Wiley Encyclopedia of Clinical Trials

1. Abbreviated New Drug Application (ANDA) 2. Absolute Risk Reduction 3. Accelerated Approval 4. Adverse Drug Reaction R...

Author: Ralph B. D'Agostino | Lisa Sullivan | Joseph Massaro

88 downloads 2040 Views 17MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

1. Abbreviated New Drug Application (ANDA) 2. Absolute Risk Reduction 3. Accelerated Approval 4. Adverse Drug Reaction Reporting 5. Adverse Event Evaluation 6. Adverse Event Report System (AERS) 7. Advisory Committees 8. AIDS Clinical Trials Group (ACTG) 9. Algorithm-Based Designs 10. Aligned Rank Test 11. Allocation Concealment 12. Alpha-Spending Function 13. Analysis of Variance ANOVA 14. Analysis Population 15. Application of New Designs in Phase I Trials 16. ASCOT Trial 17. Assay Sensitivity 18. Assessment Bias 19. Assessment of Health-Related Quality of Life 20. Audit 21. Audit Certificate 22. Audit Report 23. Bayesian Dose-Finding Designs in Healthy Volunteers 24. Benefit/Risk Assessment in Prevention Trials 25. Biased-Coin Randomization 26. Bioequivalence (BE) Testing for Generic Drugs 27. Biological Assay, Overview 28. Block Randomization 29. Bootstrap 30. Cardiac Arrhythmia Suppression Trial (CAST) 31. Categorical Response Models 32. Causal Inference 33. Cell Line 34. Censored Data 35. Center for Devices and Radiological Health (CDRH) 36. Center for Drug Evaluation and Research (CDER) 37. Central Nervous System (CNS) 38. CFR 21 Part 11 39. Change/Percent Change From Baseline 40. Chemistry, Manufacturing and Controls (CMC) 41. Citizen Petition 42. Clinical Data Coordination 43. Clinical Data Management 44. Clinical Hold Decision 45. Clinical Significance 46. Clinical Trial Misconduct 47. Clinical Trials, Early Cancer and Heart Disease 48. Clinical Trials to Support Prescription to Over-the-Counter Switches 49. Cluster Randomization 50. Code of Federal Regulations (CFR)

51. Coherence in Phase I Clinical Trials 52. Cohort vs. Repeated Cross-Sectional Survey Designs 53. Collinearity 54. Combination Therapy 55. Committee for Medicinal Product for Human Use (CHMP) 56. Common Technical Document (CTD) 57. Community-Based Breast and Cervical Cancer Control Research in Asian Immigrant Populations 58. Compliance and Survival Analysis 59. Composite Endpoints in Clinical Trials 60. Computer-Assisted Data Collection 61. Conditional Power in Clinical Trial Monitoring 62. Confidence Interval 63. Confidence Intervals and Regions 64. Confirmatory Trials 65. Confounding 66. CONSORT 67. Contract Research Organization (CRO) 68. Control Groups 69. Cooperative North Scandinavian Enalapril Survival Study (CONSENSUS) 70. Cooperative Studies Program, US Department of Veterans Affairs 71. Coordinating Committee 72. Coordinating Investigator 73. Coronary Drug Project 74. Correlation 75. Cost-Effectiveness Analysis 76. Covariates 77. Cox Proportional Hazard Model 78. Cronbach's Alpha 79. Crossover Design 80. Crossover Trials 81. Data Mining 82. Data Monitoring Committee 83. Data Safety and Monitoring Board (DSMB) 84. Data Standards 85. Dermatology Trials 86. Designs with Randomization Following Initial Study Treatment 87. Diagnostic Studies 88. Discriminant Analysis 89. Disease Trials for Dental Drug Products 90. Disease Trials in Reproductive Diseases 91. Disease Trials on Pediatric Patients 92. DNA Bank 93. Dose Escalation and Up-and-Down Designs 94. Dose Escalation Guided by Graded Toxicities 95. Dose Finding Studies 96. Dose Ranging Crossover Designs 97. Double-Dummy 98. Drift (For Interim Analysis) 99. Drug Development

100. Drug Packaging 101. Drug Registration and Listing System (DRLS) 102. Drug Supply 103. Eastern Cooperative Oncology Group (ECOG) 104. Eligibility and Exclusion Criteria 105. Emergency Use Investigational New Drug (IND) 106. End of Phase II Meeting 107. End-of-Phase I Meeting 108. Enrichment Design 109. Environmental Assessments (EAs) 110. Equivalence Trials and Equivalence Limits1 Essential Documents 111. 112. Ethical Challenges Posed by Cluster Randomization 113. Ethical Issues in International Research 114. European Medicines Agency (EMEA) 115. European Organization for Research and Treatment of Cancer (EORTC) 116. Factor Analysis: Confirmatory 117. Factorial Designs in Clinical Trials 118. Fast Track 119. FDA Division of Pharmacovigilance and Epidemiology (DPE) 120. FDA Modernization Act (FDAMA) of 1997 121. Federal Food, Drug and Cosmetic Act 122. Federal Register 123. Fileable New Drug Application (NDA) 124. Financial Disclosure 125. Fisher's Exact Test 126. Flexible Designs 127. Food and Drug Administration (FDA) 128. Frailty Models 129. Futility Analysis 130. Generalized Estimating Equations 131. Generalized Linear Models 132. Generic Drug Review Process 133. Gene Therapy 134. Genetic Association Analysis 135. Global Assessment Variables 136. Gold Standard 137. Good Clinical Practice (GCP) 138. Good Laboratory Practice (GLP) 139. Goodness of Fit 140. Group-Randomized Trials 141. Group Sequential Designs 142. Hazard Rate 143. Hazard Ratio 144. Heritability 145. Historical Control 146. Hypothesis 147. Hypothesis Testing

148. Identifying the Most Successful Dose (MSD) in Dose-Finding Studies 149. Imaging Science in Medicine 150. Imputation 151. Incompetent Subjects and Proxy Consent 152. Independent Ethics Committee (IEC) 153. Inference, Design-Based vs. Model-Based 154. Informed Consent Process, Forms, and Assent 155. Institution 156. Institutional and Independent Review Boards 157. Institutional Review Boards (IRB) 158. Integrated Database 159. Intention-to-Treat Analysis 160. Interaction Model 161. Interim Analyses 162. Interim Clinical Trial/Study Report 163. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) 164. International Studies of Infarct Survival (ISIS) 165. Interrater Reliability 166. Interval Censored 167. Intraclass Correlation Coefficient 168. Intrarater Reliability 169. Investigational Device Exemption (IDE) 170. Investigational New Drug Application Process (IND) 171. Investigational Product 172. Investigator 173. Investigator/Institution 174. Investigator's Brochure 175. Kaplan–Meier Plot 176. Kappa 177. Kefauver–Harris Drug Amendments 178. Lan-DeMets Alpha-Spending Function 179. Large Simple Trials 180. Linear Model 181. Logistic Regression 182. Logrank Test 183. Longitudinal Data 184. Manual of Policies and Procedures (MaPPs) 185. Masking 186. Maximum Duration and Information Trials 187. Maximum Tolerable Dose (MTD) 188. Metadata 189. Methods for Conduct of Rigorous Group-Randomization 190. Microarray 191. Minimum Effective Dose (MinED) 192. Ministry of Health, Labour and Welfare (MHLW, Japan) 193. Min Test 194. Missing Data Monitoring 195. 196. Monotherapy

197. Mother to Child Human Immunodeficiency Virus Transmission Trials 198. Multicenter Trial 199. Multinational (Global) Trial 200. Multiple Comparisons 201. Multiple Endpoints 202. Multiple Evaluators 203. Multiple Risk Factor Intervention Trial (MRFIT) 204. Multiple Testing in Clinical Trials 205. Multistage Genetic Association Studies 206. National Cancer Institute (NCI) 207. National Center for Toxicological Research (NCTR) 208. National Cooperative Gallstone Study 209. National Eye Institute (NEI) 210. National Heart, Lung, and Blood Institute (NHLBI) 211. National Human Genome Research Institute (NHGRI) 212. National Institute of Allergy and Infectious Disease (NIAID) 213. National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) 214. National Institute of Biomedical Imaging and Bioengineering (NIBIB) 215. National Institute of Child Health and Human Development (NICHD) 216. National Institute of Dental and Craniofacial Research (NIDCR) 217. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) 218. National Institute of Environmental Health Science (NIEHS) 219. National Institute of General Medical Sciences (NIGMS) 220. National Institute of Mental Health (NIMH) 221. National Institute of Neurological Disorders and Stroke (NINDS) 222. National Institute of Nursing Research (NINR) 223. National Institute on Aging (NIA) 224. National Institute on Alcohol Abuse and Alcoholism (NIAAA) 225. National Institute on Deafness and Other Communication Disorders (NIDCD) 226. National Institute on Drug Abuse (NIDA) 227. National Institutes of Health (NIH) 228. National Institutes of Health Stroke Scale (NIHSS) 229. National Library of Medicine (NLM) 230. New Drug Application (NDA) Process 231. N of 1 Randomized Trial 232. Non-Compartmental Analysis 233. Noncompliance 234. Non-Inferiority Trial 235. Nonparametric Methods Non-Randomized Trial 236. 237. Objectives 238. Office of Orphan Products Development (OOPD) 239. Office of Pharmacoepidemiology and Statistical Science (OPaSS) 240. Office of Regulatory Affairs (ORA) 241. One-Sided Versus Two-Sided Tests 242. Open-Labeled Trial 243. Optimal Biological Dose for Molecularly-Targeted Therapies

244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269. 270. 271. 272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 282. 283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293.

Optimizing Schedule of Administration in Phase I Clinical Trials Orphan Drug Act (ODA) Orphan Drugs Orphan Products Grant Program Outliers Over dispersion Over-the-Counter (OTC) Drug Product Review Process Over the Counter (OTC) Drugs Overview of Anti-Infective Drug Development Overview of Safety Pharmacology Paired t Test Parallel Track Partially Balanced Designs “Patients” vs. “Subjects” Permutation Tests in Clinical Trials Pharmacoepidemiology, Overview Pharmacovigilance Phase 2/3 Trials Phase I/II Clinical Trials Phase III Trials Phase II Trials Phase I Trials Phase I Trials in Oncology Phase IV Trials Physicians' Health Study (PHS) Placebos Planning A Group-Randomized Trial Poisson Regression Population Pharmacokinetic and Pharmacodynamic Methods Postmenopausal Estrogen/Progestin Intervention Trial (PEPI) Power Transformations Predicting Random Effects in Community Intervention Preference Trials Premarket Approval (PMA) Premarket Notification 510(k) Premature Termination or Suspension Pre-NDA Meeting Prescription Drug User Fee Act (PDUFA) Prescription Drug User Fee Act (PDUFA) II Prescription Drug User Fee Act (PDUFA) IV Prescription Drug User Fee Act III Prevention Trials Primary Efficacy Endpoint Priority Review Prognostic Variables in Clinical Trials Propensity Score Proportional Odds Model Proportions, Inferences, and Comparisons Prostate Cancer Prevention Trial Protocol

294. Protocol Amendment 295. Protocol Deviators 296. Protocol Violators 297. Publication Bias 298. P Value 299. Quality Assessment Of Clinical Trials 300. Quality Assurance 301. Quality Control 302. Quality of Life Query Management: The Route to a Quality Database 303. 304. Question-Based Review (QbR) 305. Randomization-Based Nonparametric Analysis of Covariance 306. Randomization Procedures 307. Randomization Schedule 308. Rank-Based Nonparametric Analysis of Covariance Record Access 309. 310. Refuse to File Letter 311. Registration of Drug Establishment Form 312. Regression 313. Regression Models to Incorporate Patient Heterogeneity 314. Regression to the Mean 315. Regulatory Authorities 316. Regulatory Definitions 317. Relative Risk Modeling 318. Reliability Study 319. Repeatability and Reproducibility 320. Repeated Measurements 321. Repository 322. Response Adaptive Randomization 323. Response Surface Methodology 324. Risk Assessment 325. Robust Two-Stage Model-Guided Designs for Phase I Clinical Studies 326. Run-In Period 327. Safety Information 328. Sample Size Calculation for Comparing Means 329. Sample Size Calculation for Comparing Proportions 330. Sample Size Calculation for Comparing Time-to-Event Data 331. Sample Size Calculation for Comparing Variabilities 332. Sample Size Considerations for Morbidity/Mortality Trials 333. Screening, Models of 334. Screening Trials 335. Secondary Efficacy Endpoints 336. Sensitivity, Specificity and Receiver Operator Characteristic (ROC) Methods 337. Sequential Analysis 338. Serious Adverse Event (SAE) 339. Simple Randomization 340. Software for Genetics/Genomics 341. Sponsor 342. Sponsor-Investigator

343. Spontaneous Reporting System (SRS) 344. Stability Analysis 345. Stability Study Designs 346. Standard Operating Procedures (SOP) 347. Statins 348. Stepped Wedge Design 349. Stopping Boundaries 350. Stratification 351. Stratified Designs Stratified Randomization 352. 353. Subgroup 354. Subgroup Analysis 355. Subinvestigator 356. Superiority Trial Surrogate Endpoints 357. 358. Survival Analysis, Overview 359. Suspension or Termination of IRB Approval 360. The Belmont Report 361. The Carotene and Retinol Efficacy Trial (Caret) 362. The Center for Biologics Evaluation and Research 363. The Cochrane Collaboration 364. The Community Intervention Trial for Smoking Cessation (COMMIT) 365. The FDA and Regulatory Issues 366. Therapeutic Dose Range 367. Therapeutic Equivalence 368. Therapeutic Index 369. TNT Trial 370. Treatment-by-Center Interaction 371. Treatment Interruption 372. Treatment Investigational New Drug (IND) 373. Trial Site 374. True Positives, True Negatives, False Positives, False Negatives 375. UGDP Trial 376. Update in Hyperlipidemia Clinical Trials 377. Using Internet in Community Intervention Studies1 378. Vaccine 379. Vaccine Adverse Event Report System (VAERS) Web Based Data Management System 380. 381. Wei-Lin-Weissfeld Method for Multiple Times to Events 382. Wilcoxon–Mann–Whitney Test 383. Wilcoxon Signed-Rank Test 384. Women's Health Initiative: Statistical Aspects and Selected Early Results 385. Women's Health Initiative Dietary Trial 386. Women's Health Initiative Hormone Therapy Trials 387. World Health Organization (WHO): Global Health Situation

ABBREVIATED NEW DRUG APPLICATION (ANDA)

patent protection for the new medicines they developed to make up for time lost while their products were going through the FDA’s approval process. Brand-name drugs are subject to the same bioequivalence tests as generics upon reformulation.

An Abbreviated New Drug Application (ANDA) contains data that when submitted to the U.S. Food and Drug Administration’s (FDA) Center for Drug Evaluation and Research, Office of Generic Drugs provides for the review and ultimate approval of a generic drug product. Once approved, an applicant may manufacture and market the generic drug product to provide a safe, effective, low-cost alternative to the American public. Generic drug applications are termed ‘‘abbreviated’’ because they are generally not required to include preclinical (animal) and clinical (human) data to establish safety and effectiveness. Instead, generic applicants must scientifically demonstrate that their product is bioequivalent (i.e., performs in the same manner as the innovator drug). One way scientists demonstrate bioequivalence is to measure the time it takes the generic drug to reach the bloodstream in 24 to 36 healthy volunteers. This gives them the rate of absorption, or bioavailability, of the generic drug, which they can then compare with that of the innovator drug. The generic version must deliver the same amount of active ingredients into a patient’s bloodstream in the same amount of time as the innovator drug. Using bioequivalence as the basis for approving generic copies of drug products was established by the ‘‘Drug Price Competition and Patent Term Restoration Act of 1984,’’ also known as the Waxman-Hatch Act. This Act expedites the availability of less costly generic drugs by permitting the FDA to approve applications to market generic versions of brand-name drugs without conducting costly and duplicative clinical trials. At the same time, the brand-name companies can apply for up to 5 additional years longer This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/regulatory/applications/ ANDA.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

ABSOLUTE RISK REDUCTION

studies or in cross-sectional designs. (Some hypothesis testing procedures such as the Fisher exact test are conditional also on the total number of ‘‘successes’’ in the two groups combined. This alternative conditioning is inappropriate for confidence intervals for a difference of proportions; in particular in the event that no successes are observed in either group, this approach fails to produce an interval.) The outcome variable is binary: 1 if the event of interest occurs, 0 if it does not. (We do not consider here the case of an integer-valued outcome variable; typically, this involves the number of episodes of relapse or hospitalization, number of accidents, or similar events occurring within a defined follow-up period. Such an outcome would instead be modeled by the Poisson distribution.) We observe that r1 subjects in group 1 and r2 subjects in group 2 experience the event of interest. Then the proportions having the event in the two groups are given by p1 = r1 /n1 and p2 = r2 /n2 . If responses in different individuals in each group are independent, then the distribution of the number of events in each group is binomial. Several effect size measures are widely used for comparison of two independent proportions:

ROBERT G. NEWCOMBE Cardiff University Cardiff, Wales, United Kingdom

Many response variables in clinical trials are binary: the treatment was successful or unsuccessful; the adverse effect did or did not occur. Binary variables are summarized by proportions, which may be compared between different arms of a study by calculating either an absolute difference of proportions or a relative measure, the relative risk or the odds ratio. In this article we consider several point and interval estimates for the absolute difference between two proportions, for both unpaired and paired study designs. The simplest methods encounter problems when numerators or denominators are small; accordingly, better methods are introduced. Because confidence interval methods for differences of proportions are derived from related methods for the simpler case of the single proportion, which itself can also be of interest in a clinical trial, this case is also considered in some depth. Illustrative examples relating to data from two clinical trials are shown. 1

Difference of proportions p1 – p2 Ratio of proportions (risk ratio or relative risk) p1 /p2 Odds ratio (p1 /(1 – p1 ))/(p2 /(1 – p2 ))

PRELIMINARY ISSUES

In most clinical trials, the unit of data is the individual, and statistical analyses for efficacy and safety outcomes compare responses between the two (or more) treatment groups. When subjects are randomized between these groups, responses of subjects in one group are independent of those in the other group. This leads to unpaired analyses. Crossover and split-unit designs require paired analyses. These have many features in common with the unpaired analyses and will be described in the final section. Thus, we study n1 individuals in group 1 and n2 individuals in group 2. Usually, all analyses are conditional on n1 and n2 . Analyses conditional on n1 and n2 would also be appropriate in other types of prospective

In this article we consider in particular the difference between two proportions, p1 – p2 , as a measure of effect size. This is variously referred to as the absolute risk reduction, risk difference, or success rate difference. Other articles in this work describe the risk ratio or relative risk and the odds ratio. We consider both point and interval estimates, in recognition that ‘‘confidence intervals convey information about magnitude and precision of effect simultaneously, keeping these two aspects of measurement closely linked’’ (1). In the clinical trial context, a difference between two proportions is often referred to as an

Wiley Encyclopedia of Clinical Trials, Copyright © 2007, John Wiley & Sons, Inc.

1

2

ABSOLUTE RISK REDUCTION

absolute risk reduction. However, it should be borne in mind that any term that includes the word ‘‘reduction’’ really presupposes that the direction of the difference will be a reduction in risk—such terminology becomes awkward when the anticipated benefit does not materialize, including the nonsignificant case when the confidence interval for the difference extends beyond the null hypothesis value of zero. The same applies to the relative risk reduction, 1 – p1 /p2 . Whenever results are presented, it is vitally important that the direction of the observed difference should be made unequivocally clear. Moreover, sometimes confusing labels are used, which might be interpreted to mean something other than p1 – p2 ; for example, Hashemi et al. (2) refer to p1 – p2 as attributable risk. It is also vital to distinguish between relative and absolute risk reduction. In clinical trials, as in other prospective and cross-sectional designs already described, each of the three quantities we have discussed may validly be used as a measure of effect size. The risk difference and risk ratio compare two proportions from different perspectives. A halving of risk will have much greater population impact for a common outcome than for an infrequent one. Schechtman (3) recommends that both a relative and an absolute measure should always be reported, with appropriate confidence intervals. The odds ratio is discussed at length by Agresti (4). It is widely regarded as having a special preferred status on account of its role in retrospective case-control studies and in logistic regression and meta-analysis. Nevertheless, it should not be regarded as having gold standard status as a measure of effect size for the 2 × 2 table (3, 5). 2 POINT AND INTERVAL ESTIMATES FOR A SINGLE PROPORTION Before considering the difference between two independent proportions in detail, we first consider some of the issues that arise in relation to the fundamental task of estimating a single proportion. These issues have repercussions for the comparison of proportions because confidence interval methods for p1 – p2 are generally based closely on those

for proportions. The single proportion is also relevant to clinical trials in its own right. For example, in a clinical trial comparing surgical versus conservative management, we would be concerned with estimating the incidence of a particular complication of surgery such as postoperative bleeding, even though there is no question of obtaining a contrasting value in the conservative group or of formally comparing these. The most commonly used estimator for the population proportion π is the familiar empirical estimate, namely, the observed proportion p = r/n. Given n, the random variable R denoting the number of subjects in which the response occurs has the binomial B(n,p) distribution, with Pr[R = r] = {n!/r!(n – r!)} pr qn – r where q = 1 – p. The simple empirical estimator is also the maximum likelihood estimate for the binomial distribution, and it is unbiased— in the usual statistical sense that the expectation of R given n, E[R|n] = π. However, when r = 0, many users of statistical methods are uneasy with the idea that p = 0 is an unbiased estimate. The range of possible values for π is the interval from 0 to 1. Generally, this means the open interval 0 < π < 1, not the closed interval 0 ≤ π ≤ 1, as usually it would already be known that the event sometimes occurs and sometimes does not. As the true value of π cannot then be negative or zero but must be greater than zero, the notion that p = 0 should be regarded as an unbiased estimate of π seems highly counterintuitive. Largely with this issue in mind, alternative estimators known as shrinkage estimators are available. These generally take the form pψ = (r + ψ)/(n + 2ψ) for some ψ > 0. The quantity ψ is known as a pseudo-frequency. Essentially, ψ observations are added to the number of ‘‘successes’’ and also to the number of ‘‘failures.’’ The resulting estimate pψ is intermediate between the empirical estimate p = r/n and 1/2, which is the midpoint and center of symmetry of the support scale from 0 to 1. The degree of shrinkage toward 1/2 is great when n is small and minor for large n. Bayesian analyses of proportions lead naturally to shrinkage estimators, with ψ = 1

ABSOLUTE RISK REDUCTION

and 1/2 corresponding to the most widely used uninformative conjugate priors, the uniform prior B(1,1) and the Jeffreys prior B( 1/2, 1/2). It also is important to report confidence intervals, to express the uncertainty due to sampling variation and finite sample size. The simplest interval p ± z × SE(p), where √ SE(p) = (pq/n), remains the most commonly used. This is usually called the Wald interval. Here, z denotes the relevant quantile of the standard Gaussian distribution. Standard practice is to use intervals that aim to have 95% coverage, with 2.5% noncoverage in each tail, leading to z = 1.9600. Unfortunately, confidence intervals for proportions and their differences do not achieve their nominal coverage properties. This is because the sample space is discrete and bounded. The Wald method for the single proportion has three unfavorable properties (6–9). These can all be traced to the interval’s simple symmetry about the empirical estimate. The achieved coverage is much lower than the nominal value. For some values of π , the achieved coverage probability is close to zero. The noncoverage probabilities in the two tails are very different. The location of the interval is too distal—too far out from the center of symmetry of the scale, 1/2. The noncoverage of the interval is predominantly mesial. The calculated limits often violate the boundaries at 0 and 1. In particular, when r = 0, a degenerate, zero-width interval results. For small non-zero values of r (1, 2, and sometimes 3 for a 95% interval), the calculated lower limit is below zero. The resulting interval is usually truncated at zero, but this is unsatisfactory as the data tells us that 0 is an impossible value for π . Corresponding anomalous behavior at the upper boundary occurs when n – r is 0 or small. Many improved methods for confidence intervals for proportions have been developed. The properties of these methods are evaluated by choosing suitable parameter space points (here, combinations of n and π ), using these to generate large numbers of simulated random samples, and recording how often the resulting confidence interval includes the true value π . The resulting coverage probabilities are then summarized by

3

calculating the mean coverage and minimum coverage across the simulated datasets. Generally, the improved methods obviate the boundary violation problem, and improve coverage and location. The most widely researched options are as follows. A continuity correction may be incorpo√ rated: p ± {z (pq/n) + 1/(2n)}. This certainly improves coverage and obviates zerowidth intervals but increases the incidence of boundary overflow. The Wilson score method (10) uses the theoretical value π , not the empirical estimate p, in the formula for the standard error of p. Lower and upper limits are obtained as the two solutions of the equation p = π ± √ z × SE(π ) = π ± z × (π (1 – π )/n), which reduces to a quadratic in π . The two roots are given in closed form as √ {2p + z2 ± z (z2 + 4rq)}/{2(n + z2 )}. It is easily demonstrated (7) that the resulting interval is symmetrical on the logit scale—the other natural scale for proportions—by considering the product of the two roots for π , and likewise for 1 – π . The resulting interval is boundary respecting and has appropriate mean coverage. In contrast to the Wald interval, location is rather too mesial. The midpoint of the score interval, on the ordinary additive scale, is a shrinkage estimator with ψ = ( 1/2)z2 , which is 1.92 for the default 95% interval. With this (and also Bayesian intervals) in mind, Agresti and Coull (8) proposed a pseudo-frequency method, which adds ψ = 2 to the numbers of successes (r) and failures (n – r) before using the ordinary Wald formula. This is also a great improvement over the Wald method, and is computationally and conceptually very simple. It reduces but does not eliminate the boundary violation problem. A variety of alternatives can be formulated, with different choices for ψ, and also using something other than n + 2ψ as the denominator of the variance. Alternatively, the Bayesian approach described elsewhere in this work may be used. The resulting intervals are best

4

ABSOLUTE RISK REDUCTION

referred to as credible intervals, in recognition that the interpretation is slightly different from that of frequentist confidence intervals such as those previously described. Bayesian inference starts with a prior distribution for the parameter of interest, in this instance the proportion π . This is then combined with the likelihood function comprising the evidence from the sample to form a posterior distribution that represents beliefs about the parameter after the data have been obtained. When a conjugate prior is chosen from the beta distribution family, the posterior distribution takes a relatively simple form: it is also a beta distribution. If substantial information about π exists, an informative prior may be chosen to encapsulate this information. More often, an uninformative prior is used. The simplest is the uniform prior B(1,1), which assumes that all possible values of π between 0 and 1 start off equally likely. An alternative uninformative prior with some advantages is the Jeffreys prior B( 1/2, 1/2). Both are diffuse priors, which spread the probability thinly across the whole range of possible values from 0 to 1. The resulting posterior distribution may be displayed graphically, or may be summarized by salient summary statistics such as the posterior mean and median and selected centiles. The 2 1/2 and 97 1/2 centiles of the posterior distribution delimit the tail-based 95% credible interval. Alternatively, a highest posterior density interval may be reported. The tail-based interval is considered preferable because it produces equivalent results when a transformed scale (e.g., logit) is used (11). These Bayesian intervals perform well in a frequentist sense (12). Hence, it is now appropriate to regard them as confidence interval methods in their own right, with theoretical justification in the Bayesian paradigm but empirical validation from a frequentist standpoint. They may thus be termed beta intervals. They are readily calculated using software for the incomplete beta function, which is included in statistical packages and also spreadsheet software such as Microsoft Excel. As such, they should now be regarded as computationally of ‘‘closed form,’’ though less transparent than Wald methods.

Many statisticians consider that a coverage level should represent minimum, not average, coverage. The Clopper-Pearson ‘‘exact’’ or tail-based method (13) achieves this, at the cost of being excessively conservative; intervals are unnecessarily wide. There is a trade-off between coverage and width; it is always possible to increase coverage by widening intervals, and the aim is to attain good coverage without excessive width. A variant on the ‘‘exact’’ method involving a mid-P accumulation of tail probabilities (14, 15) aligns mean coverage closely with the nominal 1 – α. Both methods have appropriate location. The Clopper-Pearson interval, but not the mid-P one, is readily programmed as a beta interval, of similar form to Bayes intervals. A variety of shortened intervals have also been developed that maintain minimum coverage but substantially shrink interval length (16, 17). Shortened intervals are much more complex, both computationally and conceptually. They also have the disadvantage that what is optimized is the interval, not the lower and upper limits separately; consequently, they are unsuitable when interest centers on one of the limits rather than the other. Numerical examples illustrating these calculations are based on some results from a very small randomized phase II clinical trial performed by the Eastern Cooperative Oncology Group (18). Table 1 shows the results for two outcomes, treatment success defined as shrinkage of the tumor by 50% or more, and life-threatening treatment toxicity, for the two treatment groups A and B. Table 2 shows 95% confidence intervals for both outcomes for treatment A. These examples show how Wald and derived intervals often produce inappropriate limits (see asterisks) in boundary and near-boundary cases. 3 AN UNPAIRED DIFFERENCE OF PROPORTIONS We return to the unpaired difference case. As described elsewhere in this work, hypothesis testing for the comparison of two proportions takes a quite different form according to whether the objective of the trial is to

ABSOLUTE RISK REDUCTION

5

Table 1. Some Results from a Very Small Randomized Phase II Clinical Trial Performed by the Eastern Cooperative Oncology Group Treatment A

Treatment B

14 0 2

11 0 1

Number of patients Number with successful outcome: tumor shrinkage by ≥ 50% Number with life-threatening treatment toxicity Source: Parzen et al. J Comput Graph Stat. 2002: 11; 420–436.

Table 2. 95% Confidence Intervals for Proportions of Patients with Successful Outcome and With Life-Threatening Toxicity on Treatment A in the Eastern Cooperative Oncology Group Trial Outcome

Successful tumor shrinkage

Life-threatening toxicity

0 0 to 0* < 0* to 0.0357 0 to 0.2153 0.1111 < 0* to 0.2563 0 to 0.2180 0 to 0.1616 0 to 0.2316 0 to 0.1926

0.1429 < 0* to 0.3262 < 0* to 0.3619 0.0401 to 0.3994 0.2222 0.0302 to 0.4143 0.0433 to 0.4046 0.0309 to 0.3849 0.0178 to 0.4281 0.0247 to 0.3974

Empirical estimate Wald interval Wald interval with continuity correction Wilson score interval Agresti-Coull shrinkage estimate Agresti-Coull interval Bayes interval, B(1,1) prior Bayes interval, B( 1/2, 1/2) prior Clopper-Pearson ‘‘exact’’ interval Mid-P interval

Note: Asterisks denote boundary violations. Source: Parzen et al. J Comput Graph Stat. 2002: 11; 420–436.

ascertain difference or equivalence. When we report the contrast between two proportions with an appropriately constructed confidence interval, this issue is taken into account only when we come to interpret the calculated point and interval estimates. In this respect, in comparison with hypothesis testing, the confidence interval approach leads to much simpler, more flexible patterns of inference. The quantity of interest is the difference between two binomial proportions, π 1 and π 2 . The empirical estimate is p1 – p2 = r1 /n1 – r2 /n2 . It is well known that, when comparing means, there is a direct correspondence between hypothesis tests and confidence intervals. Specifically, the null hypothesis is rejected at the conventional two-tailed α = 0.05 level if and only if the 100(1 – α) = 95% confidence interval for the difference excludes the null hypothesis value of zero. A similar property applies also to the comparison of proportions—usually, but not invariably. This is because there are several options for constructing a confidence interval for the difference of proportions, which have

different characteristics and do not all correspond directly to purpose-built hypothesis tests. The Wald interval is calculated as p1 – √ p2 ± z (p1 q1 /n1 + p2 q2 /n2 ). It has poor mean and minimum coverage and fails to produce an interval when both p1 and p2 are 0 or 1. Overshoot can occur when one proportion is close to 1 and the other is close to 0, but this situation is expected to occur infrequently in practice. Use of a continuity correction improves mean coverage, but minimum coverage remains low. Several better methods have been developed, some of which are based on specific mathematical models. Any model for the comparison of two proportions necessarily involves both the parameter of interest, δ = π 1 −π 2 , and an additional nuisance parameter γ . The model may be parametrized in terms of δ and π 1 + π 2 , or δ and (π 1 + π 2 )/2, or δ and π 1 . We will define the nuisance parameter as γ = (π 1 + π 2 )/2. Some of the better methods substitute the profile estimate γ δ , which is the maximum likelihood estimate of γ conditional on a

6

ABSOLUTE RISK REDUCTION

hypothesized value of δ. These include scoretype asymptotic intervals developed by Mee (19) and Miettinen and Nurminen (20). Newcombe (21) developed tail-based exact and mid-P intervals involving substitution of the profile estimate. All these intervals are boundary respecting. The ‘‘exact’’ method aligns the minimum coverage quite well with the nominal 1 – α; the others align mean coverage well with 1 – α, at the expense of fairly complex iterative calculation. Bayesian intervals for p1 – p2 and other comparative measures may be constructed (2, 11), but they are computationally much more complex than in the single proportion case, requiring use of numerical integration or computer-intensive methodology such as Markov chain Monte Carlo (MCMC) methods. It may be more appropriate to incorporate a prior for p1 – p2 itself rather than independent priors for p1 and p2 (22). The Bayesian formulation is readily adapted to incorporate functional constraints such as δ ≥ 0 (22). Walters (23) and Agresti and Min (11) have shown that Bayes intervals for p1 – p2 with uninformative beta priors have favorable frequentist properties. Two computationally simpler, effective approaches have been developed. Newcombe (21) also formulated square-and-add intervals for differences of proportions. The concept is a very simple one. Assuming independence, the variance of a difference between two quantities is the sum of their variances. In other words, standard errors ‘‘square and add’’—they combine in the same way that differences in x and in y coordinates combine to give the Euclidean distance along the diagonal, as in Pythagoras’ theorem. This is precisely how the Wald interval for p1 – p2 is constructed. The same principle may be applied starting with other, better intervals for p1 and p2 separately. The Wilson score interval is a natural choice as it already involves square roots, though squaring and adding would work equally effectively starting with, for instance, tail-based (24) or Bayes intervals. It is easily demonstrated that the square-and-add process preserves the property of respecting boundaries. Thus, the square-and-add interval is obtained as follows. Let (li , ui ) denote the

score interval for pi , for i = 1,2. Then the square-and-add limits are p1 − p2 − p1 − p2 +

√ √

{(p1 − l1 )2 + (u2 − p2 )2 }, {(u1 − p1 )2 + (p2 − l2 )2 }.

This easily computed interval aligns mean coverage closely with the nominal 1 – α. A continuity correction is readily incorporated, resulting in more conservative coverage. Both intervals tend to be more mesially positioned than the γ δ -based intervals discussed previously. The square-and-add approach may be applied a second time to obtain a confidence interval for a difference between differences of proportions (25); this is the linear scale analogue of assessing an interaction effect in logistic regression. Another simple approach that is a great improvement over the Wald method is the pseudo-frequency method (26, 27). A pseudofrequency ψ is added to each of the four cells of the 2 × 2 table, resulting in the shrinkage estimator (r1 + ψ)/(n1 + 2ψ) – (r2 + ψ)/(n2 + 2ψ). The Wald formula then produces the limits √ pψ1 − pψ2 ± z {pψ1 (1 − pψ1 )/(n1 + 2ψ) + pψ2 (1 − pψ2 )/(n2 + 2ψ)}, where pψi = (ri + ψ)/(ni + 2ψ) i = 1, 2. Agresti and Caffo (27) evaluated the effect of choosing different values of ψ, and they reported that adding 1 to each cell is optimal here. So here, just as for the single proportion case, in total four pseudo-observations are added. This approach also aligns mean coverage effectively with 1 – α. Interval location is rather too mesial, very similar to that of the square-and-add method. Zero-width intervals cannot occur. Boundary violation is not ruled out but is expected to be infrequent. Table 3 shows 95% confidence intervals calculated by these methods, comparing treatments A and B in the ECOG trial (18).

ABSOLUTE RISK REDUCTION

7

Table 3. 95% Confidence Intervals for Differences in Proportions of Patients with Successful Outcome and with Life-Threatening Toxicity between Treatments A and B in the Eastern Cooperative Oncology Group Trial Outcome

Successful tumor shrinkage

Life-threatening toxicity

0 0* to 0* −0.2588 to 0.2153 −0.2667 to 0.2223 −0.2849 to 0.2316 −0.2384 to 0.1926 −0.2198 to 0.1685 −0.1768 to 0.1361 −0.2588 to 0.2153 −0.0144 −0.2016 to 0.1728

0.0519 −0.1980 to 0.3019 −0.2619 to 0.3312 −0.2693 to 0.3374 −0.2721 to 0.3514 −0.2539 to 0.3352 −0.2432 to 0.2986 −0.2288 to 0.3008 −0.2524 to 0.3192 0.0337 −0.2403 to 0.3076

Empirical estimate Wald interval Mee interval Miettinen-Nurminen interval Tail-based ‘‘exact’’ interval Tail-based mid-P interval Bayes interval, B(1,1) priors for p1 and p2 Bayes interval, B( 1/2, 1/2) priors for p1 and p2 Square-and-add Wilson interval Agresti-Caffo shrinkage estimate Agresti-Caffo interval

Note: Asterisks denote boundary violations. Source: Parzen et al. J Comput Graph Stat. 2002: 11; 420–436.

4

NUMBER NEEDED TO TREAT

In the clinical trial setting, it has become common practice to report the number needed to treat (NNT), defined as the reciprocal of the absolute risk difference: NNT = 1/(p1 – p2 ) (28, 29). This measure has considerable intuitive appeal, simply because we are used to assimilating proportions expressed in the form of ‘‘1 in n,’’ such as a 1 in 7 risk of lifethreatening toxicity for treatment A in Table 1. The same principle applies to differences of proportions. These tend to be small decimal numbers, often with a leading zero after the decimal point, which risk being misinterpreted by the less numerate. Thus if p1 = 0.35 and p2 = 0.24, we could equivalently report p1 – p2 = 0.11, or as an absolute difference of 11% or an NNT of 9. The latter may well be an effective way to summarize the information when a clinician discusses a possible treatment with a patient. As always, we need to pay careful attention to the direction of the difference. By default, NNT is read as ‘‘number needed to treat for (one person to) benefit,’’ or NNTB. If the intervention of interest proves to be worse than the control regime, we report the number needed to harm (NNTH). A confidence interval for the NNT may be derived from any good confidence interval method for p1 – p2 by inverting the two limits. For example, Bender (30) suggests an

interval obtained by inverting square-andadd limits (21). But it is when we turn attention to confidence intervals that the drawback of the NNT approach becomes apparent. Consider first the case of a statistically significant difference, with p1 – p2 = +0.25, and 95% confidence interval from +0.10 to +0.40. Then an NNT of 4 is reported, with 95% confidence interval from 2.5 to 10. This has two notable features. The lower limit for p1 – p2 gives rise to the upper limit for the NNT and vice versa. Furthermore, the interval is very skewed, and the point estimate is far from the midpoint. Neither of these is a serious contraindication to use of the NNT. But often the difference is not statistically significant— and, arguably, reporting confidence intervals is even more important in this case than when the difference is significant. Consider, for example, p1 – p2 = +0.10, with 95% confidence interval from –0.05 to +0.25. Here, the estimated NNT is 1/0.10 = +10. Inverting the lower and upper confidence limits for p1 − p2 gives −20 and +4. This time, the two limits do not change places apparently. But there are two problems. The point estimate, +10, is not intermediate between −20 and +4. Moreover, the interval from −20 to +4 does not comprise the values of the NNT that are compatible with the data, but rather the ones that are not compatible with it. In fact, the confidence region for the NNT in this case consists of two intervals that extend to infinity, one from + 4 to +

8

ABSOLUTE RISK REDUCTION

∞ in the direction of benefit, the other from –20 to –∞ in the direction of harm. It could be a challenge to clinicians and researchers at large to comprehend this singularity that arises when a confidence interval spanning 0 is inverted (31). Accordingly, it seems preferable to report absolute risk reductions in percentage rather than reciprocal form. The most appropriate uses of the NNT are in giving simple bottomline figures to patients (in which situation, usually only the point estimate would be given), and in labeling a secondary axis on a graph. 5

A PAIRED DIFFERENCE OF PROPORTIONS

Crossover and split-unit trial designs lead to paired analyses. Regimes that aim to produce a cure are generally not suitable for evaluation in these designs, because in the event that a treatment is effective, there would be a carryover effect into the next treatment period. For this reason, these designs tend to be used for evaluation of regimes that seek to control symptomatology, and thus most often give rise to continuous outcome measures. Examples of paired analyses of binary data in clinical trials include comparisons of different antinauseant regimes administered in randomized order during different cycles of chemotherapy, comparisons of treatments for headache pain, and split-unit studies in ophthalmology and dermatology. Results can be reported in either risk difference or NNT form, though the latter appears not to be frequently used in this context. Other examples in settings other than clinical trials include

longitudinal comparison of oral carriage of an organism before and after third molar extraction, and twin studies. Let a, b, c, and d denote the four cells of the paired contingency table. Here, b and c are the discordant cells, and interest centers on the difference of marginals: p1 − p2 = (a + b)/n − (a + c)/n = (b − c)/n. Hypothesis testing is most commonly performed using the McNemar approach (32), using either an asymptotic test statistic expressed as z or chi-square, or an aggregated tail probability. In both situations, inference is conditional on the total number of discordant pairs, b + c. Newcombe (33) reviewed confidence interval methods for the paired difference case. Many of these are closely analogous to unpaired methods. The Wald interval performs poorly. So does a conditional approach, based on an interval for the simple proportion b/(b + c). Exact and tail-based profile methods perform well; although, as before, these are computationally complex. A closedform square-and-add approach, modified to take account of the nonindependence, also aligns mean coverage with 1 – α, provided that a novel form of continuity correction is incorporated. Tango (34) developed a score interval, which is boundary respecting and was subsequently shown to perform excellently (35). Several further modifications were suggested by Tang, Tang, and Chan (36). Agresti and Min (11) proposed pseudo-frequency methods involving adding ψ = 0.5 to each cell

Table 4. Crossover Trial of Home Versus Hospital Physiotherapy: Treating Physiotherapist’s Assessment of Whether the Patient Benefited from Either Type of Treatment Number of patients benefiting from: Treatment sequence Home versus hospital Hospital versus home Difference Half

Difference (first minus second treatment)

Both

First

Second

Neither

Estimate

95% confidence interval

11 9

6 4

1 5

3 1

+ 0.2381 −0.0526 + 0.2907 + 0.1454

−0.0127 to + 0.4534 −0.3372 to + 0.2434 −0.0973 to + 0.6475 −0.0486 to + 0.3238

Source: R. G. Newcombe, Estimating the difference between differences: measurement of additive scale interaction for proportions. Stat Med. 2001; 20: 2885–2893. Reproduced with permission.

ABSOLUTE RISK REDUCTION

and demonstrated good agreement of mean coverage with 1 – α. However, overshoot can occasionally occur. The above methods are appropriate for a paired difference of proportions. But for crossover and simultaneous split-unit studies, a slightly different approach is preferable. Thus, in a crossover study, if the numbers of subjects who get the two treatment sequences AB and BA are not identical, the simple difference of marginals contains a contribution from period differences. A more appropriate analysis is based on the analysis of differences of paired differences described by Newcombe (25). The example in Table 4 relates to a crossover trial of home versus hospital physiotherapy for chronic multiple sclerosis (37). Twenty-one patients were randomized to receive home physiotherapy followed by hospital physiotherapy, and 19 to receive these treatments in the reverse order. Following Hills and Armitage (38) and Koch (39), the treatment effect is estimated as half the difference between the within-subjects period differences in the two treatment order groups. The resulting estimate, + 0.1454 and 95% confidence interval, –0.0486 to + 0.3238, are very similar but not identical to those obtained by direct application of the modified square-and-add approach (33), + 0.1500 and –0.0488 to + 0.3339. REFERENCES 1. K. Rothman, Modern Epidemiology. Boston: Little, Brown, 1986. 2. L. Hashemi, B. Nandram, and R. Goldberg, Bayesian analysis for a single 2 × 2 table. Stat Med. 1997; 16: 1311–1328. 3. E. Schechtman, Odds ratio, relative risk, absolute risk reduction, and the number needed to treat—which of these should we use? Value Health. 2002; 5: 431–436. 4. A. Agresti, Categorical Data Analysis, 2nd ed. Hoboken, NJ: Wiley, 2002. 5. R. G. Newcombe, A deficiency of the odds ratio as a measure of effect size. Stat Med. 2006; 25: 4235–4240. 6. S. E. Vollset, Confidence intervals for a binomial proportion. Stat Med. 1993; 12: 809–824. 7. R. G. Newcombe, Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med. 1998; 17: 857–872.

9

8. A. Agresti and B. A. Coull, Approximate is better than ‘‘exact’’ for interval estimation of binomial proportions. Am Stat. 1998; 52: 119–126. 9. L. D. Brown, T. T. Cai, and A. DasGupta, Interval estimation for a proportion. Stat Sci. 2001; 16: 101–133. 10. E. B. Wilson, Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927; 22: 209–212. 11. A. Agresti and Y. Min, Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 × 2 contingency tables. Biometrics. 2005; 61: 515–523. 12. B. P. Carlin and T. A. Louis, Bayes and Empirical Bayes Methods for Data Analysis. London: Chapman & Hall, 1996. 13. C. J. Clopper and E. S. Pearson, The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934; 26: 404–413. 14. H. O. Lancaster, The combination of probabilities arising from data in discrete distributions. Biometrika. 1949; 36: 370–382. 15. G. Berry and P. Armitage, Mid-P confidence intervals: a brief review. Statistician. 1995; 44: 417–423. 16. H. Blaker, Confidence curves and improved exact confidence intervals for discrete distributions. Can J Stat. 2000; 28: 783–798. 17. J. Reiczigel, Confidence intervals for the binomial parameter: some new considerations. Stat Med. 2003: 22; 611–621. 18. M. Parzen, S. Lipsitz, J. Ibrahim, and N. Klar, An estimate of the odds ratio that always exists. J Comput Graph Stat. 2002: 11; 420–436. 19. R. W. Mee, Confidence bounds for the difference between two probabilities. Biometrics. 1984; 40: 1175–1176. 20. O. S. Miettinen and M. Nurminen, Comparative analysis of two rates. Stat Med. 1985; 4: 213–226. 21. R. G. Newcombe, Interval estimation for the difference between independent proportions: comparison of eleven methods. Stat Med. 1998; 17: 873–890. 22. R. G. Newcombe, Bayesian estimation of false negative rate in a clinical trial of sentinel node biopsy. Stat Med. 2007; 26: 3429–3442. 23. D. E. Walters, On the reliability of Bayesian confidence limits for a difference of two proportions. Biom. J. 1986; 28: 337–346. 24. T. Fagan, Exact 95% confidence intervals for differences in binomial proportions. Comput

10

ABSOLUTE RISK REDUCTION

Biol Med. 1999; 29: 83–87. 25. R. G. Newcombe, Estimating the difference between differences: measurement of additive scale interaction for proportions. Stat Med. 2001; 20: 2885–2893. 26. W. W. Hauck and S. Anderson, A comparison of large-sample confidence interval methods for the difference of two binomial probabilities. Am Stat. 1986; 40: 318–322. 27. A. Agresti and B. Caffo, Simple and effective confidence intervals for proportions and differences of proportions result from adding 2 successes and 2 failures. Am Stat. 2000; 54: 280–288. 28. A. Laupacis, D. L. Sackett, and R. S. Roberts, An assessment of clinically useful measures of the consequences of treatment. N Engl J Med. 1988; 318: 1728–1733. 29. D. G. Altman, Confidence intervals for the number needed to treat. BMJ. 1998; 317: 1309–1312. 30. R. Bender, Calculating confidence intervals for the number needed to treat. Control Clin Trials. 2001; 22: 102–110. 31. R. G. Newcombe, Confidence intervals for the number needed to treat—absolute risk reduction is less likely to be misunderstood. BMJ. 1999; 318: 1765. 32. Q. McNemar, Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947; 12: 153–157. 33. R. G. Newcombe, Improved confidence intervals for the difference between binomial proportions based on paired data. Stat Med. 1998; 17: 2635–2650. 34. T. Tango, Equivalence test and CI for the difference in proportions for the paired-sample design. Stat Med. 1998; 17: 891–908. 35. R. G. Newcombe, Confidence intervals for the mean of a variable taking the values 0, 1 and 2. Stat Med. 2003; 22: 2737–2750. 36. M. L. Tang, N. S. Tang, and I. S. F. Chan, Confidence interval construction for proportion difference in small sample paired studies. Stat Med. 2005; 24: 3565–3579. 37. C. M. Wiles, R. G. Newcombe, K. J. Fuller, S. Shaw, J. Furnival-Doran, et al., Controlled

randomised crossover trial of physiotherapy on mobility in chronic multiple sclerosis. J Neurol Neurosurg Psych. 2001; 70: 174–179. 38. M. Hills and P. Armitage, The two-period cross-over clinical trial. Br J Clin Pharmacol. 1979; 8: 7–20. 39. G. G. Koch, The use of non-parametric methods in the statistical analysis of the twoperiod change-over design. Biometrics. 1972; 28: 577–584.

FURTHER READING Microsoft Excel spreadsheets that implement chosen methods for the single proportion, unpaired and paired difference, and interaction cases can be downloaded from the author’s website: http://www.cardiff.ac.uk/medicine/ epidemiology statistics/research/statistics/ newcombe The availability of procedures to calculate confidence intervals for differences of proportions is quite patchy in commercial software. StatXact (Cytel Statistical Software) includes confidence intervals for differences and ratios of proportions and odds ratios. These are ‘‘exact’’ intervals, designed to guarantee minimum coverage 1 – α. The resulting intervals are likely to be relatively wide compared with methods that seek to align the mean coverage approximately with 1 – α.

CROSS-REFERENCES Estimation Confidence interval Categorical variables Comparing proportions Sample size for comparing proportions (superiority and noninferiority) Relative risk Odds ratio Bayesian approach Equivalence analysis Noninferiority analysis

ACCELERATED APPROVAL

In 1987, four major initiatives were drafted to speed drug delivery, cut cost, and make drugs available sooner. Two of these initiatives, ‘‘Treatment IND’’ and ‘‘Parallel Track,’’ focused on expanding access of potentially life-saving drugs to specific patient populations prior to approval of the drug. In contrast, the ‘‘Subpart E’’ and ‘‘Accelerated Approval’’ programs focused on decreasing the amount of clinical and regulatory review time needed to approve a life-saving drug. Subpart E provides a regulatory framework to grant approval to certain drugs after an extended Phase II trial. Accelerated approval allows for a drug to be approved with restrictions on distribution or with the use of unvalidated surrogate endpoints or measures of indirect clinical benefits to determine efficacy (1). The terminology and regulatory implications for various FDA programs can be confusing and are often mislabeled in literature. In understanding the terms, it is worth mentioning that ‘‘Subpart E,’’ which is also known as ‘‘Expedited Development,’’ refers to the sub-part of title 21 in the Code of Federal Regulations (CFR). Likewise, ‘‘Accelerated Approval’’ is sometimes referred to as ‘‘Subpart H,’’ as that is the subpart of 21 CFR that created the regulations allowing for this process by the FDA. Since the 1990s, more initiatives such as Fast Track designation, Priority Review, and Rolling Review have been implemented to facilitate the path of life-saving drugs to market, through separate initiatives.

LOUIS A. CABANILLA M.S. CHRISTOPHER P. MILNE D.V.M, M.P.H, J.D Tufts University Center for the Study of Drug Development, Boston, Massachusetts

The development and approval of a new drug is a complex and time-consuming procedure, which takes many years to accomplish. By the time a drug receives FDA approval, extensive lab and clinical work must be performed to ensure that the drug is both safe and effective, and months or years spent by the FDA reviewing the drug’s application. Although this level of quality control is a great benefit to people who use the drug, the extensive amount of time spent on development and regulatory submission review represents a barrier for a person with a serious or life threatening disease for which no treatment exists, or for a person for whom available treatments have failed. In such cases, a speedy development and regulatory review are of the utmost importance. A patient with no or few therapeutic options is often willing to accept a higher risk to benefit ratio including the use of a treatment whose efficacy is predicated on indirect measures of the expected clinical benefit. 1 ACCELERATED DEVELOPMENT VERSUS EXPANDED ACCESS IN THE U.S.A. In the United States, an emphasis has been placed on making potentially life saving drugs available as soon as possible. In some instances, this process involves making experimental drugs available to patients who are not enrolled in clinical trials, but more typically it involves programs designed to decrease the time to market for these important drugs. This movement took shape in the middle to late 1980s when AIDS drugs were being developed but were not made available to the wider patient population quickly enough, which caused outrage among those who risked dying while waiting for the FDA to approve these drugs for marketing.

2 SORTING THE TERMINOLOGY—WHICH FDA INITIATIVES DO WHAT? • Treatment IND. ‘‘TIND’’ provides for

early access of promising drugs to patients with serious or life-threatening illnesses who have no treatment options, or for patients who have failed available therapies. Drug companies are allowed to recoup the cost of producing the drug for these patients who are not involved in the clinical trials.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

ACCELERATED APPROVAL • Parallel Track. This program is a more

• Priority Review. This review allows the

extensive version of the TIND program aimed at people who do not qualify for participation in clinical trials. The program was a response to the AIDS epidemic, and it is generally intended for HIV/AIDS treatments or related problems. Drug companies collect basic safety data from these patients and are allowed to recoup the cost of producing the drug.

FDA to allocate more resources on the review of a ‘‘priority’’ drug (i.e., one that represents a significant advance over those currently on the market). • Rolling Review.. Under rolling review, companies can submit sections or ‘‘Reviewable Units (RUs)’’ of a NDA or BLA as they complete them, to be reviewed by the FDA. Although an RU may be complete, marketing rights are not granted until all RUs are submitted and approved (1,2).

• Expedited Development (Subpart E).

This provision allows for extensive interaction and negotiation with the FDA to move a drug to the market as quickly as possible. This process includes the possibility of expanding a Phase II trial and using that data for a new drug application (NDA)/biologic licensing application (BLA) submission, and then focusing on postmarketing safety surveillance.

3 ACCELERATED APPROVAL REGULATIONS: 21 C.F.R. 314.500, 314.520, 601.40 The FDA has several programs to expedite the development and approval of lifesaving drugs. Because some similarities are observed among the programs that pertain to the eligibility requirements and regulatory language, they are sometimes confused with each other. Although some overlap does occur, they generally affect different segments of the development and application review timeline (see Fig. 1).

• Accelerated Approval (Subpart H). This

provision allows drugs to gain approval with distribution restrictions, or more commonly, based on unvalidated surrogate endpoints or an indirect clinical benefit. It allows lifesaving drugs to become available while the company completes a long-term Phase IV study. • Fast Track. This program originated as

4 STAGES OF DRUG DEVELOPMENT AND FDA INITIATIVES

a provision of the second Prescription Drug User Fee Act (PDUFA II), and it allows the FDA to facilitate the development of and to expedite the review for drugs intended to treat serious or lifethreatening illness that have the potential to address unmet medical needs.

IND Pre-IND

End of Phase I

End of Phase II

End of Phase III

The intended effect of FDA expedited development programs is to speed development and approval; however, the respective programs focus on disparate aspects of the development and regulatory process. For example,

NDA Submission

NDA Approval Marketing

Priority Rating Rolling Review

Accelerated Approval Sub-Part E Fast Track

Figure 1. Stages of drug development and FDA initiatives. The progressive stages of drug development, and where the initiatives begin/end.

ACCELERATED APPROVAL

3

Accelerated Approval (1992–2005) and... 50% 45% 40% Percentage

35% 30% 25% 20% 15% 10% Figure 2. Accelerated Approval and . . . ’’ The overlap between accelerated approval and various other initiatives.

5% 0%

accelerated approval is often confused with fast track designation and priority review; however, they are quite different. The central mechanism of the accelerated approval program is ‘‘conditional’’ approval based on the use of unvalidated surrogate endpoints or indirect clinical endpoints as evidence of efficacy, or restricted distribution. Fast track designation provides the opportunity for intensive scientific interaction with the FDA and acts as a threshold criterion for rolling review. Priority review is an administrative prioritization scheme implemented by the FDA to give precedence to applications for drugs that represent an improvement to the currently marketed products for a particular disease or condition (also applies to eligible diagnostics and preventatives). Moreover, it should be recognized that the same drug could be a part of accelerated approval, fast track, and other programs at the same time (see Fig. 2). Many accelerated approvals have also benefited from other programs aimed at decreasing the time and resources required to bring crucial drugs to market. In addition, many accelerated approvals are also orphan drugs. Orphan drugs are treatments for rare diseases and conditions, and the FDA designation provides certain economic and regulatory incentives for a drug company to develop them.

Fast Track

Subpart E

Rolling NDA Orphan Drug

5 ACCELERATED APPROVAL REGULATIONS: 21 CFR 314.500, 314.520, 601.40 Accelerated approval regulations were promulgated December 11, 1992. The law stipulates the following (3): FDA may grant marketing approval for a new drug product on the basis of adequate and well-controlled clinical trials establishing that the drug product has an effect on a surrogate endpoint that is reasonably likely, based on epidemiologic, therapeutic, pathophysiologic, or other evidence, to predict clinical benefit or on the basis of an effect on a clinical endpoint other than survival or irreversible morbidity.

Conditions for approval: • Possible FDA restrictions or on distri-

bution and use by facility physician, a mandated qualifying test, or procedural administration requirement • Promotional materials must be submit-

ted to, and approved by the FDA • Streamlined withdrawal mechanisms if • Anticipated benefit is not confirmed

in Phase IV trials • Sponsor fails to exercise due dili-

gence in performing Phase IV trials • Restrictions on use prove insuffi-

cient to ensure safe usage • Violations on restrictions on use

and distribution

4

ACCELERATED APPROVAL • Promotional materials are false or

misleading • Other evidence that the drug is not safe or effective (3) An FDA advisory committee can recommend a drug for accelerated approval based on the set criteria for qualification, which includes the soundness of evidence for the surrogate markers. Following an advisory committee recommendation, the FDA can then review the application and approve the drug for marketing. Drugs must be intended for patients with serious or life-threatening illnesses. Moreover, the data used for the approval must show effectiveness of an unvalidated surrogate endpoint that is ‘‘reasonably likely to predict a clinical effect.’’ In contrast, a validated surrogate endpoint would proceed though the normal approval process. If a company seeks accelerated approval based on restricted distribution, then it must have clear distribution restriction practices and provider/user education programs for the drugs to gain approval. An NDA submission with unvalidated surrogate endpoints must still stand up to the scrutiny of the NDA review process, and it can be rejected for any of the reasons a traditional NDA could be rejected such as safety, efficacy, or concern about product quality. Beyond the issues of approval, all promotional materials for an accelerated approval drug must be submitted to the FDA for approval, and they must be periodically reviewed. This review is another method of ensuring the appropriate understanding and availability of these drugs for both doctors and patients. The initial approval process for a drug by the FDA pertains to a particular or limited set of indications for a particular Sub-population of patients. However, it is common that a drug can later be found to have a benefit for multiple populations and multiple indications. For a drug that is already approved for use, the expansion for a subsequent indication requires a less comprehensive supplemental NDA (sNDA) or supplemental BLA (sBLA). A supplement can be eligible for accelerated approval status, whether the first indication was a traditional approval or not. Over the past 5 years more sNDAs have been granted accelerated approval. In the 2000s,

23 sNDA accelerated approvals were granted by the FDA, whereas in the 1990s (1993–1999) only five sNDAs were approved (4). Most accelerated approvals have been granted to sponsors of small molecule drugs (i.e., chemicals), but a significant minority, which has increased over time, have been granted to large molecule drugs (i.e., biologics). This latter group is generally composed of monoclonal antibodies, although many more designations have been granted for other biologics such as vaccines or recombinant proteins. Nearly 31% of all accelerated approvals have been for biologics, most for oncology, although several accelerated approvals have been granted for rare diseases such as multiple sclerosis and Fabry’s disease (5). Increasingly, accelerated approvals have been given to speed the development and availability of preventative vaccines, such as the influenza vaccine for the 2006–2007 flu season (6) (Fig. 3). 6 ACCELERATED APPROVAL WITH SURROGATE ENDPOINTS The major advantage of the accelerated approval designation is that it can decrease the complexity of late stage clinical trials using an unvalidated surrogate endpoint, which may decrease the number of patients who must be enrolled, decrease the amount of data being collected, and most of all decrease the time required to conduct the necessary studies (7). For example, in 1992 ddI (also called Videx or didanosine) was approved based on CD4 levels rather than the survival rate for HIV infected patients, which greatly reduced the Phase III trial time. However, this endpoint has become validated; as a result, new drugs that use CD4 levels as a clinical measurement now proceed through the normal NDA process. In terms of speed to market, drugs that are designed to increase the survival rate or slow disease progression can take years to properly test, which is generally the case for illnesses such as cancer, HIV, or multiple sclerosis. In such cases, accelerated approval is beneficial in making a drug available to patients while long-term data are collected.

ACCELERATED APPROVAL

The ability to use easily observable and quantifiable data as evidence of clinical effectiveness allows a drug to reach the market much more quickly than using traditional measurement such as overall survival rate (7). Surrogate endpoints are a subset of biological markers, which are believed to have a correlation for endpoints in a clinical trial (8). Although it is not guaranteed that an actual relationship exists, surrogate endpoints are ‘‘based on scientific evidence for clinical benefit based on epidemiologic, therapeutic, pathophysiologic or other scientific evidence’’ (8). A clinical trial testing the efficacy of a statin, for example, might use a reduction in cholesterol as a surrogate for a decrease in heart disease. Elevated cholesterol levels are linked to heart disease, so this correlation is likely. The benefit of using cholesterol levels is that heart disease often takes decades to develop and would require a time-consuming and expensive trial to test. With this in mind, the FDA might approve the statin for marketing contingent on the sponsor company’s agreement to complete postmarketing commitment(s) to test for a clinical effect (Phase III b or Phase IV studies). Phase IIIb and Phase IV studies are distinct from each other in that the former begins after the application submission, but before approval, and it typically continues into the postmarketing period; a Phase IV study is a postmarketing study.

6.1 What Is a Surrogate Endpoint? A surrogate endpoint is a laboratory or physical sign that is used in therapeutic trials as a substitute for a clinically meaningful endpoint (or biomarker) that is a direct measure of how a patient feels, functions, or survives and that is expected to predict the effect of the therapy [8]. 6.2 What Is a Biomarker? A biomarker is a characteristic that is measured and evaluated objectively as an indicator of normal biologic or pathogenic processes or pharmacological responses to a therapeutic intervention. Surrogate endpoints are a subset of biomarkers [8]. 7 ACCELERATED APPROVAL WITH RESTRICTED DISTRIBUTION Although accelerated approval drugs are generally approved on the basis of surrogate endpoints, however, section 21 C.F.R. Section 314.520 allows for approval of a drug based on restrictions on distribution. These restrictions can refer to certain facilities or physicians who are allowed to handle and prescribe the drug, or to certain requirements and tests or medical procedures, which must be performed prior to use of the drug. The drug thalidomide has a restricted accelerated approval for the treatment of Erythema

Product Type of Accelerated Approvals: 1992–2005 80% 70%

Percentage

60% 50% 40% 30% 20% 10% 0%

5

Small Molecule

Biologics Type of Drug

Figure 3. Product type of accelerated approvals. Comparison between small molecule and biologic drugs approved by FDA through the Subpart H.

6

ACCELERATED APPROVAL

Nodosum Leprosum because the drug is associated with a high risk of birth defects for infants whose mothers are exposed to the drug. To mollify this risk, doctors must complete an educational program on the risks and safe usage of thalidomide before they are certified to prescribe the drug. Few accelerated approvals are based on this restriction, although they have become more common over time. The number of restricted distribution approvals averaged one per year between 2000 and 2005. However, in the first half of 2006 there were four restricted approvals, all of which were supplemental approvals (Fig. 4) (4). 8 PHASE IV STUDIES/POSTMARKETING SURVEILLANCE Although a drug that is given accelerated approval might reach the market, the company that sponsors the drug is required to complete the necessary research to determine efficacy with a Phase IV study. These studies are intended to increase the quality of the application dossier to the scientific standards of drugs approved through the traditional NDA/BLA process. Sec. 314.510 (9) stipulates: Approval under this section will be subject to the requirement that the applicant study the drug further, to verify and describe its clinical benefit, where there is uncertainty as to

the relation of the surrogate endpoint to clinical benefit, or of the observed clinical benefit to ultimate outcome. Postmarketing studies would usually be studies already underway. When required to be conducted, such studies must also be adequate and well controlled. The applicant shall carry out any such studies with due diligence.

Some ambiguity exists in this section of the law as no actual time frame was given, until recent years, for companies to complete their Phase IV trials and submit their findings to the FDA. Rather, it was expected that companies would use ‘‘due diligence’’ in completing their research. This nebulous definition has lead to some contention about the level of compliance by the drug industry. This question will be considered in more depth in a subsequent section. 9 BENEFIT ANALYSIS FOR ACCELERATED APPROVALS VERSUS OTHER ILLNESSES Drugs that qualify for accelerated approval are typically intended to treat serious conditions and diseases such as pulnanary Arterial hypertension, HIV, malignancies, Fabry’s disease, and Crohn’s disease. Given that the average median total development time for all FDA-approved drugs during the 1990s and 2000s was just over 7 years (87 months), drugs under the accelerated approval program have spent considerably less time in

Approval Type: 1992–2005 Number of Approvals

90 80 70 60 50

sNDAs/sBLAs NDA/BLA

40 30 20 10 0

Surrogate

Restricted Basis

Figure 4. Accelerated Approval information through 06/06. Approval type: A comparison between Subpart H approvals with restricted distribution versus surrogate basis.

ACCELERATED APPROVAL

the journey from bench to bedside in such critical therapeutic areas as HIV/AIDS and cancer (Fig. 5). To qualify for accelerated approval, a drug must offer a meaningful therapeutic benefit over available therapy, such as: • Greater efficacy • A more favorable safety profile • Improved patient response over existing

therapies (9) Given the severity of the illnesses and the potential for a ‘‘meaningful therapeutic benefit,’’ the FDA is generally willing to accept a higher risk-to-benefit ratio for a potential therapy. A drug with a comparatively high risk of side effects or with less certainty of proof of efficacy can be approved. For example, a drug with a traditional approval for the treatment of insomnia would have an acceptable overall safety and efficacy profile. Conversely, an accelerated approval drug for multiple sclerosis could gain approval with a likely clinical effect, based on surrogate endpoints. This change in approval occurred because patient advocacy groups and public opinion have made it clear that patients in such a situation are willing to bear additional risks. For patients with no other options, a new therapy often represents the only possibility for increasing the length or quality of their life. Likewise, a drug designated as a firstline treatment for a malignancy would have

7

a more rigorous approval process, given the currently available treatments, than a drug designated as a third-line treatment for patients who have failed all conventional treatment options. Within any patient population, a certain percentage will fail to respond to first- or second-line treatments or they may become resistant to therapy. This finding is particularly true with malignancies that have high relapse rates and infections such as HIV, in which the virus gains immunity to treatments over time. In situations such as these, creating the incentive for drug companies to develop second-, third-, and even fourth-line treatments with new mechanisms of action is critical.

10 PROBLEMS, SOLUTIONS, AND ECONOMIC INCENTIVES Accelerated approval is intended to increase the speed to market for important drugs. Although this timeline is a benefit to patients, it is also a benefit to the company that sponsors the drug. The use of surrogate endpoints can greatly decrease the length and complexity of a clinical trial. This use allows a drug company to begin recouping sunk costs earlier and provides returns on sales for a longer period. Without accelerated approval, it is possible that many of these drugs would have been viewed as too risky or expensive to develop in terms of return on investment for

Median Development and Approval Times for Accelerated Approval Drugs: 1992–2005 140

Time (months)

120 100 80 60 40 20

Figure 5. Median development and approval times for accelerated approval drugs. Development and regulatory times by indication for Subpart H.

0 HIV/AIDS

Cancer

Cardiovascular Related Diseases

Other

Median Approval Time Median Development Time

8

ACCELERATED APPROVAL Table 1. 2004 Sales figures for drugs first approved under accelerated approval: sales figures for various accelerated approval drugs in billions of dollars (US) Viracept Biaxin Cipro Eloxatin Taxotere Levaquin Remicade Celebrex Erbitux Temodar Thalomid Casodex Norvir Arimidex Epivir Gleevec Betaseron Kaletra Camptosar Viread

nelfinavir mesylate clarithromycin ciprofloxacin hcl oxaliplatin docetaxel levofloxacin infliximab celecoxib cetuximab temozolomide thalidomide bicalutamide ritonavir anastrozole lamivudine imatinib mesylate interferon beta-1b lopinavir/ritonavir irinotecan hcl tenofovir disoproxil fumarate

drug companies. Even after approval, sponsor companies have some degree of uncertainty. It is possible that some practitioners, patients, and third-party payers may consider such approvals to be conditional and avoid their use until no alternative is available. Moreover, insurance companies may refuse to reimburse experimental drugs, thus the financial burden for the treatment may also place the drugs out of reach of patients. However, these drugs can be very profitable. In 2004, 20 drugs had accelerated approval among the top 200 best-selling drugs, with combined sales of $12.54 billion (Table 1) (10). A prerequisite for accelerated approval designation is that the drug company will complete a Phase IV trial with ‘‘due diligence’’ following the marketing of the drug. Once approved, evidence from a Phase IV trial that does not support the efficacy of the drug may lead to the drug being withdrawn. The original text has proven problematic in that ‘‘due diligence’’ is ambiguous, and it is understood differently by the various stakeholders. Moreover, despite the wording of the provisions for drug withdrawal, it is difficult to do so once market rights have

$0.23 $0.27 $0.27 $0.78 $0.94 $1.70 $1.80 $2.70 $0.24 $0.24 $0.27 $0.28 $0.30 $0.30 $0.31 $0.34 $0.38 $0.45 $0.48 $0.53

been granted. A congressional staff inquiry found that as of March 2005, 50% of outstanding accelerated approval postmarketing studies had not begun, whereas most of the other 50% of studies were underway. Overall, a significant proportion (26%) of trials is taking longer than expected to complete (11). The incentive for a company to perform a Phase IV trial is minimal given that the sponsor company has already received marketing approval, and the completion of the trial is expensive and time consuming. As of 2006, no drug had been withdrawn for failure to complete a Phase IV trial. However, proposed regulations and guidance within the FDA could change this (12). The biggest change to the original procedure for accelerated approval came in 2003 when the FDA reinterpreted the laws regarding the number of accelerated approvals granted per oncology indication. Whereas the old practice allowed only one accelerated approval per indication, the new interpretation viewed a medical need as ‘‘unmet’’ until the sponsoring company completes a Phase IV trial (13). This change provides an incentive for drug companies to complete Phase IV studies quickly to avoid market competition

ACCELERATED APPROVAL

from other drugs. More recently, the FDA has released a guidance that suggests accelerated approvals could be based on interim analysis, which will continue until a final Phase IV report is completed, rather than on separate Phase IV trials (14). This guidance would potentially decrease the cost of a Phase IV study as well as alleviate the issue of patient enrollment, which is often cited as a cause for delays by drug companies. The idea of granting ‘‘conditional’’ approval contingent on Phase IV completion for ‘‘full’’ approval has also been mentioned (15). Increasing Phase IV follow-through is important to the proper functioning of the accelerated approval program. Although the failure to complete a Phase IV trial might hurt the patient population by leaving some uncertainty as to a drug’s safety or efficacy, it also hurts the drug companies in the long run. In 2005, an advisory committee refused to grant the drug Zarnestra, (Johnson & Johnson, New Brunswick, NJ) accelerated approval and cited the FDA’s failure to enforce Phase IV obligations as a reason (16). This decision was an example of the impact that the lack of confidence in FDA enforcement, and sponsor compliance could have on the future performance of the program. A final issue, which the FDA must address, is that of public perception and understanding of exactly what accelerated approval is and is not. There is often the misconception about the conditions under which accelerated approval is granted, that it is better than a full approval, or that it is the same as a fast track approval. In the interest of transparency, that the public must understand that an accelerated approval is less certain than a traditional approval and that less than a full quantum of evidence was used as the basis for efficacy. However, an accelerated approval should not be viewed as inherently more risky, when in fact it has been demonstrated to be more beneficial for certain patients to take the drug than not based on consideration of all the risks and benefits currently known. 11

FUTURE DIRECTIONS

Will there be changes to the accelerated approval program? On one hand, some

9

patient groups believe that accelerated approval is not delivering on its promise for bringing more therapeutic options to patients with diseases for which few or no adequate treatments are available. They question the arduousness of the process, the increasing trepidation of the FDA regarding potential ill effects, as well as political influences being brought to bear on a program that was borne out of an initiative inspired by patients, medical practitioners, and FDA in response to a public health emergency. On the other hand, there is increasing discomfort among some consumer watchdog groups as well as patient groups concerning the diligence with which the confirmatory post-marketing studies are being pursued by biopharmaceutical firms. Addressing this issue was a key focus of the public debates surrounding the reauthorization of the PDUFA in 2007. Will the concept of accelerated approval go global? Research on biomarkers is becoming a global endeavor and the need to get lifesaving drugs to patients more expeditiously is a worldwide concern. It would seem likely that the regulatory agencies in countries that are major centers of biopharmaceutical research and development such as Japan and the European Union would follow the United State’s lead as they did in setting up their orphan drug programs 10 and 20 years after the United States, respectively (with piggy-back programs emerging in other industrialized countries such as Australia and Singapore). However, the EU already has a program that is similar to accelerated approval, which is termed marketing authorization ‘‘under exceptional circumstances’’ for products that do not have, and are likely to never have full safety and efficacy data (17). The EU also recently enacted ‘‘conditional approval’’ that allows for approval based on a less complete approval package than is normally required (18). It is possible that some symbiosis may occur between the EU and U.S. program. The FDA may become more comfortable with the concept of a conditional approval (i.e., annual status review and restrictive marketing until confirmatory studies are complete), whereas the EU may become more expansive in employing conditional approval based on the increasing availability of surrogate biomarkers as

10

ACCELERATED APPROVAL

adequate indicators of safety and efficacy in ‘‘exceptional circumstances.’’ The Japanese regulatory body has also adopted a conditional approval process, which includes provisions for the use of restricted access, and surrogate endpoints. However, they do not have a program that is specific to accelerated approvals. This program, which encompasses many initiatives, has been an effective means of decreasing development and regulatory time (19). Overall, the success of the accelerated approval program to date in bringing some 100 drug products to market months or years faster than would have happened through the traditional process, together with advances in the science and technology of the clinical evaluation of biopharmaceuticals and the more powerful voice of patient advocacy in determining regulatory agency policy, ought to ensure that accelerated approval will expand its reach both therapeutically and geographically.

REFERENCES 1. S. S. Shulman and J. S. Brown. The Food and Drug Administration’s early access and fast-track approval initiatives: how have they worked. Food Drug Law J. 1995; 50: 503–531. 2. Guidance for Industry: Fast Track Drug Development Programs – Designation, Development, and Application Review. January 2006, Procedural, Revision 2. U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER) & Center for Biologics Evaluation and Research (CBER). 3. New Drug, Antibiotic, and Biological Drug Product Regulations; Accelerated Approval. Final Rule. 57 Federal Register 58942 (December 11, 1992). Department of Health and Human Services, Food and Drug Administration. 4. Accelerated Approvals under 21 CFR 314 Subpart H (drugs) & 21 CFR 601 Subpart E (biologics). U.S. Food and Drug Administration. CDER website. Available at http://www.fda. gov/cder/rdmt/accappr.htm. Last accessed 09/15/ 2006. 5. Unpublished Data, Marketed Database, September 2006. Tufts Center for the Study of Drug Development. Boston, MA: Tufts University.

6. FDA News: FDA Approves Additional Vaccine for Upcoming Influenza Season. October 5, 2006. U.S. Food and Drug Administration. [Press Release] Available at: http://www.fda. gov/hhs/topics/NEWS/2006/NEW01478.html. 7. R. L. Schilsky, Hurry up and wait: is accelerated approval of new cancer drugs in the best interests of cancer patients? J. Clin. Oncol. 2003; 21: 3718–3720. 8. A. G. Chakravarty, Surrogate markers – their role in the regulatory decision process. FDA Workshop, American Statistical Association, Toronto, Canada, 2004. 9. H. Subpart, Accelerated Approval of New Drugs for Serious or Life-Threatening Illnesses. 21 CFR 314.50 (Revised April 1, 2004). 10. Top 200 Drugs for 2004 by U.S. Sales. Available at: http://www.drugs.com/top200 2004. html. 11. Conspiracy of Silence: How the FDA Allows Drug Companies to Abuse the Accelerated Approval Process. Responses by the Food and Drug Administration and the Securities and Exchange Commission to Correspondence from Rep. Edward J. Markey (D-MA). Congressional staff summary, June 1, 2005. 12. A. Kaspar, Accelerated approval should be ‘‘conditional’’ panel says. Pink Sheet Daily, June 17, 2005. 13. FDA revising accelerated approval standard in Oncology; biomarkers sought. The Pink Sheet June 9, 2003;65: 23. 14. Susman E. Accelerated approval seen as triumph and roadblock for cancer drugs. J. Natl. Cancer Inst. 2004; 96: 1495–1496. 15. Fox JL. Conflicting signals on US accelerated approvals. Nature Biotechnology, published online, 31 August 2005. Available at: http://www. nature.com/news/2005/050829/full/nbt09051027.html. 16. Iressa casts shadow over Zarnestra advisory committee review. The Pink Sheet May 16, 2005;67: 18. 17. Guideline on procedures for the granting of a marketing authorization under exceptional circumstances, pursuant to article 14 (8) of regulation (EC) No 726/2004. London, 15 December 2005. Doc. Ref. EMEA/357981/2005. European Medicines Agency (EMEA). 18. Guideline on the scientific application and the practical arrangements necessary to implement commission regulation (EC) No 507/2006 on the conditional marketing authorization for medicinal products for human use

ACCELERATED APPROVAL falling within the scope of regulation (EC) No 726/2004. London, 5 December 2006. Doc. Ref. EMEA/509951/2006. European Medicines Agency (EMEA). 19. Japan regulators increase use of ‘‘conditional approvals’’ on drugs. Pacific Bridge Medical: Asian Medical Newsletter 2006; 6: 1–2.

FURTHER READING Federal Register Final Rules, Title 21, Chapter 1, Subchapter D- Drugs for Human Use, Subpart H- Accelerated Approval of New Drugs for Serious or Life Threatening Illnesses. §314 and 600. Guidance for industry: Providing Clinical Evidence of Effectiveness for Human Drug and Biological Products, U.S Department of Health and Human Services Food and Drug Administration, May 6, 1998. Guidance for Industry: Fast Track Drug Development Programs- Designation, Development, and Application Review, U.S Department of Health and Human Services Food and Drug Administration, January 2006.

11

ADVERSE DRUG REACTION REPORTING The sponsor should expedite the reporting to all concerned investigator(s)/institutions(s), to the Institutional Review Board(s) (IRB)/Independent Ethics Committee(s) (IEC), where required, and to the regulatory authority(ies) of all adverse drug reactions (ADRs) that are both serious and unexpected. Such expedited reports should comply with the applicable regulatory requirement(s) and with the ICH Guideline for Clinical Safety Data Management: Definitions and Standards for Expedited Reporting. The sponsor should submit to the regulatory authority(ies) all safety updates and periodic reports, as required by applicable regulatory requirement(s).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

ADVERSE EVENT EVALUATION

at the time of the data collection, it may be extremely difficult to assess the causality of a single, specific event, and it is only through the review of aggregated data, or after a series of apparently unrelated events have occurred, that a relationship becomes apparent. For example, falls are not uncommon in an elderly population, and it may seem unremarkable that an older study participant has a fall resulting in minor injuries during participation in a study, and that this does not really constitute an adverse event. However, if several subjects in the study seem to be having falls, at a rate higher than would be expected, additional investigation may demonstrate that the investigational product causes dizziness and that the risk of falling is a true safety issue. Although it is often easiest to think of safety assessment in terms of adverse events occurring during a study in which the intervention is a pharmaceutical product, adverse events may also occur in studies in which there is a different type of intervention. For example, emotional distress or nightmares may result from a study in which a new technique for counseling is used in the treatment of survivors of childhood trauma.

LINDSAY A. MCNAIR, M.D., Vertex Pharmaceuticals, Cambridge, MA Boston University School of Public Health Department of Epidemiology Boston, Massachusetts

1

INTRODUCTION

Although much of the focus in later stage clinical trials is often on study design and the evaluation of the efficacy of an intervention or pharmaceutical product, the appropriate collection and assessment of safety data is a critical part of any clinical study. In early stage trials, the primary objective of the study may be the assessment of safety (or the identification of a maximum tolerated dose). For the assessment of safety data, several key factors must be collected for any adverse event. This article will provide information for the identification of adverse events and the data to be collected, to provide the necessary tools for safety assessment in clinical studies. 2

IDENTIFYING AN ADVERSE EVENT 3 INFORMATION COLLECTED FOR ADVERSE EVENTS

An adverse event is defined as any untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and that does necessarily not have to have a causal relationship with the treatment. It may include worsening of a preexisting disease or condition that occurs during the study period that is outside the normally expected variations (1). Adverse events may also include the abnormal or clinically significant results of laboratory tests or medical procedures (see Adverse Event Definitions). It is important to note that the definition of an adverse event does NOT state or imply any relationship to the intervention or investigational product. Therefore, any medical event regardless of whether the investigator thinks it may have been related to the study intervention must be collected as an adverse event. The rationale for this is that

The first piece of information collected is the adverse event name. The exact term used by the investigator to name the event (e.g., hypertension or diarrhea) is called the verbatim term. Some medical events can be named in many different ways, but they are all referring to the same clinical finding or diagnosis (e.g., hypertension, high blood pressure, elevated blood pressure, increased BP, etc.). For the analysis of adverse events, the verbatim terms will be interpreted through a coding dictionary. The coding dictionary will match each reported verbatim as closely as possible to the appropriate medical term, which will then be grouped with other like terms that all describe the same clinical event. The dictionary includes a hierarchy of terms, so that the more specific, lower level terms can be then grouped into disorders of specific

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

ADVERSE EVENT EVALUATION

mechanisms, diagnoses, and body systems (see Coding Dictionaries).

4

RELATIONSHIP TO THE INTERVENTION

Assessing relationship to the intervention or the investigational product is often one of the more difficult aspects of safety data collection and interpretation. Clinical study subjects often have comorbidities, physical findings related to the disease under study, side effects from other concomitant medications, or even normal variations in health status that complicate the decision about whether a particular adverse event is related to the intervention being studied. Even in early phase studies of healthy volunteers, there are reports of adverse events of alanine aminotransferase elevations in subjects receiving placebo (2), which may be from the change in daily habits during study participation, or to fluctuations in normal laboratory values when watched more closely than is standard clinical practice. In determining the relationship, there are several questions the investigator must consider. What was the timing of the adverse event (start date/time, and stop date/time)? What was the relationship to the administration of the intervention? Depending on the type of investigational product and the pharmacokinetics, this may be related to the actual dosing time, dosing route (for example, an injection site reaction), a maximal plasma concentration, or a cumulative exposure over time. If the intervention or investigational product is dose reduced or withdrawn, does the event improve? If the dose is increased or the intervention is restarted, does the event recur? Are there other comorbid conditions or other medications that may have caused the event? Has this event or this type of event been seen before with this intervention? Is there a reasonable biologic mechanism through which this event could have been caused? After determining the likelihood that the adverse event is related to the intervention, the investigator must choose one of several options that reflects the clinical assessment. The options usually include no relationship,

an unlikely relationship, a possible relationship, a probable relationship, and a definite relationship. The specific wording of the options may vary (‘‘unlikely’’ is sometimes called ‘‘remote’’). There are rarely more options for the relationship categories, but there are often fewer; sometimes the options are collapsed into four, three, or even two categories (related/not related). The CDISC standard, which will be standard for all data used to support regulatory approvals by 2010 (3), includes four options (not related/unlikely related/possibly related/related). 5 ASSESSING SEVERITY The investigator must assess the severity, sometimes called the intensity, of the adverse event. For the interpretation of the data collected, it is important that investigators are using the same guidelines to determine severity, so that an event considered moderate in intensity by one investigator is reasonably similar to the same type of event considered moderate by a different investigator. For this reason, there are many standard guidelines for the assessment of the severity of adverse events. Perhaps the best known of the standard guidelines is the National Cancer Institute’s Common Toxicity Criteria (NCI CTC) (4). This guideline, now in Version 3.0 and more extensive with each version, includes hundreds of possible adverse events with brief descriptions for what would be considered appropriate for each grade of severity, from grade 1 through grade 4. In general, grade 1 events are considered mild, grade 2 events are moderate, grade 3 events are severe, and grade 4 events are life-threatening. Any event that results in death is considered grade 5. For example, nausea is grade 1 if it includes loss of appetite but no real change in eating habits, grade 2 if there is decreased oral intake but no weight loss or malnutrition and IV fluids are needed for less than 24 hours, and grade 3 if oral intake is so low that parenteral nutrition or hydration is required for more than 24 hours. Results of laboratory tests are graded as well, usually by the proportional change from the reference range for the laboratory (i.e., 0.5 times

ADVERSE EVENT EVALUATION

lower limit of normal range) rather than by absolute numbers. The NCI CTC is well known, extensive, and commonly used, but it is not appropriate for all clinical studies. The NCI CTC is designed for use in clinical studies in oncology, and in other populations (particularly healthy volunteers), the grading of events may be considered too lenient, with significant clinical signs and symptoms before an event is considered to be severe. In the design of a new clinical study, it is necessary to review different grading guidelines to find the one best suited to the study and the study population—or, in some cases, to develop a new grading guideline, particularly if there are safety events that are expected or that are of particular interest. In general, for events for which there is not specific grading guidance, the usual recommendation is to consider the impact on the activities of daily life for the study participant, with a mild event having little or no impact, a moderate having some impact, and a severe event having substantial impact. 6

DETERMINING EXPECTEDNESS

The determination of when a specific event is considered to be ‘‘Expected’’ for a specific investigational product is a formal process. In general, an event is considered expected if events of the same nature and severity are described in the product label (or in the case of investigational products, the Investigator’s Brochure). The formal determination of expectedness is important, as it is a significant factor in the requirements for the reporting of adverse events to regulatory authorities. In general, serious adverse events that are considered expected do not have to be reported in the same expedited manner (within 7–15 days, depending on other factors) to investigators and regulatory agencies. Therefore, the persons who are responsible for maintaining the Investigator’s Brochure usually make the decision about what is considered ‘‘Expected.’’ In addition to considering the documented safety profile for the intervention, the investigator may use an informal interpretation of expectedness also taking into account his or her own experience. The expectedness should

3

be considered as part of the determination of relationship, but the investigator will not be asked to specify the expectedness of an event.

7

ACTIONS TAKEN

In addition to collecting the event name, start and stop dates (and sometimes times), the relationship, and the severity, it is also important to know what actions were taken in response to a specific event. Did the dose of the investigational drug have to be decreased, or did dosing have to be interrupted (with possible impact on efficacy)? Did the administration of the intervention have to be stopped completely? Did the subject discontinue from the study? Did they require addition medications to treat the event, or hospitalization? Although in general, the more dramatic actions (stopping administration, etc.) may parallel the event severity, this may not always be the case. Adverse events that do not have significant clinical safety impact may still impact the tolerability of the investigational product, especially if it will be used over a long period of time. Because of the potential effects of tolerability on treatment adherence, this information must be analyzed as part of the overall safety profile.

REFERENCES 1. European Medicines Agency, ICH E2A. Clinical Safety Data Management, Definitions and Guidelines for Expedited Reporting. (March 1995). Available: http://www.fda.gov/cder/ guidance/iche2a.pdf, accessed January 2008. 2. Rosenzweig P, Miget N, Broheir S. Transaminase elevation on placebo in Phase 1 Clinical Trials: Prevalence and significance. J. Clin. Pharmacol. 1999; 48:19–23. 3. Clinical Data Interchange Standards Consortium; mission statement and strategic plan summary. Available: http://www.cdisc.org/ about/index.html, accessed January 2008. 4. National Cancer Institute Common Toxicity Criteria, Version 3.0. Available: http://ctep. cancer.gov/reporting/ctc v30.html, accessed January 2008.

4

ADVERSE EVENT EVALUATION

FURTHER READING Guidance for Industry: Adverse Event Reporting—Improving Human Subject Protection. (Issued in draft, April 2007). Available: http://www.fda.gov/cber/gdlns/advreport.pdf, accessed January 2008. Guidance for Industry: Premarketing Risk Assessment. (Issued March 2005). Available: http:// www.fda.gov/cder/guidance/6357fnl.pdf, accessed January 2008.

CROSS-REFERENCES Coding dictionaries Regulatory definitions Safety assessments Adverse event reporting system

ADVERSE EVENT REPORT SYSTEM (AERS)

labeling information, sending out a ‘‘Dear Health Care Professional’’ letter, or reevaluating an approval decision.

The Adverse Event Reporting System (AERS) is a computerized information database designed to support the U.S. Food and Drug Administration’s (FDA) postmarketing safety surveillance program for all approved drug and therapeutic biologic products. The ultimate goal of the AERS is to improve the public health by providing the best available tools for storing and analyzing safety reports. The FDA receives adverse drug reaction reports from manufacturers as required by regulation. Health-care professionals and consumers send reports voluntarily through the MedWatch program. These reports become part of a database. The structure of this database is in compliance with the International Conference on Harmonisation (ICH) international safety reporting guidelines (ICH E2B). The guidelines describe the content and format for the electronic submission of reports from manufacturers. The FDA codes all reported adverse events using the standardized international terminology of MedDRA (the Medical Dictionary for Regulatory Activities). Among the AERS system features are the on-screen review of reports, searching tools, and various output reports. The FDA staff use AERS reports in conducting postmarketing drug surveillance and compliance activities and in responding to outside requests for information. The AERS reports are evaluated by clinical reviewers in the Center for Drug Evaluation and Research (CDER) and the Center for Biologics Evaluation and Research (CBER) to detect safety signals and to monitor drug safety. They form the basis for further epidemiologic studies when appropriate. As a result, the FDA may take regulatory actions to improve product safety and protect the public health such as updating a product’s This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/aers/default.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

ADVISORY COMMITTEES Advisory committees provide the U.S. Food and Drug Administration (FDA) with independent advice from outside experts on issues related to human and veterinary drugs, biological products, medical devices, and food. In general, advisory committees include a Chair, several members, plus a consumer, industry, and sometimes a patient representative. Additional experts with special knowledge may be added for individual meetings as needed. Although the committees provide advice to the FDA, final decisions are made by the Agency. Nominations for scientific members, consumer, industry, and patient representatives originate from professional societies, industry, consumer and patient advocacy groups, the individuals themselves, or other interested persons. Candidates are asked to provide detailed information regarding financial holdings, employment, research grants and contracts, and other potential conflicts of interest that may preclude membership. Persons nominated as scientific members must be technically qualified experts in their field (e.g., clinical medicine, engineering, biological and physical sciences, biostatistics, and food sciences) and have experience interpreting complex data. Candidates must be able to analyze detailed scientific data and understand its public health significance.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oc/advisory/vacancies/ acvacfaq.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

AIDS CLINICAL TRIALS GROUP (ACTG)

recognizing that a CD4 cell is infected and destroying it. After a burst of viral replication that occurs immediately after people are initially infected, viral loads drop to a lower ‘‘set point.’’ The effects on the immune system are gradual, so individuals often are unaware they are infected for years. If left untreated, the disruption of the immune system by HIV infection is catastrophic. This leaves people vulnerable to a long list of opportunistic infections (OIs), wasting, neurologic damage, and cancers that people with an intact immune system do not normally have. When this state is reached, people are considered to have AIDS. HIV was initially identified in 1981 when a small but growing number of individuals in cities in the United States and Europe were diagnosed with unusual and rapidly fatal infections and cancers. Once the syndrome was described and when a blood test for antibody to it became available in 1983, it became apparent that this infection already existed worldwide at pandemic proportions. In 2005, it was estimated that 33.4 to 46.0 million people were living with HIV worldwide, 3.4–6.2 million became newly infected, and 2.4–3.3 million people had already died from HIV/AIDS (1).

JANET W. ANDERSEN ACTG Statistics and Data Management Center, Harvard School of Public Health, Boston, Massachusetts,

This article provides a brief overview of HIV and AIDS, a short history of the founding of the ACTG, an overview of the structure of the ACTG, and sections that outline the major areas covered by the agenda of the ACTG, including clinical trial design issues that developed in the context of the studies and science that was developed to answer this agenda. The ACTG website (www.aactg.org) provides brief overviews of Group studies open to enrollment, information about the ACTG and clinical trials in general, and links to a vast array of sites with information about HIV and AIDS. 1

A BRIEF PRIMER ON HIV/AIDS

Acquired immunodeficiency syndrome (AIDS) is a collapse of the immune system caused by infection with the human immunodeficiency virus (HIV). HIV is a retrovirus; a retrovirus is a virus whose genetic material is RNA that is read into DNA (or reverse transcribed to DNA) when the virus enters the target cell. This DNA incorporates itself into the DNA in the host cell’s nucleus, and then it is replicated to produce new retrovirus particles. HIV is passed from person to person primarily by direct blood transfer from transfusions or needles, by unprotected sexual activity, or from a mother to her child at birth or through breast-feeding. HIV’s target cell in humans is the CD4 or helper T-cell. CD4 cells are the part of the blood’s lymphocytes or white blood cells that are responsible for organizing and arousing (activating) the other cells of the immune system to produce antibodies against an invader or to attack a foreign cell directly. HIV infection of the CD4 cells results in their destruction, either directly by the virus or from the immune system

2

ACTG OVERVIEW

2.1 History In 1986, the National Institutes of Allergy and Infectious Diseases (NIAID) established the Division of AIDS (DAIDS). In that same year, NIAID established the AIDS Treatment Evaluation Units, and, in 1987, they established the AIDS Clinical Studies Groups to conduct studies of treatment for AIDS and HIV. In 1988, the AIDS Clinical Trials Group (ACTG) was formed by the merger of these two groups. Two original units had a primary focus in the treatment of children with HIV. This component of the ACTG agenda grew as the Network grew. In 1997, NIAID funded the Pediatric AIDS Clinical Trials Group (PACTG) that was constituted from the sites in the ACTG that specialized in studies in pediatric HIV and prevention of mother-tochild transmission (MTCT). At that time, the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

AIDS CLINICAL TRIALS GROUP (ACTG)

ACTG was renamed the Adult AIDS Clinical Trials Group (AACTG) to reflect the change in its focus (2). In 2006, the NIAID and DAIDS initiated a wide-ranging reorganization of the AIDS Networks in the United States including the vaccine and prevention networks as well as those like the AACTG that focused on treatment of the disease. In the grant period between 1999 and 2006, the AACTG had increased its agenda to include international membership engaged in both common and country-specific clinical research. In the reorganization of the networks, the AACTG defined and expanded this international effort, restructured its scientific committees, and integrated the Oral HIV/AIDS Research Alliance (OHARA) that studies oral manifestations and complications of HIV. To reflect this revised and refocused agenda, the AACTG reassumed the name ‘‘AIDS Clinical Trials Group’’ and the acronym, ACTG. This overview will refer to the Group as the ACTG throughout.

•

•

2.2 ACTG Structure The infrastructure required for the success of a network of the size of the ACTG is massive, interlocking, and complex. The general structure of the ACTG has been very stable over time, consisting of the following: • The Group Chair, Vice-Chair, and Exec-

utive Committee. The ACTG Executive Committee (AEC) is constituted of Group leadership and representatives from the Scientific Committees, community, sites, Operations Center, Statistical and Data Management Center (SDMC), and DAIDS. Subcommittees of the AEC include one that evaluates the performance of the Group and its sites, and the Scientific Agenda Steering Committee that coordinates and oversees the scientific agenda of the Group. The Principal Investigators of each site are an essential component of the leadership of the Group. • The Operations Center that is sited at Social and Scientific Systems (Silver Springs, MD). This center provides administrative support and coordination for

•

•

leadership and scientific activities, protocol development and implementation, site establishment, regulatory activities, and contract management. The SDMC that is constituted of the Statistical and Data Analysis Center (SDAC) sited at the Harvard School of Public Health (Boston, MA) and the Data Management Center (DMC) sited at the Frontier Science Technology and Research Foundation, Inc. (Amherst, NY). SDAC members provide leadership and expertise in protocol design and implementation, and they prepare both interim and final analyses of studies. SDAC members also conduct methodological research essential to HIV studies. DMC members design the study forms, monitor data and specimen submission and quality, conduct training for sites, and work closely with sites on data issues. The Network Laboratory Committee that is constituted of laboratories studying virology (quantification of HIV and other viruses; testing of in vitro and genetic drug resistance), immunology (measurement of cytokines, lymphocyte proliferation, and cellular immune responses), and pharmacology (measurement of parameters of drug distribution and elimination and drug bioavailability). The laboratories also develop innovative assays and methodology in this changing field. The Network Community Advisory Board and the community at large bring community perspectives and input to the Group’s agenda as a whole and to issues related to specific research areas and protocols. Community representatives are members of ACTG leadership groups as well as of all protocol teams. The sites, site investigators and staff, and subjects who are the backbone of the research.

3 ACTG SCIENTIFIC ACTIVITIES The scientific agenda of the ACTG is carried out through its scientific committees. It is

AIDS CLINICAL TRIALS GROUP (ACTG)

impossible to detail the activity of the ACTG since its inception in this brief overview. A few studies will be mentioned to illustrate the nature and breadth of the integrated research conducted by the Group. References are provided where results are given but not for every study number cited. A search engine such as PubMed (3) can be used to identify ACTG publications by topic and/or study number. Information on current studies is on the ACTG website. 4 DEVELOPMENT OF POTENT ANTIRETROVIRAL THERAPY (ART) Treatment of HIV ultimately depends on preventing components of the retroviral lifecycle. Thus far, no evidence suggests the eradication of progressive HIV disease. However, the development of drugs and combination ART has resulted in prolonged survival and profound improvements in quality of life when drugs are taken as prescribed. The numbers and types of drugs available have mirrored the increasing understanding of the structure and functions of the virus and its components. Although active in early drug development, an essential role of a large, multicenter network like the ACTG is the ability to complete large comparative (i.e., Phase III and IV) studies quickly to make the results about optimal treatments available to the clinical community. Once HIV was identified as a retrovirus, nucleosides such as zidovudine (ZDV), didanosine (ddI), lamivudine (3TC), and zalcitabine (ddC) were developed that interfere with the reverse transcription of the viral RNA into DNA. The class of drugs is called nucleoside reverse transcriptase inhibitor (NRTI). The ACTG conducted pivotal studies that led to the approval of these drugs as monotherapy and also identified the emergence of drug resistance that limits long-term efficacy. To illustrate the scope of the epidemic, ACTG 019 that tested ZDV versus placebo enrolled 3222 subjects in 2 years. ACTG 175 (4) was a randomized, doubleblind, placebo-controlled study of ZDV, ddI, ZDV + ddI, or ZDV + ddC, which demonstrated the superiority of combination NRTIs over monotherapy. Numerous studies investigated other NRTIs and combinations.

3

A second class of drugs, the nonnucleosides or NNRTIs such as nevirapine (NVP) and efavirenz (EFV), was then developed. ACTG 241 (5) was the first study of a 2-class 3-drug regimen (ZDV, ddI, and NVP) that showed the superiority of multiclass therapy. ACTG 364 (6) extensively evaluated EFV + NVP + 2 NRTIs, which showed this arm to be superior to three-drug regimens in subjects with prior NRTI experience. For HIV to replicate, a long protein must be cleaved by an enzyme (a protease) encoded by its genes. The protease inhibitors (PIs) were the next class of effective drugs to be developed. ACTG 320 (7) was a turning point in HIV research. 3TC or 3TC plus the PI indinavir (IDV) were added to ZDV in ZDV-experienced subjects. A data safety monitoring board (DSMB) halted the study early because of the superiority of the IDV arm in preventing AIDS-defining events (OIs, cancer, etc.) or death. Data and experience from this trial, which followed both clinical events and HIV viral loads with a newly developed assay, also provided convincing evidence that quantitative measurement of the HIV virus was a highly relevant and reliable early surrogate endpoint. Subsequent studies investigated increasingly active combination therapy. For example, ACTG 384 was a partially blinded, randomized, 2 × 3 factorial design in which the first factor was two different NRTI backbones, and the second factor was the use of nelfinavir (NFV, a PI) or EFV or both. This study specified second-line regimens to be used if the initial failed; primary endpoints included both the results of the initial regimen and comparisons of consecutive treatments (8). This study demonstrated the superiority of one of the six arms (ZDV/3TC + EFV), that ZDV/3TC in combinations was more effective and safer than ddI + d4T, and that a four-drug regimen was not better than the three-drug regimen (9,10). These findings led to changes in U.S. ART guidelines. Some Phase III studies evaluate the equivalence or noninferiority of new drugs or drug combinations compared with standard regimens. A5095 investigated three PI-sparing regimens hoping to preserve the class for those who failed first-line therapy. It compared a control arm of 3 NRTIs,

4

AIDS CLINICAL TRIALS GROUP (ACTG)

including abacavir (ABC), with two arms that contained NRTIs + EFV. The primary aims were to demonstrate: the noninferiority of ABC/3TC/ZDV to 3TC/ZDV/EFV, the superiority of ABC/3TC/ZDV/EFV to 3TC/ZDV/EFV, and the superiority of ABC/3TC/ZDV/EFV to ABC/3TC/ZDV. At a prespecified review and based on the statistical design that included guidelines for early evaluation of the noninferiority endpoint (11), the DSMB recommended discontinuation of the three-NRTI arm because of inferior results and suggested continuation of the EFV-containing arms (12). This study illustrates numerous issues involved in the design and management of equivalence and noninferiority studies, and it is an example of efficient study design. As the number of drugs in all classes and the surety of HIV control in a high proportion of ART-na¨ıve subjects increased, tablets that combine highly active drugs were formulated, including a ‘‘one pill once-a-day’’ regimen. A5175 is a three-arm, noninferiority study of a once-daily PI-containing and once-daily and twice-daily PI-sparing regimen, which are both primarily targeted for resource limited settings (refrigeration not required) including sites in Africa, India, South America, and Thailand, although sites in the United States are also participating. A5202 is a four-arm equivalence study of once-daily regimens. Although the arms form a factorial design (PI vs. NNRTI as one factor, and the NRTI backbone as the other), the sample size is fully powered for between-arm equivalence comparisons to guard against a between-factor interaction that could limit the interpretability of the results. Studies conducted internationally must consider numerous other factors, such as differences in host and viral genomics; socioeconomic issues, including communication, transportation and laboratory capabilities; malnutrition; the social stigma of HIV infection; and co-infections such as tuberculosis (TB) and malaria. 4.0.1 Side Effects. No regimen is without adverse effects. Some side effects of ART, such as nausea, diarrhea, reduced blood counts, and impaired liver and kidney function, can be managed with dose and

drug changes as well as supportive care. Others are more complex, including ARTrelated neuropathy and metabolic complications, such as disfiguring lipoatrophy and fat redistribution, mitochondrial toxicity, and extremely high cholesterol levels. The first challenges for the ACTG were developing definitions of these new conditions and understanding their etiologies (13). In fact, some of the first investigations involved establishing whether the metabolic complications were a direct effect of ART or a result of dysregulation from the longer-term survival with HIV that ART was making possible. The ACTG has conducted numerous studies to investigate the underlying cause of these conditions and also to prevent and treat them. The use of substudies, which are intensive investigations on a subset of subjects in a large study, is very powerful. For example, A5097s investigated the neuropsychological effects of EFV in A5095 (14), and early effects were found such as ‘‘bad dreams’’ during the first week of therapy. 4.0.2 Viral Load Data, Complex Endpoints. Both the development of sensitive assays to measure the HIV viral load and the advent of fully suppressive ART resulted in the need for innovative methods for study design and analysis. All tests for viral loads as well as many other markers have a lower limit of reliable quantification; a result below this limit does not mean virus has been eliminated. SDAC members have been at the forefront of development for methods for such ‘‘left censored’’ data (15), combined correlated endpoints (16), and the use of surrogate endpoints in clinical trials (17). 4.0.3 Drug Resistance. HIV replication is highly error prone, which results in a high rate of viral mutation. A change in just one base in HIV’s RNA is sufficient to confer resistance not only to a specific drug but also to other drugs from the same class. ACTG researchers have provided breakthroughs in understanding, identifying, and quantifying drug resistance, including extensive modeling to uncover complex patterns of mutations that affect viral control and the use of predicted drug susceptibility to guide regimen choice. Salvage studies such as A5241

AIDS CLINICAL TRIALS GROUP (ACTG)

and treatment options after virologic failure included in front-line studies like A5095 base treatment assignment on an individual’s virus’ drug resistance profile. Additionally, the possibility of resistance developing in the mother’s virus from brief treatment with ART during pregnancy to prevent MTCT and implications for the future treatment of the mother are considered in several studies (A5207, A5208). 4.0.4 ALLRT. As survival with HIV has increased, long-term follow-up of subjects with standardized data and sample collection beyond participation in an individual study became vital. The ACTG mounted the ALLRT (AIDS Longitudinally Linked Randomized Trials or A5001) in 2000 with an accrual target of at least 4500 enrollees. Subjects in targeted, large, randomized ART studies are followed on the ALLRT during and far beyond participation in their initial study. Data available from any concurrently enrolled ACTG study are merged with data collected specifically for the ALLRT, which results in a complete longitudinal database. Numerous analyses that involve long-term response, immunologic recovery, quality of life, complications of HIV and therapy, co-infection, and epidemiologic design have resulted or are underway. 4.1 Drug Development, Pharmacology (PK), and Pharmacogenomics A large component of the ACTG’s activities is the initial evaluation of new therapies for HIV, from the earliest days of ddI, to intensive investigations of novel modalities such as entry inhibitors that target one specific molecule on the surface of the CD4 cell (A5211). Some of the earliest stages of drug development involve establishing the pharmacology of a drug: How fast is it absorbed and eliminated? Even brief monotherapy quickly results in viral drug resistance as noted earlier, so studies of individual drugs are performed in HIV-uninfected volunteers. Subjects with HIV are taking at least three antiretrovirals in their ART. Thus, any PK study in HIV-infected subjects is ultimately a study of drug–drug interactions, either comparing the PK of the target drug between

5

monotherapy in HIV uninfected volunteers and when added to ART in HIV-infected subjects, or comparing the PK of one or more of the ART drugs before and after receipt of the new agent. These studies are essential because one drug can cause another either to be ineffective or to reach toxic levels at normal dosing. Subjects with HIV also receive numerous medications to prevent and treat AIDSdefining illnesses and side effects of ART. Thus, the ACTG also conducts essential studies of interactions between antiretrovirals and other medications. For example, simvastatin is often prescribed to combat ARTrelated dislipidemia. A5047 (18) provided crucial evidence of a life-threatening increase in simvastatin levels when given with RTV plus saquinivir, which is a PI, whereas A5108 (19) established that it was safe when given with EFV. As well as population-based analyses, determination of the genetic basis of drug metabolism as well as other biologic systems is essential both to individualize treatment and also to explain and ultimately predict outcome. In 2002, the ACTG launched a unique study, which is called A5128 (20), to collect informed consent and samples specifically for human genetic data to be linked to clinical data in a highly confidential manner. An example of the power of both A5128 and the interrelationship of ACTG studies is a merge of data from A5128, A5095, and A5097s that determined that a specific allelic variant in a liver enzyme gene that is more common in African Americans is associated with much greater EFV exposure (21). The ACTG conducts some classic doseescalating Phase I studies, but it has performed numerous studies that test the efficacy of new drugs or, perhaps more importantly, identify inefficacious or problematic drugs and combinations. ACTG 290 and 298 confirmed laboratory findings that ZDV and d4T are antagonistic, which led to ART guidelines that they not be used together. ACTG 307 and A5025 demonstrated that hydroxyurea was inefficacious. Additionally, the latter also showed that the risk of pancreatitis was increased when ddI and d4T were given together, which led to the recommendation not to combine them. Phase II

6

AIDS CLINICAL TRIALS GROUP (ACTG)

evaluations of strategies for multidrug class experienced subjects lead to studies of ‘‘class sparing’’ front-line regimens, and research on studies such as ACTG 398 demonstrated the importance to drug resistance of ‘‘minority’’ variants in the HIV genome. 4.2 Immunology, Pathogenesis, and Translational Research Ultimately, central issues in HIV infection are the failure of the individual’s immunity to mount an effective defense when the person first became infected and the inability of the immune system to clear the virus over time. The ACTG’s research in this area includes investigations in immunity and its regulation, immune-based therapies to enhance anti-HIV immunity after infection, immune restoration with ART, and special issues such as reservoirs where virus can lie latent and unique challenges and opportunities in specific populations. 4.2.1 HIV-Specific Immunity. It was not clear whether the inability of the immune system to recognize and destroy the HIV virus was caused solely by the progressive loss of both na¨ıve (unprimed) and memory (lymphocytes that remain after an immune response) antigen-reactive subsets of the CD4 cells, or whether wider intrinsic or HIVrelated defects occurred in cellular immunity. Clinical and basic research informs each other and both were critical in understanding the interaction of complex biological systems involved in the immune disregulation in this disease. Studies in the pre-ART era (e.g., ACTG 137 and 148) found some immunological effect but no clinical benefit of vaccination with proteins derived from the virus’ outer coating, which is a strategy effective in other diseases. Once CD4 counts were being restored with ART, the logical step was to investigate strategies to boost HIV-specific immunity as an adjunct to the initiation of ART and in synergy with successful ART. This strategy required the ACTG to pursue innovative basic research in the underlying biological mechanisms of cellular immunity; the nature of the defects, which includes both those caused by the virus and those lingering despite viral suppression and the level

of immunocompetency required for any given immune-based strategy to be successful. This research also required an intensive effort in the development of specialized laboratories, standardization and validation of research assays, and development of a central specimen repository to ensure proper storage and that samples could be retrieved long after study closure. 4.2.2 Immune Restoration. Once reliable suppressive ART had been defined, it became important to study whether the immunodeficiency caused by HIV is reversible and what factors are associated with the degree and pace of reconstitution. ACTG studies 315 and 375 followed subjects over 6 years of successful ART, finding that most failed to restore normal CD4 counts and function (22). Intensive studies of clinical (A5014, A5015), immunologic (DACS 094, DACS 202), and human genetic (NWCS 233) parameters could not identify strong predictors of who would and would not achieve immunologic reconstitution among subjects who had sustained virologic control on ART. Studies such as ACTG 889 found that recovery of immunity to pathogens other than HIV is also incomplete, which indicated not only that patients might still be susceptible to OIs despite a CD4 increase, but also that the effectiveness of anti-HIV immune therapy might need to be investigated in populations with a relatively intact immune system. Although consistent predictors of immunologic reconstitution have not yet been identified, the Group has produced extensive cutting-edge research in immune activation and modulation, CD4 cell subsets, pathogenspecific immunity, thymic activity, and HIV-1 pathogenesis. 4.2.3 Immune-Based Therapy. Because HIV control alone is insufficient to restore a normal immune system, the ACTG has investigated immune-based treatments as adjuncts to ART including immunomodulators (e.g., growth factors, cyclosporin A, IL-2, and IL12) and therapeutic vaccination with more general polyvalent peptide vaccines (A5176) and also those based on viral vectors (A5197). Future studies might include infusions of autologous cells primed ex vivo

AIDS CLINICAL TRIALS GROUP (ACTG)

with autologous HIV RNA. The ‘‘analytical treatment interruption’’ (ATI) is a unique and powerful tool for evaluating immunomodulation of HIV. After establishment of sustained viral suppression and with careful criteria for CD4 levels, all ARVs are discontinued. HIV viral rebound is closely monitored and ART is restarted if rebound is rapid or the CD4 count drops. Successful immunomodulation provides a substantial period of time with minimal viral replication and with sustained CD4 counts. ATI provides rapid readout to test vaccines and other immune-based modalities. 4.2.4 Latent Reservoirs, Special Populations. Understanding HIV has required innovative research into underlying differences between populations. An intensive study (A5015) included measurement of thymic volume as well as serial quantification of numerous T-cell and serum markers of immune function. Subjects were 13–30 years of age or at least 45 years old. Not only did the study identify age-associated differences in immune restoration with ART, but also numerous new findings were made in the biology of immune restoration. Genomic research is covered elsewhere, but it is an important component of understanding subpopulations. PK studies in women consider interactions between ARVs and contraceptives, and other studies evaluate other aspects of gender differences (e.g., do women and men with the same body mass index have different PK?). HIV can lie latent in resting cells, and several intensive studies consider viral reservoirs other than the blood such as lymph nodes, cerebrospinal fluid, mucosal tissue, and the genital tracts of men and women to understand how to eradicate all HIV viruses to prevent resurgence. 4.3 OIs and other AIDS-Related Complications As noted above, the immune system collapse caused by prolonged infection with HIV makes individuals susceptible to a long list of diseases that are very rare in people with an intact immune system and that often have an unusually rapid course with AIDS. The Centers for Disease Control, the World

7

Health Organization, and other groups have developed a list of AIDS-defining conditions (23). These conditions include the following: fungal infections (candidiasis, Pneumocystis carinii pneumonia, and cryptococcus); bacterial infections [Mycobacterium avium complex (MAC), and TB]; viral infections (cytomegalovirus); wasting; neurological complications [progressive multifocal leukoencephalopathy (PML), dementia, peripheral neuropathy (PN)]; and cancers [Kaposi’s Sarcoma (KS), lymphomas, cervical cancer].

4.3.1 OIs. Before effective ART was developed, numerous trials in the ACTG were performed to study prevention and treatment of these life-threatening diseases and conditions. An example of the prevalence of these unusual conditions in AIDS is ACTG study 196 that tested treatment for MAC, which enrolled over 1200 subjects in 9 months from late 1993 to early 1994. The role of the ACTG in drug development is evident in sequences like ACTG 196 that demonstrated the superiority of clarithromycin over standard therapy (24) followed by ACTG 223 that demonstrated the superiority of clarithromycin plus two other drugs (25). Similar sequences of studies in other OIs lead to enhanced quality of life and prolonged survival for subjects waiting successful drugs for HIV. The development of highly effective ART in the late 1990s produced a paradigm shift. Several OI studies were halted or completely refocused when it became apparent that immune reconstitution with ART dramatically reduced OI incidence. For example, a planned interim analysis of ACTG 362, which is a large three-arm study of prevention of MAC, found only two events when many were anticipated. The study was continued for extended follow-up to focus on cardiovascular, metabolic, and neurologic changes, and the rates of AIDS-defining events in these subjects who once had a very low CD4 count (<50), and has produced numerous important findings. More recent studies examine the timing of ART, for example, whether immediate or deferred ART is beneficial in HIVinfected people with very low CD4 counts who present with a life-threatening OI (A5164).

8

AIDS CLINICAL TRIALS GROUP (ACTG)

4.3.2 Cancer. Uncontrolled infection with HIV causes a 100-fold increase in cancers like KS and non-Hodgkin’s lymphoma. KS is a rare tumor of the skin and connective tissue previously observed in elderly men of Mediterranean descent. Between 1987 and 1998, the ACTG enrolled subjects in 15 KS studies, generally evaluating conventional cancer chemotherapy in unconventional low doses in these patients who were debilitated from wasting, OIs, and a failed immune system. Oncology groups were also conducting trials in AIDS-related cancers. The National Cancer Institute funded the AIDS Malignancy Consortium (26) in 1995 to consolidate this effort. Since then, the ACTG has studied cancers involving human papillomavirus (HPV), including cervical cancer and, more recently, has studies of vaccines to prevent HPV infection in HIV-infected individuals. 4.3.3 Neurology. The neurological effects of HIV are complicated, which include effects of the virus itself (dementia, PN), OIs such as PML and toxoplasmosis, and drugrelated toxicity. As with other complications, early studies examined the use of conventional treatment in severely debilitated and immune-compromised patients. It was also important to ascertain whether conditions such as dementia and PN were similar to that observed in other settings (e.g., Alzheimer’s and diabetes, respectively). Nearly all studies have both treatment and developmental components. Diagnosis, confirmation of response, and research in basic biology in this setting are complicated and often invasive, which require specialized scans and samples of cerebrospinal fluid (CSF), nerves, muscles, and the brain. The need for special expertise in all these areas led to the founding of the Neurologic AIDS Research Consortium (NARC) (27) in 1993. The ACTG and NARC work closely in joint studies that combine therapeutics with extensive scientific investigations. For example, despite being a negative study with respect to treatment, ACTG 243 definitively established that measurement of JC polyomavirus in the CSF can be used for diagnosis of PML, making brain biopsy unnecessary (28). ACTG 700, which was a substudy of 301 that showed a drug active in Alzheimer’s did not have the same activity in HIV cognitive

impairment, extensively evaluated the use of brain magnetic resonance spectroscopy in biopsy-free diagnosis and in research in complex brain metabolism and neuronal activity (29).

4.3.4 Co-Infections. With prolonged survival with ART, co-infection with Hepatitis C (HCV) and B (HBV) viruses has emerged as a significant problem in people with HIV. Evidence suggests that hepatitis progression to liver failure or cancer is faster with HIV, and ART is more difficult to manage in those with poor liver function. The primary treatment for HCV in monoinfection (just HCV) is various forms of Interferon, which is a drug that affects the immune system. A5071 (30) demonstrated the superiority of the pegylated form of Interferon (PEG) in HCV/HIV coinfection, but it also showed that the durability of viral suppression is far lower. A5178 tests weight-adjusted dosing to improve viral control and the use of PEG in those without viral suppression. Biopsy, which is invasive and expensive, is the ‘‘gold standard’’ for assessing liver damage (fibrosis). Several ancillary studies associated with the Group’s hepatitis protocols examine new markers for fibrosis and statistically model panels of standard and these new clinical parameters to develop alternatives to biopsy. International ACTG studies are evaluating HBV in Africa where the disease is more common than in the United States. Coinfections with other pathogens such as TB and malaria occur at high rates in some countries. The standard regimen for TB is generally made up of four drugs. Subjects who initiate treatment for TB and HIV together are starting seven to eight drugs they may never have taken before, each with side effects and with the complexity of managing all the dosing. A5221 is an international study that evaluates immediate versus delayed ART start in subjects who initiate treatment for TB to test whether phased start is superior to starting all the drugs together. This setting is an example of the difficulty of managing HIV in areas with high rates of other endemic serious diseases.

AIDS CLINICAL TRIALS GROUP (ACTG)

5

EXPERT SYSTEMS AND INFRASTRUCTURE

The infrastructure required for the success of a group of this size is massive and interlocking. Systems had to be developed de novo. For example, the DMC developed the Laboratory Data Management System, which is a system that tracks a sample from subject to freezer to repository to assay lab and transfers the sample’s data from the assay lab to the database. Expert systems have been developed by the SDMC to enable web-based transfer of viral genomic data and provide rapid sequence alignment and quality assurance. Both of these systems are used by other networks. The operations center and DMC have moved the ACTG to a ‘‘paperless’’ system, in which essential documents are downloaded from secure websites. Large contracts and regulatory groups ensure seamless study conduct. Training of investigators and site personnel is a constant focus, possibly all the more important as international sites without prior clinical trial experience join the ACTG.

7.

8.

9.

10.

REFERENCES 1. Joint United Nations Programme on HIV/ AIDS. AIDS epidemic update, 2005. (data. unaids.org/pub/GlobalReport/2006/2006 GR ch02 en.pdf) 2. aidshistory.nih.gov/timeline/index.html. 3. PubMed. www.ncbi.nlm.nih.gov/entrez/query. fcgi?DB=pubmed. 4. Hammer SM, Katzenstein D, Hughes MD, Gundacker H, Schooley RT, Haubrich RH, Henry WK, Lederman ML, Phair JP, Niu M, Hirsch MS, Merigan TC. A trial comparing nucleoside monotherapy with combination therapyn in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 1996; 335: 1081–1090. 5. D’Aquila RT, Hughes MD, Johnson VA, Fischl MA, Sommadossi JP, Liou SH, Timpone J, Myers M, Basgoz N, Niu M, Hirsch MS. Nevirapine, zidovudine and didanosine compared with zidovudine and didanosine in patients with HIV-1 infection. Ann. Intern. Med. 1996; 124: 1019–1030. 6. Albrecht MA, Bosch RJ, Hammer SM, Liou SH, Kessler H, Para MF, Eron J, Valdez H, Dehlinger M, Katzenstein DA. Nelfinavir,

11.

12.

13.

9

efavirenz or both after the failure of nucleoside treatment of HIV infection. N. Engl. J. Med. 2001: 345: 398–407. Hammer SM, Squires KE, Hughes MD, Grimes JM, Demeter LM, Currier JS, Eron JJ, Feinberg JE, Balfour HH, Deyton LR, Chodakewitz JA, Fischl MA. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. N. Engl. J. Med. 1997; 337: 725–733. Smeaton LM, DeGruttola V, Robbins GK, Shafer RW. ACTG (AIDS Clinical Trials Group) 384: a strategy trial comparing consecutive treatments for HIV-1. Control. Clin Trials 2001; 22: 142–159. Robbins GK, De Gruttola V, Shafer RW, Smeaton LM, Snyder SW, Pettinelli C, Dub´e MP, Fischl MA, Pollard RB, Delapenha R, Gedeon L, van der Horst C, Murphy RL, Becker MI, D’Aquila RT, Vella S, Merigan TC, Hirsch MS; AIDS Clinical Trials Group 384 Team. Comparison of sequential three-drug regimens as initial therapy for HIV-1 infection. N. Engl. J. Med. 2003: 349: 2293–2303. Shafer RW, Smeaton LM, Robbins GK, De Gruttola V, Snyder SW, D’Aquila RT, Johnson VA, Morse GD, Nokta MA, Martinez AI, Gripshover BM, Kaul P, Haubrich R, Swingle M, McCarty SD, Vella S, Hirsch MS, Merigan TC; AIDS Clinical Trials Group 384 Team. Comparison of four-drug regimens and pairs of sequential three-drug regimens as initial therapy for HIV-1 infection. N. Engl. J. Med. 2003: 349: 2304–2315. Ribaudo HR, Kuritzkes DR, Schackman BR, Acosta EP, Shikuma CM, Gulick RM. Design issues in initial HIV-treatment trials: focus on ACTG A5095. Antivir. Ther. 2006; 11: 751–760. Gulick RM, Ribaudo HJ, Shikuma CM, Lustgarten S, Squires KE, Meyer WA 3rd, Acosta EP, Schackman BR, Pilcher CD, Murphy RL, Maher WE, Witt MD, Reichman RC, Snyder S, Klingman KL, Kuritzkes DR. Triple-nucleoside regimens versus efavirenzcontaining regimens for the initial treatment of HIV infection. N. Engl. J. Med. 2004: 350: 1850–1861. Wohl DA, McComsey G, Tebas P, Brown TT, Glesby MJ, Reeds D, Shikuma C, Mulligan K, Dube M, Wininger D, Huang J, Revuelta M, Currier J, Swindells S, Fichtenbaum C, Basar M, Tungsiripat M, Meyer W, Weihe J, Wanke C. Current concepts in the diagnosis and management of metabolic complications

10

AIDS CLINICAL TRIALS GROUP (ACTG)

of HIV infection and its therapy. Clin. Infect. Dis. 2006 43: 645–653. 14. Clifford DB, Evans S, Yang Y, Acosta EP, Goodkin K, Tashima K, Simpson D, Dorfman D, Ribaudo H, Gulick RM. Impact of efavirenz on neuropsychological performance and symptoms in HIV-infected individuals. Ann Intern Med 2005; 143: 714–721. 15. Hughes M. Analysis and design issues for studies using censored biomarker measurements, with an example of viral load measurements in HIV clinical trials. Stat Med. 2000; 19: 3171–3191. 16. DiRienzo AG, De Gruttola V. Design and analysis of clinical trials with a bivariate failure time endpoint, with application to AIDS Clinical Trials Group Study A5142. Control Clin Trials 2003: 24:122-34. 17. Hughes MD. Evaluating surrogate endpoints. Control Clin. Trials 2002; 23: 703–707. 18. Fichtenbaum CJ, Gerber JG, Rosenkranz SL, Segal Y, Aberg JA, Blaschke T, Alston B, Fang F, Kosel B, Aweeka F. Pharmacokinetic interactions between protease inhibitors and statins in HIV seronegative volunteers: ACTG Study A5047. AIDS 2002; 16: 567–577. 19. Gerber JG, Rosenkranz SL, Fichtenbaum CJ, Vega JM, Yang A, Alston BL, Brobst SW, Segal Y, Aberg JA. AIDS Clinical Trials Group A5108 Team. Effects of efavirenz on the pharmacokinetics of simvastatin, atorvastatin, and pravastatin; results of AIDS Clinical Trials Group 5108 Study. J. Acquir. Immune Defic. Syndr. 2005; 39: 307–312. 20. Haas DW, Wilkinson GR, Kuritzkes DR, Richman DD, Nicotera J, Mahon LF, Sutcliffe C, Siminski S, Andersen J, Coughlin K, Clayton EW, Haines J, Marshak A, Saag M, Lawrence J Gustavson J, Bennett J, Christensen R, Matula MA, Wood AJJ. A multi-investigator/institutional DNA bank for AIDS-related human genetic studies: AACTG protocol A5128. HIV Clin. Trials 2003; 4: 287–300 21. Haas DW, Ribaudo HJ, Kim RB, Tierney C, Wilkinson GR, Gulick RM, Clifford DB, Hulgan T, Marzolini C, Acosta EP. Pharmacogenetics of efavirenz and central nervous system side effects: an Adult AIDS Clinical Trials Group study. AIDS 2004; 18: 2391–2400. 22. Smith K, Aga E, Bosch R, Valdez H, Connick El, Landay A, Kuritzkes D, Gross B, Francis I, McCune JM, Kessler H, Lederman M.

Long-term changes in circulating CD4T lymphocytes in virologically suppressed patients after 6 years of highly active antiretroviral therapy. AIDS 2004; 18: 1953–1956. 23. 1993 Revised classification system for HIV infection and expanded surveillance case definition for AIDS among adolescents and adults. www.who.int/hiv/strategic/en/cdc.1993 hivaids def.pdf. 24. Benson CA, Williams PL, Cohn DL, Becker S, Hojczyk P, Nevin T, Korvick JA, Heifets L, Child CC, Lederman MM, Reichman RC, Powderly WG, Notario GF, Wynne BA, Hafner R. Clarithromycin or rifabutin alone or in combination for primary prophylaxis of Mycobacterium avium complex disease in patients with AIDS: A randomized, doubleblind, placebo-controlled trial. The AIDS Clinical Trials Group 196/Terry Beirn Community Programs for Clinical Research on AIDS 009 Protocol Team. J. Infec. Dis. 2000; 181: 1289–1297. 25. Benson CA, Williams PL, Currier JS, Holland F, Mahon LF, MacGregor RR, Inderlied CB, Flexner C, Neidig J, Chaisson R, Notario GF, Hafner R: AIDS Clinical Trials Group 223 Protocol team. A prospective, randomized trial examining the efficacy and safety of clarithromycin in combination with ethambutol, rifabutin or both for the treatment of disseminated Mycobacterium avium complex disease in persons with acquired immunodeficiency syndrome. Clin. Infec. Dis. 2003; 37:1234-43. 26. www.niaid.nih.gov/daids/pdatguide/amc. htm. 27. www.neuro.wustl.edu/narc. 28. Yiannoutsos CT, Major EO, Curfman B, Jensen PN, Gravell M, Hou J, Clifford DB, Hall CD. Relation of JC virus DNA in the cerebrospinal fluid to survival in acquired immunodeficiency syndrome patients with biopsy-proven progressive multifocal leukoencephalopathy. Ann. Neurol. 1999; 45: 816–821. 29. Schifitto G, Navia B, Yiannoutsos C, Marra C, Chang L, Ernst T, Jarvik J, Miller E, Singer E, Ellis R, Kolson D, Simpson D, Nath A, Berger J, Shriver S, Millar L, Colquhoun D, Lenkinski R, Gonzalez R, Lipton S. Memantine and HIV-associated cognitive impairment: a neuropsychological and proton magnetic resonance spectroscopy study. AIDS 2007; 21: 1877–1886.

AIDS CLINICAL TRIALS GROUP (ACTG) 30. Chung R, Andersen J, Volberding P, Robbins G, Liu T, Sherman K, Peters M, Koziel M, Bhan A, Alston-Smith B, Colquhoun D, Nevin T, Harb G, van der Horst C. Peg-interferon alfa-2a plus ribavirin versus interferon alfa-2a plus ribavirin for chronic hepatitis c in HIVcoinfected persons. N. Engl. J. Med. 2004; 351: 451–459.

11

ALGORITHM-BASED DESIGNS

of DLTs is open such that any AE from the CTC catalog of grade 3 and higher related to treatment is considered a DLT (4).

WEILI HE Clinical Biostatistics Merck Research Labs Merck & Co., Inc. Rahway, New Jersey

1.1 Algorithm-based A + B Designs The standard methods for conducting phase I cancer trials are the traditional algorithmbased A + B designs that are unaccompanied by close inspection of the statistical properties (2, 5). Our focus here is on the traditional algorithm-based designs; alternative designs, such as continual reassessment method (CRM) (6), decision theoretic approaches (7, 8), and escalation with overdose control (9), are discussed elsewhere in this book. With algorithm-based designs, the trials begin with the selection of a starting dose low enough to avoid severe toxicity but also high enough for a chance of activity and potential efficacy in humans (4). The starting dose is generally chosen based on animal studies, usually one-tenth of the lethal dose in mice or based on the information from previous trials. The dose escalation usually follows a modified Fibonacci scheme with dose escalations of 100%, 65%, 52%, 40%, 33%, 33%, 33%, and so on (4). Large increments occur at early doses, and smaller ones at higher levels. The traditional 3 + 3 design is a special case of an A + B design. With this design, three patients are assigned to the first dose level. If there is no DLT, the trial proceeds to the next dose level with a cohort of three other patients. If at least two out of the three patients experience the DLT, then the previous dose level is considered as the MTD; otherwise, if only one patient experiences the DLT, then three additional patients are enrolled at the same dose level. If at least one of the three additional patients experience the DLT, then the previous dose is considered as the MTD; otherwise, the dose will be escalated. This design scheme is considered a 3 + 3 design without dose de-escalation. The 3 + 3 design with dose de-escalation is essentially similar, but permits more patients to be treated at a lower dose. When excessive DLT incidences occur at the current dose level, another three patients would be treated at the previous dose level if there were only

JUN LIU MRI Unit, Department of Psychiatry Columbia University New York State Psychiatric Institute New York, New York

1

PHASE I DOSE-FINDING STUDIES

The primary goal of a phase I cancer clinical trial is to determine the maximum tolerated dose (MTD) to be used for subsequent phase II and III trials evaluating efficacy (1, 2). The MTD is determined as the highest dose level of a potential therapeutic agent at which patients experience an acceptable level of dose-limiting toxicity (DLT) (3). The acceptable level of DLT is usually referred to as a tolerable or target toxicity level (TTL) (4). It is generally specified at the design stage of a trial by investigators, and is dependent on the potential therapeutic benefit of the drug, which may include but is not limited to the considerations of disease, disease stage, type of drug being tested, and patient population (4). Hence, the determination of the MTD level chiefly depends on the TTL that the investigators assign to the study drug. As with TTL, DLT is specifically defined in a study protocol at the design stage before the start of the study. In the United States, the common toxicity criteria (CTC) of the U.S. National Cancer Institute (NCI) is used (4). There is a large list of adverse events (AEs) subdivided into organ/symptom categories that can be related to the anticancer treatment. Each AE can be categorized into five classes: grade 0, no toxicity; grade 1, mild toxicity; grade 2, moderate toxicity; grade 3, serious/severe toxicity; and grade 4, very serious or life threatening. Usually, a toxicity of grade 3 or 4 is considered the DLT. That identifies a subset of toxicities from the CTC list, and the limits of grading define the DLTs for the investigational drug. Sometimes the list

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

ALGORITHM-BASED DESIGNS

three patients treated at that level previously. Hence, the dose may de-escalate until reaching a dose level at which six patients are already treated, and at most one DLT is observed out of six patients. The MTD is defined as the highest dose level (≥1st dose level) in which six treated patients have at most one patient experiencing the DLT, or among three patients, none have experienced DLT, and the immediate higher dose level having at least two patients experiencing the DLT out of a cohort of three or six patients. If escalation occurs at the last observed dose level during the study, then the MTD is at or above the last dose level. If the trial stops at the first dose, then the MTD is below the first dose level. In either of the above cases, the prescribed dose levels for the trial may need to be altered to determine the MTD. 1.2 Modification of A + B Designs and Derivation of Key Statistical Properties Lin and Shih (10) and Shih and Lin (11) discussed statistical properties of the traditional and modified algorithm-based designs in a general setting: A + B designs with or without dose de-escalation, which include the popular 3 + 3 design as a special case. They provided exact formulas to derive the key statistical properties of these traditional and modified A + B designs, which include (1) the probability of a dose being chosen as the MTD; (2) the expected number of patients at each dose level; (3) the attained TTL (the expected DLT at the MTD found through the algorithm-based design); (4) the expected number of toxicity at each dose level; and (5) the expected overall toxicity in a trial. They realized that, although the traditional A + B design is quite a general family, it still does not include some important variants of the algorithm-based designs that are often encountered in many phase I cancer clinical trials. They referred to experience by Smith et al. (12): ‘‘[P]hase I trials should consider alternative designs that increase a patient’s chance to receive a therapeutic dose of an agent and also provide a better dose estimate for phase II trials.’’ They introduced three modifications of the traditional A + B design. In the traditional A + B design, the previous lower dose level is always declared as

the MTD. The current dose has no chance at all to be declared as the MTD, which is conservative and is a waste of the already scarce information for phase I cancer trails. The first modification they made is to allow the investigator to declare a current dose level as the MTD (M1 A + B design). Because the starting dose in the traditional A + B design is always the lowest dose level in the study protocol, which may be conservative, the second modification they made, in addition to M1, is to start somewhere in the middle of the prespecified dose levels with escalation or de-escalation in the sequence of treatment (M2 A + B design). This M2 design has the advantage of reaching the MTD faster; when the starting dose is too toxic, the dose can de-escalated to avoid stopping the trial or halting it for protocol amendment. The third modification they made is to have a twostage process. In the first stage, only one or two patients are treated per dose level in an increasing dose scheme until the first sign of toxicity incidence (either minor or DLT) is observed, at which time the second stage of a standard A + B design will begin. Lin and Shih derived the exact formulae for the key statistical properties described above for the traditional A + B design and also for M1 to M3 A + B designs. (A computer program to calculate these statistical quantities is provided on the authors’ Website at http://www2.umdnj.edu/∼linyo.) For clinicians who use the algorithm-based A + B design or M1 to M3 A + B designs, Lin and Shih provide them with powerful tools to gain an understanding of the design properties before the start of a trial. 1.3 Advantages and Drawbacks of Algorithm-based Designs There are two schools of thought in designing phase I cancer clinical trials. One treats the MTD as being observed from the data, and the other treats the MTD as a parameter to be estimated from a model. The former approach of treating the MTD as being identifiable from the data can be seen in the algorithmbased A + B designs. These designs have been continuously and widely used, even though many alternative model-based designs have been proposed and advocated in recent years,

ALGORITHM-BASED DESIGNS

such as the continual reassessment method (CRM) (6), random walk rules (13), decision theoretic approaches (14, 15), and escalation with overdose control (16). The chief reason for the broad use of algorithm-based designs is its simplicity in application. Model-based designs (6, 13–16) have to overcome many operational and logistic issues practitioners, such as real-time data collection and process, real-time estimation of model parameters by statisticians, prompt communication of the dose level to practitioners for the next cohort of patients, and logistics for drug supplies to these patients. These operational issues become increasingly more difficult to manage when a phase I study is conducted in multiple study centers. The traditional A + B design is not without its drawbacks. As already mentioned, the chief complaint is that the recommended dose observed from the data from an algorithmbased design has no interpretation as an estimate of the dose that yields a prespecified rate of target toxicity (6). Hence, it has no intrinsic property that makes the study stop at any particular percentile of toxicity or even on average (5). In addition, there is no mechanism in the trial to estimate the sampling error in the identified MTD dose; thus, the recommended dose for phase II will always be an imprecise estimate of the optimal dose (17). Other criticisms of traditional algorithm-based designs include that if the initial starting dose is far below the true MTD, the length of the trial may be long with many patients—especially the patients who entered early into the trial—being treated at a suboptimal dose (5, 12, 18). Smith et al. (12) recommended that the planning stage for future phase I trials should consider alternative designs that increase a patient’s chance to receive a therapeutic dose of a drug and provide a better dose estimate for phase II studies. 2

ACCELERATED DESIGNS

Because of the perceived drawbacks of the traditional algorithm-based A + B designs, a few variants have been proposed and evaluated (11, 18–22). Storer (21) was probably the first to examine the characteristics of

3

the traditional algorithm-based designs. He proposed several variants of the standard designs, including a two-stage procedure. Operating characteristics of the standard design along with stopping rules have been derived (23). In the first stage, one patient is recruited per dose level, and dose escalation continues until the first DLT is observed. After that, accrual to the second stage begins following the standard design. The MTD is established based on the scheme of the standard design. Simon et al. (22) proposed a family of accelerated titration designs similar to Storer’s. Design 1 was a conventional design (similar to the commonly used modified Fibonacci method) using cohorts of three to six patients, with 40% dose-step increments and no intrapatient dose escalation. Designs 2 through 4 included only one patient per cohort until one patient experienced doselimiting toxic effects or two patients experienced grade 2 toxic effects (during their first course of treatment for designs 2 and 3 or during any course of treatment for design 4). Designs 3 and 4 used 100% dose steps during this initial accelerated phase. After the initial accelerated phase, designs 2 through 4 resorted to standard cohorts of three to six patients, with 40% dose-step increments. Designs 2 through 4 used intrapatient dose escalation if the worst toxicity is grade 0 to 1 in the previous course for the patients. Using simulations, Simon et al. (22) showed that accelerated titration designs appear to effectively reduce the number of patients who are undertreated, speed up the completion of phase I trials, and provide a substantial increase in the information obtained. They cautioned that the use of an accelerated titration design requires careful definition of the level of toxicity that is considered dose limiting and of the level considered sufficiently low that intrapatient dose escalation is acceptable. In most phase I clinical trials, however, intrapatient dose escalation may not be allowed when the toxicity profile of the compound being studied is not well known. Otherwise, it may be difficult to tease out the toxicity response at one dose level from the response at another dose level. Dancey et al. (24) also provided a thorough evaluation of the performance of the family of accelerated designs by Simon et al.

4

ALGORITHM-BASED DESIGNS

and conducted a literature search to assess the usefulness of the accelerated designs in the evaluation of novel oncology therapeutics. They reached similar conclusions about the advantages of accelerated designs as Simon’s evaluation. Based on the results of their review of published phase I studies that used the Simon group’s accelerated titration designs, they observed that the use of minimum one-patient cohorts and larger doseescalation steps may be advantageous under the following circumstances: (1) the agent is of a chemical class that has been widely studied; (2) the agent is predicted to have minimal interpatient variability in pharmacokinetics; and (3) the agent’s anticipated toxicity is unlikely to be severe or irreversible and is amenable to close monitoring and supportive interventions. Conversely, there are situations where an accelerated titration design may not provide the optimal balance between safety and efficiency, as either larger numbers of patients/dose cohort and/or smaller dose increments would be preferable. Agents associated with steep dose-response curves for toxicity, severe irreversible toxicity, unexplained mortality in animal toxicology studies, or large variability in doses or plasma drug levels eliciting effects may require alternative designs to balance safety and efficiency optimally. They found that, despite the advantages, the accelerated designs are not widely used, probably due to the conservativeness of investigators. There may be scenarios under the accelerated designs as proposed by Storer and Simon et al. in which the MTD could be declared at a dose level at which only one patient had been studied if the design is without dose de-escalation. This is not desirable because more than one patient should be evaluated at the MTD dose level. It is also possible that the MTD could not be identified in a trial because no DLT was encountered with the acceleration of one patient through the entire prescribed dose levels. This case is also less than optimal because one would want to strive to identify the MTD within the prescribed dose levels in a trial; restarting the trial with altered dose levels prolongs the study duration and requires additional patient and financial resources.

2.1 A Modification of the Accelerated Designs Liu et al. (25) proposed a modified accelerated design in which an MTD dose will need to be tested in a cohort of A + B patients. Further, if no DLT occurs throughout the prescribed doses in the acceleration phase during the trial, then the last dose level will be studied with a cohort of at least A or A + B patients; this way, more patients are evaluated at the last dose level, thus increasing the chance of identifying an MTD dose for the trial. This modified accelerated 1 + A + B design closely approximates that of the M3 A + B design proposed by Shih and Lin (11) except that this design modifies the standard A + B portion of the study to ensure that A + B patients will be studied at the MTD level and that a cohort of at least A or A + B patients will be studied at the last dose level n if no DLT occurrence at the previous n − 1 doses. Liu et al. (25) used the same notations as in Lin and Shih (10). Let A, B, C, D, and E be integers. Notation C/A indicates that there are C toxicity incidences out of A patients and >C/A means more than C toxicity incidences out of A patients. Similarly, the notation
ALGORITHM-BASED DESIGNS

5

Enter one patient No Escalate to dose l + 1*

Exhibit DLT? Yes Switch to A + B Design

Dose Level l
> D/A

≥C/A + ≤D/A Add B patients

≤E/(A + B)

>E/(A + B)

Escalate to Dose Level l + 1* Dose level de-escalate to l − 1**

≤ E/(A + B)

Dose level l −1 is MTD

Add B patients

≤E/(A + B)

= 0/1

Add A-1 Patients

≤ D/A

>E/(A + B)

>D/A

Dose level l − 1

Dose de-escalate to the lower dose level**

* At the last dose level n

Enroll A patients

>D/A

≤D/A

Add B patients To dose level n − 1

>E/(A + B)

≤E/(A + B) No MTD can be identified

** if l − 1 = 0

No MTD can be identified

as compared with the standard A + B design in two case studies. They concluded that the modified accelerated design and the standard design select a dose level as the MTD with

Figure 1. Diagram of modified accelerated design.

similar probabilities. However, the modified accelerated design requires fewer expected numbers of patients because of its ability to accelerate dose escalation through low-dose

6

ALGORITHM-BASED DESIGNS

levels and to switch to the standard design when DLT is encountered. Under certain dose toxicity profiles such as a dose-toxicity response curve that is gradually increasing, the savings in patient resources and time may be tremendous, translating into much shorter studies. For trials with high starting-dose toxicity or with only a few dose levels, the modified accelerated design may not save the number of patients because it rushes through the initial doses to the higher doses and may require dose de-escalation a few more levels down than the standard design. Their modified accelerated design can also be extended easily to allow the flexibility to start a trial at a dose level that may be several levels above the studies prescribed minimum dose. This flexibility gives researchers increased freedom to expedite a trial as needed while having the option to move downward to the lower doses when necessary. 3 MODEL-BASED APPROACH IN THE ESTIMATION OF MTD In cancer clinical trials, a single universal TTL does not appear to exist based on the literature (5, 12, 21, 26). Depending on the type and severity of a disease, a TTL of approximately 33% is suitable for many cancer trials, and a TTL of 50% may also be reasonable when the toxicities are viewed to be less serious (or more acceptable) relative to the severity of the disease being treated. Among the variety of possible TTLs, it would be difficult to use the algorithm-based approach to derive the MTD of certain TTLs as MTD observed from the data from an algorithmbased design has no interpretation as an estimate of the dose that yields a prespecified rate of target toxicity (6). 3.1 Evaluation of the Traditional Algorithm-based 3 + 3 Design The common perception by most clinicians is that the traditional algorithm-based 3 + 3 designs produce a 33% toxicity rate (5). This is not true based on discussions by Lin and Shih (10) and Kang and Ahn’s simulation study (27, 28). Kang and Ahn showed that the expected toxicity rate at the MTD is between

19% and 22% if the dose-toxicity relationship is assumed to be logistic or hyperbolic tangent functions. He et al. (29) also conducted a simulation study to investigate the properties of the algorithm-based 3 + 3 design. Instead of assuming a particular functional form of the dose-toxicity relationship, they only allowed the dose-toxicity relationship to be monotonically nondecreasing. They found that the expected toxicity levels are in the range of 19% to 24% rather than the anticipated 33%, depending on the number of prescribed doses in a study. They observed that the estimate of the expected toxicity level and its associated standard deviation decrease with the increasing number of dose levels prescribed for a study. This is most likely for the following reasons. For the algorithm-based 3 + 3 designs, the MTD is designed as the highest dose level (≥1st dose level) in which either six patients are treated with at most one patient experiencing the DLT or three patients are treated with no patient experiencing the DLT and the immediate higher dose level having at least two patients experiencing the DLT out of a cohort of three or six patients. The expected toxicity level at the MTD that is identified under this design scenario is more likely associated with DLTs between 0% (0/3) to 16.7% (1/6) because at most zero out of three patients or one out of six patients experiences DLT at the MTD level. With increasing number of prescribed dose levels in a study, the accuracy and precision of identifying a MTD that is associated with an expected toxicity level approaching this limit increase. Because it has been shown that the traditional 3 + 3 algorithm-based designs cannot provide accurate estimates of the MTD when TTLs are set high (27–29), He et al. (29) proposed a model-based approach in the estimation of the MTD following an algorithm-based design. 3.2 Model-based Approach in the Estimation of MTD He et al. (29) assumed a one-parameter family model to depict the dose-toxicity relationship and denoted it by ψ(d1 ,a). Let the dose range be D = {d1 , . . ., dk }, where k is the number of dose levels in the study; let S =

ALGORITHM-BASED DESIGNS

{s1 , . . ., sm } be the dose level in the dose escalation or de-escalation steps during the trial, where m is the total number of steps of dose escalation and/or de-escalation before the study stops; and let Y = {y1 , . . ., ym } be the number of patients experiencing DLT events at the corresponding S = {s1 , . . ., sm } steps, and the elements in Y may take on values (0, 1, 2, 3). Under traditional 3 + 3 designs, at each step i three patients are studied at dose level si , where dose level si may or may not be equal to dose level sj at step j (i = j). They derived the likelihood function as follows: L(a) =

m

Pr(yl |sl ) ∝

l=1

m {ψ(sl , a)}yl l=1

{1 − ψ(sl , a)}

(3−yl )

.

(1)

These derivations can be easily extended to general traditional A + B algorithm-based designs. Let v = {v1 , . . . , vm } be the number of patients studied at the corresponding S = {s1 , . . . , sm } steps, and the elements in v may take on values (A, B) for the traditional A + B designs. Based on the likelihood function in equation 1, the likelihood function for a traditional A + B design can be written as L(a) ∝

m {ψ(sl , a)}yl {1 − ψ(sl , a)}vl −yl ) .

(2)

l=1

Because the likelihood function in equation 2 is derived based on the data from each step of dose escalation or de-escalation, this likelihood function can also apply to the modified accelerated 1 + A + B designs mentioned in the previous section. In the case of modified accelerated designs, v = {v1 , . . . , vm } may take on values (1, A, B). Further, for any algorithm-based designs, be they an A + B design or an accelerated design or an M1−M3 A + B design (7), the likelihood function can be written as equation 2, with v = {v1 , . . . , vm } taking on possible integer values in (1 to A) or (1 to B). He et al. (29) employed a Bayesian approach for the model inference. They conducted simulations to investigate the properties of their model-based approach in the estimation of the MTD as compared with the MTD identified from a traditional

7

algorithm-based 3 + 3 design. The simulation results demonstrated that their modelbased approach produced much less biased estimates of the MTDs compared with the estimates obtained from the traditional 3 + 3 designs. Although the traditional 3 + 3 designs produce estimates of the MTD that are generally associated with expected toxicity levels of 19% to 24% regardless of any prespecified target toxicity levels, their model-based approach enables a substantially less biased estimate of the MTD based on the data as compared with the traditional 3 + 3 designs. Their method is applicable in any trial that was conducted using an algorithm-based design, when the intended target toxicity level was much higher than 20%, potentially salvaging the trial and probably the whole development program by bringing forward to later phase trials a therapeutic dose for confirmatory efficacy studies. 4 EXPLORING ALGORITHM-BASED DESIGNS WITH PRESPECIFIED TARGETED TOXICITY LEVELS 4.1 Rationale for Exploring the Properties of Algorithm-based Designs As mentioned in the previous section, the chief reason for the broad use of algorithmbased designs is due to their simplicity in application, even though many alternative model-based designs have been proposed and advocated in recent years. Correspondingly, the chief complaint about these designs is that the recommended MTD dose observed from the data from an algorithm-based design has no interpretation as an estimate of the dose that yields a prespecified rate of target toxicity. The properties of the traditional algorithm-based 3 + 3 designs have been studied extensively by many (27–29). Simulation studies have shown that the traditional algorithm-based 3 + 3 designs are associated with expected toxicity levels in the range of 19% to 24% or even less, depending on the actual algorithm and number of prescribed doses in a trial. Other algorithmbased designs have not been studied, and it is not certain what design properties, especially the expected toxicity level, an identified MTD

8

ALGORITHM-BASED DESIGNS

may be associated with at the conclusion of an algorithm-based trial. In the following section, we study the expected toxicity levels of a few A + B and 1 + A + B designs with varying design parameters by simulations.

4.2 Simulation Study Design and Results Suppose that k dose levels from the dose range D = {d1 , . . . , dk } are chosen for a study. Let θ i be the probability of DLT that corresponds to the dose level di (i = 1, . . . , k). We assume that θ i satisfies θ i = ψ(di ), where ψ( ) is any monotonic function with the constraint θ 1 ≤ θ 2 ≤ . . . θ k . Denote ri = P(MTD = di ) as the probability that dose level di is declared as the MTD. Based on the exact formulas of Lin and Shih (10) and exact formulas of Liu et al. (25), ri can be derived for traditional A + B designs with and without dose de-escalation and the modified accelerated 1 + A + B designs. Once ri is obtained for each dose level di , we can derive the expected toxicity level (ETL) at the MTD as

and the corresponding standard deviation as k−1 k−1 θi2 ri ri − (ETL)2 . SD = i=1

The ETL is different from the target toxicity level (TTL) in that TTL is an acceptable toxicity level at the MTD prespecified by investigators at the onset of a trial, whereas ETL is the expected toxicity level at the MTD achieved after the conclusion of a trial. For dose levels d1 , . . . , dk , instead of calculating the ETL for one set of fixed θ 1 , . . . , θ k , we generated N = 4000 sets of the corresponding probabilities of the DLT as θ 1 , . . . , θ k from the uniform distribution (0,1) with θ 1 ≤ θ 2 ≤ . . . ≤ θ k through simulation. Then for each of the N sets of the θ 1 , . . . , θ k , ETL is obtained by the exact computation using the formulas by Lin and Shih (10) and Liu et al. (25), respectively, for A + B and 1 + A + B designs. For our simulations, k = 10 is used. The results of the simulations are then pooled together to evaluate the parameter estimates with the following summary statistics:

ETL = P(toxicity at MTD | dose 1 ≤ MTD < does k) =

k−1 i=1

θi ri

k−1

i=1

ˆ = ETL

N

ˆ ETLl /N,

l=1

ri , ˆ l represents the expected toxicity where ETL

i=1

Table 1. Estimated Expected Toxicity Levels Based On Algorithm-Based Designs: Commonly Used Algorithm-Based Designs Design Parameters

Designs

A

B

C

D

E

A+B

1+A+B

3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 6 6 6 6 6 6 6 6 6

1 1 1 2 2 1 1 1 2 2 2 2 2 2

1 1 2 2 2 1 1 1 2 2 2 2 3 3

1 2 2 2 3 2 3 4 2 3 4 5 4 5

20% 27% 30% 33% 42% 21% 26% 30% 25% 31% 38% 45% 39% 48%

21% 29% 32% 32% 42% 22% 27% 32% 25% 32% 38% 45% 39% 47%

ALGORITHM-BASED DESIGNS

level at the lth (l = 1, . . . , N) simulation. The ˆ l is defined as standard deviation of ETL N ˆ = SD(ETL)

ˆ

ˆ 2 − ETL) . N−1

l=1 (ETLl

Using the same definitions for A, B, C, D, E, C/A, >C/A, and
9

toxicity levels that are higher than 19% to 24%. This is extremely desirable because algorithm-based designs are still widely used due to their operational simplicity for implementation. Now that the statistical property in terms of the expected toxicity levels has been studied for these designs, practitioners could choose the one algorithm-based design that fits their prespecified target toxicity level. Table 2 provides the estimated ETLs for less commonly used A + B and 1 + A + B designs. Although the variability for each estimated ETL was not reported, they generally vary in the range of 3% to 6%. In general, the precision of the ETL estimates increases with increasing number of doses included in a trial (29).

Table 2. Estimated Expected Toxicity Levels Based On Algorithm-Based Designs: Less Commonly Used Algorithm-Based Designs Design Parameters

Designs

A

B

C

D

E

A+B

1+A+B

3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6

2 4 4 4 5 5 3 4 5 5 5 5 6 6 6 3 4 4 4 5 5 5 6 3 3 6 6 6

1 1 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 2 2 1 2 2 1 2 3 1 2 3

1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 3 3 3 3 3

1 1 2 3 3 4 2 4 3 4 3 4 5 5 5 3 3 3 4 4 4 5 5 3 4 5 6 6

22% 18% 25% 30% 28% 31% 35% 38% 28% 35% 29% 35% 26% 37% 38% 30% 27% 28% 32% 30% 30% 33% 32% 28% 36% 33% 38% 38%

24% 18% 26% 32% 30% 33% 35% 39% 30% 37% 29% 36% 29% 39% 38% 31% 30% 28% 34% 32% 31% 35% 34% 30% 36% 35% 39% 38%

10

ALGORITHM-BASED DESIGNS

REFERENCES 1. N. L. Geller, Design of phase I and II clinical trials in cancer: a statistician’s view. Cancer Invest. 1984; 2: 483–491. 2. W. Rosenberger and L. Haines, Competing designs for phase I clinical trials: a review. Stat Med. 2002; 21: 2757–2770. 3. S. Green, J. Benedetti, and J. Crowley, Clinical Trials in Oncology. Boca Raton, FL: Chapman & Hall/CRC, 2003. 4. L. Edler, Overview of phase I trials. In: J. Crowley, ed. Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001, pp. 1–34. 5. B. Storer and D. DeMets, Current phase I/II designs: are they adequate? J Clin Res Drug Dev. 1986; 1: 121–130. 6. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase I clinical studies in cancer. Biometrics. 1990; 46: 33–48. 7. J. Whitehead and H. Brunier, Bayesian decision procedures for dose determining experiments. Stat Med. 1995; 14: 885–893. 8. J. Whitehead, Bayesian decision procedures with application to dose-finding studies. Int J Pharm Med. 1997; 11: 201–208. 9. J. Babb, A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med. 1998; 17: 1103–1120. 10. Y. Lin and W. J. Shih, Statistical properties of the traditional algorithm-based designs for phase I cancer clinical trials. Biostatistics. 2001; 2: 203–215. 11. W. J. Shih and Y. Lin, Traditional and modified algorithm-based designs for phase I cancer clinical trials. In: S. Chevret (ed.), Statistical Methods for Dose-Finding Experiments. New York: Wiley, 2006, pp. 61–90. 12. T. Smith, J. Lee, H. Kantarjian, S. Legha, and M. Rober, Design and results of phase I cancer clinical trials: three-year experience at M.D. Anderson Cancer Center. J Clin Oncol. 1996; 14: 287–295. 13. S. Durham, N. Flournoy, and W. Rosenberger, A random walk rule for phase I clinical trials. Biometrics. 1997; 53: 745–760. 14. J. Whitehead and H. Brunier, Bayesian decision procedures for dose determining experiments. Stat Med. 1995; 14: 885–893. 15. J. Whitehead, Bayesian decision procedures with application to dose-finding studies. Int J Pharm Med. 1997; 11: 201–208.

16. J. Babb, A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med. 1998; 17: 1103–1120. 17. M. Ratain, R. Mick, R. Schilsky, and M. Siegler, Statistical and ethical issues in the design and conduct of phase I and II clinical trials of new anticancer agents. J Natl Cancer Inst. 1993; 85: 1637–1643. 18. E. Korn, D. Midthune, T. Chen, L. Rubinstein, M. Christian, and R. Simon, A comparison of two phase I trial designs. Stat Med. 1994; 13: 1799–1806. 19. C. Ahn, An evaluation of phase I cancer clinical trial designs. Stat Med. 1998; 17: 1537–1549. 20. W. Rosenberger and L. Haines, Competing designs for phase I clinical trials: a review. Stat Med. 2002; 21: 2757–2770. 21. B. Storer, Design and analysis of phase I clinical trials. Biometrics. 1989; 45: 925–937. 22. R. Simon, B. Freidlin, L. Rubinsteir, S. Arbuck, J. Collins, and M. Christian, Accelerated titration designs for phase I clinical trials in oncology. J Natl Cancer Inst. 1997; 89: 1138–1147. 23. E. Reiner, X. Paoletti, and J. O’Quigley, Operating characteristics of the standard phase I clinical trial design. Comput Stat Data Anal. 1999; 30: 303–315. 24. J. Dancey, B. Freidlin, and L. Rubinstein, Accelerated titration designs. In: S. Chevret (ed.), Statistical Methods for Dose-Finding Experiments. New York: Wiley, 2006, pp. 91–113. 25. J. Liu, W. He, and H. Quan, Statistical properties of a modified accelerated designs for phase I cancer clinical trials. Commun Stat Theory Methods. 2007. In press. 26. M. Stylianou, M. Proschan, and N. Flournoy, Estimating the probability of toxicity at the target dose following an up-and-down design. Stat Med. 2003; 22: 535–543. 27. S. Kang and C. Ahn, The expected toxicity rate at the maximum tolerated dose in the standard phase I cancer clinical trial design. Drug Inf J. 2001; 35: 1189–1199. 28. S. Kang and C. Ahn, An investigation of the traditional algorithm-based designs for phase I cancer clinical trials. Drug Inf J. 2002; 36: 865–873. 29. W. He, J. Liu, B. Binkowitz, and H. Quan, A model-based approach in the estimation of the maximum tolerated dose in phase I cancer clinical trials. Stat Med. 2006; 25: 2027–2042.

ALGORITHM-BASED DESIGNS

CROSS-REFERENCES Continual Reassessment Method Maximum Tolerable Dose Phase I Trials

11

ALIGNED RANK TEST

been extended to multi-way designs, and has also been considered for time-to-event data (3). The goal of the procedure is to preserve information regarding interblock variability through the application of interblock rankings while removing the additive blockspecific effects from the data. This permits a more efficient test of the experimental effect under examination. To this end, all observations are centered (aligned) according to their intrablock mean, median, or another symmetric and translation-invariant measure of location. The residuals from centering are then ranked without respect to block or experimental group, and are labeled according to group. After labeling, the groupwise rank sums are compared using a procedure appropriate for the hypotheses. For example, the Wilcoxon rank-sum statistic may be computed for two-sample comparisons, and the Kruskal-Wallis statistic or F statistic (4) may be considered for comparisons of more than two experimental groups. Like their traditional nonaligned rank test counterparts, these aligned procedures provide an exact test of the null hypothesis. The significance of the test may be computed by hand for small samples where derivation of the permutation distribution of the rank labels is convenient; larger samples or multiple blocking factors require the use of computerized software or the chi-square approximation from Sen (5), which was developed further in Puri and Sen (6) and supported by Tardif (7).

REBECCA B. McNEIL Biostatistics Unit Mayo Clinic Jacksonville, Florida,

ROBERT F. WOOLSON Department of Biostatistics, Bioinformatics and Epidemiology Medical University of South Carolina Charleston, South Carolina

When a comparison of means or medians between two or more experimental groups is desired, and differences between the scale parameters (variance) are not of concern, a location shift hypothesis is under consideration. Analysis of variance (ANOVA) methods are the most common hypothesistesting approach in this experimental framework. However, the validity of the parametric approach is dependent on the distribution of the data. Rank tests are a popular alternative method of analysis. The application of these tests generally involves the ranking of all observations without regard to experimental group; a function of the ranks is then compared between the groups. But it is often necessary to block or stratify the data on one or more covariates for the purpose of reducing experimental error. In this situation, it is routine to consider the application of methods such as the Friedman generalization of the Wilcoxon test (1), which ranks observations on an intrablock basis to develop a comparison of experimental groups. Another approach with comparatively higher efficiency in these multi-way designs is that of aligned rank testing, which compares experimental groups based on the ranks of residuals obtained after the alignment of intrablock observations via removal of block effects. 1

2

ASSUMPTIONS

We restrict our further discussion to the twoway layout. Consider a randomized block design with n blocks, p experimental treatment groups, and a total of N measurements. Assume that the block and experimental effects are additive and linear, and that the experimental effect is a fixed effect. Then the endpoint under consideration within the i-th block and j-th experimental group, Yij , may be described as

BACKGROUND

Aligned rank tests were introduced by Hodges and Lehmann in 1962 (2) within the motivating context of the two-way layout. The concept of ranking after alignment according to one or more factors has since

Yij = µ + αi + βj + εij

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

ALIGNED RANK TEST

where µ is the overall mean response, α i is the block effect, β j is the experimental (treatment) effect, and εij denotes residual error. We assume that the vectors εi , which contain the residuals specific to block i (i = 1, . . . , n), constitute independent random variables with a continuous joint cumulative distribution function. In addition, it is assumed that the rank scoring sequence selected is a function F of the sequence m/(N + 1), m = 1, . . . , N, where F is defined according to the Chernoff-Savage criteria. This allows for most of the commonly chosen scoring regimens. 3

EFFICIENCY, SIZE, AND POWER

Akritas (8) shows that when a two-way model is considered, the aligned rank test of the hypothesis of no main or interaction effects due to the experimental variable outperforms the rank transform ANOVA and the F test in efficiency for logistic and double exponential data. The F test exceeds the aligned rank test and the rank transform ANOVA in efficiency when the data have a Gaussian distribution (8). Specifically, the aligned rank test has minimum efficiency of 3/π = 0.955 in comparison with the rank transform F test in normal data (9). When compared with the equivalent test based on interblock rankings, the aligned rank test has efficiency of 3/2 for two experimental groups, and the efficiency declines monotonically to equivalence between methods as the sample size grows. It should be noted that these efficiencies are correct regardless of whether the sample size increases as a result of increasing the number of experimental groups, or as a result of increasing the groupwise sample size while maintaining a limited number of experimental groups (9). The size (type I error rate) of the aligned rank test has not been comprehensively described. O’Gorman (10) presents the results of simulations of empirical size for 4 or 10 blocks and 3 or 6 treatments in a randomized complete block design. The distributions considered were the uniform, normal, and double exponential. The aligned rank test (large sample, with referral to the chisquare distribution) was consistently conservative, with estimated size ranging from 1.4%

to 4.8%, depending on whether the alignment was with respect to the mean or the median. A smaller number of blocks resulted in increased conservatism. An aligned test with referral to the F distribution, but with denominator degrees of freedom modified for improved size, ranged between 3.3% and 5.2% for the same data. The power of the mean- and median-aligned rank tests was slightly inferior to that of the denominatoradjusted test when small sample sizes were considered. In larger samples where at least 15 blocks were present, the aligned rank tests had power comparable to or greater than that of the F test and Friedman’s test, particularly in the simulations of data from double exponential distributions. Thus, the F test is preferred for normal or uniform distributions, and the aligned rank tests are preferred for larger sizes and skewed or long-tailed distributions (10). However, selection of an analysis method should take into account both the comparative size and power of the candidate methods. 4 EXAMPLE We illustrate the use of this method using data collected in a clinical trial of high-dose vitamin D supplementation in pregnant and lactating mothers. Table 1 presents selected data from the trial. The measurements represent the serum level of 25(OH)D, a form of vitamin D measured through a blood test. The data are drawn from three racial blocks and two dosage groups. We align blocks using the block mean. After computation of the block mean, the observations are aligned and the residuals ranked (from least to greatest) and labeled with respect to dose group, as shown in Table 2. To compare dose groups, we now compute the sum of ranks specific to a single group. For group A, the ranks total W = 57. This sum is referred to the permutation distribution of the rank sums, obtained by varying the group labelings. Specifically, within block 1, the two assignments to group A are permutable among the three observations to yield rank totals for group A of 10, 11, and 19. Within block 2, the two assignments to group

ALIGNED RANK TEST

3

Table 1. Selected 25(OH)D Measurements and Intrablock Means from a Clinical Trial with Two Dose Groups and Three Blocks Block 1 2 3

Dose A

Dose B

42.0, 41.4 36.3, 32.9 39.5, 46.1

14.6 15.9, 18.6 25.0, 25.6, 31.3

Mean 32.7 25.9 33.5

Source: Data kindly provided by Dr. Carol Wagner, Medical University of South Carolina.

Table 2. Calculation, Ranking, and Labeling of Residuals After Alignment by Subtraction of the Intrablock Mean 25(OH)D Block mean Residual Ranking Dose Block

42.0 32.7 9.3 10 A 1

41.4 32.7 8.7 9 A 1

14.6 32.7 –18.1 1 B 1

36.3 25.9 10.4 11 A 2

32.9 25.9 7.0 8 A 2

A can be varied among the four observations to give six sums (some not unique), and 10 (5choose-2) may be obtained from the labelings in block 3. Thus, it can be shown that the total possible value of the rank sum is 57, obtainable only through one sum: that of the largest intrablock values possible through label permutation. The significance of the test is then 1/(3 × 6 × 10) = 0.0056. This exact test of the null hypothesis of no difference between groups indicates that there is a significant difference between the mean 25(OH)D levels of the two dose groups. 5

LARGE-SAMPLE APPROXIMATION

For randomized complete block designs, in which each of n blocks contains p experimental units assigned to p treatments, a large-sample chi-square approximation is available. This may be obtained through referral of the statistic

(p − 1)n2 S=

p 2 R.j − R.. j=1

p n 2 Rij − Ri. i=1 j=1

2 to the χp−1 distribution.

15.9 25.9 –10.0 2 B 2

6

18.6 25.9 –7.3 5 B 2

39.5 33.5 6.0 7 A 3

46.1 33.5 12.6 12 A 3

25.0 33.5 –8.5 3 B 3

25.6 33.5 –7.9 4 B 3

31.3 33.5 –2.2 6 B 3

RELATED METHODS

The aligned rank test relies on analysis of residuals, after adjustment for block effects, to complete a test of the experimental effect. However, residual analysis is not restricted to aligned tests. It is common to examine residuals after fitting a chosen model so as to assess the lack of fit. Often, this examination is visual, using graphical techniques, but Quade (11) has formalized some residual analysis methods with demonstrated equivalence to preexisting nonparametric tests. The weighted rank tests from Quade permit comparisons between regression models, with weighting matrices applied to residuals to provide emphasis on specific aspects of the models.

REFERENCES 1. M. H. Hollander and D. A. Wolfe, Nonparametric Statistical Methods 2nd ed. New York: Wiley, 1999. 2. J. L. Hodges and E. L. Lehmann, Rank methods for combination of independent experiments in analysis of variance. Ann Math Stat 1962; 33: 482–497. 3. M. D. Schluchter, An aligned rank test for censored data from randomized block designs. Biometrika 1985; 72: 609–618.

4

ALIGNED RANK TEST 4. R. F. Fawcett and K. C. Salter, A Monte Carlo study of the F test and three tests based on ranks of treatment effects in randomized block designs. Commun Stat Simul Comput 1984; 13: 213–225. 5. P. K. Sen, On a class of aligned rank order tests in two-way layouts. Ann Math Stat 1968; 39: 1115–1124. 6. M. L. Puri and P. K. Sen, Nonparametric Methods in Multivariate Analysis New York: Wiley, 1971. 7. S. Tardif, On the almost sure convergence of the permutation distribution for aligned rank test statistics in randomized block designs. Ann Stat 1981; 9: 190–193. 8. M. G. Akritas, The rank transform method in some two-factor designs. J Am Stat Assoc 1990; 85: 73–78. 9. K. L. Mehra and J. Sarangi, Asymptotic efficiency of certain rank tests for comparative experiments. Ann Math Stat 1967; 38: 90–107.

10. T. W. O’Gorman, A comparison of the F-test, Friedman’s test, and several aligned rank tests for the analysis of randomized complete blocks. J Agric Biol Environ Stat 2001; 6: 367–378. 11. D. Quade, Regression analysis based on the signs of the residuals. J Am Stat Assoc 1979; 74: 411–417.

CROSS-REFERENCES Analysis of variance (ANOVA) Balanced design Blocking Rank tests

ALLOCATION CONCEALMENT

and informed consent obtained, in ignorance of the upcoming assignment (5).

KENNETH F. SCHULZ Family Health International Research Triangle Park North Carolina,

1 INDICATIONS OF THE IMPORTANCE OF ALLOCATION CONCEALMENT Trialists recently coined the term ‘‘allocation concealment’’ to describe, as well as highlight the importance of, the process (2, 3). Proper allocation concealment appears to prevent bias. Four empirical investigations have found that trials that used inadequate or unclear allocation concealment, compared with those that used adequate concealment, yielded up to 40% larger estimates of effect, on average (3, 6–8). The poorly executed trials tended to exaggerate treatment effects. One other investigation did not find an overall exaggeration, but suggested that poor quality, including allocation concealment, may distort the trial results in either direction (9). In any case, an updated meta-analysis of the relevant studies provides empirical evidence that inadequate allocation concealment allows bias to seep into trials (10). The bias could fluctuate in either direction. The worst concealed trials yielded greater heterogeneity in results (i.e., the results vacillated extensively above and below the estimates from better trials) (3). Indeed, having a randomized sequence should make little difference without adequate allocation concealment. Even random, unpredictable assignment sequences can be subverted (3, 11, 12). For example, if investigators implemented a truly randomized sequence by merely posting the sequence on a bulletin board, everyone involved with the trial can obtain the upcoming assignments. Similarly, the allocation sequence could be implemented through placing method indicator cards in regular (translucent) envelopes. This inadequate allocation concealment process could be deciphered by simply holding the envelopes to a bright light. With both these inadequate approaches, awareness of

Randomization eliminates bias in the allocation of participants to treatment (intervention) groups. Generating an unpredictable randomized allocation sequence represents the first vital step in randomization (1). Implementing that sequence while concealing it, at least until assignment occurs, a process termed allocation concealment, represents the second step (2, 3). Without adequate allocation concealment, randomization crumples in a trial. Misunderstandings thrive concerning allocation concealment. Some researchers attempting to discuss it digress into flipping coins or computer random number generators. Allocation concealment, however, is distinct from the techniques that generate the sequence. Allocation concealment techniques implement the sequence (3). Still other people confuse allocation concealment with blinding (2–4). However, unmistakable theoretical and practical differences separate the terms. Allocation concealment primarily prevents selection bias and protects a sequence before and until assignment. By contrast, blinding prevents ascertainment bias and protects the sequence after assignment. Furthermore, allocation concealment can, and must, always be successfully implemented in a randomized control trial (RCT); but in many trials, blinding cannot be successfully implemented. Proper allocation concealment secures strict implementation of a sequence without foreknowledge of treatment assignments. It safeguards the upcoming assignments from those who admit participants to a trial. The judgment to accept or reject a participant must be made,

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

ALLOCATION CONCEALMENT

the next assignment could lead to the exclusion of certain patients based on their prognosis. Alternatively, awareness of the next assignment could lead to directing some participants preferentially to groups, which can easily be accomplished by delaying a participant’s entry into the trial until the next ‘‘preferred’’ allocation appears. Bias could easily be introduced, despite an adequate randomized sequence (11). Understandably, however, investigators seldom document the sensitive details of subverting allocation schemes. Nevertheless, when investigators responded anonymously, many did relate instances of deciphering (11). The descriptions varied from holding translucent envelopes to bright lights, to holding opaque envelopes to ‘‘hot lights’’ in radiology, to rifling files in the dead of night (11). Most deciphering stemmed from inadequate allocation concealment schemes. The value of randomization paradoxically produces its most vexing implementation problems. Randomization antagonizes humans by frustrating their clinical inclinations (11, 13, 14). Thus, many involved with trials will be tempted to undermine randomization, if afforded the opportunity to decipher assignments. To minimize the impact of this human tendency, trialists must devote meticulous attention to concealing allocation schemes. Proper randomization hinges on adequate allocation concealment. 2 ADEQUATE METHODS OF ALLOCATION CONCEALMENT The following approaches to allocation concealment are generally considered as adequate (2, 3, 15): sequentially numbered, opaque, sealed envelopes (SNOSE); pharmacy-controlled; numbered or coded containers; central randomization (e.g., by telephone to a trials office); or other methods whose description contained elements convincing of concealment (e.g., a secure computer-assisted method). In assessing allocation concealment from published reports, readers will be fortunate to find even these minimal standards reasonably met (see Block 1). Investigators should, however, embrace even more rigorous approaches to assure that

selection and confounding biases have been averted. Methods based on envelopes are more vulnerable to exploitation through human ingenuity and resourcefulness than most other approaches. Therefore, some trialists consider envelopes a less than ideal method of concealment (16). However, they can be adequate and they are eminently practical, even in developing world settings. With an envelope method, investigators must diligently develop and monitor the allocation process to preserve concealment. In addition to using SNOSE, investigators must ensure that the envelopes are opened sequentially, and only after the participant’s name and other details are written on the appropriate envelope (17). Also using pressure-sensitive or carbon paper inside the envelope transfers the information written on the outside of the envelope to the assigned allocation card inside the envelope, which promotes randomization adherence and also creates a valuable audit trail. Cardboard or aluminum foil placed inside the envelope further inhibits detection of assignments via ‘‘hot lights’’ in radiology or the like. Pharmacy-controlled methods of allocation concealment tend to be adequate, but also are subject to allocation concealment and sequence generation problems. Although reports in which the assignment was made by the pharmacy have generally been classified as having ‘‘adequate’’ allocation concealment (2, 3, 15), comformity with proper randomization by pharmacists is unknown. Their detailed methods should have been reported. The author is aware of instances in which pharmacists have violated assignment schedules (11). For example, one pharmacy depleted the stock of one of the two drugs being compared in a randomized trial. They proceeded to allocate the other drug to all newly enrolled participants to avoid slowing recruitment. Another pharmacy allocated (what they inaccurately called ‘‘randomization’’) by alternate assignment. Investigators should not assume that pharmacists grasp RCT methodology. For pharmacy-controlled allocation, investigators should develop the allocation concealment procedures in concert with the pharmacy, instruct in the details if necessary, and monitor the process.

ALLOCATION CONCEALMENT

3

Table 1. Important Elements in Adequate Allocation Concealment Schemes General Allocation Concealment Scheme

Details on Adequate Approaches for Implementing Allocation Concealment

Envelopes

Should be sequentially numbered, opaque, sealed envelopes (SNOSE). Must ensure that the envelopes are opened sequentially only after participant details are written on the envelope. Pressure-sensitive or carbon paper inside the envelope transfers that information to the assignment card inside the envelope (also creates an audit trail). Cardboard or aluminum foil inside the envelope renders the contents of the envelope impermeable to intense light.

Containers

Must ensure that the containers are sequentially numbered or coded to prevent deciphering. Must ensure that containers are tamper-proof, equal in weight, identical in appearance, and equivalent in sound upon shaking.

Pharmacy-controlled

Investigators should develop, or at least validate in concert with the pharmacy, a proper randomization scheme, including allocation concealment. Should instruct the pharmacy in the details of proper allocation concealment, if necessary, and monitor the process.

Central randomization

They must determine the mechanism for contact (e.g., telephone, fax, or e-mail), the rigorous procedures to ensure enrollment before randomization, and the comprehensive training for those individuals staffing the central randomization office.

Others such as automated electronic assignment systems

Investigator may use other systems for allocation concealment. Any such system must safeguard allocations until enrollment is assured and assignment confirmed. A computer-based system must be impenetrable for it to be adequate. Simple computer-based systems that merely store assignments or naively protect assignments may prove as transparent as affixing an allocation list to a bulletin board.

Proper use of sequentially numbered containers prevents foreknowledge of treatment assignment, but only if investigators take proper precautions. They must assure that all of the containers are tamper-proof, equal in weight, identical in appearance, and equivalent in sound upon shaking. It is also encouraged that some audit trail be established, such as writing participant identifiers on the empty containers. Although central randomization persists as

an

excellent

allocation

concealment

approach, indeed, perhaps the ‘‘gold standard,’’ investigators must establish effective trial procedures and monitor adherence in

their execution. At a minimum, they determine the mechanism for contact (e.g., telephone, fax, or e-mail), the rigorous procedures to ensure enrollment before randomization, and the comprehensive training for those individuals staffing the central randomization office. Other methods may suffice for adequate allocation concealment. For example, an investigator may use a secure computerassisted method to safeguard assignments until enrollment is assured and confirmed. Undoubtedly, automated electronic assignment systems will become more common (18, 19). However, simple computer-based systems that merely store assignments or naively protect assignments may prove as transparent as affixing an allocation list to

4

ALLOCATION CONCEALMENT

a bulletin board. A computer-based system must be impenetrable for it to be adequate. Finally, in their trial report, investigators must fully describe the details of their allocation concealment approach (11, 20). Many reports, however, neglect to provide even the bare essentials (21). Only 4% of trials in a recent review of 2000 trials in schizophrenia described a method of concealment (22). Another paper found that over 20% of trials in head injury described an allocation concealment method, but the reports were more likely to describe an inadequate method rather than an adequate method (23), which hampers attempts by readers to assess RCTs. Fortunately, the situation is improving with more medical journals adopting CONSORT reporting standards for RCTs (20, 24, 25). With transparent reporting, more investigators will have to design and conduct methodologically sound trials. 3 BLOCK 1: EXAMPLES OF DESCRIPTIONS OF ALLOCATION CONCEALMENT

. . . concealed in sequentially numbered, sealed, opaque envelopes, and kept by the hospital pharmacist of the two centres (26). Treatments were centrally assigned on telephone verification of the correctness of inclusion criteria . . . (27). Glenfield Hospital Pharmacy Department did the randomisation, distributed the study agents, and held the trial codes, which was disclosed after the study(28). The various placebo and treatment blocks were then issued with a medication number and assigned to consecutive patients in a sequential order. Two copies of the randomisation list were prepared: one was used by the packaging department, . . . supplied in blister packs containing 20 capsules for morning and evening administration over 10 days. These blister packs were supplied in labeled boxes–ie, one box for each patient and each dose (29).

of randomization from reports of controlled trials published in obstetrics and gynecology journals. JAMA 1994; 272(2): 125–128. 3. K. F. Schulz, I. Chalmers, R. J. Hayes, and D. G. Altman, Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995; 273(5): 408–412. 4. K. F. Schulz, I. Chalmers, and D. G. Altman, The landscape and lexicon of blinding in randomized trials. Ann. Intern. Med. 2002; 136(3): 254–259. 5. T. C. Chalmers, H. Levin, H. S. Sacks, D. Reitman, J. Berrier, and R. Nagalingam, Meta-analysis of clinical trials as a scientific discipline. I: control of bias and comparison with large co-operative trials. Stat. Med. 1987; 6(3): 315–328. 6. D. Moher et al., Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 1998; 352(9128): 609–613. 7. L. Kjaergard, J. Villumsen, and C. Gluud, Quality of randomised clinical trials affects estimates of intervention. Abstracts for Workshops and Scientific Sessions, 7th International Cochrane Colloquium. Rome, Italy, 1999. ¨ 8. P. Juni, D. G. Altman, and M. Egger, Systematic reviews in health care: assessing the quality of controlled clinical trials. BMJ 2001; 323(7303): 42–46. 9. E. M. Balk, P. A. Bonis, H. Moskowitz, C. H. Schmid, J. P. Ioannidis, C. Wang, and J. Lau, Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 2002; 287(22): 2973–2982. ¨ 10. P. Juni and M. Egger, Allocation concealment in clinical trials. JAMA 2002; 288(19): 2407–2408; author reply 2408–2409. 11. K. F. Schulz, Subverting randomization in controlled trials. JAMA 1995; 274(18): 1456–1458. 12. S. Pocock, Statistical aspects of clinical trial design. Statistician 1982; 31: 1–18.

REFERENCES 1. K. F. Schulz and D. A. Grimes, Generating allocation sequences in randomised trials: chance, not choice. Lancet 2002; 359: 515–519. 2. K. F. Schulz, I. Chalmers, D. A. Grimes, and D. G. Altman, Assessing the quality

13. K. F. Schulz, Unbiased research and the human spirit: the challenges of randomized controlled trials. CMAJ 1995; 153(6): 783–786. 14. K. F. Schulz, Randomised trials, human nature, and reporting guidelines. Lancet 1996; 348(9027): 596–598.

ALLOCATION CONCEALMENT 15. D. G. Altman and C. J. Dor’e, Randomisation and baseline comparisons in clinical trials. Lancet 1990; 335(8682): 149–153. 16. C. L. Meinert, Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. 17. C. Bulpitt, Randomised Controlled Clinical Trials. The Hague, The Netherlands: Martinus Nijhoff, 1983. 18. K. Dorman, G. R. Saade, H. Smith, and K. J. Moise Jr., Use of the World Wide Web in research: randomization in a multicenter clinical trial of treatment for twin-twin transfusion syndrome. Obstet. Gynecol. 2000; 96(4): 636–639. 19. U. Haag, Technologies for automating randomized treatment assignment in clinical trials. Drug Inform. J. 1998; 118: 7–11. 20. D. Moher, K. F. Schulz, and D. Altman, The CONSORT statement: revised recommendations for improving the quality of reports or parallel-group trials. Lancet 2001; 357: 1191–1194. 21. K. F. Schulz and D. A. Grimes, Allocation concealment in randomised trials: defending against deciphering. Lancet 2002; 359(9306): 614–618. 22. B. Thornley and C. Adams, Content and quality of 2000 controlled trials in schizophrenia over 50 years. BMJ 1998; 317(7167): 1181–1184. 23. K. Dickinson, F. Bunn, R. Wentz, P. Edwards, and I. Roberts, Size and quality of randomised controlled trials in head injury: review of published studies. BMJ 2000; 320(7245): 1308–1311.

5

24. D. G. Altman et al., The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann. Intern. Med. 2001; 134(8): 663–694. 25. D. Moher, A. Jones, and L. Lepage, Use of the CONSORT statement and quality of reports of randomized trials: a comparative beforeand-after evaluation. JAMA 2001; 285(15): 1992–1995. 26. T. Smilde, S. van Wissen, H. Wollersheim, M. Trip, J. Kastelein, and A. Stalenhoef, Effect of aggressive versus conventional lipid lowering on atherosclerosis progression in familial hypercholesterolaemia (ASAP): a prospective, randomised, double-blind trial. Lancet 2001; 357: 577–581. 27. Collaborative Group of the Primary Prevention Project (PPP), Low-dose aspirin and vitamin E in people at cardiovascular risk: a randomised trial in general practice. Collaborative Group of the Primary Prevention Project. Lancet 2001; 357(9250): 89–95. 28. C. E. Brightling et al., Sputum eosinophilia and short-term response to prednisolone in chronic obstructive pulmonary disease: a randomised controlled trial. Lancet 2000; 356(9240): 1480–1485. 29. I. McKeith, T. Del Ser, P. Spano et al., Efficacy of rivastigmine in dementia with Lewy bodies: a randomised, double-blind, placebocontrolled international study. Lancet 2000; 356(9247): 2031–2036.

ALPHA-SPENDING FUNCTION

or asymmetric boundaries. In general, a test statistic Z(j), j = 1,2,3, . . . J is computed at each successive interim analysis. In the large sample case and under the null hypothesis, the Z(j)s are standard N(0,1). At each analysis, the test statistic is compared with a critical value Zc (j). The trial would continue as long as the test statistic does not exceed the critical value. That is, continue the trial as long as

DAVID L. DEMETS University of Wisconsin-Madison Madison, Wisconsin

K. K. GORDON LAN Johnson & Johnson Raritan, New Jersey

The randomized control clinical trial (RCT) is the standard method for the definitive evaluation of the benefits and risks of drugs, biologics, devices, procedures, diagnostic tests, and any intervention strategy. Good statistical principles are critical in the design and analysis of these RCTs (1,2). RCTs also depend on interim analysis of accumulating data to monitor for early evidence of benefit, harm, or futility. This interim analysis principle was established early in the history of RCTs (3) and was implemented in early trials such as the Coronary Drug Project (4,5). Evaluation of the interim analysis may require the advice of an independent data monitoring committee (DMC) (6,7), including certain trials under regulatory review (8,9). However, although ethically and scientifically compelling, interim repeated analysis of accumulating data has the statistical consequence of increased false positive claims unless special steps are taken. The issue of sequential analysis has a long tradition (10,11) and has received special attention for clinical trials (12,13). In particular, increasing the frequency of interim analysis can substantially increase the Type I error if the same criteria are used for each interim analysis (13). This increase was demonstrated in the Coronary Drug Project, which used sequential analysis for monitoring several treatment arms compared with a placebo (4). Most of the classical sequential methods assumed continuous analysis of accumulating data, a practice not realistic for most RCTs. Rather than continuous monitoring, most clinical trials review accumulating data periodically after additional data has been collected. Assume that the procedure is a one-sided test but that the process can be easily generalized to introduce two one-sided symmetric

Z(j) < Zc (j) for j = 1, 2, 3, . . . , J − 1 Otherwise, the trial might be considered for termination. We would fail to reject the null hypothesis if Z(j) < Zc (j) for all j (j = 1,2,...,J). We would reject the null hypothesis if at any interim analysis, Z(j) ≥ Zc (j)

for j = 1, 2, . . . , J

Peto et al. (14) recommended using a very conservative critical value for each interim analysis, say a standardized value of Zc (j) = 3.0 for all j (j = 1,2, . . . ,J), such the impact on the overall Type I error would be minimal. In 1977, Pocock (15) published a paper based on the earlier work of Armitage and colleagues (13) that formally introduced the idea of a group sequential approach. This modification developed a more conservative critical value than the na¨ıve (e.g., 1.96 for one-sided Type I error of 0.025) to be used at each analysis such that the overall Type I error was controlled. For example, if a total of 5 interim analyses were to be conducted with an intended Type I error of 0.025, then Zc (j) = 2.413 would be used at each interim analysis (j = 1,2,. .,5). Note that this final critical value is much larger than the standard critical value. In 1979, O’Brien and Fleming (16) introduced an alternative group sequential boundary for evaluating interim analysis. In this approach, the critical values change with each interim analysis, starting with a very conservative (i.e., large) value and shrinking to a final value close to the nominal critical value at the scheduled completion. The exact form for each critical value is Zc (j) = √ ZOBF (J) (J/j). In this case, for the same 5 interim analyses and an overall Type I error of 0.025, the ZOBF (5) value is 2.04, which

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

ALPHA-SPENDING FUNCTION

√ makes the 5 critical values 2.05 (5/j) for j = 1,2, . . . ,5 or (4.56, 3.23, 2.63, 2.28, and 2.04). Both of these latter models assume an equal increment in information between analyses and that the number of interim analyses J is fixed in advance. These three group sequential boundaries have been widely used, and examples are shown in Fig. 1. In fact, the OBF group sequential method was used in the Betablocker Heart Attack Trial (BHAT) (17), which terminated early because of an overwhelming treatment benefit for mortality. In 1987, Wang and Tsiatis generalized the idea of Pocock and O’Brien-Fleming and introduced a family of group sequential boundaries. For given α, J and a shape parameter φ, a constant C will be chosen so that Z(j) ≥ C (J/j)φ for some j = 1,2, . . . , J is equal to α. The choice of φ = 0.5 yields the OBF boundary, and φ = 0 yields the Pocock boundary. 1 ALPHA SPENDING FUNCTION MOTIVATION The BHAT trial was an important factor in the motivation for the alpha spending function approach to group sequential monitoring. BHAT was a cardiovascular trial that

evaluated a betablocker class drug to reduce mortality following a heart attack (17). An independent DMC reviewed the data periodically, using the OBF group sequential boundaries as a guide. A beneficial mortality trend emerged early in the trial and continued to enlarge with subsequent evaluations. At the sixth of a planned seven interim analyses, the logrank test statistic crossed the OBF boundary. After careful examination of all aspects, the DMC recommended that the trial be terminated, approximately 1 year earlier than planned. However, although the OBF sequential boundaries were used, the assumptions of these models were not strictly met. The increment in the number of deaths between DMC meetings was not equal. Furthermore, additional interim analyses were contemplated although not done. This experience suggested the need for more flexible sequential methods for evaluating interim results. Neither the number nor the timing of interim analyses can be guaranteed in advance. A DMC may need to add additional interim analyses as trends that suggest benefit or harm emerge. As described by Ellenberg, Fleming, and DeMets (7), many

Group Sequential Boundaries 4.0 Reject H0 3.0

Standardized Statistic

2.0 1.0

Accept H0

0 −1.0 −2.0 −3.0

O’Brien-Fleming

−4.0

Truncated O’Brien-Fleming .25

.50 .75 Information Fraction

1.0

Figure 1. Upper boundary values corresponding to the α 1 (t*) spending function for α − 0.05 at information fraction t* = 0.25, 0.50, 0.75, and 1.0 and for a truncated version at a critical value of 3.0

ALPHA-SPENDING FUNCTION

3

Table 1. Comparison of boundaries using spending functions with Pocock (P) and O’Brien-Fleming (OBF) methods (α = 0.05, t* = 0.2, 0.4, 0.6, 0.8, and 1.0 t*

α 1 (t*)

OBF

α 2 (t*)

P

0.2 0.4 0.6 0.8 1.0

4.90 3.35 2.68 2.29 2.03

4.56 3.23 2.63 2.28 2.04

2.44 2.43 2.41 2.40 2.36

2.41 2.41 2.41 2.41 2.41

factors must be considered before recommendations for early termination are made, and an additional interim analysis may be necessary to confirm or more fully evaluate these issues. Thus, the need for a flexible group sequential method seemed compelling.

2

THE ALPHA SPENDING FUNCTION

The initial (or classical) group sequential boundaries are formed by choosing boundary values such that the sum of the probability of exceeding those critical values during the interim analyses is exactly the specified alpha level set in the trial design, assuming the null hypothesis of no intervention effect. That is, the total available alpha is allocated or ‘‘spent’’ over the prespecified times of interim analyses. The alpha spending function proposed by Lan and DeMets (18) allocated the alpha level over the interim analyses by a continuous function, α(t), where

t is the information fraction, 0 ≤ t ≤ 1. Here t could be the fraction of target patients recruited (n/N) or the fraction of targeted deaths observed (d/D) at the time of the interim analysis. In general, if the total information for the trial design is I, then at the j-th analysis, the information fraction tj = Ij /I. The total expected information I should have been determined by the trial design if properly done. The function α(t) is defined such that α(0) = 0 and α(1) = α. Boundary values Zc (j), which correspond to the α-spending function α(t), can be determined successfully so that P0 {Z(1) ≥ Zc (1), or Z(2) ≥ Zc (2), or . . . , or Z(j) ≥ Zc (j)} = α(tj )

(1)

where {Z(1), . . . , Z(j),} represent the test statistics from the interim analyses 1, . . . , . The specification of α(t) will create a boundary of critical values for interim test statistics, and we can specify functions that approximate Spending Functions

.05

Alpha

.04

a2(t*)

.03 )

t* a 3(

.02

a1(t*) .01 Figure 2. Comparison of spending functions α 1 (t*), α 2 (t*), and α 3 (t*) at information fractions t* = 0.2, 0.4, 0.6, 0.8, and 1.0

0

.2

.4 .6 Information Fraction

.8

1

4

ALPHA-SPENDING FUNCTION

O’Brien–Fleming or Pocock boundaries as follows: √ α1 (t) = 2 − 2 − (Zα/2 ´ / t) O’Brien-Fleming Type α2 (t) = α1n(1 + (e − 1)t) Pocock Type where denotes the standard normal cumulative distribution function. The shape of the alpha spending function is shown in Fig. 2 for both of these boundaries. Two other families of spending functions proposed (19,20) are α(θ , t) = αtθ for θ >0 α(γ , t) = α[(1 − e−γ t )/(1 − e−γ )],

for γ # 0

The increment α(tj ) – α(tj −1) represents the additional amount of alpha or Type I error probability that can be used at the jth analysis. In general, to solve for the boundary values Zc (j), we need to obtain the multivariate distribution of Z(1), Z(2), . . . , Z(J). In the cases to be discussed, the distribution is asymptotically multivariate normal with covariance structure = (σ jk ) where σjk = cov(Z(j), Z(k)) √ √ = (tj /tk ) = (iJ /iK ) for j ≤ k where ij and ik are the amount of information available at the j-th and k-th data monitoring, respectively. Note that at the jth data monitoring, ij and ik are observable and σ jk is known even if I (total information) is unknown. However, if I is not known during interim analysis, we must estimate I by Iˆ and tj by tj = Ij /Iˆ so that we can estimate α(tj ) by (tˆ). If these increments have an independent distributional structure, which is often the case, then derivation of the values of the Zc (j) from the chosen form of α(t) is relatively straightforward using Equation (1) and the methods of Armitage et al. (21,22). If the sequentially computed statistics do not have an independent increment structure, then the derivation of the Zc (j) involves a more complicated numerical integration and sometimes is estimated by simulation. However, as discussed later, for the most frequently used test statistics, the independent increment structure holds.

This formulation of the alpha spending function provides two key flexible features. Neither the timing nor the total number of interim analyses have to be fixed in advance. The critical boundary value at the j-th analysis only depends on the information fraction tj and the previous j-1 information fractions, t1 , t2 , . . . , tj−1 , and the specific spending function being used. However, once an alpha spending function has been chosen before the initiation of the trial, that spending function must be used for the duration of the trial. A DMC can change the frequency of the interim analyses as trends emerge without appreciably affecting the overall α level (23,24). Thus, it is difficult to abuse the flexibility of this approach. The timing and spacing of interim analyses using the alpha spending function approach have been examined (19, 25–27). For most trials, two early analyses with less than 50% of the information fraction are adequate. An early analysis say at 10% is often useful to make sure that all of the operational and monitoring procedures are in order. In rare cases, such early interim reviews can identify unexpected harm such as in the Cardiac Arrhythmia Suppression Trial (28) that terminated early for increased mortality at 10% of the information fraction using an alpha spending function. A second early analysis at 40 or 50% information fraction can also identify strong convincing trends of benefit as in two trials that evaluated beta blocker drugs in chronic heart failure (29,30). Both trials terminated early at approximately 50% of the information fraction with mortality benefits. Computation of the alpha spending function can be facilitated by available software on the web (www.biostat.wisc.edu/landemets) or by commercial software packages (www .cytel.com/Products/East/default.asp). 3 APPLICATION OF THE ALPHA SPENDING FUNCTION Initial development of group sequential boundaries was for comparison of proportions or means (15,16,26). In these cases, the increments in information are represented by additional groups of subjects and their responses to the intervention. For comparing means or proportions, the information

ALPHA-SPENDING FUNCTION

fraction t can be estimated by the n/N, the observed sample size divided by the expected sample size. However, later work expanded the use to other common statistical procedures. Tsiatis and colleagues (31,32) demonstrated that sequential logrank test statistics and the general class of rank statistics used in censored survival data had the independent increment structure that made the application to group sequential boundary straightforward. Later, Kim and Tsiatis (33) demonstrated that the alpha spending function approach for sequential logrank tests was also appropriate. In this case, the information fraction is approximated by d/D, the number of observed events or deaths divided by the expected or design for number of events or deaths (34). Application of the alpha spending function for logrank tests has been used in several clinical trials (e.g., 28–30). Group sequential procedures including the alpha spending function have also been applied to longitudinal studies using a linear random effects model (35,36). Longitudinal studies have also been evaluated using generalized estimating equations (37). In a typical longitudinal clinical trial, subjects are added over time, and more observations are gathered for each subject during the course of the trial. One statistic commonly used is to evaluate the rate of change by essentially computing the slope of the observations for each subject and then taking a weighted average of these slopes over the subjects in each intervention arm. The sequential test statistics for comparison of slopes using the alpha spending function must take into account their distribution. If the information fraction is defined in terms of the Fisher information (i.e., inverse of the variance for the slopes), then the increments in the test statistic are independent, and the alpha function can be applied directly (38). The total expected information may not be known exactly, but it often can be estimated. Wu and Lan (36) provide other approaches to estimate the information fraction in this setting. Scharfstein and Tsiatis (39) demonstrated that any class of test statistics that satisfies specific likelihood function criteria will have this property and thus can be used directly in this group sequential setting.

5

4 CONFIDENCE INTERVALS AND ESTIMATION Confidence intervals for an unknown parameter θ following early stopping can be computed by using the same ordering of the sample space described by Tsiatis et al. (32) and by using a process developed by Kim and DeMets (25,40) for the alpha spending function procedures. The method can be briefly summarized as follows: A 1–γ lower confidence limit is the smallest value of θ for which an event at least as extreme as the one observed has a probability of at least γ . A similar statement can be made for the upper limit. For example, if the first time the Z-value exists the boundary at tj with the observed Z*(j) ≥ Zc (j), then the upper θ U and lower θ L confidence limits are θ U = sup {θ : Pθ {Z(1) ≥ Zc (1), or · · · , or Z(j − 1) > Zc (j − 1), or Z(j) ≥ Z∗ (j)} ≤ 1 − γ }} and θ L = inf {θ : Pθ {Z(1) ≥ Zc (1), or · · · , or Z(j − 1) ≥ Zc (j − 1), or Z(j) ≥ Z1 (j)} ≥ γ }} Confidence intervals obtained by this process will have coverage closer to 1 – γ than na¨ıve ˆ confidence intervals using θˆ ±Zγ /2 SE(θ). As an alternative to computing confidence intervals after early termination, Jennison and Turnbull (41) have allocated the calculation of repeated confidence intervals. This calculation is achieved by inverting a sequential test to obtain the appropriate coefficient Z∗α/2 in the general form for the confidence ˆ This inversion can be interval, θˆ ± Z∗α/2 SE(θ). achieved when the sequential test is based on an alpha spending function. If we compute the interim analyses at the tj , obtaining corresponding critical values Zc (j), then the repeated confidence intervals are of the form θˆk ± Zc (j)SE(θˆj ) where θˆj is the estimate for the parameter θ at the j-th analysis. Methodology has also been developed to obtain adjusted estimates for the intervention effect (42–47). Clinical trials that terminate early are prone to exaggerate the

6

ALPHA-SPENDING FUNCTION

magnitude of the intervention effect. These methods shrink the observed estimate closer to the null. The size of the adjustments may depend on the specific sequential boundary employed. Conservative boundaries such as that proposed by Peto or O’Brien and Fleming generally require less adjustment, and the na¨ıve point estimate and confidence intervals may be quite adequate. Another issue is the relevance of the estimate to clinical practice. The population sample studied is usually not a representative sample of current and future practice. Subjects were those who passed all of the inclusion and exclusion criteria and volunteered to participate. Early subjects may differ with later subjects as experience is gained with the intervention. Thus, the intervention effect estimate may represent populations like the one studied, the only solid inference, but not as relevant to how the intervention will be used. Thus, complex adjustments may not be as useful. 5

TRIAL DESIGN

If any trial is planning to have interim analyses for monitoring for benefit or harm, then that plan must be taken into account in the design. The reason is that group sequential methods will impact the final critical value, and thus power, depending on which boundary is used. For the alpha spending function approach, the specific alpha function must be chosen in advance. In addition, for planning purposes, the anticipated number of interim analyses must be estimated. This number does not have to be adhered to in the application, but it is necessary for the design. Variance with this number in the application will not practically affect the power of the trial. Thus, the design strategy for the alpha spending function is similar to that strategy described by Pocock for the initial group sequential methods (15). The key factor when the sample size is computed is to take into consideration the critical value at the last analysis when the information fraction is 1.0. One simple approach is to use this new critical value in the standard sample size formula. This estimate will reasonably approximate a more exact approach described below.

To illustrate, consider a trial that is comparing failure rates of successive groups of subjects. Here, H0 :pC − pT = 0 HA :pC − pT = δ # 0 where pC and pT denote the unknown response rates in the control and newtreatment groups, respectively. We would estimate the unknown parameter by pˆ C and pˆ T , the observed event rates in our trial. For a reasonably large sample size, we often use the following test statistics, Z=

pˆ C − pˆ T p(1 ˆ − p)(1/m ˆ C + 1/mT

to compare event rates where pˆ is the combined event rate across treatment groups. For sufficiently large n where n = mC = mT , this statistics has an approximate standard normal distribution with mean and unit variance under the null hypothesis H0 : = 0. In this case, assuming equal sample size (n) per group in each arm, √ = n(pC − pT / 2p(1 − p) √ = nδ/ 2p(1 − p) where p = (pC + pT )/2 It follows that n=

2 2 p(1 − p) δ2

To design our studies, we evaluate the previous equation for n, the sample size per treatment per sequential group. Because the plan is to have J groups each of size 2n, the total sample size 2N equals 2nJ. Now, to obtain the sample size in the context of the alpha spending function, we proceed as follows: 1. For planning purposes, estimate the number of planned interim analyses J at equally spaced increments of information (i.e., 2n subjects). It is also possible to specify unequal increments, but equal space is sufficient for design purposes.

ALPHA-SPENDING FUNCTION

2. Obtain the boundary values of the K interim analyses under the null hypothesis H0 to achieve a prespecified overall alpha level, α, for a specific spending function α(t). 3. For the boundary obtained, obtain the value of to achieve a desired power (1–β). 4. Determine the value of n that determines the total sample size 2N = 2nJ. 5. Having computed these design parameters, one may conduct the trial with interim analysis to be done based on the information fraction tj approximated by

6

7

CONCLUSIONS

The alpha spending function approach for group sequential interim analysis has provided the necessary flexibility that allows data monitoring committees to fulfill their task. DMCs can adjust their analysis as data accumulates and trends emerge. As long as the alpha spending function is specified in advance, there is little room for abuse. Many trials sponsored by industry and government have successfully used this approach. Although the decision to terminate any trial early, for benefit or harm, is a very complex decision process, the alpha spending function can be an important factor in that process.

tj = Number of subjects observed/2N at the jth analysis (38). The number of actual interim analyses may not be equal to J, but the alpha level and the power will be affected only slightly (26). As a specific example, consider using an O’Brien–Fleming-type alpha spending function α 1 (t) with a one-sided 0.025 alpha level and 0.90 power at equally spaced increments at t = 0.2, 0.4, 0.6, 0.8 and 1.0. Using previous publications (16) or available computer software, we obtain boundary values 4.56, 3.23, 2.63, 2.28, and 2.04. Using these boundary values and available software, we find that = 1.28 provides the desired power of 0.90. If we specify pC = 0.6, pT = 0.4 (p = 0.5) under the alternative hypothesis, then we can obtain a sample size as follows. For = 1.28, n=

2(1.28)2 (0.5)(0.5) = 20.5, (0.2)2

and we have a total sample size of 2(21)5 = 210 subjects. We can then proceed to conduct interim analysis times at information fraction tj equal to the observed number of subjects divided by 210. Similar formulations can be developed for the comparison of means, repeated measures, and survival analysis (48). However, for most applications, the standard sample size formulas with the new alpha spending function final critical value will be a very good approximation.

REFERENCES 1. L. Friedman, C. Furberg, and D. DeMets, Fundamentals of Clinical Trials. Littleton, MA: John Wright – PSG Inc., 1981. 2. S. J. Pocock, Clinical Trials: A Practical Approach, New York: Wiley, 1983. 3. Heart Special Project Committee. Organization, review and administration of cooperative studies (Greenberg Report): a report from the Heart Special Project Committee to the National Advisory Council, May 1967. Control. Clin. Trials 1988; 9: 137–48. 4. P. L. Canner, Monitoring treatment differences in long–term clinical trials. Biometrics 1977; 33: 603–615. 5. Coronary Drug Project Research Group. Practical aspects of decision making in clinical trials: The Coronary Drug Project as a case study. Control. Clin. Trials 1982; 9: 137–148. 6. D. L. DeMets, L. Friedman, C. D. Furberg, Data Monitoring in Clinical Trials: A Case Studies Approach. 2005. Springer Science + Business Media, New York, NY. 7. S. Ellenberg, T. Fleming and D. DeMets, Data Monitoring Committees in Clinical Trials: A Practical Perspective. West Sussex, UK: John Wiley & Sons, Ltd., 2002. 8. ICH Expert Working Group: International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline. Statistical principles for clinical trials. Stats. Med. 1999; 18: 1905–1942. 9. U.S. Department of Health and Human Services.Food and Drug Administration.

8

ALPHA-SPENDING FUNCTION Docket No. 01D–0489. Guidance for Clinical Trial Sponsors on the Establishment and Operations of Clinical Trial Data Monitoring Committees. Federal Register 66: 58151–58153, 2001. Available: http://www.fda.gov/OHRMS/DOCKETS/98fr/ 112001b.pdf.

10. F. J. Anscombe, Sequential medical trials. Journal of the American Statistical Association 1963; 58: 365–383.

24.

25.

26.

11. I. Bross, Sequential medical plans. Biometrics 1952; 8: 188–205. 12. P. Armitage, Sequential Medical Trials, 2nd ed. New York: John Wiley and Sons, 1975. 13. P. Armitage, C. K. McPherson, and B. C. Rowe, Repeated significance tests on accumulating data. J. Royal Stat. Soc. Series A 1969; 132: 235–244. 14. R. Peto, M. C. Pike, P. Armitage, et al., Design and analysis of randomized clinical trials requiring prolonged observations of each patient. 1. Introduction and design. Br. J. Cancer 1976; 34: 585–612.

27.

28.

29.

15. S. J. Pocock: Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64: 191–199. 16. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics 1979; 35: 549–556.

30.

17. Beta–Blocker Heart Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction. I Mortality results. J. Amer. Med. Assoc. 1982; 247: 1707–1714. 18. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical trials. Biometrika 1983; 70: 659–663. 19. K. Kim and D. L. DeMets. Design and analysis of group sequential tests based on the type I error spending rate function. Biometrika 1987; 74: 149–154. 20. I. K. Hwang, W. J. Shih, and J. S. DeCani, Group sequential designs using a family of type I error probability spending function. Stats. Med. 1990; 9: 1439–45. 21. K. K. G. Lan and D. L. DeMets, Group sequential procedures: Calendar versus information time. Stats. Med. 1989; 8: 1191–1198. 22. D. M. Reboussin, D. L. DeMets, K. M. Kim, and K. K.G. Lan, Computations for group sequential boundaries using the Lan–DeMets spending function method. Control. Clin. Trials 2000; 21: 190–207. 23. M. A. Proschan, D. A. Follman, and M. A. Waclawiw. Effects of assumption violations

31.

32.

33.

34.

35.

36.

on type I error rate in group sequential monitoring. Biometrics 1992; 48: 1131–1143. K. K. G. Lan and D. L. DeMets, Changing frequency of interim analyses in sequential monitoring. Biometrics 1989 Sept; 45(3): 1017–1020. K. Kim and D. L. DeMets, Confidence intervals following group sequential tests in clinical trials. Biometrics 1987; 4: 857–864. K. Kim and D. L. DeMets, Sample size determination for group sequential clinical trials with immediate response. Stats. Med. 1992; 11: 1391–1399. Z. Li and N. L. Geller, On the choice of times for date analysis in group sequential trials. Biometrics 1991; 47: 745–750. Cardiac Arrhythmia Suppression Trial (CAST) Investigators. Preliminary report: Effect of endainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. N. Engl. J. Med. 1989; 321: 406–412. MERIT–HF Study Group. Effect of metoprolol CR/XL in chronic heart failure: Metoprolol CR/XL randomised intervention trial in congestive heart failure. Lancet 1999; 353: 2001–2007. M. Packer, A. J. S. Coats, M. B. Fowler, H. A. Katus, H. Krum, P. Mohacsi, J. L. Rouleau, M. Tendera, A. Castaigne, C. Staiger, et al. for the Carvedilol Prospective Randomized Cumulative Survival (COPERNICUS) Study Group. Effect of Carvedilol on survival in severe chronic heart failure. New Engl. J. Med. 2001; 334: 1651–1658. A. A. Tsiatis, Repeated significance testing for a general class of statistics used in censored survival analysis. J. Am. Stat. Assoc. 1982; 77: 855–861. A. A. Tsiatis, G. L. Rosner, and C. R. Mehta, Exact confidence intervals following a group sequential test. Biometrics 1984; 40: 797–803. K. Kim and A. A. Tsiatis, Study duration for clinical trials with survival response and early stopping rule. Biometrics 1990; 46: 81–92. K. K. G. Lan and J. Lachin, Implementation of group sequential logrank tests in a maximum duration trial. Biometrics 1990; 46: 759–770. J. W. Lee and D. L. DeMets, Sequential comparison of changes with repeated measurement data. J. Am. Stat. Assoc. 1991; 86: 757–762. M. C. Wu and K. K. G. Lan, Sequential monitoring for comparison of changes in a response variable in clinical trials. Biometrics 1992; 48: 765–779.

ALPHA-SPENDING FUNCTION 37. S. J. Gange and D. L. DeMets, Sequential monitoring of clinical trials with correlated categorical responses. Biometrika 1996; 83: 157–167. 38. K. K. G. Lan, D. M. Reboussin, D. L. DeMets: Information and information fractions for design and sequential monitoring of clinical trials. Communicat. Stat.– Theory Methods 1994; 23: 403–420. 39. D. O. Scharfstein, A. A. Tsiatis, and J. M. Robins, Semiparametric efficiency and its implication on the design and analysis of group–sequential studies. J. Am. Stat. Assoc. 1997; 92: 1342–1350. 40. K. Kim, Point estimation following group sequential tests. Biometrics 1989; 45: 613–617. 41. C. Jennison and B. W. Turnbull: Interim analyses: The repeated confidence interval approach. J. Royal Stat. Soc., Series B 1989; 51: 305–361. 42. S. S. Emerson and T. R. Fleming, Parameter estimation following group sequential hypothesis testing. Biometrika 1990; 77: 875–892. 43. M. D. Hughes and S. J. Pocock, Stopping rules and estimation problems in clinical trials. Stats. Med. 1981; 7: 1231–1241. 44. Z. Li and D. L. DeMets, On the bias of estimation of a Brownian motion drift following group sequential tests. Stat. Sinica 1999; 9: 923–937. 45. J. C. Pinheiro and D. L. DeMets: Estimating and reducing bias in group sequential designs with Gaussian independent structure. Biometrika 1997; 84: 831–843. 46. D. Siegmund, Estimation following sequential tests. Biometrika 1978; 65: 341–349. 47. J. Whitehead, On the bias of maximum likelihood estimation following a sequential test. Biometrika 1986; 73: 573–581. 48. D. L. DeMets and K. K. G. Lan, The alpha spending function approach to interim data analyses. In: P. Thall (ed.), Recent Advances in Clinical Trial Design and Analysis. Dordrecht, The Netherlands: Kluver Academic Publishers, 1995, pp. 1–27.

FURTHER READING M. N. Chang and P. C. O’Brien, Confidence intervals following group sequential tests. Control. Clin. Trials 1986; 7: 18–26.

9

T. Cook and D. L. DeMets, Statistical Methods in Clinical Trials. Boca Raton, FL: CRC Press/Taylor & Francis Co., 2007. D. L. DeMets, Data monitoring and sequential analysis – An academic perspective. J. Acq. Immune Def. Syn. 1990; 3(Suppl 2):S124–S133. D. L. DeMets, R. Hardy, L. M. Friedman, and K. K. G. Lan, Statistical aspects of early termination in the Beta–Blocker Heart Attack Trial. Control. Clin. Trials 1984; 5: 362–372. D. L. DeMets and K. K. G. Lan, Interim analysis: the alpha spending function approach. Stats. Med. 1994; 13: 1341–1352. D. L. DeMets, Stopping guidelines vs. stopping rules: A practitioner’s point of view. Communicat. Stats.–Theory Methods 1984; 13: 2395–2417. T. R. Fleming and D. L. DeMets, Monitoring of clinical trials: issues and recommendations. Control. Clin. Trials 1993; 14: 183–197. J. L. Haybittle, Repeated assessment of results in clinical trials of cancer treatment. Brit. J. Radiol. 1971; 44: 793–797. K. K. G. Lan, W. F. Rosenberger, and J. M. Lachin: sequential monitoring of survival data with the Wilcoxon statistic. Biometrics 1995; 51: 1175–1183. J. W. Lee, Group sequential testing in clinical trials with multivariate observations: a review. Stats. Med. 1994; 13: 101–111. C. L. Meinert, Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. S. Piantadosi. Clinical Trials: A Methodologic Perspective, 2nd ed. Hoboken, NJ: John Wiley and Sons, Inc., 2005. S. J. Pocock. Statistical and ethical issues in monitoring clinical trials. Stats. Med. 1993; 12: 1459–1469. S. J. Pocock. When to stop a clinical trial. Br. Med. J. 1992; 305: 235–240. E. Slud and L. J. Wei, Two–sample repeated significance tests based on the modified Wilcoxon statistic. J. Am. Stat. Assoc. 1982; 77: 862–868. A. Wald, Sequential Analysis. New York: John Wiley and Sons, 1947.

ANALYSIS OF VARIANCE ANOVA

The basis for generalizability of a successful clinical trial is strengthened when the coverage of a study is as broad as possible with respect to geographical area, patient demographics, and pre-treatment characteristics as well as other factors that are potentially associated with the response variables. At the same time, heterogeneity among patients becomes more extensive and conflicts with the precision of statistical estimates, which is usually enhanced by homogeneity of subjects. The methodology of the ANOVA is a means to structure the data and their validation by accounting for the sources of variability such that homogeneity is regained in subsets of subjects and heterogeneity is attributed to the relevant factors. The method ANOVA is based on the use of sums of squares of the deviation of the observations from respective means (→ Linear Model). The tradition of arraying sums of squares and resulting F-statistics in an ANOVA table is so firmly entrenched in the analysis of balanced data that the extension of the analysis to unbalanced data is necessary. For unbalanced data, many different sums of squares can be defined and then be used in the numerators of F-statistics providing it tests for a wide variety of hypotheses. In order to provide a practically relevant and useful approach, the ANOVA through the cell means model will be introduced. The concept of the cell means model was introduced by Searle (2, 3), Hocking and Speed (4), and Hocking (5) to resolve some of the confusion associated with ANOVA models with unbalanced data. The simplicity of such a model is readily apparent: No confusion exists on which functions are estimable, what their estimators are, and what hypotheses can be tested. The cell means model is conceptually easier, it is useful for understanding the ANOVA models, and it is, from the sampling point of view, the appropriate model to use. In many applications, the statistical analysis is characterized by the fact that a number of detailed questions need to be answered. Even if an overall test is significant, further

¨ Dr. JORG KAUFMANN

AG Schering SBU Diagnostics & Radiopharmaceuticals Berlin, Germany

1

INTRODUCTION

The development of analysis of variance (ANOVA) methodology has in turn had an influenced on the types of experimental research being carried out in many fields. ANOVA is one of the most commonly used statistical techniques, with applications across the full spectrum of experiments in agriculture, biology, chemistry, toxicology, pharmaceutical research, clinical development, psychology, social science, and engineering. The procedure involves the separation of total observed variation in the data into individual components attributable to various factors as well as those caused by random or chance fluctuation. It allows performing hypotheses tests of significance to determine which factors influence the outcome of the experiment. However, although hypothesis testing is certainly a very useful feature of the ANOVA, it is by no means the only aspect. The methodology was originally developed by Sir Ronald A. Fisher (1), the pioneer and innovator of the use and applications of statistical methods in experimental design who coined the name ‘‘Analysis of Variance – ANOVA.’’ For most biological phenomena, inherent variability exists within the response processes of treated subjects as well as among the conditions under which treatment is received, which results in sampling variability, meaning that results for a subject included in a study will differ to some extent from those of other subjects in the affected population. Thus, the sources of variability must be investigated and must be suitably taken into account when data from comparative studies have to be evaluated correctly. Clinical studies are in particular a fruitful field for the application of this methodology.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

ANALYSIS OF VARIANCE ANOVA

analyses are, in general, necessary to assess specific differences in the treatments. The cell means model provides within the ANOVA framework the appropriate model for a correct statistical inference and provides such honest statements on statistical significance in a clinical investigation. 2

FACTORS, LEVELS, EFFECTS, AND CELLS

One of the principal uses of statistical models is to explain variation in measurements. This variation may be caused by the variety of factors of influence, and it manifests itself as variation from one experimental unit to another. In well-controlled clinical studies, the sponsor deliberately changes the levels of experimental factors (e.g., treatment) to induce variation in the measured quantities to lead to a better understanding of the relationship between those experimental factors and the response. Those factors are called independent and the measured quantities are called dependent variables. For example, consider a clinical trial in which three different diagnostic imaging modalities are used on both men and women in different centers. Table 1 shows schematically how the resulting data could be arrayed in a tabular fashion. The three elements used for classifications (center, sex, and treatment) identify the source of variation of each datum and are called factors. The individual classes of the classifications are the levels of the factor (e.g., the three different treatments T1, T2, and T3 are the three levels of the factor treatment). Male and female are the two levels of the

Table 1. Factors, Levels and Cells (ijk)

Treatment (k) Center (i)

Sex (j)

1

male female

2

male female

3

male female

T1

T2

cell (2 1 2)

T3

factor sex, and center1, center2, center3 are the three levels of the factor center. A subset of the data present for a ‘‘combination’’ of one level of each factor under investigation is considered a cell of the data. Thus, with the three factors, center (3 levels), sex (2 levels), and treatment (3 levels), 3 × 2 × 3 = 18 cells numbered by triple indexing i j k exist. Repeated measurements in one cell may exist, which they usually do. Unbalanced data occur when the number of repeated observations per cell nijk are different for at least some of the indices (i, j, k). In clinical research, this occurrence is rather the role than the exception. One obvious reason could be missing data in an experiment. Restricted availability of patients for specific factor combination is another often experienced reason. 3 CELL MEANS MODEL A customary practice since the seminar work of R. A. Fisher has been that of writing a model equation as a vehicle for describing ANOVA procedures. The cell means model is now introduced via a simple example in which only two treatments and no further factors are considered. Suppose that yir , i = 1, 2, r = 1, . . . , ni represents a random sample of two normal populations with means µ1 and µ2 and common variance σ 2 . The data point yir denotes the rth observation on the ith population of size ni and its value assumed to follow a Gaussian normal distribution: yir ∼ N(µi , σ 2 ). The fact that the sizes n1 and n2 of the two populations differ indicates that a situation of unbalanced data exists. In linear model form, it is written yir = µi + eir , i = 1, 2; r = 1, . . . , ni

(1)

where the errors eir are identically independent normal N(0, σ 2 ) distributed (i.i.d. variables). Notice, a model consists of more than just a model equation: It is an equation such as Equation (1) plus statements that describe the terms of the equation. In the example above, µi is defined as the population mean of the ith population and is equal to the expectation of yir E( yir ) = µi , r = 1, . . . , ni , for i = 1, 2

(2)

ANALYSIS OF VARIANCE ANOVA

The difference yir − E( yir ) = yir − µi = eir is the deviation of the observed yir value from the expected value E( yir ). This deviation denoted eir is called the error term or residual error term, and from the introduction of the model above, it is a random variable with expectation zero and variance v(eir ) = σ 2 . Notice, in the model above, the variance is assumed to be the same for all eir s. The cell means model can now be summarized as follows:

yir = µi + eir

(3)

E( yir ) = µi E(eir ) = 0 v(eir ) = σ 2 for all i and r

Notice that Equation (3) does not assume explicitly the Gaussian normal distribution. In fact, one can formulate the cell means model more generally by only specifying means and variances. Below, however, it is restricted to the special assumption eir ≈ i.i.d. N(0, σ 2 ). It will be shown that the cell means model can be used to describe any of the models that are classically known as ANOVA models.

4

ONE–WAY CLASSIFICATION

Let us begin with the case of a cell means model in which the populations are identified by a single factor with i levels and ni observations at the ith level for i = 1, . . . , i. 4.1 Example 1 A clinical study was conducted to compare the effectiveness of three different doses of a new drug and placebo for treating patients with high blood pressure. For the study, 40 patients are included. To control for unknown sources of variation, 10 patients each were assigned at random to the four treatment groups. As response, one considered the difference in diastolic blood pressure measurement between baseline (pre-value) and the measurement four weeks after administration of treatment. The response measurements yir , sample means yi , and sample variances s2i are shown in Table 2. The cell means model to analyze the data of the example 1 is then given as yir = µi + eir i = 1, 2, 3, 4; r = 1, . . . , 10

dose 1

dose 2

dose 3

1.6 0.1 −0.3 1.9 −1.3 0.6 −0.6 1.0 −1.1 1.1

1.7 4.7 0.2 3.5 1.7 2.6 2.3 0.2 0.3 0.7

3.6 2.3 2.1 1.6 1.8 3.3 4.3 1.2 1.5 2.4

3.1 3.5 5.9 3.0 5.2 2.6 3.7 5.3 3.4 3.6

where µi defines the population mean at the ith level as E( yir ) = µi and σ 2 is the common variance of the eir .

yi.

0.30

1.79

2.41

3.93

sample means

s2i

1.24

2.31

1.01

1.26

sample variances

σˆ 2 = 1.46

(4)

eir ≈ i.i.d. N(0, σ 2 )

Table 2. Example 1, Data for a Dose Finding Study

Placebo

3

estimated common error variance of the model

4

5

ANALYSIS OF VARIANCE ANOVA

PARAMETER ESTIMATION

The µi s in Equation (4) are usually estimated by the methods of least squares. The estimator for µi (i = 1, . . . , i) is then given by µˆ i =

1 yi =: yi. nir r

(5)

2 σˆ =

1 ( yir − µˆ i )2 n.- I r

(6)

i

where n. =

ni

i

The sample means and sample variances for each treatment and the estimated common error variance σ 2 are given in Table 2.

SST =

6 THE R(.) NOTATION – PARTITIONING SUM OF SQUARES

( yi − y.)2

(7)

with 1 yi ni

y2i r

(10)

r

The error sum of squares after fitting the model E( yir ) = µi is SSE(µi ) =

i

i

y. =

i

The ANOVA procedure can be summarized as follows: Given n observations yi s, one defines the total sum of squared deviations from the mean by Total SS =

error sum of squares, respectively. Thus, Total SS = Model SS + Error SS. The Total SS always has the same value for a given set of data because it is nothing other than the sums of squares of all data points relative to the common mean. However, the partitioning into Model SS and Error SS depends on model selection. Generally, the addition of a new factor to a model will increase the Model SS and, correspondingly, reduce the Error SS. When two models are considered, each sum of squares can be expressed as the difference between the sums of squares of the two models. Therefore, the approach related to given sum of squares allows the comparison of two ANOVA models very easily. In the one-way classification in Equation (4), the total sum of squares of each observation is

( yir − µˆ i )2

(11)

r

R(µi ) = SST − SSE(µi ) is denoted the reduction in sum of squares because of fitting the model E( yir ) = µi

i

The ANOVA technique partitions the variation among observations into two parts: the sum of squared deviations from the model to the overall mean ( yˆ i − y.)2 (8) Model SS = i

and the sum of squared deviations from the observed values yi values to the model Error SS = SSE =

( yi − yˆ i )2

(9)

i

These two parts are called the sum of squares because of the model and the residual or

Fitting the simplest of all linear models, the constant mean model E( yir ) = µ, the estimate of E( yir ) would be yˆ ir = µˆ = y and the error sum of squares results as SSE(µ) =

i

( yir − µ) ˆ 2

r

R(µ) = SST − SSE(µ) is denoted the reduction in sum of squares because of fitting the model E( yir ) = µ. The two models E( yir ) = µi and E( yir ) = µ can now be compared in terms of their respective reductions in sum of

ANALYSIS OF VARIANCE ANOVA

5

Table 3a. ANOVA—Partioning the Total Sum of Squares 1-way Classification, Cell Means Model

Source of variation

df

Sum of square

Mean square

Model µ

1

R(µ)

R(µ)

Model µi

I−1

R(µi /µ)

R(µi /µ) I−1

Residual

n. − I

SSE = SST − R(µi )

SST − R(µi ) n. − I

Total

n.

F statistic

R(µi /µ) (I − 1)

SST − R(µi ) n. − I

SST

Table 3b. Partitioning the Total Sum Sqares Adjusted for Mean

Source of variation

df

Sum of square

Mean square

Model µi

I−1

R(µi /µ)

R(µi /µ) I−1

Residual

n. − I

SSE = SST − R(µi )

SST − R(µi ) n. − I

Total a.f.m

n. − 1

SST − R(µ)

squares given by R(µi ) and R(µ). The difference (R(µi ) − R(µ) is the extent to which fitting E( yir ) = µi brings about a greater reduction in sum of squares than does fitting E( yir ) = µ. Obviously, the R(.) notation is a useful mnemonic for comparing different linear models in terms of the extent to which fitting each accounts for a different reduction in the sum of squares. The works of Searle (2, 3), Hocking (5), and Littel et al. (6) are recommended for deeper insight. It is now very easy to partition the total sum of squares SST into terms that develop in the ANOVA. Therefore, the identity SST = R(µ) + (R(µi ) − R(µ)) + (SST − R(µi )) (12) = R(µ) + R(µi /µ) + SSE(µi ) or SST − R(µ) = R(µi /µ) + SSE(µi ) are used with R(µi /µ) = R(µi ) − R(µ). The separation of the Table 3a and Table 3b is

F statistic R(µi /µ) SST − R(µi ) (I − 1) n. − I

appropriate to Equation (12) the first and last line. Table 3a displays the separation into the components atributable to the model µ in the first line, to the model µi extent µ in the second line, and, in the third line to the error term, the last line to the total sum of squares. Table 3b displays only the separation into the two components attributable to the model µi extent µ and in the second line to the error term. 7

ANOVA–HYPOTHESIS OF EQUAL MEANS

Consider the following inferences about the cell means. This analysis includes the initial null hypothesis of equal means (global hypotheses, all means simultaneous by the same) so-called ANOVA hypothesis contingent with pairwise comparisons, contrasts, and other linear function comprising either hypothesis tests or confidence intervals. In starting off with the model E( yij ) = µI , the global null hypothesis H0 : µ1 = µ2 = . . . = µI

6

ANALYSIS OF VARIANCE ANOVA

Table 3c. ANOVA Example 1

Source of variation

df

Sum of square

Mean square

F statistic

Pr > F

Model µi Residual Total a.f.m.

3 36 39

67.81 52.52 120.33

22.60 1.46

15.5

<0.0001

is of general interest. A suitable F-statistic can be used for testing this hypothesis H0 (see standard references (2–6). The F-statistic testing H 0 and the sums of squares of Equation (12) are tabulated in Table 3a and Table 3b, columns 3–5. The primary goal of the experiment in example 1 was to show that the new drug is effective compared with placebo. At first, one may test the global null-hypotheses H0 : µ1 = µ2 = µ3 = µ4 with E( yir ) = µi . Table 3c shows information for testing the hypothesis Ho . R(µi /µ) = R(µi ) − R(µ) = 67.81 is the difference for the respective reductions in sum of squares for the two models E( yir ) = µi and E( yir ) = µ. From the mean squares R(µi /µ)/(I − 1) = 22.60 and the error term (SST − SSE(µi ))/ (n. − I) = σˆ 2 = 1.46 one obtains the F-statistic F = 15.5 for testing the null-hypothesis H 0 and the probability Pr(F > Fα ) < 0.0001. As this probability is less then the type I error α = 0.05, the hypothesis Ho can be rejected in favor of the alternative µi = µj for at least one pair i and j of the four treatments. Rejection of the null hypothesis H0 indicates that differences exist among treat ments, but it does not show where the differences are located. Investigators’ interest is rarely restricted to this overall test, but rather to comparisons among the doses of the new drug or placebo. As a consequence, multiple comparisons comparing the three classes with placebo are required. 8

MULTIPLE COMPARISONS

In many clinical trials, often more than two drugs or more than two levels of one drug are considered. Having rejected the global

hypothesis of equal treatment means (e.g., when the probability of the F-statistic in Table 3c, last column, is less then 0.05), questions related to picking out drugs that are different from others or determining what dose level is different from the others and placebo have to be attacked. These analyses generally constitute making many (multiple) further comparisons among the treatments in order to detect effects of prime interest to the researcher. The excessive use of multiple significance tests in clinical trials can greatly increase the chance of false-positive findings. A large amount of statistical research has been devoted to multiple comparison procedures and the control of false-positive results caused by multiple testing. Each procedure usually has the objective of controlling the experiment-wise or family-wise error rate. A multiple test controls the experiment-wise or family-wise multiple level α, if the probability to reject at least one of the true null hypotheses does not exceed α, irrespective of how many hypotheses and which of them is in fact true. For ANOVA, several types of multiple comparison procedures exist that adjust the critical value of the test statistic. For example, the Scheffe -, Tukey-, and Dunnetttest, and procedures that adjust the compari son-wise P-values (e.g., the Bonferroni-Holm procedure), and the more general closed-test procedures exist (7, 8). Marcus et al. (9) introduced those so-called closed multiple test procedures, which keeps the family-wise multiple level α under control. The closed-test principle requires a special structure among the set of null hypotheses, and it can be viewed as a general tool for deriving a multiple test. In general, the use of an appropriate multiple-comparison test for inference concerning treatment comparisons is indicated (10):

ANALYSIS OF VARIANCE ANOVA

1. to make an inference concerning a particular comparison that has been selected on the basis of how the data have turned out; 2. to make an inference that requires the simultaneous examination of several treatment comparisons, for example, the minimum effected dose in dose findings studies; and 3. to perform so-called ‘‘data dredging,’’ namely, assembling the data in various ways in the hope that some interesting differences will be observable.

9

At the same time, the more extensive heterogeneity among patients conflicts with the precision of statistical estimates, which is usually enhanced by requiring homogeneity of subjects (12). Two study design strategies are followed: 1. Stratified assignment of treatments to subjects who are matched for similarity on one or more block factors, such as gender, race, age, initial severity, or on strata, such as region or centers. Separately and independently within each stratum, subjects are randomly assigned to treatment groups in a block randomization design (11). When a stratified assignment is used, treatment comparisons are usually based on appropriate within stratum/center differences (13). 2. A popular alternative to the block randomization design is the stratification of patients according to their levels on prognostic factors after complete randomization in the analysis phase. This alternative strategy, termed poststratification, leads exactly to the same kind of statistical analysis, but it has disadvantages compared with prestratification. Pre-stratification guards by design against unlikely but devastating differences between the groups in their distributions on the prognostic factors and in sample sizes within stratum. With pre-stratification, these will be equal—or least close to equal—by design. With post-stratification due to randomization, they will be equal only in a long-term average sense (14).

TWO–WAY CROSSED CLASSIFICATION

Two basic aspects of the design of experiments are the error control and the optimal structuring of the treatment groups. A general way of reducing the effect of uncontrolled variations on the error of treatment comparisons is grouping the experimental units (patients) into sets of units being alike (uniform) as much as possible. All comparisons are then made within and between sets of similar units (11). Randomization schemes (e.g., as complete randomization, randomized blocks) are part of the error control of the experimental design. On the other hand, the structure of treatments, what factor(s) and factor levels are to be observed is called the treatment design (6). The factorial treatment design is one of the most important and widely used treatment structures. Thereby one distinguishes between treatment and classification factors. The factorial design can be used with any randomization scheme. Factorial experiments can be compared with the one factor at a time approach, and they have the advantage of giving greater precision in the estimating of overall factor effects. They enable the exploration of interactions between different factors, which are being and allow an extension of the range of validity of the conclusions by the insertion of additional factors. Factorial designs enhance the basis for any generalizability of the trial conclusion with respect to geographical area, patient demographics, and pre-treatment characteristics as well as other factors that are potentially associated with the response variable.

7

10

BALANCED AND UNBALANCED DATA

When the number of observations in each cell is the same, the data shall be described as balanced data. They typically come from well-designed factorial trials that have been executed as planned. The analysis of balanced data is relatively easy and have been extensively described (15). When the number of observations in the cells is not uniform, the data shall be described as unbalanced data.

8

ANALYSIS OF VARIANCE ANOVA

Table 4. Liver Encym (Log Transform)—2-way Crossed Classification Treatment (T), Liver Impairment (B) Case T B X Case T B X

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2 2 1 2 1 1 1 1 1 2 1 2 1 2 2 1 2 1 2 1

1 1 2 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1

0.85 1 1.32 0.95 0.90 1.43 1.30 1.08 1 1.08 1.15 1.60 1.34 0.95 1.36 0.95 0.90 1.28 0.95 0.85

T = 1 contrast media 1 T = 2 contrast media 2

21 22 23 24 25 26 27 28 29 30 31 33 34 35 36 37 38 39 40

1 2 2 2 2 1 2 2 1 2 1 2 1 1 2 2 1 2 1

1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1

1.11 1.28 0.90 1 1 0.85 1 0.90 0.90 1 1 1.04 1.49 0.85 1.08 0.90 1.64 1.26 1.08

B = 1 liver impairment no B = 2 liver impairment yes

Case

T

B

X

Case

T

B

X

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 2 1 1 1 2

1 1 2 1 1 2 1 1 2 2 2 1 2 2 1 1 2 1 1 1

1.15 1.11 1.15 0.95 0.90 1.52 1.36 1.18 1.15 1.46 1.18 1.04 1.34 1 0.95 0.70 1.20 0.85 1.30 1.15

121 122 123 124 125 127 128 129 130 131 132 133 134 135 136 137 138 139 140

2 1 1 1 1 2 2 2 1 2 2 2 1 2 1 1 1 2 2

1 1 2 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1

1 0.95 1.04 1.43 1.53 1.04 1 1 1.26 1 0.90 1.15 0.95 1.11 1.32 0.95 1.40 0.90 1.04

ANALYSIS OF VARIANCE ANOVA

This fact will be illustrated using a numerical example. Suppose interest exists in the comparison of two contrast media, T = 1, 2, regarding a liver enzym and a second factor B, with categories B = 1, 2, describing the initial severity of the liver impairment (B = 1 not impaired, B = 2 impaired). Table 4 shows the data of a clinical trial designed as a result. Let xijk be the kth observation on treatment i for impaired liver at severity j, where i = 1, 2, j = 1, 2, and k = 1, . . . , nij , with nij being the number of observations on treatment i and impairment j. For the sake of generality, treatment is referred to as row–factor A with levels i = 1, . . . , i and liver impairment as column-factor B with levels j = 1, . . . , j. Notice the data of Table 4 are represented as logarithms yijk = log(xijk ) of the liver enzyme data xijk . The cell means model is as follows: yijk = µij + eijk , i = 1, . . . , I j = 1, . . . , J, k = 1, . . . , nij

(13)

A normal (Gaussian) distribution is assumed for the log-transformed measurements

An important difference between balanced and unbalanced data is the definition of a mean over rows or columns, respectively. For example, the row mean of the cell means µij in row i is straightforward µi . = 1/J

eijk ≈ i.i.d. N(0, σ 2 )

It follows as for the one-way classification that the cell means µij are estimated by the mean of the observation in the appropriate cells (i, j) yijk =: yij.

(14)

k

and the residual error variance is estimated as σˆ 2 = SSE/(N − s)

(15)

where

SSE =

i

j

( yijk − µˆ ij )2

k

and where s = I × J − (number of empty nij is the total sample size. cells). N = i

j

µij

(16)

µˆ ij

(17)

j

and its estimator is ˆ i . = 1/J µ

j

with variance estimate ˆ i .) = 1/J 2 1/nij σˆ 2 v(µ ˆ 1. = 1.22 and µ ˆ 2. = 1.15 is For example 2, µ obtained. A different row mean of the cell means µij in row i, sometimes thought to be an interesting alternative to µi ., is the weighted mean with the weights nij /ni. : µi . =

nij j

ˆ i . = µ

ni.

µij

(18)

estimated by the mean over all observation in row i,

yijk = log xijk

µˆ ij = 1/nij

9

nij ni.

µˆ ij = yi.

(19)

with variance estimate

ˆ i ) = σˆ 2 /ni. v(µ ˆ 1. = 1.16 and µ ˆ 2. = 1.05 is For example 2, µ obtained. The difference between these two cell mean estimates µi. and µi. consists in the fact that µi. is independent from the number of the data points in the cells, whereas µi. depends on the number of observations in each cell. As the sample sizes of participating centers in a multi-center study are mostly different, a weighted mean over the centers is a better overall estimator for treatment comparisons (13, 14, 16). If it can be ascertained that the population in a stratum or block is representative of a portion of the general population of interest, for example, in post-stratification, then it is natural to use

10

ANALYSIS OF VARIANCE ANOVA

a weighted mean for an overall estimator of a treatment effect. Many clinical trials result in unbalanced two-way factorial designs. Therefore, it is appropriate to define row means or column respectively as weighted means resulting in different sum of squares. If weights other than these two possibilities are wanted (e.g., Cochran-MantelHaenszel weights), one may consider a general weighting of the form µi . = tij µij (20) j

with j

ˆ i . = tij = 1 and the estimator µ

j

ˆ i ) = σˆ 2 tij yij. with v(µ

j

t2ij /ni .

11 INTERACTION BETWEEN ROWS AND COLUMNS Consider an experiment with two factors A (with two levels) and B (with three levels). The levels of A may be thought of as two different hormone treatments and the three levels of B as three different races. Suppose that no uncontrolled variation exists and that observations as of Fig 1a are obtained. These observations can be characterized in various equivalent ways: 1. the difference between the observations corresponding to the two levels of A is the same for all three levels of B; 2. the difference between the observations for any two levels of B is the same for the two levels of A; and mij

3. the effects of the two factors are additive. For levels i and i of factor A and levels j and j of factor B, consider the relation among means given by µij − µij = µi j − µi j

(21)

If this case holds, the difference in the means for level i and i of factor A is the same for levels j and j of factor B and, vice versa, the change from level j to level j of factor B is the same for both levels of factor A. If Equation (21) holds for all pairs of levels of factor A and all pairs of levels of factor B, or when the conditions a), b), or c) are satisfied, one can say factor A does not interact with factor B: No interaction exists between factor A and B. If the relation in Equation (21) or the conditions a), b), or c) are not satisfied, then an interaction exists between factor A and factor B. Many ways exist in which interaction can occur. A particular method is shown in Fig 1b. The presence of an interaction can mislead conclusions about the treatment effects in terms of row effects. It is therefore advisable to assess the presence of interaction before making conclusions, which can be done by testing an interaction hypothesis. The test may indicate that the data are consistent with the absence of interaction but may not prove that no real interaction exists.

mij A1

16

16

14

A2

14

12

12

10

10

8

8

6

6 1

2

3

Factor B

Figure 1a. Two-way classification, no interaction 2 rows, 3 columns.

A2

A1 1

2

3

Factor B

Figure 1b. Two-way classification, interaction 2 rows, 3 columns.

ANALYSIS OF VARIANCE ANOVA

Table 5. Sample Size, Means

T1 T2

12

B1

B2

(27) 1.07 (32) 1.00

(12) 1.36 (7) 1.30

ANALYSIS OF VARIANCE TABLE

ANOVA tables have been successful for balanced data. They are generally available, widely documented, and ubiquitously accepted (15). For unbalanced data, often no unique, unambiguous method for presenting the results exists. Several methods are available, but often not as easily interpretable as methods for balanced data. In the context of hypothesis testing, or of arraying sums of squares in an ANOVA format, a variety of sums of squares are often used. The problem in interpreting the output of computerspecific programs is to identify those sums of squares that are useful and those that are misleading. The information for conducting tests for row effects, column effects, and interactions between rows and columns is summarized in an extended ANOVA table.

Various computational methods exist for generating sums of squares in an ANOVA table (Table 5) since the work of Yates (17). The advantage of using the µij -model notation introduced above is that all the µij are clearly defined. Thus, a hypothesis stated in terms of the µij is easily understood. Speed et al. (18), Searle (2, 3), and Pendleton et al. (16) gave the interpretations of four different types of sums of squares computed (e.g., by the SAS, SPSS, and other systems). To illustrate the essential points, use the model in Equation (13), assuming all nij > 0. For reference, six hypotheses of weighted means are listed in Table 6 that will be related to the different methods (16, 18). A typical method might refer to H1 , H 2 , or H 3 as ‘‘main effect of A,’’ row effect. The hypotheses H 4 and H 5 are counterparts of H 2 and H 3 generally associated with ‘‘main effect B,’’ column effect. The hypothesis of no interaction is H 6 , and it is seen to be common to all methods under the assumption nij < 0. The hypotheses H 1 , H 2 , and H 3 agree in the balanced case (e.g., if nij = n for all i and j but not otherwise). The hypothesis H3 does not depend on the nij . All means have the same weights 1/j and are easy to interpret. As it states, no difference exists in the levels of the factor A when averaged over all levels of factor B [Equation (16)].

Table 6. Cell Means Hypotheses

Hypothesis

Main Effect

Weighted Means j nij µij /ni. = j nij µij /ni. j nij µij = i j nij nij µij /n.j

H1 ⎫ ⎬ H2 ⎭ H3 H 4

Factor B

µi = µi i nij µij = j i nij nij µij /ni.

H5

columns

µ.j = µ.j

H6

Interaction A × B

µij − µij − µij + µij = 0

Factor A rows

11

For all i, i, j, j, i?i, j?j H1 weighted mean equation (18) H2 Chochran Mantel-Haenszel weights equation (20) for t = nij nij /n.j H3 weighted mean equation (16) H4 counterpart of H2 with factor B H5 counterpart of H3 with factor B H6 interaction between factor A and factor B

12

ANALYSIS OF VARIANCE ANOVA

The hypotheses H1 and H 2 represent comparisons of weighted averages [Equations (18) and (20)] with the weights being a function of the cell frequencies. A hypothesis weighted by the cell frequencies might be appropriate if the frequencies reflected population sizes but would not be considered as the standard hypothesis. Table 7 specifies the particular hypothesis tested each type of ANOVA sums of squares. Table 8 shows three different analyses of variance tables for the example 2 ‘‘liver encym’’ computed with PROC GLM SAS. The hypotheses for the interaction term TxB is given by H6 . The test is the same for Type I, II, and III sums of squares. In this example, no interaction exists between treatment and liver impairment, P-value Pr > F = 0.94.

The hypothesis for ‘‘main effect A’’—the treatment comparison—is given by H1 , H 2 , or H 3 , corresponding with the different weighting means (Table 6 and Table 7). The range of the results for the treatment comparison are different (Table 8, source T) 1. for Type I, the P-value, Pr > F = 0.0049, is less then 0.005 (highly significant) 2. for Type II, the P-value, Pr > F = 0.074, is between 0.05 < P < 0.1 (not significant) 3. for Type III, the P-value, Pr > F = 0.142, is greater then 0.1 (not significant). The hypothesis H2 (i.e., Type II) is appropriate for treatment effect in the analysis of this

Table 7. Cell Means Hypotheses Being Tested

Sum of Sqares

Row effect, Factor A Column effect, Factor B Interaction, A×B

Typ I

Typ II

Typ III

H1 H4 H6

H2 H4 H6

H3 H5 H6

Typ I, Type II, and Type III agree when balanced data occur Table 8. Liver encyms -Two way classification with interaction term Treatment (T), impairment (B), interaction (T*B)

Source T B T*B Source T B T*B Source T B T*B

DF

Typ I SS

Mean Square

F-Value

Pr > F

1 1 1

0.20615 1.22720 0.00015

0.20615 1.22720 0.00015

8.41 50.08 0.01

0.0049 0.0001 0.9371

DF

Typ III

Mean Square

F-Value

Pr > F

1 1 1

0.05413 1.19127 0.00015

0.05413 1.19127 0.00015

2.21 48.61 0.01

0.1415 0.0001 0.9371

DF

Typ II

Mean Square

F-Value

Pr > F

1 1 1

0.08038 1.22720 0.00015

0.08038 1.22720 0.00015

3.28 50.08 0.01

0.0742 0.0001 0.9371

ANALYSIS OF VARIANCE ANOVA

example—a two-way design with unbalanced data. No interaction exists between treatment and liver impairment. The different test results for H 1 , H 2 , and H 3 result from the unbalanced data and factor—liver impairment. Any rules for combining centers, blocks, or stratas in the analysis should be set up prospectively in the protocol. Decisions concerning this approach should always be taken blind to treatment. All features of the statistical model to be adopted for the comparison of treatments should be described in advance in the protocol. The hypothesis H2 is appropriate in the analysis of multi-centers trials when treatment differences over all centers (13, 14, 19) are considered. The essential point emphasized here is that the justification of a method should be based on the hypotheses being tested and not on heuristic grounds or computational convenience. In the presence of a significant interaction, the hypotheses of main effects may not be of general interest and more specialized hypotheses might be considered. With regard to missing cells, the hypotheses being tested can be somewhat complex for the various procedures or types of ANOVA tables. Complexities associated with those models are simply aggravated when dealing with models for more than two factors. How many factors exist or how many levels each factor has, the mean of the observations in each filled cell, is an estimator of the population mean for that cell. Any linear hypothesis about cell means of any non empty cell is testable; see the work of Searle (3: 384–415). REFERENCES 1. R. A. Fisher, Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd, 1925. 2. S. R. Searle, Linear Models. New York: John Wiley & Sons, 1971. 3. S. R. Searle, Linear Models for Unbalanced Data. New York: John Wiley & Sons, 1987. 4. R. R. Hocking and F. M. Speed, A full rank analysis of some linear model problems. JASA 1975; 70: 706–712. 5. R. R. Hocking, Methods and Applications of Linear Models-Regression and the Analysis

13

of Variance. New York: John Wiley & Sons, 1996. 6. R. C. Littel, W. W. Stroup, and R. J. Freund, SAS for Linear Models. Cary, NC: SAS Institute Inc., 2002. 7. P. Bauer, Multiple primary treatment comparisons on closed tests. Drug Inform. J. 1993; 27: 643–649. 8. P. Bauer, On the assessment of the performance of multiple test procedures. Biomed. J. 1987; 29(8): 895–906. 9. R. Marcus, E. Peritz, and K. R. Gabriel, On closed testing procedures with special reference to ordered analysis of variance. Biometrica 1976; 63: 655–660 10. C. W. Dunnett and C. H. Goldsmith, When and how to do multiple comparison statistics in the pharmaceutical industry. In: C. R. Buncker and J. Y. Tsay, eds. Statistics and Monograph, vol. 140. New York: Dekker, 1994. 11. D. R. Cox, Planning of Experiments. New York: John Wiley & Sons, 1992. 12. G. G. Koch and W. A. Sollecito, Statistical considerations in the design, analysis, and interpretation of comparative clinical studies. Drug Inform. J. 1984; 18: 131–151. 13. J. Kaufmann and G. G. Koch, Statistical considerations in the design of clinical trials, weighted means and analysis of covariance. Proc. Conference in Honor of Shayle R. Searle, Biometrics Unit, Cornell University, 1996. 14. J. L. Fleiss, The Design and Analysis of Clinical Experiments. New York: John Wiley & Sons, 1985. 15. H. Sahai and M. I. Ageel, The Analysis of Variance – Fixed, Random and Mixed Models. ¨ Boston: Birkhauser, 2000. 16. O. J. Pendleton, M. von Tress, and R. Bremer, Interpretation of the four types of analysis of variance tables in SAS. Commun. Statist.Theor. Meth. 1986; 15: 2785–2808. 17. F. Yates, The analysis of multiple classifications with unequal numbers in the different classes. JASA 1934; 29: 51–56. 18. F. M. Speed, R. R. Hocking, and O. P. Hackney, Methods of analysis of linear models with unbalanced data. JASA 1978; 73: 105–112. 19. S. Senn, Some controversies in planning and analysing multi-centre trials. Stat. Med. 1998; 17: 1753–1765.

• Three patients in the A arm and eight

ANALYSIS POPULATION

patients in the B arm were lost to followup during the 2-year duration of the study. • Four patients were missing a lab measure at baseline that could serve as an important covariate. • At least three patients in the A arm were known to be taking medication B, perhaps in addition to A. • During the trial, five patients moved to another state, and although no lab measures or visit data were missing, lab measures for these patients have now been conducted at different facilities, with unknown quality-control procedures.

ANDREW S. MUGGLIN JOHN E. CONNETT Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota

Most applied statisticians, at some time, have been given the request: ‘‘Please go analyze the data and then report back.’’ But hidden in the instruction, ‘‘ . . . go analyze the data’’ are at least three major issues, each of which has many subissues. Exactly which data should be analyzed? What hidden biases exist in the way that these data were collected? Which analysis technique is the most appropriate for these data? None of these questions has a trivial answer, but appropriate, high-quality analysis and interpretation depend on careful attention to these questions. In this article, we examine specifically the question of which data should be analyzed. In data sets that emanate from clinical trials, this task can be surprisingly challenging. 1

Even in a simple setting such as this, it is not at all clear how to ‘‘go analyze the data.’’ The problems only grow when a trial contains thousands of patients, longer follow-up, and more complicated data collection schemes. One simple and common approach to this setting might be to take all patients who have complete datasets and analyze them as they were actually treated. On the surface, this approach sounds appealing on two fronts: First, the notion of ‘‘complete datasets’’ presents no analytical difficulties (statistical techniques are typically well defined on welldefined datasets). Second, to analyze patients as they were actually treated seems to imply that we are studying the actual scientific consequence of treating patients in a specified way. However, these approaches have several major problems. What does ‘‘actually treated’’ mean when patients either forget or refuse to take their medication precisely as prescribed? What kinds of systematic factors lurk behind a patient’s or caregiver’s decision not to comply completely with a specified treatment regimen? And if only complete datasets are analyzed, what systematic factors influence whether a patient’s dataset is incomplete? For instance, if sicker patients tend to have more missing values or tend not to take their medication faithfully, an analysis that ignores incomplete datasets and handles patients according to their actual

APPRECIATING THE DIFFICULTIES

Consider the following hypothetical example. Suppose we have a trial of 100 patients, randomized equally into two study arms: oral medication A and oral medication B. Suppose that the following plausible scenarios all occur: • Three patients who violated the inclu-

sion/exclusion criteria of the trial were enrolled—one patient because of a clerical mistake, one because of an erroneous lab value, and one because of a clinician’s judgment about the intent of the trial. • Because of clerical error, one patient that should have been assigned to A was actually assigned to B. • Three patients assigned to take A refused to take it and dropped out of the study.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

ANALYSIS POPULATION

treatment regimen can produce biased and misleading conclusions. Another approach might be to take all enrolled patients—regardless of whether they had data missing—and analyze them in the groups to which they were assigned by randomization. In this instance, the analyst immediately faces the problem of how to handle the missing values. Is it best to impute the worst possible value for these? Or observed values from some other patients? Or to impute these values via some mathematical model? And considering assignment by randomization (assuming the trial is randomized), analyzing patients according to how they were randomized results in a bias toward no treatment difference in the study arms. (The term ‘‘bias’’ in clinical trials is not intended to be limited to intentional or fraudulent bias but is meant to include the more common situation of hidden influences that systematically affect an outcome.) If the aim of the study is to show that a new treatment is better than an older one, then it is generally deemed to be a small evil, because the bias only makes it harder to demonstrate the superiority of the new treatment. But if the aim of the study is to show that the new treatment is equivalent or noninferior to an existing one, then analyzing patients by assigned treatment can be dangerous—especially if compliance to treatment is low—because the bias may falsely cause an inferior treatment to seem equivalent to a superior one. 2

DEFINITIONS OF COMMON TERMS

Several terms are commonly used to describe a particular analysis data set and the principles by which it was chosen. We describe some of them here.

patient is how they were randomized. (Exceptions and difficult cases have occurred; for example, a patient is accidentally assigned to the wrong treatment, or a patient somehow gets enrolled but is never randomized.) In nonrandomized studies, this principle is more difficult to implement but is probably most likely to be implemented as the intended course of treatment when a patient was initially enrolled, if that can be determined. Intention-to-treat is often the default analysis philosophy because it is conceptually simple and because it is conservative in the sense that lack of perfect compliance to the prescribed therapy tends to bias the treatment effects toward each other, which makes claims of superiority of one therapy over another more difficult to achieve. However, as mentioned previously, in the setting of equivalence or noninferiority trials, this bias can be an anticonservative bias. 2.2 As-Randomized This term is often used interchangeably with intention-to-treat, but it has a few differences. Obviously, the term ‘‘as-randomized’’ does not apply in a nonrandomized study. More subtle differences include situations where a patient is incorrectly assigned to a treatment. For instance, if randomizations are preprinted in a sequence of envelopes, and someone opens the wrong envelope for a given patient, this patient may be randomized to a treatment other than what was intended. Also, if patients are enrolled and intended to be randomized but for some reason are never randomized, an intention-totreat analysis may place those patients into whichever arm seems to be have the most conservative effect, whereas an as-randomized analysis may omit them. 2.3 As-Treated

2.1 Intention-to-Treat (or Intent-to-Treat, or ITT) This principle dictates that patients be divided into analysis groups based on how the investigator intended to treat them, rather than on how they were actually treated. This principle is easiest to apply in a study where patients are randomized to treatments, because in most cases the intended way to treat a

This approach assigns patients to analysis groups based on how they were actually treated. A patient randomized to therapy A but who receives therapy B will be analyzed as belonging to the therapy B group. This approach attempts to study the more purely scientific effect of the therapies as actually implemented, which avoids the bias that occurs under intention to treat. It has

ANALYSIS POPULATION

other biases, however. If patients who take an incorrect therapy do so for some systematic reason (e.g., the assigned therapy has failed, or sicker patients tend to switch therapies because of side effects), this will bias the interpretation in unpredictable ways. Moreover, it can be difficult in practice to determine how a patient is actually treated. What, for example, does one do with data from a patient who has taken the prescribed medication intermittently? Or who took only half the prescribed dose? Or one who, after taking therapy A for awhile, discontinues it and begins taking therapy B? 2.4 Per Protocol Sometimes called an ‘‘adherers only’’ analysis, this term is used to denote an analysis where patients’ data are included in the analysis dataset only if the patients adhered to the requirements of the protocol. This analysis typically means that patients are randomized correctly to a therapy, have met the inclusion and exclusion criteria for the study, and have followed the treatment regimen correctly. Patients who drop out of a treatment arm or who drop in to a different arm (whether or not they discontinue the assigned treatment) are excluded, at least for the periods of time where they were not properly complying with the protocol. This approach attempts to answer the scientific question of how the treatment behaves under ideal conditions, while avoiding the biases inherent in analyzing patients as they were actually treated (see above). However, some biases still exist, because patients who are excluded from the analysis may be eliminated for some systematic reason (e.g., elderly people had more trouble complying with therapy B than with therapy A and thus more elderly people are excluded from the B arm).

3

finished the entire prescribed course of therapy or some minimum amount of the entire course of therapy (making it more of an astreated or per-protocol analysis). It can help with understanding of the long-term effects of therapy, especially among compliers. It is subject to a very serious bias, however, because it ignores patients who drop out of the study or who do not comply. It is dangerous to assume that noncompliance is a random occurrence. 2.6 Evaluable Population This term is not very informative in how it handles patients (e.g., as randomized?), but the intention is that ‘‘nonanalyzable’’ data are excluded. Data can be classified as ‘‘nonanalyzable’’ for a variety of reasons: They can be missing (e.g., a patient has missed a follow-up visit, a serum assay was lost or not performed), technically inadequate (as occurs in certain echocardiographic or other high-tech measurements), or in some cases important endpoints cannot be adjudicated reliably (e.g., whether a death was specific to a certain cause). In some instances, the term seems to refer to patients who have taken the therapy at some minimum level of adherence or had a minimum level of follow-up (making it essentially a completer analysis). This approach is one of convenience. It is subject to many biases, and data interpretation can be difficult. Authors who use this term should take care to define it, and readers should pay careful attention to the authors’ definition. 3

MAJOR THEMES AND ISSUES

In selecting the analysis dataset, certain issues and questions are generated frequently. In this section, we address several of them.

2.5 Completer Analysis Set

3.1 Should We Analyze Data as Intention-To-Treat or Not?

This term is less common and seems to be used in different ways. Some use it to describe an analysis set that includes only patients who have finished the specified minimum follow-up time. When contrasted in publication with an intention-to-treat analysis, it sometimes represents those patients who

The reflexive answer to this question is often affirmative, but the decision is sometimes not easy. Cases where it is easy are settings in which an investigator is trying to prove that an experimental treatment is superior to a standard treatment or placebo. As mentioned previously, the rationale here is that

4

ANALYSIS POPULATION

biases tend to work against the experimental therapy. If patients cannot comply well with the experimental regimen, then it will tend to work against the finding that the experimental treatment is superior. If a large amount of drop-in or drop-out of the experimental arm occurs (e.g., control subjects getting the experimental therapy, or experimental arm patients who follow the control regimen), then it will have the same effect. This conservative bias is also an incentive to investigators to be vigilant in executing the trial according to plan, because mistakes such as accidentally randomizing or assigning patients to treatments will tend to hurt the goals of the trial. Another compelling reason to analyze patients by intention to treat is that assignment of patients to analysis groups is usually straightforward (although it may be challenging to conduct the analysis, such as in the case of missing data). In some settings, however, this approach may not be the best. In an equivalence trial, where the goal is to acquire evidence to establish that two treatments are nearly the same, poor compliance will bias the trial in favor of the desired finding. In some settings, particularly pilot studies, the question of most interest regards the treatment effect when implemented correctly (‘‘efficacy’’), rather than its effect in actual practice (‘‘effectiveness’’). In other settings, an investigator may opt for an intention-to-treat analysis when studying a therapy’s efficacy or effectiveness but not for its safety. It is good practice to think through these issues and document one’s approach prior to knowing the results for any particular analysis. This practice defends against the criticism that a particular cohort was selected simply because it yields the most favorable interpretation. 3.2 What If Inclusion or Exclusion Criteria Are Violated? In trials of moderate or large size, it is common that a few patients are enrolled that should not have been enrolled. Reasons range from simple mistakes, to lab values that were not known when the patient enrolled, to investigator fraud. Hopefully in a wellexecuted trial, this group of patients will be small enough that results do not depend

on whether these patients are included. But there are no guarantees, and in the end, one analysis must be selected as the official trial result. Should false inclusions be contained in this analysis? One argument to exclude data from such patients is that they do not come from the intended study population. It is difficult to know whether their inclusion in the analysis represents a conservative or anticonservative bias. An argument to include them is that this sort of mistake will inevitably happen in real life, so it is best to analyze data that way. Senn (1, p. 157) and Friedman et al. (2, ch. 16) give several other arguments on each side. Once again, it is important that investigators think through their specific cases prior to knowing the analysis results. In our opinion, it is generally best to analyze according to the principle that ‘‘if enrolled, then analyzed.’’ A variant on this question is, ‘‘If I discover a false inclusion during the trial, should follow-up on that patient be discontinued even though I intend to analyze the data I have thus far collected?’’ This question has almost no good answer. If the answer is ‘‘Yes,’’ then potential late adverse events may be lost. If the answer is ‘‘No,’’ then the investigator may be exposing the patient to an inappropriate risk inherent in the therapy. In our opinion, this situation must be handled on a case-by-case basis, with investigators wrestling with the issues and erring, if necessary, on the side of the patient’s best interests and the most disclosure about the nature of the mistake. 3.3 What Do We Do if Data Are Missing? Missing data are inevitable in a study that involves humans. Data go missing for many reasons: patients drop out or withdraw consent, patients miss scheduled visits, urine samples are lost, records are lost, required tests are not conducted, test results are found to be technically inadequate, memory banks in recording devices become full, and so on. Furthermore, many different data items can be missing, and they are not always in the same patients. A patient who relocates and loses contact with the study center will have all data missing after a certain time point. A patient may have complete follow-up data

ANALYSIS POPULATION

but be missing an important baseline lab measure. A patient cannot make a certain scheduled visit but may have full information from all other visits. If data were always missing because of random causes, then it would be sensible to analyze the data that are actually collected and ignore missing cases. Most statistical software packages do this by default. But what assurance is there that the nonmissing data are representative of the missing data? In particular, if the sickest patients are dropping out or skipping scheduled visits because of their condition, or if caregivers are not prescribing certain tests because it would put a very sick patient under excessive stress, then the nonmissing data values are not representative of the missing ones, and any analyses that ignore missingness will be biased. Much statistical research has focused on the topic of statistical adjustment for missing data (3,4). But the fundamental reality is that missing data are still missing, and any statistical adjustment is only as good as the assumptions employed in the adjustment. It is important to assess and disclose the potential impact of missing data on study conclusions through sensitivity analyses. It is even more important to expend maximal energy to prevent the data from going missing in the first place. 3.4 What Is the Value of Subgroup Analyses or Responder Analyses? At the conclusion of a trial, investigators often wish to know if a therapy was especially effective for particular types of patients. A subgroup analysis is one where specific subclasses of patients are selected and their data are analyzed. Typical subgroups are defined by gender, race, geographical location, or many disease-specific categorizations. Here, the analysis is typically well defined (it is usually the primary objective analysis applied to a subset of patients), but the number of subgroups that could be analyzed is practically limitless. Subgroup analyses are often presented graphically, with confidence intervals for the treatment effect in each subgroup presented together in one graph. It is not surprising that the treatment effect is not the same in all

5

subgroups. Inevitably, some will have larger confidence intervals than others (because of varying subgroup sizes or different subgroup variances), and some can seem to be different from the others in the location of the confidence interval. It is important to remember that if enough subgroups are tested, probably every possible finding will be observed, so results should be interpreted with caution, and researchers should examine other studies to corroborate any apparent findings. A related analysis is a post-hoc responder analysis. Here, patients are identified as being a ‘‘responder’’ to a therapy according to some measure (e.g., still alive, or blood pressure adequately controlled), and the task of the analyst is to determine whether any subgroupings, baseline measures, or early indicators significantly predict who the responders will be. This method can be helpful in generating hypotheses for additional research. But because of the potentially limitless number of covariates that can be analyzed, these results should be viewed with extreme caution. 3.5 How Should Analyses be Handled If There is Poor Compliance to Treatment Regimen? In some studies, such as certain medical device studies, compliance is nearly perfect (e.g., if a pacemaker is implanted, it does what it is programmed to do; noncompliance is rare and might be caused by device malfunction or physician’s programming error but not by patient cooperation). In other studies, compliance is much poorer (e.g., when a patient must remember to take a combination of poorly tolerated medications on different dosing schedules). It is often difficult or impossible to determine the extent of the noncompliance accurately, although this fact should not discourage an investigator from attempting its assessment. This assessment is more for the purpose of defending trial conduct and interpretation than to decide whether to analyze any particular patient’s data. At one extreme, the data from all patients are analyzed regardless of compliance, which is consistent with the intention-to-treat approach. This method usually makes sense whenever the intention-to-treat analysis app-

6

ANALYSIS POPULATION

roach makes sense. At the other extreme, absolutely perfect or at least very good compliance is required to be included in an analysis. This analysis makes sense if the aim is to determine whether a treatment works under ideal circumstances or to identify adverse effects of a treatment when applied exactly as intended. It is usually better to analyze more data than less.

4

RECOMMENDATIONS

Hopefully, it is clear at this point that the resolution to most issues in choosing an analysis dataset is seldom simple and that experts can and do disagree. The following sections represent our opinions on some approaches that we have found useful in determining the makeup of an analysis dataset from a clinical trial. 4.1 Wrestle with the Issues It is not possible to eliminate or control all biases in a clinical investigation. It is usually not possible to estimate their magnitudes, and in some cases, it is impossible even to know the direction of bias. But it is possible to understand many of them and take steps to mitigate them, especially in trial conduct but also in data analysis. It can be hard work, but the result of this effort is a great enhancement in the quality and validity of the results. 4.2 Prespecify Wherever Possible Not all issues can be anticipated, but it should not be an excuse for not trying. Identify and document your planned approach to selecting an analysis cohort in advance.

4.4 Know That What Is Conservative in One Setting May Not Be In Another As previously mentioned, the intention-totreat principle is usually conservative in a trial of superiority, although it is not conservative in a trial of equivalence or noninferiority. At times, what is conservative for an efficacy objective may not be conservative for a safety objective. 4.5 Favor General Applicability Over Pure Science Several legitimate reasons can be cited to study a therapy under ideal conditions and also under more realistic conditions. When in doubt, we have found it helpful to favor the realistic side and to attempt to include rather than exclude data. 4.6 Conduct Sensitivity Analyses Check the robustness of your conclusions to varying definitions of the analysis cohort, as well as missing data assumptions and analysis techniques. Disclose in the final report or manuscript whether the conclusions change under reasonable deviations from your planned approach. 4.7 Account Painstakingly For How the Analysis Dataset Was Selected Account carefully in any final report for every patient that is included in or excluded from each analysis. In a journal article, we have less room to describe this in detail, but it is still good practice to account for patients’ data in summary fashion. The CONSORT statement (5) provides a checklist and example statements to illustrate how journal articles should report on methods in general (including specifically the numbers analyzed) for randomized clinical trials. Many of these principles apply to nonrandomized studies as well.

4.3 Take the Conservative Course

4.8 Illustrative Example

When biases cannot be controlled, seek to align them against the conclusion that you wish to reach, if possible. Then, if you achieve statistical significance regarding your desired conclusion, you have firm evidence.

The Lung Health Study (6) provides an excellent example of several issues presented above. The intent of this randomized trial was to study three treatments for chronic obstructive pulmonary disease, which is a

ANALYSIS POPULATION

major cause of morbidity and mortality that occurs almost exclusively in smokers. The three treatments were 1) smoking intervention plus an inhaled bronchodilator, 2) smoking intervention plus placebo, and 3) no intervention. The primary goal was to determine whether the treatments differentially influence the rate of decline in forced expiratory volume in 1 second (FEV1 ). It was intended that 6000 patients be enrolled and followed at annual visits for 5 years. The manuscript in Reference 6 provides an unusually detailed description of the study methods, which includes paragraphs on design, entry criteria, patient recruitment, spirometry, compliance monitoring, follow-up procedures, quality control, data management, and statistical methods. In a table in the Results section, the investigators report the follow-up rates at each annual visit by treatment group, with a maximum follow-up rate of 95% in the intervention plus bronchodilator group at year 5, and a minimum follow-up rate of 88% that occurred at one time in each of the three arms. In year 5, follow-up rates were at least 94% in all three groups. The generally high follow-up rates and especially the 94–95% follow-up rates in year 5 were in part the result of extra efforts by the investigators to ensure attendance at the fifth annual visit by as many surviving participants as possible. This result is important because the fifth annual visit data had the greatest influence on the computation of rates of change of the outcome variable and gave the most accurate estimate of the participant’s lung function status at the time they exited the study. It is possible to estimate the rate of decline in pulmonary function if a person has at least two annual visits (i.e., baseline and at least one follow-up measurement). The implicit assumption in such an analysis is that if a visit is missed, then the outcome data are missing at random (i.e., the missingness of the measurement is independent of the value of the outcome variable itself). This assumption is unlikely to be true in practice. A person with seriously impaired lung function is more likely to be sick or incapacitated, and therefore more likely to miss a visit, than a similar person with better lung function. In this case, no adjustments for

7

nonrandom missingness were applied in the estimates of rates of decline in lung function. Similarly, data for participants who died during the course of the study were included for all lung function measurements made prior to the time of death, and the data were entered into the analysis of the outcome. The manuscript points out several details that bolster the reader’s understanding of the analysis cohort. Among these are statements such as the following: ‘‘At the end of the study, the whereabouts and vital status of only 21 participants (0.4%) were unknown.’’ ‘‘Of the 5887 participants, 149 died during the 5-year follow-up period.’’ In captions in several figures, statements disclose how missing data values were handled: ‘‘Nonattenders of follow-up visits are counted as smokers’’ and ‘‘Participants not attending the visit were classified as noncompliant.’’ In an assessment of patient compliance to the bronchodilation regimen, the investigators report that 20–30% of patients who attended follow-up visits did not bring in their medication canisters for weighing, and this missingness probably represents a source of bias. The investigators discuss this bias, along with overestimates and underestimates of the bias, and report that by either measure, inhaler compliance did not differ between the intervention plus bronchodilator group and the intervention plus placebo group. This example demonstrates several principles found earlier in this article. First, analysis was by intention to treat, which was conservative in this case. (The manuscript words it as a ‘‘comparison of the randomly assigned groups.’’) Strong efforts were made to prevent losses to follow-up, with a high degree of success. Researchers had an understanding of and presented a disclosure of biases involved in missing data items, as well as a sensitivity analysis to how the missing compliance data were handled. And finally, researchers accounted carefully for patients at various time points and how it impacted each analysis. 4.9 Summary The process of selecting a dataset to represent the official results of a clinical trial is far from trivial. The principles involved are not

8

ANALYSIS POPULATION

universally accepted, and even when there is general agreement, there are often difficulties in specific data items that make implementing those principles difficult or even impossible. Nevertheless, some common terms can be used by researchers to describe their approaches to data selection and analysis, and common situations can develop wherein researchers can learn from each other. If this article can be boiled down to its simplest essence, it is this: Selecting a clinical trial analysis dataset is much more difficult than it seems. Therefore, think hard about the issues and potential biases, take a conservative course when you are unsure, and openly disclose how the final datasets were selected.

The difficulty involved in selecting an analysis dataset is just another of the many challenges— legal, ethical, scientific, practical—involved in medical research on humans. But in carefully adhering to good scientific principles and in paying close attention to biases and limitations—which include the selection of data for analysis—the impact of medical research is substantially enhanced. REFERENCES 1. S. Senn, Statistical Issues in Drug Development. New York: John Wiley & Sons, 1997. 2. L. M. Friedman, C. D. Furberg, D. L. DeMets, Fundamentals of Clinical Trials, 3rd ed. New York: Springer-Verlag, 1998. 3. D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, 1987. 4. J. L. Schafer, Analysis of Incomplete Multivariate Data. New York: Chapman and Hall, 1997. 5. D. Moher, K. F. Schulz, and D. G. Altman, for the CONSORT Group, The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. Lancet 2001; 357: 1191–1194. 6. N. R. Anthonisen, J. E. Connett, J. P. Kiley, M. D. Altose, W. C. Bailey, AS. Buist, W. A. Conway Jr, P. L. Enright, R. E. Kanner, P. O’Hara, G. R. Ownes, P. D. Scanlon, D. P. Tashkin, and R. A. Wise, for the Lung Health Study Research Group, Effects of smoking

intervention and the use of an anticholinergic bronchodilator on the rate of decline of FEV1. JAMA 1994; 272: 1497–1505.

FURTHER READING P. Armitage, Exclusions, losses to follow-up, and withdrawals in clinical trials. In: S. H. Sharpiro and T. A. Louis (eds.), Clinical Trials: Issues and Approaches, New York: Marcel Dekker, 1983, pp. 99–113. S. Piantadosi, Clinical Trials: A Methodologic Perspective. New York: John Wiley & Sons, 1997. The ICH Guidelines are also useful references. See especially E3: International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use, ICH Harmonized Tripartite Guideline, Structure and Content of Clinical Study Reports (E3), 1995. Available in Federal Register on July 17, 1996 (61 FR 37320) and at: http://www.ich.org/LOB/media/MEDIA479.pdf and also at: http://www.fda.gov/cder/guidance/ iche3.pdf.

CROSS-REFERENCES Bias Missing Values Intention-to-Treat Analysis Equivalence Trial

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

development in cancer research where cytotoxic drugs were developed, the optimal dose was considered the maximally tolerated dose (MTD): the highest dose that has an acceptable level of toxicity. But there are other optimality criteria, such as minimally effective dose (MinED): the minimum dose that shows sufficient efficacy.

ELIZABETH GARRETT-MAYER Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University Baltimore, Maryland,

Phase I dose finding studies have become increasingly heterogeneous in regards to the types of agents studied, the types of toxicities encountered, and degree of variability in outcomes across patients. For these reasons and others, when designing a phase I study it is especially important to select a design that can accommodate the specific nature of the investigational agent. However, most current phase I studies use the standard 3 + 3 design, which has been demonstrated to have poor operating characteristics, allows little flexibility, and cannot properly accommodate many of current dose-finding objectives. Novel phase I trials have been introduced over the past 20 years that have superior properties and are much more flexible than the 3 + 3 design, yet they have not been fully embraced in practice. These newer designs are in general not cumbersome and provide a more accurate and precise estimate of the appropriate dose to pursue in phase II and later studies. 1

2 STANDARD DESIGNS AND THEIR SHORTCOMINGS 2.1 The Standard 3 + 3 Design The standard 3 + 3 is the most commonly used and well-known phase I design (1). However, it is also probably the most inappropriate in most cases. It is an algorithmic design where a set of rules are followed for dose escalation or de-escalation based on observed dose limiting toxicities (DLTs); at the end of the trial, a maximally tolerated dose (MTD) is declared without any analysis of the data. The way the algorithm works is as follows. Enter three patients on dose level k: 1. If 0 of 3 patients have a DLT, dose escalate to dose level k + 1. 2. If 2 or 3 of 3 patients have a DLT, de-escalate to dose level k – 1. 3. If 1 of 3 patients has a DLT, add an additional 3 patients at dose level k. (a) If 1 of the 6 patients treated at k have a DLT, escalate to dose level k + 1. (b) If 2 or more of the 6 patients at k have a DLT, de-escalate to dose level k – 1.

OBJECTIVES OF A PHASE I TRIAL

It is important to be reminded of the goals of the phase I trial when selecting a design. Phase I studies are generally dose-finding studies, where the goal of the trial is to identify the optimal dose to take forward for further testing (usually for efficacy testing). For ethical reasons, patients cannot be randomized to doses: one cannot treat patients at high doses until lower ones have been explored. The standard approach, therefore, has been to administer a low dose of an agent to a small number of patients and, if the dose appears to be well-tolerated within those patients, to administer a higher dose to another small number of patients. Historically, due to the roots of phase I trial

When de-escalations occur, an additional 3 patients are usually treated at the dose unless 6 patients have already been treated at that dose. The MTD is then defined as the highest dose at which 0 or 1 of 6 patients has experienced a DLT. Note that there are slight variations to this design, but the way it is described here is one of the more common implementations (1). Many agents currently under evaluation in phase I studies do not have the same toxi-

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

city concerns that cytotoxic agents have had in the past. The standard 3 + 3 phase I design makes several assumptions that are not practical for many current phase I studies. Some of these assumptions are the following: 1. As dose increases, the chance of a response increases (i.e., there is a monotonic increasing association between dose and response). 2. As dose increases, the chance of toxicity increases (i.e., there is a monotonic increasing association between dose and toxicity). 3. The dose that should be carried forward should be the highest dose with acceptable toxicity. 4. Acceptable toxicity is approximately 15% to 30% (regardless of the definition of a toxic event). These assumptions are very limiting and are not practical in many of our current settings. For example, angiogenesis inhibitors or colony-stimulating factors may have very low probability of toxicity, and a dose-finding study may be more concerned with evaluating pharmacodynamic response over an increasing range of doses. As such, we would not want to escalate until we saw toxicity: we would want to escalate until we saw sufficient pharmacodynamic response. We may also assume that there may be a range of doses that provides a pharmacodynamic response, but doses below or above this range may be ineffective. This invalidates the assumption of monotonicity between dose and response. 2.1.1 What Is Wrong with the Standard 3 + 3 Design? There are a number of flaws that can be identified in the standard 3 + 3 design, and a few are highlighted here. One of the most problematic issues with this design is that it does not allow for the desired toxicity rate to be user-specified. For example, in certain cancers with very high mortality rates (e.g., brain cancer), the acceptable toxicity rate (i.e., the proportion of patients who experience a DLT) is higher than in some other more curable and less deadly cancers. Also, acceptable toxicity rates for short-term treatments (e.g., several cycles

of chemotherapy) will likely be higher than for long-term treatments (e.g., a daily dose of tamoxifen for 5 years for breast cancer survivors). The standard 3 + 3 design is prespecified in such a way that these considerations cannot be accommodated. The target toxicity rate for these trials is not well-defined (although it is often assumed that they target a rate between 15% and 30%) because it depends heavily on the spacing between prespecified dose levels and how many dose levels are included. In many cases, the 3 + 3 design will significantly underestimate or overestimate the desired toxicity rate. Another major shortcoming of the standard 3 + 3 is that, when choosing the MTD, it ignores any toxicity information observed at lower doses and has very poor precision. This is a common feature of algorithmic designs, but it is still a serious inefficiency. Newer designs estimate a dose-toxicity curve after all the data have been observed to not only help in choosing the most appropriate MTD, but also to provide an estimate of precision. By using the data that have accrued throughout the trial at all doses, there will be greater precision in the estimate of the MTD. Even if the target toxicity rate is approximately 20% and the goal of the trial is to find the MTD, the 3 + 3 will be a less efficient and more error-prone way of identifying a dose than model-based approaches described later in this chapter. Lastly, the 3 + 3 design will not achieve the objectives of a phase I trial when the primary outcome of interest for dose selection is not the maximally tolerated dose. When identifying the MinED, escalating based on DLTs is illogical. However, the standard 3 + 3 only allows for toxicity-based outcomes. 2.2 The Accelerated Titration Design There have been some improvements made to the standard 3 + 3 approach. Simon et al. (2) introduced the accelerated titration design, which has two major advantages over the standard design: (1) it treats only one or two patients per dose level until a DLT is observed; and (2) at the end of the trial, all of the data are incorporated into a statistical model for determining the appropriate dose to recommend for phase II testing. These

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

designs also allow for intrapatient dose escalation. It is clearly a more ethical and more appealing design to patients (patients are less likely to be treated on ineffective doses and can be escalated if they tolerate the treatment). This design still has the shortcoming that escalation is determined based on the same rules as the 3 + 3. However, this is mitigated to some extent by choosing the MTD based on a fitted statistical model. The accelerated titration design is also limited in that it is only appropriate for dose escalation that is toxicity based: for agents where toxicity is not the major outcome for dose finding, the accelerated titration design should not be used. 3

SOME NOVEL DESIGNS

In the past 20 years, a number of designs have been introduced for increased efficiency in phase I trials, but, unfortunately, most of these designs have not been adopted into mainstream use despite their significant improvements as compared with the standard algorithmic designs. Part of the reason for this is that they are based on statistical models and those without a strong background in statistics may be wary of them. There is a misconception that these novel designs are not as safe as the standard 3 + 3, but the evidence is quite to the contrary: many investigators have shown that, for example, the continual reassessment method is safer than the standard 3 + 3 because it exposes fewer patients to toxic doses and requires a smaller sample size (3–7). Even so, proponents of the 3 + 3 continue to promote to it with the idea that such a simple and standard design is safer than the alternatives and that its historic presence is a testament to its utility. An adaptive design is one in which certain decisions about how to proceed (e.g., what doses to give patients, how many patients to treat, which treatment to apply) are made for future patients in the trial based on results observed on all patients treated earlier in the trial. Adaptive phase I designs are designs in which the dose for a future patient is chosen based on the results from patients already treated on the study. There are several ways

3

that adaptive designs can be used in phase I studies, and just a few are described here. These designs have well-documented advantages over nonadaptive approaches and are becoming more commonly seen in many phase I clinical trials (3–7). 3.1 The Continual Reassessment Method The continual reassessment method (CRM) was first introduced by O’Quigley et al. in 1990 (8) and has undergone many modifications in the past 16 years. It is fundamentally different from the standard 3 + 3 design and most other ‘‘up and down’’ designs because (1) it relies on a mathematical model, (2) it assumes that dose is continuous (i.e., doses are not from a discrete set), and (3) it allows and requires the user to select a target toxicity rate. The original CRM used a Bayesian framework and had relatively few safety considerations. Newer and more commonly used versions have included several safety concerns (e.g., not allowing dose to increase by more than 100%) and a simpler estimation approach (i.e., maximum likelihood estimation). Although there are a number of CRMs currently used in practice, the model of Piantadosi et al. (9) will be the approach described here because of its simplicity. For other modifications, see Goodman et al. (4), Faries et al. (10), and Moller (11). The design requires several user-specified inputs: 1. The target dose-limiting toxicity rate 2. A dose that is expected to have a low level of toxicity (e.g., 5% or 10%) 3. A dose that is expected to have a high level of toxicity (e.g., 90% or 95%) 4. The number of patients per dose level (usually 1, 2, or 3). The number of patients per cohort usually will depend on the anticipated accrual. The mathematical model used is prespecified; with the knowledge of the three items listed above, the dose for the first cohort of patients can be calculated. This is demonstrated in Figure 1A where three assumptions are made: a 5% DLT rate at a dose of 100 mg, a 95% DLT rate at a dose of 900 mg, and a logistic dose-toxicity model. (For more

4

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

detail on the logistic dose-toxicity model, see Piantadosi et al. [9]). The dose-toxicity curve is drawn according to this information, with a horizontal line at the target toxicity rate (25%). The dose corresponding to a 25% DLT rate is 343 mg, and this is the dose for our first cohort of patients. Let us assume that two patients are treated at dose level 1 (dose = 343 mg). The DLT information observed has three possibilities: (1) neither patient has a DLT, (2) one patient has a DLT, or (3) both patients have a DLT. We add this information to our presumed information described in the previous paragraph. In Figures 1B to 1D, the updated

dose-toxicity curves are drawn using this new information, corresponding to the three possible outcomes, with the toxicity outcomes included in the graphs (0 = no DLT, 1 = DLT). If neither patient has a DLT, the dose is escalated to 418 mg (see Figure 1B); if one patient has a DLT, the dose is decreased to 218 mg (see Figure 1C); if both have a DLT, then dose is decreased to 147 mg for the next cohort (see Figure 1D). The approach continues where, after each cohort of two patients is observed, the additional data are incorporated to update the estimated dose-toxicity curve. The CRM is considered ‘‘adaptive’’ because we are using

600

800

1.0 0.8 0.6 0.4

1000

0

200

400

600

800

1000

DOSE (MG)

C. 1/2 DLTs: next dose is 233mg

D. 2/2 DLTs: next dose is 147mg

600

800

1000

0.8 0.6 0.4 0.2

PROBABILITY OF DLT 400

DOSE (MG)

0.0

0.8 0.6 0.4 0.2

200

1.0

DOSE (MG)

0.0 0

0.2

PROBABILITY OF DLT 400

0.0

1.0 0.8 0.6 0.4 0.2

200

1.0

0

PROBABILITY OF DLT

B. 0/2 DLTs: next dose is 418mg

0.0

PROBABILITY OF DLT

A. Starting dose: 343mg

0

200

400

600

800

1000

DOSE (MG)

Figure 1. (A) Estimated dose-toxicity curve based on a priori information about likely low and high toxicity doses. The starting dose is 343 mg. (B) Updated dose-toxicity curve if no dose-limiting toxicities (DLTs) are seen in two patients treated at 343 mg. The updated dose for next cohort would be 418 mg. (C) Updated dose-toxicity curve if one DLT is seen in two patients treated at 343 mg. The updated dose for next cohort would be 233 mg. (D) Updated dose-toxicity curve if two DLTs are seen in two patients treated at 343 mg. The updated dose for next cohort would be 147 mg.

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

the data collected as we proceed to determine the doses for future cohorts. And, unlike the algorithmic designs, when choosing a dose for the next cohort, we use information collected from all previous cohorts and not just the last cohort of patients. There are a variety of approaches that may be used to determine when the CRM trial should stop. Suggested approaches include when the recommended dose differs by no more than 10% from the current dose (9), defining a prespecified number of patients (4), or based on a measure of precision of the curve (12). For a detailed discussion of sample size choice, see Zohar and Chevret (13). Although in theory CRMs are markedly better than the standard 3 + 3 design, they do have several potential limitations. First, they require a statistician to be intimately involved with the daily coordination of the trial. The dose for a new patient depends on the data observed up until the new patient is enrolled and on the statistical model that has been chosen. Determining the next dose is not a trivial endeavor and requires that a statistician be available and familiar with the trial to perform these calculations on short notice. Second, the mathematical model that is assumed may not be flexible enough to accommodate the true or observed dosetoxicity relationship. The approach that we have highlighted here by Piantadosi et al. (9) is more flexible than some others, but in some cases, the chosen model may not be robust. 3.2 Extensions of the CRM for Efficacy Outcomes Although the CRM was developed to be used for dose-toxicity relationships, dose-finding based on dose-efficacy relationships can also be explored. This is most appropriate in the setting where a targeted therapy is being evaluated and, for example, there is a desired level of inhibition of a particular target. A target effectiveness rate could be defined as the dose that achieves inhibition of the target in at least 80% of patients. Pharmacokinetic parameters (e.g., area under the curve) could also be the basis for the design of a CRM study.

5

3.3 Bayesian Adaptive Designs Bayesian adaptive designs comprise a wide range of designs that are all based on the same general principles. For phase I studies, Bayesian adaptive designs have many similarities to the CRM described in the previous sections. They tend to be more complicated mathematically, so many of the details of these designs are beyond the scope of this article. However, the basic elements required for Bayesian decision making are (1) a statistical model, (2) a prior distribution that quantifies the presumed information about the toxicity of doses before the trial is initiated, (3) a set of possible actions to take at each look at the data, and (4) a ‘‘gain’’ function. We will consider each of these in turn.

1. The statistical model. This is a mathematical representation of the relationship between dose and toxicity. In the CRM we have described, we assumed a logistic model, but there are many possible models. 2. A prior distribution. This is similar to the quantification of toxicity a priori that is shown in Figure 1A. In the Bayesian approach, this is more formally described using an actual probability distribution but will take a form similar to that shown in Figure 1A. Often these are taken to be quite conservative by assuming that high doses tend to be very toxic so that the dose escalation is less likely to investigate high doses unless there is relatively strong evidence that they are safe, based on the accumulated information at lower doses. 3. Possible actions. These are the actions that can be taken when new patients are enrolled and include all of the possible doses that could be given to them. 4. Gain function. This is an attribute of the Bayesian approach that is quite different from the CRM. The gain function reflects what is best for the next cohort of patients. This is particularly relevant in phase I trials because of the tradeoff of efficacy and toxicity: a higher dose likely means an increase in both, and

6

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

the gain function allows us to include that tradeoff in our decision making about the next cohort. However, the gain function can take many forms and need not represent the tradeoff between toxicity and efficacy.

mathematical details needed to fully describe the approach, but the general idea is shown in Figure 2. Each curve (i.e., ‘‘contour’’) in Figure 2 represents a set of efficacy–toxicity tradeoffs that would be considered equally desirable. The user must elicit information from the clinical investigator to determine what a target contour would be. Dose is escalated based on the combined outcomes of toxicity and efficacy, and new doses are determined as a function of previously observed results. Another adaptive design recently proposed uses a three category outcome variable to describe both efficacy and toxicity (22). This approach has many nice properties, such as maximizing efficacy with constraints on toxicity instead of allowing them equal importance. This may be more practical in many settings, especially when it is expected that the agent under investigation has very low toxicity.

These designs are described in detail by Gatsonis and Greenhouse (14), Whitehead (15), and Haines et al. (16). Descriptions aimed at clinical investigators are provided by Zhou (17) and Whitehead et al. (18). Zohar et al. (19) have developed publicly available software for conducting Bayesian dose-finding studies. 3.4 Efficacy and Toxicity for Dose Finding Recently developed dose escalation designs have formally accounted for the desire to find doses that have both acceptable toxicity and maximal efficacy. This is in response to the trend in recent years of applying standard designs to find MTDs and subsequently looking at efficacy outcomes to see which, if any, of the doses explored showed efficacy either based on clinical outcomes, such as tumor response, or based on correlative outcomes, such as modulation of a genetic marker known to be in the cancer pathway. A more efficient approach is to simultaneously maximize efficacy while minimizing toxicity. Thall and Cook (20, 21) have proposed an adaptive Bayesian method that chooses the optimal dose based on both efficacy and toxicity. The principles are like those described in the previous section, but with the added complexity of two outcomes. There are many

4 DISCUSSION

0.0

Probability of Toxicity 0.2 0.4 0.6 0.8 1.0

In recent years, there have been quite a few novel designs proposed for phase I trials, many of which are Bayesian and most of which are adaptive. The trend has been to steer away from the old-fashioned algorithmic designs that have been shown repeatedly to be inefficient. Chevret (23) has recently published a book on the dose-finding methods that addresses many of the issues discussed here in more depth and detail. There are a number of approaches that could not be discussed in this chapter due to space considerations. Readers interested

0.0

0.2

0.4 0.6 Probability of Efficacy

0.8

1.0

Figure 2. An example of contours for Thall and Cook’s adaptive Bayesian design using both efficacy and toxicity. The target contour is shown by the thick line. Several other contours are shown with thinner lines.

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

in learning more about novel designs should also consider investigating the curve-free method (24) and the biased coin up-and-down design with isotonic regression (25) that have been introduced in recent years. REFERENCES 1. L. Edler, Overview of phase I trials In: J. Crowley (ed.), Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001, pp. 1–34. 2. R. M. Simon, B. Freidlin, L. V. Rubinstein, S. Arbuck, J. Collins, and M. Christian, Accelerated titration design for phase I clinical trials in oncology. J Natl Cancer Inst. 1997; 89: 1138–1147. 3. B. Storer, An evaluation of phase I clinical trial designs in the continuous dose-response setting. Stat Med. 2001; 20: 2399–2408. 4. S. N. Goodman, M. L. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stat Med. 1995; 14: 1149–1161. 5. E. Garrett-Mayer, The continual reassessment method for dose-finding: a tutorial. Clin Trials. 2006; 3: 57–71. 6. J. O’Quigley, Another look at two phase I clinical trial designs. Stat Med. 1999; 18: 2683–2690. 7. C. Ahn, An evaluation of phase I cancer clinical trial designs. Stat Med. 1998; 17: 1537–1549. 8. J. O. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990; 46: 33–48. 9. S. Piantadosi, J. D. Fisher, and S. Grossman, Practical Implementation of a modified continual reassessment method for dose-finding trials. Cancer Chemother Pharmacol. 1998; 41: 429–436. 10. D. Faries, Practical modifications of the continual reassessment method for phase I cancer clinical trials. J Biopharm Stat. 1994; 4: 147–164. 11. S. Moller, An extension of the continual reassessment methods using a preliminary up-and-down design in a dose finding study in cancer patients, in order to investigate a greater range of doses. Stat Med. 1995; 14: 911–922. 12. N. Ishizuka and Y. Ohashi, The continual reassessment method and its applications: a

7

Bayesian methodology for phase I cancer clinical trials. Stat Med. 2001; 20: 2661–2681. 13. S. Zohar and S. Chevret, The continual reassessment method: comparison of Bayesian stopping rules for dose-ranging studies. Stat Med. 2001; 20: 2827–2843. 14. C. Gatonis and J. Greenhouse, Bayesian methods for phase I clinical trials. Stat Med. 1992; 11: 1377–1389. 15. J. Whitehead, Using Bayesian decision theory in dose-escalation studies In: S. Chevret (ed.), Statistical Methods in Dose-Finding Experiments. Chicester, UK: Wiley, 2006, pp. 149–171. 16. L. M. Haines, I. Perevozskaya, and W. F. Rosenberger, Bayesian optimal designs for phase I clinical trials. Biometrics. 2003; 59: 591–600. 17. Y. Zhou, Choice of designs and doses for early phase trial. Fundam Clin Pharmacol. 2004; 18: 1–7. 18. J. Whitehead, Y. Zhou, S. Patterson, N. D. Webber, and S. Francis, Easy-to-implement Bayesian methods for dose-escalation studies in healthy volunteers. Biostatistics. 2001; 2: 47–61. 19. S. Zohar, A. Latouche, M. Tacconet, and S. Chevret, Software to compute and conduct sequential Bayesian phase I and II doseranging clinical trials with stopping rules. Comput Methods Program Biomed. 2003; 72: 117–125. 20. P. F. Thall and J. D. Cook, Dose-finding based on efficacy-toxicity trade-offs. Biometrics. 2004; 60: 684–693. 21. P. F. Thall and J. D. Cook, Using both efficacy and toxicity for dose-finding In: S. Chevret (ed.), Statistical Methods in DoseFinding Experiments. Chicester, UK: Wiley, 2006, pp. 275–285. 22. W. Zhang, D. J. Sargent, S. Mandrekar, An adaptive dose-finding design incorporating both toxicity and efficacy. Stat Med. 2006; 25: 9243–9249. 23. S. Chevet, ed. Statistical Methods for DoseFinding Experiments. Chicester, UK: Wiley, 2006. 24. M. Gasparini and J. Eisele, A curve-free method for phase I clinical trials. Biometrics. 2000; 56: 609–615. 25. M. Stylianou and N. Flourny, Dose-finding using the biased coin up-and-down design and istonic regression. Biometrics. 2002; 58: 171–177.

8

APPLICATION OF NOVEL DESIGNS IN PHASE I TRIALS

FURTHER READING

CROSS-REFERENCES

J. O’Quigley, Dose-finding designs using the continual reassessment method In: J. Crowley (ed.), Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001, pp. 35–72. B. Storer, Choosing a phase I design In: J. Crowley (ed.), Handbook of Statistics in Clinical Oncology. New York: Marcel Dekker, 2001, pp. 73–92.

Continual reassessment method Dose escalation design Phase I trials Bayesian approach

ASCOT TRIAL

trial of a lipid-lowering agent (lipid-lowering arm, ASCOT-LLA, 5). Between 1998 and 2000, patients were recruited to an independent, investigator-inititated, investigatorled, multicenter, prospective, randomizedcontrolled trial. Patients were eligible for ASCOT-BPLA if they were aged 40–79 years at randomization, and had either untreated hypertension with a systolic blood pressure of 160 mm Hg or more, diastolic blood pressure of 100 mm Hg or more, or both, or they were treated for hypertension with systolic blood pressure of 140 mm Hg or more, diastolic blood pressure of 90 mm Hg or more, or both. In addition, the patients had to have at least three other cardiovascular risk factors and they were to be free of any previous cardiac events or current cardiac disease

JESPER MEHLSEN Frederikesberg Hospital—Clinical Physiology & Nuclear Medicine Frederiksberg, Denmark

One of the first randomized, placebocontrolled trials in hypertensive subjects (1) demonstrated that whereas treatment dramatically reduced mortality and the incidence of strokes, it did not prevent myocardial infarction. This paradox was confirmed by metaanalyses of randomized trials—all of which used standard diuretic and/or betablocker therapy—showing treatment effects similar to those predicted by prospective, observational studies on stroke incidence but not on the occurrence of coronary heart disease (CHD; 2). It has been speculated that adverse effects of the older antihypertensive drugs on serum lipids, glucose, and potassium could blunt the cardioprotection conferred by lowering of blood pressure. 1

2.1 Endpoints Primary objectives were as follows: 1) to compare the effects on the combined outcome of non-fatal myocardial infarction (MI) and fatal CHD of a beta-blocker-regimen (atenolol) + if necessary a diuretic (bendroflumethiazide-K) with a calcium channel blocker-based regimen (amlodipine) + if necessary an ACE inhibitor (perindopril); 2) to compare the effect on the combined outcome of non-fatal MI and fatal CHD of a statin (atorvastatin) with that of placebo among hypertensive patients with total cholesterol < 6.5 mmol/L.

OBJECTIVES

The rationale for the ASCOT study (3) was thus to answer the issue of whether a newer combination of antihypertensive agents, a dihydropyridine calcium channel blocker (CCB) and an angiotensin converting enzyme (ACE) inhibitor, would produce greater benefits in terms of reducing CHD events than the standard beta-blocker/diuretic combination. The second main issue of ASCOT was whether lipid lowering with a statin would provide additional beneficial effects in those hypertensive patients with average or below average levels of serum cholesterol. 2

2.2 Drug Treatment The ASCOT-BPLA used forced titration of the study drugs with specific add-on drugs, the first of which was the alfa-blocker doxazosin, gastrointestinal transport system (GITS). The target for antihypertensive treatment was 140/90 mm Hg for nondiabetics and 130/80 mm Hg for patients with diabetes mellitus. The ASCOT-LLA used a standard dose of atorvastatin 10 mg for all patients assigned to active treatment without any specific target level for cholesterol.

STUDY DESIGN

ASCOT involved two treatment comparisons in a factorial design—a prospective, randomized, open, blinded endpoint design comparing two antihypertensive regimens (blood pressure-lowering arm, ASCOT-BPLA, 4) and, in a subsample of those hypertensives studied, a double-blind, placebo-controlled

2.3 Sample Size Assuming an adjusted yearly rate of nonfatal myocardial infarction and fatal CHD events

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

ASCOT TRIAL

of 1.42% and an intention-to-treat effect of 15% reduction in risk, it was estimated that a sample size of 18,000 was required to generate 1150 primary endpoints. This sample size would provide 80% power to detect such an effect. In the ASCOT-LLA, a 30% reduction in cholesterol was expected and assumed to translate into a reduction in nonfatal MI and fatal CHD of 30%. Under these conditions, a sample of 9000 patients would have 90% power to detect such an effect. 2.4 Data Safety Monitoring Plan The Data Safety Monitoring Committee (DSMC) did unblinded interim analysis during the trial and used the symmetric Haybittle–Peto statistical boundary (critical value Z = 3) as a guideline for deciding to recommend early termination of the trial. 3

RESULTS

In the Nordic countries, 686 family practices randomized patients, and in the United Kingdom and Ireland, 32 regional centers to which patients were referred by their family doctors recruited patients. A total of 19,342 patients were randomized, but 85 patients had to be excluded because of irregularities in blood-pressure measurements resulting in 19,257 evaluable patients. Participants were well matched between groups; over 80% were on previous antihypertensive treatment, they were mainly white and male, and had a mean age of 63 years, a mean body mass index (BMI) of almost 29 kg/m2 , a mean total cholesterol of 5.9 mmol/L, and a mean baseline sitting blood pressure of 164/95 mm Hg. The average number of the additional cardiovascular risk factors required for inclusion in the trial was 3.7. From the main study, 10,305 patients were found to be eligible for inclusion in ASCOT-LLA. In 2002, the data safety monitoring board (DSMC) recommended that ASCOT-LLA be stopped on the grounds that atorvastatin had resulted in a highly significant reduction in the primary endpoint of CHD events compared with placebo. All patients in the ASCOT-LLA were offered atorvastatin 10 mg daily to be continued to the end of ASCOTBPLA. In 2004, the DSMB recommended

the ASCOT-BPLA to be stopped as those patients allocated the atenolol-based regimen had significantly higher mortality as well as worse outcomes on several other secondary endpoints than those allocated the amlodipine-based regimen. ASCOT-LLA accumulated 33,041 patient years of follow-up (median, 3.3 years). At the close of follow-up for the lipid-lowering arm, complete information was obtained on 10,186 (98.8%) of the patients originally randomized. By the end of the study, 87% of patients originally assigned atorvastatin were still taking a statin, and 9% of those in the placebo group had been prescribed open-label statins. Comparing atorvastatin treatment to placebo, total and LDL-cholesterol were lowered by 1.0 mmol/L and 1.0 mmol/L (19% and 29%), respectively, and triglycerides were reduced by 14%. Changes in HDL-cholesterol concentrations were minimal and blood pressure control throughout the trial was similar in the two groups. The primary endpoint of nonfatal myocardial infarction and fatal CHD was significantly lower by 36% (hazard ratio 0.64 [95% CI 0.50–0.83], P = 0.0005) in the atorvastatin group than in the placebo group. There were also significant reductions in four of seven secondary endpoints, some of which incorporated the primary endpoint: total cardiovascular events including revascularization procedures (21%); total coronary events (29%); the primary endpoint excluding silent myocardial infarction (38%); and fatal and nonfatal stroke (27%). Effects of statin on the secondary endpoints of heart failure or cardiovascular mortality did not differ significantly from those of placebo. ASCOT-BPLA accumulated 106,153 patient years of follow-up (median, 5.5 years). In October, 2004, the DSMC recommended the trial be stopped on the grounds that compared with those allocated the amlodipinebased regimen, those allocated the atenololbased regimen had significantly higher mortality as well as worse outcomes on several other secondary endpoints. Complete endpoint information was collected at the end of the study for 18,965 patients (99%). On average, blood pressure dropped from a mean of 164.0/94.7 (SD 18.0/10.4) mm Hg to a mean of 136.9/78.3 (16.7/9.8) mm Hg. At the trial close

ASCOT TRIAL

out, 32% of patients with diabetes and 60% of those without had reached both the systolic and the diastolic blood pressure targets. Compared with those allocated the atenolol-based regimen, blood pressure values were lower in those allocated the amlodipine-based regimen and the average difference throughout the trial was 2.7/1.9 mm Hg. By the end of the trial, as intended by design, most patients (78%) were taking at least two antihypertensive agents. The primary endpoint of nonfatal myocardial infarction plus fatal CHD was nonsignificantly lowered by 10% in those allocated the amlodipine-based regimen compared with those allocated the atenolol-based regimen. There were significant reductions in most of the secondary endpoints (except fatal and nonfatal heart failure): nonfatal myocardial infarction (excluding silent myocardial infarction) and fatal CHD (reduced by 13%); total coronary events (13%); total cardiovascular events and procedures (16%); allcause mortality (11%); cardiovascular mortality (24%); and fatal and nonfatal stroke (23%). The difference in all-cause mortality was caused by the significant reduction in cardiovascular mortality, with no apparent difference in noncardiovascular mortality. Of the tertiary endpoints, there were significant reductions associated with the amlodipinebased regimen for unstable angina (32%), peripheral arterial disease (35%), development of diabetes (30%), and development of renal impairment (15%). Twenty-five percent of patients stopped therapy because of an adverse event, with no significant difference between the allocated treatment groups. 4

DISCUSSION & CONCLUSIONS

The lipid-lowering arm of ASCOT (ASCOTLLA) showed that cholesterol lowering with atorvastatin compared with placebo conferred a significant reduction in nonfatal myocardial infarction and in fatal CHD in hypertensive patients at moderate risk of developing cardiovascular events. Observational data have indicated a relatively weak association between serum cholesterol and the risk of stroke (6), but previous randomized trials of statin use have shown significant reductions in stroke events of the

3

same order of magnitude as in ASCOT-LLA (7). There were no significant adverse effects on any of the prespecified secondary or tertiary endpoints in association with the use of atorvastatin. The relative magnitude of the benefits in ASCOT-LLA are notably larger for CHD prevention than are the effects of blood-pressure lowering in randomized, placebo-controlled trials (2), whereas the relative reduction in stroke seems somewhat smaller. However, the results show the benefits of statin treatment are additional to those of good blood-pressure control. The findings support the concept that strategies aimed at reducing cardiovascular disease should depend on global assessment of risk, and that benefits of lipid lowering are present across the whole range of serum cholesterol concentrations. Subsequent economic analysis has indicated that adopting the treatment strategy used in ASCOT-LLA would be costeffective (8). ASCOT-BPLA showed that amlodipine-based treatment was superior to an atenolol-based therapy in hypertensive patients at moderate risk of developing cardiovascular events in terms of reducing the incidence of all types of cardiovascular events and all-cause mortality, and in terms of risk of subsequent new-onset diabetes. The effective blood pressure lowering achieved by the amlodipinebased regimen, particularly in the first year of follow-up, is likely to have contributed to the differential cardiovascular benefits. However, the systolic blood pressure difference observed would, based on previous randomized trials (2) and by observational studies (9), be expected to generate a difference in coronary events and in strokes far below that achieved. A large substudy in ASCOT, the Conduit Artery Function Evaluation (CAFE) Study (10), showed that the two drug regimens had substantially different effects on central aortic pressures and hemodynamics despite a similar impact on brachial blood pressure in those included. The study indicated that differences in central aortic pressures could be a potential mechanism to explain the different clinical outcomes between the two blood pressure treatment arms in ASCOT.

4

ASCOT TRIAL

Another explanation for the difference observed could possibly be found in a prespecified assessment of whether any synergistic effects were apparent between the lipid-lowering and blood-pressure-lowering regimens (11). This analysis revealed that atorvastatin reduced the relative risk of CHD events by 53% (P < 0.0001) among those allocated the amlodipine-based regimen, but nonsignificantly (16%) among those allocated the atenolol-based regimen (P < 0.025 for heterogeneity). A significant excess of new-onset diabetes was observed in those allocated the atenololbased regimen and is compatible with the results of previous studies (12). The effect on short-term cardiovascular outcomes of individuals who became diabetic during the course might not be apparent compared with those who did not develop diabetes, although adverse outcomes associated with type 2 diabetes could reasonably be expected with extended follow-up. The ASCOT-BPLA reaffirm that most hypertensive patients need at least two agents to reach recommended blood pressure targets, and that most can reach current targets if suitable treatment algorithms are followed. Economic analysis has indicated that amlodipine-based therapy would be costeffective when compared with the atenololbased therapy (13). Both arms of the ASCOT trial have impacted on current recommendations for the treatment of hypertension. The European guidelines (14) now recommend the addition of a statin to the antihypertensive treatment in hypertensive patients aged less than 80 years who have an estimated 10 years risk of cardiovascular disease of more than 20% or of cardiovascular death of 5% or more. The British National Institute for Health and Clinical Excellence (15) no longer recommends beta-blockers as a first-line therapy, and the European guidelines express several reservations regarding the use of beta-blockers in hypertension particularly in combination with a thiazide diuretic. REFERENCES 1. VA Cooperative Study Group, Effects of treatment on morbidity in hypertension. II. Results

in patients with diastolic blood pressure averaging 90 through 114mm Hg. JAMA 1970; 213: 1143–1152. 2. Blood Pressure Lowering Treatment Trialists’ Collaboration, Effects of different blood pressure lowering regimens on major cardiovascular events: Results of prospectively designed overviews of randomised trials. Lancet 2003; 362: 1527–1535. 3. Sever PS, Dahl¨of B, Poulter NR, et al. Rationale, design, methods and baseline demography of participants of the Anglo-Scandinavian Cardiac Outcomes Trial. J. Hypertens. 2001; 6: 1139–1147. 4. Dahl¨of B, Sever PS, Poulter NR, et al. Prevention of cardiovascular events with an antihypertensive regimen of amlodipine adding perindopril as required versus atenolol adding bendroflumethiazide as required, in the Anglo-Scandinavian Cardiac Outcomes Trial-Blood Pressure Lowering Arm (ASCOTBPLA): A multicentre randomised controlled trial. Lancet 2005; 366: 895–906. 5. Sever PS, Dahlof B, Poulter NR, et al. Prevention of coronary and stroke events with atorvastatin in hypertensive patients who have average or lower-than-average cholesterol concentrations, in the Anglo-Scandinavian Cardiac Outcomes Trial-Lipid Lowering Arm (ASCOT-LLA): A multicentre randomised controlled trial. Lancet 2003; 361: 1149–1158. 6. Eastern Stroke and Coronary Heart Disease Collaborative Research Group. Blood pressure, cholesterol, and stroke in eastern Asia. Lancet 1998; 352: 1801–1807. 7. Crouse JR III, Byington RP, Furberg CD. HMG-CoA reductase inhibitor therapy and stroke risk reduction: An analysis of clinical trials data. Atherosclerosis 1998; 138: 11–24. 8. Lindgren P, Buxton M, Kahan T, et al. Costeffectiveness of atorvastatin for the prevention of coronary and stroke events: an economic analysis of the Anglo-Scandinavian Cardiac Outcomes Trial–lipid-lowering arm (ASCOTLLA). Eur. J. Cardiovasc. Prev. Rehabil. 2005; 12: 29–36. 9. Lewington S, Clarke R, Qizilbash N, et al. Age-specific relevance of usual blood pressure to vascular mortality: A meta-analysis of individual data for one million adults in 61 prospective studies. Lancet 2002; 360: 1903–1913. 10. Williams B, Lacy PS, Thom SM, et al. Differential Impact of Blood Pressure–Lowering Drugs on Central Aortic Pressure and Clinical Outcomes: Principal Results of the Conduit

ASCOT TRIAL Artery Function Evaluation (CAFE) Study. Circulation 2006; 113: 1213–1225. 11. Sever P, Dahl¨of B, Poulter N, et al. Potential synergy between lipid-lowering and bloodpressure-lowering in the Anglo-Scandinavian Cardiac Outcomes Trial. Eur. Heart J. 2006; 27: 2982–2988. 12. Opie LH, Schall R. Old antihypertensives and new diabetes. J. Hypertens. 2004; 22: 1453–1458. 13. Lindgren P, Buxton M, Kahan K, et al. Economic Evaluation of ASCOT-BPLA: Antihypertensive treatment with an amlodipinebased regimen is cost-effective compared to an atenolol-based regimen. Heart Online, October 2007. 14. The Task Force for the Management of Arterial Hypertension of the European Society of Hypertension (ESH) and of the European Society of Cardiology (ESC). 2007 Guidelines for the Management of Arterial Hypertension. J. Hypertens. 2007; 25: 1105–1187. 15. National Institute of Clinical Excellence. Hypertension: management of hypertension in adults in primary care. 2006; Available from: http://www.nice.org.uk/ nicemedia/pdf/CG034NICEguideline.pdf.

5

ASSAY SENSITIVITY

estimate for the primary endpoint, one can calculate the power at any effect size, yielding the power function. This power function takes a value equal to the type I error rate when the effect size is 0 and a value equal to the desired power when the effect size is . Because the goal of a superiority trial is to detect a difference, high assay sensitivity will help a trial achieve its objective in the presence of a true treatment difference. In such a trial, the power function can help provide some quantification for assay sensitivity. The higher the power to detect a treatment effect is, the greater the assay sensitivity is. For a superiority trial, several factors could reduce the true treatment effect or increase the variability, leading to a smaller signal-to-noise ratio when comparing an effective treatment with a less effective or an ineffective one. These factors include, but are not limited to:

CHRISTY CHUANG-STEIN Pfizer Global Research and Development Kalamazoo, Michigan

A good source for assay sensitivity is the International Conference on Harmonisation of Technical requirements for Registration of Pharmaceuticals for Human Use (ICH) guidance on Choice of Control Group and Related Issues in Clinical Trials (E10) (1). According to ICH E10, the assay sensitivity of a clinical trial is defined as the trial’s ability to distinguish an effective treatment from a less effective or ineffective one. Since it is always important for a trial to tell treatments of different effects apart, assay sensitivity is an important consideration when designing and conducting a trial. The lack of adequate assay sensitivity has different implications for trials designed to demonstrate superiority and trials designed to demonstrate noninferiority as we will discuss below. 1

1. High measurement or ascertainment errors caused by the use of multiple equipments, multiple technicians, and multiple raters that produce different results. 2. Poor compliance with therapy such as study medication administration, use of concomitant medications, visit schedule, sample collection, and data recording. 3. Large patient dropout rate that makes analysis challenging and renders findings suspicious. 4. The enrollment population is not what the effective treatment will benefit the most. 5. The dose or the administration schedule is not optimal, which results in a suboptimal exposure. 6. A large placebo effect in the enrolled population makes it hard to demonstrate a treatment benefit. 7. Poorly or inconsistently applied diagnostic criteria that result in the inclusion of subjects who might not have the disorder under investigation. 8. Treatment randomization that fails to take into consideration crucial baseline covariates, leading to imbalance

SUPERIORITY TRIAL

The primary objective of a superiority trial is to demonstrate the superiority of a treatment over a concurrent comparator. The comparator could be a placebo or another treatment. The primary endpoint could be related to efficacy, safety, or benefit/risk. When determining the sample size for a superiority study, a trialist needs to consider the type I error (false-positive) rate and the power (one minus the false-negative rate) at a particular effect size. The allowable type I error rate, typically set to be 0.05 for a two-sided test in a confirmatory trial, determines the level at which statistical significance will be claimed. As for statistical power, a prespecified requirement (80% or 90%) at a prespecified effect size () gives the probability that we will conclude a treatment effect difference if the treatment effect difference is truly of the amount . Given a sample size, the type I error rate, and the variability associated with the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

ASSAY SENSITIVITY

between treatment groups with respect to important covariates. 9. Lack of clarity in the protocol, leaving investigators to interpret the requirements themselves and leading to discrepancies in implementation. When a superiority trial concludes a treatment effect difference, the finding demonstrates that the trial has assay sensitivity. If the trial fails to conclude a treatment effect difference, it could be because of the absence of a true treatment effect difference. It could also be that the study is lacking adequate assay sensitivity because of one or more of the reasons listed above. Since the latter is something that a trialist could control with proper steps, it is important that a trialist ensures adequate assay sensitivity through careful design and excellent execution. By comparison, it is hard to change the innate property of a drug candidate. 2

NONINFERIORITY TRIALS

For convenience, we will focus our discussion on noninferiority trials (2) even though many of the comments are equally applicable to equivalence trials. Here, we assume that a noninferiority decision will be based on a confidence interval constructed for the treatment effect difference. Furthermore, we assume that a positive value for the difference signals a beneficial treatment effect (for the new treatment). Noninferiority will be concluded if the lower confidence limit exceeds a predetermined quantity. When noninferiority in efficacy is being investigated, the hope is to confirm the efficacy of the new treatment over a placebo based on the proven efficacy of the active control (over a concurrent placebo) observed in previous trials. The idea is that if the efficacy of the new treatment is ‘‘reasonably’’ close to that of the active control and the latter is believed to have a better efficacy than a placebo, then one could conclude the efficacy of the new treatment over a placebo if a placebo had been included in the trial. To make the above work, it is critical that the active control successfully and consistently demonstrated a benefit over a concurrent placebo in past trials. In other words,

historical evidence of sensitivity to drug effects is available showing that similarly designed trials regularly distinguished the active control from a placebo. This condition, although sounds simple, is not always true. In fact, we know this condition often fails for disorders such as depression. For depression, a non-negligible chance exists that an effective treatment fails to demonstrate efficacy over a placebo comparator in a given trial. Even when historical evidence for assay sensitivity exists, the effect over the concurrent placebo is not necessarily similar from trial to trial. When this happens, no assurance is given that similarly designed noninferiority trials involving the active control will have similar assay sensitivity. All trials should be planned carefully. A trialist should consider all things that could possibly go wrong and make previsions for them at the design stage. On the other hand, the appropriateness of trial conduct could only be fully evaluated after the trial is completed. To extrapolate the efficacy of the active control observed in previous trials to the current one, the current trial should be similar to those from which historical evidence has been drawn. For example, we should have similar inclusion/exclusion criteria, similar allowance for concomitant medications, similar measurement process, and similar follow-up and data collection/analysis methods. Since historical evidence for assay sensitivity only applies to similarly designed trials, the current trial should be conducted with the same high quality as the previous ones. 3 IMPACT OF SLOPPY TRIAL EXECUTION Careful conduct of a clinical trial according to the protocol has a major impact on the credibility of the results (3). Sloppy trial conduct affects both noninferiority (4,5) and superiority trials, albeit differently. Since the primary objective of a noninferiority trial is to demonstrate the absence of a difference, the lack of assay sensitivity could reduce the treatment effect difference and increase the chance of erroneously concluding noninferiority. On the other hand, the variability could be increased as a result of increased sources

ASSAY SENSITIVITY

of variability. The variability contributes to a greater confidence interval width, which in turn could reduce our chance to conclude noninferiority correctly. The above discussion is further complicated in the case of a binary endpoint (6). Use the case of an antibiotic trial where the success rate of the new treatment and the positive control is expected to be around 70%. If the observed rates are close to 50%, this would result in a higher than expected variability when constructing the confidence interval for the difference in the success rates. Several factors could play against each other in regard to the noninferiority decision. Clearly, the observed treatment effect is a major factor. However, the magnitude of the change in the point estimate for treatment effect needs to be evaluated against a likely increase in the variability for their combined effect on the final conclusion. Koch and Rohmel (7) conducted a simulation study to evaluate the impact of sloppy study conduct on noninferiority studies. They showed that sloppy study conduct could affect the rate of false-positive conclusions in noninferiority trials. In their investigations, Koch and Rohmel examined the following four factors: (1) the objective of the trial; (2) the population analyzed and the mechanism for imputation of nonevaluable measurements; (3) the mechanism generating nonevaluable measurements; and (4) the selected measure for treatment effect. They encouraged researchers to perform simulation studies routinely to investigate the effect of these often counteracting factors in practical situations. By comparison, sloppy study conduct in a superiority trial generally leads to a decrease in assay sensitivity, and therefore, it reduces our chance to conclude superiority even if the treatment of interest is better than the comparator. 4

DESIGNS INCLUDING THREE ARMS

A design using three arms is used increasingly to help gauge assay sensitivity in a trial (8). This applies to trials designed from the superiority or noninferiority perspective. The three arms discussed here are placebo, an active control, and a new treatment.

3

In a superiority trial, the inclusion of an active control helps decide whether a study has adequate assay sensitivity. If the study can conclude a significant treatment effect between the active control (known to have an effect) and the placebo, one can conclude that the study has adequate assay sensitivity. If the new treatment fails to demonstrate a significant effect over the placebo in the same study, one could conclude that the new treatment, as applied to the patients in the study, is not likely to possess the kind of treatment effect used to size the study at the design stage. The three-arm design has gained popularity for proof of concept or dose-ranging studies also. Since most product candidates entering into clinical testing will fail, many drug development programs are designed so that the sponsors could terminate a program early if the candidate is not likely to meet the required commercial profile. To this end, the inclusion of an active control in an early trial (proof of concept or dose-ranging) helps answer the question of whether the observed effect of the new treatment is reasonably close to the truth based on the observed response to the active control. Similarly, for situations where product differentiation is important, including an active control allows the sponsor to compare the new treatment head-to-head with the active control after efficacy of the new treatment is confirmed. In some cases, a sponsor might cease further development of a new treatment because it does not meet the commercial requirements relative to the active control even if the new treatment demonstrates efficacy over a placebo. A classic noninferiority trial designed from the efficacy perspective does not include a placebo. The basis for this is typically because of ethical consideration. For example, in trials involving patients with life-threatening conditions for which treatments are available, it will be unethical to treat patients with a placebo. For confirmatory trials that do contain a placebo and an active control, the primary analysis is typically to compare the new treatment with the placebo (9). A secondary analysis is to compare the new treatment with the active control for product differentiation.

4

ASSAY SENSITIVITY

In some situations, the primary objective of the trial is to compare the new treatment with a placebo on safety to rule out a safety signal. The design is basically a noninferiority design. The hope is to conclude that the primary safety endpoint associated with the new treatment is within a certain limit of that associated with the placebo. This is the case with the ‘‘thorough QT/QTc’’ study discussed in ICH E14 (10). ICH E14 states that the ‘‘thorough QT/QTc’’ study is intended to determine whether a drug candidate under development has a threshold pharmacologic effect on cardiac repolarization as detected by QT/QTc prolongation. Such a study, which is typically conducted early in a clinical development program, is intended to provide maximum guidance for the collection of ECG data in later trials. For example, a negative ‘‘thorough QT/QTc’’ study will often allow the collection of on-therapy ECGs in accordance with the current practices in each therapeutic area. On the other hand, a positive ‘‘thorough QT/QTc study’’ will almost always lead to an expanded ECG safety evaluation during later stages of drug development. Because of the critical role of this study, the confidence in the study’s findings can be greatly enhanced by the use of a concurrent positive control (pharmacological or nonpharmacological) to establish assay sensitivity. Detecting the positive control’s effect in the study will establish the ability of the study to detect an effect similar to that of the positive control. A common choice of the positive control is a drug or an intervention that has an effect on the mean QT/QTc interval of about 5 ms. Positive controls play a similar role in the assessment of mutagenicity studies. According to Hauschke et al. (11), the classification of an experiment as being negative or positive should be based also on the magnitude of the responses in the positive control. In such studies, the determination of the maximum safe dose is often done by incorporating a biologically meaningful threshold value, which is expressed as a fraction of the difference between positive and vehicle control. Therefore, the inclusion of a positive control not only determines the quality of the experiment qualitatively, but it is also part of the summary results involving a new compound in a quantitative manner.

In summary, assay sensitivity is a critical factor to the success of a trial, whether the trial is designed from the efficacy or the safety perspective. REFERENCES 1. International Conference on Harmonisation E10. (2000). Choice of control group and related issues in clinical trials. Step 5. Available: http://www.ich.org/cache/compo/276254-1.html. 2. Hung HMJ, Wang SJ, O’Neill R. (2006). Noninferiority trials. Wiley Encyclopedia of Clinical Trials. 3. International Conference on Harmonisation E9. (1998). Statistical principles for clinical trials. Step 5. Available: http://www.ich.org/ cache/compo/276-254-1.html. 4. Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of rigorous methods. Br. Med. J. 1996; 313: 36–39. 5. Sheng D, Kim MY. The effect of noncompliance on intent-to-treat analysis of equivalence trials. Stat Med. 2006; 25: 1183–1199. 6. Chuang-Stein C. Clinical equivalence—a clarification. Drug Information J. 1999; 33: 1189–1194. 7. Koch A, Rohmel J. The impact of sloppy study conduct on non-inferiority studies. Drug Information J. 2002; 36: 3–6. 8. Chuang C, Sanders C, Snapinn S. An industry survey on current practice in the design and analysis of active control studies. J. Biopharm. Stat. 2004; 14(2): 349–358. 9. Koch A, Rohmel J. Hypothesis testing in the ‘‘gold standard’’ design for proving the efficacy of an experimental treatment relative to placebo and a reference. J. Biopharm. Stat. 2004; 14(2): 315–325. 10. International Conference on Harmonisation E14. (2005). The clinical evaluation of QT/QTc interval prolongation and proarrhythmic potential for non-antiarrhythmic drugs. Step 4 document. Available: http://www.ich.org/ cache/compo/276-254-1.html. 11. Hauschke D, Slacik-Erben R, Hensen S, Kaufmann R. Biostatistical assessment of mutagenicity studies by including the positive control. Biometrical J. 2005; 47: 82–87.

ASSESSMENT BIAS

when they assessed the effect openly but not when they assessed the effect blindly in the same patients (4). Some outcomes can only be meaningfully evaluated by the patients (e.g., pain and well-being). Unfortunately, blinding patients effectively can be very difficult, which is why active placebos are sometimes used. The idea behind an active placebo is that patients should experience side effects of a similar nature as when they receive the active drug, although it contains so little of a drug that it can hardly cause any therapeutic effect. As lack of blinding can lead to substantial bias, it is important in blinded trials to test whether the blinding has been compromised. Unfortunately, this test is rarely done (Asbjørn Hr´objartsson, unpublished observations), and, in many cases, double-blinding is little more than window dressing. Some outcome assessments are not made until the analysis stage of the trial (see below). Blinding should, therefore, also be used during data analysis, and it should ideally be preserved until two versions of the manuscript— written under different assumptions which of the treatments is experimental and which is control—have been approved by all the authors (5).

PETER C. GØTZSCHE Nordic Cochrane Centre Rigshospitalet, København Ø, Denmark

1

INTRODUCTION

Assessment bias in a clinical trial occurs if bias in the assessment of the outcome exists. It is also called ascertainment bias, diagnostic bias, or detection bias (1). A major cause of assessment bias is lack of blinding. Other problems relate to differential identification of harmless or false-positive cases of disease, bias in assessment of disease-specific mortality, the use of composite outcomes, competing risks, timing of the assessments, and bias in assessment of harms. 2

LACK OF BLINDING

One of the most important and most obvious causes of assessment bias is lack of blinding. In empirical studies, lack of blinding has been shown to exaggerate the estimated effect by 14%, on average, measured as odds ratio (2). These studies have dealt with a variety of outcomes, some of which are objective and would not be expected to be influenced by lack of blinding (e.g., total mortality). When patient reported outcomes are assessed, lack of blinding can lead to far greater bias than the empirical average. An example of a highly subjective outcome is the duration of an episode of the common cold. A cold does not stop suddenly, and awareness of the treatment received could therefore bias the evaluation. In a placebo-controlled trial of Vitamin C, the duration seemed to be shorter when an active drug was given, but many participants had guessed they received the vitamin because of its taste (3). When the analysis was restricted to those who could not guess what they had received, the duration was not shorter in the active group. Assessments by physicians are also vulnerable to bias. In a trial in multiple sclerosis, neurologists found an effect of the treatment

3 HARMLESS OR FALSE-POSITIVE CASES OF DISEASE Assessment bias can occur if increased diagnostic activity leads to increased diagnosis of true, but harmless cases of disease. Many stomach ulcers are silent (i.e., they come and go and give no symptoms). Such cases could be detected more frequently in patients who receive a drug that causes unspecific discomfort in the stomach. Similarly, if a drug causes diarrhea, it could lead to more digital, rectal examinations, and therefore also to the detection of more cases of prostatic cancer, most of which would be harmless, because many people die with prostatic cancer but rather few die from prostatic cancer. Assessment bias can also be caused by differential detection of false-positive cases

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

ASSESSMENT BIAS

of disease. Considerable observer variation often exists with common diagnostic tests. For gastroscopy, for example, a kappa value of 0.54 has been reported for the interobserver variation in the diagnosis of duodenal ulcers (6), which usually means that rather high rates of both false-positive findings and false-negative findings occur. If treatment with a drug leads to more gastroscopies because ulcers are suspected, one would therefore expect to find more (false) ulcers in patients receiving that drug. A drug that causes unspecific, non-ulcer discomfort in the stomach could therefore falsely be described as an ulcer-inducing drug. The risk of bias can be reduced by limiting the analysis to serious cases that would almost always become known (e.g., cases of severely bleeding ulcers requiring hospital admission or leading to death). 4

DISEASE-SPECIFIC MORTALITY

Disease-specific mortality is very often used as the main outcome in trials without any discussion of how reliable it is, even in trials of severely ill patients in which it can be difficult to ascribe particular causes for the deaths with acceptable error. Disease-specific mortality can be highly misleading if a treatment has adverse effects that increases mortality from other causes. It is only to be expected that aggressive treatments can have such effects. Complications to cancer treatment, for example, cause mortality that is often ascribed to other causes, although these deaths should have been added to the cancer deaths. A study found that deaths from other causes than cancer were 37% higher than expected and that most of this excess occurred shortly after diagnosis, suggesting that many of the deaths were attritutable to treatment (7). The use of blinded endpoint committees can reduce the magnitude of misclassification bias, but cannot be expected to remove it. Radiotherapy for breast cancer, for example, continues to cause cardiovascular deaths even 20 years after treatment (8), and it is not possible to distinguish these deaths from cardiovascular deaths from other causes. Furthermore, to work in an unbiased

way, death certificates and other important documents must have been completed and patients and documents selected for review without awareness of status, and it should not be possible to break the masking during any of these processes, including review of causes of death. This result seems difficult to obtain, in particular because those who prepare excerpts of the data should be kept blind to the research hypothesis (1). Fungal infections in cancer patients with neutropenia after chemotherapy or bonemarrow transplantation is another example of bias in severely ill patients. Not only is it difficult to establish with certainty that a patient has a fungal infection and what was the cause of death, but evidence also exists that some of drugs (azole antifungal agents) may increase the incidence of bacteriaemias (9). In the largest placebocontrolled trial of fluconazole, more deaths were reported on drug than on placebo (55 vs 46 deaths), but the authors also reported that fewer deaths were ascribed to acute systemic fungal infections (1 vs 10 patients, P = 0.01) (10). However, if this subgroup result is to be believed, it would mean that fluconazole increased mortality from other causes (54 vs 36 patients, P = 0.04). Bias related to classification of deaths can also occur within the same disease. After publication of positive results from a trial in patients with myocardial infarction (11), researchers at the U.S. Food and Drug Administration found that the causeof-death classification was ‘‘hopelessly unreliable’’ (12). Cardiac deaths were classified into three groups: sudden deaths, myocardial infarction, or other cardiac event. The errors in assigning cause of death nearly all favored the conclusion that sulfinpyrazone decreased sudden death, the major finding of the trial. 5 COMPOSITE OUTCOMES Composite outcomes are vulnerable to bias when they contain a mix of objective and subjective components. A survey of trials with composite outcomes found that when they included clinician-driven outcomes, such as hospitalization and initiation of new antibiotics, in addition to objective outcomes such

ASSESSMENT BIAS

as death, it was twice as likely that the trial reported a statistically significant effect (13). 6

COMPETING RISKS

Composite outcomes can also lead to bias because of competing risks (14), for example, if an outcome includes death as well as hospital admission. A patient who dies cannot later be admitted to hospital. This bias can also occur in trials with simple outcomes. If one of the outcomes is length of hospital stay, a treatment that increases mortality among the weakest patients who would have had long hospital stays, may spuriously appear to be beneficial. 7

given in the protocol, the conversion of these data into publishable bits of information can be difficult and often involves subjective judgments. Particularly vulnerable to assessment bias is exclusion of reported effects because they are not felt to be important, or not felt to be related to the treatment. Trials that have been published more than once illustrate how subjective and biased assessment of harms can be. Both number of adverse effects and number of patients affected can vary from report to report, although no additional inclusion of patients or follow-up have occurred, and these re-interpretations or reclassifications sometimes change an insignificant difference into a significant difference in favor of the new treatment (16).

TIMING OF OUTCOMES

Timing of outcomes can have profound effects on the estimated result, and the selection of time points for reporting of the results is often not made until the analysis stage of the trials, when possible treatment codes have been broken. A trial report of the anti-arthritic drug, celecoxib, gave the impression that it was better tolerated than its comparators, but the published data referred to 6 months of follow-up, and not to 12 and 15 months, as planned, when little difference existed; in addition, the definition of the outcome had changed, compared with what was stated in the trial protocol (15). Trials conducted in intensive care units are vulnerable to this type of bias. For example, the main outcome in such trials can be total mortality during the stay in the unit, but if the surviving patients die later, during their subsequent stay at the referring department, little may be gained by a proven mortality reduction while the patients were sedated. A more relevant outcome would be the fraction of patients who leave the hospital alive. 8

3

ASSESSMENT OF HARMS

Bias in assessment of harms is common. Even when elaborate, pretested forms have been used for registration of harms during a trial, and guidelines for their reporting have been

REFERENCES 1. A. R. Feinstein, Clinical Epidemiology. Philadelphia: Saunders, 1985. ¨ 2. P. Juni, D. G. Altman, and M. Egger, Systematic reviews in health care: assessing the quality of controlled clinical trials. BMJ 2001; 323: 42–46. 3. T. R. Karlowski, T. C. Chalmers, L. D. Frenkel, A. Z. Kapikian, T. L. Lewis, and J. M. Lynch, Ascorbic acid for the common cold: a prophylactic and therapeutic trial. JAMA 1975; 231: 1038–1042. 4. J. H. Noseworthy et al., The impact of blinding on the results of a randomized, placebocontrolled multiple sclerosis clinical trial. Neurology 1994; 44: 16–20. 5. P. C. Gøtzsche, Blinding during data analysis and writing of manuscripts. Controlled Clin. Trials 1996; 17: 285–290. 6. T. Gjørup et al., The endoscopic diagnosis of duodenal ulcer disease. A randomized clinical trial of bias and interobserver variation. Scand. J. Gastroenterol. 1986; 21: 561–567. 7. B. W. Brown, C. Brauner, and M. C. Minnotte, Noncancer deaths in white adult cancer patients. J. Natl. Cancer Inst. 1993; 85: 979–987. 8. Early Breast Cancer Trialists’ Collaborative Group, Favourable and unfavourable effects on long-term survival of radiotherapy for early breast cancer: an overview of the randomised trials. Lancet 2000; 355: 1757–1770. 9. P. C. Gøtzsche and H. K. Johansen, Routine versus selective antifungal administration for

4

ASSESSMENT BIAS

control of fungal infections in patients with cancer (Cochrane Review). In: The Cochrane Library, Issue 3. Oxford: Update Software, 2003. 10. J. L. Goodman et al., A controlled trial of fluconazole to prevent fungal infections in patients undergoing bone marrow transplantation. N. Engl. J. Med. 1992; 326: 845–851. 11. The Anturane Reinfarction Trial Research Group, Sulfinpyrazone in the prevention of sudden death after myocardial infarction. N. Engl. J. Med. 1980; 302: 250–256. 12. R. Temple and G. W. Pledger, The FDA’s critique of the anturane reinfarction trial. N. Engl. J. Med. 1980; 303: 1488–1492. 13. N. Freemantle, M. Calvert, J. Wood, J. Eastaugh, and C. Griffin, Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA 2003; 289: 2554–2559. 14. M. S. Lauer and E. J. Topol, Clinical trials - multiple treatments, multiple end points, and multiple lessons. JAMA 2003; 289: 2575–2577. ¨ 15. P. Juni, A. W. Rutjes, and P. A. Dieppe, Are selective COX 2 inhibitors superior to traditional non steroidal anti-inflammatory drugs? BMJ 2002; 324: 1287–1288. 16. P. C. Gøtzsche, Multiple publication in reports of drug trials. Eur. J. Clin. Pharmacol. 1989; 36: 429–432.

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

(7) Role-Emotional, and (8) Mental Health. These eight domains can be further summarized by two summary scales: Physical Component Summary (PCS) and Mental Component Summary (MCS) scales. This article is intended to provide an overview of assessment of HRQOL in clinical trials. For more specific details on a particular topic mentioned in this article, the readers should consult the cited references. The development of a new HRQOL questionnaire and its translation into various languages are separate topics and are not covered in this article.

C. S. WAYNE WENG Department of Biomedical Engineering Chung Yuan Christian University Chungli, Taiwan

1

INTRODUCTION

Randomized clinical trials are the gold standard for evaluating new therapies. The primary focus of clinical trials has traditionally been evaluation of efficacy and safety. As clinical trials evolved from traditional efficacy and safety assessment of new therapies, clinicians were interested in an overall evaluation of the clinical impact of these new therapies on patient daily functioning and well-being as measured by health-related quality of life (HRQOL). As a result, HRQOL assessments in clinical trials have risen steadily throughout the 1990s and continue into the twentyfirst century. What is HRQOL? Generally, quality of life encompasses four major domains (1): 1. 2. 3. 4.

2

CHOICE OF HRQOL INSTRUMENTS

HRQOL instruments can be classified into two types: generic instruments and diseasespecific instruments. The generic instrument is designed to evaluate general aspects of a person’s HRQOL, which should include physical functioning, mental functioning, and social well-being. A generic instrument can be used to evaluate the HRQOL of a group of people in the general public or a group of patients with a specific disease. As such, data collected with a generic instrument allow comparison of HRQOL among different disease groups or against a general population. As a generic instrument, is designed to cover a broad range of HRQOL issues and it may be less sensitive regarding important issues for a particular disease or condition. Disease-specific instruments focus assessment in a more detailed manner for a particular disease. A more specific instrument allows detection of changes in disease-specific areas that a generic instrument is not sufficiently sensitive to detect. For example, the Health Assessment Questionnaire (HAQ) is developed to measure functional status of patients with rheumatic disease. The HAQ assesses the ability to function in eight areas of daily life: dressing and grooming, arising, eating, walking, hygiene, reach, grip, and activities. Table 1 (3–30) includes a list of generic and disease-specific HRQOL instruments for common diseases or conditions.

Physical status and functional abilities Psychological status and well-being Social interactions Economic or vocational status and factors

World Health Organization (WHO) defines ‘‘health’’ (2) as a ‘‘state of complete physical, mental, and social well-being and not merely the absence of infirmity and disease.’’ HRQOL focuses on parts of quality of life that are related to an individual’s health. The key components of this definition of HRQOL include (1) physical functioning, (2) mental functioning, and (3) social wellbeing, and a well-balanced HRQOL instrument should include these three key components. For example, the Medical Outcomes Study Short Form-36 (SF-36), a widely used HRQOL instrument, includes a profile of eight domains: (1) Physical Functioning, (2) Role-Physical, (3) Bodily Pain, (4) Vitality, (5) General Health, (6) Social Functioning,

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

Table 1. Generic and Disease-Specific HRQOL Instruments for Common Diseases or Conditions Generic HRQOL Instruments

Reference

Short Form-36 (SF-36) Sickness Impact Profile Nottingham Health Profile Duke Health Profile McMaster Health Index Questionnaire Functional Status Questionnaire WHO Quality of Life Assessment

3–5 6 7 8 9 10 11

Disease-Specific HRQOL Instruments: Disease Instrument

Reference

Pain

Brief Pain Inventory McGill Pain Questionnaire Visual Analogue Pain Rating scales

Depression

Beck Depression Inventory Center for Epidemiologic Studies Depression Scale (CES-D) Hamilton Rating Scale for Depression The Hospital Anxiety and Depression Questionnaire Zung Self-Rating Depression Scale WHO Well-Being Questionnaire Health Assessment Questionnaire (HAQ)

Rheumatic Disease: • Rheumatoid Arthritis • Osteoarthritis • Ankylosing Spondylitis • Juvenile Rheumatoid Arthritis Inflammatory Bowel Disease Asthma Airway Disease Seasonal Allergic Rhinitis Parkinson’s Disease

Cancer Both have tumor-specific modules

Inflammatory Bowel Disease Questionnaire (IBDQ) Asthma Quality of Life Questionnaire (AQLQ) St. George’s Respiratory Questionnaire (SGRQ) Rhinoconjunctivitis Quality of Life Questionnaire (RQLQ) Parkinson’s disease Questionnaire −39 item (PDQ-39) Parkinson’s Disease Quality of Life Questionnaire (PDQL) EORTC QLQ-C30 Functional Assessment of Cancer Therapy (FACT)

A comprehensive approach to assessing HRQOL in clinical trials can be achieved

12 13 Various authors; see Reference 14, p. 341 15 16 17 18 19 20 21

22, 23 24 25 26 27 28 29 30

relevant HRQOL components or a ‘‘module’’ approach, which includes a core measure of HRQOL domains supplemented in

using a battery of questionnaires when a single questionnaire does not address all

the same questionnaire by a disease- or treatment-specific set of items. The battery

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

approach combines a generic HRQOL instrument plus a disease-specific questionnaire. For example, in a clinical trial on rheumatoid arthritis (RA), one can include the SF-36 and HAQ to evaluate treatment effect on HRQOL. The SF-36 allows comparison of RA burden on patients’ HRQOL with other diseases as well as the general population. The HAQ, being a disease-specific instrument, measures patients’ ability to perform activities of daily life and is more sensitive to changes in a RA patient’s condition. The module approach has been widely adopted in oncology as different tumors impact patients in different ways. The most popular cancer-specific HRQOL questionnaires, EORTC QLQ-C3O and FACT, both include core instruments that measure physical functioning, mental functioning, and social well-being as well as common cancer symptoms, supplemented with a list of tumor- and treatment-specific modules. In certain diseases, a disease-specific HRQOL instrument is used alone in a trial because the disease’s impact on general HRQOL is so small that a generic HRQOL instrument will not be sensitive enough to detect changes in disease severity. For example, disease burden of allergic rhinitis on generic HRQOL is relatively small compared with the general population. Most published HRQOL studies in allergic rhinitis use Juniper’s Rhinoconjunctivitis Quality of Life Questionnaire (RQLQ), a disease-specific HRQOL questionnaire for allergic rhinitis. 3 ESTABLISHMENT OF CLEAR OBJECTIVES IN HRQOL ASSESSMENTS A clinical trial is usually designed to address one hypothesis or a small number of hypotheses, evaluating a new therapy’s efficacy, safety, or both. When considering whether to include HRQOL assessment in a study, the question of what additional information will be provided by the HRQOL assessment must be asked. As estimated by Moinpour (31), the total cost per patient is $443 to develop an HRQOL study, monitor HRQOL form submission, and analyze HRQOL data. Sloan et al. (32) have revisited the issue of the cost of HRQOL assessment in a number of settings including clinical trials and

3

suggest a wide cost range depending on the comprehensiveness of the assessment, which is not a trivial sum of money to be spent in a study without a clear objective in HRQOL assessment. The objective of HRQOL assessment is usually focused on one of the four possible outcomes: (1) improvement in efficacy leads to improvement in HRQOL, (2) treatment side effects may cause deterioration in HRQOL, (3) the combined effect of (1) and (2) on HRQOL, and (4) similar efficacy with an improved side effect profile leads to improvement in HRQOL. After considering possible HRQOL outcomes, one will come to a decision whether HRQOL assessment should be included in the trial. In many published studies, HRQOL was included in the studies without a clear objective. These studies generated HRQOL data that provided no additional information at the completion of the studies. Goodwin et al. (33) provide an excellent review of HRQOL measurement in randomized clinical trials in breast cancer. They suggest that, given the existing HRQOL database for breast cancer, it is not necessary to measure HRQOL in every trial, at least until ongoing trials are reported. An exception is interventions with a psychosocial focus where HRQOL has to be the primary outcome. 4

METHODS FOR HRQOL ASSESSMENT

The following components should be included in a study protocol with an HRQOL objective: – Rationale for assessing HRQOL objective(s) and for the choice of HRQOL instrument(s): – To help study personnel understand the importance of HRQOL assessment in the study, inclusion of a clear and concise rationale for HRQOL assessment is essential, along with a description of the specific HRQOL instrument(s) chosen. – HRQOL hypotheses: – The study protocol should also specify hypothesized HRQOL outcomes with respect to general and specific domains. It is helpful to identify

4

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

the primary domain and secondary domains for HRQOL analysis in the protocol. – Frequency of HRQOL assessment: – In a clinical trial, the minimum number of HRQOL assessments required is two, at baseline and the end of the study for studies with a fixed treatment duration, and most patients are expected to complete the treatment. One or two additional assessments should be considered between baseline and study endpoint depending on the length of the study so that a patient’s data will still be useful if endpoint data was not collected. More frequent assessments should be considered if the treatment’s impact on HRQOL may change over time. Three or more assessments are necessary to characterize patterns of change for individual patients. In oncology trials, it is common to assess HRQOL on every treatment cycle, as patients’ HRQOL is expected to change over time. However, assessment burden can be minimized if specific time points associated with expected clinical effects are of interest and can be specified by clinicians (e.g., assess HRQOL after the minimum number of cycles of therapy required to observe clinical activity of an agent). Another factor to be considered for the frequency of HRQOL assessment is the recall period for a particular HRQOL instrument. The recall period is the duration a subject is asked to assess his/her responses to an HRQOL questionnaire. The most common recall periods are one week, two weeks, and four weeks. – Administering HRQOL questionnaires: – To objectively evaluate HRQOL, one needs to minimize physician and study nurse influence on patient’s response to HRQOL questions. Therefore, the protocol should indicate that patients are to complete the HRQOL questionnaire in a quiet place in the doctor’s office at the beginning of his/her office

visit, prior to any physical examination and clinical evaluation by the study nurse and physician. – Specify the magnitude of difference in HRQOL domain score that can be detected with the planned sample size: – This factor is especially important when the HRQOL assessment is considered a secondary endpoint. As the sample size is based on the primary endpoint, it may provide only enough power to detect a relatively large difference in HRQOL scores. The question of whether to increase the sample size to cover HRQOL assessment often depends on how many more additional patients are needed and the importance of the HRQOL issue for the trial. To collect HRQOL data when power will be insufficient to detect effects of interest is a waste of clinical resources and the patient’s time. – Specify how HRQOL scores are to be calculated and analyzed in the statistical analysis section: – Calculation of HRQOL domain scores should be stated clearly, including how missing items will be handled. As a result of the nature of oncology studies, especially in late-stage disease, patients will stop treatment at different time points because of disease progression, intolerance to treatment side effects, or death and therefore fail to complete the HRQOL assessment schedule. For example, if data are missing because of deteriorating patient health, the study estimates of effect on HRQOL will be biased in favor of better HRQOL; the term ‘‘informative missing data’’ is the name for this phenomenon and must be handled with care. Fairclough (34) has written a book on various longitudinal methods to analyze this type of HRQOL data. However, analyzing and interpreting HRQOL data in this setting remain a challenge. – Strategies to improve HRQOL data collection:

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

↑ Education at the Investigators’ meeting and during site initiation visit: It is important to have investigators and study coordinators committed to the importance of HRQOL assessment. Without this extra effort, HRQOL assessment is likely to be unsuccessful, simply because collecting HRQOL data is not part of routine clinical trial conduct. – Emphasize the importance of the HRQOL data: ↑ Baseline HRQOL forms should be required in order to register a patient to a trial. Associate grant payment per patient with submission of a patient’s efficacy data. Specifying some portion of the grant payment with the submission of HRQOL form has significantly increased the HRQOL completion rate in the author’s clinical experience. • Establish a prospective reminder system for upcoming HRQOL assessments and a system for routine monitoring of forms at the same time clinical monitoring is being conducted. The following checklist (Table 2) may be helpful when considering inclusion of HRQOL assessment in a clinical trial protocol. 5

HRQOL AS THE PRIMARY ENDPOINT

To use HRQOL as the primary endpoint in a clinical trial, prior information must demonstrate at least comparable efficacy of a study treatment to its control. In this context, to design a study with HRQOL as the primary endpoint, the sample size will have to be large enough to assure adequate power to detect meaningful differences between treatment groups in HRQOL. Another context for a primary HRQOL endpoint is in the setting of treatment palliation. In this case, treatment efficacy is shown by the agent’s ability to palliate disease-related symptoms and overall HRQOL without incurring treatmentrelated toxicities. For example, patient report

5

of pain reduction can document the achievement of palliation [e.g., see Tannock et al. (35) example below]. A HRQOL instrument usually has several domains to assess various aspects of HRQOL. Some HRQOL instruments also provide for an overall or total score. The HRQOL endpoint should specify a particular domain, or the total score, as the primary endpoint of the HRQOL assessment in order to avoid multiplicity issues. If HRQOL is included as a secondary endpoint, it is a good practice to identify a particular domain as the primary focus of the HRQOL assessment. This practice forces specification of the expected outcomes of HRQOL assessments. Some investigators have applied multiplicity adjustments to HRQOL assessments. The approach may be statistically prudent, but it does not provide practical value. The variability of HRQOL domain scores is generally large. With multiple domains being evaluated, only a very large difference between groups will achieve the required statistically significance level. When evaluating HRQOL as a profile of a therapy’s impact on patients, clinical judgment of the magnitude of HRQOL changes should be more important than the statistical significance. However, this ‘‘exploratory’’ analysis perspective should also be tempered with recognition that some significant results may be marginally significant and subject to occurrence by chance. 6

INTERPRETATION OF HRQOL RESULTS

Two approaches have been used to interpret the meaningfulness of observed HRQOL differences between two treatment groups in a clinical trial: distribution-based and anchor-based approaches. The most widely used distribution-based approach is the effect size, among other methods listed in Table 3 (36–45). Based on the effect size, an observed difference is classified into (1) 0.2 = a small difference, (2) 0.5 = a moderate difference, and (3) 0.8 = a large difference. To advocate using the effect size to facilitate the interpretation of HRQOL data, Sloan et al. (46) suggested a 0.5 standard deviation as

6

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

Table 2. Checklist for HRQOL Assessment • • • • •

Rationale to assess HRQOL and the choice of HRQOL instrument(s) Hypothesis in terms of expected HRQOL outcomes Frequency of HRQOL assessment Procedures for administering HRQOL questionnaires Specify the magnitude of difference in HRQOL domain score that can be detected with the planned sample size • Specify in the statistical analysis section how HRQOL scores are to be calculated and analyzed • Strategies to improve HRQOL data collection

a reasonable benchmark for a 0–100 scale to be clinically meaningful. This suggestion is consistent with Cohen’s (47) suggestion of one-half of a standard deviation as indicating a moderate effect and therefore clinically meaningful. The anchor-based approach compares observed differences relative to an external standard. Investigators have used this approach to define the minimum important difference (MID). For example, Juniper and Guyatt (26) suggested that a 0.5 change in RQLQ be the MID (RQLQ score ranges from 1 to 7). Osoba (48) suggested that a 10-point change in EORTC QLQ-C30 questionnaire would be a MID. Both these two MIDs are group average scores. How these MIDs apply to individual patients is still an issue. Another issue in using MID is related to the starting point of patients’ HRQOL scores. Guyatt et al. (49) provides a detailed overview of various strategies to interpret HRQOL results. 7

EXAMPLES

7.1 HRQOL in Asthma To evaluate salmeterol’s effect on quality of life, patients with nocturnal asthma were enrolled into a double-blind, parallel group, placebo-controlled, multicenter study (50). The study rationale was that patients with nocturnal asthma who are clinically stable have been found to have poorer cognitive performance and poorer subjective and objective sleep quality compared with normal, healthy patients. To assess salmeterol’s effect on reducing the impact of nocturnal asthma on patients’ daily functioning and well-being, patients were randomized to

receive salmeterol, 42 µg, or placebo twice daily. Patients were allowed to continue theophylline, inhaled corticosteroids, and ‘‘asneeded’’ albuterol. Treatment duration was 12 weeks, with a 2-week run-in period. The primary study objective was to assess the impact of salmeterol on asthma-specific quality of life using the validated Asthma Quality of Life Questionnaire (24) (AQLQ). Patients were to return to the clinic every 4 weeks. Randomized patients were to complete an AQLQ at day 1; weeks 4, 8, 12; and at the time of withdrawal from the study for any reason. Efficacy (FEV1, PEF, nighttime awakenings, asthma symptoms, and albuterol use) and safety assessments were also conducted at these clinic visits. Scheduling HRQOL assessment prior to efficacy and safety evaluations at office visits minimizes investigator bias and missing HRQOL evaluation forms. The AQLQ is a 32-item, self-administered, asthma-specific instrument that assesses quality of life over a 2-week time interval. Each item is scored using a scale from 1 to 7, with lower scores indicating greater impairment and higher scores indicating less impairment in quality of life. Items are grouped into four domains: (1) activity limitation (assesses the amount of limitation of individualized activities that are important to the patient and are affected by asthma); (2) asthma symptoms (assesses the frequency and degree of discomfort of shortness of breath, chest tightness, wheezing, chest heaviness, cough, difficulty breathing out, fighting for air, heavy breathing, difficulty getting a good night’s sleep); (3) emotional function (assesses the frequency of being afraid of not having medications, concerned about medications, concerned about

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

7

Table 3. Common Methods Used to Measure a Questionnaire’s Responsiveness to Change Method

Formula

Relative change Effect size

Meantest1 − Meantest2 SDtest1

36

Square of Effect Sizedimension Effect Sizestandard Meantest1 − Meantest2 SDdifference Meantest1 − Meantest2 SDstable group Meantest1 − Meantest2 SEdifference

Relative efficiency

Standardized response mean

Responsiveness statistic

Paired t statistic SE of measurement

Meantest1 − Meantest2 Meantest1

Reference

SDtest × Square Root 1 − Reliability Coefficienttest

37, 38 39

40

41, 42

43 44, 45

Reprinted with permission

having asthma, frustrated); and (4) environmental exposure (assesses the frequency of exposure to and avoidance of irritants such as cigarette smoke, dust, and air pollution). Individual domain scores and a global score are calculated. A change of 0.5 (for both global and individual domain scores) is considered the smallest difference that patients perceive as meaningful (51). To achieve 80% power to detect a difference of 0.5 in AQLQ between two treatment arms would only require 80 patients per arm at a significance level of 0.05. However, this study was designed to enroll 300 patients per arm so that it could also provide 80% power to detect differences in efficacy variables (e.g., FEV1, nighttime awakening) between two treatment arms at a significance level of 0.05. A total of 474 patients were randomly assigned to treatment. Mean change from baseline for the AQLQ global and each of the four domain scores was significantly greater (P 0.005) with salmeterol compared with placebo, first observed at week 4 and continuing through week 12. In addition, differences between salmeterol and placebo groups were greater than 0.5 at all visits except at week 4 and week 8 for the environmental exposure domain. At week 12, salmeterol significantly (P < 0.001 compared

with placebo) increased mean change from baseline in FEV1 , morning and evening PEF, percentage of symptom-free days, percentage of nights with no awakenings due to asthma, and the percentage of days and nights with no supplemental albuterol use. This study demonstrated that salmeterol’s effect in improving patients’ asthma symptoms had a more profound effect on improving patients’ daily activity and well-being. 7.2 HRQOL in Seasonal Allergy Rhinitis A randomized, double-blind, placebocontrolled study was conducted to evaluate the effects on efficacy, safety, and quality of life of two approved therapies (fexofenadine HCI 120 mg and loratadine 10 mg) for the treatment of seasonal allergy rhinitis (SAR), (52). Clinical efficacy was based on a patient’s evaluation of SAR symptoms: (1) sneezing; (2) rhinorrhea; (3) itchy noses, palate, or throat; and (4) itchy, watery, or red eyes. The primary efficacy endpoint was the total score for the patient symptom evaluation, defined as the sum of the four individual symptom scores. Each of the symptoms was evaluated on a 5-point scale (0 to 4), with higher scores indicating more severe symptoms. Treatment duration was 2 weeks, with a run-in period of

8

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

3–7 days. After randomization at study day 1, patients were to return the clinic every week. During these visits, patients were to be evaluated for the severity of SAR symptoms and to complete a quality of life questionnaire. Patient-reported quality of life was evaluated using a validated disease-specific questionnaire— Rhinoconjunctivitis Quality of Life Questionnaire (RQLQ) (26). The RQLQ is a 28-item instrument that assesses quality of life over a 1-week time interval. Each item is scored using a scale from 0 (i.e., not troubled) to 6 (i.e., extremely troubled), with lower scores indicating greater impairment and higher scores indicating less impairment in quality of life. Items are grouped into seven domains: (1) sleep, (2) practical problems, (3) nasal symptoms, (4) eye symptoms, (5) non-nose/eye symptoms, (6) activity limitations, and (7) emotional function. Individual domain scores and an overall score are calculated. A change of 0.5 (for both global and individual domain scores) is considered the smallest difference that patients perceive as meaningful (53). The RQLQ assessment was a secondary endpoint. No sample size and power justification was mentioned in the published paper. A total of 688 patients were randomized to receive fexofenadine HCI 120 mg, loratadine 10 mg, or placebo once daily. Mean 24-hour total symptom score (TSS) as evaluated by the patient was significantly reduced by both fexofenadine HCI and loratadine from baseline (P ≤ 0.001) compared with placebo. The difference between fexofenadine HCI and loratadine was not statistically significant. For overall quality of life, a significant improvement from baseline occurred for all three treatment groups (mean improvement was 1.25, 1.00, and 0.93 for fexofenadine HCI, loratadine, and placebo, respectively). The improvement in the fexofenadine HCI group was significantly greater than that in either the loratadine (P ≤ 0.03) or placebo (P ≤ 0.005) groups. However, the magnitude of differences among the treatment groups was less than the minimal important difference of 0.5. The asthma example demonstrates that salmeterol not only significantly improved patients’ asthma-related symptoms, both statistically and clinically, but also relieved their

asthma-induced impairments on daily functioning and well-being. On the other hand, the SAR example demonstrates that fexofenadine HCI and loratadine both were effective in relief of SAR symptoms. The difference between fexofenadine HCI and loratadine in HRQOL was only statistically significant, but not clinically. However, Hays and Wooley (54) have cautioned investigators about the potential for oversimplication when applying a single minimal clinically important difference (MCID). 7.3 Symptoms Relief for Late-Stage Cancers Although the main objective for the treatment of early-stage cancers is to eradicate the cancer cells and prolong survival, it may not be achievable in late-stage cancers. More often, the objective for the treatment of latestage cancers is palliation, mainly through relief of cancer-related symptoms. As the relief of cancer-related symptoms represents clinical benefit to patients, the objective of some clinical trials in late-stage cancer was relief of a specific cancer-related symptom such as pain. To investigate the benefit of mitoxantrone in patients with symptomatic hormone-resistant prostate cancer, hormonerefractory patients with pain were randomized to receive mitoxantrone plus prednisone or prednisone alone (35). The primary endpoint was a palliative response defined as a two-point decrease in pain as assessed by a six-point pain scale completed by patients (or complete loss of pain if initially 1+) without an increase in analgesic medication and maintained for two consecutive evaluations at least 3 weeks apart. Palliative response was observed in 23 of 80 patients (29%; 95% confidence interval; range 19–40%) who received mitoxantrone plus prednisone, and in 10 of 81 patients (12%; 95% confidence interval; range 6–22%) who received prednisone alone (P = 0.01). No difference existed in overall survival. In another study assessing gemcitabine effect on relief of pain (55), 162 patients with advanced symptomatic pancreatic cancer completed a lead-in period to characterize and stabilize pain and were randomized to receive either gemcitabine 1000 mg/m2

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

weekly × 7 followed by 1 week of rest, then weekly × 3 every 4 weeks thereafter, or to fluorouracil (5-FU) 600 mg/m2 once weekly. The primary efficacy measure was clinical benefit response, which was a composite of measurements of pain (analgesic consumption and pain intensity), Karnofsky performance status, and weight. Clinical benefit required a sustained (≥4 weeks) improvement in at least one parameter without worsening in any others. Clinical benefit response was experienced by 23.8% of gemcitabine-treated patients compared with 4.8% of 5-FU-treated patients (P = 0.0022). In addition, the median survival durations were 5.65 and 4.41 months for gemcitabinetreated and 5-FU-treated patients, respectively (P = 0.0025). Regarding the use of composite variables, researchers have urged investigators to report descriptive results for all components so that composite results do not obscure potential negative results for one or more of the components of the composite (56, 57). In a third study example, although symptom assessment was not the primary endpoint, it was the main differentiating factor between the two study arms in study outcomes. As second-line treatment of small cell lung cancer (SCLC), topotecan was compared with cyclophosphamide, doxorubicin, and vincristine (CAV) in 211 patients with SCLC who had relapsed at least 60 days after completion of first-line therapy (58). Response rate and duration of response were the primary efficacy endpoints. Patientreported lung-cancer-related symptoms were also evaluated as secondary endpoints. Similar efficacy in response rate, progression-free survival, and overall survival was observed between topotecan and CAV. The response rate was 26 of 107 patients (24.3%) treated with topotecan and 19 of 104 patients (18.3%) treated with CAV (P = 0.285). Median times to progression were 13.3 weeks (topotecan) and 12.3 weeks (CAV) (P = 0.552). Median survival was 25.0 weeks for topotecan and 24.7 weeks for CAV (P = 0.795). However, the proportion of patients who experienced symptom improvement was greater in the topotecan group than in the CAV group for four of eight lung-cancer-related symptoms evaluated, including dyspnea, anorexia,

9

hoarseness, and fatigue, as well as interference with daily activity (P ≤ 0.043). 8

CONCLUSION

Although HRQOL assessment in clinical trials has increased steadily over the years, a substantial challenge remains in interpretation of HRQOL results and acceptance of its value in clinical research. Both issues will require time for clinicians and regulators to fully accept HRQOL assessments. To help build acceptance, existing HRQOL instruments should be validated in each therapeutic area, rather than develop new instruments. The most urgent need in HRQOL research is to increase HRQOL acceptance by clinicians and regulators so that pharmaceutical companies will continue to include financial support for HRQOL assessments in new and existing drug development programs. 9

ACKNOWLEDGMENT

The author is deeply indebted to Carol M. Moinpour for her numerous suggestions and to Carl Chelle for his editorial assistance. REFERENCES 1. J. A. Cramer and B. Spilker, Quality of Life and Pharmacoeconomics: An Introduction. Philadelphia: Lippincott-Raven, 1998. 2. World Health Organization, The First Ten Years of the World Health Organization. Geneva: World Health Organization, 1958, p. 459. 3. J. E. Ware, Jr. and C. D. Sherbourne, The MOS 36-Item Short-Form Health Survey (SF36). I. Conceptual framework and item selection. Med. Care 1992; 30: 473–483. 4. C. A. McHorney, J. E. Ware, Jr., and A. E. Raczek, The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med. Care 1993; 31: 247–263. 5. C. A. McHorney et al., The MOS 30-Item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med. Care 1994; 32: 40–66.

10

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

6. M. Bergner et al., The Sickness Impact Profile: development and final revision of a health status measure. Med. Care 1981; 19: 787–805.

21.

7. S. M. Hunt, J. McEwen, and S. P. McKenna, Measuring Health Status. London: Croom Helm, 1986. 8. G. R. Parkerson, Jr., W. E. Broadhead, and C. K. Tse, The Duke Health Profile. A 17-item measure of health and dysfunction. Med. Care 1990; 28: 1056–1072.

22.

9. L. W. Chambers, The McMaster Health Index Questionnaire (MHIQ): Mehtodologic Documentation and Report of the Second Generation of Investigations. Hamilton, Ontario, Canada: McMaster University, Department of Clinical Epidemiology and Biostatistics, 1982.

23.

10. A. M. Jette et al., The Functional Status Questionnaire: reliability and validity when used in primary care. J. Gen. Intern. Med. 1986; 1: 143–149. 11. S. Szabo (on behalf of the WHOQOL Group), The World Health Organization Quality of Life (WHOQOL) assessment instrument. In: B. Spilker (ed.), Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia: Lippincott-Reven, 1996. 12. C. S. Cleeland, Measurement of pain by subjective report. In: C. R. Chapman and J. D. Loeser (eds.), Issue in Pain Measurement. New York: Raven Press, 1989.

24.

25.

26.

27.

28.

13. R. Melzack, The McGill Pain Questionnaire: major properties and scoring methods. Pain 1975; 1: 277–299. 14. I. McDowell and C. Newell, Measuring Health: A Guide to Rating Scales and Questionnaires, 2nd ed. New York: Oxford University Press, 1996.

29.

15. A. T. Beck et al., An inventory for measuring depression. Arch. Gen. Psychiat. 1961; 4: 561–571.

30.

16. L. S. Radloff, The CES-D scale: a self-report depression scale for research in the general population. Appl. Psychol. Measure. 1977; 1: 385–401.

31.

17. M. Hamilton, Standised assessment and recording of depressive symptoms. Psychiat. Neurol. Neurochir. 1969; 72: 201–205.

32.

18. A. Zigmond and P. Snaith, The Hospital Anxiety and Depression Questionnaire. Acta Scand. Psychiat. 1983; 67: 361–368. 19. W. W. K. Zung, A self-rating depression scale. Arch. Gen. Psychiat. 1965; 12: 63–70. 20. P. Bech et al., The WHO (Ten) Well-Being Index: validation in diabetes. [comment]. [Clinical Trial. Journal Article. Multicenter

33.

Study. Randomized Controlled Trial]. Psychother. Psychosomat. 1996; 65: 183–190. J. F. Fries et al., The dimension of health outcomes: the Health Assessment Questionnaire, disability and pain scales. J. Rheumatol. 1982; 9: 789–793. G. H. Guyatt, A. Mitchell, E. J. Irvine et al., A new measure of health status for clinical trials in inflammatory bowel disease. Gastroenterology 1989; 96: 804–810. E. J. Irvine, B. Feagan et al., Quality of life: a valid and reliable measure of therapeutic efficacy in the treatment of inflammatory bowel disease. Gastroenterology 1994; 106: 287–296. E. F. Juniper, G. H. Guyatt, P. J. Ferrie, and L. E. Griffith, Measuring quality of life in asthma. Am. Rev. Respir. Dis. 1993; 147: 832–838. P. W. Jones, F. H. Quirk, and C. M. Baveystock, The St. Geroge’s Respiratory Questionnaire. Respiratory Med. 1991; 85: 25–31. E. F. Juniper and G. H. Guyatt, Development and testing of a new measure of health status for clinical trials in rhinoconjunctivitis. Clin. Exp. Allergy 1991; 21: 77–83. V. Peto et al., The development and validation of a short measure of functioning and well being for individuals with Parkinson’s disease. Qual. Life Res. 1995; 4(3): 241–248. A. G. E. M. De Boer et al., Quality of life in patients with Parkinson’s disease: development of a questionnaire. J. Neurol. Neurosurg. Psychiat. 1996; 61(1): 70–74. N. K. Aaronson et al., The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology. J. Natl. Cancer Inst. 1993; 85: 365–376. D. F. Cella et al., The Functional Assessment of Cancer Therapy scale: development and validation of the general measure. J. Clin. Oncol. 1993; 11: 570–579. C. M. Moinpour, Costs of quality-of-life research in Southwest Oncology Group trials. J. Natl. Cancer Inst. Monogr. 1996; 20: 11–16. J. A. Sloan et al. and the Clinical Significance Consensus Meeting Group, The costs of incorporating quality of life assessments into clinical practice and research: what resources are required? Clin. Therapeut. 2003; 25(Suppl D). P. J. Goodwin et al., Health-related quality-oflife measurement in randomized clinical trails in breast cancer—taking stock. J. Natl. Cancer Inst. 2003; 95: 263–281.

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

11

34. D. L. Fairclough, Design and Analysis of Quality of Life Studies in Clinical Trials. Boca Raton, FL: Chapman & Hall/CRC Press, 2002.

47. J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. London: Academic Press, 1988.

35. I. F. Tannock et al., Chemotherapy with mitoxantrone plus prednisone or prednisone alone for symptomatic hormone-resistant prostate cancer: a Canadian randomized trial with palliative end points. J. Clin. Oncol. 1996; 14: 1756–1764.

48. D. Osoba, G. Rodrigues, J. Myles, B. Zee, and J. Pater, Interpreting the significance of changes in health-related quality of life scores. J. Clin. Oncol. 1998; 16: 139–144.

36. R. A. Deyo and R. M. Centor, Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J. Chronic Dis. 1986; 39: 897–906. 37. Kazis et al., 1989. 38. R. A. Deyo, P. Diehr, and D. L. Patrick, Reproducibility and responsiveness of health status measures: statistics and strategies for evaluation. Control Clin. Trials 1991; 12(4 Suppl): 142S–158S. 39. C. Bombardier, J. Raboud, and Auranofin Cooperating Group, A comparison of healthrelated quality-of-life measures for rheumatoid arthritis research. Control Clin. Trials 1991; 12(4 Suppl): 243S–256S. 40. J. N. Katz et al., Comparative measurement sensitivity of short and longer health status instruments. Med. Care 1992; 30: 917–925. 41. G. H. Guyatt, S. Walter, and G. Norman, Measuring change over time: assessing the usefulness of evaluative instruments. J. Chronic Dis. 1987; 40: 171–178. 42. G. H. Guyatt, B. Kirshner, and R. Jaeschke, Measuring health status: what are the necessary measurement properties? J. Clin. Epidemiol. 1992; 45: 1341–1345. 43. M. H. Liang et al., Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum. 1985; 28: 542–547. 44. K. W. Wyrwich, N. A. Nienaber, W. M. Tierney, and F. D. Wolinsky, Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med. Care 1999; 37: 469–478.

49. G. H. Guyatt et al. and the Clinical Significance Consensus Meeting Group, Methods to explain the clinical significance of health status measures. Mayo Clin. Proc. 2002; 77: 371–383. 50. R. F. Lockey et al., Nocturnal asthma – effect of salmeterol on quality of life and clinical outcomes. Chest 1999; 115: 666–673. 51. E. F. Juniper et al., 1994. 52. P. Van Cauwenberge and E. F. Juniper, The Star Study Investigating Group. Comparison of the efficacy, safety and quality of life provided by fexofenadine hydrochloride 120 mg, loratadine 10 mg and placebo administered once daily for the treatment of seasonal allergic rhinitis. Clin. Exper. Allergy 2000; 30: 891–899. 53. E. F. Juniper et al., 1996. 54. R. D. Hays and J. M. Woolley, The concept of clinically meaningful difference in healthrelated quality-of-life research. How meaningful is it? Pharmacoeconomics 2000; 18: 419–423. 55. H. A. Burris, III et al., Improvements in survival and clinical benefit with gemcitabine as first-line therapy for patients with advanced pancreas cancer: a randomized trial. J. Clin. Oncol. 1997; 15: 2403–2413. 56. N. Freemantle, M. Calvert, J. Wood, J. Eastaugh, and C. Griffin, Composite outcomes randomized trials. Greater precision but with greater uncertainty? JAMA 2003; 289: 2554–2559. 57. M. S. Lauer and E. J. Topol, Clinical trials—Multiple treatment, multiple end points, and multiple lessons. JAMA 2003; 289: 2575–2577. 58. J. von Pawel et al., Topotecan versus cyclophosphamide, doxorubicin, and vincristine for the treatment of recurrent smallcell lung cancer. J. Clin. Oncol. 1999; 17: 658–667.

45. K. W. Wyrwich, W. M. Tierney, and F. D. Wolinsky, Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. J. Clin. Epidemiol. 1999; 52: 861–873.

FURTHER READING

46. J. A. Sloan et al., Randomized comparison of four tools measuring overall quality of life in patients with advanced cancer. J. Clin. Oncol. 1998; 16: 3662–3673.

B. Spilker (ed.), Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia: Lippincott-Reven, 1996.

12

ASSESSMENT OF HEALTH-RELATED QUALITY OF LIFE

M. A. G. Sprangers, C. M. Moinpour, T. J. Moynihan, D. L. Patrick, and D. A. Revicki, Clinical Significance Consensus Meeting Group. assessing meaningful change in quality of life over time: a users’ guide for clinicians. Mayo Clin. Proc. 2002; 77: 561–571. J. E. Ware et al., SF-36 Health Survey: Manual and Interpretation Guide. Boston: The Health Institute, New England Medical Center, 1993. World Health Organization, International Classification of Impairments, Disabilities, and Handicaps. Geneva: World Health Organization, 1980.

AUDIT

of legal proceedings or investigations. Where required by applicable law or regulation, the sponsor should provide an audit certificate.

An Audit is a systematic and independent examination of trial-related activities and documents to determine whether the evaluated trial-related activities were conducted, and the data were recorded, analyzed, and reported accurately according to the protocol, sponsor’s Standard Operating Procedures (SOPs), Good Clinical Practice (GCP), and applicable regulatory requirement(s). The purpose of a sponsor’s audit, which is independent of and separate from routine monitoring or quality control functions, should be to evaluate trial conduct and compliance with the protocol, SOPs, GCP, and applicable regulatory requirements. The sponsor should appoint individuals who are independent of the clinical trial/data collection system(s) to conduct audits. The sponsor should ensure that the auditors are qualified by training and by experience to conduct audits properly. An auditor’s qualifications should be documented. The sponsor should ensure that the auditing of clinical trials/systems is conducted in accordance with the sponsor’s written procedures on what to audit, how to audit, the frequency of audits, and the form and content of audit reports. The sponsor’s audit plan and procedures for a trial audit should be guided by the importance of the trial to submissions to regulatory authorities, the number of subjects in the trial, the type and complexity of the trial, the level of risks to the trial subjects, and any identified problem(s). The observations and findings of the auditor(s) should be documented. To preserve the independence and the value of the audit function, the regulatory authority(ies) should not request the audit reports routinely. Regulatory authority(ies) may seek access to an audit report on a case-by-case basis, when evidence of serious GCP noncompliance exists, or in the course This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

AUDIT CERTIFICATE An Audit Certificate is the declaration of confirmation by the auditor that an audit has taken place. Where required by applicable law or regulation, the sponsor should provide an audit certificate.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

AUDIT REPORT An Audit Report is the written evaluation by the sponsor’s auditor of the results of the audit. To preserve the independence and the value of the audit function, the regulatory authority(ies) should not request the audit reports routinely. Regulatory authority(ies) may seek access to an audit report on a case-by-case basis, when evidence of serious Good Clinical Practice (GCP) noncompliance exists, or in the course of legal proceedings or investigations.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS

The majority of phase I studies are conducted in healthy volunteers. Current doseescalation designs for healthy volunteer studies are usually crossover designs in which each subject receives a dose in each of a series of consecutive treatment periods separated by washout periods. Such a design is illustrated in Table 1. Groups of subjects, known as cohorts or panels, some on active doses and others on placebo, are treated simultaneously. The lowest dose, d1 , is normally regarded as the only ‘‘safe’’ dose to be used for the first cohort of subjects. Safety data along with pharmacokinetic/pharmacodynamic (PK /PD) data from each treatment period are collected then summarized in tables, listings, and graphs for a safety review committee to make decisions regarding the doses to be given during the next period. The committee will normally assign doses according to some predefined dose-escalation scheme, but they may alter the scheme by repeating the previous doses or de-escalating one dose level; they may even stop the trial if its safety is in question. This is essentially a PK/PD ‘‘clinical judgment’’– guided design. Response variables are analyzed either by simple analysis of variance (ANOVA) approaches (2) or by repeated measures analysis of variance. The PK/PD models may also be estimated and presented at the end of a trial. Under conventional dose-escalation rules, it is likely that healthy volunteers will be treated at subtherapeutic doses. Consequently, information gathered from a trial is mostly not relevant for identifying the optimal dose for phase II studies. Bayesian decision-theoretic designs, motivated by statistical-model–based designs for cancer trials, have been proposed to enhance the precision of the optimal dose for phase II studies, to increase the overall efficiency of the dose-escalation procedure while maintaining the safety of subjects (3–6).

YINGHUI ZHOU Medical and Pharmaceutical Statistics Research Unit The University of Reading Reading, Berkshire, United Kingdom

Phase Istudies have unique characteristics that distinguish them from other phases of clinical research. This is the phase when new experimental drugs are given to human subjects at the first time: a more explicit name for such trials is ‘‘first-into-man’’ (FIM) studies. Although intensive toxicologic studies have been carried out at preclinical trials, the primary concern for FIM studies is always the safety of the participating subjects (1). Safety can be assessed by the incidence of doselimiting events (DLE), such as moderate or severe adverse events, clinically significant changes in cardiac function, exceeding the maximum exposure limits of pharmacokinetic profiles such as the area under the curve (AUC) or the maximum concentration (Cmax ), and changes in pharmacodynamic parameters such as blood flow or heart rate. The majority of FIM studies are conducted on healthy volunteers for whom dose-limiting events should generally be avoided. Only a small proportion of FIM studies is conducted on late-stage patients in cancer trials. For these patients, standard therapies have failed, and low, safe doses will not achieve therapeutic effects, while high, unsafe doses will cause toxicity. A small risk of a DLE is therefore permitted to gain some therapeutic effect. The primary objective of FIM studies is to find an optimal dose, both safe and potentially efficacious, for later phases of clinical research. This involves a dose-escalation scheme: a fixed range of discrete doses, d1 < . . . < dk for some integer k, is predefined by the investigators to be administered in turn to subjects to assess the safety and tolerability of the compound. For healthy volunteer studies, the optimal dose is the maximum safe dose leading to an acceptable concentration of drug in plasma or to an adequate description of the biological effects.

1 A BAYESIAN DECISION-THEORETIC DESIGN In healthy volunteer studies, several PK/PD measurements are normally monitored and recorded. The methodology described here,

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS Table 1. An example of a crossover design Cohort

Dosing intervals 1

1 2 3 . ..

2

X

3 X

X

5

X

(1)

The logarithm of the j-th active dose received by the i-th subject is denoted by ij . The term si is a random effect relating to the i-th subject. The si and εij are modeled as mutually independent, normally distributed random variables with mean zero and variances τ 2 and σ 2 , respectively. The correlation, ρ, between two responses on the same subject can is equal to τ 2 /(σ 2 + τ 2 ). Placebo administrations are ignored in this model because there will be no drug detected in plasma.

6

7

X X

however, focuses on a single pharmacokinetic variable derived from the curve relating the concentration of the drug in plasma to the time since administration (7). Commonly used summaries such as area under the curve (AUC) or peak drug concentration (Cmax ) are often modeled by the normal distribution after a logarithmic transformation (8). The mean value of y, denoting either log(AUC) or log(Cmax ), is modeled as linear in log dose: E(y) = θ 1 + θ 2 log(dose), where θ 1 and θ 2 are unknown population parameters. A maximum exposure level, L, is defined before the start of the study based on the toxicity profile for the compound observed in the most sensitive animal species. Any concentration of the drug leading to an AUC or Cmax in excess of this level will be unacceptable. As each subject receives more than one dose, the multiple responses of each subject are correlated. Furthermore, different subjects vary considerably in terms of pharmacokinetics. Therefore, both random subject effects and random errors are included in the model. A log–log mixed effects model (3–6) is fitted to yij , the response following the j-th dose to the i-th subject: yij = θ1 + θ2 lij + si + εij

4

X X

X

Bayesian decision theory supplies a general framework for making decisions under uncertainty, which can be applied in many scientific and business fields (9–11). Let θ = (θ 1 , θ 2 ) be the vector of the unknown parameters. Some information or expert opinion about θ , from either preclinical data or experience of similar compounds, will be available before the dose-escalation study begins. This information can be formulated as a probability density function for θ , denoted by h0 (θ ) and known as the ‘‘prior density.’’ Let x denote the data collected in the doseescalation study, and f(x | θ ) be its likelihood function of the data. Then, the posterior density function of θ , h(θ | x), can be derived using Bayes’ theorem: h (θ |x) =

h0 (θ ) f (x|θ ) . h0 (ϕ) f (x|ϕ) dϕ

This posterior represents the opinion about θ formed by combining the prior opinion with the data. Potentially, one could treat θ 1 , θ , τ 2 , and σ 2 all as unknown parameters. However, the more parameters that are modeled, the less accurate the resulting estimates, especially as sample sizes are small in phase I trials. One way to solve the problem is to give some parameters fixed values as if they are known. In particular, the within-subject correlation, ρ, might be set to a value such as 0.6. Hence, there will be three unknown parameters in the model (1) instead of four parameters, namely θ 1 , θ 2 , and σ 2 . Fixing ρ is convenient, and the effect of doing so has been studied by Whitehead et al. (5) by taking the alternative strategy of specifying discrete priors on ρ. In the resulting analyses, ρ is underestimated by the Bayesian procedure, and

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS

the estimate of σ 2 is overestimated. Consequently, it was concluded that fixing ρ was a reasonable policy for practical use. Conjugate priors have been proposed for θ 1 , θ 2 , and ν (3–6), where ν is the withinsubject precision ν = σ −2 . In particular, the conditional distribution of θ given ν, and the marginal distribution of ν can be taken to be θ |ν ∼ N µ0 , (νQ0 )−1 ; ν ∼ Ga (α0 , β0 ) , (2) where N denotes a normal distribution, Ga a gamma distribution, and the values of µ0 , Q0 , α 0 , and β 0 are chosen to represent prior knowledge. Being a conjugate prior, the posterior distribution shares the same form. Suppose that n subjects have been treated. The i-th subject has received pi periods of treatment, i = 1, . . . , n, and so a total of p1 + · + pn = p observations are available. Let the p-dimensional y denote the vector of responses with elements yij ordered by subject and by period within subject. The (p × 2) design matrix X is made up of rows of the form (1, ij ), ordered in the same way as y. The (p × n) matrix U, the design matrix of the random subject effect, is defined as having a 1 in the i-th column of rows p1 + · + pi1 + 1, . . . , p1 + . . . + pi for i = 1, . . . , n, and zeros elsewhere. The identity matrix is denoted by I. The (p × p) matrix P is defined as P = [I + ρ/(1 – ρ)UU ]−1 . Posterior distributions for θ and ν are (3) θ |v ∼ N µ, (vQ)−1 ; v ∼ Ga (α, β) , where α = α 0 + p/2, β = β0 + (y Py + µ0Q0µ0−µ Qµ)/2, µ = (Q0 + X PX)−1 (Q0 µ0 + X Py), and Q = (Q0 + X PX). Priors reflect not only one’s opinions, but also how strongly they are held. Here, a small value for α 0 represents a weak opinion. Consequently, dose escalation may be quick at the beginning as a few safe observations will soon overcome any prior reservations. A bigger value for α 0 represents a strong prior, and the resulting dose escalation will be conservative as prior concerns will only be removed by a clear demonstration of safety. A safety constraint can be used to help control overdosing (3–6). This requires that no dose be given if the predicted response at

3

this dose is likely to exceeds the safety limit L. Mathematically this can be expressed as (4) P yij > log L|dij , y ≥ π0 , where π 0 , the tolerance level, can be set at a low value such as 0.05 or 0.20. The dose at which the above probability is equal to π 0 is called the maximum safe dose for the i-th subject following the j-th observation. The maximum safe dose is subject related, and posterior estimates may differ among subjects who have already been observed, being lower for a subject who previously had absorbed more drug than average and higher if the absorption was less. After each treatment period of the dose-escalation procedure, the posterior predictive probability that a future response will lie above the safety limit is updated. The decision of which dose to administer to each subject in each dosing period is made using a predefined criterion. This criterion can be based on safety; for example, one could use the maximum safe dose as the recommended dose. It can also be based on accuracy of estimates of unknown parameters; for example, the optimal choice of doses is that which will maximize the determinant of the posterior variance-covariance matrix of the joint posterior distribution or minimize the posterior variance of some key parameter. 2 AN EXAMPLE OF DOSE ESCALATION IN HEALTHY VOLUNTEER STUDIES An example, described in detail by Whitehead et al. (5), in which the principal pharmacokinetic measure was Cmax will be outlined briefly here. The safety cutoff for this response was taken to be yL = log(200). Seven doses were used according to the schedule: 2.5, 5, 10, 25, 50, 100, 150 µg. The actual trial was conducted according to a conventional design, and the dosing structure and the resulting data are listed in Table 2. From a SAS PROC MIXED analysis, maximum likelihood estimates of the parameters in model (1) are θ1 = 1.167, θ2 = 0.822, σ 2 = 0.053, and τ2 = 0.073.

(5)

4

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS

Note that it follows that the within-subject correlation is estimated as 0.579. As an illustration, the Bayesian method described above will be applied retrospectively to this situation. Conjugate priors must first be expressed. It is assumed prior expert opinion suggested that the Cmax values would be 5.32 and 319.2 pg/mL at the doses 2.5 and 150 µg respectively. This prior indicates that the highest dose 150 µg is not a safe dose because the predicted Cmax exceeds the safety limit of 200 pg/mL. The value for ρ is set as 0.6. This forms the bivariate normal distributions for θ , N

θ |v ∼ 0.756 0.940 − 0.109 , . 1 −0.109 0.04

The value for α 0 is set as 1, as suggested in Whitehead et al. (5). The value for β 0 can be found via the safety constraint: P(y01 > logL | d01 = 2.5) = 0.05. This implies that dose 2.5 µg will be the maximum safe dose for all new subjects at the first dosing period in the first cohort. Therefore, ν ∼ Ga(1, 0.309). To illustrate Bayesian dose escalation, data are simulated based on the parameters found from the mixed model analysis given by Whitehead et al. (5). Thus, Cmax values are generated from three cohorts of eight healthy volunteers, each treated in four consecutive periods and receiving three active doses and one randomly placed placebo. A simulated dose escalation with doses chosen according to the maximum safe dose criterion is shown in Table 3. The first six subjects received the lowest dose, 2.5 µg. All subjects at the next dosing period received dose 10 µg, in which 5 µg dose was skipped for subjects 1 to 4. Subjects 7 and 8, who were on placebo in the first dosing period, skipped two doses, 2.5 and 5 µg, to receive 10 µg in the second dosing period. If this dosing proposal was presented to a safety committee in a real trial, the committee members might wish to alter this dose recommendation. The Bayesian approach provides a scientific recommendation for dose escalations. However, the final decision on which doses are given should come from a safety committee. The procedure would be able to make use of results

from any dose administered. In Table 3, it is shown that the maximum safe dose for a new subject at the beginning of the second cohort is 25 µg. All subjects in the second cohort received 50 µg at least twice. Subjects 17 to 22 in the first dosing period of the final cohort received 50 µg. However, they all had a high value of Cmax . The Bayesian approach then recommended a lower dose, 25 µg for subjects 17 to 20 (subjects 21 and 22 were on placebo). This shows that the Bayesian approach can react to different situations quickly: when escalation looks safe, dose levels can be skipped; when escalation appears to be unsafe, lower doses are recommended. Two high doses that were administered in the real trial, 100 and 150 µg, were never used in this illustrative run. The posterior distributions for θ at the end of the third cohort are

N

θ |v ∼ 1.376 0.024 − 0.006 , , 0.790 −0.006 0.002

and ν ∼ Ga(37, 2.625). Figure 1 shows the doses administered to each subject and the corresponding responses. Table 4 gives the maximum likelihood estimates from the real data in Table 2 that were used as the true values in the simulation, together with the maximum likelihood estimates from the simulated data in Table 3. Results show that σ 2 and τ 2 were underestimated from the simulated data, with there being no evidence of between-subject variation. Consequently, the estimated correlation from the simulated data is zero, in contrast to the true value of 0.579 used in the simulation. This is a consequence of the small dataset, and illustrates the value of fixing ρ during the escalation process. Different prior settings will result in different dose escalations. For example, if the value for α 0 is changed from 1.0 to 0.1, then the dose-escalation will be more rapid. Table 5 summarizes the recommended doses and simulated responses from another simulation run where α 0 = 0.1. In the second dosing period of the first cohort, subjects 1, 2, and 4 skipped two doses, 5 and 10 µg. Subjects 3, 7, and 8 skipped three doses in that cohort.

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS

5

Table 2. Real data from a healthy volunteer trial in Whitehead et al. (2006) Period Subject

1 Dose

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

2.5 2.5 2.5 10

2 Cmax

Cmax

5 5

14.5

24.1 2.5 2.5 5 5

2.5 2.5 2.5 10 10 10 10 10

3

Dose

29.0 15.0 21.8 23.0

50

54.7

50 50 50 50 50

161.5 101.4 194.1 61.6 81.7

10.4 13.5

10 25

14.6 34.1

25 25 25 10 2.5 100 50

34.9 70.6 40.7 16.6 125.6 79.4

100 100

131.3 232.5

100 50

84.2 113.8

4

Dose

Cmax

10

11.0

5 25 5 5 10

10.0 33.3 10.2 20.6 28.5

5 25 50 25

19.4 71.6 37.1

50 25 5

73.4 53.0 14.4

100

108.1

100 150 100

104.5 217.6 178.9

Dose

Cmax

10 10 50 10

20.4 19.7 74.6 29.7

10 10 50

14.6 27.0 47.7

50 50 50

59.2 76.1 166.6

50 10

87.1 17.8

Source: Whitehead et al. Stat Med. 2006; 25: 433–445.

The starting dose for all subjects in the first dosing period of the second and third cohorts was 50 µg (25 µg and 50 µg were the doses in Table 3). All subjects but one occasion repeatedly received 50 µg during the second cohort.Inthe third cohort, the dose of 100 µg was used quite frequently. The highest dose, 150 µg, was never used. This example shows that different prior settings will affect dose-escalation procedures. Multiple simulations should therefore be conducted to gain a better understanding of the properties of a design. Different scenarios should be tried to ensure that the procedure has good properties, whether the drug is safe, unsafe, or only safe for some lower doses. 3

DISCUSSION

Bayesian methods offer advantages over conventional designs. Unlike conventional

designs, more doses within the predefined dose range, or even outside of the predefined dose range, can be explored without necessarily needing extra dosing time as dose level skipping can be permitted. From the simulation runs in Tables 3 and 5, more doses, such as 40, 60, and 80 µg, could perhaps be included in the dose scheme. Simulations have shown that skipping dose levels does not affect either safety or accuracy; on the contrary, it improves safety or accuracy as a wider range of doses is used (12). Providing a greater choice of doses, while allowing dose skipping, leads to procedures that are more likely to find the target dose and to learn about the dose-response relationship curve efficiently. Ethical concerns can be expressed through cautious priors or safety constraints. Dose escalation will be dominated by prior opinion at the beginning, but it will soon

6

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS Table 3. A simulated dose-escalation Period Subject

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

3

4

Dose

Cmax

Dose

Cmax

Dose

Cmax

2.5 2.5 2.5 2.5 2.5 2.5

9.9 12.4 10.4 11.1 6.5 7.7

10 10 10 10

19.7 27.0 35.6 35.7

25 25

64.1 76.4

25 25 25 25 25 25

50 50 50 50 50 50

10 10 50 50 50 50

38.9 41.4 42.5 53.4 50.8 48.7

50 50 25 25 25 25

138.2 194.4 173.2 139.9 181.6 117.9

50 50

be influenced more by the accumulating real data. Before a Bayesian design is implemented in a specific phase I trial, intensive simulations should be used to evaluate different prior settings, safety constraints, and dosing criteria under a range of scenarios. Investigators can then choose a satisfactory design based on the simulation properties that they

21.3 24.5 84.4 71.0 87.1 67.2

84.4 82.2 73.6 57.9 76.6 43.1

93.6 97.5

25 25 50 25 50 50

50 50 50 50 25 25

25 50 50 50

33.7 62.9 93.2 52.8 53.7 58.6

60.4 66.5 89.4 75.5 46.5 69.2

47.2 87.7 75.5 79.2

Dose

Cmax

25 25 50 50 50 50

44.8 27.1 58.8 57.9 58.0 58.1

50 50 50 50 50 50

92.7 91.9 79.6 110.6 62.6 76.4

25 50 50 50 50 50

37.4 80.2 79.2 63.9 82.5 65.9

are interested in. For instance, they may look for a design that gives the smallest number of toxicities or the most accurate estimates. Until recently, such prospective assessment of design options was not possible, but now that the necessary computing tools are available, it would appear inexcusable not to explore any proposed design before it is implemented.

Table 4. Maximum likelihood estimates (MLE) and Bayesian modal estimates of simulated data in Table 3 (with standard errors or standard deviations)

Truth for simulations (MLE from real data) Final MLE (exclude the prior information) Bayesian prior modal estimates Bayesian posterior modal estimates

θ1

θ2

σ2

τ2

ρ

d∗f

1.167 (0.158) 1.568 (0.143) 0.759 (0.967) 1.376 (0.156)

0.822 (0.046) 0.741 (0.041) 1.00 (0.192) 0.790 (0.042)

0.053

0.073

0.579

73.42

0.090

0.000

0.000

69.45

0.309

0.463

0.6

2.5

0.071

0.106

0.6

58.18

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS

7

Dose 157 135 113 91 69 47 25 3 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Subject

Cmax ∈ (0, 0.2L);

Key: Cmax 210

Cmax ∈ (0.2L, 0.5L);

Cmax ∈ (0, 0.5L, L).

safty limit

180 150 120 50% safty limit 90 60 20% safty limit

30 3 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24

Subject

Figure 1. An illustration of a simulated dose escalation (using data from Table 3).

Although the methodology described here is presented only for a single pharmacokinetic outcome, the principles are easily generalized for multiple endpoints. Optimal doses can be found according to the safety limits for each of the endpoints, and then the lowest of these doses can be recommended. The Bayesian decision-theoretic approach has also been extended for application to an attention deficit disorder study

(13), where a pharmacodynamic response (heart rate change from baseline), a pharmacokinetic response (AUC), and a binary response (occurrence of any dose-limiting events) are modeled. A one-off simulation run indicates that the Bayesian approach controls unwanted events, dose-limiting events, and AUC levels exceeding the safety limit, while achieving more heart rate changes within the therapeutic range.

8

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS Table 5. A simulated dose-escalation (the value for α 0 is 0.1) Period Subject

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1

2

Dose

Cmax

2.5 2.5 2.5 2.5 2.5 2.5

8.2 9.9 6.2 8.4 8.0 6.4

50 50 50 50 50 50

50 50 50 50 50 50

88.9 104.4 74.9 98.3 86.2 89.4

58.5 48.3 70.5 45.3 55.3 49.1

3

4

Dose

Cmax

Dose

Cmax

25 25 50 25

53.4 40.4 118.5 61.1

50 50

66.1 78.9

25 25 50 50 50 50

50 50 100 100 50 100

100 50

50 50 50 50 50 50

65.1 37.7 64.4 49.5 64.7 84.0

50 50 50 50 50 100

109.9 115.0 172.8 140.7 73.8 89.7

100 100 100 50

157.6 72.8

Bayesian methodology only provides scientific dose recommendations. These should be treated as additional information for guidance of the safety committee of a trial rather than dictating final doses to be administered. Statisticians and clinicians need to be familiar with this methodology. Once the trial starts, data need to be available quickly and presented unblinded with dose recommendations to the safety committee. They can then make a final decision, based on the formal recommendations, together with all of the additional safety, laboratory, and other data available.

4.

REFERENCES

6.

3.

5.

1. U.S. Food and Drug Administration. 1997. Guidance for Industry: General considerations for the clinical evaluation of drugs. 2. K. Gough, M. Hutchison, O. Keene, B. Byrom, S. Ellis S, et al., Assessment of dose proportionality—Report from the

7.

92.7 74.1 60.9 98.2 46.6 46.2

48.4 56.7 46.6 68.5 60.5 98.2

142.0 115.6 102.5 63.9

Dose

Cmax

50 50 50 50 50 50

76.7 84.2 101.7 83.0 78.2 72.0

50 50 100 50 50 50

57.9 72.6 83.6 60.8 59.2 93.4

50 100 100 100 100 100

92.7 119.8 106.5 101.5 156.5 145.3

Statisticians in the Pharmaceutical Industry/Pharmacokinetic UK Joint Working Party. Drug Inf J. 1995; 29: 1039–1048. S. Patterson, S. Francis, M. Ireson, D. Webber, and J. Whitehead, A novel Bayesian decision procedure for early-phase dose finding studies. J Biopharm Stat. 1999; 9: 583–597. J. Whitehead, Y. Zhou, S. Patterson, D. Webber, and S. Francis, Easy-to-implement Bayesian methods for dose-escalation studies in healthy volunteers. Biostatistics. 2001; 2: 47–61. J. Whitehead, Y. Zhou, A. Mander, S. Ritchie, A. Sabin, and A. Wright, An evaluation of Bayesian designs for dose-escalation studies in healthy volunteers. Stat Med. 2006; 25: 433–445. Y. Zhou, Dose-escalation methods for phase I healthy volunteer studies. In: S. Chevret (ed.), Statistical Methods for DoseFinding Experiments. Chichester, UK: Wiley, 2006, pp. 189–204. S. C. Chow and J. P. Liu, Design and Analysis of Bioavailability and Bioequivalence Studies. Amsterdam: Dekker, 1999.

BAYESIAN DOSE-FINDING DESIGNS IN HEALTHY VOLUNTEERS 8. W. J. Westlake, Bioavailability and bioequivalence of pharmaceutical formulations. In: K. E. Peace (ed.), Biopharmaceutical Statistics for Drug Development. New York: Dekker, 1988, pp. 329–352. 9. D. V. Lindley, Making Decisions. London: Wiley, 1971. 10. J. Q. Smith, Decision Analysis: A Bayesian Approach. London: Chapman & Hall, 1988. 11. J. O. Berger, Statistical Decision Theory and Bayesian Analysis, New York: Springer, 1985. 12. Y. Zhou and M. Lucini, Gaining acceptability for the Bayesian decision-theoretic approach in dose escalation studies. Pharm Stat. 2005; 4: 161–171. 13. L. Hampson, Bayesian Methods to Escalate Doses in a Phase I Clinical Trial. M.Sc. dissertation. School of Applied Statistics, University of Reading, Reading, United Kingdom, 2005.

CROSS-REFERENCES Phase I trials Pharmacodynamic study Pharmacokinetic study Crossover design Analysis of variance (ANOVA) Placebo Bayesian approach

9

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

relatively few will receive a preventive benefit from the therapy. In this setting, the use of B/RAs provides an additional mechanism to ensure that all participants comprehend the full extent of potential benefits and risks, and that they make a well-informed decision about the interchange of benefits and risks they are willing to accept by participating in the trial. The use of B/RA in prevention trials also provides a method to evaluate the global effect of the therapy as a safeguard against subjecting trial participants to an unforeseen harmful net effect of treatment. Once the results of the trial are known and the true levels of benefits and risks of the therapy have been established, the individualized B/RA employed in the trial can become the basis for the development of a B/RA methodology that could be used in the clinical setting to facilitate the decision-making process for individuals and their health care providers who may be considering the use of preventive therapy. The trial results can also be used to develop a population-based B/RA to identify changes in patient loads for the outcomes affected by the preventive therapy that would be anticipated as health care professionals incorporate the use of the preventive therapy into their clinical practice. This information could in turn be used for decision-making regarding the planning for and use of health care resources.

JOSEPH P. COSTANTINO University of Pittsburgh, Pittsburgh, PA, USA

Benefit/risk assessment (B/RA) is a mathematical procedure to estimate the probability of detrimental outcomes, beneficial outcomes and the net-effect anticipated from exposure to a given agent. B/RAs of health-related outcomes are used for public health planning, decision-making regarding health care financing and therapeutic decision-making in clinical practice (12,24). Information obtained from B/RAs based on findings from controlled clinical trials, particularly those with doublemasking of treatment, are most informative for health care planning and decision-making because such information is less likely to be biased than is information obtained from observational studies (20,23,26). Thus, B/RAs in prevention trials are an excellent source of information to use as the basis for the types of health care planning and decision-making mentioned above. However, in a prevention trial a B/RA is primarily performed as a supplement for planning, monitoring and analyzing the trial. It is designed to provide a global assessment of all potential beneficial and harmful effects that may occur as a result of a treatment that is being evaluated as a means to reduce the incidence of some particular disease or condition. The Women’s Health Initiative (WHI), the Breast Cancer Prevention Trial (BCPT) and the Study of Tamoxifen and Raloxifene (STAR) are examples of large-scale, multicenter prevention trials which included B/RA as part of the trial methodology (7,8,22,28,29). Compared with treatment trials, the need for the type of information provided by a B/RA may be greater in prevention trials. This situation exists because prevention trials usually involve healthy persons among whom only a small proportion may develop the disease of primary interest during the course of the trial (5–713,22,29,30). As such, all participants are subjected to the risks of therapy during the course of the trial, but

1 TYPES OF B/RAs PERFORMED IN PREVENTION TRIALS In a prevention trial a B/RA can take one of three forms which can be classified according to the nature of the population that constitutes the basis for the assessment. These include assessments based on the general population, those based on the trial cohort and those based on an individual trial participant. Each of these forms of B/RA is

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

performed for specific purposes, namely to support various aspects of the conduct of the trial. A B/RA based on the general population is often performed pre-trial as part of the justification for initiating the trial. The purpose of this form of assessment is to demonstrate the potential net health benefit to society that could be obtained if the therapy being evaluated in the trial actually exhibits the efficacy that is anticipated. This type of assessment is the most generalized form. It is usually accomplished by estimating effects on a national basis assuming the therapy is administered to all susceptible individuals or to a subset of high-risk individuals and demonstrating that there is a significant net benefit when comparing the number of cases prevented with the estimates for the number of additional cases of detrimental outcomes that may be caused as a side-effect of the therapy. A B/RA based on the trial cohort is performed during the course of trial implementation as part of the safety monitoring effort. It can be accomplished in a regimented fashion as part of the formal plan for the interim monitoring of the trial or as an informal tool used by the data monitoring committee to assess the overall safety of the therapy being evaluated. This type of assessment is not usually necessary during a trial if the anticipated effects from the therapy involve only a few outcomes or if the anticipated beneficial effects substantially outweigh the anticipated detrimental effects. However, in complex situations where the anticipated outcomes affected by the therapy involve multiple diseases or conditions and/or the magnitude of the anticipated net benefit may not be large, a B/RA based on the trial cohort can be a very useful supplement for trial surveillance as a method of monitoring the global effect of all beneficial and detrimental outcomes combined. A notable difference between a B/RA based on the general population and one based on the study cohort is in the nature of the measures that are provided by these two forms of assessment. A risk assessment based on a general population provides a measure of the theoretical net effect of the therapy from estimates of

anticipated beneficial and detrimental outcomes. In contrast, a risk assessment based on the trial cohort determines the observed net effect of therapy based on outcomes actually experienced by the cohort during the course of the trial. A B/RA based on an individual trial participant is similar to that of the populationbased assessment in that it is also a theoretical estimate. In this case the assessment is not made for the general population, but instead for a specific subpopulation of persons who have the same risk factor profile (age, sex, race, medical history, family history, etc.) for the anticipated beneficial and detrimental outcomes as that of a particular individual participating in the trial. Information from this type of assessment is used to facilitate the communication to each potential trial participant of the nature of the benefits and risks that are anticipated for them as a result of taking therapy during trial participation. This type of individualized B/RA is used in prevention trials when the nature of anticipated effects is complex and benefit/risk communication is a more difficult task due to the interplay of multiple beneficial and detrimental outcomes. When it is used in this manner, it becomes an integral part of the process to obtain informed consent for each individual’s participation in the trial. 2 ALTERNATIVE STRUCTURES OF THE BENEFIT/RISK ALGORITHM USED IN PREVENTION TRIALS The core components of a B/RA are the measures of the treatment effect for each of the health outcomes that may be affected by the therapy being assessed. In this instance the treatment effect is defined as the difference between the probability that the outcome will occur among individuals who do not receive the therapy being evaluated (p0 ) and the probability that the outcome will occur among those who do receive the therapy (p1 ). For outcomes beneficially affected by therapy, the treatment effect (p0 − p1 ) will have a positive sign, representing cases prevented by therapy. For outcomes detrimentally affected by therapy, the treatment effect will have a

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

negative sign, representing cases caused by therapy. In its simplest structure, the benefit/risk analysis is summarized by an index of net effect () as the summation of treatment effects for all outcomes affected. If there are I number of outcomes affected by therapy, then the basic algorithm for the B/RA is defined as:

1 =

I

(p0,i − p1,i ).

(1)

i=1

When the sign of the index of net effect is positive, the therapy exhibits an overall beneficial health effect. When the sign is negative, the therapy has an overall detrimental effect. When dealing with a B/RA based on the trial cohort, the probabilities of (1) are obtained directly from the observations in the trial. When dealing with assessments based on the general population or the individual trial participant, the probabilities utilized as anticipated values among those who do not receive therapy (p0 ) are usually taken from some type of national database or from prospective studies of large populations that included measurements of the outcomes of interest. The probabilities used in these latter types of assessments as anticipated values among those who receive therapy are determined by multiplying the anticipated probability among those not treated by the relative risk (untreated to treated) anticipated as the treatment effect. For example, if we anticipate that treatment will reduce the incidence of a particular outcome by 35% then the anticipated relative risk would be 0.65 and the value used for p1 would be 0.65 p0 . If we anticipate that treatment will increase the incidence of an outcome by 30%, then the anticipated relative risk would be 1.30 and the value used for p1 would be 1.30 p0 . Estimates of the anticipated treatment effects for each outcome are taken from the literature dealing with pharmacokinetics, animal studies and studies in humans undertaken as preliminary investigations of the therapy as an agent to prevent the disease, or from human studies in which the therapy was being used as an agent for the treatment of disease.

3

In the prevention trial setting it is often advantageous to utilize structures of the benefit/risk algorithm other than that defined in (1). Since a B/RA based on the trial cohort is meant to be performed as part of the effort to monitor safety during the trial, an alternative structure of the benefit/risk algorithm can be used to facilitate this effort. This structure incorporates a standardization of the differences between the probabilities among those receiving and not receiving the therapy being evaluated. In this situation the index of net effect is defined as: I (p0,i − p1,i )

2 =

i=1

I s.e. (p0,i − p1,i )

.

(2)

i=1

In this structure, the index of net-effect (2 ) becomes a standardized value with an N(0,1) distribution. As such, the standardized values are Z-scores. Critical values of this index of net effect in the form of Z and −Z can then be used as cut-points for global monitoring indicating that there is a significant net effect that is beneficial or detrimental, respectively. In addition to that for the standardized score, there are other structures of the algorithm used in the prevention trial setting. Instead of expressing the differences between those treated and not treated in terms of the probabilities of the anticipated outcomes, an alternative structure of the algorithm is that based on differences between treatment groups in terms of the number of cases of the outcomes. The structure of the algorithm based on the difference in terms of the number of cases is defined as: 3 =

I (n0,i − n1,i ),

(3)

i=1

where n0 is the number of cases occurring among those who do not receive the therapy being evaluated and n1 is the number of cases among those who do receive the therapy. This structure of the algorithm is that which is utilized to perform B/RAs based on the general population. This type of assessment is meant

4

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

to justify the need for a trial by demonstrating the potential health benefit to society. The net effect to society is more effectively communicated to a greater proportion of individuals when it is expressed as the number of cases prevented from (3) than when it is expressed as the probability from (1). This facilitation of risk communication is also the reason that (3) is preferred over (1) for B/RAs based on individual trial participants where the specific goal of the assessment is to enhance the individual’s comprehension of benefits and risks that may be experienced as a result of trial participation. For a population-based assessment, the numbers of cases in (3) are determined by multiplying the anticipated probabilities p0,i and p1,i by the number of persons in the general population, frequently that of the total US, to obtain an estimate of the number of cases that may be prevented or caused by treatment for each outcome on an annual basis. For an individual participant-based assessment, the numbers of cases in (3) are determined by multiplying the anticipated probabilities by a fixed sample size (N) of theoretical individuals who all have a risk factor profile similar to that of the individual being assessed. A fixed period of follow-up time (t) is assumed to obtain the number of cases prevented or caused by treatment in t years among N individuals. In scenarios where the length of follow-up is long and/or the population is of older age, the estimation of n0,i and n1,i should incorporate the competing risk of mortality that would be anticipated. If d is the probability of dying and RR is the relative risk anticipated for the outcome of interest, the adjusted expected number of cases among those not treated can be calculated as: p0,i [1 − exp{−t(p0,i + di )}], n0,i = N (p0,i + di ) (4) and the adjusted expected number of cases among those treated can be calculated as: RRp0,i n1,i = N (RRp0,i + di ) ×[1 − exp{−t(RRp0,i + di )}].

(5)

In most prevention trials the outcomes that are potentially affected by the therapy

being evaluated encompass a wide range of severity. A simple adding together of the risks of beneficial and detrimental effects without including a consideration of the relative severity of the outcomes may not be appropriate or desirable. For example, suppose a therapy is anticipated to prevent breast cancer and hip fractures, but may cause an increase in uterine cancer and cataracts. Is it appropriate to equate one case of breast cancer prevented to one case of cataracts caused or equate one case of hip fracture prevented to one case of uterine cancer caused? In situations where it is important to include a consideration of the relative severity of the outcomes affected by the therapy, the equations described above for determining the index of net effect can be modified to incorporate a weighting of the outcomes. If wi is used to represent the weight for each of the I outcomes, then the modification to (3) to incorporate weighting of the outcomes is: 4 =

I

wi (n0,i − n1,i ).

(6)

i=1

Equations (1) and (2) can be modified in a similar fashion by including wi as a multiplier of the quantity of difference in the probabilities. 3 METHODOLOGICAL AND PRACTICAL ISSUES WITH B/RA IN PREVENTION TRIALS There are several issues to be faced when performing a B/RA in a prevention trial. These issues concern the variability of the index of net effect, weighting the outcomes by severity, estimating the index of net effect for individuals with specific profiles of risk factors and communicating the findings of a B/RA to individual participants. Some discussion of each of these issues is presented below. The estimates of p0,i and p1,i used in a B/RA have a variability associated with them in terms of the strength of evidence supporting the treatment effect and in terms of the precision of the treatment effect. If this variability is substantial, then it may be necessary to incorporate consideration of the variability into the B/RA. Freedman et al. (8,25) have described a Bayesian approach

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

to incorporating a measure of variability into the index of net effect when it is measured in the form of weighted, standardized probabilities. They assume a skeptical prior distribution based on the strength of the preliminary evidence used as the anticipated treatment effect for each outcome potentially affected by therapy. Gail et al. (10) have described a method to incorporate a measure of variability into the estimate of the index of net effect measured in the form of a weighted number of cases. Their method involves bootstrapping, based on the 95% confidence intervals of the anticipated relative risk associated with treatment for each outcome, to determine the probability that the net number of cases is greater than zero. The values used for weighting the differences between those treated and not treated can be based on a utility function related to the severity of the outcome, preferences in terms of levels of risk acceptability or other considerations. However, the best choice of a utility function is not always obvious. A measure of mortality such as the case-fatality ratio is one possible utility. If this type of weighting is used, then the choice of the one-year, five-year or ten-year case-fatality ratios would be an issue because the relative weighting of the outcomes could likely be very different depending on which time period for case-fatality is used. Also, weights based on case-fatality would eliminate the consideration of any nonfatal outcome, which would not be preferable if there were several nonfatal outcomes of interest or if a nonfatal outcome has a significant impact on morbidity. Issues also arise with the use of rankings based on the impact on quality of life or preferences regarding the acceptability of risk (1,11). The issues with these utilities arise because the rankings are often subjective in nature, based on the opinions of a relatively small panel of individuals, and it is possible that the rankings of outcomes could differ substantially depending on the population from whom the opinions are ascertained (2,15,16). In light of these issues, attempting to identify a basis for weighting a B/RA is a practical problem that can be difficult to resolve. The preferred choice for any particular trial could differ from one group of individuals to another. As such, if a B/RA is planned as part of

5

trial monitoring, it is essential that the data monitoring committee reviews and reaches a consensus regarding the proposed weighting before it initiates review of the outcome data. To accomplish the individualization desired for B/RAs based on individual trial participants, it is necessary to provide estimates of effect specific to the individual’s full spectrum of risk factors for each of the outcomes expected to be affected by the therapy of interest. A problem likely to be faced when performing individualized assessments is the unavailability of probability estimates specific to the individual’s full set of risk factors. For outcomes with several relevant risk factors to be considered or for outcomes that have not been studied in diverse populations, estimates of the outcome probabilities for a specific category of risk factor profiles may not exist. In some cases, multivariate regression models are available that can be used to predict probabilities of outcomes for specific risk factor profiles from data based on the general population. Examples of such models include those for heart disease, stroke and for breast cancer (4,9,1417–19). However, the models currently available are primarily limited to those for the more common diseases and are not generally applicable to all race and sex populations. Also, relatively few of these models have been well validated. Thus, in practice it is often necessary to use estimates of outcome probabilities for individualized B/RAs that are taken from populations that are more representative of the general population than of the population specific to the risk factor profile of the individual being assessed. When this is the case, the limitations of the methodology need to be recognized and used in this light. Nonetheless, a B/RA that has been individualized to the extent possible is more informative to a trial participant than one based on the general population. Additional discussions of the limitations of individualized B/RAs can be found in presentations concerning individualized B/RAs for the use of tamoxifen to reduce breast cancer risk (3,4,27). Communicating the results of a B/RA to an individual is a skilled task. An effort must be made to provide information in a manner that facilitates the individual’s comprehension (21). Tools are needed to facilitate this

6

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

Table 1. Example of Data Presentation Tool for Communicating the Benefits and Risks of Tamoxifen Therapy The information below provides the number of certain events that would be expected during the next five years among 10 000 untreated women of your age (ageX ), race (raceY ) and five-year breast cancer risk (riskZ ). To help you understand the potential benefits and risks of treatment, these numbers can be compared with the numbers of expected cases that would be prevented or caused by five years of tamoxifen use Severity of event Life-threatening events

Other severe events

Other events

Expected number of cases among 10 000 untreated women

Expected effect among 10 000 women if they all take tamoxifen for five years

Invasive breast cancer Hip fracture

N0,1 cases expected N0,2 cases expected

Potential benefits N1,1 of these cases may be prevented N1,2 of these cases may be prevented

Endometrial cancer Stroke Pulmonary embolism

N0,3 cases expected N0,4 cases expected N0,5 cases expected

Potential risks N1,3 more cases may be caused N1,4 more cases may be caused N1,5 more cases may be caused

In situ breast cancer

N0,6 cases expected

Potential benefit N1,6 of these cases may be prevented

Deep vein thrombosis

N0,7 cases expected

Potential risk N2,7 more cases may be caused

Potential benefits:

Tamoxifen use may reduce the risk of a certain type of wrist fracture called Colles’ fracture by about 39%, and also reduce the risk from fractures of the spine by about 26%. Tamoxifen use may increase the occurrence of cataracts by about 14%.

Type of event

Potential risk:

effort. These tools must be developed before the initiation of the trial and included as part of the protocol approved by the Institutional Review Board. Relatively little work has been done in the area of developing tools for communicating the benefits and risks of participation in a prevention trial. However, some tools have been developed that serve as examples for future development. Tools to enhance the communication of B/RA information to women screened for participation were developed for use in the BCPT (7,22). Since the conclusion of this trial, the tools were refined for use in the STAR trial (28). Table 1 provides an example of the type of tool used in the STAR trial to inform potential participants regarding their individualized B/RA. This tool was developed based on the principles put forth by the participants of the National Cancer Institute’s workshop convened to develop information to assist in counseling women about the benefits and risks of tamoxifen when used to reduce

the risk of breast cancer. This workshop and the specific methodology used for the B/RA are described by Gail et al. (10). There were several key working premises that guided the development of the STAR trial tool displayed in Table 1. The premises were considerations of form and format to facilitate the participant’s comprehension of their individualized B/RA. These working premises were to: (1) avoid the use of probabilities and relative risk as these concepts are not readily understood by the nonstatistician; (2) provide information for each outcome anticipated to be affected by therapy; (3) group the information presented by severity of the outcomes; (4) provide detailed information for the outcomes with more severe consequences and provide an estimate of effects among those not treated so the individual can understand the context in which to place the expected treatment effects; and (5) limit the tool to one page of data presentation to reduce the amount of data overload perceived by the individual.

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS

The precise considerations involved in any prevention trial may differ; however, working premises of this nature designed to enhance comprehension should always be employed when developing tools to communicate B/RA information to potential trial participants. REFERENCES 1. Bennett, K. J. & Torrance, G. W. (1996). Measuring health state preferences and utilities: ratings scale, time trade-offs and standard gamble techniques, in Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd Ed., B. Spilker, ed. Lippincott-Raven, Philadelphia, pp. 253–265. 2. Boyd, N. F., Sutherland, H. J., Heasman, K. Z., TritchlerD. L. & Cummings B. J. (1990). Whose utilities for decision analysis?, Medical Decision Making 1, 58–67. 3. Costantino, J. P. (1999). Evaluating women for breast cancer risk-reduction therapy, in ASCO Fall Education Book. American Society of Clinical Oncology, pp. 208–214. 4. Costantino, J. P., Gail, M. H., Pee, D., Anderson, S., Redmond, C. K. & Benichou, J. (1999). Validation studies for models to project the risk of invasive and total breast cancer incidence, Journal of the National Cancer Institute 91, 1541–1548. 5. Cummings, S. R., Echert, S., Krueger, K. A., Grady, D., Powles, T. J., Cauley, J. A., Norton, L., Nickelsen, T., Bjarnason, N. H., Morrow M., Lippman M. E., Black, D., Glusman, J. E. & Jordan, V. C. (1999). The effect of raloxifene on risk of breast cancer in postmenopausal women: results from the MORE randomized trial, Journal of the American Medical Association 281, 2189–2197. 6. Ettinger, B., Black, D. M., Mitlak B. H., Knickerbocker, R. K., Nickelsen, T., Genant, H. K., Christiansen, C., Delmas, P. D., Zanchetta, J. R., Stakkestad, J., Gluer, C. C., Krueger, K., Cohen, F. J., Eckert, S., Ensrud, K. E., Avioli, L. V., Lips, P. & Cummings, S. R. (1999). Reduction of vertebral fracture risk in postmenopausal women with osteoporosis treated with raloxifene: results from a 3-year randomized clinical trial, Journal of the American Medical Association 282, 637–645. 7. Fisher, B., Costantino, J. P., Wickerham, D. L., Redmond, C. K., Kavanah, M., Cronin, W. M., Vogel, V., Robidoux, A., Dimitrov, N., Atkins, J., Daly, M., Wieand, S., Tan-Chiu, E., Ford, L. & Wolmark, N. (1998). Tamoxifen

7

for prevention of breast cancer: report of the National Surgical Adjuvant Breast and Bowel Project P-1 study, Journal of the National Cancer Institute 90, 1371–1388. 8. Freedman, L., Anderson, G., Kipnis, V., Prentice, R., Wang, C. Y., Rousouw, J., Wittes, J. & DeMets, D. (1996). Approaches to monitoring the results of long-term disease prevention trials: examples from the Women’s Health Initiative, Controlled Clinical Trials 17, 509–525. 9. Gail, M. H., Brinton, L. A., Byar, D. P., Corle, D. K., Green, S. B., Schairer, C. & Mulvihill, J. J. (1989). Projecting individualized probabilities of developing breast cancer for white females who are being examined annually, Journal of the National Cancer Institute 81, 1879–1886. 10. Gail, M. H., Costantino, J. P., Bryant, J., Croyle, R., Freedman, L., Helzlsouer, K. & Vogel V. (1999). Weighing the risks and benefits of tamoxifen for preventing breast cancer, Journal of the National Cancer Institute 91, 1829–1846. 11. Guyatt, G., Feeny, D. & Patrick, D. (1993). Measuring health-related quality of life, Annuals of Internal Medicine 118, 622–629. 12. Haynes, R. B., Sackett, D. L., Gray, J. A. M., Cook, D. J. & Guyatt, G. H. (1996). Transferring evidence from research to practice: 1. The role of clinical care research evidence in clinical decisions, APC Journal Club 125, A14–A15. 13. Hulley, S., Grady, D., Bush, T., Furberg, C., Herrington, D., Riggs, B. & Vittinghoff, E. (1998). Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women. Heart and Estrogen/Progestin Replacement Study (HER) Research Group, Journal of the American Medical Association 280, 605–613. 14. Liao, Y., McGee, D. L., Cooper, R. S. & Sutkowski, M. B. (1999). How generalizable are coronary risk prediction models? Comparison of Framingham and two other national cohorts, American Heart Journal 137, 837–845. 15. Llewellyn-Thomas, H. A. (1995). Patients’ health care decision making: a framework for descriptive and experimental investigations, Medical Decision Making 15, 101–106. 16. Llewellyn-Thomas, H. A., Naylor, C. D., Cohen, M. N., Baskiniski, A. S., Ferris, L. E. & Williams, J. E. (1992). Studying patients’ preferences in health care decision making,

8

17.

18.

19.

20.

21.

22.

BENEFIT/RISK ASSESSMENT IN PREVENTION TRIALS Canadian Medical Association Journal 147, 859–864. Lloyd-Jones, D. M., Larson, M. G., Beiser, A. & Levy, D. (1999). Lifetime risk of developing coronary heart disease, Lancet 353, 89–92. Manolio, T. A., Kronmal, R. A., Burke, G. L., O’Leary, D. H. & Price, T. R. (1996). Shortterm predictors of incidence stroke in older adults. The Cardiovascular Health Study, Stroke 27, 1479–1486. Menotti, A., Jacobs, D. R., Blackburn, H., Krombout, D., Nissinen, A., Nedeljkovic, S., Buzina, R., Mohacek, I., Seccareccia, F., Giampaoli, S., Dontas, A., Aravanis, C. & Toshima, H. (1996). Twenty-five year prediction of stroke deaths in the seven countries study: the role of blood pressure and its changes, Stroke 27, 381–387. Pocock, S. J. & Elbourne, D. R. (2000). Randomized trials or observational tribulations?, The New England Journal of Medicine 342, 1907–1909. Redelmeier, D. A., Rozin, P. & Kahneman D. (1993). Understanding patients’ decision–cognitive and emotional perspectives, Journal of the American Medical Association 270, 72–76. Redmond, C. K. & Costantino, J. P. (1996). Design and current status of the NSABP Breast Cancer Prevention Trial, Recent Results in Cancer Research 140, 309–317.

23. Sackett, D. L. (1997). Bias in analytical research, Journal of Chronic Diseases 32, 51–63. 24. Simon, G., Wagner, E. & VonKorff, M. (1995). Cost-effectiveness comparisons using ‘‘real world’’ randomized trials: the case of the new

antidepressant drugs, Journal of Clinical Epidemiology 48, 363–373. 25. Spiegelhalter, D. J., Freedman, L. & Parmar, M. K. B. (1994). Bayesian approaches to randomization clinical trials, Journal of the Royal Statistical Society, Series A 157, 357–416. 26. Steineck, G. & Ahlbom, A. (1992). A definition of bias founded on the concept of the study base, Epidemiology 3, 477–482. 27. Taylor, A. L., Adams-Cambell, L. & Wright, J. T. (1999). Risk/benefit assessment of tamoxifen to prevent breast cancer—still a work in progress, Journal of the National Cancer Institute 19, 1792–1973. 28. Wolmark, N., Wickerham, D. L., Costantino, J. P. & Cronin, W. (1999). NSABP Protocol P2: Study of Tamoxifen and Raloxifene (STAR) for the Prevention of Breast Cancer. National Surgical Breast and Bowel Project, Pittsburgh, Pennsylvania. 29. Women’s Health Initiative Study Group (1998). Design of the Women’s Health Initiative clinical trial and observational study, Controlled Clinical Trials 19, 61–109. 30. Writing Group for the PEPI Trial (1995). Effects of estrogen/progestin regimens on heart disease risk factors in postmenopausal women: the Post-menopausal Estrogen/Progestin Intervention (PEPI) Trial, Journal of the American Medical Association 273, 199–208.

BIASED COIN RANDOMIZATION

is b, and the maximum number of consecutive subjects with the same treatment is 2b. This method is most commonly used in clinical trial practice. However, it may be possible to predict or guess future treatments for subjects in the later part of each block with knowledge of the treatment for patients earlier in the block. For example, with two treatment groups and a block size of 4, an initial allocation of two patients to treatment A will mean that the next two subjects will receive treatment B. As such, the treatment allocation is deterministic for patients toward the end of blocks. In blinded trials, the block size is usually not disclosed because of the potential for unblinding and bias. However, even this strategy may not afford total protection from selection bias. If study staff are able to make a good guess at the previous treatment—based on frequent occurrence of known treatmentrelated adverse events or strong efficacy results (sometimes called functional unblinding)— then selection bias may still occur in PBR. The aim of this article is to describe a class of randomization procedures called biased coin randomization procedures, which are designed to eliminate or substantially reduce the problems of the CR and PBR procedures by

MIKE D. SMITH Clinical Statistics Pfizer Global Research & Development New London, Connecticut

1 RANDOMIZATION STRATEGIES FOR OVERALL TREATMENT BALANCE In a ‘‘simple’’ or ‘‘complete’’ randomization (CR) scheme, subjects are allocated to treatment groups based on a fixed probability without any regard to previous allocation of subjects or level of current imbalance. In the case of equal randomization to two groups, this procedure is equivalent to tossing a coin. Because it is based on the inability to predict or guess the treatment for future patients, this procedure is free from selection bias (1) , which is the bias introduced into a trial from awareness of a patient’s treatment allocation. The CR procedure does not guarantee overall treatment balance at the end of the trial, and there is a potential during the trial for a long run of one treatment versus another. For these reasons, the CR procedure is rarely used. Avoiding imbalance at specific time points during the trial is particularly important in the implementation of adaptive designs based on interim analysis of data. An early alternative to CR that addressed the issue of potential treatment imbalance is the nonrandom systematic design (e.g., ABBAABBAA), which is not discussed here as it contains no element of randomization. An alternative randomization procedure that aims to limit imbalance throughout the trial is ‘‘permuted block randomization’’ (PBR) (2). For a fixed number of treatments groups, fixed-sized blocks are generated that contain all possible treatment permutations. The full randomization list is then generated by randomly selecting from this full set of treatment blocks until the required sample size is achieved. Under PBR, the magnitude of treatment imbalance and the maximum number of consecutive subjects receiving the same treatment are both limited. In the case of two treatment groups, with a block size of 2b the maximum treatment imbalance

• minimizing the size any treatment

imbalance in a trial (and within each strata) and reducing the chance of long runs of one treatment group. • removing any deterministic treatment allocation for patients to treatment and so eliminating selection bias. The randomization procedures discussed here, including biased coin randomization and others, are described in detail in Rosenberger and Lachin (3), and summarized in Chow and Chang (4). 2 THE BIASED COIN RANDOMIZATION PROCEDURE Efron (5) first described a biased coin randomization procedure for two treatment groups

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

BIASED COIN RANDOMIZATION

(A and B) where equal treatment allocation over a whole study is required. We can consider this random allocation as tossing a coin. If the current treatment allocation is balanced, then the coin used will be a fair unbiased coin. However, if there is any imbalance in the current treatment allocation, then the coin will have a fixed bias favoring the treatment with the smaller number of subjects. Thus, the bias is independent of the size of treatment imbalance. Let nA and nB be the current number of subjects assigned to treatment A and B, respectively. The probability of the next subject being assigned to treatment A in the trial is   1 − p nA > nB nA = nB Probability (treatment A) = 0.5  p nA < nB , where p represents the bias in the randomization ‘‘coin’’ (1 ≥ p ≥ 0.5). This procedure is labeled BCD(p). Complete randomization is equivalent to using BCD(0.5). A block randomization using a block size of 2 is equivalent to using BCD(1), which is not usually considered as it has deterministic allocation for each alternate subject with the associated potential for selection bias. Efron (5) favored the value BCD(2/3) to provide a clear bias for the coin while maintaining an adequate random element. Pocock (6) suggested that BCD(3/4) is too predictable in the presence of imbalance, BCD(2/3) is appropriate for a relatively small trial; and BCD(3/5) is adequate for a larger trial with at least 100 patients. Operationally, the BCD(p) procedure uses three distinct fixed-randomization schedules, using the probability of being assigned to treatment A. For an investigator-blinded trial or a trial with multiple investigator sites, the three fixed randomization lists would be

administered centrally and not at individual sites. Within a stratified randomization (with one stratification factor), it would be proposed that each stratum be treated as a separate trial. Therefore, an attempt is made to achieve treatment balance for each stratum and thereby over the whole trial. A discussion of a more complex stratification procedure is described later.

3 PROPERTIES The BCD(p) procedure produces a Markov Chain random walk process for the difference DN = nA − nB . Under the assumption that both positive and negative imbalances are equally important to avoid, then the properties of the BCD(p) procedure can be evaluated more simply by considering |DN |. The transition matrix for |DN | is formed by considering the probability of moving from the current state to all possible states for the next patient. If balance is achieved at any stage in the study, the probability of an imbalance of 1 following the next subject is 1. If the study is imbalanced at any stage in the study, the probability of greater imbalance will be 1 − p and the probability of lesser imbalance will be p. The limiting probabilities (N = nA + nB → ∞) of imbalance of size j can be calculated from the transition matrix for |DN | (3). These are shown in Table 1 for even and odd N. The likelihood of perfect balance (for even N) or an imbalance of 1 (for odd N) tends to 1 as r → ∞ (p → 1); however, the deterministic nature of treatment allocation will increase, and so increase the potential for selection bias. An illustration of the limiting probabilities of imbalance for BCD(0.60), BCD(0.75) and BCD(0.90) is given in Figure 1A and B.

Table 1. Probability of imbalance of size j for the BCD(p) in terms of r = p/(1 – p) Even N

Odd N

1 − 1/r (j = 0)

r2 −1 (odd rj+1

r2 −1 (even rj+1

j, j ≥ 2)

j, j ≥ 1)

1.0

1.0

0.8

0.8 Probability

Probability

BIASED COIN RANDOMIZATION

0.6 0.4 0.2

3

0.6 0.4 0.2

0.0

0.0 0

1

2

3

4

5

6

7

8

9

10 11

0

1

2

3

4

Imbalance

5

6

7

8

9

10 11

Imbalance

Figure 1. Probability of imbalance for (A) even and (B) odd total number of subjects using biased coin design with p = 0.60 (blue), 0.75 (red), 0.90 (green).

30 Difference (Dn = nA–nB)

Difference (Dn = nA–nB)

30 20 10 0 −10 −20 −30 0

100 Subject Number

200

20 10 0 −10 −20 −30 0

100 Subject Number

200

Difference (Dn = nA–nB)

30 20 10 0 −10 −20 −30 0

100 Subject Number

200

Figure 2. Difference in number of subjects by additional recruited subjects (A) BCD(0.60), (B) BCD(0.75), and (C) BCD(0.90) with starting imbalance of +20 (red line), 0 (blue line), –20 (green line).

An example of this procedure would be an ongoing clinical trial where a further 200 subjects will be randomized. Let the current level of imbalance in this example be Dn = +20, 0, and −20 shown as the red, blue, and green curves, respectively, in Figure 2A–C. We can

see the difference DN = nA − nB for the additional 200 randomized subjects using the BCD(0.60), BCD(0.75), and BCD(0.90) procedures (as A, B, and C, respectively). The plots indicate that as p increases we achieve faster convergence to Dn = 0, given initial

4

BIASED COIN RANDOMIZATION

imbalance and a lower level of imbalance around Dn = 0. Many investigators use the expected number of correct guesses of future treatment assignment as a surrogate measure of potential selection bias (5). Wei (7) showed that the number of correct guesses of the next treatment assignment is much lower for BCD than for PBR. The CR procedure has the minimum expected number of correct guesses, and so (as expected) it minimizes the potential section bias. The number of correct guess is greater for BCD compared with CR; however, this difference decreases as studies get larger.

4 EXTENSIONS TO THE BIASED COIN RANDOMIZATION 4.1 Unequal Randomization Ratio A simple extension to the BCD procedure is to allow for unequal treatment allocation. If the required randomization ratio (A:B) is k, the probability of assigning the next subjects to treatment A could be Probability (treatment A)  k   k+1 − p nA > knB k = k+1 nA = knB   k + p nA < knB . k+1 This procedure will further modify the bias of the originally biased coin. The initial coin k and this bias is either increased or bias is, k+1 decreased by p in cases of imbalance from the planned randomization ratio, subject to the k constraint 1 − k+1 > p > 0 (where k > 1). 4.2 More Than Two Treatments The BCD procedure can be easily modified to >2 treatment groups. Under equal randomization (with t treatment groups and a total of N subjects), the expected number of subjects in each treatment group will be N/t. In this procedure, the coin bias is applied to each treatment group and is established by comparing the current individual treatment allocation again the expected level.

We define the probability of assigning the next patient to treatment i(i = 1, . . . , t) as follows: Probability (treatment i)   c(1/t − p) ni > N/t = pi = c(1/t) ni = N/t   c(1/t + p) ni < N/t, where ni is the current number of subjects for treatment i, 1/t > p > 0, and c is a rescaling value such that the probabilities sum to 1( ti=1 pi = 1) 1. As before, larger values of p lead to faster convergence with N/t and subsequently to a lower level of imbalance around N/t. Figure 3(A and B) shows an example of the proportion of subjects assigned to each of four treatment groups for an intended randomization ratio of 1:1:1:1, using values of p equal to 0.05 and 0.15 for 200 subjects. As with before in the two treatment case, a larger value of p will tend the proportion of subjects allocated to each treatment more quickly toward the expected value (here N/t) and maintain a smaller imbalance around the expected value. Over a large number of subjects, the proportion of subjects allocated to each treatment will tend to N/t regardless of the choice of p. However, the choice of p does greatly influence the proportion of subjects allocated to each treatment over a small or moderate number of subjects. The choice of p will also influence the chance of long runs of one treatment group, regardless of the number of subjects in the trial. 5 ADAPTIVE BIASED COIN RANDOMIZATION The coin bias in the original BCD procedure is constant, regardless of the size of the treatment imbalance (DN = nA − nB ) or the current size of the study (N = nA + nB ). For example, the probability of the next subject being assigned to treatment A is the same, regardless of whether DN is 1 or 100, or whether N is 20 or 2000. An extension to the simple BCD(p) procedure is to modify the coin bias according to the current size of the imbalance. We will describe this using

1.0

1.0

0.8

0.8 Proportion

Proportion

BIASED COIN RANDOMIZATION

0.6 0.4

5

0.6 0.4 0.2

0.2

0.0

0.0 0

50

100 150 Subject Number

0

200

50

100

150

200

Subject Number

Figure 3. Proportion of subjects allocated to each of four treatment groups using (A) p = 0.05 and (B) p = 0.15. Table 2. Probability of imbalance of size j for the Big Stick Rule

Even c Odd c

Even N

Odd N

1/c (j = 0 or c) 2/c (even j, c–2 ≥ j ≥ 2) 1/c (j = 0) 2/c (even j, c – 1 ≥ j ≥ 2)

2/c (odd j, c–1 ≥ j ≥ 1) 1/c (j = c) 2/c (odd j, c – 2 ≥ j ≥ 1)

both step and continuous functions for the coin bias. A further extension is to also modify the coin bias based on the current size of the study, and this will be described for continuous coin bias functions.

on whether N and c are odd or even. The

5.1 Step Functions for Coin Bias

ing the deterministic nonrandom treatment

Rosenberger and Lachin (3) describe two modified BCD procedures using a step function for coin bias based on DN. The first is called the Big Stick Rule (8) and uses an unbiased coin for treatment allocation where the current imbalance |DN | < c for some predefined critical value of c. Where the imbalance reaches this critical value, the probability of being assigned to treatment A is 0 (DN = c) or 1 (DN = −c). The imbalance between treatment groups is limited to c, and the maximum consecutive allocation of one treatment group is 2c + 1. There is a chance that treatment allocation is determined in a nonrandomized manner when the maximum imbalance is reached, and the frequency of this depends on the choice of c. A value of c = ∞ gives the BCD(p) design. Using the random walk process, we can calculate the long-term probabilities of imbalance of size j, which depend

allocation is 1/c (when N and c are either both

long-term probability of imbalance if size j for the Big Stick Rule is shown in Table 2. Therefore, the long-term probability of apply-

odd or both even), and it will decrease as c increases. The second modified BCD procedure described by Rosenberger and Lachin (3) using a step function for coin bias is the ‘‘biased coin design with imbalance intolerance’’ (9). This procedure is a mixture of Efron’s original BCD procedure and the Big Stick Rule. The treatment allocation is deterministic where the imbalance exceeds some critical value as in the Big Stick Rule; however, the biased coin is still used for any imbalance below the critical level as in Efron’s

6

BIASED COIN RANDOMIZATION

procedure. This rule is shown here:

and p. Therefore, the treatment assignment in this procedure would be:

Probability (treatment A)   0 Dn = c      1 − p c > Dn > 0 = 1/2 Dn = 0   p 0 > Dn > −c     1 Dn = −c,

  1 − p DN = c Probability (treatment A) = 1/2 |DN | < c   p DN = −c.

where 1 ≥ p ≥ 0.5 (a value of 0.5 gives the big stick design, and a value of 1 gives the big stick design with c = 1). Chen (9) gives the probability of imbalance of size j. In particular, the probability of imbalance of size c (and so the probability of applying the deterministic c−1 treatment allocac−1 q , where q = 1 − p. tion) is (p−q)p c c p −q pc−1 This is always less than 1/c, so the likelihood of reaching the imbalance boundary is less with this procedure compared with the Big Stick Rule. Similarly even N), the (for c−1 (p−q)p , which is probability of balance is pc −qc always greater than 1/c, so the likelihood of achieving balance is greater with this procedure compared with the Big Stick Rule. Both the Big Stick Rule and the biased coin design with imbalance intolerance procedures use nonrandom allocation of treatment to patients when imbalance has reached a predefined limit, which creates hard boundaries of −c and c for Dn . Pocock (6) describes a modified procedure that avoids the deterministic allocation in the Big Stick Rule by replacing the probabilities of 0 and 1 with 1 − p

A similar modification to the biased coin design with imbalance intolerance design might be to replace 0 and 1 with 1 − p2 and p2 (such that 1/2 < p < p2 < 1). This would give greater balance compared with Pocock’s procedure, but at the expense of a more complex strategy now requiring the definition of two levels of coin bias rather than one. In this procedure, the imbalance is now not limited by a hard boundary of c but rather by what might be viewed as a ‘‘soft’’ limit. An illustration of this procedure (with p = 0.60, p2 = 0.90, and a soft limit of 3) is compared with Chen’s design (with p = 0.60 and a hard limit of 3) in Figure 4. 5.2 Continuous Functions for Coin Bias A continuous function can be defined for the treatment allocation coin bias such that it is both (1) directly proportional to the size of the current treatment imbalance and (2) inversely proportional to the current size of the study. This provides the benefit that changes to treatment balance are greatest earlier in the trial and/or where the treatment imbalance is greatest. The use of a continuous function for coin bias means that

Difference (Dn = nA–nB)

5 3

0

−3 −5 0

25

50 Subject Number

75

100

Figure 4. Difference in number of subjects (A-B) by subject number for adaptive step function with hard (red) and soft (blue) limits of 3 using p = 0.60 and p2 = 0.90.

BIASED COIN RANDOMIZATION

the arbitrary and subjective cut-off points in the step functions described previously do not need to be defined. Two examples of a continuous function for coin bias are shown here. As before, for two treatment groups, and shown in terms of the probability of assigning the next subject to treatment A, n2B n2B

+ n2B

(10) and ω + βnB 2ω + β(nA + nB ) for some ω,β ≥ 0(11). Wei (7, 12) describes a general class of functions for coin bias and defines a set of properties that this class of function should meet. The CR, PBR, and BCD procedures are shown to fit within this general framework. For two treatment groups, the probability of assigning the next subject to treatment A would be nA − nB , p(Dn /N) = p nA + nB where p(x) is a nonincreasing function satisfying p(x) + p(−x) = 1 for −1≥x≥1. This class of functions need not be continuous (e.g., the step functions for coin bias described in section 5.1 would satisfy these conditions). However, Wei (7, 12) shows that if the function p(x) is continuous at x = 0 then selection bias will tend to 0 as the sample size increases. ρ The function p(x) = (1+x)(1−x) ρ +(1−x)ρ with x = Dn /N, yields the following function for the probability of assigning the next subject to treatment A: ρ

ρ nA

nB ρ + nB

(13). This function is a more general form of the rule by Atkinson (10). The function p(x) = (1 − x)/2 with x = Dn /N gives the allocation rule for the probability of being assigned to treatment A as: nB nA + nB

7

(12). In these procedures, as for the previous BCD strategies, the coin bias will favor randomization toward the under represented treatment group. However, by using these continuous functions for coin bias, the coin bias is affected by the size of the imbalance. Additionally, as the denominator in all these functions includes nA and nB , the coin bias will tend to 0 as the study gets larger; thus, these procedures will tend to CR as the sample size increases. Another general class of continuous coin bias functions, given by Baldi Antognini and Giovagnoli (14), is the adjustable biased coin designs (ABCDs). The ABCDs are based on Dn rather than Dn /N. In these functions, the coin bias does not necessarily tend to 0 as the study gets larger. This property can be viewed as either an advantage (14) or a disadvantage (12). One ABCD function highlighted by Baldi Antognini and Giovagnoli (14) is  |D |a n   |Dn |a +1 Dn < 0 Probability(treatment A) = 0.5 Dn = 0   1 D n>0 |Dn |a +1 where a ≥ 0, a function labeled as ABCD(Fa). Smaller values of a will increase the chance of some treatment imbalance but will decrease selection bias (the special case a = 0 would give CR with no selection bias). The function ABCD(Fa) will tend to the big stick design with boundaries of ± 2 as a → ∞. The effect of a can be dramatic even when the treatment balance is not large. For example, the probability of being assigned to treatment A is <0.01 when Dn ≥ 4 and ≥ 5 when a = 4 and 3, respectively. The function ABCD(Fa) is continuous, but the probability of assigning treatment A is equal to 1/2 when Dn is 0 or ± 1, that is, where the current allocation is balanced or as close to being balanced as possible (the latter in the case of an odd number of subjects). Figure 5 shows the probability of assigning treatment A to the next subject based on the current imbalance using ABCD(F0.5), ABCD(F1), and ABCD(F4). Although the ABCD functions are described in this section as continuous functions for coin bias, the noncontinuous step

8

BIASED COIN RANDOMIZATION

functions (section 5.1) could be viewed as a special case of the ABCD as they also represent decreasing probability functions (for treatment A) in Dn and satisfy p(Dn ) + p(−Dn ) = 1. We conclude this section by illustrating a number of the biased coin procedures. Table 3 shows the probability of assigning treatment A for a variety of situations based the current level of treatment imbalance (Dn ) and the current sample size (N). 6

URN MODELS

´ 6.1 Polya Urn Model A different formulation to the BCD in achieving treatment balance is the randomized urn design (UD). A historical review of the UD is given by Rosenberger (15). The P´olya Urn Model (16) consists of an urn containing ω balls of two colors, with each color representing a treatment group. A ball is removed from the urn at random and then replaced, together with β balls of the opposite color. The next patient is allocated the treatment corresponding to the color of the drawn ball. This design may be labeled as UD(ω,β), and the special case of UD(ω,0) is equivalent to simple or complete randomization. This design is sometimes called the Friedman’s or generalized Friedman’s urn (4, 17), although the generalized Friedman’s urn is more general than this procedure (Friedman [18] and Wei [19]) and will be discussed

later. The range of descriptive terms for this design is discussed in Rosenberger [15]. This design also allows the initial number of balls to differ between the two groups: for example, UD(ω1 , ω2 , β). However, when equal allocation to treatment is required throughout the trial, we may assume an equal number of balls for the two treatment groups in the starting urn. The case of an unequal initial composition has relevance where unequal treatment allocation might be required. In this case, the number of balls returned would need to depend not on which ball was drawn but on the comparison of the current urn versus the target randomization ratio. An extension to this design might be to update the urn with a random number of balls of the appropriate color, using some predefined probability-generating function. This is called the Randomized Urn Model (20). The application of the urn model for multiple randomization strata is most simply to treat each strata level as an independent trial with a separate urn (11), as also suggested in the BCD (5). This is described in more detail later. The initial number of balls in the urn will affect the bias in the randomization when treatment imbalance occurs early in the trial. Any imbalance that occurs is more quickly countered with smaller versus larger values of ω (as β has a greater relative effect). As the total number of subjects increases, the effect of ω diminishes, and the procedure tends toward the CR procedure with perfect

Probability Assingning Treatment A

1.0 0.8 0.6 0.4 0.2 0.0 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 Treatment Imbalance, Dn = Na– Nb

6

7

8

Figure 5. Probability of assigning treatment A by current treatment imbalance difference for ABCD(F0.5) [blue], ABCD(F1) [red] and ABCD(F4) [green].

BIASED COIN RANDOMIZATION

9

Table 3. Probability of assigning treatment A for a variety of situations based on the current level of treatment imbalance (Dn ) and the current sample size (N) Constant coin bias

N

nA , nB

1 1,0 2 1,1 3 2,1 4 3,1 20 15,5 100 55,45 1000 505,495

Dn

1 0 1 2 10 10 10

Step function for coin bias

Continuous function for coin bias

Efron 1971 Chen 1999 Atkinson (5)a (9)b 1982 (10)c

0.4 0.5 0.4 0.4 0.4 0.4 0.4

0.4 0.5 0.4 0.4 0 0 0

0 0.5 0.2 0.1 0.1 0.4 0.49

Wei 1978 Wei 1977 (12)d (11)e

0 0.5 0.33 0.25 0.25 0.45 0.495

0.45 0.5 0.46 0.43 0.33 0.45 0.495

Baldi Antognini & Giovagnoli 2004 (14), ABCD(Fa)f 0.5/0.5/0.5 0.5/0.5/0.5 0.5/0.5/0.5 0.414/0.333/0.059 0.240/0.091/0.001 0.240/0.091/0.001 0.240/0.091/0.001

a BCD(p = 0.6). b BCD

with imbalance intolerance (p = 0.6, c = 5). A) = n2B / n2A + n2B

c Pr(treatment

d Pr(treatment

A) = nB /(nA = nB ). A) = (ω + β nB )/(2ω + β (nA + nB )), with ω = 10 and β = 2. f Pr(treatment A) as described for ABCD(Fa), with a = 0.5, 1, and 4. e Pr(treatment

balance. This can be seen by the approximate probability of treatment imbalance for a moderate or large study, which is

Pr Dn > r ≈ 2 1 − r

3 (nA + nB )

(3). Note that this probability is independent of both ω and β. There is a direct equivalence between the BCD and the P´olya UD, where the UD(ω,β) is equivalent to the BCD using the allocation probability defined by the continuous ω+βnB (11). function for coin bias of 2ω+β(n +n ) A B The special case UD(0,1) is equivalent to the BCD with allocation probability proportional to the number ofsubjects allocated to treatnB ment group A of n +n . A B Wei and Lachin (17) show the following properties of the UD(ω,β) as compared with other randomization procedures: • The chance of imbalance for UD(ω,β) is

far less likely compared with CR.

• The potential for selection bias for UD

(ω,β) is less compared with BCD(p) and PBR, and it tends to the minimum selection bias possible (i.e., under CR) as the study size gets larger. • As the study size gets larger, UD(ω,β) is free from accidental bias (which may be caused by an imbalance of a known or unknown prognostic factor). The extension of this design to more than two treatments is simple (19). The initial urn contains ω distinct balls for each treatment group. A ball is removed from the urn at random and replaced. The next patient is allocated the treatment corresponding to this drawn ball. The urn is updated by adding β balls for all the other treatment groups. The probability of the next subject receiving treatment i(i = 1, . . . , t) is

ω + β(N − ni ) tω + β(t − 1)N

,

where N is the total number of subjects (N = n1 + . . . + nt ). The probability of imbalance for treatment group i versus perfect balance

10

BIASED COIN RANDOMIZATION

is

Pr ni − N/t > r  

 t2 (t + 1)  . ≈ 2 1 − r N(t − 1)2

Again the long-term probability of imbalance is independent of both ω and β. The UD has a particularly simple application to response-adaptive randomization and is described later. 6.2 Urn Model without Replacement The urn design UD(N/2,0) using sampling from the urn without replacement is equivalent to the random allocation rule (21). In this design, treatment balance at the end of the trial is guaranteed, given that the total number of subjects required for the trial is known and fixed in advance. The major criticism of this procedure is that once all N/2 subjects are allocated for either group then the allocation for the remaining subjects must all be assigned to a single treatment group, which may increase both selection and accidental bias, the latter due to time trends in patient characteristics, for example. Another criticism is that this procedure may still lead to imbalances during the trial or a long series of one treatment group. The former might be an important consideration if the trial uses interim analyses or is terminated early. 6.3 Friedman’s Urn Model Friedman (18) introduced a more general urn model whereby a ball is removed at random from the urn (initially containing ω balls of two colors as before). The ball is then replaced together with α balls of the same color and β balls of the opposite color. Treatment allocation for the next patient is chosen from the color of the selected ball. This design will be labeled as UD(ω,α,β). The model UD(ω,0,β) is the P´olya Urn Model discussed previously, which is usually described using two parameters, UD(ω,β). To achieve treatment balance, we require α, β > 0 and also β > α (the case where β = α is CR). Wei (19) confirms that the UD(ω,α,β)

tends to the CR procedure as the study size gets larger as for the simpler UD(ω,β). The UD(ω,α,β) also has the same properties of asymptotic balance and is free from selection and accidental bias. The UD(ω,α,β) tends to balance more quickly as β/α gets larger (19), so the P´olya design of UD(ω,0,β) will tend to balance more quickly that the generalized Friedman design with α > 0. This model is equivalent to a BCD, with an adaptive probability of the next subject being assigned to treatment A as

ω + αnA + βnB 2ω + (α + β)(nA + nB )

The extension of the Friedman urn model to more than two treatment groups (t > 2) is described by Wei (19), and it follows the same procedure as for the simpler P´olya urn model. Following the random selection and replacement of a ball from the urn, the urn is then updated by adding β balls for each of the other treatment groups and α balls for the selected treatment group. The probability of the next subject receiving treatment is i (i = 1, . . . , t) is

ω + αni + β(N − ni ) tω + (α + β(t − 1))N

,

where N is the total number of subjects (N = n1 + . . . + nt ). The probability of imbalance for treatment group i versus perfect balance for large N (where (t + 1) β > α) is

Pr ni − N/t > r  

 2 ((t + 1)β − α) t  . ≈ 2 1 − r N(t − 1)(α + (t − 1)β)

7 TREATMENT BALANCE FOR COVARIATES In this section we discuss amendments to the biased coin designs that attempt to balance the treatment allocation not only over the whole study, but also within levels of important covariates or prognostic factors. In a study with treatment imbalance with respect to prognostic factors, the estimated treatment differences will still be unbiased as long as the prognostic factor is accounted for

BIASED COIN RANDOMIZATION

in the analysis (this is not true if the factor is ignored in the analysis). Therefore, the real benefit of maintaining treatment balance within strata is to reduce the variability of the estimated treatment difference, and so to improve the efficiency of the study (22). The BCD procedures aim to bias the treatment allocation for future subjects based on the treatment allocation of previous subjects. As we have seen, the potential for overall treatment imbalance is greatly reduced in the BCD compared with CR. However, the BCD procedures will not guarantee treatment balance within levels of any important prognostic factor. A common procedure to achieve treatment balance within strata is stratified randomization, and the most common method of this is stratified PBR. This method will help improve balance within strata across the prognostic factors. However, it is possible that the overall study could still have some sizeable imbalance across the strata, particularly where the total study size is small and number of strata is large. In this case, many blocks may be unfilled and potentially imbalanced at any point in the trial. The potential for selection bias would also still exist in this procedure. Another simple procedure to achieve treatment balance within strata is minimization, which attempts to achieve balance within the strata of each main prognostic factor (not in strata across factors as in stratified randomization. For example, we can create balance for both gender and for all study sites, but not at each gender-site level. Minimization has been proposed using: • A deterministic nonrandom allocation

rule (23). • A procedure using a combination of both

nonrandom and random allocation rules, the former for higher imbalance, the latter for no or mild imbalance (2). • A rule maintaining random allocation (10, 24). The latter approaches can be viewed as biased coin randomization procedures, for example, where the coin bias is weighted toward allocation that gives the least treatment imbalance in the marginal covariate

11

factors for the next patient (24), or toward allocation that minimizes the variances of the treatment contrasts under the design containing the current state of the treatment allocation and covariate information and the covariate factors for the next patient (D-optimality [10, 25]). Within the BCD and UD procedures previously described, a common simple solution is simply to treat each strata as a separate experiment. The bias in the coin is then calculated using the current state for the single urn (strata) relevant for the next patient (e.g., the single urn for males, age < 45 years, in study site 10). Again, where the numbers of strata are large and/or the sample size is small, this procedure may not protect against overall treatment imbalance or even imbalance within strata. In this case, Wei (19) suggests balancing the marginal urns, either using some composite score calculated from the imbalance over all relevant urns (e.g., the three urns for males, age < 45 years, and study site 10) or using only the currently most imbalanced marginal urn. This procedure would not aim to achieve balance within cross-factor strata, but it is more likely to achieve balance in the marginal factor levels and across the study. Therefore, the BCD procedure to achieve marginal-factor urn balance is a minimization procedure maintaining the random allocation rule. 8 APPLICATION OF BIASED COIN DESIGNS TO RESPONSE-ADAPTIVE RANDOMIZATION The aim of the various bias coin designs is to use a randomization strategy that tends the treatment allocation toward an intended allocation ratio (most often balanced) while reducing allocation bias and other biases in the trial. These strategies can also be extended to the treatment allocation within important strata (covariate-adaptive randomization). A quite distinct class of techniques are those for response adaptive randomization (RAR); as their name suggests, these techniques allocate subjects to treatment groups based on the known responses of previous subjects, rather than on just the treatment group or covariate values. Broadly, the aim of RAR techniques could be:

12

BIASED COIN RANDOMIZATION • To optimize some key parameter of inter-

est. • To tend the treatment allocation toward the more ‘‘successful’’ treatment groups and away from the inferior groups (with respect to efficacy, safety or toleration, or some combination). Both of these objectives can be viewed as creating a more ethical design for the patients recruited into the clinical trial, for example, by minimizing variance of the treatment contrasts (to increase power or reduce overall sample size) or reducing the number of patients subjected to a clearly inferior treatment. Both the BCD and UD techniques may be used for RAR, whether the aim is to optimize some parameter or to bias allocation toward successful treatment groups. The application is easiest for binary response data as in for the randomized ‘‘play the winner’’ rule (RPW), the most well-known RAR technique (26). In this urn model, the initial composition is ω balls for each treatment group (each treatment identified by different colored balls). The first patient is allocated the treatment according to a ball drawn (with replacement) from the initial urn, and the response is noted. If the subject has a treatment response, β colored balls for that treatment are added into the urn, and α balls for each of the other treatments. If the subject does not have a treatment response, α and β balls are added for the current and each of the other treatments, respectively. The study then continues to recruit subjects, to collect the patient responses, and to update the urn. This design is labeled as the RPW(ω,α,β), where β > α ≥ 0 (CR is a special case where β = α). In the two-group case (treatments A and B) for RPW(ω,0,β), the ratio of balls in the urn and the treatment allocation ratio (A/B) will tend to qB /qA , where qi = 1 − pi is the nonresponse rate for treatment i. For the more general RPW(ω,α,β), the treatment allocation ratio will tend to a function based on α and β. These results show that the BCD, using a continuous function for coin bias based on the responses in each treatment group, could also be used as a strategy for a RAR design with the same objective as for the RPW design.

Many modifications have been made to the initial RPW urn model for RAR. These include strategies for more than two treatment groups, for categorical and continuous outcomes, for delayed responses, for treatment elimination from the randomization, and for targeting response percentiles (3, 4, 15, 27).

REFERENCES 1. D. Blackwell and J. L. Hodges, Design for the control of selection bias. Ann Math Stat. 1957; 28: 449–460. 2. M. Zelen, The randomization and stratification of patients to clinical trials. J Chron Dis. 1974; 27: 365–375. 3. W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials Theory and Practice. New York: Wiley, 2002. 4. S. C. Chow and M. Chang, Adaptive Design Methods in Clinical Trials. Boca Raton: Chapman & Hall, 2007. 5. B. Efron, Forcing a sequential experiment to be balanced. Biometrika. 1971; 58: 403–417. 6. S. J. Pocock, Clinical Trials: A Practical Approach. New York: Wiley, 1983. 7. L. J. Wei, A class of treatment assignment rules for sequential experiments. Commun Stat Theory Methods. 1978; A7: 285–295. 8. J. F. Soares and C. F. Wu, Some restricted randomization rules in sequential designs. Commun Stat Theory Methods. 1983; 12: 2017–2034. 9. Y. P. Chen, Biased coin design with imbalance intolerance. Commun Stat Stochastic Models. 1999; 15: 953–975. 10. A. C. Atkinson, Optimum biased coin designs for sequential clinical trials with prognostic factors. Biometrika. 1982; 69: 61–67. 11. L. J. Wei, A class of designs for sequential clinical trials. J Am Stat Assoc. 1977; 72: 382–386. 12. L. J. Wei, The adaptive biased coin design for sequential experiments. Ann Stat. 1978; 6: 92–100. 13. R. L. Smith, Sequential treatment allocation using biased coin designs. J R Stat Soc Ser B Methodol. 1984; 46: 519–543. 14. A. Baldi Antognini and A. Giovagnoli, A new ‘biased coin design’ for the sequential allocation of two treatments. Appl Stat. 2004; 53: 651–664.

BIASED COIN RANDOMIZATION 15. W. F. Rosenberger, Randomized urn models and sequential design. Sequential Analysis. 2002; 21: 1–28. ¨ 16. F. Eggenberger and G. P´olya, Uber die statis¨ tik verketteter vorgange. Zeitschrift fur ¨ angewandte mathematic and mechanik. 1923; 3: 279–289. 17. L. J. Wei and J. M. Lachin, Properties of the urn randomization in clinical trials. Control Clin Trials. 1988; 9: 345–364. 18. B. Friedman, A simple urn model. Commun Appl Math. 1949; 1: 59–70. 19. L. J. Wei, An application of an urn model to the design of sequential controlled clinical trials. J Am Stat Assoc. 1978; 73: 559–563. 20. N. L. Johnson and S. Kotz, Urn Models and Their Applications. New York: Wiley, 1977. 21. J. M. Lachin, Properties of simple randomization in clinical trials. Control Clin Trials. 1988; 9: 312–326. 22. S. Senn, Statistical Issues in Drug Development. Chichester, UK: Wiley, 1997.

13

23. D. R. Taves, Minimization: a new method of assigning patients to treatment and control groups. Clin Pharm Ther. 1974; 15: 443–453. 24. S. J. Pocock and R. Simon, Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics. 1975; 31: 103–115. 25. A. C. Atkinson, Optimum biased–coin designs for sequential treatment allocation with covariate information. Stat Med. 1999; 18: 1741–1752. 26. L. J. Wei and S. Durham, The randomized play-the-winner rule in medical trials. J Am Stat Assoc. 1978; 73: 840–843. 27. A. Ivanova and W. F. Rosenberger, A comparison of urn designs for randomized clinical trials of K > 2 treatments. J Biopharm Stat. 2000; 10: 93–107.

BIOEQUIVALENCE (BE) TESTING FOR GENERIC DRUGS To receive approval for an Abbreviated New Drug Application (ANDA), an applicant generally must demonstrate, among other things, that its product has the same active ingredient, dosage form, strength, route of administration, and conditions of use as the listed drug and that the proposed drug product is bioequivalent to the reference listed drug [21 United States Code (U.S.C.) 355(j)(2)(A); 21 Code of Federal Regulations (CFR) 314.94(a)]. Bioequivalent drug products show no significant difference in the rate and extent of absorption of the therapeutic ingredient [21 U.S.C. 355(j)(8); 21 CFR 320.1(e)]. Studies for BE are undertaken in support of ANDA submissions with the goal of demonstrating BE between a proposed generic drug product and its reference listed drug. The regulations governing BE are provided at 21 CFR Part 320.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/6772dft.pdf) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

BIOLOGICAL ASSAY, OVERVIEW

penetrating level of research focusing on the gene-environment interaction in conventional experiments with biological units, and thereby calling for drastically different statistical resolutions for bioassays. We include also a brief synopsis of these recent developments. Traditionally, in a bioassay, a test (new) and a standard preparation are compared by means of reactions that follow their applications to some biological units (or subjects), such as subhuman primates (or human) living tissues or organs; the general objective being to draw interpretable statistical conclusions on the relative potency of the test preparation with respect to the standard one. Usually, when a drug or a stimulus is applied to a subject, it induces a change in some measurable characteristic that is designated as the response variable. In this setup, the dose may have several chemically or therapeutically different ingredients while the response may also be multivariable. Thus the stimulus–response or dose–esponse relationship for the two preparations, both subject to inherent stochastic variability, are to be compared in a sound statistical manner (with adherence to biological standardization) so as to cast light on their relative performance with respect to the set objectives. Naturally, such statistical procedures may depend on the nature of the stimulus and response, as well as on other extraneous experimental (biological or therapeutical) considerations. As may be the case with some competing drugs for the treatment of a common disease or disorder, the two (i.e. test and standard) preparations may not have the same chemical or pharmacological constitution, and hence, statistical modeling may be somewhat different than in common laboratory experimentation. Nevertheless, in many situations, the test preparation may behave (in terms of the response/tolerance distribution) as if it is a dilution or concentration of the standard one. For this reason, often, such bioassays are designated to compare the relative performance of two drugs under the dilution–concentration postulation, and are thereby termed dilution assays.

PRANAB K. SEN Chapel Hill, NC, USA

This article mainly emphasizes the classical aim of biological assay (or bioassay), to estimate relative potency, arising out of a need for biological standardization of drugs and other products for biological usage. There is a basic difference between biological and chemical endpoints or responses: the former exhibits greater (bio)variability and thereby requires in vivo or in vitro biological assays wherein a standard preparation (or a reference material) is often used to have a meaningful interpretation of relative potency. However, the term bioassay, has also been used in a wider sense, to denote an experiment, with biological units, to detect possible adverse effects such as carcinogenicity or mutagenicity; Mutagenicity Study). In the context of environmental impact on biosystems, toxicodynamic and toxicokinetic (TDTK) models as well as physiologically based pharmacokinetic (PBPK) models have been incorporated to expand the domain of bioassays; structure– activity relationship information (SARI) is often used to consolidate the adoption of bioassays in a more general setup; the genesis of dosimetry (or animal studies) lies in this complex. The use of biomarkers in studying environmental toxic effects on biological systems, as well as in carcinogenicity studies, has enhanced the scope of bioassays to a greater interdisciplinary field; we need to appraise, as well, bioassays in this broader sense. Further, recent advances in bioinformatics have added new frontiers to the study of biological systems; bioassay models are gaining more popularity in the developing area of computational biology. Our appraisal of bioassay would remain somewhat incomplete without an assessment of the role of Pharmacogenomics as well as Toxicogenomics in establishing a knowledge base of the chemical effects in biological systems. The developments in genomics during the past eight years, have opened the doors for a far more

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

BIOLOGICAL ASSAY, OVERVIEW

Dilution assays are classified into two broad categories: Direct dilution and indirect dilution assays. In a direct assay, for each preparation, the exact amount of dose needed to produce a specified response is recorded, so that the response is certain while the dose is a nonnegative random variable that defines the tolerance distribution. Statistical modeling of these tolerance distributions enables us to interpret the relative potency in a statistically analyzable manner, often in terms of the parameters associated with the tolerance distributions. By contrast, in an indirect assay, the dose is generally administered at some prefixed (usually nonstochastic) levels, and at each level, the response is observed for subjects included in the study. Thus, the dose is generally nonstochastic and the stochastic response at each level leads to the tolerance distributions that may well depend on the level of the dose as well as the preparation. If the response is a quantitative variable, we have an indirect quantitative assay, while if the response is quantal in nature (i.e. all or nothing), we have a quantal assay. Both of these indirect assays are more commonly addressed in statistical formulations. Within this framework, the nature of the dose–response regression may call for suitable transformations on the dose variable (called the dosage or dose-metameter) and/or the response variable, called the response-metameter. The basic objective of such transformations is to achieve a linear dosage-response regression, which may induce simplifications in statistical modeling and analysis schemes. In view of the original dilution structure, such transformations may lead to different designs for such assays, and the two most popular ones are (i) parallel-line assays and (ii) slope-ratio assays. Within each class, there is also some variation depending on the (assumed) nature of tolerance distributions, and within this setup, the probit (or normit) and logit transformations, based on normal and logistic distributions respectively, are quite popular in statistical modeling and analysis of bioassays. Bliss 2 contains an excellent account of the early developments in this area, while the various editions of Finney 6 capture more up-to-date developments, albeit with a predominantly parametric flavor. We refer to

these basic sources for extensive bibliography of research articles, particularly in the early phase of developments where biological considerations often dominated statistical perspectives. In this framework, it is also possible to include bioassays that may be considered for bioavailability and bioequivalence studies, though basically there are some differences in the two setups: Bioassays for assessing relative potency relate to clinical therapeutic equivalence trials, while in bioequivalence trials, usually, the relative bioavailability of different formulations of a drug are compared. Thus, in bioequivalence studies, the pharmacologic results of administrating essentially a common drug in alternative forms, such as a capsule versus a tablet, or a liquid dose of certain amount, capsules (tablets) or liquid forms of larger dose versus smaller dose with increased frequency of prescription, or even the administration of a drug at different time of the day, such as before breakfast or sometime after a meal, and so on, are to be assessed in a valid statistical manner. In this sense, the active ingredients in the drug in such alternative forms may be essentially the same, and differences in bioavailability reflect the form and manner of administration. We shall see later on that these basic differences in the two setups call for somewhat different statistical formulations and analysis schemes.

1 DIRECT DILUTION ASSAYS As an illustrative example, consider two toxic preparations (say, S and T), such that a preparation is continuously injected into the blood stream of an animal (say, cat) until its heart stops beating. Thus, the response (death) is certain, while the exact amount of the dose (X) required to produce the response is stochastic. Let X S and X T stand for the dose (variable) for the standard and test preparation, and let F S (x) and F T (x), x ≥ 0, be the two tolerance distributions. The fundamental assumption of a direct dilution assay is the following: FT (x) = FS (ρx),

for all x ≥ 0,

(1)

BIOLOGICAL ASSAY, OVERVIEW

where ρ(>0) is termed the relative potency of the test preparation with respect to the standard one. Standard parametric procedures for drawing statistical conclusions on ρ are discussed fully in the classical text of Finney 6, where other references are also cited in detail. If F S (.) is assumed to be a normal distribution function, then ρ is characterized as the ratio of the two means, as well as the ratio of the two standard deviations. Such simultaneous constraints on means and variances vitiate the simplicity of achieving optimality of parametric procedures (in the sense of maximum likelihood estimators and related likelihood ratio tests). On the other hand, if we use the log-dose transformation on the two sets of doses, and the resulting dosage distributions, denoted by FS∗ (.)andFT∗ (, )respectively, are taken as normal, then they have the same variance, while the difference of their means define logρ. Interestingly enough, in the first case, the estimator of ρ is the ratio of the sample arithmetic means, while in the other case, it turns out as the ratio of the sample geometric means. A different estimator emerges when one uses a power-dosage (as is common in slope-ratio assays). Thus, in general, these estimators are not the same, and they depend sensibly on the choice of a dosage. This explains the lack of invariance property of such parametric estimates (as well as associated test statistics) under monotone dosage transformations. From an operational point of view, an experimenter may not have the knowledge of the precise dosage, and hence, it may not be very prudent to assume the normality, lognormality or some other specific form of the tolerance distribution. Therefore, it may be reasonable to expect that an estimator of the relative potency should not depend on the chosen dosage as long as the latter is strictly monotone. For example, if the true tolerance distribution is logistic while we assume it to be (log)normal, the sample estimator may not be unbiased and fully efficient. Even when the two tolerance distributions are taken as normal, the ratio of the sample means is not unbiased for ρ. In this respect, such parametric procedures for the estimation of the relative potency (or allied tests for the fundamental assumption) are

3

not so robust, and any particular choice of a dosage may not remain highly efficient over a class of such chosen tolerance distributions. Nonparametric procedures initiated by Sen 17–19 and followed further by Shorack 26, and Rao and Littell 14, among others, eliminate this arbitrariness of dosage selection and render robustness to a far greater extent. Basically, we may note that ranks are invariant under strictly monotone (not necessarily linear) transformations on the sample observations. As such, a test for the fundamental assumption in (1) based on appropriate rank statistic remains invariant under such transformations. Similarly, if an estimator of the relative potency is based on suitable rank statistics, it remains invariant under such strictly monotone dosage transformations. Both the Wilcoxon–Mann–Whitney twosample rank-sum test and the (Brown–Mood) median test statistics were incorporated by Sen 17 for deriving nonparametric estimators of relative potency, and they also provide distribution-free confidence intervals for the same parameter. If there are m observations X S1 , . . . , X Sm for the standard preparation and n observations X T1 , . . . , X Tn , for the test preparation, we define the differences Yij = XSi − XTj , for i = 1, . . . , m; j = 1, . . . , n.

(2)

We arrange the N(= mn) observations Y ij in ascending order of magnitude, and let Y˜ N bethe median of these N observations. If N is even, we take the average of the two central order statistics. Then Y˜ N isthe Wilcoxon score estimator of logρ, and it is a robust and efficient estimator of logρ. The estimator is invariant under any strictly monotone transformation on the dose. Similarly, the confidence interval for logρ can be obtained in terms of two specified order statistics of the Y ij , and this is a distribution-free and robust procedure. A similar procedure works out for the median procedure; for general rank statistics, generally, an iterative procedure is needed to solve for such robust R-estimators. Rao and Littell 14 incorporated the twosample Kolmogorov–Smirnov test statistics in the formulation of their estimator. For computational convenience, because of

4

BIOLOGICAL ASSAY, OVERVIEW

the invariance property, it is simpler to work with the log-dose dosage, and in that way, the estimators of the log-relative potency correspond to the classical rank estimators in the two-sample location model. These direct dilution assays require the measurement of the exact doses needed to produce the response; this may not be the case if there are some latent effects. For example, the time taken by the toxic preparation to traverse from the point of infusion to the heart multiplied by the infusion rate may account for such a latent effect. In general, the situation may be much more complex. This naturally affects the fundamental assumption in (1), and variations in the modeling and statistical analysis to accommodate such effects have been discussed in 6 and 17 in the parametric and nonparametric cases respectively. 2

INDIRECT DILUTION ASSAYS

As an example, consider two drugs, A and B, each administered at k(≥ 2) prefixed levels (doses) d1 , . . . , dk . Let X Si and Y Ti be the response variable for the standard and test preparation respectively. These drugs may not have the same chemical ingredients, and may not have the same dose levels. It is not necessary to have the same doses for both the preparations, but the modifications are rather straightforward, and hence we assume this congruence. We assume first that both X Si and Y Ti are continuous (and possibly nonnegative) random variables. Suppose further that there exist some dosage xi = ξ (di ), i = 1, . . . , k and response-metameter X* = g(X), Y* = g(Y), for some strictly monotone g(.), such that the two dosage–response regressions may be taken as linear, namely, that ∗ ∗ = αT + βT xi + eTi , XSi YTi

= αS + βS xi + eSi ,

(3)

for i = 1, . . . , k, where for statistical inferential purposes, certain distributional assumptions are needed for the error components eTi and eSi , i = 1, . . . , k. Generally, in the context of log-dose transformations, we have a parallel-line assay, while slope-ratio assays

arise typically for power transformations. Thus, in a parallel-line assay, the two dose– response regression lines are taken to be parallel, and further that the errors eTi and eSi have the same distribution (often taken as normal). In this setup, we have then β S = β T = β (unknown), while α T = α S + β log ρ, where ρ is the relative potency of the test preparation with respect to the standard one. This leads to the basic estimating function log ρ =

{αT − αS } , β

(4)

so that if the natural parameters β, α S and α T are estimated from the acquired bioassay dataset, statistical inference on logρ (and hence ρ) can be drawn in a standard fashion. For normally distributed errors, the whole set of observations pertains to a conventional linear model with a constraint on the two slopes β S , β T , so that the classical maximum likelihood estimators and allied likelihood ratio tests can be incorporated for drawing statistical conclusions on the relative potency or the fundamental assumption of parallelism of the two regression lines. However, the estimator of logρ involves the ratio of two normally distributed statistics, and hence, it may not be unbiased; moreover, generally, the classical Fieller’s theorem [6] is incorporated for constructing a confidence interval for logρ (and hence, ρ), and it is known that this may result in an inexact coverage probability. Because of this difference in setups (with that of the classical linear model), design aspects for such parallel-line assays need a more careful appraisal. For equispaced (log —)doses, a symmetric 2k-point design has optimal information contents, and are more popularly used in practice. We refer to 6 for a detailed study of such bioassay designs in a conventional normally distributed errors model. Two main sources of nonrobustness of such conventional inference procedures are the following: 1. Possible nonlinearity of the two regression lines (they may be parallel but yet curvilinear); 2. Possible nonnormality of the error distributions.

BIOLOGICAL ASSAY, OVERVIEW

On either count, the classical normal theory procedures may perform quite nonrobustly, and their (asymptotic) optimality properties may not hold even for minor departures from either postulation. However, if the two dose–response regressions (linear or not) are not parallel, the fundamental assumption of parallel-line assays is vitiated, and hence, statistical conclusions based on the assumed model may not be very precise. In a slope-ratio assay, the intercepts α S and α T are taken as the same, while the slopes β S and β T need not be the same and their ratio provides the specification of the relative potency ρ. In such slope-ratio assays, generally, a power transformation: dosage = (dose)λ , for some λ > 0 is used, and we have ρ=

βT βS

1/λ ,

(5)

which is typically a nonlinear function of the two slopes β T and β S , and presumes the knowledge of λ. In such a case, the two error components may not have the same distribution even if they are normal. This results in a heteroscedastic linear model (unless ρ = 1), where the conventional linear estimators or allied tests may no longer possess validity and efficiency properties. Moreover, as ρ λ is a ratio of two slopes, its conventional estimator based on usual estimators of the two slopes is of the ratio-type. For such ratio-type estimators, again the well-known Fieller Theorem (6) is usually adopted to attach a confidence set to ρ or to test a suitable null hypothesis. Such statistical procedures may not have the exact properties for small to moderate sample sizes. Even for large sample sizes, they are usually highly nonrobust for departures from the model-based assumptions (i.e. linearity of regression, the fundamental assumption, and normality of the errors). Again the design aspects for such slope-ratio assays need a careful study, and (6) contains a detailed account of this study. Because of the common intercept, usually a 2k + 1 point design, for some nonnegative integer k is advocated here. The articles on Parallel-line Assay and Slope-ratio Assay should be consulted for further details. The primary emphasis in these articles is on standard parametric

5

methods, and hence we discuss briefly here, the complementary developments of nonparametric and robust procedures for such assays. These were initiated in (21, 22) and also systematically reviewed in (23). First, we consider a nonparametric test for the validity of the fundamental assumption in a parallelline assay. This is essentially a test for the equality of slopes of two regression lines, and as in (21), we consider an aligned test based on the Kendall τ statistic. For each preparation with the set of dosages as independent variate and responses as dependent variable, one can define the Kendall tau statistic in the usual manner. We consider the aligned observations Y Ti − bxi and xi , and denote the corresponding Kendall τ (in the summation but not average form) as K T (b), and for the standard preparation, an aligned Kendall’s τ statistic is defined by K S (b), where we allow b to vary over the entire real line. Let then K ∗ (b) = KT (b) + KS (b), −∞ < b < ∞. (6) Note then that K T (b), K S (b), and hence K*(b) are all nonincreasing in b and have finitely many step-down discontinuities. Equating K*(b) to 0 (20), we obtain the pooled estimator ˆ βofβ. Let us then write L=

ˆ 2 + [KS (β)] ˆ 2} {[KT (β)] , Vn

(7)

where V n is the variance of the Kendall τ statistic under the hypothesis of no regression (and is a known quantity). This statistic has, under the hypothesis of homogeneity of β T and β S , closely central chi-square distribution with one degree of freedom. Lisused as a suitable test statistic for testing the validity of the fundamental assumption of a parallel-line assay where the normality of the error components is not that crucial. In that sense it is a robust test. Moreover, ˆ having obtained the pooled estimator βofβ, under the hypothesis of homogeneity of the slopes, we consider the residuals ˆ i , Yˆ Si = YSi − βx ˆ i, Yˆ Ti = YTi − βx

(8)

for different i, and treating them as two independent samples, as in the case of dilution direct assays, we use the

6

BIOLOGICAL ASSAY, OVERVIEW

Figure 1. Graphical procedure for obtaining a nonparametric confidence interval for the log potency ratio in a parallel-line assay

parameters, and obtain similar robust estimation and testing procedures. It is also possible to use general (aligned) M-statistics for this purpose. In general, such solutions are to be obtained by iterative methods, and hence, for simplicity and computational ease, we prescribe the use of the Kendall tau and twosample rank- sum statistics for the desired statistical inference. Next, we proceed to the case of sloperatio assays, and consider first a nonparametric test for the validity of the fundamental assumption (of a common intercept but possibly different slopes). We define the Kendall tau statistics K T (b) and K S (b) as in the case of the parallel-line assay, and equating them to 0, we obtain the corresponding estimates of β T and β S , which are denoted by βˆT and βˆS respectively. Consider then the residuals Y˜ Ti = YTi − βˆT xi , Y˜ Si = YSi − βˆS xi , ∀i. (9)

Wilcoxon–Mann–Whitney rank-sum test statistic to estimate the difference of the intercepts α T − α S in a robust manner. As in the direct dilution assay, this estimator is the median of the differences of all possible pairs of residuals from the test and standard preparation respectively. A robust, consistent and asymptotically normally distributed estimator of logρ is then obtained by dividing this estimator by the pooled estimaˆ tor β. For drawing a confidence interval for logρ (and hence, ρ), we can then use the Fieller Theorem by an appeal to the asymptotic normality of the estimator, or as in (21), consider a rectangular confidence set for β and α T − α S by computing a coordinate-wise confidence interval for each with coverage probability 1 − γ /2, and as in Figure 1, draw a robust confidence set for logρ with coverage probability 1 − γ. Though this does not have an exact coverage probability, it is quite robust and works out well even for quite nonnormal error distributions. In the above setup, instead of the Kendall τ and the two-sample rank-sum statistics, we may use a general linear rank statistic for regression and a two-sample linear rank statistic for difference of location

We pool all these residuals into a combined set, and use the signed-rank statistic to derive the corresponding rank estimator of the hypothesized common value of the intercept; this estimator, denoted by α,isthe ˜ median of all possible midranges of the set of residuals listed above. Let then Yˆ Ti = ˜ Yˆ Si = Y˜ Si − α, ˜ ∀i,and for each prepaY˜ Ti − α, ration, based on these residuals, we consider the Wilcoxon signed-rank statistic. These are ˆ S respectively. As in the ˆ T and W denoted by W case of parallel-line assays, here we consider a test statistic for testing the validity of the fundamental assumption as L=

ˆ 2} ˆ 2 +W {W T S , Vn

(10)

where V n is the variance of Wilcoxon signedrank statistic under the hypothesis of symmetry of the distribution around 0 (and is a known quantity). When the fundamental assumption holds, the distribution of L isclose to the central chi-square distribution with 1 degree of freedom, and hence a test can be carried out using the percentile point of this chi-square law. This test is quite robust and the underlying normality of the errors may not be that crucial in this context. Note that for the slope-ratio assay, granted the

BIOLOGICAL ASSAY, OVERVIEW

fundamental assumption of a common intercept, a natural plug-in estimator of ρ is given by ρˆ =

βˆT βˆS

1/λ .

(11)

We may use the Fieller Theorem under an asymptotic setup to construct a confidence interval for ρ. Alternatively, as in the case of a parallel line assay, for a given γ (0 < γ < 1), we may consider a distribution-free confidence interval of coverage probability 1 − γ /2 for each of the two slopes β T and β S , and obtain a confidence interval for ρ λ (and hence ρ). The situation is quite comparable to the Figure for the parallel-line assay, excepting that β T and β S are taken for the two axes. Here also, instead of the Kendall tau statistics and the Wilcoxon signed-rank statistics, general regression rank statistics and (aligned) signed-rank statistics (or even suitable M-statistics) can be used to retain robustness of the procedures without sacrificing much efficiency. However, the solutions are generally to be obtained by iterative procedures, and hence, we prefer to use the simpler procedures considered above. 3

INDIRECT QUANTAL ASSAYS

In this type of (indirect) assays, the response is quantal (i.e. all or nothing) in nature. For each preparation (T or S) and at each level of administered dose, among the subjects, a certain number manifest the response while the others do not; these frequencies are stochastic in nature and their distribution depends on the dose level and the preparation. Thus, for a given dosage x, we denote by F T (x) and F S (x) the probability of the response for the test and standard preparation respectively. It is customary to assume that both F T (x) and F S (x) are monotone increasing in x, and for each α(0 < α < 1), there exits unique solutions of the following FT (ξTα ) = α,

and FS (ξSα ) = α,

(12)

so that ξ Tα and ξ Sα are the α-quantile of the test and standard preparation; they are

7

termed the 100α% effective dosage. In particular, for α = 1/2, they are termed the median effective dosage. Whenever the response relates to death (as is usually the case with animal and toxicologic studies), the ξTα , ξSα are also termed 100α%-lethal dosage. In many studies, generally, low dosages are contemplated so that α is chosen to be small. This is particularly the case with radioimmunoassays, and we shall comment on that later on. Estimation of the ξ Tα and ξ Sα with due attention to their interrelations is the main task in a quantal assay. The concept of parallel-line and slope-ratio assays, as laid down for indirect quantitative assays, is also adoptable in quantal assays, and a detailed account of the parametric theory based on normal, lognormal, logistic, and other notable forms of the distribution F T (x) is available with Finney [6, Chapter 17]. In this context, the probit and logit analyses are particularly notable, and we shall discuss them as well. To set the ideas, we consider a single preparation at k(≥ 2)- specified dosage d1 , . . . ,dk , where d1 < d2 < . . . < dk . Suppose that the dosage di has been administered to ni subjects, out of which ri respond positively while the remaining ni − ri do not, for i = 1, . . . , k. In this setup, the di , ni are nonstochastic, while the ri are random. The probability of a positive response at dosage di , denoted by π (di ), is then expressed as π (di ) = π (θ + βdi ),

i = 1, . . . , k,

(13)

where θ and β are unknown (intercept and regression) parameters, and π (x), −∞ < x < ∞, is a suitable distribution function. In a parametric mold, the functional form of π (.) is assumed to be given, while in nonparametrics, no such specific assumption is made. Note that the joint probability law of r1 , . . . , rk is given by k ni i=1

ri

π (θ + βdi )ri [1 − π (θ + βdi )]ni −ri , (14)

so that the likelihood function involves only two unknown parameters θ and β. The loglikelihood function or the corresponding estimating equations are not linear in the parameters, and this results in methodological as well as computational complications.

8

BIOLOGICAL ASSAY, OVERVIEW

If π (.) is taken as a logistic distribution, that is, π (x) = {1 + e−x }−1 , then we have from the above discussion π (di ) = θ + βdi , i = 1, . . . , k. log [1 − π (di )] (15) This transformation, known as the logit transformation, relates to a linear regression on the dosage, and simplifies related statistical analysis schemes. Thus, at least intuitively, we may consider the sample logits ri − ri ) , i = 1, . . . , k, (16) Zi = log ni and attempt to fit a linear regression of the Zi on di . In passing, we may remark that technically ri could be equal to zero or ni (with a positive probability), so that Zi would assume the values − ∞ and +∞ with a positive probability, albeit for large ni , this probability converges to zero very fast. As in practice, the ni may not be all large; to eliminate this impasse, we consider the Anscombe correction to a binomial variable, and in (16), modify the Zi as (ri + 38 ) Zi = log , i = 1, . . . , k. (17) (ni − ri + 38 ) Though the ri have binomial distributions, the Zi have more complex probability laws, and computation of their exact mean, variance, and so on, is generally highly involved. For large values of the ni , we have the following √

ni (Zi − θ − βdi )

D

→ N(0, {π (di )[1 − π (di )]}−1 ),

(18)

for each i = 1, . . . , k, where the unknown π (di ) can be consistently estimated by the sample proportion pi = ri /ni . Thus, using the classical weighted least squares estimation (WLSE) methodology, we may consider the quadratic norm Q(θ , β) =

k i=1

and minimize this with respect to θ , β to obtain the WLS estimators. Although the logit transformation brings the relevance of generalized linear models (GLM), the unknown nature of their variance functions makes the WLSE approach more appropriate for the suggested statistical analysis. In any case, the asymptotic flavor should not be overlooked. If π (x) is taken as the standard normal distribution function (x), whose density function is denoted by φ(x), then we may consider the transformation ri , i = 1, . . . , k, (20) Zi = −1 ni known as the probit or normit transformation. Here also, it would be better to modify Zi as Zi =

−1

(ni + 12 )

,

i = 1, . . . , k. (21)

Note that by assumption, −1 (π (di )) = θ + βdi , i = 1, . . . , k, and this provides the intuitive appeal for a conventional linear regression analysis. However, the likelihood approach based on the product-binomial law encounters computational difficulties and loses its exactness of distribution theory to a greater extent. Here also, we would have complications in the computation of the exact mean, variance, or distribution of the Zi , and hence, as in the logit model, we consider a WLSE approach in an asymptotic setup where the ni are large. By virtue of √ ni (pi − the asymptotic normality of the π (di ))(where again we take pi = (ri + 3/8)/(ni + 1/2)), we obtain that for every i ≥ 1, √

ni [Zi − θ − βdi ]

D

→ N(0,

π (di )[1 − π (di )] ), φ 2 ( −1 (π (di )))

(22)

so that we consider quadratic norm in a WLSE formulation Q(θ , β) =

ni pi (1 − pi ){Zi − θ − βdi }2 , (19)

(ri + 38 )

k ni φ 2 ( −1 (pi )) i=1

pi (1 − pi )

[Zi − θ − βdi ]2 , (23)

BIOLOGICAL ASSAY, OVERVIEW

and minimizing this with respect to θ , β, we arrive at the desired estimators. For both the logit and probit models, the resulting estimators of θ , β are linear functions of the Zi with coefficients depending on the ni and the pi . Therefore, the asymptotic normality and other properties follow by standard statistical methodology. Moreover the (asymptotic) dispersion matrix of these estimators, in either setup, can be consistently estimated from the observational data sets. Thus, we have the access to incorporate standard asymptotics to draw statistical conclusions based on these estimators. Let us then consider the case of quantal bioassays involving two preparations (S and T), and for each preparation, we have a setup similar to the single preparation case treated above. The related parameters are denoted by θ S , β S and θ T , β T respectively, and for modeling the response distributions, we may consider either the logit or probit model, as has been discussed earlier. If we have a parallel-line assay, as in the case of an indirect assay, we have then βT = βS = β unknown, and θT − θs = β log ρ, (24) so that based on the estimates θˆS , βˆS , θˆT and βˆT ,along with their estimated dispersion matrix, we can incorporate the WLSE to estimate the common slope β and the intercepts θ S and θ T . The rest of the statistical analysis is similar to the case of indirect assays. Moreover, this WLSE methodology is asymptotically equivalent to the classical likelihood-function–based methodology, so it can be regarded, computationally, as a simpler substitute for a comparatively complicated one. For a slope-ratio assay, we have similarly a common intercept while the ratio of the slopes provide the measure of the relative potency, and hence, the WLSE based on the individual preparation estimators can be adopted under this restriction to carryout the statistical analysis as in the case of an indirect assay. Besides the logit or probit method, there are some other quasi-nonparametric methods, of rather an ad hoc nature, and among these, we may mention of the following estimators of the median effective dosage:

9

¨ 1. The Spearman–Karber estimator; 2. The Reed–Muench estimator, and 3. The Dragstedt–Behrens estimator. These procedures are discussed in (7), p. 43. If the tolerance distribution is symmetric, ¨ the Spearman–Karber estimator estimates the median effective dosage closely; otherwise, it may estimate some other characteristic of this distribution. Miller 13 studied the relative (asymptotic) performance of these three estimators, casting light on their bias terms as well. From a practical point of view, none of these estimators appears to be very suitable. Rather, if the π (di ) do not belong to the extreme tails (i.e. are not too small or close to 1), the logit transformation provides a robust and computationally simpler alternative, and is being used more and more in statistical applications. In passing, we may remark that Finney [7, Chapter 10] contains some other techniques that incorporate modifications in the setup of usual quantal assays, such as the numbers ni being unknown and possibly random, multiple (instead of binary) classifications, errors in the doses. In the following chapter, he also introduced the case of doses in mixtures that require a somewhat extended model and more complex statistical designs and analysis schemes. We shall comment on these below. 4 STOCHASTIC APPROXIMATION IN BIOASSAY In the context of a quantal assay, we have the dosage-response model in terms of the tolerance distribution π (d), and the median effective (lethal) dosage, LD50, is defined by the implicit equation π (LD50) = 0.50. In this context, for each preparation (standard or test), corresponding to initial dosage levels d1 , . . . , dk , we have estimates p(d1 ), . . . , p(dk ) of the unknown π (d1 ), . . . , π (dk ). We may set pi = π (di ) + e(di ), i = 1, . . . , k,

(25)

where the errors are (for large ni , the number of subjects treated) closely normally distributed with zero mean and variance n−1 i π (di )[1 − π (di )].Onthe basis of this initial response data, we can choose an appropriate

10

BIOLOGICAL ASSAY, OVERVIEW

do for which the corresponding p(do ) is closest to 1/2. Then, we let d(1) = do + ao [p(do ) − 1/2], for some ao > 0, and recursively we set 1 , d(j+1) = d(j) + aj p(d(j) ) − 2 for some aj > 0; j ≥ 0. (26) The aim of this stochastic approximation procedure, due to Robbins and Monro 15, is to estimate the LD50 without making an explicit assumption on the form of the tolerance distribution π (d). But in this setup, the p(d(j) ) as well as the d(j) are stochastic elements, and for the convergence of this stochastic iteration procedure, naturally, some regularity conditions are needed on the {ai ;i ≥ 0} and π (d) around the LD50. First of all, in order that the iteration scheme terminates with a consistent estimator of the LD50, it is necessary that the ai converge to zero as i increases. More precisely, it is assumed in this context that an diverges to + ∞, but a2n < +∞. n≥0

n≥0

in many toxicologic studies where a turn occurs at an unknown level. Often this is treated in a general change-point model framework. The main advantage of the stochastic approximation approach over the classical quantal assay approach is that no specific assumption is generally needed on the probability function π (d), so that the derived statistical conclusions remain applicable in a much wider setup. On the other hand, the stochastic iteration scheme generally entails a larger number of subjects on which to administer the study, and often that may run contrary to the practicable experimental setups, especially with respect to cost considerations. In this general context, a significant amount of methodological research work has been carried out during the past 40 years, and an extensive review of the literature on stochastic approximation is made by Ruppert (16) where the relevant bibliography has also been cited. The scope of stochastic approximation schemes is by no means confined to quantal assays; they are also usable for quantitative bioassays, and even to other problems cropping up in far more general setups.

(27) In addition, the continuity and positivity of the density function corresponding to the distribution function π (x) at the population LD50 is also a part of the regularity assumptions. Further assumptions are needed to provide suitable (stochastic) rates of convergence of the estimator of the LD50 and its asymptotic normality and related large sample distributional properties. Once the LD50 values are estimated for each preparation, we may proceed as in the case of a quantal assay, and draw conclusions about the relative potency and other related characteristics. It is not necessary to confine attention specifically to the LD50, and any LD100α, for α ∈ (0, 1) can be treated in a similar fashion. In fact, Kiefer and Wolfowitz (12) considered an extension of the Robbins–Monro stochastic approximation procedure that is aimed to locate the maximum (or minimum) of a dose–response function that is not necessarily (piecewise or segmented) linear but is typically nonmonotone, admitting a unique extremum (maximum or minimum) of experimental importance. Such dose–response regressions arise

5 RADIOIMMUNOASSAY In radioimmunoassays antigens are labeled with radioisotopes, and in immunoradiometric assays antibodies are labeled. For a broad range of antigens, such radioligand assays enable the estimation of potency from very small quantities of materials and usually with high precision. Radioligand assays are based upon records of radiation counts in a fixed time at various doses, so that potency estimation involves the relation between counts of radioactivity and dose, generally both at low levels (8). In many such studies, the regression function of the count of radioactivity on dose has been found to be satisfactorily represented by a logistic curve; however, the lower and upper asymptotes of such a curve are not necessarily equal to zero and one, but are themselves unknown parameters. This difference with the classical logistic distribution is reflected in a somewhat different form of the variance function of radiation counts. Unlike the Poisson process, the variance function may not be

BIOLOGICAL ASSAY, OVERVIEW

equal to the mean level of the radiation counts U(d) (i.e. their expectation at a given dose level d); in many studies, it has been experimentally gathered that the variance function V(d) behaves like [U(d)]λ , where λ(>0) typically lies between 1 and 2. For this reason, the usual Poisson regression model in generalized linear models (GLM) methodology may not be universally applicable in radioimmunoassays. Moreover, such radioligand assays may not be regarded as strictly bioassays, since they may not depend upon responses measured in living organisms or tissues. However, the advent of the use of biologic markers in mutagenesis studies and in molecular genetics, particularly during the past 20 years, has extended the domain of statistical perspectives in radioligand assays to a much wider setup of investigations, and strengthened the structural similarities between radioimmunoassays and the classical bioassays. They involve statistical modeling and analysis schemes of very similar nature, and in this sense, their relevance in a broader setup of bioassays is quite appropriate. 6

DOSIMETRY AND BIOASSAY

As has been noted earlier, a dose–response model exhibits the (mathematical) relationship between an amount of exposure or treatment and the degree of a biological or health effect, generally a measure of an adverse outcome. Bioassay and clinical trials are generally used in such dose–response studies. With the recent advances in pharmacoepidemiology as well as in risk analysis, bioassays have led to another broader domain of statistical appraisal of biological dose–response studies, known as dosimetry (or animal study). Pharmacoepidemiology rests on the basic incorporation of pharmacodynamics (PD) and pharmacokinetics (PK) in the development of the so called structure– activity relationship information (SARI). Though a PD model directly relates to a dose–response model, the PK actions of the exposure or drug needs to be taken into account in the dose–response modeling. This is now done more in terms of SARI where the structure refers to the dose factors

11

and activity refers to the biological reactions that follow the exposure (dose) to a specific species or organism. In a majority of cases, the target population is human, but owing to various ethical and other experimental constraints, human beings may not be usable to the full extent needed for such a dose–response modeling. As such, animal studies are often used to gather good background information, which is intended for incorporation in human studies in bioassay and clinical trials. Dosimetry pertains to this objective. Dosimetry models intend to provide a general description of the uptake and distribution of inhaled (or ingested or absorbed) toxics (or compounds having adverse health effects) on the entire body system. For judgment on human population, such dosimetric models for animal studies need to be extrapolated with a good understanding of the interspecies differences. SARI is a vital component in enhancing such statistical validation of pooling the information from various animal studies and extrapolating to the human population. Most dose–response relationships are studied and through well-controlled animal bioassays with exposure or dose levels generally much higher than typically perceived in human risk analysis. In this respect, dosimetry is directly linked to bioassay, though in dosimetry, the SARI is more intensively pursued to facilitate extrapolation. PDPK aspects not only may vary considerably from subhuman primates to human beings, but also there is much less of control in human exposure to such toxics. Also, metabolism in the human being is generally quite different from that in subhuman primates. An important element in this context is the environmental burden of disease (EBD) factor that exhibits considerable interspecies variation as well as geopolitical variation. Hence, ignoring the SARI part, a conventional dose–response model for a subhuman primate may not be of much help in depicting a similar model for human exposure. For the same reason, conventional statistical extrapolation tools may be of very limited utility in this interspecies extrapolation problems (25). Finally, in many carcinogenicity studies, it has been observed that xenobiotic effects underlie such dose–response

12

BIOLOGICAL ASSAY, OVERVIEW

relations, and this is outlined in a later section. 7

SEMIPARAMETRICS IN BIOASSAYS

The GLM methodology has been incorporated in a broad variety of statistical modeling and analysis schemes pertaining to a wide range of applications, and bioassays are no exceptions. Going back to the direct dilution assays, if we had taken both the distributions, F S and F T , as exponentials with respective means µS and µT , then the two distributions would have constant hazard rates 1/µS and 1/µT respectively, so that the relative potency ρ is equal to the ratio of the two hazard rates. Inspired by this observation, and by the evolution of the Cox (3) proportional hazard model (PHM), research workers have attempted to relate the two survival functions SS (x) = P{X S > x} and ST (x) = P{X T > x} as ST (x) = [SS (x)]ρ , x ≥ 0,

(28)

and interpret ρ as the relative potency of the test preparation with respect to the standard one. Though this representation enables one to import the PHM-based statistical analysis tools for the estimation of the relative potency, for distributions other than the exponential ones, the interpretation of ‘‘dilution assays’’ may no longer be tenable under such a PHM. There is an alternative interpretation in terms of the parallelism of the two log-hazard functions, but that may not fit well with the fundamental assumption in dilution assays. For some related statistical analysis of bioassays based on GLM methodologies, we refer to (24), where indirect bioassays have also been treated in the same manner along with the classical parametrics. 8

NONPARAMETRICS IN BIOASSAYS

The estimators of relative potency and tests for fundamental assumptions in dilution (direct as well as indirect) assays based on rank statistics, considered earlier, spark the first incorporation of nonparametrics in biological assays. However, these may be

characterized more in terms of semiparametrics, in the sense that the assumed linearity of dose–response regressions was essentially parametric in nature, while the unknown form of the underlying tolerance distribution constitutes the nonparametric component. Thus, together they form the so-called semiparametric models. It is possible to incorporate more nonparametrics in bioassays mostly through the nonparametric regression approach. For direct dilution assays, such nonparametric procedures are quite simple in interpretation and actual formulation. We consider the log-dose transformation, so that the dosage for the test and standard preparations have the distributions FT∗ (x) and FS∗ (x), respectively, where FT∗ (x) = FS∗ (x + log ρ), forall x. If we denote the p-quantile of FT∗ and FS∗ byQT (p) and QS (p) respectively, then we have QS (p) − QT (p) = log ρ, ∀ p ∈ (0, 1), (29) so that the well-known Q–Q plot for the two preparations results in a linear regression form, and this provides the statistical information to test for this fundamental assumption as well as to estimate the relative potency. A similar conclusion can also be drawn from a conventional P–P plot. The classical Kolmogorov–Smirnov statistics (in the two-sample case) can be used for drawing statistical conclusions, and we may refer to Rao and Littell (14) for some related work. The situation is a bit more complex with indirect assays. In the classical parametric setup, we work with the expected response at different dosages, assuming of course a linear regression. In a semiparametric approach, this linearity of dosage-response regression is taken as a part of the basic assumption, but the distribution of the errors is allowed to be a member of a wider class, so that robust procedures based on rank or M-statistics are advocated instead of the classical WLSE. In a pure nonparametric setup, the linearity of the dosage-response regression is not taken for granted. Therefore the two dosage-response regression functions may be of quite arbitrary nature, and yet parallel in an interpretable manner. The statistical task is therefore to assess this parallelism without imposing linearity or some other parametric forms. Here

BIOLOGICAL ASSAY, OVERVIEW

also, at a given dosage level, instead of the mean response level, we may consider median or a p-quantile, and based on such robust estimators, we draw statistical conclusions allowing the quantile functions to be of a rather arbitrary nature. Asymptotics play a dominant role in this context, and often this may require a relatively much larger sample size. On the other hand, in terms of robustness and validity, such pure nonparametric procedures have a greater scope than parametric or semiparametric ones. 9 BIOAVAILABILITY AND BIOEQUIVALENCE MODELS As has been explained earlier bioequivalence trials differ from conventional bioassays, as here, generally, the active substances in the drug are the same but the differences in bioavailability reflect the form and manner of administration. Such alternative modes may therefore call for additional restraints in the statistical formulation, and because of anticipated biological equivalence, there is less emphasis on relative potency and more on general equivalence patterns. For such reasons, regulatory requirements for establishing average bioequivalence of two preparations (that are variations of an essentially common drug) relate to a verification of the following: A confidence interval for the relative potency, having the confidence limits ρ L , ρ U , lies between two specified endpoints, say ρ o < 1 < ρ o , with a high coverage probability (or confidence coefficient) γ . Generally, γ is chosen close to 1 (namely, 0.95), and also ρ 0 = (ρ o )−1 is very close to one. These requirements in turn entail a relatively large sample size, and therefore, (group) sequential testing procedures are sometimes advocated (9). For general considerations underlying such bioequivalence trials, we refer to (1,11,29), where other pertinent references are cited. Generally, such statistical formulations are more complex than the ones referred to earlier. As has been mentioned earlier, the term bioassay is used in a more general form, and this is equally true for bioequivalence and bioavailability models. Kinetic measures of

13

bioavailability and pharmacokinetic parameters have been developed to meet the demand for such recent usage. We will illustrate this somewhat differently with pharmacogenomics, which is revolutionizing the field of bioinformatics and experiments with biological units, in general. 10 PHARMACOGENOMICS IN MODERN BIOASSAYS Following Ewens and Grant (5), we take bioinformatics to mean the emerging field of science growing from the application of mathematics, statistics, and information technology, including computers and the theory surrounding them, to study and analysis of very large biological and, in particular, genetic data sets. Having its genesis 50 years ago (28), the field has been fueled by the immense increase in the DNA data generation. Earlier interpretation of bioinformatics with emphasis on computational biology by Waterman (27) also merits serious considerations, while Durbin et al. (4) had a view point geared by computer algorithms along with some heuristic usage of hidden Markov models. At the current stage, gene scientists cannot scramble fast enough to keep up with the genomics, with developments emerging at a furious rate and in astounding detail. Bioinformatics, at least at this stage, as a discipline, does not aim to lay down some fundamental mathematical laws (which might not even exist in such a biological diversity). However, its utility is perceived in the creation of innumerable computer graphics and algorithms that can be used to analyze exceedingly large data sets arising in bioinformatics. In this context, naturally data mining and statistical learning tools (under the terminology Knowledge Discovery and Data Mining (KDDM)) are commonly used (10), though often in a heuristic rather than objective manner. There could be some serious drawbacks of statistical analysis based on such KDDM algorithms alone, and model selection has emerged as a challenging task in bioinformatics. Given the current status of bioinformatics as the information technology (advanced computing) based discipline of analyzing

14

BIOLOGICAL ASSAY, OVERVIEW

exceedingly high dimensional data with special emphasis on genomics, and that genomics looks at the vast network of genes, over time, to determine how they interact, manipulate, and influence biological pathways, networks, as well as physiology, it is quite natural to heed to genetic variation (or polymorphism) in most studies involving biological units. Moreover, because of the drug-response relationship, basic in bioassay, it is natural to appraise the role of pharmacogenomics in this setup. Pharmacology is the science of drugs including materia medica, toxicology, and therapeutics, dealing with the properties and reactions of drugs, especially with relation to their therapeutic values. In the same vein, pharmacodynamics, a branch of pharmacology, deals with reactions between drugs and living structures; pharmacokinetics relates to the study of the bodily absorption, distribution, metabolism, and excretion of drugs. In bioequivalence trials, these tools have already been recognized as fundamental. Pharmacogenetics deals with genetic variation underlying differential response to drugs as well as drug metabolism. The whole complex constitutes the discipline: Pharmacogenomics. In the same way, Toxicogenomics relates to the study of gene-environmental interactions in disease and dysfunction to cast light on how genomes respond to environmental stress or toxics. It is conceived that there are certain genes that are associated with disease phenotype, side effects, and drug efficacy. Also, because of inherent (genetic) variations and an enormously large number of genes as well as a very large pool of diseases and disorders, there is a genuine need of statistical methods to assess the genetic mapping of disease genes. Pharmaco-toxicogenomics is therefore destined to play a fundamental role in biological assays, in the years to come. 11 COMPLEXITIES IN BIOASSAY MODELING AND ANALYSIS There are generally other sources of variations, which may invalidate the use of standard statistical analysis schemes in bioassays to a certain extent. Among these factors, special mention may be made of the following:

1. Censoring of various types, 2. Differentiable / Nondifferentiable measurement errors, 3. Stochastic compliance of dose, 4. Correlated multivariate responses, and 5. Curse of dimensionality in genomics. It is generally assumed that censoring is usually of Type I (truncation of the experiment at a prefixed timepoint), Type II (truncation following a certain prefixed number or proportion of responses), and random, where the censoring time and response time are assumed to be stochastically independent, and moreover, the censoring is assumed to be noninformative, so that the censoring time distribution remains the same for both the preparations. In actual practice, this may not be generally true, and hence, effects of departures from such assumptions on the validity and efficacy of standard statistical procedures are therefore needed to be assessed. Measurement of the actual dose levels in quantal assays, or the response levels in an indirect assay may often be impaired to a certain extent by measurement errors. In statistical analysis, usually such measurement errors are assumed to be either differentiable or nondifferentiable type, and appropriate statistical models and related analysis schemes depend on such assumptions. In radioimmunoassays, dosimetric studies in pharmacokinetics, as well as in other types, not the full amount of a prescribed dose may go into the organ or experimental unit, and the actual consumption of the dose may be (often, highly) stochastic in nature. Therefore, the dose–response regression relation may be subject to nonidentifiability and overdispersion effects. This calls for more modifications of existing models and analysis schemes. Finally, when there are multiple endpoints with possibly binary or polytomous responses, a dimension reduction for the model-based parameters becomes necessary from statistical modeling and inference perspectives. Otherwise, an enormously large sample size may be needed to handle adequately, the full parameter model, and this may run contrary to the practical setup of an assay. The situation is worse when some of the responses are

BIOLOGICAL ASSAY, OVERVIEW

quantitative while the others are quantal or at best polychotomous. These naturally introduce more model complexities and call for more complicated statistical analysis tools.

REFERENCES 1. Anderson, S. & Hauck, W. W. (1990) Considerations of individual bioequivalence, Journal of Pharmacokinetics and Biopharmaceutics 18, 259–273. 2. Bliss, C. I. (1952) The Statistics of Bioassay. Academic Press, New York. 3. Cox, D. R. (1972) Regression models and life tables (with discussion), Journal of the Royal Statistical Society B 34, 187–220. 4. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. (1998) Biological Sequence Analysis: Probabilistic Models for Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. 5. Ewens, W. J. & Grant, G. R. (2001) Statistical Methods in Bioinformatics: An Introduction. Springer-Verlag, New York. 6. Finney, D. J. (1964) Statistical Methods in Biological Assay, 2nd Ed. Griffin, London. 7. Finney, D. J. (1971) Probit Analysis, 3rd ed. University Press, Cambridge. 8. Finney, D. J. (1976) Radioligand assay, Biometrics 32, 721–730. 9. Gould, A. L. (1995) Group sequential extensions of a standard bioequivalence testing procedure, Journal of Pharmacokinetics and Biopharmaceutics 23, 57–86. 10. Hastie, T., Tibshirani, R. & Friedman, J. (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. 11. Hochberg, Y. (1955) On assessing multiple equivalences with reference to bioequivalence, in Statistical Theory and Applications: Papers in Honor of H. A. David., H. N. Nagaraja, P. K. Sen & D. F. Morrison, eds. Springer-Verlag, New York, pp. 265–278.

15

15. Robbins, H. & Monro, S. (1951) A stochastic approximation method, Annals of Mathematical Statistics 22, 400–407. 16. Ruppert, D. (1991) Stochastic approximation, in Handbook of Sequential Analysis, B. K. Ghosh & P. K. Sen, eds. Marcel Dekker, New York, pp. 503–529. 17. Sen, P. K. (1963) On the estimation of relative potency in dilution (-direct) assays by distribution-free methods, Biometrics 19, 532–552. 18. Sen, P. K. (1964) Tests for the validity of fundamental assumption in dilution (-direct) assays, Biometrics 20, 770–784. 19. Sen, P. K. (1965) Some further applications of nonparametric methods in dilution (-direct) assays, Biometrics 21, 799–810. 20. Sen, P. K. (1968) Estimates of the regression coefficient based on Kendall’s tau, Journal of the American Statistical Association 63, 1379–1389. 21. Sen, P. K. (1971) Robust statistical procedures in problems of linear regression with special reference to quantitative bioassays, I, International Statistical Review 39, 21–38. 22. Sen, P. K. (1972) Robust statistical procedures in problems of linear regression with special reference to quantitative bioassays, II, International Statistical Review 40, 161–172. 23. Sen, P. K. (1984) Nonparametric procedures for some miscellaneous problems, in Handbook of Statistics, Vol. 4: Nonparametric Methods, P. R. Krishnaiah & P. K. Sen, eds. Elsevier, Holland, pp. 699–739. 24. Sen, P. K. (1997) An appraisal of generalized linear models in biostatistical applications, Journal of Applied Statistical Sciences 5, 69–85. 25. Sen, P. K. (2003) Structure-activity relationship information in health related environmental risk assessment, Environmetrics 14, 223–234. 26. Shorack, G. R. (1966) Graphical procedures for using distribution-free methods in the estimation of relative potency in dilution (-direct) assays, Biometrics 22, 610–619.

12. Kiefer, J. & Wolfowitz, J. (1952) Stochastic estimation of the maximum of a regression function, Annals of Mathematical Statistics 23, 462–466.

27. Waterman, M. S. (1995) Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, Cambridge.

13. Miller, R. G., Jr. (1973) Nonparametric estimators of the mean tolerance in bioassay, Biometrika 60, 535–542.

28. Watson, J. D. & Crick, F. H. C. (1953) Genetical implications of the structure of deoxyribonucleic acid, Nature 171, 964–967.

14. Rao, P. V. & Littell, R. (1976) An estimator of relative potency, Communication of Statistics Series A 5, 183–189.

29. Westlake, W. J. (1988) Bioavailability and bioequivalence of pharmaceutical formulations, in Biopharmaceutical Statistics for

16

BIOLOGICAL ASSAY, OVERVIEW Drug Developments, K. E. Peace, ed. Marcel Dekker, New York, pp. 329–352.

FURTHER READING Cox, C. (1992) A GLM approach to quantal response models for mixtures, Biometrics 48, 911–928. Moses, L. E. (1965) Confidence limits from rank tests, Technometrics 7, 257–260. Sen, P. K. (2002) Bioinformatics: statistical perspectives and controversies, in Advances in Statistics, Combinatorics and Related Areas, C. Gulati & S. N. Mishra, eds. World Science Press, London, pp. 275–293.

BLOCKED RANDOMIZATION

groups, and similarly there is no control of the balance of allocation to groups over time. Consider the issue of overall imbalance. Let NA and NB represent the numbers of subjects randomized to treatments A and B by simple randomization, respectively. Then the difference NA minus NB is asymptotically normal with a zero mean and variance N = NA + NB . This property can be used to derive the distribution of the difference. The probability that the absolute value of the difference exceeds D is approximately calculated as 2(1–φ(D/n0.5 )) where φ is the cumulative standard normal distribution function. For example, in a trial of 100 subjects, the 95% confidence interval for the difference is calculated as ±19.6, which we round up to 20. Thus, there is a 5% chance of a split that equals or is more extreme than 40 versus 60 or 60 versus 40. Particularly as the sample size increases, the loss of study power that is a consequence of such imbalances is minimal, but other considerations apply. Concerns may include the following: risk of low power at interim analyses, which will have smaller sample sizes than the final; trial credibility to journal or regulatory reviewers; and limited trial supplies. Temporal bias may be caused by too many subjects being assigned to one group early in the study and too few at later time points. If the type of patient enrolled changes over time, then simple randomization may lead to imbalances in the baseline characteristics of the treatment groups. Such changes are entirely feasible and could be caused by investigators who wish to acquire experience of the treatment and trial before they enter the sicker subjects. Similarly, temporal trends in patient outcomes could be caused by the learning curve associated with the trial and treatments. Although theoretically the temporal trends can be accounted for in the analysis, it may not be simple in the typical multicenter trial that starts up centers at different times. For both of the above reasons simple randomization is rarely used.

DAMIAN McENTEGART Head of Statistics, Clinphone Group Ltd, Nottingham, United Kingdom

Randomization techniques can be classified as static, in that allocation is made from a randomization list that is generated before the start of the trial, or they can be dynamic, in that the allocation depends on the current balance of allocated treatments either overall or within particular sub-groups. This article deals with static randomizations from a list composed of one or more blocks. Blocked randomization is the most commonly used randomization technique as judged by reports in medical journals (1) and my own experience (80% of trials on our database of over 1500 randomized trials use blocked randomization). Most theory and examples presented will relate to the case of randomization into two treatment groups with a desired equal allocation. This method is for simplicity of exposition and everything can be generalized to cover multiple treatments and/or unequal randomization ratios. 1

SIMPLE RANDOMIZATION

To understand the rationale for blocked randomization and what it entails, it is first necessary to define simple randomization as a basis for comparison. Simple randomization (alternatively called complete or unconstrained randomization) occurs when each treatment assignment is made at random, and the allocation is completely independent of previous allocations. It has the attractive property of complete unpredictability. One method of constructing a list in a trial of two treatments would be to toss a coin for each allocation. Two disadvantages of simple randomization are that the randomization may yield an imbalance in the overall numbers of subjects assigned to the treatment

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

BLOCKED RANDOMIZATION

2 RESTRICTED RANDOMIZATION THROUGH THE USE OF BLOCKS The alternative to simple randomization is to impose some form of restriction on the randomization. Restricted randomization is defined as any procedure that deviates from simple randomization and controls the randomization to achieve balance between groups in terms of the overall size of the groups and/or their characteristics. The focus of this article is blocked randomization methods, which are used to balance treatment groups overall and, if needed, for time trends and prognostic factors. We will deal with two types of blocked randomization. The first instance is in which the block size or length equals the required sample size (i.e., there is one block for the entire trial). Abel (2) terms this complete balanced randomization. We will devote a section to such blocks. The second is random permuted blocks, in which several blocks make up the randomization list. For convenience, we deal with the second type of scheme first. 2.1 Random Permuted Blocks The permuted blocks procedure involves randomizing subjects to treatment by constructing a randomization list partitioned into blocks. In the simplest case of a constant block size with two treatments and m subjects per treatment in each block, then the block size is 2m. If the list is composed of B blocks, then the total number of entries on the list is 2mB. Allocation should be made sequentially from the list that corresponds to the order that subjects are randomized; otherwise the process is subject to subversion. Thus, the list is enforcing a perfect balance of treatment allocations after every 2m subjects have been recruited. Although this method protects against temporal trends, the constraint of return to periodic balance has the disadvantage that the allocations become increasingly predictable toward the end of each block if the past allocations and block size are known. It may lead to investigator selection bias where the investigator may decide not to enter particular subjects in the trial to avoid the receiving treatments that are not favoured by the investigator (e.g., the

control) or delay their entry until the chances of them receiving a particular treatment (e.g., the experimental treatment) are better than average. We will refer to the topic of selection bias again. Permuted blocks are used when there is a perceived need to protect against time trends in patient characteristics and/or outcomes. Even in the perceived absence of time trends, it is often advantageous to ensure a reasonable degree of overall treatment allocation balance during the trial in case of interim analyses. An additional motivation for permuted blocks may be the need to balance treatment groups within prognostic strata. 2.2 Generation of Blocks Two methods of generating blocks are used: the random allocation rule and the truncated binomial design. Given the block size 2m, the random allocation rule is to generate the randomization probability for each treatment in accordance with the totals of each treatment that still have to be assigned to meet the overall balance requirement. Thus, for example, if we are randomizing the fifth patient in a block of eight, and the allocations in the block to date have been three for treatment A and two for treatment B, then the patient will be randomized to receive treatment A with probability 1/3 and treatment B with probability 2/3. The random allocation rule is equivalent to sampling from the population of all possible randomization orderings with equal probability. If multiple blocks are used, then generally the blocks are generated independently (i.e., sampled with replacement). The alternative method of the truncated binomial design (3) is effected by using simple randomization until m of one of the treatments have been assigned, then assigning the other treatment to the remaining subjects. The chance of a long run of predictable allocations at the end of truncated binomial design blocks has precluded their use in practice, and they will not be referred to again. 2.3 Stratified Randomization Using Permuted Blocks In the absence of fraud or selection bias, there will be no systematic reason for treatment imbalances on prognostic factors using

BLOCKED RANDOMIZATION

simple or blocked randomization but chance imbalances are possible, especially in smaller trials. Kernan et al. (4) investigate the chances of an imbalance for two treatment groups on a binary prognostic factor that is present in 30% of the population. The chance that the two treatment group proportions will arithmetically differ by more than 10% is 43% for a trial of 30 subjects, 38% for a trial of 50 subjects, 27% for a trial of 100 subjects, 9% for a trial of 200 subjects, and 2% for a trial of 400 subjects. These results relate to a single factor; the chance of at least one imbalance on multiple factors is magnified. Chance imbalances on important prognostic variables reduce the precision of the treatment estimate, particularly for smaller trials. But if the prognostic variables are accounted for in the analysis model (poststratification), then the loss in precision for trials of more than 50 subjects per treatment group is slight (5). Of more importance may be the attitude of regulators to chance imbalances. European regulators require sensitivity analyses ‘‘to demonstrate that any observed positive treatment effect is not solely explained by imbalances at baselines in any of the covariates’’ and further note that in the case of a ‘‘very strong baseline imbalance, no adjustment may be sufficiently convincing to restore the results’’ (6). Thus trialists may wish to protect the credibility of their trial by balancing on a few important prognostic factors. A list of other potential reasons for achieving balance is given by McEntegart (5). With a randomization list, balancing takes place within individual strata defined as the crossclassification of prognostic factor levels. For continuous scale variables, this procedure will involve some categorization; methods for deciding on cut-points are available (7). As the number of balancing factors and levels increases, the number of strata increases geometrically. Regulatory guidance suggests that the use of more than two or three balancing factors is rarely necessary (8). If the randomization is to be stratified, then in the absence of temporal trends, a single block (complete balanced randomization) within each stratum would be a possibility in which the numbers of patient to be recruited in each stratum can be fixed in advance. Fixed recruitment, however, is often undesirable

3

because of the implications for the duration of trial recruitment and the extra administrative burden [although interactive voice response (IVR)/web randomization provides a solution where recruitment to each stratum is automatically closed once the stratum limit has been reached (9)]. Alternatively, where stratum sizes are not fixed in advance, permuted blocks will have to be used. Blocks are generated and assigned to each stratum with the effect that a separate randomization list is observed for each stratum. The allocation of blocks to strata is normally done at the time of generating the randomization list. If an IVR/web system is being used to perform the allocation during the trial, then it is not necessary, and allocation of blocks can be performed dynamically as they are needed. This method can be advantageous when center or country is stratification factors. Consider the example of stratification by gender and site. For any particular gender* site stratum, when the first subject with the relevant characteristics enrolls in the study, the system allocates the first free unallocated block of randomization code to that stratum. The current subject is then assigned the treatment associated with the first randomization code in the block, and the block is reserved for future subjects in this stratum. Subsequent subjects who enroll into this stratum are assigned to the next available randomization entry in the block. As a block is completed, then the next available block is reserved if another subject enrolls into that stratum. This allocation saves code that may have some small benefits in terms of checking, maintenance, and system performance. More importantly, it allows flexibility in site recruitment including easily allowing new, unplanned sites to join the study; the system simply allocates blocks of code that follow the same process. When using such a process however, it is important that the randomization (sequence) number is not revealed to the investigator, or else he can deduce the block size from the step change in the randomization numbers when allocation is made from a new block. If it is desired to reveal the randomization number, then a scrambled number that differs from the sequence number on the list should be used. Arguably using dynamic allocation

4

BLOCKED RANDOMIZATION

of blocks makes unscrupulous monitoring by the study team more difficult because it takes more work to deduce the blocks used within each strata. In a stratified design with K strata, then potentially K incomplete (open) blocks are at the end of the trial, each with a maximum imbalance of m and a maximum imbalance for the trial as a whole of Km. If too many strata are relative to the number of subjects and treatments, then the trial is subject to what has been called over stratification (10). Balance may not be achieved for the study as a whole or for the individual strata levels if multiple stratifying factors exist. Hallstrom and Davies (11) provide methodology and formulae to evaluate the risk of an aggregate imbalance combined over strata. For a twotreatment group trial, an approximate 95% confidence interval for the difference between the treatment group allocations can be calculated as ±1.96* (K* (2m + 1)/6)1/2 . Extensions to other numbers of treatment groups and unequal allocation ratios are possible or simulation may be used. Kernan et al. (4) suggest that to avoid the worst effects of incomplete blocks in trials with interim analyses, the number of strata for a two-treatment group trial is limited to the number of subjects divided by four times the block size. This method is to allow for interim analyses. If the number of strata is too large, then it may be necessary to use a dynamic randomization technique such as minimization (12) or dynamic balancing (13). The merits and issues of using such techniques are discussed elsewhere (1,5). If it is desired to balance over a stratification factor at the study level in a multicenter trial, then a central randomization using IVR/web randomization will have to be used. Country is often an appropriate factor in multinational trials (6). Finally, we note that the logistics of organizing medication supplies for stratified randomization in multicenter trials are considerably simpler in IVR/web systems that separate the randomization step from the dispensing step (9). In a traditional trial, supplies are provided to sites in patient numbered packs; normally the patient number will match the number used on the case record form used to

collect the patient data. If the trial is stratified by any factor in addition to site, then typically separate sets of patient numbered packs are reserved for the lists used in each stratum at the site. Unless the numbers to be recruited to each stratum at each site are fixed, which rarely happens as this would slow recruitment, then considerable wastage of patient supplies will occur, as packs are provided that allow for flexibility of strata composition in recruitment. 2.4 Multicenter Trials and Blocking To meet recruitment timelines and to aid generalization of results, most clinical trials are conducted at more than one center. Although a special case of stratification, multicenter trials deserve a separate discussion with regard to blocking because it is the most common use of blocked randomization. European regulatory guidance (6) notes that ‘‘most multicentre trials are stratified by centre (or investigator) for practical reasons or because centre (or investigator) is expected to be confounded with other known or unknown prognostic factors.’’ This regulation reflects the ICH E9 Guidance (8), which states that ‘‘It is advisable to have a separate random scheme for each centre, i.e. to stratify by centre or to allocate several whole blocks to each centre.’’ Most pharmaceutical industry trials are stratified by center, and it seems to be true of all trials. In a survey of trials reported in a 3 month period in 1997 in leading journals, Assmann et al. (14) found that 18 of 25 trials gave sufficient detail stratified by site. One reason for using some form of site stratified, blocked randomization in multicenter trials that are not managed electronically develops from the logistical considerations surrounding supplies and the uncertainty surrounding the recruitment process. One way of organizing the randomization would be to construct a single block of randomization codes equal to the trial sample size and sequentially send several corresponding patient numbered packs to each site—the exact number depending on each site’s agreed recruitment target. If all sites filled their target, then randomization is completely balanced at the study level. But sites

BLOCKED RANDOMIZATION

rarely meet their recruitment target with a consequent high risk of substantial imbalance at the study level. To avoid these problems, randomization for multicenter trials are usually balanced at the site level; this target is most commonly achieved via a blocked randomization, which also guards against temporal trends within each site as well as simplifying the logistics. One way of stratifying by site in a trial with randomization managed electronically is to construct the site stratified blocks dynamically by use of a balancing constraint. This method is a variant of the balanced block randomization scheme first described by Zelen (15). Consider a trial with five treatment groups where it is required to balance at the site level. In the dynamic scheme, subject to the balance constraint, subjects are allocated the treatment that corresponds to the next unused randomization code from a central unstratified randomization list with a block size of five. The balance constraint is that allocation must not cause the treatment imbalance (maximum minus minimum) at site to exceed one. In effect, blocks of size five are being used at the site level with the blocks being dynamically constructed considering the study level balance. A consequence of this method is that the codes on the central list are not always allocated sequentially. The advantage of the scheme is that it allows site balancing without the risk of imbalance at the study level from too many open blocks. It is particularly relevant in studies with many treatments conducted at many sites in which the average recruitment is expected to be less than the number of treatments. If site stratification is employed when it may not otherwise have been (because of the risk of study level imbalance), then a marked supply savings can occur if a system that automatically manages site inventories is also used (16). Assuming subjects enter in a random order, then all randomization sequences within any given center are possible before the trial starts, which is a perfectly valid randomization scheme, albeit that the blocks are not being selected independently and the sample is restricted (for instance, it would not be possible to have all centers commencing the first block with the same treatment). However, the scheme technically violates the

5

ICH E9 Guidance (8), which states that ‘‘The next subject to be randomized into a trial should always receive the treatment corresponding to the next free number in the appropriate randomization list.’’ Wittes (17) has argued that block sizes can be varied by clinic with expected higher recruiting centers having higher block sizes. This option is rarely used because of the problems of predicting the high recruiting clinics. It has, however, been implemented on an ongoing basis in the light of actual experience within an IVR system. In this scheme, the size of block that was dynamically allocated by the system was determined by the calculated recruitment status of the center. 3 SCHEMES USING A SINGLE BLOCK FOR THE WHOLE TRIAL If no stratification occurs, then the simplest way to overcome the problem of unequal allocation to groups is to restrict the randomization list so that the numbers in each group are equal (or in the required proportions). This function is equivalent to using a single block for the whole trial. 3.1 Maximal Procedure Abel (2) described an alternative randomization scheme that constrains the size of the within block imbalance as follows. One specifies a measure of imbalance, I , and a maximum permitted imbalance I0 ; the simplest imbalance measure is the range of treatment allocations. A randomization list of the required length n1 is then generated using simple randomization. Compute the imbalance In for each n ≤ n1 . If, for some n, In > I0 then replace the list. The process is repeated until an acceptable list is found. Berger et al. (18) extended Abel’s concept through the maximal procedure, which also imposes the condition of so called terminal balance on the list. Thus, the maximal procedure retains the desirable characteristics of the randomized permuted block design namely by minimizing imbalance and temporal bias. The permuted block procedure is characterized by its forced returns to perfect balance at the end of each block. For a given block size, the imbalance cannot

6

BLOCKED RANDOMIZATION

exceed half the block size, and this balance is called the maximum tolerated imbalance (MTI). Forced returns to periodic imbalance at the end of every block do not substantially help control temporal bias, yet they do potentially increase selection bias. The maximal procedure considers schedules that satisfy the constraint of never having an imbalance between treatments that exceeds the given MTI and also satisfy the constraint of perfect balance at the end of the sequence. The scheme is therefore less subject to selection bias at the cost of a very small increase in expected imbalance if the trial does not run to completion. Sequences are generated using the random allocation rule until one is found that satisfies the MTI conditions. The technique can be used in stratified trials, but unless the numbers in the strata are fixed, there is a slight increase in the risk of imbalance at the study level. This result occurs because of the slightly higher expected imbalance caused by stopping part way through the maximal block. Nevertheless, it seems a small price to pay, and the technique should be considered in any site stratified trial where the possibility of selection bias is a concern. 3.2 Hadamard Matrix Procedure The maximal scheme has some similarities to Bailey and Nelson’s method (19). This method of generating blocks involves selecting a block at random from an array derived from the permuted columns of a Hadamard matrix (20) with a selection criterion based on the MTI at any point in the block. For instance, with a block size of 12 for a two treatment trial, the example scheme used in the paper has an MTI of 3, and there are 44 possible blocks to choose from. If multiple blocks are selected with replacement, then the resultant scheme is valid in the sense that if terms for treatment and block are included in the analysis model, the model-based estimator of the error variance is also unbiased in that its expectation over the randomization is correct no matter what the order of patient presentation. If multiple blocks are used in an unstratified trial, then we are enforcing periodic returns to balance; thus, the similarity with the maximal procedure ends here. Furthermore, the Hadamard scheme is

less flexible than the maximal scheme, more prone to selection bias, more restrictive (for example for a trial size of 12 and an MTI of 3, there are 792 potential maximal sequences and only 44 in the Hadamard example), and harder to generate for block sizes larger than 12. It does, however, avoid the need for a permutation test in the analysis (see later). 4 USE OF UNEQUAL AND VARIABLE BLOCK SIZES The issue of predictability and selection bias is particularly relevant in site-stratified trials that are not blinded and where the investigator knows the past treatment allocations. One strategy that is sometimes used in an attempt to reduce potential selection bias is to employ a variable block design with some degree of randomness in the block sizes. The block sizes could be chosen at random from a specified subset of block sizes or alternatively be determined as a random ordering of a defined number of each possible block size. The advantage of the latter is that it is possible to fix the exact number of records in randomization list, which is not possible with the former scheme. This strategy is implicitly encouraged in the International Conference on Harmonisation (ICH) E9 Guidance on statistical principles for clinical trials (8) which states that: Care should be taken to choose block lengths that are sufficiently short to limit possible imbalance, but that are long enough to avoid predictability towards the end of the sequence in a block. Investigators and other relevant staff should generally be blind to the block length; the use of two or more block lengths, randomly selected for each block, can achieve the same purpose. (Theoretically, in a double-blind trial predictability does not matter, but the pharmacological effects of drugs may provide the opportunity for intelligent guesswork.)

The considerations that surround such a strategy depend on the assumed type of selection bias. Under the Blackwell-Hodges model (3), the experimenter wishes to bias the study by selecting a patient with a more favorable expected outcome when he guesses that treatment A is the next treatment to

BLOCKED RANDOMIZATION

be allocated. Conversely, the experimenter selects a patient with a poorer prognosis when he guesses that treatment B is the next treatment to be allocated. Clearly the optimal strategy for the investigator is to guess treatment A when it has been allocated least to date, to guess treatment B if it has been allocated least to date, and to guess with equal probability in the case of a tie in prior allocations. This method is called the convergence strategy. Under the Blackwell-Hodges model, the use of random block sizes does not reduce the potential for or degree of selection bias as the investigator’s strategy and expected successes remains the same even if he is unmasked to the block sizes. Rosenberger and Lachin (21) show that a design employing multiple block sizes has an expected bias factor associated with the average block size. So under the Blackwell Hodges model, no advantage exists to the use of random block sizes. Indeed if the MTI condition is not relaxed, then variable blocking will lead to even more scope for prediction than fixed size blocks because of the greater number of blocks and the associated return to balance on the completion of each block (18). But this argument is overstated, as one would not generally know the size of the current block. Furthermore, in my experience the MTI requirement is usually relaxed when variable blocking is used (e.g., practitioners might consider mixed blocks of 4 and 6 rather than fixed blocks of size 4). From the above, one might conclude that there is little advantage in employing variable sized blocks. But it would not be a complete representation of the situation. Arguably with fixed blocks, investigators often can guess the block size from customary practice based on logistical and balance considerations. For example, in a site stratified trial, supplies may be sent to sites in single blocks (22,23). Also, overall study level balance considerations may dictate against the use of larger block sizes; in a site stratified trial of two treatments, the block size is almost always chosen to be four and investigators are aware of this. In this case, the potential for selection bias is a concern as the investigator can determine the tail allocations of the block with complete certainty. Matts and Lachin (24) formally investigate

7

the bias properties in this situation and show that the expected number of treatments that are completely predictable within a block of known length 2m containing m allocations to each of two treatments is (m + 1) – 1 . They investigate the scenario in which the sequence of block sizes is masked but the sizes employed are unmasked. As might be expected, the use of variable block sizes will substantially reduce the potential for prediction with certainty in this scenario. Dupin-Spriet et al. (25) quantify the degree of prediction with certainty in trials with more than two arms and in trials with an unequal allocation ratio. For example in a three-arm trial with balanced blocks of size 6, 20% of treatments assignments are predictable. In a later paper (26), the same authors devise methods to quantify prediction with certainty in the cases of series with two and three unequal sized blocks for schemes that they claim are of a common size (e.g., a fixed number of subjects per center of 8, 12, or 30). For instance in the case of two blocks, the calculations will quantify the probability of identifying a long block when it comes before a short one if it starts with a sequence incompatible with the content of a short block. Both the situations of known block lengths as well as concealed order and concealed block lengths and order are considered. The results are compared against the predictability of the maximal method in the cases where the MTI is known and unknown. For a blocked scheme, if the details of the scheme (block lengths and order) can be assumed to be unknown, then there is actually no reduction in predictability with certainty when using unequal length blocks. The best scheme is always the one with the shortest possible block length as the unknown length order decreases opportunities for deductions about the composition of the blocks. But of course, it is precisely this scheme that is the most likely to be guessed. Furthermore, the scenarios of fixed recruitment per center are relatively uncommon. So although it is interesting to consult this reference when devising a scheme, the use of variable blocking should not be dismissed solely based on this reference.

8

BLOCKED RANDOMIZATION

In summary, the ICH suggestion to use blocks of unequal size is not as straightforward as it might first seem. But it is to be encouraged in open label trials (or blinded trials that are subject to guesswork) where for logistical or other reasons, it is felt necessary to stratify at the site level. It is most relevant when trial designers can be persuaded to use a higher MTI than they would otherwise have considered or where there is no requirement for a fixed number of randomization numbers at each site. In the absence of the latter requirement, block sizes can truly be chosen at random from the specified set of allowable block sizes, which provides less scope for deduction. 4.1 Mixed Randomization The notion of variable blocking has been extended by Schulz and Grimes (27), who recommend their method of mixed randomization for non-double-blind trials. The mixed randomization procedure begins with an uneven block generated with a maximum disparity between the treatments. This procedure is then followed by series of random permuted blocks with randomly varied block sizes interspersed with a simple random sequence. If used in site stratified multicenter trials, the sample size per center would have to be large to allow for the practicalities of the scheme; and more scope for study level imbalances exists compared with conventional site stratification. Nevertheless, this method adds another tool to the armoury of trying to confuse the issue and may be useful. The concept of uneven blocks at the start of the sequence is one I have used on occasion. 5 INFERENCE AND ANALYSIS FOLLOWING BLOCKED RANDOMIZATION The theory underpinning inferences from analyses following the use of blocked randomization is provided by Rosenberger and Lachin (21). On most occasions, a randomization model is used as the basis for inference. Permutation tests based on the randomization model assume that under the null hypothesis, the set of observed patient responses are fixed irrespective of treatment assignment. Then, the observed difference

between the treatment groups depends only on the treatment assigned, and statistical testing can be based around it. Treating the observed patient outcomes as given, the chosen test statistic is computed for all possible permutations of the randomization sequence. The incidence of observing a test statistic that is equal to or is more extreme than the observed test statistic is enumerated, and the probability is calculated by dividing by the total number of possible randomizations. This calculation is assumption free. As enumeration of all possible randomizations may be onerous as the sample size increases, an adequate alternative is to perform a Monte Carlo simulation of 10,000 or 100,000 randomization sequences. Actually, there is no need to perform such analyses following conventional blocked randomization because conventional tests, such as the block-stratified Mantel-Haenszel test, or analysis of variance fitting a factor for the block effect, will provide tests that are asymptotically equivalent to the permutation test (21). To ensure that tests have a true false positive error rate that is close to the nominal rate (e.g., 0.05), the analysis must account for the blocking and any stratification employed. Consider the case of blocking in an unstratified trial. Rosenberger and Lachin (21) show that the correct analysis accounts for the blocks and that the bias of an analysis without inclusion of the blocks depends on the direction of the intrablock correlation coefficient. In addition, an analysis that does not take the blocking into account is likely to be conservative. Omitting the blocks from analyses is common practice because most practitioners doubt the prognostic significance of the blocking factor, and the large number of extra parameters can cause instability and other model-fitting difficulties; this practice is acknowledged by the regulators (28). The situation regarding stratification factors is not treated so liberally, because although an analysis that does not reflect the stratification should again be conservative, there is no excuse for not performing the appropriate test as required under the assumptions of the randomization model. Thus, the ICH E9 Guidance (8) states that ‘‘if one or more factors are used to stratify the

BLOCKED RANDOMIZATION

design, it is appropriate to account for these factors in the analysis,’’ and the European guidance (6) states that, The primary analysis should reflect the restriction on the randomization implied by the stratification. For this reason, stratification variables—regardless of their prognostic value – should usually be included as covariates in the primary analysis.

The European guidance recognizes the problem of low recruiting centers, and if center was used in the randomization but not adjusted for in the analysis then sponsors ‘‘should explain why and demonstrate through well explained and justified sensitivity analyses, simulations, or other methods that the trial conclusions are not substantially affected because of this.’’ But generally the intention of the regulators is clear—you should ‘‘analyse as you randomize.’’ Appropriate theory has not been developed for the other single-block methods described. Thus, to use conventional asymptotic analysis, one must appeal to the homogeneity of the patient population. If one is not prepared to make this assumption, then Kalish and Begg (29) argue that large time trends would need to be observed before there was substantial distortion of the test characteristics; such large trends would be observable and can be accounted for in the model. This issue remains a debate. 6 MISCELLANEOUS TOPICS RELATED TO BLOCKED RANDOMIZATION 6.1 Blocking in Crossover and Group Randomized Designs In certain trials, subjects will be scheduled to receive the treatments in different orders. In these crossover trials, each subject serves as his own control provided that the results from each treatment administration are independent of the preceding allocations; arguably then no need exists to constrain the randomization in any way (i.e., a simple randomization can be used for the sequence of treatment administrations to each patient). But carryover of effects from the previous treatment can never be ruled out, so it is conventional to balance the treatment sequences (e.g., in a

9

two treatment, two period crossover trial, the two possible sequences are usually generated in blocks). In trials with three treatments, the six possible orderings are grouped in a block. In trials with more than three treatments, Williams squares are used as blocks that are balanced Latin square designs constructed so that each treatment is preceded equally often by each other treatment (30,31). Bellamy et al. (32) consider the situation of randomizing subjects in cohorts of a predetermined size to one of two study treatments using blocking factors, the composition of which cannot be determined prior to assembling the cohorts for randomization. This situation develops in trials in which subjects are to be randomized as a group, but not all the subjects attend for the visit. The randomization is effected by alternation into arbitrarily labelled groups within each stratification factor level and then random assignment of these groups to treatment. 6.2 Generation of Blocks: Software and Practical Considerations Most computer statistical packages include pseudo-random number generators, which are so called because the numbers they generate are not the result of a random process but have properties similar to those generated via a random process. These numbers can be used to prepare randomized blocks using the random allocation rule or by generating the block contents and then using the random number generator to assign block position. The SAS procedure PROC PLAN (SAS Institute, Inc., Cary, NC) provides a simple and flexible module for generating randomization lists; Deng and Graz (33) give some useful examples of generating blocked randomizations using this procedure. A variety of freely available software is available for generating blocked randomizations. A recent and flexible package that overcomes the lack of flexibility of some earlier software is described by Saghaei (34). It is important that the trial statistician reviews the generation process used for the randomization list before the trial. Ideally, the trial statistician should not generate the list. My recommended process is that the statistician reviews a ‘‘dummy list’’ that is

10

BLOCKED RANDOMIZATION

created by a validated software package that, based on a set of input parameters, simultaneously generates two lists using different seeds. The dummy list can be used for data set-up, testing, and quality control procedures. Procedures should be in place to ensure secure storage and appropriate limited access to the list after generation. Generally, blocks will be generated independently of one another, that is, they are sampled with replacement, as this method underpins the theory for statistical inference. Generally, there is no need to restrict the number of possible randomization orderings unnecessarily. But marginal advantages for overall treatment balance may be obtained by ensuring a balance of block types within the list for center stratified schemes where incomplete blocks can be expected. No real objection to this practice will occur provided the details are kept away from staff involved in the monitoring of the trial. 6.3 Blocked Randomization also Used for Drug Packaging

kept in separate documents or appendices that are not seen by staff involved with the trial performance or conduct. Good Clinical Practice (36) requires that the protocol contain ‘‘a description of the measures taken to minimize/avoid bias, including randomization and blinding,’’ but it is not necessary to go into specific detail about the type of blocking and block sizes. In contrast after the trial has completed, reports of the trial should contain full details of the randomization procedures. The CONSORT group statement (37) asks for details of the method used to generate the random allocation sequence and details of blocking and stratification. REFERENCES 1. N. W. Scott, G. C. McPherson, C. R. Ramsay, and M. K. Campbell, The method of minimization for allocation to clinical trials: a review. Control. Clin. Trials 2002; 23: 662–674. 2. U. Abel, Modified replacement randomization. Statist. Med. 1987; 6: 127–135.

Trial material packaging is also done using randomized blocks. When the patient number corresponds to the randomization sequence number as it does in a traditional trial, then only one scheme is needed. If, however, the randomization and material dispensing steps are separated, as is often the case with IVR and web trials (9), then separate randomizations will have to be performed for randomization and the packaging list. To avoid unblinding through logical deduction, it is important that after the randomization has identified the appropriate treatment to allocate, that selection from the set of packs available at the site is performed randomly (35). This function is best achieved by use of a double randomized pack list; this block randomized pack list is then scrambled to remove any association between pack number and sequence number (35).

3. D. Blackwell and J. L. Hodges, Design for the control of selection bias. Ann. Math. Stats. 1957; 28: 449–460.

6.4 Documentation and Reporting

8. International Conference on Harmonisation, E-9 Document, Guidance on statistical principles for clinical trials. Federal Register 1998; 63: 49583–49598. Available: http://www.fda.gov/cder/guidance/91698.pdf.

From the preceding discussions, it is clear that the more that can be kept from investigators and monitors, the better it is for study integrity. Thus, before and during the trial conduct, specifics of randomization should be

4. W. N. Kernan, C. M. Viscoli, R. W. Makuch, L. M. Brass, and R. I. Horwitz, Stratified randomization for clinical trials. J. Clin. Epidemiol. 1999; 52: 19–26. 5. D. J. McEntegart, The pursuit of balance using stratified and dynamic randomisation techniques. Drug. Inf. J. 2003; 37: 293–308. Available: http://www.clinphone.com/files/ Stratfied%20Dynamic%20Randomization% 20Techniques.pdf. 6. Committee for Proprietary Medicinal Products. Points to Consider on Adjustment for Baseline Covariates. 2003. CPMP/EWP/283/ 99 Available: http://www.emea.eu.int/pdfs/ human/ewp/286399en.pdf. 7. S. C. Choi, T. P. Germanson, and T. Y. Barnes, A simple measure in defining optimal strata in clinical trials. Control. Clin. Trials 1995; 16: 164–171.

9. B. Byrom, Using IVRS in clinical trial management. Appl. Clin. Trials 2002; 10: 36–42.

BLOCKED RANDOMIZATION http://www.clinphone.com/files/Using IVRS in trial%20management.pdf 10. R. Simon, Patient subsets and variation in therapeutic efficacy. Br. J. Clin. Pharmacol. 1982; 14: 473–482. 11. A. I. Hallstrom and K. Davis, Imbalance in treatment assignments in stratified blocked randomization. Control. Clin. Trials 1988; 9: 375–382. 12. S. J. Pocock and R. Simon, Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics 1975; 31: 103–115. 13. S. Heritier, V. Gebski, P. Pillai, Dynamic balancing randomization in controlled clinical trials. Stat. Med. 2005; 24:3729-3741. 14. S. F. Assman, S. J. Pocock, L. E. Enos, L. E. Kasten, et al., Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet 2000; 355: 1064–1069. 15. M. Zelen, The randomization and stratification of patients to clinical trials. J. Chronic. Dis. 1974; 27: 365–375. 16. D. McEntegart and B. O’Gorman, Impact of supply logistics of different randomisation and medication management strategies used within IVRS. Pharmaceut. Engin. 2005; 36–46. Available: http://www. clinphone.com/files/ISPE%20paper1.pdf. 17. J. Wittes, Randomized treatment assignment. P Armitage, T Colton (eds). Encyclopedia of Biostatistics. John Wiley, 1998.:3703-3711 18. V. W. Berger, A. Ivanova, and M. Deloria Knoll, Minimizing predictability while retaining balance through the use of less restrictive randomization procedures. Stat. Med. 2003; 22: 3017–3028. 19. R. A. Bailey and P. R. Nelson, Hadamard randomisation: a valid restriction of random permuted blocks. Biomet. J. 2003; 45: 554–560. 20. A. S. Hedayat and W. D. Wallis, Hadamard matrices and their applications. Ann. Statist. 1978; 6: 1184–1238. 21. W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials. New York: John Wiley & Sons, 2002. 22. S. J. Kunselman, T. J. Armstrong, T. B. Britton, and P. E. Forand, Implementing randomisation procedures in the asthma clinical research network. Control. Clin. Trials 2001: 22:181S–195S. 23. D. McEntegart, Implementing randomization procedures in the Asthma Clinical Research Network. Control. Clin. Trials 2002; 23: 424–426.

11

24. J. P. Matts and J. M. Lachin, Properties of permuted-block randomization in clinical trials. Control. Clin. Trials 1988; 9: 327–344. 25. T. Dupin-Spriet, J. Fermanian, and A. Spriet, Quantification of predictability in block randomization. Drug. Inf. J. 2004; 38: 135–141. 26. T. Dupin-Spriet, J. Fermanian, and A. Spriet, Quantification methods were developed for selection bias by predictability of allocations with unequal block randomization. J. Clin. Epidemiol. 2005; 58: 1269–1270. 27. K. F. Schulz and D. A. Grimes, Unequal group sizes in randomised trials: guarding against guessing. Lancet 2002; 359: 966–970. 28. J-M. Grouin, S. Day, Lewis, Adjustment for baseline covariates: an introductory note. J. Stat. Med. 2004; 23: 697–699. 29. L. A. Kalish and C. B. Begg, The impact of treatment allocation procedures on nominal significance levels and bias. Control. Clin. Trials 1987; 8: 121–135. 30. E. J. Williams, Experimental designs balanced for the estimation of residual effects of treatments. Australian J. Sci. Res. 1949; 2A: 149–168. 31. R. G. Newcombe, Sequentially balanced threesquares cross-over designs. Stat. Med. 1996; 15: 2143–2147. 32. S. L. Bellamy, A dynamic block-randomization algorithm for group-randomized clinical trials when the composition of blocking factor is not known in advance. Contemp. Clin. Trials 2005; 26: 469–479. 33. C. Deng and J. Graz, Generating randomisation schedules using SAS programming. SUGI 27, April 14-17 2002 (SAS Users Group) Paper 267-27. Available: http://www2. sas.com/proceedings/sugi27/p267-27.pdf. 34. M. Saghaei, Random allocation software for parallel group randomized trials. BMC Med. Res. Methodol. 2004; 4: 26. Available: http:// www.biomedcentral.com/1471-2288/4/26. 35. M. Lang, R. Wood, and D. McEntegart, Double-randomised Packaging Lists in Trials managed by IVRS. Good Clin. Pract. J. 2005; 10–13. Available: http://www.clinphone.com/ files/GCPJ%20article%20final%20Nov%2020 05.pdf. 36. International Conference on Harmonisation, E-6 Document. Good Clinical Practice. Federal Register 1997; 62: 25691–25709. Available: http://www.fda.gov/ cder/guidance/iche6.htm. 37. CONSORT Group, D. G. Altman, et al., The revised CONSORT statement for reporting

12

BLOCKED RANDOMIZATION randomised trials: explanation and elaboration. Ann. Intern. Med. 2001; 134: 663–694.

FURTHER READING V. W. Berger, Selection Bias and Covariate Imbalances in Clinical Trials. New York: John Wiley & Sons, 2005.

CROSS-REFERENCES Blocking Randomization Randomization Codes Randomization Methods Randomization Procedures Randomization List Simple Randomization Stratified Randomization Interactive Voice Randomization System (IVRS)

to x (the difference is −0.047, to three decimal places). The bootstrap distribution looks normal, with some skewness. This amount of skewness is a cause for concern. This example may be counter to the intuition of many readers, who use normal probability plots to look at data. This bootstrap distribution corresponds to a sampling distribution, not raw data. This distribution is after the central limit theorem has had its one chance to work, so any deviations from normality here may translate into errors in inferences. We may quantify how badly this amount of skewness affects confidence intervals; we defer this to the section on Bootstrap confidence intervals. We first discuss the idea behind the bootstrap, and give some idea of its versatility.

BOOTSTRAP TIM HESTERBERG Insightful Corp., Seattle, Washington

1

INTRODUCTION

We begin with an example of the simplest type of bootstrapping in this section, then discuss the idea behind the bootstrap, implementation by random sampling, using the bootstrap to estimate standard error and bias, the central limit theorem and different types of bootstraps, the accuracy of the bootstrap, confidence intervals, hypothesis tests, planning clinical trials, the number of bootstrap samples needed and ways to reduce this number, and we conclude with references for additional reading. Figure 1 shows a normal quantile plot of Arsenic concentrations from 271 wells in Bangladesh, from http://www.bgs.ac.uk/ arsenic/bangladesh/Data/SpecialStudyData. csv referenced from statlib http://lib.stat.cmu. edu/datasets. The sample mean and standard deviation are x = 124.5 and s = 298, respectively. √The usual formula standard error is s/ n = 18.1, and the√usual 95% confidence interval x ± tα/2,n−1 s/ n is (88.8, 160.2). This interval may be suspect because of the skewness of the data, despite the reasonably large sample size. We may use the bootstrap for inferences for the mean of this dataset. We draw a bootstrap sample, or resample, of size n with replacement from the data, and compute the mean. We repeat this process many times, say 104 or more. The resulting bootstrap means comprise the bootstrap distribution, which we use to estimate aspects of the sampling distribution for X. Figure 2 shows a histogram and normal quantile plot of the bootstrap distribution. The bootstrap standard error is the standard deviation of the bootstrap distribution; in this case the bootstrap standard error is 18.2, which is close to the formula standard error. The mean of the bootstrap means is 124.4, which is close

2

PLUG-IN PRINCIPLE

The idea behind the bootstrap is the plug-in principle (1)—that if a quantity is unknown, then we plug in an estimate for it. This principle is used all the time in statistics. The standard deviation of a sample mean for i.i.d. observations from a √ population with standard deviation σ is σ/ n; when σ is unknown, we plug in an estimate s to obtain √ the usual standard error s/ n. What is different in the bootstrap is that we plug in an estimate for whole population, not just for a numerical summary of the population. Statistical inference depends on the sampling distribution. The sampling distribution depends on the following: 1. the underlying population(s), 2. the sampling procedure, and 3. the statistic, such as X. Conceptually, the sampling distribution is the result of drawing many samples from the population and calculating the statistic for each. The bootstrap principle is to plug in an estimate for the population, then mimic the real-life sampling procedure and statistic calculation. The bootstrap distribu-

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

BOOTSTRAP

1500 1000 0

500

As (ug/L)

2000

2

−2

−1

0

1

2

3

Figure 1. Arsenic concentrations in 271 wells in Bangladesh.

0.025

Quantiles of Standard Normal

180 160 Mean

0.015

100

120

0.010 0.0

80

0.005

Density

0.020

200

Observed Mean

140

−3

80

100

120

140

160

180

200

Mean

−4

−2

0

2

4

Quantiles of standard normal

Figure 2. Bootstrap distribution for arsenic concentrations.

tion depends on 1. an estimate for the population(s), 2. the sampling procedure, and 3. the statistic, such as X. The simplest case is when the original data are an i.i.d. sample from a single population, and we use the empirical distributhe population, where tion Fˆ n to estimate Fˆ n (u) = (1/n) I(xi ≤ u). This equation gives the ordinary nonparametric bootstrap, which corresponds to drawing samples of size n without replacement from the original data.

2.1 How Useful is the Bootstrap Distribution? A fundamental question is how well the bootstrap distribution approximates the sampling distribution. We discuss this question in greater detail in the section on Accuracy of bootstrap distribution, but note a few key points here. For most common estimators (statistics that are estimates of a population parameter, e.g., X is an estimator for µ, whereas a l statistic is not an estimator), and under fairly general distribution assumptions, center: the center of the bootstrap distribution is not an accurate

BOOTSTRAP

approximation for the center of the sampling distribution. For example, the center of the bootstrap distribution for X is centered at approximately x, the mean of the sample, whereas the sampling distribution is centered at µ. spread: the spread of the bootstrap distribution does reflect the spread of the sampling distribution. bias: the bootstrap bias estimate (see below) does reflect the bias of the sampling distribution. skewness: the skewness of the bootstrap distribution does reflect the skewness of the sampling distribution. The first point bears emphasis. It means that the bootstrap is not used to get better parameter estimates because the bootstrap distributions are centered around statistics θˆ calculated from the data (e.g., x or regression ˆ rather than the unknown populaslope β) tion values (e.g., µ or β). Drawing thousands of boot-strap observations from the original data is not like drawing observations from the underlying population, it does not create new data.

Instead, the bootstrap sampling is useful for quantifying the behavior of a parameter estimate, such as its standard error, bias, or calculating confidence intervals. Exceptions do exist where bootstrap averages are useful for estimation, such as random forests (2). These examples are beyond the scope of this article, except that we give a toy example to illustrate the mechanism. Consider the case of simple linear regression, and suppose that a strong linear relationship exists between y and x. However, instead of using linear regression, one uses a step function the data are split into eight equal-size groups based on x, and the y values in each group are averaged to obtain the altitude for the step. Applying the same procedure to bootstrap samples randomizes the location of the step edges, and averaging across the bootstrap samples smooths the edges of the steps. This method is shown in Fig. 3. A similar effect holds in random forests, using bootstrap averaging of tree models, which fit higher-dimensional data using multivariate analogs of step functions. 2.2 Other Population Estimates Other estimates of the population may be used. For example, if there was reason to assume that the arsenic data followed a gamma distribution, we could estimate parameters for the gamma distribution, then draw

step function bootstrap average

Figure 3. Step function defined by eight equalsize groups, and average across bootstrap samples of step functions.

3

4

BOOTSTRAP

samples from a gamma distribution with those estimated parameters. In other cases, we may believe that the underlying population is continuous; rather than draw from the discrete empirical distribution, we may instead draw samples from a density estimated from the data, say a kernel density estimate. We return to this point in the section entitled ‘‘Bootstrap distributions are too narrow.’’ 2.3 Other Sampling Procedures When the original data were not obtained using an i.i.d. sample, the bootstrap sampling should reflect the actual data collection. For example, in stratified sampling applications, the bootstrap sampling should be stratified. If the original data are dependent, the bootstrap sampling should reflect the dependence; this information may not be straightforward. Some cases exist where the bootstrap sampling should differ from the actual sampling procedure, including: • regression (see ‘‘Examples’’ section) • planning clinical trials (Section 9), (see

‘‘Planning Clinical trials’’ section) • hypothesis testing (see ‘‘Hypothesis test-

ing’’ section), and • small samples (see ‘‘Bootstrap distributions are too narrow’’) 2.4 Other Statistics The bootstrap procedure may be used with a wide variety of statistics mean, median, trimmed mean, regression coefficients, hazard ratio, x-intercept in a regression, and others—using the same procedure. It does not require problem-specific analytical calculations. This factor is a major advantage of the bootstrap. It allows statistical inferences such as confidence intervals to be calculated even for statistics for which there are no easy formulas. It offers hope of reforming statistical practice—away from simple but nonrobust estimators like a sample mean or least-squares regression estimate, in favor of robust alternatives.

3 MONTE CARLO SAMPLING—THE ‘‘SECOND BOOTSTRAP PRINCIPLE’’ The second bootstrap ‘‘principle’’ is that the bootstrap is implemented by random sampling. This aspect is not actually a principle, but an implementation detail. Given that we are drawing i.i.d. samples of size n from the empirical distribution Fˆ n , there are at most nn possible samples. In small samples, we could create all possible bootstrap samples, deterministically. In practice, n is usually too large for that to be feasible, so we use random sampling. Let B be the number of bootstrap samples used (e.g., B = 104 ). The resulting B statistic values represent a random sample of size B with replacement from the theoretical bootstrap distribution that consists of nn values (including ties). In some cases, we can calculate the theoretical bootstrap distribution without simulation. In the arsenic example, parametric bootstrapping from a gamma distribution causes the theoretical bootstrap distribution for the sample mean to be another gamma distribution. In other cases, we can calculate some aspects of the sampling distribution without simulation. In the case of the nonparametric bootstrap when the statistic is the sample mean, the mean and standard deviation of the theoretical bootstrap distribution are x √ and σˆ Fˆ n / n, respectively, where σˆ F2ˆ = n−1 n

(xi − x)2 .

Note that this differs from the usual sample standard deviation in using a divisor of n instead of n − 1. We return to this point in the section ‘‘Bootstrap distribution are too narrow.’’ The use of Monte Carlo sampling adds additional unwanted variability, which may be reduced by increasing the value of B. We discuss how large B should be in the section ‘‘How many bootstrap samples are needed.’’ 4 BIAS AND STANDARD ERROR Let θ = θ (F) be a parameter of a population, such as the mean, or difference in regression

BOOTSTRAP

coefficients between subpopulations. Let θˆ be the corresponding estimate from the data, θˆ ∗ be the estimate from a bootstrap sample, ∗

θˆ = B−1

B

ˆ EFˆ (θˆ ∗ ) − θ (F)

(2)

The Monte Carlo version in Equation (1) substitutes the sample average of bootstrap statistics for the expected value.

be average of B bootstrap estimates, and s2θˆ ∗ = (B − 1)−1

for population F. Substituting Fˆ for the unknown F in both terms yields the theoretical bootstrap analog

θˆb∗

b=1

5

B ∗ (θˆb∗ − θˆ )2

5

EXAMPLES

b=1

the sample standard deviation of the bootstrap estimates. Some bootstrap calculations require that θˆ be a functional statistic, which is one that depends on the data only through the empirical distribution, not on n. A mean is a functional statistic, whereas the usual sample standard deviation s with divisor n − 1 is not—repeating each observation twice gives the same empirical distribution but a different s. The bootstrap bias estimate for a functional statistic is ∗

θˆ − θˆ

(1)

Relative risk bootstrap CI t CI

0.0

Proportion in high risk group

0.6 0.4 0.2

An outlier at 10.1 is omitted. Other large observations are indicated below.

0.020

0.030

A major study of the association between blood pressure and cardiovascular disease found that 55 of 3338 men with high blood pressure died of cardiovascular disease during the study period, compared with 21 out of 2676 patients with low blood pressure. The estimated relative risk is θˆ = pˆ 1 /pˆ 2 = 0.0165/0.0078 = 2.12. To bootstrap this, we draw samples of size n1 = 3338 with replacement from the first group, independently draw samples of size n2 = 2676 from the second group, and calculate the relative risk θˆ ∗ . In addition, we record the individual proportions pˆ ∗1 and pˆ ∗2 . The bootstrap distribution for relative risk is shown in the left panel of Fig. 4. It is

Observed Mean

0.0

Density

5.1 Relative Risk

0.010

0.8

Note how this example relates to the plugˆ − in principle. The bias of a statistic is E(θ) θ , which for a functional statistic may be ˆ − θ (F), the expected value expanded as EF (θ) ˆ of θ when sampling from F minus the value

In this section, we consider some examples, with a particular eye to standard error, bias, and normality of the sampling distribution.

1

2

3 4 5 Relative Risk

6

7

0.0

0.005

0.010

Proportion in low risk group Slope = Relative Risk

Figure 4. Bootstrap distribution for relative risk.

0.015

6

BOOTSTRAP

up and down 1.96 residual standard errors from the central point (the original data) to the circled points; the endpoints of the interval are the slopes of the lines from the origin to the circled points. A t interval would not be appropriate in the example because of the bias and skewness. In practice, one would normally do a t interval on a transformed statistic, for example, log of relative risk, or log-odds-ratio log(pˆ 1 (1 − pˆ 2 )/((1 − pˆ 1 )p2 )). Figure 5 shows a normal quantile plot for the bootstrap distribution of the log of relative risk. The distribution is much less skewed than is relative risk, but it is still noticeably skewed. Even with a log transformation, a t interval would only be adequate for work where accuracy is not required. We discuss confidence intervals even more in the section entitled ‘‘Bootstrap confidence intervals.’’ 5.2 Linear Regression The next examples for linear regression are based on a dataset from a large pharmaceutical company. The response variable is a pharmacokinetic parameter of interest, and candidate predictors are weight, sex, age, and dose (3 levels—200, 400, and 800). In all, 300 observations, are provided, one per subject. Our primary interest in this dataset will be to use the bootstrap to investigate the behavior of stepwise regression; however, first we consider some other issues.

1.5 1.0 0.5 0.0

Log Relative Risk

2.0

highly skewed, with a long right tail caused by a divisor relatively close to zero. The standard error, from a sample of 104 observations, is 0.6188. The theoretical bootstrap standard n n error is undefined because some of the n1 1 n2 2 ∗ bootstrap samples have θˆ undefined because the denominator pˆ ∗ is zero; this aspect is not important in practice. The average of the bootstrap replicates is larger than the original relative risk, which indicates bias. The estimated bias is 2.205 − 2.100 = 0.106, which is 0.17 standard errors. Although the bias does not seen large in the figure, this amount of bias can have a huge impact on inferences; a rough calculation suggests that the actual noncoverage of one side of a two-sided 95% confidence interval would be (0.17 + 1.96) = 0.0367 rather than 0.025, or 47% too large. The right panel of Fig. 4 shows the joint bootstrap distribution of pˆ ∗1 and pˆ ∗2 . Each point corresponds to one bootstrap sample, and the relative risk is the slope of a line between the origin and the point. The original data is at the intersection of horizontal and vertical lines. The solid diagonal lines exclude 2.5% of the bootstrap observations on each side; the slopes are the endpoint of a 95% bootstrap percentile confidence interval. The bottom and top dashed diagonal lines are the endpoints of a t interval with standard error obtained using the usual delta method. This interval corresponds to calculating the standard error of residuals above and below the central line (the line with slope θˆ ), going

−4

−2

0

2

Quantiles of Standard Normal

4

Figure 5. Bootstrap distribution for log of relative risk.

BOOTSTRAP

A standard linear regression using main effects gives:

(Intercept) wgt sex age dose

Value

Std. Error

t value

32.0819 0.2394 −7.2192 −0.1507 0.0003

4.2053 0.0372 1.2306 0.0367 0.0018

7.6290 6.4353 −5.8666 −4.1120 0.1695

Pr (>|t|) 0.0000 0.0000 0.0000 0.0000 0.8653

5.2.1 Prediction Intervals and NonNormality. The right panel also hints at the difference between a confidence interval (for mean response given covariates) and a prediction interval (for a new observation). With large n, the regression lines show little variation, but the variation of an individual point above and below the (true) line remains constant regardless of n. Hence as n increases, confidence intervals become narrow but prediction intervals do not. This example is reflected in the standard formulae for confidence intervals: (3) yˆ ± tα 1/n + (x − x)2 /Sxx and prediction intervals in the simple linear regression case: yˆ ± tα 1 + 1/n + (x − x)2 /Sxx (4)

60

As n → ∞ the terms inside the square root decrease to zero for a confidence interval but original bootstrap

PK Parameter 30 40

40

10

20

30 10

20

PK Parameter

covariate (age) to the model reduces residual variance. Note that the y values here are the actual data; they are not adjusted for differences between the actual sex and age and the base case. Adjusting the male observations would raise these valves. Adjusting both would make the apparent residual variation in the plot smaller to match the residual variance from the regression.

50

original bootstrap

50

60

The left panel of Fig. 6 contains a scatterplot of clearance versus weight, for the 25 females who received dose = 400, as well as regression lines from 30 bootstrap samples. This graph is useful for giving a rough idea of variability. A bootstrap percentile confidence interval for mean clearance given weight would be the range of the middle 95% of heights of regression lines at a given weight. The right panel shows all 300 observations and predictions for the clearance/weight relationship using (1) all 300 observations, (2) the main-effects model, and (3) predictions for the ‘‘base case’’, for females who received dose = 400. In effect, this graph uses the full dataset to improve predictions for a subset, ‘‘borrowing strength’’. Much less variability is observed than in the left panel, primarily because of the larger sample size and also because the addition of an important

40

50

60

70 Weight

80

90

100

7

40

50

60

70

80

90

100

Weight

Figure 6. Bootstrap regression lines. Left panel: 25 females receiving dose = 400. Right panel: all observations, predictions for females receiving dose = 400.

8

BOOTSTRAP

approach 1 for a prediction interval, the prediction interval approaches yˆ ± zα . Now, suppose that residuals are not normally distributed. Asymptotically and for reasonably large n the confidence intervals are approximately correct, but prediction intervals are not—the interval yˆ ± zα is only correct for normally distributed data. Prediction intervals should approach (ˆy ± Fˆ ε−1 (α/2), yˆ ± Fˆ ε−1 (1 − α/2)) as n → ∞, where Fˆ is the estimated residual distribution. In other words, no central limit theorem exists for prediction intervals. The outcome for a new observation depends primarily on a single random value, not an average across a large sample. Equation (4) should only be used after confirming that the residual distribution is approximately normal. And, in the opinion of this author, Equation (4) should not be taught in introductory statistics to students ill-equipped to understand that it should only be used if residuals are normally distributed. A bootstrap approach that takes into account both the shape of the residual distribution and the variability in regression lines is outlined below in the subsection entited ‘‘Prediction Intervals.’’ 5.2.2 Stepwise Regression. Now, consider the case of stepwise regression. We consider models that range from the intercept-only model to a full second-order model that includes all main effects, all interactions, and quadratic functions of dose, age, and weight. We use forward and backward stepwise regression, with terms added or subtracted to minimize the CP statistic, using the step function of S-PLUS (insightful Corp., Seattle, WA). The resulting coefficients and inferences are:

Value

Std. Error

t value

(Intercept) 12.8035 14.1188 0.9068 wgt 0.6278 0.1689 3.7181 sex 9.2008 7.1634 1.2844 age −0.6583 0.2389 −2.7553 I(ageˆ2) 0.0052 0.0024 2.1670 wgt: sex −0.2077 0.0910 −2.2814

Pr (>|t|) 0.3637 0.0002 0.1980 0.0055 0.0294 0.0218

The sex coefficient is retained even though it has small t value because main effects are included before interactions. We use the bootstrap here to check model stability, obtain standard errors, and check for bias. 6 MODEL STABILITY The stepwise procedure selected a six-term model. We may use the bootstrap to check the stability of the procedure under random sampling (does it consistently select the same model, or is there substantial variation) and to observe which terms are consistently included. Here, we create bootstrap samples by resampling subjects— whole rows of the data—with replacement. We sample whole rows instead of individual values to preserve covariances between variables. In 1000 bootstrap samples, only 95 resulted in the same model as for the original data; on average, 3.2 terms differed between the original model and the bootstrap models. The original model has six terms; the bootstrap models ranged from 4 to 12, with an average of 7.9, which is 1.9 more than for the original data. This result suggests that stepwise regression tends to select more terms for random data than for the corresponding population, which in turn suggests that the original six-term model may also be overfitted. Figure 7 shows the bootstrap distributions for two coefficients: dose and sex. The dose coefficient is usually zero, although it may be positive or negative. This graph suggests that dose is not very important in determining clearance. The sex coefficient is bimodal, with the modes on opposite sides of zero. It turns out that the sex coefficient is usually negative when the weight–sex interaction is included, otherwise it is positive. Overall, the bootstrap suggests that the original model is not very stable. For comparison, repeating the experiment with a more stringent criteria for variable inclusion—a modified Cp statistic with double the penalty—results in a more stable model. The original model has the same six

BOOTSTRAP

Density

0.03

0.04

Observed Mean

0

0.0

0.01

0.02

40 20

Density

60

Observed Mean

9

−0.05

0.0 Coef.dose

0.05

−40

−20

0 Coef.sex

20

40

Figure 7. Bootstrap distribution for dose and sex coefficients in stepwise regression.

terms. In all, 154 bootstrap samples yielded the same model, and on average the number of different terms was 2.15. The average number of terms was 5.93, which is slightly less than for the original data; this number suggests that stepwise regression may now be slightly underfitting. 6.0.2.1 Standard Errors. At the end of the stepwise procedure, the table of coefficients, standard errors, and t values was calculated, ignoring the variable selection process. In particular, the standard errors are calculated under the usual regression assumptions, and assuming that the model were fixed from the outset. Call these nominal standard errors. In each bootstrap sample, we recorded the coefficients and the nominal standard errors. For the main effects the bootstrap standard errors (standard deviation of bootstrap coefficients) and average of the nominal standard errors are:

Intercept wgt sex age dose

boot SE

avg.nominal SE

27.9008 0.5122 9.9715 0.3464 0.0229

14.0734 0.2022 5.4250 0.2137 0.0091

The bootstrap standard errors are much larger than the nominal standard errors.

That information is not surprising— the bootstrap standard errors reflect additional variability because of model selection, such as the bimodal distribution for the sex coefficient. This is not to say that one should use the bootstrap standard errors here. At the end of the stepwise variable selection process, it is appropriate to condition on the model and do inferences accordingly. For example, a confidence interval for the sex coefficient should be conditional on the weight–sex interaction being included in the model. But it does suggest that the nominal standard errors may be optimistic. Indeed they are, even conditional on the model terms, because the residual standard error is biased. 6.0.2.2 Bias. Figure 8 shows bootstrap distributions for R2 (unadjusted) and residual standard deviation. Both show very large bias. The bias is not surprising— optimizing generally gives biased results. Consider ordinary linear regression—unadjusted R2 biased. If it were calculated using the true βs ˆ it would not be biased. instead of estimated βs Optimizing βˆ to minimize residual squared error (and maximize R2 ) makes unadjusted R2 biased. In classic linear regression, with the model selected in advance, we commonly use adjusted R2 to counteract the bias. Similarly, we

10

BOOTSTRAP

Observed Mean

0.8 0.6

Density

0

0.0

0.2

2

0.4

4

Density

6

1.0

8

1.2

Observed Mean

0.25

0.35 Rsquare

0.45

6.5

7.0 7.5 Sigma

8.0

Figure 8. Bootstrap distributions for R2 and residual standard deviation in stepwise regression.

use residual variance calculated using a divisor of (n − p − 1) instead of n, where p is the number of terms in the model. But in this case, it is not only the values of the coefficients that are optimized, but which terms are included in the model. This result is not reflected in the usual formulae. As a result, the residual standard error obtained from the stepwise procedure is biased downward, even using a divisor of (n − p − 1). 6.0.3 Bootstrapping Rows or Residuals. Two basic ways exist to bootstrap linear regression models—to resample rows, or residuals (3). To resample residuals, we fit the initial model yˆ i = βˆ0 + βˆj xij , calculate the residuals ri = yi − yˆ i , then create new bootstrap samples as y∗i = yˆ i + r∗i

(5)

for i = 1, . . . , n, where r∗i is sampled with replacement from the observed residuals {r1 , . . . , rn }. We keep the original x and yˆ values fixed to create new bootstrap y* values. Resampling rows corresponds to a random effects sampling design—in which x and y are both obtained by random sampling from a joint population. Resampling residuals corresponds to a fixed effects model, in which the xs are fixed by the experimental design and ys are obtained conditional on the xs.

So at first glance, it would seem appropriate to resample rows when the original data collection has random xs. However, in classic statistics we commonly use inferences derived using the fixed effects model, even when the xs are actually random. We do inferences conditional on the observed x values. Similarly, in bootstrapping we may resample residuals even when the xs were originally random. In practice, the difference matters most when there are factors with rare levels or interactions of factors with rare combinations. If resampling rows, it is possible that a bootstrap sample may have none of the level or combination, in which case the corresponding term cannot be estimated and the software may give an error. Or, what is worse, there may be one or two rows with the rare level, enough so the software would not crash, but instead quietly give garbage answers. That are imprecise because they are based on few observations. Hence, with factors with rare levels or small samples more generally, it may be preferable to resample residuals. Resampling residuals implicitly assumes that the residual distribution is the same for every x and that there is no heteroskedasticity. A variation on resampling residuals that allows heteroskedasticity is the wild bootstrap, which in its simplest form adds either plus or minus the original residual ri to each

BOOTSTRAP

fitted value,

y∗i = yˆ i ± ri

(6)

with equal probabilities. Hence the expected value of y∗i is yˆ i , and the standard deviation is proportional to ri . For more discussion see Ref. 3. Other variations on resampling residuals exist, such as resampling studentized residuals or weighted error resampling for nonconstant variance (3). 6.0.4 Prediction Intervals. The idea of resampling residuals provides a way to obtain more-accurate prediction intervals. To capture both variation in the estimated regression line and residual variation, we may resample both. Variation in the regression line may be obtained by resampling either ˆ∗ residuals or rows to generate random β val∗ ˆ ˆ ues and corresponding yˆ = β0 + βj x0j , for predictions at x0 . Independently, we draw random residuals r* and add them to the yˆ *. After repeating this many times, the range of the middle 95% of the (ˆy* + r*) values gives a prediction interval. For more discussion and alternatives see Ref. 3. 6.1 Logistic Regression

observation to the prediction from another yields values anywhere between −1 and 2. Instead, we keep the xs fixed and generate y values from the estimated conditional distribution given x. Let pˆ i be the predicted probability that yi = 1 given xi . Then y∗i

1 with probability pˆ 0 with probability 1 − pˆ

(7)

The kyphosis dataset (4) contains observations on 81 children who had corrective spinal surgery, on four variables: Kyphosis (a factor indicating whether a postoperative deformity is present), Age (in months), Number (of vertebrae involved in the operation), and Start (beginning of the range of vertebrae involved). A logistic regression using main effects gives coefficients:

(Intercept) Age Start Number

Value

Std. Error

t value

−2.03693352

1.449574526 0.006446256 0.067698863 0.224860819

−1.405194

0.01093048 −0.20651005

0.41060119

1.695633 −3.050421

1.826024

1.0 0.8 0.0

0.2

0.4

Kyphosis

original bootstrap

0.6

1.0

which suggest that Start is the most important predictor. The left panel of Fig. 9 shows Kyphosis versus Start, together with predicted curve for the base case with Age = 87 (the median)

0.8 0.6 0.4 0.2 0.0

Kyphosis

In logistic regression, it is straightforward to resample rows of the data, but resampling residuals fails—the y values must be either zero or one, but adding the residual from one

5

10 Start

15

11

5

10 Start

Figure 9. Bootstrap curves for predicted kyphosis, for Age = 87 and Number = 4.

15

12

BOOTSTRAP

Figure 10. Bootstrap distributions for logistic regression coefficients.

and Number = 4 (the median). This graph is a sunflower plot (5,6), in which a flower with k > 2 petals represents k duplicate values. The right panel of of Fig. 9 shows predictions from 20 bootstrap curves. Figure 10 shows the bootstrap distributions for the four regression coefficients. All distributions are substantially non-normal. It would not be appropriate to use classic normal-based inferences. Indeed, the printout of regression coefficients above, from a standard statistical package (S-PLUS) includes t values but omits p values. Yet, it would be tempting for a package user to interpret the t coefficients as developing from a t distribution; the bootstrap demonstrates that it

would be improper. The distributions are so non-normal as to make the use of standard errors doubtful. The numerical bootstrap results are:

(Intercept) Age Start Number

Observed

Mean

Bias

−2.03693

−2.41216

−0.375224

SE

1.737216 0.01093 0.01276 0.001827 0.008017 −0.20651 −0.22991 −0.023405 0.084246 0.41060 0.48335 0.072748 0.274049

The bootstrap standard errors are larger than the classic (asymptotic) standard errors by 20–24%. The distributions are also ex-

BOOTSTRAP

tremely biased, with absolute bias estimates that range from 0.22 to 0.28 standard errors. These results are for the conditional distribution bootstrap, which is a kind of parametric bootstrap. Repeating the analysis with the nonparametric bootstrap (resampling observations) yields bootstrap distributions that are even longer-tailed, with larger biases and standard errors. This result reinforces the conclusion that classic normal-based inferences are not appropriate here. 7 ACCURACY OF BOOTSTRAP DISTRIBUTIONS How accurate is the bootstrap? This inquiry entails two questions: • How accurate is the theoretical boot-

strap? • How accurately does the Monte Carlo

implementation approximate the theoretical bootstrap? We begin this section with a series of pictures intended to illustrate both questions. We conclude this section with a discussion of cases where the theoretical bootstrap is not accurate, and remedies. In the section on ‘‘How many bootstrap samples are needed,’’ we return to the question of Monte Carlo accuracy. The treatment in this section is mostly not rigorous. Much literature examines the first question rigorously and asymptotically; we reference some of that work in other sections, particularly in the Section. About confidence intervals, and also refer the reader to (7,8) and some sections of (Ref. 3), as well as the references therein. Figure 11 shows a population, and five samples of size 50 from the population in the left column. The middle column shows the sampling distribution for the mean, and bootstrap distributions from each sample based on B = 1000 bootstrap samples. Each bootstrap distribution is centered at the statistic (x) from the corresponding sample rather than being centered at the population mean µ. The spreads and shapes of the bootstrap distributions vary a bit. This example informs what the bootstrap distributions may be used for. The bootstrap

13

does not provide a better estimate of the population parameter µ, because no matter how many bootstrap samples are used, they are centered at x (plus random variation), not µ. On the other hand, the bootstrap distributions are useful for estimating the spread and shape of the sampling distribution. The right column shows five more bootstrap distributions from the first sample; the first four using B = 1000 resamples and the final using B = 104 . These examples illustrate the Monte Carlo variation in the bootstrap. This variation is much smaller than the variation caused by different original samples. For many uses, such as quick and dirty estimation of standard errors or approximate confidence intervals, B − 1000 resamples is adequate. However, there is noticeable variability, particularly in the tails of the bootstrap distributions, so when accuracy matters, B = 104 or more samples should be used. Note the difference between using B = 1000 and B = 104 bootstrap samples. These examples correspond to drawing samples of size 1000 or 104 observations, with replacement, from the theoretical bootstrap distribution. Using more samples reduces random Monte Carlo variation, but it does not fundamentally change the bootstrap distribution— it still has the same approximate center, spread, and shape. Figure 12 is similar to Fig. 11, but for a smaller sample size, n = 9 (and a different population). As before, the bootstrap distributions are centered at the corresponding sample means, but now the spreads and shapes of the bootstrap distributions vary substantially because the spreads and shapes of the samples also vary substantially. As before, the Monte Carlo variation is small, and it may be reduced using B = 104 or more samples. It is useful to compare the bootstrap distributions to classic statistical inferences. With √ classic t intervals of the form x ± tα/2 s/ n, the confidence interval width varies substantially in small samples as the sample standard deviation s varies. Similarly, the classic √ standard error s/ n varies. The bootstrap is no different in this regard—bootstrap standard errors and widths of confidence intervals for the mean are proportional to s.

14

BOOTSTRAP Population mean = mu Sample mean = x Sampling distribution

Population

−3

mu

0

3

6

0

mu

3

x

0

x

3

3 0

x

Bootstrap distribution 3 for Sample 1

0

3

x

x

3

x

0

0

x

3

0

Bootstrap distribution 5 for Sample 1

x

3 0

3

x

B=10^4

Bootstrap distribution for Sample 5

Sample 5

3

x

3 0

x

Bootstrap distribution 4 for Sample 1

B=1000

Bootstrap distribution for Sample 4

3

x

B=1000

3 0

x

Sample 4

0

3 0

Bootstrap distribution for Sample 3

0

3

x

B=1000

Sample 3

0

Bootstrap distribution 2 for Sample 1

Bootstrap distribution for Sample 2

Sample 2

0

B=1000

Bootstrap distribution for Sample 1

Sample 1

0

3

Bootstrap distribution 6 for Sample 1

x

3

Figure 11. Bootstrap distribution for mean, n = 50. The left column shows the population and five samples. The middle column shows the sampling distribution and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample, with B = 1000 or B = 104 .

Where the bootstrap does differ from classic inferences is how it handles skewness. The bootstrap percentile interval, and other bootstrap confidence intervals discussed below in the next Section, are in general asymmetrical with asymmetry depending on the sample. They estimate the population skewness from the sample skewness. In contrast, classic t intervals assume that the population is zero. In Bayesian terms, the bootstrap uses

a noninformative prior for skewness, whereas classic procedures use a prior with 100% of its mass on skewness = 0. Which method is preferred? In large samples, it is clearly the bootstrap. In small samples, the classic procedure may be preferred. If the sample size is small, then skewness cannot be estimated accurately from the sample, and it may be better to assume skewness

BOOTSTRAP Population

15

Population mean = mu Sample mean = x Sampling distribution

−3

mu

3

-3

mu

3

Bootstrap distribution for Sample 1

Sample 1

−3

x

3

Bootstrap distribution 2 for Sample 1

−3

x

Sample 2

−3

3

Bootstrap distribution for Sample 2

x

3

−3

x

3

Bootstrap distribution for Sample 3

Sample 3

−3

x

3

x

3

−3

x

3

x

3

−3

Bootstrap distribution 3 for Sample 1

−3

−3

x

3

−3

x

3

−3

3

B=1000

x

3

B=1000

x

3

B=10^4

Bootstrap distribution 6 for Sample 1

x

3

B=1000

Bootstrap distribution 5 for Sample 1

Bootstrap distribution for Sample 5

Sample 5

−3

−3

x

Bootstrap distribution 4 for Sample 1

Bootstrap distribution for Sample 4

−3

−3

B=1000

x

3

Figure 12. Bootstrap distribution for mean, n = 9. The left column shows the population and five samples. The middle column shows the sampling distribution and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample, with B = 1000 or B = 104 .

= 0 despite the bias, rather than to use an estimate that has high variability. Now turn to Fig. 13, where the statistic is the sample median. Here, the bootstrap distributions are poor approximations of the sampling distribution. In contrast, the sampling distribution is continuous, but the bootstrap distributions are discrete, with the only possible values being values in the original

sample (here n is odd). The bootstrap distributions are very sensitive to the sizes of gaps among the observations near the center of the sample. The bootstrap tends not to work well for statistics such as the median or other quantiles that depend heavily on a small number of observations out of a larger sample.

16

BOOTSTRAP

−4

M

10

−4

M

10

Bootstrap distribution for Sample 1

Sample 1

−4

M

10

−4

M

10

M

10

−4

M

10

M

10

−4

M

10

M

10

−4

M

Sample 5

−4

M

10

M

10

−4

M

M

10

10

Bootstrap distribution 4 for Sample 1

−4

M

10

Bootstrap distribution 5 for Sample 1

−4

M

Bootstrap distribution for Sample 5

−4

10

Bootstrap distribution 3 for Sample 1

Bootstrap distribution for Sample 4

Sample 4

−4

−4

Bootstrap distribution for Sample 3

Sample 3

−4

Bootstrap distribution 2 for Sample 1

Bootstrap distribution for Sample 2

Sample 2

−4

Population median = M Sample median = m

Sampling distribution

Population

10

Bootstrap distribution 6 for Sample 1

−4

M

10

Figure 13. Bootstrap distribution for median, n = 15. The left column shows the population and five samples. The middle column shows the sampling distribution and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample.

7.1 Systematic Errors in Bootstrap Distributions

Second, in many applications there is a relationship between the statistic and its standard error [‘‘acceleration’’ in the termi-

We note three ways that bootstrap distributions are systematically different than sampling distributions. First, as noted above, bootstrap distributions are centered at the statistic θˆ (plus bias) rather than at the parameter θ (plus bias).

nology of Efron (9)]. For example, the standard error of a binomial proportion p(1 ˆ − p)/n ˆ depends on p. ˆ Similarly, when sampling from a gamma distribution, the variance of the sample mean depends on the

BOOTSTRAP

underlying mean. More generally when sampling the mean from positively skewed distributions, samples with larger means tend to give larger standard errors. When is acceleration occurs, the bootstrap standard error reflects the standard error ˆ not the true standard corresponding to θ, deviation of the sampling distribution (corresponding to θ ). Suppose the relationship is positive; when θˆ < θ , it tends to be true that the estimated standard error is also less than the true standard deviation of the sampling distribution, and confidence intervals tend to be too short. This is true for t intervals, whether using a formula or bootstrap standard error, and also to a lesser extent for bootstrap percentile intervals. The more accurate intervals discussed in the next section correct for acceleration. The third systematic error is that bootstrap distributions tend to be too narrow. 7.2 Bootstrap Distributions Are Too Narrow In small samples, bootstrap distributions tend to be too narrow. Consider the case of a sample mean from a single population; in this case, the standard √ theoretical bootstrap error is σˆ / n where σˆ 2 = (1/n) (xi − x)2 . In contrast to the usual sample standard deviation s, this method uses a divisor of n rather than n − 1. The reason the distributions are too narrow relates to the plug-in principle; when plugging in the empirical distribution Fˆ n for use as the population, we are drawing samples from a population with standard deviation σˆ . The result is that bootstrap standard errors are too small, by√a factor 1 − 1/n relative to the usual s/ n; the errors are about 5% too small when n = 10, about 1% too small when n = 50, and so on. In stratified bootstrap situations, the bias depends on the strata sizes rather than on the total sample size. Some easy remedies can be implemented. The first is to draw bootstrap samples of size n − 1, with replacement from the data of size n. The second, bootknife sampling (10), is a combination of jackknife and bootstrap sampling—first create a jackknife sample by omitting an observation, then draw a bootstrap sample of size n with replacement from

17

the n − 1 remaining observations. The omission can be random or systematic. A third remedy is the smoothed bootstrap. Instead of drawing random samples from the discrete distribution Fˆ n , we draw from a kernel density estimate Fˆ h (x) = n−1 ((x − xi )/h), where is the standard normal density (other densities may be used). The original motivation (11,12) was to draw samples from continuous distributions, but it can also be used to correct for narrowness (10). The variance of an observation from Fˆ h is σˆ 2 + h2 . Using h2 = s2 /n results in the theoretical bootstrap standard error matching the usual formula (10). For multidimensional data x the kernel covariance can be 1/n times the empirical covariance matrix. For non-normal data, it may be appropriate to smooth on a transformed scale; for example, for failure time data, take a log transform of the failure times, add normal noise, then transform it back to the original scale. 8

BOOTSTRAP CONFIDENCE INTERVALS

Several bootstrap confidence intervals have been proposed in the literature. Reviews of confidence intervals are found in Refs. 13–15. Here we focus on five examples: t intervals with either bootstrap or formula standard errors, bootstrap percentile intervals (16), bootstrap t intervals (17) bootstrap BCa intervals (9), and bootstrap tilting (18–21). Note that ‘‘t intervals with bootstrap standard errors’’ and ‘‘bootstrap t intervals’’ are different. Percentile and t intervals are quick-anddirty intervals that are relatively simple to compute, but they are not very accurate except for very large samples. They do not properly account for factors such as bias, acceleration, or transformations. They are first-order correct under fairly general circumstances (basically, for asymptoticallynormal statistics) the one-sided noncoverage levels √ for nominal (1√− α) intervals are α/2 + O(1/ n). The O(1/ n) errors decrease to zero very slowly. The BCa, tilting, and bootstrap t intervals are second-order correct, with coverage errors O(1/n).

18

BOOTSTRAP

The percentile, BCa, and tilting intervals are transformation invariant—they give equivalent results for different transformations of a statistic, for example, hazard ratio and log-hazard ratio, or relative risk and log relative risk. t intervals are not transformation invariant. Bootstrap t intervals are less sensitive to transformations than are t intervals; the use of different (smooth) transformations has coverage√effects of order O(1/n), compared with O(1/ n) for t intervals. Our focus is on one-sided errors because few practical situations are truly two-sided. A nominal 95% interval that misses 2% of the time on the left and 3% of the time on the right should not be considered satisfactory. It is a biased confidence interval—both endpoints are too low, so it gives a biased impressions about where the true parameter may be. The appropriate way to add one-sided coverage errors is by adding their absolute values, so the 2%/3% interval has a total coverage error of |2 − 2.5| + |3 − 2.5| = 1%, not 0%. 8.1 t Intervals A t interval is of the form θˆ ± tα/2,ν sθˆ

(8)

where sθˆ is a standard error computed using a formula or using the bootstrap, and ν is degrees of freedom, which is typically set to n − 1 (although other values would be better for non-normal distributions). The bootstrap standard error may be computed using the techniques in the section ‘‘Bootstrap distributions are too narrow’’— bootknife, sampling with reduced size, or smoothed bootstrap. This results in slightly wide intervals that are usually more accurate in practice. These techniques have an O(1/n) effect on one-sided coverage errors, which is unimportant for large samples but is important in small samples. For example, for a sample of independent identically distributed observations from a normal distribution, a nominal 95% t interval for the mean using a bootstrap standard error without these corrections would have one-sided coverages errors:

n 10 20 40 100

Non-coverage

Error

0.0302 0.0277 0.0264 0.0256

0.0052 0.0027 0.0014 0.0006

8.2 Percentile Intervals In its simplest form, a 95% bootstrap percentile interval is the range of the middle 95% of a bootstrap distribution. More formally, bootstrap percentile intervals are of the form ˆ −1 (1 − α/2)) ˆ −1 (α/2), G (G

(9)

ˆ is the estimated bootstrap distribuwhere G tion of θˆ ∗ . Variations are possible that improve finitesample performance. These examples have received little attention in the bootstrap literature, which tends to focus on asymptotic properties. In particular, the simple bootstrap percentile intervals tend to be too narrow, and the variations give wider intervals with better coverage. First, the bootknife or other techniques cited previously may be used. Second, the percentiles may be adjusted. In a simple situation, like the sample mean from a symmetric distribution, the interval is similar to the t interval in Equation (8) but using quantiles of a normal distribution rather than t distribution, zα/2 rather than tα/2,n−1 . As a result, the interval tends to be too narrow. A correction is to adjust the quantiles based on the difference between a normal and t distribution, ˆ −1 (1 − α /2)) ˆ −1 (α /2), G (G

(10)

−1 (α/2) and where is where −1 (α /2) = Ft,n−1 the standard normal distribution and F t,n−1 is the t distribution function with n − 1 degrees of freedom. This method gives wider intervals. Extensive simulations (not included here) show that this gives smaller coverage errors in practice, in a wide variety of applications. The effect on coverage errors is O(1/n), which is the same order as the

BOOTSTRAP

bootknife adjustment, but the magnitude of the effect is larger; for example, the errors caused by using z rather than t quantiles in a standard l interval for a normal population are as follows:

n

Non-coverage

Error

0.0408 0.0324 0.0286 0.0264

0.0158 0.0074 0.0036 0.0014

10 20 40 100

A third variation relates to how quantiles are calculated for a finite number B of bootstrap samples. Hyndman and Fan (22) give a family of definitions of quantiles for finite samples, governed by a parameter 0 ≤ ∗ is the (b δ ≤ 1. The bth order statistic θˆ(b) − δ)/(B + 1 − 2δ) quantile of the bootstrap distribution for b = 1, . . . , B. Linear interpolation between adjacent bootstrap statistics is used if the desired quantile is not of the form (b − δ)/(B + 1 − 2δ) for some integer b. For bootstrap confidence intervals δ = 0 is preferred, as other choices result in lower coverage probability. The effect on coverage errors is O(1/B).

The standard error may be calculated either by formula or bootstrap sampling; in the latter case, calculating each sθˆ ∗ requires a second level of bootstrap sampling, with secondlevel bootstrap samples drawn from each first-level bootstrap sample. Figure 14 shows the bootstrap distribution for the t statistic for means arsenic concentration, where t is the ordinary t statistic √ (x − µ)/(s/ n). In contrast to Fig. 2, in which the bootstrap distribution for the mean is positively skewed, the distribution for the t statistic is negatively skewed. The reason is that there is positive correlation between x* and s*, as seen observed in the right panel of Fig. 14, so that a negative numerator in Equation (12) tends to occur with a small denominator. The bootstrap t interval is based on the identity θˆ − θ < G−1 t (1 − α/2)) = 1 − α sθˆ (13) where Gt is the sampling distribution of t [Equation (11)]. Assuming that t* [Equation (12)] has approximately the same distribution as t, we substitute quantiles of the bootstrap distribution for t*; then solving for θ yields the bootstrap t interval P(G−1 t (α/2) <

−1 ˆ (θˆ − G−1 t∗ (1 − α/2)sθˆ , θ − Gt∗ (α/2)sθˆ )

8.3 Bootstrap t The difference between t intervals (possibly using bootstrap standard errors) and bootstrap t intervals is that the former assume that a t statistic follows a t distribution, whereas the latter estimate the actual distribution using the bootstrap. Let θˆ − θ (11) t= sθˆ where t is statistic. Under certain conditions, the t statistic follows a t distribution. Those conditions are rarely met in practice. The bootstrap analog of t is

t∗ =

θˆ ∗ − θˆ sθˆ ∗

(12)

19

(14)

Note that the right tail of the bootstrap distribution of t* is used in computing the left side of the confidence interval, and conversely. The bootstrap t and other intervals for the mean arsenic concentration example described in the introduction are shown in Table 1. It is not appropriate to use bootknife or other sampling methods in the section ‘‘Bootstrap distributions are too narrow’’ with the bootstrap t. The reason we use those methods with the other intervals is because those intervals are too narrow if the plug-in population is narrower, on average, than the parent population. The sampling distribution of a t statistic, in contrast, is invariant under changes in the scale of the parent population. This method gives it an automatic correction for the plug-in population being too narrow, and to add bootknife sampling would overcorrect.

BOOTSTRAP

S

20

M

Figure 14. Bootstrap distribution for t statistic, and relationship between boot strap means and standard deviations, of arsenic concentrations. Table 1. Confidence Intervals for Mean Arsenic Concentration, Based on 100,000 Bootstrap Samples, Using Ordinary Nonparametric and Bootknife Resampling 95% Interval Formula t t w boot SE percentile bootstrap t BCa tilting t w boot SE percentile BCa tilting

Asymmetry ±35.7

(88.8, 160.2) Ordinary bootstrap (88.7, 160.2) (91.5, 162.4) (94.4, 172.6) (95.2, 169.1) (95.2, 169.4) Bootknife (88.7, 160.3) (91.5, 162.6) (95.4, 169.3) (95.2, 169.4)

±35.8 (−33.0, 38.0) (−30.1, 48.1) (−29.3, 44.6) (−29.3, 44.9) ±35.8 (−32.9, 38.1) (−29.1, 44.8) (−29.3, 45.0)

The ‘‘asymmetry’’ column is obtained by subtracting the observed mean. The ‘‘t w boot SE’’ interval is a t interval using a bootstrap standard error.

8.4 BCa Intervals

where

The bootstrap BCa (bias-corrected, accelerated) intervals are quantiles of the bootstrap distribution, like the percentile interval, but with the percentiles adjusted depending on a bias parameter z0 and acceleration parameter a. The interval is

(G−1 (p(α/2)), G−1 (p(1 − α/2)))

(15)

p(c) − z0 +

z0 + (−1) (c) 1 − a(z0 + (−1) (c))

(16)

is the adjusted probability level for quantiles; it simplifies to c when z0 = a = 0. The BCa interval is derived by assuming a smooth transformation h exists such that h(θˆ ) ∼ N(h(θ ) + z0 σh , σh2 )

(17)

BOOTSTRAP

where σ h = 1 + ah(θ ) and that the same relationship holds for bootstrap samples (subˆ and θˆ for θ ). Some algebra stitute θˆ ∗ for θ, yields the BCa confidence interval. The transformation h cancels out, so it does not be estimated. For the nonparametric bootstrap, the parameter z0 is usually estimated using the fraction of bootstrap observations that fall below the original observed value, z0 = (−1) (#(θˆ ∗ < θˆ )/B)

(18)

and acceleration parameter based on the skewness of the empirical influence function. One estimate of that skewness is obtained from jackknife samples; let θˆ(i) be the statistic calculated from the original sample but excluding observation i, and θ (i) be the average of those values, then − a=

i=1

n 6 (θˆ(i) − θ (i) )2

3/2

probability 2.5% of exceeding the observed ˆ = 0.025. Bootstrap tilting value, Pθleft (θˆ ∗ > θ) borrows this idea. The idea behind bootstrap tilting is to create a one-parameter family of populations that includes the empirical distribution function to find the member of that family that has 2.5% (or 97.5%) of the bootstrap distribution exceeding the observed value. Let the left (right) endpoint of the interval be the parameter of interest calculated from that population. The family is restricted to have support on the empirical data with varying probabilities on the observations. For example, given i.i.d. observations (x1 , . . . , xn ) when the parameter of interest is the population mean, one suitable family is the exponential tilting family, which places probabilities Pi = c exp(τ xi )

n (θˆ(i) − θ (i) )3

(19)

i=1

Davison and Hinkley (3) also give expressions for a in the case of stratified sampling, which includes two sample applications. For the arsenic data, z0 = 0.0438 (based on 100,000 replications) and a = 0.0484. The 95% interval is then the range from the 0.0436 to 0.988 quantiles of the bootstrap distribution. The BCa interval has greater Monte Carlo error than the ordinary percentile interval because Monte Carlo error in estimating z0 propagates into the endpoints and because typically one of the quantiles is farther in the tail than for the percentile interval. For example, here the 98.8% quantile is used instead of the 97.5% quantile. In the best case, that a = z0 = 0, this method requires a bit more than twice as many bootstrap samples as the percentile interval for comparable Monte Carlo accuracy. 8.5 Bootstrap Tilting Intervals In parametric statistics, the left endpoint of a confidence interval for a parameter θ is the value such that the sampling distribution has

21

(20)

on observation i, where τ is a tilting parameter, and c is a normalizing constant (depend ing on τ ) such that i pi = 1. τ = 0 gives equal probabilities pi = 1/n, which corresponds to the empirical distribution function, and about half of the bootstrap distribution is below the observed x. τ < 0 places higher probabilities on smaller observations; sampling with these probabilities is more likely to give samples with smaller observations, and smaller bootstrap means, so more of the bootstrap distribution is below x. We find the values of τ for which only 2.5% of the bootstrap distribution is above x. The left endpoint of the confidence interval is the mean of the corresponding weighted population n pi xi . θleft = i=1

Similarly, the right endpoint is n

pi xi

i=1

when pi is computed using the τ that puts 97.5% of the bootstrap distribution to the right of x.

22

BOOTSTRAP

Another suitable family is the maximum likelihood family, with probability pi =

c 1 − τ (xi − x)

(21)

on observation i. 8.6 Importance Sampling Implementation Conceptually, finding the right value of τ requires trial and error; for any given τ , we calculate p = (p1 , . . . , pn ), draw bootstrap samples with those probabilities, calculate the bootstrap statistics, and calculate the fraction of those statistics that are above θˆ , then repeat with a different τ until the fraction is 2.5%. This method is expensive, and the fraction varies because of random sampling. In practice, we use an importance sampling implementation. Instead of sampling with unequal probabilities, we sample with equal probabilities, then reweight the bootstrap samples by the relative likelihood of the sample under weighted and ordinary bootstrap sampling. The likelihood for a bootstrap sample is l(x1∗ , . . . , xn∗ ) = wi∗

(22)

compared with (1/n)n for ordinary bootstrap sampling. Let wb = wi∗ /(1/n)n = nwi∗ be the relative likelihood for bootstrap sample b. We estimate the probability by Pˆ p (θˆ ∗ > θˆ ) = B−1

B

wb I(θˆb∗ > θˆ ) = B−1

b=1

b∈R

b∈L

Ui (p) = lim ε−1 (θ (p + ε(δi − p)) − θ (p)) (24) →0

where δ i is the vector with 1 in position i and 0 elsewhere. When evaluated at p0 these derivatives are known as the empirical influence function, or infinitesimal jackknife. Four least-favorable families found in the tilting literature are: F1 : pi = c exp(τ Ui (p0 )) F2 : pi = c exp(τ Ui (p))

wb

F4 : pi = c(1 − τ Ui (p))−1

(25)

b∈R

wb + (1.2)

wb .

b∈E

Similar calculations are done for the τ used for the right endpoint; solve 0.025B =

8.6.1 Tilting for Nonlinear Statistics. The procedure can be generalized to statistics other than the mean using a least-favorable single-parameter family, one for which inference within the family is not easier, asymptotically, than for the original problem (18). This method is best done in terms of derivatives. Let F p denote a weighted distribution with probability pi on original data point xi , θ (p) = θ (F p ) be the parameter for the weighted distribution (e.g., weighted mean, or weighted regression coefficient), and p0 = (1/n, . . . , 1/n) correspond to the original equal-probability empirical distribution function. The gradient of θ (p) is

F3 : pi = c(1 − τ Ui (p))−1

(23) where R is the subset of {1, . . . , B} with θˆb∗ > θˆ . In practice we also worry about ties, on ˆ let E be the subset where cases with θˆ ∗ = θ; ˆ We numerically find τ to solve with θˆb∗ = θ. 0.025B =

In any case, after finding τ , the endpoint of the interval is the weighted mean for the empirical distribution with probabilities calculated using τ .

wb + (1/2)

wb

b∈E

where L is the subset of {1, . . . , B} with θˆb∗ < θˆ .

each indexed by a tilting parameter τ , where each c normalizes the probabilities to add to 1. F1 and F2 are well known as ‘‘exponential tilting,’’ and the coincide with Equation (20) if θ is a mean. Similarly F3 and F4 are maximum likelihood tilting and coincide with Equation (21) for a mean. F2 and F4 minimize the backward and forward Kullback-Leibler distances between p and p0 , respectively, subject to pi ≥ 0, pi = 1, and θ (p) = A; varying A results in solutions of the form given in Equation (25). F4 also maximizes the likelihood pi subject to the same constraints.

BOOTSTRAP

As in the case of the sample mean, having selected a family, we find the value of τ for which 2.5% (95%) of the bootstrap distribution is to the right of the observed θˆ ; the left (right) endpoint of the confidence interval is then the parameter calculated for the weighted distribution with probability pi on xi . All four families result in second-order accurate confidence intervals (19), but the finite-sample performance differs, sometimes dramatically for smaller samples. The fixedderivative versions F1 and F3 are easier to work with, but they have inferior statistical properties. They are shorter, have actual coverage probability lower than the nominal confidence, and for sufficiently high nominal confidence levels the actual coverage can decrease as the nominal confidence increases. The maximum likelihood version F4 gives the widest intervals with highest and usually most accurate coverage. The differences in coverage between the four families are O(1/n). The maximum likelihood family has better statistical properties, which produces wider confidence intervals with closer to the desired coverage levels. The exponential tilting family is more convenient numerically. 8.7 Confidence Intervals for Mean Arsenic Concentration Table 1 shows 95% confidence intervals for the mean arsenic concentration example described in the Introduction. The intervals vary dramatically, particularly in the degree of asymmetry. The t intervals are symmetric about x. The bootstrap t interval reaches much farther to the right, and it is much wider. The percentile interval is asymmetrical, longer on the right side, to a lesser extent than other asymmetrical intervals. Whereas it is asymmetrical, it is so haphazardly rather than by design, and the amount of asymmetry is too small for good accuracy. Although preferable to the t intervals, it is not as accurate as the second-order accurate procedures. The t intervals assume that the underlying population is normal, which is not true here. Still, the common practice with a sample size as large as 271 would be to use t intervals anyway. The bootstrap can help answer whether

23

Table 2. Actual Noncoverage of Nominal 95% t Intervals, as Estimated from Second-Order Accurate Intervals

bootstrap t BCa

Left

Right

0.0089 0.0061

0.062 0.052

The actual noncoverage should be 0.025 on each side. A t interval would miss more than twice too often on the right side.

that is reasonable, by giving an idea what the actual noncoverage is for a 95% t interval. Table 2 shows the what nominal coverage levels would be needed for the bootstrap t and BCa intervals to coincide with the actual endpoints of the t interval— in other words, what the bootstrap t and BCa intervals think is the actual noncoverage of the l intervals. The discrepancies are striking. On the left side, the t interval should miss 2.5% of the time; it actually misses only about a third or fourth that often, according to the bootstrap t and BCa intervals. On the right side, it should miss 2.5% of the time, but it actually misses somewhere between 5.2% and 6.2%, according to the BCa and bootstrap t procedures. This finding suggests that the t interval is severely biased, with both endpoints systematically lower than they should be. 8.8 Implications for Other Situations The t intervals are badly biased in the arsenic example. What does this finding imply for other situations? On the one hand, the arsenic data are skewed relative to most data observed in practice. On the other hand, the sample size is large. What can we say about other combinations of sample size and population skewness? For comparison, samples of size 47 from an exponential population are comparable with the arsenic data, in the sense that the sampling distribution for the mean is equally skewed. A quick simulation with 106 samples of exponential data with n = 47 shows that the actual noncoverage of 95% t intervals is 0.0089 on the left and 0.0567 on the right, comparable with the bootstrap estimates above. This example shows that for a distribution with only moderate skewness,

24

BOOTSTRAP

like the exponential distribution, n = 30 is not nearly enough to use t intervals; that even n = 47 results in noncoverage probabilities that are off by factors of about 3 and 2, on the two sides. Reducing the errors in noncoverage to a more reasonable 10% of the desired value, i.e. that the actual one-sided noncoverage probabilities are between 2.25% and 2.75% on each side for a nominal 95% interval would require around n = 7500 for an exponential distribution. Even for distributions that are not particularly skewed, say 1/4 the skewness of an exponential distribution (e.g., a gamma distribution with shape = 16), the sample size would need to be around 470 to reduce the errors in noncoverage to 10% of the desired values. To obtain reasonable accuracy for smaller sample sizes requires the use of more accurate confidence intervals, either a secondorder accurate bootstrap interval or a comparable second-order accurate nonbootstrap interval. Two general second-order accurate procedures that do not require sampling are ABC (23) and automatic percentile (24) intervals, which are approximations for BCa and tilting intervals, respectively. The current practice of statistics, using normal and t intervals with skewed data, systematically produces confidence intervals with endpoints that are too low (for positivelyskewed data). Similarly, hypothesis tests are systematically biased; for positively skewed data, they reject H 0 : θ = θ 0 too often for cases with θˆ < θ0 and too little for θˆ > θ0 . The primary reason is acceleration— when θˆ < θ0 then acceleration makes it likely that s < σ , and the t interval does not correct for this, so it improperly rejects H 0 . 8.9 Comparing Intervals t intervals and bootstrap percentile intervals are quick-and-dirty intervals that are suitable for rough approximations, but these methods should not be used where accuracy is needed. Among the others, I recommend the BCa in most cases, provided that the number of bootstrap samples B is very large. In my experience with extensive simulations, the bootstrap t is the most accurate in

terms of coverage probabilities. However, it achieves this result at a high cost—the interval is longer on average than the BCa and tilting intervals, often much longer. Adjusting the nominal coverage level of the BCa and tilting intervals upward gives comparable coverage to bootstrap t with shorter length. And the lengths of bootstrap t intervals vary much more than the others. I conjecture that this length difference occurs because bootstrap t intervals are sensitive to the kurtosis of the bootstrap distribution, which is hard to estimate accurately from reasonablesized samples. In contrast, BCa and tilting intervals depend primarily on mean, standard deviation, and skewness of the bootstrap distribution. Also, the bootstrap t is computationally expensive if the standard error is obtained by bootstrapping. If sθˆ is calculated by bootstrapping, then sθˆ ∗ is calculated using a second level of bootstrapping—drawing bootstrap samples from each first-level bootstrap sample (requiring a total of B + BB2 bootstrap samples, if B2 second-level bootstrap samples from each of B first-level bootstrap samples). The primary advantage of bootstrap tilting over BCa is that it requires many fewer bootstrap replications, typically by a factor of 37 for a 95% confidence interval. The disadvantages of tilting are that the small-sample properties of the fixed-derivative versions F1 and F3 are not particularly good, whereas the more rigorous F2 and F4 are harder to implement reliably. 9 HYPOTHESIS TESTING An important point in bootstrap hypothesis testing is that sampling should be performed in a way that is consistent with the null distribution. We describe here three bootstrap hypothesis testing procedures: pooling for two-sample tests, bootstrap tilting, and bootstrap t. The first is for two-sample problems, such as comparing two means. Suppose that the null hypothesis is that θ 1 = θ 2 , and that one is willing to assume that if the null hypothesis is true, then the two populations are the same. Then one may pool the data, draw samples of size n1 and n2 with replacement

BOOTSTRAP

from the pooled data, and compute a test statistic such as θˆ1 − θˆ2 or a t statistic. Let T* be the bootstrap test statistic, and led T 0 be the observed value of the test statistic. The P-value is the fraction of time that the T* exceeds T 0 . In practice, we add 1 to the numerator and denominator when computing the fraction—the one-sided P-value for the onesided alternative hypothesis θˆ1 − θˆ2 > 0 is

(26)

#(T ∗ > T0 ) + 1 . B+1

The lower one-sided P-value is #(T ∗ < T0 ) + 1 , B+1 and the twosided P-value is the two times the smaller of the one-sided P-values. This procedure is similar to the twosample permutation test, which pools the data and draws n1 observations without replacement for the first sample and allots the remaining n2 observations to be the second sample. The permutation test is preferred. For example, suppose one outlier exists in the combined sample; every pair of permutation samples has exactly one copy of the outlier, whereas the bootstrap samples may have 0, 1, 2, . . . copies. This result adds extra variability not present in the original data, and it detracts from the accuracy of the resulting P-values. Now suppose that one is not willing to assume that the two distributions are the same. Then bootstrap tilting hypothesis testing (3,25,26) may be suitable. Tilting may also be used in one-sample and other contexts. The idea is to find a version of the empirical distribution function(s) with unbreakequal probabilities that satisfy the null hypothesis (by maximizing likelihood or minimizing Kullback-Leibler distance subject to the null hypothesis), then draw samples from the unequal-probability empirical distributions, and let the P-value be the fraction of times the bootstrap test statistic exceeds the observed test statistic. As in the case of confidence intervals, importance sampling may

25

be used in place of sampling with unequal probabilities, see the section on ‘‘Bootstrap confidence intervals.’’ This method shares many close connections to empirical likelihood (27). Bootstrap tilting hypothesis tests reject H 0 if bootstrap tilting confidence intervals exclude the null hypothesis value. The third general-purpose bootstrap testing procedure is related to bootstrap t confidence intervals. A t statistic is calculated for the observed data, and the P-value for the statistic is calculated not by reference to the Students t distribution, but rather by reference to the bootstrap distribution for the t statistic. In this case, the bootstrap sampling need not be done consistently with the null hypothesis, because t statistics are approximately pivotal—their distribution is approximately the same independent of θ . 10

PLANNING CLINICAL TRIALS

The usual bootstrap procedure is to draw samples of size n from the empirical data, or more generally to plug in an estimate for the population and draw samples using the sampling mechanism actually used in practice. In planning clinical trials, we may modify this procedure in two ways: • try other sampling procedures, such as

different sample sizes or stratification, and/or • plug in alternate population estimates. For example, given training data of size n, to estimate standard errors or confidence interval width that would result from a possible clinical trial of size N, we may draw bootstrap samples of size N with replacement from the data. Similarly, we may estimate the effects of different sampling mechanisms, such as stratified sampling, or case-control allocation to arms, even if pilot data were obtained other ways. For example, we consider preliminary results from a clinical trial to evaluate the efficacy of maintenance chemotherapy for acute myelogenous leukemia (28,29). After achieving remission through chemotherapy,

26

BOOTSTRAP Table 3. Leukemia Data Group

Length of Complete Remission (in weeks)

Maintained Nonmaintained

9, 13, 13+ 18, 23, 28+ 31, 34, 45+ 48, 161+ 5, 5, 8, 8, 12, 16+ 23, 27, 30, 33, 43, 45

the patients were assigned to a treatment group that received maintenance chemotherapy and a control group that did not. The goal was to examine whether maintenance chemotherapy prolonged the time until relapse. The data are in Table 3. In all, 11 subjects were in the treatment group, and 12 subjects were in the control group. A Cox proportional hazards regression, using Breslow’s method of breaking ties, yields a log-hazard ratio of 0.904 and standard error 0.512: coef exp(coef) se(coef) z p group 0.904 2.47 0.512 1.77 0.078 An ordinary bootstrap with B = 104 resulted in 11 samples with complete separation— where the minimum observed relapse time in the treatment group exceeds the maximum observed relapse in the control group, which yields an infinite estimated hazard ratio. A stratified bootstrap reduces the number of samples with complete separation to three. Here, stratification is preferred (even if the original allocation were not stratified) to condition on the actual sample sizes and prevent imbalance in the bootstrap samples. Omitting the three observations results in a slightly long-tailed bootstrap distribution, with standard error 0.523, which is slightly larger than the formula standard error. Drawing 50 observations from each group results in a bootstrap distribution for loghazard ratio that is nearly exactly normal with almost no bias, no samples with separation (they are still possible, but unlikely), and a standard error of 0.221—about 14% less than would be obtained by dividing the formula standard error by two, 0.512/2 = 0.256. Similar results are obtained using Efron’s method for handling ties and from a smoothed bootstrap with a small amount of noise added to the remission times. The fact that the reduction in standard error was 14% greater than expected based on the usual

√ O(1/ n) rate may be because censored observations have a less-serious impact with larger sample sizes. 10.1 ‘‘What If’’ Analyses—Alternate Population Estimates In planning clinical trials, it may often be of interest to do ‘‘what if’’ analyses, perturbing various inputs. For example, how might the results differ under sampling from populations with a log hazard ratio of zero, or 0.5? This method should be done by reweighting observations (30,31). This method is a version of bootstrap tilting (18,20,30,32) and is closely related to empirical likelihood (33). Consider first a simple example—sampling the difference in two means, θˆ = x1 − x2 . To sample from populations with different values of θ , it is natural to consider perturbing the data, shifting one or both samples, for example, adding θ − θˆ to each value in sample 1. Perturbing the data does not generalize well to other situations. Furthermore, perturbing the data would often give incorrect answers. Suppose that the observations represent positive skewed observations such as survival times, with a mode at zero. Shifting one sample to the left would give negative times; a shift to the right would make the mode nonzero. More subtle, but very important, is that shifting ignores the mean– variance relationships for skewed populations—increasing the mean should also increase the variance. Instead, we use a weighted version of the empirical data, which maximizes the likelihood of the observed data subject to the weighted distributions that satisfy desired constraints. To satisfy µ1 − µ2 = θ 0 , for example, we maximize n1 i=1

w1i

n2 i=1

w2i

(27)

BOOTSTRAP

subject to constraints on weights (given here for two samples): w1i > 0, i = 1, . . . , n1

(28)

w2i > 0, i = 1, . . . , n2 n1

w2i = 1

i=1

and the constraint specific to comparing two means: n1 i=1

w1i x1i −

n2

w2i x2i = θ0

(29)

i=1

For other statistics we replace Equation (28) with the more general θ (Fˆ n,w ) = θ0

distribution. In this case, both bootstraps are nearly normal, and the standard errors are very similar—0.221 for the ordinary bootstrap and 0.212 for the weighted bootstrap, both with 50 observations per group. 11 HOW MANY BOOTSTRAP SAMPLES ARE NEEDED

w1i = 1

i=1 n2

27

(30)

where Fˆ n,w is the weighted empirical distribution (with obvious generalization to multiple samples or strata). The computational tools used for empirical likelihood (33) and bootstrap tilting (18,20) are useful in determining the weights. The bootstrap sampling is from the weighted empirical distributions, that is, the data are sampled with unequal probabilities. Figure 15 shows this idea applied to the leukemia data. The top left shows KaplanMeier survival curves for the original data, and the top right shows the bootstrap distribution for the log hazard ratio, using 50 observations in each group. The bottom left shows weights chosen to maximize Equation (26), subject to Equation (28) and a log hazard ratio equal to 0.5. To reduce the ratio from its original value of 0.904, the treatment group gets high weights early and low weights later (the weighted distribution has a higher probability of early events) whereas the control group gets the converse. Censored observations get roughly the average weight of the remaining noncensored observations in the same group. The middle left shows the resulting weighted survival estimates, and the middle right shows the corresponding bootstrap

We suggested in the Section ‘‘Accuracy of Bootstrap definitions’’ that 1000 bootstrap samples is enough for rough approximations, but that more are needed for greater accuracy. In this section, we give details. The focus here is on Monte Carlo accuracy— how well the usual random-sampling implementation of the bootstrap approximates the theoretical bootstrap distribution. A bootstrap distribution based on B random samples corresponds to drawing B observations with replacement from the theoretical bootstrap distribution. Quantities such as the mean, standard deviation, or quantiles of the bootstrap distribution converge to their √ theoretical counterparts at the rate O(1/ B), in probability. Efron and Tibshirani (1) suggest that B = 200, or even as few as B = 25, suffices for estimating standard errors, and that B = 1000 is enough for confidence intervals. We argue that larger sizes are appropriate, on two grounds. First, those criterion were developed when computers were much slower; with faster, computers, it is much easier to take large samples. Second, those criteria were developed using arguments that combine the random variation caused by the original sample with the random variation caused by bootstrap sampling. For example, Efron and Tibshirani (1) indicate that E( ) + 2 1/2 . ˆ ∞ )2 + ˆ B )= cv(se cv(se 4B where cv is coefficient of variation, cv(Y) = σ Y /E(Y), seB and se∞ are bootstrap standard errors using B or ∞ replications, respectively, and relates to the kurtosis of the bootstrap distribution; it is zero for normal distributions. Even relatively small values of ˆ ∞ ) not much ˆ B )/cv(se B make the ratio cv(se larger than 1.

28

BOOTSTRAP

Kaplan Meier curves 1.0

Ordinary bootstrap

0.5

1.0

Density

0.6 0.4

0.0

0.0

0.2

Proportion Surviving

0.8

1.5

Observed Mean

0

50

100

150

0.0

Weighted bootstrap

1.5

Observed Mean

Density

0.6

0.5

0.4

1.0

1.5

0.8

1.0

2.0

Weighted Kaplan Meier curves

0.0

0.0 0

50

100

0.0

150

0.5

Observation Weights

0.10

0.12

Maintained Control

0.08

⊕

0.06

⊕

⊕

0.02

0.04

⊕

0.0

Maintained/censored ⊕ Control/censored 0

50

100

1.0

Log Hazard Ratio

Survival Time in Weeks

Observation Weights

1.0

Log Hazard Ratio

0.2

Proportion Surviving

0.5

Survival Time in Weeks

43

150

Log Hazard Ratio

Figure 15. Perturbed bootstrap distribution for survival.

1.5

BOOTSTRAP

We feel more relevant is the variation in bootstrap answers conditional on the data. This finding is particularly true in clinical trial applications, where • reproducibility is important—two peo-

ple analyzing the same data should get (almost exactly) the same results, with random variation between their answers minimized, and • the data may be very expensive—there is little point in wasting the value of expensive data by introducing extraneous variation using too small B. Given the choice between reducing variation in the ultimate results by gathering more data or by increasing B, it would be cheaper to increase B, at least until B is large. .

ˆ B) = Conditional on the data, cv(se (δ + 2)/(4B), where δ is the kurtosis of the theoretical bootstrap distribution (conditional on the data). When δ is zero (usually approximately √true), this equation simplifies . ˆ B )=1/ 2B. to cv(se To determine how large B should be, we consider the effect on confidence intervals. Consider a t interval of the form θˆ ± tα/2 seB . Suppose that such an interval using se∞ would be approximately correct, with onesided noncoverage α/2. Then the actual noncoverage using seB in place of se∞ would −1 (α/2)). For large n be Ft,n−1 ((seB /se∞ )Ft,n−1 and α = 0.05, to have the actually one-sided noncoverage fall within 10% of the desired value (between 0.0225 and 0.0275) requires that seB /se∞ be between −1 (0.025*1.1)/−1 (0.025) = 0.979 and −1 (0.025*0.9)/−1 (0.025) = 1.023. To have 95% confidence of no√more than 10% error requires that 1.96/ 2B ≤ 0.022, or B ≥ 0.5(1.96/0.022)2 = 3970, or about 4000 bootstrap samples. To satisfy the more stringent criterion of 95% confidence that the noncoverage error is less than 1% of 0.025 would require approximately 400,000 bootstrap samples. With modern computers, this number is not unreasonable, unless the statistic is particularly slow to compute. Consider also bootstrap confidence intervals based on quantiles. The simple bootstrap percentile confidence interval is the range

29

from the α/2 to 1 − α/2 quantiles of the bootstrap distribution. Let G−1 ∞ (c) be the c quantile of the theoretical bootstrap distribution, and the number of bootstrap statistics that fall below this quantile is approximately binomial with parameters B and c (the proportion parameter may differ slightly because the discreteness of the bootstrap distribution). For finite B, the one-sided error has standard error approximately c(1 − c)/B. For c = 0.025, to reduce 1.96 standard errors to c/10 requires B ≥ (10/0.025)2 1.962 0.025*0.975 = 14980, about 15,000 bootstrap samples. The more stringent criterion of a 1% error would require approximately 1.5 million bootstrap samples. The bootstrap BCa confidence interval has greater Monte Carlo error because it requires estimating a bias parameter using the proportion of bootstrap samples that fall below the original θˆ (and the variance of a binomial proportion p(1 − p)/B is greatest for . p = 0.5). It requires B about twice as large as the bootstrap percentile interval for equivalent Monte Carlo accuracy— 30,000 bootstrap samples to satisfy the 10% criterion. On the other hand, the bootstrap tilting interval requires about 17 times fewer bootstrap samples for the same Monte Carlo accuracy as the simple percentile interval, so that about 1000 bootstrap samples would suffice to satisfy the 10% criterion. In summary, to have 95% probability that the actual one-sided non-coverage for a 95% bootstrap interval falls within 10% of the desired value, between 0.0225 and 0.0275, conditional on the data, requires about 1000 samples for a bootstrap tilting interval, 4000 for a t interval using a bootstrap standard error, 15,000 for a bootstrap percentile interval, and 30,000 for a bootstrap BCa interval. Figure 16 shows the Monte Carlo variability of several bootstrap confidence interval procedures, for various combinations of sample size, statistic, and underlying data; these samples are representative of a larger collection of examples in a technical report (21). The panels show the variability caused by Monte Carlo sampling with a finite bootstrap sample-size B, conditional on the data.

30

BOOTSTRAP

Figure 16 is based on 2000 randomly generated datasets for each sample size, distribution, and statistic. For each dataset, and for each value of B, two sets of bootstrap samples are created and intervals calculated using all methods. For each method, a sample variance is calculated using the usual unbiased sample variance (based on two observations). The estimate of Monte Carlo variability is then the average across the 2000 datasets of these unbiased sample variances. The result is the ‘‘within-group’’ component of variance (caused by Monte Carlo variability) and excludes the ‘‘between-group’’ component (caused by differences between datasets). 11.1

Assessing Monte Carlo Variation

To assess Monte Carlo variation in practice, two options can be employed. The first is to use asymptotic formulae. For example, the bootstrap estimate of bias in Equation (1) depends on the sample mean of the bootstrap statistics; the usual formula √ for standard error of a sample mean is seB / B, in which seB is the sample standard deviation of the bootstrap statistics. The standard error of a ˆ − p)/B. ˆ The bootstrap proportion pˆ is p(1 standard error of a bootstrap standard error seB is seB (δ + 2)/(4B). The other alternative is to resample the bootstrap values. Given B i.i.d. observations θˆ1∗ , θˆ2∗ , . . . , θˆB∗ from the theoretical bootstrap distribution, and a summary statistic Q (e.g., standard error, bias estimate, or endpoint of a confidence interval), we may draw B2 bootstrap samples of size B from the B observations, and calculate the summary statistics Q∗1 , Q∗2 , . . . , Q∗B . The sample standard devia2 tion of the Q*s is the Monte Carlo standard error. 11.2

Variance Reduction

Several techniques can be used to reduce the Monte Carlo variation. The balanced bootstrap (34), in which each of the n observations is included exactly B times in the B bootstrap samples, is useful for bootstrap bias estimates but is of little value otherwise. Antithetic variates (35) is moderately helpful for bias estimation but is of little value otherwise.

Importance sampling (36,37) is particularly useful for estimating tail quantiles, as for bootstrap percentile and BCa intervals. For nonlinear statistics, one should used a defensive mixture distribution (38,39). Control variates (35,38,40,41) is moderately to extremely useful for bias and standard error estimation and can be combined with importance sampling (42). They are most effective in large samples for statistics that are approximately linear. Concomitants (41,43) are moderately to extremely useful for quantiles and can be combined with importance sampling (44). They are most effective in large samples for statistics that are approximately linear; linear approximations tailored to a tail of interest can dramatically improve the accuracy (45). Quasi-random sampling (46) can be very useful for small n and large B; the convergence rate is O(log(B)n B−1 ) compared with O(B−1/2 ) for Monte Carlo methods. Analytical approximations for bootstrap distributions are available in some situations, which include analytical approximations for bootstrap tilting and BCa intervals (19,23), and saddle-point approximations (47–51).

12 ADDITIONAL REFERENCES In this section we give some additional references. Hesterberg et al. (52) is an introduction to the bootstrap written for introductory statistics students. It is available free at http://bcs. whfreeman.com/pbs/cat 160/PBS18.pdf Efron and Tibshirani (1) is an introduction to the bootstrap written for upper-level undergraduate or beginning graduate students. Davison and Hinkley (3) is the best general-purpose reference to the bootstrap for statistical practitioners. Hall (7) looks at asymptotic properties of various bootstrap methods. Chernick (53) has an extensive bibliography, with roughly 1700 references related to the bootstrap.

BOOTSTRAP

0.025

Correlation, Normal, n = 80

0.020 0.015 0.010 0.005

Monte Carlo standard dev

empir BCa boot-t exp-tilt ml-tilt

0.0

0.01

0.02

0.03

empir BCa boot-t exp-tilt ml-tilt

0.0

Monte Carlo standard dev

Mean, Exponential, n = 40

100

140

200 300 500 Bootstrap sample size

1K

2K

100

200 300 500 Bootstrap sample size

1K

2K

1K

2K

0.08 0.06 0.04

Monte Carlo standard dev

0.10

empir BCa boot-t exp-tilt ml-tilt

0.0

0.02

0.005

0.015

0.025

140

Variance, Exponential, n = 20

empir BCa boot-t exp-tilt ml-tilt

0.0

Monte Carlo standard dev

Variance, Normal, n = 10

100

140

200 300 500 Bootstrap sample size

1K

2K

100

200 300 500 Bootstrap sample size

Ratio of Means, Exponential, n = 80

0.06 0.04

Monte Carlo standard dev

0.06 0.04

0.0

0.02 0.0

empir BCa boot-t exp-tilt ml-tilt

0.02

empir BCa boot-t exp-tilt ml-tilt

0.08

140

0.08

Ratio of Means, Normal, n = 10

Monte Carlo standard dev

31

100

140

200 300 500 Bootstrap sample size

1K

2K

100

45

140

200 300 500 Bootstrap sample size

Figure 16. Monte Carlo variability for confidence intervals.

1K

2K

32

BOOTSTRAP

The author’s website www.insightful.com/ Hesterberg/bootstrap has resources for teaching statistics using the bootstrap and for bootstrap software in S-PLUS. Some topics that are beyond the scope of this article include bootstrapping dependent data (time series, mixed effects models), cross-validation and bootstrap-validation (bootstrapping prediction errors and classification errors), Bayesian bootstrap, and bootstrap likelihoods. References 1 and 3 are good starting points for these topics, with the exception of mixed effects models. REFERENCES 1. B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Boca Raton, FL: Chapman and Hall, 1993. 2. L. Breiman, Random forests. Mach. Learn. 2001; 45(1): 5–32. 3. A. Davison, and D. Hinkley, Bootstrap Methods and their Applications. Cambridge, UK: Cambridge University Press, 1997. 4. J. Chambers and T. Hastie, Statistical Models in S. Beimont, CA: Wadsworth, 1992. 5. J. M. Chambers, W. S. Cleveland, B. Kleiner, and P. A. Tukey, Graphical Methods for Data Analysis. Beimont, CA: Wadsworth, 1983. 6. A. Ruckstuhl, W. Stahel, M. Maechler, and T. Hesterberg, sunflower. 1993. Statlib: Availables: http://lib.stat.emu.edu/S. 7. P. Hall, The Bootstrap and Edgeworth Expansion. New York: Springer, 1992. 8. J. Shao and D. Tu, The Jackknife and Bootstrap. New York: Springer-Verlag, 1995. 9. B. Efron, Better bootstrap confidence intervals (with discussion). J. Am. Stat. Assoc., 1987; 82: 171–200. 10. T. C. Hesterberg, Unbiasing the Bootstrap Bootknife Sampling vs. Smoothing. In Proceedings of the Section on Statistics & the Environment, Alexandria, VA: American Statistical Association, 2004, pp. 2924–2930. 11. B. Silverman and G. Young, The bootstrap: to smooth or not to smooth. Biometrika 1987; 74: 469–479. 12. P. Hall, T. DiCiccio, and J. Romano, On smoothing and the bootstrap. Ann. Stat., 1989; 17: 692–704. 13. T. J. DiCiccio and J. P. Romano, A review of bootstrap confidence intervals (with discussion). J. Royal Stat. Soc. Series B, 1988; 50(3): 338–354.

14. P. Hall, Theoretical comparison of bootstrap confidence intervals (with discussion). Ann. Stat., 1988; 16: 927–985. 15. T. DiCiccio and B. Efron, Bootstrap confidence intervals (with discussion). Stat. Sci., 1996; 11(3): 189–228. 16. B. Efron, Bootstrap methods: another look at the jackknife (with discussion). Ann. of Stat., 1979; 7: 1–26. 17. B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans. National Science Foundation–Conference Board of the Mathematical Sciences Monograph 38. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1982. 18. B. Efron, Nonparametric standard errors and confidence intervals. Canadian J. Stat., 1981; 9: 139–172. 19. T. J. DiCiccio and J. P. Romano, Nonparametric confidence limits by resampling methods and least favorable families. Internat. Stat. Rev., 1990; 58(1): 59–76. 20. T. C. Hesterberg, Bootstrap tilting confidence intervals and hypothesis tests. In: K. Berk and M. Pourahmadi, (eds.), Computer Science and Statistics: Proceedings of the 31st Symposium on the Interface, vol. 31. Fairfax Station, VA: Interface Foundation of North America, 1999, pp. 389–393. 21. T. C. Hesterberg, Bootstrap Tilting Confidence Intervals. Research Department 81, Seattle, WA: MathSoft, Inc., 1999. 22. R. J. Hyndman and Y. Fan, Sample quantiles in statistical packages. Am. Stat., 1996; 50: 361–364. 23. T. DiCiccio and B. Efron, More accurate confidence intervals in exponential families. Biometrika, 1992; 79: 231–245. 24. T. J. DiCiccio, M. A. Martin, and G. A. Young, Analytic approximations to bootstrap distribution functions using saddlepoint methods. Technical Report 356. Palo Alto, CA: Department of Statistics, Stanford University, 1990. 25. B. Efron, Censored data and the bootstrap. J. Am. Stat. Assoc., 1981; 76(374): 312–319. 26. D. V. Hinkley, Bootstrap significance tests. Bull. Internat. Stat. Inst., 1989; 65–74. 27. A. Owen, Empirical likelihood ratio confidence intervals for a single functional. Biometrika 1988; 75: 237–249. 28. S. H. Embury, L. Elias, P. H. Heller, C. E. Hood, P. L. Greenberg, and S. L. Schrier, Remission maintenance therapy in acute

BOOTSTRAP myelogenous leukemia. Western J. Med. 1977; 126: 267–272. 29. Insightful. S-PLUS 8 Guide to Statistics. Seattle, WA: Insightful, 2007. 30. T. C. Hesterberg, Bootstrap tilting diagnostics, Proceedings of the Statistical Computing Section, 2001. 31. T. C. Hesterberg, Advances in Importance Sampling. PhD thesis, Statistics Department, Stanford University, Palo Alto, California, 1988. 32. T. C. Hesterberg, Resampling for Planning Clinical Trials-Using S + Resample. Paris: Statistical Methods in Biopharmacy, 2005. Available: http:// www.insightful.com/Hesterberg/ articles/Paris05-ResampleClinical.pdf. 33. P. Hall and B. Presnell, Intentionally biased bootstrap methods. J. Royal Stat. Soc., Series B, 1999; 61(1): 143–158. 34. A. Owen, (2001). Empirical Likelihood. Boca Raton− FL: Chapman & Hall/CRC Press, 2001. 35. J. R. Gleason, Algorithms for balanced bootstrap simulations. Am. Statistic., 1988; 42(4): 263–266. 36. T. M. Therneau, Variance reduction techniques for the bootstrap, PhD thesis, Department of Statistics, Stanford University, Palo Alto, California, 1983. Technical Report No. 200. 37. M. V. Johns, Importance sampling for bootstrap confidence intervals. J. Am. Stat. Assoc., 1988; 83(403): 701–714. 38. A. C. Davison, Discussion of paper by D. V. Hinkley. J. Royal Stat. Soc. Series B, 1986; 50: 356–57. 39. T. C. Hesterberg, Weighted average importance sampling and defensive mixture distributions. Technometrics 1995; 37(2): 185–194. 40. A. C. Davison, D. V. Hinkley, and E. Schechtman, Efficient bootstrap simulation. Biometrika, 1986; 73: 555–566. 41. B. Efron, More efficient bootstrap computations. J. Am. Stat. Assoc., 1990; 85(409): 79–89. 42. T. C. Hesterberg, Control variates and importance sampling for efficient bootstrap simulations. Stat. Comput., 1996; 6(2): 147–157. 43. K. A. Do and P. Hall, Distribution estimation using concomitants of order statistics, with application to Monte Carlo simulations for the bootstrap. J. Roy. Stat. Soc., Series B, 1992; 54(2): 595–607.

33

44. T. C. Hesterberg, Fast bootstrapping by combining importance sampling and concomitants. In: E. J. Wegmand and S. Azen, (ed.), Computing Science and Statistics: Proceedings of the 29th Symposium on the Interface, volume 29. Fairfax Station, VA: Interface Foundation of North America, 1997, pp. 72–78. 45. T. C. Hesterberg, Tail-specific linear approximations for efficient bootstrap simulations. J. Computat. Graph. Stat., 1995; 4(2): 113–133. 46. K. A. Do and P. Hall, Quasi-random sampling for the bootstrap. Stat. Comput., 1991; 1(1): 13–22. 47. M. Tingley and C. Field, Small-sample confidence intervals. J. Am. Stat. Assoc., 1990; 85(410): 427–434. 48. H. E. Daniels and G. A. Young, Saddlepoint approximation for the studentized mean, with an application to the bootstrap. Biometrika 1991, 78(1): 169–179. 49. S. Wang, General saddlepoint approximations in the bootstrap. Stat. Prob. Lett., 1992; 13: 61–66. 50. T. J. DiCiccio, M. A. Martin, and G. A. Young, Analytical approximations to bootstrap distributions functions using saddlepoint methods. Statistica Sinica 1994; 4(1): 281. 51. A. J. Canty and A. C. Davison, Implementation of saddlepoint approximations to bootstrap distributions, In L. Billard, and N. I. Fisher, (eds.), Computing Science and Statistics; Proceedings of the 28th Symposium on the Interface, vol. 28. Fairfax Station, VA: Interface Foundation of North America, 1997, pp. 248–253. 52. T. Hesterberg, S. Monaghan, D. S. Moore, A. Clipson, and R. Epstein, Bootstrap methods and permutation tests In: D. Moore, G. McCabe, W. Duckworth, and S. Sclore, The Practice of Business Statistics. New York: W. H. Freeman, 2003. 53. M. R. Chernick, Bootstrap Methods: A Practitioner’s Guide. New York: Wiley, 1999.

CARDIAC ARRHYTHMIA SUPPRESSION TRIAL (CAST)

1

OBJECTIVES

The major objective of CAST was to test whether post MI patients with left ventricular dysfunction (defined by reduced ejection fraction) who have either symptomatic or asymptomatic ventricular arrhythmias suppressed by antiarrhythmic agents would be less likely to experience an arrhythmic death over 3 years 5.

LEMUEL A. MOYE´ University of Texas School of Public Health

The Cardiac Arrhythmia Suppression Trial (CAST) was the National Heart, Lung, and Blood Institute-(NHLBI)-sponsored, randomized, controlled clinical trial designed to confirm the arrhythmia suppression hypothesis. Instead, it revealed a 2.5-fold increase (96% CI 1.6 to 4.5) in the mortality rate of patients assigned to either encainide or flecainide, which resulted in the early termination of the study. Its findings altered the development program of antiarrhythmic agents and demonstrated the value of objective evidence in formulating treatment guidance. After a myocardial infarction (MI), patients have an increased risk of death from arrhythmia and nonarrhythmic cardiac causes. Although many factors contribute to this risk, ventricular premature contractions (VPCs) that occur post MI confer an independent risk for both arrhythmic death and death from all cardiac causes (1,2). These patients are commonly treated with antiarrhythmic drugs 3. During the 1980s the available antiarrhythmic agents (quinidine and procainimide) were supplemented with the development of the three new agents, encainide, flecanide, and moricizine. Initial reports from case studies suggested that these drugs successfully suppressed ventricular arrhythmias 4. After numerous discussions, but in the absence of a clinical trial, these drugs were approved by the FDA for use in patients with serious ventricular arrhythmias, in the belief that the suppression of asymptomatic or mildly symptomatic ventricular arrhythmias would reduce the arrhythmic death rate. This theory became known as the arrhythmia suppression hypothesis. After FDA approval, the National Heart, Lung, and Blood Institute (NHLBI) designed a clinical trial to test this concept.

1.1 Primary Endpoint The single primary endpoint for CAST was death from arrhythmia. This definition included (1) witnessed instantaneous death in the absence of severe congestive heart failure or shock, (2) unwitnessed death with no preceding change in symptoms and for which no other cause was identified, and (3) cardiac arrest. 1.2 Secondary Endpoints Additional, prospectively declared endpoints were (1) new or worsened congestive heart failure, (2) sustained ventricular tachycardia, (3) recurrent myocardial infarction, (4) cardiac procedures, and (5) quality of life. 2

STUDY DESIGN

CAST was a randomized, double-blind, placebo-controlled clinical trial. 2.1 Patient Eligibility Patients were eligible for CAST screening if they had an MI between 6 days and 2 years before the screening. Patients whose MI was within 90 days before the screening Holter monitor examination had to have ejection fractions of 55% or less. Patients whose MI occurred more than 90 days before the qualifying Holter were required to have an ejection fraction of 40% or less. To be eligible, patients must have had a ventricular arrhythmia defined as at least six or more ventricular premature contractions per hour and, in addition, must have had documented sustained ventricular tachycardia

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CARDIAC ARRHYTHMIA SUPPRESSION TRIAL (CAST)

to be randomized. However, patients whose ventricular ectopy caused either presyncope or syncope were excluded. In addition, patients who had no contraindications to the therapy or were poor compliers were also denied entry to the study. The protocol was approved by the institutional review board of each clinical center. 2.2 Titration Patients who met the above entry criteria then underwent on open titration period, receiving up to three drugs (encainide, flecanide, and moricizine) at two oral doses. The goal of the titration was to suppress the ventricular arrhythmia. Titration ceased when the arrhythmia was suppressed, defined by a greater than 60% reduction in premature ventricular contractions and at least a 90% reduction in runs of unsustained ventricular tachycardia. Subjects with ejection fractions less than 30% were not given flecainide. 135 patients (6%) had incomplete arrhythmia suppression. Patients whose arrhythmia was aggravated by the titration therapy or proved to be intolerant to the drugs were not randomized. The titration exercise produced 447 patients who either died or were intolerant to the drug. 2.3 Randomization and Follow-Up Patients who 1) met the inclusion and exclusion criteria and 2) had their ventricular ectopy suppressed during the titration phase were randomized by telephone call through the CAST Coordinating Center. Patients were randomized to receive either encainide, flecainide, moricizine, or placebo therapy. Randomization was stratified by clinical center, ejection fraction (either ≤ 30% or > 30%), and time between the MI and the qualifying Holter examination (< 90 days or ≥ 90 days). Follow-up examinations were scheduled at 4-month intervals. The expected duration of follow-up was anticipated to be 3 years. 2.4 Sample Size The primary study hypothesis was whether the use of antiarrhythmic agents (encainide, flecanide, and moricizine) would reduce the 3-year cumulative mortality rate from sudden death. Notably, CAST was designed as a

one-tailed test to assess whether drug therapy was beneficial or had no benefit. It was not designed with statistical rigor to demonstrate harm. The anticipated 3-year cumulative mortality rate from death because of arrhythmia was 11%. Treatment was anticipated to reduce this by 30%. The overall alpha error was 0.025 (one-tailed), and the desired power was 85%. These computations produced a sample size of 4400 patients for the trial. 2.5 Analysis Plan Baseline comparisons were performed by ttests and chi-square tests. Primary and secondary endpoint comparisons that involve time-to-event analyses were executed using log-rank tests. The significance levels for the individual drug comparisons were increased three-fold to adjust for multiple comparisons among the three drug groups. Confidence intervals were estimated by the method of Cornfield. The relative risk of treatment in clinically defined subgroups was calculated to evaluate the consistency of the effects of the drugs on the study endpoints across subgroups. 2.6 Monitoring Procedures The Data and Safety Monitoring Board met twice yearly to review the unblinded CAST results. This group approved the interim monitoring protocol in September 1988 before it evaluated any data. This protocol included a conservative boundary for stopping the study because of demonstrated benefit, a symmetric boundary for advising stopping the study because of adverse effects, and a stochastic curtailment boundary to permit stopping CAST for a low probability of demonstrating a beneficial effect. The total number of events anticipated in the trial was 425, but the data as of March 1989 indicated that less than 300 events would occur. 3 RESULTS 3.1 Screening As of March 30, 1989, 2309 patients had been recruited and either were undergoing or had completed the open-label titration phase of the study. Of these, 1727 patients met the

CARDIAC ARRHYTHMIA SUPPRESSION TRIAL (CAST)

arrhythmia requirement, had their arrhythmias suppressed, and completed randomization. Assignments to therapy were as follows: 1455 received either encainide, flecainide, or placebo; 272 received either moricizine or placebo. Overall, 730 patients were randomly assigned to encainide and flecainide, and 725 patients received placebo therapy. 3.2 Baseline Characteristics The baseline characteristics of patients randomized to encainide or flecainide were compared with patients recruited to placebo therapy. The average age was 61 years old, and 82% of participants were male. Placebo and treatment groups were similar with respect to all characteristics, including age, ejection fraction, time elapsed since myocardial infarction, and use of beta-blockers, calcium-channel blockers, digitalis, or diuretics at baseline. The mean left ventricular ejection fraction was 0.40 in patients treated with encainide or flecainide. Approximately 2.2% of patients had an ejection fraction below 0.20; 20% had an ejection fraction between 0.30 and 0.55. The mean frequency of VPCs was 127 per hour in drug-treated patients, and 20.6% of the patients had a least one run of unsustained ventricular tachycardia (≥ 120 beats per minute) during baseline Holter recording. 3.3 Compliance As of March 30, 1989, 8.4% of patients randomized to encainide or flecainide therapy had their therapy discontinued. More than half of the withdrawals exited because of protocol-defined reasons (e.g., major adverse events or symptoms). Placebo therapy was removed in 8.6% of patients. Of the patients still taking active therapy or placebo, 79% were taking at least 80% of their medication. Of the patients who died of arrhythmia or were resuscitated after a cardiac arrest, 88% were following the study regimen at the time of the fatal event. The average exposure to therapy was 295 days in the encainide/flecainide group and 300 days in the placebo group.

3

3.4 Mortality Findings 730 patients were recruited to encainide and flecainide, and 725 patients were recruited to placebo. 78 deaths occurred in the combined encainide/flecainide-placebo groups, 56 deaths in the encainide/flecainide group and 22 deaths in the placebo group. The total mortality relative risk for encainide or flecainide therapy when compared with placebo was 2.5 (95% CI 1.6 to 4.5). The relative risks of encainide or flecainide were indistinguishable (2.7 and 2.2, respectively). 33 deaths occurred because of either arrhythmia or cardiac arrests in the encainide/flecainide group, and 9 deaths occurred in the placebo group. The relative risk of death from arrhythmia for encainide or flecainide was 3.6 (95% CI 1.7–8.5). The relative risks of death from arrhythmia or cardiac arrest for patients receiving encainide or flecainide considered separately were not different (3.4 and 4.4, respectively). 14 deaths occurred for other cardiac reasons in the encainide/flecainide group compared with 6 deaths in the placebo group. 9 noncardiac deaths, unclassified deaths, or other cardiac arrest occurred in the encainide/ flecainide group, and 7 occurred in the placebo group. 3.5 Subgroup Analyses The effect of the therapy was consistent across subgroups. In all subgroups, patients treated with encainide or flecainide had higher rates of both total mortality and death because of arrhythmia than patients treated with placebo. The observed increased risk from either of these two agents was present regardless of age, prolonged QRS interval, or use of beta-blockers, calcium-channel blockers, digitalis, or diuretic at baseline. Notably, patients with an ejection fraction ≥ 30% in which use of encainide and flecainide were stratified had essentially equal relative risks (4.6 and 4.4, respectively) 3.6 Early Termination On April 16 and 17, 1989, the CAST Data and Safety Monitoring Board reviewed the data available as of March 30, 1989. An evaluation of the data for all randomized patients

4

CARDIAC ARRHYTHMIA SUPPRESSION TRIAL (CAST)

revealed that the interim monitoring statistic (Z = − 3.22) had crossed the lower advisory boundary for harm (Z = − 3.11). In addition, the conditional power for a beneficial effect (< 0.27) was well below the minimum conditional power established (0.55). The DSMB therefore recommended that the encainide and flecainide arms of the study be discontinued. 4

CONCLUSIONS

The scheduled termination of CAST was preempted by an unexpected increase in mortality, which prompted the termination of the encainide and flecainide limbs of the study. Given that millions of patients received prescriptions for approved antiarrhythmic agents during the conduct of CAST, the public health implications of its finding of antiarrhythmic therapy-induced harm were profound 6. The use of these drugs was sharply curtailed as the practicing community struggled to understand the staggering findings. No confounding factors that could explain the finding of harm were identified. In fact, the consistency of the excessive mortality risk of encainide and flecainide in all subgroups was a noteworthy feature of this study. The implications of CAST for the development of new antiarrhythmic drugs was profound. Its demonstration of hazard in the face of VPC suppression removes this suppression as an informative, predictive surrogate endpoint for clinical endpoints. In addition, the results had regulatory implications, as the FDA grappled with the correct combination of the number of trials, the type of patients, the sample size, and the goals of these trials to view the risk–benefit balance of antiarrhythmic therapy clearly. Finally, perhaps no point more clearly reveals the expectations of the CAST investigators than their acceptance of the onetailed statistical hypothesis test, specifically designed not to focus on harm. The demonstration of excess mortality overwhelmed the ability of the CAST analysis that was designed simply to demonstrate benefit versus lack of benefit. The use of one-tailed research continues to be debated in the clinical literature (7,8). The surprise results of CAST reinforced

the time-tested need for the study investigators not just to consider the possibility of harm, but also to weave this concern into the fabric of the study 9. Fortunately, their wise decision to include a placebo group not only distinguished CAST from the many preliminary studies that led to FDA approval of drugs subsequently shown to be harmful but also was the one essential characteristic of the study that permitted a clear demonstration of the damage stemming from the use of these drugs. REFERENCES 1. J. T. Bigger, Jr. et al., Multicenter PostInfarction Research Group, The relationships among ventricular arrhythmia, left ventricular dysfunction, and mortality in the 2 years afgter myocardial infarction. Circulation. 1984; 69: 250–258. 2. J. Mukharjil et al., Risk factors for sudden death afgter acute myocardial inarction; two year follow-up. Am J Cardiol. 1984; 54: 31–36. 3. S. Vlay, How the university cardiologist treats ventrcicular premature beats; a nationwide survey of 65 university medical centers. Am Heart J. 1985; 110: 904–912. 4. The CAPS Investigators, The Cardiac Arrhythmia Pilot Study. Am J Cardiol. 1986; 57: 91–95. 5. The CAPS Investigators, The CAST Investigators (1989) Preliminary Report: Effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. New England J Medicine. 1989; 3212: 406–412. 6. L. K. Hines, T. P. Gross, and D. L. Kennedy, Outpatient arrhythmic drug use from 1970 to 1986. Arch Intern Med. 1989; 149: 1524–1527. 7. J. A. Knottnerus and L. M. Bouter, (2001) Commentary: The ethics of sample size: Twosided testing and one-sided thinking. J Clinical Epidemiology. 2001; 54: 109–110. 8. L. A. Moy´e and A. Tita, Defending the Rationale for the Two-Tailed Test in Clinical Research. Circulation. 2002; 105: 3062–3065. 9. L. A. Moy´e, Statistical Reasoning in Medicine. The Intuitive P-value Primer. New York: Springer-Velag, 2006.

FURTHER READING T. Moore, Deadly Medicine. New York: Simon and Schuster, 1995.

CARDIAC ARRHYTHMIA SUPPRESSION TRIAL (CAST)

CROSS-REFERENCES Placebo controlled clinical trial Run in period Interim analyses Trial termination

5

Table 1. 2 × 2 Contingency Table for Treatment and Success

CATEGORICAL RESPONSE MODELS GERHARD TUTZ

Response

¨ Statistik Munich Institut Fur Germany Treatment

1

INTRODUCTION

Drug (Treatment 1) Control (Treatment 2)

In clinical trials, categorical data occur if the response is measured in categories. The simplest experimental design compares two treatment groups, usually a treatment and a control group, on a binary response that may be generally described by the categories of success or failure. Thus, data are given in a 2 × 2 table (see Table 1) where the marginals n1+ , n2+ denote the number of observations in treatment groups 1 and 2. With the binary response given by y = 1 for success and y = 0 for failure, and T denoting the treatment, investigation concerns the conditional probabilities of success in treatment group t

Failure 2

Marginals

n11

n12

n1+

n21

n22

n2+

The dependence of y on treatment group may be given in the form of a binary logit model log

πt1 = β0 + βT(t) πt2

(1)

where βT(t) is a parameter connected to treatment group t. As πt1 + πt2 = 1, only two probabilities, namely π11 and π21 , are free to vary. Therefore, the model has too many parameters and an additional restriction is needed. By setting βT(1) = βT , βT(2) = 0, one immediately obtains

πt1 = P(y = 1|T = t). The effect of the treatment is measured by comparing π11 and π12 (i.e., the probabilities of success in group 1 and 2) or functions thereof. Instead of comparing the probabilities themselves, it is often preferable to compare the odds, which for treatment group t are defined by

log

π11 = β0 + βT , π12

log

π21 = β0 . π22

Simple calculation shows that βT may be interpreted as the log odds ratio βT = log

P(y = 1|T = t) πt1 = πt2 P(y = 0|T = t)

π11 /π12 π21 /π22

and

where πt2 = 1 − πt1 = P(y = 0|T = t). Comparison of treatments may be based on the odds ratio (also called cross-product ratio) θ=

Success 1

eβT =

π11 /π12 π21 /π22

is the odds ratio. Thus, the case of no effect where the odds ratio equals (1) corresponds to βT = 0. For general designs, it is instructive to give the model in Equation (1) in the usual form of a regression model. With the restriction βT(2) = 0, Equation (1) is equivalent to the logit model

π11 /π12 π21 /π22

which is a directed measure of association between treatment and success. If θ = 1, the odds in both treatment groups are the same, and therefore treatment has no effect on the probability of success. If θ > 1, odds in treatment group 1 are larger than in treatment group 2; for 0 < θ < 1, treatment group 2 is superior to treatment group 1.

log

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

πt1 = β0 + xt βT πt2

(2)

2

CATEGORICAL RESPONSE MODELS

Table 2. m × 2 × 2 Contingency Table with Binary Response Response Center 1 2 · · m

Treatment

Success

Failure

Drug Control Drug Control · · · · Drug Control

n111 n121 n211 n221 · · · · nm11 nm21

n112 n122 n212 n222 · · · · nm12 nm22

n11+ n12+ n21+ n22+ · · · · nm1+ nm2+

where x1 = 1 (coding T = 1) and x2 = 0 (coding T = 2). 2 BINARY RESPONSE MODELS WITH COVARIATES

Often treatments are compared by using data from different sources. If, in a multi-center study, the success of treatment is investigated, one obtains count data as given in Table 2. As the effect of a treatment may depend on the specifics of a center, it is necessary to include the information where the data come from. With πst1 = P(y = 1|T = t, C = s) denoting the conditional probability of success given treatment t and center s, a logit model that uses center as covariate is given by πst1 = β0 + xt βT + βC(s) πst2

πs11 /πs12 πs21 /πs22 =

P(y = 1|T = 1, C = s)/P(y = 0|T = 1, C = s) P(y = 1|T = 2, C = s)/P(y = 0|T = 2, C = s)

= eβT which does not depend on the center. Thus, the effect of the treatment (measured by odds ratio) does not depend on the center, although centers might differ in efficiency. If the effect of treatment is modified by centers, a more complex model including interactions applies. 2.2 The Binary Logit Model with Covariates

2.1 Observations from Several Strata

log

odds. If βT = 0, variation of success might exist across centers but treatment is without effect. If Equation (3) holds, the association between treatment and response for center s is given by the odds ratio

(3)

where πst2 = 1 − πst1 = P(y = 0|T = t, C = s), and x1 = 1 (for T = 1) and x2 = 0 (for T = 2). The parameter βT specifies the effect of treatment whereas βC(s) represents the center effect. The effect on the odds is seen from the form πst1 = eβ0 · (eβT )xt · eβC(s) . πst2 Thus, eβ0 and eβC(s) represent the multiplicative effects of treatment and center on the

The multi-center study is just one example in which observations come from several strata that should be included as control variables. Alternative covariates that influence the response may be levels of age or severity of the condition being treated. If the covariates are categorical, as, for example, age levels, one again obtains a contingency table of the form given in Table 2. In the more general case where covariates also may be metric, as, for example, age in years, data are no longer given in contingency tables. Let yi with yi = 1 for success and yi = 0 for failure denote the response of the ith individual (or observed unit). Together with yi one has information about treatment (xT,i ) and a vector of covariates (zi ), which may contain age, gender, condition, and so on. With πi = P(yi = 1|xi , zi ), the corresponding binary logit model is given by log

πi = β0 + xT,i βT + zi βz . 1 − πi

(4)

For two treatment groups, xT,i is a dummy variable as in Equation (3), given by xT,i = 1 if treatment 1 is applied on the ith observation and xT,i = 0 if treatment 2 is applied on the ith observation. It should be noted that i refers to observations not treatments.

CATEGORICAL RESPONSE MODELS

In the case with m treatment groups, xT,i represents a vector of dummy variables, xT,i = (xT(1),i , . . . , xT(m−1),i ) where xT(s),i

⎧ ⎨1 if treatment s is used on individual i = ⎩ 0 otherwise.

log

πi = xi β 1 − πi

exp(xi β) . 1 + exp(xi β)

The last form has the general structure πi = F(xi β)

Estimation of parameters may be based on the concept of maximum likelihood. The loglikelihood for Equation (5) is given by n

yi log(πi ) + (1 − yi ) log(1 − πi ) (6)

i=1

where πi is specified by πi = F(xi β). Maximization of l(β) yields the maximum likeˆ In the general case, no lihood estimator β. explicit form of βˆ is available, therefore the estimator is computed iteratively [e.g., McCullagh & Nelder (2)]. For the accuracy of the maximum likelihood estimator, the Fisher or expected information matrix F(β) = E(−∂ 2 l(β)/∂β ∂β ) has to be computed. One obtains F(β) =

n

F (xi β) xi − xj − F(xi β))

F (xi β)(1 i=1

where F (η) is the first derivative F (η) = ∂F(η)/∂η. Under regularity conditions [see Fahrmeir & Kaufmann (3)], one obtains asymptotic existence of the maximum likelihood estimate, consistency, and asymptotic normality, β˜ ∼ N(β, F(β)−1 ).

or equivalently πi =

2.3 Inference in Binary Response Models

l(β) =

The variables xT(s),i are dummy variables that code if treatment s is used or not by 0, 1 coding. Alternative coding schemes are given, for example, in Fahrmeir and Tutz (1). More generally, xT,i may contain dummy variables for treatment groups as well as metric variables referring to treatment, as, for example, dosage of drug. The same holds for the covariates zi . In a multi-center study, zi is a vector of dummy variables coding centers, but it may also contain metric variables. Equation (4) accounts for the control variables given in zi but still assumes that the effect of treatment is not modified by these variables. The model is a main effect model because it contains treatment as well as covariates as separate additive effects within the logit model. By collecting intercept, treatment, and control variables into one vector xi = (1, xT,i , zi ) and defining β = (β0 , βT , βz ), one obtains the general form of the logit model

3

(5)

where F is the logistic distribution function F(η) = exp(η)/(1 + exp(η)). The logit model is a special case of a generalized linear model [compare Kauermann & Norris in EoCT and McCullagh & Nelder (2)], with F as the (inverse) link function. Alternative (inverse) link functions in use are based on the exponential distribution F(η) = 1 − exp(−η), the extreme minimal-value distribution F(η) = 1 − exp(− exp(η)), or the normal distribution function. The latter yields the so-called probit model [see McCullagh & Nelder (2)].

If only categorical covariates are present and therefore data are given in a contingency table, estimation may also be based on the weighted least squares approach suggested by Grizzle et al. (4). If the linear predictor ηi = xi β is partitioned into treatment effects xi,T βT and covariates zi βz , the focus of interest is on the null hypothesis H0 : βT = 0, although the effect of covariates may also be investigated by considering the null hypothesis H0 : βz = 0. These hypotheses are special cases of the linear hypothesis H0 : Cβ = ξ

against

H1 : Cβ = ξ

where C is a specified matrix and ξ is a given vector.

4

CATEGORICAL RESPONSE MODELS

Let βˆ denote the maximum likelihood estimate for the model and β˜ denote the maximum likelihood estimate for the submodel under the additional restriction Cβ = ξ . Test statistics for the linear hypothesis are the likelihood ratio test ˜ − l(β)) ˆ λ = −2(l(β) where l(β) is the log-likelihood given in Equation (6), the Wald test ˆ )(Cβˆ − ξ ) w = (Cβˆ − ξ ) (CF −1 (β)C and the score test

parameters than models in which ordinality of the response is ignored. For categorical data, parsimony of parameters is always recommended. In the following, let Y take values from ordered categories 1, . . . , k. 3.1 Cumulative Type Models A simple concept for treating the ordinal response is to use a model for binary responses in which the binary response corresponds to a split between category r and r + 1 yielding binary response categories {1, . . . , r}, {r + 1, . . . , k}. The corresponding binary response model in Equation (5) with linear predictor β0r + x β is

˜ −1 (β)s( ˜ β) ˜ u = s (β)F

P(Y ≤ r|x) = F(β0r + x β).

where s(β) = ∂l(β)/∂β is the score function. With rg(C) denoting the rank of matrix C, all three test statistics, λ, w, and u, have the same limiting χ 2 -distribution

Although the resulting models for r = 1, . . . , k − 1 seem to be separate models, they are linked by the common response Y. A consequence is that the restriction β01 ≤ . . . ≤ β0,r+1 must hold because P(Y ≤ r) ≤ P(Y ≤ r + 1). Moreover, because β does not depend on the category in Equation (7), it is assumed that the effect of x is the same for all splits into categories {1, . . . , r}, {r + 1, . . . , k}. As in Equation (7), categories are cumulated, it is referred to as a cumulative response model. It may be motivated by considering Y as a coarser version of an underlying unobservable variable Y˜ = −x β + with β01≤...≤β0,r+1 denoting the category boundaries on the latent continuum that defines the levels of Y. The link between observable Y and underlying variable Y˜ is specified by Y = r if γ0,r−1 < Y˜ ≤ γ0r where has distribution function F. The most common model is the proportional odds model that results from choosing F as the logistic distribution function. One obtains

a

λ, w, u ∼ χ 2 (rg(C)) under regularity conditions, which are similar to the conditions needed for asymptotic results for maximum likelihood estimation (5). An advantage of the Wald statistic is that only one fit is necessary. Although for likelihood ration and score test the restricted estimate β˜ has to be computed, the Wald test uses only the maximum likelihood estimate ˆ For testing single coefficients H0 : βT = 0, β. the Wald test is the square (w = t2 ) of the ‘‘t-value’’ βˆT t= √ aˆ rT where aˆ rT = v ar(βˆT ) is the rth diagonal element of the estimated covariance matrix ˆ Most program packages use t when F −1 (β). investigating the effect of single variables. 3

ORDINAL RESPONSES

In many cases, the response is measured in ordered categories; for example, patient condition may be given by good, fair, serious, or critical. The use of the ordering of categories allows for models that have less

log

(7)

P(Y ≤ r|x) = β0r + x β, P(Y > T|x)

r = 1, . . . , k − 1. The model is extensively discussed in McCullagh (6). The name proportional odds is caused by the property that the cumulative odds are proportional, for example, for two values of explanatory variables x1 , x2 one obtains log

P(Y ≤ r|x1 )/P(Y > |x1 ) = (x1 − x2 )β P(Y ≤ r|x2 )/P(Y > |x2 )

CATEGORICAL RESPONSE MODELS

which does not depend on r. Therefore, the proportion of cumulative odds only depends on the explanatory variables. 3.2 Sequential Type Models For ordered categories, it may often be assumed that they are reached successively. The sequential model may be seen as a process model with start in category 1 and consecutive modeling of binary transitions to higher categories. The transition is specified by P(Y = r|Y ≥ r, x) = F(β0r + x β) r = 1, . . . , k − 1. The binary model specifies the probability that the process stops in category r given category r is reached. In most common use is the sequential logistic model with F(η) = exp(η)/(1 + exp(η)). An advantage of the sequential model over the cumulative model is that no ordering restriction for the intercept is necessary. Moreover, it may be estimated by software that handles binary regression models. The relationship to ¨ a¨ cumulative models is investigated in La¨ ar and Matthews (7) and Greenland (8). 4

NOMINAL RESPONSE

For nominal response variable Y ∈ {1, . . . , k}, the most common model is the multinomial logit model P(Y = r|x) =

exp(x βr ) , k−1 1 + s=1 exp(x βs )

r = 1, . . . , k − 1, where k is the reference category with P(Y = k) = 1 − p(Y = r) − . . . − P(Y = k − 1). The alternative form log

P(Y = r|x) = x βr P(Y = k|x))

(8)

shows the similarity to the binary logit model. It should be noted that when comparing P(Y = r|x) to P(Y = k|x), which is basically parameterized in Equation (8), the parameter β r has to depend on r. As Y is assumed to be on nominal scale level, the interpretation of parameters in

5

Equation (8) is similar to the binary logit model with β r referring to the effect of x on the logits between category r and the reference category k. The model is easily reparameterized if not k but k0 {1, . . . , k} is considered as the reference category by considering log

P(Y = r|x) = x (βr − βk0 ) − x β˜r P(Y = k0 |x)

where β˜r = βr − βk0 are the parameters corresponding to reference category k0 .

5 INFERENCE FOR MULTICATEGORICAL RESPONSE For data (Yi , xi , i = 1, . . . , n), the multicategorical models from Sections 3 and 4 may be written as multivariate generalized linear models πi = h(Xi β) where πi = (πi1 , . . . , πi,k−1 ), πi,r = P(Yi = r|xi ), h is a link function depending on the model, and Xi is a design matrix. The multinomial model, for example, uses the block design matrix Xi = Diag(xi , xi , . . . , xi ), β = (β1 , . . . , ), and link function h = (h1 , . . . , hk ) with βk−1 hr (η1 , . . . , ηk−1 ) = exp(ηr )/(1 + k−1 s=1 exp(ηs )). For more details, see, for example, Fahrmeir and Tutz (1). The log likelihood has the form

l(β) =

n k−1

yij log(πij /(1 − πi1 − . . . − πiq ))

i=1 j=1

+ log(1 − πi1 − . . . − πiq ), The Fisher matrix F(β) = E(−∂ 2 l/∂β∂β ) is given by F(β) =

n

Xi Wi (β)Xi

i=1

where Wi (β) = {

∂g (πi ) ∂g(π ) i (β) ∂π i }−1 ∂πi i

with g de-

noting the inverse function of h, g = h−1 , and

6

CATEGORICAL RESPONSE MODELS

i (β) = Diag(πi ) − πi πi denoting the covariance matrix. Under regularity conditions, one asymptotically obtains β˜ ∼ N(β, F(β)

−1

).

Table 3. Clinical Trial for Curing an Infection (16) Response Center Treatment Success Failure Odds Ratio

Linear hypotheses H0 : Cβ = ζ may be tested by likelihood ratio tests, Wald tests, or score tests in the same way as for binary response models.

2

6

4

3

FURTHER DEVELOPMENTS

The models in the previous sections use the linear predictor ηi = xi β. This assumption may be weakened by assuming a more flexible additive structure ηi = f(1) (xi1 ) + . . . + f(p) (xip ) where f(j) are unspecified functions. Several methods have been developed to estimate the components f(j) ; see Hastie and Tibshirani (9) and Fan and Gijbels (10) for localizing approaches and Green and Silverman (11) for penalizing approaches. For small sample sizes, asymptotic inference may not be trustworthy. Exact inference, for example, in case control studies is considered in Agresti (12). A short overview on exact inference is found in Agresti (13). In medical research, often continuous variables are converted into categorical variables by grouping values into categories. By using models with flexible predictor structure, categorization may be avoided. Categorization always means loss of information (14), and the choice of cutpoints may yield misleading results [e.g., Altman et al. (15)]. 7

1

AN EXAMPLE

Beitler and Landis (16) considered a clinical trial in which two cream preparations, an active drug and a control, are compared on their success in curing an infection [see also Agresti (12)]. The data were collected at eight centers. The present analysis is restricted to four centers and the data are given in Table 3. By considering the centers as fixed (not as a random sample), the most general logistic model incorporates interactions between treatment (T) and centers (C). Testing the main effect model 1 + T + C against the

Drug Control Drug Control Drug Control Drug Control

11 10 16 22 14 7 2 1

25 27 4 10 5 12 14 16

1.19 1.82 4.80 2.29

Table 4. Estimated Effects for Main Effects Model

1 C2 C3 C4 T

estimate

std error

t-value

tail probability

−1.265 2.028 1.143 −1.412 0.676

0.325 0.419 0.423 0.622 0.340

−3.89 4.83 2.70 −2.133 1.98

.000090 .000001 .00681 .03295 .04676

model with interaction 1 + T + C + T.C yields λ = 2.66 on 4 degrees of freedom. Therefore, the interaction effect may be omitted. Further simplification does not seem advisable as the likelihood ratio test for the relevance of the main effect treatment yields λ = 4.06 on 1 degree of freedom and, for the main effect center, λ = 48.94 on 3 degrees of freedom. The estimates for the main effect model with center 1 as reference category are given in Table 4. The Wald test for the effect of treatment shows significance at the 0.05 level, confirming the result of the likelihood ratio test. REFERENCES 1. L. Fahrmeir and G. Tutz, Multivariate Statistical Modelling based on Generalized Linear Models, 2nd ed. New York: Springer, 2001. 2. P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed. New York: Chapman & Hall, 1989. 3. L. Fahrmeir and H. Kaufmann, Consistency and asymptotic normality of the maximum

CATEGORICAL RESPONSE MODELS likelihood estimator in generalized linear models. Ann. Stat. 1985; 13: 342–368. 4. J. E. Grizzle, C. F. Starmer, and G. G. Koch, Analysis of categorical data by linear models. Biometrika 1969; 28: 137–156. 5. L. Fahrmeir, Asymptotic likelihood inference for nonhomogeneous observations. Statist. Hefte (N.F.) 1987; 28: 81–116. 6. P. McCullagh, Regression model for ordinal data (with discussion). J. Royal Stat. Soc. B 1980; 42: 108–127. ¨ a¨ and J. N. Matthews, The equivalence 7. E. La¨ ar of two models for ordinal data. Biometrika 1985; 72: 206–207. 8. S. Greenland, Alternative models for ordinal logistic regression. Stat. Med. 1994; 13: 1665–1677. 9. T. Hastie and R. Tibshirani, Generalized Additive Models. London: Chapman & Hall, 1990. 10. J. Fan and I. Gijbels, Censored regression: local linear approximation and their applications. J. Amer. Statist. Assoc. 1994; 89: 560–570. 11. D. J. Green and B. W. Silverman, Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapman & Hall, 1994. 12. A. Agresti, Categorical Data Analysis. New York: Wiley, 2002. 13. A. Agresti, A survey of exact inference for contingency tables. Stat. Sci. 1992; 7: 131–177. 14. J. Cohen, The cost of dichotomization. Appl. Psycholog. Meas. 1983; 7: 249–253. 15. D. G. Altman, B. Lausen, W. Sauerbrei, and S. Schumacher, Dangers of using ‘‘optimal’’ cutpoints in the evaluation of prognostic factors. J. Nat. Cancer Inst. 1994; 86: 829–835. 16. P. Beitler and J. Landis, A mixed-effects model for categorical data. Biometrics 1985; 41: 991–1000.

7

CAUSAL INFERENCE

Since it is not possible to observe the value of Y t (u) and Y c (u) at the same time, one cannot observe the causal effect of the experiment medicine on subject u. This is considered to be the Fundamental Problem of Causal Inference. To solve this problem, one turns to using a set of units with similar characteristics to infer the average causality. The SUTVA (stable unit treatment value assumption) is essential for ensuring the equivalence between the effect at a unit and the effect in the set of units (the Population). SUTVA assumes (1) no interference between units; and (2) no hidden version of treatment; i.e., no matter how unit u receives a treatment, the observed outcome would be Yt (u) or Y c (u). When special assumptions are added to the above model, causal effect can be inferred. Some of those assumptions are described below, again, in the context of clinical trials.

YILI L. PRITCHETT Abbott Laboratories Abbott Park, Illinois

Causal inference refers to making statistical inference with respect to causality. An effect of a cause is always defined in the context with respect to another cause. Cause is different from attributes; that is, causal relationships are different from associational relationships. Associational inferences use observed data to estimate how different outcomes from the same experimental unit are related. A typical example is that, correlation, the most commonly used statistic to quantify the relation between two variables, does not infer any causation. To evaluate causal effect, an experimental setting is the desired and the simplest setup. Ronald A. Fisher (1) and Jerzy Neyman (2) pioneered the design of experiments for causal inference. Fisher’s sharp null hypothesis testing, using P-values to quantify the strength of the evidence when rejecting a null hypothesis, and his exact test for 2 × 2 tables were among his many significant contributions to causal inference. Neyman first came up with the notation of potential outcome, which provided the foundation for a general framework developed later by Donald B. Rubin (3). Rubin’s causal model can be described in the setting of clinical trials as follows, where ‘‘treatment’’ is the cause of interest. Assume there is a population U of subjects. Let S be a binary variable of treatment assignment. S(u) = t indicates that the subject u receives the experimental medicine, and S(u) = c indicates the control medicine. S(u) for each subject could have been different. Let Y(u) be the outcome measure of the clinical trial. On each subject u, both Yt (u) and Yc (u) are potential outcomes. However, only one of those will be observed in the study, depending on which treatment the subject is exposed to. The causal effect of t on subject u is defined as Yt (u) − Yc (u)

1. Temporal Stability and Causal Transience Assumptions. Temporal stability assumes consistency of potential outcomes over time; that is, a subject’s outcome is independent from when the treatment is given. Causal transience assumes that the previous treatment effect under c is transient so that it has no impact on the potential outcome under current treatment of t, and vise versa. Under these two assumptions, it is plausible that one can expose a subject to different treatments in sequence, t and c, and observe outcomes under each treatment. The causal effect can then be estimated by Y t (u) − Yc (u). These two assumptions are the fundamental assumptions supporting the validity of using crossover design to evaluate causal effect. 2. Unit Homogeneity Assumption. Assume that Yt is homogeneous over different subjects, and so is Yc . Thus, the causal effect of t can be estimated as Yt (u1 ) and Yc (u2 ). Case-control studies are operated under this assumption (4). In those types of studies, to evaluate a cause effect, subjects with similar (if not identical) demographic and base-

(1)

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CAUSAL INFERENCE

line characteristics are matched with only an exception for the existence of a condition, which is under investigation as a potential cause. The outcome measures are compared between these two groups of subjects, and the difference will be attributed to the only difference of possessing that specific condition. Unit homogeneity assumption implies a weaker assumption than the constant effect assumption, which assumes that the cause effect is a constant over the population. 3. Independence Assumption. Denote the average causal effect of T on population U as T = E(Yt − Yc ) = E(Yt ) − E(Yc )

(2)

If the average potential outcome of Yt and Yc on the population can be estimated using observed data, the average causal effect T can then be estimated. In parallel group design, at least half of Yt (u) and Yc (u) are missing. What one can estimate is E(Yt |S = t), which is the conditional average of outcome Yt over those who receive t. Similarly, it is also estimable for E(Yc |S = c). When subjects are assigned at random to either group by a certain physical mechanism, the determination of which treatment the subject is exposed to is regarded as statistically independent of all other variables over U, including outcomes Yt and Yc . That is, in a randomized clinical trial, if the randomization procedure is carried out correctly, it is plausible that S is independent of Yt and Yc . Therefore, we have E(Yt ) = E(Ys |S = t) E(Yc ) = E(Ys |S = c) The above equations imply

estimated by the between-difference in sample means as has been routinely performed in parallel design, randomized clinical trials. Holland (5) introduced the concept of prima facie causal effect, TPF , an associational parameter for the joint distribution of (YS , S), as TPF = E(YS |S = t) − E(YS |S = c)

(4)

In general, TPF and T are not equal, if the independence assumption is not true. Randomized experiments play an important role in causal inference with the reason as shown above. In nonrandomized experiments, Equation (3) no longer holds true, because the units exposed to one treatment generally differ systematically from the units exposed to the other treatment. That is, the sample means are confounded with selection bias, and they cannot be used directly to infer to causal effect. A solution to this problem is to create a balancing score b(X), which is a function of observed covariates X such that the treatment assignment is independent to X conditioning on b(X). Therefore, causal effect can be evaluated at each stratum characterized by b(X). The average effect over b(X) can be used to estimate the average cause effect on the population. In that approach, the unit-level probability of treatment assignment is called the propensity score (6). Causal inference highly relies on how the treatment is assigned to the experimental units. Using the concept of ‘‘assignment mechanism,’’ where the probability that a specific treatment is assigned to an experimental unit is modeled, randomized, nonrandomized, and response adaptive randomized experiments can be characterized within the same framework. To illustrate, let Pr(S = t|X,YS ,YC ) denote the assignment mechanism for a clinical trial and X the covariates, YS and YC are the potential outcomes, and YOBS are the accumulate observed outcomes. We have

T = E(Ye ) − E(Yc ) (3)

• Pr(S = t|X,YS ,YC ) = Pr(S = t), when this

That is, when the independence assumption holds true, the average causal effect T can be

• Pr(S = t|X,YS ,YC ) = Pr(S = t|X), when

= E[YS |S = t] − E[YS |S = c]

is a classic randomized design. this is a nonrandomized design.

CAUSAL INFERENCE • Pr(S = t|X,YS ,YC ) = Pr(S = t|YOBS ),

when this is a response-adaptive, randomized design. In this type of design, the probability of assigning treatment t is not a constant any more, but it is dynamically adapting based on the observed response data. Of note, causal inference is not the same in concept as causal modeling that has been extensively used in social science. In causal modeling, path diagrams are used to describe the relationships among variables that have plausible causal or associational relationships. For example, studying the relationship between age and job satisfaction (7), a plausible assumption was ‘‘the older the employee gets, the more satisfied he is with his job.’’ When other intervening factors were also taken into consideration, such as the responsibility and the income associated with the job, the relation between age and job satisfaction became more complicated. One could use a path diagram to put those plausible relationships together and could use regression approaches to find the answer. That is, the coefficients between variables estimated from regression models could be used to quantify the relationship from age to Job satisfaction on each potential pathway. The sum of the coefficients among several pathways provided a quantitative measure on the relationship between the two variables. One can see that the two variables of interests in this example were not causal in nature. This example demonstrated that the application of causal modeling could deliver answers for associational inference but not necessary for causal inference. However, if all variables included in the causal modeling are causal factors, then causal modeling can be used as one of the tools to deduct the causal effects. REFERENCES 1. R. A. Fisher, The Design of Experiments. Edinburgh: Oliver & Boyd, 1935. 2. J. Neyman, On the application of probability theory to agricultural experiments. In: Essay on Principles. Section 9 (in Polish). Reprinted in English in Stat. Sci. 1990; 5:463– 480.

3

3. D. B. Rubin, Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 1974; 66:688– 701. 4. D. B. Rubin, Matched Sampling for Causal Effects. Cambridge, UK: Cambridge University Press, 2006. 5. P. W. Holland, Statistics and causal inference (with discussion). J. Am. Stat. Assoc. 1986; 81:945 − 970. 6. P. R. Rosenbaum and D. B. Rubin, The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70:41 − 55. 7. R. D. Retherford and M. K. Choe, Statistical Models for Causal Analysis. New York: Wiley, 1993.

CELL LINE

grow for long periods in culture if they are fed regularly under aseptic conditions (3). It was not until the 1940s that evidence began to develop to indicate that vertebrate cells could be maintained indefinitely in culture. Wilton Earle, at the National Cancer Institute, isolated single cells of the L cell line and showed that they form clones of cells in tissue culture (4). At around the same time, Gey et al. (5) were attempting to establish continuous cultures from rat and human tissue. One of their successes was the culture of continuous line of cells from human cervical carcinoma, the cell line that later become the well-known HeLa cell line (5). The cultivation of human cells received another stimulus by the development of different serum-free selective media as well as by the greater control of contamination with antibiotics and clean–air equipment.

THERU A. SIVAKUMARAN SUDHA K. IYENGAR Case Western Reserve University Department of Epidemiology and Biostatistics Cleveland, Ohio

Culturing of cells is a process whereby cells derived from various tissues are grown with appropriate nutrients under appropriate environmental conditions. Cell cultures and lines can be used as an alternative approach to the use of live animals for studying models of physiological function. A cell line, which is formed after first subculture of the primary culture, is a permanently established cell culture that can proliferate indefinitely under appropriate conditions. The formation of a cell line from primary culture implies an increase in the total number of cells over several generations as well as the ultimate predominance of the cells or cell lineage with the capacity for high growth, which results in a degree of uniformity in the cell population. Although all the cells in a cell line are very similar, they may have similar or distinct phenotypes. If one cell lineage is selected to have certain specific properties in the bulk of cells in the culture, this becomes cell strain. 1

2

PRIMARY CELL CULTURE

The primary culture is the stage that starts from isolation cells, is applied until the culture reaches confluence, and is split into multiple cultures for the first time. At this stage, the cells are usually heterogeneous but still represent the parent cell types and expression of tissue-specific properties. Primary cell culture can be obtained by one of three methods:

CELL CULTURE: BACKGROUND

2.1 Mechanical Disaggregation The tissue is disaggregated by chopping with a scalpel or by forcing the tissue through a mesh screen or syringe needle. The resulting suspension of cells is allowed to attach to an appropriate substrate to enable growth.

The concept of maintaining the live cell that was separated from original tissue was discovered in the nineteenth century. In late 1885, Wilhelm Roux (1) established the principle of tissue culture by demonstrating that embryonic chick cells can be maintained outside the animal body in a saline solution for several days. The first success in culturing animal cells is attributed to Harrison (2), who showed that the spinal cord from frog embryo could be cultured in a natural medium of clotted frog lymph; he also observed outgrowth of nerve fibers from the explant. The man who was responsible for establishing the idea of the cell culture was Alexis Carrel. He showed that cells could

2.2 Enzymatic Disaggregation Cell–cell and cell–substrate adhesion is generally mediated by three major classes of transmembrane proteins: the cell–cell adhesion molecules, cell–substrate interactions molecules, and transmembrane proteoglycans. Proteolytic enzymes, such as trypsin, collagenase, and so on, are usually added to the tissue fragments to dissolve the cement that holds the cells together, and enzymatic

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CELL LINE

disaggreation is the most widely used method to isolate cells for primary culture. 2.3 Primary Explanation This method was originally developed by Harrison (2) for initiating cell culture, and it is useful when a small amount of tissue is available. The tissue is finely chopped and rinsed, and the small fragments are seeded onto the surface of the culture vessel so it will adhere. Adherence is facilitated by scratching the plastic surface or trapping the tissue under a coverslip, or by use of surface tension created by adding a small volume of medium with a high concentration of serum or clotted plasma. Once adhesion is achieved, the primary culture develops by outgrowth from the explant. The tendency of animal cells in vivo to interact with one another and with the surrounding extracellular matrix is mimicked in their growth in culture. After tissue aggregation or subculture, most normal cells, with the exception of hematopoietic cells, need to spread out on a substrate to proliferate, and the cells are said to be anchorage dependent. Cells that can grow in suspension such as hematopoietic cells, tumor cells, transformed cell lines, and so on, are known as adherent independent. Cell adhesion is known to be mediated by specific cell surface receptor molecules in the extracellular matrix, and it is likely that secretion of extracellular matrix protein and proteoglycans may precede spreading by the cells. These matrix components bind to the culture surface and function as a bridge between the surface and the cells. Hence, the culture surface can be conditioned by treating it with spent medium from another culture or by purified fibronectin or collagen. 3

SUBCULTURE

When cells in culture have grown and filled up the entire available surface, they must be transferred, subcultured, or passaged into fresh culture vessels to give them room for continued growth. The subculture process involves removal of the culture medium and addition of trypsin to dislodge the cells from the substrate and to disrupt the cell–cell

connections. When the primary cell culture is thus subcultured, it generates a secondary culture that becomes known as a cell line. Each cell line can be subcultured several times such that a secondary subculture can then yield to tertiary culture and so on; this nomenclature is seldom used beyond the tertiary culture. With each successive subculture, the components of the cell population with the ability to proliferate most rapidly will gradually predominate, and nonproliferating or slowly proliferating cells will be diluted out. This finding is apparent after first subculture, in which differences in proliferative capacity of cells is compounded by varying abilities and transfer. To assess when a cell line will require subculture or how long cells will take to reach a certain density, it is necessary to become familiar with the cell cycle for each cell line, as cells at different phases of the growth cycle proliferate at different rates and show differences in respiration, synthesis of a variety of products, enzyme activity, and so on. This growth cycle is typically divided into three phases (6). 3.1 The Lag Phase When cells are taken from a stationary culture, a lag occurs before growth begins (this lag is the time that follows reseeding after subculture) (Fig. 1). During this time, cells replace elements of the cell surface, and the extracellular matrix that is lost during trypsinization adheres to the substrate and spreads out. 3.2 The Log Phase The log phase is the phase when the cells grow and the population doubles. The length of the log phase depends on the density at which the cells were seeded (initial culture), the growth rate of the cells, and the final concentration that will inhibit additional cell proliferation. The log phase, when the growth rate is as high as 90–100%, is the optimal time for using cells for experiments as the culture is in its most reproducible form. 3.3 The Plateau Phase At the end of the log phase, the entire available growth surface is occupied, and every

CELL LINE

3

Saturation density 106

Cells/cm2

Cells/ml

105

105

Figure 1. Three different phases of the growth cycle of cultured cells. Reproduced from Freshney’s Culture of animal cells: A manual of basic technique (6).

104

cell is in contact with its surroundings; this phenomenon is known as confluence. As a result, the growth rate slows and eventually may stop; specialized proteins may now be synthesized faster than during other phases. The growth rate in the plateau phase drops to 0–10%. Because the transformed cells are deficient in contact inhibition of cell motility and density limitation of cell proliferation, they will continue to proliferate for several generations even after reaching confluence, and they will reach higher saturation densities at plateau. The plateau phase of these cultures is a balance between increased proliferation and cell loss from apoptosis, which results in a greater turnover than occurs in normal cells at plateau. The construction of a growth curve from cell counts made at regular intervals after initiation of the subculture enables the measurements of parameters, such as lag period or lag time as well as population doubling time and saturation density, that determine the characteristics of the cell line under a given set of culture conditions. When these parameters are consistent, they can be used to calculate variables such as split ratio. The latter is defined as the degree of dilution required to reinitiate a new growth cycle with a short lag period and to achieve the appropriate density for subculture at a convenient time in the future.

104

Doubling time

Seeding concentration Lag 0

4

2

4 6 8 Days from subculture

10

EVOLUTION OF CELL LINES

The cell lines that are derived from normal tissue can divide only for a limited number of times, and their lifespan is determined by the cell type as well as by the environmental conditions at the first subculture. If the medium, substrate, and other conditions are appropriate, then the cell line may progress through several serial subcultures (Fig. 2). Conversely, if the conditions are not adequate, then the cell line will die out within one or two subcultures. Even for cell lines grown under satisfactory conditions, the maximum number of attainable subcultures (passages) is fixed. After a few passages, cells enter a state where they no longer divide, which is known as replicative senescence. For example, normal human fibroblasts typically divide only 25–40 times in culture before they stop. Senescence is a genetic event determined by the inability of terminal DNA sequences in the telomeres of each chromosome to replicate at each cell division, which results in progressive shortening of telomeres, until the cell cannot divide any more (7). Human somatic cells have turned off the enzyme, called telomerase, that normally maintains the telomeres and that is why their telomeres shorten with each cell division. The cell lines that only provide a limited number of population doublings are called finite cell lines (6). Some cell lines, notably those from rodents and most tumors, are not limited by a finite

4

CELL LINE

Transformation

1020

Cumulative Cell Number

1018

Primary Culture

Continuous Cell Line

Finite Cell line

1016

1st Subculture 2nd Subculture

1014

Senescence and Cell Death

1012 1010

Subculture Interval 108 Serial Passage 106

0

Explantation

2

4

6 8 10 Weeks in Culture

12

14

100

Figure 2. Evolution of a cell line. Reproduced from Freshney’s Culture of animal cells: A manual of basic technique (6).

lifespan, and they can continue to divide indefinitely. These cell lines are called continuous cell lines. The ability of growing continuously is reflective of the action of specific genetic variations, such as deletions or mutations of the p53 gene and overexpression of the telomerase gene. 5 DEVELOPMENT OF IMMORTALIZED CELL LINES The use of normal human cells in biomedical research is restricted in part by the limited proliferative potential of primary cultures. Thus, researchers need to re-establish fresh cultures from explanted tissue frequently, which is a tedious process. To use the same cells throughout an ongoing research project, primary cells need to extend their replicative capacity or undergo immortalization. Some cells immortalize spontaneously by passing through replicative senescence and thus adapt easily to life in culture. However, these spontaneously immortalized cells invariably have unstable genotypes and are host to numerous genetic mutations, which renders them less reliable representatives of their starting tissue’s phenotype. Many applications would require having essentially unlimited numbers of carefully characterized cells

with the desired phenotypes. Therefore, the ideal immortalization protocol would produce cells that are capable not only of extended proliferation, but also possess the same genotype and tissue markers of their parental tissue. The most commonly used method to attain this objective is the transduction of normal cells, before they enter senescence, with genes from DNA tumor virus, such as simian virus 40 (SV40), a human papillomavirus (HPV), or Epstein-Barr virus (EBV). 5.1.1 SV40. The major region of SV40 oncoprotein encoded by early region DNA, which is the larger T antigen (LT) gene, inactivates the p53 and Rb proteins. Cells that express these oncoproteins proliferate beyond the point of senescence. Transduction of normal cells with the SV40 early region genes by transfection of an expression plasmid remains a very common immortalization technique for almost any human cell type (8). 5.1.2 HPV. The E6 and E7 oncogenes of high-risk oncogenic risk HPV strains, which are HPV types 16 and 18, have been shown to assist the immortalization of several different cell types. The E6 protein causes degradation of the p53 protein, it upregulates the

CELL LINE

c-myc expression, and it also partially activates the telomerase (9). The HPV16 E7 protein induces degradation of the Rb protein via ubiquitin-proteasome pathway, and this mechanism explains the immortalization ability of E7 gene independently of E6 gene (10). 5.1.3 EBV. This herpes virus has the unique ability of transforming B lymphocytes into permanent, latently infected lymphoblastoid cell lines. Every infected cell carries multiple extrachromosomal copies of the viral episome and constitutively expresses a limited set of viral gene products, which are called latent proteins, that comprise six EBV nuclear antigens (EBNAs 1, 2, 3A, 3B, 3 C, and -LP) and three latent membrane proteins (LMPs 1, 2A, and 2B). The latent membrane protein 1 and EBNA1 are likely to be required for EBV to immortalize B lymphocytes. They have been shown to mediate the increase in p53 levels via activation of the NF-kB transcription factor (11). The process of transfection or infection extends the proliferative capacity of cells to 20–30 population doublings before entering a state called crisis. In some cultures, very few cells (about 1 per 107 ) recommence proliferation after a variable period of crisis, probably because of additional genetic changes, and form an immortalized cell line. The limitation of transformation with DNA tumor viruses is that the proteins they encode may cause undesirable changes in the resulting immortalized cell lines (e.g., loss of some differentiated properties and loss of normal cellcycle checkpoint controls). These unwanted changes can be avoided by direct induction of telomerase activity. This activity is done by transfecting the cells with the telomerase gene htrt, which thereby extends the life span of the cell line such that a proportion of these cells become immortal but not malignantly transformed (7). 5.2 Characterization and Authentication of Cell Lines Research that involves the use of cell lines demands the precise knowledge of the purity and species of origin of the working cell lines. In the facilities where multiple cell

5

lines are maintained, the possibility of cross contamination can occur. In some cases, the contaminating colony of cells in the culture may be identified visually. In other cases, this unintentional coculture cannot be determined by visual inspection. Therefore, it is important to verify the identity and purity of cell cultures periodically. In the absence of such a monitoring system, it is possible that interspecies and intraspecies cell line contamination may occur, which results in the generation of false conclusions. Apart from this cross contamination, most human cell lines are prone to various sorts of genetic rearrangements, such as chromosomal aneuploidy, deletion of chromosomal regions, and so on, which affect the biochemical, regulatory, and other phenotypic features of cells during their cultivation. Therefore, it would be very useful to monitor the authenticity of the cell lines permanently and/or to have a comprehensive set of standard tests for confirming their cellular identity. Finally, instances occur when hybrid cells lines from two species are created, which are called somatic cell hybrids (e.g., culture lines that are developed with a mix of human and mouse cells through Sendai virus fusion). These types of hybrids predominantly retain the genome of one of the species but also keep some chromosomal complement from the alternate species. Herein, it would be necessary to determine which species contributed which element of the genomic complement. The standard approach that is generally followed is derivation of multiple immunological and genetic marker system profiles from each cell culture to be tested. Once an early molecular signature of the cells is obtained, it can later be compared with all future cell lines for identity validation. Some methods used to derive these profiles are described below. 5.2.1 Isoenzyme Analysis. Isoenzymes are enzymes that exhibit interspecies and intraspecies polymorphisms. Isoenzymes have similar catalytic properties but differ in their structure, and they can be detected based on their differences in electrophoretic mobility. By studying the isoenzymes present in the cell lines, it would be possible

6

CELL LINE

to identify the species from which the cell line was derived. 5.2.2 Chromosomal Analysis. Chromosome content is one of the best-defined criteria for differentiating cell lines derived from more than one species and sex. Chromosomal banding patterns (e.g., trypsin-Giemsa banding or G-banding) can distinguish between normal and malignant cells and can also distinguish between human and mouse chromosomes. Thus, this technology is used frequently in cell line characterization. Fluorescence in situ hybridization is another powerful and rapid method for detecting aneuploidy, chromosomal deletions, and so on. Using different combinations of chromosomeand species-specific probes to hybridize to individual chromosomes, it would be possible to identify chromosomal fragments in metaphases of interspecies hybrid cells. 5.2.3 DNA Profiling. This method is based on the existence of dispersed hypervariable regions of tandem-repetitive nucleotide sequences in the genome (12). The original approach is based on using multilocus Variable Number of Tandem Repeats (VNTR) probes and Southern blotting, which is a DNA analysis technique that works on the principle that smaller fragments of DNA electrophorese at rates faster than larger fragments of DNA. When the DNA from a cell line analyzed by Southern blotting with probes targeted to single locus, many unique size fragments characteristic of many different alleles encoding that particular locus are identified. Given the high degree of heterogeneity at every locus, each cell line produces two specific bands. After examining many such VNTR loci, application of multiple single-locus probes to a single enzyme yields a unique cell line profile. This profile is cell line specific unless more than one cell line is derived from the same individual or if highly inbred donor subjects have been used. More recently, other techniques that can produce profiles more rapidly have been advocated. Thus, the second approach is a more rapid method that involves amplification of microsatellite markers using species-specific primers by polymerase chain reaction; this method is used only for human cell lines. The

banding pattern produced by electrophoretic separation of amplified DNA products can be compared with archived profiles on a database, which is similar to what is done for forensic profiling. Like multilocus VNTR analysis, this technique can also be used to demonstrate continuity between master and working banks and to detect contamination and instability in cell lines. 5.3 Cell Banks Advances in cell technology enable scientists to isolate and cultivate any tissue cells from any species of interest. Alternatively, wellcharacterized cell lines can be obtained from cell banks. Several cell banks such as American Type Culture Collection (ATCC), European Collection for Animal Cell Cultures, Coriell Institute for Medical Research, and so on, acquire, preserve, authenticate, and distribute reference cell lines to the scientific communities. Currently, the ATCC house over 4000 cell lines derived from over 150 different species (Table 1). Once obtained, cell lines can be preserved in cell banks for future use.

REFERENCES 1. J. Paul, Achievement and challenge. In: C. Barigozzi (ed.), Progress in Clinical and Biological Research, vol. 26. New York: Alan R Liss, Inc., 1977, pp. 3–10. 2. A. G. Harrison, Observations on the living developing nerve fiber. Proc. Soc. Exp. Biol. Med. 1907; 4: 140–143. 3. A. Carrel, Artificial activation of the growth in vitro of connective tissue. J. Exp. Med. 1913; 17: 14–19. 4. W. R. Earle, E. L. Schilling, T. H. Stark, N. P. Straus, M. F. Brown, and E. Shelton, Production of malignacy in vitro; IV: the mouse fibroblast cultures and changes seen in the living cells. J. Natl. Cancer Inst. 1943; 4: 165–212. 5. G. O. Gey, W. D. Coffman, and M. T. Kubicek, Tissue culture studies of the proliferative capacity of cervical carcinoma and normal epithelium. Cancer Res. 1952; 12: 364–365. 6. R. I. Freshney, Culture of Animal Cells: A Manual of Basic Technique. New York: WileyLiss, 2000.

CELL LINE

7

Table 1. List of Some Commonly used Cell Lines Cell Line

Cell Type

Origin

Age

MRC-5 MRC-9 BHK21-C13 HeLa 293 3 T3-A31 CHO-K1 WI-38 ARPE-19 C2 BRL3A NRK49F A2780 A9 B16 MOG-G-CCM SK/HEP-1 Caco-2 HL-60 Friend ZR-75-1 C1300 HT-29 KB Vero

Fibroblast Fibroblast Fibroblast Epithelial Epithelial Fibroblast Fibroblast Fibroblast Epithelial Fibroblastoid Fibroblast Fibroblast Epithelial Fibroblast Fibroblastoid Epithelioid Endothelial Epithelial Suspension Suspension Epithelial Neuronal Epithelial Epithelial Fibroblast

Human lung Human lung Syrian hamster kidney Human cervix Human kidney Mouse BALB/c Chinese hamster ovary Human lung Human retina (RPE) Mouse Skeletal muscle Rat liver Rat kidney Human ovary Mouse subcutaneous Mouse melanoma Human glioma Human hepatoma Human colon Human myeloid leukemia Mouse spleen Human breast, ascites fluid Rat neuroblastoma Human colon Human oral Monkey kidney

Embryonic Embryonic Newborn Adult Embryonic Embryonic Adult Embryonic Adult Embryonic Newborn Adult Adult Adult Adult Adult Adult Adult Adult Adult Adult Adult Adult Adult Adult

7. A. G. Bodnar, M. Ouellette, M. Frolkis, S. E. Holt, C. P. Chiu, G. B. Morin, et al., Extension of life-span by introduction of telomerase into normal human cells. Science 1998; 279: 349–352. 8. L. V. Mayne, T. N. C. Price, K. Moorwood, and J. F. Burke, Development of immortal human fibroblast cell lines. In: R. I. Freshney and N. R. Cooper (eds.), Culture of Immortalized Cells. New York: Wiley-Liss, 1996, pp. 77–93. 9. A. J. Klingelhutz, S. A. Foster, and J. K. McDougall, Telomerase activation by the E6 gene product of human papillomavirus type 16. Nature 1996; 380: 79–82. 10. S. N. Boyer, D. E. Wazer, and V. Band, E7 protein of human papilloma virus-16 induces degradation of retinoblastoma protein through the ubiquitin-proteasome pathway. Cancer Res. 1996; 56: 4620–4624. 11. W. P. Chen and N. R. Cooper, Epstein-Barr virus nuclear antigen 2 and latent membrane protein independently transactivate p53 through induction of NF-kappa B activity. J. Virol. 1996; 70: 4849–4853. 12. A. J. Jeffreys, V. Wilson, and S. L. Thein, Individual-specific fingerprints of human

DNA. Nature 1985; 316: 76–79.

FURTHER READING J. P. Mather and D. Barnes, eds. Animal cell culture methods. In: Methods in Cell Biology, vol. 57. San Diego, CA: Academic Press, 1997.

CROSS-REFERENCES In vitro/in vivo correlation Pharmacokinetic study Toxicity Pharmacogenomics Microarray DNA Bank

CENSORED DATA

1. The presence of censoring may alter the hazard function of the lifetime X i , i.e. the conditional distribution of X i , given that i is alive at t(X i ≥ t) and uncensored at t (U i ≥ t), may be different from what it was in the uncensored case, i.e. just given X i ≥ t (dependent censoring). 2. The observed right-censoring times, U i , may contain information on θ (informative censoring).

PER KRAGH ANDERSEN University of Copenhagen, Copenhagen, Denmark

In classical statistics, the observations are frequently assumed to include independent random variables X 1 , . . . , X n , with X i having the density function

An example of a dependent censoring scheme would be if, in a clinical trial with survival times as the outcome variables, one removed patients from the study while still alive and when they appeared to be particularly ill (or particularly well), so that patients remaining at risk are not representative of the group that would have been observed in the absence of censoring. In other words, dependent censoring represents a dynamic version of what in an epidemiologic context would be termed a selection bias. An example is provided below (Example 1). Mathematical formulations of independent censoring (conditions on the joint distribution of X i and U i ) may be given, and it may be shown that several frequently used models for the generation of the times of right-censoring satisfy these conditions. The difficulty in a given practical context lies in the fact that the conditions may be impossible to verify, since they refer to quite hypothetical situations. The second concept mentioned, noninformative censoring, is simpler and relates to the fact that if censoring is informative, then a more efficient inference on θ may be obtained than the one based on (2); see below.

fiθ (x) = αiθ (x)Sθi (x), where αiθ (x) is the hazard function, Sθi (x) is the survival function, and θ is a vector of unknown parameters. Then inference on θ may be based on the likelihood function, L(θ ) =

fiθ (Xi ),

i

in the usual way. In survival analysis, however, one can rarely avoid various kinds of incomplete observation. The most common form of this is right-censoring where the observations are (X˜ i , Di ),

i = 1, . . . , n,

(1)

where Di is the indicator I{X˜ i = Xi }, and X˜ i = Xi , the true survival time, if the observation of the lifetime of i is uncensored and X˜ i = Ui , the time of right-censoring, otherwise. Thus, Di = 1 indicates an uncensored observation, Di = 0 corresponds to a right-censored observation. Other kinds of incomplete observation will be discussed below. Survival analysis, then, deals with ways in which inference on θ may be performed based on the censored sample (1). We would like to use the function Lc (θ ) =

1

The general definition of independent censoring given by Andersen et al. 2, Section III.2.2 for multivariate counting processes has the following interpretation for the special case of survival analysis with time-fixed covariates. The basic (uncensored) model is that conditional on covariates Z = (Z1 , . . . , Zn ) the lifetimes X 1 , . . . , X n are independent, X i

αiθ (X˜ i )Di Sθi (X˜ i )

i

=

fiθ (X˜ i )Di Sθi (X˜ i )1−Di

INDEPENDENT CENSORING

(2)

i

for inference, but there are two basic problems:

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CENSORED DATA

having the hazard function αiθ (t|Zi ) ≈ Pθ φ (Xi ∈ Idt |Xi ≥ t, Z)/dt.

(3)

Here, Idt is the interval [t, t + dt) and Pθ φ is the joint distribution of X 1 , . . . , X n , Z and the censoring times. Note that the hazard function only depends on θ , i.e. φ is a nuisance parameter. Because of the conditional independence of X i it follows that Pθ φ (Xi ∈ Idt |Ft− ) ≈ αiθ (t|Zi )I{Xi ≥ t}dt, where the history Ft− contains Z and all information on X 1 , . . . , X n from the interval [0, t), i.e. values of X i for i with X i < t and the information that X j ≥ t for j with X j ≥ t. Let there now be given right-censoring times U 1 , . . . , U n and define the enlarged history Gt as the one containing Ft and all information on U 1 , . . . , U n from the interval [0, t], i.e. values of U i ≤ t and the information that U j ≥ t for those j where U j ≥ t. The condition for independent censoring is then that Pθ φ (Xi ∈ Idt |Ft− ) = Pθ φ (Xi ∈ Idt |Gt− ).

(4)

It follows that simple type I censoring, where all U i are equal to a fixed time, u0 , and simple type II censoring, where all U i are equal to the kth smallest lifetime X (k) for some k between 1 and n, are both independent, since the rightcensoring times in these cases give rise to no extra randomness in the model; that is, Ft = Gt . In some models, U 1 , . . . , U n are assumed to be independent given Z and Z1 , . . . , Zn are independent identically distributed (iid). Then the assumption (4) reduces to αiθ (t|Zi ) ≈ Pθ φ (Xi ∈ Idt |Xi ≥ t, Ui ≥ t, Z)/dt (5) and it is fulfilled, e.g. if U i and X i are independent given Zi . This is, for instance, the case in the simple random censorship model where U 1 , . . . , U n are iid and independent of X1, . . . , Xn. Some authors take the condition (5) (which is less restrictive than (4)) as the definition of independent censoring; see, for example, 6, p. 128. However, (4) may be generalized to other models based on counting processes

and both (4) and (5) cover the most frequently used mathematical models for the right-censoring mechanisms. These include both the models already mentioned, i.e. simple type I, type II and random censorship and various generalizations of these (e.g. progressive type I censorship (cf. Example 2, below), general random censorship, and randomized progressive type II censorship; see, [2, Section III.2.2]). Earlier contributions to the definition and discussion of independent censoring are the monographs by Kalbfleisch & Prentice 13, p. 120 and Gill 7, Theorem 3.1.1 and the papers by Cox 5, Williams & Lagakos 16, Kalbfleisch & MacKay 12 and Arjas & Haara 3, all of whom give definitions that are close or equivalent to (5). Another condition for independent censoring, stronger than (5) but different from (4), is discussed by Jacobsen 11. From (4) and (5) it is seen that censoring is allowed to depend on covariates as long as these are included in the model for the hazard function of the lifetime distribution in (3). Thus, an example of a dependent censoring scheme is one where the distribution of U i depends on some covariates that are not included there. This is illustrated in the following example. 1.1 Example 1: Censoring Depending on Covariates Suppose that iid binary covariates, Z1 , . . . , Zn , have Pθ φ (Zi = 1) = 1 − Pθ φ (Zi = 0) = φ, and that X 1 , . . . , X n are iid with survival function S(t). The Kaplan–Meier estima based on the X i then provides a tor S(t) consistent estimate of θ = S(·), the marginal distribution of X i . This may be written as S(t) = φS1 (t) + (1 − φ)S0 (t), where Sj (t), for j = 0, 1, is the conditional distribution given Zi = j. Note that these may be different, e.g. S1 (t) < S0 (t) if individuals with Zi = 1 are at higher risk than those with Zi = 0. Define now the right-censoring times U i by Ui = u0 , if Zi = 1,

Ui = +∞, if Zi = 0.

CENSORED DATA

Then, for t < u0 the Kaplan–Meier estimator will still consistently estimate S(t), while for S(u0 ) will estimate S0 (t)/S0 (u0 ). t > u0 , S(t)/ If, however, the covariate is included in the model for the distribution of X i , i.e. θ = [S0 (·), S1 (·)], then S j (t), the Kaplan–Meier estimator based on individuals with Zi = j, j = 0, 1, will consistently estimate the corresponding Sj (t), also based on the right-censored sample (though, of course, no information will be provided about S1 (t) for t > u0 ). It is seen that censoring is allowed to depend on the past and on external (in the sense of conditionally independent) random variation. This means that if, in a lifetime study, sex and age are included as covariates, then a right-censoring scheme, where, say, every year, one out of the two oldest women still alive and uncensored is randomly (e.g. by flipping a coin) chosen to be censored, is independent. However, a right-censoring scheme depending on the future is dependent. This is illustrated in the following example. 1.2 Example 2: Censoring Depending on the Future Suppose that, in a clinical trial, patients are accrued at calendar times T 1 , . . . , T n and that they have iid lifetimes X 1 , . . . , X n (since entry) independent of the entry times. The study is terminated at calendar time t0 and the entry times are included in the observed history, i.e. Zi = T i in the above notation. If, at t0 , all patients are traced and those still alive are censored (at times U i = t0 − T i ) and, for those who have died, their respective lifetimes, X i , are recorded, then this right-censoring is independent (being deterministic, given the entry times, so-called progressive type I censoring). Consider now, instead, the situation where patients are only seen, for instance, every year, i.e. at times T i + 1, . . . , T i + ki ≤ t0 and suppose that if a patient does not show up at a scheduled follow-up time, then this is because he or she has died since last follow-up and the survival time is obtained. Suppose, further, that for the patients who are alive at the time, T i + ki , of their last scheduled follow-up, and who die before time t0 , there is a certain probability, φ, of obtaining information on the failure, whereas for those

3

who survive past t0 nothing new is learnt. If these extra survival times are included in the analysis and if everyone else is censored at ki , then the right-censoring scheme is dependent. This is because the fact that patient i is censored at ki tells the investigator that this patient is likely not to die before t0 and the right-censoring, therefore, depends on the future. To be precise, if the average probability of surviving past t0 , given survival until the last scheduled follow-up time is 1 − π , then the probability of surviving past t0 , given censoring at the time of the last scheduled follow-up, is (1 − π )/[π (1 − φ) + 1 − π ], which is 1 if φ = 1, 1 − π if φ = 0, and between 1 − π and 1, otherwise. If, alternatively, everyone still alive at time T i + ki were censored at ki , then the censoring would be independent (again being deterministic given the entry times). Another censoring scheme that may depend on the future relative to ‘‘time on study’’, but not relative to calendar time, occurs in connection with testing with replacement, see, for example, (8). Let us finally in this section discuss the relation between independent rightcensoring and competing risks. A competing risks model with two causes of failure, d and c, is an inhomogeneous Markov process W(·) with a transient state 0 (‘‘alive’’), two absorbing states d and c and two causespecific hazard functions α 0d (t) and α 0c (t), e.g. Andersen et al. 1. This generates two random variables: X = inf[t : W(t) = d] and U = inf[t : W(t) = c], which are incompletely observed since the observations consist of the transition time = X∧U and the state W(X) = d or c reached X at that time. The elusive concept of ‘‘independent competing risks’’ (e.g. [13, Section 7.2]) now states that in a population where the risk c is not operating, the hazard function for d is still given by α 0d (t). This condition is seen to be equivalent to censoring by U being independent. However, since the population where a given cause of failure is eliminated is

4

CENSORED DATA

usually completely hypothetical in a biological context, this formal equivalence between the two concepts is of little help in a practical situation and, as is well known from the competing risks literature (e.g. 4,15, and [13, Chapter 7]), statistical independence of the random variables X and U cannot be tested W(X)]. from the incomplete observations [X, What can be said about the inference on the parameter θ = α 0d (·) based on these data is that consistent estimation of θ may be obtained by formally treating failures from cause c as right-censorings, but that this parameter has no interpretation as the d failure rate one would have had in the hypothetical situation where the cause c did not operate. For the concept of independent censoring to make sense, the ‘‘uncensored experiment’’ described in the beginning of this section should, therefore, be meaningful. 2 LIKELIHOODS: NONINFORMATIVE CENSORING The right-censored data will usually consist of i , Di , Zi ; i = 1, . . . , n) (X and, under independent censoring, the likelihood can then be written using productintegral notation L(θ , φ) = Pθ φ (Z)

i

αiθ (t)Di (dt)

t0 θφ

× [1 − αiθ (t)dt]1−Di (dt) γi (t)Ci (dt) θφ

× [1 − γi (t)dt]1−Ci (dt) .

(6)

Here, Di (dt) = I{X i ∈ Idt }, Ci (dt) = I{U i ∈ Idt }, θφ and αiθ (t) and γi (t) are the conditional hazards of failure and censoring, respectively, given the past up until t- (including covariates). The likelihood (6) may be written as L(θ , φ) = Lc (θ )L∗ (θ , φ), with Lc (θ ) given by (2) and where the contributions from censoring and covariates are collected in L*(θ , φ). Thus, the function (2), which is usually taken as the standard censored data likelihood, is, under independent

censoring, a partial likelihood on which a valid inference on θ may be based. It is only the full likelihood for θ if L*(θ , φ) does not depend on θ , which is the case if censoring (and covariates) are noninformative. Thus, noninformative censoring is a statistical concept (while the concept of independent censoring is probabilistic) and means that the θφ conditional hazard of censoring γi (t) does, in fact, not depend on θ , the parameter of interest. An example of an informative rightcensoring scheme could be in a study with two competing causes of failure and where only one of the two cause-specific failure rates is of interest; if the two cause-specific failure rates are proportional (as in the so-called Koziol–Green model for random censoring, 14), then the failures from the second cause (the censorings) will carry information on the shape of the hazard function for the failure type of interest. It is, however, important to notice that even if the censoring is informative, then inference based on (2) will still be valid (though not fully efficient) and as it is usually preferable to make as few assumptions as possible about the distribution of the right-censoring times, the (partial) likelihood (2) is often the proper function to use for inference. 3 OTHER KINDS OF INCOMPLETE OBSERVATION When observation of a survival time, X, is right-censored, then the value of X is only known to belong to an interval of the form [U, + ∞). This is by far the most important kind of censoring for survival data, but not the only one. Thus, the observation of X is interval-censored if the value of X is only known to belong to an interval [U, V) and it is said to be left-censored if U = 0. It was seen above that under independent right-censoring a right-censored observation, U i , contributed to the partial likelihood function with a factor Sθ (U i ), which was also the contribution to the full likelihood under noninformative censoring. Similarly, concepts of independent and noninformative interval-censoring may be defined as leading to a contribution of Sθ (U i )–Sθ (V i ) to, respectively, the partial and the full likelihood.

CENSORED DATA

These concepts have received relatively little attention in the literature; however, this way of viewing censoring is closely related to the concept of coarsening at random. Formally, grouped data, where for each individual the lifetime is known only to belong to one of a fixed set of intervals [uk−1 , uk ) with 0 = u0 < u1 < · · · < um = +∞, are also interval-censored. However, the fact that the intervals are the same for everyone simplifies the likelihood to a binomial-type likelihood with parameters pθk = Sθ (uk−1 ) − Sθ (uk ), k = 1, . . . , m. Let us finally remark that while, following Hald [9,10, p. 144], censoring occurs when we are able to sample a complete population but individual values of observations above (or below) a given value are not specified, truncation corresponds to sampling from an incomplete population, i.e. from a conditional distribution. Left-truncated samples, where an individual is included only if his or her lifetime exceeds some given lower limit, also occur frequently in the analysis of survival data, especially in epidemiologic studies where hazard rates are often modeled as a function of age and where individuals are followed only from age at diagnosis of a given disease or from age at employment in a given factory. REFERENCES 1. Andersen, P.K., Abildstrom, S. & Rosthøj, S. (2002). Competing risks as a multistate model. Statistical Methods in Medical Research. 11, 203–215. 2. Andersen, P.K., Borgan, Ø., Gill, R.D. & Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag, New York. 3. Arjas, E. & Haara, P. (1984). A marked point process approach to censored failure data with complicated covariates, Scandinavian Journal of Statistics 11, 193–209. 4. Cox, D.R. (1959). The analysis of exponentially distributed life-times with two types of failure, Journal of the Royal Statistical Society, Series B 21, 411–421. 5. Cox, D.R. (1975). Partial likelihood, Biometrika 62, 269–276. 6. Fleming, T.R. & Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York.

5

7. Gill, R.D. (1980). Censoring and stochastic integrals, Mathematical Centre Tracts 124, Mathematisch Centrum, Amsterdam. 8. Gill, R.D. (1981). Testing with replacement and the product limit estimator, Annals of Statistics 9, 853–860. 9. Hald, A. (1949). Maximum likelihood estimation of the parameters of a normal distribution which is truncated at a known point, Skandinavisk Aktuarietidsskrift 32, 119–134. 10. Hald, A. (1952). Statistical Theory with Engineering Applications. Wiley, New York. 11. Jacobsen, M. (1989). Right censoring and martingale methods for failure time data, Annals of Statistics 17, 1133–1156. 12. Kalbfleisch, J.D. & MacKay, R.J. (1979). On constant-sum models for censored survival data, Biometrika 66, 87–90. 13. Kalbfleisch, J.D. & Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. 14. Koziol, J.A. & Green, S.B. (1976). A Cram´ervon Mises statistic for randomly censored data, Biometrika 63, 465–474. 15. Tsiatis, A.A. (1975). A nonidentifiability aspect of the problem of competing risks, Proceedings of the National Academy of Sciences 72, 20–22. 16. Williams, J.A. & Lagakos, S.W. (1977). Models for censored survival analysis: constant sum and variable sum models, Biometrika 64, 215–224.

CENTER FOR DEVICES AND RADIOLOGICAL HEALTH (CDRH)

the product’s safety. These specific regulations may include requirements for meeting performance standards recognized by the FDA, postmarket surveillance, patient registries, or other appropriate requirements. Class III: Premarket Approval. Class III devices are life supporting, life sustaining, or important in preventing impairment of human health. Because general controls may be insufficient to provide reasonable assurance of the device’s safety and effectiveness, FDA preapproval is required before it is marketed. Under Class III regulations, devices such as heart valves, breast implants, and cranial electrotherapy stimulators must be reviewed for safety and effectiveness before marketing.

The U.S. Food and Drug Administration’s Center for Devices and Radiological Health (CDRH) ensures the safety and effectiveness of medical devices and the safety of radiologic products. 1

MEDICAL DEVICES

The Food, Drug, and Cosmetic (FD&C) Act defines a medical device as any health-care product that does not achieve its principal intended purposes by chemical action or by being metabolized. Under this definition, a ‘‘device’’ can be as simple as a tongue depressor or a thermometer, or as complex as a kidney dialysis machine. Medical devices are classified and regulated according to their degree of risk to the public.

2

OBTAINING FDA APPROVAL

The CDRH works with both medical device and radiologic health industries. A manufacturer of a Class III device files a PreMarket Approval Application (PMA) to obtain FDA approval to market the product. Like the submission that is filed for the approval of a new drug, a PMA contains clinical and laboratory testing data to demonstrate safety and effectiveness. A Premarket Notification, also known as a 510(k), is an application submitted to the FDA to demonstrate that a medical device is substantially equivalent to (i.e., meaning just as safe and effective as) a legally marketed device that does not require premarket approval.

1.1 Regulatory Classes Because each device is different, the Food and Drug Administration (FDA) establishes three different regulatory classes to ensure that each device is subject to regulations that are appropriate. Class I: General Controls. Class I devices are subject to a set of general regulations that apply to all devices. General controls include the registration of manufacturers, general recordkeeping requirements, and compliance with Good Manufacturing Practice regulations. Class II: Special Controls. Class II devices are those for which general regulations are not enough to guarantee the safety of the device. A Class II device may be subject to specific regulations to ensure

3 GOOD MANUFACTURING PRACTICES (GMPS) The FDA further ensures the safety and effectiveness of medical devices by regulating their manufacture. As with drugs, the FDA has established Good Manufacturing

This article was modified from the website of the United States Food and Drug Administration (http://www.eduneering.com/fda/courses/FDATour3/tourFDA-frames-08.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CENTER FOR DEVICES AND RADIOLOGICAL HEALTH (CDRH)

Practices for medical devices and regularly inspects manufacturers to ensure they comply with these regulations. 4

CONTINUOUS ANALYSIS

After approval of a medical device, the FDA continuously analyzes reports to ensure that products are safe and to watch for dangerous events related to the use of medical devices. The CDRH also monitors certain electronic products to protect the public from unnecessary exposure to radiation. Products that are monitored by the FDA include televisions, microwave ovens, X-ray machines, and devices employing lasers (including laser light shows). The FDA administers the law by setting and enforcing standards to limit unnecessary radiation emissions.

CENTER FOR DRUG EVALUATION AND RESEARCH (CDER)

comprise the single most important factor in the approval or disapproval of a new drug and are the basis for a new drug application (NDA). Once the NDA is approved, the product can be legally marketed in the United States. 2. Postmarket Drug Surveillance. The CDER continuously monitors the safety of drugs that have already been marketed. After the drug is approved and marketed, the FDA uses different mechanisms for ensuring that firms adhere to the terms and conditions of approval described in the application and that the drug is manufactured in a consistent and controlled manner. This is done by periodic unannounced investigations of drug production and control facilities by the FDA’s field investigators and analysts. The Center also ensures that prescription drug information provided by drug firms is truthful, balanced, and accurately communicated. In addition, the CDER follows up with companies on concerns related to medication errors, drug shortages, and ineffective or toxic drugs. 3. Generic Drug Review. Generic drug applications are termed ‘‘abbreviated’’ in that they are generally not required to include preclinical (animal) and clinical (human) data to establish safety and effectiveness. The Abbreviated New Drug Application (ANDA) provides for the review and ultimate approval of a generic drug product. After all components of the application are found to be acceptable, an approval or tentative approval letter is issued to the applicant. Tentative approvals require the manufacturer to delay marketing the generic drug until all patent/exclusivity issues have expired. 4. Over-the-Counter Drug Review. Overthe-counter (OTC) drug products are those that are available to consumers without a prescription, and 6 out of 10 medications bought by consumers are OTC drugs. The CDER’s OTC program establishes drug monographs for each

The U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER) ensures that safe and effective prescription, nonprescription, and generic drugs are made available as quickly as possible by overseeing the research, development, manufacture, and marketing of drugs. The CDER reviews the clinical trial evidence of the safety and effectiveness of new drugs before approving them for marketing and monitors their performance for unexpected health risks. The CDER also ensures that drug labeling, drug information for patients, and drug promotion are truthful, helpful, and not misleading. 1

CDER MISSION

The Center is organized into four main functional areas. 1. New Drug Development and Review . Drug development is a highly complicated, lengthy process. During preclinical drug development, a sponsor evaluates the drug’s toxic and pharmacologic effects through in vitro and in vivo laboratory animal testing. Next, the firm submits an investigational new drug (IND) application to the CDER to obtain U.S. Food and Drug Administration (FDA) approval that the new drug is sufficiently safe to allow clinical trials. During clinical trials, the investigational drug is administered to humans and is evaluated for its safety and effectiveness in treating, preventing, or diagnosing a specific disease or condition. The results of this testing will This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/about/default.htm and http://www.eduneering.com/fda/courses/FDA Tour3/tourFDA-frames-06.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CENTER FOR DRUG EVALUATION AND RESEARCH (CDER)

class of product; these are like a recipe book, covering the acceptable ingredients, doses, formulations, and labeling. Products conforming to a monograph may be marketed without further FDA clearance, and those that do not must undergo separate review and approval through the new drug approval process.

FDA in 1 year or less; priority applications for breakthrough medications are usually approved in 6 months. The PDUFA user fees, however, do not cover the FDA’s expenses connected with generic and nonprescription drugs, plant inspections, postmarket surveillance, or monitoring of drug advertisements. 5 ACCELERATED APPROVAL

2

GOOD MANUFACTURING PRACTICES

To make sure that drugs are manufactured to the same high standards that are required for their approval, the FDA provides a set of regulations called Good Manufacturing Practices (GMPs). The law requires that the Office of Regulatory Affairs (ORA) perform periodic inspections of all drug firms for compliance with GMPs. 3

ADVERSE EVENT REPORTING

The FDA maintains several reporting systems that alert the Agency to side effects that were not detected during clinical trials but rather emerged when the product became widely used. One of these programs is the CDER’s MedWatch, which encourages health professionals to report serious adverse events involving any medical product (including drugs, devices, and biologics). If necessary, the FDA can take regulatory actions to protect consumers. Regulatory actions may include restrictions on the product’s use or its withdrawal from the market. About 1% to 3% of products approved each year must be removed later because of rare but serious side effects. 4

PRESCRIPTION DRUG USER FEE ACT

In the Prescription Drug User Fee Act of 1992 (PDUFA) the U.S. Congress, pharmaceutical industry, and FDA agreed on specific review goals for certain drugs and biologics, which are to be achieved with the help of user fees paid to the FDA by the products’ manufacturers. The program has been instrumental in reducing the FDA’s median drug-review times by more than half. As a result, typical drug applications are processed by the

Many of the drugs currently used to treat lifethreatening conditions such as cancer were approved through an accelerated FDA review process. In accelerated approval, the FDA approves the drug on the condition that the applicant will study and report the findings of the clinical benefit of the drug. The FDA continues to review new information and data about these drugs as the data become available; if the findings are negative, the appropriate actions are taken.

CENTRAL NERVOUS SYSTEM (CNS)

nutritional deficiencies, environmental toxicities, or cerebrovascular disorders, neurodegenerative conditions, headache syndromes, seizure disorders, movement disorders, or autoimmune-mediated diseases other than those listed. Finally, we restrict our discussion to RCTs and do not review epidemiological or observational studies.

DAVID B. SOMMER LARRY B. GOLDSTEIN Duke University Medical Center Durham, North Carolina

1

INTRODUCTION

2 CNS TRIALS: GENERAL INCENTIVES AND CONSTRAINTS

Clinical neurology has a rich history based on careful patient observation. Over the last quarter century, however, a major shift toward evidence-based medicine has occurred, and that shift includes the practice of neurology. Although many diagnostic and therapeutic approaches employed in neurology today have never been formally evaluated, clinicians, patients, and payers now seek informative results from well-designed and conducted clinical trials to help guide therapeutic decisions. The randomized controlled trial (RCT) is the gold standard for evaluating putative interventions. The scope of central nervous system (CNS) diseases is vast, and a comprehensive review would be lengthy. The number of published trials has nearly doubled since the last such report was published in 2001 (1). This article provides an overview of issues related to CNS clinical trials focusing on six selected conditions (ischemic stroke, Alzheimer’s disease, migraine headache, epilepsy, Parkinson’s disease, and multiple sclerosis) that primarily involve the CNS and about which an increasing number of trials have been published over the last few decades (Table 1). These conditions were chosen both because they are associated with significant morbidity (Fig. 1) (2) and because they illustrate the diversity of diseases affecting the CNS. It should be noted that each of these topics could support an extensive report and that, for the topics that are discussed, only selected studies are cited. We do not address trials of treatments for CNS effects of systemic illness, psychiatric diseases, infections of the CNS, sleep disorders, traumatic brain injury, CNS neoplasms, inborn errors of metabolism and congenital malformations,

A RCT can be conducted ethically only when clinical equipoise exists between the interventions being considered, the trial can provide an answer to the question being studied, and the rights of subjects can be preserved. RCTs are often expensive and therefore require appropriate levels of funding. In 2005, public funding for brain research in the United States was estimated at 6.1 billion euros. Private funding for brain research in the United States was 8.4 billion euros during the same period. In Europe, the totals were 0.9 billion and 3.3 billion euros, respectively (3). Public granting agencies generally fund clinical trials in CNS disease based on the quality of the science, but in the United States only about 15% of applications are currently funded, with costly clinical trials competing for funds with a variety of other types of research projects (4). Because of the nature of the peer-review process, great effort can be extended in developing successive versions of a proposal with no guarantee of success. Industry-funded trials are undertaken primarily to provide evidence of the efficacy and safety of a company’s product. Research investment is generally limited to interventions that are likely to be profitable. Economic analyses show that device trials have the highest return on investment because of the lower regulatory hurdles for governmental approval and the high market prices (5,6). Suggesting the roles of incentives and constraints other than potential public health impact, Figs. 1 and 2 illustrate the discor-

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CENTRAL NERVOUS SYSTEM (CNS)

Table 1. Number of Randomized Controlled Trials in Selected Neurological Diseases Indexed in Medline Year of Publication

Stroke

Alzheimer’s Disease

Migraine Headache

Epilepsy

Parkinson’s Disease

Multiple Sclerosis

Before 1981 1981–1986 1987–1991 1992–1996 1997–2001 2002–2006 Total

5 16 35 49 281 674 1060

1 18 62 155 201 335 772

44 52 75 135 193 240 739

29 50 81 181 236 199 776

45 21 67 134 179 262 708

14 26 62 108 163 230 603

Source: PubMed search for articles indexed with major subject heading (MeSH) of Stroke, Alzheimer Disease, Migraine Disorders, Epilepsy, Parkinson Disease, and Multiple Sclerosis (searches limited to randomized controlled trials and by publication year).

Global Burden of Disease (Million Quality Adjusted Life Years) Multiple Sclerosis, 1.5 Parkinson's Disease, 1.6

Stroke, 49.2

Epilepsy, 7.3

Migraine, 7.7

Figure 1. Source: World Health Report 2002 (2).

Alzheimer's (and other dementias), 10.4

Number of Randomized Controlled Trials (1965-2006) Multiple Sclerosis, 603

Stroke, 1060

Parkinson's, 708

Alzheimer's, 772 Epilepsy, 776 Migraine, 739

Figure 2. Source: PubMed search for articles indexed with major subject heading (MeSH) of Stroke, Alzheimer Disease, Migraine Disorders, Epilepsy, Parkinson Disease, and Multiple Sclerosis (searches limited to randomized controlled trials and by publication year).

CENTRAL NERVOUS SYSTEM (CNS)

dance between the burdens of different groups of CNS diseases and the numbers of clinical trials conducted in each area. 3

CNS TRIALS: COMMON CHALLENGES

Adequate and timely subject enrollment and retention are challenges faced by all prospective clinical trials. Trials involving CNS diseases can be particularly difficult. CNS disease can affect cognition and/or communication and therefore impair the subject’s capacity to give informed consent. In these circumstances, investigators must obtain consent from a surrogate decision maker. This requirement is particularly challenging in clinical trials of interventions for acute stroke in which the onset of disease is abrupt and unexpected, and the time window for treatment is limited. Because many neurological diseases evolve over several years (e.g., clinical worsening in Alzheimer’s disease, disability in multiple sclerosis, evolution of Parkinson’s disease, etc.), long periods of observation may be necessary to detect a treatment effect. This need presents a particular challenge for patient retention, increases the problem of competing comorbidities, and raises trial costs. In the case of industry-developed treatments in which a period of patent protection is limited, the latter two issues may make a trial unfeasible. A short follow-up period also limits a trial’s capacity to detect delayed harm. The majority of CNS trials end at or before 24 months, yet many drugs used to treat CNS disease are taken for a lifetime. Another historic problem faced by clinical trials of CNS disease has been a lack of standardized disease definitions or reliable, validated outcome measures. For example, a 1987 review of all 28 articles including clinical assessments published in the journal Stroke in 1984 and 1985 found that no single neurological scale was used more than once (7). This problem was not unique to stroke research and has been at least partially addressed not only for stroke but also for a variety of other conditions as described for each disease reviewed below. Agreement on useful outcome measures has been challenging for many CNS diseases because of

3

the lack of readily available pathologic data (brain biopsy being generally considered the most invasive of monitoring procedure). Neuroimaging technologies [e.g., magnetic resonance imaging (MRI)] have been employed to provide a measure of disease activity in diseases such as multiple sclerosis, but their use as a primary surrogate endpoint remains controversial (8). No valid or practical biologic endpoints exist for many CNS diseases for which endpoints must be constructed from clinician assessments and/or patient/caregiver report. 4

ISCHEMIC STROKE—PREVENTION

Stroke has the greatest health impact of all neurological diseases (Fig. 1). About 80% of strokes are ischemic, and consequently, interventions that prevent or ameliorate ischemic stroke have the greatest potential cumulative impact of all CNS treatments. The advent of thrombolytic therapy for selected patients with acute ischemic stroke has done little to reduce the overall burden of the disease, and primary and secondary prevention remain the most important interventions from the public health standpoint (9). Numerous clinical trials have shaped current medical therapy for stroke prevention. The benefit of lowering blood pressure in patients with hypertension has been demonstrated in at least 25 controlled trials of a variety of antihypertensive drugs (10). Treatment with dose-adjusted warfarin reduces the risk of stroke by 60% in patients with nonvalvular atrial fibrillation, whereas antiplatelet drugs such as aspirin reduce the risk by approximately 20% (11). Antiplatelet therapy represents a cornerstone of secondary stroke prevention, and several studies inform current practice. The Canadian Cooperative Study was among the first RCTs to demonstrate a benefit of aspirin in secondary stroke prevention (12). Supported by numerous other trials, aspirin use is associated with an approximate 13% reduction in the risk of serious vascular events in patients with prior stroke or transient ischemic attack (TIA) (13). The role of clopidogrel is somewhat more controversial. The Clopidogrel versus Aspirin for the Prevention of Recurrent Ischemic Events (CAPRIE) trial found a reduction in

4

CENTRAL NERVOUS SYSTEM (CNS)

a composite endpoint of myocardial infarction, stroke, or vascular death in patients with a history of myocardial, infarction, or symptomatic peripheral arterial disease with clopidogrel as compared with aspirin, but because of sample size limitations, it was not clear whether patients with stroke benefited (14). A meta-analysis of available data in 2002 demonstrated similar degrees of risk reduction with ASA and clopidogrel (15). The Management of Atherothrombosis with Clopidogrel in High Risk Patients (MATCH) trial found that the combination of aspirin plus clopidogrel was associated with a higher risk of bleeding without a reduction in recurrent ischemic events in patients with cerebrovascular disease risk when compared to clopidogrel alone (16). The Clopidogrel for High Atherothrombotic Risk and Ischemic Stabilization, Management, and Avoidance Trial (CHARISMA) found no benefit for the combination as compared with aspirin alone in high-risk patients (17). In contrast, the European Stroke Prevention Study-2 (ESPS2) found that the combination of aspirin and sustained-release dypiridamole was more efficacious than aspirin alone, a result that now has been supported by the European/ Australasian Stroke Prevention in Reversible Ischemia Trial (ESPIRIT) (18,19). Negative RCTs are as important as positive ones in that they show which treatments are not efficacious or safe. The Warfarin–Aspirin Recurrent Stroke Study (WARSS) found no benefit for warfarin over aspirin for secondary prevention in patients with noncardioembolic stroke (20). A substudy of WARSS evaluating patients with cryptogenic stroke who had a patent foramen ovale (PFO) similarly found no benefit for warfarin as compared with aspirin (21). The Warfarin–Aspirin Symptomatic Intracranial Disease (WASID) trial found no benefit of warfarin over aspirin in patients with a symptomatic intracranial stenosis (22). Postmenopausal hormone replacement therapy was once thought to reduce the risk of stroke and other cardiovascular events and was widely prescribed, but RCTs now have found either no benefit or increased stroke risk for women with coronary heart disease (23) or prior stroke history (24) or for otherwise healthy post-menopausal women (25).

Trials in subjects with established coronary heart disease or certain risk factors show an approximate 20% reduction in the risk of stroke associated with the use of HMG-CoA reductase inhibitors (statins). Studies with combined vascular endpoints had demonstrated lower stroke rates with statins versus placebo in patients with vascular risk factors (9). The Stroke Prevention with Aggressive Reductions in Cholesterol Levels (SPARCL) trial also found a reduction in recurrent stroke and other vascular events associated with a statin in patients with a recent stroke or TIA, no known coronary heart disease, and a low-density lipoprotein cholesterol (LDL-C) level between 100 to 190 mg/dL (26). A randomized trial first drastically altered surgical intervention for stroke prevention in 1985 with the publication of the negative Extracranial– Intracranial Bypass Study (27). The role of carotid endarterectomy (CEA) in severe symptomatic carotid stenosis for stroke prevention now has been firmly established by RCTs (28–30). The Asymptomatic Carotid Atherosclerosis Study (ACAS) (31) and the Asymptomatic Carotid Surgery Trial (ACST) (32) also demonstrate a smaller benefit in asymptomatic patients with high-grade carotid stenosis. Several trials comparing endarterectomy with carotid angioplasty/stenting have been published (33–35), but the results are inconsistent and the procedure has only been approved in the United States for symptomatic patients deemed at high endarterectomy risk. Several interventions recommended for stroke prevention (e.g., smoking cessation, moderation of alcohol consumption, and diet and weight management) are based on epidemiologic data and have not been examined in an RCT, either because it would be unethical or impractical to do so. Ongoing stroke prevention trials sponsored by the National Institutes of Health (NIH)/NINDS or other public agencies include evaluation of aspirin versus aspirin plus clopidogrel for secondary prevention of small subcortical strokes (NCT00059306), evaluation of warfarin versus aspirin for primary stroke prevention in patients with a cardiac ejection fraction less than

CENTRAL NERVOUS SYSTEM (CNS)

35% (NCT00041938), PFO closure versus anticoagulation versus antiplatelet therapy in recurrent stroke with PFO (NCT00562289), endarterectomy versus carotid stenting in symptomatic carotid stenosis (NCT00004732), and multiple trials of cholesterol-lowering agents in primary prevention (additional information on these trials and other ongoing trials can be accessed at ClinicalTrials.gov by typing the ID number into the search box). Industrysponsored trials include two large Phase III trials of oral direct thrombin inhibitors compared with warfarin in the prevention of stroke in atrial fibrillation (NCT00403767 and NCT00412984), a large Phase III trial of aspirin for primary prevention in patients with moderate cardiovascular risk (NCT00501059), and evaluation of an endovascularly deployed device to prevent embolism from the left atrial appendage (NCT00129545). A large industrysponsored Phase III randomized trial is comparing carotid stenting versus endarterectomy for asymptomatic severe carotid stenosis (NCT00106938), and multiple industry-sponsored trials evaluate PFO closure devices for patients with cryptogenic stroke (e.g., NCT00289289 and NCT00201461). 5

ISCHEMIC STROKE—ACUTE TREATMENT

The outcomes to be measured in stroke prevention trials are relatively straightforward (i.e., the numbers of strokes or other vascular events prevented). The associated costs and impact of therapy on quality of life are secondary measures. In contrast, the primary outcomes of interest for trials of acute stroke treatments are functional measures of stroke severity and are reflected in scales that assess neurological impairments (e.g., the NIH Stroke Scale and NIHSS), disability (e.g., the Barthel Index), and handicap [e.g., the Modified Rankin Scale, (mRS)]. Outcomes of acute stroke interventions are generally assessed 90 days poststroke. The optimal way of analyzing functional outcome data remains a point of discussion. No single trial has had a greater impact in revolutionizing the medical approach to

5

acute ischemic stroke than the NINDS trial of tissue plasminogen activator (tPA) (36). For patients treated within 3 hours of symptom onset, a 13% absolute (32% relative) increase occurred in proportion of patients with little or no functional deficit after 3 months. Because of the limited time window for treatment and because the likelihood of benefit increases the sooner the drug can be given, the NINDS tPA trial has led to a major change in the approach to acute ischemic stroke (37). A pilot RCT suggested that transcranial ultrasound could augment the effectiveness of systemic tPA, however, this finding has not yet been replicated or adopted in practice (38). Trials of several neuroprotective agents, heparin, and abciximab for the treatment of acute stroke have all yielded disappointing results (37). Currently, the NINDS is sponsoring three Phase III trials in acute stroke management. Albumin in Acute Ischemic Stroke (ALIAS, NCT00235495) and Field Administration of Stoke Therapy–Magnesium (FAST–MAG, NCT00059332) are placebocontrolled RCTs that evaluate the potential neuroprotective effects of human albumin and magnesium sulfate, respectively. The Interventional Management of Stroke III Trial (NCT00359424) is randomizing patients presenting with acute stroke within 3 hours to standard therapy with intravenous tPA versus partial-dose intravenous tPA followed by angiography and individualized intravascular intervention in eligible candidates. Currently, industry is sponsoring Phase III neuroprotective trials with citicoline (NCT00331890); ONO-2506, a homolog of valproic acid (NCT00229177); and Tanakan, a standardized ginkgo biloba extract (NCT00276380). Industry is also conducting a placebo-controlled double-blinded evaluation of ancrod, an enzyme derived from snake venom, for the treatment of ischemic stroke presenting within 6 hours (NCT00141011). The manufacturers of the NeuroThera topical phototherapy system and the NeuroFlo partial aortic occlusion devises are sponsoring Phase III efficacy trials (NCT00419705 and NCT00119717, respectively).

6

6

CENTRAL NERVOUS SYSTEM (CNS)

ALZHEMIER’S DISEASE

Alzheimer’s disease (AD) is the most common cause of dementia and leads to a significant burden of disease for both patients and caregivers. Histopathology provides the gold standard diagnostic test for AD, but this test cannot be used to identify subjects for enrollment in clinical trials and must be interpreted in the context of the patient’s clinical status. Standardized clinical criteria for the diagnosis of AD have been developed that have a correlation of 80–100% as compared with histopathology and provide a means for identifying subjects for therapeutic trials (39). The National Institute of Neurological and Communicative Diseases and Stroke/Alzheimer’s Disease and Related Disorders Association (NINCDS–ADRDA) criteria are among the best validated and most widely used for identifying potential subjects with possible or probable AD (40). Objective cognitive assessments, clinician or caregiver subjective global assessments, and functional assessments are all used to measure outcome in AD trials. The MiniMental State Examination (MMSE) (41) is widely used in trials as a cognitive instrument, but the Alzheimer’s Disease Assessment Scale–Cognitive Portion (ADAS–Cog) (42) is a more robust measure and has been used in several clinical trials. Multiple disease-specific instruments are available for assessing functional status including the Disability Assessment for Dementia (DAD) (43) and the Alzheimer’s Disease Cooperative Study Activities of Daily Living inventory (ACDS–ADL) (44). In addition to clinical assessment, several groups are actively seeking to identify biomarkers that can be used to aid in the diagnosis and monitoring of AD. This finding might help in early identification of cases, when patients stand to benefit the most from treatments intended to slow the course of the disease (45). Despite sustained efforts at finding disease-modifying therapies for AD, none have been identified to date, and the available symptomatic therapies have limited clinical benefit (46). Tacrine was the first cholinesterase inhibitor shown to have beneficial effects in a randomized trial, but its use was limited by side effects (47). The

pivotal efficacy studies of donepezil, which had the major advantages of daily dosing and a favorable adverse effects profile, were published in 1998 (48,49). This was followed by publication of positive efficacy trials of rivastigmine (50) and galantamine (51). The N-methyl-D-aspartate (NMDA) receptor antagonist memantine has been shown to improve function compared with placebo in severe patients with AD (45,52). Publicly funded trials currently are evaluating the effectiveness of selenium and vitamin-E (NCT00040378) as well as celecoxib and naproxen (NCT00007189) for the primary prevention of AD. The National Institute on Aging (NIA) is also sponsoring trials of simvastatin (NCT00053599) and docosahexanoic acid (NCT00440050) to slow AD progression. Currently, industry is funding Phase III trials of 3 novel agents targeted at the beta amyloid protein, which are as follows: MCP-7869 (NCT00105547), 3APS (NCT00088673), and bapineuzumab (NCT00574132). Rosiglitazone is also being evaluated as a disease-modifying agent in AD (NCT00348309). 7 MIGRAINE Migraine headache is defined for clinical and research purposes by the International Headache Society (IHS) criteria (53). The prevalence of migraine in the United States and Western Europe has been estimated at 9% of the general population, and about one third of migraneurs have headaches severe enough to interfere with activities of daily living (52). Migraine headaches have long been the subject of RCTs—propranolol was first shown to be effective versus placebo for migraine prophylaxis in 1974 (54). Clinical trials have shaped current practice in both the acute treatment of migraines and the prevention of recurrent migraines. Outcomes in migraine trials rely entirely on subjective patient report. The IHS has recommended the following standardized outcomes for acute therapy trials: percentage of patients pain-free at 2 hours, sustained painfree state at 48 hours, intensity of headache rated on a categorical scale, percentage of patients with reduction in pain from

CENTRAL NERVOUS SYSTEM (CNS)

moderate–severe to none–mild, and functional outcome rated on a categorical scale. For prophylaxis trials, monthly frequency of migraine attacks is the IHS-recommended outcome (55). In 2000, the American Academy of Neurology reviewed the available evidence for the treatment of migraine. Convincing effectiveness data from RCTs in acute treatment existed for all of the serotonin agonists (triptans), dihydroergotamine (DHE) nasal spray, aspirin, ibuprofen, naproxen sodium, butorphanol nasal spray, and opiates. Convincing efficacy data from RCTs in migraine prophylaxis were available for divalproex sodium, amitriptyline, propranolol, timolool, and methysergide (56). Clinical trial data supporting the efficacy of gabapentin (57) and topiramate (58) for migraine prophylaxis have since become available. RCTs have continued to refine our understanding of migraine treatment by evaluation of combination therapy (59), by direct comparisons of different agents (60), and through rigorous testing of widely used but unproven therapies. A recent placebo-controlled trial of intravenous dexamethasone administered in the emergency department failed to find benefit in acute migraine (61). Although parenteral DHE is widely used and believed to be helpful, its efficacy has never been proven in an RCT (62). The Chilean government currently is sponsoring a direct comparison trial of pregabalin and valproate in migraine prophylaxis (NCT00447369). Ongoing industrysponsored trials include a Phase III placebocontrolled evaluation of botulinum toxin type A for migraine prophylaxis (NCT00168428) and evaluations of an implantable neurostimulator (NCT00286078), transcranial magnetic stimulation (NCT00449540), and a cardiac atrial septal repair device (NCT00283738). 8

EPILEPSY

Epilepsy affects nearly 1% of the population and has a cumulative incidence of 3% by age 74 years. Before 1990, only six drugs were available for the treatment of seizures, but the range of available options

7

has since more than doubled (63). Definitions for seizure types and epilepsy syndromes used in research are maintained and periodically updated by the International League Against Epilepsy (www.ilae.org). Outcomes in antiepileptic drug (AED) trials are generally based on patient/caregiver reported seizure frequency. A common practice is to dichotomize subjects into responders (at least 50% reduction in seizure frequency) and nonresponders. The proportion of subjects remaining seizure-free has also been reported. The development of novel AEDs has been spurred because some patients remain refractory to all available therapies. These patients usually comprise the subjects for efficacy trials of new AEDs because the new AED can be ethically tested against placebo as add-on therapy and because an effect is more likely to be demonstrated in a short period of time in a population with frequent seizures. To evaluate the efficacy of newer AEDs in nonrefractory patients, randomized trials with active comparison groups rather than placebo-controlled trials are required because it would be unethical to deprive nonrefractory patients of therapy with proven benefit. The first placebo-controlled RCTs of AEDs as add-on therapy in refractory patients were published in 1975. Carbamazepine (64) and valproate (65) were evaluated in 23 and 20 patients, respectively. In a typical modern add-on trial, levetiracetam was recently shown to reduce mean seizure frequency and have a higher responder rate than placebo in subjects with uncontrolled idiopathic generalized epilepsy (66). The 45% responder rate among placebo patients in this study illustrates the importance of randomized placebo-controlled trials in the evaluation of therapeutic efficacy. Comparative trials of AEDs are less common than add-on trials. This situation is unfortunate because comparative trials have greater value in clinical decision making because little RCT evidence is available to guide AED choices (67). One example is a trial comparing lamotrigine and carbamazepine in subjects with both new onset partial and generalized seizures (68). This trial found

8

CENTRAL NERVOUS SYSTEM (CNS)

no difference in efficacy but improved tolerability with lamotrigine. The trial was not powered to detect a subtle difference in efficacy. The only Phase III NIH-sponsored trial currently in progress is a randomized parallel assignment trial comparing ethosuximde, lamotrigine, and valproic acid in the treatment of childhood absence epilepsy (NCT00088452). Industry-sponsored trials include a comparison of zonisamide versus carbamazepine in newly diagnosed partial epilepsy (NCT00477295) and a comparison of pregabalin versus levetiracetam (NCT00537238). Ongoing placebocontrolled add-on trials of new agents include evaluations of rufinamide (NCT00334958), SPM 927 (NCT00136019), retigabine (NCT00235755), and RWJ-333369 (NCT00433667). 9

PARKINSON’S DISEASE

Parkinson’s Disease (PD) is a neurodegenerative disorder with cardinal motor manifestations of tremor, bradykinesia, and rigidity. Levodopa was found to be the first efficacious symptomatic therapy for PD in the 1960s, and several RCTs have since demonstrated that this and several other agents are efficacious in ameliorating PD symptoms. No therapy has yet been proven to alter the progressive natural history of the disease (69). PD is usually diagnosed for research purposes using the United Kingdom Parkinson’s Disease Brain Bank criteria (70). Disability stage is measured using the classic five-stage scale of Hoehn and Yahr (71), but disease progression now is usually measured using the more comprehensive and detailed Unified Parkinson’s Disease Rating Scale (UPDRS) (72). Although the combination of levodopa and carbidopa has long been used for the treatment of PD, its effect on the natural history of the disease was uncertain. A landmark RCT, published in 2004, randomized patients to placebo or 3 different doses of levodopa and demonstrated a dose-related improvement in function with treatment that partially persisted 2 weeks after discontinuation of drug (73). A stable formulation

of levodopa/carbidopa for continuous duodenal infusion was shown to be efficacious in advanced PD (74). The dopamine agonist bromocriptine was shown to offer symptomatic benefit over placebo in 1975 (75), and pergolide was proven efficacious in 1985 (76). Pergolide has since been removed from the market because of toxicity. In the late 1990s, the dopamine agonists pramipexole (77) and ropinirole (78) were proven efficacious in RCTs and have been widely adopted into practice. Transdermal rotigotine has also recently been adopted as an effective dopaminergic therapy (79). Subcutaneous injection of apomorphine has been shown to treat refractory ‘‘off periods’’ effectively (80). Compounds that alter the metabolism of dopamine, such as the monoamine oxidase inhibitor selegiline, can improve symptom control (81). More recently, rasagiline (82) and an orally disintegrating formulation of selegiline (83) have been reported to improve motor symptoms. The catechol O-methyl transferase (COMT) inhibitors entacapone (84) and tolcapone (85) have been shown to reduce motor fluctuations in PD, but the latter has had limited use because of hepatotoxicity. Amantadine has been suggested to be efficacious in several small randomized trials, each with fewer than 20 patients (86). A large NINDS trial of vitamin E for neuroprotection in PD was negative (87). A trial of coenzyme Q10 yielded intriguing but nonsignificant results (88). Surgical treatment of advanced PD with deep brain stimulation (DBS) has been shown to improve quality of life and motor symptoms compared with medical therapy alone (89). The NINDS currently is sponsoring a Phase III evaluation of creatine as a diseasemodifying therapy in PD (NCT00449865). Two novel compounds currently in Phase III industry-sponsored trials are as follows: E2007 (NCT00286897) and SLV308 (NCT00269516). Another industry-sponsored Phase III evaluation of continuous infusion duodenal levodopa-carbidopa gel is in progress (NCT00357994).

CENTRAL NERVOUS SYSTEM (CNS)

10

MULTIPLE SCLEROSIS

Multiple sclerosis (MS) is a multifocal autoimmune demyelinating disease of the CNS associated with axonal injury with a typical onset in early adulthood. MS can have a variable course but can lead to significant disability. MS is nosologically divided into relapsing remitting MS (RRMS), secondary progressive MS (SPMS), and primary progressive MS (PPMS). Before 1990, no treatments proven to alter the disease or its clinical manifestations were available (90). MS is commonly diagnosed for clinical trial purposes using the so-called revised McDonald criteria (91), which incorporate both clinical and imaging findings. Commonly employed clinical outcomes in MS studies are relapse rates (RRMS) and disability progression using the Expanded Disability Scale (EDS). Several short-term imaging endpoints have been employed, but their correlation to the more important clinical endpoints has been debated (92). Four injectable drugs now have been proven efficacious for reducing disability and relapse rates in patients with RRMS. RCT results were published for interferon (INF) beta 1-b (93) and glatriamer acetate (94) in 1995; INF beta 1-a in 1996 (95); and natalizumab in 2006 (96). The first three compounds now are commonly used in patients with RRMS, whereas natalizumab’s use has been limited because of associated cases of progressive multifocal leukoencephalopathy (PML). In 2002, a higher dose three times weekly formulation of INF beta 1-a was shown to be superior to the previously approved once weekly formulation (97) and is now commonly in use. Promising results of preliminary RCTs for two oral agents, fingolimod (98) and laquinimod (99), in RRMS have been published. Administration of high-dose oral steroids has been shown to be efficacious versus placebo in reducing short-term disability in acute MS exacerbations (100). A small RCT found no difference between treatment of acute exacerbations with high-dose oral versus intravenous methylprednisolone (101). Nonetheless, many clinicians avoid the use of oral steroids for exacerbations because of the results of the Optic Neuritis Treatment

9

Trial in which the group that received lowdose oral steroids fared worse than the group that received a placebo (102). Trials in the treatment of progressive disease (PPMS and SPMS) have been less encouraging. Trials of azathioprine, cladribine, cyclophosphamide, cyclosporine, and methotrexate all have demonstrated a lack of efficacy or overwhelming toxicity (103). Positive results have been published for mitoxantrone (104,105), and this agent is frequently used in refractory patients. Currently, the NINDS is sponsoring a trial comparing INF beta 1-a versus glatriamer acetate versus the combination of the two agents (NCT00211887). Other sources of public funding currently are sponsoring comparative evaluations of cyclophosphamide versus methylprednisolone in secondary progressive disease (NCT00241254) and intravenous versus high-dose oral steroids in acute exacerbations of RRMS (NCT00418145). Industry currently is sponsoring Phase III evaluations of three novel oral agents in the treatment of RRMS, as follows: laquinimod (NCT00509145), BG00012 (NCT00420212), and fingolimod (NCT00420212). All three of these trials have target enrollments of at least 1000 patients. Industry is also sponsoring Phase III evaluations of alemtuzumab as add-on therapy to INF beta 1-a in RRMS (NCT00530348), MBP8298 in SPMS (NCT00468611), mitoxantrone in SPMS (NCT00146159), and rituximab in PPMS (NCT00087529).

11

CONCLUSION

This brief overview of clinical trials for selected diseases or conditions primarily affecting the CNS serves as an introduction to some issues involved in their conduct and interpretation. Only a few example trials for a few selected conditions are discussed and underscore the breath of clinical trials being conducted in this area. Therapeutic trials for CNS diseases remain in their infancy, but they already have had a dramatic impact on the practice of clinical neurology and will likely play an evermore important role in the future.

10

CENTRAL NERVOUS SYSTEM (CNS)

REFERENCES 1. R. J. Guiloff (ed.), Clinical Trials in Neurology. London: Springer, 2001. 2. C. Murry and A. Lopez (eds.), The World Health Report 2002—Reducing Risks, Promoting Health Life. Geneva, Switzerland: World Health Organization, 2002. 3. P. Sobocki, I. Lekander, S. Berwick, J. Olesen, and B. J¨onsson, Resource allocation to brain research in Europe (RABRE). Eur. J. Neurosci., 24(10): 2691–2693, 2006. 4. NINDS Funding Strategy—FY 2007. Available: http://www.ninds.nih.gov/funding/ ninds funding strategy.htm. Accessed December 14, 2007. 5. H. Moses, 3rd, E. R. Dorsey, D. H. Matheson, and S. O. Thier, Financial anatomy of biomedical research. JAMA. 294(11):1333–1342, 2005. 6. L. B. Goldstein, Regulatory device approval for stroke: fair and balanced? Stroke. 38: 1737–1738, 2007. 7. K. Asplund, Clinimetrics in stroke research. Stroke. 18: 528–530, 1987. 8. D. S. Goodin, E. M. Frohman, G. P. Garmany, Jr., et al., Disease modifying therapies in multiple sclerosis: report of the Therapeutics and Technology Assessment Subcommittee of the American Academy of Neurology and the MS Council for Clinical Practice Guidelines. Neurology. 58(2): 169–178, 2002. 9. L. B. Goldstein, R. Adams, M. J. Alberts, et al., Primary prevention of ischemic stroke: a guideline from the American Heart Association/American Stroke Association Stroke Council. Stroke. 37: 1583–1633, 2006. 10. C. M. Lawes, D. A. Bennett, V. L. Feigin, and A. Rodgers, Blood pressure and stroke: an overview of published reviews. Stroke. 35: 1024, 2004. 11. R. G. Hart, L. A. Pearce, and M. I. Aguilar, Meta-analysis: antithrombotic therapy to prevent stroke in patients who have nonvalvular atrial fibrillation. Annals of Internal Medicine. 146: 857–867, 2007. 12. The Canadian Cooperative Study Group, A randomized trial of aspirin and sulfinpyrazone in threatened stroke. New Engl. J. Med. 299: 53–59, 1978. 13. A. Algra and J. van Gijn, Cumulative metaanalysis of aspirin efficacy after cerebral ischaemia of arterial origin. J. Neurol. Neurosurg. Psychiatry. 66: 255, 1999. 14. CAPRIE Steering Committee, A randomized, blinded, trial of Clopidogrel Versus Aspirin in

Patients at Risk of Ischemic Events (CAPRIE). Lancet. 348: 1329–1339, 1996. 15. Antithrombotic Trialists’ Collaboration, Collaborative meta-analysis of randomised trials of antiplatelet therapy for prevention of death, myocardial infarction, and stroke in high risk patients. BMJ. 324: 71–86, 2002. 16. H. C. Diener, J. Bogousslavsky, L. M. Brass, et al., Aspirin and clopidogrel compared with clopidogrel alone after recent ischaemic stroke or transient ischaemic attack in highrisk patients (MATCH): randomised, doubleblind, placebo-controlled trial. Lancet. 364: 331–337, 2004. 17. D. L. Bhatt, K. A. Fox, W. Hacke, et al., Clopidogrel and aspirin versus aspirin alone for the prevention of atherothrombotic events. N. Engl. J. Med. 354: 1706–1717, 2006. 18. H. C. Diener, L. Cunha, C. Forbes, J. Sivenius, P. Smets, A. Lowenthal, European Stroke Prevention Study, 2: dipyridamole and acetylsalicylic acid in the secondary prevention of stroke. J. Neurol. Sci. 143: 1–13, 1996. 19. ESPRIT Study Group, Aspirin plus dipyridamole versus aspirin alone after cerebral ischaemia of arterial origin (ESPRIT): randomised controlled trial. Lancet. 367: 1665–1673, 2006. 20. J. P. Mohr, J. L. Thompson, R. M. Lazar, et al., A comparison of warfarin and aspirin for the prevention of recurrent ischemic stroke. N. Engl. J. Med. 345: 1444–1451, 2001. 21. S. Homma, R. L. Sacco, M. R. Di Tullio, R. R. Sciacca, and J. P. Mohr, for the PFO in Cryptogenic Stroke Study (PICSS) Investigators, Effect of medical treatment in stroke patients with patent foramen ovale. Circulation. 105: 2625–2631, 2002. 22. M. I. Chimowitz, M. J. Lynn, H. HowlettSmith, et al., Comparison of warfarin and aspirin for symptomatic intracranial arterial stenosis. N. Engl. J. Med. 352: 1305–1316, 2005. 23. S. Hulley, D. Grady, T. Bush, et al., Heart and Estrogen/progestin Replacement Study (HERS) Research Group, Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women. JAMA. 280: 605–613, 1998. 24. C. M. Viscoli, L. M. Brass, W. N. Kernan, P. M. Sarrel, S. Suissa, and R. I. Horwitz, A clinical trial of estrogen-replacement therapy after ischemic stroke. N. Engl. J. Med. 345: 1243–1249, 2001.

CENTRAL NERVOUS SYSTEM (CNS) 25. J. E. Rossouw, G. L. Anderson, R. L. Prentice, et al., Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the Women’s Health Initiative randomized controlled trial. JAMA. 288: 321–333, 2002. 26. P. Amarenco, J. Bogousslavsky, A. Callahan, 3rd, et al., Stroke Prevention by Aggressive Reduction in Cholesterol Levels (SPARCL) Investigators. High-dose atorvastatin after stroke or transient ischemic attack. N. Engl. J. Med. 355: 549–559, 2006. 27. The EC/IC Bypass Study Group, Failure of extracranial-intracranial arterial bypass to reduce the risk of ischemic stroke: results of an international randomized trial. N. Engl. J. Med. 313: 1191–1200, 1985. 28. North American Symptomatic Carotid Endarterectomy Trial Collaborators, Beneficial effect of carotid endarterectomy in symptomatic patients with high-grade carotid stenosis. N. Engl. J. Med. 325: 445–453, 1991. 29. European Carotid Surgery Trialists’ Collaborative Group, MRC European Carotid Surgery Trial: interim results for symptomatic patients with severe (70–99%) or with mild (0–29%) carotid stenosis. Lancet. 337: 1235–1243, 1991. 30. M. R. Mayberg, S. E. Wilson, F. Yatsu, et al., Carotid endarterectomy and prevention of cerebral ischemia in symptomatic carotid stenosis: Veterans Affairs Cooperative Studies Program. JAMA. 266: 3289–3294, 1991. 31. Executive Committee for the Asymptomatic Carotid Atherosclerosis Study, Endarterectomy for asymptomatic carotid artery stenosis. JAMA. 273: 1421–1428, 1995. 32. A. Halliday, A. Mansfield, J. Marro, et al., Prevention of disabling and fatal strokes by successful carotid endarterectomy in patients without recent neurological symptoms: randomised controlled trial. Lancet. 363: 1491–1502, 2004. 33. J. S. Yadav, M. H. Wholey, R. E. Kuntz, et al., Protected carotid-artery stenting versus endarterectomy in high risk patients. N. Engl. J. Med. 351: 1493–1501, 2004. 34. The SPACE Collaborative Group, 30 day results from the SPACE trial of stent-protected angioplasty versus carotid endarterectomy in symptomatic patients: a randomised non-inferiority trial. Lancet. 368: 1239–1247, 2006. 35. J. L. Mas, G. Chatellier, B. Beyssen, et al., Endarterctomy versus stenting in patients

11

with symptomatic severe carotid stenosis. N. Engl. J. Med. 355: 1660–1167, 2006. 36. The National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group, Tissue plasminogen activator for acute ischemic stroke. N. Engl. J. Med. 333: 1581–1587, 1995. 37. H. P. Adams, Jr., G. del Zoppo, M. J. Alberts, et al., Guidelines for the early management of adults with ischemic stroke. Stroke. 38(5): 1655–1711, 2007. 38. A. V. Alexandrov, C. A. Molina, J. C. Grotta, et al., Ultrasound-enhanced systemic thrombolysis for acute ischemic stroke. N. Engl. J. Med. 351: 2170–2178, 2004. 39. O. L. Lopez, I. Litvan, K. E. Catt, et al., Accuracy of four clinical diagnostic criteria for the diagnosis of neurodegenerative dementias. Neurology. 53: 1292–1299, 1999. 40. G. McKhann, D. A. Drachman, M. F. Folstein, R. Katzman, D. L. Price, and E. Stadlan, Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of the Department of Health and Human Services Task Force on Alzheimer’s disease. Neurology. 34: 939–944, 1984. 41. J. R. Cockrell and M. F. Folstein, Mini-Mental State Examination (MMSE). Psychopharmacol. Bull. 24: 689–692, 1988. 42. R. C. Mhos, The Alzheimer’s disease assessment scale. Int. Psychogeriatr. 8: 195–203, 1996. 43. I. Gelinas, L. Gauthier, M. McIntyre, and S. Gauthier, Development of a functional measure for persons with Alzheimer’s disease: the disability assessment for dementia. Am. J. Occup. Ther. 53: 471–481, 1999. 44. D. Galasko, D. Bennett, M. Sano, et al., An inventory to assess activities of daily living for clinical trials in Alzheimer’s disease. Alzheimer Dis. Assoc. Disord. Suppl. 2:S33–S39, 1997. 45. A. M. Fagan, C. A. Csernansky, J. C. Morris, D. M. Holtzman, The search for antecedent biomarkers of Alzheimer’s disease. J. Alzheimers Dis. 8(4): 347–358, 2005. 46. R. S. Doody, J. C. Stevens, C. Beck, et al., Practice parameter: management of dementia (an evidence-based review). Report of the Quality Standards Subcommittee of the American Academy of Neurology. Neurology. 56(9): 1154–1166, 2001. 47. W. K. Summers, L. V. Majovski, G. M. Marsh, K. Tachiki, and A. Kling, Oral tetrahydroaminoacridine in long-term treatment of

12

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58.

59.

60.

CENTRAL NERVOUS SYSTEM (CNS) senile dementia, Alzheimer type. N. Engl. J. Med. 315: 1241–1245, 1986. S. Rogers, R. Doody, R. Mohs, et al., and the Donepezil Study Group, Donepezil improved cognition and global function in Alzheimer’s disease. Arch. Intern. Med. 158: 1021–1031, 1998. S. Rogers, M. Farlow, R. Doody, et al., and the Donepezil Study Group, A 24-week, doubleblind, placebo-controlled trial of donepezil in patients with Alzheimer’s disease. Neurology. 50: 136–145, 1998. M. Rosler, R. Anand, A. Cicin–Sain, et al., Efficacy and safety of Rivastigmine in patients with Alzheimer’s disease: international randomised controlled trial. B. Med. J. 318: 633–638, 1999. M. Raskind, E. Peskind, T. Wessel, et al., and the Galantamine Study Group, Galantamine in AD. A 6-month randomized, placebocontrolled trial with a 6-month extension. Neurology. 54: 2261–2268, 2000. J. L. Brandes, Global trends in migraine care: results from the MAZE survey. CNS Drugs. 16 Suppl. 1: 13–18, 2002. Headache Classification Committee of the International Headache Society, Classification and diagnostic criteria for headache disorders, cranial neuralgias and facial pain. Cephalalgia. 8 Suppl. 7: 1–96, 1998. TE Wideroe and T. Vigander, Propranolol in the treatment of migraine. Br. Med. J. 2: 699–701, 1974. P. Tfelt-Hansen, G. Block, C. Dahl¨of, et al., Guidelines for controlled trials of drugs in migraine: second edition. Cephalalgia. 20: 765–786, 2000. S. D. Silberstein, Practice parameter: evidence-based guidelines for migraine headache (an evidence-based review): report of the Quality Standards Subcommittee of the American Academy of Neurology. Neurology. 55: 754–762, 2000. N. T. Mathew, A. Rapoport, J. Saper, et al., Efficacy of gabapentin in migraine prophylaxis. Headache. 41: 119, 2001. J. L. Brandes, J. R. Saper, M. Diamond, et al., Topiramate for migraine prevention: a randomized controlled trial. JAMA. 291: 965, 2004. J. L. Brandes, D. Kudrow, S. R. Stark, et al., Sumatriptan-naproxen for acute treatment of migraine: a randomized trial. JAMA. 297: 1443, 2007. B. W. Friedman, J. Corbo, R. B. Lipton, et al., A trial of metoclopramide vs sumatriptan

for the emergency department treatment of migraines. Neurology. 64: 463, 2005. 61. B. W. Friedman, P. Greenwald, T. C. Bania, et al., Randomized trial of IV dexamethasone for acute migraine in the emergency department. Neurology. 69: 2038–2044, 2007. 62. I. Colman, M. D. Brown, G. D. Innes, E. Grafstein, T. E. Roberts, and B. H. Rowe, Parenteral dihydroergotamine for acute migraine headache: a systematic review of the literature. Ann. Emerg. Med. 45: 393–401, 2005. 63. J. A. French, A. M. Kanner, J. Bautista, et al., Efficacy and tolerability of the new antiepileptic drugs I: treatment of new onset epilepsy: report of the Therapeutics and Technology Assessment Subcommittee and Quality Standards Subcommittee of the American Academy of Neurology and the American Epilepsy Society. Neurology. 62: 1252–1260, 2004. 64. H. Kutt, G. Solomon, C. Wasterlain, H. Peterson, S. Louis, and R Carruthers, Carbamazepine in difficult to control epileptic out-patients. Acta. Neurol. Scand. Suppl. 60: 27–32, 1975. 65. A Richens and S. Ahmad, Controlled trial of sodium valproate in severe epilepsy. Br. Med. J. 4: 255–256, 1975. 66. S. F. Berkovic, R. C. Knowlton, R. F. Leroy, J. Schiemann, and U. Falter, Levetiracetam N01057 Study Group, Placebo-controlled study of levetiracetam in idiopathic generalized epilepsy. Neurology. 69: 1751–1760, 2007. 67. J. A. French and R. J. Kryscio, Active control trials for epilepsy: avoiding bias in head-tohead trials. Neurology. 66: 1294–1295, 2006. 68. M. J. Brodie, A. Richens, and A. W. Yuen, Double-blind comparison of lamotrigine and carbamazepine in newly diagnosed epilepsy. UK Lamotrigine/ Carbamazepine Monotherapy Trial Group. Lancet. 345: 476–479, 1995. 69. E. Tolosa and R. Katzenshclager, Pharmacological management of Parkinson’s disease. In: J Jankovic and E Tolosa (eds.), Parkinson’s Disease and Movement Disorders. Philidephia, PA: Lippincott Williams and Wilkins, 2007. 70. A. J. Hughes, S. E. Daniel, L. Kilford, et al., Accuracy of clinical diagnosis of idiopathic Parkinson’s disease: a clinico-pathological study of 100 cases. J. Neurol. Neurosurg. Psychiatry. 55: 181–184, 1992. 71. M. M. Hoehn, and M. D. Yahr, Parkinsonism: onset, progression and mortality. Neurology. 17: 427–442, 1967.

CENTRAL NERVOUS SYSTEM (CNS) 72. S. Fahn and R. Elton, Members of the UPDRS Development Committee, Unified Parkinson’s Disease Rating Scale. In: S. Fahn, C. D. Marsden, D. B. Calne, and M. Goldstein (eds.), Recent Developments in Parkinson’s Disease, vol. 2. Florham Park, NJ: MacMillan, 1987. 73. Parkinson Study Group, Levodopa and the progression of Parkinson’s disease. N. Engl. J. Med. 351: 2498–2508, 2004. 74. D. Nyholm, A. I. Nilsson Remahl, N. Dizdar, et al., Duodenal levodopa infusion monotherapy vs oral polypharmacy in advanced Parkinson disease. Neurology. 64(2): 216–223, 2005. 75. P. F. Teychenne, P. N. Leigh, J. L. Reid, D. B. Calne, J. K. Greenacre, A. Petrie, and A. N. Bamji, Idiopathic parkinsonism treated with bromocriptine. Lancet. 2(7933): 473–476, 1975. 76. J. I. Sage and R. C. Duvoisin, Pergolide therapy in Parkinson’s disease: a double-blind, placebo-controlled study. Clin. Neuropharmacol. 8(3): 260–265, 1985. 77. A. Lieberman, A. Ranhosky, and D. Korts, Clinical evaluation of pramipexole in advanced Parkinson’s disease: results of a double-blind, placebo-controlled, parallelgroup study. Neurology. 49: 162–168, 1997. 78. A. Lieberman, C. W. Olanow, K. Sethi, et al., A multicenter trial of ropinirole as adjunct treatment for Parkinson’s disease, Ropinirole Study Group, Neurology. 51: 1057–1062, 1998. 79. R. L. Watts, J. Jankovic, C. Waters, A. Rajput, B. Boroojerdi, and J. Rao, Randomized, blind, controlled trial of transdermal rotigotine in early Parkinson disease. Neurology. 68(4): 272–276, 2007. 80. R. B. Dewey, Jr., J. T. Hutton, P. A. LeWitt, and S. A. Factor, A randomized, double-blind, placebo-controlled trial of subcutaneously injected apomorphine for parkinsonian offstate events. Arch. Neurol. 58: 1385–1392, 2001. 81. B. Sivertsen, E. Dupont, B. Mikkelsen, P. Mogensen, C. Rasmussen, F. Boesen, and E. Heinonen, Selegiline and levodopa in early or moderately advanced Parkinson’s disease: a double-blind controlled short- and longterm study. Acta. Neurol. Scand. Suppl. 126: 147–152, 1989. 82. Parkinson Study Group, A controlled, randomized, delayed-start study of rasagiline in early Parkinson disease. Arch. Neurol. 61: 561–566, 2004. 83. C. H. Waters, K. D. Sethi, R. A. Hauser,

13

E. Molho, and J. M. Bertoni, Zydis selegiline reduces off time in Parkinson’s disease patients with motor fluctuations: a 3-month, randomized, placebo-controlled study. Mov. Disord. 19: 426–432, 2004. 84. Parkinson Study Group, Entacapone improves motor fluctuations in levodopatreated Parkinson’s disease patients. Ann. Neurol. 42: 747–755, 1997. 85. M. C. Kurth, C. H. Adler, M. S. Hilaire, et al., Tolcapone improves motor function and reduces levodopa requirement in patients with Parkinson’s disease experiencing motor fluctuations: a multicenter, double-blind, randomized, placebo-controlled trial, Tolcapone Fluctuator Study Group I. Neurology. 48(1): 81–87, 1997. ´ 86. F. P. da Silva-Junior, P. Braga-Neto, F. Sueli Monte, and V. M. de Bruin, Amantadine reduces the duration of levodopainduced dyskinesia: a randomized, doubleblind, placebo-controlled study. Parkinsonism Relat. Disord. 11(7): 449–452, 2005. 87. The Parkinson Study Group, Effects of tocopherol and deprenyl on the progression of disability in early Parkinson’s disease. N. Engl. J. Med. 328: 176–183, 1993. 88. C. W. Shults, D. Oakes, K. Kieburtz, et al., Effects of coenzyme Q10 in early Parkinson disease: evidence of slowing of the functional decline. Arch. Neurol. 59: 1541–1550, 2002. 89. G. Deuschl, C. Schade-Brittinger, P. Krack, et al., A randomized trial of deep-brain stimulation for Parkinson’s disease. N. Engl. J. Med. 355(9): 896–908, 2006. 90. M. J. Olek and D. M. Dawson, Multiple sclerosis and other inflammatory demyelinating diseases of the central nervous system. In: W. G. Bradley, R. B. Daroff, G. M. Fenichel, and J. Jankovic (eds.), Neurology in Clinical Practice. Philidelphia, PA: Butterworth Heinemann, 2004. 91. C. H. Polman, S. C. Reingold, G. Edan, et al., Diagnostic criteria for multiple sclerosis: 2005 revisions to the ‘‘McDonald Criteria’’. Ann. Neurol. 58(6): 840–846, 2005. 92. D. S. Goodin, E. M. Frohman, G. P. Garmany, Jr., et al., Disease modifying therapies in multiple sclerosis: report of the Therapeutics and Technology Assessment Subcommittee of the American Academy of Neurology and the MS Council for Clinical Practice Guidelines. Neurology. 58(2): 169–178, 2002. 93. The IFNB Multiple Sclerosis Study Group and the UBC MS/MRI Analysis Group, Interferon beta-1b in the treatment of MS: final outcome

14

CENTRAL NERVOUS SYSTEM (CNS)

of the randomized controlled trial. Neurology. 45: 1277–1285, 1995. 94. K. P. Johnson, B. R. Brooks, J. A. Cohen, et al., Copolymer 1 reduces relapse rate and improves disability in relapsing-remitting multiple sclerosis: results of a phase III multicenter, doubleblind, placebo-controlled trial. Neurology. 45: 1268–1276, 1995. 95. L. D. Jacobs, D. L. Cookfair, R. A. Rudick, et al., Intramuscular interferon beta-1a for disease progression in exacerbating remitting multiple sclerosis. Ann. Neurol. 39: 285–294, 1996. 96. C. H. Polman, P. W. O’Connor, E. Havrdova, et al., A randomized, placebo-controlled trial of natalizumab for relapsing multiple sclerosis. N. Engl. J. Med. 354: 899, 2006. 97. H. Panitch, D. S. Goodin, G. Francis, et al., Randomized, comparative study of interferon beta-1a treatment regimens in MS: The EVIDENCE Trial. Neurology. 59: 1496, 2002. 98. L. Kappos, J. Antel, G. Comi, et al., Oral fingolimod (FTY720) for relapsing multiple sclerosis. N. Engl. J. Med. 355: 1124, 2006. 99. C. Polman, F. Barkhof, M. SandbergWollheim, et al., Treatment with laquinimod reduces development of active MRI lesions in relapsing MS. Neurology. 64: 987, 2005. 100. F. Sellebjerg, J. L. Frederiksen, P. M. Nielsen, and J. Olesen, Double-blind, randomized, placebo-controlled study of oral, highdose methylprednisolone in attacks of MS. Neurology. 51: 529, 1998. 101. D. Barnes, R. A. Hughes, R. W. Morris, et al., Randomised trial of oral and intravenous methylprednisolone in acute relapses of multiple sclerosis. Lancet. 349: 902, 1997. 102. R. W. Beck, P. A. Cleary, M. M. Anderson, Jr., et al., the Optic Neuritis Study Group, A randomized, controlled trial of corticosteroids in the treatment of acute optic neuritis. N. Engl. J. Med. 326: 581, 1992. 103. M. J. Olek, Treatment of progressive multiple sclerosis in Adults. In: B. D. Rose (ed.), UpToDate. Waltham, MA:, 2007. 104. G. Edan, D. Miller, M. Clanet, et al., Therapeutic effect of mitoxantrone combined with methylprednisolone in multiple sclerosis: A randomised multicentre study of active disease using MRI and clinical criteria. J. Neurol. Neurosurg. Psychiatry. 62: 112, 1997. 105. H. P. Hartung, R. Gonsette, N. Konig, et al., Mitoxantrone in progressive multiple sclerosis: a placebo-controlled, double-blind, randomised, multicentre trial. Lancet. 360: 2018, 2002.

CFR 21 PART 11

is important to note that the predicate rules encompass a broad range of areas and apply not only to Good Clinical Practice, but also to good laboratory practices and good manufacturing practices (GMP). Records maintained or submitted to the agency in these areas are subject to Part 11. One consequence of Part 11 was to provide a framework for electronic data capture and record keeping in clinical trials. As clinical trials increase in number and complexity, electronic solutions claim a growing presence in the effort to improve data collection, transmission, storage, and retrieval. Sponsors are moving toward electronic alternatives to introduce speed and greater reliability into these data-driven processes that historically, have been paper based. Additionally, the capability to execute electronic signatures at the investigative site (electronic case report forms and Case Books) is accompanied by increased responsibilities for sponsors and technology providers to adhere to and provide solutions that are compliant with 21CFR11 and the current FDA guideline(s). Significant challenges lie ahead, most notably, how to implement electronic solutions so the resulting electronic records are considered equivalent to paper records, and electronic signatures are considered equal to handwritten ones. Since its effective date, Part 11 has had a significant impact on clinical research, and specifically clinical trials with regard to the implementation of technology, validation, data collection, data management, and regulatory submissions.

MICHAEL P. OWINGS and PAUL A. BLEICHER Phase Forward, Waltham, Massachusetts

One consequence of Part 11 was to provide a framework for electronic data capture and electronic record keeping in clinical trials; as such, it has had a significant impact on clinical research, and specifically clinical trials with regard to the implementation of technology, validation, data collection, data management, and regulatory submissions. Several guidances remain in effect for the interpretation of 21 CFR Part 11, specifically the Guidance for Computerized Systems Used in Clinical Trials and a Scope and Application Guidance. The latter guidance indicates that the FDA is in the process of reinterpreting, and possibly redefining 21 CFR Part 11 in line with a risk management approach. On March 20, 1997, The U. S. FDA issued an important regulation under the Code of Federal Regulations entitled 21CFR11: ELECTRONIC RECORDS; ELECTRONIC SIGNATURES, which is commonly referred to as ‘‘Part 11.’’ The purpose of the regulation is to provide criteria under which the FDA will consider electronic records to be equivalent to paper records, and electronic signatures equivalent to traditional handwritten signatures. Part 11 applies to persons who, in fulfillment of a requirement in a statute, have chosen to maintain records or submit designated information electronically to FDA, and to records in electronic form that are created, modified, maintained, archived, retrieved, or transmitted under any records requirements set forth in Agency regulations. Part 11 also applies to electronic records submitted to the Agency under the Federal Food, Drug, and Cosmetic Act (the Act) (3) and the Public Health Service Act (the PHS Act) (4), even if such records are not specifically identified in Agency regulations (§ 11.1). The underlying requirements set forth in the Act, PHS Act, and FDA regulations (other than part 11) are referred to as predicate rules (2). It

1

BACKGROUND

Part 11 was developed by the FDA in concert with other government initiatives that began in the early 1990s and continue to the present. The final rule is the result of a six-year process, which began in 1991, when members of the pharmaceutical industry met with the FDA to explore how they could accommodate paperless record systems

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CFR 21 PART 11

Table 1. Selected Information Specified in Subpart A Related to Scope and Definitions 21CFR11 Subpart A—General Provisions §11.1 – Scope (a) The regulations in this part set forth the criteria under which the agency considers electronic records, electronic signatures, and handwritten signatures executed to electronic records to be trustworthy, reliable, and generally equivalent to paper records and handwritten signatures executed on paper. (b) This part applies to records in electronic form that are created, modified, maintained, archived, retrieved, or transmitted, under any records requirements set forth in agency regulations. However, this part does not apply to paper records that are, or have been, transmitted by electronic means. (c) Where electronic signatures and their associated electronic records meet the requirements of this part, the agency will consider the electronic signatures to be equivalent to full handwritten signatures, initials, and other general signings as required by agency regulations, unless specifically excepted by regulation(s) effective on or after August 20, 1997. (e) Computer systems (including hardware and software), controls, and attendant documentation maintained under this part shall be readily available for, and subject to, FDA inspection. §11.3 – Definitions (4) Closed system means an environment in which system access is controlled by persons who are responsible for the content of electronic records that are on the system (6) Electronic record means any combination of text, graphics, data, audio, pictorial, or other information representation in digital form that is created, modified, maintained, archived, retrieved, or distributed by a computer system. (7) Electronic signature means a computer data compilation of any symbol or series of symbols executed, adopted, or authorized by an individual to be the legally binding equivalent of the individual’s handwritten signature. (9) Open system means an environment in which system access is not controlled by persons who are responsible for the content of electronic records that are on the system.

under the current good manufacturing practice (cGMP) regulations in 21 CFR Parts 210 and 211. From of these meetings came a task force (a working sub-group), who developed a 1992 notice in the Federal Register that the FDA was considering the use of electronic identification and signatures, a proposed rule in 1994 and ultimately, the final rule in 1997. In addition to 21 CFR Parts 210 and 211, Part 11 was complimentary with other initiatives to eliminate paper that were already underway at the federal government, such as the Paperwork Reduction Act of 1995 (5) and later the Government Paperwork Elimination Act (GPEA) of 1998 (6). The purpose of the Paperwork Reduction Act of 1995 as stated is to minimize the paperwork burden for individuals, small businesses, educational and nonprofit institutions, Federal contractors, State, local and tribal governments, and other persons resulting from the collection of information by or for the Federal Government.

Part 11 and GPEA shared the combined purpose of eliminating paper from complex and costly processes and promoting the use of technology with the creation and management of electronic records and use of electronic signatures. This period was an inflection point as the U.S. Government began to encourage and even require the use of technology. GPEA was enacted on October 21, 1998, as title XVII of P.L. 105-277, and contains specific requirements in the use of technology for electronic records such as: the technology shall be compatible with standards and technology for electronic signatures that are generally used in commerce and industry . . . and shall ensure that electronic signatures are as reliable as is appropriate for the purpose in question and keep intact the information submitted.

As will be shown later in the article, these requirements are consistent with, and in some cases almost identical to, the basic tenets contained in Part 11.

CFR 21 PART 11

3

Table 2. Information Specified in Subpart B Related to the Manifestation of the Electronic Signature and Signature Linking 21CFR11 Subpart B–Controls for Closed Systems §11.10 - Controls for closed systems Persons who use closed systems to create, modify, maintain, or transmit electronic records shall employ procedures and controls designed to ensure the authenticity, integrity, and, when appropriate, the confidentiality of electronic records, and to ensure that the signer cannot readily repudiate the signed record as not genuine. Such procedures and controls shall include the following: (a) Validation of systems to ensure accuracy, reliability, consistent intended performance, and the ability to discern invalid or altered records. (b) The ability to generate accurate and complete copies of records in both human readable and electronic form suitable for inspection, review, and copying by the agency. (c) Protection of records to enable their accurate and ready retrieval throughout the records retention period. (d) Limiting system access to authorized individuals. (e) Use of secure, computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records. Record changes shall not obscure previously recorded information. Such audit trail documentation shall be retained for a period at least as long as that required for the subject electronic records and shall be available for agency review and copying. §11.50 - Signature manifestations. (a) Signed electronic records shall contain information associated with the signing that clearly indicates all of the following: (1) The printed name of the signer; (2) The date and time when the signature was executed; and (3) The meaning (such as review, approval, responsibility, or authorship) associated with the signature. §11.70 - Signature/record linking. Electronic signatures and handwritten signatures executed to electronic records shall be linked to their respective electronic records to ensure that the signatures cannot be excised, copied, or otherwise transferred to falsify an electronic record by ordinary means.

2

ORGANIZATION

Part 11 is divided into three subparts: General Provisions, Electronic Records, and Electronic Signatures. Within these three sections, Part 11 states the scope of the rule; provides definitions; and states that individuals using the system(s) are to employ procedures and controls designed to ensure the authenticity, integrity, and when appropriate, confidentiality of electronic records. 2.1 General Provisions Subpart A; General Provisions of Part 11 contains three sections: 11.1—Scope, 11.2— Implementation, and 11.3—Definitions. In essence, these three sections establish a basic framework for which to employ electronic records and electronic signatures in lieu of paper records, and the criteria under which

the FDA consider electronic records and electronic signatures to be equivalent to paper records and signatures executed on paper. Additionally, the applicability of Part 11 is briefly addressed: . . . applies to records in electronic form that are created, modified, maintained, archived, retrieved, or transmitted, under any records requirements set forth in agency regulations. However, this part does not apply to paper records that are, or have been, transmitted by electronic means.

Thus, this document indicates that Part 11 would not apply to a facsimile copy of a paper record or signature required by the agency. The definitions section contains nine definitions related to ERES. Definitions are included for the following terms: Electronic Record, Electronic Signature, and Open and

4

CFR 21 PART 11

Table 3. A Selected Portion of the Signature General Requirements and Procedures Contained in Subpart C 21CFR11 Subpart C–Electronic Signatures §11.100 - General requirements (a) Each electronic signature shall be unique to one individual and shall not be reused by, or reassigned to, anyone else. (b) Before an organization establishes, assigns, certifies, or otherwise sanctions an individual’s electronic signature, or any element of such electronic signature, the organization shall verify the identity of the individual. (c) Persons using electronic signatures shall, prior to or at the time of such use, certify to the agency that the electronic signatures in their system, used on or after August 20, 1997, are intended to be the legally binding equivalent of traditional handwritten signatures. (2) Persons using electronic signatures shall, upon agency request, provide additional certification or testimony that a specific electronic signature is the legally binding equivalent of the signer’s handwritten signature. §11.200 - Electronic signature components and controls. (a) Electronic signatures that are not based upon biometrics shall: (1) Employ at least two distinct identification components such as an identification code and password. (2) Be used only by their genuine owners; and (3) Be administered and executed to ensure that attempted use of an individual’s electronic signature by anyone other than its genuine owner requires collaboration of two or more individuals. (b) Electronic signatures based upon biometrics shall be designed to ensure that they cannot be used by anyone other than their genuine owners. §11.300 - Controls for identification codes/passwords Persons who use electronic signatures based upon use of identification codes in combination with passwords shall employ controls to ensure their security and integrity. Such controls shall include: (a) Maintaining the uniqueness of each combined identification code and password, such that no two individuals have the same combination of identification code and password. (d) Use of transaction safeguards to prevent unauthorized use of passwords and/or identification codes, and to detect and report in an immediate and urgent manner any attempts at their unauthorized use to the system security unit, and, as appropriate, to organizational management.

Table 4. An Overview of the Topics Contained in the Guidance for Industry: Computerized Systems Used in Clinical Trials Overview–Guidance for Industry: Computerized Systems Used in Clinical Trials • General Principles • Standard Operating Procedures • Data Entry • System Features • Security • System Dependability • System Controls • Training of Personnel • Records Inspection •Certification of Electronic Signatures

Closed Systems. Although all definitions in this section should be noted, the definitions provided for Electronic Record, Electronic Signature, and Open and Closed Systems have historically played a very important role

in how practitioners have defined systems and approached implementation and validation of systems for clinical applications in an effort to comply with the rule. Validation and controls of the system are paramount,

CFR 21 PART 11

given that the General Provisions also state that ‘‘Computer systems (including hardware and software), controls, and attendant documentation maintained under this part shall be readily available for, and subject to, FDA inspection.’’. 2.2 Subpart B—Electronic Records Subpart B—Electronic Records contains four sections: 11.10—Controls for closed systems, 11.30—Controls for open systems, 11.50— Signature Manifestations, and 11.70— Signature and record linking. These sections address how applicable systems are controlled, how specific requirements are related to the manifestation of an electronic signature on an electronic record, linking of that signature to the electronic record, what information must be included, and the need to link the signature to all respective electronic records. The preamble of 21CFR Part11, which is essential to interpreting the final rule, is flexible on the method of linkage. It states, While requiring electronic signatures to be linked to their respective electronic records, the final rule affords flexibility in achieving that link through use of any appropriate means, including use of digital signatures and secure relational database references.

Part 11 Subpart B is designed such that the controls delineated under ‘‘Controls for Closed Systems’’ also apply to open systems. This subpart includes important topics such as: Validation ‘‘to ensure accuracy, reliability, consistent intended performance, and the ability to discern invalid or altered records’’, system access and security, copies for inspection, record retention, and the ‘‘use of secure, computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records.’’ In addition, procedures and controls must be in place so that the signer cannot easily repudiate the signed record as not genuine. 2.3 Subpart C—Electronic Signatures Subpart C contains three sections: 11.100— General requirements, 11.200—Electronic

5

signature components and controls, and 11.300—Controls for identification codes/ passwords. It addresses electronic signatures and defines them as ‘‘a computer data compilation of any symbol or series of symbols, executed, adopted, or authorized by an individual to be the legally binding equivalent of the individual’s handwritten signature.’’ This section explains the components of electronic signatures and distinguishes between signatures that are based on nonbiometrics and those that are. It specifies that a signature must ‘‘employ at least two distinct identification components such as an identification code and password.’’ According to the rule, whether an electronic signature is based on biometrics, steps must be taken to ensure that it cannot be used by anyone but the owner of that signature. Biometrics refers to the use of automated methods to verify an individual’s identity based on measurement of that individual’s physical features and/or repeatable actions that are unique to that individual. Examples include fingerprints, voice patterns, retinal or iris scanning, facial recognition, temperature, or hand geometry. Electronic signature integrity and detection of unauthorized use is emphasized in this section also, as it requires the ‘‘use of transaction safeguards to prevent unauthorized use of passwords and/or identification codes,’’ and it allows detection and reporting of attempts at their unauthorized use to the system security unit as appropriate. In the event that an individual or company is going to employ the use of electronic signatures, the FDA requires that the party submit an affidavit to ‘‘certify to the agency that the electronic signatures in their system, used on or after August 20, 1997, are intended to be the legally binding equivalent of traditional handwritten signatures.’’ 3

PART 11 GUIDANCE

Following the issuance of Part 11 in 1997, significant discussions ensued among industry, system providers, and the Agency concerning the interpretation and the practicalities of complying with the new regulations. In particular, concerns were raised in the areas

6

CFR 21 PART 11

of validation, audit trails, record retention, record copying, and legacy systems. These concerns suggested the need for additional official guidance in the interpretation of Part 11 from the FDA. In an effort to address these concerns and to assist in implementation of the regulation, the FDA published a compliance policy guide in 1999, (CPG 7153.17: Enforcement Policy: 21 CFR Part 11; Electronic Records; Electronic Signatures) (7), and began to publish several Guidance for Industry documents that included the following: • 21 CFR Part 11; Electronic Records;

Electronic Signatures, Computerized Systems Used in Clinical Trials (Final Rule: April 1999) (8) • 21 CFR Part 11; Electronic Records;

Electronic Signatures, (Draft: August 2001) (9)

Validation

• 21 CFR Part 11; Electronic Records;

Electronic Signatures, Glossary Terms (Draft: August 2001) (10)

of

• 21 CFR Part 11; Electronic Records;

Electronic Signatures, Time Stamps (Draft: February 2002) (11) • 21 CFR Part 11; Electronic Records;

Electronic Signatures, Maintenance of Electronic Records (Draft: July 2002) (12) • 21 CFR Part 11; Electronic Records;

Electronic Signatures, Electronic Copies of Electronic Records (Draft: August 2002) (13) The stated purpose of these guidances is to describe the Agency’s current thinking on a topic; therefore, it should be noted that the FDA states that ‘‘guidance documents do not establish legally enforceable responsibilities . . . and should be viewed only as recommendations.’’ Computerized Systems Used in Clinical Trials (GCSUCT): A Guidance for Industry was originally released in April 1999 (8). ‘‘The Guidance’’ (as it is often referred to in the electronic clinical trials arena) provides additional detail in support of Part 11 and a commentary on implementation. According to the GCSUCT, it was developed

for two related purposes: to address requirements of 21 CFR Part 11 and to provide the agency’s current thinking on issues that pertain to computerized systems used to create, modify, maintain, archive, retrieve, or transmit clinical data intended for submission to the FDA. Although the original version of the GCSUCT is no longer in effect, a new updated guidance has been issued to replace the original guidance (see discussion below). The GCSUCT has been useful and pertinent as a basic guide to Part 11 compliance and implementation for clinical systems. 3.1 Re-Examination of Part 11 and the Scope and Application Guidance After release of the GCSUCT in 1999, the industry awaited additional guidances in the hope that they would provide greater clarity as to how the FDA expected the industry to comply with Part 11. The releases of each of five additional DRAFT guidances were followed by many comments, re-examinations, discussion, and some confusion within the industry. By the end of 2002, five years after Part 11 was released, industry personnel, system providers, and the FDA still found themselves struggling with the interpretation and implementation of Part 11. Many persons within the industry suggested that full compliance with Part 11 could lead to monumental additional costs for sponsors. This statement spurred the FDA to re-examine the progress of Part 11 to date and to discuss how to proceed. At the time, the FDA was also re-examining the CGMP regulation under an initiative called: Pharmaceutical CGMPs for the 21st Century: A Risk-Based Approach; A Science and Risk-Based Approach to Product Quality Regulation Incorporating an Integrated Quality Systems Approach. It only made sense to also re-examine Part 11. On February 20, 2003 the FDA released a new Draft Guidance 21 CFR Part 11; Electronic Records; Electronic Signatures: Scope and Application (14). Accompanying the guidance was the withdrawal of CPG.7153.17 Part 11 Compliance Policy and all previous Part 11 draft Guidance(s) (with the exception of the final GCSUCT guidance). This new FDA approach to 21CFR11 included

CFR 21 PART 11

the beginning of an overall re-examination of the rule. The FDA stated that this reexamination may lead to more modifications to Part 11 itself, which would be aligned with the FDA’s risk- and science-based approach to the GMP regulations announced in August of 2002 (Pharmaceutical cGMPs for the 21st century). It is important to note that the issuance of this guidance was not a general withdrawal of Part 11, but a repositioning on the part of FDA. In the Scope and Application guidance, the FDA attempts to narrow the scope and application of Part 11 by closely coupling it with the Predicate Rule. In effect, the FDA re-emphasized the original basis for Part 11, which for the clinical space is Good Clinical Practices. The Scope and Application Final Rule was issued into effect in the federal register in August 2003. The approach outlined in the guidance as stated by FDA was based on three main elements: • Part 11 will be interpreted narrowly

by the agency; we (the FDA) are now clarifying that fewer records will be considered subject to part 11. • For those records that remain subject to Part 11, we (the FDA) intend to exercise enforcement discretion with regard to part 11 requirements for validation, audit trails, record retention, and record copying in the manner described in this guidance and with regard to all part 11 requirements for systems that were operational before the effective date of part 11(known as legacy systems). • We (the FDA) will enforce all predicate rule requirements, which include predicate rule record and recordkeeping requirements.

7

systems under Part 11 requirements based on a documented risk assessment with relation to Safety, Efficacy, and Quality (SEQ) as opposed to arbitrary parameters.

The FDA refers to enforcement discretion throughout the Scope and Application guidance. The FDA expressed their intent to not take regulatory action to enforce compliance with Part 11 regarding validation, audit trail, record retention, and record copying requirements. The guidance states that the FDA intends to exercise enforcement discretion with regard to legacy systems that otherwise met predicate rule requirements prior to August 20, 1997—the date that the original Part 11 went into effect. The definition of legacy systems is not well defined in the guidance. For example, the guidance is not specific about systems in place in 1997, but the FDA is concerned about components or software that has been replaced subsequently. The language in the guidance suggests that the (legacy) system had to be in compliance with the predicate rule prior to 1997 to qualify for enforcement discretion under this guidance. Some persons in the industry take the stance that if the basic system was in place and functions as originally intended, then it would fall into the definition of a legacy system. A stricter interpretation would only allow systems that were implemented entirely prior to 1997 into the category of legacy systems. One interpretation of the guidance would suggest that it reduced the need for validation, but the majority interpretation was that this result was not what the FDA intended. Some basic tenets still hold true regarding validation and Part 11.

Several key points and concepts were introduced and discussed in the Scope and Application guidance; risk assessment was one of the more prominent topics. The FDA states:

1. Validation of is still required for systems covered by Part 11. The basic approach to validation set forth in the Predicate rule and previous guidances is not specific to 21 CFR Part 11 and had been in practice before Part 11.

We recommend that you base your approach on a justified and documented risk assessment and a determination of the potential of the system to affect product quality and safety and record integrity. This allows for narrowing the scope of

2. Although the CPG.7153.17 Part 11 Compliance Policy and all but one previous Part 11 DRAFT Guidance have been withdrawn, 21CFR11 is still in force and compliance is expected within

8

CFR 21 PART 11

the modified expectations of the Scope and Application Guidance. 3. Predicate rules have increased in importance for company management and compliance regarding Part 11, which presents an opportunity for industry to re-examine and refocus their efforts on the truly important aspects of compliance and the predicate rules regarding clinical systems and electronic records.

Computerized Systems Used in Clinical Investigations; Guidance for Industry (May 2007) (16) supplements Scope and Application guidance and serves to unify Part 11 guidances In May 2007, the FDA issued a new final guidance: Computerized Systems Used in Clinical Investigations. This guidance was based on the DRAFT guidance issued in 2004: Computerized Systems Used in Clinical Trials (15). As stated in the introduction, the guidance ‘‘provides to sponsors, contract research organizations (CROs), data management centers, clinical investigators, and institutional review boards (IRBs), recommendations regarding the use of computerized systems in clinical investigations’’, and additionally ‘‘supersedes the guidance of the same name dated April 1999; and supplements the guidance for industry on Part 11, Electronic Records; Electronic Signatures — Scope and Application and the Agency’s international harmonization efforts when applying these guidances to source data generated at clinical study sites.’’ After the Scope and Application guidance was issued, inconsistencies existed between the new guidance and the 1999 GCSUCT guidance that remained in effect. Namely, the areas identified by the FDA as areas they would exercise ‘‘enforcement discretion’’ were addressed explicitly and rather prescriptively in the GCSUCT. This information caused a misalignment in the guidelines and required the FDA to re-examine and revise the GCSUCT. The document titled, ‘‘Computerized Systems Used in Clinical Investigations’’ is the result of that work. The following are 11 key points to note in the guidance:

1. Application of the guidance; this guidance applies to: • ‘‘Computerized systems that contain any data that are relied on by an applicant in support of a marketing application, including computerized laboratory information management systems that capture analytical results of tests conducted during a clinical trial’’ • ‘‘Recorded source data transmitted from automated instruments directly to a computerized system (e.g., data from a chemistry autoanalyser or a Holter monitor to a laboratory information system)’’ • ‘‘Source documentation that is created in hardcopy and later entered into a computerized system, recorded by direct entry into a computerized system, or automatically recorded by a computerized system (e.g., an ECG reading)’’ The guidance states that it does not apply to ‘‘computerized medical devices that generate such data and that are otherwise regulated by FDA.’’ In addition, the guidance states that acceptance of data from clinical trials for decision-making purposes depends on the FDA’s ability to verify the quality and integrity of the data during FDA on-site inspections and audits. (21 CFR 312, 511.1(b), and 812)’’. 2. The introduction references and emphasizes the importance of ‘‘source data’’ to reconstruct the study. The word ‘‘source’’ is used 24 times throughout the document in reference to source data and/or documentation. This usage differs from previous guidance(s) in which ‘‘source data’’ is hardly mentioned in the electronic context. 3. The guidance focuses on electronic source data and reiterates previously stated tenets related to data integrity. ‘‘Such electronic source data and source documentation must meet the same fundamental elements of data quality (e.g., attributable, legible, contemporaneous, original, and accurate) that are expected of paper records and must

CFR 21 PART 11

comply with all applicable statutory and regulatory requirements.’’

changes that systems make automatically to adjust to daylight savings time.’’ 10. Under Section IV (e) on External Security Safeguards, the guidance states that ‘‘you should maintain a cumulative record that indicates, for any point in time, the names of authorized personnel, their titles, and a description of their access privileges. 11. Change Control: The guidance emphasizes the need to preserve data integrity during ‘‘security and performance patches, or component replacement.’’ In addition, it recommends ‘‘maintaining back up and recovery logs’’. The specificity of this section is more explicit than previous guidance(s).

4. The guidance requires that sponsors identify each step at which a computerized system will be used to create, modify, maintain, archive, retrieve, or transmit source data. 5. The recommendations section (section IV) states that: The computerized systems should be designed, ‘‘to prevent errors in data creation, modification, maintenance, archiving, retrieval, or transmission (e.g., inadvertently unblinding a study).’’ This statement is important to note; essentially the FDA is saying that computerized systems can and should be designed and used to improve the data quality and ensure data integrity. 6. Also under Section IV (c) regarding Source Documentation and Retention, the FDA explicitly states that ‘‘when original observations are entered into a computerized system, the electronic record is the source document’’ (eSource). Here the FDA makes clear their position regarding the interpretation, or definition of electronic source data. 7. Overall, the guidance is not that prescriptive, with exception section IV (d) on Internal Security Safeguards, which literally provides suggested system features and workflow on how the user should conduct a data entry session. 8. Audit Trails: The most significant point regarding audit trails is that the FDA explicitly state that ‘‘Audit trails . . . used to capture electronic record activities should describe when, by whom, and the reason changes were made to the electronic record.’’ Historically, it has been debated whether ‘‘reason for change’’ is a necessary system feature, this guidance makes clear the FDA position on ‘‘reason for change’’ as required. 9. Date and Time Stamps: The FDA states that ‘‘Any changes to date and time should always be documented . . . we do not expect documentation of time

9

4

CONCLUSION

21CFR11 has fulfilled its original intent to allow for electronic recording of data and electronic record keeping in all documents intended for submission to the FDA. The FDA’s intention on the scope and enforcement of the rule was clarified in the Scope and Application guidance of 2003. 21CFR11 is currently being ‘‘re-examined’’ by the FDA, but they have not provided a clear timeframe of when this re-examination will conclude or what the expected outcome of this initiative will be. Some industry insiders speculate that it will be sooner rather than later, and that the regulation will either be significantly changed or revoked. Others believe that the re-examination will result in simply incorporating the philosophy of the Scope and Application guidance and the recently updated GCSUCI guidance into the actual regulation. REFERENCES 1. 62 FR 13430, 21 CFR part 11; Electronic Records; Electronic Signatures. Published in the Federal Register, March 20, 1997. 2. Predicate Rule: These requirements include certain provisions of the Current Good Manufacturing Practice regulations (21 CFR Part 211), Current Good Manufacturing Practice in Manufacturing, Processing, Packing, or

10

CFR 21 PART 11 Holding of Drugs (21 CFR Part 210), the Quality System Regulation (21 CFR Part 820), the Good Laboratory Practice for Non clinical Laboratory Studies regulations (21 CFR Part 58), Investigator recordkeeping and record retention (21 CFR 312.62), IND safety reports (21 CFR 312.32), Protection of Human Subjects (21 CFR 50.27). Documentation of informed consent (21 CFR PART 56), Subpart D—Records and Reports, 56.115 IRB records; Subpart C—IRB Functions and Operations, 56.109 IRB review of research, 21 CFR PART 314—APPLICATIONS FOR FDA APPROVAL TO MARKET A NEW DRUG (NDA); Subpart B—Applications.

3. Federal Food, Drug, and Cosmetic Act (the Act). 4. Public Health Service Act (the PHS Act). 5. Paperwork Reduction Act of 1995. 6. Government Act (GPEA), 10/21/98.

Paperwork Title XVII,

Elimination P.L. 105-277,

7. CPG 7153.17: Enforcement Policy: 21 CFR Part 11; Electronic Records; Electronic Signatures. 8. Computerized Systems Used in Clinical Trials, (GCSUCT), A Guidance for Industry (Final Rule: April 1999). 9. 21 CFR Part 11; Electronic Records; Electronic Signatures, Validation (Draft: August 2001). 10. 21 CFR Part 11; Electronic Records; Electronic Signatures, Glossary of Terms (Draft: August 2001). 11. 21 CFR Part 11; Electronic Records; Electronic Signatures, Time Stamps (Draft: February 2002). 12. 21 CFR Part 11; Electronic Records; Electronic Signatures, Maintenance of Electronic Records (Draft: July 2002). 13. 21 CFR Part 11; Electronic Records; Electronic Signatures, Electronic Copies of Electronic Records (Draft: August 2002). 14. 21 CFR Part 11; Electronic Records; Electronic Signatures: Scope and Application (Final Rule: August 2003). 15. Computerized Systems Used in Clinical Trials, (GCSUCT), Guidance for Industry (Draft: September 2004). 16. Computerized Systems Used in Clinical Investigations, (GCSUCI), Guidance for Industry (May 2007).

FURTHER READING R. D. Kush, eClinical Trials, Planning and Implementation. Boston, MA: Thomson Centerwatch, 2003.

CROSS-REFERENCES Case Report Form Clinical Data Management Software for Data Management Software for Genetics/Genomics Clinical Trial/Study Conduct Electronic Submission of NDA EMEA Food and Drug Administration (FDA, USA) Good Clinical Practice (GCP) International Conference on Harmonization (ICH) Investigational New Drug Application (IND) Confidentiality HIPAA

• CHANGE: X − Y is compared by T.

ANALYZING CHANGE FROM BASELINE IN RANDOMIZED TRIALS

• FRACTION: (X − Y) ÷ X is compared

by T. • ANCOVA: Calculate Y = β 1 X + β 2 T + constant.

ANDREW J. VICKERS Memorial Sloan-Kettering Cancer Center, New York, New York

To illustrate the methods, we will use pain scores on a 0–10 scale as an example, making the common assumption that the scores can be treated as a continuous variable, rather than 11 ordered categories. Take the case of a patient with a baseline pain score of 5, which improves to a score of 3 after randomized treatment. The number entered into a between-group two-sample comparison, such as t-test or Mann-Whitney U test, will be 3 for the POST analysis, 2 for the CHANGE analysis, and 40% for the FRACTION analysis. In ANCOVA, 3 would be entered for the dependent variable (Y) and 5 for the covariate (X).

A very common scenario in medicine is for patients to present to a clinician with some sort of a chronic complaint and ask whether the severity of the complaint can be reduced by treatment. Accordingly, many clinical trials involve measuring the severity of a complaint before and after randomly allocated treatment: Patients with chronic headache, obesity, or hypertension are randomized to treatment or control to observe whether the treatment is effective for reducing pain, weight, or blood pressure. This article provides a nontechnical introduction to the statistical analysis of two-arm randomized trials with baseline and follow-up measurement of a continuous outcome.

1.1 Clinical Interpretation of Statistical Results For the sake of argument, assume that our illustrative patient represented the mean in the treatment group, and that there was a change in means from 5 to 4 in the control group. The results of the trial could be stated in the following terms:

1 METHODS FOR ANALYZING TRIALS WITH BASELINE AND FOLLOW-UP MEASURES In the simplest case, an outcome measure such as pain, weight, or blood pressure is measured only twice, once before randomization (‘‘baseline’’) and once at some point after randomization (‘‘follow-up’’), such as shortly after treatment is complete. We will use a simple notation for this introductory paper: X is the baseline measurement, Y is the followup assessment, and T is an indicator variable coded 1 for patients who receive treatment and coded 0 for controls. We will assume that only a single X and Y exist per patient so that, for example, if a weekly pain score is reported for 4 weeks after treatment, a summary statistic such as a mean is calculated and used as each patient’s datum. We will compare four different methods of analysis: POST, CHANGE, FRACTION, ANCOVA (analysis of covariance), which are defined as follows:

POST:

CHANGE:

FRACTION:

ANCOVA:

Mean pain at follow-up was 3 in the treatment group compared with 4 in controls Pain fell by a mean of 2 points in the treatment group compared with 1 point in controls Pain fell by a mean of 40% in the treatment group compared with 20% in controls As for CHANGE, where β 2 is the change score

The ease with which clinicians can understand each of these interpretations will vary with the endpoint. In the case of hypertension, where particular cut-points such as 140 mm Hg are considered important, clinicians will often want to know mean

• POST: Y is compared by T.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

ANALYZING CHANGE FROM BASELINE IN RANDOMIZED TRIALS

posttreatment measure (POST); for obesity research, clinical questions are often expressed in terms of weight loss (CHANGE). Percentage change (FRACTION) is obviously of little relevance for weight or blood pressure, but it is often useful for a symptom score, such as pain or number of hot flashes, which would ideally be close to zero. 2 STATISTICAL POWER OF THE ALTERNATIVE ANALYSIS METHODS

1.5 1 .5

Relative Sample Size

2

Frison and Pocock (1) have published formulae that can be used to estimate the relative power of POST, CHANGE, and ANCOVA. Relative power is dependent on the correlation between the baseline and follow-up measure (ρ xy ). Assuming that the variances of the baseline and postrandomization measures are similar, CHANGE has higher power than POST if correlation is greater than 0.5 and lower power otherwise. This measure can be thought in terms of whether the baseline measure adds variation (correlation less than 0.5) or prediction (correlation greater than 0.5). ANCOVA has superior power to both POST and CHANGE irrespective of the correlation between baseline and postrandomization measure. However, if correlation is either very high or low, the power of ANCOVA is not importantly higher than either CHANGE in the former case and POST in the latter. These relationships are shown in Fig. 1.

A slight complication in the use of Frison and Pocock’s formulae is that they require data on correlation between baseline and follow-up for the data set in the randomized trial. This correlation will be different to the correlation observed in any pretrial data set because of the effects of treatment. For example, imagine that a group of researchers has a large database of pain scores taken before and after treatment with a standard analgesic agent, and the correlation between the baseline and follow-up (ρ xy ) is 0.5. In a randomized trial that compares a new drug to the standard, ρ xy will be close to 0.5 in the control arm as well as in the trial as a whole if the new drug and standard analgesic are have similar analgesic effects. However, if the new agent was completely effective, with pain scores of 0 for all patients, ρ xy would be 0 in the experimental arm, and therefore somewhere between 0 and 0.5 for the trial as a whole. However, for the moderate effect sizes observed in most clinical trials, the difference between ρ xy in the control arm and ρ xy for the whole trial will be small: For example, if ρ xy is 0.5 in the control arm, and there is an effect size of 0.5 standard deviations, then ρ xy will be approximately 0.485 for control and treatment group patients combined. No analytic formulae have been published for the power of FRACTION. Accordingly, the power of the four analytic approaches can best be compared by simulation. Our base

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Correlation between baseline and follow-up measures

1.0

Figure 1. Relative sample size required for different methods of analysis. Dashed line: POST. Solid grey line: CHANGE. Solid dark line: ANCOVA.

ANALYZING CHANGE FROM BASELINE IN RANDOMIZED TRIALS

scenario is a trial with a 0.5 standard deviation difference between groups and a total sample size of 60 equally divided between two arms. This sample size was given to yield a power close to 50% for the POST analysis. The correlation between baseline and follow-up measures within each group separately was varied between 0.2 and 0.8. Two other parameters were varied: whether the effect of treatment was a shift (e.g., treatment reduced scores by two points) or a ratio (e.g., treatment reduced scores by 40%), and whether change was observed in the control arm. The results are shown in the tables. Table 1 is similar to a previously published result (2). Power is highest for ANCOVA; the relative power of POST and CHANGE are as shown in Fig. 1. FRACTION has lower power than CHANGE. Table 2 is a result that has not been previously reported: The scores are reduced in the control group (e.g., the ‘‘placebo effect’’) but are reduced more in the treatment group. FRACTION has higher power than CHANGE; however, if scores increase in both groups (e.g., pain worsening over time), although less so in treated patients, the power of FRACTION is reduced. Clearly, the relative power of FRACTION and

3

CHANGE depends on the difference between baseline and follow-up measures in the control group. Tables 3 and 4 show the results of the simulation using a ratio treatment effect. The relative powers of the different methods of analysis are similar to those for a shift treatment effect. 2.1 Which Method to Use? The relative power of POST, CHANGE, and FRACTION seems to vary in complex ways depending on the correlation, type of treatment effect, and the degree of change in the treatment group. Nonetheless, the key message remains the same: ANCOVA has the greatest power irrespective of the scenario. Yet, as pointed out above, occasionally other methods may provide results of greater clinical relevance. A simple solution is based on the principle that presentation of statistical results need not be determined solely by the principal inferential analysis. The most obvious application of this principle is when outcome is measured on a continuous scale, but a binary outcome is used as an additional method to present trial data. For example, in a trial on headache pain (3), the primary outcome was a difference between days

Table 1. Power of Methods for Analyzing Trials with Baseline and Follow-up Measures where Treatment Effect is a Shift Correlation between baseline and follow-up Method POST CHANGE FRACTION ANCOVA

0.20 0.475 0.322 0.300 0.488

0.35 0.484 0.387 0.367 0.527

0.50 0.482 0.470 0.454 0.589

0.65 0.481 0.626 0.609 0.701

0.80 0.475 0.847 0.838 0.874

The trial parameters are 30 patients per group, no change between baseline and follow-up in the control group, and a mean 0.5 standard deviation decrease in the treatment group.

Table 2. Power of FRACTION for Analyzing Trials with Baseline and Follow-up Measures, Varying Change in the Control Group Correlation between baseline and follow-up Change in controls 1 SD decrease 2 SD decrease 3 SD decrease 1 SD increase

0.20 0.362 0.422 0.461 0.251

0.35 0.436 0.492 0.515 0.303

0.50 0.531 0.570 0.561 0.379

0.65 0.678 0.675 0.611 0.510

0.80 0.859 0.788 0.655 0.727

The trial parameters are 30 patients per group, and a 0.5 standard deviation decrease in the treatment group. Power for other methods is invariant to the change in controls; accordingly, Table (1) can be used as a comparison.

4

ANALYZING CHANGE FROM BASELINE IN RANDOMIZED TRIALS

Table 3. Power of Methods for Analyzing Trials with Baseline and Follow-up Measures where Treatment Effect is a Ratio Correlation between baseline and follow-up Method POST CHANGE FRACTION ANCOVA

0.20 0.513 0.337 0.307 0.525

0.35 0.524 0.402 0.370 0.570

0.50 0.520 0.488 0.457 0.630

0.65 0.518 0.646 0.608 0.743

0.80 0.513 0.860 0.829 0.903

The trial parameters are 30 patients per group and no change between baseline and follow-up in the control group. Scores in the treatment group are reduced by a ratio equivalent to a mean 0.5 standard deviation.

Table 4. Power of Methods for Analyzing Trials with Baseline and Follow-up Measures where Treatment Effect is a Ratio and Scores Improve in the Control Group Correlation between baseline and follow-up Method POST CHANGE FRACTION ANCOVA

0.20 0.523 0.340 0.378 0.537

0.35 0.534 0.406 0.448 0.579

0.50 0.530 0.491 0.543 0.640

0.65 0.527 0.649 0.688 0.754

0.80 0.522 0.862 0.871 0.908

The trial parameters are 30 patients per group and a 1 standard deviation change between baseline and follow-up in the control group. Scores in the treatment group are reduced by a ratio equivalent to a mean 0.5 standard deviation.

with headache, but the authors also reported the proportion of patients who experienced a 50% or greater reduction in headache days. Accordingly, one can report the results of the ANCOVA and then, if it would be helpful, give percentage change by dividing mean follow-up score in the control group by β 2 , which is the difference between treatments estimated by ANCOVA. Note that choice of ANCOVA as the method of analysis has implications for sample size calculation. Such calculations require an estimate of the correlation between baseline and follow-up measures (1). Also, note that sample size was not varied in the simulations: ANCOVA involves the loss of an additional degree of freedom, which can hurt power if sample sizes are very small (such as fewer than five patients per group). 2.2 Additional Advantages of ANCOVA ANCOVA has several advantages additional to statistical power in comparison with CHANGE, POST, or FRACTION. The first concerns chance imbalance between groups at baseline, for example, if pain scores at randomization were higher in the treatment group. Of course, all methods of analysis are

unconditionally unbiased, that is, their power (true positive rate) and size (false positive rate or Type I error) are close to nominal levels when averaged over a large number of trials, even if a chance baseline imbalance is observed in some trials. However, baseline imbalance may compromise any specific trial. In the case where the baseline measure is higher in the treatment group, a POST analysis will underestimate the true treatment effect and, because of regression to the mean, a CHANGE analysis will overestimate treatment benefit [empirical examples of this phenomenon have been published (4)]. ANCOVA provides a good estimate of treatment effect irrespective of chance baseline imbalance. The second advantage of ANCOVA is related to stratification. If a trial incorporates one or more stratification factors in randomization, then failure to incorporate these factors in analysis will inflate P values (5). Power can be returned to nominal levels by adding randomization strata as covariates in ANCOVA. For example, if a hypertension trial was randomized using age and weight as stratification variables, then analysis should use Y = β 1 X + β 2 T + β 3 age + β 4 weight + constant, where age and weight are entered

ANALYZING CHANGE FROM BASELINE IN RANDOMIZED TRIALS

in the same form as for stratification (e.g., as continuous variables for minimized randomization; as binary variables for randomization by stratified blocks). ANCOVA is extended to more complex designs and analyses more easily than twosample comparisons. These designs include trials that assess treatment effects that diverge over time, for example, where the effects of a treatment are cumulative, those with more than two arms (e.g., psychotherapy vs. relaxation therapy vs. no treatment), or analyses that examine whether treatment response differs by baseline severity. Although these topics are outside the scope of this article, it can be noted briefly that trials with divergent treatment effects can be analyzed either by an ANCOVA using the slope of follow-up measurements as the dependent variable or by a generalized estimating equations approach with terms for both time and time by treatment interaction. Multiarm trials can be analyzed by ANCOVA using dummy variables for treatments (e.g., ‘‘contact’’ and ‘‘talking therapy’’ would be coded 1, 1 for psychotherapy; 1, 0 for relaxation therapy; 0, 0 for controls); analyses that examine whether treatment response differs by baseline severity can use an interaction term such that Y = β 1 X + β 2 T + β 3 T.X + constant. 3 EFFECT OF NON-NORMAL DISTRIBUTIONS ON ANALYSIS ANCOVA is a parametric analysis, and it has become an article of faith that parametric statistics require either normally distributed data or sample sizes large enough to invoke the ‘‘Central Limit Theorem.’’ As an example, one popular statistics textbook states that, ‘‘Parametric methods require the observations in each group to have an approximately Normal distribution . . . if the raw data do not satisfy there conditions . . . a non-parametric method should be used’’ (6). Moreover, a considerable number of simulation studies have shown that although Type I errors of parametric methods such as the t-test are unaffected by the sample distribution, such methods are underpowered for the analysis of nonsymmetric data (7). Accordingly perhaps, the New England Journal of

5

Medicine currently includes the following in its instructions to authors. ‘‘For analysis of [continuous] measurements, nonparametric methods should be used to compare groups when the distribution of the dependent variable is not normal’’ (note that, at the time of writing, these recommendations are under revision). That said, for a linear regression, it is not the distribution of the dependent variable Y that is of interest, but the distribution of Y conditional on the independent variables. This statement means that, for an ANCOVA, the distribution of interest is not Y but X − Y, in other words, the change score. As it happens, if X and Y are both drawn from non-normal distributions, X − Y will often be distributed normally, because change scores are a linear combination and the Central Limit Theorem therefore applies. As a straightforward example, imagine that baseline and post-treatment measure were represented by a single throw of a die. The posttreatment measure has a flat (uniform) distribution, with each possible value having an equal probability. The change score has a more normal distribution: A peak is observed in the middle at zero; the chance of a zero change score is the same as the chance of throwing the same number twice, that is 1 in 6—with more rare events at the extremes—there is only a 1 in 18 chance of increasing or decreasing score by 5. A similar result can be demonstrated for a wide variety of distributions, which include uniform, asymmetric, and what is probably most common in medical research, positive skew. Distributions for baseline and follow-up measures may show extreme deviations from the normal, yet the change score tends toward normality (7). Simulation studies have shown that ANCOVA retains its power advantages over CHANGE and POST for data sampled from distributions other than the normal (7).

REFERENCES 1. L. Frison and S. J. Pocock, Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design. Stat. Med. 1992; 11: 1685–1704.

6

ANALYZING CHANGE FROM BASELINE IN RANDOMIZED TRIALS

2. A. J. Vickers, The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Med. Res. Methodol. 2001; 1: 6. 3. D. Melchart, A. Streng, A. Hoppe, et al., Acupuncture in patients with tension-type headache: randomised controlled trial. BMJ 2005; 331: 376–382. 4. A. J. Vickers, Statistical reanalysis of four recent randomized trials of acupuncture for pain using analysis of covariance. Clin. J. Pain 2004; 20: 319–323. 5. L. A. Kalish and C. B. Begg, The impact of treatment allocation procedures on nominal significance levels and bias. Control. Clin. Trials 1987; 8: 121–135. 6. D. G. Altman, Practical Statistics for Medical Research. London: Chapman and Hall, 1991. 7. A. J. Vickers, Parametric versus nonparametric statistics in the analysis of randomized trials with non-normally distributed data. BMC Med. Res. Methodol. 2005; 5: 35.

FURTHER READING L. J. Frison and S. J. Pocock, Linearly divergent treatment effects in clinical trials with repeated measures: efficient analysis using summary statistics. Stat. Med. 1997; 16: 2855–2872.

L. Frison and S. J. Pocock, Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design. Stat. Med. 1992; 11: 1685–1704. Y. K. Tu, A. Blance, V. Clerehugh, and M. S. Gilthorpe, Statistical power for analyses of changes in randomized controlled trials. J. Dent. Res. 2005; 84: 283–287. A. J. Vickers, Analysis of variance is easily misapplied in the analysis of randomized trials: a critique and discussion of alternative statistical approaches. Psychosom. Med. 2005; 67: 652–655.

CROSS-REFERENCES Longitudinal data (study design Phase III) ANCOVA (methodology / analysis) Generalized estimating equations (methodology / analysis) Linear model (methodology / analysis)

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

should be addressed to the appropriate clinical division (8) and directed to the document control center at the appropriate address indicated below.

NORMAN R. SCHMUFF 1. For drug products regulated by CDER, send the IND submission to the Central Document Room, Center for Drug Evaluation and Research, Food and Drug Administration, 5901–B Ammendale Rd., Beltsville, MD 20705–1266. 2. For biological products regulated by CDER, send the IND submission to the Central Document Room, Center for Drug Evaluation and Research, Food and Drug Administration, 5901–B Ammendale Rd., Beltsville, MD 20705–1266. 3. For biological products regulated by CBER, send the IND submission to the Document Control Center (HFM–99), Center for Biologics Evaluation and Research, Food and Drug Administration, 1401 Rockville Pike, Suite 200 N, Rockville, MD 20852–1448.

U.S. Food and Drug Administration Silver Spring, Maryland

DAVID T. LIN Biologics Consulting Group, Inc. Alexandria, Virginia

1

OVERVIEW

Formerly, the U.S. Food and Drug Administration (FDA) used the term ‘‘Chemistry, Manufacturing, and Controls’’ (CMC) to refer to what other countries and regions called the ‘‘Quality’’ portion of a new drug submission. The terms are now used interchangeably at FDA. CMC/Quality requirements for INDs are spelled out briefly in the Code of Federal Regulations (CFR) (1). Note that under FDA’s Good Guidance Practices (2,3), items that are obligatory are referred to as ‘‘requirements.’’ Requirements are contained either in laws or legislation (4) (e.g., The Food, Drug and Cosmetic Act (5) or the Public Health Service Act (6) for biological products) or regulation, which, in this case, is a formal FDA-drafted elaboration and interpretation of the legislation, contained in Title 21 of the CFR. These requirements comprise what sponsors and applicants ‘‘must’’ do, whereas recommendations, which are generally contained in guidance documents, are not mandatory. 2

Quality aspects of these applications are assessed either by the Office of New Drug Quality Assessment (ONDQA) or the Office of Biotechnology Products (OBP); however, sponsors should rely on the clinical division as their primary contact point. ONDQA currently handles all of the traditional small molecule and hormone protein (e.g., the insulins and growth hormones) products. OBP handles most of the remaining biological products under CDER’s review jurisdiction.

ADMINISTRATIVE ASPECTS 3

INDs for vaccines, blood products, and gene therapy are under the review jurisdiction of CBER products. CDER currently has jurisdiction over small molecule drugs and biological drugs derived from natural sources, whether chemically synthesized or from recombinant DNA technology. For CDER or CBER, electronic INDs can be sent via FDA’s secure electronic gateway (7) that was established in 2007. Paper-based IND applications and electronic submission on physical media

PRE-INDS

CDER offers a Pre-Investigational New Drug Application (pre-IND) consultation program to foster early communications between sponsors and new drug review divisions to provide guidance on the data necessary to be submitted in INDs. The review divisions are organized generally along therapeutic class and can each be contacted using the designated ‘‘Pre-IND Consultation List’’ (9).

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

4

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

QUALITY FOR IND TYPES

4.1 IND Categories The two IND categories, both of which are standard INDs not distinguished by the IND regulations, are as follows: • Commercial • Research (noncommercial)

‘‘Commercial’’ INDs are filed by pharmaceutical firms intending to file subsequent New Drug Applications (NDAs)/BLAs to commercialize their product. IND filings from large research institutions, such as the National Institutes of Health are also considered to be in this category. ‘‘Research’’ INDs may be submitted by a physician who both initiates and conducts an investigation, and under whose immediate direction the investigational drug is administered or dispensed. A physician might submit a research IND to propose studying an unapproved drug or studying an approved product for a new indication or in a new patient population. The quality data expectations for these applications are the same and are described in general in 21 CFR 312 (See Code of Federal Regulation (CFR) article). 4.2 Special IND Types The three special IND types are as follows: • Emergency Use • Treatment IND • Exploratory IND

An ‘‘Emergency Use’’ IND allows the FDA to authorize use of an experimental drug in an emergency situation that does not allow time for submission of an IND in accordance with 21 CFR 312.23 or 21 CFR 312.34. It is also used for patients who do not meet the criteria of an existing study protocol or for cases in which an approved study protocol does not exist. Typically, these INDs involve single patients. In nearly all cases for quality data, the application references other existing applications.

A ‘‘Treatment IND’’ is submitted for experimental drugs that show promise in clinical testing for serious or immediately lifethreatening conditions, while the final clinical work is conducted and the FDA review takes place. (For additional information see Treatment IND article) In January of 2006, CDER issued ‘‘Guidance for Industry, Investigators, and Reviewers Exploratory IND Studies.’’ An ‘‘exploratory IND study’’ is defined as a clinical trial that • is conducted early in Phase 1, • involves very limited human exposure,

and • has no therapeutic nor diagnostic intent

(e.g., screening studies or microdose studies). The extent of CMC-related IND information is stated to be ‘‘similar to that described in current guidance for use of investigational products.’’ For CMC purposes, little difference is found in the data expectations for this study and a traditional Phase 1 study. 4.3 IND Phases The regulations at 21 CFR 312.23(a)(7)(i) emphasize the graded nature of manufacturing and controls information. Although in each phase of the investigation sufficient information should be submitted to ensure the proper ‘‘identity strength quality and purity’’ of the investigational drug, the amount of information needed to make that assurance will vary with the phase of the investigation, the proposed duration of the investigation, the dosage form, and the amount of information otherwise available. 5 QUALITY-BY-DESIGN AND INDS As part of the Pharmaceutical Quality for the 21st Century Initiative (10), FDA is encouraging a more systematic approach to pharmaceutical development and manufacturing that generally falls under the concept of Quality-by-Design (QbD). The QbD approach suggests that product quality should be built in from the ground up by designing each

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

manufacturing step for optimal product performance at every stage of the development process. QbD is defined in the Step 2 ICH Annex to Q8 (11) as: A systematic approach to development that begins with predefined objectives and emphasizes product and process understanding and process control, based on sound science and quality risk management.

ONDQA Director Moheb Nasr described the QbD approach this way in his November 2006 presentation (12) at the American Association of Pharmaceutical Scientists Annual Meeting: In a Quality by Design system: • The product is designed to meet patient

needs and performance requirements • The process is designed to consistently

meet product critical quality attributes • The impact of starting raw materials

and process parameters on product quality is understood • The process is continually monitored, evaluated and updated to allow for consistent quality over time • Critical sources of process variability are identified and controlled • Appropriate control strategies are developed As described below, much of the QbD aspects of an IND or NDA/BLA will be included in the P.2 Pharmaceutical Development section of the application. Although QbD is not a statutory requirement, FDA believes this approach to be beneficial to both sponsors/applicants and to the public. 6

ELECTRONIC SUBMISSIONS FOR INDS

Submissions of INDs in the electronic Common Technical Document (eCTD) format (13) have been strongly encouraged, although not required. However, as of January 1, 2008, the only acceptable electronic submissions to FDA are submissions in the eCTD format (14). Advantages to FDA include less required storage space for paper submissions, expedited routing, and simultaneous

3

availability to reviewers and managers. An advantage to both sponsors and the agency is the ‘‘lifecycle’’ features built into the eCTD. These features allow for an historical view of changes within a given section. It is also noteworthy that submission redundancies can be eliminated (e.g., submitting the same study report to both an IND and an NDA/BLA) because subsequent applications, both IND and NDA/BLA, simply can include links to previously submitted files. In 2007, an FDA electronic gateway (15) was established to facilitate timely submission and eliminate the need for submissions on physical media. Note that for purposes of eIND submissions (but not NDAs/BLAs), it is not necessary to comply with the ‘‘Granularity’’ Annex (16) to ICH M4. That is, it would be acceptable to submit a single file for the Module 2 Quality Overall Summary (consistent with the ‘‘Granularity’’ document), a single file for the ‘‘S’’ drug substance section, and a single file for the ‘‘P’’ drug product section. These latter two options are in conflict with the ‘‘Granularity’’ Annex but nonetheless acceptable for eINDs. 7

GENERAL CONSIDERATIONS

For the proposed use of a U.S. marketed product in unmodified form and in its original packaging, the sponsors need only to indicate this use. For products that are modified (e.g., over-encapsulated tablets), the IND should include some evidence of equivalent performance of the original to the modified product for the duration of the proposed trial. For repackaged products, data should be included to demonstrate that the product is stable for the duration of the trial. For placebo-controlled trials, component and composition data for the placebo could be included in a separate drug product section. A brief rationale should be included for why the blinding is adequate; it should address, for example, issues relating to appearance, color, and taste. Foreign-sourced comparators present special challenges. Generally, a U.S.-sourced comparator product is preferred. The use of FDA-approved drug products provides assurance of drug quality. Where this use is not

4

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

possible and local products are used, documentation should be provided to show that the drug product is comparable in quality to the U.S. product. This documentation could involve, for example, comparing impurity, dissolution profiles and content uniformity. Without an adequate showing of comparability, study results would be considered inconclusive and would likely not be adequate for support of approval of a new agent, or a new indication for an existing product. In certain cases, a showing of comparability might involve an in vivo bioequivalence study. For cases in which it is desired to use a foreign comparator, when no U.S. approved product is available, discussions with FDA should be occur at an early development stage before filing the IND. 8

CMC CLINICAL HOLD CONSIDERATIONS

A clinical hold is as follows: An order issued by FDA to the sponsor to delay a proposed clinical investigation or to suspend an ongoing investigation. The clinical hold order may apply to one or more of the investigations covered by an IND. When a proposed study is placed on clinical hold, subjects may not be given the investigational drug. When an ongoing study is placed on clinical hold, no new subjects may be recruited to the study and placed on the investigational drug; patients already in the study should be taken off therapy involving the investigational drug unless specifically permitted by FDA in the interest of patient safety (17)

Every effort is made by FDA to facilitate the initiation of clinical trials. When serious deficiencies are identified before the 30-day date, sponsors are generally notified by telephone or e-mail. It is especially important in this period to respond promptly and completely to FDA requests. It is advantageous to have the primary contact available during FDA working hours during this time. Any CMC deficiencies should be addressed through a formal submission to the IND. The previously mentioned FDA electronic gateway (18) was established to facilitate timely submission and reviewer availability of such IND amendments.

For all disciplines, including CMC, the review focus is on the safety of the proposed clinical trial. Consequently, issues relating to commercial manufacture are generally not pertinent. However, inadequate attention to product variability can be a safety issue and may compromise the results of a clinical trial. For ‘‘Research’’ INDs, the most common CMC reasons for a clinical hold relates to lack of administrative information, such as a reference to an existing IND, and a letter of authorization from that IND’s sponsor. Issues may also develop relating to the performance or stability of a repackaged or reformulated commercial product. Any manipulation of a sterile product is also of special concern. For ‘‘Commercial’’ INDs, quoting the Phase 1 guidance: The identification of a safety concern or insufficient data to make an evaluation of safety is the only basis for a clinical hold based on the CMC section. Reasons for concern may include, for example: 1) a product made with unknown or impure components; 2) a product possessing chemical structures of known or highly likely toxicity; 3) a product that cannot remain chemically stable throughout the testing program proposed; or 4) a product with an impurity profile indicative of a potential health hazard or an impurity profile insufficiently defined to assess a potential health hazard; or 5) a poorly characterized master or working cell bank.

Regarding impurities, it is critical that the material proposed for use in the clinical trial have a similar impurity profile to that used in nonclinical studies. For example, the same drug substance lot used in nonclinical studies might be used to manufacture clinical trial lots. If impurities appear only in the drug product, then it is imperative that these, too, should be appropriately qualified. 9 SUBMISSION FORMAT AND CONTENT The content, but not the format, of CMC section for INDs is prescribed by 21 CFR 312.23(a)(7). In 1995, CDER and CBER issued ‘‘Guidance for Industry: Content and Format of Investigational New Drug Applications (INDs) for Phase 1 Studies of Drugs,

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

Including Well- Characterized, Therapeutic, Biotechnology-derived Products’’ (19). This predated work on the International Conference on Harmonisation (20) (ICH) topic M4, the Common Technical Document (CTD), which reached Step 4 of this FDA guidance in November of 2000. Although the content is still pertinent, the submission should now be formatted according to the CTD format as described below. In May of 2003, CDER issued ‘‘Guidance for Industry: INDs for Phase 2 and Phase 3 Studies Chemistry, Manufacturing, and Controls Information’’ (21). Like the Phase 1 guidance described above it includes no references are to the CTD. Nevertheless, as for Phase 1 submissions, these, too, should be submitted in the CTD format, preferably as an eCTD. 9.1 Module 1 Module 1 contains region-specific data, largely administrative, that falls outside the scope of the ICH CTD. For eCTD applications, the format and content of Module 1 is described by FDA (22), Japanese MHLW/PMDA (23), and European EMEA (24). Quality-specific data for FDA should include, if appropriate, a request for categorical exclusion from or submission of an environmental assessment (25). Also, this module should include a copy of all labels and labeling to be provided to each investigator (26). A mock-up or printed representation of the proposed labeling that will be provided to investigator(s) in the proposed clinical trial should be submitted. Investigational labels must carry a ‘‘caution’’ statement as required by 21 CFR 312.6(a). That statement reads: ‘‘Caution: New Drug Limited by Federal (or United States) law to investigational use.’’ Although INDs are generally exempt from expiration dating, where new drug products for investigational use are to be reconstituted at the time of dispensing, their labeling must bear expiration information for the reconstituted drug product (27). 9.2 Module 2 Unlike Efficacy, and Safety, Quality has a single Module 2 heading known as the ‘‘Quality Overall Summary’’ (QOS). FDA has issued

5

no guidance on IND expectations for the QOS or other Module 2 summaries. Consequently, the inclusion of a QOS in an IND can be considered optional. However, if submitting a QOS, it seems appropriate to limit its length to five or fewer pages. The format could be as described in M4-Q (28), or it might be abbreviated as that guidance applies to NDAs/BLAs and ANDAs. 9.3 Module 3 If a QOS is not included, then provide a summary of the application in Module 3 (29). The file should be submitted under eCTD (v 3.2) element ‘‘m3-quality.’’ Although it is not obvious from the eCTD specification, it is permissible to submit documents under this heading (or element, to use XML terminology). A look at the eCTD technical specification will reveal that this possibility also exists for many other headings with no content other than subordinate headings. As suggested in the Phase 1 guidance (30), the sponsor should state whether it believes that either 1) the chemistry of either the drug substance or the drug product, or 2) the manufacturing of either the drug substance or the drug product, presents any signals of potential human risk. If so, then these signals of potential risks should be discussed, and the steps proposed to monitor for such risks should be described, or the reasons why the signals should be dismissed should be discussed. In addition, sponsors should describe any chemistry and manufacturing differences between the drug substance/product proposed for clinical use and the drug substance/product used in the animal toxicology trials that formed the basis for the sponsor’s conclusion that it was safe to proceed with the proposed clinical study. How these differences might affect the safety profile of the drug product should also be discussed. If no differences are found, then this fact should be stated explicitly. 10

S DRUG SUBSTANCE

A short summary, a few paragraphs in length, describing the drug substance should be provided. If the IND proposes a U.S.-approved

6

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

drug, then the entire drug substance section can be omitted. If it is proposed to use a U.S.approved drug manufactured by an alternate supplier, then complete CMC information should be included.

10.2.1 Manufacturer(s)P1 . Provide the complete name and street address for the drug substance manufacturer and its FDA Establishment Identifier number (FEI) and/ or Central File Number (CFN) if known.

10.1 General InformationP1 10.1.1 NomenclatureP1 . Provide all names (including, e.g., company code, INN, and USAN) by which the substance has been known. Pick one of these names and use it consistently throughout the application. If legacy reports use alternate names, then it will be helpful to the reviewer if you can modify these reports so that a single standard name is used across the application. Follow a similar practice for naming impurities. Also, include a systematic name, preferably the name used by the Chemical Abstracts Service (CAS). Include the CAS Registry Number (CAS RN) for the drug and any relevant salt forms.

10.2.2 Description of Manufacturing Process and Process ControlsP1 . A brief narrative description should be included along with a detailed flow diagram. The latter should indicate weights, yields, solvents, reagents, in-process controls, and isolated intermediates. Critical steps should be identified, if known. Include milling or size-reduction steps. More information may be needed to assess the safety of biotechnology-derived drugs or drugs extracted from human or animal sources. Such information includes the cell bank, cell culture, harvest, purification and modification reactions, and filling, storage, and shipping conditions. The general description of the synthetic and manufacturing process (e.g., fermentation and purification) provided to support the Phase 1 studies should be updated from a safety perspective if changes or modifications have been introduced. Reprocessing procedures and controls need not be described except for natural sourced drug substances when the steps affect safety (e.g., virus or impurity clearance). For sterile drug substances, updates on the Phase 1 manufacturing process should be submitted. The Phase 2 information should include changes in the drug substance sterilization process (e.g., terminal sterilization to aseptic processing). Information related to the validation of the sterilization process need not be submitted at this time.

10.1.2 StructureP1 . Provide a legible graphical structure and clearly indicate any relevant stereochemical features. Include the molecular mass. For biotechnological materials, provide schematic amino acid sequence and indicate secondary and tertiary structure, if known, and sites of posttranslational modifications. It may be helpful to provide a machine-readable representation of the structure, such as a SMILES (31) string, MDL MOLFILE (32), or InChI (33). 10.1.3 General PropertiesP1 . Include melting point, and other physico-chemical properties, such as log P, and a pH solubility profile. Indicate which polymorph is proposed for use, as appropriate. Also, include a short summary of biological activity. 10.2

ManufactureP1

After the initial IND submission, if changes have been made in manufacturing sites or processes, then provide a short summary in the following sections. Amend the IND only if such changes involve lots used for stability studies or preclinical or clinical testing. P1 These sections should be included in submissions

for Phase 1 studies.

10.2.3 Control of MaterialsP1 . Indicate the quality standards for the starting material, solvents, and reagents. For fermentation products or natural substances extracted from plant, human, or animal sources, the following information should be provided in Phase 1 submissions: (1) origin (e.g., country), source (e.g., pancreas), and taxonomy (e.g., family, genus, species, and variety) of the starting materials or strain of the microorganism, (2) details of appropriate screening procedures for adventitious agents, if relevant, and (3) information to support the

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

safe use of any materials of microbial, plant, human, or animal origin (e.g., certification, screening, and testing). Any updates to the information submitted in Phase 1 and any new information to support the safety of materials of human or animal origin should be provided in fater submissions. 10.2.4 Controls of Critical Steps and Intermediates. The general strategy for identifying critical steps and the nature of controls for these steps, if identified, should be discussed briefly. For isolated intermediates, provide a tentative specification. To the extent possible in Phase 2, sponsors should provide information on controls of critical steps and intermediates and tentative acceptance criteria to ensure that the manufacturing process is controlled at predetermined points. Although controls of critical steps and intermediates still can be in development, information on controls for monitoring adventitious agents should be provided for fermentation and natural sourced (human or animal) drug substances, as appropriate. 10.2.5 Process Validation and/or EvaluationP1 . Generally this section is not needed in an IND submission. However, it might be appropriate, for example, to describe criteria used to assess clearance of adventitious agents. 10.2.6 Manufacturing Process Development. Although only safety-related aspects need to be reported for Phase 1 studies, this section provides an opportunity for a brief description of the sponsor’s ‘‘Quality-byDesign’’ approach. 10.3 CharacterizationP1 10.3.1 Elucidation of Structure and other Characteristics. Evidence to support the proposed chemical structure of the drug should be submitted. It is understood that the amount of structure information will be limited in the early stage of drug development. Data should be summarized in tabular form and indicate, for example, peak assignments and coupling constants for 1 H NMR spectra. Infrared and 1 H NMR spectra should be submitted. For the latter, provide expansions of regions of interest. For biotechnological

7

materials, information should be provided on the primary, secondary, and higher-order structure. In addition, information on biological activity, purity, and immunochemical properties (when relevant) should be included. 10.3.2 ImpuritiesP1 . This section should provide an HPLC that clearly indicates impurities and should provide structural identification when possible. If structures have not been assigned, then spectra and/or spectral data should be included. Although Q3A does not apply to drug substances for investigational studies, it is advisable follow its general principles in that thresholds should be established for reporting, identification, and qualification. Subsequently, new impurities (e.g., from a change in synthetic pathway) should be qualified, quantified, and reported, as appropriate. Procedures to evaluate impurities to support an NDA/BLA (e.g., recommended identification levels) may not be practical at this point in drug development. Suitable limits should be established based on manufacturing experience, stability data, and safety considerations. 10.4

Control of Drug SubstanceP1

10.4.1 SpecificationP1 . ICH Q6A (34) and Q6B (35) define specification as . . . a list of tests, references to analytical procedures, and appropriate acceptance criteria which are numerical limits, ranges, or other criteria for the tests described.’’ A tentative specification should include the ‘‘Universal Tests / Criteria’’ defined in Q6A as well as the appropriate ‘‘Specific Tests / Criteria.’’ An individual test and acceptance criteria should be in place for impurities above an identification threshold defined by the sponsor. The specification should also include an acceptance criterion for total impurities. 10.4.2 Analytical ProceduresP1 . Analytical procedures used to assure the identity, strength, quality, and purity of the drug product must be included. Except for compendial procedures, a general description of each test method should be provided.

8

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

10.4.3 Validation of Analytical Procedures. Although data should support that the analytical procedures are suitable for their intended purpose, validation data ordinarily need not be submitted at the initial stage of drug development. However, for some well-characterized, therapeutic biotechnology-derived products, validation data may be needed in certain circumstances to ensure safety in Phase 1. For Phase 2/3 studies, the analytical procedure (e.g., high-pressure liquid chromatography) used to perform a test and to support the tentative acceptance criteria should be described briefly, and changes should be reported when the changes are such that an update of the brief description is warranted. A complete description of analytical procedures and appropriate validation data should be available for the analytical procedures that are not from an FDA-recognized standard reference (e.g., official compendium and AOAC International Book of Methods), and this information should be submitted on request. 10.4.4 Batch AnalysesP1 . Batch analyses for all relevant batches should be included. Where possible, test results should be given as numeric values, not just ‘‘pass’’ or ‘‘fail.’’ The use of each batch should be indicated, with a clear indication of whether the proposed clinical trial batch is the same as that used in preclinical testing. 10.4.5 Justification of Specification. Justification of the interim specification should rely on all relevant data, including, for example, batches used in preclinical testing and development. 10.5

Reference Standards or MaterialsP1

Some information should be provided on reference standards. For biological and biotechnological products, if the working reference standard is based on a national or international standard (e.g., World Health Organization), then information should be provided on that reference standard. 10.6

Container Closure SystemP1

A general brief description of the container closure system should be provided. Include,

if known, whether the components in direct contact with the drug substance comply with the indirect food additive regulations in 21 CFR 174.5. 10.7 StabilityP1 Sufficient information should be submitted to support the stability of the drug substance during the toxicologic studies and the proposed clinical studies (see stability analysis and stability study design chapters). 10.7.1 Stability Summary and ConclusionsP1 . Summarize the data from section S.7.3. 10.7.2 Post-approval Stability Protocol and Stability Commitment. This section is not appropriate to an IND and can be deleted. Note that in the eCTD, most sections (elements) are optional. If sections are deleted, then it is important, however, not to change any of the element names that include the section numbers (i.e., if a section like this one is not included, then do not renumber subsequent sections). 10.7.3 Stability DataP1 . Provide the stability data in tabular format. A graphical display of the data can be helpful. 11 DRUG PRODUCT Provide a short drug product summary here. A separate Drug Product section should be provided for any comparator products. If a placebo is used in the trial, then it, too, can be in a separate Drug Product section, but to facilitate comparisons, it is preferable to include this information in the same section as the drug product. If the IND is for a U.S.-approved drug, and the drug product is used in unmodified form in its original container, then this situation should be clearly indicated. If the product is modified (e.g., the marketed tablet is over-encapsulated) or packaged in something other than the marketed container, then information assuring equivalent product performance should be included in the appropriate section.

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

11.1 Description and Composition of the Drug ProductP1 A description of the drug product and its composition should be provided. The information provided should include, for example: • Description of the dosage form, • Composition, for example, list of all com-

ponents of the dosage form and their amount on a per-unit basis (including overages, if any), the function of the components, and a reference to their quality standards (e.g., compendial monographs or manufacturer’s specifications), • Description of accompanying reconstitution diluent(s), and • Type of container and closure used for the drug product and accompanying reconstitution diluent, if applicable. Note that this section includes the components and composition of the unit dose but not the batch formula that is in section P.3.2. 11.2 Pharmaceutical Development This section provides an opportunity to describe Quality-by-Design efforts that may have been built into the drug product. The expectations for a Phase 1 IND are quite modest, and most information indicated in M4Q (R1) can be considered optional. For more extensive P.2 guidance, consult the ICH Q8 ‘‘Pharmaceutical Development’’ guidance (36). Note however that the foregoing documents strictly apply to NDAs, and BLAs. 11.2.1 Components of the Drug Product 11.2.1.1 Drug Substance At minimum, the compatibility of the drug substance with excipients listed in P.1 should be discussed. Likewise, for fixed dose products, containing more than one active ingredient the compatibility of drug substances with each other should be addressed. 11.2.1.2 Excipients The choice of excipients listed in P.1, their concentration, and their characteristics that can influence the drug product performance should be discussed relative to their respective functions.

9

11.2.2 Drug Product 11.2.2.1 Formulation Development A brief summary that describes the development of the drug product should be provided, taking into consideration the proposed route of administration and usage. 11.2.2.2 Overages Any overages in the formulation(s) described in P.1 should be justified. 11.2.2.3 Physico-Chemical and Biological Properties Parameters relevant to the performance of the drug product, such as pH, ionic strength, dissolution, redispersion, reconstitution, particle size distribution, aggregation, polymorphism, rheological properties, biological activity or potency, and/or immunological activity, should be addressed. 11.2.3 Manufacturing Process Development. A general discussion of the approach to process development could be included here, with any specifics on what has been done so far, and what is planned. 11.2.4 Container Closure System. A brief description of the suitability of the container closure for its intended purpose could be discussed, considering, for example, choice of materials, protection from moisture and light, compatibility of the materials of construction with the dosage form (including sorption to container and leaching), safety of materials of construction, and performance (such as reproducibility of the dose delivery from the device when presented as part of the drug product). 11.2.5 Microbiological Attributes. Where appropriate, the microbiological attributes of the dosage form should be discussed, as well as the selection and effectiveness of preservative systems in products that contain antimicrobial preservatives. If applicable, then include the rationale for not performing microbial limits testing or not including preservatives in, for example, multiple-use liquid products. 11.2.6 Compatibility. The compatibility of the drug product with reconstitution diluents or dosage devices (e.g., precipitation of drug substance in solution, sorption on injection vessels, and stability) should be addressed.

10

11.3

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

ManufactureP1

11.3.1 Manufacturer(s)P1 . The name, address, and responsibility of each manufacturer, including contractors, and each proposed production site or facility involved in manufacturing and testing should be provided, as well as the Central File Number (CFN)/ Firm Establishment Identifier (FEI) identifying number. 11.3.2 Batch FormulaP1 . A batch formula should be provided that includes a list of all components of the dosage form to be used in the manufacturing process, their amounts on a per batch basis, including overages, and a reference to their quality standards. 11.3.3 Description of Manufacturing Process and Process ControlsP1 . A diagrammatic presentation and a brief written description of the manufacturing process should be submitted, including sterilization process for sterile products. Flow diagrams are suggested as the usual, most effective, presentations of this information. If known, then critical steps could be indicated. 11.3.4 Controls of Critical Steps and Intermediates. If critical steps are identified in P.3.3, then corresponding controls should be briefly identified. 11.3.5 Process Validation and/or Evaluation. Although process validation is not expected for drug product manufactured during the IND stages of development, information appropriate to the development stage should be available to demonstrate that each process step accomplishes its intended purpose. 11.4

Control of ExcipientsP1

Generally, a reference to the current USP is adequate in this section. For atypical use of compendial excipients (e.g., lactose in a dry-powder inhaler), control of other attributes may be appropriate. Quality standards for noncompendial excipients should be described briefly. If subsections of P.4 are not needed, then the corresponding headings can be deleted.

11.4.1 Specifications. This section need only be included if appropriate, as described above. 11.4.2 Analytical Procedures. This section need only be included if appropriate, as described above. 11.4.3 Validation of Analytical Procedures. This section need only be included if appropriate, as described above. 11.4.4 Justification of Specifications. This section need only be included if appropriate, as described above. 11.4.5 Excipients of Human or Animal OriginP1 . For excipients of human or animal origin, information should be provided regarding adventitious agents (e.g., sources, specifications; description of the testing performed; and viral safety data). 11.4.6 Novel ExcipientsP1 . For excipients used for the first time in a drug product or by a new route of administration, details of manufacture, characterization, and controls, with cross references to support safety data (nonclinical and/or clinical), should be provided directly in the IND or by reference to a DMF, another IND, or a NDA/BLA. 11.5 Control of Drug ProductP1 11.5.1 Specification(s)P1 . The ICH Q6A defines specification as ‘‘ . . . a list of tests, references to analytical procedures, and appropriate acceptance criteria which are numerical limits, ranges, or other criteria for the tests described.’’ A tentative specification should include the ‘‘Universal Tests / Criteria’’ defined in Q6A as well as the appropriate ‘‘Specific Tests / Criteria.’’ An individual test and acceptance criteria should be in place for degradation products above an identification threshold defined by the sponsor. The specification should also include an acceptance criterion for total impurities. 11.5.2 Analytical ProceduresP1 . The analytical procedures used for testing the drug product should be described briefly.

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

11.5.3 Validation of Analytical Procedures. Validation data ordinarily need not be submitted at the initial stage of drug development. However, sponsors should have data in hand to demonstrate that the procedures are suitable for their intended use. For some well-characterized, therapeutic biotechnology-derived products, method qualification (37) data may be needed to ensure safety in Phase 1.

11

of the components in the container closure system should be provided. Additional information may be requested for atypical delivery systems such as metered dose inhalers and disposable injection devices. Include, if known, whether the components in direct contact with the drug substance comply with the indirect food additive regulations in 21 CFR 174.5. 11.8

StabilityP1

P1

11.5.4 Batch Analyses . A description of batches, their use (including study numbers), and results of batch analyses should be provided. The batch proposed for clinical use should be clearly identified. 11.5.5 Characterization of Impurities. Information on the characterization of impurities should be provided, if not previously provided in ‘‘S.3.2 Impurities.’’ 11.5.6 Justification of Specification. Justification for the proposed interim drug product specification should be provided. It is understood that the specification will evolve during the drug’s development. For degradants, although ICH Q3B does not apply to drug products for investigational studies, it is advisable to follow its general principles in that thresholds should be established for reporting, identification, and qualification. As with Q3B, acceptance criteria should be set for individual degradants that are present above the identification threshold. A general acceptance criterion for unspecified impurities should be set at less than or equal to the identification threshold.

(See also stability analysis and stability study design chapters.) 11.8.1 Stability Summary and ConclusionP1 . Summarize the data provided in section 8.3. Note that although expiry dating of IND products is not required, it is necessary to obtain sufficient data to support the stability of product for the duration of the clinical trial. However, where new drug products for investigational use are to be reconstituted at the time of dispensing, their labeling must bear expiration information for the reconstituted drug product. 11.8.2 Post-approval Stability Protocol and Stability Commitment. This section is not appropriate for an IND. 11.8.3 Stability DataP1 . For Phase 1, provide available stability data in tabular format. For Phase 2 and 3, provide data on clinical product used in Phase 1 and 2, respectively, in tabular format. 12 MEETINGS AND OTHER COMMUNICATIONS

11.6 Reference Standards or MaterialsP1 Some information should be provided on reference standards. For biological and biotechnological products, if the working reference standard is based on a national or international standard (e.g., World Health Organization), then information should be provided on that reference standard. 11.7 Container Closure SystemP1 The container closure system is defined as all packaging components that together contain and protect the product. A brief description

Both CFR sections (38) and FDA guidance (39) relate to meetings. Firms have generally been encouraged to hold CMC-specific meetings at the end–of-Phase 2 (EOP2). However, for pre-IND questions, a preference is expressed for written questions, which will receive written answers. Many times questions can be clarified by reference to existing guidance documents. Sometimes a follow-up telephone conference can clear up any unresolved issues. For matters that require extensive discussion, a meeting may be appropriate.

12

13

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

GMPs FOR CLINICAL TRIALS

A direct final rule exempting most Phase 1 products, including biological products, from the GMP regulations (21 CFR 211) was issued on January 17, 2006 (40). A proposed rule with language similar to the direct final rule and a companion draft guidance (41) were issued on the same date. These documents provided FDA’s recommendations on approaches to complying with current good manufacturing practice as required in the 501(a)(2)(B) statute of the Food, Drug, and Cosmetic Act (FD&C Act). Although the final rule exempting Phase 1 products from the GMP regulations was withdrawn on May 2, 2006 (42), the approaches in the proposed rule and the draft guidance can be used until both documents are finalized.

14

OTHER REGIONS

Recently, extensive regulations (43) relating to clinical trials have been put in place in the European Union. One quality-specific aspect is that Directive 2001/83/EC (44) requires that a ‘‘qualified person’’ provide certain oversight functions, such as certifying individual batch release. Another recent European Union guideline (45), EMEA/CHMP/SWP/ 28367/07, entitled ‘‘Guideline on Strategies to Identify and Mitigate Risks for First-inHuman Clinical Trials with Investigational Medicinal Products,’’ includes a section (4.2) on mitigating quality-associated risks in clinical trials. The Japanese Pharmaceutical and Medical Device Agency has information on clinical trials on their English language website (46). The United Kingdom’s MHRA has a web page devoted to clinical trials (47). The web site for the Official Newspaper of the Government of Canada also has a page devoted to clinical trials (48).

REFERENCES 1. Code of Federal Regulations (CFR). Available: http://www.gpoaccess.gov/cfr/index. html.

2. Good Guidance Practices 21 CFR 10.115. Available: http://www.accessdata.fda.gov/ scripts/cdrh/cfdocs/cfCFR/CFRSearch.cfm? fr=10.115. 3. Federal Register Notice on Administrative Practices and Procedures; Good Guidance Practices. Available: http://www.fda. gov/OHRMS/DOCKETS/98fr/091900a.pdf. 4. Laws Enforced by the FDA and Related Statutes. Available: http://www.fda.gov/ opacom/laws/. 5. Federal Food, Drug, and Cosmetic Act. Available: http://www.fda.gov/opacom/laws/ fdcact/fdctoc.htm. 6. Public Health Service Act. Available: http:// www.fda.gov/opacom/laws/phsvcact/phsvcact. htm. 7. FDA Electronic Submissions Gateway. Available: http://www.fda.gov/esg/. 8. CDER Reference Guide. Available: http:// www.fda.gov/cder/directories/reference guide.htm. 9. Pre-IND Consultation Contacts. Available: http://www.fda.gov/cder/ode4/preind/PreINDConsultationList.pdf. 10. FDA’s Pharmaceutical Quality for the 21st Century. Available: http://www.fda.gov/ oc/cgmp/. 11. Pharmaceutical Development: Annex to Q8. Available: http://www.ich.org/cache/compo/ 363-272-1.html#Q8. 12. Office of New Drug Quality Assessment (ONDQA): Presentations. Available: http:// www.fda.gov/cder/Offices/ONDQA/presentations.htm. 13. eCTD Specification and Related Files. Available: http://estri.ich.org/ectd. 14. Memo regarding Docket Notice on eCTDs as the only acceptable electronic submission format. Available: http://www.fda.gov/ ohrms/dockets/dockets/92s0251/92s-0251m000034-vol1.pdf. 15. FDA Electronic Submissions Gateway. Available: http://www.fda.gov/esg/. 16. Guidance for Industry Granularity Document Annex to M4: Organization of the CTD. Available: http://www.fda.gov/cder/guidance/ 7042fnl.htm. 17. INDs: Clinical holds and requests for modification. Available: http://www.accessdata.fda. gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm? fr=312.42. 18. FDA Electronic Submissions Gateway. Available: http://www.fda.gov/esg/.

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

13

19. Guidance for Industry: Content and Format of Investigational New Drug Applications (INDs) for Phase 1 Studies of Drugs, Including Well-Characterized, Therapeutic, Biotechnology-derived Products, Nov 1995. Available: http://www.fda.gov/cder/guidance/ clin2.pdf.

30. Guidance for Industry: Content and Format of Investigational New Drug Applications (INDs) for Phase 1 Studies of Drugs, Including Well-Characterized, Therapeutic, Biotechnology-derived Products, Nov 1995. Available: http://www.fda.gov/cder/guidance/ clin2.pdf.

20. International Conference on Harmonisation. Available: http://www.ich.org.

31. Simplified Molecular Input Line Entry Specification. Available: http://en.wikipedia.org/ wiki/Simplified molecular input line entry specification.

21. Guidance for Industry: INDs for Phase 2 and Phase 3 Studies Chemistry, Manufacturing, and Controls Information. Available: http://www.fda.gov/cder/guidance/3619fnl. pdf. 22. The eCTD Backbone Files Specification for Module 1. Available: http://www.fda.gov/cder/ regulatory/ersr/Module1Spec.pdf. 23. Pharmaceuticals and Medical Devices Agency [Japan]. Available: http://www.pmda.go.jp. In the search box on the home page, enter as a search term ‘‘0527004.’’ This retrieves eCTD related PDF files, including the Module 1 Schema: http://www.pmda.go.jp/ich/m/ m4 ectd toriatsukai 04 5 27.pdf. 24. Telematic Implementation Group for Electronic Submission and ICH Implementation. Available: http://esubmission.emea. eu.int/tiges. 25. a) IND content and format 21 CFR 312.23(a)(7)(iv)(e). Available: http://www. accessdata.fda.gov/scripts/cdrh/cfdocs/cfCFR/ CFRSearch.cfm?fr=312.23. b) Guidance for Industry: Environmental Assessment of Human Drug and Biologics Applications, July 1998. Available: http://www.fda.gov/cder/ guidance/1730fnl.pdf. 26. IND content and format 21 CFR 312.23(a)(7)(iv)(d). Available: http://www. accessdata.fda.gov/scripts/cdrh/cfdocs/cfCFR/ CFRSearch.cfm?fr=312.23. 27. CGMPs, Packaging and Labeling Control, Expiration Dating. Available: http:// www.accessdata.fda.gov/scripts/cdrh/cfdocs/ cfCFR/CFRSearch.cfm?fr=211.137.

32. Connation Table File (MDL MOLFILE). Available: http://en.wikipedia.org/wiki/MDL Molfile. 33. International Chemical Identifier. Available: http://en.wikipedia.org/wiki/International Chemical Identifier. The IUPAC International Chemical Identifier. Available: http:// iupac.org/inchi/index.html. 34. International Conference on Harmonisation; Guidance on Q6A Specifications: Test Procedures and Acceptance Criteria for New Drug Substances and New Drug Products: Chemical Substances. Available: http://www.fda.gov/ OHRMS/DOCKETS/98fr/122900d.pdf. 35. Guidance for Industry Q6B Specifications: Test Procedures and Acceptance Criteria for Biotechnological/Biological Products. Available: http://www.fda.gov/cder/guidance/ Q6Bfnl.PDF. 36. Guidance for Industry Q8 Pharmaceutical Development. Available: http://www.fda.gov /cder/guidance/6746fnl.pdf. 37. 37. N. Ritter, S. J. Advant, J. Hennessey, H. Simmerman, J. McEntire, A. Mire-Sluis, and C. Joneckis, What is test method qualification? Bioprocess Intern. 2004; 2(8): 32—46. Available:http://www.bioprocessintl.com/ default.asp? page=articles&issue=9%2F1%2F2004 (site requires registration). 38. 21 CFR 312.47 Investigational New Drug Application: Meetings. Available: http://www. accessdata.fda.gov/scripts/cdrh/cfdocs/cfCFR/ CFRSearch.cfm?fr=312.47.

28. a) Guidance for Industry: M4Q: The CTD—Quality. Available: http://www.fda. gov/cder/guidance/4539Q.PDF. b) Guidance for Industry: M4: The CTD—Quality Questions and Answers/ Location Issues. Available: http://www.fda.gov/cder/guidance/5951fnl. pdf.

39. a) Guidance for Industry: IND Meetings for Human Drugs and Biologics Chemistry, Manufacturing, and Controls Information, May 2001. Available: http://www.fda.gov/ cder/guidance/3683fnl.pdf. b) Guidance for Industry: Formal Meetings With Sponsors and Applicants for PDUFA Products. Available: http://www.fda.gov/cder/guidance/2125fnl. pdf.

29. In Module 3, the information that should be included to support a Phase 1 IND submission is noted by a superscript P1.

40. Current Good Manufacturing Practice Regulation and Investigational New Drugs. Federal Register. 2006; 71(10): 2458–2462.

14

CHEMISTRY, MANUFACTURING AND CONTROLS (CMC)

41. Guidance for Industry: INDs—Approaches to Complying with CGMP During Phase 1. Available: http://www.fda.gov/cder/guidance/ 6164dft.pdf. 42. 42. Current Good Manufacturing Practice Regulation and Investigational New Drugs; Withdrawal. Federal Register. 2006; 71(84): 25747. 43. EUDRALEX Volume 10—Clinical trials: Medicinal Products for human use in clinical trials (investigational medicinal products). Available: http://ec.europa.eu/enterprise/pharmaceuticals/eudralex/homev10.htm. 44. Directive 2001/83/EC of the European Parliament and of the Council of 6 November 2001 on the Community Code Relating to Medicinal Products for Human Use. Available: http://ec.europa.eu/enterprise/pharmaceuticals/eudralex/vol-1/consol 2004/human code. pdf. 45. Guideline on Strategies to Identify and Mitigate Risks for First-in-Human Clinical Trials with Investigational Medicinal Products. Available: http://www.emea.europa. eu/pdfs/human/swp/2836707enfin.pdf. 46. Pharmaceutical and Medical Device Agency: Clinical Trial Related Operations. Available: http://www.pmda.go.jp/english/clinical.html. 47. MHRA: Clinical Trials for Medicinal Products. Available: http://www.mhra.gov.uk/home/ idcplg?IdcService=SS GET PAGE&nodeId =101. 48. Canada Gazette: Regulations Amending the Food and Drug Regulations (1024–Clinical Trials). Available: http://canadagazette.gc.ca/ partII/2001/20010620/html/sor203-e.html.

CITIZEN PETITION Anyone may request or petition the U.S. Food and Drug Administration (FDA) to change or create an Agency policy or regulation under 21 Code of Federal Regulations (CFR) Part 10.30. Requests should be directed to FDA’s Dockets Management Branch. When submitting a petition, keep these points in mind: • Clearly state what problem you think

the Agency needs to address. • Propose specifically what the Agency’s

action should be. Your proposal should be based on sound, supportable facts. • Submit the petition, an original, and three (3) copies, unless otherwise stipulated in the Federal Register announcement, to: Food and Drug Administration Dockets Management Branch Room 1-23 12420 Parklawn Drive Rockville, MD 20857 827-6860 The FDA carefully considers every petition and must respond within 180 days by either approving or denying it (in whole or in part), or by providing a tentative response indicating why the FDA has been unable to reach a decision. All petitions will be subject to public examination and copying as governed by the rules in 21 CFR 10.20(j). If the FDA approves the petition, it may be published in the Federal Register. Any petition could eventually be incorporated into Agency policy.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/ora/fed state/Small business/ sb guide/petit.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

CLINICAL DATA COORDINATION

converting it into a high-quality database ‘‘. . .enabling the data to be efficiently analysed and displayed’’ (1, p. 69). The processes outlined are assumed for a regulatory study (i.e., a study intended for submission to a regulatory agency for approval of a drug or device) using a paper-based case report form. Data management activities are divided into the following phases:

CHERYL KIOUS 1

INTRODUCTION

The goal of clinical research trials is to generate data that proves the safety and effectiveness of a drug or device, leading to marketing approval by a regulatory authority. The acquisition, validation, and integration of this clinical data, known as clinical data management (CDM), are an integral component of the drug development process. CDM is the process of preparing the clinical data for statistical analysis; the conclusions from this analysis form the basis of a regulatory submission for approval and subsequent marketing. The discipline of CDM includes case report form design, clinical study database design and programming, data entry, data validation, coding, quality control, database finalization, and data archiving. CDM begins in the study design phase and ends when the clinical database is finalized and the data are considered ready for statistical analysis. CDM processes are greatly enhanced through the development of a comprehensive data management plan, adherence to good CDM practices, and by ensuring that quality steps are built into all stages of data handling. CDM is a comprehensive series of activities performed by a diverse team of people, ranging from data entry operators, database programmers, and clinical data coordinators to coding specialists. The core of the CDM team is the lead data manager (LDM), who serves as the lead clinical data coordinator and primary CDM contact for a clinical research study. The LDM is responsible for developing the data management plan, providing direction and leadership to the data management team, coordinating the data management activities, and ensuring that a high-quality database is provided for subsequent statistical data analysis. This article provides an overview of the role of the LDM, including best practices and considerations in the CDM process. The CDM process is responsible for taking the raw data provided by investigative sites and

• study initiation • study conduct • study closure

Best practices for the LDM to follow will be outlined for each data management phase. 2

STUDY INITIATION

Careful planning is one of the most important activities in the study initiation phase. ‘‘CDM need to ensure that they have the specific procedures and systems in place to handle the data when they arrive’’ (1, p. 73). It is important to set up good lines of communication within CDM and with other functional groups to allow for optimal data handling. Even with good planning, unexpected things happen during the course of the study. ‘‘If we plan for what is expected it will give us more time to devote to the unexpected and unplanned events’’ (1, p. 86). Good planning processes can facilitate the collection, processing, and reporting of high quality data in a timely manner (1, p. 87). The following activities occur in the study initiation phase: • case report form design • development of the data management

plan • design study database and program edit

checks 2.1 Case Report Form Design A well-designed case report form (CRF) can facilitate accurate data collection, analysis, and reporting. ‘‘Traditionally CRFs have been very extensive, capturing many thousands of

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

CLINICAL DATA COORDINATION

data points of which only 50-60% may actually be used. Quantity versus quality needs to be reconciled and only essential data collected and computerized’’ (1, p. 82). Data are collected for clinical studies for two primary purposes: (1) ‘‘To provide the answer to the question posed by the protocol’’ and (2) ‘‘to provide the required safety data regarding the drug being studied’’ (1, p. 134). A CRF should be designed to accomplish three basic purposes: the collection of raw data, ease in monitoring or auditing, and to facilitate processing of the data. It is challenging to optimally meet all of these purposes; compromises may need to be made. In that case, the investigative site’s perspective should take precedence over other groups’ needs (2). For detailed information and example CRFs, see Reference 2. The principles outlined below should be incorporated during the CRF development phase. 1. Document the procedures used for the creation, approval, and version control of the CRF. 2. Design the CRF to collect only the key safety and efficacy data specified in the protocol. The CRF can be drafted early in the protocol development phase because most of the data to be collected are known before the protocol is finalized (2). The CRF design process can help ‘‘. . .identify attributes of a protocol that are inappropriate or infeasible, requiring additional protocol development’’ (2, p. 4). The statistical analysis plan (SAP) can provide guidance on what data should be collected. If the SAP is not available during the CRF design phase, the statistical methods section of the protocol is a good reference to identify what data to include when developing the CRF. 3. Use standard forms whenever possible to facilitate integration of data across studies, such as the standards defined by the Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM). Database setup can be achieved more efficiently and reliably with the use of standard CRF modules (1).

4. Obtain cross-functional input on the draft CRF. Representatives from clinical, biostatistics, and CDM programming departments should review the CRF to ensure it is adequate from their respective group’s view. Depending on the type of study and data being collected, other reviewers may need to be included, such as investigators and subject matter consultant experts. 5. If using a validated data collection instrument, such as rating scales or questionnaires, it should not be modified without consulting the author to ensure the validity of the tool is not compromised by the change (3). 6. Avoid collecting duplicate data whenever possible. If redundancy is needed to validate data, ensure the duplicate data are obtained through independent measurements (3). In some cases, some data may need to be collected to confirm other data, but ‘‘. . .it is essential that no one piece of data is asked for twice, either directly or indirectly’’ (1). 7. Include the investigator’s signature in the CRF to document the investigator’s review and approval of the recorded data. 8. Finalize the CRF in conjunction with the finalization of the protocol, ensuring each reviewer’s feedback is incorporated during each review cycle. Document approval of key reviewers on a CRF approval form, noting the version information on the approval form. 9. Ensure CRF completion instructions and training are provided to site personnel. All efforts should be made to ensure the CRF is available to the site before the first subject is enrolled in the study.

2.2 Development of the Data Management Plan A Data Management Plan (DMP) is a comprehensive document developed by the LDM to describe the data handling and data cleaning processes to be used during the study. The content for each section of the DMP is

CLINICAL DATA COORDINATION

suggested below or described in the corresponding section that follows. Each component of the DMP should have input, review, and approval from appropriate data management and cross-functional team members. The DMP should be created at the beginning of the study and updated to reflect any process decisions or changes as needed throughout the study. Each document in the DMP should, at a minimum, contain the following: • document title • version number or date • protocol identification • company identification (and sponsor

name, if study managed by a contract research organization) • author, reviewer, and authorizer names • distribution list • change history section

The following major components are recommended for the DMP: • data

management personnel: this section includes a list of key CDM team members, such as LDM, Database Programmer, Coding Lead, Data Entry Lead

• data management status reports: this

section includes samples of the status reports to be provided, the frequency, and distribution list • database design (see section below) • data tracking and entry (see section

below) • data validation (see section below) • coding (see section below) • external data processing (see section

below) • quality control (see section below) • serious adverse event (SAE) reconcilia-

tion (see section below) • data transfer: if applicable, this section

includes specifications for the transfer of data, such as from a CRO to the sponsor • data handling report (see section below)

3

• standard operating procedures (SOPs):

this section includes a list of applicable SOPs, including the dates they became effective for the study The relevant section of the DMP should be finalized before the corresponding activities begin. For example, data entry should not begin until the Data Tracking and Entry sections of the DMP are finalized. The foundation of the DMP is based on SOPs. SOPs identify what procedures are needed and why, when they occur, how procedures are implemented, and the roles or positions responsible for each procedure (1). The SOPs provide the core framework on which study-specific processes are documented in detail in the DMP. 2.3 Design Study Database and Program Edit Checks The focus of this section is the design and validation of a project-specific clinical database using a validated clinical data management system (CDMS). It does not address the requirements for software applications development and validation or installation and functional-level testing of a packaged CDM application. For further information, see References 4 and 5, which document on computerized systems in clinical trials and provide information on how ‘‘. . .expectations and regulatory requirements regarding data quality might be satisfied where computerized systems are being used to create, modify, maintain, archive, retrieve, or transmit clinical data’’ (5). Ensure that adequate procedures as specified in an SOP are in place to control user accounts and access within the CDMS. Granting and revoking of study access for each user should be documented in the data management study file. Access should be revoked for users who no longer need access to the study (e.g., not working on study, resignation from company). In the design of a project-specific clinical database, the database programmer should follow good programming practices, such as use of a header block, completion of informal development testing, and use of sound programming code. Whenever possible, use

4

CLINICAL DATA COORDINATION

standard data structures, such as CDISC ODM (6). Input from clinical and biostatistics are important during the database design phase ‘‘. . .as the more thought at the start about what is required at the end will result in a study that can be processed efficiently and effectively. It must be remembered that the computer is a tool and that the output from it is only as good as the ‘human’ instructions it receives’’ (1). A database should not be released into production until the study protocol and CRF are finalized. Database design activities consist of the following components: 1. Computer system validation (CSV) plan: The database programmer should create a clinical study database CSV plan to document the testing strategy for the database structure, data entry screens, edit checks, external data imports, and data transfers. The CSV plan outlines the scope of testing, methodology to be used, format for the test data, and documentation and resolution of errors in the testing process. 2. Database annotated CRF: Upon finalization of the CRF, the database programmer annotates the CRF with data file names, item names, code lists, derived items, and other relevant information for each data variable, including version information. 3. Database structure and data entry screens: Using the database annotated CRF, the programmer creates the database metadata and data entry screens. 4. Coding: The variables to be coded are set up as part of the database structure and entry screens and the programs used for auto-encoding are mapped to the appropriate dictionaries. The version of dictionaries used for coding should be referenced in the metadata. 5. Edit checks: Edit checks are programmed according to the edit check specifications document. Edit checks may consist of programmed edit checks that periodically run on the

data or programmed listings that report discrepant data based on specified criteria. 6. External data: Programs used to electronically load non-CRF data into the CDMS or system table (such as Oracle) should be tested to ensure that the import process does not affect the integrity of the data. 7. Data transfer: Programs used to export the data from the CDMS should be tested to assure the transfer process does not affect the integrity of the data being transferred. 8. Formal testing of the database may be done by the programmer or the LDM. Hypothetical or dummy subject data is entered to validate the database structure, data entry screens, coding setup, and edit specifications. All errors identified through the testing process, the action taken to correct the errors, and each correction retested to validate modifications should be documented. Once the database has undergone all testing and is considered ready for use, document the release of the database and edit checks to a production environment. Minimum testing requirements consist of: A. Database structure: verify each field on the CRF is present; each data variable’s type, name, length, ranges, and code lists are accurate; derivations calculate and derive to the targeted item correctly; confirm setup of auto encoding and the appropriate dictionary term populates correctly; and the flow of data entry screens are compatible with the CRF. B. Edit checks: both valid (clean) and invalid (unclean) should be entered to verify each check achieves the expected result and performs according to the edit specifications document. C. External data: completing a test import for each data source verifying the number of records matched the raw data file and all data fields mapped correctly to

CLINICAL DATA COORDINATION

the appropriate database variable in the CDMS or designated table (such as Oracle). If any modifications to the file specifications or programs used to load the dataoccur, the changes must be retested and documented. D. Data transfers: completing a test transfer verifying the number of records exported and that all data, data labels, code lists, formats, and calculated variables are exported as intended.

3.

4. 5.

6. 3

STUDY CONDUCT

The following activities occur in the study conduct phase: • data management documents tracking

and data entry • data validation • coding • serious adverse event (SAE) reconcilia-

7.

tion • external data processing • quality control

3.1 Data Management Documents Tracking and Data Entry The CDM documents that are expected to be received and the process and flow for tracking, entry, and filing should be defined at the start of the study and documented in the DMP. CDM documents may consist of CRFs, Data Clarification Forms (DCFs), sitebased DCFs (SDCFs), and source documents (for example, laboratory reports, ECGs, and subject diaries). These documents should be stored in a secure and controlled environment. It is recommended that the following information be documented: 1. Documents that will be received as a hard copy, fax, or other electronic method, such as PDF. 2. Documents that will be received as originals or copies. Define any hard copies of reports that will be received

8.

9.

5

that correspond with any electronically transferred data, such as laboratory reports. Identify whether any reconciliation of electronically transferred data with hard copy reports will occur. The process and systems to be used for any documents that will be tracked (manually or electronically). The filing organization of documents, within a paper file or electronically. The process for processing CRFs received in a language different than expected. The mechanism that will be used to ensure all expected CRFs and other data entry documents are received and entered in the CDMS. A program written within the CDMS to track pages once they are entered is an efficient and accurate method. Such a program can also provide reports for monitors of expected pages that are missing. The mechanism used to verify unscheduled pages have been received and entered, such as unscheduled visits, repeating pages of log-type forms such as concomitant medications and subject diaries. Any manual review that will be completed before data entry, such as verification for complete and correct subject identifiers, measurement of scales. Describe any data conventions that can be applied before entry. The method of data entry, such as whether it is interactive double data entry, independent double data entry with third party reconciliation, or single entry with 100% manual verification. ‘‘The great debate on whether single or double entry should be used to enter clinical data is still unresolved and it should be questioned if there is a right or wrong answer’’ (1). FDA and ICH regulations do not mandate a specific data entry method. ‘‘All clinical trial information should be recorded, handled, and stored in a way that allows its accurate reporting, interpretation, and verification’’

6

CLINICAL DATA COORDINATION

(7, p. 25695). The method can be evaluated based on factors such as the skill level of the entry operators, amount of time to complete the entry process, and the percent of acceptable errors in the data. 10. General and page-specific data entry guidelines to document handling of data, including date formats, subject initials, blank pages, missing pages, missing data fields, text fields, values recorded as not done or unknown, extraneous comments or marginalia, abbreviations, and illegible fields.

3.2 Data Validation Data validation or data cleaning are the processes used to validate the accuracy of data (3). Data validation can identify errors made while entering data from CRFs into the database and transcription errors made by the site when transcribing from source documents to the CRFs. These errors are identified through checks of the data that report inconsistencies within a subject’s data, medically invalid values, data values out of predefined ranges or formats, incorrect protocol procedures, spelling and legibility problems, and missing data. This check can be achieved through manual review of data listings and programmed edit checks that identify discrepant data. ‘‘The ultimate purpose of validation is to ensure that data meet a level of quality such that the inferences and conclusions drawn from any analyses may be correct’’ (1). Comprehensive data validation guidelines should be developed to check CRF and any other external data related to the study. ‘‘When data management locks a database that is still dirty, the data problems are passed off to the statisticians and programmers. The data problems do not go away. In fact, they become more expensive to fix’’ (8) Comprehensive data validation guidelines (DVGs) should be created consistent with the protocol requirements, CRF, and the statististical analysis plan (or key safety and efficacy data). Cross-functional input from clinical and biostatistics on the DVGs is important and will assure a sound data cleaning plan.

All data changes should be documented at the data management center and the investigative site. The DVGs will generally include the following components: 1. DCF process and flow: A. Frequency for running consistency checks and whether they will be run automatically or manually B. Whether DCF updates will be made to the CRF C. Timelines for DCF generation and receipt from investigative sites D. Method of DCF distribution to investigative sites E. Definitions of discrepancy and DCF statuses and audit trail reasons for change and their use F. Process for handling unsolicited comments on CRFs, for example, review by clinical or medical team members or issue a DCF to the investigative site G. Process for handling any duplicate DCFs 2. Edit specifications: A. Edit specifications can encompass pre-entry review, programmed consistency checks, and manual listings review. The edit specifications should be sufficiently detailed that the database programmer can use this document to program consistency checks and data listings. Examples include: a. Checks for sequential dates b. Checks for missing or required data c. Checks to assure data values are consistent with the data expected, such as character, numeric, within a predetermined range d. Checks across different types of data (cross-panel or crossmodule) e. Checks for protocol violations f. Checks to identify outliers, such as medically implausible

CLINICAL DATA COORDINATION

data, outside expected ranges for the study population B. Standardized DCF message text should be developed for both programmed consistency checks and manual queries issued from manual review of data listings. This process can increase consistency and efficiency during processing, especially when multiple CDCs are working on the study. 3. Data conventions A. Data cleaning conventions may be created for changes that can be made without the investigator’s approval on each individual change. These should be determined at the beginning of the study, before data cleaning activities begin, and only used for obvious errors. Data cleaning conventions should be made by trained data management personnel. B. Data conventions should not be applied to key safety or efficacy variables. C. The investigator should receive and acknowledge at least the first and final versions of data cleaning conventions. D. If certain protocol tests are being performed locally by the site and use normal reference ranges, such as a local laboratory, determine whether local reference ranges will be entered in the database and referenced in reporting or whether laboratory reference ranges will be normalized across investigative sites or studies.

3.3 Coding Clinical data, such as diseases, diagnoses, and drugs that are collected on CRFs, are coded using standard dictionaries to facilitate the analysis and reporting of the data. Coding of data provides enhanced data retrieval, flexible reporting capabilities, the ability to group consistent terms, and the capability

7

for different levels of hierarchical or crossreference reporting (1). ‘‘The aim of processing clinical trials data and safety information is to order and structure the data to allow effective reporting and retrieval of information for regulatory, safety, marketing and planning usage’’ (1). Data are typically coded using an auto-encoding system. Auto-encoders vary in complexity from exact verbatim text to dictionary term-matching to use of a thesaurus to manage synonyms that map to a dictionary term. When selecting an auto-encoder, capabilities to consider include the ability to manage synonyms, misspellings, mixed case, word variations, word order, and irrelevant verbatim information (3). Auto-encoders must be validated and comply with FDA regulations on electronic records (4). The International Conference on Harmonization (ICH) Medical Dictionary for Regulatory Activities (MedDRA) Terminology is recommended for coding clinical trial adverse event data. For guidance regarding the use of MedDRA and version updates, see the Maintenance and Support Services Organization (MSSO) website for publications entitled MedDRA Implementation and Versioning for Clinical Trials and MedDRA Term Selection: Points to Consider, Release 3.4 (http://www.meddramsso.com) (9, 10). Medications are coded to facilitate the analysis and reporting of possible drug interactions (3). The WHO Drug Dictionary is commonly used for coding medications (see Uppsala Monitoring Centre) (11). If adverse events and serious adverse events are maintained in two separate databases, ensure that the same dictionaries and versions are used, if possible, and that consistent coding guidelines are used to code serious adverse events. All versions of dictionaries used should remain available for reference. If changes or additions are needed to publish dictionaries, the organizations that maintain the dictionaries have procedures for submitting change requests (3). ‘‘All levels, codes or group assignments for the data should be stored’’ (3). This storing ensures all information is available if recoding is needed or if hierarchy or multi-axial information is needed to fully understand a particular coded term.

8

CLINICAL DATA COORDINATION

The following points should be considered for coding data and detailed coding guidelines should be developed for each study: 1. Identify the variables to be coded, the dictionaries, and the version used. 2. Define the process for terms that do not auto-encode, including misspellings, abbreviations, and irrelevant verbatim information. If coding guidelines allow modifications to be made to facilitate auto-encoding, the modifications should only be made to the corresponding modifiable term and the original verbatim term should remain unchanged in the database. Changes to the original verbatim term should only be made based on a DCF confirmed by the investigator. 3. Develop guidelines for how terms containing multiple concepts will be handled, whether they can be split per coding guidelines or issuing a DCF to investigator. In general, all adverse events reported as multiple events should be issued on a DCF to the investigator for clarification. 4. Any terms that cannot be coded should be documented in a data handling report. All serious adverse events should be coded. 5. Determine how dictionary updates that occur during the study will be managed. If updates will be applied, identify the frequency of planned updates and how updates will be applied (e.g., updates only or reload dictionary). 6. Define the process and frequency for review of coded terms for missing codes, consistency, and medical or clinical judgment. Define reports that will be used for coding review. For further information on dictionary management, the following are good resources. 1. Good Clinical Data Management Practices, Version 3 (3) 2. MedDRA maintenance and the Maintenance Support Services

Organization (MSSO), http://www. meddramsso.com (9, 10) 3. Uppsala Monitoring Centre, http:// www.who-umc.org (11)

3.4 Serious Adverse Event Reconciliation Serious adverse event (SAE) reconciliation is performed for studies in which SAE data and study adverse event data are maintained in two separate databases. SAE data are typically reported to a drug safety group responsible for maintaining the regulatory safety reporting database. Adverse events (AEs) are reported along with CRF data and entered into the clinical database. As some events are entered into both databases, a comparison of the databases for key safety data variables is performed to ensure consistency. When designing the AE and SAE forms, standardize the capture of SAE data variables for consistency in reporting and to facilitate reconciliation. Develop SAE reconciliation guidelines to include: 1. The systems and locations of the safety database and the clinical database. 2. The method for providing safety data between the two groups (i.e., paper or electronic). If electronic, the medium to be used and format of the data, such as ASCII or SAS. 3. The mechanism for comparing data (i.e., manually or programmatically). 4. The time intervals for when comparisons will be performed, including details for comparisons prior to interim analyses or safety data reporting. 5. The coding dictionaries and versions used to code diagnoses among the two databases. If using different dictionaries or a different version of the same coding dictionary, identify how differences in coded terms for the same event will be resolved. 6. Determine cut-off dates, such as database lock, after which no new SAEs will be added to the clinical database.

CLINICAL DATA COORDINATION

7. Identify the variables that will be reconciled and which items must be an exact match and which may be similar. The following items are examples of items that might be reconciled if they are present in both the safety and clinical databases: A. protocol B. investigator C. subject identification (e.g., subject number, randomization number, initials, date of birth, gender, race) D. diagnosis (i.e., verbatim or coded term) E. severity F. onset date G. date of death H. resolution date I. causality assessment J. action taken with drug K. outcome 8. Reconciliation can occur on events where all data to be reconciled have been entered and validated, no outstanding DCFs exist, and the event has been coded. It is recommended that discrepancies be documented consistently, such as on an SAE reconciliation form. 9. Determine the process for documenting and resolving discrepancies. Outline the communication process with the drug safety group and expected turn-around time for resolving discrepancies. Data clarification forms should be issued to the investigator to resolve discrepancies. The drug safety group should be notified as soon as possible of any events that are in the clinical database but not in the safety database. It is possible for events to exist in the safety database but not in the clinical database, reflecting CRFs that have not been retrieved and entered. 10. It is recommended to have written documentation from the drug safety lead and the LDM once all data have been received and reconciled. Any discrepancies that cannot be resolved should be documented in the DMP.

9

3.5 External Data Processing Use of a centralized vendor for processing and reporting of select data is very common in multicenter trials. Centralized vendors are used to standardize the methods of testing and reporting, using consistent reference values for all data collected across study sites. Many sponsors use one vendor for protocol procedures or tests such as automated randomization, clinical laboratory, ECG interpretation, pharmacokinetic data, and subject diaries. The results are created, entered, or processed and quality control checked by the vendor and typically sent as electronic file (or files) to the sponsor or CRO. The sponsor or CRO should ensure the vendor’s personnel, systems, and procedures are adequate and meet regulatory requirements. For further information, see Good Clinical Data Management Practices, Version 3, section Vendor Management (3). Once the vendor has been identified and qualified, it is important to work with a contact person at the vendor to develop transfer specifications, which are then documented. The transfer specifications and detailed guidelines for each type of external data should be developed as part of the DMP and include information, such as: 1. Contact names and phone numbers for both the sponsor or CRO and vendor. 2. Format of data to be received, such as ASCII or SAS, and version identification. 3. Method of data transfer, for example, disk, CD, data tape, Internet. If data are to be transferred through the Internet, ensure appropriate encryption mechanisms are used. 4. Schedule for data transfers. 5. Whether transfers will be incremental or cumulative. 6. Testing strategy, including completing a test transfer to validate the integrity of the data following the import process (see Design Study Database section). 7. Process for loading data, the programs used to manipulate data, if applicable, and loading logs.

10

CLINICAL DATA COORDINATION

8. Location of loaded data, such as CDMS, Oracle table, or SAS datasets. 9. Detailed file structure, including variable names, labels, attributes (size, format), code lists, mapping of external data variables to clinical database structure and position within file, required variables, how missing variables will be identified. 10. Details of normal ranges and process for conversion to different units, if applicable. The vendor should provide a list of reference ranges and effective dates. Ensure reference ranges are up-to-date throughout the study and prior to database lock. Identify how any reference range changes will be handled for reporting purposes. 11. Process for handling data corrections and additions. If an audit trail is not available for corrections at the data management site, the vendor should make all data changes and provide a corrected data file. 12. Process for handling repeat tests, duplicates, and partial records. 13. Process for handling special characters, such as symbols and quotation marks. 14. Mechanism to identify retransmission of corrected data. 15. Communication process with vendor to acknowledge receipt of data transfers, such as e-mail or hard copy form. 16. If external data contains information that would unblind the CDM team, identify the process for handling the data to ensure blind is maintained. Once transfer specifications have been agreed on with the vendor, the external data import program has been set up and tested, external data transfers may occur per the specifications. The following points should be considered in processing external data: 1. Acknowledge receipt of each data transfer. 2. Communicate any problems encountered with the data file and whether the data file was acceptable.

3. Obtain a list of checks that are performed on the data by the vendor before transfer to ensure CDM does not duplicate checks. 4. Identify in the edit specifications the checks that will be performed on the data following each transfer. Some checks to consider include: A. Verify a minimum number of demographic variables to ensure the data correspond to the intended subject, such as subject number, subject initials, and date of birth. B. Verify for each record in the data transfer file that a corresponding CRF record exists in the clinical database and the intended data variables match, such as visit identifier and date of sample or procedure. C. Verify for each CRF record in the clinical database that a corresponding record exists in the data transfer file and the intended data variables match as above. D. If the CRF and clinical database do not capture unscheduled tests or procedures, develop a process to ensure these records are present in the data transfer file and contain the appropriate data. It is recommended that the CRF capture, at a minimum, the date of each unscheduled test or procedure to ensure all expected data is received and reconciled in the transfer file. 5. To avoid duplication of information, transfer only those variables that are needed to accurately match the subject and associated visit with the CRF data in the clinical database. 6. Retain an electronic version of the data transfer file (i.e., each transfer file if incremental and the final transfer if cumulative). 3.6 Quality Control Quality control in the context of CDM are the procedures undertaken to ensure the

CLINICAL DATA COORDINATION

‘‘. . .data contains no errors, inconsistencies or omissions (that can be rectified) and that the data are coded correctly and transferred accurately from the case record form (CRF) to the computer database’’ (1). The ICH GCP guideline states, ‘‘Quality control should be applied to each stage of data handling to ensure that all data are reliable and have been processed correctly’’ (7, p. 25699). Quality is not a single measurement; quality is measured with individual steps of a total process (1). Quality can be influenced by the experience and judgment used by the CDM personnel reviewing discrepancies, ‘‘. . .while queries are usually issued in line with set procedures which broadly define what should and should not be queried, there will always be some discrepancies where an element of judgment is needed in order to arrive at the decision of whether a query is warranted’’ (1). The level of expertise and appropriate judgment used in determining whether to send a DCF to the investigative site can affect the overall quality of the study data. CDM staff responsible for data validation should receive adequate training and quality control of their work product until a reasonable level of accuracy is attained before working independently. Quality is typically measured by setting a certain standard for an allowable number of errors within a defined number of data values. Errors can be weighted differently according to the value of the data (e.g., key efficacy data may need to be error-free). Good data validation is time-consuming to define, develop, and implement. ‘‘It is important to assess the value of the effort put into validations against the resulting improvement in the data’’ (1). A quality control (QC) plan should be developed at the beginning of the study to document the QC procedures that will be performed during the study. QC reviews should consist of a comparison of the selected data variables and corresponding DCFs, site DCFs, and data conventions to ensure no entry, programming, or database update errors occurred in the data management process. The QC plan may provide for initial, ongoing, and final inspections. Initial inspections can provide immediate feedback in areas where investigative site or monitoring training is needed. Ongoing inspections

11

can be beneficial for studies of long duration or large size to ensure that no systematic processes resulting in errors have occurred. The final inspection is performed to provide an error rate between the CRF and the final database used for analysis. All errors identified in inspections should be documented and corrected in the clinical database. ‘‘Error detection methods that are applied to only a sample of the data can be used to obtain estimates of the distribution of undetected errors’’ (3). The Good Clinical Data Management Practices document suggests that the best practice for determining appropriate QC sample size is by statistical sampling (3, p. 82). CDM error rate standards vary across the industry. Errors should be expressed in the form of an error rate rather than a raw count of the number of errors. Error rates should be calculated by taking the number of errors identified divided by the total number of fields inspected. ‘‘The error rate gives a common scale of measurement for data quality’’ (3). Study-specific details should be documented in a QC plan, covering details that are not specified in the relevant QC SOP. Study-specific details that might be included in the QC plan are:

1. Identify any key safety and efficacy variables that are deemed critical for the analysis to undergo 100% QC and be error-free. These variables can be determined from the SAP (or the statistical methods section of the protocol if the SAP is not final) and in discussion with biostatistics. If the study has a very large volume of key safety and efficacy data, the LDM in conjunction with biostatistics determines an acceptable error rate for key variables while maintaining a statistical confidence interval. 2. If inspections will occur other than on the final, clean data, identify the timing and sampling methodology to be used.

12

4

CLINICAL DATA COORDINATION

STUDY CLOSURE

The following activities occur in the study closure phase: • final quality control procedures • database lock • data archiving

4.1 Final Quality Control Procedures Before database lock and when a specified percentage of the study data are considered clean, such as 90%, a final inspection is performed to estimate the error rate between the CRFs and the clinical database. The final inspection is to ensure data management processes resulted in high-quality data suitable for statistical analysis. 4.2 Database Lock Database lock occurs when all study data are considered clean and ready for statistical analysis. For some studies, databases may be locked for interim analyses to support study continuance for safety or efficacy reasons or for regulatory submissions. The locking of the database in the CDMS is completed to prevent inadvertent or unauthorized changes once analysis and reporting have begun (3). Written procedures should be in place for the locking of the database and removal of write access in the CDMS. Use of a database lock checklist is recommended to ensure all appropriate procedures have been completed. Potential items to include on a database lock checklist are mentioned later in the section. Once the database lock process is underway (such as initiation of signatures on a database lock authorization form), no additional changes should be made to the database. Instances occur when a database may need to be unlocked. A discrepancy may be identified during the statistical analysis and reporting phase that may warrant unlocking the database. A discrepancy should be discussed with project team members, including clinical, biostatistics, and data management, to determine any affect the discrepancy might have on the analysis. If the decision is made to unlock the database to make corrections

or query the investigator for additional information, the unlock process should be documented, including authorization signatures, date, time, and the names of individuals given access to the unlocked database. Once discrepancies are resolved or corrections are made to the database, formal database lock procedures should once again be followed, including completion of a database lock checklist. If the decision is made not to unlock the database, the rationale should be documented in the data management study file, preferably a data handling report or comparable document, and as errata in the clinical study report. In some cases, the discrepancy identified post-lock does not warrant unlocking because the data either do not effect or only minimally effect the statistical results. Before database lock, it is good practice to use a database lock checklist to ensure all items have been completed. The following should be considered in evaluating the database for readiness to lock: 1. All study data have been received or accounted for by data management 2. All data have been entered in the CDMS and all entry discrepancies have been resolved 3. All noncoded text fields and unsolicited comments have been reviewed and appropriate actions taken 4. All electronic external data have been received and reconciled with the clinical database and all discrepancies resolved 5. All data validation has been completed per the Data Management Plan and all queries have been received and updated 6. Coding has been completed and reviewed for consistency and accuracy 7. SAE reconciliation has been completed and all discrepancies have been resolved (applicable if AEs and SAEs are maintained in separate databases) 8. Reference ranges, such as laboratory normal ranges, are available in the database and have been verified to map appropriately to corresponding data values

CLINICAL DATA COORDINATION

9. Inspections specified in the QC plan have been completed and an acceptable error rate was achieved 10. All automatic batch jobs have been canceled 11. All deviations and unresolved data issues have been documented, such as in the Data Handling Report The concept of a Data Handling Report is to document deviations in processes defined in the DMP; deviations to SOPs; unresolved data issues, such as DCFs, missing CRF pages or other study data; terms that could not be coded; and any other information that will be useful for the statistical analysis and clinical study report. Write access to the clinical database should be revoked from all users once the database is locked. 4.3 Data Archiving Environmentally protected, secure, and accessible storage for study data documents is usually limited in most CDM organizations. Routine data archiving is good practice to allow space for ongoing studies and ensure the data are easily accessible by CDM personnel throughout the life of the study. Data archiving also ensures the documents are available for audits by independent auditing groups and regulatory authorities. All CDM organizations should have an SOP for archiving paper and electronic records based on at least the minimum requirements specified as essential documents ‘‘. . .those documents that individually and collectively permit evaluation of the conduct of a trial and the quality of the data produced. . .’’ in ICH Guideline, E6 Good Clinical Practice: Consolidated Guidance (7, p. 25705). In many organizations, paper archiving follows a master filing plan or template for all clinical study documents within which CDM documents are filed. A standard master filing plan ensures documents can be easily retrieved if needed for future reference. Study data documents should be retained according to applicable regulations of the country (or countries) where the drug is approved. According to ICH Guideline, ‘‘. . .essential documents should be retained

13

until at least 2 years after the last approval of a marketing application in an ICH region . . .’’ (7, p. 25700). Archiving consists of both paper and electronic records. Electronic data mediums are changing rapidly. ‘‘Media obsolescence can make orphans of data in the clinical world’’ (12). It is important for clinical trial data to have continued readability once archived. As technology advances, the hardware and software to read data files can quickly become obsolete. One strategy to keep data from becoming obsolete and unreadable is to migrate it to newer versions. This process can be time-consuming and costly to ensure the data is properly migrated and validated. The process of migrating data can have risks, such as losing data in the transfer or unintended transformation (12). Mostly recently, an industry-accepted standard for data archiving is the CDISC Operational Data Model (ODM). For further information, please see http://www.cdisc.org. The following should be considered when archiving CDM documents and electronic records: 1. Ensure electronic data are archived in an open format, such as Extensible Markup Language (XML) used in CDISC ODM, because it can be easily accessed, regardless of system software version used. 2. These items should be archived: A. Database metadata and program code B. Raw data, including CRF data and any externally loaded electronic data C. External data electronic files in original format, retaining each file if incremental data files received, or last file only if cumulative data files were received D. Copy of dictionary used to code data E. Each version of laboratory reference ranges used F. Audit trail G. Final data, including any derived data

14

CLINICAL DATA COORDINATION

H. Original documents, including CRFs, DCFs, DMP, projectspecific computer system validation documentation, and any other relevant study documentation, such as correspondence I. Discrepancy management data for data that failed edit checks J. SOPs used and any documented deviations K. Database closure documentation, such as database lock authorization, confirmation of database lock, and removal of access L. Electronic back-up copies 3. The recommended timing to archive data is subjective, but a standardized approach might be to archive once the clinical study report for a trial is finalized. 4.4 CDM’s Role in Training and Education CDM can play a significant role in the training and continuing education of clinical study site personnel and the clinical research associates who are responsible for monitoring the data. CDM input in all phases of the clinical study can improve the quality and integrity of the data. Examples of CDM input are: 1. CDM participates in the review or development of CRF completion guidelines. These guidelines provide detailed instructions to clinical study site personnel on how to complete each page of the CRF. 2. CDM participates in investigator meetings to present CRF completion guidelines and provide sample completed CRFs and DCFs. CDM provides instructions for correcting errors on CRFs on the original and how to handle errors for CRFs that have already been submitted to CDM. They communicate the flow of the data collection and cleaning process and identify potential errors in consistency across data collection forms. 3. CDM provides ongoing feedback on trends in CRF completion and frequency of data errors at clinical team

meetings or teleconferences, investigative site newsletters, and annual investigator meetings.

5 SUMMARY CDM is a crucial step in the drug development process. Good CDM processes can improve the integrity and quality of data when built on the foundation of a welldesigned clinical study and data collection tool and an adequate clinical monitoring plan. CDM is a complex process in which data from a clinical study are acquired, processed, validated according to predefined rules, and integrated into a comprehensive database for subsequent statistical analysis and reporting. Key to successful data management is a knowledgeable LDM, who applies the principles of good CDM practices in all stages of data handling. In the study initiation phase, it is important to proactively plan and document detailed processes in a Data Management Plan; develop a comprehensive data management timeline to ensure a complete and high-quality database within the required company or CRO timelines; lead the cross-functional team providing input on the design of the data collection tool following industry-accepted standards; and coordinate the design and testing of the database and edit checks before processing production data. In the next phase—study conduct—the LDM should assess the adequacy of the Data Management Plan and adapt processes to meet the requirements of the study; provide early and ongoing feedback to monitors and study sites on the quality of the data; ensure the data validation guidelines comprehensively address both anticipated and unanticipated data scenarios; and modify or add edit checks as needed. All of the planning efforts completed at the beginning of the study and the data validation steps that occur while the study is ongoing culminate in the study closure phase. During this final phase, QC procedures are carried out and the final steps to database lock are completed, including documenting

CLINICAL DATA COORDINATION

any deviations to SOPs or processes outlined in the Data Management Plan. Following database lock, completion of the statistical analysis, and a final clinical study report, the paper and electronic study data may be permanently archived. Well-planned and managed CDM processes that follow industry-accepted standards and practices can result in a timely, high-quality database. The conclusions drawn from high-quality databases can prove the safety and effectiveness of a drug or device, providing the foundation for a regulatory submission and subsequent approval and marketing. REFERENCES 1. R. K. Rondel, S. A. Varley, and C. F. Webb (eds.), Clinical Data Management. London: Wiley, 1993. 2. B. Spilker and J. Schoenfelder, Data Collection Forms in Clinical Trials. New York: Raven Press, 1991. 3. Society for Clinical Data Management, Inc., Good Clinical Data Management Practices (GCDMP), Version 3. Milwaukee, WI: SCDM, September 2003. 4. Food and Drug Administration, 21 CFR Part 11, Electronic Records; Electronic Signatures; Final Rule. Fed. Reg. 1997; 62(54): 13429–13466. 5. Food and Drug Administration, Guidance for Industry: Computerized Systems Used in Clinical Trials. Washington, DC: FDA, April 1999. 6. Clinical Data Interchange Standards Consortium (CDISC). Operational Data Model, Final Version 1.2.1. (online). Available: http://www.cdisc.org/models/odm/v1.2.1/ index.html. 7. International Conference on Harmonization, Good Clinical Practice (ICH GCP): Consolidated Guideline. Fed. Reg. 1997; 62(90). 8. K. L. Monti (2001). A statistician shows how to save time and money through data management. Appl. Clin. Trials (online). Available: http://www.actmagazine.com/ appliedclinicaltrials/article/articleDetail.jsp? id=92010. 9. Maintenance and Support Services Organization (MSSO), MedDRA Term Selection: Points to Consider, Release 3.4, November 18, 2004. (online). Available: http://www.meddramsso.com/NewWeb2003/

15

document library/9530-710%20ptc output% 20doc122204.pdf. 10. Maintenance and Support Services Organization (MSSO), Recommendation for MedDRA Implementation and Versioning for Clinical Trials. (online). Available: http://www.meddramsso.com/NewWeb2003/ Docs/clinicaltrialversioning.pdf. 11. Uppsala Monitoring Centre (2005). (online). Availabe: http://www.who-umc.org. 12. P. Bleicher (2002). Diamonds may be forever, but data? Appl. Clin. Trials (online). Available: http://www.actmagazine.com/ appliedclinicaltrials/article/articleDetail.jsp? id=87252.

CLINICAL DATA MANAGEMENT

data. As will be demonstrated in the articles that follow, the accuracy and completeness of data collected for clinical research is a primary focus of many of the activities surrounding clinical data management, whereas the focus for online retailers, airlines, and banking applications is optimization for transaction processing response and security. Clinical data management, then, encompasses the processes involved in transferring data to a computerized database; applying a variety of techniques to make sure that the data are as accurate and complete as possible; tracking all changes, modifications, additions, and deletions to the data; and providing a mechanism for delivering the data for statistical analysis.

RUTH MCBRIDE Axio Research, Seattle, Washington

Even before computers were used in clinical research, we have been collecting data, organizing it, and analyzing it. Since the advent of computers and specialized data management software, the area of clinical data management has matured to the point that clinical investigators can count on having complete, reliable, and current data for interim and final analyses. Clinical data management, as the name implies, is data management focused on clinical data, but not on all clinical data. Clinical data are collected for a variety of purposes, as part of electronic medical records to manage medical or dental practice, as part of billing information, and in the context that this section of the encyclopedia will discuss, to support clinical research and, in particular, clinical trials. In this context, there are some specialized requirements of a data management system or process. First of all, data collected for clinical trials research is primarily study participantcentric (or patient centric). This implies a hierarchical data structure where most data records are tied to participants. Within participant, data are further organized by ‘‘visit.’’ Contrast this data structure with databases designed to support, for example, banking or airlines. These databases are designed to support many transactions and are more typically relational in structure. Although commercial clinical data management systems are built on relational database servers (e.g., Oracle or SQL/Server), in order to retrieve and display data most efficiently, these systems take advantage of the hierarchical structure to build links and keys to optimize access. Second, the volume of data in clinical data management databases is relatively small compared with other commercial applications. A typical database for a phase III study might contain several thousand megabytes of information. Databases for large online retailers contain many terabytes of

1 HOW HAS CLINICAL DATA MANAGEMENT EVOLVED? 1.1 Coronary Drug Project and Early Trials As early as the 1960s computerized systems were being used for clinical data management in NIH-sponsored clinical trials. The Coronary Drug Project, which began collecting data in 1966 (1), describes data management processes that included computer files, keypunched data, and computer programs to scan for possible data problems. Originally, data for this trial were stored in fixed-length patient records on computer tape, indexed by a master patient directory. Each record contained variables or reserved space for each of the possible patient visits. Thus, a lot of precious file space was wasted. Data were keypunched onto 80 column punch cards and 100% verified before being loaded into the database. Text data were separately keyed and stored. Eventually, ‘‘key-to-disk’’ machines replaced the punch cards. Computer programs were written to check for a variety of potential data problems. The range of data edits, remarkably, was consistent with the data edits applied to more contemporary clinical databases: missing responses, where required; values outside expected limits; internal logical consistency; and data values indicating possible protocol deviations.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CLINICAL DATA MANAGEMENT

Without intelligent data entry, however, these early data editing programs had to check to make sure that legitimate values were entered even for categorical responses. More contemporary programs will not allow an invalid entry for categorical responses such as ‘‘Yes’’/‘‘No’’ variables. In the late 1970s, the Coronary Artery Surgery Study (CASS), another NIH-sponsored clinical trial, initiated the use of distributed data entry or remote data entry (RDE). Computer workstations were installed at each of the clinical sites. The workstations allowed the clinical site staff to key study data directly to a computer diskette. Data entry programs checked for some possible data problems, generally univariate data edits. Double-entry verification was enforced. In the beginning, the computer diskettes were mailed to the Data Coordinating Center. Later during the trial, the data were electronically transmitted from the workstation at the clinical site to a similar workstation at the Coordinating Center. The key advantage to this system was that data were received at the Coordinating Center more quickly than if paper CRFs had been sent. Data editing programs could be run sooner and queries sent back to the clinical sites more quickly. A major premise driving this innovation is that the sooner that potential data errors or omissions are detected, the more likely they will be corrected. This was based on the conjecture that the clinic staff would be more likely to remember a patient visit and the medical chart would more likely still be easily available. The key-to-disk software was developed by staff at the Coordinating Center. Even at that time, there were few commercial programs to support clinical data management. The CASS Coordinating Center, like many other data coordinating centers involved in clinical trials at the time, developed its own clinical data management system (2). Separate computer programs were developed to scan the database for potential data problems. Query reports were periodically sent by mail to the clinical sites.

1.2 Development of Commercial Software to Support Clinical Data Management Until the 1980s, most clinical data management was done using ‘‘home-grown’’ software. Data Coordinating Centers at major universities supporting NIH-sponsored research, fledgling Contract Research Organizations, and pharmaceutical companies with their own data centers developed and deployed a wide variety of data systems. These early ‘‘home-grown’’ systems initially had to provide their own underlying file structure since database engines such as Oracle were not yet widely available. A few products emerged, such as the Scientific Information Retrieval System (3), which provided a database system tailored to clinical studies. SIR was a hierarchical data system optimizing for data linked together for cases and visits. It provided a robust storage format for clinical data, but at that time lacked many of the data entry and data cleaning tools of more contemporary systems. By the mid-1980s, a few commercial systems came to the market to support clinical data management. These systems added more features such as data entry, data cleaning, and reporting functions. Most of these systems presumed that paper CRFs would be sent to a central data center for entry and processing. The Internet was in its infancy, and Web-based applications were yet to come. By the mid-1990s, personal computers had penetrated the business world and the Web was quickly becoming a platform for secure business transactions. Clinical data management moved slowly toward adopting this new technology. A few university-based Data Coordinating Centers were experimenting with home-grown systems to allow data entry using Web applications. By the turn of the century, however, the market for Web-based clinical data management systems, primarily ‘‘electronic data capture’’ or eDC systems had emerged. Even into the 1990s, a big decision for any clinical data management group, however, was whether to acquire and implement a commercial software system or to develop a software system more closely tailored to internal processes.

CLINICAL DATA MANAGEMENT

2

ELECTRONIC DATA CAPTURE

Numerous articles have been written on the value of eDC systems over conventional paper CRFs. In the late 1990s, experts were predicting that by now well over half of clinical trials would be conducted using eDC systems. Granted, there has been tremendous growth in the adoption of eDC, reported by some to be as much as 23% per year (4). The number of eDC vendors has grown steadily in the intervening years, with a few systems capturing a sizeable portion of this market. However, at present, the proportion of trials that are managed using paper CRFs remains substantial. Several factors have limited the rate of adoption of eDC, including acceptance by clinical site staff. Although eDC clearly lessens the burden for storage of paper records, it does not necessarily lessen the burden on clinical site staff for collecting and transcribing that information from source documents to the CRF, whether it be paper or electronic. Not all clinics are equipped with sufficient numbers of properly configured PCs to allow research staff access to the eDC system when it would be most efficient. For example, most examination rooms do not yet have high-speed Internet access or wireless Internet access. 3 REGULATORY INVOLVEMENT WITH CLINICAL DATA MANAGEMENT For those clinical trials conducted in support of a new drug application or approval of a medical device, regulatory agencies such as the U.S. FDA, the European EMEA, and Health Canada have issued guidance documents and regulations that govern many aspects of clinical data management. The objective of these guidance documents and regulations is to assure that the information presented to the regulatory body is as complete and accurate as possible, since the safety of the patients who will use these new treatments will depend on its reliability. Perhaps the most far-reaching of these regulations is 21CFR11, Electronic Records and Signature. This regulation was issued in 1997 after extensive consultation with the pharmaceutical industry and other stakeholders. Work began on this regulation in

3

1991 with the FDA forming a Working Group in 1992 to investigate the various issues surrounding digital signature, in particular. A proposed rule was published in 1994 and resulted in a hailstorm of discussion. The final rule was issued in March 1997. The regulation, as its title implies, puts forth requirements for the use of both electronic data and electronic signatures for studies to be submitted to the U.S. FDA. The regulation’s primary focus is to make sure that electronic data are as reliable and attributable as data or information that would have been supplied on paper, and that if an electronic signature is used in place of a physical signature, that the electronic signature carries the same weight as a physical signature, is attributable, and is untamperable. Part 11 also addresses requirements for electronic systems that would support creation, storage, modification, and transmission of electronic records. These regulations lay out some fundamental requirements for clinical data management systems, such as complete electronic audit trail, attributability of data by automatic date/time stamp, and use of security measures such as passwords to prevent unauthorized access. In addition to 21CFR11, the FDA issued a guidance document in April 1999 on the use of computerized systems in clinical trials. This guidance document ‘‘applies to records in electronic form that are used to create, modify, maintain, archive, retrieve, or transmit clinical data required to be maintained, or submitted to the FDA’’ (5). The guidance was updated in May 2007 and contains extensive information about the agency’s expectations regarding the design and validation of computerized systems, such as clinical data management systems. 4

PROFESSIONAL SOCIETIES

The Society for Clinical Trials, founded in 1978 (www.sctweb.org), has always included professionals involved with clinical data management. Their annual meetings include numerous workshops and plenary sessions on various topics related to clinical data management. A supplement to the Society’s journal, Controlled Clinical Trials, published in

4

CLINICAL DATA MANAGEMENT

1995 (6) discusses the fundamentals of clinical data management. Similarly, the Drug Information Association (www.dia.org) supports professional interchange on clinical data management. Their annual meeting includes a specific track on data management, and they sponsor other meetings throughout the year with a particular focus on clinical data management. And the Society for Clinical Data Management (www.scdm.org) was founded to specifically ‘‘advance the discipline of clinical data management.’’ In addition to meetings, the SCDM offers a complete guide to ‘‘Good Clinical Data Management Practices’’ and a certification program for clinical data managers.

5

LOOK TO THE FUTURE

5.1 Standardization In recent years there has been a move toward standardizing the way that data are collected and exchanged. The Clinical Data Interchange Standards Consortium (CDISC, www. CDISC.org) was formed as a working group from the Drug Information Association. This now independent organization has developed several standards for data models that are becoming widely adopted, not just in the United States, but worldwide. The National Cancer Institute began an initiative to standardize the way that clinical data are captured with their Cancer Data Standards Repository (CaDSR) project (http://ncicb.nci.nih. gov/NCICB/infrastructure/cacore overview/ cadsr). This repository contains standardized data element definitions for a wide variety of domains involving both cancer studies and clinical studies in other areas, such as dentistry and cardiology. 5.2 Electronic Data Capture and Electronic Health Records As access to the Web continues to increase and as our society continues to move toward electronic rather than paper records, the proportion of studies conducted using electronic systems such as Web eDC systems is bound to increase. Whether eDC completely replaces paper CRFs is still a topic for debate. ‘‘Going

electronic’’ will require changes in the workflow model to take advantage of the efficiencies that an eDC system can offer. As long as the clinic staff are still recording clinical information on paper (either as clinic notes, source documentation, or paper worksheets), they are not realizing the full advantages of the eDC system since they will need to transcribe the information from paper, rather than recording it once electronically. The next level of efficiency to be gained will come from linking research needs with electronic health records (EHRs). The proportion of in-patient and out-patient facilities using electronic health records is increasing rapidly and has gained acceptance in Europe more than in the United States. A few groups have explored extracting data directly from the electronic health record to an electronic case report form. CDISC partnered with the Duke Clinical Research Institute in 2003 to demonstrate integration of electronic health records with clinical trial data (Starbrite project presented at DIA, www.cdisc.org/pdf/ss DIAJune2004lb.ppt). The objectives for EHR systems differ substantially from the objectives for eDC systems (7). Health records might contain data as discrete data points (e.g., laboratory values or blood pressure readings), or they may contain less-structured data such as progress notes or descriptions of symptoms. The purpose of the EHR is to store and convey to the health-care providers as much information as possible to assist in the diagnosis and treatment of a patient presenting at that health-care facility. The purpose of an eDC system is to collect and organize the information needed for a clinical trial, a much more focused need. The market for EHR systems is huge when compared with the market for eDC systems. Vendors of EHR systems will be driven by their market, in particular by large hospital systems. The number of EHR vendors is expected to be large, and thus, the formats for collecting data will be widely varied. The challenges are large, but the potential benefits are also large in terms of the savings in time and the decrease in transcription errors. Groups such as CDISC and PhRMA are collaborating on addressing issues related to

CLINICAL DATA MANAGEMENT

integrating clinical research with electronic health records (8,9). 6

CONCLUSION

The proliferation of the use of computers and the rapid penetration of the Internet as a means of secure exchange of data has had a profound impact on clinical data management. Where it might have taken months to collect, clean, and analyze data for a clinical trial 30 years ago, we now expect complete, relatively clean data almost as soon as a patient visit has been completed. Changes in technology have had a profound impact on the way that a trial is managed and in the role for data management staff (10). As technology has and continues to advance, the nature of clinical data management has changed, but the objective remains the same: to provide complete, accurate, and attributable data to support clinical studies, in particular, clinical trials. The articles that follow will describe various aspects of clinical data management and various approaches that have been taken to collect, organize, clean, and report clinical data. REFERENCES 1. C. L. Meinert, E. C. Heinz, and S. A. Forman, ‘‘The Coronary Drug Project: Role and Methods of the Coordinating Center’’ and other articles in this supplement. Control Clin Trials. 1983; 4(4): 355–375. 2. L. D. Fisher, M. J. Gillespie, M. Jones, and R. McBride, Design of clinical database management systems and associated software to facilitate medical statistical research. CRC Critical Reviews in Medical Informatics. Vol. I-4. 3. G. D. Anderson, E. Cohen, W. Gazdzik, and B. Robinson, ‘‘Scientific information retrieval system: A new approach to research data management’’, Proceedings of the 5th Annual ACM SIGUCCS Conference on User Services, 1977: 209–212. 4. K. A. Getz, The imperative to support site adoption of EDC. Appl. Clin. Trials. (Jan. 2006). Available: http://appliedclinical trialsonline.findpharma.com/appliedclinical trials/article/articleDetail.jsp?id=283027.

5

5. http://www.fda.gov/cder/guidance/7359fnl. pdf 6. R. McBride and S. W. Singer (eds.), Data management for multicenter studies: Methods and guidelines. Controlled Clin Trials. 1995; 16(2, suppl): 1–179. 7. P. Bleicher, Integrating EHR with EDC: When two worlds collide. Appl. Clin. Trials. (Mar. 2006). Available: http://appliedclinical trialsonline.findpharma.com/appliedclinical trials/IT/Integrating-EHR-with-EDC-WhenTwo-Worlds-Collide/ArticleStandard/Article/ detail/310798. 8. eClinical Forum and PhRMA, The Future Vision of Electronic Health Records as eSource for Clinical Research. Discussion document published Sept. 2006. Available: http://www.cdisc.org/publications/index. html. 9. CDISC, CDISC Standards and Electronic Source Data in Clinical Trials. Discussion document published Nov. 2006. Available: http://www.cdisc.org/publications/index. html. 10. F. Falk, Impact of eDC on clinical staff roles. Appl. Clin. Trials. (June 2007). Available: http://appliedclinicaltrialsonline.findpharma. com/appliedclinicaltrials/Feature+Article/ Impact-of-EDC-on-Clinical-Staff-Roles/ ArticleStandard/Article/detail/431922? contextCategoryId=35507.

CLINICAL HOLD DECISION A clinical hold is the mechanism that the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER) uses when it does not believe, or cannot confirm, that the study can be conducted without unreasonable risk to the subjects/patients. If this occurs, the Center will contact the sponsor within the 30-day initial review period to stop the clinical trial. The CDER may either delay the start of an early-phase trial on the basis of information submitted in the Investigational New Drug Application (IND), or stop an ongoing study based on a review of newly submitted clinical protocols, safety reports, protocol amendments, or other information. When a clinical hold is issued, a sponsor must address the issue that is the basis of the hold before the order is removed. The CDER’s authority concerning clinical holds is outlined in federal regulations. The regulations specify the clinical hold criteria that CDER applies to various phases of clinical testing. In addition, all clinical holds are reviewed by upper management of CDER to ensure consistency and scientific quality in the Center’s clinical hold decisions.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/clinhold.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

CLINICAL SIGNIFICANCE

may be used by regulatory agencies for drug approval, by clinicians to decide between treatment alternatives, by patients to make informed decisions about treatment, by the health-care industry for formulary and reimbursement decisions, and by health-care policy makers to make policy decisions regarding resource allotment. Early evidence of the clinical implications of QOL is evident in the links between survival and QOL components such as patients’ fatigue levels, social support, and group counseling (14–17). Even a simple, single-item measure of patient global QOL can be related to patient survival (18). Changes in QOL scores can also be linked to positive economic (19, 20) and social (21) outcomes.

CELIA C. KAMATH and JEFFREY A. SLOAN Health Sciences Research Mayo Clinic Rochester, Minnesota

JOSEPH C. CAPPELLERI Pfizer, Inc. Global Research & Development Groton, Connecticut

1

INTRODUCTION

The field of patient-reported outcomes, particularly health-related quality of life (QOL), has burgeoned in the last few years (1, 2). The importance assigned to the study of these outcomes has been attributed to the aging of the population and consequently higher prevalence of chronic diseases, along with the reality that medical treatment often fails to cure the disease but may affect QOL (3). Health-related quality of life has gained attention in research and clinical trial settings (3, 4). The increasingly important role assigned by patients and clinicians to QOL’s role in medical decision-making has resulted in greater attention paid to the interpretation of QOL scores, particularly as it relates to clinical significance (5–7). Clinical significance relates to the clinical meaningfulness of inter-subject or intra-subject changes in QOL scores. Clinical significance has been difficult to determine, in part because of the development of a myriad of QOL instruments over the past decade (8, 9). Some of these developments have had little or no psychometric (1, 2, 6, 10, 11) or clinical validation (9, 12). Moreover, relative to traditional clinical endpoints like survival and systolic blood pressure, QOL as a clinical endpoint is relatively unfamiliar, especially in regard to interpretation and relevance of changes in QOL scores (13). Why is clinical significance of QOL scores important? It aids in the design of studies by helping to determine sample size calculations. Evidence of clinical significance

2

HISTORICAL BACKGROUND

Statistical significance as measured by a Pvalue is influenced by sample size and data variability. Although statistical significance can be considered a prerequisite for clinical significance, only clinical significance assigns meaning to the magnitude of effect observed in any study. Historically, Cohen (22) was responsible for proposing one of the earliest criteria for identifying important change, which can be construed as clinically significant. He suggested that a small ‘‘effect size’’ (defined later in the article) was 0.2 standard deviation units, a medium ‘‘effect size’’ was 0.5, and a large ‘‘effect size’’ was 0.8. Although his intention was to provide guidance for sample size calculations in the social and behavioral science, Cohen’s benchmarks have extended to health-care research to decide whether a change in QOL scores is important. Current research suggests that a moderate effect size of one-half a standard deviation unit (effect size = 0.5) is typically important (23). A more recent and popular definition of clinical significance uses an anchor-based approach based on an external standard that is interpretable and appreciably correlated to the target QOL measure in order to elucidate the meaning of change on the target QOL measure.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

CLINICAL SIGNIFICANCE

Embedded under the rubric of clinical significance is the minimum important difference, a lower bound on clinical significance. One definition of a minimum important difference (MID) is ‘‘the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side-effects and excessive cost, a change in the patient’s management’’ (24). Some researchers prefer to use the term ‘‘minimally detectable difference’’ (25, 26); other definitions have sprouted (e.g., the ERES method) (27, 28). No single solution to the challenging topic of assessing clinical significance exists. Nevertheless, a series of proposals has engendered understanding and appreciation of the topic. Special issues in Statistics in Medicine (1999, Vol. 18) and the Journal of Consulting and Clinical Psychology (1999, Vol. 67) have been dedicated to the topic of clinical significance of QOL and other clinical measures. Proceedings from a meeting of an international group of about 30 QOL experts were published recently in a special issue of the Mayo Clinic Proceedings (2002, Vol. 77) (29–35), which provides practical guidance regarding the clinical significance of QOL measures.

3

ARTICLE OUTLINE

This article draws largely from recent scientific literature, including the Mayo Clinic Proceedings (29–35) and other sources (4), to provide an overview on the clinical significance of QOL measures. The following section on Design and Methodology covers the different perspectives and existing methods to determine clinical significance. The next section on Examples illustrates trials in which researchers attempted to define the concept on specific QOL measures. Then, the section on Recent Developments highlights new methods to determine clinical significance. Finally, the section on Concluding Remarks discusses some future directions for research.

4 DESIGN AND METHODOLOGY 4.1 Perspectives for Determining and Interpreting Clinical Significance Clinical significance involves assigning meaning to study results. The process of establishing such meaning can be conceptualized in two steps: (1) understanding what changes in score mean to the concerned stakeholder (e.g., patient, clinician, clinical researcher, policy maker) and (2) making results of clinical studies interpretable and comprehensible to such stakeholders or decision makers (30, 36). The term ‘‘clinical’’ in relation to significance has different meanings and implications for different stakeholders such as patients, clinicians, and society. From the patient’s perspective, clinical significance can be defined as the change in QOL scores that patients perceive as beneficial (or detrimental) and important and prompts them to seek health care or request changes in their treatment (33), or that induces patients to determine that the intervention has been successful (24). From the clinician’s perspective, it can be defined as the diagnosis of the clinician as to the amount of change in QOL scores that would mandate some form of clinical intervention (37). From the societal or population perspective, clinical significance is based on the values of the group surveyed, in which importance is defined by the outcomes that are deemed worthy of society’s resources. Any or all of these perspectives for defining clinical significance may be applicable, but they are not always in agreement (4). An equally important issue is the different perspectives for interpreting clinical meaningfulness of changes in reported QOL (35). For example, a clinician may use QOL data to explain the treatment alternatives to a patient, whereas a health-policy maker may describe to elected officials the financial impact on a patient population whose QOL has changed. Similarly, a regulatory agency and pharmaceutical company may ascertain the appropriate level of evidence for a successful research study (35). Thus, QOL results must be framed, analyzed, and presented in a way that is meaningful to the pertinent audience and its respective needs.

CLINICAL SIGNIFICANCE

Only then will the concept be meaningful and gain greater acceptance and use over time. 4.2 Methods to Explain the Clinical Significance of Health Status Measures Two common approaches used to establish the interpretability of QOL measures are termed anchor-based and distributionbased. The characteristics of each approach are described below. Several examples will be given later in the section on Examples. Interested readers are encouraged to read Crosby et al. (4) and Guyatt et al. (30) for an expanded discussion of the concepts presented here. Anchor-based approaches are used to determine clinically meaningful change via cross-sectional or longitudinal methods involve comparing measures of QOL to measures with clinical relevance (4). Crosssectional methods include several forms: (1) comparing groups that are different in terms of some disease-related criterion (38, 39); (2) linking QOL to some external benchmarking criteria (40–42); (3) eliciting preferencebased ratings on a pair-wise basis, where one person’s ratings state serves as an anchor to evaluate the other person’s ratings (43); and (4) using normative information from dysfunctional and functional populations (6). Longitudinal methods involve the comparison of changes in QOL scores across time with the use of (1) global ratings of change as ‘‘heuristics’’ to interpret changes in QOL scores (5, 24, 38, 44); (2) significant future medical events for establishing difference thresholds (45); and (3) comparisons of changes in HRQOL to other diseaserelated measures of outcome across time (46). Anchor-based methods are cataloged in Table 1 (4–6, 24, 38–41, 43–45, 47, 48). Anchor-based methods require two properties (30): (1) Anchors must be interpretable, otherwise they will hold no meaning to clinicians or patients and (2) anchors must share appreciable correlation with the targeted QOL measure. The biggest advantage of anchor-based approaches is the link with a meaningful external anchor (4), akin to establishing the construct validity of the measure (49). Potential problems, however, exist with this approach. These problems include

3

recall biases (50), low or unknown reliability and validity of the anchor measure (51), low correlation between anchor and actual QOL change score (52, 53, 54, 55), and complex relationships between anchors and QOL scores (56). Hays and Wooley (57) recommend caution in the indiscriminate dependence and use of a single minimum important difference (MID) measure. They also list several problems in estimating MIDs: The estimated magnitude could vary depending on the distributional index (57, 58), the external anchor (59), the direction of change (improvement vs. decline) (60), and the baseline value (61). In general, longitudinal methods are preferable because of their direct link with change (4). Distribution-based approaches for determining the importance of change are based on the statistical characteristics of the obtained sample, namely average scores and some measure variability in results. They are categorized as (1) those that are based on statistical significance using P-values (i.e., given no real change, the probability of observing this change or a more extreme change), which include the paired t-statistic (62) and growth curve analysis (63); (2) those that are based on sample variation (i.e., those that evaluate mean change in relation to average variation around a mean value), which include effect size (22, 64), standardized response mean (SRM) (44), and responsiveness statistic (65); and (3) those that are based on the measurement precision of the instrument (i.e., evaluate change in relation to variation in the instrument as opposed to variation of the sample), which includes the standard error of the mean (SEM) (7) and the reliable change index (RC) (6). Distributed-based methods are cataloged in Table 2 (4, 6, 7, 22, 44, 62–65). An advantage of the distribution-based methods is that they provide a way of establishing change beyond random variation and statistical significance. The effect size version of the distribution-based methods is useful to interpret differences at the group level and has benchmarks of 0.20 standard deviations units as a small effect, 0.50 as a moderate effect, and 0.80 as a large effect (22, 64, 66). The measures that seem most promising

4

CLINICAL SIGNIFICANCE

Table 1. Anchor-Based Methods of Determining Change Type

Method

Examples

HRQOL evaluated in relation to:

Advantages

Disadvantages

Crosssectional

Comparison to diseaserelated criteria

References 39 and 47

Disease severity or diagnosis

Can be standardized

May not reflect change

Easy to obtain

Groups may differ in other key variables May no reflect change

Comparison to nondiseaserelated criteria

References 40 and 41

Impact of life events

Easy to obtain

Provides external basis for interpretation

Preference rating

Longitudinal

Reference 43

Pairwise comparisons of health status

All health states are compared

Comparison to known populations

Reference 6

Functional or dysfunctional populations

Uses normative information

Global ratings of change

References 5, 24, 38, 44

Patients’ or clinicians’ ratings of improvement

Easy to obtain

Prognosis of future events

Changes in disease related outcome

Reference 45

Reference 48

Reprinted with permission from Crosby et al. (4).

Those experiencing and not experiencing some future event

Changes in clinical outcome

Best measure from individual perspective Can take into account a variety of information Prospective

Provides evidence of predictive validity Tied to objective outcome measure Known psychometric properties

Groups may differ on other key variables Relationship to HRQOL not clear May not reflect change Hypothetical, artificial Time Consuming Normative information not always available Amount of change needed not specified Does not consider measurement precision Unknown reliability

Influenced by specific rating scale and anchors Does not consider measurement precision Difficult to obtain

Does not consider measurement precision Assumes strong HRQOL-outcome correlation

CLINICAL SIGNIFICANCE

for the purpose of establishing clinical significance at the individual patient level are the SEM and the RC. These measures are based on the measurement precision of the instrument and incorporate the reliability of the instrument (e.g., Cronbach’s alpha or test-retest reliability), and the standard deviation of scores. In principle, SEM and RC are sample invariant. Researchers have favored Cronbach’s alpha over test-retest reliability to calculate reliability for the SEM (7, 30, 67). Distribution methods are particularly helpful when used together with meaningful anchors, which enhances validity and hence meaning to the QOL measure. Some encouragement exists to know that anchor-based measures appear to coincide with distribution-based methods. Researchers have found a correspondence between SEM and anchor-based determinant of a minimum important difference across difference diseases (7, 23, 67, 68). The 1 SEM benchmark corresponds with an effect size of approximately 0.5. Standard error of measurement is moderated by the reliability of the measure, where measures with higher reliability are ‘‘rewarded’’ by lowering the effect size needed to achieve a minimally important difference. A rationale for a SEM as a measure of MID is provided by Norman et al. (23) who assert that Miller’s theory (69) of the limits of human discernment is linked to the threshold of 0.5 standard deviation units. 5

EXAMPLES

This section provides examples of studies used to determine clinical significance and presents general advice for defining and interpreting clinical significance in clinical studies. Table 3 (5, 7, 24, 64, 67, 70–72) includes several examples on the use of both anchorbased methods and distribution-based methods to establish clinical significance across a wide range of QOL measures. These examples span several disease groups, instruments, and methods for determining clinical significance. Readers are encouraged to review the cited papers for further details on these studies. The authors begin with a classic paper by Jaeschke et al. (24), one of the first papers

5

on clinically meaningful differences determined through the anchor-based approach. The magnitude of difference considered minimally significant was an average of 0.5 per item on a 7-point scale, which was confirmed by Juniper et al. (5) on the Asthma Quality of Life Questionnaire (AQLQ). A third study, by Kazis et al. (64), examined the difference between statistical significance and clinical significance. Using several pain studies and the Pain Intensity numerical rating scale (PI-NRS), Farrar et al. (70) found a reduction of 2 points or 30% on the 11-point pain scale to be clinically significant. Using the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) scale for osteoarthritis, Angst et al. (71) compared estimates derived from anchor-based and distribution-based approaches; they determined sample sizes for specified changes signifying worsening and, separately, improvement. Using the Chronic Heart Failure Questionnaire (CHQ), Wyrwich et al (7) also compared anchor- and distribution-based approaches in determining that 1 SEM equals the MCID of 0.5 per item determined by the anchor-based approach. Finally, using the Functional Assessment of Cancer Therapy-Lung (FACTL) questionnaire, Cella et al (72) showed the convergence of three different complementary approaches on clinical significance. Taken from Sprangers et al. (34), Table 4 provides a useful checklist of questions to help in interpretation of longitudinal, patient-derived QOL results presented in clinical trials and the clinical literature. These questions are based on the premise that detecting meaningful change depends on the adequacy of the research design, measurement quality, and data analysis. 6

RECENT DEVELOPMENTS

6.1 The

1 2

Standard Deviation Rule

It would be desirable to simply define, at least initially, what a clinical significant result is likely to be. Emerging research comparing anchor-based and distribution-based estimates provides an evolving standard as to what to use as an initial estimate (23). The anchor-based estimates averaging 0.5 per

6

CLINICAL SIGNIFICANCE

item on a 7-point scale appear to converge with an estimate of 12 standard deviation (SD) units. This latter estimate is derived through distribution-based methods such as the effect size approach (22, 64), the SEM (7, 67, 68), the Jacobson reliable change index (6), and the standardized response mean (73). Potential moderating factors that could impact these estimates upward or downward are the method used to determine minimum difference estimates and whether patients were suffering from acute or chronic conditions (23, 74). 6.2 Empirical Rule Effect Size Sloan et al. (27, 28) have taken this concept one step further in the form of the Empirical Rule Effect Size (ERES) by combining Cohen’s effect size categorization (22) with the empirical rule from statistical theory (75). The ERES is based on Tchebyschev’s Theorem and states that the distribution of any QOL tool is contained within 6 SDs of the observed values. The ERES entails the estimation of QOL change scores in terms of SD estimates, expressed as units on the theoretical range of a QOL instrument. Thus, small, moderate, and large effect sizes for comparing QOL treatment groups turn out to be 3%, 8%, and 13%, respectively, of the theoretical range of any QOL tool. This simple and intuitive rule to identify the magnitude of clinically significant changes is likely to be easy for clinical researchers to comprehend. The rule can facilitate the design of clinical trials in terms of sample size calculations and interim monitoring of clinical trials. The ERES framework for a priori establishment of effect sizes is sample-independent and thus an improvement over sample-dependent methods (5, 21, 76). However, the simplicity of the ERES method gives rise to some challenges and questions. The theoretical range of the instrument is rarely observed in its entirety, necessitating the modification of the theoretical range to more practical limits before calculating the ERES estimate for 1 SD as necessarily 16.7% (i.e., 1/6 of distribution of observed values) of the range. Similarly, truncated distributions, where the patient population is homogeneously ill or uniformly

healthy, can be accommodated by incorporating this knowledge into the definition of the appropriate range. These guidelines for clinical treatments can be used in the absence of other information but will need modification in their application to idiosyncratic or unique clinical settings. More research is needed to examine the generalizability of such benchmarks across baseline patient health, severity of illness, and disease groups. 6.3 Group Change vs. Individual Change Distinctions should be made in determining the significance of change at the group versus the individual level. Every individual in a group does not experience the same change in outcomes (group level outcomes are assigned a mean change value). Higher variability exists in individual responses than those of the group. Depending on the distribution of individual differences, the same group mean can have different implications for an individual (77). The traversing of group- and individuallevel QOL data entails procedures for moving from one level to the other involving two distinctive scientific traditions: deductive and inductive (31). A deductive approach is employed when one addresses the extent to which group data can be used to estimate clinical significance at the individual level. An inductive approach is used when one evaluates the extent to which individual change data can be brought to the group level to define clinical significance. Readers are advised to read Cella et al. (31) for a more detailed account. 6.4 Quality of Life as a ‘‘Soft’’ Endpoint The ‘‘softness’’ of QOL as an endpoint, relative to, say, survival and tumor response, is cited as a particular barrier to implementation and interpretation of results (13). However, methodological and conceptual strides made in defining and measuring QOL, and the growing familiarity with the interpretation and potential utility of QOL data, make those concerns increasingly outdated. Psychometric advances have been made in QOL assessment tools across disease areas (8, 78–81). Funding opportunities to study QOL endpoints have allowed for study designs

CLINICAL SIGNIFICANCE

that are large enough to have power to detect meaningful differences (13). Moreover, accumulated experience with analyzing QOL endpoints has resulted in the recognition that their statistical challenges are no different from those of ‘‘hard’’ endpoints. 7

CONCLUDING REMARKS

Several suggestions on clinical significance are offered. First, the application of multiple strategies for determining clinical significant is recommended. Doing so would enable better interpretability and validity of clinically significant change, would add to existing evidence of the magnitude of change that constitutes clinical significance, and would provide indicators of distributional parameters that create convergence or divergence in estimation of clinical significance. For example, Kolotkin et al. (46) found convergence between anchor-based and distribution-based methods at moderate level of impairment but wide disparities at mild and severe levels of impairment. Second, more research is needed to identify the effect psychometric properties—that is, reliability, validity, and responsiveness of QOL instruments— have in quantifying clinically meaningful change (4, 62, 82). Similarly, research into the psychometric properties of global rating and health transition scales used in anchor-based methods is also needed. Global ratings tend to be single item measures and may therefore fall short in terms of explaining complex QOL constructs. Anchoring assessment also tends to be positively correlated with post-treatment states but with near-zero correlation with pre-treatment states, suggesting a recall bias (83) or response shift (84). More research is needed to address the cognitive process used by patients to retrospectively assess changes in health over time (30). Third, baseline severity results in regression to the mean (RTM), an error-based artifact describing the statistical tendency of extreme scores to become less extreme at follow-up. Failure to take this tendency into account may lead to false conclusions that patients with severe impairments at baseline have shown clinically significant change

7

when, in fact, it was just RTM. RTM also has a greater impact on data when the measure is less reliable (4, 85). More research is also needed into the effect of baseline QOL impairment on magnitude of clinically meaningful change (4, 48, 66, 86, 87). Similar research is needed in terms of the generalizability of the standardized benchmarks for determining clinically meaningful change, especially for distribution-based methods (4, 66). Specifically, how satisfactory are the evolving benchmarks (effect sizes of 0.2, 0.5, and 0.8 for small, moderate, and large change, respectively) across different dimensions of QOL (e.g., mental versus physical), different disease groups (e.g., arthritis versus cancer), respondents (e.g., patients versus clinicians), measures (e.g., generic versus disease-specific), patient populations (e.g., older versus younger), or patient conditions (e.g., improving versus deteriorating)? Finally, care must be taken in presenting results of studies in a way that is familiar to the user of the information. For example, translating clinical significance into a number needed to treat (NNT) and a proportion of patients achieving various degrees of clinical benefit relative to the control may provide a desirable way to present study results (30). REFERENCES 1. N. Aaronson, Methodologic issues in assessing the quality of life of cancer patients. Cancer 1991; 67(3 Suppl): 844–850. 2. D. Cella and A. E. Bonomi, Measuring quality of life. Oncology 1995; 9(11 Suppl): 47–60. 3. R. Berzon, Understanding and Using HealthRelated Quality of Life Instruments within Clinical Research Studies. Quality of Life Assessment in Clinical Trials: Methods and Practice. Oxford, UK: 2000, pp. 3–15. 4. R. D. Crosby, R. L. Kolotkin, and G. R. Williams, Defining clinically meaningful change in health-related quality of life. J. Clin. Epidemiol. 2003; 56: 397–407. 5. E. F. Juniper, G. H. Guyatt, and A. Willan, Determining a minimal important change in a disease-specific quality of life questionnaire. J. Clin. Epidemiol. 1994; 47: 81–87. 6. N. S. Jacobson and P. Truax, Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J. Consult. Clin. Psychol. 1991; 59: 12–19.

8

CLINICAL SIGNIFICANCE 7. K. W. Wyrwich, W. M. Tiemey, and F. D. Wolinsky, Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health related quality of life. J. Clin. Epidemiol. 1999; 52: 861–873. 8. D. Cella, Quality of life outcomes: measurement and validation. Oncology 1996; 10(11 Suppl): 233–246. 9. J. A. Sloan, J. R. O’Fallon, and V. J. Summan, Incorporating quality of life measurements in oncology clinical trials. Proceeding of the Biometrics Section of the American Statistical Association, 1998: 282–287.

10. B. Spilker, Quality of Life and Pharmacoeconomics in Clinical Trials. New York: Lippincott Raven, 1996. 11. D. Osoba, What has been learned from measuring health-related quality of life in clinical oncology. Eur. J. Cancer 1999; 35(11): 1565–1570. 12. J. A. Sloan and T. Symonds, Health-related quality of life measurement in clinical trials: when does a statistically significant change become relavant? unpublished manuscript, 2003. 13. M. H. S. J. Frost, Quality of life measures: a soft outcome - or is it? Amer. J. Managed Care 2002; 8(18 Suppl): S574–S579. 14. L. Degner and J. A. Sloan, Symptom distress in newly diagnosed ambulatory cancer patients as a preditor of survival in lung cancer. J. Pain Symptom Manag. 1995; 10(6): 423–431. 15. H. M. Chochinov and L. Kristjanson, Dying to pay: the cost of end-of-life care. J. Palliat. Care 1998; 14(4): 5–15. 16. R. A. Silliman, K. A. Dukes, and L. M. Sullivan, Breast cancer care in older women: sources of information, social support, and emotional health outcomes. Cancer 1998; 83(4): 706–711. 17. D. Spiegel, J. R. Bloom, and H. Kraemer, Psychological support for cancer patients. Lancet 1989; 2(8677): 1447. 18. J. A. Sloan, C. L. Loprinzi, and S. A. Kuross, Randomized comparison of four tools measureing overall quality of life in patients with advanced cancer. J. Clin. Oncol. 1998; 16: 3662–3673. 19. D. L. Patrick and P. Erickson, Applications of health status assessment to health policy. In: B. Spilker (ed.), Quality of Life and Pharmacoeconomics in Clinical Trials. New York: Lippincott Raven, 1996, pp. 717–727.

20. M. R. Gold, D. L. Patrick, and G. W. Torrance, Identifying and valuing outcomes. In: M. Gold et al. (eds.), Cost Effectiveness in Health and Medicine. New York: Oxford University Press, 1996, pp. 82–134. 21. E. F. Juniper, The value and quality of life in asthma. Eur. Resp. J. 1997; 7: 333–337. 22. J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988. 23. G. R. Norman, J. A. Sloan, and K. W. Wyrwich, Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med. Care 2003; 41(5): 582–592. 24. R. Jaeschke, J. Singer, and G. H. Guyatt, Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin. Trials 1989; 10(4): 407–415. 25. P. Jones, Interpreting thresholds for a clinically significant change in health status (quality of life) with treatment for asthma and COPD. Eur. Resp. J. 2002; 19: 398–404. 26. J. G. Wright, The minimally important difference: who’s to say what is important? J. Clin. Epidemiol. 1996; 49: 1221–1222. 27. J. A. Sloan et al., Detecting worms, ducks, and elephants: a simple approach for defining clinically relevant effects in quality of life measures. J. Cancer Integrative Med. 2003; 1(1): 41–47. 28. J. A. Sloan, T. Symonds, D. Vargas-Chanes, and B. Fridley, Practical guidelines for assessing the clinical significance of health-related QOL changes within clinical trials. Drug Inf. J. 2003; 37: 23–31. 29. J. A. Sloan et al., Assessing clinical significance in measuring oncology patient quality of life: introduction to the symposium, content overview, and definition of terms. Mayo Clin. Proc. 2002; 77: 367–370. 30. G. H. Guyatt et al., Methods to explain the clinical significance of health status measures. Mayo Clin. Proc. 2002; 77: 371–383. 31. D. Cella et al., Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life. Mayo Clin. Proc. 2002; 77: 384–392. 32. J. A. Sloan et al., Assessing the clinical significance of single items relative to summated scores. Mayo Clin. Proc. 2002; 77: 479–487. 33. M. H. Frost et al., Patient, clinician, and population perspectives on determining the clinical significance of quality-of-life scores. Mayo Clin. Proc. 2002; 77: 488–494.

CLINICAL SIGNIFICANCE

9

34. M. A. G. Sprangers et al., Assessing meaningful change in quality of life over time: a users’ guide for clinicians. Mayo Clin. Proc. 2002; 77: 561–571.

47. R. A. Deyo et al., Physical and psychosocial function in rheumatoid arthritis: clinical use of a self-adminstered health status instrument. Arch. Intern. Med. 1992; 142: 879

35. T. Symonds et al., The clinical significance of quality-of-life results: practical considerations for specific audiences. Mayo Clin. Proc. 2002; 77: 572–583.

48. R. L. Kolotkin, R. D. Crosby, and G. R. Williams, Integrating anchor-based and distribution-based methods to determine clinically meaningful change in obesity-specific quality of life. Qual. Life Res. 2002; 11: 670.

36. M. A. Testa, Interpretation of quality-of-life outcomes issues that affect magnitude and meaning. Med. Care 2000; 38: II166–II174. 37. C. van Walraven, J. L. Mahon, D. Moher, C. Bohm, and A. Laupacis, Surveying physicians to determine the minimal important difference: implications for sample-size calculation. J. Clin. Epidemiol. 1999; 52: 717–723. 38. R. A. Deyo and T. S. Inui, Toward clinical application of health status measures: sensitivity of scales to clinically important changes. Health Serv. Res. 1984; 19: 278–289. 39. P. A. Johnson, L. Goldman, E. J. Orav et al., Comparison of the medical outcomes study short-form 36-item health survey in black patients and white patients with acute chest pain. Med. Care 1995; 33: 145–160. 40. J. E. Ware, R. H. Brook, A. Davies-Avery et al., Conceptualization and Measurement of Health for Adults in the Health Insurance Study, vol. 1. Model of Health and Methodology. Santa Monica, CA: Rand Corporation, 1979. 41. M. Testa and W. R. Lenderking, Interpreting pharmacoeconcomic and quality-of-life clinical trial data for use in therpeutics. Pharmacoeconomics 1992; 2: 107. 42. M. Testa and D. C. Simonson, Assessment of quality-of-life outcomes. N. Engl. J. Med. 1996; 28: 835–840. 43. H. A. Llewellyn-Thomas, J. I. Williams, and L. Levy, Using a trade-off techniques to assess patients’ treatment preferences for benign prostatic hyperplasia. Med. Decis. Making 1996; 16: 262–272. 44. G. Stucki, M. H. Liang, and A. H. Fossel, Relative responsiveness of condition specific and health status measures in degenerative lumbar spinal stenosis. J. Clin. Epidemiol. 1995; 48: 1369–1378. 45. J. M. Mossey and E. Shapiro, Self-rated health: a predictor of mortaility among the elderly. Amer. J. Public Health 1982; 72: 800–808. 46. R. L. Kolotkin, R. D. Crosby, and K. D. Kosloski, Development of a brief measure to assess quality of life in obesity. Obes. Res. 2001; 9: 102–111.

49. E. Lydick and R. S. Epstein, Interpretation of quality of life changes. Qual. Life Res. 1993; 2: 221–226. 50. N. Schwartz and S. Sudman, Autobiographical Memory and the Validity of Retrospective Reports. New York: Springer-Verlag, 1994. 51. K. W. Wyrwich, S. Metz, and A. N. Babu, The reliability of retrospective change assessments. Qual. Life Res. 2002; 11: 636. 52. B. Mozes, Y. Maor, and A. Shumueli, Do we know what global ratins of health-related quality of life measure? Qual. Life Res. 1999; 8: 269–273. 53. J. R. Kirwan, D. M. Chaput de Sainttonge, and C. R. B. Joyce, Clinical judgment in rheumatoid arthritis. III. British rheumatologists’ judgment of ‘change in response to therapy.’ Ann. Rheum. Dis. 1984; 43: 686–694. 54. D. Cella, E. A. Hahn, and K. Dineen, Meaningful change in cancer-specific quality of life scores: differences between improvement and worsening. Qual. Life Res. 2002; 11: 207–221. 55. G. H. Guyatt and R. Jaeschke, Reassessing quality of life instruments in the evaluation of new drugs. Pharmacoeconomics 1997; 12: 616–626. 56. F. Lydick and B. P. Yawn, Clinical interpretation of health-related quality of life data. In: M. J. Staquet et al. (eds.), Quality of Life Assessment in Clinical Trials: Methods and Practice. Oxford: Oxford University Press, 1998, pp. 299–314. 57. R. D. Hays and J. M. Wooley, The concept of clinically meaningful difference in healthrelated quality-of-life research. How meaningful is it? Pharmacoeconomics 2000; 18(5): 419. 58. J. Wright and N. L. Young, A comparison of different indices of responsiveness. J. Clin. Epidemiol. 1997; 50: 239–246. 59. B. Barber, N. C. Santanello, and R. S. Epstein, Impact of the global on patient perceivable change in an asthma specific QOL questionnaire. Qual. Life Res. 1996; 5: 117–122. 60. J. Ware, K. Snow, M. Kosinski et al., SF-36 Health Survey: Manual and Interpretation

10

61.

62.

63.

64.

65.

66.

67.

68.

69.

70.

71.

72.

CLINICAL SIGNIFICANCE Guide. Boston, MA: The Health Institute, 1993. D. W. Baker, R. D. Hays, and R. H. Brook, Understanding changes in health status: is the floor phenomenon merely the last step of the staircase? Med. Care 1997; 35: 1–15. J. A. Husted, R. J. Cook, and V. T. Farewll, Methods for assessing responsiveness: a critical review and recommendations. J. Clin. Epidemiol. 2000; 53: 459–468. D. C. Speer and P. D. Greenbaum, Five methods for computing significant individual client change and improvement rates: support for an individual growth curve approach. J. Consult. Clin. Psychol. 1995; 63: 1044–1048. L. Kazis, J. J. Anderson, and R. S. Meenan, Effect sizes for interpreting changes in health status. Med. Care 1989; 27(Suppl 3): S178–S189. G. H. Guyatt, C. Bombardier, and P. X. Tugwell, Measuring disease-specific quality of life in clinical trials. CMAJ 1986; 134: 889–895. G. Samsa, D. Edelman, and M. L. Rothman, Determining clinically important differences in health status measures: a general approach with illustration to the Health Utilities Index Mark II. Pharmacoeconomics 1999; 15: 41–55. K. W. Wyrwich, N. A. Nienaber, and W. M. Tiemey, Linking clinical relevance and statistical significance in evaluating intraindividual changes in health-related quality of life. Med. Care 1999; 37: 469–478. K. W. Wyrwich, W. M. Tiemey, and F. D. Wolinsky, Using the standard error of measurement to identify important changes on the Asthma Quality of Life Questionnaire. Qual. Life Res. 2002; 11: 1–7. G. G. Miller, The magic number seven plus or minus two: some limits on our capacity for processing information. Psychol. Rev. 1956; 63: 81–97. J. T. Farrar et al., Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain 2001; 94: 149–158. F. Angst, A. Aeschlimann, and G. Stucki, Smallest detectable and minimal clinically important differences of rehabilitation intervention with their implication for required sample sizes using WOMAC and SF-36 quality of life measurement instruments in patients with osteoarthritis of the lower extremities. Arthrit. Care Res. 2001; 45: 384–391. D. Cella et al., What is a clinically meaningful change on the Functional Assessment of Cancer Therapy-Lung (FACT-L) Questionnaire?

Results from Eastern Cooperative Oncology Group (ECOG) Study 5592. J. Clin. Epidemiol. 2002; 55: 285–295. 73. C. McHorney and A. Tarlov, Individualpatient monitoring in clinical practice: are available health status measures adequate? Qual. Life Res. 1995; 4: 293–307. 74. A. L. Stewart, S. Greenfield, and R. D. Hays, Functional status and well-being of patients with chronic conditions: results from the medical outcomes study. JAMA 1989; 262: 907–913. 75. F. Pukelsheim, The three sigma rule. Amer. Stat. 1994; 48: 88–91. 76. E. F. Juniper, G. H. Guyatt, and D. H. Feeny, Measuring quality of life in childhood asthma. Qual. Life Res. 1996; 5: 35–46. 77. G. Guyatt, E. F. Juniper, S. D. Walter, L. E. Griffith, and R. S. Goldstein, Interpreting treatment effects in randomized trials. BMJ 1998; 316: 690–693. 78. O. Chassany et al., Patient-reported outcomes: the example of health-related quality of life - a European guidance document for the improved integration of health-related quality of life assessment in the drug regulatory process. Drug Inf. J. 2002; 36: 209–238. 79. C. Speilberger, State-Trait Anxiety Inventory:STAI (Form Y). Palo Alto, CA: Consulting Psychologists Press, Inc., 1983. 80. L. Radloff, The CES-D scale: a self-report depression scale for research in the general population. Appl. Psychol. Meas. 1977; 1: 385–481. 81. D. M. McNair, M. Lorr, and L. F. Droppleman, Profile of mood states manual. San Diego, CA: EdiTS, 1992. 82. R. D. Hays, R. Anderson, and D. Revicki, Psychometric considerations in evaluating health-related quality of life measures. Qual. Life Res. 1993; 2: 441–449. 83. G. R. Norman, P. W. Stratford, and G. Regehr, Methodological problems in the retrospective computation of responsiveness to change: the lessons of Cronbach. J. Clin. Epidemiol. 1997; 50(8): 869–879. 84. C. E. Schwartz and M. A. G. Sprangers, Methodological approaches for assessing response shift in longitudinal health-related quality-of-life research. Social Sci. Med. 1999; 48: 1531–1548. 85. M. T. Moser, J. Weis, and H. H. Bartsch, How does regression to the mean affect thresholds of reliable change statistics? Simulations and examples for estimation of true change in

CLINICAL SIGNIFICANCE cancer-related quality of life. Qual. Life Res. 2002; 11: 669. 86. C. McHorney, Generic health measurement: past accomplishments and a measurement paradigm for the 21st century. Ann. Int. Med. 1997; 127: 743–750. 87. P. W. Stratford, J. Binkley, and D. L. Riddle, Sensitivity to change of the Roland-Morris Back Pain Questionnaire: part 1. Phys. Ther. 1998; 78: 1186–1196.

11

12

Table 2. Distribution-Based Methods of Determining Change Reference

HRQOL evaluated in relation to:

Paired t-statistic

62

Standard error of the mean change

Growth curve analysis

63

Standard error of the slope

Effect size

22, 64

Pre-test standard deviation

Calculation

x1 − x0 (di − d)2 n(n−1)

B √ V

x1 − x0 (x0 − x0 )2 n−1

Advantages

Disadvantages

None

Increases with sample size

Not limited to pre-test and post-test scores Uses all of the available data

Increases with sample size

Standardized units

Benchmarks for interpretation Independent of sample size Standardized response mean

Responsiveness statistic

44

65

Standard deviation of change

Standard deviation of change in a stable group

x1 − x0 (di − d)2 n−1

x1 − x0 (di stable − dstable )2 n−1

Standardized units

Requires large sample sizes Assumes data missing at random Decreases with increased baseline variability of sample Does not consider variability of change May vary widely among samples Varies as a function of effectiveness of treatment

Independent of sample size Based on variability of change Standardized units

More conservative than effect size Independent of sample size Takes into account spurious change due to measurement error

Data on stable subjects frequently not available

CLINICAL SIGNIFICANCE

Method

Table 2. (continued) Method Standard error of measurement

Reliable change index

Reference 7

6

HRQOL evaluated in relation to: Standard error measurement

Standard error of the measurement difference

Calculation

x1 − x0

√ (x0 − x0 )2 ( 1 − r) (n − 1)

x1 − x0 2(SEM)2

Advantages

Disadvantages

Relatively stable across populations

Assumes measurement error to be constant across the range of possible scores

Takes into account the precision of the measure Cutoffs based on confidence intervals Relatively stable across populations

Reprinted with permission from Crosby et al. (4).

CLINICAL SIGNIFICANCE

Takes into account precision of measure Cutoffs based on confidence intervals

Assumes measurement error to be constant across the range of possible scores

13

14

CLINICAL SIGNIFICANCE

CLINICAL SIGNIFICANCE

15

16

CLINICAL SIGNIFICANCE

CLINICAL SIGNIFICANCE

17

18

– Is the questionnaire appropriate given the research objective and the rationale for QOL assessment? – Is the questionnaire appropriate given the domains included and in light of the disease and population characteristics? – Is the questionnaire reliable and valid? Is this information reported in the article? – Is the questionnaire responsive to change? Is this information reported in the article?

Is the QOL questionnaire relevant, reliable, valid, and responsive to change?

– What are their disease (e.g., tumor type), treatment (e.g., duration), sociodemographic and cultural (e.g., age, ethnicity), and behavioral (e.g., alcohol use) characteristics? – To what extent are the QOL data applicable to your patients? – Is actual QOL status of individual patients reported (e.g., by providing confidence intervals, standard deviations, subgroup data, individual data plots), thus documenting the amount of individual variation in response to treatment?

What are the characteristics of the population for whom changes in QOL are reported?

– Is the sample size appropriate for the research questions (e.g., by providing a power calculation)? – Is a rationale and/or source for the anticipated effect size specified?

Is the study adequately powered?

– Is QOL assessed at appropriate times to document treatment course, clinical events, and post-treatment effects? – Are standard research design procedures followed (e.g., avoidance of respondent burden, collection of data prior to treatment or consultation)? – Is the timing of the QOL assessments similar across treatment arms?

– Is a baseline assessment included? – Is QOL assessed at appropriate times for determining minimally important change given the natural course of the disease? – Is QOL assessed long enough to determine a clinical effect, taking disease stage into account?

Are the timing and frequency of assessments adequate?

– Are patients’ baseline QOL scores close to the extremes of the response scale? Do the treatment groups differ in baseline QOL?

– Is the questionnaire appropriate given practical considerations (e.g., regarding respondent burden and the availability of different language versions)?

Table 4. Checklist for Assessing Clinical Significance over Time in QOL

– Do the tabular and graphical presentations take the problems inherent in the data into account (e.g., presence of floor and ceiling effects, patient attrition)? – Are the data appropriately analyzed (e.g., are all time points included, are missing data taken into account, are pre-treatment co-variates included)? – Does the article provide sufficient information on the statistical models selected?

– Are the data presented in a meaningful and suitable way enabling an overview of QOL changes over time?

How are multiple time-points handled?

– Is the adopted approach of handling multiplicity explicitly described? – Which approach is taken: limiting the QOL outcomes, use of summary measures, adjustment of p-values, and/or multivariate statistical analysis and modeling? – Did the interpretation of the results take the problem of multiple outcomes into account?

How are multiple QOL outcomes addressed in analyses?

– Does the power calculation take into account: the scale range of the anticipated effect, the score distribution (i.e., magnitude and form), the number of outcome measures, and research hypothesis (i.e., equivalence versus difference)?

19

Reprinted with permission from Sprangers et al. (34).

– Does the article report the reasons for missing questionnaires? – Is there an association between patients’ health status and missing QOL data? – If patients with incomplete data are excluded from the analysis (e.g., by using complete case methods), does the article document that these are nonignorable missing data?

– Are missing data handled adequately? – Does the article indicate how missing items within a questionnaire are handled? – Does the article report the number of missing questionnaires at each scheduled assessment?

Can alternative explanations account for the observed change or lack of observed change? Are dissimilar baseline characteristics adequately accounted for? – Is the baseline QOL score used as a co-variate?

Table 4. (continued)

– Are changes in patient’s internal standards, values, and/or the conceptualization of QOL explicitly measured? – Are insignificant or small changes in QOL reported despite substantial changes in patient’s health status (i.e., deterioration or improvement)?

Did the patient’s QOL perspective change over time?

– Is observed survival difference combined with QOL in evaluating change? – If patients have died in the course of the study, is mortality accounted for in the evaluation of QOL? – Are summary indices (e.g., QALYs, Q-TWiST) or imputation techniques used?

– In cases of non-ignorable missing data, are several analytical approaches presented to address possible bias in conclusions based on this QOL data set?

– To what extent is the statement of clinical importance appropriate and empirically warranted?

– Does the article provide some guidance regarding the clinical importance of the observed change in QOL?

How is statistical significance translated into meaningful change?

– How likely is it that patients have changed their internal standards, values, and/or their conceptualization of QOL as a result of adaptation to deteriorating or improving health?

CLINICAL TRIAL MISCONDUCT

the present situation in the United States, where handling misconduct has become routine, and the United Kingdom, where efforts to establish consensus have been fragmentary and fitful, serves as a sharp reminder of the importance of facing up to and dealing with this unpleasant side of science. It also confirms that scientific institutions are unlikely to deal with the problem, except on a case-by-case basis, unless forced to do so by their governmental paymasters (5).

DRUMMOND RENNIE University of California Institute for Health Policy Studies San Francisco, California

1

THE SCOPE OF THIS ARTICLE

The purpose of clinical trials of interventions is to obtain unbiased, relevant, and reliable information on the value of the interventions (1). Many ethical problems can occur during a trial. It can be designed incorrectly, for example, so that the results of a placebo-controlled trial on a me-too drug are worthless to the clinician who wants to know whether it represents an improvement over the best already available. Randomization can be broken and concealment made inadequate. The trial can be under-powered so that the research participants are put at risk without any hope of obtaining results of value to mankind. The trial may be stopped early by the sponsor for financial reasons (2), the results of the trial can be buried, or the same results can be reported repeatedly by different authors without cross-referencing (3), resulting in publication bias or distortion of the published literature. These and other such actions are, to varying degrees, unethical, and all of them have at one time or another been labeled research or scientific misconduct (4), but in the United States at least, they fall outside the federal definition of research misconduct and no one would be found guilty of misconduct for such acts alone. It is particularly useful to look at the history and present situation of scientific misconduct in the United States to see how this came about, because it was in the United States that the first well-publicized cases occurred, and the United States was the first country to take decisive action to deal with it. This action took the part of forming a consensus that a problem to be solved existed; reaching a definition; developing a process to be followed when allegations arose; and modifying the process in response to experience and new case law. The contrast between

2 WHY DOES RESEARCH MISCONDUCT MATTER? Crooked research, clinical or otherwise, corrupts the record, leading other scientists down false trails. In clinical trials, it inevitably distorts the apparent efficacy of interventions, may falsely raise the expectations of patients and physicians alike, and, when uncovered, leads to public anger, skepticism, and loss of confidence in research in general and of all research institutions. Within trials and within institutions, misconduct, if not dealt with, leads to loss of morale, cynicism, and resignations of honest investigators. It often leads to the ostracism of whistleblowers, even if they are correct in their accusations, and to deep divisions within research teams. Clinical trials are exceedingly expensive, so efforts to correct tainted research may be thwarted because replication may be too costly to organize. Finally, the process of investigation is expensive and involves numerous busy people, with more pressing claims on their time, in extra work. 3

EARLY CASES

The modern history of research misconduct started in 1974 at Sloan-Kettering with Summerlin and his painted mouse (6–12). Over the next 15 years, numerous spectacular cases occurred at major U.S. research universities, later summarized by numerous books and articles. These cases were reported fully by the media, as were the responses of the scientific establishment. In general,

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

CLINICAL TRIAL MISCONDUCT

scientists, accustomed to trusting their colleagues, had great difficulty imagining that colleague would ever break that trust. They denied fraud could occur, or maintained, in the absence of any evidence, that those who committed fraud were extreme aberrations or sick individuals. Few seemed to accept the common sense view that some percent of scientists, as with any other profession, were likely to be fraudulent. ‘‘The response of each research institution varied but was only too often characterised by circling the wagons, denial, and cover up. Under the eyes of the press, each institution would hurriedly patch together its own process, assembling ad hoc panels, sometimes with glaring conflicts of interest. The results were frequently slow, bungled, idiosyncratic, and unfair to almost everyone’’ (5). Politicians became involved in 1981 when then Congressman Al Gore held the first of some dozen hearings at which delinquent scientists were called on the carpet, whistleblowers testified on how they had been abused, and important institutional administrators denied to openly skeptical audiences that any problem existed at all (5, 13–15). At the close of the first hearing, Gore could not ‘‘avoid the conclusion that one reason for the persistence of this type of problem is the reluctance of people high in the science field to take these matters seriously’’ (13). In the face of inertia, denial and opposition on the part of the scientific establishment and an increasingly rancorous atmosphere, Congress, using the justification that it was responsible for overseeing how tax dollars were spent, and that faked research was a fraud against the public purse, forced the setting up of an Office of Scientific Integrity (OSI) in the National Institutes of Health (NIH) in 1989 (6). Rules and a definition of misconduct were promulgated, and the process for investigation and adjudication laid down. Although governmental regulations did not apply to those on non-governmental money, the new regulations applied to all institutions receiving money from the government, so this effectively meant that all biomedical research institutions in the United States had to draw up institutional rules that complied with the government’s definition of misconduct and procedures for responding to it. Although

those funded by the National Science Foundation operated under a slightly different system, the governmental rules became, de facto, the universal rules within biomedicine. After several years, the initiative and the OSI were moved outside the NIH, but operated only within the Public Health Service. The OSI transmogrified into the more lawyerheavy Office of Research Integrity (ORI). It provoked bitter attacks from scientists, politicians, accused persons and their attorneys, whistleblowers, and the press. The procedures were said to be amorphous, inconsistent, illegal, and poorly articulated, the process too quick, too slow, too timid, too aggressive, and often bungled. The accused and whistleblower/accusers both complained of being deprived of their rights (6, 15). The OSI/ORI operated on a collegial ‘‘scientific dialog’’ model, intended to keep the process in the hands of scientists, rather than lawyers. But as cases accumulated and the OSI/ORI lost them when they were appealed to the courts, this was found to be legally flawed and unfair to the accused, and was abandoned for a process that followed the stricter rules of administrative law. Meanwhile, some whistleblowers, exasperated by the confusion, pursued an alternative route provided by the False Claims Act, which qualified them to share in a portion of any grant monies recovered through the courts rather than through the designated institutional routes (6, 15). 4 DEFINITION The process of developing the regulations had been greatly facilitated by a successful series of meetings, endorsed by the American Bar Association and scientific societies, notably the American Association for the Advancement of Science and attended by scientists, administrators, politicians, and lawyers. The most fundamental issue, the definition, was a bone of contention from the start. In general, scientists, well aware of the problems caused by breakdowns in the complex interpersonal relationships formed during team research, and afraid of the devastating effect of malicious and unfounded accusations, attempted to limit the definition to the relatively clear

CLINICAL TRIAL MISCONDUCT

crimes of fabrication, falsification, and plagiarism. Lawyers and politicians, who were mainly lawyers, were concerned to separate definitely criminal acts from uncouth and uncollegial behavior, (to separate, in Tina Gunsalus’ phrase, ‘‘the crooks from the jerks’’) and to separate those acts, such as sexual harassment, which could occur outside the scientific environment and for which laws already existed. Abused whistleblowers, (many of whom, although correct in their accusations, had ended up losing their jobs because of the universal tendency on the part of friends to rally around their accused colleagues, no matter what the evidence) often felt that their rights had been violated and usually sought a wider definition. The precise details of each of their cases differed, so there was pressure from them to include a large number of acts, from duplicate publication to impolite behavior, in the definition of research misconduct. Institutions were concerned that they have control over disciplining their faculty, while at the same time not being made to sink under the weight of elaborate new rules that were expensive to enforce. Everyone wanted clarity because, without a clear and universal definition, everyone had to invent their own rules, leading to confusion and unfairness. It was decided to use the term ‘‘misconduct’’ rather than ‘‘fraud,’’ because a finding of fraud, usually financial, needed several specific requirements to be proved, which would have been impossible to sustain in the case of purely scientific irregularities. The definition of scientific misconduct adopted by the OSI in 1989, was ‘‘Fabrication, falsification, plagiarism or other practices that seriously deviate from those that are commonly accepted within the scientific community for proposing, conducting, or reporting research’’ (16). Although everyone agreed that fabrication, falsification, and plagiarism were antithetical to good science, large scientific organizations immediately objected to the inclusion of ‘‘other practices . . . ’’ on the grounds that this would include, and so inhibit, breakthrough, unconventional science. They wanted the definition limited to fabrication, falsification, and plagiarism, what came to be called ‘‘FF&P.’’ Although this argument

3

seemed spurious to many, underlying the concern was the understandable fear that scientists would be found guilty of practices they had no idea could be wrong. Against this notion, others argued that many cases were not covered by FF&P, for example, stealing work during peer review. There could not be a definition that implied that such behaviors were not wrong. 5

INTENT

Gruber said: ‘‘The power and the beauty of science do not rest upon infallibility which it has not, but on corrigibility without which it is nothing’’ (17). No one could undertake the risky and uncertain business of science if every error were to be construed as misconduct. Distinguishing error from misconduct requires making a judgment about intent. ‘‘Misconduct’’ in law meals the ‘‘willful’’ trangression of a definite rule. Its synonyms are ‘‘misdemeanor, misdeed, misbehavior, delinquency, impropriety, mismanagement, offense, but not negligence or carelessness’’ (18), which is one reason why the definition decided on in 1999 at a consensus conference in Edinbugh, Scotland, (‘‘Behavior by a researcher, intentional or not, that falls short of good ethical and scientific standards.’’) is doomed (19). The law has a long history of judging intent, and that is what, in addition to deciding on the facts, panels looking into scientific misconduct must do. 6 WHAT SCIENTIFIC MISCONDUCT WAS NOT What became clear from the battle was that a great many examples of egregious conduct had to be dealt with, if at all, by different mechanisms. Thus, failure to obtain informed consent, or theft of laboratory funds, or mistreatment of laboratory animals, or accepting money to do fake research while billing the government for the costs, were serious offenses covered by laws concerning conduct of trials, laws against theft, laws governing animal research, or anti-kickback laws. Similarly, inserting the results of fake experiments as supporting evidence in grant applications to the government contravened the

4

CLINICAL TRIAL MISCONDUCT

law that, in the United States at least, it is illegal to lie intentionally to the government. For an excellent discussion of legal issues in research, the reader is referred to Kalb and Koehler (20). All sorts of other unethical behaviors were left to be dealt with by the institutions, the scientific community, and the scientific journals. So, for example, datadredging, duplicate publication, ghost and guest authorship, failure to publish, starting trials in the absence of equipoise, failure to share data, or failure to reveal massive financial conflicts of interest, and many other practices damaging to science and exasperating to scientists were either by implication condoned or left to the community to sanction. In this chapter, such issues or problems in clinical trials such as mismanagement of funds or improper and biased statistical analysis will not be addressed. 7

THE PROCESS

The 1989 regulations detailed a process to be followed whenever an allegation was received, either by an official at the research institution, or by the OSI/ORI (henceforth called the ORI), which stipulated an initial inquiry by the institution to see whether there might be any merit to the accusation, and, if likely, a full investigation. This investigation was to be carried out in the institutions by people with no conflict of interest, with the results to be forwarded to the ORI. In addition, the ORI was given a monitoring function to assure the public that the all research institutions were taking the matter seriously and complying with the regulations. The most important change, made in 1993, was the introduction of an appeals process before a Research Integrity Adjudication Panel appointed by the Departmental Appeals Board of the Department of Health and Human Services (DHHS). The process now is that the ORI reviews the findings and either (rarely) refers to the DHHS Office of Inspector General, or almost always, refers to the Assistant Secretary of Health for accepting or rejecting the institutions’ recommendations. If a finding of misconduct occurs, the ORI would impose sanctions or negotiate an agreement with the

accused. The accused could then appeal to the Departmental Appeals Board, where the accused scientist has a right to be represented by counsel, to discovery of all evidence used to convict, to cross examine witnesses, including the accuser, and to participate in the hearing (21). An independent analysis of policies and procedures at institutions approved by the ORI, done in 2000, has shown considerable variation between institutions in all phases of the handling of cases of suspected misconduct before the report goes to the ORI (22), and low levels of due process. For example, only 32% had policies requiring the accused have access to all the evidence, and only 21% the right to present witnesses (22). Mello and Brennan have criticized the process as providing insufficient safeguards and due process for accused scientists, for whom a finding of misconduct would be devastating (21). 8 THE PAST DECADE Despite an apparently clear and workable system, the controversy continued and, in the early 1990s, several high-profile cases were decided against government science agencies and their ways of proceeding. To try to restore calm, the U.S. Public Health Service set up a Commission on Research Integrity (named the Ryan Commission, after its chair, Kenneth J. Ryan of Harvard Medical School). The Commission heard testimony in 15 meetings across the United States from a large number of witnesses, including scientists and their organizations, whistleblowers, attorneys, institutions, the press, interested citizens, and government officials. Their report was completed in November 1995 (23). The Commission recommended that the definition of research misconduct should be ‘‘based on the premise that research misconduct is a serious violation of the fundamental principle that scientists be truthful and fair in the conduct of research and the dissemination of its results’’ (23). ‘‘Research misconduct is significant misbehavior that improperly appropriates the intellectual property or contributions of others, that intentionally impedes the progress

CLINICAL TRIAL MISCONDUCT

of research, or that risks corrupting the scientific record or compromising the integrity of scientific practices. Such behaviors are unethical and unacceptable in proposing, conducting, or reporting research or in reviewing the proposals or research reports of others’’ (23). The commission specifically included within this definition ‘‘misappropriation,’’ which included plagiarism and making use of another’s ideas and words during peer review; ‘‘interference’’ with another’s research; and ‘‘misrepresentation’’ by reporting scientific data falsely. Its members having the advantage of hearing many actual cases, the Commission also included obstruction of investigations of research misconduct, and noncompliance with research regulations (23). Finally, the Commission, recognizing that whistleblowers represented an invaluable quality assurance mechanism, but that whistleblowers had to receive protection from retaliation lest they suffer damage to their careers as a result of their actions, presented an appendix ‘‘Responsible Whistleblowing: a Whistleblower’s Bill of Rights.’’ The reaction to the report varied from enthusiastic (Nature and the Lancet) to angry. The President of the Federation of American Societies for Experimental Biology (FASEB) wrote to the Secretary for Health and Human Services that the ‘‘Commission’s report is so seriously flawed that it is useless as a basis for policy making and should be disavowed . . . we find the definition to be unworkable, and therefore unacceptable’’ (24). He was quoted in the press as calling the report ‘‘an attack on American science,’’ (25) surprising words given the reliance the commission had put on the National Academy of Science’s own report (26, 27). It seems likely that the cause of this excessive reaction was the Whistleblower’s Bill of Rights, the insertion of which was interpreted not as attention to the plight of the accuser, but as a failure to protect the rights of the already well-protected accused. Once again, the bogeyman that regulation beyond FF&P would inhibit groundbreaking research was raised, this time to attack the Commission, even though in thousands of actual cases this had never once happened. In response to court decisions, changes to the process were introduced to allow more

5

protections for the accused and to bring the entire process more into line with the procedures set down by administrative law. Institutions became used to handling cases, the process was seen to be fairer, and the late 1990s were characterized by a decrease in the shrillness of the debate. On December 6, 2000, the United States issued the new, government-wide regulations defining research misconduct and laying down the rules for investigation and adjudication of allegations of misconduct concerning research done with U.S. federal funds (Fig. 1) (28). As all important universities and research institutions receive such funds, these regulations, broadly speaking, became institutional rules, although institutions are allowed to have their own additional rules if they wish to impose a higher internal standard (28). The definition in the regulations (see Fig. 1) restricted the definition to FF&P; required intent to be taken into account; and stipulated that the allegation must be proven by a ‘‘preponderance of the evidence.’’ Relative quiet has descended on the community as institutions have successfully handled cases in an entirely routine fashion. The ORI now serves chiefly an educational and monitoring function, whereas the accused is now assured of the full protections of the law. 9

LESSONS FROM THE U.S. EXPERIENCE

Among the lessons to be gleaned from the turbulent experience in the United States are the following. Despite numerous very public scandals, and enormous publicity, building consensus on such an emotional topic is hard and takes many years. With notable exceptions, scientists are naturally loath to cede any of their authority, and good scientists often find it hard to imagine that anyone could break the bonds of trust that allow science to function. Unless pushed by those holding the purse strings, they will do little to police their own profession and tend to challenge, resist, and circumscribe attempts from outside to impose rules. They will tend to be suspicious of attempts to protect whistleblowers, who remain by far the most useful source for reporting misconduct and who often, perhaps usually, suffer for their efforts on behalf

6

CLINICAL TRIAL MISCONDUCT

I. Research § Misconduct Defined Research misconduct is defined as fabrication, falsification, or plagiarism in proposing, performing, or reviewing research, or in reporting research results. • Fabrication is making up data or results and recording or reporting them. • Falsification is manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.‡ • Plagiarism is the appropriation of another person’s ideas, processes, results, or words without giving appropriate credit. • Research misconduct does not include honest error or differences of opinion. II. Findings of Research Misconduct A finding of research misconduct requires that: • There be a significant departure from accepted practices of the relevant research community; and • The misconduct be committed intentionally, or knowingly, or recklessly; and • The allegation be proven by a preponderance of evidence. No rights, privileges, benefits, or obligations are created or abridged by issuance of this policy alone. The creation or abridgment of rights, privileges, benefits, or obligations, if any, shall occur only on implementation of this policy by the federal agencies. † Research, as used herein, includes all basic, applied, and demonstration research in all fields of science, engineering, and mathematics, which includes, but is not limited to, research in economics, education, linguistics, medicine, psychology, social sciences, statistics, and research involving human subjects or animals. ‡ The research record is the record of data or results that embody the facts resulting from scientific inquiry, and includes, but is not limited to, research proposals, laboratory records, both physical and electronic, progress reports, abstracts, theses, oral presentations, internal reports, and journal articles. § The term ‘‘research institutions’’ is defined to include all organizations using federal funds for research, including, for example, colleges and universities, intramural federal research laboratories, federally funded research and development centers, national user facilities, industrial laboratories, or other research institutes. Independent researchers and small research institutions are covered by this policy.

Figure 1. Federal Policy on Research Misconduct,‡

of science. Scientists have learned that the same body must not be investigator, prosecutor, and judge. Lastly, although scientific misconduct has to be defined and assessed by scientists, the threats to the livelihood of the accused are so great that no solution will work unless it is seen to be fully in accord with the law, the workings of which they are often ill-informed.

10

OUTSIDE THE UNITED STATES

The most thoughtful national response to the issue of scientific misconduct has been made in Denmark (29, 30). What is particularly interesting is that those who organized this did so before Denmark, in sharp contradistinction to the United States and other countries, had any important cases or scandals with which to deal. Their effort was from the start aimed at prevention as well as at setting down procedures for responding to allegations. For a full description and analysis, the reader is referred to ‘‘Scientific Dishonesty & Good Scientific Practice’’ published by the Danish Medical Research Council in 1992 (29, 30).

In the United Kingdom, the situation still seems similar to that in the United States 20 years ago, few lessons having been learned from abroad. Certain granting agencies (31–33) have well-established rules—as in the United States, their authority based on the need to account for funds disbursed. But little enthusiasm has been expressed for any general response, despite energetic pushing from the editors of the BMJ and Lancet, who, exasperated, founded the Committee on Publication Ethics (COPE). Editors of general medical journals are quite frequently made aware of suspect research, and are then called on to investigate—something they ordinarily do not have the mandate, authority, money, legal standing, time, or expertise to do. In the United States, editors simply convey the allegation to the researcher’s institution and act as witnesses. In the United Kingdom, there may be no one and no system to investigate, adjudicate, and sanction. COPE, which has reported annually since 1998, serves as a sounding board, looks into ethical maters, and, where necessary and possible, refers to the appropriate authority. Although the editors have repeatedly called for a national

CLINICAL TRIAL MISCONDUCT

7

body, very little progress has been made (34, 35). Elsewhere, many other countries, which, like the United Kingdom and South Africa, have poorly defined procedures, are struggling to deal with the problem are sweeping it under the table. Germany has had to contend with long-running and widely publicized cases and is attempting to frame regulations (36). In China, Beijing University has issued broad regulations that include, for example, ‘‘intentionally exaggerating the academic value and economic and social results of a research finding’’ (37).

The team rapidly discovered serious protocol deviations, poor documentation, nonexistent patients, failure to obtain ethical approval for the study from the institution, and no evidence of informed consent. They concluded that Bezwoda’s study was invalid (43). Among the consequences of this behavior were disappointment for thousands of women and their physicians and loss of confidence in clinical research. It is unclear what part Bezwoda’s colleagues played, but the continued day-to-day involvement of colleagues in all aspects of this research would surely have prevented this misconduct.

11 SCIENTIFIC MISCONDUCT DURING CLINICAL TRIALS

12

Many of the early cases of scientific misconduct were perpetrated by physicians under contract to pharmaceutical companies to follow-up patients and to help with post-marketing surveillance. Wells and Brock have described several cases (38, 39). Pharmaceutical companies were left with a quandary. Their audits had revealed clear misconduct, but until the 1990s they did not wish to antagonize the physicianperpetrators or their colleagues by prosecuting (39). Now that the public recognizes that scientific misconduct exists, companies are backed up by the medical community, so their reluctance to pursue cases has diminished. As with misconduct in other settings, distinguished researchers in large medical centers may be found guilty (40). The difficulties of dealing with misconduct during a trial when no standard procedures exist is well illustrated by Hoeksema et al. (41). Of recent cases of misconduct within a clinical trial, the most notorious concerns the South African oncologist Dr. Werner Bezwoda of the University of Witwatersrand in Johannesburg. Bezwoda’s clinical trial was alone in reporting striking benefit from high-dose chemotherapy and peripheral blood stem cell rescue in women with high-risk breast cancer (42). His results, presented in 1999, were so at variance with three other trials reported of the same therapy that the National Cancer Institute sent an audit team to South Africa to investigate.

AUDIT

Audit is an inefficient and costly system for policing science and should not in general be used. Clinical trials are different. In 1978, well before there were standard methods for investigating misconduct, numerous researchers at Boston University Medical Center reported being pressured to falsify and fabricate data in clinical trials in oncology by their chief. He, on the other hand, alleged that the falsifications, later shown to involve 15% of the data, had been perpetrated by his juniors (7). He also maintained that ‘‘there are certain types of studies that are almost beyond the limits of absolute surveillance’’ as they are so complex (7). As a consequence of the wide publicity given this case, the National Cancer Institute mandated on-sight audits by their cooperative groups. Subsequently, in 1985, Shapiro and Charrow, examining the results of U.S. Food and Drug Administration (FDA) audits showed a high prevalence of bungled and fraudulent research (44). In 1993, Weiss et al. reported on their audit system for the Cancer and Leukemia Group B, and found a very low rate (45). However, the finding of one case of misconduct and one of ‘‘gross scientific error’’ was made under conditions where everyone had been put on notice that there would be regular audits; where one of those audited had received a suspended prison sentence for misconduct; where three large centers with adverse audit results had simply been dropped from the study, and where everyone involved had made a very strong commitment

8

CLINICAL TRIAL MISCONDUCT

to the expensive and time-consuming task of audit (46). It was a routine audit that revealed misconduct by one investigator, Dr. Roger Poisson of Montreal, in one of the most significant of trials, the National Surgical Adjuvant Breast and Bowel Project (NSABBP). The ORI, eventually alerted by the National Cancer Institute, discovered 115 well-documented cases of fabrication and falsification, for which Poisson took responsibility, as well as falsifications in other studies (47). This episode was particularly unfortunate because it took years for the audit results to become known to the journals that had published articles that had included Poisson’s data, and journals and public learned of it via the Chicago Tribune (47). The scientists who later asserted that Poisson’s falsifications did not materially affect the results of the trial, so that informing public and journals was unimportant, were missing an essential point. Research is paid for by the public, who are deeply interested in the results of important trials in breast cancer. The legitimacy of researchers and their trials depends on the trust of the public, which this incident, the fact that it was unearthed by a reporter, and the massive publicity it aroused, damaged severely. The result was doubly unfortunate, given that it was the trial’s own audit that had first revealed to its leaders that a problem existed. As Brock has pointed out, the knowledge that audit will be routine and thorough, and that misconduct will be reported and prosecuted is likely to have a deterrant effect (39). Steward et al. have suggested that peer reviewed journals consider publication of the results of trials only when a program of random audits exists (48). Experience has shown that in widely-flung enterprises like multicenter clinical trials, careful attention to audit, conducted by senior investigators involved with the trial, is necessary if trials are to be credible. 13

CAUSES

A great deal of unsupported opinion has been published on the causes. The usual favorite is that scientists are driven to misconduct

because of the ‘‘publish or perish’’ atmosphere of the research life. This facile explanation takes no account of the fact that no correlation, positive or negative, exists between publication and misconduct; that the vast majority of scientists are honest, whatever their publication rates; that the pressure to publish has been shown to be exaggerated; and that at least one prominent scientist found guilty of fabrication went out of his way to deny this was the cause (13). Brock concludes that the motive is often simple greed, which, associated with a desire for advancement, may be a powerful causative factor. We are far from understanding the psychological processes working here, as is evidenced by, for example, the case of Karen Ruggiero, whose extensive fabrications came as a complete shock to her psychologist superior colleague (49). It is to be hoped that the large amount of research stimulated by the ORI’s grants and research conferences will shed light on this topic. In the meantime, it seems likely that ignorance of scientific mores, poorly enunciated standards, poor or absent teaching, and inadequate individual supervision of young researchers are far more important. 14 PREVALENCE We do not know how commonly misconduct or misconduct in trials occurs. The publication of data from audit committees such as that by Weiss et al. (45) is useful, but such audits reflect only the experience of the trials with strict audit and the highest standards. Pharmaceutical companies seem to be increasing their audit of the performance of trials conducted in the community, but their combined experience has not been gathered and reported. The annual reports of the ORI (http://ori.hhs.gov) give some indication of the number of cases of misconduct, not confined to clinical trials, reported to the ORI and their disposition. In 2002, the following types of allegations were reported by institutions in the United States: fabrication: 45; falsification: 58; plagiarism: 27; ‘‘other’’: 33. During that year, there were 67 initial inquiries and 31 full investigations (50). No data exist to show that misconduct is more prevalent in one country than another.

CLINICAL TRIAL MISCONDUCT

It is simplest to assume that clinical scientists everywhere lie on a curve reaching from the obsessively honest through to the serially dishonest, and that as scientific or medical degrees do not come with any guarantee of probity, there will always be those who commit and report fraudulent research. The idea of an experimental, short-term, confidential audit, to be reported only in aggregate, in order to establish the prevalence of gross fabrication was first proposed in 1988 (51). But when the ORI recently funded a survey of misconduct and other forms of poor behavior, there were immediate and forceful protests from the same scientific bodies that had fought so hard to limit the definition of misconduct to FF&P (52, 53). There were legitimate concerns about the scientific quality of the projected survey. But in an editorial, the journal Nature described the defensiveness of the actions of the large scientific societies as ‘‘heads-in-the-sand,’’ and a ‘‘good impersonation of . . . . special interests with something to hide’’ (54). Whether this is true or not, given these official attitudes, it will be a long time before we have any clear idea of the prevalence of misconduct. 15

PEER REVIEW AND MISCONDUCT

A few cases of plagiarism of ideas and words have occurred during peer review. Given the privileged and confidential nature of manuscripts sent to reviewers, the possibility of plagiarism or breaking confidentiality is a particularly disturbing one, and one that editors must do their best to prevent by reminding reviewers of their responsibilities. Editors have a duty to report breaches of confidentiality and act as witnesses to investigatory panels. For a fuller account, the reader is referred to Reference 55. Peer review operates on the assumption that the authors are telling the truth, and is a very poor detector of fraudulent research, unless, as sometimes happens, the reviewer sees his or her own words plagiarized. 16

RETRACTIONS

The case of Poisson mentioned above, and a great many other serious cases, have amply

9

demonstrated that a single incident of misconduct should immediately mean that the validity of all the guilty scientist’s other research cannot be assumed, and all of it must be scrutinized carefully. The University of California San Diego (UCSD) set the standard after the discovery, in 1985, of duplicate data in two studies published by Robert A. Slutsky. Over the next year, investigating panels held Slutsky’s co-authors responsible for the validity of every part of every one of 137 articles published in 7 years (56). Overall, 77, including reviews, were judged valid, 48 were ‘‘questionable’’ (and therefore unciteable), and 12 were deemed at that time ‘‘fraudulent’’ (this event occurred before federal definitions had been promulgated). Nevertheless, most of the journals asked to retract the articles refused to do so, with the result that Slutsky’s fraudulent articles were being cited as if correct years later (57). Moreover, 2 years after a high-profile inquiry in Germany looking into the work of Friedhelm Herrmann and Marion Brach found 94 published papers to include manipulated data, 14 of the 29 journals publishing the articles had not published retraction notices concerning any of the articles (58). Journal editors tend to be cowed by threats either from the guilty scientist or from the innocent co-authors, each of whom has an interest in preventing the appearance of a retraction. The editor of Nature has written feelingly of the problems of retracting seven papers all sharing the first author, Jan Hendrik Sch¨on, who maintained throughout that his work should stand unless faced with hard evidence to the contrary (59). Despite this fact, it is everyone’s duty to correct the literature, and an increased resolve on the parts of editors, backed up by strong policies from editorial societies is in evidence. The International Committee of Medical Journal Editors policy reads: ‘‘The retraction or expression of concern, so labeled, should appear on a numbered page in a prominent section of the journal, be listed in the contents page, and include in its heading the title of the original article. It should not simply be a letter to the editor. Ideally, the first author should be the same in the retraction as in the article, although under certain

10

CLINICAL TRIAL MISCONDUCT

circumstances the editor may accept retractions by other responsible people. The text of the retraction should explain why the article is being retracted and include a bibliographic reference to it’’ (60). It is the duty of research institutions and a guilty scientist’s co-authors to check the scientist’s entire published work. It is also their duty, and that of the relevant journals, to issue retractions when the reports are found to contain fraudulent work (61, 62). The difficulties of doing this are shown in the case of Poehlman, who was sentenced to serve a year in prison for his misconduct (62). These difficulties are much greater when the work has originated in a country with no process for dealing with misconduct. 17

PREVENTION

From the start, it was recognized that many cases of misconduct were associated with apparent ignorance on the part of young researchers of the mores and standards of good research, and others with a complete breakdown in the assumed mentor-trainee relationship, resulting in ineffective supervision, monitoring, or training of young researchers. Add to that the fact that clinical trials demand rigorous adherence to sometimes elaborate protocols on the part of many individuals, some of whom, although quite senior, may be new to research, and many of whom are from different cultural backgrounds. The responsibility on trial leaders to educate, train, and monitor their colleagues is therefore considerable, but unavoidable. Every effort should be made not to put inordinate pressure on people such as clinical coordinators to recruit patients (63) or to distort the whole process with excessive monetary incentives. Audit and putting people on notice about the consequences of misconduct are necessary, but building up close relationships within a well-trained team may well turn out to be more important. REFERENCES 1. I. Chalmers, Unbiased, relevant, and reliable assessments in health care. BMJ 1998; 317: 1167–1168.

2. B. M. Psaty and D. Rennie, Stopping medical research to save money: a broken pact with researchers and patients. JAMA 2003; 289: 2128–2131. 3. D. Rennie, Fair conduct and fair reporting of clinical trials. JAMA 1999; 282: 1766–1768. 4. I. Chalmers, Underreporting research is scientific misconduct. JAMA 1990; 263: 1405–1408. 5. D. Rennie, Dealing with research misconduct in the United Kingdom. An American perspective on research integrity. BMJ 1998; 316(7146): 1726–1728. 6. D. Rennie and C. K. Gunsalus, Scientific misconduct. New definition, procedures, and office—perhaps a new leaf. JAMA 1993; 269(7): 915–917. 7. W. Broad and N. Wade, Betrayers of the Truth—Fraud and Deceit in the Halls of Science. New York: Simon & Schuster, 1982. 8. M. C. LaFollette, Stealing Into Print—Fraud, Plagiarism, and Misconduct in Scientific Publishing. Berkeley, CA: University of California Press, 1992. 9. S. Lock and F. Wells (eds.), Fraud and Misconduct in Biomedical Research. 1st ed. London: BMJ Publishing Group, 1993. 10. S. Lock and F. Wells (eds.), Fraud and Misconduct in Biomedical Research. 2nd ed. London: BMJ Publishing Group, 1996. 11. S. Lock, F. Wells, and M. Farthing (eds.), Fraud and Misconduct in Biomedical Research. 3rd ed. London: BMJ Publishing Group, 2001. 12. A. Kohn, False Prophets. Fraud and Error in Science and Medicine. Oxford: Basil Blackwell Ltd., 1986. 13. Fraud in Biomedical Research. In: Hearings Before the Subcommittee on Investigations and Oversight of the Committee on Science and Technology, 1981. 14. Fraud in NIH Grant Programs. In: Hearings Before the Subcommittee on Energy and Commerce, 1988. 15. D. Rennie and C. K. Gunsalus, Regulations on scientific misconduct: lessons from the US experience. In: S. Lock, F. Wells, and M. Farthing (eds.), Scientific Fraud and Misconduct. 3rd ed. BMJ Publishing Group, 2001, pp. 13–31. 16. U.S. Department of Health and Human Services, Public Health Service, Responsibilities of awardee and applicant institutions for dealing with and reporting possible misconduct

CLINICAL TRIAL MISCONDUCT in science: final rule. Fed. Reg. 1989; 54: 32446–32451. 17. New York Times. July 22, 1975. 18. B. Mishkin, The Investigation of scientific misconduct: some observations and suggestions. New Biologist 1991; 3: 821–823. 19. Joint Consensus Conference on Misconduct in Biomedical Research. In: Royal College of Physicians of Edinburgh, 1999. 20. P. E. Kalb and K. G. Koehler, Legal issues in scientific research. JAMA 2002; 287: 85–91. 21. M. M. Mello and T. A. Brennan, Due process in investigations of research misconduct. N. Engl. J. Med. 2003; 349: 1280–1286. 22. CHPS Consulting, Analysis of Institutional Policies for Responding to Allegations of Scientific Misconduct. Rockville, MD: Office of Research Integrity. CHPS Consulting, 2000. 23. Integrity and Misconduct in Research. Report of the Commission on Research Integrity to the Secretary of Health and Human Services, the House Committee on Commerce and the Senate Committee on Labor and Human Resources.(* the Ryan Commission). (1995). (online). Available: http://gopher.faseb.org/opar/cri.html;. 24. R. A. Bradshaw, Letter to Secretary of Health and Human Services Donna Shalala. January 4, 1996. 25. B. Goodman, Scientists are split over finding of Research Integrity Commission. The Scientist 1996;Sect. 1. 26. Responsible Science—Ensuring the Integrity of the Research Process, vol. I. Washington, DC: National Academy Press, 1992. 27. Responsible Science—Ensuring the Integrity of the Research Process, vol. II. Washington, DC: National Academy Press, 1993. 28. Office of Science and Technology Policy, Federal policy on research misconduct. Fed. Reg. 2000; 76260–76264. 29. D. Andersen, L. Attrup, N. Axelsen, and P. Riis, Scientific Dishonesty & Good Scientific Practice. Copenhagen: Danish Medical Research Council, 1992. 30. P. Riis, The concept of scientific dishonesty: ethics, value systems, and research. In: S. Lock, F. Wells, and M. Farthing (eds.), Fraud and Misconduct in Biomedical Research. 3rd ed. London: BMJ Publishing Group, 2001, p. 268. 31. The Wellcome Trust. (2002). Guidelines on Good Research Practice. (online). Available: http://www.wellcome.ac.uk/en/1/ awtvispolgrpgid.html.

11

32. R. Koenig, Wellcome rules widen the net. Science 2001; 293: 1411–1413. 33. I. Evans, Conduct unbecoming— the MRC’s approach. BMJ 1998; 316: 1728–1729. 34. R. Smith, The need for a national body for research misconduct—nothing less will reassure the public. BMJ 1998; 316: 1686–1687. 35. M. Farthing, R. Horton, and R. Smith, UK’s failure to act on research misconduct. Lancet 2000; 356: 2030. 36. A. Bostanci, Germany gets in step with scientific misconduct rules. Science 2002; 296: 1778. 37. D. Yimin, Beijing U. issues first-ever rules. Science 2002; 296: 448–449. 38. F. Wells, Counteracting research misconduct: a decade of British pharmaceutical industry action. In: S. Lock, F. Wells, and M. Farthing (eds.), Fraud and Misconduct in Biomedical Research. 3rd ed. London: BMJ Publishing Group, 2001, pp. 64–86. 39. P. Brock, A pharmaceutical company’s approach to the threat of research fraud. In: S. Lock, F. Wells, and M. Farthing (eds.), Fraud and Misconduct in Biomedical Research. 3rd ed. London: BMJ Publishing Group, 2001, pp. 89–104. 40. J. H. Tanne, FDA limits research of former AHA president for submitting false information. BMJ 2002; 325: 1377. 41. H. L. Hoeksema et al., Fraud in a pharmaceutical trial. Lancet 2000; 356: 1773. 42. R. Horton, After Bezwoda. Lancet 2000; 355: 942–943. 43. R. B. Weiss et al., High-dose chemotherapy for high-risk primary breast cancer: an on-site review of the Bezwoda study. Lancet 2000; 355: 999–1003. 44. M. F. Shapiro and R. P. Charrow, Scientific misconduct in investigational drug trials. N. Engl. J. Med. 1985; 312: 732–736. 45. R. B. Weiss et al., A successful system of scientific data audits for clinical trials. JAMA 1993; 270: 459–464. 46. D. Rennie, Accountability, audit, and reverence for the publication process. JAMA 1993; 270(4): 495–496. 47. D. Rennie, Breast cancer: how to mishandle misconduct. JAMA 1994; 271(15): 1205–1207. 48. W. P. Steward, K. Vantongelen, J. Verweij, D. Thomas, and A. T. Van Oosterom, Chemotherapy administration and data collection in an EORTC collaborative group—can we trust the results. Eur. J. Cancer 1993; 29A: 943–947.

12

CLINICAL TRIAL MISCONDUCT

49. C. Holden, Psychologist made up sex bias results. Science 2001; 294: 2457. 50. Report on 2002 Institutional Annual Report on Possible Research Misconduct. Washington, DC: Office of Research Integrity, August 2003. 51. D. Rennie (ed.), Mark, Dupe, Patsy, Accessory, Weasel, Flatfoot. In: Ethics and Policy in Scientific Publication. Bethesda, MD: Council of Biology Editors, Inc., 1990, pp. 155–174. 52. C. Holden, Planned misconduct surveys meet stiff resistance. Science 2002; 298: 1549. 53. D. S. Greenberg, Misconduct poll prompts fury among scientists. Lancet 2002; 360: 1669. 54. Soft responses to misconduct. Nature 2002; 240: 253. 55. D. Rennie, Misconduct and journal peer review. In: F. Godlee and T. Jefferson (eds.), Peer Review in Health Sciences. London: BMJ Books, 1999, pp. 90–99. 56. R. L. Engler et al., Misrepresentation and responsibility in medical research. N. Engl. J. Med. 1987; 317: 1383–1389. 57. W. P. Whitely, D. Rennie, and A. W. Hafner, The scientific community’s response to evidence of fraudulent publication. The Robert Slutsky case. JAMA 1994; 272(2): 170–173. 58. A. Abbott and J. Schwarz, Dubious data remain in print two years after misconduct inquiry. Nature 2002; 418: 113. 59. Retractions’ realities. Nature 2003; 422: 1. 60. International Committee of Medical Journal Editors. (2001). Uniform requirements for manuscripts submitted to biomedical journals. (online). Available: http://www.icmje.org/index.html#top. 61. E. Panel Marshall, Extensive Sudbø Fraud. Science 2006; 313: 29. 62. H. C. Sox, D. Rennie. Research Misconduct, Retraction and Cleansing the Medical Literature: Lessons from the Poehlman Case. Ann Intern Med. 2006; 144: 609–613. 63. P. A. Cola, Follow up to scientific misconduct. Clin. Researcher 2002; 2: 26–27.

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

materials were effective and not overly toxic, ultimately lead to clinical trials in humans. By 1960, there was an annual screening of approximately 25 000–30 000 materials sponsored by NCI with only about 10—20 new agents having sufficient effectiveness in animal systems to merit consideration for testing in humans. Peter Armitage, of the London School of Hygiene and Tropical Medicine, was a visiting scientist at NCI in the late 1950s. His background in sequential statistical procedures quickly found direct application in the development of two- and three-stage screening procedures for animal tumor systems that permitted rejection of an agent at any stage but acceptance only at the final stage (3,43). The object was to determine quickly which new compounds should be considered for further study in man. In the late 1950s, optimism was high that this screening program would lead to a new chemotherapeutic treatment that would make large clinical trials unnecessary. Also, there was a belief that different forms of cancer were sufficiently similar so that an agent active in one form of the disease would also be active in another.

MARVIN A. SCHNEIDERMAN∗ EDMUND A. GEHAN Georgetown University Medical Center, Washington, DC, USA

Early developments in controlled clinical trials at the National Institutes of Health (NIH) took place mainly at the National Cancer Institute (NCI) and what was then the National Heart Institute (NHI) [subsequently the National Heart, Lung, and Blood Institute (NHLBI)] beginning in the 1950s. This article reviews the developments from the early 1950s to the late 1960s at both institutes, summarizing the early efforts in clinical trials, the organizations set up to conduct and monitor the clinical trials, and the developments in statistical methodology that have formed the basis for conducting many of the present day randomized controlled trials. The early history of clinical trials at these institutes has been reviewed in more detail at NCIbyGehan & Schneiderman and at NHLBI by Halperin et al. (28,32).

1.1 Early Efforts in Clinical Trials

1 DEVELOPMENTS IN CLINICAL TRIALS AT THE NATIONAL CANCER INSTITUTE (NCI)

Dr C. Gordon Zubrod came to NCI in 1954 at about the time that Dr James Holland departed for Roswell Park Memorial Institute in Buffalo, NY. Drs Emil Frei and E.J. Freireich arrived at NCI in 1955. Under the leadership of Zubrod, this formed the key group of clinicians who initiated the clinical trials program at NCI. When Zubrod was at Johns Hopkins University in the early 1950s, he indicated that there ‘‘were two streams of influence (relating to developments in clinical trials)—infectious disease chemotherapy and comparative studies of analgesics and hypnotic drugs’’ (52). Among those playing an important role in the conduct of clinical trials at Johns Hopkins were Dr James Shannon (later Director of the National Institutes of Health), the pharmacologist E.K. Marshall, Jr and W.G. Cochran. About this time, the studies of streptomycin in pulmonary tuberculosis by the Medical Research Council were

A major advance in the development of chemical agents for the treatment of cancer came from observations of the treatment of children with acute lymphocytic leukemia, which was a rapidly fatal disease until 1948 when Sidney Farber, in a nonrandomized study of methotrexate, observed complete remissions and longer survival among some pediatric patients (21). However, results did not meet with uniform acceptance and questions were raised about diagnosis, selection of patients, and reporting. There was a need for a more organized approach to treatment experimentation that would lead to unbiased evaluations of treatments. At about the same time, animal models of the major forms of cancer— sarcomas, carcinomas, and leukemias—were developed that could be used to screen candidate materials and, if the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

published and had a profound influence on the Johns Hopkins group (41). The first effort at a randomized trial was a comparison of the efficacy of tetracycline and penicillin in the treatment of lobar pneumonia (5). At the same time, the Veterans Administration began its first randomized controlled trials in tuberculosis (50). 1.2 The Organization of Trials In 1954, the US Congress created the Cancer Chemotherapy National Service Center (CCNSC) to stimulate research in the chemotherapy of cancer. A clinical panel was formed, headed by Dr I. Ravdin, and included among others Drs Zubrod and Holland. At an early meeting, the clinical panel reviewed a paper by Louis Lasagna, which enunciated five principles of the controlled clinical trial, including randomization and the statistical treatment of data (38). Over the next several years, the clinical panel of the CCNSC oversaw the organization of cooperative clinical trials groups for the conduct of clinical trials in cancer. By 1960, there were 11 cooperative clinical study groups (Table 1), each comprised of a number of universities and/or V.A. Hospitals and Medical Centers and a Statistical Coordinating Center (48). The cooperative groups were funded by the NCI through the Chairman and a Statistical Center. Zubrod recruited the chairman of each group and Marvin Schneiderman recruited the biostatisticians and statistical centers. One of the statisticians, W.J. Dixon, had two graduate students who were writing general statistical programs for the analysis of biomedical data. NCI awarded a contract to carry out this work that subsequently became the Biomedical Data Processing Program (BMDP) package of statistical programs. In the establishment of a clinical cooperative group, CCNSC agreed that there should be adherence to the following principles: combination of data from all institutions to accumulate rapidly the necessary number of patients; standard criteria of diagnosis, treatment, and measurement of effect; statistical design of the study, with a randomized assignment of patients to the groups to be compared; and statistical analysis and collaborative reporting of the results.

The clinical trials effort involved more types of clinical studies than randomized trials. There was a sequence of trials with differing objectives: Phase I—to determine the maximum tolerated dose of a regimen that can be used in looking for therapeutic effect; Phase II—to determine whether a particular dosage schedule of an agent is active enough to warrant further study; and Phase III—a comparative trial, usually randomized, to decide whether a new therapy is superior to a standard therapy. The primary objective of the clinical trials program was to provide a means of testing in humans new agents that had previously demonstrated effectiveness in animal tumor systems. 1.3 Some Early Trials Following some preliminary discussions between Dr Zubrod and Jerome Cornfield, a leading statistician at NIH, there was agreement that childhood leukemia was an ideal disease for testing some of the new agents objectively. The first randomized cooperative clinical trial in acute leukemia was planned in 1954, begun in 1955, and reported by Frei et al. in 1958 (23). The trial involved two regimens of combination chemotherapy—6-mercaptopurine and either intermittent or continuous methotrexate in 65 patients. The study had the following features: a uniform protocol at the four participating institutions uniform criteria of response adherence to the principles of the controlled clinical trial, especially the randomization of patients to therapies; and stratification of patients by age, type of leukemia, and history of prior therapy. Statistical methods used were a comparison of median survival times and median duration of remissions between therapies, confidence intervals, and Fisher’s exact test. The first randomized clinical trial in solid tumors was conducted by members of the Eastern Solid Tumor Group and reported by Zubrod et al. in 1960 (53). The trial involved a randomized comparison of two alkylating agents (thiotepa vs. nitrogen mustard) in patients with solid tumors. One objective was to ‘‘study the feasibility and usefulness of collaborative clinical research in cancer

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

3

Table 1. Cooperative Clinical Study Groups in 1960 Group

Chairman Statistician

Acute leukemia, Group A Acute leukemia, Group B Eastern Solid Tumor Group Southeastern Group Western Group Southwestern Group Prostate Group Breast Group A Breast Group B V.A. Groups—various malignancies University Groups—lung, breast, stomach, ovary, colon

M. Lois Murphy E. Frei C.G. Zubrod R.W. Rundles F. Willett H.G. Taylor H. Brendler A. Segaloff G. Gordon J. Wolf et al. A. Curreri et al.

chemotherapy’’. The trial involved 258 randomized patients, and notable features were: blind evaluation of response by vote of clinical investigators objective procedures for measurement of tumors and determination of when a response began and ended; the importance of accounting for type I and type II statistical errors appropriate sample size for detection of differences between treatments and statistical analysis in the reporting of results. A subsequent trial demonstrated the value of combination chemotherapy in acute leukemia and the independent action of drugs to increase the probability that a patient achieves complete remission (24). Freireich et al. (25) reported a prospective, randomized, double-blind, placebo-controlled, sequential study of 6-mp vs. placebo in the maintenance of remissions in pediatric acute leukemia. This study established that 6-mp maintenance treatment leads to substantially longer remissions than placebo and was a forerunner to many adjuvant studies in other forms of cancer, such as breast cancer, in which treatments are administered when the patients are in a disease-free state (25). This study also was a motivation for the development of an extension of the Wilcoxon test for comparing survival distributions subject to censoring (27) and was used as an example by Cox in his, now classic, paper on regression models and life tables (16).

I. Bross M. Schneiderman M. Schneiderman B.G. Greenberg E. MacDonald D. Mainland M. Schneiderman M. Schneiderman M. Patno R. Stiver, G. Beebe, W. Dixon

1.4 Developments in Methodology In the clinical trials program at NCI prior to 1970, there were several developments in methodology that have influenced the conduct of subsequent clinical trials. Before 1960, the clinical testing of new agents often involved as few as five patients, with the agent discarded if no positive response was obtained in at least one patient. In 1961, Gehan proposed a plan for Phase II trials that determined the minimum number of consecutive patients to study when all patients are nonresponders, before one could reject a new agent for further study, at given levels of rejection error (26). This plan, or now more commonly Simon’s modification, continues in use today in Phase II studies (46). Several philosophical issues arose from the drug development program. The practice of human experimentation could be questioned by ‘‘Doesn’t a physician have an implied duty to give his patient the best treatment? If that is the case, how can one justify having the toss of coin (i.e. randomization) decide which treatment a patient should receive?’’ The reply was (and is), ‘‘If the physician really knows what is the best treatment for the patient, the patient must receive that treatment and not be randomized into a trial.’’ The question then becomes, ‘‘How and when does a physician know what is the best treatment for a specific patient?’’ The major ethical issue then becomes one of learning quickly (i.e. with a minimum number of patients) what is the best treatment. There have been several proposals for

4

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

establishing what one ‘‘knows’’ while minimizing the number of patients who will receive the less effective treatment. Armitage proposed closed sequential procedures with paired patients on each treatment, and with the trial terminated as soon as one could establish the superiority of one of the treatments over the other (2). A feature of the plans was an upper limit on the number of patients one could enter. Schneiderman & Armitage later described a family of sequential procedures, called wedge plans because of the shape of the acceptance boundary, which provided a bridge between the open plans derived from Wald’s theory and the restricted procedures of Armitage (44,45). In the 6-mp vs. placebo study for maintaining remissions in pediatric leukemia, patients were paired according to remission status (complete or partial), one patient receiving 6-mp and the other placebo by a random allocation, and a preference was recorded for 6-mp or placebo depending upon the therapy which resulted in the longer remission. The trial was conducted sequentially according to one of the Armitage plans (2) and a sequential boundary favoring 6-mp was reached after 18 preferences had occurred— 15 for 6-mp and 3 for placebo. There were 12 patients still in remission at the time the study was terminated, although one could record a preference for one or the other treatment because the pair-mate had relapsed at an earlier time. It was clear that a more efficient analysis could be obtained by using the actual lengths of remission. Gehan, while working on an NCI fellowship with D.R. Cox at Birkbeck College in London, developed a generalization of the Wilcoxon test for the fixed sample size problem with each sample subject to arbitrary right censoring (27). Halperin had previously developed a generalization of the Wilcoxon test, when all times to censoring were equal to the longest observation time (30). Mantel noticed that one could utilize the chi-square test for comparison of survival data between two or more groups, assuming that one constructs a contingency table of deaths and survivors at each distinct failure time in the groups of patients under study. This chi-square test was appropriate when the risk of failure in one group was a constant multiple of that in the other; this

test was an extension of the earlier test developed by Mantel and Haenszel which measured the statistical significance of an observed association between a disease and a factor under study in terms of an increased relative risk of disease (39,40). This test subsequently became known variously as the Mantel–Haenszel test, the logrank test or the Cox–Mantel test, and has been studied by Cox and Peto, among others (16,42). Another development in the 1960s was the exponential regression model proposed by Feigl & Zelen (22). Dr Robert Levin of NCI was interested in studying the relationship of the survival time of leukemia patients to the concomitant variate of white blood count, separately according to the presence or absence of auer rods and/or significant granulature of leukemia cells in the bone marrow at diagnosis. Feigl & Zelen proposed a model in which an exponential survival distribution is postulated for each patient and the expected value of the survival time is linearly related to the patient’s white blood count. A more general loglinear model was subsequently given by Glasser (29), and there have been numerous subsequent developments in parametric regression models with censored survival data 17, Chapters 5 and 6, pp. 62–90. 2 DEVELOPMENTS IN CLINICAL TRIALS AT THE NATIONAL HEART, LUNG, AND BLOOD INSTITUTE (NHLBI) Prior to 1960, the National Heart Institute (NHI), subsequently to become NHLBI, had little involvement in multicenter clinical trials. In a trial designed in 1951, there was a comparison of ACTH, cortisone, and aspirin in the treatment of rheumatic fever and the prevention of rheumatic heart disease. A total of 497 children were enrolled in 12 centers in the UK, the US, and Canada. Felix Moore, then Chief of the Biometrics Section of NHI, was a statistical consultant. There were no differences in treatment effectiveness in the study, and no statistical or methodologic problems mentioned in the final report (8). Subsequently, there was a multicenter observational study of lipoproteins in

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

atherosclerosis that had substantial impact on the methodology for coordinating studies performed at several sites (47). The Statistical Center was led by Felix Moore and Tavia Gordon at NHI. Careful quality control procedures and standardization of methods across centers were emphasized. 2.1 Early Efforts in Clinical Trials Jerome Cornfield joined the NHI in 1960 and strongly influenced the conduct of clinical trials at NHI and statistical research on methodologic issues arising in clinical trials. In the early 1960s, intensive planning for two clinical trials was begun at NHI to reduce risk factors for coronary heart disease—The Diet Heart Feasibility Study (DHFS) and the Coronary Drug Project (CDP) (14,20). These studies reflected the strong interest in both dietary and drug approaches to the prevention of coronary heart disease and the recurrence of myocardial infarction. For the DHFS, the NHI Biometrics Branch served as the statistical coordinating center, first under the supervision of Joseph Schachter and later of Fred Ederer. Max Halperin rejoined the NHI in 1966 and, upon Cornfield’s retirement in 1968, became Chief of the Biometrics Research Branch until his retirement in 1977. Four areas of clinical trials and methodology can be traced to these early studies and the individuals responsible for them. These areas are: organizational structure for clinical trials at NIH; methodology for the interim analysis of accumulating data, including the Bayesian approach, group sequential and stochastic curtailment methods; design and analysis of clinical trials, including the effects of patient noncompliance on power and the intention to treat principle; and methods for analysis of data from longitudinal clinical trials. 2.2 The Organization of NHLBI Trials The ‘‘NHLBI Model’’ for cooperative clinical trials evolved from discussion during the planning stage of the CDP among outside medical experts and NHI medical and statistical staff. In 1967, a report by a committee appointed by the National Advisory Heart Council and chaired by Bernard Greenberg

5

described this structure (35). The report, subsequently known as the ‘‘Greenberg Report’’, became the basis for a structure of nearly all subsequent NHLBI trials as well as for many other trials sponsored at NIH. The major components of the organizational structure include a Steering Committee, a Policy Advisory Board, a Data Monitoring Committee, and a Statistical or Data Coordinating Center, as well as individual clinics, central laboratories, and various other committees which served the needs of the trial. These might include committees to develop eligibility criteria, to assign cause of death, to define methodology and standards, or to oversee the preparation of manuscripts (for more details of organizational structure). From the biostatistical viewpoint, the Data Monitoring Committee has the responsibility of monitoring accumulating data on a periodic basis and analyzing results for evidence of early benefit or harm. Primary and secondary outcomes measures are reviewed, along with safety data, compliance to the protocol, and subgroup analyses which may identify particular risk groups. The Statistical Coordinating Center and the Data Monitoring Committee work closely together in performing the appropriate data analyses needed for fulfilling the Committee’s responsibilities. The Statistical and Data Coordinating Centers for early trials at the NHLBI are given in Table 2. Personnel at these coordinating centers have played an important role in the development of clinical trials and made numerous contributions to statistical methodology. 2.3 Developments in Methodology These are considered under three headings: data monitoring, design and analysis, and longitudinal studies. 2.3.1 Data Monitoring. Jerome Cornfield was involved in the planning and conduct of two clinical trials—the DHFS and the CDP. Both Cornfield and Halperin served on the Data and Safety Monitoring Committee of the CDP. At least partly motivated by his involvement in these trials, Cornfield

6

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE Table 2. Early NHLBI Coordinating Centers. University of Maryland/Maryland Research Institute Coronary Drug Project University of Texas School of Public Health Hypertension Detection and Follow-up Program University of North Carolina—Chapel Hill, School of Public Health Lipid Research Clinical Program University of Minnesota School of Public Health, Biometry Division Multiple Risk Factor Intervention Trial University of Washington School of Public Health, Biostatistics Department Coronary Artery Surgery Study George Washington University Biostatistics Center Intermittent Positive Pressure Breathing Trial NHLBI Biometrics Research Branch National Diet Heart Feasibility Study Urokinase Pulmonary Embolism Trial Urokinase Streptokinase Pulmonary Embolism Trial

published papers in 1966 on sequential trials, sequential analysis, and the likelihood principle, from a Bayesian perspective (9,10). In 1966, Max Halperin worked jointly with Cornfield and Samuel Greenhouse (then at the National Institute of Mental Health) to develop an adaptive allocation procedure that would assign an increasing proportion of patients to the better of two treatments as evidence accumulated (13). Their approach to the problem was Bayesian and generalized the earlier work of Anscombe and Colton (1,7). At around the same time, Cornfield published a general paper on the Bayesian approach that involved the use of a prior probability distribution with a mass of probability P at the null hypothesis, with a continuous density of total mass 1 − P over a set of alternative hypotheses (11). A key feature of Cornfield’s proposal was the rejection of the null hypothesis when the posterior odds (the relative betting odds or RBO) became small for H0 . The RBO was used in the CDP in the monitoring of mortality differences between the control and each of the drug treatment groups. Subsequently, Canner, of the CDP Coordinating Center, considered the determination of critical values for decision making at multiple time points during the conduct of the clinical trial from the Neyman–Pearson perspective (6). Later, curtailment and stochastic curtailment methods were developed and applied to trials of the NHLBI in the 1970s and early 1980s (19,31,34,37).

Statisticians working with the CDP were aware that, as the data accumulated, repeated testing for treatment differences using conventional statistical significance levels would increase the type I error over the nominal alpha level associated with that critical value. Armitage et al. evaluated the impact of repeated testing on the type I error and demonstrated that multiple tests could increase the type I error substantially (4). Interim analyses of clinical data are necessary for scientific and ethical reasons, but large type I errors are not acceptable. Canner developed a method for the CDP for determining the critical value at each interim analysis so that the overall type I error is close to the desired level (6). Statisticians involved with NHLBI trials developed group sequential methods and applied them to trials starting with the CDP. 2.3.2 Design and Analysis. In the DHFS, it was projected that a reduction in cardiovascular risk would result from a reduction in cholesterol level. The original sample size projection was for the entry of 8000 patients into several treatment arms. Although a special review committee suggested that this sample size might be too large, Cornfield argued that there were too many inconclusive small studies already in the literature. Several aspects of the trial required consideration, including noncompliance with the treatment regimen. It was presumed that the maximum effect on risk would occur only

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

after some period of time on treatment and that failure to adhere to the treatment regimen could mean a return to higher risk levels. Halperin et al. (33) incorporated these considerations into the design of clinical trials by proposing methods for adjusting sample size for noncompliance in the treatment group. Studies were considered with a fixed period of observation and a comparison of proportions as the main analysis. Implicit in this paper is the ‘‘intention to treat’’ principle, i.e. analysis of all randomized patients in their assigned treatment group regardless of compliance. Ultimately, the report of the CDP recognized this point (15). Most primary and secondary prevention trials conducted by the NHLBI since 1970 have made use of sample size adjustments for noncompliance. The Framingham Heart Study was begun in 1948 and has had an important influence on methodologic research at the NHLBI and the design of prevention trials. Over 5000 adult residents of Framingham, Massachusetts, were entered into a longitudinal study with the objective of evaluating the effects of various risk factors on the development of subsequent cardiovascular disease. The study has clarified the roles of high blood pressure, elevated total serum cholesterol, and cigarette smoking on the risk of cardiovascular disease (18,36). Occurrence or not of a cardiovascular event in a 2-year follow-up period is a binary outcome. Cornfield considered a regression approach to deal with the binary outcome variables. The problem was closely related to the discrimination problem between two samples from multivariate normal distributions. For a specific prior probability of belonging or not to a disease group, the posterior probability could be represented as a logistic regression function that was closely related to what could be obtained from a conventional discriminant function analysis (49). Cornfield & Mitchell argued that one could use the logistic model to predict the impact on risk of specified changes in risk factors (12). Subsequently, this logistic model approach was used in the design of several NHLBI prevention trials. 2.3.3 Longitudinal Studies. A methodology for analysis of longitudinal data was needed for the Framingham Study which

7

could be considered both a cohort and a longitudinal study. Cohorts of individuals were followed to observe patterns of morbidity and mortality, and biennial measurements of cardiovascular risk factors provided an opportunity to study patterns relating to aging. Early reports of the Framingham study used simple graphical and descriptive methods to describe patterns of aging. During the 1980s, there was much work on methodology for longitudinal studies, Overview that ultimately led to NHLBI sponsorship of a workshop on methods for analysis of longitudinal and follow-up studies, whose proceedings have appeared as a special issue in Statistics in Medicine (51). REFERENCES 1. Anscombe, F. J. (1963). Sequential medical trials, Journal of the American Statistical Association 58, 365–383. 2. Armitage, P. (1957). Restricted sequential procedures, Biometrika 44, 9–26. 3. Armitage, P. & Schneiderman, M. (1958). Statistical problems in a mass screening program, Annals of the New York Academy of Science 76, 896–908. 4. Armitage, P., McPherson, C. K. & Rowe, B. C. (1969). Repeated significance tests on accumulating data, Journal of the Royal Statistical Society, Series A 132, 235–244. 5. Austrian, R., Mirick, G., Rogers, D., Sessoms, S. M., Tumulty, P. A., Vickers, W. H., Jr. & Zubrod, C. G. (1951). The efficacy of modified oral penicillin therapy of pneumococcal lobar pneumonia, Bulletin of Johns Hopkins Hospital 88, 264–269. 6. Canner, P. L. (1977). Monitoring treatment differences in long-term clinical trials, Biometrics 33, 603–615. 7. Colton, T. (1963). A model for selecting one of two medical treatments, Journal of the American Statistical Association 58, 388–400. 8. Cooperative Clinical Trial of ACTH, Cortisone and Aspirin in the Treatment of Rheumatic Fever and the Prevention of Rheumatic Heart Disease (October 1960). Circulation 22,. 9. Cornfield, J. (1966). Bayesian test of some classical hypotheses—with applications to sequential clinical trials, Journal of the American Statistical Association 61, 577–594. 10. Cornfield, J. (1966). Sequential trials, sequential analysis, and the likelihood principle, American Statistician 20, 18–23.

8

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE

11. Cornfield, J. (1969). The Bayesian outlook and its application, Biometrics 25, 617–657. 12. Cornfield, J. & Mitchell, S. (1969). Selected risk factors in coronary disease. Possible intervention effects, Archives of Environmental Health 19, 382–394. 13. Cornfield, J., Halperin, M. & Greenhouse, S. (1969). An adaptive procedure for sequential clinical trials, Journal of the American Statistical Association 64, 759–770. 14. Coronary Drug Project Research Group (1973). The Coronary Drug Project. Design, methods, and baseline results, Circulation 47, Supplement 1, 179. 15. Coronary Drug Research Group (1980). Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project, New England Journal of Medicine 303, 1038–1041. 16. Cox, D. R. (1972). Regression models and life tables (with discussion), Journal of the Royal Statistical Society, Series B 34, 187–220. 17. Cox, D. R. & Oakes, D. (1984). Analysis of Survival Data. Chapman & Hall, London. 18. Dawber, T. R., Meadors, G. F. & Moor, F. E. (1951). Epidemiological approaches to heart disease: the Framingham Study, American Journal of Public Health 41, 279–286. 19. DeMets, D. L. & Halperin, M. (1981). Early stopping in the two-sample problem for bounded variables, Controlled Clinical Trials 3, 1–11. 20. Diet–Heart Feasibility Study Research Group (1968). The National Diet–Heart Study Final Report, Circulation 37, Supplement 1, 428. 21. Farber, S., Diamond, L. K., Mercer, R., Sylvester, R. F. Jr. & Wolff, J. A. (1948). Temporary remissions in children produced by folic acid antagonist aminopterin, New England Journal of Medicine 238, 787–793. 22. Feigl, P. & Zelen, M. (1965). Estimation of exponential survival probabilities with concomitant information, Biometrics 21, 826–838. 23. Frei, E., III, Holland, J. F., Schneiderman, M. A., Pinkel, D., Selkirk, G., Freireich, E. J., Silver, R. T., Gold, G. L. & Regelson, W. (1958). A comparative study of two regimens of combination chemotherapy in acute leukemia, Blood 13, 1126–1148. 24. Frei, E., III, Freireich, E. J., Gehan, E. A., Pinkel, D., Holland, J. F., Selawry, O., Haurani, F., Spurr, C. L., Hayes, D. M., James, W., Rothberg, H., Sodee, D. B., Rundles, W., Schroeder, L. R., Hoogstraten, B., Wolman, I. J., Tragis, D. G., Cooper, T., Gendel, B.

R., Ebaugh, F. & Taylor, R. (1961). Studies of sequential and combination antimetabolite therapy in acute leukemia: 6-mercaptopurine and methotrexate, Blood 18, 431–454. 25. Freireich, E. J., Gehan, E. A., Frei, E., III, Schroeder, L. R., Wolman, I. J., Anbari, R., Bergert, O., Mills, S. D., Pinkel, D., Selawry, O. S., Moon, J. H., Gendel, B. R., Spurr, C. L., Storrs, R., Haurani, F., Hoogstraten, B. & Lee, S. (1963). The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia: A model for evaluation of other potentially useful therapy, Blood 21, 699–716. 26. Gehan, E. A. (1961). The determination of the number of patients required in a preliminary and follow-up trial of a new chemotherapeutic agent, Journal of Chronic Diseases 13, 346. 27. Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored samples, Biometrika 52, 203–223. 28. Gehan, E. A. & Schneiderman, M. A. (1990). Historical and methodological developments in clinical trials at the National Cancer Institute, Statistics in Medicine 9, 871–880. 29. Glasser, M. (1967). Exponential survival with covariance, Journal of the American Statistical Association 62, 561–568. 30. Halperin, M. (1960). Extension of the Wilcoxon-Mann-Whitney test to samples censored at the same fixed point, Journal of the American Statistical Association 55, 125–138. 31. Halperin, M. & Ware, J. (1974). Early decision in a censored Wilcoxon two-sample test for accumulating survival data, Journal of the American Statistical Association 69, 414–422. 32. Halperin, M., DeMets, D. L. & Ware, J. H. (1990). Early methodological developments for clinical trials at the National Heart Lung and Blood Institute, Statistics in Medicine 9, 881–882. 33. Halperin, M., Rogot, E., Gurian, J. & Ederer, F. (1968). Sample sizes for medical trials with special reference to long term therapy, Journal of Chronic Diseases 21, 13–24. 34. Halperin, M., Ware, J., Johnson, N. J., Lan, K. K. & Demets, D. (1982). An aid to data monitoring in long-term clinical trials, Controlled Clinical Trials 3, 311–323. 35. Heart Special Project Committee (1988). Organization, review, and administration of cooperative studies (Greenberg Report): A report from the Heart Special Project Committee to the National Advisory Heart Council, May 1967, Controlled Clinical Trials 9, 137–148.

CLINICAL TRIALS, EARLY CANCER AND HEART DISEASE 36. Kannel, W. B., Dawber, T. R., Kagan, A., Nevotskie, N. & Stokes, J. (1961). Factors of risk in the development of coronary heart disease—six year followup experience: the Framingham Study, Annals of Internal Medicine 55, 33–50. 37. Lan, K. K. G., Simon, R. & Halperin, M. (1982). Stochastically curtailed tests in long-term clinical trials, Communications in Statistics—Stochastic Models 1, 207–219. 38. Lasagna, L. (1955). The controlled clinical trial: theory and practice, Journal of Chronic Diseases 1, 353–358. 39. Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration, Cancer Chemotherapy Reports 50, 163–170. 40. Mantel, N. & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease, Journal of the National Cancer Institute 22, 719–748. 41. Medical Research Council (1948). Streptomycin treatment of pulmonary tuberculosis, British Medical Journal 2, 769–783. 42. Peto, R. & Peto, J. (1972). Asymptotically efficient rank invariant test procedures (with discussion), Journal of the Royal Statistical Society, Series A 135, 185–206. 43. Schneiderman, M. A. (1961). Statistical problems in the screening search for anti-cancer drugs by the National Cancer Institute of the United States, in Quantitative Methods in Pharmacology. North-Holland, Amsterdam. 44. Schneiderman, M. A. & Armitage, P. (1962). A family of closed sequential procedures, Biometrika 49, 41–56. 45. Schneiderman, M. A. & Armitage, P. (1962). Closed sequential t-tests, Biometrika 49, 359–366. 46. Simon, R. (1989). Optimal two stage designs for Phase II trials, Controlled Clinical Trials 10, 1–10. 47. Technical Group and Committee on Lipoproteins and Atherosclerosis (1956). Evaluation of serum lipoproteins and cholesterol measurements as predictors of clinical complications of atherosclerosis, Circulation 14, 691–742. 48. The National Program of Cancer Chemotherapy Research (1960). Cancer Chemotherapy Reports 1, 5–34. 49. Truett, J., Cornfield, J. & Kannel, W. B. (1967). A multivariate analysis of the risk factors of coronary heart disease in Framingham, Journal of Chronic Diseases 20, 511–524.

9

50. Tucker, W. B. (1954). Experiences with controls in the study of the chemotherapy of tuberculosis, Transactions of the 13th Veterans Administration Conference on the Chemotherapy of Tuberculosis, Vol. 15. 51. Wu, M., Wittes, J. T., Zucker, D. & Kusek, J. eds (1988). Proceedings of the Workshop on Methods for Longitudinal Data Analysis in Epidemiological and Clinical Studies, Statistics in Medicine 7, 1–361. 52. Zubrod, C. G. (1982). Clinical trials in cancer patients: an introduction, Controlled Clinical Trials 3, 185–187. 53. Zubrod, C. G., Schneiderman, M., Frei, E., III, Brindley, C., Gold, G. L., Shnider, B., Oviedo, R., Gorman, J., Jones, R., Jr, Jonsson, U., Colsky, J., Chalmers, T., Ferguson, B., Dederick, M., Holland, J., Selawry, O., Regelson, W., Lasagna, L. & Owens, A. H., Jr (1960). Appraisal of methods for the study of chemotherapy of cancer in man: Comparative therapeutic trial of nitrogen mustard and thiophosphoramide, Journal of Chronic Diseases 11, 7–33.

CASE STUDIES: OVER-THE-COUNTER DRUGS

challenges. These questions must probe a consumer’s understanding of OTC label messages without cueing particular responses. Open-ended questions may introduce the least bias but may be difficult to objectively score. Some of these issues will be illustrated using data from the label development program supporting the prescription-to-OTC switch of omeprazole (Prilosec) as presented to the FDA’s Nonprescription Drugs Advisory Committee (7). An early version of the OTC Prilosec label included a warning that the consumer should ‘‘Ask a doctor before use if you . . . are taking . . . phenytoin (seizure medicine)’’ (7) because of concerns about a potential drug–drug interaction. In a label comprehension study, participants were provided the proposed label and were asked: ‘‘You suffer from seizures and are taking a medicine called Phenytoin to help control your seizures. You also routinely suffer from heartburn several times per week. You have just heard about this new product, Prilosec 1 for the prevention and relief of heartburn. If you were the person described in this situation and you wanted to use Prilosec 1 to prevent or treat your heartburn, what would you do now?’’ (7). This question, which was asked of 504 participants, includes a scenario designed to assess comprehension of the warning against concomitant use. To focus on this issue, the scenario provides a clinical indication for the use of Prilosec. Approximately 90% of participants responded that they would check with their doctor or would not use the Prilosec. Both of these answers are acceptable in that they are consistent with the intent of the warning. A low-literacy subset of the participants performed in a manner similar to the total study population. These results are encouraging and suggest that the label would prevent concomitant use of phenytoin and omeprazole based on effective communication of this specific message. However, the same participants were asked a similar question but with Prozac, a drug not mentioned on the label, as a possible concomitant medication. Over half the participants again indicated that they would not use the Prilosec or would consult with their

ERIC P. BRASS Harbor-UCLA Center for Clinical Pharmacology Torrance, California

Switching a drug from prescription to overthe-counter (OTC) status is based on evidence that consumers can use the drug safely and effectively in the absence of a healthcare professional (1). The OTC drug label is the key source of information for consumers when deciding whether to use a specific OTC drug and how to self-manage their condition when using the drug. Manufacturers must provide data supporting the claim that consumers will use the drug properly in the unsupervised OTC setting. Specifically, manufacturers submit clinical research data to regulatory authorities demonstrating that consumers can understand the OTC drug label and will heed the instructions contained in the label. Label comprehension studies evaluate the ability of consumers to understand key communication objectives as presented in the proposed OTC label (2, 3). Self-selection and actual use studies test the ability of consumers to decide if the OTC drug is appropriate for their use based on their individual health history and whether they can selfmanage the treatment course. The design, conduct, and interpretation of these studies pose unique challenges, some of which are illustrated in our case examples. Although some label development studies have been published (4–6), many examples are available from the deliberations of Food and Drug Administration (FDA) Advisory Committee meetings. Importantly, each OTC clinical program must be individually designed to meet the specific issues associated with the specific OTC candidate. 1

LABEL COMPREHENSION STUDIES

Construction of questions or instruments to assess label comprehension poses particular

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

CASE STUDIES: OVER-THE-COUNTER DRUGS

doctor first. This conservative response by consumers is reassuring from a public health perspective, but it suggests that the excellent response to the phenytoin scenario was less a measure of comprehension than of a more universal tendency to give a safe response in the testing scenario. The communication of the concomitant medication warnings for Prilosec was further studied in a group of individuals who had label contraindications for Prilosec use. These people were provided the label and asked, ‘‘If you were a heartburn sufferer and you wanted to relieve heartburn, would it be OK for you to use Prilosec 1 or not?’’ (7). Note that, in contrast to the phenytoin scenario, this question is very focused and applies to the individual participant. Over 40% of participants with a label contraindication responded that it would be OK to use, in contrast to the label’s intent. Attempts to use open-ended questions to elicit an understanding of why the incorrect responders felt that they could use the drug yielded answers ranged from ‘‘Don’t know’’ to ‘‘Warning labels don’t apply to me’’ (7). The Prilosec example illustrates the challenge in motivating consumers to heed clearly communicated label warnings. Comprehension of the core message is necessary but not sufficient for an OTC label to ensure proper drug use by consumers. Thus, selfselection and actual use studies are required to assess consumer behaviors. The poor intent-to-heed rate in the Prilosec case would represent a barrier to OTC approval if the phenytoin–omeprazole interaction would result in substantial risk to consumers. Moreover, it illustrates how label comprehension study questions must be designed to truly assess label communication as opposed to general consumer insights into expected responses (for example, ‘‘I’d ask my doctor’’). Ultimately, Prilsoec was approved for OTC marketing. The final label has a more limited concomitant medication warning excluding phenytoin and focusing on drugs where a more clinically relevant drug–drug interaction might occur.

2 ACTUAL USE STUDIES Most proposed OTC drugs require a unique set of self-selection and overall self-management behaviors by consumers. Thus, the design of each actual use trial must incorporate assessments designed to evaluate those aspects most relevant to the specific OTC drug being studied. 2.1 Orlistat Orlistat is a drug that was approved for prescription marketing in the United States for obesity management, including weight loss and weight maintenance. It works as an inhibitor of lipases in the gastrointestinal tract, and thus its proper use requires that it be taken with meals in patients compliant with a low-fat diet (8). Additionally, as it may inhibit absorption of other compounds from the intestine, including fat-soluble vitamins, it is recommended that orlistat users also take a multivitamin at least 2 hours before or after meals. Thus, when orlistat was proposed for a switch to OTC status, the self-selection and actual use trials were an important aspect of the FDA’s evaluation (9). The actual use study for orlistat was conducted at 18 pharmacies with a wide geographic distribution (9). Study participants were recruited through in-store and newspaper advertisements. Interested consumers were shown the orlistat package and told: ‘‘Imagine you are in a store and this is a new over-the-counter medicine. You can take as much time as you need to look at the packaging. Let me know when you are finished.’’ They were then asked the selfselection question: ‘‘Do you think this medication is appropriate for you to use?’’ After answering the self-selection question, they were asked, ‘‘The cost of this medication is $45 for a bottle of 90 capsules. Would you like to purchase the medicine today?’’ Each bottle would provide up to 30 days of treatment, and participants were allowed to buy up to three bottles at a time. Consumers who purchased orlistat were followed by periodic phone contact, which used a structured question-based interview. Additionally, participants could return to the study pharmacy to purchase additional orlistat. They were followed for up to 90 days of treatment.

CASE STUDIES: OVER-THE-COUNTER DRUGS

3

Figure 1. Flow of research subjects in orlistat OTC actual use study. Of the 703 participants screened at sites, 237 were included in the core analysis of longitudinal use behaviors. However, other cohorts provided useful information on the effectiveness of other label elements. Data from Feibus (9).

In the orlistat actual use study, of 681 eligible participants who examined the package, 543 said the product was appropriate for their use, 339 wanted to purchase the product, 262 actually purchased the product, and 237 of the purchasers had evaluable behavioral data (Figure 1). This numerical characterization of the population highlights an important aspect of data analysis for actual use studies: which denominator should be used? There is no generalized answer as the population should be the most relevant to the question/hypothesis being examined. Most often this is either the self-selection population or the evaluable purchasers. However, there may be exceptions. For example, if the label’s ability to deter use by consumers with a contraindication is being tested, the denominator may be the number of individuals with the contraindication who evaluated the product, and the numerator those that self-selected or purchased the drug. This is more meaningful than simply expressing the percentage of purchasers with the contraindication, which may be misleadingly low if a small percentage of the evaluating individuals had the contraindication. As in any study, omission

of a portion of the study population from the analysis for any reason raises questions as to the robustness of the data set. Understanding how orlistat would be used in an unsupervised setting was a major objective of the actual use trial. Key behaviors studied included whether orlistat was used with meals (>95% used with meals), whether the maximal daily dose was exceeded (<3% exceeded the dose), whether consumers would use orlistat chronically as required to obtain benefit (46% of participants were still using orlistat after 90 days), and whether consumers were supplementing therapy with multivitamins properly (<50% were using multivitamin as per label). The latter two results raised important questions. The low adherence with chronic therapy is suboptimal, but adherence may be low in the prescription setting as well (10). The incorrect use of multivitamins led to recommendations that labeling for these directions be improved. Many consumer subsets of interest were underrepresented in the orlistat actual use study. For example, only two participants were taking cyclosporine, a drug known to

4

CASE STUDIES: OVER-THE-COUNTER DRUGS

interact with orlistat. Despite a label contraindication, one of the two cyclosporine users elected to purchase orlistat. Similarly, seven of 14 warfarin users elected to purchase orlistat contrary to label directions. These data led to label modifications and more intensive studies of cohorts with concomitant drug use to optimize heeding of the label directions. Additionally, only 4% of participants in the study were of low literacy, raising concerns about whether the results could be generalized. These results were of concern during advisory committee deliberations and led to recommendations for further testing and label refinements. Thus, the orlistat actual use study addressed a number of important issues germane to regulatory decision making, and guided further label development and research when the results identified nonoptimal consumer behaviors. 2.2 Lovastatin Statins have been considered for OTC use based on their proven efficacy in reducing cardiovascular morbidity and mortality, and on the large number of patients who are eligible for statin therapy based on guidelines but are not receiving therapy (11). However, use of a statin in the OTC setting requires consumers to comply with a complex selfmanagement paradigm that includes making proper self-selection based on knowledge of their serum cholesterol concentrations and cardiovascular risk factors, triaging to more intensive therapy based on their response to OTC statin therapy, avoiding drug interactions, and recognizing possible statin muscle toxicity. Aspects of OTC statin therapy have been studied in a number of actual use studies, including the Consumer Use Study of Over-the-Counter Lovastatin (CUSTOM) study (12). The CUSTOM study recruited to study sites over 3300 consumers interested in self-management of their cholesterol. These consumers were given the opportunity to purchase OTC lovastatin (20 mg) after reviewing the drug package, which included the proposed OTC label. This label contained information to guide self-selection and use of lovastatin. The label also contained warnings

about adverse events and instructions for self-management based on cholesterol concentrations after 8 weeks of OTC lovastatin treatment. Over 1000 consumers purchased and used lovastatin in the study. The study yielded valuable insights into how consumers might use a statin in the unsupervised OTC setting. The OTC label used in CUSTOM guided consumer self-selection based on their serum low-density lipoprotein (LDL) cholesterol concentration. Consumers could elect to purchase a cholesterol test at the study site, but this was not required. After their purchase decision, consumers were asked to report their LDL cholesterol concentration. Additionally, blood samples were taken for assay of cholesterol, but the results were not provided to the consumers unless they had obtained the test before purchase. When analyzed by broad ranges (<130 mg/dL, 130–170 mg/dL, >170 mg/dL), 76% of the 667 consumers who used OTC lovastatin and for whom data were available accurately reported their LDL cholesterol. Thirtynine consumers with LDL cholesterol levels above 170 mg/dL, who were thus likely to require more intensive statin treatment, selfreported their LDL cholesterol as less than 170 mg/dL (13). The target population for OTC lovastatin was consumers with a moderate 10-year cardiovascular risk of 5% to 20%. The ability of the surrogates used on the proposed OTC label to allow self-selection of this cohort was assessed by the CUSTOM study. Of note, 24% of the participants who self-selected to use lovastatin in CUSTOM had a 10-year cardiovascular risk of greater than 20% (Figure 2) (14). Thus, CUSTOM identified the potential for some consumers to be diverted from optimal, intensive statin therapy by using OTC statins. Labeling of OTC statins to minimize this diversion is thus a priority. Similarly, 29% of OTC lovastatin users were at low risk as assessed using conventional tools (see Figure 2). The CUSTOM study illustrates the ability of an actual use trial to assess complex self-selection paradigms and to identify areas requiring improvement before regulatory approval.

CASE STUDIES: OVER-THE-COUNTER DRUGS

5

Figure 2. Self-selection for OTC treatment and cardiovascular risk in the CUSTOM study. The proposed paradigm for OTC statin use requires consumers to risk-stratify based on surrogates provided on the label. The CUSTOM study assessed the effectiveness of this approach. Data from Brass et al. (14).

Overall, correct self-selection of OTC lovastatin required a large number of assessments by the consumer, including their LDL cholesterol level and their cardiac risk factors as well as age, concomitant medications, previous muscle symptoms, history of liver disease or pregnancy, cutoffs based on highdensity lipoprotein cholesterol and triglycerides concentration, and the presence or absence of concomitant diseases such as diabetes. When all these criteria were taken into account, only 10% of lovastatin users in the CUSTOM study met all label criteria (15). This illustrates an important aspect of interpreting actual use study results. Not all label directions and restrictions are equivalently important. A study analysis can benefit from having prespecified the most important label-heeding requirements and from defining the thresholds or performance standards that must be met to be consistent with unsupervised OTC use. For example, in the case of OTC lovastatin, the hypothetical decision of a 54-year-old woman with hypertension and a plasma LDL concentration of 160 mg/dL to use the drug and the decision of a 55year-old man currently taking gemfibrozil to use OTC lovastatin are both incorrect. The decision by the woman is unlikely to have an adverse health consequence, but that of the 55-year-old man is associated with a potentially clinically significant drug–drug interaction. Compliance with the concomitant medication warning thus merits a meaningful predefined standard of performance, and mild deviations from the age limits are substantially less important. In the case of CUSTOM, 10 of 48 consumers on gemfibrozil

who evaluated OTC lovastatin purchased and used the OTC statin (13). After they had started therapy with OTC lovastatin, the label instructed consumers to have their cholesterol rechecked. Those consumers with high LDL cholesterol concentrations on therapy, which might include those with very high levels when therapy was started, should seek guidance on their therapy from a health-care professional. This instruction is important to prevent undertreatment of higher risk consumers; heeding of this was evaluated by CUSTOM. The study design permitted participants to return to the study site for cholesterol testing as well as to purchase additional drug. Consumers could also obtain cholesterol testing independent of the study structure. Seventy percent of the patients for whom follow-up data were available reported at least one follow-up cholesterol test during the 26 week study (13). Of these participants, 75% made a self-triage decision consistent with the label directions (e.g., to continue therapy based on the result or discontinue therapy and talk to their doctor). This again illustrates the ability of an actual use study to assess complex consumer behaviors over time with minimal cueing. 3

CONCLUSIONS

Label comprehension and actual use studies provide data critical to developing OTC drugs that can be used safely and effectively in the OTC setting. Each study should contain design elements that permit robust assessment of the label elements most important to

6

CASE STUDIES: OVER-THE-COUNTER DRUGS

correct use of the OTC candidate. Statistical analysis of these studies should include prioritized endpoints with prespecified definition of acceptable consumer behavior rates. 3.0.1 Acknowledgment. The author is a consultant to GSK Consumer Healthcare, Johnson&Johnson-Merck, McNeil Consumer Pharmaceuticals, and Novartis Consumer Health. REFERENCES 1. E. P. Brass, Changing the status of drugs from prescription to over-the-counter availability. N Engl J Med. 2001; 345: 810–816. 2. E. P. Brass, ed. Clinical trials to support prescription to over-the-counter switches. In: R. D’Agostino, L. Sullivan, and J. Massaro (eds.), Wiley Encyclopedia of Clinical Trials, New York: Wiley, 2008. 3. E. Brass and M. Weintraub, Label development and the label comprehesion study for over-the-counter drugs. Clin Pharmacol Ther. 2003; 74: 406–412. 4. A. A. Ciociola, M. A. Sirgo, K. A. Pappa, J. A. McGuire, and K. Fung, A study of the nonprescription drug consumer’s understanding of the ranitidine product label and actual product usage patterns in the treatment of episodic heartburn. Am J Ther. 2001; 8: 387–398. 5. E. G. Raymond, S. M. Dalebout, and S. I. Camp, Comprehension of a prototype overthe-counter label for an emergency contraceptive pill product. Obstet Gynecol. 2002; 100: 342–349. 6. J. M. Melin, W. E. Struble, R. Tipping, T. C. Vassil, J. Reynolds, et al., The CUSTOM study: a consumer use study of OTC Mevacor. Am J Cardiol. 2004; 94: 1243–1248. 7. K. Lechter, DDMAC Review Omeprazole Magnesium Tablets Study 3358 Label Comprehension. August 16, 2000. Available from: http:// www.fda.gov/ohrms/dockets/ac/02/briefing/386 1B1 11 Label%20comprehension.pdf

8. S. Henness and C. M. Perry, Orlistat: a review of its use in the management of obesity. Drugs. 2006; 66: 1625–1656. 9. K. B. Feibus, Orlistat OTC, 60mg capsules GlaxoSmithKline (GSK) Consumer Healthcare New Drug Application 21-887 Actual Use Study NM17285. Slide presentation, 2006. Available from: http://www.fda.gov/ ohrms/dockets/ac/06/slides/2006-4201S1 07 FDA-Feibus files/frame.htm#slide0060.htm 10. M. Malone and S. A. Alger-Mayer, Pharmacist intervention enhances adherence to orlistat therapy. Ann Pharmacother. 2003; 37: 1598–1602. 11. A. M. Gotto, Jr., Over-the-counter statins and cardiovascular disease prevention: perspectives, challenges, and opportunities. Clin Pharmacol Ther. 2005; 78: 213–217. 12. J. M. Melin, W. E. Struble, R. W. Tipping, J. M. Reynolds, T. C. Vassil, et al., A consumer use study of over-the-counter lovastatin (CUSTOM). Am J Cardiol. 2004; 94: 1243–1248. 13. E. P. Brass, Consumer behavior in the setting of over-the-counter statin availability: lessons from the consumer use study of OTC Mevacor. Am J Cardiol. 2004; 94: 22F–29F. 14. E. P. Brass, S. E. Allen, and J. M. Melin, Potential impact on cardiovascular public health of over-the-counter statin availability. Am J Cardiol. 2006; 97: 851–856. 15. D. Shetty, Mevacor Daily 20mg Tablets Rx-toOTC Switch. PowerPoint presentation, 2005. Available from: http://www.fda.gov/ohrms/ dockets/ac/05/slides/2005-4086S1 06 FDAShetty.ppt#267,1,MevacorTM

CROSS-REFERENCES Clinical trials to support prescription to over-thecounter switches Confidence interval Statistical analysis plan Informed consent Power

CLUSTER RANDOMIZATION

Several recently published trials that are reviewed in Section 2 will be used to illustrate the key features of cluster randomization. The principles of experimental design, including the benefits of random assignment and the importance of replication, are discussed in Sections 3–5, whereas issues of sample size estimation are considered in Section 6. Methods of analysis at the cluster level and at the individual level are discussed in Sections 7 and 8, respectively, whereas designs involving repeated assessments are considered in Section 9. In Section 10, the authors provide recommendations for trial reporting, and in Section 11, they conclude the article by considering issues arising in meta-analyses that may include one or more cluster randomization trials. Readers interested in a more detailed discussion might wish to consult Donner and Klar (5) from which much of this article was abstracted.

NEIL KLAR PhD Division of Preventive Oncology Cancer Care Ontario Toronto, Ontario, Canada

ALLAN DONNER PhD The University of Western Ontario Department of Epidemiology and Biostatistics London, Ontario, Canada

1

INTRODUCTION

Randomized trials in which the unit of randomization is a community, worksite, school, or family are becoming increasingly common for the evaluation of lifestyle interventions for the prevention of disease. This form of treatment assignment is referred to as cluster randomization or group randomization. Reasons for adopting cluster randomization are diverse, but they include administrative convenience, a desire to reduce the effect of treatment contamination, and the need to avoid ethical issues that might otherwise arise. Data from cluster randomization trials are characterized by between-cluster variation, which is equivalent to saying that responses of cluster members tend to be correlated. Dependencies among cluster members typical of such designs must be considered when determining sample size and in the subsequent data analyses. Failure to adjust standard statistical methods for within-cluster dependencies will result in underpowered studies with spuriously elevated type I errors. These statistical features of cluster randomization were not brought to wide attention in the health research community until the now famous article by Cornfield (1). However the 1980s saw a dramatic increase in the development of methods for analyzing correlated outcome data in general (2) and methods for the design and analysis of cluster randomized trials in particular (3, 4). Books summarizing this research have also recently appeared (5, 6), and new statistical methods are in constant development.

2 EXAMPLES OF CLUSTER RANDOMIZATION TRIALS 1. A group of public health researchers in Montreal (7) conducted a household randomized trial to evaluate the risk of gastrointestinal disease caused by consumption of drinking water. Participating households were randomly assigned to receive an in-home water filtration unit or were assigned to a control group that used tap water. Households were the natural randomization unit in this trial for assessing an in-home water filtration unit. There were 299 households (1206 individuals) assigned to the filtered water group and 308 households (1201 individuals) assigned to the tap water group. The annual incidence of gastrointestinal illness was analyzed using an extension of Poisson regression that adjusted for the within-household correlation in the outcome variable. Based on these analyses, investigators concluded that approximately 35% of the reported gastrointestinal illnesses among control group subjects were preventable.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

CLUSTER RANDOMIZATION

2. The National Cancer Institute of the United States funded the Community Intervention Trial for Smoking Cessation (COMMIT), which investigated whether a community-level, fouryear intervention would increase quit rates of cigarette smokers (8). Communities were selected as the natural experimental unit because investigators assumed that interventions offered at this level would reach the greatest number of smokers and possibly change the overall environment, thus making smoking less socially acceptable. Random digit dialing was used to identify approximately 550 heavy smokers and 550 light-to-moderate smokers in each community. Eleven matched-pairs of communities were enrolled in this study, with one community in each pair randomly assigned to the experimental intervention and the remaining community serving as a control. Matching factors included geographic location, community size, and general sociodemographic factors. Each community had some latitude in developing smoking cessation activities, which included mass media campaigns and programs offered by health-care providers or through worksites. These activities were designed to increase quit rates of heavy smokers, which in theory, should then also benefit lightto-moderate smokers whose tobacco use tends to be easier to change. The effect of the intervention was assessed at the community level by calculating the difference in community-specific quit rates for each pair. Hypothesis tests were then constructed by applying a permutation test to the 11 matched-pair difference scores, an analytic approach that accounts for both the between-community variability in smoking quit rates as well as for the matching. Further details concerning this cluster level method of analysis are provided in Section 7. Unfortunately, although the experimental intervention offered by COMMIT significantly increased smoking quit rates among light-to-moderate

smokers from about 28% to 31% (P = 0.004, one-sided), no similar effect was identified among the cohort of heavy smokers. 3. Antenatal care in the developing world has attempted to mirror care that is offered in developed countries even though not all antenatal care interventions are known to be effective. The World Health Organization (WHO) antenatal care randomized trial (9) compared a new model of antenatal care that emphasized health-care interventions known to be effective with the standard model of antenatal care. The primary hypothesis in this equivalence trial was that the new model of antenatal health care would not adversely effect the health of women. Participating clinics, recruited from Argentina, Cuba, Saudi Arabia, and Thailand, were randomly assigned to an intervention group or control group separately within each country. Clinics were selected as the optimal unit of allocation in this trial for reasons of administrative and logistic convenience. This decision also reduced the risk of experimental contamination that could have arisen had individual women been randomized. However, random assignment of larger units (e.g., communities) would have needlessly reduced the number of available clusters, thus compromising study power. Twenty-seven clinics (12568 women) were randomly assigned to the experimental arm, whereas 26 control group clinics (11958 women) received standard antenatal care. The primary analyses examining low birthweight (<2500 g) as the principal endpoint were based on (1) extensions of the Mantel–Haenszel test statistic and (2) extensions of logistic regression, both of which were adjusted for betweencluster variability and took into account the stratification by country. The resulting odds ratio comparing experimental to control group women with respect to low birthweight was estimated as equal to 1.06 (95% confidence interval 0.97–1.15). Based on

CLUSTER RANDOMIZATION

these results, and bolstered by similar findings from other study outcomes, the investigators concluded that the new antenatal care model does not adversely affect perinatal and maternal health.

3

PRINCIPLES OF EXPERIMENTAL DESIGN

The science of experimental design as initially put forward by R. A. Fisher (10) was based on the principles of random assignment, stratification, and replication. These principles may be illustrated by considering in more detail the cluster randomization trials described above. All three trials reviewed in Section 2 used random allocation in assigning clusters to an intervention group. They differed, however, in the experimental unit: Random assignment was by household in the Montreal trial (7), by clinic in the WHO Antenatal Care Trial (8), and by community in COMMIT (9). The advantages of random allocation for cluster randomization trials are essentially the same as for clinical trials randomizing individuals. These advantages include the assurance that selection bias has played no role in the assignment of clusters to different interventions; the balancing, in an average sense, of baseline characteristics in the different intervention groups; and formal justification for the application of statistical distribution theory to the analysis of results. A final compelling reason for randomized assignment is that the results are likely to have much more credibility in the scientific community, particularly if they are unexpected. Three designs are most frequently adopted in cluster randomization trials: 1. Completely randomized, involving no pre-stratification or matching of clusters according to baseline characteristics 2. Matched-pair, in which one of two clusters in a stratum is randomly assigned to each intervention 3. Stratified, involving the assignment of two or more clusters to at least some combination of stratum and intervention.

3

The completely randomized design is most appropriate for studies having large numbers of clusters, such as the Montreal trial, which enrolled over 600 households. Matching or stratification is often considered for community intervention trials such as COMMIT (8) in which the numbers of clusters that can be enrolled are usually limited by economic or practical considerations. The main attraction of this design is its potential to provide very tight and explicit balancing of important prognostic factors, thereby improving statistical power. The stratified design is an extension of the matched-pair design in which several clusters, rather than just one, are randomly assigned within strata to each intervention and control group. This design was selected for the WHO antenatal care trial (9). The stratified design has been used much less frequently than either the matched-pair or completely randomized design. However, for many studies, it would seem to represent a sensible compromise between these two designs in that it provides at least some baseline control on factors thought to be related to outcome, while easing the practical difficulties of finding appropriate pair-matches. 4 EXPERIMENTAL AND QUASI-EXPERIMENTAL DESIGNS Investigators may sometimes be hesitant to adopt a random allocation scheme because of ethical and/or practical concerns that arise. For example, the systematic allocation of geographically separated control and experimental clusters might be seen as necessary to alleviate concerns regarding experimental contamination. Nonrandomized designs may also seem easier to explain to officials, to gain broad public acceptance, and more generally, to allow the study to be carried out in a simpler fashion, without the resistance to randomization that is often seen. In this case, well-designed quasi-experimental comparisons (11) will be preferable to obtaining no information at all on the effectiveness of an intervention. At a minimum, it seems clear that non-experimental comparisons may generate hypotheses that can subsequently be tested in a more rigorous framework.

4

CLUSTER RANDOMIZATION

Several reviews of studies in the health sciences have made it clear that random assignment is in fact being increasingly adopted for the assessment of nontherapeutic interventions (5). Similar inroads have been made in the social sciences, where successfully completed randomized trials help investigators to recognize that random assignment may well be possible under circumstances in which only quasi-experimental designs had been previously considered. The availability of a limited number of clusters has been cited as a reason to avoid randomization, because it may leave considerable imbalance between intervention groups on important prognostic factors. This decision seems questionable as there is no assurance that quasi-experimental designs will necessarily create acceptably balanced treatment groups with a limited number of available clusters. Moreover, a matchedpairs or stratified design can usually provide acceptable levels of balance in this case. Concern has been raised by some researchers at the relatively small number of community intervention trials that have actually identified effective health promotion programs (12). It is worth noting, however, that the dearth of effective programs identified using cluster randomization is largely confined to evaluations of behavioral interventions. There has been considerably more success in evaluating medical interventions as in trials assessing the effect of vitamin A on reducing childhood morbidity and mortality (13). There is a long history of methodological research comparing the effectiveness of methods of treatment assignment in controlled clinical trials (14). For example, Chalmers et al. (15) reviewed 145 reports of controlled clinical trials examining treatments for acute myocardial infarction published between 1946 and 1981. Differences in casefatality rates were found in only about 9% of blinded randomized trials, 24% of unblinded randomized studies, and 58% of nonrandomized studies. These differences in the effects of treatment were interpreted by the authors as attributable to differences in patient participation (i.e., selection bias), which should give pause to investigators touting the benefits

of non-randomized comparisons as a substitute for cluster randomization. Murray (16) in addressing the ability of investigators to identify effective behavioral interventions stated that The challenge is to create trials that are (a) sufficiently rigorous to address these issues, (b) powerful enough to provide an answer to the research question of interest, and (c) inexpensive enough to be practical. The question is not whether to conduct group-randomized trials, but how to do them well.

5 THE EFFECT OF FAILING TO REPLICATE Some investigators have designed community intervention trials in which exactly one cluster has been assigned to the experimental group and one to the control group, either with or without the benefit of random assignment (5). Such trials invariably result in interpretational difficulties caused by the total confounding of two sources of variation: (1) the variation in response due to the effect of intervention, and (2) the natural variation that exists between two communities (clusters) even in the absence of an intervention effect. These sources of confounding can be only be disentangled by enrolling multiple clusters per intervention group. The natural between-household variation in infection rates may easily be estimated separately for subjects in the control and experimental arms of the completely randomized Montreal water filtration trial (7). However, this task is more complicated for COMMIT (8), because the pair-matched design adopted for this trial implies that there is only a single community in each combination of stratum and intervention. Inferences concerning the effect of intervention must therefore be constructed using the variation in treatment effect assessed across the 11 replicate pairs. Between cluster variation is also directly estimable in stratified trials such as the WHO antenatal care trial (9) using, for example, the replicate clusters available per intervention group within each of the four participating countries. The absence of replication, although not exclusive to community intervention trials, is

CLUSTER RANDOMIZATION

most commonly seen in trials that enroll clusters of relatively large size. Unfortunately, the large numbers of study subjects in such trials may mislead some investigators into believing valid inferences can be constructed by falsely assuming individuals within a community provide independent responses. A consequence of this naive approach is that the variance of the observed effect of intervention will invariably be underestimated. The degree of underestimation is a function of the variance inflation due to clustering (i.e., the design effect). In particular, the variance inflation factor is given by 1 + (m − 1)ρ, where m denotes average cluster size and ρ is a measure of intracluster correlation, interpretable as the standard Pearson correlation between any two responses in the same cluster. With the additional assumption that the intracluster correlation is non-negative, ρ may also be interpreted as the proportion of overall variation in response that can be accounted for by the between-cluster variation. As demonstrated in Section 6, even relatively small values of intracluster correlation, combined with large cluster sizes, can yield sizable degrees of variance inflation. Moreover, in the absence of cluster replication, positive effects of intervention artificially inflate estimates of between-cluster variation, thus invalidating the resulting estimates of ρ. More attention to the effects of clustering when determining the trial sample size might help to eliminate designs that lack replication. Even so, investigators will still need to consider whether statistical inferences concerning the effect of intervention will be interpretable if only two, or a few, replicate clusters are allocated to each intervention group. 6

SAMPLE SIZE ESTIMATION

A quantitatively justified sample size calculation is almost universally regarded as a fundamental design feature of a properly controlled clinical trial. Methodologic reviews of cluster randomization trials have consistently shown that only a small proportion of these studies have adopted a predetermined

5

sample size based on formal considerations of statistical power (5). There are several possible explanations for the difficulties investigators have faced in designing adequately powered studies. One obvious reason is that the required sample size formulas still tend to be relatively inaccessible, not being available, for example, in most standard texts or software packages. A second reason is that the proper use of these formulas requires some prior assessment of both the average cluster size and the intracluster correlation coefficient ρ. The average cluster size for a trial may at times be directly determined by the selected interventions. For example, households were the natural unit of randomization in the Montreal water filtration study (7), where the average cluster size at entry was approximately four. When relatively large clusters are randomized, on the other hand, subsamples of individual cluster members may be selected to reduce costs. For example, the primary endpoint in the community intervention trial COMMIT (8) was the quit rate of approximately 550 heavy smokers selected from each cluster. Difficulties in obtaining accurate estimates of intracluster correlation are slowly being addressed as more investigators begin publishing these values in their reports of trial results. Summary tables listing intracluster correlation coefficients and variance inflation factors from a wide range of cluster randomization trials and complex surveys are also beginning to appear (5). In practice, estimates of ρ are almost always positive and tend to be larger in smaller clusters. The estimated variance inflation factor may be directly applied to determine the required sample size for a completely randomized design. Let Zα/2 denote the two-sided critical value of the standard normal distribution corresponding to the error rate α, and Zβ denote the critical value corresponding to β. Then, assuming the difference in sample means for the experimental and control groups, Y 1 − Y 2 , can be regarded as approximately normally distributed, the number of subjects required per intervention group is

6

CLUSTER RANDOMIZATION

given by n=

(Zα/2 + Zβ )2 (2σ 2 )[1 + (m − 1)ρ] , (µ1 − µ2 )2

where µ1 − µ2 denotes the magnitude of the difference to be detected, σ 2 denotes the common but unknown variance for an individual subject’s outcome Y, and m denotes the average cluster size. Equivalently, the number of clusters required per group is given by k = n/m. Formulas needed to estimate sample size for pair-matched and stratified designs for a variety of study outcomes are provided by Donner and Klar (5). Regardless of the study design, it is useful to conduct sensitivity analyses exploring the effect on sample size by varying values of the intracluster correlation, the number of clusters per intervention group, and the subsample size. The benefits of such a sensitivity analysis are illustrated by a report describing methodological considerations in the design of the WHO antenatal care trial (9, 17). The values for ρ, suggested from data obtained from a pilot study, varied from 0 to 0.002, whereas cluster sizes were allowed to vary between 300 and 600 patients per clinic. Consequently, the degree of variance inflation due to clustering could be as great as 1 + (600 − 1)0.002 = 2.2. As is typical in such sensitivity analyses, power was seen to be much more sensitive to the number of clusters per intervention group than to the number of patients selected per clinic. Ultimately a total of 53 clinics were enrolled in this trial, with a minimum of 12 clinics per site and with each clinic enrolling approximately 450 women. It must be emphasized that the effect of clustering depends on the joint influence of both m and ρ. Failure to appreciate this point has led to the occasional suggestion in the epidemiological literature that clustering may be detected or ruled out on the basis of testing the estimated value of ρ for statistical significance, i.e., testing H0 : ρ = 0 versus HA : ρ > 0. The weakness of this approach is that observed values of ρ may be very small, particularly for data collected from the very large clusters typically recruited for community intervention trials. Therefore, the power of a test for detecting such values as

statistically significant tends to be unacceptably low (5). Yet small values of ρ, combined with large cluster sizes, can yield sizable values of the variance inflation factor, which can seriously disturb the validity of standard statistical procedures if unaccounted for in the analyses. Thus we would recommend that investigators inherently assume the existence of intracluster correlation, a well-documented phenomenon, rather than attempting to rule it out using statistical testing procedures. 7 CLUSTER LEVEL ANALYSES Many of the challenges of cluster randomization arise when inferences are intended to apply at the individual level, whereas randomization is at the cluster level. If inferences were intended to apply at the cluster level, implying that an analysis at the cluster level would be most appropriate, the study could be regarded, at least with respect to sample size estimation and data analysis, as a standard clinical trial. For example, one of the secondary aims of the Community Intervention Trial for Smoking Cessation (COMMIT) was to compare the level of tobacco control activities in the experimental and control communities after the study ended (18). The resulting analyses were then naturally conducted at the cluster (community) level. Analyses are inevitably more complicated when data are available from individual study subjects. In this case, the investigator must account for the lack of statistical independence among observations within a cluster. An obvious method of simplifying the problem is to collapse the data in each cluster, followed by the construction of a meaningful summary measure, such as an average, which then serves as the unit of analysis. Standard statistical methods can then be directly applied to the collapsed measures, which removes the problem of nonindependence because the subsequent significance tests and confidence intervals would be based on the variation among cluster summary values rather than on variation among individuals. An important special case arises in trials having a quantitative outcome variable when each cluster has a fixed number of subjects.

CLUSTER RANDOMIZATION

In this case, the test statistic obtained using the analysis of variance is algebraically identical to the test statistic obtained using a cluster level analysis (5). Thus, the suggestion that is sometimes made that a cluster level analysis intrinsically assumes ρ = 1 is misleading, because such an analysis can be efficiently conducted regardless of the value of ρ. It is important to note, however, that this equivalence between cluster level and individual level analyses, which holds exactly for quantitative outcome variables under balance, holds only approximately for other outcome variables (e.g., binary, time to event, count). A second implication of this algebraic identity is that the well-known ecological fallacy cannot arise in the case of cluster level intention-to-treat analyses, because the assigned intervention is shared by all cluster members. In practice, the number of subjects per cluster will tend to exhibit considerable variability, either by design or by subject attrition. Cluster level analyses that give equal weight to all clusters may therefore be inefficient. However, it is important to note that appropriately weighted cluster level analyses are asymptotically equivalent to individual level analyses. On the other hand, if there are only a small number of clusters per intervention group, the resulting imprecision in the estimated weights might even result in a loss of power relative to an unweighted analysis. In this case, it might therefore be preferable to consider exact statistical inferences constructed at the cluster level, as based on the randomization distribution for the selected experimental design (e.g., completely randomized, matched-pair, stratified). As noted in Section 2, COMMIT investigators (8) adopted this strategy, basing their primary analysis of tobacco quit rates on the permutation distribution of the difference in event rates within a pair-matched study design. Using a two-stage regression approach, investigators were also able to adjust for important baseline imbalances on known prognostic variables. 8

INDIVIDUAL LEVEL ANALYSES

Standard methods of analysis applied to individually randomized trials have all

7

been extended to allow for the effect of between-cluster sources of variation. These methods include extensions of contingency table methods (e.g., Pearson chi-square test, Mantel–Haenszel methods) and of two sample t-tests. More sophisticated extensions of multiple regression models have also been developed and are now available in standard statistical software. Attention in this section will be focused on methods for the analysis of binary outcome data, which arise more frequently in cluster randomization trials than continuous, count, or time-to-event data. The discussion will also be limited to data obtained from completely randomized and stratified designs. Methods of analysis for other study outcomes are considered in detail elsewhere (5), whereas some analytic challenges unique to pair-matched trials are debated by Donner and Klar (5) and by Feng et al. (19). Analyses of data from the WHO antenatal care trial (9) are now considered in which clinics were randomly assigned to experimental or control groups separately within each of the four participating sites (countries). An extension of the Mantel–Haenszel statistic adjusted for clustering was used to compare the risk of having a low birthweight outcome for women assigned to either the new model of antenatal care or a standard model. For clusters of fixed size m, this statistic is equal to the standard Mantel–Haenszel statistic divided by the variance inflation factor 1 + (m − 1)ρ, ˆ where ρˆ is the sample estimate of ρ. Thus, failure to account for between cluster variability (i.e., incorrectly assuming ρ = 0) will tend to falsely increase the type I error rate. A key advantage of this approach is that the resulting statistic simplifies to the standard Mantel–Haenszel test statistic. Similar advantages are shared by most other individual level test statistics. Additional analyses reported in this trial used an extension of logistic regression that allowed adjustment for other potential baseline predictors of low birth-weight including maternal education, maternal age, and nulliparity. These analyses allowed examination of the joint effects of individual-level and cluster-level predictors (i.e., intervention, strata).

8

CLUSTER RANDOMIZATION

Two frequently used extensions of logistic regression are the logistic-normal model and the generalized estimating equations (GEEs) extension of this procedure (5). The logisticnormal model assumes that the logit transform of the probability of having a low birthweight outcome follows a normal distribution across clusters. The resulting likelihood ratio tests will have maximum power for detecting effects of intervention as statistically significant when parametric assumptions such as these are satisfied. It may be difficult in practice to know whether the assumptions underlying the use of parametric models are reasonable. Therefore, attention is limited here to the GEE approach, which has the advantage of not requiring specification of a fully parametric distribution. Two distinct strategies are available to adjust for the effect of clustering using the GEE approach. The first can be said to be model-based, as it requires the specification of a working correlation matrix that describes the pattern of correlation between responses of cluster members. For cluster randomization trials, the simplest assumption to make is that responses of cluster members are equally correlated, i.e., to assume the correlation structure within clusters is exchangeable. The second strategy that may be used to adjust for the effect of clustering employs ‘‘robust variance estimators’’ that are constructed using betweencluster information. These estimators consistently estimate the true variance of estimated regression coefficients even if the working correlation matrix is misspecified. Moreover, provided there are a large number of clusters, inferences obtained using robust variance estimators will become equivalent to those obtained using the model-based strategy provided the working correlation matrix is correctly specified. The examples we have considered involve only a single level of clustering. More sophisticated multilevel methods of analysis are available (5, 20) that allow examination of effects at two more levels. For example, women participating in the WHO antenatal care trial might have been cared for by a specific physician within each clinic. Responses of women would then be clustered by physician nested within clinics, generating two

levels of clustering. This additional structure could then be used to explore differential intervention effects across physicians, e.g., to consider whether years of training was associated with relatively fewer women having low weight babies. Although these analyses may enrich our understanding of the trial results, they are almost always exploratory in nature. It is important to note that statistical inferences constructed using individual level analyses are approximate, with their validity only assured in the presence of a large number of clusters. This requirement essentially flows from the difficulty in accurately estimating between-cluster sources of variation. Thus the validity of statistical inferences constructed using individual level analyses may be in question should there be fewer than 20 clusters enrolled. If a small number of clusters are enrolled, it may only be possible to consider cluster level analyses by constructing statistical inferences based on the selected randomization distribution. 9 INCORPORATING REPEATED ASSESSMENTS Investigators are often interested in considering the longitudinal effects of intervention as part of a cluster randomized trial. The choice here is between a cohort design that tracks the same individuals over time and a repeated cross-sectional design that tracks the same clusters over time, but draws independent samples of individuals at each calendar point. In this section, how the study objectives should determine the choice between these designs and how the resulting decision affects data collection, data analysis, and the interpretation of study results are outlined. Cohort samples of subjects were included in each of the three trials presented in Section 2. For example, smoking quit rates were measured for subsamples of heavy and light-to-moderate smokers selected from each of the participating COMMIT communities (8). Cohort members were followed and contacted annually during the four-year intervention. The length of follow-up time was a consequence of the study objectives for which smoking quit rates were defined ‘‘ . . . as the

CLUSTER RANDOMIZATION

fraction of cohort members who had achieved and maintained cessation for at least six months at the end of the trial.’’ This outcome illustrates how cohort designs are best suited to measuring change within individual participants, implying that the unit of inference is most naturally directed at the individual level. A desirable design feature of COMMIT was that subjects were selected before random assignment with the avoidance of any concerns regarding possible selection bias. This strategy was not available for the WHO antenatal care trial (9), as women could only be identified after their initial clinic visit, which for most women occured after random assignment. Selection bias is unlikely, however, because all new patients from participating clinics were enrolled in the trial and birthweight data from singleton births were available for 92% of women from each intervention group. A secondary objective of COMMIT was to determine whether the intervention would decrease the prevalence of adult cigarette smoking (8). This objective was achieved by conducting separate surveys before random assignment and after completion of the intervention. A principal attraction of such repeated cross-sectional surveys is that any concerns regarding the effects of possible attrition would be avoided. Of course, differential rates of participation in crosssectional surveys conducted after random assignment can still compromise validity, because willingness to participate may be a consequence of the assigned intervention. Nonetheless, random samples of respondents at each assessment point will be more representative of the target population than will be a fixed cohort of smokers. The final decision regarding the selection of a cohort or cross-sectional design should be based primarily on the study objectives and the associated unit of inference. However, it can still be informative to quantitatively evaluate the relative efficiency of the two designs (21). As repeated assessments are made on the same subjects, the cohort design tends to have greater power than a design involving repeated cross-sectional surveys. Note, however, that in practice, subject attrition may eliminate these potential gains in power.

9

The number and timing of assessments made after baseline should be determined by the anticipated temporal responses in each intervention group. For example, it might be reasonable to expect different linear trends over time across intervention groups in a community randomized trial of smoking cessation if the effects of intervention were expected to diffuse slowly through each community. Alternatively, the effects of intervention might diffuse rapidly but be transient, requiring a more careful determination of assessment times to ensure that important effects are not missed. The methods of analysis presented in Sections 7 and 8 assumed the presence of only a single assessment after random assignment. Extensions of these methods to cluster randomization trials having longitudinal outcome measures are beginning to appear (22, 23). 10

STUDY REPORTING

Reporting standards for randomized clinical trials have now been widely disseminated (24). Many principles that apply to trials randomizing individuals also apply to trials randomizing intact clusters. These principles include a carefully posed justification for the trial, a clear statement of the study objectives, a detailed description of the planned intervention, and an accurate accounting of all subjects randomized to the trial. Unambiguous inclusion–exclusion criteria must also be formulated, although perhaps separately for cluster level and individual level characteristics. There are, however, some unique aspects of cluster randomization trials that require special attention at the reporting stage. The focus here is on some of the most important of these aspects. The decreased statistical efficiency of cluster randomization relative to individual randomization can be substantial, depending on the sizes of the clusters randomized and the degree of intracluster correlation. Thus, unless it is obvious that there is no alternative, the reasons for randomizing clusters rather than individuals should be clearly stated. This information, accompanied by a clear description of the units randomized, can

10

CLUSTER RANDOMIZATION

help a reader decide whether the loss of precision due to cluster randomization is in fact justified. Having decided to randomize clusters, investigators may still have considerable latitude in their choice of unit. As different levels of statistical efficiency are associated with different cluster sizes, it would seem important to select the unit of randomization on a carefully considered basis. An unambiguous definition of the unit of randomization is also required. For example, a statement that ‘‘neighbourhoods’’ were randomized is clearly incomplete without a detailed description of this term in the context of the planned trial. The clusters that participate in a trial may not be representative of the target population of clusters. Some indication of this lack of representativeness may be obtained by listing the number of clusters that met the eligibility criteria for the trial, but which declined to participate, along with a description of their characteristics. A continuing difficulty with reports of cluster randomization trials is that justification for the sample size is all too often omitted. Investigators should clearly describe how the sample size for their trial was determined, with particular attention given to how clustering effects were adjusted. This description should be in the context of the experimental design selected (e.g., completely randomized, matched-pair, stratified). It would also be beneficial to the research community if empirical estimates of ρ were routinely published (with an indication of whether the reported values have been adjusted for the effect of baseline covariates). It should be further specified what provisions, if any, were made in the sample size calculations to account for potential loss to follow-up. As the factors leading to the loss to follow-up of individual members of a cluster may be very different from those leading to the loss of an entire cluster, both sets of factors must be considered here. A large variety of methods, based on very different sets of assumptions, have been used to analyze data arising from cluster randomization trials. For example, possible choices for the analysis of binary outcomes include adjusted chi-square statistics, the method of

(GEE), and logistic-normal regression models. These methods are not as familiar as the standard procedures commonly used to analyze clinical trial data, partly because the methodology for analyzing cluster randomization trials is in a state of rapid development, with virtually no standardization and a proliferation of associated software. Therefore, it is incumbent on authors to provide a clear statement of the statistical methods used, and accompanied, where it is not obvious, by an explanation of how the analysis adjusts for the effect of clustering. The software used to implement these analyses should also be reported. 11 META-ANALYSIS Meta-analyses involving the synthesis of evidence from cluster randomization trials raise methodologic issues beyond those raised by meta-analyses of individually randomized trials. Two of the more challenging of these issues are (1) the increased likelihood of study heterogeneity, and (2) difficulties in estimating design effects and selecting an optimal method of analysis (25). These issues are illustrated in a metaanalysis examining the effect of vitamin A supplementation on child mortality (13). This investigation considered trials of hospitalized children with measles as well as communitybased trials of healthy children. Individual children were assigned to intervention in the four hospital-based trials, whereas allocation was by geographic area, village, or household in the eight community-based trials. One community-based trial included only one geographic area per intervention group, each of which enrolled approximately 3000 children. On the other hand, there was an average of about two children from each cluster when allocation was by household. Thus, an important source of heterogeneity arose from the nature and size of the randomization units allocated in the different trials. This problem was dealt with by performing the meta-analysis separately for the individually randomized and cluster randomized trials. It is straightforward to summarize results across trials when each study provides a common measure for the estimated effect

CLUSTER RANDOMIZATION

of intervention (such as an odds ratio, for example) and a corresponding variance estimate that appropriately accounts for the clustering. Unfortunately, the information necessary for its application in practice is rarely available to meta-analysts. One consequence of this difficulty is that investigators are sometimes forced to adopt ad-hoc strategies when relying on published trial reports that fail to provide estimates of the variance inflation factor. For example, in the meta-analysis described above, only four of the eight community-based trials reported that they accounted for clustering effects. The authors argued that increasing the variance of the summary odds ratio estimator computed over all eight trials by an arbitrary 30% was reasonable because the design effects ranged from 1.10 to 1.40 in those studies that did adjust for clustering effects. Even when each trial provides an estimate of the design effect, several different approaches could be used for conducting a meta-analysis. For example, a procedure commonly adopted for combining the results of individually randomized clinical trials with a binary outcome variable is the wellknown Mantel–Haenszel test. The adjusted Mantel–Haenszel test (5) may be used to combine results of cluster randomized trials. Other possible approaches are discussed by Donner et al. (26). REFERENCES 1. J. Cornfield, Randomization by group: a formal analysis. Amer. J. Epidemiol. 1978; 108: 100–102. 2. M. Ashby, J. M. Neuhaus, W. W. Hauck et al., An annotated bibliography of methods for analyzing correlated categorical data. Stat. Med. 1992; 11: 67–99. 3. A. Donner, N. Birkett, and C. Buck, Randomization by cluster: sample size requirements and analysis. Amer. J. Epidemiol. 1981; 114: 906–914. 4. R. F. Gillum, P. T. Williams, and E. Sondik, Some consideration for the planning of totalcommunity prevention trials: when is sample size adequate? J. Commun. Health 1980; 5: 270–278. 5. A. Donner and N. Klar, Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold, 2000.

11

6. D. M. Murray, Design and Analysis of GroupRandomized Trials. Oxford: Oxford University Press, 1998. 7. P. Payment, L. Richardson, J. Siemiatycki, R. Dewar, M. Edwardes, and E. Franco, A randomized trial to evaluate the risk of gastrointestinal disease due to consumption of drinking water meeting microbiological standards. Amer. J. Public Health 1991; 81: 703–708. 8. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a four-year community intervention. Amer. J. Public Health 1995; 85: 183–192. 9. J. Villar, H. Ba’aqeel, G. Piaggio, et al. for the WHO Antenatal Care Trial Research Group, WHO antenatal care randomised trial for the evaluation of a new model of routine antenatal care. Lancet 2001; 357: 1551–1564. 10. D. A. Preece, RA Fisher and experimental design: a review. Biometrics 1999; 46: 925–935. 11. W. R. Shadish, T. D. Cook, and D. T. Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston, MA: Houghton Mifflin Company, 2002. 12. C. Merzel and J. D’Affitti, Reconsidering community-based health promotion: promise, performance, and potential. Amer. J. Public Health 2003; 93: 557–574. 13. W. W. Fawzi, T. C. Chalmers, M. G. Herrera, and F. Mosteller, Vitamin A supplementation and child mortality, a meta-analysis. JAMA 1993; 269: 898–903. 14. M. McKee, A. Britton, N. Black, K. McPherson, C. Sanderson, and C. Bain, Interpreting the evidence: choosing between randomised and non-randomised studies. Brit. Med. J. 1999; 319: 312–315. 15. T. C. Chalmers, P. Celano, H. S. Sacks, and H. Smith, Jr., Bias in treatment assignment in controlled clinical trials. N. Engl. J. Med. 1983; 309: 1358–1361. 16. D. M. Murray, Efficacy and effectiveness trials in health promotion and disease prevention: design and analysis of group-randomized trials. In: N. Schneiderman, M. A. Speers, J. M. Silva, H. Tomes, and J. H. Gentry (eds.), Integrating Behavioral and Social Sciences with Public Health. Washington, DC: American Psychology Association, 2001. 17. A. Donner, G. Piaggio, J. Villar et al. for the WHO Antenatal Care Trial Research Group, Methodological considerations in the design

12

CLUSTER RANDOMIZATION

of the WHO Antenatal care Randomised Controlled Trial. Paediatr. Perinatal Epidemiol. 1998; 12(Sup 2): 59–74. 18. B. Thompson, E. Lichtenstein, K. Corbett, L. Nettekoven, and Z. Feng, Durability of tobacco control efforts in the 22 Community Intervention Trial for Smoking Cessation (COMMIT) communities 2 years after the end of intervention. Health Educ. Res. 2000; 15: 353–366. 19. Z. Feng, P. Diehr, A. Peterson, and D. McLerran, Selected statistical issues in group randomized trials. Annu. Rev. Public Health 2001; 22: 167–187. 20. A. V. Diez-Roux, Multilevel analysis in public health research. Annu. Rev. Public Health 2000; 21: 171–192. 21. H. A. Feldman and S. M. McKinlay, Cohort vs. cross-sectional design in large field trials: precision, sample size, and a unifying model. Stat. Med. 1994; 13: 61–78. 22. A. I. Sashegyi, K. S. Brown, and P. J. Farrell, Application of a generalized random effects regression model for cluster-correlated longitudinal data to a school-based smoking prevention trial. Amer. J. Epidemiol. 2000; 152: 1192–1200. 23. D. M. Murray, P. J. Hannan, R. D. Wolfinger, W. L. Baker, and J. H. Dwyer, Analysis of data from group-randomized trials with repeat observations on the same groups. Stat. Med. 1998; 17: 1581–1600. 24. D. Moher, K. F. Schulz, and D. G. Altman for the CONSORT Group, The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 2001; 357: 1191–1194. 25. A. Donner and N. Klar, Issues in the metaanalysis of cluster randomized trials. Stat. Med. 2002; 21: 2971–2980. 26. A. Donner, G. Paiggio, and J. Villar, Statistical methods for the meta-analysis of cluster randomization trials. Statistic. Meth. Med. Res. 2001; 10: 325–338.

CODE OF FEDERAL REGULATIONS (CFR) The Code of Federal Regulations (CFR) is a codification of the general and permanent rules published by the U.S. Government Printing Office in the Federal Register by the Executive departments and agencies of the federal government. The CFR is divided into 50 titles, which represent broad areas subject to federal regulation. Each title is divided into chapters that usually bear the name of the issuing agency. Each chapter is subdivided even more into parts that cover specific regulatory areas. Large parts may be subdivided into subparts. All parts are organized in sections, and most citations to the CFR will be to the section level.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cdrh/devadvice/365.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

COHERENCE IN PHASE I CLINICAL TRIALS

in patients, as opposed to those in healthy volunteers. 1 COHERENCE: DEFINITIONS AND ORGANIZATION

YING KUEN (KEN) CHEUNG Columbia University New York

Phase I dose-finding designs dictate dose assignments during a trial in an outcomeadaptive manner, and escalate or de-escalate doses for future patients based on the previous observations. An escalation for the next patient is said to be incoherent if the most recent patient experiences a DLT. A dosefinding design is coherent in escalation if it does not induce incoherent escalation in all possible outcome sequences, that is, it allows escalation only after a non-DLT observation. For example, Fig. 1 shows an outcome sequence with an incoherent escalation, and thus by definition the design that generates this outcome sequence is not coherent in escalation. By the same token, a deescalation for the next patient is incoherent if the most recent patient has no sign of toxicity. A design is coherent in de-escalation if it does not induce any incoherent de-escalation. Figure 2 shows an example of incoherent de-escalation. Examples of coherent designs and the utility of coherence will be discussed under the headings ‘‘Coherent designs’’ and ‘‘Compatible initial design.’’ Because coherent escalation and deescalation are defined with respect to the outcome of the most recent patient, the concept of coherence is directly applicable to clinical trials with short-term toxicity so that the toxicity outcome from the most recent patient is readily available when the next patient is enrolled. In the clinical situations where the DLT is defined with respect to a nontrivial period of time, it becomes infeasible to wait for the outcome of the most recent patient before enrolling a new patient because it may cause repeated accrual suspensions and hence long trial duration. Two common ways to handle these situations are to enroll patients in groups and to monitor the trial in real-time. Therefore, extensions of the basic concept of coherence to these clinical settings are needed to enhance its

In the early phase clinical development of a new treatment, phase I trials are typically small studies that evaluate toxicity and determine a safe dosage range. Often, several doses of the treatment will be tested; in these situations, a more specific objective is to determine the maximum dose that does not exceed an acceptable level of toxicity, thus called the maximum tolerated dose (MTD). For ethical reasons, balanced randomization to each dose level is not feasible because it may unknowingly allocate many patients to excessively toxic doses. Rather, dose assignments in phase I trials are usually made in an outcome-adaptive manner in that previous outcomes in the trial form the basis of dose assignments for future subjects. Conventionally, the MTD is approached from lower doses according to the rule that escalates dose after a cohort of three subjects who show no treatment-related adverse or toxic event (Table 1). Although this conventional method has several limitations and shortcomings in terms of the statistical properties (1), it embraces a principle that appears to be clinically and ethically sound. When the most current subject suffers a treatment-related toxic event or dose-limiting toxicity (DLT), no escalation should take place for the next subject without testing the current or the lower doses in more subjects. This principle is called coherence in escalation (2). For phase I trials conducted in patients such as those in oncology, the flipside of the ethical concern is to avoid treating many patients at very low and hence inefficacious doses (3, 4). Analogously, coherence in de-escalation (2) stipulates that de-escalation should not take place when no DLT is observed in the current patient. This article reviews the coherence principles in phase I trials and some related concepts and applications. For brevity in discussion, we will focus on trials conducted

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

COHERENCE IN PHASE I CLINICAL TRIALS Table 1. Conventional Rules for Escalation and De-escalation from a Given Dose∗ # DLT /# Patients in Cohort 1 Cohort 2 0/3 1/3 1/3 ≥2/3

Action/conclusion

Not necessary 0/3 ≥1/3 Not necessary

Escalate dose Escalate dose or called MTD MTD has been exceeded; de-escalate dose MTD has been exceeded; de-escalate dose

∗ Up to two cohorts of three patients will be enrolled to a dose, and escalation will occur only when all patients in the most current cohort have no DLT.

5

Dose Level

4

3

2

1

1

3

5

7

9

11

13

15

Patient Number Figure 1. Dose assignments and outcomes of the first 15 patients generated according to a twostage CRM for a 25% target DLT rate. Each point represents a patient with a circle ‘‘o’’ indicating no DLT, and a cross ‘‘x’’ DLT. The initial design escalates doses after every three successive non-DLT observations; the model setup for the CRM is the same as that used in Reference (2). Escalation for patient 11 is incoherent after a DLT observation from patient 10. This escalation also violates compatibility because patient 11 would have received dose level 4 according to the initial design if patient 10 did not have DLT.

practical use. These extensions are to be discussed under the headings ‘‘group coherence’’ and ‘‘real-time coherence.’’ 2

COHERENT DESIGNS

The phase I design literature clearly indicates that it is almost instinctive to enforce coherence as a consequence of ethical considerations, even before the term ‘‘coherence’’ was coined in Reference 2. Phase I designs can be generally classified as algorithm-based and model-based (5). An algorithm-based design prespecifies the rules of escalation and

de-escalation before a trial starts. Examples include the conventional method (Table 1), the up-and-down designs (6), and the biased coin design (7). Coherence is enforced in the biased coin design by construction of the rules, and it is an intrinsic property of the method. The conventional method and the up-and-down designs are also coherent in escalation by construction; moreover, implementation with common sense will suffice coherence in de-escalation. For example, if the conventional method is used to facilitate dose assignments, and if the first two patients at a dose have DLT, then one should

COHERENCE IN PHASE I CLINICAL TRIALS

de-escalate dose for the next patient instead of rigidly treating a third patient at the same dose as in Fig. 2. A model-based design chooses the next dose based on a dose-toxicity curve, which is estimated repeatedly throughout the trial. As no fast and hard rule exists to determine escalation and de-escalation, there is no guarantee that a pure model-based design will adhere to coherence. Even if coherence holds, it is not as obvious as in an algorithm-based design, and it needs to be proved on a methodby-method basis. The continual reassessment method (CRM) and its modified versions (3, 8–12) are by far the most commonly used model-based design in phase I trials. Briefly, the original proposal of the CRM (3) defines the MTD as a dose that causes DLT in the patient population with a prespecified target rate, assumes a one-parameter model to describe the dose–toxicity relationship, treats the first patient at the prior MTD based on the model, and repeatedly estimates the model-based MTD after every patient so that the subsequent patient is treated at the updated MTD estimate. This original version of the CRM, regardless of the functional form of the model used, has been proved to be coherent (2). For practical purposes, several authors have suggested to start a trial conservatively at the lowest dose instead of the modelbased prior MTD (8–12). As a result of this deviation from the initial model-based dose assignment, the subsequent model-based dose assignments are not necessarily coherent in escalation (see Fig. 1). Being aware of the potential of incoherent escalation, all authors of the modified CRM routinely apply an additional restriction that the dose level for the next patient cannot be higher than the level for the most recent patient who has just had a DLT; see Reference 8 for example. In other words, coherence in escalation is enforced by restriction. Alternatively, Reference (2) suggests a simple way to achieve coherence of a modified CRM by calibration of the design parameters. An advantage of this latter calibration approach is that it ensures the initial dose escalation plan to be compatible with the model-based plan and the chosen target DLT rate; see the section ‘‘Compatible

3

initial design.’’ All the modified CRM referenced above are coherent in de-escalation. 3

COMPATIBLE INITIAL DESIGN

As mentioned earlier, a practical modification of the CRM is to start the trial at the lowest dose instead of the model-based prior MTD. As a result, an additional set of rules or an initial design is needed to govern escalation at the beginning of the trial before it turns to the model-based assignments. A two-stage CRM design is thus formed by an initial design and a CRM stage. In a two-stage design, the first observed DLT is often the trigger on which the trial will be switched to the CRM stage (11, 12). Therefore, the initial design in effect dictates dose assignments (escalations) in a trial when there is no DLT. As such, the initial design should represent the fastest escalation scheme that takes place when no DLT is observed throughout the entire trial. (Otherwise, it seems hard to justify the quickening of escalation on the observation of any DLT.) Motivated by this assertion, an initial design is said to be compatible with the CRM component in a two-stage design with respect to a target DLT rate if dose escalation according to the two-stage design is no faster after observation of a DLT than it would have been in the absence of DLT. As a general rule of thumb, incompatibility can be caused by (1) an initial design that escalates after a conservatively large number of non-DLT observations, and/or (2) a large target DLT rate. Capitalizing on this relationship between the initial design and the target DLT rate, one may calibrate an initial design so that it is compatible with the CRM component in the two-stage design with respect to the target rate. For example, one may start by considering the initial design that escalates after groups of moderate-sized (e.g., five patients) and by decreasing the group size until compatibility is satisfied. A sufficient but not necessary condition for compatibility of an initial design is to verify that the two-stage design thus obtained is coherent by calibration (2) instead of by restriction; see the discussion at the end of the previous section. The outcome sequence in Fig. 1 is in fact generated by a two-stage design for a 25%

4

COHERENCE IN PHASE I CLINICAL TRIALS

5

Dose Level

4 3 2 1

1

3

5

7 9 Patient Number

11

target DLT rate with an initial design that escalates dose after every three non-DLT observations. Once the first DLT is observed, the dose-toxicity curve would be updated after every single patient, and the next patient will be given the dose with DLT rate estimated to be closest to the target 25% according to the CRM. The dose-toxicity model setup for the CRM is the same as that used in Reference 2. The escalation for Patient 11 is incoherent and leads to the violation of compatibility. Thus, by the principles of coherence and compatibility, this initial design is overly conservative—this observation is contrary to the conventional yet unexamined intuition that groups-ofthree is reasonable for a 25% target rate. In particular, it can be verified that a groupsof-two initial design is compatible with the CRM setup in Fig. 1; see Reference 2. 4

GROUP COHERENCE

4.1 Clinical Settings In situations in which the DLT can only be evaluated after a nontrivial period of time (i.e., late toxicity) and patient enrollment is comparatively fast, the duration of a trial can be substantially reduced by making interim dose decisions after every small group of patients, as opposed to every single patient, is accrued. Thus, we will view each group of patients, instead of each individual patient, as an experimental unit for dose assignment purposes.

13

15

Figure 2. Dose assignments and outcomes generated according to the conventional method in Table 1. Each point represents a patient with a circle ‘‘o’’ indicating no DLT, and a cross ‘‘x’’ DLT. Deescalation for patient 13 is incoherent after a non-DLT observation from patient 12.

4.2 Definitions An escalation for the next group of patients is said to be group incoherent when the observed DLT rate in the most recent group is larger than or equal to the target DLT rate. Analogously, a de-escalation for the next group is group incoherent if the observed DLT rate in the most recent group is less than or equal to the target DLT rate. A dose-finding design is group coherent if it does not induce group incoherent escalation or de-escalation in any possible outcome sequences. Figure 3 provides an illustration of a group incoherent design. 4.3 Examples Examples of grouped accrual designs include the conventional method, Storer’s designs C and D (6), and grouped accrual CRM (10). Whether a design is group coherent depends on the target DLT rate. Take Storer’s design C as an example, which enrolls two patients at a time, it escalates if no DLT occurs, and it de-escalates if at least one DLT occurs. In other words, a de-escalation will occur when the observed DLT rate is 50%. Therefore, Storer’s design C is group incoherent in deescalation for a target DLT rate that is larger than 50%, although we note that such a high target rate is an unlikely choice in practice. Goodman et al. (10) propose grouped accrual CRM in attempt to reduce the trial duration imposed by the original CRM that enrolls one patient at a time. If the first group

COHERENCE IN PHASE I CLINICAL TRIALS

5

5 0.67

Figure 3. Dose assignments and outcomes of the first six patient groups (of size three) generated according to a grouped accrual CRM that starts at the lowest dose for a 25% target DLT rate. Each number indicates the observed DLT rate of the group. Escalation for Group 4 is group incoherent. The model setup for the CRM is the same as that used in Reference (2).

Dose Level

4 3

0.33

2

0.00

5

6

0.00

1 0.00

1

of patients is treated at the prior MTD without resorting to an initial design, the grouped accrual CRM is group coherent (2). However, if the trial starts at the lowest dose as suggested also in (10), then the grouped accrual CRM without additional restriction is not necessarily group coherent (Fig. 3). In this case, therefore, one may need to apply the ad hoc restriction of no escalation (de-escalation) if the observed DLT rate exceeds (falls below) the target DLT rate. Alternatively, one can achieve group coherence by calibrating the group size and the CRM parameters using the technique in Reference 2. 5

0.33

REAL-TIME COHERENCE

5.1 Clinical Settings In situations where a DLT may occur at a random time during the observation period, partial information from previously enrolled patients may be available as a new patient becomes eligible for the trial; thus, each individual patient should be treated as an experimental unit. For example, suppose that each patient is to be followed monthly for a maximum of six months for DLT evaluation. By the time of the arrival of a new patient, complete information is available from the currently enrolled patients who have already had a DLT. For those without any sign of DLT, we still have partial information because a patient without DLT for five months apparently conveys more

2

3 4 Group Number

safety information about the treatment than a patient without DLT for only one month. Ignoring such information is inefficient and could cause bias. In addition, incoherent moves may occur if we rigidly implement a grouped accrual design without considering the updated information between patient enrollments. Consider Fig. 2 for example: Suppose that patient 11 has already had a DLT by the time when patient 12 arrives, treating patient 12 at dose level 4 according to the conventional method leads to an incoherent de-escalation for patient 13. In this particular case, to avoid incoherence, deescalation should have taken place for patient 12. In general clinical settings where potentially updated information exists between patient enrollments, dose assignment decisions should be made continuously on real time during the trial. 5.2 Definitions For a phase I design defined on real time, coherence in de-escalation between any two time points stipulates that dose assignment at the later time point should not be lower than that at an earlier time point if no new DLT is observed in all enrolled patients between the two time points; see Fig. 4 for a graphical illustration. In contrast, it is difficult to provide a practical definition for coherence in escalation between any two time points. If a patient experiences a DLT at the later time point, then escalation from the earlier time

6

COHERENCE IN PHASE I CLINICAL TRIALS

Patient Number

8

6

X

4

2

0

2

4

6

8

10

12

Study time (months) Figure 4. Enrollment of the first eight patients to a phase I trial on the study timeline: Each line represents a patient with ‘‘•’’ indicating the entry time of the patient; a line ending with a cross ‘‘x’’ indicates the time of a DLT, and ending with a circle ‘‘o’’ indicates no DLT at the end of follow-up, which is six-month after study entry. According to coherence in de-escalation, patient 7 (entry at 4 months since study began) should receive a dose no less than that received by patients 1–6 because no DLT is observed during the period between the entries of these previous patients and four months. On the other hand, patient 8 (entry at 6 months) may receive a dose higher than that received by patient 7 without violating coherence in escalation despite that a DLT occurs between four and six months.

point may still be reasonable because other patients without DLT will also have longer follow-up at the later time point than at the earlier time point and serve as stronger evidence for the safety of the treatment at the later time. Consider the entry of patient 8 in Fig. 4. Although a DLT occurs between the entries of patient 7 (4 months) and patient 8 (6 months), the longer DLT-free follow-up of patients 1 through 6 at 6 months provides stronger evidence for the safety of the treatment than at 4 months. Therefore, it may not be unreasonable to escalate dose for patient 8 from that of patient 7. However, if the two time points are close enough, the patients who remain without DLT will contribute only little additional information to offset it because of the DLT observed at the later time point. Letting the two time points be arbitrarily close, we can then define real-time coherence in escalation as a condition that no escalation is allowed in such a short time interval; a rigorous, mathematical definition is given in Reference 2. We note

that this definition bears little practical relevance because the monitoring intervals are seldom short. However, the calculus involved is a useful, theoretical concept for a real-time dose-finding design to respect. We contend that a dose-finding design with desirable theoretical property such as consistency (13) and coherence is a necessary, yet by no means sufficient, condition for its general use. 5.3 Examples The time-to-event continual reassessment method (TITE-CRM) (14) is the first phase I design defined on real time. The main idea of the method is to incorporate into the CRM the partial information from incompletely followed patients by down-weighing the non-DLT observations according to the length of the follow-up via a weight function. If the weight function is chosen such that the weight given to each DLT-free patient is increasing as the trial progresses, the TITECRM will then be coherent. This monotonic increasing requirement is intuitive because

COHERENCE IN PHASE I CLINICAL TRIALS

as the follow-up period of a DLT-free patient becomes longer, his/her contribution to the likelihood (via his/her weight) should also increase. Readers interested in the technical details are referred to theorem 3 in Reference 2. Several recent developments in real-time phase I designs have occurred; see References 15 and 16 for example. Whether these methods are coherent have not yet been examined.

6

DISCUSSION

To summarize, this chapter reviews the concept and the applications of coherence under various phase I trial settings with (1) shortterm toxicities, (2) late toxicities, and (3) toxicities that occur at random in a nontrivial period of time. These principles are motivated by general ethical considerations rather than the operating characteristics of particular dose-finding designs. Therefore, any coherent designs should be evaluated with extensive simulations and numerical techniques (13) before being used in actual trials. This being said, calibrating coherent two-stage CRM avoids over-conservative initial escalation plan (see discussion in the section ‘‘Compatible initial design’’), and hence seems to adhere to motivating rationales of the CRM (3), which is to place fewer patients at the low doses. If one finds a compatible initial design to be too aggressive, it indicates that the specified target DLT rate may be too large, and a lower tolerance should be specified. Indeed, the concept of coherence provides an objective bridge between the dose escalation plan and the specified target.

REFERENCES 1. B. Storer and D. DeMets, Current phase I/II designs: are they adequate? J. Clin. Res. Drug Dev. 1987; 1: 121–130.

7

4. D. D. Rosa, J. Harris, and G. C. Jayson, The best guess approach to phase I trial design. J. Clin. Oncol. 2006; 24: 206–208. 5. Chevret, S (ed.), Statistical Methods for Dose-Finding Experiments. New York: John Wiley & Sons, 2006. 6. B. E. Storer, Design and analysis of phase I clinical trials. Biometrics 1989; 45: 925–937. 7. S. D. Durham, N. Flournoy, and W. F. Rosenberger, A random walk rule for phase I clinical trials. Biometrics 1997; 53: 745–760. 8. D. Faries, Practical modifications of the continual reassessment method for phase I cancer clinical trials. J. Biopharm. Stat. 1994; 4: 147–164. 9. E. L. Korn, D. Midthune, T. T. Chen, L. V. Rubinstein, M. C. Christian, and R. M. Simon, A comparison of two phase I trial designs. Stat. Med. 1994; 13: 1799–1806. 10. S. N. Goodman, M. L. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stat. Med. 1995; 14: 1149–1161. 11. S. Møller, An extension of the continual reassessment methods using a preliminary up-and-down design in a dose finding study in cancer patients, in order to investigate a greater range of doses. Stat. Med. 1995; 14: 911–922. 12. J. O’Quigley and L. Z. Shen, Continual reassessment method: a likelihood approach. Biometrics 1996; 52: 673–684. 13. Y. K. Cheung, and R. Chappell, A simple technique to evaluate model sensitivity in the continual reassessment method. Biometrics 2002; 58: 671–674. 14. Y. K. Cheung and R. Chappell, Sequential designs for phase I clinical trials with late-onset toxicities. Biometrics 2000; 56: 1177–1182. 15. T. M. Braun, Z. Yuan, and P. F. Thall, Determining a maximum-tolerated schedule of a cyctotoxic agent. Biometrics 2005; 61: 335–343. 16. T. M. Braun, Generalizing the TITE-CRM to adapt for early- and late-onset toxicities. Stat. Med. 2006; 25: 2071–2083.

2. Y. K. Cheung, Coherence principles in dose-finding studies. Biometrika 2005; 92: 863–873.

CROSS-REFERENCES

3. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics 1990; 46: 33–48.

Dose Escalation Design Phase I Trials

Continual Reassessment Method (CRM)

COHORT VS. REPEATED CROSS-SECTIONAL SURVEY DESIGNS

illustrated with relevant data. More detail is available elsewhere (9). Data from a large community-based program evaluation are used to motivate and illustrate the theory (10). A random digit dialing (RDD) survey was conducted at baseline (T 0 ) followed by a smaller RDD survey conducted 2 years later (T 1 ) in the same communities, yielding 5475 T 0 surveys and 3524 T 1 surveys with complete data. People in the T 0 cross-sectional samples were also followed at T 1 , and 3280 repeat interviews were obtained (40% attrition); this is referred to as the cohort sample. For the T 0 and T 1 cross-sectional samples, there was an overall 54% response rate, which is considered in the discussion section. Consider the Venn Diagram in Fig. 1. The two large circles refer to the community population at T 0 and at T 1 ; their intersection is the long-term residents who were in the community at both times; they are referred to here as ‘‘stayers.’’ This group will be considered the population of primary interest in the following analyses, although other choices might have been made (9, 11). The two smaller circles (A + B + C and D + E) represent the respondents to the two cross-sectional surveys taken at T 0 and T 1 , respectively. The smallest circle (C) represents the cohort sample, successfully surveyed at both T 0 and T 1 . People in segment A (outmigrants) were surveyed at T 0 but left the population before the T 1 survey; those in segment C (cohort sample) remained and were interviewed again at T 1 ; those in B remained but were not reinterviewed, because they either refused or could not be contacted. The T 1 cross-sectional survey collected information from segments D (stayers) and E (inmigrants). As B + C is a random sample of the stayers at T 0 , and D is a random sample of the stayers at T 1 , the survey estimates from these segments provide estimates for the stayers at the respective time points. Here ‘‘bias’’ is defined as the difference between the cohort (or cross-sectional) change estimate and the change observed among the stayers.

PAULA DIEHR University of Washington Department of Biostatistics Seattle, WA, USA

1

INTRODUCTION

A central goal of community-based health promotion programs is to reduce the prevalence of risky health-behaviors in the intervention communities. Surveys of community residents at two or more points in time are often required to obtain direct evidence on whether such a goal is met. These surveys might use a cohort design in which a panel of individuals is surveyed repeatedly or else a repeated cross-sectional design in which a fresh sample of individuals from each community is interviewed on each survey occasion, or occasionally both. The choice of survey design depends, of course, on the goals of the intervention under evaluation. A long-term intervention (changing laws, for example) would affect people in a community no matter how long they lived there, suggesting that two cross-sectional surveys would suffice, whereas an intervention that depended on exposing individuals to certain information or programs might be better evaluated in a cohort of residents. Cohort estimates of change are desirable because, in using each person as his or her own control, an estimate will generally have a smaller standard error than an estimate of change based on two independent samples. On the other hand, cohort estimates can be seriously biased if losses to follow-up or changes in the community over time make the cohort unrepresentative of the population of primary interest. The relative merits of the cohort and repeated cross-sectional sampling approaches have been described in several places, (1–8). The following discussion is based on a theoretical approach for choosing between the two survey designs that includes issues of cost and attrition, and it is

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

COHORT VS. REPEATED CROSS-SECTIONAL SURVEY DESIGNS

Population at t1

Population at t0

F

G

H

B A

D

C

E

A + B + C + D + F + G is the community at T0 B + C + D + E + G + H is the community at T1 B + C + D + G is the stayers A + B + C is the T0 cross-sectional sample D + E is the T1 cross-sectional sample C is the cohort sample Figure 1. The Venn diagram for smoking prevalence.

It is clear from Fig. 1 that the different surveys are reaching different populations. The T 0 cross-section surveys both stayers and outmigrants, the T 1 cross-section surveys stayers and inmigrants, and the cohort represents a subset of the stayers who were willing to be surveyed at both T 0 and T 1 . Clearly, different survey designs can yield different answers. If they differ, which is most appropriate? Table 1 gives sample information for each segment of the Venn diagram. The percent of people who smoke cigarettes (smoking prevalence) is used here as the health behavior of interest. Sample sizes, estimates of smoking prevalence at T 0 and T 1 , and the standard deviations of the estimates, changes in prevalence, and biases are shown in Table 1. For example, segment A at T 0 had a sample of 805 people whose smoking prevalence was 24.7%, with a standard deviation (standard error) of 1.5 percentage points. The change in prevalence for stayers was 23.4 − 23.9 = − 0.45 percentage points, a slight decrease in healthful behavior. The change in prevalence for the cohort was +2.29 percentage points, and the bias of the cohort estimate is thus 2.29 − ( − .45) = 2.74 percentage points. The cohort estimate of change has higher bias than the cross-sectional estimate, but a lower standard error. Note that segments A through E had different smoking

prevalences, and the cohort sample was more favorable than the others. 2 THEORY— BIAS AND PRECISION This section develops algebraic inequalities to show when a cohort survey is better than a cross-sectional survey, for a fixed survey budget, in terms of the mean squared error of its estimates. The choice of design in making optimal estimates of change is first considered for a single community and, second, for comparing changes in two communities, such as a treatment and control community. The goal is to measure change in the stayers, and that only two waves of surveys will be conducted, a baseline and a single follow-up. 2.1 Choice of Design for Estimating Change in One Community The goal of the surveys is to obtain an accurate estimate of the change in (for instance) prevalence of smoking for ‘‘stayers’’ in a particular community. Let a particular measure of personal health behavior be independently identically distributed with variance σ 2 at both times. If a cross-sectional sample of size N x is taken at T 0 and again at T 1 , then the variance of each estimate of smoking prevalence will be σ 2 /N x , and the variance of the estimated community change will be

COHORT VS. REPEATED CROSS-SECTIONAL SURVEY DESIGNS

3

Table 1. Smoking Prevalence for Different Segments and Groups % Current Smokers (s.d. of Estimate)* Group

N

T 0 Prevalence

A B C D E Stayer

805 1390 3280 2981 543 4670 2981 3280 5475 3524

24.7 (1.5) 27.1 (1.1) 21.9 (0.7)

Cohort X-sect The

T 1 Prevalence

Change

Bias

19.6 (0.7) 23.9 (0.8) 25.4 (1.8)

23.4 (0.6) 21.9 (0.7) 23.6 (0.6)

23.9 (0.8) 19.6 (0.7)

− 0.45 (1.0) 2.29 (0.5)

2.74 (1.0)

24.1 (0.7)

− 0.50 (0.9)

− 0.05 (0.4)

standard deviation of an estimate is often referred to as a standard error.

2σ 2 /N x (assuming no overlap in the samples). If instead a cohort sample of size N c is taken at T 0 and followed over time, the variance of the estimated community change will be (1 − ρ)2σ 2 /N c , where ρ is the correlation between a cohort member’s smoking status at T 0 and T 1 . If N c = N x , the variance of the cohort estimator is smaller than the cross-sectional variance, especially if ρ is near 1. Two desirable characteristics of estimators are unbiasedness and small variance (similar to the concepts of validity and precision). An investigator might prefer a biased estimator if its variance was sufficiently small. In Table 1, the cohort estimate is more biased but has lower standard deviation than the cross-sectional estimate. If the sample size was much larger, the standard deviations for both methods would approach zero, and the cross-sectional estimate would be clearly better. If the sample size was much smaller, the cohort estimate might be preferable because of its smaller standard deviation. To formalize this choice, consider the mean squared error (MSE) of an estimator, which is the average squared distance of an estimator from the true mean. Suppose the estimator, X, has mean µx and variance σ 2 x , and that the true parameter value is . The expected (mean) squared distance of X from is then E(X − )2 = E(X − µx + µx − )2 = E(X − µx )2 + E(µx − )2 = σ 2 x + (µx − )2 . The MSE is thus equal to the sum of the variance and the squared bias of X. Letting the bias of the cohort and cross-sectional estimates be Bc and Bx , respectively, the MSEs

are MSE(cross-sectional) = 2σ 2 /N x + B2 x and MSE(cohort) = (1 − ρ)2σ 2 /N c + B2 c . A cohort estimate has a lower mean squared error than a cross-sectional estimate if 2σ 2 /Nx + B2 x > (1 − ρ)2σ 2 /Nc + B2 c

(1)

For large values of N x and N c , the inequality depends only on the bias terms. The authors are interested in designs that have the same total cost, which includes the cost of nonresponse for the primary survey and implicitly assumes that the nonresponse rate = α. Let S be the total number of surveys that can be afforded. For a cross-sectional survey with two waves, N x = S/2. For a cohort survey, however, the effect of attrition must be included. To arrive at N c surveys in the final cohort sample, N c + W surveys had to be conducted at T 0 , where W is the number of T 0 surveys ‘‘wasted’’ because no follow-up interview was obtained (segments A and B). Let the attrition rate be α = W/(N c + W); then W = N c α/(1 − α). The total number of surveys in the cohort survey is N c + W at T 0 and N c at T 1 , so that S = 2N c + W = 2N c + N c α/(1 − α) = N c (2 − α)/(1 − α) and N c = S(1 − α)/(2 − α). Substituting these values for N x and N c into Equation (1), the cohort estimate has lower MSE if α (2 − α) ρ− > S(B2c − B2x ) (2) 2σ 2 (1 − α) (2 − α) This inequality can be used to determine situations in which a cohort design has lower MSE than a cross-sectional design. As the

4

COHORT VS. REPEATED CROSS-SECTIONAL SURVEY DESIGNS

cohort bias is usually larger than the crosssectional bias, the cross-sectional estimate will have lower MSE when S is large. Similarly, if ρ < α/(2 − α), then the left side of Equation (2) is negative, and the crosssectional design will be better. The equation can be solved for S*, the value of S for which the cohort and cross-sectional estimates have the same MSE. Notice that if (B2 c − B2 x ) is negative, dividing Equation (2) through by this term reverses the inequality; and if zero, the inequality is independent of S, and depends only on ρ and α. Continuing the smoking example, the correlation of smoking at T 0 and at T 1 is estimated as 0.80 and σ 2 = 1826.71. From Table 1, the attrition rate is 1 − 3280/5475 = 0.40. For 22 different health-related survey variables (including smoking), both the cohort and the cross section differed significantly from the stayers, with the stayers having less favorable results. The intertemporal correlation, ρ, ranged from 0.19 for being unemployed to 1.0 for race and gender (which do not change). For 19 of the 22 variables, the cohort bias was larger in absolute value than the cross-sectional bias, making the righthand side in Equations (2) and (3) positive. All variables but one had correlations larger than α/(2 − α) = 0.25, meaning that the lefthand term was usually positive. The values of S*, the total sample size at which the two MSEs are the same, ranged from 244 to infinity. A cross-sectional design may be best for variables with low S* for moderate and large numbers of surveys. A cohort estimate is better for the others unless very large samples can be afforded. 2.2 Comparison of Change in Two Communities Inequality (2) shows when a cohort is better than a cross-sectional design to estimate change in a single community. However, if the goal of the survey is to estimate the difference in change between two communities (usually a treatment versus a control community), it is possible that the biases in the treatment and control estimates are similar and will thus cancel out. A similar argument to that above shows that a cohort design estimate of the difference between the treatment

change and the control change has a lower mean squared error than a cross-sectional design estimate for the same fixed budget if 4σ 2

α (2 − α) ρ− > S(βc2 − βx2 ) (3) (1 − α) (2 − α)

where β c = Bc,tx − Bc,ctrl (the difference in treatment and control cohort biases) and β x = Bx,tx − Bx,ctrl (the difference in treatment and control cross-sectional biases). Note that, even if Bc,tx and Bc,ctrl are large, β c may be small. Differences in costs and sample sizes can be incorporated into this equation. Of the 22 variables considered, the term [β 2 c − β 2 x ] was negative in only 5 instances, but S* was usually very large. That is, a cohort estimate has a lower mean squared error unless the samples are very large. If about 1000 surveys could be afforded, the cross-sectional estimate would be better for estimating the difference in changes for only 8 of the 22 variables. Equations (2) and (3) may be used to help plan community surveys when attrition is a factor, and other methods may be used when it is not (7, 8). The data used here for illustration had a high attrition rate, which should be favorable to repeated cross-sectional survey designs, but the cohort designs usually had lower MSE. Other articles cited here have different examples and somewhat different findings.

REFERENCES 1. D. C. Altman, A framework for evaluating community-based heart disease prevention programs. Soc. Sci. Med. 1986; 22: 479–487. 2. G. V. Glass, V. L. Willson, and J. M. Gottman, Design and Analysis of Time-Series Experiments; Boulder: Colorado Assoc. Univ., 1975. 3. S. Salvini, D. J. Hunter, L. Sampson, M. J. Stampfer, G. A. Colditz, et al., Food-based validation of dietary questionnaires: the effects of week-to-week variation in food consumption. Int. J. Epidemiol. 1989; 18: 858–867. 4. T. D. Koepsell, E. H. Wagner, A. C. Cheadle, et al., Selected methodological issues in evaluating community-based health promotion and disease prevention programs. Annu. Rev. Publ. Health. 1992; 13: 31–57.

COHORT VS. REPEATED CROSS-SECTIONAL SURVEY DESIGNS 5. J. T. Salonen, T. W. Kottke, D. R. Jacobs, and P. J. Hannan, Analysis of community-based studies-evaluation issues in the North Karelia Project and the Minnesota Heart Health Program. Int. J. Epidemiol. 1986; 15: 176–182. 6. L. S. Caplan, D. S. Lane, and R. Grimson, The use of cohort vs repeated cross-sectional sample survey data in monitoring changing breast cancer screening practices. Prev. Med. 1995; 24: 553–556. 7. S. M. McKinlay, Cost-efficient designs of cluster unit trials. Prev. Med. 1994; 23: 606–611. 8. H. A. Feldman and S. M. McKinlay, Cohort versus cross-sectional design in large field trials; precision, sample size, and a unifying model. Stat. Med. 1994; 13: 61–78. 9. P. Diehr, D. C. Martin, T. Koepsell, A. Cheadle, E. Wagner, and B. M. Psaty, Optimal survey design for community-intervention evaluations: cohort or cross-section? J. Clin. Epidemiol. 1995; 48: 1461–1472. 10. E. H. Wagner, T. D. Koepsell, C. Anderman et al., The evaluation of the Henry J Kaiser Family Foundation’s Community Health Promotion Grant Program: Design. J. Clin. Epidemiol. 1991; 44: 685–699. 11. M. H. Gail, D. P. Byar, T. F. Pechachek, and D. K. Corle, Aspects of statistical design for the Community Invervention Trial for Smoking Cessation (COMMIT). Controlled Clinical Trials. 1992; 13: 6–21.

5

COLLINEARITY

variable is, in turn, regarded as an outcome variable in a regression equation that includes the remaining p − 1 explanatory variables. Then, R2j represents the squared residual correlation obtained using explanatory variable j, j = 1, . . . , p, as the response. The VIF is then defined for each such regression as: 1 . VIFj = 1 − R2j

G. A. DARLINGTON Cancer Care Ontario, Toronto, Ontario, Canada

Collinearity (or ‘‘multicollinearity’’) refers to a high level of correlation within a set of explanatory variables. In a regression modeling situation, if explanatory variables are highly correlated, then regression coefficient estimates may become unstable and not provide accurate measures of the individual effects of the variables. The estimate of the precision of these coefficient estimates is also affected and therefore confidence intervals and hypothesis tests are, likewise, affected. For the estimation of regression coefficients, the columns of the design matrix must be linearly independent. At an extreme, if two explanatory variables are perfectly linearly associated (i.e. their correlation is equal to 1), then such collinearity is an example of linearly dependent columns in the design matrix, X. While two parameters require estimation (i.e. the regression coefficients for the two explanatory variables), information is not available in the design matrix to estimate both coefficients uniquely. The two individual effects cannot be distinguished as a result of this collinearity. While collinearity typically does not involve completely linearly related explanatory variables, high levels of correlation can still lead to difficulties in coefficient estimation. It should be noted that this issue pertains to the relationship among explanatory variables which, ultimately, affects the ability to investigate simultaneously the relationship between the response variable and the explanatory variables. Therefore, the identification of potential collinearity problems is usually addressed by examination of the relationships among explanatory variables. One simple technique for the identification of collinearity is presented in Kleinbaum et al. (1). The computation of the variance inflation factor (VIF) is suggested. If there are p explanatory variables, each explanatory

If there is a strong relationship between the explanatory variable j and the remaining p − 1 explanatory variables, then R2j is close to 1 and VIFj is large. It is suggested, in (1), that values of VIF greater than 10 indicate serious collinearity that will affect coefficient and precision estimation. Collinearity may also be indicated if coefficient estimates from fitting simple regression models of the response with each explanatory variable are substantially different from coefficient estimates from fitting a multiple regression model including all explanatory variables. Similarly, if the order in which certain terms are included in the model seriously affects the coefficient estimates for these terms, then collinearity is indicated. Of course, one of the primary purposes of multivariate regression models is to examine the role of explanatory variables having ‘‘adjusted’’ for other variables in the model so that such behavior is not necessarily a problem. However, serious collinearity problems may prohibit a multivariate model from being fitted at all. If two or more explanatory variables are highly correlated because they represent measurements of the same general phenomenon (e.g. highest attained level of education and current salary are both aspects of socioeconomic status), then collinearity can be addressed by choosing one variable thought to be the most relevant. This variable would then be included in any models and the remaining, so-called redundant, variables would be excluded. The identification of such redundant variables may be difficult, so, alternately, a new variable that combines information on the correlated variables can be derived. This aggregate variable would

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

COLLINEARITY

be included in models instead of all of the component variables. It is sometimes helpful, particularly when collinearity is created as a result of including polynomial terms (e.g. X and X2 are included in a model together) but also in general, to center the original explanatory variables. This is accomplished by computing new explanatory variables that are the original measurements with the means subtracted. Suppose there are n individuals and p explanatory variables measured on each individual, X ji , i = 1, . . . , n, j = 1, . . . , p. Then the new explanatory variables are Zji = X ji − X j . If a quadratic model is of interest, then one would include the terms Zji and Z2ji in the model. In (1), an example of the effectiveness of such an approach for correcting collinearity is presented.

When polynomial regression is being undertaken, then the further step of orthogonalization of the explanatory variables is also possible and frequently used in some settings. Orthogonalization of more general sets of explanatory variables is possible but not as widely used. REFERENCES 1. Kleinbaum, D. G., Kupper, L. L. & Muller, K. E. (1988). Applied Regression Analysis and Other Multivariate Methods, 2nd Ed. PWSKent, Boston.

COMBINATION THERAPY

as resistance rapidly emerged (2). The successive development of two- and three-drug regimens as more drugs became available, initially PAS (P) and isoniazid (H), led to effective treatments with regimens of three drugs for one to three months followed by two drugs for 12 to 18 months (SPH/PH) (3). The introduction of new drugs, particularly rifampicin (R) and pyrazinamide (Z), led to the development of more effective regimens, although it has proved impossible to reduce the duration below 6 months and maintain highly effective regimens (SHRZ/HR) (4). The need for new antituberculosis drugs is largely driven by the high incidence of resistance to one or more of these key drugs, but there would also be major benefits in combination regimens that would reduce the total duration to less than 6 months. A very similar process has led to the current standard therapies for HIV infection with trials demonstrating that, as they became available, two drugs were better than one (5) and three better than two (6). Now, although different combinations are compared in many trials, often only one of the drugs differs. For example, in the ACTG 384 trial, triple and quadruple regimens were compared, which all included two nucleoside analogue reverse transcriptase inhibitors (NRTIs) with either efavirenz (EFV), a non-nucleoside reverse transcriptase inhibitor (NNRTI), nelfinavir, a protease inhibitor (PI), or both drugs (7). This trial also explored, in a factorial design, two different nucleoside reverse transcriptase inhibitor (NRTI) combinations, didanosine (ddI) plus stavudine (d4 T) and ZDV plus lamivudine (3TC). As in tuberculosis, a need exists for new drugs and combinations for HIV infection as the current therapies have to be given for long periods and are associated with substantial failure rates, usually because of difficulties with adherence to therapy and the emergence of resistance. Once a combination has been shown to be effective and the optimal dosages of the drugs are clearly defined, advantages of combining the drugs in a single preparation exist, as this combination aids compliance by simplifying therapy. It also minimizes the risk of the

JANET DARBYSHIRE London, United Kingdom

1

DEFINITION

Combination therapy regimens most commonly combine two or more drugs that have additive or synergistic action to treat a disease more effectively, frequently infections or cancer, with the aim of producing more effective regimens and preventing the emergence of resistance of the micro-organism or the tumor to the drugs. However combinations of drugs are widely used to treat many other diseases such as hypertension, asthma, diabetes, and arthritis. Combinations of drugs and other modalities, such as surgery or radiotherapy, are also used in certain disease areas, such as cancer. Clinical trials may be designed to evaluate the combination therapy or to assess the effect of a new drug in addition to other drugs as part of a combination. In the first case, it is the combination that is being evaluated, and all the components may be different between the regimens or, indeed, a combination may be tested against a single drug (for example, the ICON 3 trial compared a combination of two drugs with either a regimen of 3 drugs or monotherapy) (1). Indeed, as the number of available drugs has increased for the treatment of diseases such as cancer, tuberculosis, or HIV infection, trials have successively evaluated combinations of two, three, or more drugs to achieve a more rapid or complete cure depending on the disease. The balance between increasing efficacy by increasing the number of drugs in a combination and the risk of increasing toxicity is key to the development of optimal regimens. The introduction of new drugs, especially with different modes of action, may lead to major changes in the therapy of a disease. Tuberculosis is a classic example of how the use of combination therapy changed over time. The first drug, streptomycin (S), was clearly highly effective, but only for a short time

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

COMBINATION THERAPY

development of resistance to a drug, in the case of antibiotics, as a result of the patients choosing to take only one of the drugs. However, disadvantages exist in terms of the lack of flexibility to relate dose to weight and the management of toxicity. Combinations of ethambutol and isoniazid, which are widely used for the treatment of tuberculosis, are available in a variety of formulations because of the dose-related toxicity of ethambutol and, therefore, the need to relate the dose to weight. Such combinations should be compared with the individual drugs to ensure that they are adequately absorbed and produce comparable pharmacokinetic profiles. If a trial is evaluating combination regimens that are available in combined preparations, a decision will have to be made as to whether the preparation should be used. If they are, and the comparator is not available as a combined preparation, it may overestimate the benefits of the combined preparation because it is more likely to be taken. However, if single drugs are given for both regimens, the potential benefits of the combined preparation cannot be adequately assessed. In a trial that is exploring experimental regimens, single preparations are more often used, but if the regimens are being assessed as they will be used in routine practice, then a case for using the combined preparation exists. The development of combination regimens is often built on evidence from laboratory studies of additive or synergistic activity of drugs and the need for combinations because no single drug is adequate to cure a disease in some or all patients (for example, cancer or tuberculosis) or to control it (for example, HIV infection, hypertension, or diabetes). In different diseases, the approach may be different according to availability of drugs or the disease course. In chronic diseases, such as hypertension, Type II diabetes, or epilepsy, the aim is to control the disease with the minimum therapy. New drugs may be added to a combination over time to achieve this result. Trials may be designed to compare aggressive therapy with this standard approach to assess the impact on long-term disease control. Trials of new drugs are likely to mimic this approach by adding a new drug or a standard drug compared with adding a placebo or,

alternatively, adding an existing drug, if this practice is standard. In some diseases, different aims from trials of combination therapies may exist. One approach is to try to improve the results of treatment by increasing the potency of the regimens, by adding more drugs to the initial regimen, ideally with a different mode of action. For example, the ACTG 384 trial compared a four-drug regimen with the two standard three-drug regimens for HIV infection (7). An alternative approach is to reduce the toxicity of the regimens by minimizing the exposure to drugs while maintaining control of the infection. Such an approach may be particularly important in diseases such as HIV infection where the drugs are likely to be given for long periods of time and have potentially serious long-term side effects. A number of ongoing trials are exploring different approaches, for example, comparing strict control of viral replication with less aggressive therapy based on immunological markers. Trials of combination therapies may be used to assess new drugs when they will only be given as part of such regimens. Two alternative approaches exist that may be appropriate in different circumstances. The first is to randomize to add the new drug or placebo to existing standard therapy (8). The second is to substitute the new drug for one of the drugs in a standard combination and to compare with the standard combination (9). The advantages of the former are that, theoretically, it is easier to demonstrate a difference from placebo, but the risk is that if the current therapy is highly effective, little if any benefit may exist from adding a new drug. The disadvantage of the second approach is that it is likely to be more difficult to demonstrate superiority or equivalence to an existing drug. Further, reluctance may exist to substitute a new drug for one that is known to be effective. In some areas, such as leukaemia or lymphoma where therapies are becoming more and more effective, and yet are still not uniformly successful, it is becoming increasingly difficult to assess new combination therapies as the improvements are likely to be small. The large trials needed to reliably assess such

COMBINATION THERAPY

small differences may not be feasible, especially for the types of disease that are more rare. The development of new drugs for diseases such as cancer and HIV brings new challenges, not the least of which is how best to evaluate the many potential combinations that can be selected by combining the new and old drugs. Novel trial designs are needed together with the development of better surrogate markers that can be used to select the best combinations to take forward into large trials to assess clinical benefits and risks. Two-stage designs, such as those reported by Royston and Parmar (10), are innovative approaches to this problem. Combination therapies are only needed if monotherapies are not potent enough. Ultimately, the aim is to provide a therapy that is effective, safe, and simple to take, and in many diseases, major advantages would exist in replacing regimens of three, four, or even more drugs by a monotherapy regimen. When the first trials of protease inhibitors demonstrated their high potency in HIV infection, some hope existed that they might be as effective as current two-drug combination regimens, but it soon became clear that they were not sufficiently potent to prevent the emergence of resistance on their own, although they had a major impact when added to the two-drug regimen. Combinations of drug and nondrug therapy may be used to treat diseases, such as cancer, where bone marrow or stem cell transplants require chemotherapy as part of the whole treatment. Similarly combinations of chemotherapy and radiotherapy are effective in some forms of tumor (referred to as chemoradiotherapy). Other multi-mode combinations, such as chemotherapy and immunotherapy, (with drugs or therapeutic vaccines) may be developed for infections. The approach to assessing all of these is similar to drug combinations but special issues often exist, such as the timing of the different interventions within the overall treatment plans to achieve maximum benefits and to minimize toxicity. These issues of timing may make the assessment of the regimens more complex, particularly if the timings differ between regimens.

3

REFERENCES 1. ICON Collaborators, Icon3: randomised trial comparing paclitaxel plus carboplatin against standard chemotherapyof either single agent carboplatin or CAP (Cyclophosphamide, doxirubicin, cisplatin) in women with ovarian cancer. Lancet 2002; 360: 505–515. 2. A five-year assessment of patients in a controlled trial of streptomycin in pulmonary tuberculosis. Quart. J. Med. 1954: 91: 347–366. 3. Medical Research Council, Long-term chemotherapy in the treatment of chronic pulmonary tuberculosis with cavitation. Tubercle 1962; 43: 201–267. 4. W. Fox, Whither short-course chemotherapy? Br. J. Dis. Chest 1981; 75: 331–357. 5. HIV Trialists’ Collaborative Group, Zidovudine, didanosine, and zalcitabine in the treatment of HIV infection: meta-analyses of the randomised evidence. Lancet 1999; 353: 2014–2025. 6. S. M. Hammer, K. E. Squires, M. D. Hughes, J. M. Grimes, L. M. Demeter, J. S. Cuyrrier et al., A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. N. Engl. J. Med. 1997; 337(11): 725–733. 7. G. K. Robbins, V. De Gruttola, R. W. Shafer et al., Comparison of sequential three-drug regimens as initial therapy for HIV-1 infection. N. Engl. J. Med. 2003: 349; 2293–2363. 8. D. W. Cameron, M. Heath-Chiozzi, S. Danner, C. Cohen, S. Kravcik, C. Maurath et al., Randomised placebo controlled trial of ritonavir in advanced HIV-1 disease. Lancet 1998; 351(9102): 543–549. 9. S. Staszewski, J. Morales-Ramirez, K. T. Tashima, A. Rachlis, D. Skiest, J. Stanford, R. Stryker, P. Johnson, D. F. Labriola, D. Farina, D. J. Manion, and N. M. Ruiz, Efavirenz plus zidovudine and lamivudine, efavirenz plus indinavir, and indinavir plus zidovudine and lamivudine in the treatment of HIV1 infection in adults. NEJM 1999; 341(25): 1865–1873. 10. P. Royston, M. K. B. Parmar, and W. Qian, Novel designs for multi-arm clinical trials with survival outcomes with an application in ovarian cancer. Stat. Med. 2003; 22: 2239–2256.

COMMITTEE FOR MEDICINAL PRODUCT FOR HUMAN USE (CHMP)

in close cooperation with health-care professionals and the pharmaceutical companies themselves. The CHMP plays an important role in EU-wide pharmacovigilance by closely monitoring reports of potential safety concerns (Adverse Drug Reaction Reports, or ADRs) and, when necessary, by making recommendations to the European Commission regarding changes to a product’s marketing authorization or the product’s suspension/withdrawal from the market. In cases where there is an urgent requirement to modify the authorization of a medicinal product due to safety concerns, the CHMP can issue an Urgent Safety Restriction (USR) to inform health-care professionals about changes in how or under what circumstances the medication may be used. The CHMP publishes a European Public Assessment Report (EPAR) for every centrally authorized product that is granted a marketing authorization, which sets out the scientific grounds for the Committee’s opinion in favor of granting the authorization. A Summary of Product Characteristics (SPC) is also published with the labeling and packaging requirements for the product and details of the procedural steps taken during the assessment process. These EPARs are published on the EMEA’s website and are generally available in all official languages of the EU. Scientific assessment work conducted by the CHMP is subject to an internal peerreview system to safeguard the accuracy and validity of opinions reached by the Committee. The EMEA’s integrated qualitymanagement system ensures effective planning, operation, and control of the CHMP’s processes and records. Other important activities of the CHMP and its working parties include:

The Committee for Medicinal Products for Human Use (CHMP) is responsible for preparing the European Medicines Agency’s opinions on all questions concerning medicinal products for human use, in accordance with Regulation European Commission (EC) No 726/2004. The CHMP plays a vital role in the marketing procedures for medicines in the European Union (EU): • The CHMP is responsible for conduct-

ing the initial assessment of medicinal products for which a Community-wide marketing authorization is sought. The CHMP is also responsible for several postauthorization and maintenance activities, including the assessment of any modifications or extensions (variations) to the existing marketing authorization. • The CHMP arbitrates in cases of disagreement between member states over the marketing authorization of a particular medicinal product. The CHMP also acts in referral cases, initiated when there are concerns relating to the protection of public health or where other community interests are at stake. Assessments conducted by the CHMP are based on purely scientific criteria and determine whether the products concerned meet the necessary quality, safety, and efficacy requirements in accordance with EU legislation (particularly Directive 2001/83/EC). These processes ensure that once medicinal products reach the marketplace they have a positive risk–benefit balance in favor of the patients/users of the products. Subsequent monitoring of the safety of authorized products is conducted through the EU’s network of national medicines agencies,

• Assistance to companies in researching

and developing new medicines. • Preparation of scientific and regulatory

guidelines for the pharmaceuticals industry. • Cooperation with international partners on the harmonization of regulatory requirements for medicines.

This article was modified from the website of the European Medicines Agency (http://www.emea .europa.eu/htms/general/contacts/CHMP/CHMP. html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

COMMON TECHNICAL DOCUMENT (CTD)

text. Acronyms and abbreviations should be defined the first time they are used in each module. References should be cited in accordance with the current edition of the Uniform Requirements. The CTD should be organized into five modules: Module 1 is region specific; modules 2, 3, 4, and 5 are intended to be common for all regions. Conformance with the CTD guidances should help ensure that these four modules are provided in a format acceptable to the regulatory authorities (see the figure and overall outline on the following pages).

Through the International Conference on Harmonisation (ICH) process, considerable harmonization has been achieved among the three regions (Japan, Europe, and the United States) in the technical requirements for the registration of pharmaceuticals for human use. However, until now, no harmonization of the organization of a submission has existed. Each region has its own requirements for the organization of the technical reports in the submission and for the preparation of the summaries and tables. In Japan, the applicants must prepare the GAIYO, which organizes and presents a summary of the technical information. In Europe, expert reports and tabulated summaries are required, and written summaries are recommended. The U.S. Food and Drug Administration (FDA) has guidance regarding the format and content of the new drug application submission. To avoid generating and compiling different registration dossiers, this guidance describes a harmonized format for the Common Technical Document (CTD) that will be acceptable in all three regions. Throughout the CTD, the display of information should be unambiguous and transparent to facilitate the review of the basic data and to help a reviewer quickly become oriented to the application contents. Text and tables should be prepared by using margins that allow the document to be printed on both A4 paper (E.U. and Japan) and 8.5 x 11’’ paper (U.S.). The left-hand margin should be sufficiently large that information is not obscured through binding. Font sizes for text and tables should be of a style and size that are large enough to be easily legible, even after photocopying. Times New Roman, 12-point font is recommended for narrative

Module 1. Administrative Information and Prescribing Information This module should contain documents specific to each region; for example, application forms or the proposed label for use in the region. The content and format of this module can be specified by the relevant regulatory authorities. For information about this module see the guidance for industry, General Considerations for Submitting Marketing Applications According to the ICH/CTD Format. Module 2. Common Technical Document Summaries Module 2 should begin with a general introduction to the pharmaceutical, including its pharmacologic class, mode of action, and proposed clinical use. In general, the introduction should not exceed one page. Module 2 should contain seven sections in the following order: • CTD Table of Contents • CTD Introduction • Quality Overall Summary • Nonclinical Overview • Clinical Overview • Nonclinical Written and Tabulated

Summaries • Clinical Summary

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cber/gdlns/m4ctd.pdf) by Ralph D’Agostino and Sarah Karl.

Because Module 2 contains information from the Quality, Efficacy, and Safety sections of the CTD, the organization of the

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

COMMON TECHNICAL DOCUMENT (CTD)

individual Module 2 summaries is discussed in three separate documents: • M4Q: The CTD – Quality • M4 S: The CTD – Safety • M4E: The CTD – Efficacy

Module 3. Quality Information on Quality should be presented in the structured format described in the guidance M4Q. Module 4. Nonclinical Study Reports The Nonclinical Study Reports should be presented in the order described in the guidance M4 S. Module 5. Clinical Study Reports The human study reports and related information should be presented in the order described in the guidance M4E. The CTD should be organized according to the following general outline. Module 1: Administrative Information and Prescribing Information. 1.1 Table of Contents of the Submission Including Module 1 1.2 Documents Specific to Each Region (for example, application forms and prescribing information) Module 2: Common Technical Document Summaries. 2.1 CTD Table of Contents 2.2 CTD Introduction 2.3 Quality Overall Summary 2.4 Nonclinical Overview 2.5 Clinical Overview 2.6 Nonclinical Written and Tabulated Summary Pharmacology Pharmacokinetics Toxicology 2.7 Clinical Summary Biopharmaceutics and Associated Analytical Methods Clinical Pharmacology Studies

Clinical Efficacy Clinical Safety Synopses of Individual Studies Module 3: Quality. 3.1 Module 3 Table of Contents 3.2 Body of Data 3.3 Literature References Module 4: Nonclinical Study Reports. 4.1 Module 4 Table of Contents 4.2 Study Reports 4.3 Literature References Module 5: Clinical Study Reports. 5.1 Module 5 Table of Contents 5.2 Tabular Listing of All Clinical Studies 5.3 Clinical Study Reports 5.4 Literature References

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH IN ASIAN IMMIGRANT POPULATIONS

cervical cancer screening are as follows: 70% of age-eligible women should have received mammography in the preceding two years and 90% of women should have received a Pap test within the preceding three years (9). Data from the 2000 National Health Interview Survey (NHIS) show wide variations in breast and cervical cancer screening rates by race and ethnicity (Table 1). In addition, the ‘‘Pathways Project’’ surveyed five San Francisco Bay populations in 1994; nearly all the white (99%) and black (98%) respondents reported at least one Pap smear, compared with 76% of Latina, 67% of Chinese, and 42% of Vietnamese respondents; similar patterns were observed for mammography (10). As the NHIS is conducted in English, it excludes individuals with limited English proficiency (11). The ‘‘Breast and Cervical Cancer Intervention Study’’ found breast and cervical cancer screening rates were significantly lower among non-English speaking than English speaking Chinese and Latina women (12). These findings support the importance of cancer screening interventions targeting immigrants, particularly those that are less acculturated (12, 13). This article focuses on community-based breast and cervical cancer control research in Asian populations. First, the authors address principles of community participatory research and several important program evaluation issues: sampling, survey methods, recruitment and retention, translation, data quality, and control group ‘‘contamination.’’ Second, the authors summarize communitybased studies that aimed to increase mammography or Pap testing levels among Asian women. For this summary, the authors used the same approach as Legler et al. (14) in their meta-analysis of interventions to promote mammography among women with historically low mammography rates. Specifically, the authors only considered studies in which the researchers used an experimental or quasi-experimental design to evaluate the effectiveness of their intervention, and reported intervention outcomes based on actual receipt of breast or cervical cancer screening (14).

VICTORIA M. TAYLOR Fred Hutchinson Cancer Research Center Division of Public Health Sciences Seattle, Washington University of Washington Department of Health Services Seattle, Washington

T. GREGORY HISLOP Cancer Control Research British Columbia Cancer Agency Vancouver, British Columbia, Canada

YUTAKA YASUI Fred Hutchinson Cancer Research Center Division of Public Health Sciences Seattle, Washington

SHIN-PING TU and J. CAREY JACKSON University of Washington Department of Medicine Seattle, Washington Harborview Medical Center Seattle, Washington,

1

INTRODUCTION

The number of Asian Americans in the United States increased from a little over one million in 1970 to nearly seven million in 1990, and reached more than 10 million by 2000 (1–3). Two-thirds (66%) of Asian Americans are foreign born and the majority (56%) do not speak English ‘‘very well.’’ Further, North America’s Asian population is heterogeneous and includes individuals from East Asia (e.g., mainland China), South Asia (e.g., India), Southeast Asia (e.g., Vietnam), and Island Asia (e.g., the Philippines) (4). The Presidential Initiative on Race aims to eliminate disparities in six areas of health status, including cancer screening, over the next decade (5, 6). Current American Cancer Society screening guidelines specify that women aged 40 and older should receive a screening mammogram annually and women aged 18 and older should receive Papanicolaou (Pap) testing every one to three years, depending on their risk for disease (7, 8). The Healthy People 2010 objectives for breast and

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH Table 1. Breast and Cervical Cancer Screening Rates by Race and Ethnicity—United States, 2000

Race/Ethnicity

Mammogram within the past two yearsa %

Pap test within the past three yearsb %

72 68 63 52 57

82 84 77 77 67

White Black Latina American Indian or Alaska Native Asian or Pacific Islander a b

Women aged 40 years and older Women aged 18 years or older

2 COMMUNITY PARTICIPATORY RESEARCH The importance of active community participation in research projects involving racial/ethnic minority groups is increasingly recognized (15, 16). Community-based research principles, developed by the University of Washington School of Public Health and Community Medicine, specify the following: Community partners should be involved in the earliest stages of the project and have real influence on project direction; research processes and outcomes should benefit the community; members of the community should be part of the analysis and interpretation of data; productive partnerships between researchers and community members should be encouraged to last beyond the life of the project; and community members should be empowered to initiate their own research projects that address needs they identify themselves (17). The University of California at San Francisco’s ongoing study ‘‘Reaching Vietnamese American Women: A Community Model for Promoting Cervical Cancer Screening’’ provides one example of successful communitybased research. The study collaborates with a community coalition that includes representatives from community-based organizations and health agencies that provide services to Vietnamese Americans; it uses community forums to both develop and evaluate program components; and it employs lay health workers from the targeted community. However, this research group has noted that principles of community-based research,

developed in the United States as a product of American culture, cannot always be applied to immigrants. For example, the researchers had to strike a balance between respecting the dominant cultural pattern of deferring to authority figures (e.g., local Vietnamese physicians) and encouraging community members to contribute their views. Additionally, the researchers were not able to involve non-English speaking community members in data collection because their Institutional Review Board would not allow individuals to engage in data collection activities unless they received human subjects certification (and no human subjects training materials are available in Vietnamese) (18).

3 EVALUATION ISSUES 3.1 Sampling Simple random sampling can only be used for studies targeting Asian-American populations when the group of interest is highly concentrated within a defined geographic boundary (19). For example, the ‘‘Pathways Project’’ successfully identified Chinese households using random digit dialing within San Francisco telephone prefixes that were at least 35% Chinese (20). In less concentrated populations, the following approaches have variously been used: purchasing marketing company lists for individuals of a particular Asian ethnicity, acquiring lists from organizations that serve one particular community (e.g., the Chinese Community Health Plan in Northern California), and convenience sampling

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

(e.g., from Filipino and Korean church congregations and community organizations) (20–24). Certain Asian populations have characteristic surnames that can be applied to existing databases (e.g., telephone books) (25–29). However, the efficiency of this approach to sampling varies by Asian subgroup and is a function of population density (19, 27, 29). For example, it works well for Vietnamese who have a relatively small number of highly characteristic last names (it has been shown that 99% of Vietnamese households can be identified by using 37 surnames) (30). Surname lists work less well for other groups. It has been found to be an inefficient method of identifying Cambodian households, and it is doubtful that this approach would work for Filipinos who often have Spanish surnames (19, 31). Additionally, last name lists are not useful for identifying women from Asian subgroups with a high rate of interracial marriage (e.g., Japanese Americans) (19). Finally, for some Asian populations, a universal set of last names that can be used in research projects may not exist; rather, each locale and time period may require its own list of names. For example, some Chinese immigrant communities include a higher proportion of people with Mandarin (as opposed to Cantonese) last names than others (25). 3.2 Survey Methods Cancer control intervention studies often use surveys to identify individuals for randomized controlled trials or to evaluate intervention effectiveness. In-person interviews, conducted in the language of an interviewee’s choice, are believed to be the most effective method of enlisting cooperation from Asian immigrant groups, and can be cost-efficient when the sample is drawn from nondispersed communities (e.g., Vietnamese in the Tenderloin area of San Francisco); however, it is usually a cost-prohibitive approach for surveying dispersed Asian populations (e.g., Vietnamese in multiple Californian counties) (32, 33). Table 2 provides survey response rates from population-based breast and cervical cancer control projects, and shows that response rates vary by Asian subgroup and geographic area as well as method of survey administration (30, 31, 33–38).

3

3.3 Recruitment and Retention It can be difficult to obtain high response rates and low loss to follow-up in populations over-represented by immigrants for multiple reasons. Specifically, immigrants may not have working telephones, frequently have more than one job, may be concerned about safety and suspicious of strangers, often have multiple health and social problems, and tend to move frequently (19, 31, 39). Also, some Asian immigrants travel frequently between their original and adopted countries. A British research group identified South Asian women who were registered with general practitioners in England for a breast cancer intervention study, and found that 14% of the sample had recently moved while 12% were currently in Asia (40). Generally, recruitment and retention are enhanced by employing bilingual and bicultural staff members from the targeted community. Gender matching of research personnel and participants is also important, especially for members of traditional Asian cultures in which discussions of certain topics (e.g., cancer and gynecologic issues) are inappropriate with the opposite sex (41). 3.4 Translation As many Asian immigrants do not speak English, questionnaires have to be prepared in Asian languages, which is a somewhat cumbersome and time-consuming process that usually involves one of the following processes: translation, back-translation to ensure equivalence, and reconciliation of any discrepancies between the original and backtranslated English versions; or double forward translation into the relevant Asian language, and then review by a referee (41–44). Additionally, some Asian subgroups include individuals who speak multiple languages or dialects (e.g., Filipino American communities include Cebuano-, Ilaccano-, and Tagalog-speaking individuals) (45). Experience from the ‘‘Pathways Project’’ shows that Pap smear and mammogram, for example, cannot be translated into Chinese or Vietnamese. Therefore, the terms have to be stated in English and subsequently defined in the relevant Asian language (43, 44).

4

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

Table 2. Response Rates from Surveys of Asian Populations Geographic area Taylor, 1999 Hislop, 2003

Cambodian Chinese

Taylor, 2002 Wismer, 1998

Chinese Korean

Bird, 1998

Vietnamese

Jenkins, 1999

Vietnamese

Nguyen, 2002

Vietnamese

Taylor, Submitted

Vietnamese

Seattle, Washington Vancouver, British Columbia Seattle, Washington Alameda/Santa Clara counties, California San Francisco, California Sacramento, California Alameda/Santa Clara counties, California Los Angeles/Orange counties, California Santa Clara county, California Harris county, Texas Seattle, Washington

3.5 Data Quality Most community-based studies to evaluate breast and cervical cancer screening intervention programs have used self-reported survey data to evaluate intervention effectiveness (46). However, increasing evidence exists that the quality of survey data may differ by race and ethnicity (46, 47). The ‘‘Pathways Project’’ surveyed five San Francisco Bay populations in 1994. Agreement between the baseline survey and a callback survey (conducted in a 10% randomly selected subsample) for the question ‘‘Have you ever had a mammogram?’’ was higher among white than racial/ethnic minority women. Specifically, the test-retest reliabilities among Chinese, Vietnamese, Latina, black, and white women were 0.90 (95% CI—0.74, 0.97), 0.74 (95% CI—0.51, 0.88), 0.90 (95% CI—0.79, 0.95), 0.93 (0.72—0.99), and 1.00 (95% CI—0.86, 1.00), respectively. Following a baseline telephone survey in multi-ethnic Alameda County, California, investigators from the ‘‘Pathfinders Project’’ examined medical records to validate breast and cervical cancer self-reports. The proportions of mammograms and Pap smears that

Year(s) survey conducted

Survey method

Response rate (%)

1997-98 1999

In-person In-person

72 55

1999 1994

In-person Telephone

64 80

1996

In-person

79 74

1996

Telephone

45

42

2000

Telephone

63

2002

In-person

54 82

could be validated were significantly lower among racial/ethnic minority than white women. Specifically, mammograms were validated for 89% of white women, 72% of black women, 72% of Latina women, 67% of Chinese women, and 76% of Filipina women. The corresponding proportions for Pap testing were 85%, 66%, 66%, 68%, and 67%, respectively (46). Several researchers have concluded that Asian Americans have a greater tendency than whites to provide socially desirable responses to survey questions, and have recommended using other methods of outcome ascertainment (e.g., medical record review), when possible (43, 46). 3.6 Control Group ‘‘Contamination’’ Program evaluation in Asian immigrant communities can be compromised by dissemination of the intervention to a study’s control group (24, 38, 48). Many Asian immigrant communities are relatively small and self-contained with strong social as well as extended family networks, and information is often quickly disseminated throughout the community. Although these communication channels serve the community well, they can

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

compromise the methodological rigor needed for randomization protocols (48). 4

COMMUNITY-BASED STUDIES

4.1 Overview The authors identified nine studies that met inclusion criteria for review (i.e., the study design was experimental or quasiexperimental, and intervention outcomes were based on actual screening test receipt) (Table 3). These studies targeted Cambodian, Chinese, Filipina, Korean, South Asian, and Vietnamese women in Canada, the United States, and England. Overall, three of the studies randomized individual women to experimental or control status, two randomized groups of women, and four used a quasiexperimental (two-community) study design. 4.2 Individual Randomized Trials Taylor et al. (39) conducted a three-arm randomized controlled trial to evaluate cervical cancer screening interventions for Chinese American/Canadian women in Seattle, Washington, and Vancouver, British Columbia. Baseline survey respondents who under-utilized Pap testing were randomized to an outreach worker intervention (that included a home visit, use of a video and print materials, and logistic assistance accessing screening services), a direct mail intervention (that included a video and print materials), or control status. Outcome evaluation was based on results from a follow-up survey as well as medical record verification of selfreported Pap testing. Overall, 39% of the 129 women who received the outreach intervention, 25% of the 139 women who received the direct mail intervention, and 15% of the 134 controls reported Pap testing following randomization (outreach worker versus control P < 0.001, direct mail versus control P = 0.03, and outreach worker versus direct mail P = 0.02) (39). Investigators in Leicester, England, conducted a four-arm randomized controlled trial to evaluate the effects of health education on the uptake of Pap smears among women originally from the Indian subcontinent (Bangladesh, Pakistan, and India).

5

Nearly one-half (47%) of women who were shown a video during a home visit by an outreach worker adhered to screening recommendations, as did 37% of those who were visited and given a leaflet as well as a fact sheet. In contrast, only 5% of women who were not contacted and 11% of women who were sent print materials in the mail completed Pap testing (49). In another British study, researchers evaluated the effect of a home visit by an outreach worker on mammography participation by South Asian (Bangladeshi and Pakistani) women aged 50–64 years in Oldham. No difference existed in the proportion of intervention and control group women who subsequently responded to an invitation to Britain’s population-based mammography screening program (40). 4.3 Group Randomized Trials A Seattle research team conducted a grouprandomized controlled trial to evaluate a neighborhood-based outreach worker intervention to increase Pap testing among Cambodian refugees. Interventions were delivered by bicultural, bilingual Cambodian outreach workers; they included home visits and small group neighborhood meetings, use of a motivational video, and tailored logistic assistance accessing screening services. At baseline, 44% of the women in intervention neighborhoods and 51% of women in control neighborhoods reported Pap smear receipt in the past year. At follow-up, the proportions reporting a Pap test in the last 12months were 61% and 62% among intervention and control women, respectively. Increases in intervention group (17%) and control group (11%) cervical cancer screening rates were not significantly different (48). Maxwell et al. recently reported their results from a randomized controlled trial to increase breast and cervical cancer screening among Filipina American women in Los Angeles (24). Women aged 40 and older were recruited from community-based organizations as well as churches, and invited to attend a group session with a health educator. Groups were randomly assigned to receive a cancer screening module (intervention) or a physical activity module (control). Moderate increases in breast and cervical

6

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

cancer screening rates were observed in both groups (9 to 12 percentage points). However, among recent immigrants (women who had spent less than 10 years in the United States), mammography screening increased significantly more in the intervention arm than in the control arm (a 27 versus a 6 percentage point increase, P < 0.05). 4.4 Quasi-Experimental Studies Wismer et al. (38) have reported interim evaluation data from a community-based project that aimed to increase breast and cervical cancer screening rates among Korean Americans. Lay health workers were trained to provide workshops and print materials were distributed. After an 18-month intervention period, no significant changes occurred in Pap testing rates in either the intervention community (Alameda County, California) or the control community (Santa Clara County, California). Observed mammography increases in both the intervention county and the control county were equivalent. The researchers concluded that competing programs in Santa Clara County, diffusion of the intervention from Alameda County to neighboring Santa Clara County, and secular trends for mammography may all have contributed to their negative findings (38). The ‘‘Vietnamese Community Health Promotion Project’’ in San Francisco has evaluated several breast and cervical cancer screening interventions for VietnameseAmerican women (32–34, 50–52). One study evaluated an outreach intervention to promote receipt and screening interval maintenance of mammography and Pap smears. Indigenous lay health workers conducted a series of three small group educational sessions with Vietnamese women in the Tenderloin district of San Francisco while women in Sacramento, California, served as controls. Pre- and post- intervention surveys showed that the proportions of women reporting at least one mammogram increased from 54% to 69% in the experimental area (P = 0.006). In contrast, rates remained constant in the control community. Similar results were obtained for previous Pap testing; rates increased from 46% to 66% (P = 0.001) in San Francisco, but did not increase in Sacramento (34).

Another ‘‘Vietnamese Community Health Promotion Project’’ study evaluated a medialed campaign that included use of television, newspaper, and billboard advertising as well as the distribution of audio-visual and print educational materials. Post-intervention, no difference existed in recent mammography or Pap testing use among women in two northern California experimental counties and those in two southern California control counties. However, women in the intervention area were more likely to be planning to have mammograms and Pap tests in the future than women in the control area (33). Finally, the ‘‘Vietnamese Community Health Promotion Project’’ used a two-community study design to evaluate a multi-faceted community intervention to increase mammography participation in a northern California community. No intervention effect was demonstrated in this study (52). 5 CONCLUSION The Indigenous Model has been proposed as an effective framework for the delivery of health education programs targeting immigrant groups. This model specifies that successful programs should be delivered by individuals who are acceptable and accessible to the target population, in convenient locations, and in multiple ways (53–55). The three interventions that were successful in promoting cancer screening among women of Asian descent have all used these principles (34, 39, 49). For example, Taylor et al. recruited Chinese women from the targeted communities to serve as outreach workers, delivered the intervention in women’s homes, and provided education through discussion sessions with outreach workers as well as audio-visual and print materials (39). Rychetnik et al. (56) believe that criticisms of the randomized controlled trial in public health research are based on a consideration of classic trials in which the intervention is standardized and the individual is the unit of randomization. They proposed the routine use of cluster trials, with groups as the unit of randomization and analysis, to evaluate complex public health interventions (56). Two studies of breast and cervical

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

cancer interventions in Asian communities have used this approach with negative findings. In both cases, the investigators reported increases in screening levels among both the intervention and control groups, and proposed that control group contamination may have occurred (24, 48). As cancer control intervention programs often involve multi-faceted, community driven strategies, randomized controlled trials cannot always accommodate such programs (56). Consequently, investigators have used quasi-experimental designs with intervention and control communities (33, 34, 38, 52). Bird et al. (34) successfully used this study design to evaluate their breast and cervical cancer control intervention targeting Vietnamese women. A marked increase in breast and cancer screening rates among women in the experimental community with no change among women in the control community (together with the absence of any other promotional activities in the experimental community) provide compelling evidence for an intervention effect (34). However, other researchers have had difficulties evaluating breast and cervical cancer intervention programs, using a two-community design, because of unanticipated promotional activities in their control communities (38, 52, 57). This article addressed selected issues in the evaluation of public health interventions for Asian-American women. However, the authors did not address other important issues that are highly relevant to program development and delivery. For example, the importance of using qualitative methods during the development of interventions for immigrant groups and applying a culturally appropriate conceptual framework are not discussed (13, 41, 58, 59, 60, 61). In 1994, the Centers for Disease Control and Prevention published its report Chronic Disease in Minority Populations. The authors of the section about Asian Americans made the following recommendations: Special data collection efforts should focus on each Asian subgroup, attention should be given to low use of preventive measures, and intervention approaches should be culturally tailored to Asian communities (62). Almost a decade later, only a few studies have evaluated

7

breast and cervical cancer control interventions in Asian communities. However, these studies highlight the methodologic challenges in conducting evaluative public health research, particularly in racial/ethnic minority communities. 6

ACKNOWLEDGEMENTS

This work was supported by cooperative agreement #86322 and grant #82326 from the National Cancer Institute. REFERENCES 1. J. S. Lin-Fu, Asian and Pacific Islanders: an overview of demographic characteristics and health care issues. Asian Amer. Pacific Islander J. Health 1993; 1: 21–36. 2. U.S. Census Bureau, The Asian Population: 2000. Washington, DC: U.S. Department of Commerce, 2000. 3. U.S. Census Bureau, A Profile of the Nation’s Foreign-born Population from Asia (2000 update). Washington, DC: U.S. Department of Commerce, 2002. 4. U.S. Department of Commerce, We the Asian Americans. Washington, DC: U.S. Department of Commerce, 1993. 5. Department of Health and Human Services, Racial and Ethnic Disparities in Health. Washington, DC: Department of Health and Human Services, 1998. 6. D. Satcher, Eliminating racial and ethnic disparities in health: the role of the ten leading health indicators. J. Natl. Med. Assoc. 2000; 92: 315–318. 7. American Cancer Society, Cancer Prevention and Early Detection: Facts and Figures. Atlanta, GA: American Cancer Society, 2003. 8. D. Saslow et al., American Cancer Society guideline for the early detection of cervical neoplasia and cancer. CA Cancer J. Clinicians 2002; 52: 342–362. 9. Department of Health and Human Services, Healthy People 2010. Washington, DC: U.S. Government Printing Office, 2000. 10. R. A. Hiatt et al., Pathways to early cancer detection in the multiethnic population of the San Francisco Bay Area. Health Educ. Quart. 1996; 23: 10–27. 11. M. Kagawa-Singer and N. Pourat, Asian American and Pacific Islander breast and cervical carcinoma screening rates and Healthy

8

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH People 2000 objectives. Cancer 2000; 89: 696–705. R. A. Hiatt et al., Community-based cancer screening for underserved women: design and baseline findings from the Breast and Cervical Cancer Intervention Study. Prevent. Med. 2001; 33: 190–203. R. A. Hiatt and R. J. Pasick, Unsolved problems in early breast cancer detection: focus on the underserved. Breast Cancer Res. Treat. 1996; 40: 37–51. J. Legler et al., The effectiveness of interventions to promote mammography among women with historically lower rates of screening. Cancer Epidemiol. Biomark. Prevent. 2002; 11: 59–71. B. A. Israel et al., Review of community-based research: assessing partnership approaches to public health. Annu. Rev. Public Health 1998; 19: 173–202. P. M. Lantz et al., Can communities and academia work together on public health research? Evaluation results from a community-based participatory research partnership in Detroit. J. Urban Health 2001; 78: 495–507. University of Washington School of Public Health and Community Medicine. (2003). Community-based research principles. (online). Available: http://sphcm.washington. edu/research/community.htm. T. K. Lam et al., Encouraging VietnameseAmerican women to obtain Pap tests through lay health worker outreach and media education. J. Gen. Intern. Med. 2003; 18: 516–524. S. H. Yu and W. T. Lui, Methodologic issues. In: N. W. S. Zane et al., (eds.), Methodologic Issues. Thousand Oaks, CA: Sage Publications, 1994. M. Lee, F. Lee, and F. Stewart, Pathways to early breast and cervical cancer detection for Chinese American women. Health Educ. Quart. 1996; 23: 76–88. A. E. Maxwell, R. Bastani, and U. S. Warda, Breast cancer screening and related attitudes among Filipino-American women. Cancer Epidemiol. Biomark. Prevent. 1997; 6: 719–726. A. E. Maxwell, R. Bastani, and U. S. Warda, Mammography utilization and related attitudes among Korean-American women. Women Health 1998; 27: 89–107. A. E. Maxwell, R. Bastani, and U. S. Warda, Demographic predictors of cancer screening among Filipino and Korean immigrants in the United States. Amer. J. Prevent. Med. 2000; 18: 62–68.

24. A. E. Maxwell, R. Bastani, P. Vida, and U. S. Warda, Results of a randomized trial to increase breast and cervical cancer screening among Filipino American women. Prevent. Med. 2003; 37: 102–109. 25. B. C. K. Choi et al., Use of surnames to identify individuals of Chinese ancestry. Amer. J. Epidemiol. 1993; 138: 723–734. 26. B. K. Hage et al., Telephone directory listings of presumptive Chinese surnames: an appropriate sampling frame for a dispersed population with characteristic surnames. Epidemiology 1990; 1: 405–408. 27. D. S. Lauderdale and B. Kestenbaum, Asian American ethnic identification by surname. Population Res. Policy Rev. 2000; 19: 283–300. 28. A. Nicoll, K. Bassett, and S. J. Ulijaszek, What’s in a name? Accuracy of using surnames and forenames in ascribing Asian ethnic identity in English populations. J. Epidemiol. Community Health 1986; 40: 364–368. 29. E. Y. Tjam, How to find Chinese research participants: use of a phonologically based surname search method. Can. J. Public Health 2001; 92: 138–142. 30. T. Nguyen et al., Predictors of cervical Pap smear screening awareness, intention, and receipt among Vietnamese-American women. Amer. J. Prevent. Med. 2002; 23: 207–214. 31. V. M. Taylor et al., Cervical cancer screening among Cambodian-American women. Cancer Epidemiol. Biomark. Prevent. 1999; 8: 541–546. 32. S. J. McPhee et al., Pathways to early cancer detection for Vietnamese women: Suc Khoe La Vang! (Health is gold!). Health Educ. Quart. 1996; 23: 60–75. 33. C. N. Jenkins et al., Effect of a media-led education campaign on breast and cervical cancer screening among Vietnamese-American women. Prevent. Med. 1999; 28: 395–406. 34. J. A. Bird et al., Opening pathways to cancer screening for Vietnamese-American women: lay health workers hold a key. Prevent. Med. 1998; 27: 821–829. 35. T. G. Hislop et al., Facilitators and barriers to cervical cancer screening among Chinese Canadian women. Can. J. Public Health 2003; 94: 68–73. 36. V. M. Taylor et al., Pap testing adherence among Vietnamese American women. Cancer Epidemiol. Biomark. Prevent. 2004; 13: 613–619.

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH 37. V. M. Taylor et al., Cervical cancer screening among Chinese Americans. Cancer Detect. Prevent. 2002; 26: 139–145. 38. B. A. Wismer et al., Interim assessment of a community intervention to improve breast and cervical cancer screening among Korean American women. J. Public Health Manag. Pract. 2001; 7: 61–70. 39. V. M. Taylor et al., A randomized controlled trial of interventions to promote cervical cancer screening among Chinese women in North America. J. Natl. Cancer Inst. 2002; 94: 670–677. 40. T. Hoare et al., Can the uptake of breast screening by Asian women be increased? A randomized controlled trial of a linkworker intervention. J. Public Health Med. 1994; 16: 179–185. 41. M. Kagawa-Singer, Improving the validity and generalizability of studies with underserved US populations expanding the research paradigm. Ann. Epidemiol. 2000; 10: S92–S103. 42. J. Eyton and G. Neuwirth, Cross-cultural validity: ethnocentrism in health studies with special reference to the Vietnamese. Social Sci. Med. 1984; 18: 447–453. 43. R. J. Pasick et al., Problems and progress in translation of health survey questions: the Pathways experience. Health Educ. Quart. 1996; 23: S28–S40. 44. S. P. Tu et al., Translation challenges of crosscultural research and program development. Asian Amer. Pacif. Islander J. Health 2003; 10: 58–66. 45. M. R. McBride et al., Factors associated with cervical cancer screening among Filipino women in California. Asian Amer. Pacif. Islander J. Health 1998; 6: 358–367. 46. S. J. McPhee et al., Validation of recall of breast and cervical cancer screening by women in an ethnically diverse population. Prevent. Med. 2002; 35: 463–473. 47. R. J. Pasick et al., Quality of data in multiethnic health surveys. Public Health Reports 2001; 116: 223–243. 48. V. M. Taylor et al., Evaluation of an outreach intervention to promote cervical cancer screening among Cambodian American women. Cancer Detect. Prevent. 2002; 26: 320–327. 49. B. R. McAvoy and R. Raza, Can health education increase uptake of cervical smear testing among Asian women? Brit. J. Med. 1991; 302: 833–386.

9

50. S. J. McPhee, Promoting breast and cervical cancer screening among Vietnamese American women: two interventions. Asian Amer. Pacif. Islander J. Health 1998; 6: 344–350. 51. S. J. McPhee and T. T. Nguyen, Cancer, cancer risk factors, and community-based cancer control trials in Vietnamese Americans. Asian Amer. Pacif. Islander J. Health 2000; 8: 18–31. 52. T. Nguyen et al., Promoting early detection of breast cancer among Vietnamese-American women. Results of a controlled trial. Cancer 2001; 91: 267–273. 53. M. S. Chen, Jr. et al., Implementation of the indigenous model for health education programming among Asian minorities: beyond theory and into practice. J. Health Educ. 1992; 23: 400–403. 54. M. S. Chen, Jr., Cardiovascular health among Asian Americans/Pacific Islanders: an examination of health status and intervention approaches. Amer. J. Health Promot. 1993; 7: 199–207. 55. M. S. Chen, Jr. et al., An evaluation of heart health education for Southeast Asians. Amer. J. Health Promot. 1994; 10: 205–208. 56. L. Rychetnik et al., Criteria for evaluating evidence on public health interventions. J. Epidemiol. Community Health 2002; 56: 119–127. 57. L. Suarez et al., Why a peer intervention program for Mexican American women failed to modify the secular trend. Amer. J. Prevent. Med. 1997; 13: 411–417. 58. S. J. Curry and K. M. Emmons, Theoretical models for predicting and improving compliance with breast cancer screening. Ann. Behav. Med. 1994; 16: 302–316. 59. F. A. Hubbell et al., From ethnography to intervention: developing a breast cancer control program for Latinas. J. Natl. Cancer Inst. Monographs 1995; 109–115. 60. J. C. Jackson et al., Development of a cervical cancer control intervention program for Cambodian American women. J. Community Health 2000; 25: 359–375. 61. J. C. Jackson et al., Development of a cervical cancer control intervention for Chinese immigrants. J. Immigrant Health 2002; 4: 147–157. 62. B. I. Truman, J. S. Wing, and N. L. Keenan, Asians and Pacific Islanders. In: D. Satcher et al., (eds.), Chronic Disease in Minority Populations. Atlanta, GA: Centers for Disease Control, 1994.

10

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

Table 3. Community-based Breast and Cervical Cancer Intervention Studies Breast and/or cervical cancer screening

Group

Taylor, 2002

Cervical

Taylor, 2002

Cervical

First author, year of publication

Assignment unit

Intervention strategies

Evaluation method(s)

Cambodian

Small area neighborhood

Survey 12 months after randomization Review of medical records

No effect

Chinese

Individual

Home visit by outreach worker Group education session Use of video Logistic assistance Group 1: Home visit by outreach worker Use of video and print materials Logistic assistance

Survey six months after randomization Review of medical records

Both interventions effective Outreach worker intervention more effective than direct mailing intervention

Survey 12 months after randomization

No overall effect for mammography or Pap testing Effective for mammography in recent immigrant sub-group No effect

Maxwell, 2003

Both

Filipina

Small group

Wismer, 2002

Both

Korean

Community

McAvoy, 1991

Cervical

South Asian

Individual

Hoare, 1994

Breast

South Asian

Individual

Group 2: Direct mailing of video and print materials Group education session Use of print materials

Workshops delivered by lay health workers Distribution of print materials Group 1: Home visit by outreach worker Use of video

Group 2: Home visit by outreach worker Use of print materials Group 3: Direct mailing of print materials Home visit by outreach worker

Pre- and postintervention cross-sectional surveys

Main finding(s)

Review of screening program computerized records four months after randomization

Home visits were effective

Review of screening program computerized records

No effect

COMMUNITY-BASED BREAST AND CERVICAL CANCER CONTROL RESEARCH

11

Table 3. (continued) Breast and/or cervical cancer screening

Group

Assignment unit

Intervention strategies

Evaluation method(s)

Bird, 1998

Both

Vietnamese

Community

Pre- and postintervention cross-sectional surveys

Effective for mammography and Pap testing

Jenkins, 1999

Both

Vietnamese

Community

Series of three group education sessions delivered by lay health workers Distribution of print materials Media campaign Distribution of audio-visual and print materials

Pre-and postintervention cross-sectional surveys

Nguyen, 2001

Breast

Vietnamese

Community

No effect on mammography or Pap testing behavior Increased mammography and Pap testing intentions No effect

First author, year of publication

Media campaign Group education sessions Distribution of audio-visual and print materials

Pre- and postintervention cross-sectional surveys

Main finding(s)

COMPLIANCE AND SURVIVAL ANALYSIS

assignment have no impact on outcome, is consistently tested irrespective of compliance levels. Under the alternative, we expect, however, different (smaller) intent-to-treat effects than the prescribed regime would create when it materializes. This happens as the treatment group becomes a mix of varying (lower) degrees of exposure (2,8). Estimation of the causal effect of actual dose timing becomes challenging, when observed exposure patterns are no longer randomized. (Un)measured patient characteristics and earlier experience may determine exposure levels that become confounded with the natural treatment-free hazard of the patient. The association between compliance which induces treatment exposure levels, and treatment-free hazards is often called a selection effect in line with missing data terminology. An ‘‘as-treated’’ analysis, such as a PH analysis, with the currently received treatment as a time-dependent covariate, compares hazards between differently treated groups at a given time and thus estimates a mix of selection and causal effects (11). An ‘‘on-treatment’’ analysis censors patients as soon as they go off the assigned treatment and thus generates informative censoring for the same reason. Structural accelerated failure time (SAFT) models and structural PH models have been designed to avoid these biases. We explain their key features and potential through a simple example first.

ELS GOETGHEBEUR Ghent University, Ghent, Belgium

1

COMPLIANCE: CAUSE AND EFFECT

Today, new treatments must prove their worth in comparative (double blind) randomized clinical trials, the gold standard design for causal inference. With noninformatively right-censored survival outcomes, a typical robust intention-to-treat analysis compares groups as randomized using the popular (weighted) logrank test. Accompanying Kaplan–Meier, curves describe nonparametrically how survival chances differ between arms. A one-parameter summary of the contrast follows from a semiparametric Cox proportional hazards (PH) model Accelerated Failure-Time Model (6). In general, and especially with long-term treatments, patients tend to deviate from their prescribed treatment regime. Varying patterns of observed exposure relative to the assigned are called ‘‘compliance (levels)’’ and recognized as a likely source of variation in treatment effect. Because deviations from prescribed regimes occur naturally in clinical practice, it is wise to learn about them within the trial context rather than restrict the study population to perfect compliers, an atypical and sometimes small and unobtainable subset of the future patient horizon (12). Treatments that are stopped or switched or are less dramatic lapses in dosing happen in response to a given assignment. Different exposure patterns between treatment arms therefore point to (perceived) differences following alternative assignments. Studying compliance patterns as an outcome can yield valuable insights (15). Of course, actual treatment regimes may also influence primary outcomes. From the intent-to-treat perspective, underdosing causes reduced power and a requirement for larger samples. Fortunately, the strong null hypothesis, where treatment and its

2

ALL-OR-NOTHING COMPLIANCE

In randomized studies that evaluate a oneshot treatment, such as surgery (7), vaccination (4), or an invitation to undergo screening (1), all-or-nothing compliance with experimental assignment arises naturally. Let Ri = 1(0) indicate whether individual i, with baseline covariates Xi , gets randomized to the experimental (control) arm. When experimental treatment is not available outside the treatment arm the control arm remains

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

COMPLIANCE AND SURVIVAL ANALYSIS

uncontaminated by experimental exposure and its outcomes can serve as reference outcomes for causal inference. Let T 0i denote such a potential survival time for individual i following a control trajectory free of experimental exposure. Let T 1i and Ci respectively be survival time and compliance following a possible treatment assignment. Ri operates independently of the vector (T 0i , T 1i , Ci , Xi ) but determines, which components are observed. Observed survival time and exposure are simply denoted T i , Ei for all. With an uncontaminated control arm, Ei = Ci Ri . One goal of causal inference is to estimate how the contrast between potential survival times T 1i and T 0i varies over the subpopulations determined by different Ci levels and their induced level of experimental exposure on the treatment arm, Ei . The sharp null hypothesis assumes it makes no difference what arm one is assigned to and hence: d|X T0i = i T1i ,

(1)

d|X where = i indicates equality in distribution conditional on Xi . The most obvious violation of (1) occurs when (some) patients on the experimental arm become exposed and exposure alters survival chances. This is called a direct causal effect of exposure (10). When an assignment influences survival through mechanisms of action operating independently from exposure levels, we have an indirect effect. Below, we consider a cancer clinical trial (7), where the experimental intervention consists of implanting an arterial device during surgical resection of metastases. A planned implant could lead to an operation scheduled earlier in the day and timing may create its own set of prognostic circumstances. In addition, the news that a planned implant did not happen could be depressing to the patient and diminish survival chances beyond what would have happened on the control arm. Both mechanisms can lead to an indirect (clinical) effect of exposure assignment. Double blind studies are carefully designed to avoid indirect {d|C =0, X } effects, so they satisfy: T0i i = i T1i and hence, P(T 1i > t—Ci = 0, Ri = 1, Xi ) = P(T 0i > t|Ci = 0, Ri = 0, Xi ), for all t. The contrast

between P(T 1i > t|Ci = e, Ri = 1, Xi ) and P(T 0i > t|Ci = e, Ri = 0, Xi ) then represents the causal effect of exposure level e. In general, however, this combines direct and indirect effects of assignment in the population with compliance level e. In what follows, we ignore Xi for simplicity, but stronger inference can be drawn when assumptions condition on Xi . To estimate P(T 0i > t|Ci = 1, Ri = 1), one can solve P(T 0i > t|Ci = 1|Ri = 1)P(Ci = 1, Ri = 1) + P(T 0i > t|Ci = 0, Ri = 1)P(Ci = 0|Ri = 1) = (P(T 0i > t|Ri = 1) =)P(T 0i > t|Ri = 0) after substituting empirical means or (Kaplan–Meier) estimates for the other unknown terms. Isotonic regression can turn the pointwise estimates in a monotone survival curve. To evaluate the treatment effect among the exposed, one comˆ 0i t|Ci = ˆ 1i t|Ci = 1, Ri = 1) with P(T pares P(T 1, Ri = 1). The selective nature of exposure is seen by contrasting treatment-free survival probabilities for the exposed and nonexˆ 0i t|Ci = 1, Ri = 1) posed subpopulations: P(T ˆ 0i t|Ci = 0, Ri = 1). and P(T 3 MORE GENERAL EXPOSURE PATTERNS A structural model parameterizes the shift in distribution from observed survival time T i to a reference time, T ei following a specific exposure regime e, in terms of observed (possibly time-dependent) exposures Ei and covariates Xi . One can thus perform parameter-specific (Ei , Xi )-dependent back transformations of observed survival times (or distributions). The parameter value that solves estimating equations demanding equality of estimated T ei distributions (conditional on baseline covariates) between arms is our point estimate. The procedure is illustrated in Figure 1 d|R

for the SAFT model Ti exp{−β0 Ei } = i T0i in our trial, where Ei indicates an actual implant of the arterial device. For timedependent implants Ei (t), we could have used T d|R the SAFT model 0 i exp(−β0 Ei (u))du = i T0i . For technical points concerning the specific treatment of censored data, we refer the reader to (5,12). The left-hand panel shows ITT Kaplan–Meier curves in the standard and intervention arm. In the right-hand panel, the survival curve for the standard

COMPLIANCE AND SURVIVAL ANALYSIS

3

Figure 1. Estimation of structural parameters

arm is compared with KM-curves following the transformations T i exp{−βEi } with β = −1.5 and β = −0.36 on the intervention arm. Reducing treated failure times by the factor exp(−1.5) overcompensates for the observed harmful treatment effect as survival chances on the intervention arm are now higher than on the standard arm. This is confirmed by the logrank chi-squared value of 9.326, plotted in the middle panel. The survival curve corresponding to the point ˆ = 70%) is conestimate βˆ = −0.36 (exp(β) vincingly close to the observed survival in the standard arm. The middle panel reveals chisquared values for a range of hypothesized structural parameter values. Those that do not lead to significantly different curves at the 5% level form the 95% confidence interval [−1.07, 0.39] for β 0 . 4 OTHER STRUCTURAL MODELING OPTIONS One can propose many different maps of the observed into the treatment-specific survival distributions. This may happen on a PH scale

(7) or involve time-dependent measures of effect in the SAFT setting (12). Estimation methods, which rely on the instrument of randomization, protect the α –level just like the intent-to-treat test, but shift the point estimate away from a diluted average. To achieve this, they rely on the postulated structural model, which can sometimes be rejected by the data, but generally not confirmed owing to a lack of observed degrees of freedom. Special care is thus required when interpreting these models and their results. Some diagnostic procedures have been proposed and forms of sensitivity analyses (13,14). To explicitly account for measured timedependent confounders, structural nested failure-time models can be used as an alternative, or marginal structural models for T ei as in (3). The estimation process then relies on the assumption of ‘‘no residual confounding’’, ignores the instrument Ri , and loses its robust protection of the α –level. Structural modeling of failure time distributions has opened a world of practical and theoretical developments for the analysis of compliance and survival time. The field of

4

COMPLIANCE AND SURVIVAL ANALYSIS

research is very much alive today. Recent work (9), for instance, proposes to estimate optimal treatment regimes from compliance data. Our brief account can give but a flavor of this wealth.

5

ACKNOWLEDGMENTS

We thank Tom Loeys for drawing the figure.

REFERENCES 1. Baker, S. G. (1999). Analysis of Survival data from a randomized trial with all-or-nothing compliance: estimating the cost-effectiveness of a cancer screening program, Journal of the American Statistical Association 94, 929–934. 2. Frangakis, C. E. & Rubin, D. B. (1999). Addressing complications of intention-totreat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes, Biometrika 80, 365–379. 3. Hernan, M. A., Brumback, B. & Robins, J. M. (2001). Marginal structural models to estimate the joint causal effect of nonrandomized treatments, Journal of the American Statistical Association 96, 440–448. 4. Hirano, K., Imbens, G., Rubin, D. & Zhou, X. H. (2000). Assessing the effect of an influenza vaccine in an encouragement design, Biostatistics 1, 69–88. 5. Joffe, M. M. (2001). Administrative and artificial censoring in censored regression models, Statistics in Medicine 20, 2287–2304. 6. Kalbfleisch, J. D. & Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data, 2nd Ed. Wiley, New Jersey, Hoboken. 7. Loeys, T. & Goetghebeur, E. (2003). A causal proportional hazards estimator for the effect of treatment actually received in a randomized trial with all-or-nothing compliance, Biometrics 59, 100–105. 8. Mark, S. D. & Robins, J. M. (1993). A method for the analysis of randomized trials with compliance information: an application to the multiple risk factor intervention trial, Controlled Clinical Trials 14, 79–97. 9. Murphy, S. (2003). Optimal dynamic treatment regimes, Journal of the Royal Statistical Society, Series B 65, 331–355.

10. Pearl, J. (2001). Causal inference in the health sciences: a conceptual introduction, Health Services and Outcomes Research Methodology 2, 189–220. 11. Robins, J. M. & Greenland, S. (1994). Adjusting for differential rates of PCP prophylaxis in high-versus low-dose azt treatment arms in an aids randomized trial, Journal of the American Statistical Association 89, 737–749. 12. Robins, J. M. & Tsiatis, A. A. (1991). Correcting for non-compliance in randomized trials using rank preserving structural failure time models, Communications in Statistics, A 20, 2609–2631. 13. Scharfstein, D., Robins, J. M., Eddings, W. & Rotnitzky, A. (2001). Inference in randomized studies with informative censoring and discrete time-to-event endpoints, Biometrics 57, 404–413. 14. White, I. & Goetghebeur, E. (1998). Clinical trials comparing two treatment arm policies: which aspects of the treatment policies make a difference? Statistics in Medicine 17, 319–340. 15. White, I. & Pocock, S. (1996). Statistical reporting of clinical trials with individual changes from allocated treatment, Statistics in Medicine 15, 249–262.

CROSS-REFERENCES Noncompliance, Adjustment for Survival Analysis, Overview

COMPOSITE ENDPOINTS IN CLINICAL TRIALS

of type I error). Thus, the study is judged to have achieved statistically significant evidence of efficacy and to be a positive trial only if the test for the primary endpoint achieves the prespecified level of significance. The approach of relying on a single primary endpoint enables the investigators to control the overall probability of falsely declaring the experimental therapy to be efficacious, and it assists in the interpretation of analyses of the secondary endpoints, which will be helpful in delineating the nature of any treatment effect when the primary endpoint is significant. If the primary endpoint is not statistically significant, analyses of secondary endpoints are not used to definitively establish efficacy, but they may provide insights about treatment effects as well as information that may be helpful in designing future clinical trials. In performing secondary analyses, one must be concerned about the problem of multiple testing—that one or more secondary endpoints might reach statistical significance by chance. Because the overall error rate for the trial is controlled by specifying the primary endpoint, the usual approach to minimizing the problems with multiple testing is to limit the group of secondary endpoints to a relatively small number of the most important endpoints. Others are classified as tertiary endpoints. Because specification of the primary endpoint is critically important in the design of a clinical trial, it deserves serious consideration, a review of the relevant literature, and perhaps even preliminary studies. Five desirable properties of a primary endpoint have been proposed (1): it should be (1) clinically relevant, (2) accurate, (3) reproducible, (4) objective, and (5) quantitative. In practice, specification of a primary endpoint will also need to consider the nature of the disease and the manifestations under study as well as the availability of patients, investigators, and resources. The choice of the components of composite endpoint also should be related to the primary question of interest (2). The use of a composite endpoint, defined a priori, that comprises a variety of patient characteristics is often better suited to accom-

PETER C. O’BRIEN Division of Biostatistics Mayo Clinic College of Medicine Rochester, Minnesota

BARBARA C. TILLEY Department of Biostatistics, Bioinformatics, and Epidemiology Medical University of South Carolina Charleston, South Carolina

PETER J. DYCK Peripheral Nerve Lab Mayo Clinic College of Medicine Rochester, Minnesota

Composite endpoints have become increasingly useful as a measure of a patient’s disease status. They have been found useful in both clinical practice and medical research. Here, we shall examine the usefulness of composite endpoints as a primary endpoint in clinical trials. The basic rationale for their use is discussed and illustrated with examples, followed by a variety of procedures for forming composites including how they should be interpreted. In the section on global assessment variables, alternate approaches to conducting a primary analysis with multiple endpoints are considered and these methods are compared with the use of composite endpoints. 1 THE RATIONALE FOR COMPOSITE ENDPOINTS We consider the outcome measures used in the typical phase III clinical trial for efficacy, for simplicity assuming that an experimental therapy is being compared with a placebo. Most commonly, a protocol for such a study specifies a single primary endpoint (see the discussion of primary endpoints), various secondary endpoints (see the discussion of secondary endpoints), and often additional tertiary endpoints. The goal of categorizing study endpoints in this manner is to control the type I error rate (see the discussion

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

COMPOSITE ENDPOINTS IN CLINICAL TRIALS

plishing the objectives. In particular, by providing a more comprehensive assessment of the patient, it may be able to achieve a much higher level of clinical relevance. These considerations are illustrated with examples, including a detailed example involving clinical trials of drugs to treat diabetic neuropathy. 2 FORMULATION OF COMPOSITE ENDPOINTS Next, some useful methods for defining a composite endpoint for use as a primary endpoint in a clinical trial are considered. 2.1 Existing Scoring Systems Ideally, a suitable comprehensive composite endpoint has already been identified and extensively studied, such as the composite scores for diabetic sensory polyneuropathy. These types of composites seek to cumulate information about qualitatively different patient characteristics, all of which are directly relevant to efficacy. A commonly used type of composite endpoint consists of scores assigned to each of numerous individual questions, which are then summated to obtain the composite score. Many such scoring systems have been developed, validated, and widely used in clinical applications as well as in clinical trials. Their usefulness may have been studied by a group of experts in the field, and the appropriate use and interpretation may have been delineated by a consensus statement that will indicate what constitutes a meaningful effect size, among other things. The availability of such an existing composite endpoint that adequately measures patient status for the purposes of the specific clinical trial under consideration may be viewed as the ideal. If such a scoring system is available, a statistical approach for defining a composite endpoint is not needed. 2.2 Methods Based on Rankings In many situations, no well accepted scoring systems for combining variables are available. It may also be considered more meaningful to weight each of the endpoints equally

and without regard to their units of measure. A rank sum scoring approach (see the discussion of global assessment variables) may be useful in these situations. Specifically, one ranks all the subjects from best to worst separately for each variable. One then computes the sum of the ranks received for each subject to obtain rank sum scores measuring the relative performance of each subject. The relative rankings obtained are helpful in assessing the clinical relevance of the treatment effect and can be used to estimate the probability that the experiment will produce a better result than placebo (again, see the discussion of global assessment variables). 2.3 Methods for Endpoints That Are All Dichotomous or Censored In many trials, all the individual endpoints are censored. Cardiology trials, for example, are often interested in the time to any of several cardiac related events. An approach is to consider the time to the first event as the composite endpoint. Similarly, if all the variables are binary, a composite binary endpoint can be defined as the occurrence of any of the individual events. 3 EXAMPLES 3.1 Clinical Trials in Diabetes We can illustrate the use of composite versus individual measurement with the condition diabetic sensory polyneuropathy (DSPN) (3), which we have studied in detail in a crosssectional and longitudinal study (the Rochester Diabetic Neuropathy Study) and in several controlled clinical trials. Intuitively, we observe that DSPN is the sum total of a patient’s neuropathic symptoms; neurologic signs of weakness, sensory loss (different modalities), and autonomic nerve dysfunctions as well as overall impairments and disability. To track all of these diverse manifestations and appropriately assign a relative worth to each component is not presently feasible. Use of every measure might also be too intrusive, time consuming, and expensive. The problem is even more complex because a variety of composite scores are available (e.g., attributes of nerve conduction, neurologic

COMPOSITE ENDPOINTS IN CLINICAL TRIALS

signs, sensation tests, and so on). Without going into great detail, we found that no one attribute of nerve conduction or single clinical sign adequately represented the condition DSPN or was a suitable marker for its presence or severity. For controlled trials focusing on the development of DSPN, it is possible to use a single criterion such as that chosen for conduct of the Diabetes Control and Complications Trial study (4), which compared rigorous versus conventional glycemic control. The criterion for DSPN was: ≥2 abnormalities from among these three: (1) decreased or absent ankle reflexes, (2) decreased or absent vibration sensation of great toes, or (3) symptoms of polyneuropathy not attributable to other neuropathies. Although this score performed well for this purpose, it provided information only about frequency and not about severity of DSPN. For many purposes, knowledge about severity of DSPN would provide additional important information. We have developed several composite and continuous measures of severity of DSPN. One such composite score is 5 NC nds. In this score, five attributes of nerve conduction (NC) of leg—chosen because they are distal nerves, representing different functional groups—are each expressed as a normal deviate (nd), from percentiles corrected for age, sex, height, and weight based on a study of healthy subjects and with all abnormalities expressed in the upper tail of the normal distribution, which are summed, divided by the number of measurable attributes, and multiplied by 5. We have found that this summated score can be used to provide an excellent measure not only of the presence but also the severity of polyneuropathy because it tracks severity of DSPN without a floor or ceiling effect, is sensitive at defined levels of specificity, is very reproducible, and correlates reasonably with clinical neurologic deficit. It has the advantage of being objective (patients cannot will the result). This measure has also been shown to have good monotonicity and is highly suitable for controlled trials. It is also our impression that 5 NC nds might have a low degree of inter–medical center variability when used by certified electromyographers whose training and choice of equipment would likely

3

result in a high degree of reproducibility of results. However, for judging severity of DSPN, other primary outcome measures might be used. For example, if the emphasis is to be on the effect on neurologic signs (i.e., leg weakness, reflex loss, or sensation loss), a summed score of neurologic signs (e.g., neuropathy impairment score of lower limb), or summated quantitative sensation tests ( QST nds) might be used. Scales that judge the severity of a patient’s symptoms are needed when the drug to be tested has its putative action on the relief of symptoms (e.g., pain). Thus, in conditions such as headache, trigeminal or other neuralgias, fatigue states, anxiety, and the like, if composite clinical endpoints are chosen, these outcomes should include a measure of the frequency, duration, and severity of symptoms. In contrast to use of impairment measures in which an external physician or scientific observer makes observations, the patient’s judgment of quality of life, impairment for acts of daily living, or disability may also need to be considered. Scales of acts of daily living and quality of life are used to represent the patient’s perception of the impact of his or her illness on performance of life tasks and quality of life and whether an intervention improves the dysfunction or disability (5). Scoring of a single or a battery of motor functions or acts of daily living may also be used to assess overall severity of polyneuropathy. This approach has been extensively explored in assessment of therapy in immune polyneuropathies (6).

3.2 Clinical Trials in Cardiovascular Disease In cardiovascular disease (CVD) and other diseases where treatment is expected to affect both morbidity and mortality, it is common to choose an outcome representing failure of any one of a set of poor outcomes occur. For example, in assessing the benefits of aspirin for CVD prevention, outcomes such as vascular death, myocardial infarction, or major stroke; or death, myocardial infarction, or stroke; or death or reinfarction have been used as composite outcomes (7).

4

COMPOSITE ENDPOINTS IN CLINICAL TRIALS

3.3 Clinical Trials in Rheumatoid Arthritis The American College of Rheumatology criteria (ACR) for treatment success have been developed as a composite outcome. The ACR criteria were based on achieving success on at least some of the outcomes of interest. The composite outcome has been validated in a variety of settings and has been modified as background treatment has improved (8). 3.4 Other Comments In general, one can categorize the situations where multiple endpoints are needed to measure patient status in two ways: (1) each endpoint measures a qualitatively different patient attribute versus providing alternate measures of the same attribute, and (2) each endpoint considered by itself is sufficiently important that it could be used as a primary endpoint (but is not as informative and meaningful as the multiple measures taken together) versus a situation where one or more of the endpoints taken alone would not be sufficiently meaningful to provide a primary endpoint by itself but nonetheless adds important information in arriving at an overall assessment of the patient. Although these distinctions are helpful, they may be difficult to make in practice. For example, one might consider electromyographic measures of nerve conduction and measures of sensation to be assessing different attributes. However, for both endpoints, it is possible to distinguish whether the function of small or large nerve fibers is being assessed. Therefore, it could be argued that if both endpoints are evaluating small fibers then they are measuring the same attribute. The distinctions mentioned above also would be important in a disease that has qualitatively different manifestations. For example, Hunter syndrome affects primarily lung function in some patients and primarily ambulation in others, so focusing on only one of these attributes with a single primary endpoint would not provide a clinically comprehensive and meaningful assessment. Although the primary goal of a composite endpoint is to obtain the most clinically meaningful measure of patient status with which to measure efficacy, a more comprehensive measure will typically also have the

desirable statistical property of increasing the power of the trial. Combining multiple endpoints to provide a more comprehensive assessment of the patient may also be helpful in evaluating treatment safety. 4 INTERPRETING COMPOSITE ENDPOINTS The appropriate interpretation of a composite endpoint depends on how it was formed and its intended use. The most straightforward situation is the case of existing scoring systems that have been validated and widely used in clinical practice. In this case, there is a clear understanding of the clinical relevance of the magnitude of group differences observed in the trial. If the composite consists of a summation of many items, reporting of trial results often focuses only on the overall score and does not examine the individual components. This is unfortunate because one would expect that important additional information may be gleaned from determining whether the treatment effects were seen in only a few of the individual variables (in which case, the nature of any clustering may be important) or whether the effects were displayed consistently across all variables. One way to approach the interpretation of the composite score is to focus on inspection of the individual components. For example, with relatively few individual variables, it may be possible to prespecify what constitutes a clinically meaningful effect size for each endpoint. Individual statistical tests accompanying each endpoint may further assist in interpretation. Because the overall error rate has been controlled by the test for the composite endpoint, adjustment for these individual tests is not needed. Even if testing of individual outcomes is not conducted, at a minimum the results for each of the components of the composite outcome should be reported. As Wong (9) has commented, a review article on aspirin and CVD (7) that reports on only the composite outcome and myocardial infarction could mask the relationships between aspirin and stroke; if the individual outcomes as well as the composite outcome had been reported, this concern would have been addressed.

COMPOSITE ENDPOINTS IN CLINICAL TRIALS

Strategies for making decisions about clinical relevance with composite endpoints depend on the context. For example, if the overall test is significant and at least one of the endpoints reaches a clinically meaningful threshold, should the corresponding statistical test associated with that endpoint also be required to reach statistical significance? In our view, this depends on the circumstances. Alternatively, the overall test for the composite endpoint might reach statistical significance (demonstrating that the experimental treatment has a real beneficial effect), but none of the individual endpoints may reach the threshold for clinical importance or statistical significance. This could be likely to happen when the composite outcome is defined as the occurrence of any one of several events. Should the magnitude of the treatment effect be considered not clinically meaningful, or might the nature of the effect be diffuse but sufficiently consistent across endpoints that it cumulates to something clinically meaningful? Again, this depends on the circumstances, and clinical judgment is most important for arriving at a conclusion. It may be particularly important to examine the treatment effect on individual endpoints if the composite is the time to first event or the occurrence of any event because the type of events that occur soonest or most frequently may have an outsize effect on the composite. 5

CONCLUSIONS

Composite endpoints are valuable in clinical trials of efficacy, primarily because they provide a comprehensive and clinically meaningful assessment of the patient. The efficient use of multiple sources of information may also result in an increase in power over a single outcome. REFERENCES 1. P. C. O’Brien, Commentary to: A clinical trial endpoint based on subjective rankings, by Follmann D, Wittes J and Cutler JA. Stat Med. 1992; 11: 447–449. 2. N. Freemantle, M. Calvert, J. Wood, J. Eastaugh, and C. Griffen, Composite outcomes

5

in randomized trials: greater precision but with greater uncertainty? JAMA. 2003: 289: 2554–2559. 3. DCCT Research Group. The Diabetes Control and Complications Trial (DCCT): design and methodologic considerations for the feasibility phase. Diabetes. 1986; 35: 530–545. 4. DCCT Research Group. The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulin-dependent diabetes mellitus. N Engl J Med. 1993; 329: 977–986. 5. P. J. Dyck and P. C. O’Brien, Polyneuropathy dysfunction scores. J Neurol Neurosurg Psychiatry. 2006; 77: 899–900. 6. Graham RC, Hughes RAC. A modified peripheral neuropathy scale: the Overall Neuropathy Limitations Scale. J Neurol Neurosurg Psychiatry. 2006; 77: 973–976. 7. C. L. Campbell, S. Smyth, G. Montalescot, and S. Steinhubl, Aspirin dose for the prevention of cardiovascular disease: a systematic review. JAMA. 2007. 297; 2018–2024. 8. D. T. Felson, J. J. Anderson, M. L. Lange, G. Wells, and M. P. LaValley, Should improvement in rheumatoid arthritis clinical trials be defined as fifty percent or seventy percent improvement in core set measures, rather than twenty percent? Arthritis Rheum. 1998; 41: 1564–1570. 9. G. K. Wong, Letter to the editor. JAMA. 2007; 298: 625.

COMPUTER-ASSISTED DATA COLLECTION

appearing for in-person data collection as well. Before then, research data from study participants was typically captured on paper forms. Key differences exist between computerbased data collection (or EDC) and traditional paper-based data collection. In a traditional data-collection setting, an interviewer reads question items and response options to the study participant and then captures the responses on paper. Alternatively, participants independently complete paper forms or self-administered questionnaires. Unlike a paper-based form, an EDC system can dynamically tailor question wording on the basis of previously entered responses, implement complex routing logic so that only applicable questions are asked of respondents, and incorporate real-time data edits to ensure that responses are within valid ranges and are consistent with prior responses. Moreover, the computer system can enforce response to questionnaire items (using a ‘‘don’t know’’ or ‘‘refuse to answer’’ response when necessary) to minimize or eliminate missing data. Also unlike paper data, data collected through EDC or computer-assisted interviewing can rapidly be made available to a data management team for quality control and preliminary results. Under typical computerassisted, in-person data collection, data are collected with use of a laptop or desktop computer and then retrieved from the computer as often as desired (generally by secure electronic transmissions initiated from each computer). By contrast, data captured on paper forms are less immediately available because the responses must be subjected to a data-entry and verification process. If the data are collected at multiple locations, there may be the further delay of shipping paper forms to a central data entry location or the added overhead of setting up multiple data entry sites. The advantages of EDC, or computer-assisted interviewing, over traditional penciland-paper methods can be summarized as follows: Reduction in time from last patient visit to database release

TIM J. GABEL , SUSAN D. PEDRAZZANI and ORIN A. DAY Survey and Computing Sciences RTI International Research Triangle Park, North Carolina

Computer-assisted data collection, or electronic data capture (EDC), has been used for more than two decades in the government survey and market research industries but has only in recent years begun gaining broad acceptance in the clinical research setting, which relies heavily on the use of paper-based case report forms (CRFs). In many clinical studies, participants typically use paper forms to provide self-reported data for the study. These forms are then converted to electronic format using data-entry or clinical data management systems suitable for EDC. An ever-increasing number of clinical studies are deploying computerassisted or EDC systems, choosing them over paper-and-pencil data collection as information on the advantages becomes more widely known. Driving this conversion from paper to computer-assisted data collection is the need to collect quality data in a more accurate, efficient, and cost-effective manner. Audio computer-assisted self-interviewing (A-CASI) and telephone audio computerassisted self-interviewing (T-ACASI) are just two of the numerous methods of computerassisted data collection that can be used in clinical trials. A-CASI and T-ACASI may be particularly applicable to the collection of high-quality patient-reported data, especially in Phase IV and postmarketing surveillance studies. 1 DESCRIPTION OF COMPUTER-ASSISTED INTERVIEWING Computer-assisted interviewing for telephone-based data collection has been in use since at least the early 1970s (1). With the introduction of personal computers in the 1980s, computer-assisted methods began

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

COMPUTER-ASSISTED DATA COLLECTION

Reduction in number of required queries (fewer data clarification forms) Faster and easier query resolution Reduction in site monitoring costs (improved and easier monitoring) Reduction in project management costs Efficiency gains from reuse of forms (development of standards library) Ability to perform interim or ad hoc analysis earlier Faster access to data, enabling better-informed decision making Elimination of paper handling Potential for more secure data with use of passwords, data encryption, and limitation of access to database Potential for more complicated questionnaire routing Ability to specify and computerize enforcement of logic checks and range checks Elimination of or reduction in missing data items Despite the advantages of computerassisted interviewing methods over paperbased data collection, however, often the necessary involvement of a live interviewer carries disadvantages that a self-administered instrument avoids. In 1990 Miller et al. (2) summarized some of the methodological challenges in conducting surveys concerned with acquired immunodeficiency syndrome (AIDS). Observing that participation and response were impacted by participants’ concerns about confidentiality, Miller and colleagues reported that ‘‘because of the sensitive and highly personal nature of these questions, virtually all of the surveys made some provision to permit respondents to reveal the details of their sexual behavior without undue embarrassment’’ (2). A-CASI and T-ACASI currently answer this kind of need for private response while retaining the advantages of EDC. 2 AUDIO COMPUTER-ASSISTED SELF-INTERVIEWING In 1994, O’Reilly et al. (3) described the concept of A-CASI as a new technology for

collecting data. With the use of A-CASI, instead of being read questions by an interviewer and responding verbally while the interviewer enters the responses into the computer, the respondent listens in privacy to digitally recorded audio questions delivered to headphones through the sound card of the computer. He or she then enters responses independently, using the computer keyboard or a touch-screen. In this way, an A-CASI instrument temporarily removes the interviewer from the process; in fact, an A-CASI instrument can offer modes in which questions are not even displayed on the screen so that interviews can be conducted in a completely confidential manner in an otherwise less than private setting. Building on the research by O’Reilly and colleagues, Turner et al. (4) reported on the outcomes of an A-CASI experiment embedded in the 1995 National Survey of Adolescent Males. The embedded A-CASI experiment demonstrated statistically significant differences in the reporting of ‘‘sensitive’’ behavior (in this case, male–male sexual contact within a national probability sample of males aged 15 to 19 years) when A-CASI methods were used instead of interviewer-administered survey questions. Since that time, A-CASI has gained widespread acceptance in governmentsponsored social science research. In more recent years, the A-CASI approach has been extended to telephone-based data collection efforts. Health-related studies collecting sensitive data have been implemented using Telephone A-CASI, or T-ACASI. This approach combines interactive voice-response technologies with interviewer-administered questions (5). With T-ACASI, the respondent listens to digitally recorded questions over the telephone and enters his or her response by using a touch-tone telephone keypad. Again, without a clinician or interviewer present to ask the questions or to enter the responses, the study participant can feel more comfortable answering sensitive questions in a confidential setting (6).

COMPUTER-ASSISTED DATA COLLECTION

3 COMPUTER-ASSISTED METHODS IN CLINICAL TRIALS These computer-assisted data collection methods apply especially well to the clinical research arena. The more prevalent EDC becomes in clinical trials, the more likely these methods will gain acceptance in the clinical community. However, careful consideration and planning regarding A-CASI and T-ACASI approaches are crucial for effective implementation and study outcomes. 3.1 Questionnaire Design Most clinical researchers recognize the importance of devoting adequate time and resources to the development of the questionnaire or clinical forms, but this is especially true for studies incorporating A-CASI or T-ACASI methods. Because these approaches involve self-administration by study participants who interact with a computer, all questionnaire items must be clear and understandable to the participants. We recommend that a forms design team develop and test the questions with the target population in mind, giving especially careful consideration to those items that will be self-administered. To design and test self-administered questions for requisite clarity, the design team must first assess study participant characteristics such as age, literacy, computer experience or access to a touch-tone telephone, culture, language, and disease or condition severity. Consideration of these characteristics will help determine, at the outset, whether A-CASI or T-ACASI can be a successful mode for interviewing a given population. For example, the population of older persons may require a large-point font in which to view an onscreen A-CASI application, a mentally challenged population may require the digitally recorded text to be administered at a much slower pace than is customary for other populations, or it may be necessary to record the text in multiple languages. The key is to build an A-CASI or T-ACASI application that the target population will understand and easily use, so that required data will truly inform the hypotheses. A clinical study that has successfully addressed the specific characteristics and

3

needs of a target population can fully benefit from one of the major advantages of using A-CASI and T-ACASI: computerized standardization of questionnaire administration. This standardization permits the collection of better-quality data than is possible with a live interviewer because the prerecorded questions are administered as audio files in exactly the same way to every respondent who completes the questionnaire. The voice, tone, inflection, speed, and text are identical for each respondent, no matter how many times a question is repeated for each respondent. Unlike clinician- or intervieweradministered questionnaires, A-CASI or T-ACASI prevents alteration of question inflection or of specific word or phrase emphasis that might compromise standardization. Because no interviewer or clinician is involved in the self-administered phase of the interview process, computerized questionnaires must be designed accordingly. One design feature common to most computerassisted questionnaires is provision for a ‘‘don’t know’’ response if the respondent does not understand the question or simply has no response. Typically, at the beginning of the self-administered portion of the questionnaire, some standard conventions are provided to the study participant. For example, the respondent may be told to press a specific key or button for a ‘‘don’t know’’ response, and a different key or button to hear the question repeated. Help screens or help text can also be built into the questionnaire to clarify a specific question or series of response categories. In keeping with the goal of maximizing study response participation when a live interviewer is absent, the design team for an A-CASI questionnaire may do well to consider whether the study participant will provide input by computer keyboard or by touch-screen monitor. Most adults and children in today’s society have become adept at using touch-screen systems such as airport kiosks, automatic teller machines, and grocery store checkout machines. There is some evidence to suggest that, when given a choice, clinical study participants prefer responding by touch-screen computer rather than by keyboard. For example, among a

4

COMPUTER-ASSISTED DATA COLLECTION

convenience sample of 108 patients at a Baltimore, Maryland, sexually transmitted disease (STD) clinic, nearly 70% indicated that using a touch-screen A-CASI application was easier than responding by computer keyboard (7). 3.2 Edit Checks Computer-assisted questionnaire edit checks can preempt the need for laborious data editing and cleaning on the back end. In clinical research, errors and anomalies occur despite careful study design and implementation of quality assurance and quality control strategies. To minimize the impact of such anomalies and errors on the study results, data editing and cleaning typically occur as part of the data management process once the data are collected. Data editing is the process of identifying ambiguous or erroneous data and diagnosing the source of the problem, and data cleaning is the process of changing the erroneous data shown to be incorrect. Typical clinical trial data editing procedures include a ‘‘query’’ process, whereby data anomalies are resolved by referring back to a CRF, and in some instances by interacting with staff at a clinical site. With A-CASI and T-ACASI, there is no CRF. In fact, there is no paper record at all for clarifying data issues. The patient-reported source data are initialized in electronic format, and real-time edits must be incorporated into the computerized application to ensure that high-quality data are captured. Data edits, such as logic and range checks, are routinely programmed into the questionnaire to ensure that data values are within legitimate ranges and to help ensure consistency across items. Although the ability to implement edit checks with the use of A-CASI or T-ACASI is an advantage over paper-and-pencil data collection, the questionnaire designer must still consider how often to use edit checks, when to use them, and whether to deploy ‘‘soft’’ edit checks, ‘‘hard’’ edit checks, or both during administration of the questionnaire. Use of edit checks involves balancing the need for the highest quality data possible with the added burden and potential frustration they pose for the respondent. With a triggered hard edit check, for example, a

message will appear—usually in a pop-up box on the same screen—requiring a change in a patient’s response. By contrast, the soft edit check, if triggered, will simply recommend that the respondent make a change in response but also will permit the response to remain unchanged and will allow the respondent to continue with the questionnaire. Among options for edit checks, one type of hard edit requires that the respondent actually enter a value into an item before moving on to the next question. This safeguard is particularly recommended for ACASI and T-ACASI because the clinician, being removed from the self-administered interviewing phase, cannot help prevent missing data. Good candidates for hard edit checks are questionnaire items that aim to collect data for critical or key study endpoints. Hard edits ensure that questions collecting key study hypotheses data will have little or no missing data, and that the data will be within expected and acceptable ranges. Soft edit candidates are items for which a logical range of values can be established but outlier values may be feasible. In such instances, the range edit check can be triggered when out-of-range values are entered, preventing data errors due to miskeyed values but allowing the respondent to reconfirm or reenter the out-of-range value and continue completing the computerized form. One way to establish optimal use of edit checks is to pretest computerized forms. Interviewing a small number of participants who have the same characteristics as the target population provides valuable feedback about where and when a questionnaire requires edit checks. For example, if more than half the pretest participants enter a particular question response outside the range of values the team was expecting, adding an edit check to that question may be wise. If the pretest indicates that many questionnaire items require an edit check, the team may want to consider prioritizing the questions and implementing edit checks for the more critical data items, and clarifying the other questions or adding help screens. Pretesting also provides an opportunity to judge the overall usability (e.g., clarity of question wording, placement of entry fields on the

COMPUTER-ASSISTED DATA COLLECTION

screen) of the computerized system by study participants. In developing any type of edit checks, the study team should consider how best to convey each ‘‘error’’ or notification message with minimal burden to the respondent. Ideally, an ‘‘error’’ or notification message is evident to the respondent and clearly specifies what the respondent is to do. For example, an error message regarding mis-keyed birth month might read, ‘‘Month of Birth must be between 1 and 12. Please reenter.’’ A well-designed error message is brief and clear; the subsequent instructions are simple and relatively easy to act upon. 3.3 Quality Assurance Although not the focus of this article, a brief discussion is essential to address clinical studies that because of sponsor requirements must follow specified regulations—such as studies that ultimately submit their data to the U.S. Food and Drug Administration (FDA) or other regulatory agencies. The conduct of these studies, including the development and documentation surrounding the computer application, must follow all relevant federal regulations and guidelines. One example of such a regulation is ‘‘Electronic Records; Electronic Signatures,’’ 21 C.F.R. pt. 11 (2003). Audit trails are a key component of such regulations and should be included in the computerized application. The FDA defines an audit trail as a ‘‘chronological record of system activities that is sufficient to enable the reconstruction, reviews, and examination of the sequence of environments and activities surrounding or leading to each event in the path of a transaction from its inception to output of final results’’ (8). Although most commercial survey systems were not originally designed to be used for clinical research, many of them (e.g., Blaise from Statistics Netherlands, www.blaise.com) include electronic audit trail features. The key is to know the requirements of the sponsor and any relevant governing agencies and to implement a system that meets all the requirements at study inception. Implementation includes creating and maintaining the system and the testing documentation according to those same requirements.

5

4 CLINICAL CASE STUDIES USING COMPUTER-ASSISTED METHODS In this section, selected case studies illustrate how A-CASI and T-ACASI have been used in clinical studies. Although the research described includes both federally and privately sponsored studies and covers a wide array of topics, it all shares one common thread: the need to collect sensitive or personal information in a private setting. Each of the four studies programmed its instrument to incorporate real-time data edits, with consistency checks across numerous questions. At the end of data collection, each study extracted a data file from the system to allow for further edits of the data. Because of the inclusion of the edit and consistency checks, and the ability to review interim data files throughout data collection, data managers performed very limited data cleaning before beginning data analyses. 4.1 Case 1 A seroprevalence study of herpes simplex virus-2 was conducted at 36 primary care physician (PCP) offices in suburban areas surrounding six cities in the United States in 2002 (9). Adults, aged 18 to 59 years, were asked to participate during a routine visit to the PCP office. After providing consent, patients were asked to complete a 15minute, computer-assisted self-interview and to provide a blood sample. The questionnaire included demographic questions as well as a series of questions designed to elicit sexual attitudes, behaviors, and symptom history relating to genital herpes. The A-CASI technology was selected to provide patients a confidential method to report sensitive information. The questionnaire was completed and a blood sample provided by 5433 patients. 4.2 Case 2 The survey of Male Attitudes Regarding Sexual Health (MARSH) was conducted in 2001 to determine the age- and race-specific prevalence of mild, moderate, and severe erectile dysfunction (ED) in African American, Caucasian, and Hispanic males aged 40 years or older (10). The survey was a nationally administered random-digit-dial, list-assisted

6

COMPUTER-ASSISTED DATA COLLECTION

representative telephone interview. A trained telephone interviewer administered the first part of the questionnaire, which included the screening, demographics, and general healthrelated questions, and then transferred the respondent to the T-ACASI section. The respondent then answered the remaining questions with the keypad on his touch-tone telephone. The T-ACASI method provided a private setting in which the more sensitive questions were asked (e.g., questions about sexual activity, sexual gratification, the number of sexual partners, ED treatment, and the discussion of erectile problems). The survey was conducted in both English and Spanish. Interviews were completed by 2173 respondents.

interventions aimed at reducing the number of infants there who are at increased risk of dying during their first year of life. One of the numerous protocols implemented under this initiative and supported by the National Institute on Alcohol Abuse and Alcoholism was the Prevention and Fetal Alcohol Effects Study, which used A-CASI technology to allow for a more private environment when disadvantaged, pregnant women were asked about their alcohol consumption. The study results demonstrated that using computerassisted technology to screen for alcohol use in disadvantaged pregnant populations is feasible and acceptable to the respondents (13).

4.3 Case 3

REFERENCES

The National Institute of Mental Health– sponsored Collaborative HIV/STD Prevention Trial began in 1999 to examine the efficacy of a community-level intervention to reduce HIV/STD incidence and high-risk behaviors in China, India, Peru, Russia, and Zimbabwe. The intervention sought to modify social norms at the community level to effect mass changes in HIV/STD risk behaviors. In each of the five countries, community popular opinion leaders were engaged as behaviorchange agents within their community of friends, neighbors, and coworkers. As part of this work, several pilot studies were conducted with use of A-CASI technology and developed in the respective languages of each country (11). The study selected this technology to test the feasibility of its use in developing countries and to allow for added privacy when sensitive and personal information is collected (e.g., risk behaviors). The study concluded that A-CASI appears to be feasible in these settings (12). 4.4 Case 4 The National Institute of Child Health and Human Development’s Initiative to Reduce Infant Mortality in Minority Populations in the District of Columbia began in 1993 to develop projects designed to better understand the factors that influence the high rate of infant mortality and morbidity in Washington, DC, and to design and evaluate

1. J. C. Fink, CATI’s first decade: the Chilton experience. Sociol Methods Res. 1983; 12: 153–168. 2. H. G. Miller, C. F. Turner, and L. E. Moses, eds. Methodological issues in AIDS surveys. In: H. G. Miller, C. F. Turner, and L. E. Moses (eds.), AIDS: The Second Decade. Washington, DC: National Academy Press, 1990, Chapter 6. 3. J. M. O’Reilly, M. L. Hubbard, J. T. Lessler, P. P. Biemer, and C. F. Turner, Audio and video computer-assisted self-interviewing: preliminary tests of new technologies for data collection. J Off Stat. 1994; 10: 197–214. 4. C. F. Turner,. L. Ku, F. L. Sonenstein, and J. H. Pleck, Impact of ACASI on reporting of male-male sexual contacts: preliminary results from the 1995 National Survey of Adolescent Males In: R. Warnecke (Ed.), Health Survey Research Methods. DHHS Pub. No. (PHS) 96-1013. Hyattsville, MD: National Center for Health Statistics, 1996, pp. 171–176. 5. P. C. Cooley, H. G. Miller, J. N. Gribble, and C. F. Turner, Automating telephone surveys: using T-ACASI to obtain data on sensitive topics. Comput Human Behav. 2000; 16: 1–11. 6. J. N. Gribble, H. G. Miller, J. A. Catania, L. Pollack, and C. F. Turner, The impact of T-ACASI interviewing on reported drug use among men who have sex with men. Subst Use Misuse. 2000; 35: 63–84. 7. P. C. Cooley, S. M. Rogers, C. F. Turner, A. A. Al-Tayyib, G. Willis, and L. Ganapathi, Using touch screen audio-CASI to obtain data on

COMPUTER-ASSISTED DATA COLLECTION sensitive topics. Comput Human Behav. 2001; 17: 285–293. 8. Office of Regulatory Affairs, U.S. Food and Drug Administration. Glossary of Computerized System and Software Development Terminology. Available at: http://www.fda.gov/ ora/inspect ref/igs/gloss.html 9. P. Leone, D. T. Fleming, A. Gilsenan, L. Li, and S. Justus, Seroprevalence of herpes simplex virus-2 in suburban primary care offices in the United States. Sex Transm Dis. 2004; 31: 311–316. 10. E. O. Laumann, S. West, D. Glasse, C. Carson, R. Rosen, and J. H. Kang, Prevalence and correlates of erectile dysfunction by race and ethnicity among men aged 40 or older in the United States: from the male attitudes regarding sexual health survey. J Sex Med. 2007; 4: 57–65.

7

11. L. C. Strader, Developing of a multi-lingual survey instrument in A-CASI. Paper presented at the American Public Health Association Annual Meeting; October 24, 2001, Atlanta, GA. Abstract 28287. 12. The NIMH Collaborative HIV/STD Prevention Trial Group. The feasibility of audio computer-assisted self-interviewing in international settings. AIDS. 2007; 21(Suppl 2): S49–58. 13. J. Thornberry, B. Bhaskar, C. J. Krulewitch, B. Weslet, M. L. Hubbard, et al., Audio computerized self-report interview in prenatal clinics: audio computer-assisted self-interview with touch screen to detect alcohol consumption in pregnant women: application of a new technology to an old problem. Comput Inform Nurs. 2002; 20: 46–52.

CONDITIONAL POWER

2

To introduce the concept with statistical rigor and clarity, consider the statistical framework of testing a normal mean into which many clinical trials can be formulated (2,3). We have H 0 : m £ 0 versus the alternative H1 : m > 0, where m denotes the treatment effect parameter and (for example) is the mean elevation of diastolic blood pressure above 90 mm Hg. Let m be the maximum sample size of the RNST and the current data be Sn . Then the conditional power is defined as

MING T. TAN Division of Biostatistics, University of Maryland School of Medicine and Greenebaum Cancer Center, College Park, Maryland

1

CONDITIONAL POWER

INTRODUCTION

For ethical, economical, and scientific reasons, clinical trials may be terminated early for a pronounced treatment benefit or the lack thereof. In addition to group sequential tests and type I error-spending functions, another approach is to address the question more directly by asking whether the accumulated evidence is sufficiently convincing for efficacy. Or, if evidence is lacking, one should ask whether the trial should continue to the planned end and whether the conventional (reference) nonsequential test (RNST) should be employed. Conditional power is one way to quantify this evidence. It is simply the usual statistical power of RNST conditional on the current data. Thus, it is the conditional probability that the RNST will reject the null hypothesis on completion of the trial given the data currently available at a given parameter value of the alternative hypothesis. Therefore, if the conditional power is too low or exceedingly high, then the trial may be terminated early for futility or for efficacy, respectively. The early stopping procedure derived based on conditional power is referred to as stochastic curtailing by which an ongoing trial is curtailed as soon as the trend based on current data becomes highly convincing (1). Therefore, conditional power serves two closely related purposes: (1) as a measure of trend reversal (e.g., futility index) and (2) as an interim analysis procedure (e.g., formally a group sequential method). However, this method can be readily communicated to nonstatisticians because it simply answers the question whether the evidence in the interim data is sufficient for making an early decision on treatment efficacy or the lack thereof given current data in reference to what the RNST concludes.

Pn (µ) = P(the RNST will reject H0 at m|Sn ). Furthermore, let X i be the observed elevation of diastolic blood pressure above 90 mm Hg of the ith subject. Thus, X i ∼ (µ, σ 2 ), i = 1, L, n, L, m. Then the current data are summarized by the sufficient statistic, the n X i . Then Sn ∼ N(nµ, partial sum Sn = i=1

nσ 2 ), n = 1,2, L, m, where it is assumed that σ = 28.3 mm Hg based on preliminary data. We are interested in detecting a clinically important difference of 10 mm Hg at a significance level of α (for example, 0.025) with power 1 − β (for example, 0.90). The fixed sample design requires m = 86 subjects, and the null hypothesis would be√rejected if Zm > Zα = 1.96, where Zm = Sm / mσ and zα is the lower α-percentile of the standard normal curve, √ or equivalently Sm > s0 , where s0 = zα σ m = 554.68. Therefore, the conditional power is evaluated under the following conditional distribution: Sm |µ, Sn ∼ N (m − n)µ + Sn , (m − n)σ 2 (1) and thus it is given by Pn (µ) = Pµ (RNST rejects H0 |Sn ) √ = Pµ (Sm > zα σ m|Sn ) √ Zα σ m − Sn − (m − n)µ √ . =1− σ m−n A similar derivation for two-sided tests can be found in Reference 3. Because the conditional power depends on the unknown true

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

CONDITIONAL POWER

0.2

Power 0.4 0.6

0.8

1.0

2

0.0

Usual Power xbar =8 xbar =2 0

2

4

6

8

10

12

mu

treatment effect parameter µ, different hypothetical values of µ have to be given to evaluate the conditional power. A common practice is to consider the three values: the null value, the value under the alternative hypothesis that the study is designed to detect, and the current estimate. One such current value is the average of the null and the alternative. If in the midcourse (n = 43), the average elevation of blood pressure (xbar in Fig. 1) is 8 mm Hg (i.e., Sn = 43*8 = 344) and the true treatment effect µ is 10 mm Hg, then the conditional power is 92%, which implies there is a good chance the null hypothesis will be rejected at the planned end of the trial. On the other hand, if the average elevation is 2, then the conditional power is only 49%. The conditional power has been used as a futility index. If the conditional power at the alternative hypothesis is too low (for example, less than 0.20), then the trial is not likely to reach statistical significance, which provides an argument for early termination of the trial for futility. Figure 1 gives the usual power function of the test as well as the conditional power that corresponds to the two average elevations of blood pressures (xbar values) at midcourse (n = 43) of the trial. Figure 2 gives the stochastic curtailing boundaries based on the conditional power. The difficulty is which µ to choose because it may be hard to anticipate a future trend. A stochastic curtailing procedure can be derived

14

Figure 1. Illustration of the conditional power.

using the conditional power of the RNST given Sn and some plausible values of the treatment effects. Then, we can derive a formal sequential test with upper boundary an and lower boundary bn where we reject H 0 the first time Zn ≥ an or accept H 0 the first time Zn ≤ bn . If the conditional power at µ = 0 is greater than γ 0 (for example, 0.80), then H 0 is rejected, and if the conditional power at µ = 10 is less than 1 − γ 1 (for example, 0.20), then H1 is accepted. Then, the sequential boundaries are √ an = zα √ m/n + z1−γ0 (m − n)/n and bn = zα m/n − √ z1−γ1 (m − n)/n − µ(m − n)/(σ n). It can be shown (1) that the derived curtailing procedure has a type I error no greater than α/γ 0 (0.0625) and type II error no greater than β/γ 1 (0.25). Figure 2 gives the boundaries of the two stochastic curtailing procedures based on conditional power with γ 0 = γ 1 = γ = 0.80 and 0.98. The extreme early conservatism of stochastic curtailing is apparent. It is now well known and widely utilized in the monitoring of clinical trials that the test statistic in most of the common phase III clinical trials can be formulated into the general Brownian motion framework (4). In other words, the test statistic can be rescaled √ into z-statistic Bt = Zn t (0 ≤ t ≤ 1) that follows approximately a Brownian motion with drift parameter µ. Thus µ may represent the pre–post change, or a log odds ratio or a log hazard ratio The primary goal is to test H0 :

20

CONDITIONAL POWER

3

Figure 2. Stopping boundaries of different stochastic curtailing procedures (DP denotes discordance probability).

−20

−10

0

Zn

10

gamma = 0.8 gamma = 0.98 Max DP = 0.02

m £ 0 against the alternative H1 : m > 0. Then, the conditional distribution of B1 given Bt is again normal with mean Bt + (1−t)µ and variance 1−t.Therefore, the conditional power in the general Brownian motion formulation is pt (µ) = Pµ (RNST rejects H0 |Bt ) = Pµ (B1 > Zα |Bt ) zα − Bt (1 − t)µ . =1− √ 1−t For a two-sided test, the conditional power is given by pt (µ) = Pµ (|S1 | > zα/2 |Bt ) zα/2 − Bt − (1 − t)µ = 1− √ 1−t −zα/2 − Bt − (1 − t)µ . √ + 1−t Several authors (3–5) have documented in detail on how to formulate common clinical trials with various types of endpoints into the Brownian motion framework. For example, the sequentially computed log-rank statistic is normally distributed asymptotically with an independent increment structure (6,7). The conditional power is given in Reference 8 for comparing two proportions, in Reference 9 for censored survival time for log-rank or

0

20

40 n

60

80

weighted log-rank statistics, in Reference 10 for longitudinal studies, and in Reference 11 for models with covariates. In addition, several authors have used conditional power as an aid to extend an ongoing clinical trial to beyond the originally planned end for survival outcome (12) and in Brownian motion (13). More recently, the discordance probability is also extended and derived under the general Brownian motion framework (14,15) 3 WEIGHT-AVERAGED CONDITIONAL POWER Another way to avoid explicit choices of the unknown parameter is to use the weighted average of the conditional power with weights given by the posterior distribution of the unknown parameter µ given currently available data. Let the prior distribution of µ be π (µ) and its posterior be π (µ|Sn ). Then, the weight-averaged conditional power (also known as predictive power) for the one-sided hypothesis testing is given by Pn = pn (µ)π (µ|Sn )dµ. If the improper as well as noninformative prior π (µ) = 1 is chosen, then the posterior of µ|Sn is normal with mean Sn /n and variance σ 2 /n. Then from Equation (1), the marginal distribution of Sm |Sn is again normal with

4

CONDITIONAL POWER

mean (m/n)Sn and variance σ 2 ((m − n)m/n). The predictive power is thus simply: √ Pn = P(Sm > zα σ m|Sn ) √ zα σ m − (m/n)Sn =1− . σ (m − n)(m/n) Several authors have used the predictive power approach (16–18). Similar to conditional power, if Pn ≥ γ 0 , we consider rejecting the null, and if Pn ≤ 1 − γ 1 , we consider accepting the null. This criterion results in the following interim with √ analysis procedure boundaries an = zα n/m + z1−γ0 (m − n)/m √ and bn = zα n/m − z1−γ1 (m − n)/m. Unfortunately, no simple relationship relates the type I and II errors of the procedure with the predictive power. However, more informative usage of the predictive power may be through an informative prior. The data monitoring committee can make full use of the predictive power to explore the consequences of various prior believes on the unknown treatment effect parameter. 4 CONDITIONAL POWER OF A DIFFERENT KIND: DISCORDANCE PROBABILITY Based on the same principle of stopping a trial early as soon as the trend becomes inevitable, it is revealing to consider the conditional likelihood of the interim data given the reference test statistic at the planned end of the trial (m): Sn |Sm ∼ N

n n(m − n) 2 Sm , σ . m m

The distinct advantage of using this conditional probability is that it does not depend on the unknown parameter µ because conditioning is made on Sm , which is a sufficient statistic for µ. Using this conditional likelihood of the test statistic calculated at an interim time, we can derive a different kind of stochastic curtailing based on discordance probability defined as the probability the sequential test does not agree with the RNST in terms of accepting or rejecting the null hypothesis should the trial continue to the planned end (m). At a

given interim time point n, let an be the upper (rejection) boundary (i.e., if Sn ≥ an , reject the null hypothesis). Then Pµ (Sn ≥ an |Sm ≤ s0 ) is the probability that the decision to reject H 0 at n with Sn ≥ an is discordant with the decision to accept H0 at m when Sm ≤ s0 . Because for any µ, Pµ (Sn ≥ an |Sm ≤ s0 ) ≤ P(Sn ≥ an |Sm = s0 ). Then we can use P(Sn ≥ an |Sm = s0 ) to derive a sequential boundary. If this probability is smaller than ξ , then we stop the test and reject the null hypothesis; and, similarly, if P(Sn ≤ bn |Sm = s0 ) < ξ , then we stop the test and do not reject the null hypothesis. If we choose ξ (say, 0.05), that is, the same cut-off point for each n (n = 1, L, m), then we have P(Sn ≥ an |Sm = s0 ) an − ns0 /m =1− ≤ ξ. n(m − n)/mσ √ Solving n = zα n/m this equation we have a√ + z1−ξ (m − n)/m and bn = zα n/m − z1−ξ (m − n)/m. It is worth noting again that the boundaries are derived using marginal probability ξ for each n(n = 1, L, m). Marginally, the stopping boundaries are the same as those from predictive power with noninformative prior. A more accurate statement may be by a global discordance probability defined as the probability that the sequential test on interim data does not agree with the acceptance/rejection conclusion of the RNST at the planned end (19). Xiong (19) derived the elegant sequential conditional probability ratio test (SCPRT) via a conditional likelihood ratio approach and obtained the boundaries of the same form. Most importantly, he derived the intricate relationship among the type I, II errors, the discordance probability, and he provided efficient algorithms to compute them. It shows that the sequential boundary can be derived such as it has virtually the same type I and II errors as the RNST and the probability that the rejection or acceptance of the null hypothesis based on interim data might be reversed is less than a given level ρ 0 (for example, 0.02) should the trial continues to the planned end. With instantaneous computation of the type I and

CONDITIONAL POWER

II errors and various discordance probabilities, a sharper monitoring boundaries can be derived (14,19–22). It is noted that similar boundaries themselves are also derived for Bernoulli series using the same parameter free approach in the context of reducing computation in a simulation study designed to evaluate the error rates of a bootstrap test (23). Figure 2 also gives the boundaries of the stochastic curtailing procedure based on the discordance probability with a maximum discordance probability (denoted DP in Fig. 2) less than 0.02. In contrast to the extreme early conservatism of stochastic curtailing based on conditional power, the three boundaries become closer as the trial approaches its end. Interestingly, in the last quarter of the information fraction of the trial, the curtailing procedure (with γ = 0.98) almost coincides with that of the SCPRT, whereas the boundary with γ = 0.80 becomes slightly tighter than that of the SCPRT, which results in an increase in discordance probability relative to that of the RFSST and reflecting the conservatism in the SCPRT. A more detailed comparison of the two curtailing approaches and the SCPRT with common group sequential procedures such as the O’Brien-Fleming, Pocock, and Haybittle-Peto procedures is given in References 21 and 22. 5

ANALYSIS OF A RANDOMIZED TRIAL

The Beta-Blocker Heart Attack Trial was a randomized double-blind trial to compare propranolol (n = 1916) with placebo (n = 1921) in patients who had recent myocardial infarction sponsored by the National Institute of Health. Patients were accrued from June 1978 to June 1980 with a 2-year follow-up period resulting in a 4-year maximum duration. The trial was terminated early for a pronounced treatment benefit. Aspects on the interim monitoring and early stopping of this trial have been summarized (24–26). The minimum difference of clinical importance to be detected is 0.26 in log hazard ratio derived based on projected 3-year mortality rates of 0.1746 for the placebo group and 0.1375 for the treatment group adjusting for compliance. Roughly 628 deaths are

5

required for a fixed sample size test to detect such a difference at a significance level of 5% with 90% power. Seven interim analyses that correspond to the times the Policy and Data Monitoring Board met were planned. The trial was stopped 9 months early at the sixth interim analysis with 318 deaths (183 in the placebo arm and 125 in the treatment arm) with the standardized z-statistic valued at 2.82, and the O’Brien-Fleming boundary was crossed. The conditional power can be evaluated for various expected deaths. For example, a linear interpolation of the life table based the current survival data suggests 80 additional deaths in the ensuing 9 months. Therefore, the information time at the sixth analysis is √ 318/(318 + 80) = 0.80, then Bt = 2.82 0.80 = 2.52. The conditional power p0.80 (0) is 0.89. If an additional 90 deaths are expected, then the conditional power p0.78 (0) is 0.87. Both suggest a rather high conditional power for a treatment effect. Assuming an additional 90 deaths, the SCPRT curtailing based on discordance probability can be derived (14), which gives a maximum discordance probability of 0.001. This finding implies that there is only 0.1% chance that the conclusion might be reversed had the trial continued to the planned end (14). If the 628 total deaths in the original design are used, and if an SCPRT is to be used as stated in the protocol, then a maximum discordant probability is 1%, which implies only a slight chance (1%) that the decision based on the SCPRT procedure in the protocol may be reversed had the trial continued to the planned end (21). Therefore, it is highly unlikely that the early stopping decision for efficacy would be reversed had the trial continued to the planned end by all three procedures. However, the SCPRT based curtailing provides a sharper stopping boundary for trend reversal as expected. 6 CONDITIONAL POWER: PROS AND CONS To put things in perspective, the conditional power approach attempts to assess whether evidence for efficacy or the lack of it based on the interim data is consistent with that

6

CONDITIONAL POWER

at the planned end of the trial by projecting forward. Thus, it substantially alleviates the major inconsistency in all other group sequential tests where different sequential procedures applied to the same data yield different answers. This inconsistency with the nonsequential test sets up a communication barrier in practice where we can claim a significant treatment effect via the nonsequential test, but we cannot to do so via the sequential test based on the same data set or we can claim significance with one sequential method but cannot do so with another. For example, in a clinical trial that compares two treatments at the 5% significance level where five interim analyses were planned, the nominal level at the fifth analysis for the Petcock procedure is 0.016 whereas the nominal level at the fifth interim analysis for the O’Brien and Fleming is 0.041. If the trial has a nominal P-value of 0.045 at the fifth analysis, then according to either of the group sequential designs, the treatment effect would not be significant, whereas investigators with the same data just carrying out a fixed sample size test would claim a significant difference. However, if the nominal P-value is 0.03, then the treatment effect is significant according to the O’Brien-Fleming procedure but not according to the Pocock procedure. The advantage of the conditional power approach for trial monitoring is its flexibility. It can be used for unplanned analysis and even analyses whose timing depends on previous data. For example, it allows inferences from over-running or under-running (namely, more data come in after the sequential boundary is crossed, or the trial is stopped before the stopping boundary is reached). Conditional power can be used to aid the decision for early termination of a clinical trial to complement the use of other methods or when other methods are not applicable. However, such flexibility comes with a price: a potentially more conservative type I and type II error bounds (α/γ 0 and β/γ 1 ) that one can report. The SCPRT based approach removes the unnecessary conservatism of the conditional power and can retain virtually the same type I and II errors with a negligible discordance probability by accounting for how the data pattern (sample path) is trended (traversed). The use of the SCPRT

especially in making decision in early stages has been explored by Freidlin et al. (27) and for one-sided test by Moser and George (28). The greatest advantage of predictive power is that it allows us to explore consequences of various prior believes about the unknown treatment effect parameter. Finally, conditional power has also been used to derive tests adaptive to the data in the first stage of the trial (see, Reference 13). More recently, the related reverse stochastic curtailing and the discordance probability are used to derive group tests adaptive to updated estimates of the nuisance parameter (14).

REFERENCES 1. K. Lan, R. Simon, and M. Halperin, Stochastically curtailed tests in long-term clinical trials. Sequent. Anal. 1982; 1: 207–219. 2. J. Whitehead, A unified therory for sequential clinical trials. Stat. Med. 1999: 2271–2286. 3. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. New York: Chapman & Hall/CRC, 2000. 4. K. K. Lan and D. M. Zucker, Sequential monitoring of clinical trials: the role of information and Brownian motion. Stat. Med. 1993; 12: 753–765. 5. J. Whitehead, Sequential methods based on the boundaries approach for the clinical comparison of survival times. Stat. Med. 1994; 13: 1357–1368. 6. M. H. Gail, D. L. DeMets, and E. V. Slud, Simulation studies on increments of the twosample logrank score test for survival time data, with application to group sequential boundaries. In:: J. Crowley and R. A. Johnson, (eds.), Survival Analysis. Hayward, CA: Instititute of Mathematical Statistics, 1982, pp. 287–301. 7. A. A. Tsiatis, Repeated significance testing for a general class of statistics used in censored survival analysis. J. Am. Stat. Assoc. 1982; 77: 855–861. 8. M. Halperin, K. K. Lan, J. H. Ware, N. J. Johnson, and D. L. DeMets, An aid to data monitoring in long-term clinical trials. Control. Clin. Trials 1982; 3: 311–323. 9. D. Y. Lin, Q. Yao, Z. Ying, A general theory on stochastic curtailment for censored survival data. J. Am. Stat. Assoc. 1999; 94: 510–521.

CONDITIONAL POWER 10. M. Halperin, K. K. Lan, E. C. Wright, and M. A. Foulkes, Stochastic curtailing for comparison of slopes in longitudinal studies. Control. Clin. Trials 1987; 8: 315–326. 11. C. Jennison and B. W. Turnbull, Groupsequential analysis incorporating covariate information. J. Am. Stat. Assoc. 1997; 92: 1330–1341. 12. P. K. Andersen, Conditional power calculations as an aid in the decision whether to continue a clinical trial. Control. Clin. Trials 1986; 8: 67–74. 13. M. A. Proschan and S. A. Hunsberger, Designed extension of studies based on conditional power. Biometrics 1995; 51: 1315–1324. 14. X. Xiong, M. Tan, and J. Boyett, Sequential conditional probability ratio tests for normalized test statistic on information time. Biometrics 2003; 59: 624–631. 15. X. Xiong, M. Tan, and J. Boyett, A sequential procedure for monitoring clinical trials against historical controls. Stat. Med. 2007; 26: 1497–1511. 16. J. Herson, Predictive probability early termination plans for phase iI clinical trials. Biometrics 1979; 35:775–783. 17. S. C. Choi, P. J. Smith, and D. P. Becker, Early decision in clinical trials when the treatment differences are small. Experience of a controlled trial in head trauma. Control. Clin. Trials 1985; 6: 280–288. 18. D. J. Spiegelhalter, L. S. Freedman, and P. R. Blackburn, Monitoring clinical trials: Conditional or predictive power? Control. Clin. Trials 1986; 7: 8–17. 19. X. Xiong, A class of sequential conditional probability ratio tests. J. Am. Stat. Assoc. 1995; 90: 1463–1473. 20. X. Xiong, M. Tan, M. H. Kutner, Computational methods for evaluating sequential tests and post-test estimation via the sufficiency principle. Statist. Sin. 2002; 12: 1027–1041. 21. M. Tan, X. Xiong, M. H. Kutner, Clinical trial designs based on sequential conditional probability ratio tests and reverse stochastic curtailing. Biometrics 1998; 54: 682–695. 22. M. Tan and X. Xiong, Continuous and group sequential conditional probability ratio tests for phase II clinical trials. Stat. Med. 1996; 15: 2037–2051. 23. C. Jennison, Bootstrap tests and confidence intervals for a hazard ratio when the number of observed failures is small, with applications to group sequential survival studies. In: C. Page and R. LePage (eds.), Computing Science and Statistics: Twenty-second Symposium on

7

the Interface. Berlin: Springer-Verlag, 1992, pp. 89–97. 24. D. L. DeMets and K. K. Lan, Interim analysis: the alpha spending function approach. Stat. Med. 1994; 13: 1341–1352; discussion 1353–1346. 25. D. L. DeMets, R. Hardy, L. M. Friedman, and K. K. Lan, Statistical aspects of early termination in the beta-blocker heart attack trial. Control. Clin. Trials 1984; 5: 362–372. 26. K. K. Lan and D. L. DeMets, Changing frequency of interim analysis in sequential monitoring. Biometrics 1989; 45: 1018–1020. 27. B. Freidlin, E. L. Korn, and S. L. George, Data monitoring committees and interim monitoring guidelines. Control. Clin. Trials 1999; 20: 395–407. 28. B. K. Moser and S. L. George, A general formulation for a one-sided group sequential design. Clin. Trials 2005; 2: 519.

CROSS-REFERENCES Group Sequential Designs Interim Analysis

CONFIDENCE INTERVAL

sample value Q

JENO˝ REICZIGEL ´ University Szent Istvan Faculty of Veterinary Science Budapest, Hungary

“likely range” for the true value L

(95%)

U

Figure 1. Point estimate θˆ and confidence interval (L,U) for a parameter θ (prevalence, mean, difference, relative risk, etc.). θˆ represents the value experienced in the sample, whereas (L, U) defines a ‘‘reasonable range’’ or ‘‘likely range’’ for the unknown true (population) value θ

A confidence interval (CI) or interval estimate for an unknown parameter θ is a ‘‘from-to’’ range calculated from the sample so that it contains θ with a high—usually 90%, 95%, or 99%—probability (Fig. 1). As opposed to a point estimate θˆ , which is a single number calculated from the sample, the interval (L,U) gives a more expressive and easy-tounderstand picture about the uncertainty of the estimate. L and U are also called confidence limits for the parameter θ . Some examples of how CIs are usually communicated in papers are as follows:

the sample is meaningful at all, a CI is also meaningful. For curves like a regression line or a ROC curve, interval estimation results in a so-called confidence band (Fig. 2). Technically, a CI, as being a ‘‘fromto’’ range calculated from the sample, consists of two functions of the sample values L = L(x1 , x2 , . . . , xn ) and U = U(x1 , x2 , . . . , xn ), the lower and upper confidence limit, which satisfy P(L < θ < U) = 1 − α, where 1 − α is called coverage probability or level of confidence, and α is called error probability. (These probabilities may also depend on the true value of θ ; see below for details.) For symmetry, CIs are usually constructed so that they have an error probability of α/2 at each side, that is, P(θ < L) = P(θ > U) = α/2 (Fig. 3a), which are called symmetric or equaltailed CIs. In the asymmetric case, the sum of the error probabilities is α, but they do not need to be equal (Fig. 3b). One-sided CIs are intervals of the form (−∞, U) or (L, ∞) with the property P(θ < U) = 1 − α or P(θ > L) = 1 − α (Fig. 3c,d). It is easy to verify that given two 1 − α level one-sided CIs (−∞, U) and (L, ∞) for a certain parameter, then (L, U) forms a 1 − 2α level symmetric CI for the same parameter. For example, two 95% one-sided intervals define a 90% two-sided one. Similarly, if (L, U) is a 1 − α level equal-tailed CI, then (−∞, U) as well as (L, ∞) are 1 − α/2 level one-sided CIs for the same parameter. That is, a 95% level twosided symmetric interval can be converted into two 97.5% one-sided ones. In the frequentist (as opposite to Bayesian) view, the parameter θ is assumed to be a fixed

• ‘‘The estimated prevalence rate of the

disease in the study population was 8.3% (95% CI 5.8% to 11.7%).’’ • ‘‘Blood cholesterol fell by 0.76 mmol/l in group A vs 0.32 mmol/l in group B (difference 0.44 mmol/l, 95% CI 0.16 to 0.73).’’ • ‘‘Acute infections occurred in 25 of the patients on active treatment and in 36 of the patients on placebo (relative risk 0.69, 95% CI 0.55 to 0.87).’’ • ‘‘Treament failure was eight times more likely in the placebo than in the antibiotic group (13.1% vs 1.6%, odds ratio 8.2, 95% CI 1.9 to 34.7).’’ CIs can be given for unknown population parameters like the mean, variance, median, proportion, or for model parameters like a regression coefficient, odds ratio, correlation coefficient, or for functions of the abovementioned parameters like, for example, the variance/mean ratio, the difference of two proportions or two means, and so on (1). In clinical studies, CIs are typically given for the mean or median treatment effect, proportion of cure, difference in these outcome measures comparing two treatments, and so on. Practically for anything for which an estimate from

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

CONFIDENCE INTERVAL

(a)

Figure 2. Confidence band for a regression line (a) and for a ROC curve (b)

(b)

value, whereas L and U, as they depend on the sample, are random variables. Therefore, the statement P(L < θ < U) = 1 − α can only be interpreted in relation to drawing samples repeatedly and recalculating L and U each time. A small α corresponds to a low proportion of failures in repeated application of the method (i.e., it indicates that the method is good enough and makes mistakes as seldom as with a rate of α). However, considering one particular sample and the CI calculated from that sample, say when L = 12.1 and U = 13.5, the probability P(12.1 < θ < 13.5) is meaningless. Clearly, it is either true of false, given that also θ is a fixed value. To emphasize this fact, some authors suggest that in reporting the results one should avoid saying ‘‘the unknown parameter lies between 12.1 and 13.5 with 95% probability’’; one should rather say ‘‘with 95% confidence’’ or the CI should be reported as shown in the examples above. In the Bayesian model (in which even θ is regarded as a random variable), the corresponding interval estimates are the credible interval (i.e., the range of parameter values with the highest posterior probability) and Fisher’s fiducial interval. However, the calculation as well as the interpretation of these notions are different from each other, and quite different from that of a frequentist CI. If n unknown parameters to estimate exist (i.e., a parameter vector θ ∈ Rn ), one may want to construct a simultaneous confidence region (or confidence set) S ⊆ Rn with the property P(θ ∈ S) = 1 − α, rather than to give a CI for each parameter separately. Such questions come up most naturally in multiparameter models, for example in regression models. A basic example of a two-dimensional confidence region develops when one has

2.5%

95%

2.5%

(a) 2%

95%

3%

(b) 95%

5% (c)

5%

95% (d)

Figure 3. Symmetric (a), asymmetric (b), and onesided (c, d) confidence intervals

variance confidence region s 2obs

for (m, s 2)

mobs

mean

Figure 4. Joint confidence region (gray shaded area) for the mean and variance of a normal vari2 denote the sample mean and able (µobs and σobs variance)

a sample from a normal distribution with unknown µ and σ 2 , and one would like to construct a joint confidence set for the parameter pair (µ, σ 2 ), rather than two separate confidence intervals for µ and σ 2 (2). Such a confidence set (Fig. 4) can be used to derive CIs for functions of the parameters, for example, for the variance/mean ratio.

CONFIDENCE INTERVAL

In nonparametric inference, as no ‘‘parameters of the distribution’’ exist (in this case, neither the mean nor the variance are regarded as parameters of the distribution in the strict statistical sense), functionals of the distribution play the role of the above ‘‘unknown parameters.’’ For the nonstatistician user, however, it does not make any difference. CIs are closely related to statistical tests. Given a (1 − α)-level two-sided CI for a parameter θ , it is easy to convert it to a test for H0 : θ = θ0 , H1 : θ = θ0 , simply rejecting H 0 if θ 0 is not contained in the CI, that is, taking the CI as the acceptance region of the test (for a one-tailed test, a one-sided CI is needed). The resulted test has a Type I error rate of α. It works in the opposite direction as well, that is, tests can serve as the basis for CI construction (see the test inversion method below). 1 THE ROLE OF CONFIDENCE INTERVALS IN CLINICAL TRIALS In the last few decades, there has been a tendency toward wider use of CIs in reporting the results from clinical trials as well as from other studies (3–7). The main advantage of reporting CIs instead of (or even better, together with) P-values lies in better quantification of the effect size (the confidence limits are easy to compare with the clinically relevant effect size) and in direct indication of the uncertainty of the results (the wider the CI, the greater the uncertainty). Confidence intervals are particularly important in equivalence trials. To verify the equivalence of a new experimental treatment to the standard one, it must be shown that the difference between the treatment effects Ex and Es remains below a prespecified limit T defining the range of equivalence. Here, one can take either absolute or relative difference [i.e., equivalence can be defined either as −T < Ex − Es < T (absolute difference) or as 1 − T < Ex /Es < 1 + T (relative difference)]. For example, using relative difference with a range of equivalence of ±10% can be formally written as 0.90 < Ex /Es < 1.10. Then, a common method of analysis is to construct a CI for the difference Ex − Es or Ex /Es based

−T

0

3

T

(a) −T

0 (b)

T

−T

0 (c)

T

0

T

−T

(d) Figure 5. Possible locations of a confidence interval for the difference of treatment effects in an equivalence trial (if the true difference lies between the limits −T and T, treatments are regarded as equivalent)

on the observed sample, and to accept treatment equivalence if this CI is fully contained in the interval (−T, T) or (1 − T, 1 + T). The rationale behind this approach is that one would like to know whether the unknown true difference (rather than the difference between the particular observed samples) lies in the range of equivalence. According to this method, the interpretation of the CIs in Fig. 5 is as follows. a. A good evidence for equivalence. b. The new treatment is less effective than the standard one. c. The new treatment is noninferior (i.e., it may be either more effective or equivalent). d. The trial failed to have enough power to allow any definite conclusion concerning the relation between treatment effects. A larger trial is needed. Confidence intervals are also helpful in the interpretation of the results of superiority trials. Let D denote the smallest difference between the experimental and the standard treatment that is still clinically relevant. Then, the CI for the difference between treatment effects (Ex − Es ) (absolute difference) allows drawing the following conclusions (Fig. 6).

4

CONFIDENCE INTERVAL

many cases (e.g., because of the discreteness of the distribution, see Fig. 7), global measures were introduced to evaluate the performance of the various methods. The tra(b) ditional confidence coefficient, also called minimum coverage, is defined as the minimum (or infimum), whereas the mean coverage (c) (8) is the average of the coverage probability over the whole parameter range (e.g., for the binomial parameter, over the [0,1] (d) range). An important difference exists in the interpretation of these measures. The confidence coefficient can be interpreted as a lower (e) bound of the coverage rate, guaranteed by the procedure. Mean coverage, however, being 0 D based on averaging over the whole parameter (f) set, does not have any meaningful interpretation in relation with a single problem, in Figure 6. Possible locations of a confidence interwhich the parameter is assumed to be a cerval for the difference of treatment effects in a tain fixed value. (Note that, in the Bayesian superiority trial (a difference greater than D is model, averaging could be naturally made regarded as clinically relevant) with respect to the prior distribution of the parameter.) A CI constructing method is said to be cona. Experimental treatment is definitely worse. servative if its minimum coverage is never b. Experimental treatment is either worse less than the nominal and anticonservaor the difference is clinically irrelevant. tive otherwise. Although conservatism is a c. No clear conclusion: A larger trial is desirable property as compared with antineeded. conservatism, it is unsatisfactory if a method d. Experimental treatment is better but is too conservative— say, if at the nomthe difference is clinically irrelevant. inal level of 95%, the method produces a CI with actual coverage of 98%—because it e. Experimental treatment is better: The results in unnecessarily wide intervals. A difference may or may not be clinically way to control conservatism is to determine relevant. the actual coverage and to adjust it if necf. Experimental treatment is definitely betessary. Reiczigel (9) describes a computer ter: The difference is clinically relevant. intensive level-adjustment procedure varying the nominal level iteratively until the confidence coefficient gets close enough to the 2 PROPERTIES AND EVALUATION OF desired level. The procedure can be combined PERFORMANCE with any reasonable CI construction method, and it can also be applied for adjusting the The coverage probability of a CI for a parammean coverage. eter is defined as the probability that the The notion ‘‘exactness’’ of a CI is associinterval contains the parameter. However, ated with various meanings in the literature. this probability may depend on the true value In one sense, exactness means that the CI of the parameter as well as on some nuiis based on the exact probability distribusance parameters. A good method produces a tion of the given test statistic. In another CI with coverage probabilities possibly close sense, exactness means the same as conserto the desired (nominal) level irrespective of vatism above (i.e., that the coverage probabilthe values of the parameters: Ideally, the ity reaches at least the nominal level for any coverage probability should always be equal parameter value). The third sense is that the to the nominal. Because it is impossible in coverage is strictly equal to the nominal level (a)

CONFIDENCE INTERVAL

5

Coverage 1.00 0.95 0.90 0.85 Figure 7. Actual coverage of the exact’’ 95% Clopper-Pearson CI for the binomial parameter P. Although minimum coverage is 95%, for most parameter values, the coverage probability is high above 95%

0.80

for all parameter values. Computer simulations by several investigators show that exact CIs meant in the first two senses may perform rather poorly (8, 10), which is because a method may be too conservative even if its minimum coverage is equal to the nominal one, because for some parameter values the coverage probability may well exceed its minimum (as it is the case in Fig. 7). Therefore, it is a little misleading to designate the property that the minimum coverage is equal to the nominal as exactness. In fact, a further necessary condition exists for a method to perform well: Its coverage probability should not show much fluctuation. The so-called asymptotic or ‘‘large-sample’’ CIs are based on an approximate probability distribution of the test statistic, and therefore only have approximately the stated confidence level and only for large samples. Whereas an exact CI (meant in the third sense above) is valid for any sample size, an asymptotic CI is valid only for a sufficiently large sample size—and it depends on the procedure as well as on the true value of the parameter to determine what ‘‘sufficiently large’’ means. Such approximations are most often made by the normal distribution, based on the Central Limit Theorem (see below).

0

0.2

0.4

0.6

0.8

1 true p

when estimation is made by the maximum likelihood method), an asymptotic CI can be obtained as the point estimate plus and minus the asymptotic standard error of the estimate multiplied by a critical value from the normal distribution. This is demonstrated by two examples: 1. Let x and s denote the mean and standard deviation of a variable, estimated from a sample of n. Then, a 95% asymptotic CI for the true √ mean can be obtained as x ± 1.96 s/ n. 2. Let ln(OR) denote the logarithm of the odds ratio estimated from a two-bytwo table, that is, ln(OR) = ln(ad) − ln(bc) where a, b, c, d denote the table cell counts. It can be proven that the asymptotic standard error of ln(OR) is s = 1/a + 1/b + 1/c + 1/d. From this equation, a 95% asymptotic CI for the true ln(OR) is obtained as ln(OR) ± 1.96 s. The actual coverage of such a CI approaches the nominal if the sample size tends to infinity. Recommendations exist (for each case separately!) for what sample size is the approximation acceptable. 3.2 Test Inversion

3 METHODS FOR CONSTRUCTING CONFIDENCE INTERVALS 3.1 Normal Approximation In case of consistency and asymptotic normality of a point estimator (in most cases,

Test inversion can be applied if a hypothesis test exists for the parameter θ that one wants to estimate. Suppose one wants to invert a test of H0 : θ = θ0 based on a test statistic t, to get a 95% CI for θ . The basic idea is that a 95% confidence set should consist of

6

CONFIDENCE INTERVAL

test statistic t

t2 tobs t1

L

Q0

U

true parameter Q

all such θ 0 values for which H0 : θ = θ0 is not rejected at the 95% level. That is, 95% acceptance regions (t1 , t2 ) of H0 : θ = θ0 should be determined for all possible θ 0 , and the smallest and largest θ 0 for which the acceptance region contains the observed tobs should be taken (Fig. 8). Note that, for some tests, the endpoints of the acceptance regions do not increase monotonically with θ . In such cases, the set of those θ 0 values for which H0 : θ = θ0 is not rejected at a given level may not form a proper interval, but it may contain ‘‘holes.’’ Of course, taking the smallest and largest θ 0 always results in a proper interval. Properties of the CI are easy to derive from the properties of the acceptance regions. For example, an exact test (i.e., one having exact acceptance regions for all θ 0 ) results in an exact CI; if the acceptance regions of one test are contained in those of another test for all θ 0 , then even the CIs are contained in the intervals obtained from the other test, and so on. CIs obtained by inverting a likelihood ratio test, as well as those based on the socalled empirical likelihood, are reported to have good properties (11). 3.3 Bootstrap Resampling Bootstrap CIs are applied either in the nonparametric case (i.e., if one has no parametric model for the population or process that generated the observed sample) or if the sampling distribution of an estimator is difficult or impossible to determine analytically (e.g., for the correlation coefficient). When bootstrapping from the sample, the observed sample itself is used as an estimate

Figure 8. 95% acceptance region (t1 ,t2 ) of the test H0 : θ = θ0 (thick vertical line) and confidence interval (L,U) belonging to the observed value tobs of the test statistic (thick horizontal line). L and U are obtained as the smallest and largest value of θ, for which the 95% acceptance region contains tobs

of the underlying population. A large number of samples are taken from the original (observed) sample by sampling with replacement, and these bootstrap samples are used to obtain an approximation of the sampling distribution of the estimate—to determine bias, standard error, tail probabilities, and so on—or to construct bootstrap CIs (12). Although sometimes this determination can be made analytically, typically it is carried out by computer simulation. A variety of methods exist for constructing bootstrap CIs, and many improvements to these methods also exist, so one must consult the latest books and articles before implementing a method for a specific problem. REFERENCES 1. M. J. Gardner and D. G. Altman, Statistics with Confidence: Confidence Intervals and Statistical Guidelines. London: BMJ Books, 1989. 2. B. C. Arnold and M. Shavelle, Joint confidence sets for the mean and variance of a normal distribution. Amer. Stat. 1998; 52: 133–140. 3. Uniform Requirements for Manuscripts Submitted to Biomedical Journals. JAMA 1993; 269: 2282–2286. 4. Uniform Requirements for Manuscripts Submitted to Biomedical Journals. (2003). (online). Available: http://www.icmje.org. 5. M. J. Gardner and D. G. Altman, Confidence intervals rather than p-values: estimation rather than hypothesis testing. BMJ 1986; 292: 746–750. 6. L. R´ozsa, J. Reiczigel, and G. Majoros, Quantifying parasites in samples of hosts. J. Parasitol. 2000; 86: 228–232.

CONFIDENCE INTERVAL 7. J. A. C. Sterne and G. D. Smith, Sifting the evidence: what’s wrong with significance tests? BMJ 2001; 322: 226–231. 8. R. G. Newcombe, Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat. Med. 1998; 17: 857–872. 9. J. Reiczigel, Confidence intervals for the binomial parameter: some new considerations. Stat. Med. 2003; 22: 611–621. 10. A. Agresti and B. A. Coull, Approximate is better than ‘‘exact’’ for interval estimation of binomial proportions. Amer. Stat. 1998; 52: 119–126. 11. A. B. Owen, Empirical Likelihood. London: Chapman and Hall/CRC, 2001. 12. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall, 1993.

7

CONFIDENCE INTERVALS AND REGIONS

idea is not entirely misleading. An example should help to explain the technical meaning.

Confidence intervals are used for interval estimation. Whether an interval estimate is required depends upon the reason for the statistical analysis. Consider the analysis of measurements of the compressive strength of test cylinders made from a batch of concrete. If we were concerned with whether the mean strength of the batch exceeds some particular value, our problem would be one of hypothesis testing∗ . Our conclusion might be to accept or to reject the hypothesis, perhaps with an associated degree of confidence. If a simple indication of the strength likely to be achieved under the particular conditions of test is required, the observed mean strength might be quoted as an estimate of the true mean strength. This is called point estimation∗ . Interval estimation is the quoting of bounds between which it is likely (in some sense) that the real mean strength lies. This is appropriate when it is desired to give some indication of the accuracy with which the parameter is estimated. A large number of statistical problems may be included in the classes of hypothesis testing, point estimation, or interval estimation. It must be pointed out that there are several schools of thought concerning statistical inference. To quote confidence intervals is the interval estimation method advocated by the most widely accepted of these schools, variously referred to as the Neyman-Pearson∗ , Neyman-Pearson-Wald, frequentist, or classical school. There are other ways of obtaining interval estimates and we will refer to them later. (See also BAYESIAN INFERENCE, FIDUCIAL INFERENCE, LIKELIHOOD, STRUCTURAL INFERENCE.)

Example 1. Suppose that some quantity is measured using a standard testing procedure. Suppose that the quantity has a welldefined true value µ, but that the measurement is subject to a normally distributed error that has known variance σ 2 . Let X denote the random variable that is the result of a single measurement and let x be a particular value for X. Now X is normally distributed with mean µ and variance σ 2 . Using the properties of the normal distribution we can make probability statements about X; e.g., Pr[µ − 1.96σ X µ + 1.96σ ] = 0.95. (1) We could rewrite this as Pr[X − 1.96σ µ X + 1.96σ ] = 0.95 (2) or Pr[µ ∈ (X − 1.96σ , X + 1.96σ )] = 0.95. (3) Although µ may appear to be the subject of statements (2) and (3), the probability distribution referred to is that of X, as was more obvious in statement (1). If X is observed to be x, we say that we have 95% confidence that x − 1.96σ µ x + 1.96σ or say that (x − 1.96σ , x + 1.96σ ) is a 95% confidence interval for µ. No probability statement is made about the proposition x − 1.96σ µ x + 1.96σ

(4)

involving the observed value, x, since neither x nor µ has a probability distribution. The proposition (4) will be either true or false, but we do not know which. If confidence intervals with confidence coefficient p were computed on a large number of occasions, then, in the long run, the fraction p of these confidence intervals would contain the true parameter value. (This is provided that the occasions are independent and that there is no selection of cases.)

BASIC IDEA OF A CONFIDENCE INTERVAL The term ‘‘confidence interval’’ has an intuitive meaning as well as a technical meaning. It is natural to expect it to mean ‘‘an interval in which one may be confident that a parameter lies.’’ Its precise technical meaning differs substantially from this (see Jones [13], Cox [7], and Dempster [9]) but the intuitive

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CONFIDENCE INTERVALS AND REGIONS

CONFIDENCE INTERVALS BASED ON A SINGLE STATISTIC Many confidence intervals can be discussed in terms of a one-dimensional parameter θ and a one-dimensional statistic T(X), which depends upon a vector of observations X. A more general formulation will be given below under the heading ‘‘Confidence Regions.’’ Provided that T(X) is a continuous random variable, given probabilities α1 and α2 , it is possible to find T1 (θ ) and T2 (θ ) such that Pr[T(X) T1 (θ )|θ ] = α1

This confidence interval may be described as a central confidence interval because α1 = α2 (= 0.025). Noncentral confidence intervals are seldom quoted except when we are primarily concerned about large values for the parameter or about small values. In such cases it is common to quote only a single confidence limit. In Example 2 Pr[T/θ < 2.73] = 0.05 or, equivalently, Pr[θ > 0.366T] = 0.05.

(5)

Thus (0, 0.366T) is a confidence interval for θ = σ 2 at confidence level 0.95.

and Pr[T(X) T2 (θ )|θ ] = α2 .

(6)

DISCRETE DISTRIBUTIONS

In other words, T1 (θ ) and T2 (θ ) are as shown in Fig. 1. Another diagram that can be used to illustrate the functions T1 (θ ) and T2 (θ ) is Fig. 2. For every particular value of θ the probability that T lies between T1 (θ ) and T2 (θ ) is 1 − α1 − α2 . The region between the curves T = T1 (θ ) and T = T2 (θ ) is referred to as a confidence belt. In terms of Fig. 2, the basic idea of confidence intervals is to express confidence 1 − α1 − α2 that the point (θ , T) lies in the confidence belt after T has been observed. If T1 and T2 are well-behaved functions, they will have inverse functions θ2 and θ1 , as shown in the figure, and the three propositions ‘‘T1 (θ ) T(X) T2 (θ ),’’ ‘‘θ lies in the confidence belt,’’ and ‘‘θ1 (T) θ θ2 (T)’’ will be equivalent. Thus (θ1 (T), θ2 (T)) is a (1 − α1 − α2 ) confidence interval for θ .

When the statistic T is a discrete∗ random variable it is generally not possible to find functions T1 and T2 such that (5) and (6) hold precisely. Instead, we ask that Pr[T(X) T1 (θ )|θ ] be as large as possible but not greater than α1 and that Pr[T(X) T2 (θ )|θ ] be as large as possible but not greater than α2 . The functions T1 and T2 define a confidence belt which generally has a staircase-shaped perimeter. Keeping [14, p. 98] and Kendall and Stuart [15, p. 105] give examples.

Example 2. Consider eight observations from a normal distribution with known mean µ and unknown variance σ 2 . Take θ = σ 2 , X = (X1 , . . . , X8 ), and T(X) = 8 i=1 (Xi − µ)2 .

= (1 − p)12 + 12p(1 − p)11

From the fact that T/θ has a χ 2 distribution with 8 degrees of freedom, we know that Pr[T/θ 2.18] = 0.025 and that Pr[T/θ 17.53] = 0.025. Thus we take T1 (θ ) = 2.18θ , T2 (θ ) = 17.53θ and calculate θ1 (T) = 0.057T, θ2 (T) = 0.46T. The interval (0.057T, 0.46T) may be quoted as a 95% confidence interval for σ 2 .

Example 3. Consider the problem of finding a 90% confidence interval for the probability, p, of success on each trial in a sequence of independent trials, if two successes are observed in 12 trials. Some calculation yields that Pr[number of successes 2|p]

+ 66p2 (1 − p)10 = 0.05 if p = 0.438 < 0.05 if p > 0.438 and Pr[number of successes 2|p] = 1 − (1 − p)12 − 12p(1 − p)11 = 0.05 if p = 0.03046 < 0.05 if p < 0.03046.

CONFIDENCE INTERVALS AND REGIONS

3

Figure 1. Illustration of the meanings of T1 and T2 for fixed θ.

Figure 2. Confidence limits for θ based on the statistic T.

Thus the required 90% confidence interval is (0.03046, 0.0348). (Although this method of construction does not make the probability of including the true value of p to be equal to 90%, it does ensure that this probability is not less than 90%.) NUISANCE PARAMETERS∗ AND SIMILAR REGIONS∗ Under some circumstances it is easy to find confidence intervals despite the presence of a nuisance parameter. Consider the following example. Example 4. Suppose that X1 , X2 , . . . , Xn are normally distributed with mean µ and variance σ 2 , both unknown. Let X and s denote the sample mean and √ sample standard deviation. Now (X − µ) n/s has a t-distribution with n − 1 degrees of freedom no matter what the value of σ 2 . Therefore, letting t denote the 1 − 12 α quantile∗ of that t-distribution. √ √ Pr[µ − ts/ n X µ + ts/ n] = 1 − α;

or, equivalently, √ √ Pr[X − ts/ n µ X + ts/ n] = 1 − α. √ √ The interval (X − ts/ n, X + ts/ n) is a confidence level for µ at confidence level 1 − α. The parameter σ 2 is described as a nuisance parameter because we are not interested in estimating it, but it does affect the probability distribution of the observations. The regions of the sample space of the form (X − as, X + as) are described as similar regions because the probability of each of them is independent of the parameters. Confidence regions are generally based on similar regions when they exist. However, they often do not exist. CONFIDENCE REGIONS Confidence regions are a generalization of confidence intervals in which the confidence set is not necessarily an interval. Let θ be a (possibly multidimensional) parameter and let denote the set of possible values for

4

CONFIDENCE INTERVALS AND REGIONS

θ . Let X denote a random variable, generally vector-valued. A function I that gives a subset of for a value x of X is said to be a confidence set estimator or a confidence region for θ with confidence coefficient p if Pr[θ ∈ I(X)] = p.

(7)

For any such confidence region, to reject the hypothesis θ = θ0 whenever θ0 is not in I(X) is a Neyman—Pearson hypothesis test which has probability 1 − p of wrongly rejecting the hypothesis θ = θ0 . Choosing between Possible Confidence Regions There may be many functions I such that Pr[θ ∈ I(X)|θ ] = p for every θ . How should we choose which to use? Within the formulation where confidence intervals are based on a single statistic T, the problem is essentially that of choosing a statistic on which to base the confidence intervals. Perhaps confidence intervals based on the sample median∗ would be better in some ways than confidence intervals based on the sample mean. A number of criteria have been advanced to help decide between alternative confidence regions. We discuss some of them briefly. Standard texts on theoretical statistics may be consulted for further details. Confidence intervals should be based on sufficient statistics (see SUFFICIENT STATISTICS) and should be found conditional on the value of ancillary statistics (see ANCILLARY STATISTICS —I). A confidence region I is said to be unbiased if Pr[θ1 ∈ I(X)|θ2 ] p for all θ1 , θ2 ∈ . This means that wrong values for the parameter are not more likely to be included in the region I(X) than the correct values. The region I is said to be shorter, more accurate, or more selective than the region J if Pr[θ1 ∈ I(X)|θ2 ] Pr[θ1 ∈ J(x)|θ2 ] for all θ1 , θ2 ∈ . Intuitively, this means that incorrect values for θ are more likely to be

in J than in I. More selective regions correspond to more powerful tests of hypotheses and unbiased regions correspond to unbiased tests when parametric hypotheses are rejected whenever the parameter does not lie in the confidence region. The term ‘‘more selective’’ is preferred to ‘‘shorter’’ (which stems from Neyman [19]) to avoid confusion with the expected length of confidence intervals. For complex problems it may be difficult or impossible to apply some of these and other criteria. Sometimes it may only be possible to show that a particular confidence region is optimal in some sense within a particular class of regions, such as those invariant in some way. Different criteria sometimes suggest different regions. There is no completely general way of deciding which confidence interval to use. CRITICISMS OF THE THEORY OF CONFIDENCE INTERVALS There have been many arguments about the foundations of statistical inference, and there will probably be many more. Three (not independent) criticisms of the theory of confidence intervals are mentioned below. Note that they are criticisms of the frequentist school of thought, not merely of confidence intervals, which are the interval estimation technique used by that school. Likelihood Principle∗ The likelihood principle states that the ‘‘force’’ of an experiment should depend only upon the likelihood function, which is the probability density for the results obtained as a function of the unknown parameters. Many people find this principle compelling. Pratt [20] presents a persuasive defense of it in an entertaining way. Confidence interval theory violates the likelihood principle essentially because confidence intervals are concerned with the entire sample space. Coherence∗ It has been shown in several ways (e.g., Savage [23]), using various simple coherence conditions, that inference must be Bayesian if it is to be coherent. This means that every

CONFIDENCE INTERVALS AND REGIONS

Neyman confidence interval procedure that is not equivalent to a Bayesian procedure violates at least one of each set of coherence properties. Conditional Properties For a confidence region I such that Pr[θ ∈ I(X)] = α

for all θ ,

if there is a subset C of the sample space and a positive number such that either Pr[θ ∈ I(X)|X ∈ C] α −

Conditional properties of confidence intervals for practical problems are seldom as poor as for this example and those of Robinson [22]. Note particularly that the complements of relevant subsets are not necessarily relevant. However, such examples do illustrate the point made by Dempster [9], Hacking [11], and others that confidence coefficients are a good measure of uncertainty before the data have been seen, but may not be afterward. LINKS WITH BAYESIAN INFERENCE∗

for all θ

or Pr[θ ∈ I(X)|X ∈ C] α +

5

for all θ ,

then the set C is a relevant subset. The idea stems from Fisher’s use of the term ‘‘recognizable subset’’ [10] and was formalized by Buehler [6]. Some people argue that the existence of a relevant subset implies that the confidence coefficient α is not an appropriate measure of confidence that θ ∈ I(x) when it happens that x belongs to the relevant subset. Consider the following quite artificial example, in which there are only two possible parameter values and four values for a random variable that is observed only once. Example 5. Suppose that when θ = θ1 , Pr[X = 1] = 0.9, Pr[X = 2] = 0.01, Pr[X = 3] = 0.05, and Pr[X = 4] = 0.04, whereas when θ = θ2 , Pr[X = 1] = 0.02, Pr[X = 2] = 0.9, Pr[X = 3] = 0.03, and Pr[X = 4] = 0.05. The region {θ1 } if X = 1 or X = 3 I(X) = . {θ2 } if X = 2 or X = 4 is a confidence region for θ with confidence coefficient 0.95. However, Pr[θ ∈ I(X)|X ∈ {1, 2}, θ ] 90/92 for both θ values

(8)

and Pr[θ ∈ I(X)|X ∈ {3, 4}, θ ] 5/8 for both θ values.

(9)

Thus both {1, 2} and {3, 4} are relevant subsets.

Bayesian confidence regions are derived by taking a prior distribution, usually considered to represent subjective belief about unknown parameters, modifying it using observed data and Bayes’ theorem∗ to obtain a posterior distribution, and quoting a region of the parameter space which has the required probability according to the posterior distribution∗ (see BAYESIAN INFERENCE). Bayesian procedures satisfy most coherence principles, satisfy the likelihood principle, and have good conditional properties. However, their conclusions depend upon the arbitrarily or subjectively chosen prior distribution, not merely upon the data, and this is widely considered to be undesirable. A clear distinction must be made between proper and improper Bayesian procedures. Proper Bayesian procedures are those based on prior distributions which are proper (i.e., are probability distributions) and which use bounded loss and utility functions should loss or utility functions be required. Other Bayesian procedures are called improper and sometimes lack some of the desirable properties of proper Bayesian procedures. However, they are often used because they are more tractable mathematically. The bases of the frequentist and Bayesian schools of thought are quite different. However, many statistical procedures that are widely used in practice are both confidence interval procedures and improper Bayesian procedures. (see Bartholomew [1], Jeffreys [12], de Groot [8], and Lindley [17]). Of direct interest to people using the confidence intervals that may also be derived as improper Bayesian interval estimates is that the alternative derivation is often sufficient to ensure that these confidence intervals have

6

CONFIDENCE INTERVALS AND REGIONS

most of the desirable properties of proper Bayesian procedures. An exception is that there are relevant subsets for the usual confidence intervals based on the t-distribution∗ for the unknown mean of a normal distribution when the variance is also unknown (see Brown [4]. RELATIONSHIP TO FIDUCIAL INFERENCE∗ Fiducial inference generally proceeds by finding pivotal variables∗ , functions of both random variables and parameters which have a distribution that is independent of all parameters, and assuming that those pivotal variables have the same distribution after the random variables have been observed. Given the observed values of the random variables, the distribution of the pivotal variables implies a distribution for the parameters, called a fiducial distribution. To the extent that fiducial inference and confidence intervals both involve asserting faith, after seeing the data, in statements for which probabilities could be quoted before seeing the data they are similar theories. Bartlett [2] has argued that resolving the difference between these two theories is less important than resolving the difference between the pair of them and Bayesian methods. The clearest point of disagreement between them is that they support different solutions to the Behrens—Fisher problem∗ .

Example 6. Suppose that we wish to discriminate between two simple hypotheses, H0 : X has a standard normal distribution and H1 : X is distributed normally with mean 3 and unit variance, on the basis of a single observation. A standard Neyman—Pearson procedure would be to accept H0 (or fail to reject H1 ) is X 1.5 and to accept H1 if X > 1.5, and note that the probability of being correct is 0.933 as the measure of conclusiveness. That the same conclusiveness is expressed when X = 1.6 and when X = 3.6 seems unsatisfactory to Kiefer. Kiefer’s idea is to partition the sample space and to evaluate the conclusiveness of a statistical procedure conditionally for each subset of the partition. Here the sample space might be partitioned into three sets: (−∞, 0] ∪ (3, ∞), (0, 1] ∪ (2, 3], and (1, 2]. Conditionally on X being in the various sets, the probabilities of the decision to accept H0 or H1 being correct are 0.9973, 0.951, and 0.676. These could be considered to indicate ‘‘quite conclusive,’’ ‘‘reasonably conclusive,’’ and ‘‘slight’’ evidence, respectively. The article and discussion of Kiefer [16] refers to most other work relevant to conditional confidence regions. Most research has addressed the problem of which partitions of the sample space to use. Until the theory is further developed, it is difficult to see whether it will escape from the known weaknesses of Neyman—Pearson inference.

CONDITIONAL CONFIDENCE REGIONS

CONFIDENCE INTERVALS IN PRACTICAL STATISTICS

Brownie and Kiefer [5] consider that one of the weaknesses of Neyman—Pearson∗ methodology is that classical procedures generally do not give a measure of conclusiveness, which depends upon the data observed. Most other schools of thought do vary their measure of conclusiveness with the data. Kiefer [16] has developed a theory of conditional confidence which extends Neyman—Pearson methodology to allow both a data-dependent measure of conclusiveness and a frequency interpretation of this measure. The basic idea is most easily explained by an example of testing between two hypotheses. (See also CONDITIONAL INFERENCE.)

Confidence intervals are widely used in practice, although not as widely supported by people interested in the foundations of statistics. One reason for this dominance is that the most readily available statistical computer programs are based on the methods of the Neyman—Pearson school. Another reason is that many common confidence intervals (those based on normal, t, and binomial distributions) may also be derived as improper Bayesian procedures and do not suffer from most of the possible weaknesses of confidence intervals. These common procedures have some robustness∗ with respect to the vagaries of inference theory and may therefore be used without worrying very much

CONFIDENCE INTERVALS AND REGIONS

about the theory behind a particular derivation of them. Furthermore, it is fairly safe to use the intuitive notion of confidence rather than the restricted technical notion in such cases. When interpreting confidence intervals for several comparable parameters it should be noted that for two confidence intervals to overlap does not imply that the confidence interval for the difference between the two parameters would include the point zero. Also note that comparing more than two parameters at a time requires special theory (see MULTIPLE COMPARISONS —I). REFERENCES 1. Bartholomew, D. J. (1965). Biometrika, 52, 19–35. 2. Bartlett, M. S. (1965). J. Amer. Statist. Ass., 60, 395–409. 3. Birnbaum, A. (1962). J. Amer. Statist. Ass., 57, 269–326. (Very difficult to read.) 4. Brown, L. D. (1967). Ann. Math. Statist., 38, 1068–1071. 5. Brownie, C. and Kiefer, J. (1977). Commun. Statist. A—Theory and Methods, 6, 691–751. 6. Buehler, R. J. (1959). Ann. Math. Statist., 30, 845–863. (Fundamental reference on conditional properties of statistical procedures.) 7. Cox, D. R. (1958). Ann. Math. Statist., 29, 357–372. 8. de Groot, M. H. (1973). J. Amer. Statist. Ass., 68, 966–969. 9. Dempster, A. P. (1964). J. Amer. Statist. Ass., 59, 56–66. 10. Fisher, R. A. (1956). Statistical Methods and Scientific Inference, Oliver and Boyd, Edinburgh. (See p. 32.) 11. Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press, Cambridge. 12. Jeffreys, H. (1940). Annals of Eugenics, 10, 48–51. 13. Jones, H. L. (1958). J. Amer. Statist. Ass., 53, 482–490. 14. Keeping, E. S. (1962). Introduction to Statistical Inference. Van D. Nostrand, Princeton, N. J. 15. Kendall, M. G. and Stuart, A. (1961). The Advanced Theory of Statistics, Vol. 2: Inference and Relationship. Charles Griffin, London.

7

16. Kiefer, J. (1977). J. Amer. Statist. Ass., 72, 789–827. 17. Lindley, D. V. (1965). Introduction to Probability and Statistics from a Bayesian Viewpoint, Part 2, Inference. Cambridge University Press, Cambridge, England. 18. Neyman, J. (1934). J. R. Statist. Soc. A, 97, 558–606. (Especially Note I, p. 589, and discussion by Fisher R. A. p. 614. Mainly of historical interest.) 19. Neyman, J. (1937). Philos. Trans. R. Soc. Lond. A, 236, 333–380. (Fundamental reference on confidence intervals. These papers by Neyman are reproduced in Neyman, J. A Selection of Early Statistical Papers of J. Neyman, Cambridge University Press, Cambridge, 1967.) 20. Pratt, J. W. (1962). J. Amer. Statist. Ass., 57, 314–316. 21. Rao, C. R. (1965). Linear Statistical Inference and Its Applications. Wiley, New York. 22. Robinson, G. K. (1975). Biometrika, 62, 155–161. (Contrived, but reasonably simple examples.) 23. Savage, L. J. (1954). The Foundations of Statistics. Wiley, New York. (Argues that Bayesian inference is the only coherent theory.) 24. Savage, L. J. (1962). The foundations of Statistical Inference (a Discussion). Methuen, London. See also BAYESIAN INFERENCE; CONDITIONAL INFERENCE; FIDUCIAL INFERENCE; and INFERENCE, STATISTICAL.

G. K. ROBINSON

CONFIRMATORY TRIALS

when such trials are run, strict adherence to protocols and standard operating procedures is mandatory. Reporting of confirmatory trials requires discussion of any unavoidable changes and their impact on the study (1). Confirmatory trials typically address only a limited number of questions, and fundamental components of the analysis and justification for the design of confirmatory trials are set out in the protocol (1).

RAFE JOSEPH MICHAEL DONAHUE Vanderbilt University Medical Center, Nashville, Tennessee

1 DEFINITION AND DESCRIPTION 1.1 What They Are Confirmatory clinical trials, as defined in ICH guideline E9 (1), are ‘‘adequately controlled trials in which the hypotheses are stated in advance and evaluated.’’ ICH guideline E8 (2) partitions clinical studies according to objectives; ‘‘Therapeutic Confirmatory’’ studies are one class, along with ‘‘Human Pharmacology,’’ ‘‘Therapeutic Exploratory,’’ and ‘‘Therapeutic Use.’’ These classes correspond to a partition of the drug development process into what are typically viewed as four phases: Phase I (human pharamcology), Phase II (therapeutic exploratory), Phase III (therapeutic confirmatory), and Phase IV (therapeutic use). As such, confirmatory clinical trials are most often considered Phase III trials. These Phase III therapeutic confirmatory trials have objectives that are typically demonstration or confirmation of efficacy, establishment of a safety profile, construction of an adequate basis for assessing the benefit/risk relationship in order to support licensing, or establishment of a dose-response relationship. ICH E9 continues: ‘‘As a rule, confirmatory trials are necessary to provide firm evidence of efficacy or safety. In such trials the key hypothesis of interest follows directly from the trial’s primary objective, is always predefined, and is the hypothesis that is subsequently tested when the trial is complete. In a confirmatory trial it is equally important to estimate with due precision the size of the effects attributable to the treatment of interest and to relate these effects to their clinical significance.’’ This prespecification of the hypothesis that will be used to examine the primary objective is the key component of confirmatory trials. Since confirmatory trials are intended to provide firm evidence in support of claims,

1.2 What They Are Not Confirmatory trials are not exploratory trials. Although exploratory trials have clear and precise objectives like all clinical trials, these objectives may not lead to simple tests of predefined hypotheses. Such trials cannot, by themselves, provide sufficient evidence of efficacy; yet they may contribute to the full body of evidence (1). 2

CONSEQUENCES

In January 2005, the innovative pharmaceutical industry, represented worldwide by the European Federation of Pharmaceutical Industries and Associations (EFPIA), the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA), the Japanese Pharmaceutical Manufacturers Association (JPMA), and the Pharmaceutical Research and Manufacturers of America (PhRMA), committed itself to increasing the transparency of their members’ clinical trials (3). One outcome of this commitment was the establishment of a registry of the results from confirmatory clinical trials. An industrywide registry, www.clinicalstudyresults.org, houses the results from ‘‘all clinical trials, other than exploratory trials’’ (3). Another registry of clinical trials, www .clinicaltrials.gov, contains lists of ongoing confirmatory and exploratory clinical trials. 3

ISSUES AND CONTROVERSIES

D’Agostino and Massaro (4) address issues involved in the design, conduct, and analysis of confirmatory clinical trials. Among the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CONFIRMATORY TRIALS

issues discussed in detail are study objectives, target populations, sample population, efficacy variables, control groups, study design, comparisons, sample size, trial monitoring, data analysis sets, unit of analysis, missing data, safety, subsets, and clinical significance. All of these issues need to be addressed to develop successfully a confirmatory clinical trial. Parmar et al. (5) address issues around whether a specific, unique, confirmatory trial is necessary. They present a Bayesian statistical framework in which differences are assessed. Through this, they provide a method to assess the need for a confirmatory trial. Their examples revolve around non-smallcell lung cancer. Their argument centers on the belief that a major reason in performing confirmatory randomized clinical trials is a prior skepticism over whether the new treatment is likely to be clinically worthwhile. They argue that it might be wiser to accept the treatment into practice than to wait perhaps years to accrue patients and carry out a confirmatory trial. An editorial by Berry (6) discusses the pros and cons of the Parmer et al. approach. He concludes that confirmatory trials are certainly important from a scientific perspective, but whether they are ethical is a different matter. He points out the fact that earlier trials may have been carried out in different settings and that a confirmatory trial might likely show a different treatment effect—usually a smaller one—than was observed in the exploratory trials, due to regression to the mean. This regression to the mean bias makes it difficult to assess the magnitude of the treatment benefit as observed in a single confirmatory trial. REFERENCES 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). ICH Guideline E9: General considerations for clinical trials. ICH web site. Available: www.ich.org, Accessed 2006.12.31. 2. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). ICH

Guideline E8: Statistical Principles for Clinical Trials. ICH web site. Available: www.ich.org, Accessed 2006.12.31. 3. Joint Position on the Disclosure of Clinical Trial Information via Clinical Trial Registries and Databases. European Federation of Pharmaceutical Industries and Associations (EFPIA) web site. Available: www.efpia.org, Accessed 2006.12.31. 4. R. B. D’Agostino Sr, and J. M. Massaro, New devlopments in medical clinical trials. J Dent Res. 2004; 83 Spec No C: C18-24. 5. M. K. B. Parmar, R. S. Ungerleider, R. Simon, Assessing whether to perform a confirmatory randomized clinical trial. J. Natl. Cancer Inst. 1996; 88: 1645-1651. 6. D. A. Berry, When is a confirmatory randomized clinical trial needed? J. Natl. Cancer Inst. 1996; 88, No. 22: 1606-1607.

CROSS-REFERENCES Phase III Trials Pivotal Trial

CONFOUNDING

the agents whose properties we wish to study (emphasis added).

SANDER GREENLAND University of California, Los Angeles, CA, USA

It should be noted that, in Mill’s time, the word ‘‘experiment’’ referred to an observation in which some circumstances were under the control of the observer, as it still is used in ordinary English, rather than to the notion of a comparative trial. Nonetheless, Mill’s requirement suggests that a comparison is to be made between the outcome of his experiment (which is, essentially, an uncontrolled trial) and what we would expect the outcome to be if the agents we wish to study had been absent. If the outcomes is not as one would expect in the absence of the study agents, then his requirement ensures that the unexpected outcome was not brought about by extraneous circumstances. If, however, those circumstances do bring about the unexpected outcome, and that outcome is mistakenly attributed to effects of the study agents, then the mistake is one of confounding (or confusion) of the extraneous effects with the agent effects. Much of the modern literature follows the same informal conceptualization given by Mill. Terminology is now more specific, with ‘‘treatment’’ used to refer to an agent administered by the investigator and ‘‘exposure’’ often used to denote an unmanipulated agent. The chief development beyond Mill is that the expectation for the outcome in the absence of the study exposure is now almost always explicitly derived from observation of a control group that is untreated or unexposed. For example, Clayton & Hills (2) state of observational studies,

The word confounding has been used to refer to at least three distinct concepts. In the oldest usage, confounding is a bias in estimating causal effects (see Causation). This bias is sometimes informally described as a mixing of effects of extraneous factors (called confounders) with the effect of interest. This usage predominates in nonexperimental research, especially in epidemiology and sociology. In a second and more recent usage, confounding is a synonym for noncollapsibility, although this usage is often limited to situations in which the parameter of interest is a causal effect. In a third usage, originating in the experimental-design literature, confounding refers to inseparability of main effects and interactions under a particular design. The term aliasing is also sometimes used to refer to the latter concept; this usage is common in the analysis of variance literature. The three concepts are closely related and are not always distinguished from one another. In particular, the concepts of confounding as a bias in effect estimation and as noncollapsibility are often treated as identical, although there are many examples in which the two concepts diverge (8,9,14); one is given below. 1 CONFOUNDING AS A BIAS IN EFFECT ESTIMATION 1.1 Confounding

. . . there is always the possibility that an important influence on the outcome . . . differs systematically between the comparison [exposed and unexposed] groups. It is then possible [that] part of the apparent effect of exposure is due to these differences, [in which case] the comparison of the exposure groups is said to be confounded (emphasis in the original).

A classic discussion of confounding in which explicit reference is made to ‘‘confounded effects’’ is Mill [15, Chapter X] (although in Chapter III Mill lays out the primary issues and acknowledges Francis Bacon as a forerunner in dealing with them). There, he lists a requirement for an experiment intended to determine causal relations:

In fact, confounding is also possible in randomized experiments, owing to systematic improprieties in treatment allocation, administration, and compliance. A further

. . . none of the circumstances [of the experiment] that we do know shall have effects susceptible of being confounded with those of

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CONFOUNDING

and somewhat controversial point is that confounding (as per Mill’s original definition) can also occur in perfect randomized trials due to random differences between comparison groups (6,8). Various mathematical formalizations of confounding have been proposed. Perhaps the one closest to Mill’s concept is based on a formal counterfactual model for causal effects. Suppose our objective is to determine the effect of applying a treatment or exposure x1 on a parameter µ of population A, relative to applying treatment or exposure x0 . For example, A could be a cohort of breast-cancer patients, treatment x1 could be a new hormone therapy, x0 could be a placebo therapy, and the parameter µ could be the 5-year survival probability. The population A is sometimes called the target population or index population; the treatment x1 is sometimes called the index treatment; and the treatment x0 is sometimes called the control or reference treatment (which is often a standard or placebo treatment). The counterfactual model assumes that µ will equal µA1 if x1 is applied, µA0 if x0 is applied; the causal effect of x1 relative to x0 is defined as the change from µA0 to µA1 , which might be measured as µA1 − µA0 or µA1 /µA0 . If A is observed under treatment x1 , then µ will equal µA1 , which is observable or estimable, but µA0 will be unobservable. Suppose, however, we expect µA0 to equal µB0 , where µB0 is the value of the outcome µ observed or estimated for a population B that was administered treatment x0 . The latter population is sometimes called the control or reference population. Confounding is said to be present if in fact µA0 = µB0 , for then there must be some difference between populations A and B (other than treatment) that is affecting µ. If confounding is present, a naive (crude) association measure obtained by substituting µB0 for µA0 in an effect measure will not equal the effect measure, and the association measure is said to be confounded. For example, if µB0 = µA0 , then µA1 − µB0 , which measures the association of treatments with outcomes across the populations, is confounded for µA1 − µA0 , which measures the effect of treatment x1 on population A. Thus, saying a measure of association such as

µA1 − µB0 is confounded for a measure of effect such as µA1 − µA0 is synonymous with saying the two measures are not equal. The preceding formalization of confounding gradually emerged through attempts to separate effect measures into a component due to the effect of interest and a component due to extraneous effects (1,4,10,12,13). These decompositions will be discussed below. One noteworthy aspect of the above formalization is that confounding depends on the outcome parameter. For example, suppose populations A and B have a different 5-year survival probability µ under placebo treatment x0 ; that is, suppose µB0 = µA0 , so that µA1 − µB0 is confounded for the actual effect µA1 − µA0 of treatment on 5-year survival. It is then still possible that 10-year survival, ν, under the placebo would be identical in both populations; that is, ν A0 could still equal ν B0 , so that ν A1 − ν B0 is not confounded for the actual effect of treatment on 10-year survival. (We should generally expect no confounding for 200-year survival, since no treatment is likely to raise the 200-year survival probability of human patients above zero.) A second noteworthy point is that confounding depends on the target population of inference. The preceding example, with A as the target, had different 5-year survivals µA0 and µB0 for A and B under placebo therapy, and hence µA1 − µB0 was confounded for the effect µA1 − µA0 of treatment on population A. A lawyer or ethicist may also be interested in what effect the treatment x1 would have had on population B. Writing µB1 for the (unobserved) outcome of B under treatment x1 , this effect on B may be measured by µB1 − µB0 . Substituting µA1 for the unobserved µB1 yields µA1 − µB0 . This measure of association is confounded for µB1 − µB0 (the effect of treatment x1 on 5-year survival in population B) if and only if µA1 = µB1 . Thus, the same measure of association, µA1 − µB0 , may be confounded for the effect of treatment on neither, one, or both of populations A and B. 1.2 Confounders A third noteworthy aspect of the counterfactual formalization of confounding is that it invokes no explicit differences (imbalances)

CONFOUNDING

between populations A and B with respect to circumstances or covariates that might influence µ (8). Clearly, if µA0 and µB0 differ, then A and B must differ with respect to factors that influence µ. This observation has led some authors to define confounding as the presence of such covariate differences between the compared populations. Nonetheless, confounding is only a consequence of these covariate differences. In fact, A and B may differ profoundly with respect to covariates that influence µ, and yet confounding may be absent. In other words, a covariate difference between A and B is a necessary but not sufficient condition for confounding. This point will be illustrated below. Suppose now that populations A and B differ with respect to certain covariates, and that these differences have led to confounding of an association measure for the effect measure of interest. The responsible covariates are then termed confounders of the association measure. In the above example, with µA1 − µB0 confounded for the effect µA1 − µA0 , the factors responsible for the confounding (i.e. the factors that led to µA0 = µB0 ) are the confounders. It can be deduced that a variable cannot be a confounder unless it can affect the outcome parameter µ within treatment groups and it is distributed differently among the compared populations (e.g. see Yule (23), who however uses terms such as ‘‘fictitious association’’ rather than confounding). These two necessary conditions are sometimes offered together as a definition of a confounder. Nonetheless, counterexamples show that the two conditions are not sufficient for a variable with more than two levels to be a confounder as defined above; one such counterexample is given in the next section. 1.3 Prevention of Confounding Perhaps the most obvious way to avoid confounding in estimating µA1 − µA0 is to obtain a reference population B for which µB0 is known to equal µA0 . Among epidemiologists, such a population is sometimes said to be comparable to or exchangeable with A with respect to the outcome under the reference treatment. In practice, such a population may be difficult or impossible to find. Thus, an

3

investigator may attempt to construct such a population, or to construct exchangeable index and reference populations. These constructions may be viewed as design-based methods for the control of confounding. Perhaps no approach is more effective for preventing confounding by a known factor than restriction. For example, gender imbalances cannot confound a study restricted to women. However, there are several drawbacks: restriction on enough factors can reduce the number of available subjects to unacceptably low levels, and may greatly reduce the generalizability of results as well. Matching the treatment populations on confounders overcomes these drawbacks and, if successful, can be as effective as restriction. For example, gender imbalances cannot confound a study in which the compared groups have identical proportions of women. Unfortunately, differential losses to observation may undo the initial covariate balances produced by matching. Neither restriction nor matching prevents (although it may diminish) imbalances on unrestricted, unmatched, or unmeasured covariates. In contrast, randomization offers a means of dealing with confounding by covariates not accounted for by the design. It must be emphasized, however, that this solution is only probabilistic and subject to severe constraints in practice. Randomization is not always feasible, and (as mentioned earlier) many practical problems, such as differential loss and noncompliance, can lead to confounding in comparisons of the groups actually receiving treatments x1 and x0 . One somewhat controversial solution to noncompliance problems is intention-totreat analysis, which defines the comparison groups A and B by treatment assigned rather than treatment received. Confounding may, however, affect even intentionto-treat analyses. For example, the assignments may not always be random, as when blinding is insufficient to prevent the treatment providers from protocol violations. And, purely by bad luck, randomization may itself produce allocations with severe covariate imbalances between the groups (and consequent confounding), especially if the study size is small (6,8,19). Block randomization can help ensure that random imbalances on

4

CONFOUNDING

the blocking factors will not occur, but it does not guarantee balance of unblocked factors. 1.4 Adjustment for Confounding Design-based methods are often infeasible or insufficient to prevent confounding. Thus there has been an enormous amount of work devoted to analytic adjustments for confounding. With a few exceptions, these methods are based on observed covariate distributions in the compared populations. Such methods can successfully control confounding only to the extent that enough confounders are adequately measured. Then, too, many methods employ parametric models at some stage, and their success may thus depend on the faithfulness of the model to reality. These issues cannot be covered in depth here, but a few basic points are worth noting. The simplest and most widely trusted methods of adjustment begin with stratification on confounders. A covariate cannot be responsible for confounding within internally homogeneous strata of the covariate. For example, gender imbalances cannot confound observations within a stratum composed solely of women. More generally, comparisons within strata cannot be confounded by a covariate that is constant (homogeneous) within strata. This is so regardless of whether the covariate was used to define the strata. Generalizing this observation to a regression context, we find that any covariate with a residual variance of zero conditional on the regressors cannot confound regression estimates of effect (assuming that the regression model is correct). A broader and more useful observation is that any covariate that is unassociated with treatment conditional on the regressors cannot confound the effect estimates; this insight leads directly to adjustments using a propensity score. Some controversy has existed about adjustment for covariates in randomized trials. Although Fisher asserted that randomized comparisons were unbiased, he also pointed out that they could be confounded in the sense used here (e.g. see Fisher [6, p. 49]). Fisher’s use of the word ‘‘unbiased’’ was unconditional on allocation, and therefore of little guidance for analysis of a given trial. The ancillarity

of the allocation naturally leads to conditioning on the observed distribution of any pretreatment covariate that can influence the outcome parameter. Conditional on this distribution, the unadjusted treatment–effect estimate will be biased if the covariate is associated with treatment; this conditional bias can be removed by adjustment for the confounders (8,18). Note that the adjusted estimate is also unconditionally unbiased, and thus is a reasonable alternative to the unadjusted estimate even without conditioning. 1.5 Measures of Confounding The parameter estimated by a direct unadjusted comparison of cohorts A and B is µA1 − µA0 . A number of authors have measured the bias (confounding) of the unadjusted comparison by (10,12) (µA1 − µB0 ) − (µA1 − µA0 ) = µA0 − µB0 . When the outcome parameters, µ, are risks (probabilities), epidemiologists use instead the analogous ratio µA0 µA1 /µB0 = µA1 /µA0 µB0 as a measure of bias (1,4,14); µA0 /µB0 is sometimes called the confounding risk ratio. The latter term is somewhat confusing because it is sometimes misunderstood to refer to the effect of a particular confounder on risk. This is not so, although the ratio does reflect the net effect of the differences in the confounder distributions of A and B. 1.6 Residual Confounding Suppose now that adjustment for confounding is done by subdividing the total study population (A + B) into K strata indexed by k. Let µA1k be the parameter of interest in stratum k of populations A and B under treatment x0 . The effect of treatment x1 relative to x0 in stratum k may be defined as µA1k − µA0k or µA1k /µA0k . The confounding that remains in stratum K is called the residual confounding in the stratum, and is measured by µA0k − µB0k or µA1k /µB0k .

CONFOUNDING

Like effects, stratum-specific residual confounding may be summarized across the strata in a number of ways, for example by standardization methods or by other weighted-averaging methods. As an illustration, suppose we are given a standard distribution p1 , . . . , pK for the stratum index k. In ratio terms, the standardized effect of x1 vs. x0 on A under this distribution is pk µA1k k

RAA =

, pk µA0k

5

pose population k is given treatment xk , even though it could have been given some other treatment. The absolute effect of x1 vs. x2 on µ in population 1 is µ1 (x1 ) − µ1 (x2 ) = (x1 − x2 )β. Substitution of µ2 (x2 ), the value of µ in population 2 under treatment x2 , for µ1 (x2 ) yields µ1 (x1 ) − µ2 (x2 ) = α1 − α2 + (x1 − x2 )β, which is biased by the amount

k

whereas the standardized ratio comparing A with B is pk µA1k k

RAB =

. pk µB0k

k

The overall residual confounding in RAB is thus pk µA0k RAB k = , RAA pk µB0k k

which may be recognized as the standardized ratio comparing A and B when both are given treatment x0 , using p1 , . . . , pK as the standard distribution. 1.7 Regression Formulations For simplicity, the above presentation has focused on comparing two populations and two treatments. The basic concepts extend immediately to the consideration of multiple populations and treatments. Paired comparisons may be represented using the above formalization without modification. Parametric models for these comparisons then provide a connection to more familiar regression models. As an illustration, suppose population differences and treatment effects follow the model µk (x) = αk + xβ, where the treatment level x may range over a continuum, and k indexes populations. Sup-

µ1 (x2 ) − µ2 (x2 ) = α1 − α2 . Thus, under this model no confounding will occur if the intercepts α k equal a constant α across populations, so that µk (x) = α + βx. When constant intercepts cannot be assumed and nothing else is known about the intercept magnitudes, it may be possible to represent our uncertainty about α k via the following mixed-effects model: µk (x) = α + xβ + k . Here, α k has been decomposed into α + k , where k has mean zero, and the confounding in µ1 (x1 ) − µ2 (x2 ) has become an unobserved random variable, 1 − 2 . Correlation of population membership k with xk leads to a correlation of k with xk , which in turn leads to bias in estimating β. This bias may be attributed to or interpreted as confounding for β in the regression analysis. Confounders are now covariates that causally ‘‘explain’’ the correlation between k and xk . In particular, confounders normally reduce the correlation of xk and k when entered in the model. The converse is false, however: a variable that reduces the correlation of xk and k when entered need not be a confounder; it may, for example, be a variable affected by both the treatment and the exposure. 2 CONFOUNDING AND NONCOLLAPSIBILITY Much of the statistics literature does not distinguish between the concept of confounding as described above and the concept

6

CONFOUNDING

of noncollapsibility. Nonetheless, the two concepts are distinct: for certain outcome parameters, confounding may occur with or without noncollapsibility and noncollapsibility may occur with or without confounding (8,9,14,17,20,22). Mathematically identical conclusions have been reached by other authors, albeit with different terminology in which noncollapsibility corresponds to ‘‘bias’’ and confounding corresponds to covariate imbalance (7,11). As an example of no collapsibility with no confounding, consider the response distributions under treatments x1 and x0 given in Table 1 for a hypothetical index population A, and the response distribution under treatment x0 given in Table 2 for a hypothetical reference population B. If we take the odds of response as the outcome parameter µ, we get 1460 = 2.70 µA1 = 540 and µA0 = µB0 =

1000 = 1.00. 1000

Table 1. Distribution of Responses for Population A, within Strata of Z and Ignoring Z, under Treatments x1 and x0 Number of Responses Under Subpopulation

x1

x0

Subpopulation Size

Z=1 Z=2 Z=3 Totals

200 900 360 1460

100 600 300 1000

400 1200 400 2000

Table 2. Distribution of Responses for Population B, within Strata of Z and Ignoring Z, under Treatment x0

Subpopulation

Number of Responses Under x0

Subpopulation Size

Z=1 Z=2 Z=3 Totals

200 200 600 1000

800 400 800 2000

There is thus no confounding of the odds ratio: µA1 /µA0 = µA1 /µB0 = 2.70/1.00 = 2.70. Nonetheless, the covariate Z is associated with response and is distributed differently in A and B. Furthermore, the odds ratio is not collapsible: within levels of Z, the odds ratios comparing A under treatment x1 with either A or B under x0 are (200/200)/(200/600) = (900/300)/(200/200) = (360/40)/(600/200) = 3.00, a bit higher than the odds ratio of 2.70 obtained when Z is ignored. The preceding example illustrates a peculiar property of the odds ratio as an effect measure: treatment x1 (relative to x0 ) elevates the odds of response by 170% in population A, yet within each stratum of Z it raises the odds by 200%. When Z is associated with response conditional on treatment but unconditionally unassociated with treatment, the stratum-specific effects on odds ratios will be further from the null than the overall effect if the latter is not null (7). This phenomenon is often interpreted as a ‘‘bias’’ in the overall odds ratio, but in fact there is no bias if one does not interpret the overall effect as an estimate of the stratum-specific effects. The example also shows that, when µ is the odds, the ‘‘confounding odds ratio’’ (µA1 /µB0 )/(µA1 /µA0 ) = µA0 /µB0 may be 1 even when the odds ratio is not collapsible over the confounders. Conversely, we may have µA0 /µB0 = 1 even when the odds ratio is collapsible. More generally, the ratio of crude and stratum-specific odds ratios does not equal µA0 /µB0 except in some special cases. When the odds are low, however, the odds will be close to the corresponding risks, and so the two ratios will approximate one another. The phenomenon illustrated in the example corresponds to the differences between cluster-specific and population-averaged (marginal) effects in nonlinear mixedeffects regression (16). Specifically, the clusters of correlated outcomes correspond to the strata, the cluster effects correspond to covariate effects, the cluster-specific treatment effects correspond to the stratumspecific log odds ratios, and the populationaveraged treatment effect corresponds to the crude log odds ratio. Results of Gail (7) imply that if the effect measure is the difference or ratio of response proportions, then the above phenomenon–

CONFOUNDING

noncollapsibility over Z without confounding by Z–cannot occur, nor can confounding by Z occur without noncollapsibility over Z. More generally, when the effect measure is an expectation over population units, confounding by Z and noncollapsibility over Z are algebraically equivalent. This equivalence may explain why the two concepts are often not distinguished. 3 CONFOUNDING IN EXPERIMENTAL DESIGN Like the bias definition, the third usage of confounding stems from the notion of mixing of effects. However, the effects that are mixed are main (block) effects and interactions (or different interactions) in a linear model, rather than effects in the nonparametric sense of a counterfactual model. This definition of confounding differs even more markedly from other definitions in that it refers to an intentional design feature of certain experimental studies, rather than a bias. The topic of confounded designs is extensive; some classic references are Fisher (6), Cochran & Cox (3), Cox (5), and Scheff´e (21). Confounding can serve to improve efficiency in estimation of certain contrasts and can reduce the number of treatment groups that must be considered. The price paid for these benefits is a loss of identifiability of certain parameters, as reflected by aliasing of those parameters. As a simple example, consider a situation in which we wish to estimate three effects in a single experiment: that of treatments x1 vs. x0 , y1 vs. y0 , and z1 vs. z0 . For example, in a smoking cessation trial these treatments may represent active and placebo versions of the nicotine patch, nicotine gum, and buspirone. With no restrictions on number or size of groups, a fully crossed design would be reasonable. By allocating subjects to each of the 23 = 8 possible treatment combinations, one could estimate all three main effects, all three two-way interactions, and the threeway interaction of the treatments. Suppose, however, that we were restricted to use of only four treatment groups (e.g. because of cost or complexity considerations).

7

A naive approach would be to use groups of equal size, assigning one group to placebos only (x0 , y0 , z0 ) and the remaining three groups to one active treatment each: (x1 , y0 , z0 ), (x0 , y1 , z0 ), and (x0 , y0 , z1 ). Unfortunately, with a fixed number N of subjects available, this design would provide only N/4 subjects under each active treatment. As an alternative, consider the design with four groups of equal size with treatments (x0 , y0 , z0 ), (x1 , y1 , z0 ), (x1 , y0 , z1 ), and (x0 , y1 , z1 ). This fractional factorial design would provide N/2 subjects under each active treatment, at the cost of confounding main effects and interactions. For example, no linear combination of group means containing the main effect of x1 vs. x0 would be free of interactions. If one could assume that all interactions were negligible, however, this design could provide considerably more precise estimates of the main effects than the naive four-group design. To see these points, consider the following linear model: µXYZ = α + β1 X + β2 Y + β3 Z + γ1 XY + γ2 XZ + γ3 YZ + δXYZ, where X, Y, and Z equal 1 for x1 , y1 , and z1 , and 0 for x0 , y0 , and z0 , respectively. The group means, in the fractional factorial design are then µ000 = α, µ110 = α + β1 + β2 + γ1 , µ101 = α + β1 + β3 + γ2 , µ011 = α + β2 + β3 + γ3 . Treating the means as observed and the coefficients as unknown, the above system is underidentified. In particular, there is no solution for any main effect β j in terms of the means µijk . Nonetheless, assuming all γ j = 0 yields immediate solutions for all the β j . Additionally assuming a variance of σ 2 for each estimated group mean yields that the main-effect estimates under this design would have variances of σ 2 , as opposed to 2σ 2 for the main-effect estimates from the naive four-group design of the same size. For

8

CONFOUNDING

example, under the confounded fractional factorial design (assuming no interactions) βˆ1 =

µˆ 110 + µˆ 101 − µˆ 000 − µˆ 011 , 2

so var(βˆ1 ) = 4σ 2 /4 = σ 2 , whereas under the naive design, βˆ = µˆ 100 − µˆ 000 so var(βˆ1 ) = 2σ 2 . Of course, the precision advantage of the confounded design is purchased by the assumption of no interaction, which is not needed by the naive design. REFERENCES 1. Bross, I. D. J. (1967). Pertinency of an extraneous variable, Journal of Chronic Diseases 20, 487–495. 2. Clayton, D. & Hills, M. (1993). Statistical Models in Epidemiology. Oxford University Press, New York. 3. Cochran, W. G. & Cox, G. M. (1957). Experimental Designs, 2nd Ed. Wiley, New York. 4. Cornfield, J., Haenszel, W., Hammond, W. C., Lilienfeld, A. M., Shimkin, M. B. & Wynder, E. L. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions, Journal of the National Cancer Institute 22, 173–203. 5. Cox, D. R. (1958). The Planning of Experiments. Wiley, New York.

10. Groves, E. R. & Ogburn, W. F. (1928). American Marriage and Family Relationships. Henry Holt & Company, New York, pp. 160–164. 11. Hauck, W. W., Neuhaus, J. M., Kalbfleisch, J. D. & Anderson, S. (1991). A consequence of omitted covariates when estimating odds ratios, Journal of Clinical Epidemiology 44, 77–81. 12. Kitagawa, E. M. (1955). Components of a difference between two rates, Journal of the American Statistical Association 50, 1168–1194. 13. Miettinen, O. S. (1972). Components of the crude risk ratio, American Journal of Epidemiology 96, 168–172. 14. Miettinen, O. S. & Cook, E. F. (1981). Confounding: essence and detection, American Journal of Epidemiology 114, 593–603. 15. Mill, J. S. (1843). A System of Logic, Ratiocinative and Inductive. Reprinted by Longmans, Green & Company, London, 1956. 16. Neuhaus, J. M., Kalbfleisch, J. D. & Hauck, W. W. (1991). A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data, International Statistical Review 59, 25–35. 17. Pearl, J. (2000). Causality. Cambridge University Press, New York, Ch. 6. 18. Robins, J. M. & Morgenstern, H. (1987). The mathematical foundations of confounding in epidemiology, Computers and Mathematics with Applications 14, 869–916. 19. Rothman, K. J. (1977). Epidemiologic methods in clinical trials, Cancer 39, 1771–1775.

6. Fisher, R. A. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh. 7. Gail, M. H. (1986). Adjusting for covariates that have the same distribution in exposed and unexposed cohorts, in Modern Statistical Methods in Chronic Disease Epidemiology, S. H. Moolgavkar & R. L. Prentice, eds. Wiley, New York. 8. Greenland, S. & Robins, J. M. (1986). Identifiability, exchangeability, and epidemiological confounding, International Journal of Epidemiology 15, 413–419.

20. Rothman, K. J. & Greenland, S. (1998). Modern Epidemiology, 2nd ed. Lippincott, Philadelphia, Ch. 4. 21. Scheff´e, H. A. (1959). The Analysis of Variance. Wiley, New York. 22. Wickramaratne, P. & Holford, T. (1987). Confounding in epidemiologic studies: the adequacy of the control groups as a measure of confounding, Biometrics 43, 751–765.

9. Greenland, S., Robins, J. M. & Pearl, J. (1999). Confounding and collapsibility in causal inference. Statistical Science 14, 29–46.

23. Yule, G. U. (1903). Notes on the theory of association of attributes in statistics, Biometrika 2, 121–134.

CONSORT A report of a randomized controlled trial (RCT) should convey to the reader, in a transparent manner, why the study was undertaken, and how it was conducted and analyzed. To assess the strengths and limitations of an RCT, the reader needs and deserves to know the quality of its methodology. Despite several decades of educational efforts, RCTs still are not being reported adequately [2, 5, 10]. The Consolidated Standards of Reporting Trials (CONSORT) statement, published in the Journal of the American Medical Association in 1996 [1], was developed to try to help rectify this problem. The CONSORT statement was developed by an international group of clinical trialists,

statisticians, epidemiologists and biomedical editors. The CONSORT statement is one result of previous efforts made by two independent groups, the Standards of Reporting Trials (SORT) group [9] and the Asilomar Working Group on Recommendations for Reporting of Clinical Trials in the Biomedical Literature [11]. The CONSORT statement consists of two components, a 21-item checklist (Table 1) and a flow diagram (Figure 1). The checklist has six major headings that pertain to the contents of the report of a trial, namely Title, Abstract, Introduction, Methods, Results and Discussion. Within these major headings there are subheadings that pertain to specific items that should be included in any clinical trial manuscript. These items constitute the key pieces of information necessary for authors to address when reporting

Registered or Eligible Patients (n =...)

Not Randomized (n =...) Reasons (n =...)

Randomization

Received Standard Intervention as Allocated (n =...)

Received Standard Intervention as Allocated (n =...)

Did not Receive Standard Intervention as Allocated (n =...)

Did not Receive Standard Intervention as Allocated (n =...)

Followed UP (n =...)

Followed UP (n =...)

Timing of Primary and Secondary Outcomes

Timing of Primary and Secondary Outcomes

Withdrawn (n =...) Intervention ineffective (n =...) Lost to Follow-up (n =...) Other (n =...)

Completed Trial (n =...)

Withdrawn (n =...) Intervention ineffective (n =...) Lost to Follow-up (n =...) Other (n =...)

Completed Trial (n =...)

Figure 1 CONSORT flowchart. Reproduced with permission from the Journal of the American Medical Association, 1996, Volume 276, 637–665. Copyrighted (1996), American Medical Association Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

CONSORT

Table 1

CONSORT checklist

Heading

Subheading

Title Abstract Introduction

Methods

Protocol

Assignment

Masking (Blinding)

Results

Comment

Participant Flow and Follow-up Analysis

Descriptor Identify the study as a randomized trial. Use a structured format. Identify the study as a randomized trial. Use a structured format. State prospectively defined hypothesis, clinical objectives, and planned subgroup or covariate analyses Describe 1. Planned study population, together with inclusion/exclusion criteria. 2. Planned interventions and their timing. 3. Primary and secondary outcome measure(s) and the minimum important difference(s), and indicate how the target sample size was projected. 4. Rationale and methods for statistical analyses, detailing main comparative analyses and whether they were completed on an intention-to-treat basis. 5. Prospectively defined stopping rules (if warranted) Describe 1. Unit of randomization (e.g. individual, cluster, geographic). 2. Method used to generate the allocation schedule. 3. Method of allocation concealment and timing of assignment. 4. Method to separate the generator from the executor of assignment. Describe mechanism (e.g. capsules, tablets); similarity of treatment characteristics (e.g. appearance, taste); allocation schedule control (location of code during trial and when broken); and evidence for successful blinding among participants, person doing intervention, outcome assessors, and data analysts. Provide a trial profile (Figure 1) summarizing participant flow, numbers and timing of randomization assignment, interventions, and measurements for each randomized group. State estimated effect of intervention on primary and secondary outcome measures, including a point estimate and measure of precision (confidence interval). State results in absolute numbers when feasible (e.g. 10/20, not 50%). Present summary data and appropriate descriptive and inferential statistics in sufficient detail to permit alternative analyses and replication. Describe prognostic variables by treatment group and any attempt to adjust for them. Describe protocol deviations from the study as planned, together with the reasons. State specific interpretation of study findings, including sources of bias and imprecision (internal validity) and discussion of external validity, including appropriate quantitative measures when possible. State general interpretation of the data in light of the totality of the available evidence.

Was it reported? Page no.?

CONSORT the results of an RCT. Their inclusion is based on evidence, whenever possible. For example, authors are asked to report on the methods they used to achieve allocation concealment, possible in every randomized trial. There is growing evidence that inadequately concealed trials, compared with adequately concealed ones, exaggerate the estimates of intervention benefit by 30%–40%, on average [7, 8]. Additional benefits of the checklist (and flow diagram) include facilitating editors, peer reviewers and journal readers to evaluate the internal and external validity of a clinical trial report. The flow diagram pertains to the process of winnowing down the number of participants from those eligible or screened for a trial to those who ultimately completed the trial and were included in the analysis. The flow diagram pertains particularly to a two-group, parallel design, as stated in the CONSORT statement. Other checklists and flow diagrams have been developed for reporting cluster randomized trials [4] and other designs (see http://www.consortstatement.org). The flow diagram, in particular, requests relevant information regarding participants in each of the intervention and control groups who did not receive the regimen for the group to which they were randomized, those who during the course of the trial were discontinued, withdrew, became lost to follow-up, and those who have incomplete information for some other reason. There is emerging evidence to suggest that the quality of reporting of RCTs, based on the use of the CONSORT statement, compared with not using it, is higher on several dimensions, such as reduced reporting of unclear allocation concealment [6]. Similarly, use of the flow diagram was associated with better overall reporting of RCTs [3]. The CONSORT statement (checklist and flow diagram) is available on the CONSORT website (www.consort-statement.org). This site includes information on the growing number of health care journals and biomedical editorial groups, such as the International Council of Medical Journal Editors (ICMJE), who support the use of the CONSORT statement for reporting RCTs. At this writing the CONSORT statement is undergoing revision. Present plans call for the revised Statement to appear in Spring 2001 along with an

3

extensive explanatory and elaboration document to overcome some of the shortcomings of the original statement, both of which will be available on the above website.

References [1]

Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K.F., Simel, D. & Stroup, D.F. (1996). Improving the quality of reporting of randomized controlled trials: the CONSORT Statement, Journal of the American Medical Association 276, 637–639. [2] Dickinson, K., Bunn, F., Wentz, R., Edwards, P. & Roberts, I. (2000). Size and quality of randomized controlled trials in head injury: review of published studies, British Medical Journal 320, 1308–1311. [3] Egger, M., J¨uni, P., Bartlett, C. for the CONSORT Group (2001). The value of CONSORT flow charts in reports of randomized controlled trials: bibliographic study, Journal of the American Medical Association, in press. [4] Elbourne, D.R. & Campbell, M.K. (2001). Extending the CONSORT statement to cluster randomized trials: for discussion, Statistics in Medicine 20, 489–496. [5] Hotopf, M., Lewis, G. & Normand, C. (1997). Putting trials on trial – the costs and consequences of small trials in depression: a systematic review of methodology, Journal of Epidemiology and Community Health 51, 354–358. [6] Moher, D., Jones, A., Lepage, L. for the CONSORT Group (2001). Does the CONSORT statement improve the quality of reports of randomized trials: a comparative before and after evaluation?, Journal of the American Medical Association, in press. [7] Moher, D., Pham, B., Jones, A., Cook, D.J., Jadad, A.R., Moher, M. & Tugwell, P. (1998). Does the quality of reports of randomized trials affect estimates of intervention efficacy reported in meta-analyses?, Lancet 352, 609–613. [8] Schulz, K.F., Chalmers, I., Hayes, R.J. & Altman, D.G. (1995). Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials, Journal of the American Medical Association 273, 408–412. [9] The Standards of Reporting Trials Group (1994). A proposal for structured reporting of randomized controlled trials, Journal of the American Medical Association 272, 1926–1931. Correction: Journal of the American Medical Association 273, 776. [10] Thornley, B. & Adams, C.E. (1998). Content and quality of 2000 controlled trials in schizophrenia over 50 years, British Medical Journal 317, 1181–1184. [11] Working Group on Recommendations for Reporting of Clinical Trials in the Biomedical Literature (1994). Call for comments on a proposal to improve reporting

4

CONSORT of clinical trials in the biomedical literature: a position paper, Annals of Internal Medicine 121, 894–895.

(See also QUORUM) DAVID MOHER

CONTRACT RESEARCH ORGANIZATION (CRO) A Contract Research Organization (CRO) is a person or an organization (commercial, academic, or other) contracted by the sponsor to perform one or more of the trial-related duties and functions of the sponsor. A sponsor may transfer any or all of the trial-related duties and functions of the sponsor to a CRO, but the ultimate responsibility for the quality and integrity of the trial data always resides with the sponsor. The CRO should implement quality assurance and quality control. Any trial-related duty and function that is transferred to and assumed by a CRO should be specified in writing. Any trial-related duties and functions not specifically transferred to and assumed by a CRO are retained by the sponsor. All references to a sponsor in this guideline also apply to a CRO to the extent that a CRO has assumed the trial-related duties and functions of a sponsor.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

CONTROL GROUPS

Informal observational methods have been used to identify treatments throughout the history of medicine. Control groups were, in essence, previously treated patients assessed according to general clinical impression. The humoralistic philosophy of medicine— treatment to restore the balance of blood, phlegm, black bile, and yellow bile in the sick—espoused by Hippocrates and codified by Galen (130 AD) remained largely unchallenged and untested until the Renaissance (4). Unfortunately, the replacement theories were not much better, nor were the treatments or experimental methods. Lack of adequately controlled trials might not have been so important except that so many treatments were actively harmful. Numerical methods were introduced in the early 1800s, leading to important epidemiologic observations such as Snow’s discovery that cholera is a waterborne infectious disease (5). However, the methods were not as useful in assessing treatment effectiveness because if they were applied at all, they were applied using unreliable control groups. For example, a diphtheria antiserum was introduced in Europe in 1894 to 1895, and death rates due to diphtheria declined—but the decline had started before 1894 and rose again to pre-1894 rates by 1924 (6). The mix of responsible bacteria changed over time, making the contribution of treatment to the initial decline uncertain. An early exception to use of control groups was the seawater-controlled 6 arm, 12 patient scurvy study undertaken by James Lind in 1753, which has been noted to be ‘‘the first deliberately planned controlled experiment ever undertaken on human subjects’’ (7). This success lacked a completely happy ending, however, in that it took decades before the results were reliably applied and accepted (8, 9). Although principles of controlled medical experiments were expressed as early as 1866 (2), the modern randomized controlled trial is a relatively recent development. The first properly randomized controlled treatment trial was a study of streptomycin in tuberculosis. In 2 years, the trial convincingly demonstrated that streptomycin plus

STEPHANIE GREEN Clinical Biostatistics Pfizer, Inc. New London, Connecticut

A control group for a clinical trial is a group of uniformly treated patients selected to be compared with a group receiving a test (new or different) treatment. For the comparison to be a valid assessment of differences in outcome due to test treatment, the two groups must be as similar as possible with the exception of treatment. Various types of control groups are employed, and the choice is critical to the interpretation of a trial (1). The type chosen for a particular trial will depend on many factors such as availability of resources, subjectivity of the primary outcome, availability of effective treatments, current extent of knowledge concerning treatment, severity of disease, and ethical considerations. Randomized trials comparing a control group with an experimental treatment group are the most convincing and reliable method for minimizing bias and demonstrating effectiveness of new treatments. Depending on standard of care, control groups for randomized trials may be patients randomized to no treatment, a placebo (i.e., an inactive agent or a procedure with the appearance of the new treatment), or standard active therapy with or without a placebo. Because of practical and ethical constraints, not all clinical questions can be addressed with a randomized trial. Nonrandomized controlled trials can have specific comparator groups or be controlled in the sense that carefully chosen historical information from the literature is used for reference. Proper use of control groups has been a significant advance in medicine, with controlled clinical trials now the mainstay of clinical research without which ‘‘the doctor walks at random and becomes the sport of illusion’’ (2, 3). 1

HISTORY

It can be argued that the history of clinical research is the history of control groups.

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

CONTROL GROUPS

bed rest was superior to the control treatment of bed rest alone, in stark contrast to centuries of unanswered questions on previous tuberculosis treatments (10). Sir Austin Bradford Hill, the statistical champion for this study, is credited as being the most instrumental individual in fostering use of the randomized controlled clinical trial (11). Since then, use of controlled trials has resulted in numerous treatment breakthroughs, such as the beautifully executed Salk vaccine trial for prevention of polio (12). Perhaps of equal or greater importance, controlled trials have also proven the lack of effectiveness of purported breakthroughs, such as the Cardiac Arrhythmia Suppression Trial (CAST) demonstrating that encainide and flecainide increased the death rate in patients with recent myocardial infarction instead of decreasing it, despite the well-documented suppression of ventricular arrhythmia by these agents (13). Use of proper control groups continues to be a critical principle in clinical research methods. 2

ETHICS

Ethical considerations with respect to experiments using control groups and experimental treatments center on the tension between care of individual patients and the need to study treatment effectiveness. The welfare of the individual patient should not be compromised by inclusion in a clinical research trial. On the other hand, the welfare of patients in general depends on identification of new effective treatments and discarding ineffective treatments. We often do not know whether the welfare of the individual is being compromised or enhanced, thus treatment without knowledge also raises ethical issues—consider the CAST study example of unknowingly doing harm instead of good. International guidance for ethics in medical research such as the Declaration of Helsinki (14) emphasizes protection of health and rights of patients. The Helsinki declaration acknowledges the physicians’ responsibilities to individual subjects: ‘‘considerations related to the well-being of the human subject should take precedence over the interests of science and society’’ and ‘‘it is the duty

of the physician in medical research to protect the life, health, privacy, and dignity of the human subject.’’ It also acknowledges responsibilities to patients as a whole: ‘‘it is the duty of the physician to safeguard the health of the people,’’ ‘‘medical progress is based on research which ultimately must rest in part on experimentation involving human subjects,’’ and ‘‘the benefits, risks, burdens and effectiveness of a new method should be tested against those of the best current prophylactic, diagnostic, and therapeutic methods.’’ The U.S. Belmont report (15) identified three major ethical principles: respect for persons (requiring informed and voluntary consent), beneficence (requiring that benefits outweigh risks plus careful study monitoring), and justice (requiring equitable selection of subjects and protection of vulnerable subjects). The main principle related to use of a control group is beneficence. Items from the Declaration of Helsinki that help inform decisions related to use of a control group include a requirement for careful assessment of risk–benefit, agreement that risks have been assessed and can be adequately managed, agreement that the importance of the objective outweighs the inherent risks to individuals, and a reasonable likelihood that populations in which the research is carried out stand to benefit from the results. Key considerations in the type of control group to be used are severity of disease and availability of effective treatments. If the disease is severe and there is an effective treatment, for instance, a no-treatment or placebo-alone arm on a randomized trial will not be an appropriate control as the risks to the individual are too high. The Declaration of Helsinki explicitly states that use of placebos or no-treatment are not precluded although ‘‘extreme care’’ must be taken in the decision to employ these. The current wording on this point is controversial, and the implications are perhaps too restrictive (16). Another key to ethical use of a control group is equipoise, the acknowledgment that usefulness of the new treatment compared with the control treatment is not yet known. Individual investigators may hold strong opinions and decline to participate in a trial,

CONTROL GROUPS

but the general sense in the scientific community should be one of uncertainty. In the CAST study, for example, there was considerable controversy in doing a randomized controlled trial as many investigators incorrectly believed that the striking arrhythmia evidence made a control arm unethical. Fortunately, the level of uncertainty in the scientific community allowed the trial to proceed, thereby sparing future patients a harmful treatment. An interesting illustration of the issues in using placebo-treated control groups was a double-blind transplantation trial in patients with Parkinson’s disease that used the standard treatment plus sham surgery as the control arm (17). The sham surgery included general anesthesia, a scalp incision, a partial burr hole, antibiotics and cyclosporine, and positron emission tomography (PET) studies, all of which entailed some risk to the patient. Because outcomes in Parkinson’s disease are subjective, use of a placebo-treated control group was the best way to assess the effectiveness of new therapy. The question was important, there was no adequate alternative therapy, the new treatment held promise but was uncertain, and future patients would benefit. Thus, the suitability of placebo use centered on whether the risk–benefit ratio for patients treated with placebo was acceptable. The potential benefits were the contribution to science, the no-cost standard medical treatment provided as part of the study, and later transplant at no cost if the treatment was found to be beneficial. As well, these patients were spared the additional risks of the transplant if it was found not to be beneficial. The risks included the possibility of injury or death due to the sham procedure and the inconvenience and discomfort of an extensive procedure with no possible benefit. As presented in a New England Journal of Medicine sounding board (18, 19), the case illustrates how assessment of levels of risk and benefit is not always clear cut, and how reasonable people may disagree on whether the ratio, and therefore the trial, is acceptable.

3

3 TYPES OF CONTROL GROUPS: HISTORICAL CONTROLS A comparison of control versus experimental groups will indicate whether outcomes are different in the experimental group, but without randomization causality cannot be assumed. Any difference in the groups (or lack of difference) may be due to factors other than treatment. Any particular choice of control group, no matter how carefully chosen, will be systematically different from the experimental group in many (often unknown or unmeasurable) ways due to the systematic reasons patients are chosen for treatment. Many factors contribute to a choice of treatment, and many of these factors are related to outcome; to the extent that the experimental group has better or worse prognosis than the control group, the comparison will be biased. Biases can make the test treatment appear either more or less effective it actually is. Because of potential toxicities, for instance, investigators may be inclined to include only patients who are relatively healthy; in other circumstances, they may choose only patients who are too compromised for other options. For the first case, an ineffective treatment might appear to be an improvement over historical treatment. For the second, an effective treatment might appear ineffective. If a historically controlled trial is to be done, the patient population for the control group should be as similar as possible to the population for the experimental group. This includes similar general health status, similar general medical care, and use of the same diagnostic and screening procedures. Also, the primary outcome should be defined and assessed the same way in both groups and be objective so that results from the experimental group have the same interpretation as for results from the historical control group. Results of such trials must always be interpreted cautiously because of the potential for bias. 3.1 Historical Control from the Medical Literature Historically controlled trials are usually single-arm trials without specific control groups. Rather, these trials are controlled

4

CONTROL GROUPS

in the sense that statistical hypotheses are based on estimates from the literature: instead of being compared with a specific set of patients, patients on the new treatment are assessed against a fixed value. For instance, if the success rate of the standard treatment in the literature is consistently 20%, a test may be done to ascertain whether the percentage of patients with success on the new treatment is significantly greater than 20%. Or if a particular time to event distribution is hypothesized, a 1-sample logrank test (20) might be used to test superiority. Such trials are conducted for reasons such as a severely limited patient population or a need for preliminary assessment before planning a large definitive trial. This approach works best if historical estimates are well characterized and are stable over time with low variability. This may be the case for uniformly nonresponsive disease with no effective treatment, resulting in low variability in patient outcomes. There should also be no recent treatment improvements, changes in staging definitions, improvements in diagnostic procedures, or changes in the way primary outcome is assessed so that results will not be confounded by temporal changes. Of course, definitions, treatments, and methods typically do not remain stable over time, and uniformity is not common, so historical estimates for the same treatment in ostensibly the same patient population may vary substantially. It is often difficult to ascertain which estimates may be appropriate for assessing a new treatment. Considering the high potential for bias of unknown magnitude or direction, in most circumstances a single-arm trial will not provide a definitive answer to the question of whether the new treatment is an improvement over the standard. In the past, the probability of success for various types of cancer was uniformly dismal, so this setting provides an example of feasible use of this approach. With recent treatment and diagnostic advances, old assumptions no longer hold, so the approach is becoming less informative. Other approaches to controlling early studies of new regimens in

cancer are becoming more common as variability in treatment outcome increases and as previously stable historical rates improve. 3.2 Specific Historical Control Groups Trials with control groups consisting of a specific set of past patients are sometimes designed. For this type of trial, individual patient data from the control group are used for comparison with patients on the test treatment. Such trials may be justified when the probability of success and expected benefit of therapy are both considered to be so high that ethical considerations dictate a current control arm would not be in the best interests of patients. A limited patient population may also be a reason to design a historically controlled trial. A carefully chosen specific control group potentially will be better than using an estimate from the literature but will still be subject to selection biases. Known prognostic factors should be available for use in the analysis of this type of trial, and only large differences after suitable adjustment for prognostic factors should be accepted as evidence of improvement due to experimental therapy. An example illustrating how difficult it is to select a suitable specific historical control is provided by a sequence of four Southwest Oncology Group studies in multiple myeloma. The trials had the same eligibility criteria, and largely the same institutions participating on the trials. The overall survival estimates by study were remarkably similar across all four trials. There were no significant improvements demonstrated for any of the test treatments. This stability over time would suggest that use of a prior control arm for assessing the next treatment might be reasonable. However, when the identically treated control arms for each these four trials were compared, the results appeared different, and the test of differences was suggestive (P = 0.07) of heterogeneity among the arms (21). Comparison of a new regimen with the worst of these control arms could have resulted in a spurious positive result. If comparability cannot be counted on in ideal cases such as this, then it is unlikely to apply in less ideal cases. One should assume that there will be systematic differences between the

CONTROL GROUPS

chosen control group and the test treatment no matter how carefully they are chosen. The myeloma trials also provide an example of how poorly chosen historical control groups can mislead. A specific comparison of a high-dose therapy regimen with one of the trial arms indicated a striking benefit due to the experimental regimen. It is tempting to conclude in such cases that not all of the difference could be due to bias. However, patients must be in better health to receive a high-dose regimen than to receive the standard treatment. When the comparison was restricted to younger patients with good renal function, a much smaller difference remained (21). A randomized trial was deemed necessary, and it did not demonstrate a benefit due to high-dose therapy, despite the initial promising comparison (22). It should be noted that design considerations are different when a retrospective control group is used instead of a prospective control. In this case, sample size considerations are a function of the results of the control group, and they may also account for covariate adjustment (23–25). 4 TYPES OF CONTROL GROUPS: RANDOMIZED CONTROLS A control versus experimental comparison allows an assessment of whether outcomes are different in the experimental group. Randomization allows the possibility of concluding that the difference is actually caused by the test treatment, due to the elimination of biases in treatment assignment. Although randomization is necessary, it may not be sufficient for attributing differences to the test treatment. A poorly conducted randomized trial may still result in substantial bias. Some potential sources of bias in the comparison of groups are structural and can be avoided relatively easily. For instance, outcome assessment schedules should be the same for both the control group and the test group. If, for instance, time to failure is assessed less frequently in one group, then results will be biased in favor of this group. Methods of assessment should also be the same. If the method of assessment for failure in one group is more sensitive, then the

5

results will be biased against this group. Criteria for inclusion in the analysis should also be the same. If information is available in one group but not the other, then this information should not be used to exclude patients from analysis. For example, disease information in a surgical test group might identify patients unlikely to benefit from treatment, but these cannot be excluded from analysis because similarly unsuitable patients from the control group will not be excluded. Other sources of bias are not so easily eliminated. To the extent these differ according to treatment group, the treatment comparison is compromised. For instance, patients in the control group may be less compliant with trial requirements than those in the test group or may drop out of the trial all together, potentially resulting in worse than expected outcome. Or if the test treatment includes over-the-counter agents, patients in the control group may treat themselves with the agents, potentially resulting in a better than expected outcome. Subjective outcomes are particularly problematic because assessments are easily influenced by knowledge of the treatment assignment. Investigators may overestimate improvement in the test group in anticipation or hope of benefit from a promising new treatment, or adverse events may be more likely to be attributed to treatment in the experimental group compared with the control group. Patients are also subject to the wellknown placebo effect. A proportion of patients will report improvement in subjective disease symptoms or experience of treatment side effects whether or not the treatment is active (26). For example, trials of venlafaxine in sexual dysfunction (27), hot flashes (28), panic disorder (29), generalized anxiety disorder (30), and migraine pain (31) all noted improvements with placebo, sometimes less than the effect of venlafaxine, sometimes not. Or consider an antiemetic trial of placebo versus prochlorperazine versus tetrahydrocannabinol (THC, the active marijuana component). In this trial, sedation side effects of treatment were reported in 46% of placebo patients, and ‘‘highs’’ were reported in 12% of prochlorperazine patients (32), presumably due in part to anticipation of being on

6

CONTROL GROUPS

THC. In a study for which the placebo effect occurs mainly in the experimental group, an improvement in the experimental group may be observed even when the test treatment is inactive. 4.1 Untreated Control Group, Randomized Trial The category of untreated controls includes control groups for which patients receive the same palliative care or routine monitoring as patients in the experimental group, but the control group receives no additional treatment while the experimental group receives the test treatment. A trial with this type of control group has potential for all of the biases previously discussed. Problems may be particularly acute in this type of trial because of the no-treatment control. For instance, patients may have little motivation to return to the clinic for routine visits, so outcomes may not be carefully collected. As well, the placebo effect will occur only in the experimental group. 4.2 Standard Treatment Control Group, Add-On Randomized Trial When not treating is considered unethical or unfeasible, one option may be to treat both the control group and the experimental group with standard treatment and to add the test treatment in the experimental group. Bias issues are similar to those for an untreated control group. Use of this type of control group allows for assessment of improvement over standard treatment due to the test treatment. However, conclusions cannot be made concerning the usefulness of the test treatment alone. Because of potential treatment synergy or inhibition, improvement over the standard treatment does not prove singleagent activity, and lack of improvement over the standard treatment does not disprove single-agent activity. 4.3 Placebo-Treated Control Group, Randomized Trial The placebo-treated control category includes control groups for which patients receive the same palliative treatment or routine monitoring as the experimental group plus a placebo

while the experimental group receives the test treatment. Patients are given identical appearing treatments in each treatment arm to mask knowledge of the assigned treatment. Blinded placebo-controlled trials are done to reduce the potential for biases related to knowledge of the treatment assignment. For patients, compliance is enhanced, and supplemental treatments, while not necessarily eliminated, should at least be balanced across the groups. For investigators, the objectivity of outcome assessments is improved. In addition, use of a placebo will control for the placebo effect. Both groups will experience the placebo effect, so differences between the groups will be due to the active treatment. Sometimes only patients are blinded (single blind), but preferably both patients and investigators will be blinded (double blind) to avoid bias from both sources. Although it would seem best always to have a placebo-treated control group, it may not always be practical or ethical. Blinded placebo-controlled trials are resource intensive. Significant time and money are needed for manufacturing of the placebo; for labeling, shipping, and tracking the coded placebos and active agents; for setting up mechanisms for distribution with pharmacies; and for arranging for emergency unblinding. If the outcomes are objective and other problems associated with knowledge of the treatment assignment are anticipated to be minor, it may be concluded unnecessary to blind. In other cases, it may be impossible to blind, such as when a treatment has a distinctive side effect that cannot be reproduced in a placebo. In yet other cases, the placebo treatment may entail too much risk to the patient. Sham surgeries are particularly controversial, as already noted. 4.4 Placebo-Treated Control Group, Add-On Randomized Trial For an add-on randomized trial, patients in the control group can receive standard treatment and a placebo that appears identical to the test treatment while patients in the experimental group receive the standard plus the test treatment. The issues are the same as for a placebo-treated control group. Again, use of this type of control group allows for

CONTROL GROUPS

assessment of improvement over the standard treatment due to the test treatment, but it would not address the usefulness of the test treatment alone. 4.5 Active Control Group When no-treatment or placebo-only are not appropriate and when adding the test treatment to the standard is not of interest, then an active control trial may be done. The control group in this type of trial receives standard treatment while the experimental group receives the test treatment. The test treatment in this case may be a single agent, a combination of therapies, different ways of administering treatment, new schedules, or other variations that cannot be described simply as standard plus new. Double placebos are sometimes used to mask treatment assignment for these trials. Risks and benefits of forgoing proven useful treatment must be assessed carefully in this setting. Such trials are appropriate when standard treatment is useful but not fully effective and a delay in receiving standard treatment is acceptable. The aim of an active control trial may be to show superiority of the test treatment or to show equivalence. A superiority trial will allow for assessment of the difference between control and the test treatment. However, unlike an add-on trial, if the test treatment is a combination of therapies, the cause of any difference will likely not be attributable to a particular component of the combination. Equivalence trials aim to show that the test treatment is as good as the standard. ‘‘As good as’’ is difficult to establish. Equivalence trials need both active control groups and historical information because results of the standard versus test comparison must not only demonstrate that the test treatment is similar to control, but also that it would have been superior to placebo had the trial been placebo controlled. Design considerations for such trials are complex, and they are the subject of ongoing research and discussion (33–35). Lack of accurate estimates of the benefit of active control compared with placebo is a common challenge in this setting. As for other historical control settings, the assumption of no change over time is particularly troublesome.

7

4.6 Multiple Control Groups In some studies, more than one control group may be used. When appropriate, using both a placebo-only control group and an active control group can be useful for addressing both the absolute effect of a new treatment and the effect relative to the standard treatment. It may also provide information on why a trial fails to show usefulness of a new treatment. If the standard group also fails to show improvement compared with the placebo, the failure may be due to the trial rather than to lack of efficacy of the new treatment. 5

CONCLUSION

Choice of the control group in a clinical trial is critical to the success of the trial. A poor choice may result in biases that severely compromise interpretation of results. A blinded randomized trial with a placebo-treated control group will provide the most definitive evidence concerning the usefulness of a new treatment. If it is not possible to conduct this type of trial, the best control group under the circumstances should be used. The trial should be randomized if feasible. For any choice, every effort should be made to ensure the control group is as similar as possible to the experimental group. Reliability of conclusions about new treatments depends on it! REFERENCES 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E10 Choice of Control Group and Related Issues in Clinical Trials. Current Step 4 version, July 20, 2000. Available at: http://www.ich.org/LOB/media/ MEDIA486.pdf 2. J. P. Boissel, Impact of randomized clinical trial on medical practices. Control Clin Trials. 1989; 10: 120S–134S. 3. C. L. Bernard, Introduction a` l’Etude de la M´edecine Exp´erimentale. First published in 1866. London: Garnier-Flammarion, 1966. 4. D. DeMoulin, A Short History of Breast Cancer. Dorderct, Germany: Klewer, 1989.

8

CONTROL GROUPS 5. D. Freedman, From association to causation: some remarks on the history of statistics. Stat Sci. 1999; 14: 243–258.

21. S. Green, J. Benedetti, and J. Crowley, Clinical Trials in Oncology. Boca Raton, FL: Chapman and Hall/CRC Press, 2003.

6. H.O. Lancaster, Quantitative Methods in Biological and Medical Sciences. New York: Springer-Verlag, 1994.

22. B. Barlogie, R. Kyle, K. Anderson, P. Greipp, H. Lazarus, et al., Standard chemotherapy compared with high-dose chemoradiotherapy for multiple myeloma: final results of phase III US Intergroup Trial S9321. J Clin Oncol. 2006; 24: 929–936. 23. R. Makuch and R. Simon, Sample size consideration for non-randomized comparative studies. J Chronic Dis. 1980; 33: 175–181.

7. C. Stuart and D. Guthrie, eds., Lind’s Treatise on Scurvy. Edinburgh: University Press, 1953. 8. G. Cook, Scurvy in the British Mercantile Marine in the 19th century, and the contribution of the Seamen’s Hospital Society. Postgrad Med J. 2004; 80: 224–229. 9. D. Thomas, Sailors, scurvy and science. J Roy Soc Med. 1997; 90: 50–54. 10. W. Silverman, Doctoring: From art to engineering. Control Clin Trials. 1992; 13: 97–99. 11. W. Silverman and I. Chalmers, Sir Austin Bradford Hill: an appreciation. Control Clin Trials. 1991; 10: 1–10. 12. J. Smith, Patenting the Sun: Polio and the Salk Vaccine Trial. 1990. New York: William and Morrow, 1990. 13. D. Echt, P. Liebson, L. Mitchell, R. Peters, D. Obias-Manno, et al., and the CAST investigators. Mortality and morbidity in patients receiving encainide, flecainide or placebo: the Cardiac Arrhythmia Suppression Trial. N Engl J Med. 1991; 324: 781–788. 14. World Medical Association General Assembly. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. J Int Bioethique. 2004; 15: 124–129. 15. Protection of human subjects; Belmont Report: notice of report for public comment. Fed Regist. 1979; 44: 23191–23197. 16. S. Piantadosi, Clinical Trials: A Methodologic Approach, 2nd ed. New York: Wiley, 2005. 17. C. W. Olanow, C. Goetz, J. Kordower, A. J. Stoessl, V. Sossi, et al., A double-blind controlled trial of bilateral fetal nigral transplantation in Parkinson’s disease. Ann Neurol. 2003; 54: 403–414. 18. T. Freeman, D. Vawtner, P. Leaverton, J. Godbold, R. Hauser, et al., Use of placebo surgery in controlled trials of a cellular based therapy for Parkinson’s disease. N Engl J Med. 1999; 341: 988–992. 19. R. Macklin, The ethical problems with sham surgery in clinical research. N Engl J Med. 1999; 341: 992–996. 20. R. Woolson, Rank tests and a one-sample logrank test for comparing observed survival data to a standard population. Biometrics. 1981; 37: 687–696.

24. D. Dixon and R. Simon, Sample size consideration for studies comparing survival curves using historical controls. J Clin Epidemiol. 1988; 41: 1209–1213. 25. J. O’Malley, S. L. Normand, and R. Kuntz, Sample size calculation for a historically controlled clinical trial with adjustment for covariates. J Biopharm Stat. 2002; 12: 227–247. 26. A. Shapiro and K. Shapiro, The Powerful Placebo: From Ancient Priest to Modern Physician. Baltimore: Johns Hopkins University Press, 1997. 27. S. Kilic, H. Ergin, and Y. Baydinc, Venlafaxine extended release for the treatment of patients with premature ejaculation: a pilot, single-blind, placebo-controlled, fixeddose crossover study on short-term administration of an antidepressant drug. Int J Androl. 2005; 28: 47–52. 28. M. Evans, E. Pritts, E. Vittinghoff, K. McClish, K. Morgan, and R. Jaffe, Management of postmenopausal hot flushes with venlafaxine hydrochloride: a randomized, controlled trial. Obstet Gynecol. 2005; 105: 161–166. 29. M. Pollack, U. Lepola, H. Koponen, N. Simon, J. Worthington, et al., A double-blind study of the efficacy of venlafaxine extended-release, paroxetine, and placebo in the treatment of panic disorder. Depress Anxiety. 2006; 2007; 24: 1–14. 30. A. Gelenberg, R. B. Lydiard, R. Rudolph, L. Aguiar, J. T. Haskins, and E. Salinas, Efficacy of venlafaxine extended-release capsules in nondepresssed outpatients with generalized anxiety disorder: a 6-month randomized controlled trial. JAMA. 2000; 283: 3082–3088. 31. S. Ozyalcin, G. K. Talu, E. Kiziltan, B. Yucel, M. Ertas, and R. Disci, The efficacy and safety of venlafaxine in the prophylaxis of migraine. Headache. 2005; 45: 144–152. 32. S. Frytak, C. Moertel, J. O’Fallon, J. Rubin, E. Cregan, et al., Delta-9-Tetrahydrocannabinol

CONTROL GROUPS as an antiemetic for patients receiving cancer chemotherapy. Ann Int Med. 1979; 91: 825–830. 33. R. D’Augustino, J. Massaro, and L. Sullivan, Non-inferiority trials: design concepts and issues—the encounters of academic consultants in statistics. Stat Med. 2003; 22: 169–186. 34. S. Durrleman and P. Chaikin, The use of putative placebo in active control trials: two applications in a regulatory setting. Stat Med. 2003; 22: 941–952. 35. Y. Tsong, S. J. Wang, H. M. Hung, and L. Cui, Statistical issues on objective, design and analysis of non-inferiority activecontrolled clinical trial. J Biopharm Stat. 2003; 13: 29–41.

CROSS-REFERENCES Adequate and well-controlled trial Bellmont Report Historical control Placebo-controlled trial Active-controlled trial

9

COOPERATIVE NORTH SCANDINAVIAN ENALAPRIL SURVIVAL STUDY

2

CONSENSUS was a randomized, doubleblind, placebo-controlled, parallel-group trial (2). To be eligible, patients had to have the most severe form of heart failure, New York Heart Association (NYHA) class IV, which means that the symptoms of heart failure were present at rest. Patients were receiving optimal treatment for heart failure at the start of the trial, including digitalis and diuretics, and continued to receive these treatments during the trial. In addition, patients were randomized to receive either enalapril or placebo. The starting dose was 5 mg twice a day, and could be titrated up to 20 mg twice a day, depending on clinical response. Note that early in the trial the occurrence of symptomatic hypotension in some patients led the investigators to reduce the starting dose to 2.5 mg daily for some high-risk patients. The primary endpoint of the trial was death by any cause within 6 months of randomization. Secondary endpoints included 12-month mortality and mortality during the entire trial period. The sample size was calculated to be 400 patients, 200 per treatment group, based on the assumption that the 6month mortality rate would be 40% in the placebo group, and would be reduced to 24% by enalapril. This sample size provides 90% power at a two-sided significance level of 5%. Differences in mortality between treatment groups were to be analyzed using life-table methods, and the analysis was to be by intention-to-treat; that is, the survival information for each patient from the date of randomization to the date of death or study termination was to be included in the analysis. Although the trial was to be monitored by an ethical review committee, there was no formal rule governing the committee’s decisions. As its name suggests, CONSENSUS took place in three countries in the north of Scandinavia: Finland, Norway, and Sweden (see also Multinational [Global] Trial). The CONSENSUS Trial Study group consisted of the set of investigators in these three countries, a steering committee chaired by John Kjekshus, an ethical review committee (some-

STEVEN M. SNAPINN Biostatistics, Amgen, Inc. Thousand Oaks, California

Heart failure, also known as congestive heart failure, is a condition that occurs when the heart is unable to pump sufficient blood to meet the body’s oxygen demand. Signs and symptoms include dyspnea, or shortness of breath, particularly when lying down; edema, or the buildup of fluid; and cough. It is a very serious condition and, in its most severe form, is associated with a high mortality rate. Heart failure can be caused by a number of factors such as hypertension, ischemic heart disease, and cardiomyopathy. As of the mid-1980s, the standard treatment for heart failure usually included digitalis and diuretics. There was also growing evidence that direct-acting vasodilator therapy would be beneficial. However, a metaanalysis by Furberg and Yusuf (1) found little evidence for improved survival with this therapy, and instead suggested that angiotensinconverting enzyme (ACE) inhibition (see also the article on ACE inhibitors) held the most promise. This relatively new class of drugs, including enalapril, captopril, and lisinopril, was known to be effective in the treatment of hypertension and had been associated with symptomatic improvement in patients with heart failure. However, the effects of these agents on survival was unknown. 1

STUDY DESIGN

OBJECTIVES

The primary objective of the Cooperative North Scandinavian Enalapril Survival Study, also known as CONSENSUS, was to study the effect on mortality of enalapril compared with placebo, when added to conventional therapy, in patients with severe heart failure (2). Other objectives included evaluating the safety of enalapril, its effect on symptoms of heart failure, and its effects on neurohormones known to be associated with mortality in these patients.

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

COOPERATIVE NORTH SCANDINAVIAN ENALAPRIL SURVIVAL STUDY

times referred to in other trials as a data and safety monitoring board or a data monitoring committee) chaired by Jacobus Lubsen, and an administrative office run by the study’s sponsor, Merck Research Laboratories. 3

RESULTS

3.1 Interim Analysis Results CONSENSUS terminated ahead of schedule based on a recommendation from the ethical review committee (2). Their opinion was that the results favoring enalapril were so strong that continuation of the trial would be of limited scientific interest and could not be justified from an ethics perspective. An editorial in the Lancet (3) commented that, while correct, the decision to stop CONSENSUS was highly unusual, and that it is desirable for trials to run their entire course. This is especially true given that CONSENSUS had no formal stopping rule. Lubsen (4) provided a detailed review of the ethical review committee’s deliberations. Table (1) contains the interim mortality results reviewed by the committee. The members were informed by telephone of the June 27, 1986, results and agreed to a meeting to discuss them. At the meeting, held on September 14, 1986, the sponsor provided a more current update: the 6-month mortality rates were now 24 and 49% (P = 0.0002). Despite this large difference, the committee felt that they needed additional information from the sponsor on patient characteristics. This required a large data entry effort, and the information was not available until December 7, 1986. At this meeting, the 6-month mortality rates were 27 and 48%

(P = 0.001), and a review of the baseline characteristics showed balance between groups and a consistent treatment effect among important subgroups. Therefore, the committee decided to recommend termination of the trial, and they authorized the committee chair to inform the steering committee. This took place on December 13, 1986; the steering committee accepted this recommendation, set December 14, 1986, as the study termination date, and informed the clinical centers. 3.2 Final Analysis Results The final results of CONSENSUS were reported in 1987 (2) and 1988 (5). A total of 253 patients were randomized, 127 to enalapril and 128 to placebo. The mean age was approximately 70 years, approximately 70% of the patients were men, and most patients had an etiology of ischemic heart disease. The median duration of heart failure before enrollment was approximately 48 months. The final mortality results are summarized in Table (2) and displayed in Figure (1), which shows the Kaplan-Meier curves for overall mortality (see also the article on Kaplan-Meier plots). Overall, 68 placebo patients (54%) had died, compared with 50 enalapril patients (39%, P = 0.003 using lifetable methods). The benefit of enalapril was restricted to progression of heart failure: 44 placebo patients and 22 enalapril patients died of this cause, and 24 placebo patients and 28 enalapril patients died of other causes (primarily sudden cardiac death). Patients treated with enalapril also experienced symptomatic benefit. In the enalapril group, 54 surviving patients had an endof-study NYHA class of between I and III,

Table 1. Interim Mortality Results Enalapril Status date January 1, 1986 April 1, 1986 May 1, 1986 June 27, 1986 September 14, 1986 December 7, 1986

Placebo

No. of deaths

No. randomized

No. of deaths

No. randomized

8 16 20 26 28 44

52 77 84 100 101 124

15 24 31 44 52 66

51 72 78 93 99 120

COOPERATIVE NORTH SCANDINAVIAN ENALAPRIL SURVIVAL STUDY

3

Table 2. Final Mortality Results

Mortality within 6 months Mortality within 1 year Total mortality

Placebo (n = 126) No. %

Enalapril (n = 127) No. %

55 66 68

33 46 50

44 52 54

Reduction in relative risk risk (%)

P-value (life-table analysis)

40 31 27

0.002 0.001 0.003

26 36 39

Cumulative Probability of Death

0.8 0.7 0.6 0.5 0.4 0.3 0.2 Placebo Enalapril

0.1

0

1

2

3

4

5

6 7 Months

8

9

10

11

12

Placebo N = 126 102

78

63

59

53

47

42

34

30

24

18

17

Enalapril N = 127 111

98

88

82

79

73

64

59

49

42

31

26

Figure 1. Kaplan-Meier curves comparing total mortality for enalapril and placebo.

including 16 patients with class I or II; in the placebo group, there were only 27 surviving patients with an end-of-study NYHA class of between I and III and only two patients in class I or II. Enalapril was very welltolerated; the major side effect associated with enalapril was hypotension. 3.3 Neurohormone Analysis CONSENSUS provided a wealth of information on neurohormones. Blood samples were obtained from most randomized patients at the time of randomization and after 6 weeks of treatment. The first set of publications focused on the following neurohormones: angiotensin-converting enzyme, angiotensin II, aldosterone, noradrenaline, adrenaline, dopamine, and atrial natriuretic factor (6, 7). When looking at the placebo group, there were strong and statistically significant

associations between several of these neurohormones and mortality: angiotensin II (P < 0.05), aldosterone (P = 0.003), noradrenaline (P < 0.001), adrenaline (P = 0.001), and atrial natriuretic factor (P = 0.003). However, similar associations were not seen in the enalapril group. In addition, the mortality benefit of enalapril appeared stronger among patients with values of these neurohormones above the median. These results suggested that the effect of enalapril on mortality is related to hormonal activation, and to activation of the renin-angiotensin system in particular. Subsequent publications focused on the novel neurohormones N-terminal proatrial natriuretic factor ANF(1-98) and atrial natriuretic peptides ANP(1-98) and ANP(99-126) (8, 9). It was concluded that the magnitude of changes in these neurohormones provide important information on prognosis and therapeutic effects of enalapril.

4

COOPERATIVE NORTH SCANDINAVIAN ENALAPRIL SURVIVAL STUDY

3.4 Long-Term Follow-Up When CONSENSUS terminated in December of 1986, all surviving patients were removed from blinded study medication but were informed of the reason for terminating the trial and given the option of taking openlabel enalapril. In addition, patients continued to be followed for their survival status. The follow-up mortality data were analyzed in two ways. First, the follow-up data were analyzed as if the surviving patients were randomized (using the original randomization schedule) into a new study beginning on the date the blinded portion of the trial terminated, December 15, 1986. Second, the data from the blinded portion of the trial were combined with the follow-up data, as if the trial had never stopped. The first report of additional follow-up came 8.5 months after the end of blinded therapy (10). There were 58 surviving patients from the placebo group of whom 18 (31%) died during the follow-up period, and 77 surviving patients from the enalapril group of whom 16 (21%) died during the follow-up period. The next report included 2-year follow-up information (11). By the end of this period, 26 patients from the original placebo group were still alive, compared with 38 patients from the original enalapril group. The final follow-up publication was based on a 10-year follow-up period (12). By the end of the 10-year follow-up period, only five patients were still alive, all of whom had been in the enalapril group during the blinded portion of the trial. This analysis showed that, despite the fact that enalapril was made available to all surviving patients at the termination of the trial, the benefit accrued to patients randomized to enalapril persisted for at least 3.5 years after termination. 4

failure and the adoption of ACE inhibitors as a standard treatment for these patients. CONSENSUS was the first heart failure study to show dramatic benefits from ACE inhibition (14, 15). However, CONSENSUS was conducted in a population with severe heart failure and did not answer the question of whether this treatment was beneficial in patients with milder forms of the disease. Subsequent to CONSENSUS, two major trials evaluated the effects of enalapril in patients with NYHA class II–III heart failure. The second Veterans Administration Cooperative Vasodilator-Heart Failure Study (VHeFT-II) compared enalapril with a vasodilator regimen in 804 men. Over a 5years of follow-up period, mortality was consistently lower in the enalapril group, but not significantly so (P = 0.08). The Studies of Left Ventricular Dysfunction (SOLVD) Treatment trial compared enalapril and placebo with respect to total mortality in 2569 patients. Cumulative 4-year all-cause mortality among patients randomized to enalapril was 16% lower than in the placebo group (P = 0.004). As a result of these trials, ACE inhibition is now standard treatment for NYHA class II–IV heart failure. Note that there was a later trial with a similar name: Cooperative New Scandinavian Enalapril Survival Study (CONSENSUS II). Although this trial did involve the same experimental treatment, enalapril, it was conducted in patients with an acute myocardial infarction rather than in patients with heart failure. In addition, the letter ‘‘N’’ in the acronym refers to ‘‘New’’ rather than ‘‘North,’’ a reference to the fact that CONSENSUS II included the three countries that were part of CONSENSUS plus Denmark. Therefore, to avoid confusion with CONSENSUS II, CONSENSUS is now sometimes referred to as CONSENSUS I.

CONCLUSIONS

CONSENSUS was a milestone for treatment of patients with heart failure (13) and a landmark trial in the history of cardiovascular clinical research (see also the article on disease trials for cardiovascular diseases). The highly significant results and the magnitude of the clinical benefit led to the approval of enalapril for the treatment of severe heart

REFERENCES 1. C. D. Furberg and S. Yusuf, Effect of drug therapy on survival in chronic congestive heart failure. Am J Cardiol. 1988; 62: 41A–45A. 2. CONSENSUS Trial Study Group. Effects of enalapril on mortality in severe congestive heart failure: results of the Cooperative North Scandinavian Enalapril Survival Study

COOPERATIVE NORTH SCANDINAVIAN ENALAPRIL SURVIVAL STUDY (CONSENSUS). N Engl J Med. 1987; 316: 1429–1435. 3. Lancet [editorial]. Consensus on heart failure management? Lancet. 1987; 330: 311–312. 4. J. Lubsen, for the CONSENSUS Ethical Review Committee. Appendix: monitoring methods, considerations, and statement of the Cooperative North Scandinavian Enalapril Survival Study (CONSENSUS) Ethical Review Committee. Am J Cardiol. 1988; 62: 73A–74A. 5. K. Swedberg and J. Kjekshus, Effects of enalapril on mortality in severe congestive heart failure: results of the Cooperative North Scandinavian Enalapril Survival Study (CONSENSUS). Am J Cardiol. 1988; 62: 60A–66A. 6. K. Swedberg, P. Eneroth, J. Kjekshus, and S. Snapinn, for the CONSENSUS Trial Study Group. Effects of enalapril and neuroendocrine activation on prognosis in Severe Congestive Heart Failure (follow-up of the CONSENSUS Trial). Am J Cardiol. 1990; 66: 40D–45D. 7. K. Swedberg, P. Eneroth, J. Kjekshus, L. Wilhelmsen, for the CONSENSUS Trial Study Group. Hormones regulating cardiovascular function in patients with severe congestive heart failure and their relation to mortality. Circulation. 1990; 82: 1730–1736. 8. C. Hall, J. Kjekshus, P. Eneroth, and S. Snapinn, The plasma concentration of Nterminal proatrial natriuretic factor ANF(198) is related to prognosis in severe heart failure. Clin Cardiol. 1994; 17: 191–195. 9. S. V. Eriksson, K. Caidahl, C. Hall, P. Eneroth, J. Kjekshus, et al. Atrial natriuretic petptide ANP(1-98) and ANP(99-126) in patients with severe chronic congestive heart failure: relation to echocardiographic measurements. A subgroup analysis from the Cooperative North Scandinavian Enalapril Survival Study (CONSENSUS). J Cardiac Failure. 1995; 1: 109–116.

5

10. K. Swedberg and J. Kjekshus. Effect of enalapril on mortality in congestive heart failure: follow-up survival data from the CONSENSUS Trial. Drugs. 1990; 39(Suppl 4): 49–52. 11. J. Kjekshus, K. Swedberg, and S. Snapinn, for the CONSENSUS Trial Group. Effects of enalapril on long-term mortality in severe congestive heart failure. Am J Cardiol. 1992; 69: 103–107. 12. K. Swedberg, J. Kjekshus, and S. Snapinn, for the CONSENSUS Investigators. Long-term survival in severe heart failure in patients treated with enalapril: ten year follow-up of CONSENSUS I. Eur Heart J 1999; 20: 136–139. 13. G. A. J. Riegger, Lessons from recent randomized controlled trials for the management of congestive heart failure. Am J Cardiol. 1993; 71: 38E–40E. 14. W. B. Hood, Role of converting enzyme inhibitors in the treatment of heart failure. J Am Coll Cardiol. 1993; 22(Suppl A): 154A–157A. 15. J. B. Young, Angiotensin-converting enzyme inhibitors in heart failure: new strategies justified by recent clinical trials. Int J Cardiol. 1994; 43: 151–163. 16. P. Sleight, Angiotensin II and trials of cardiovascular outcomes. Am J Cardiol. 2002; 89(Suppl): 11A–17A.

CROSS-REFERENCES Multinational (global) trial Kaplan-Meier plot Data and safety monitoring board Angiotensin-converting enzyme (ACE) inhibitors Disease trials for cardiovascular diseases

Cooperative Studies Program, US Department of Veterans Affairs The Department of Veterans Affairs (VA) is in a unique position in the US, and perhaps the world, in conducting multicenter clinical trials. This is due to several factors: (1) its network of 172 medical centers geographically dispersed throughout the country, under one administrative system; (2) a dedicated group of talented physicians and other health professionals serving at these medical centers; (3) a loyal and compliant patient population of nearly four million veterans; (4) a system of experienced coordinating centers that provide biostatistical, data processing, pharmacy and administrative support; and (5) a research service that recognizes the uniqueness and importance of the program and strongly supports its mission. The VA has conducted multicenter clinical trials for more than half a century, beginning with its first trial, which was organized in 1945 to evaluate the safety and efficacy of chemotherapeutic agents for tuberculosis. This article describes the history of the program, its organization and operating procedures, some of its noteworthy trials, and current challenges and opportunities.

History of the Cooperative Studies Program (CSP) The first cooperative clinical trial conducted by the VA was a joint study with the US Armed Forces to evaluate the safety and efficacy of chemotherapeutic agents for tuberculosis. Drs John B. Barnwell and Arthur M. Walker initiated a clinical trial to evaluate various drugs in the treatment of tuberculosis, including the antibiotic streptomycin [3, 48]. The challenge of caring for 10 000 veterans suffering from the disease following World War II was the impetus for the study. Not only did the results revolutionize the treatment of tuberculosis, they also led to the development of an innovative method for testing the effectiveness of new therapies – the cooperative clinical trial. A VA Program for conducting cooperative studies in psychiatry was started in 1955 and supported by a newly developed Central Neuropsychiatric

Research Laboratory at the Perry Point, Maryland VA Medical Center (VAMC). This Program emphasized the design and conduct of randomized trials for the treatment of chronic schizophrenia. Trials were completed evaluating the efficacy of prefrontal lobotomy [2], chlorpromazine and promazine [8], phenothiazine derivatives [10], other psychotropic drugs [9, 31], the reduction or discontinuation of medication [6], the combination of medication and group psychotherapy [20], brief hospitalization and aftercare [7], the need for long-term use of antiparkinsonian drugs [30], and intermittent pharmacotherapy [43]. Noteworthy VA cooperative clinical trials in other disease areas were started in the late 1950s and 1960s. A VA cooperative study group on hypertension was started in the 1950s (and still exists today). This group was the first to show that antihypertensive drug therapy reduces the long-term morbidity and mortality in patients with severe [54] and moderate [55] elevations of blood pressure. Other areas researched by the early VA cooperative studies included: use of long-term anticoagulants after myocardial infarction; lipid lowering drugs to prevent myocardial and cerebral infarction; treatment of gastric ulcer disease; efficacy of gamma globulin in posttransfusion hepatitis; analgesics to reduce postoperative pain; surgical treatment of coronary artery disease; the effect of portal caval shunt in esophageal varices; and the effects of radical prostatectomy, estrogens, and orchiectomy in the treatment of prostate cancer. In 1962, the VA developed a concept, novel in Federal Government medical research programs at that time, of providing its investigators access to techniques and specialized help and information essential to their research. Four regional research support centers were established: the Eastern Research Support Center at the West Haven, CT VAMC; the Midwest Research Support Center at the Hines, IL VAMC; the Southern Research Support Center at the Little Rock, AR VAMC; and the Western Research Support Center at the Sepulveda, CA VAMC (see Data Management and Coordination). Individual investigators were assisted in such areas as research design, statistical methods, data management, computer programming, and biomedical engineering. The early VA cooperative studies were coordinated by VA Central Office staff in Washington, DC, by these regional research support

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Cooperative Studies Program, US Department of Veterans Affairs

centers, and by contracts with university coordinating centers. The program was led by Mr Lawrence Shaw. Beginning in 1972, a special emphasis was placed on the CSP in the VA’s Medical Research Service and its budget was quadrupled over the next decade. Under the leadership of James Hagans, MD, PhD, the program’s current organization and structure were developed and codified in the Cooperative Studies Program Guidelines [1]. This included the establishment of four statistical/data processing/management coordinating centers and a research pharmacy coordinating center solely dedicated to conducting cooperative studies; central human rights committees attached to each of the statistical coordinating centers; a standing central evaluation committee for the review of all new proposals for VA cooperative studies and all ongoing studies every three years; and clearly defined procedures for the planning, implementation, conduct, and closeout of all VA cooperative studies. The Central Neuropsychiatric Research Laboratory at the Perry Point, MD VAMC; the Eastern Research Support Center at the West Haven, CT VAMC; the Midwest Research Support Center at the Hines, IL VAMC; and a new center at the Palo Alto, CA VAMC were established as the four new VA Cooperative Studies Program Coordinating Center (CSPCCs). The Cooperative Studies Program Clinical Research Pharmacy Coordinating Center (CSPCRPCC) was established at the Washington, DC VAMC, but later relocated to the Albuquerque, NM VAMC in 1977. Daniel Deykin, MD was the first person to head simultaneously the VA research programs both in Health Services Research and Development and the CSP, from 1985 to 1996. He took advantage of this opportunity to promote the development of a series of multicenter clinical trials in the organization and delivery of health services. These trials represented unique challenges in design and conduct. Some of these trials have recently been completed and are in the process of being published [25, 44, 57]. In 1996, John Feussner, MD, MPH was appointed as the VA’s Chief Research & Development Officer, and simultaneously assumed leadership of the CSP. Up until the time Dr Feussner was appointed, the VA Research Service was composed of three major research programs – Medical Research (of which the CSP was a part), Rehabilitation Research & Development, and Health Services Research & Development.

Dr Feussner moved the CSP out of the VA Medical Research Service and elevated the Program to an equal level with the three other major VA research programs. New emphases brought to the Program by Dr Feussner include: initiation of a strategic planning process; more integration and interdependence of the coordinating centers; institution of good clinical practices and standard operating procedures at the coordinating centers; pharmaceutical manufacturing; experimentation to improve the process of informed consent [32]; educational programs in clinical research for VA investigators; partnering with industry, National Institutes of Health (NIH), and international clinical trials groups; the development of three new Epidemiology Research and Information Centers at the VAMCs in Seattle, WA, Boston, MA, and Durham, NC [5]; and Intranet and Internet communications. The strategic planning process initiated in 1997 defined the vision, mission, and specific goals for the Program (Table 1).

Organization and Functioning of the CSP This section describes how a VA cooperative study evolves and the support provided by the VACSP. Table 1

Vision/mission/goals of the VACSP

Vision • The CSP is a premier research program conducting multicenter studies with world-wide impact on health care Mission • To advance the health and care of veterans through education, training, and collaborative research studies that produce innovative and effective solutions to national healthcare problems Goals • To enhance the proficiency of CSP staff and CSP partners (chairpersons, participating investigators) in the conduct of multicenter trials • To enhance the consistency of management support for the CSP • To increase the flow of new research ideas for cooperative studies • To increase the application of research products into clinical practice • To enhance the interdependence of the CSP coordinating centers • To improve the capabilities of dissemination of research findings

Cooperative Studies Program, US Department of Veterans Affairs

3

Investigator Submits Idea to CSP Headquarters Office

Chief, CSP Obtains Reviews from 4-5 Experts on study Idea and Makes Decision

Disapprove

About 70% of Study Ideas are Disapproved

Approve Chief, CSP Assigns Study to One of the 4 CSPCCs and to CSPCRPCC if Drug / Device Study

First Planning Meeting is Held Chief, CSP Must Decide if Satisfactory Progress is Made

No

Study Terminated

Yes Major Changes Changes Made and Resubmitted

Draft Protocol Developed and Second Planning Meeting Held

Changes Requested

HRC Reviews Protocol at Second Meeting

Disapprove

Study Terminated

Approved Minor Changes

Final Protocol Prepared and Submitted to CSEC

Disapproved but Resubmittal Encouraged

CSEC Review

Disapprove

Study Terminated About 40% of protocols are Disapproved

Approve

Figure 1

Development of a VA cooperative study

These aspects of the Program have been reported previously [24, 27, 28].

Planning Request Initiation of a planning request through the evaluation phase is outlined in Figure 1. The VA Research Program, including the CSP, involves strictly intramural research. To receive VA research funding, the investigator must be at least five-eighths time VA. One of the strengths of the CSP is that most of its studies are investigator-initiated. The research questions come from investigators throughout the VA health care system who are on the front lines in providing health care for veterans.

To start the process, the investigator submits to VA Headquarters a 5–10 page planning request outlining the background of the problem, the hypothesis, a brief overview of the design, and anticipated size of the study. The planning request is given a CSP number to aid in tracking the study through its evolutionary phases. The planning request is sent to four or five independent experts in the field who initially judge the importance and feasibility of the study. If this review is sufficiently positive, the study is put into planning and assigned to one of the CSPCCs (and the CSPCRPCC if it involves drugs or devices) for development of the full proposal. This process has evolved to satisfy two important needs. First, the CSP recognizes that the ability to

4

Cooperative Studies Program, US Department of Veterans Affairs

Once the study is approved for planning, the resources of the CSPCCs are applied to the development of the full proposal. Within the coordinating centers, the study is assigned to a specific biostatistician and clinical research pharmacist. These individuals work with the principal proponent in nominating a planning committee which is reviewed and approved by the CSPCC and CSP Directors. The planning committee generally consists of the principal proponent, study biostatistician, study clinical research pharmacist, CSPCC Director, two or three potential participating site investigators, and outside consultants, as needed. The planning committee is funded for two planning meetings. At the first meeting, the basic design features of the study are agreed upon (hypothesis, patient population, treatment groups, primary and secondary endpoints, pharmacologic and drug handling issues, baseline and follow-up data collection and frequency, treatment effect size, sample size, number of sites, duration of study, publication plan, and budget). The full proposal is then written, and at the second planning meeting a draft of the protocol is fine-tuned. Development of a full proposal generally requires six to nine months.

CSPCC. This committee is comprised of scientists and laypeople from the community who review proposals for all new VA cooperative studies and all ongoing studies annually. The committee serves as a central Institutional Review Board (IRB) for studies assigned to the CSPCC and considers such aspects of the proposal as risks versus benefits to the patients, patient management, burdens placed on the patients from participation in the study, community equipoise with regard to the treatments being compared, and the informed consent procedures. This committee has absolute authority over approval or disapproval of the study. Only the HRC has the authority to change its own decisions. The composition of the committee follows VA regulations and is consistent with Food and Drug Administration (FDA) guidelines. Minimum membership of the HRC includes a VA chairperson, a practicing physician from the community, a nonphysician scientist, a veteran representative, a member of a recognized minority group, a clergyman or ethicist, and an attorney. If the HRC approves the study, then the proposal is submitted to VA Headquarters for scientific review and a funding decision. The proposal is sent to four or five experts in the field and a biostatistician for written reviews. All cooperative study proposals are reviewed by a standing scientific review committee, called the Cooperative Studies Evaluation Committee (CSEC). This committee is composed of senior physician scientists and biostatisticians who have had extensive experience in cooperative studies and clinical trials. The CSEC meets in the spring and fall of each year. The principal proponent and study biostatistician present the study to the CSEC in person (reverse site visit), defend their proposal and answer questions from the CSEC members. The CSEC then goes into executive session, and decides to recommend approval or disapproval of the study and, for approvals, gives a scientific priority score, ranging from 10 to 50 with 10 being the best score. The final funding decision is made by the CSP Director. The advantages of this review process are that the study investigators have the opportunity to interact personally with the review body to answer their criticisms and concerns, and the final decision is known immediately following the review and executive session.

Evaluation Phase

Implementation of the Trial

The completed proposal is first reviewed by the Human Rights Committee (HRC) attached to the

Implementation and conduct of a VA cooperative trial are outlined in Figure 2. Once the CSP Director

come up with a good idea needing rigorous test does not necessarily carry with it the ability to pull together all the expertise necessary to plan a clinical trial. This was especially true in the early days, when “trialists” were few and far between, and clinical researchers seldom had training in modern statistical trials design. So it is important to provide access to this expertise early in the planning process. However, such aid is expensive and scarce, so it is important not to waste it on ideas that do not show promise. Thus, the second need is for an initial concept review. This has proved to be a very efficient allocation method; about 70% of all initial proposals are not approved to go on to planning, and of the surviving 30%, about two-thirds complete the planning process. Of those that are successfully planned, about three-quarters are approved and funded. Thus, the method helps to avoid the problem of insufficiently developed protocols, while conserving the scarce resources of planning.

Planning Phase

Cooperative Studies Program, US Department of Veterans Affairs

5

CSEC Review Approved Scientific Priority Score In Fundable Range

N0

Study Terminated

Yes Study is Funded

Organizational Phase: Forms Approval and Printed; Executive Committee and DMB Meet; Organization Meeting, etc.

Patient Recruitment & Follow-up

Study Group & Executive Committee Meet Yearly

DMB & HRC Meet Yearly; Both Committees Must Approve Continuation

Disapprove

Study Terminated

Approve Disapprove CSEC Mid-Year Or Every 3-Year Review

Study Terminated

Approve No Study Completed Yes Closeout Procedures Performed

Manuscripts Prepared

Figure 2

Conduct of a VA cooperative study

approves funding, the implementation phase of the cooperative study begins. All activities in this process are closely coordinated by the CSPCC, CSPCRPCC and the Study Chairperson’s Office. The necessity for carefully controlled medical treatment and data collection procedures for the successful conduct of multicenter clinical trials is well recognized. Because of its administrative structure, the VA provides an environment that is uniquely suited to this type of research. Each participating

facility is funded by one control point for the entire period of the study, and the VAMC system provides a structure in which a relatively high degree of medical, scientific, and administrative control can be exercised. This same degree of control is often more difficult to obtain in studies that involve participating sites from different administrative systems [27]. The CSPCC recommends funding levels and monitors the performance of the individual medical centers. This information is reviewed regularly by the

6

Cooperative Studies Program, US Department of Veterans Affairs

study biostatistician, center director, the Executive Committee, and at least annually by the Data and Safety Monitoring Board. This integrated monitoring of scientific, biostatistical, and administrative aspects by the CSPCC provides a comprehensive approach to the management of multicenter clinical trials, in contrast to other clinical trials biostatistics groups that are responsible only for the analytical and data processing aspects and exercise no administrative control [27]. The Research and Development Committee and the IRB of each participating medical center must review and approve a cooperative study before it can be implemented at that facility. They are able to make modifications to the prototype consent form approved by the CSPCC HRC, but all local modifications must be reviewed and approved by the CSPCC. Included in the implementation component of a cooperative study is the establishment of the Executive Committee and Data and Safety Monitoring Board (DSMB) who share responsibility for the conduct and monitoring of the cooperative study in the ongoing phase. The Executive Committee, which often includes several members of the original Planning Committee, consists of the study chairperson who heads the committee, the study biostatistician, the clinical research pharmacist, two or three participating investigators, and one or two consultants. This committee is responsible for the detailed operational aspects of the study during its ongoing phase and ensures adherence to the study protocol, including aspects relating to patient recruitment, treatment, laboratories, data collection and submission, biostatistical analysis, data processing, subprotocols and reporting. The Executive Committee sometimes recommends probation or termination of sites whose performance is poor. The DSMB consists of five to eight individuals who have not been involved in the planning and development of the proposal, and includes one or two biostatisticians and two or more subject-matter experts in the field(s) of the cooperative study. This committee is charged with the responsibility of monitoring and determining the course of the ongoing study and considers such aspects as patient accrual; performance of the participating sites, CSPCC, and chairperson’s office; and safety and efficacy data. Perhaps a unique feature of the CSP is that the CSPCC HRC also reviews each ongoing study annually and receives the same data reports as presented

to the DSMB. The study chairperson, participating site investigators, and other members of the Executive Committee are masked to the outcome data during the course of the study. Only the DSMB, HRC, CSPCC and CSPCRPCC see the outcome data during the conduct of the study. The fourth body involved in the conduct of the cooperative study is the Study Group, which consists of all participating investigators, the study chairperson (co-chairpersons), biostatistician, clinical research pharmacist, and consultants. This body meets once annually to consider the progress of the study and to resolve problems that may arise at the participating centers. Within the CSPCC, the biostatistician heads a team of administrative, programming and data management personnel that provides regular monitoring of the study. This team develops an operations manual (see Clinical Trials Protocols), in conjunction with the chairperson’s office, to train study personnel in the day-to-day conduct of the trial. They also develop a computer data management system to edit, clean, and manage the study data. Automated query reports are generated by the computer system and sent to the participating sites for data checking and cleaning. Statistical progress reports are published by the CSPCC and distributed to the Study Group, Executive Committee, and DSMB at scheduled meetings. An initial kickoff meeting is held before the study starts to train site personnel in the conduct of the study. Annual meetings are held thereafter to refresh training and discuss issues in the conduct of the study. Frequent conference calls of the study committees are also used to facilitate communication and training. Another unique aspect of the CSP is the CRPCC, which operationalizes the pharmaceutical aspects of the clinical trials (Table 2). In the planning stages of the study the clinical research pharmacist designs the drug or device treatment and handling protocol and works with the study chairperson and pharmaceutical and device industries to purchase or obtain donations of clinical supplies for the study. The CRPCC coordinates the development of appropriate drug dosage formulations and the manufacture of study drugs or devices. In the event that drug donations are not possible, the CRPCC has the expertise and capability to provide the in-house manufacture of active drugs and matching placebo. Drugs for all cooperative studies

Cooperative Studies Program, US Department of Veterans Affairs Table 2 Unique functions and roles of the CSP pharmacy coordinating center • • • • • • • • • • •

Design of a drug or device handling protocol for each study involving drugs or devices Preparation and submission of INDAs or IDEs Obtaining donations or purchase of clinical supplies for study Coordination of appropriate drug dosage formulations and manufacture of study drugs or devices Quality control testing of drugs Development of blinding methods Storage, packaging and shipment of clinical supplies to pharmacies at the participating sites Computerized drug inventory and information system to track and replenish supplies at site pharmacies Monitoring adverse medical events and reporting to appropriate authorities Monitoring, auditing and education services to ensure sites are in compliance with GCP Preparation of final drug/device accountability reports

must pass the testing of the CRPCC’s quality control testing laboratory. The CRPCC also assesses study product blinding methods. At the CRPCC, study medications are stored in an electronically controlled and secured environment. The CRPCC customizes labels and packages all study medications, which are centrally distributed to the pharmacies at the participating sites. In doing so, the CRPCC provides a computerized drug inventory and information system for complete accountability of clinical supplies. This includes automated study supply tracking and replenishment systems for maintaining adequate study supplies at participating sites as well as automated telephone randomization and drug assignment systems. The clinical research pharmacist is then able to direct and monitor the study prescribing and dispensing activities as well as to monitor the compliance with the study protocol treatments at the participating sites. At the end of the study the CRPCC directs the retrieval and disposition of unused clinical supplies and prepares a final drug/device accountability report. The clinical research pharmacist also works closely with the study chairperson and the manufacturers to prepare, submit, and maintain Investigational New Drug Application (INDAs) or Investigational Device Exemption (IDEs), which

7

includes preparing and submitting annual and special reports to the FDA. Along with this responsibility, the clinical research pharmacist coordinates the monitoring and reporting of all adverse medical events to study management, FDA and associated manufacturers. Recently the CRPCC established a central Good Clinical Practices (GCP) Assurance Program. The Program provides monitoring, auditing, and educational services for all VA cooperative studies to ensure that the participating sites are in GCP compliance. If needed, the Program is capable of providing full GCP monitoring for studies under regulatory (FDA) scrutiny.

Final Analysis and Publication Phase Upon completion of patient intake and follow-up, the study enters the final analysis and publication phase. If the Executive Committee, the CSPCC, and the study biostatistician have performed their tasks well, this phase should be quite straightforward. It requires an updating of all study files and the processing and analysis of the complete data set. The interim statistical analyses that were run during the ongoing phase of the study are now executed on the complete data. In addition, some analyses may point to additional questions that would be of interest; however, it is anticipated that the majority of final analyses and interpretation of results will occur within 6 to 12 months after study termination. All publications emanating from the cooperative study must be approved by the Executive Committee. Although the responsibility of the DSMB terminates at the end of data collection, the Board is at times requested to review manuscripts and give advice prior to submission for publication [27]. Usually each trial generates a number of manuscripts. The Executive Committee establishes priorities for statistical analyses and manuscript development and appoints writing committees composed of members of the Executive Committee and Study Group for each paper. Authorship of the main paper(s) usually consists of the chairperson, study biostatistician, study clinical research pharmacist, members of the Executive Committee, and, in some cases, the participating site investigators. Secondary papers are often written by other members of the Executive Committee and site investigators.

8

Cooperative Studies Program, US Department of Veterans Affairs

The CSPCC serves as the final data repository for the study. The study database, protocol, operations manuals, forms, study correspondence and interim and final statistical progress reports are archived at the CSPCC.

Role of the Biostatistician and Pharmacist in the CSP One of the unique features of the CSP is that the biostatistician at the CSPCC plays a major organizational, fiscal, and administrative role, in addition to the usual technical role. In recent times, as the administration of studies has become more complex, the biostatistician may be assisted by a study coordinator but, as in the past, the greater part of the burden of management falls on the biostatistician. In contrast to the pharmaceutical industry and to many Contract Research Organization (CROs), the biostatistician is responsible for monitoring site adherence to protocol, recruitment, and many other aspects of the study conduct. In addition, the study pharmacist plays a key role in monitoring adverse effects, maintaining supplies of the study drug, regulatory reporting, and the like. In a sense, the study team is deployed to support the investigators, but has independent authority and responsibility as well. One of the strengths of this approach to study management is that it is possible to guarantee some degree of uniformity in the conduct of the studies, independent of the varying managerial skills and style of the study chairs. The biostatistician and pharmacist, together with the coordinating centers of which they are a part, provide institutional memory and continuity. Their central position on the study teams reinforces the key idea that the studies mounted by the VACSP are the joint responsibility of the program and the investigators. Such an intramural program can only succeed on a limited budget if issues of cost and complexity are kept to the forefront during the planning process. A consequence that is easily observed is that the typical CSP trial is a lean, focused attack on a single important clinical question, rather than a broad-based research project with many interwoven strands of investigation. In contrast to the much larger NIH clinical trials efforts, which are organized along disease lines, the CSP biostatisticians and CSPCCs are generalists, doing studies in all areas of medicine with

relevance to the VA. Along the way, some centers have developed some special experience in certain areas, but there has never been a “heart” center or a “cancer” center. Because the CSP has such a broad medical purview, but a relatively low volume of studies, it has not made economic sense to specialize. The scarce resource of statistical and data management expertise has needed to be allocated efficiently to support the proposals that were emerging from the field. Since VA resources have followed the strength of the proposals rather than disease areas, the CSPCCs have not specialized to any large degree. While there are undoubted advantages to specialization, as shown by the contributions made by the National Cancer Institute (NCI) (see Cooperative Cancer Trials) and the National Heart, Lung, and Blood Institute (NHLBI) (see Cooperative Heart Disease Trials) statisticians to the statistical science of their disease areas, there are some advantages to generalizing. In particular, it has been possible to transplant methods and lessons learned from well-studied areas such as cancer and heart disease, to other areas such as psychopharmacology, device research, health services research, and trials of surgical procedures. The absence of disease-area “stovepiping” has facilitated a high general level of sophistication in the conduct of trials, with techniques travelling readily across borders. This cross-pollination has also been facilitated by the structure of the VACSP scientific peer review. The standing committee that reviews and recommends studies for funding mixes disciplines with common expertise in multisite studies. Ad hoc reviewers provide the crucial discipline-specific input to the committee, but the same committee may review a heart failure trial in the morning and a psychopharmacology trial in the afternoon. The result is a high degree of uniformity in the standards for the research across disease areas, and this has been an enduring strength of the program.

Ongoing and Completed Cooperative Studies (1972–2000) One hundred and fifty-one VA cooperative studies were completed or are currently ongoing in the period 1972–2000. Table 3 presents the health care areas

Cooperative Studies Program, US Department of Veterans Affairs Table 3 Health care areas of ongoing and completed VA cooperative studies (1972–2000) Health care area Cardiology/cardiac surgery Hypertension Gastrointestinal Substance abuse Mental health Infectious diseases Cancer Dental General surgery/anesthesia Cerebrovascular Peripheral vascular Military service effects Ambulatory care Epilepsy Genitourinary Diabetes Renal Sleep Pulmonary Hematology Hearing One each in seven areasa Total

Number of studies

Percent of studies

24 15 14 11 10 9 8 6 6 5 5 4 4 4 4 3 3 3 2 2 2 7 151

15.9 10.0 9.3 7.3 6.6 6.0 5.3 4.0 4.0 3.3 3.3 2.6 2.6 2.6 2.6 2.0 2.0 2.0 1.3 1.3 1.3 4.7 100.0

a Analgesics, arthritis, geriatrics, hospital-based home care, laboratory quality control, computerized neuropsychological testing, ophthalmology

of these studies. These areas are generally reflective of the major health problems of the US veteran population, consisting mainly of middle-aged and senior adult males. Studies in cardiology and cardiac surgery represent 15.9% of the 151 studies, followed by hypertension (10.0%), gastrointestinal diseases (9.3%), substance abuse (7.3%), mental health (6.6%), infectious diseases (6.0%), and cancer (5.3%). There are a few notable disease areas that are prevalent in the VA population and yet might be considered underrepresented in the CSP. These include diabetes (2.0%), renal diseases (2.0%), pulmonary diseases (1.3%), hearing diseases (1.3%), arthritis (0.7%), and ophthalmologic diseases (0.7%). Because the CSP mainly relies on investigatorinitiated studies, the conclusion might be drawn that these subspecialties have underutilized the Program. Although studies on effects of military service represent only 2.6% of the 151 studies, studies listed in other categories have investigated treatments for

9

service-connected illnesses (e.g. posttraumatic stress disorder studies are categorized under mental health, and the substance abuse studies could be considered consequences of military service). Table 4 briefly summarizes some of the noteworthy VA cooperative clinical trials that were completed in the 1980s and 1990s. Many of these trials resulted in advances in clinical medicine that could immediately be applied to improve the health care of US veterans and the US population in general.

Current Challenges and Opportunities Although the VACSP has had numerous past successes, it faces many challenges and opportunities in the future. These include: (1) changes in the VA health care system and their effects on research; (2) nationwide concerns about violations of patients’ rights in research; (3) increasing the efficiency and interdependence among the coordinating centers and standardizing procedures; (4) ensuring the adequacy of flow of research ideas and training of investigators; and (5) partnering with industry, other federal agencies, nonprofit organizations, and international clinical trial groups to enhance the capacity of the Program.

Changes in the VA Health Care System The VA health care system has been undergoing substantial changes that could adversely affect research. In 1996, the VA reorganized into 22 geographically defined Veterans Integrated Service Networks (VISNs). Much of the central authority, decisionmaking, and budgeting once performed in VA Headquarters in Washington, DC, has been delegated to the 22 VISN offices. Within the VISNs, administrative and health care services and in some cases entire VAMCs are being consolidated. The largest component of the VA patient population, the World War II veterans, is rapidly declining. Health care personnel in some VISNs are experiencing reductions in force, with the result that the remaining personnel have limited time to devote to research. These factors may already be adversely affecting the Program’s ability to meet recruitment goals in ongoing trials [23].

Concerns About Patients’ Rights in Research The nature of the veteran population treated at VA hospitals raises some special issues in human rights

10

Cooperative Studies Program, US Department of Veterans Affairs Table 4 • • • • • • • • • • • • • • • • • • • • • • • • • • •

Noteworthy VA cooperative studies

80% of strokes in patients with atrial fibrillation can be prevented with low-dose warfarin [15] Carotid endarterectomy is effective in preventing strokes in symptomatic and transient ischemic attacks in asymptomatic patients [26, 39] Aggressive treatment of moderate hypertension works well in elderly patients [19, 37] Age and racial groupings can be used to optimize selection of first line drugs in hypertension [38] Coronary artery bypass surgery prevents mortality in patients with left main disease and in high-risk patients without left main disease [42, 49] Low dose aspirin reduces heart attacks and death in 50% of patients with unstable angina [34] Vasodilators and angiotensin converting enzyme inhibitors prevent deaths in patients with congestive heart failure [12, 13] Low dose aspirin started 6 hours after coronary artery bypass surgery and continued for one year prevents the occlusion of the bypass grafts [17, 18] Mechanical artificial aortic heart valves prolong survival more than bioprosthetic aortic heart valves [22] A conservative, ischemia-guided strategy is safe and effective for management of patients with non-Q-wave myocardial infarction [4] Digoxin does not reduce mortality but does reduce hospitalizations in patients with congestive heart failure [50] The rate of coronary events (myocardial infarction or death) in men with coronary heart disease can be reduced by 22% with Gemfibrozil therapy, which increases high density lipoprotein cholesterol and lowers triglyceride levels [45] Progression of human immunodeficiency virus (HIV) infection to full blown acquired immune deficiency syndrome (AIDS) can be delayed with the drug zidovudine [21] Steroid therapy does not improve survival of patients with septic shock [52] Patients with advanced laryngeal cancer can be treated with larynx-sparing chemotherapy and radiation compared with standard surgical removal of the larynx and have equivalent long-term survival [14] The drug Terazosin is more effective than Finasteride in relieving the symptoms of benign prostatic hyperplasia [33]. Transurethral resection of the prostate is an effective operation, but Watchful Waiting can be effective in many patients [56] An implantable insulin pump is more effective than multiple daily insulin injections in reducing hypoglycemic side-effects, and enhancing quality of life in adult-onset Type II diabetes mellitus [46] Multi-channel are superior to single-channel cochlear implants in restoring hearing to patients with profound hearing loss [11] Sclerotherapy is an effective treatment for esophageal varices in patients who have had prior bleeds but not in patients without prior bleeds [51] Antireflux surgery is more effective than medical therapy in patients with complicated gastroesophageal reflux disease [47] Severely malnourished patients benefit from pre-operative total parenteral nutrition but mildly malnourished patients do not [53] Clozapine is a cost-effective treatment for patients with refractory schizophrenia who have high hospital use [44] Erythropoietin administered subcutaneously compared with intravenously can significantly reduce the costs of hemodialysis [29] Use of intrapleural tetracycline reduces recurrence rate by 39% in patients with spontaneous pneumothorax [35] Rapid access to high quality primary care for patients with severe chronic illnesses greatly improves patient satisfaction with care but may lead to an increase in hospital readmissions [57] Levomethadyl acetate (LAAM) is a safe and efficacious drug to use for heroin addiction. Studies were used to gain FDA approval for LAAM as treatment for heroin addiction [16, 36] Systemic corticosteroids improve clinical outcomes up to three months in patients with chronic obstructive pulmonary disease [40]

Cooperative Studies Program, US Department of Veterans Affairs protections (see Medical Ethics and Statistics). The VA treats about four million veterans, who tend to be less well off than the average veteran (or the average citizen). They are often more severely ill than non-VA patients with the same age and diagnosis, and often have multiple co-morbidities. They are on average more dependent on the VA for their health care than the typical non-VA patient is on his or her usual health care provider. Against this background we note the extraordinary willingness of the veteran patient to engage in research, trusting the clinical researcher to an astonishing degree. Such trust demands an extraordinary level of protection in response. The CSP has instituted a unique framework of human subjects’ protections, going beyond the usual procedures that other federal sponsors and drug companies require. This begins in the planning stage, when each proposal must undergo a rigorous review by the HRC attached to the coordinating center. It typically meets for several hours over a single protocol, reviewing it in fine detail. The protocol cannot go forward without their independent approval. The CSP also requires the usual individual site IRB approval, and other reviews that are mandated at the local site, before a study can start at a site. The ongoing IRB reviews at the local sites (annually, or more often, as stipulated in the initial review) are monitored by the CSP staff. As has become standard in multisite trials, each CSP study has its own independent DSMB that meets at least annually to review the progress of the study. The unique CSP innovation to this process is the joint review by the HRC and DSMB. Thus, after every DSMB meeting, the two groups meet to review and recommend, with the same basis of information on study progress. The CSP has found that the HRC is able to hear the recommendation of the DSMB, which is typically heavily weighted with subjectmatter expertise, and interpret it in the light of the other perspectives they bring. The CSP believes that this has been a successful experiment in resolving the knotty issue of how to obtain full and informed ongoing review of studies where investigators are kept blind, and site-level information must be far less informative than the big picture presented to the DSMB. We believe that such joint reviews add considerably to the level of protection of human subjects.

11

In addition, members of the central HRCs conduct three site visits per year during which patients are interviewed about their participation in the trials. Thus, the Program as a whole conducts 12 such visits per year. The Albuquerque auditing group periodically site visits VAMCs participating in cooperative studies and performs audits to ensure that the sites are complying with GCP guidelines. The CSPCCs also receive copies of consent forms from all patients in all of the trials as further evidence of proper consent procedures. The VA recently established its own office to oversee the protection of patients’ rights in VA research, performing functions similar to those of the Office of Protection from Research Risks (OPRR) of the Department of Health and Human Services. IRBs at VAMCs currently are required to be accredited by an external, non-VA entity. In addition to these standard procedures, followed in all studies, the CSP has recognized two other areas of human subjects’ protection in which it can make a contribution. The Enhancing the Quality of Informed Consent (EQUIC) program [32] is designed to institutionalize the process of testing innovations in methods for obtaining informed consent. It piggybacks tests of new methods on ongoing CSP studies, and provides a centralized assessment of the quality of informed consent encounters (by remote telephone interview of patients). In the spirit of EQUIC, a substudy is being conducted in one ongoing VA cooperative study to evaluate the utility of an informed consent document developed by a focus group of subjects eligible for the trial [41]. The second topic that the CSP has engaged are the ethical, legal, and social implications of genetics research, specifically of deoxyribonucleic acid (DNA) banking with linked clinical (phenotype) data. The CSP has begun a project to provide uniform methods for obtaining and banking such samples. Steps to ensure human subjects’ protection in VA cooperative studies are listed in Table 5.

Efficiency and Interdependence of the CSPCCs The VACSP recently contracted with an outside vendor to help develop standard operating procedure (SOPs) for the CSPCCs. Twenty-two SOPs were developed in the areas of administration, planning and implementing clinical trials, data management,

12

Cooperative Studies Program, US Department of Veterans Affairs

Table 5 Steps to ensure human subjects’ protection in the VACSP • • • • • • • • • • • • • •

Investigator’s integrity Development of proposal through collaboration between investigators and CSPCCs HRC review of proposal initially Site Monitoring and Review Team (SMART) audit of consent form contents CSEC review of proposal Initial review of proposal by participating site R&D and IRB Annual central reviews of trial by DSMB and HRC Annual reviews of study by local R&D committee and IRB SMART audit of participating sites HRC site visits and interviews of study patients Receipt of copies of patient consent forms by CSPCC, local research offices, and local pharmacies Implementation of SOPs and good clinical practices Compliance with all FDA and VA regulations Innovative studies on improving informed consent

study closeout, and study oversight (Table 6). By standardizing among and within the coordinating centers certain procedures that are performed in every study, we will achieve an even higher level of support to all studies more efficiently than previously done. The SOPs will also enable the CSPCCs to be in better compliance with GCP principles and International Conference on Harmonization (ICH) guidelines. Since 1996, the Directors of the Program and centers have been meeting as a group semiannually to identify current and future challenges and opportunities, and to develop annual strategic plans to respond to these challenges and opportunities. This has enhanced the development of mutual projects which the centers can work on together to further the goals of the organization as a whole, such as the development of a Clinical Research Methods Course, a one-year sabbatical program for clinical investigators to enhance their training and skills, and SOPs for the central HRCs.

Ensuring the Adequacy of Flow of Ideas and Training of Investigators In recent years, the CSP has developed several educational opportunities to help train VA investigators

Table 6

Recently adopted SOPs for the VACSP

Administration • Preparing, issuing and updating SOPs • Training and training documentation Planning and implementation of clinical trials • Developing, approving and amending protocols • Study/training meetings • Preparing and approving operations manuals • Study initiation • Developing and approving case report forms • Creating and validating data entry screens and programs • Preparing, documenting and validating data checking programs • Preparing, documenting and validating statistical programs • Developing and conducting statistical analyses Handling data from clinical trials • Randomization, blinding and unblinding • Central monitoring • Case report form flow and tracking • Data entry and verification • Data cleaning • Reporting adverse events Study closeout • Archiving study documentation • Study closeout Study oversight • Assuring site R&D and IRB approvals • DSMB • HRC

in clinical research and to encourage utilization of the Program to answer important clinical questions. These include a five-day course in clinical trials and sabbatical and career development programs focused on clinical research methodology. The five-day course is taught once each year and involves 10 faculty members (two from each of the five coordinating centers) and 60 VA investigators selected from applications from throughout the country. The course consists of 15 lecture/discussion sessions on various aspects of designing a clinical trial, interspersed with breakout sessions during which the students are divided into five planning committees to design a clinical trial. On the last day of the course, the student groups take turns in presenting their clinical trials and receiving critiques from the audience. The course has been taught twice and has received excellent feedback from the students.

Cooperative Studies Program, US Department of Veterans Affairs

13

The CSP Career Development Program provides protected time to clinician–investigators for a period of concentrated clinical research activity. The objective is to build capacity in a wide geographic distribution for the Department of Veterans Affairs to conduct clinical research in acute-care hospitals, long-term care facilities, or outpatient settings. The Program is designed to foster the research careers of clinician–scientists who are not yet fully independent but who seek to become independent clinical researchers. The award provides three years of salary and some supplemental research support, and the awardees are expected to work at least part of the time at one of the five CSPCCs or three Epidemiology Research and Information Center (ERICs). In 1999 CSP announced a sabbatical program for established clinician–scientists to train at one of the CSPCCs or ERICs for up to one year. The purpose of the program is to support clinician–investigators who wish to secure training time to learn about the conduct of cooperative studies and epidemiologic research.

Concluding Remarks

Partnering with Outside Organizations

In summary, we believe that there are considerable strengths to conducting multicenter clinical trials in

The VACSP has partnered with NIH and industry for many years in conducting multicenter clinical trials. In recent years, a special emphasis has been placed on partnering with outside agencies to enhance the effect of the limited VA research funding, and these efforts have been fruitful. Recent examples of this partnering include: the Digitalis in Heart Failure Trial, sponsored by the VA, NHLBI, and Burroughs–Wellcome Company and conducted in 302 VA and non-VA sites in the US and Canada; a series of trials sponsored by the VA and the National Institute of Drug Abuse (NIDA) to evaluate new treatments for drug abuse; the Prostate Cancer Intervention Versus Observation Trial (PIVOT), sponsored by the VA, Agency for Healthcare Quality and Research (AHQR) and NCI; the Beta-Blocker Evaluation of Survival Trial (BEST), funded by the VA, NHLBI, and industry; the Clinical Outcomes Utilizing Revascularization and Aggressive Drug Evaluation (COURAGE) trial, supported by the VA and 10 pharmaceutical companies; the VA/National Institute of Deafness and Other Communication Disorders (NIDCD) Hearing Aid Clinical Trial; and the Shingles Prevention Study sponsored by the VA, NIH, and a pharmaceutical company.

The VACSP has been working with the American College of Surgeons to promote clinical trials evaluating new surgical operations and technologies. This collaboration has resulted in a VA trial comparing the outcomes of laparoscopic vs. open tensionfree inguinal hernia repair, a trial comparing open tension-free hernia repair vs. watchful waiting funded by AHQR, and a trial comparing pallidotomy vs. deep brain stimulation in Parkinson’s Disease. The VACSP has also issued a program announcement for the development of multinational clinical trials between the VA and the Medical Research Councils of Canada and the UK. As the field of clinical trials matures, it is likely that achievable treatment effect sizes will decrease, necessitating larger and larger trials, or “mega” trials. These types of collaborations will be important in the future, as the larger trials will exceed the capacity of any single clinical trials program.

Table 7

Strengths of the VACSP

Related to VA health care system • Large veteran population willing to participate in research, well-represented by minority groups • Largest integrated healthcare system in US with 172 medical centers under single administrative system • High-quality physician–investigators • National administrative databases that allow for tracking of patients • Supportive management in VA Headquarters • System of local research offices and IRBs at participating sites that facilitate multicenter research Related to the CSP • Quality and experience of the coordinating centers • Well-established mechanisms for conducting multi-site trials • Planning process usually produces tightly focused, cost-effective protocols • Rigorous review process by HRC and CSEC • Guidelines and SOPs for conducting trials • Multiple levels of protection of research subjects • Ability to conduct trials with high power and generalizability, so the impact on changing health care practices is maximized compared with other research programs

14

Cooperative Studies Program, US Department of Veterans Affairs

Table 8

Limitations of the VACSP

[3]

Related to VA health care system • • • •

Primarily male population, limiting generalizability of results Large studies in female and childhood diseases are not possible Changes in the health care system, including aging and declining of veteran population, decentralization and consolidation of facilities Reduction in dedicated research time for physician–investigators

Related to the CSP • • •

Long duration from submission of planning request to publication of main results raises the risk of study becoming outdated Limitation of funding Limited capacity to conduct mega trials within VA system

[4]

[5]

[6]

[7]

[8]

the VA health care system, as enumerated in Table 7. There are also some acknowledged limitations of the Program, some of which can be addressed in the future (Table 8). This article has described the history, organization and productivity of a clinical trials program designed as an integral part of a large health care system. The biostatistical and pharmacy positions in the Program are ideal from the standpoint that these people are integrally involved in the research from beginning to end and play a major role in the conduct of the trials. The Program is an example of how clinician–investigators and methodologists can work together successfully to design and conduct largescale clinical research.

[9]

[10]

[11]

Acknowledgments We are extremely indebted to the foresight and support of the US Congress, Executive Branch, and VA management, and to the dedication of the health care providers and veteran patients in the VA system to enable us to carry out this important research.

[12]

[13]

References [1]

[2]

Anonymous (1997). Cooperative Studies Program Guidelines for the Planning and Conduct of Cooperative Studies. Office of Research and Development, Department of Veterans Affairs, Washington, DC. Ball, J., Klett, C.J. & Gresock, C.J. (1959). The Veterans Administration study of prefrontal lobotomy, Journal of Clinical and Experimental Psychopathology 20, 205–217.

[14]

[15]

Barnwell, J.B., Bunn, P.A. & Walker, A.M. (1947). The effect of streptomycin upon pulmonary tuberculosis, American Review of Tuberculosis 56, 485–507. Boden, W.E., O’Rourke, R.A., Crawford, M.H., et al. (1998). Outcomes in patients with acute non-Q-wave myocardial infarction randomly assigned to an invasive as compared with a conservative management strategy, New England Journal of Medicine 338, 1785–1792. Boyko, E.J., Koepsell, T.D., Gaziano, J.M., et al. (2000). U.S. Department of Veterans Affairs medical system as a resource to epidemiologists, American Journal of Epidemiology 151, 307–314. Caffey, Jr, E.M., Diamond, L.S., Frank, T.V., et al. (1964). Discontinuation or reduction of chemotherapy in chronic schizophrenics, Journal of Chronic Diseases 17, 347–358. Caffey, Jr, E.M., Galbrecht, C.R. & Klett, C.J. (1971). Brief hospitalization and aftercare in the treatment of schizophrenia, Archives of General Psychiatry 24, 81–85. Casey, J.F., Bennett, I.F., Lindley, C.J., et al. (1961). Drug therapy in schizophrenia: a controlled study of the relative effectiveness of chlorpromazine, promazine, phenobarbital, and placebo, Archives of General Psychiatry 4, 381–389. Casey, J.F., Hollister, L.E., Klett, C.J., et al. (1961). Combined drug therapy of chronic schizophrenics: a controlled evaluation of placebo, dextroamphetamine, imipramine, isocarboxazid, and trifluoperazine added to maintenance doses of chlorpromazine, American Journal of Psychiatry 117, 997–1003. Casey, J.F., Lasky, J.J., Klett, C.J. & Hollister, L.E. (1960). Treatment of schizophrenic reactions with phenothiazine derivatives: A comparative study of chlorpromazine, triflupromazine, mepazine, prochlorperazine, perphenazine and phenobarbital, American Journal of Psychiatry 117, 97–105. Cohen, N.L., Waltzman, S.B., Fisher, S.G., et al. (1993). A prospective, randomized study of cochlear implants, New England Journal of Medicine 328, 233–237. Cohn, J.N., Archibald, D.G., et al. (1986). Effect of vasodilator therapy on mortality in chronic congestive heart failure. Results of a Veterans Administration Cooperative Study, New England Journal of Medicine 314, 1547–1552. Cohn, J.N., Johnson, G., Zeische, S., et al. (1991). A comparison of enalapril with hydralazine isorbide dinitrate in the treatment of chronic congestive heart failure, New England Journal of Medicine 325, 303–310. Department of Veterans Affairs Laryngeal Cancer Study Group (1991). Induction chemotherapy plus radiation compared with surgery plus radiation in patients with advanced laryngeal cancer. New England Journal of Medicine 324, 1685–1690. Ezekowitz, M.D., Bridges, S.L., James, K.E., et al. (1992). Warfarin in the prevention of stroke associated

Cooperative Studies Program, US Department of Veterans Affairs

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

with nonrheumatic atrial fibrillation, New England Journal of Medicine 327, 1406–1412. Fudala, P.J., Vocci, F., Montgomery, A. & Trachlenberg, A.I. (1997). Levomethyl acetate (LAAM) for the treatment of opioid dependence: a multisite, open-label study of LAAM safety and an evaluation of the product labeling and treatment regulations, Journal of Maintenance in the Addictions 1, 9–39. Goldman, S., Copeland, J., Moritz, T., et al. (1989). Saphenous vein graft patency 1 year after coronary artery bypass surgery and effects of antiplatelet therapy. Results of a Veterans Administration Cooperative Study, Circulation 80, 1190–1197. Goldman, S., Copeland, J., Moritz, T., et al. (1991). Starting aspirin therapy after operation. Effects on early graft patency. Circulation 84, 520–526. Goldstein, G., Materson, B.J., Cushman, W.C., et al. (1990). Treatment of hypertension in the elderly: II. Cognitive and behavioral function. Results of a Department of Veterans Affairs Cooperative Study, Hypertension 15, 361–369. Gorham, D.R., Pokorny, A.D. & Moseley, E.C. (1964). Effects of a phenothiazine and/or group psychotherapy with schizophrenics, Diseases of the Nervous System 25, 77–86. Hamilton, J.D., Hartigan, P.M., Simberkoff, M.S., et al. (1992). Early versus later zidovudine therapy of patients with symptomatic human immunodeficiency virus infection: results of a randomized, double-blind VA Cooperative Study, New England Journal of Medicine 326, 437–443. Hammermeister, K., Sethi, G.K., Henderson, W.G., et al. (1999). Outcomes 15 years after valve replacement with a mechanical versus a bioprosthetic valve: final report of the VA randomized trial, Journal of the American College of Cardiology 36, 1152–1158. Henderson, W.G. (2000). Is it becoming more difficult to attain target sample sizes in clinical trials? Presentation at the 21st Annual Meeting of the Society for Clinical Trials. Toronto, Canada, April 16–19. Henderson, W.G. (1980). Some operational aspects of the Veterans Administration Cooperative Studies Program from 1972–1979, Controlled Clinical Trials 1, 209–226. Henderson, W.G., Demakis, J., Fihn, S.D., et al. (1998). Cooperative studies in health services research in the Department of Veterans Affairs, Controlled Clinical Trials 19, 134–148. Hobson, R.W., Weiss, D.G., Fields, W.S., et al. (1993). Efficacy of carotid endarterectomy for asymptomatic carotid stenosis, New England Journal of Medicine 328, 221–227. James, K.E. (1980). A model for the development, conduct, and monitoring of multicenter clinical trials in the Veterans Administration, Controlled Clinical Trials 1, 193–207. Kathe, B.A., Chan, Y.-K., Buehler, D.A., et al. (1981). Protection of patient rights and welfare in the VA

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

15

Cooperative Studies Program, Controlled Clinical Trials 2, 267–274. Kaufman, J.S., Reda, D.J., Fye, C.L., et al. (1998). Subcutaneous compared with intravenous epoetin in patients receiving hemodialysis, New England Journal of Medicine 339, 578–583. Klett, C.J. & Caffey, Jr, E.M. (1972). Evaluating the long-term need for antiparkinson drugs by chronic schizophrenics, Archives of General Psychiatry 26, 374–379. Lasky, J.J., Klett, C.J., Caffey, Jr, E.M., et al. (1962). Drug treatment of schizophrenic patients: a comparative evaluation of chlorpromazine, chlorprothixene, fluphenazine, reserpine, thioridazine and triflupromazine, Diseases of the Nervous System 23, 698–706. Lavori, P.W., Sugarman, J., Hays, M.T. & Feussner, J.R. (1999). Improving informed consent in clinical trials: a duty to experiment, Controlled Clinical Trials 20, 187–193. Lepor, H., Williford, W.O., Barry, M.J., et al. (1996). The efficacy of terazosin, finasteride, or both in benign prostatic hyperplasia, New England Journal of Medicine 335, 533–539. Lewis, H.D., Davis, J.W., Archibald, D.G., et al. (1983). Protective effects of aspirin against acute myocardial infarction and death in men with unstable angina: results of a Veterans Administration Cooperative Study, New England Journal of Medicine 309, 396–403. Light, R.W., O’Hara, V.S., Moritz, T.E., et al. (1990). Intrapleural tetracycline for the prevention of recurrent spontaneous pneumothorax. Results of a Department of Veterans Affairs Cooperative Study, Journal of the American Medical Association 264, 2224–2230. Ling, W., Charuvastra, C.V., Kaim, S.C. & Klett, C.J. (1976). Methyl acetate and methadone as maintenance treatment for heroin addicts: a Veterans Administration cooperative study, Archives of General Psychiatry 33, 709–720. Materson, B.J., Cushman, W.C., Goldstein, G., et al. (1990). Treatment of hypertension in the elderly: I. Blood pressure and clinical changes. Results of a Department of Veterans Affairs Cooperative Study, Hypertension 15, 348–360. Materson, B.J., Reda, D.J., Cushman, W.C., et al. (1993). Single-drug therapy for hypertension in men. A comparison of six antihypertensive agents with placebo, New England Journal of Medicine, 328, 914–921. Mayberg, M.R., Wilson, S.E., Yatsu, F., et al. (1991). Carotid endarterectomy and prevention of cerebral ischemia in symptomatic carotid stenosis, Journal of the American Medical Association 266, 3289–3294. Niewoehner, D., Erbland, M.L., Deupree, R.H., et al. (1993). Effect of systemic glucocorticoids on exacerbations of chronic obstructive pulmonary disease, New England Journal of Medicine 340, 1941–1947. Peduzzi, P., Guarino, P., Donta, S., et al. (2000). Design of an informed consent study to evaluate the utility of a

16

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

Cooperative Studies Program, US Department of Veterans Affairs focus group consent document in the VA cooperative study. A randomized multicenter controlled trial of multi-modal therapy in veterans with Gulf War illness (CSP 470). Presentation at the 21st Annual Meeting of the Society for Clinical Trials. Toronto, Canada. April 16–19. Peduzzi, P., Kamina, A. & Detre, K. (1998). Twenty-two year follow-up in the VA cooperative study of coronary artery bypass surgery for stable angina, American Journal of Cardiology 81, 1393–1399. Prien, R.F., Gillis, R.D. & Caffey, Jr, E.M. (1973). Intermittent pharmacotherapy in chronic schizophrenia, Hospital and Community Psychiatry 24, 317–322. Rosenheck, R., Cramer, J., Xu, W., et al. (1997). A comparison of clozapine and haloperidol in hospitalized patients with refractory schizophrenia, New England Journal of Medicine 337, 809–815. Rubins, H.B., Robins, S.J., Collins, D., et al. (1999). Gemfibrozil for the secondary prevention of coronary heart disease in men with low levels of highdensity lipoprotein cholesterol, New England Journal of Medicine 341, 410–418. Saudek, C.D., Duckworth, W.C., Giobbie-Hurder, A., et al. (1996). Implantable insulin pump vs. multipledose insulin for non-insulin-dependent diabetes mellitus. A randomized clinical trial, Journal of the American Medical Association 276, 1322–1327. Spechler, S.J. & the Department of Veterans Affairs Gastroesophageal Reflux Disease Study Group (1992). Comparison of medical and surgical therapy for complicated gastroesophageal reflux disease in veterans, New England Journal of Medicine 326, 786–792. Streptomycin Committee (1947). The effects of streptomycin on tuberculosis in man, Journal of the American Medical Association 135, 634–641. Takaro, T., Hultgren, H.N., Lipton, M.J., et al. (1976). The VA cooperative randomized study of surgery for coronary arterial occlusive disease. II. Subgroup with significant left main lesions, Circulation 54, (Suppl. III), 107–117.

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

The Digitalis Investigation Group (1997). The effect of digoxin on mortality and morbidity in patients with heart failure, New England Journal of Medicine 336, 525–533. The VA Cooperative Variceal Sclerotherapy Group (1991). Prophylactic sclerotherapy for esophageal varices in male alcoholics with cirrhosis: a randomized single blind multi-center clinical trial, New England Journal of Medicine, 324, 1779–1784. The Veterans Administration Systemic Sepsis Cooperative Study Group (1987). Effect of high-dose glucocorticoid therapy on mortality in patients with clinical signs of systemic sepsis, New England Journal of Medicine 317, 659–665. The Veterans Affairs Total Parenteral Nutrition Cooperative Study (1991). Perioperative total parenteral nutrition in surgical patients, New England Journal of Medicine 325, 525–532. VA Cooperative Study Group on Antihypertensive Agents (1967). Effects of treatment on morbidity in hypertension. Results in patients with diastolic blood pressures averaging 115 through 129 mm Hg, Journal of the American Medical Association 202, 1023–1034. VA Cooperative Study Group on Antihypertensive Agents (1970). Effects of treatment on morbidity in hypertension. II. Results in patients with diastolic blood pressure averaging 90 through 114 mm Hg, Journal of the American Medical Association 213, 1143–1152. Wasson, J.H., Reda, D.J., Bruskewitz, R.C., et al. (1995). A comparison of transurethral surgery with watchful waiting for moderate symptoms of benign prostatic hyperplasia, New England Journal of Medicine 332, 75–79. Weinberger, M., Oddone, E.Z., Henderson, W.G., et al. (1996). Does increased access to primary care reduce hospital readmissions?, New England Journal of Medicine 334, 1441–1447.

WILLIAM G. HENDERSON, PHILIP W. LAVORI, PETER PEDUZZI, JOSEPH F. COLLINS, MIKE R. SATHER & JOHN R. FEUSSNER

COORDINATING COMMITTEE A Coordinating Committee is a committee that a sponsor may organize to coordinate the conduct of a multicenter trial. If a coordinating committee and/or coordinating investigator(s) are used in multicenter trials, their organization and/or selection is the responsibility of the sponsor.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

COORDINATING INVESTIGATOR The Coordinating Investigator is an investigator who is assigned the responsibility for the coordination of the investigators at different centers that participate in a multicenter trial. If a coordinating committee and/or coordinating investigator(s) are used in multicenter trials, their organization and/or selection is the responsibility of the sponsor.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

CORONARY DRUG PROJECT

2.2 Outcome Variables The primary outcome variable was all-cause mortality for the entire follow-up period, with secondary outcomes of CHD and CV death, recurrent definite nonfatal MI, the combination of CHD death or definite nonfatal MI, cerebral stroke, and others (2).

PAUL L. CANNER Maryland Medical Research Institute Baltimore, Maryland

By 1960, evidence had accrued that linked elevated blood lipid levels with increased incidence of coronary heart disease (CHD). At the same time, the pharmaceutical industry was developing drugs that were effective in reducing blood cholesterol in persons with hyperlipidemia. The time had come to assess whether reduction of lipid levels would be effective in the treatment and possible prevention of CHD. In November 1960, proceedings were started that culminated in a line item in the Federal budget for funding in 1966 of the Coronary Drug Project (CDP) (1). 1

2.3 Sample Size and Duration of Follow-up From March 1966 to October 1969, 53 Clinical Centers (in the United States and Puerto Rico) recruited a total of 8341 patients— about 1100 in each of the five drug groups and 2789 in the placebo group (2). The 2.5 to 1 ratio of patients in the placebo group relative to each drug group was designed to minimize the total sample size while achieving a specified power relative to each of the five drug–placebo comparisons (3,4). Patients were followed until July 1974 with clinic visits and examinations every 4 months for a minimum of 5 years, a maximum of 8.5 years, and a mean of 6.2 years per patient on their assigned treatment regimen (5).

OBJECTIVES

The CDP was a randomized, double-blind, placebo-controlled clinical trial of the efficacy and safety of five lipid-modifying agents in men with previous myocardial infarction (MI) (2). The patients were followed for a minimum of 5 years to determine whether pharmaceutical modification of blood lipids would lead to improved survival and a reduction in cardiovascular (CV) mortality and morbidity. A secondary objective was to identify baseline prognostic factors for CV mortality and morbidity in the placebo group of patients.

2.4 Eligibility Criteria A prospective participant in the CDP had to be a male aged 30 to 64 years with electrocardiogram-documented evidence of an MI that occurred not less than 3 months previously. Insulin-dependent diabetics and persons already on lipid-modifying medications at time of entry were excluded (2). 2.5 The CDP Aspirin Study

2 STUDY DESIGN AND METHODS 2.1 Treatments

Three CDP treatment regimens were discontinued early because of adverse effects. The patients in these three groups who were eligible and willing (1529 patients altogether) were randomized into a short (10 to 28 months, mean 22 months) double-blind trial of 972 mg/day aspirin and placebo. As with the CDP, the primary outcome variable for the CDP Aspirin Study was all-cause mortality (6).

CDP patients were randomly allocated to six treatment arms: mixed conjugated equine estrogens at two dosage levels (2.5 and 5.0 mg/day), clofibrate (1.8 g/day), dextrothyroxine (6.0 mg/day), nicotinic acid (or niacin, 3.0 g/day), and a lactose placebo (3.8 g/day) (2). These treatments were dispensed in identical-appearing capsules (9 per day at full dosage). Both the patients and the clinical staff were blinded as to each patient’s treatment group except that side effects of the estrogen and niacin treatments tended to unblind these treatment groups.

2.6 Administrative Structure In addition to 53 clinical centers, the CDP had a Coordinating Center (at the present usually

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CORONARY DRUG PROJECT

called Data Coordinating Center; University of Maryland, Baltimore), a Central Laboratory (Centers for Disease Control, Atlanta), an ECG Reading Center (University of Minnesota, Minneapolis), and a Drug Procurement and Distribution Center (Public Health Service Supply Service Center, Perry Point, Maryland). Linkage of these centers with the National Heart, Lung, and Blood Institute (NHLBI) was provided through a Medical Liaison Officer, members of the Biometrics Research Branch, and a Health Scientist Administrator for medical, biostatistical, and budgetary matters, respectively. A Steering Committee composed of representatives of the Clinical Centers, Coordinating Center, Central Laboratory, ECG Reading Center, and NHLBI provided scientific direction for the study at the operational level (7). 2.7 Data and Safety Monitoring A Data and Safety Monitoring Committee (DSMC) composed of representatives from the Coordinating Center, Central Laboratory, ECG Reading Center, and NHLBI, the Chairman and Vice-Chairman of the Steering Committee, one outside statistician, and three clinicians who were not participants in the Clinical Centers met at least twice a year to review the treatment group data and make recommendations with regard to continuation or discontinuation of the study treatment groups. A Policy Board, which was composed of a clinical pharmacologist, a biostatistician, and three clinicians/cardiologists who had no direct involvement in the CDP, acted in a senior advisory capacity on policy matters throughout the duration of the study. The Policy Board reviewed and either ratified or disapproved recommendations from the DSMC (7,8). This model of both a DSMC and an independent Policy Board has evolved into the present-day Data and Safety Monitoring Board (DSMB) made up of scientists who are independent of the study. The members of the CDP DSMC and Policy Board wished not to be blinded as to identification of the study treatments in the data monitoring reports. They recognized that assimilating hundreds of pages of tables and graphs on a great variety of safety and efficacy outcomes in a short period was difficult

enough without their being blinded to treatment group identification as well. With treatment group blinding, significant patterns in the data with respect to treatment efficacy or safety might be missed easily. Furthermore, they realized that decisions concerning treatment efficacy are not symmetrical with those concerning treatment safety, with more evidence required for early stopping for efficacy than for safety. 3 RESULTS Three CDP treatment groups were terminated early, that is, before the scheduled end of the study; these groups were both estrogen groups and the dextrothyroxine group (9–11). The two remaining drug groups—clofibrate and niacin—and the placebo group were followed until the scheduled conclusion of the trial (5). 3.1 High Dose Estrogen In May 1970, a decision was reached to discontinue the 5.0 mg/day estrogen (ESG2) group because of an increased incidence of cardiovascular events. One major consideration in the deliberations over the ESG2 findings had to do with possible bias in diagnosing definite nonfatal MI and nonfatal thromboembolic events because of the treatment being unblinded in a large percentage of the patients because of feminizing side effects of the medication. Several special analyses were carried out to assess the possibility and extent of such bias. These analyses pertained to (1) incidence and duration of hospitalization for cardiac problems; (2) incidence of subsequent cardiovascular death in patients with definite nonfatal MI since entry; (3) incidence of several nonfatal cardiovascular events ranked in order of severity, counting only the single most serious event for a given patient; and (4) comparison of the centrally read electrocardiographic findings taken in connection with new MI events for the ESG2 and placebo groups. These analyses did not yield any evidence in support of the hypothesis of overdiagnosis of nonfatal MI in the ESG2 group (9,12). At its meeting of May 13, 1970, the DSMC reviewed the subgroup analyses shown in

CORONARY DRUG PROJECT

3

Table 1. Mortality and Morbidity in the High Dose Estrogen (ESG2) and Placebo Groups, Coronary Drug Project

Event

ESG2

Placebo

Risk groupa

n

%

n

%

z-value

All 1 2 All 1 2

1118 738 380 1022 684 338

8.1 5.1 13.9 6.2 6.7 5.0

2789 1831 958 2581 1689 892

6.9 6.1 8.5 3.2 2.9 3.7

1.33 −0.95 3.02 4.11 4.30 1.05

Total mortality Definite nonfatal MI

a Risk 1 = men with one MI without complications prior to entry into the study; Risk 2 = men with more than one previous MI or one MI with complications prior to entry. Source: Coronary Drug Project Research Group (12); reprinted from Controlled Clinical Trials © 1981, with permission from Elsevier.

Table 1. For this analysis, Risk 1 comprised men with one previous MI without complications prior to entry into the study and Risk 2 included men with either more than one previous MI or one MI with complications prior to entry. For the total group the z-value for the ESG2-placebo difference in total mortality was 1.33 and the corresponding z-value for definite nonfatal MI was 4.11. [A z-value is defined here as a drug-placebo difference in proportions of a given event, divided by the standard error of the difference; z-values of ±1.96 correspond to a conventional P-value of 0.05. However, given the multiple treatment groups, multiple endpoints (here, total mortality and definite nonfatal MI), and multiple reviews of the data during the course of the study, it was judged necessary to require much larger z-values than these to establish statistical significance (13,14)]. The DSMC agreed to discontinue the ESG2 treatment in Risk 2 patients only, but the Policy Board voted to discontinue the entire ESG2 treatment group (9,12). 3.2 Dextrothyroxine In October 1971, a decision was reached to discontinue the dextrothyroxine (DT4) treatment group in the CDP, based primarily on a higher mortality in the DT4 group compared with placebo. Although the z-value (1.88) for the difference did not achieve the conventional P < 0.05 level of statistical significance, and even less so when taking into account the five treatment versus placebo group comparisons, the DSMC members did not think that

conventional significance levels necessarily applied to adverse treatment effects. The deliberations that led to the decision for discontinuation focused largely on the question of whether the excess mortality was present consistently throughout the total group of DT4-treated patients or whether it was concentrated in certain subgroups (10). Table 2 gives the observed DT4 and placebo group findings for total mortality in subgroups defined by baseline risk categorization, history of angina pectoris, and ECG heart rate. Within each higher risk subgroup, there was a substantially higher mortality in the DT4 group than in the placebo group. Conversely, DT4 showed a somewhat lower mortality than placebo in the three lower risk subgroups. These subgroups that showed adverse effects of DT4 were identified following a survey of 48 baseline variables. Because no a priori hypotheses concerning DT4 effects in defined subgroups had been specified, the observed effects were treated as a posteriori findings in the evaluation of their statistical significance. The statistical analysis of these subgroup findings lay primarily in two directions (10). First, because the observed subgroup findings emerged from an analysis that involved 48 different baseline variables, it was desirable to determine whether the observed differences were any greater than might be expected by chance alone from evaluation of 48 variables. Computer simulations showed that the observed interactions with treatment for baseline heart rate and for history of

4

CORONARY DRUG PROJECT

Table 2. Percentage of Deaths in Selected Subgroups, Dextrothyroxine (DT4) and Placebo Groups, Coronary Drug Project Baseline Characteristic Risk Group,a 8/1/71 Risk 1 Risk 2 History of angina pectoris, 8/1/71 Negative Suspect/definite ECG heart rate, 8/1/71 <70/min ≥70/min Combination,b 8/1/71 Subgroup A Subgroup B Combination, 10/1/70 Subgroup A Subgroup B Combination in interval, 10/1/70–8/1/71 Subgroup A Subgroup B

DT4

Placebo

n

%

n

%

z-value

719 364

10.8 22.5

1790 925

11.0 15.4

−0.11 3.06

440 643

7.7 19.6

1142 1573

9.9 14.4

−1.33 3.06

576 494

9.5 21.3

1482 1194

10.7 14.7

−0.74 3.32

460 623

6.5 20.9

1210 1505

9.9 14.6

−2.17 3.58

460 623

4.1 16.4

1210 1505

7.7 11.2

−2.60 3.29

441 521

2.5 5.4

1117 1337

2.4 3.8

0.09 1.50

a Risk 1 = men with one MI without complications prior to entry into the study; Risk 2 = men with more than one previous MI or one MI with complications prior to entry. b Subgroup A = men with baseline ECG heart rate <70/min. and either Risk 1 or with no history of angina pectoris prior to entry; Subgroup B = men with either baseline ECG heart rate ≥70/min or Risk 2 plus history of suspect or definite angina pectoris prior to entry. Source: Coronary Drug Project Research Group (10); reprinted from Journal of the American Medical Association © 1972, American Medical Association; all rights reserved.

angina pectoris both fell in the 5% tail of the simulated distribution of maximum interaction effects. Thus, the observed difference in the effects of DT4 on mortality between the two subgroups defined by these two variables likely was real. The second approach to the statistical evaluation of subgroup findings focused on two rather complicated subgroups based on entry heart rate, history of angina pectoris, and risk categorization (Table 2). In the data report prepared for a previous DSMC meeting, these two subgroups were defined as a result of a trial and error process of maximizing the treatment–subgroup interaction with respect to total mortality. For Subgroup A the z-value for the DT4-placebo difference was −2.60 and for Subgroup B it was +3.29 (Table 2). The z-value for interaction was 3.96 for these subgroups. Because this subgroup was constructed solely because of analyzing the data many different ways, there was inadequate statistical

evidence that this would indeed be the best subgroup to stop. Hence, the patients were followed for another few months to observe whether DT4 continued to do poorly in Subgroup B and well in Subgroup A. During this additional follow-up period, in Subgroup B, DT4 continued to show a greater than 40% higher mortality than the placebo group (5.4% vs. 3.8%), thus justifying the choice of this subgroup as one in which DT4 treatment should be discontinued. However, in Subgroup A, there was no longer a mortality benefit with DT4 (2.5% vs. 2.4%; Table 2) (10). As a result of these and other data analyses, the DSMC, at its meeting on October 21, 1971, approved motions to discontinue DT4 medication in both Subgroup A and Subgroup B. These decisions were ratified by the CDP Policy Board (12). 3.3 Low Dose Estrogen In March 1973, a decision was reached to discontinue the 2.5 mg/day estrogen (ESG1)

CORONARY DRUG PROJECT

5

Table 3. Projection of Future Mortality Experience, Low Dose Estrogen (ESG1) and Placebo Groups, Coronary Drug Project

A. Current % deaths B. Current survivors C. % deaths at end of study, 1.96 SE difference D. Future % deaths given 1.96 SE difference at end of study

ESG1

Placebo

19.9 (219/1101) 882 21.1 (232/1101) 1.5 (13/882)

18.8 (525/2789) 2,264 24.0 (670/2789) 6.4 (145/2264)

Source: Coronary Drug Project Research Group (12); Controlled Clinical Trials © 1981, with permission from Elsevier.

group (11). This decision was based on an excess incidence of venous thromboembolism, an excess mortality (not statistically significant) from all cancers, and a small, statistically insignificant excess of total mortality in the ESG1 group compared to the placebo group. As of February 1, 1973, 19.9% of the patients in the ESG1 group had died, compared with 18.8% in the placebo group (Table 3) (11). This difference was not statistically significant, and there was no clear evidence that the drug was doing definite harm. The DSMC posed the question: What is the possibility that this trend could reverse itself with ESG1 showing a statistically significant beneficial effect with respect to all-cause mortality at the end of the study in summer 1974? For this to happen, future mortality in the two groups would have to be 1.5% for ESG1 compared with 6.4% for placebo (Table 3, line D). Because this outcome was considered to be extremely unlikely given the experience to date, this analysis plus the other considerations noted earlier led to the early discontinuation of the ESG1 group in 1973 (12). Years later, a striking parallel was noted in the mortality findings between the ESG1 versus placebo comparison in the CDP in men and the estrogen/progestin versus placebo comparison in the Heart and Estrogen/ progestin Replacement Study in women. In both studies, excess mortality was observed in the treated group in the first year or two followed by a slight mortality benefit in subsequent years (15).

3.4 Clofibrate At the conclusion of the CDP, the mortality of the clofibrate group was almost identical to that of the placebo group (25.5% vs. 25.4%) (5). However, it was not always this way. The DSMC reviewed interim reports of treatment versus placebo differences in all-cause mortality at 2-month intervals throughout the course of the study. For the clofibrate-placebo comparison, on three occasions during the first 30 months of the study the z-value exceeded the −1.96 boundary (signifying a conventional P-value less than 0.05) (12). If the DSMC had decided to stop the study and declare clofibrate therapeutically efficacious on the basis of these early ‘‘statistically significant’’ results, then it is evident that in retrospect it would very likely have been a wrong decision. However, the DSMC realized that the chances of finding significant differences in the absence of true differences became much higher than 5%—perhaps as high as 30 or 35%—when the data were examined repeatedly over time for such differences (16) and when five different drug groups were being compared with placebo at each of these time points. Several statistical methods were used in the CDP to take account of the repeated analysis of treatment effects in the decisionmaking process (13,14,17–20). The use of each of these methods required a much more extreme z-value than −1.96 early in the study to conclude that a statistically significant difference had been found. 3.5 Niacin The results in the niacin group at the end of follow-up on study medication in 1974

6

CORONARY DRUG PROJECT

were only modestly encouraging. Patients taking niacin experienced about 10% and 26% decreases in serum cholesterol and triglycerides, respectively, over 5 years of follow-up. Also, substantial reductions were observed in the niacin group in definite nonfatal MI (z = −3.09) and the combination of CHD death or definite nonfatal MI (z = −2.77) at the end of the trial. However, for the primary outcome—all-cause mortality—the findings for the niacin and placebo groups were almost identical. Thus, the following conclusion was reported by the CDP Research Group: ‘‘The Coronary Drug Project data yield no evidence that niacin influences mortality of survivors of myocardial infarction; this medication may be slightly beneficial in protecting persons to some degree against recurrent nonfatal myocardial infarction. However, because of the excess incidence of arrhythmias, gastrointestinal problems, and abnormal chemistry findings in the niacin group, great care and caution must be exercised if this drug is to be used for treatment of persons with coronary heart disease’’ (5). Nine years later, the NHLBI provided funding to the CDP Coordinating Center to conduct a mortality follow-up of CDP patients. The main interest was to ascertain whether an excess of cancer deaths had occurred in the two estrogen groups. The answer to that question turned out to be ‘‘no.’’ However, most surprisingly, a highly significant reduction in all-cause mortality was observed in the niacin group compared with placebo (52.0% vs. 58.2%; z = −3.52) as well as to all other CDP treatment groups (21). The survival curves for the niacin and placebo groups were nearly identical for the first 6 years of follow-up before starting to diverge (21). No clear explanation can be provided for the very long lag period in mortality benefit. Possibly, counterbalancing adverse effects of niacin had occurred—such as a significantly higher incidence of atrial fibrillation and other cardiac arrhythmias in the niacin group (5)—while patients were taking the drug. The CDP data leave unsettled the question of whether benefit or the suppression of benefit might derive from taking niacin for more than 6 years.

Patients who took niacin experienced an elevation of plasma fasting and 1-hour glucose levels over time—an apparent adverse effect of niacin. For this reason, the use of niacin in diabetic patients has been restrained, despite the encouraging findings with respect to 15-year mortality. However, recent analyses of the CDP data indicate no reduced mortality benefit in patients with higher plasma glucose levels at baseline or with increased glucose levels at 1 year (22). A similar outcome has been observed with respect to the metabolic syndrome (23). 3.6 The CDP Aspirin Study With a mean follow-up of 22 months, the allcause mortality in the aspirin and placebo groups was 5.8% and 8.3%, respectively (z = −1.90) (6). 3.7 Mortality by Level of Adherence Although clofibrate ultimately showed no benefit overall, study investigators as well as outside observers (24) raised the question as to whether the patients who adhered well to the clofibrate treatment regimen showed any benefit with respect to mortality and cardiovascular morbidity. Clofibrate and placebo group patients were classified according to their cumulative adherence—the estimated number of capsules actually taken as a percentage of the number that should have been taken according to protocol during the first 5 years of follow-up or until death, if earlier. For those with ≥80% adherence, 5-year mortality was nearly identical in the clofibrate and placebo groups (15.0% vs. 15.1%; Table 4). For those with <80% adherence, 5year mortality was somewhat lower for clofibrate (24.6%) than for placebo (28.2%). However, the most astounding finding came from focusing on the two figures for the placebo group: 15.1% mortality for good adherers versus 28.2% for poor adherers. The z-value for this difference was −8.12 (P = 4.7 × 10−16 ). Adjustment for 40 baseline characteristics reduced the mortality difference between good and poor adherers only to 16.4% versus 25.8% (z = −5.78). In conclusion, it is doubtful that any valid conclusions can be drawn from analyses like these because there is no way of ascertaining precisely how or

CORONARY DRUG PROJECT

7

Table 4. Five-Year Mortality in the Clofibrate and Placebo Groups, According to Cumulative Adherence to Protocol Prescription, Coronary Drug Project Clofibrate Adherencea <80% ≥80% Total study group

Placebo

n

% mortalityb

n

% mortalityb

357 708 1065

24.6 (22.5) 15.0 (15.7) 18.2 (18.0)

882 1813 2695

28.2 (25.8) 15.1 (16.4) 19.4 (19.5)

aA

patient’s cumulative adherence was computed as the estimated number of capsules actually taken as a percentage of the number that should have been taken according to the protocol during the first 5 years of follow-up or until death (if death occurred during the first 5 years). b The figures in parentheses are adjusted for 40 baseline characteristics. Source: Coronary Drug Project Research Group (25); reprinted from New England Journal of Medicine © 1980 Massachusetts Medical Society; all rights reserved.

why the patients in the treated and control groups have selected themselves or have become selected into the subgroups of good and poor adherers (25). Analyses of drugplacebo differences in mortality by change in serum cholesterol over time were similarly shown to give anomalous and uninterpretable results (25).

3.8 Studies of the Natural History of CHD As noted earlier, the CDP placebo group was sized 2 1/2 times larger—a total of 2789 patients—than any of the five drug groups so as to optimize the sample size for five drugplacebo comparisons. Another benefit of such a large placebo group was the ability to carry out several data analytical studies of the natural history of CHD in the placebo group, particularly, the evaluation of the relationship of 5-year mortality and cardiovascular morbidity to various risk cardiovascular risk factors measured or observed at baseline, either singly or in combinations (26–35).

3.9 Evaluation of Data from External Quality Control Programs Split samples of blood specimens were sent to the Central Laboratory in a blinded fashion to assess the technical error (measurement error at the laboratory plus any specimen deterioration that may have taken place during storage and shipment) of laboratory

determinations. Similarly, ECGs were resubmitted in a blinded fashion to the ECG Reading Center to assess the ECG reading variability. After attempting to select the most meaningful measure of technical error or reading variability among different possibilities, the CDP statisticians developed a new measure that related the measurement error to the total variability among patients of a particular laboratory determination or ECG reading. In addition, for ECG characteristics, such as Q/QS patterns, ST-depression, and T-wave patterns, for which significant worsening of ECG findings was compared between each treatment group and placebo, the duplicate readings of the same ECGs were used to determine ‘‘significant worsening’’ from the first to the second reading. In this way the amount of ‘‘significant worsening’’ from the baseline to a follow-up ECG attributable to reading variability could be determined (36). 4 CONCLUSIONS AND LESSONS LEARNED 4.1 Making Decisions with Respect to Treatment Efficacy and Safety Arriving at a decision for early termination of a treatment group or of an entire clinical trial, because of either beneficial or adverse results, is a complex process. It may involve, among other things, the need to (1) determine whether the observed treatment differences are likely to represent real effects and are not caused by chance; (2) weigh the importance of different response variables, some possibly trending in favor of the treatment and some

8

CORONARY DRUG PROJECT

against it; (3) adjust for differences in distributions of baseline characteristics among the treatment groups; (4) discern possible biases (because the study was not double blind) in the medical management of patients or in the diagnosis of events; and (5) evaluate treatment effects in subgroups of the study participants (12). 4.2 Assessment of Treatment Efficacy by Follow-up Responses to Treatment Analyses of drug–placebo differences in mortality by adherence to CDP treatment regimen or by change in serum cholesterol over time were shown to give anomalous and uninterpretable results. The reason is that there is no way of ascertaining precisely how or why the patients in the treated and control groups have selected themselves or have become selected into the subgroups of good and poor adherers or of cholesterol responders or nonresponders (25). 4.3 Assessment of Technical Error of Laboratory and ECG Determinations New methods for evaluating the technical error of Central Laboratory determinations and ECG reading variability of the ECG Reading Center were developed, relating these sources of error to the total variability among study patients in these laboratory determinations or ECG characteristics (36). 4.4 Impact of Treatment Group Findings on the Practice of Medicine Of the five original treatment regimens evaluated in the CDP, only niacin remains as a viable lipid-modifying regimen. Because of the long-term mortality benefit of immediaterelease niacin shown in the CDP, there is renewed interest among cardiologists in prescribing this medication for modification of serum lipid levels in persons at risk for CHD. Also, more recently, more clinical trials of immediate-, sustained-, and extendedrelease niacin, in combination with other medications that impact blood lipids, have been conducted (37–39). The results of the short-duration study of aspirin and placebo in patients whose CDP treatment regimens had been discontinued have led to several additional studies

of aspirin in patients with CHD (40,41) and a resultant promotion of this popular medication to the forefront of prevention and treatment of MI. REFERENCES 1. W. J. Zukel, Evolution and funding of the Coronary Drug Project. Control. Clin. Trials 1983; 4: 281–312. 2. Coronary Drug Project Research Group, The Coronary Drug Project. Design, methods, and baseline results. Circulation 1973; 47: I1–I79. 3. P. L. Canner and C. R. Klimt, Experimental design features of the Coronary Drug Project. Control. Clin. Trials 1983; 4: 313–332. 4. C. W. Dunnett, Multiple comparison procedures for comparing several treatments with a control. J. Am. Statist. Assoc. 1955; 50: 1096–1121. 5. Coronary Drug Project Research Group, Clofibrate and niacin in coronary heart disease. JAMA 1975; 231: 360–381. 6. Coronary Drug Project Research Group, Aspirin in coronary heart disease. J. Chron. Dis. 1976; 29: 625–642. 7. P. L. Canner and J. Stamler, Organizational structure of the Coronary Drug Project. Control. Clin. Trials 1983; 4: 333–343. 8. P. L. Canner, Monitoring of the data for evidence of adverse or beneficial treatment effects in the Coronary Drug Project. Control. Clin. Trials 1983; 4: 467–483. 9. Coronary Drug Project Research Group, The Coronary Drug Project. Initial findings leading to modifications of its research protocol. JAMA 1970; 214: 1303–1313. 10. Coronary Drug Project Research Group, The Coronary Drug Project. Findings leading to further modifications of its protocol with respect to dextrothyroxine. JAMA 1972; 220: 996–1008. 11. Coronary Drug Project Research Group, The Coronary Drug Project. Findings leading to discontinuation of the 2.5 mg/day estrogen group. JAMA 1973; 226: 652–657. 12. Coronary Drug Project Research Group, Practical aspects of decision making in clinical trials: The Coronary Drug Project as a case study. Control. Clin. Trials 1981; 1: 363– 376. 13. P. Armitage, C. K. McPherson, and B.C. Rowe, Repeated significance tests on accumulating data. J. Royal Statist. Soc. A 1969; 132: 235–244.

CORONARY DRUG PROJECT 14. P. L. Canner, Monitoring treatment differences in long-term clinical trials. Biometrics 1977; 33: 603–615. 15. N. Wenger, G. L. Knatterud, and P. L. Canner, Early risks of hormone therapy in patients with coronary heart disease. JAMA 2000; 284: 41–43. 16. P. Armitage, Sequential Medical Trials, 2nd ed. New York: Wiley, 1975. 17. J. Cornfield, Sequential trials, sequential analysis and the likelihood principle. Am. Statist. 1966; 20: 18–23. 18. J. Cornfield, A Bayesian test of some classical hypotheses—with applications to sequential clinical trials. J. Am. Statist. Assoc. 1966; 61: 577–594. 19. J. Cornfield, The Bayesian outlook and its application. Biometrics 1969; 25: 617–657. 20. M. Halperin and J. Ware, Early decision in a censored Wilcoxon two-sample test for accumulating survival data. J. Am. Statist. Assoc. 1974; 69: 414–422. 21. P. L. Canner, K. G. Berge, N. K. Wenger, J. Stamler, L. Friedman, R. J. Prineas, and W. Friedewald, for the Coronary Drug Project Research Group, Fifteen year mortality in Coronary Drug Project patients: Long-term benefit with niacin. J. Am. Coll. Cardiol. 1986; 8: 1245–1255. 22. P. L. Canner, C. D. Furberg, M. L. Terrin, and M. E. McGovern, Benefits of niacin by glycemic status in patients with healed myocardial infarction (from the Coronary Drug Project). Am. J. Cardiol. 2005; 95: 254–257. 23. P. L. Canner, C. D. Furberg, and M. E. McGovern, Benefits of niacin in patients with versus without the metabolic syndrome and healed myocardial infarction (from the Coronary Drug Project). Am. J. Cardiol. 2006; 97: 477–479. 24. A. R. Feinstein, Clinical biostatistics. XLI. Hard science, soft data, and the challenges of choosing clinical variables in research. Clin. Pharmacol. Ther. 1977; 22: 485–498. 25. Coronary Drug Project Research Group, Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. N. Engl. J. Med. 1980; 303: 1038–1041. 26. Coronary Drug Project Research Group, The prognostic importance of the electrocardiogram after myocardial infarction: Experience in the Coronary Drug Project. Ann. Intern. Med. 1972; 77: 677–689.

9

27. Coronary Drug Project Research Group, Prognostic importance of premature beats following myocardial infarction: Experience in the Coronary Drug Project. JAMA 1973; 223: 1116–1124. 28. Coronary Drug Project Research Group, Left ventricular hypertrophy patterns and prognosis. Experience postinfarction in the Coronary Drug Project. Circulation 1974; 49: 862–869. 29. Coronary Drug Project Research Group, Factors influencing long-term prognosis after recovery from myocardial infarction—threeyear findings of the Coronary Drug Project. J. Chron. Dis. 1974; 27: 267–285. 30. Coronary Drug Project Research Group, Serum uric acid: Its association with other risk factors and with mortality in coronary heart disease. J. Chron. Dis. 1976; 29: 557–569. 31. Coronary Drug Project Research Group, The prognostic importance of plasma glucose levels and of the use of oral hypoglycemic drugs after myocardial infarction in men. Diabetes 1977; 26: 453–465. 32. Coronary Drug Project Research Group, The natural history of myocardial infarction in the Coronary Drug Project: Long-term prognostic importance of serum lipid levels. Am. J. Cardiol. 1978; 42: 489–498. 33. Coronary Drug Project Research Group, Cigarette smoking as a risk factor in men with a prior history of myocardial infarction. J. Chron. Dis. 1979; 32: 415–425. 34. Coronary Drug Project Research Group, The natural history of coronary heart disease: Prognostic factors after recovery from myocardial infarction in 2789 men. The 5-year findings of the Coronary Drug Project. Circulation 1982; 66: 401–414. 35. Coronary Drug Project Research Group, Highdensity lipoprotein cholesterol and prognosis after myocardial infarction. Circulation 1982; 66: 1176–1178. 36. P. L. Canner, W. F. Krol, and S. A. Forman, External quality control programs in the Coronary Drug Project. Control. Clin. Trials 1983; 4: 441–466. 37. G. Brown, J. J. Albers, L. D. Fisher, S. M. Schaefer, J.T. Lin, C. Kaplan, X. Q. Zhao, B. D. Bisson, V. F. Fitzpatrick, and H. T. Dodge, Regression of coronary artery disease as a result of intensive lipid-lowering therapy in men with high levels of apolipoprotein B. N. Engl. J. Med. 1990; 323: 1289–1298. 38. B.G. Brown, X. Q. Zhao, A. Chait, L. D. Fisher, M. C. Cheung, J. S. Morse, A. H. Dowdy, E. K. Mariino, E. L. Bolson, P. Alaupovic, et

10

CORONARY DRUG PROJECT

al., Simvastatin and niacin, antioxidant vitamins, or the combination for the prevention of coronary disease. N. Engl. J. Med. 2001; 345: 1583–1592. 39. A. J. Taylor, L. E. Sullenberger, H. J. Lee, J. K. Lee, K. A. Grace, Arterial Biology for the Investigation of the Treatment Effects of Reducing Cholesterol (ARBITER) 2. A doubleblind, placebo-controlled study of extended release niacin on atherosclerosis progression in secondary prevention patients treated with statins. Circulation 2004; 110: 3512–3517. 40. Aspirin Myocardial Infarction Study Research Group, A randomized controlled trial of aspirin in persons recovered from myocardial infarction. JAMA 1980; 243: 661–669. 41. P. L. Canner, An overview of six clinical trials of aspirin in coronary heart disease. Stat. Med. 1987; 6: 255–263.

FURTHER READING Coronary Drug Project Research Group, The Coronary Drug Project: Methods and lessons of a multicenter clinical trial. Control. Clin. Trials 4: 273–536. This monograph includes the following chapters: 1. Brief description of the Coronary Drug Project and other studies; 2. Evolution and funding of the Coronary Drug Project; 3. Experimental design features; 4. Organizational structure of the study; 5. Role of the National Institutes of Health; 6. Role and Methods of the Coordinating Center; 7. Role and Methods of the Central Laboratory; 8. Role and Methods of the ECG Reading Center; 9. Role and Methods of the Drug Procurement and Distribution Center; 10. Role and Methods of the Clinical Centers; 11. Design of data forms; 12. External quality control programs; 13. Monitoring of the data for evidence of adverse or beneficial treatment effects; 14. Further aspects of data analysis; 15. Closing down the study; 16. Impact of the Coronary Drug Project on clinical practice.

CROSS REFERENCES Adherence Data Monitoring Committee Early Termination - Study Stopping Boundaries Subgroup Analysis

CORRELATION

association to a straight line. If the variables X and Y are random variables with a joint probability distribution, then the correlation coefficient is defined as

PETER ARMITAGE University of Oxford Oxford, UK

ρ=

In a rather loose sense, two characteristics or variables are said to be correlated if changes in one variable tend to be accompanied by changes in the other, in either the same or the opposite direction. Thus, the incidence of ischemic heart disease is positively correlated with the softness of drinking water, since many epidemiologic studies have shown that the incidence tends to be higher in areas with softer water; and, conversely, the incidence is negatively correlated with water hardness. In view of the more specific definitions to be discussed below, it would perhaps be preferable to use the term association for this informal usage. Correlation implies a linear relationship with superimposed random variation; association often loosely implies a monotone relationship, but the term may also be applied to nominal data where rank order is undefined. For an account of the early history of the term correlation, see (10), especially pp. 297–299. The term was current during the middle of the nineteenth century, but its statistical usage is rightly attributed to Francis Galton, who initially used the spelling ‘‘co-relation’’. Galton was concerned with the correlation between characteristics of related individuals; for example, between an individual’s height and the mean height of the two parents. The correlation coefficient emerged from Galton’s work, after further elucidation by F.Y. Edgeworth and Karl Pearson, to become a central tool in the study of relationships between variables, especially (in Pearson’s work) between physiological and behavioral measurements on human beings.

σXY , σX σY

(1)

where σ XY , σ X , and σ Y are, respectively, the covariance of X and Y and the standard deviations of X and Y. The value of ρ is bounded between −1 and +1, taking these extreme values only when there is a linear functional relation between X and Y. Thus, if Y = α + βX exactly (as, for instance, with temperatures recorded in Fahrenheit for Y and centigrade for X), then ρ = 1 if β > 0 (as in this example), and ρ = −1 if β < 0. Biologic variables are not normally connected by linear functional relations, and the correlation coefficient usually lies between the two extremes. There is a close connection between the concept of correlation and that of linear regression. Let β Y.X be the slope of the linear regression of Y on X, and β X.Y that of the regression of X on Y. Then βY.X =

σXY , σX2

βX.Y =

σXY σY2

and, from (1), β Y.X β X.Y = ρ 2 . Since ρ 2 ≤ 1, |β Y.X |≤|1/β X.Y | (the latter expression being the slope of the regression of X on Y in a diagram with Y as the ordinate), equality being achieved only for perfect correlation. Thus, the two regression lines are in general inclined at an angle. When the correlation coefficient ρ = 0, both β Y.X and β X.Y are zero, and the two regression lines are at right angles. For a particular value of X, define Y 0 to be the value predicted by the linear regression of Y on X. Then the variance of the residuals about regression, E[(Y − Y 0 )2 ] is equal to σY2 (1 − ρ 2 ). One interpretation of the correlation coefficient is, therefore, the fact that its square is the proportion of the variance of one variable that is ‘‘explained’’ by linear regression on the other. The relationship is symmetric, being equally true for the other regression.

1 THE PRODUCT– MOMENT CORRELATION COEFFICIENT The product–moment correlation coefficient (normally abbreviated to correlation coefficient) is a measure of the closeness of the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

2

CORRELATION

THE SAMPLE CORRELATION

The correlation coefficient may also be defined, in similar manner, for a finite set of n paired quantitative observations (x1 , y1 ), . . . , (xn , yn ). The correlation coefficient, denoted now by r, may be calculated as (xi − x)(yi − y) r = 1/2 . (xi − x)2 (yi − y)2

(2)

Here, x and y are the mean values of the xi and yi , respectively, and the summations run from 1 to n. The basic properties of the sample correlation coefficient are essentially those outlined above for random variables. The relationship with regression now applies to the slopes of the least squares regression lines, by.x and bx.y . The squared correlation coefficient, r2 , is the proportion of the total sum of squares (yi − y)2 ‘‘explained’’ by the regression of y on x. Some scatter diagrams representing simple situations are shown schematically in Figure 1. In Figures 1(a) and (e) the points lie on a straight line, and r = +1 and −1, respectively. In Figure 1(c) the variation in one variable is approximately independent of the value of the other variable, and r = 0. In Figures 1(b) and (d) there is an intermediate degree of correlation, the variance of one variable being reduced when the value of the other variable is fixed, so 0 < r < 1 for (b) and −1 < r < 0 for (d). The sample correlation coefficient given in (2) is sometimes referred to as the Pearson product–moment correlation coefficient, the reference here being to Karl Pearson (7). It provides the method of calculation for any finite set of paired values, and invites consideration of its sampling error. 3

SAMPLING ERROR

Suppose that the n pairs of observations are drawn at random from a bivariate distribution of random variables X and Y. The sampling distribution of r will depend on the characteristics of the parent distribution, in particular on the population correlation

Figure 1. A schematic representation of scatter diagrams with regression lines, illustrating different values of the correlation coefficient. Reproduced from 1 by permission of Blackwell Science, Oxford

coefficient ρ. In general, r is a consistent but biased estimator of ρ, but the bias is of order 1/n and likely to be small except for very small values of n. More specific results are available if stronger assumptions are made about the nature of the parent distribution, and the traditional assumption is that of a bivariate normal distribution. This model was widely used in the early work of Galton and Karl Pearson; appropriately enough, since it provides a reasonable description of many of the biometric variables studied by them. Under the bivariate normal assumption, the distribution of r depends only on ρ and n. The density was first derived in 1915 by Fisher (4) and subsequently tabulated by David (3). If ρ = 0, then the statistic (n − 2)1/2 r (1 − r2 )1/2

(3)

CORRELATION

follows a Student’s t distribution with n − 2 degrees of freedom, and can be used to test the null hypothesis that ρ = 0. In fact, this test is valid more generally, for the standard model for linear regression, in which the values x of X are chosen arbitrarily but Y is distributed normally with constant variance around a linear function of x. Returning to the bivariate normal model, Fisher derived a variance-stabilizing transformation of r, 1+r 1 , z = tanh−1 r = log 2 1−r the distribution of which approaches normality more rapidly than that of r, as the sample size, n, increases. Asymptotically, E(z) = tanh−1 ρ, and, approximately, var(z) = 1/(n − 3). An alternative transformation (9) provides a generalization of (3): for ρ = 0, the statistic (n − 2)1/2 (r − ρ) [(1 − r2 )(1 − ρ 2 )]1/2 follows approximately a t distribution with n − 2 degrees of freedom. For further details about the distribution of r, see [6, Chapter 10]. 4

INTRACLASS CORRELATION

Suppose that observations on a single variable y are arranged in n groups, each containing m observations, and that there is reason to expect possible differences in the mean level of y between groups. If such differences exist, observations in the same group will tend to be positively correlated. This phenomenon is called intraclass correlation. Fisher (5) illustrates this for the case m = 2 by referring to measurements on pairs of brothers. He distinguishes between situations in which the brothers fall into labeled categories such as ‘‘elder’’ and ‘‘younger’’, and those in which no such categorization is required. In the first case, there are two variables—measurements for elder and younger brothers—and the standard product–moment correlation coefficient may be calculated. This is called the interclass correlation. In the second case, each pair would enter into the calculation twice, since they

3

are not naturally ordered, and the denominator of the correlation coefficient may be based on a single sum of squares about the mean for all the 2n observations. This is the intraclass correlation. It is interesting to note that Fisher used the term ‘‘class’’ to denote the possible labeling categories (‘‘elder’’ and ‘‘younger’’ here), whereas many modern writers use it to denote the groups into which the observations are clustered (‘‘sibships’’ here). The intraclass correlation coefficient, rI , may be calculated as a modified variant of (2) for all the nm(m − 1) pairs of observations in the same group, each pair being counted twice. In the cross product in the numerator in (2), all deviations are taken from the mean, y, of all mn observations. Since each observation appears m − 1 times in the cross product, the denominator of (2) is m − 1 times the sum of squares of deviations of all mn observations about y. An equivalent formula for rI clarifies the relation between intraclass correlation and variation between the group means. Denote by yi the mean for the ith group, and by v the variance of the mn observations with divisor mn rather than mn − 1. Then

rI =

m

(yi − y)2 − nv (m − 1)nv

,

the summation running from 1 to n. It follows that 1 ≤ rI ≤ 1, − (m − 1) the lower limit of −1/(m − 1) being achieved when all the yi , are equal, and the upper limit of 1 when there is no variation within the groups so that (yi − y)2 = nv. Data of the type considered here would normally be analyzed by a one-way analysis of variance, and, as Fisher (5) showed, there is a close connection between the two approaches. If, in the analysis of variance, the mean squares between and within groups are denoted by s2b and s2w , respectively, then

rI =

s2b −

n (n−1)

s2b + (m − 1)

s2w

n (n−1)

, s2w

4

CORRELATION

so, for large n, rI ∼

s2b − s2w s2b + (m − 1)s2w

.

Equivalently, rI ∼ (F − 1)/(F + m − 1), where F is the usual variance ratio statistic, s2b /s2w . Two approaches may be followed in discussion of the sampling error of the intraclass correlation coefficient, using either finite population theory as is usual for, or the random effects model more usual in biologic applications. The first approach assumes that the n groups are randomly selected from a larger set of N, and that each set of m observations within a group is randomly selected from M. If the intraclass correlation coefficient for the finite population of MN observations is denoted by ρ I , the sample value rI may be regarded as an estimator of ρ I . The random effects model effectively assumes infinite values for M and N. The group means are assumed to be distributed with a variance component σ 2b , and the withingroup deviations with a component σ 2w . Then ρI =

σ 2b σ 2b

+ σ 2w

.

Inferences about ρ I may be made using standard results for the F distributions in the one-way analysis of variance. Some examples of the use of intraclass correlation as a descriptive tool are as follows: 1. In sample survey theory, to indicate the correlation between observations in the same cluster due to systematic between-cluster variation. 2. In statistical genetics, to indicate the correlations in genetic traits between members of the same family: see the articles on familial correlations (which deals in detail with the situation in which the family groups vary in size) and genetic correlations and covariances. The numerical values of intraclass correlation coefficients are more meaningful in genetics than in most other applications, because of the predictions of Mendelian theory,

although the predicted values for familial correlations, for example between siblings, may be distorted by the additional effects of environmental correlation. 3. In studies of the reliability of repeated measurements, or the agreement between observers in measuring characteristics of the same subject. Note that when the same observers are used for each subject, there may be systematic differences in the level of recording for different observers, and the oneway analysis of variance analogue is no longer valid (2).

5 SOME GENERALIZATIONS The concepts underlying the product– moment correlation coefficient may be generalized or modified in various ways. The (x, y) pairs may not be closely linearly related, but may nevertheless be perfectly associated through a nonlinear relation. If this relation is monotone, the observations will be ranked in the same order by both x and y. Two commonly used coefficients of rank correlation are Spearman’s ρ (essentially the product–moment correlation of the ranks), and Kendall’s τ (based on the number of discrepancies in the ranking of paired observations by the two variables). Both coefficients are bounded by the values −1 and +1 for perfect negative and positive agreement between the rankings. When there are more than two quantitative variables, the correlations between pairs play an important part in various methods of multivariate analysis. In multiple linear regression, two generalizations of r are commonly used. The multiple correlation coefficient, R, generalizes a property of r, in that the proportion of the variance of the dependent variable y that is ‘‘explained’’ by the multiple regression on x1 , . . . , xk is R2 . (Since the squared form carries the essential information, R is never given a negative sign.) The partial correlation coefficient measures the product–moment correlation between y and one of the predictor variables, xi say, when all the other predictors are kept constant.

CORRELATION

It is therefore useful in assessing the separate effects of different predictors, especially if they are themselves closely correlated. In the early work of the Galton–Pearson school, much effort was put into the estimation of correlation coefficients when the observed values were nominal—perhaps even binary—but were supposed to represent divisions of some underlying, unobserved, continuous variables. The correlation between the presumed continuous variables had to be estimated from the discrete observations. It was usually assumed that the underlying distribution was bivariate normal. For two binary classifications, the measure is called tetrachoric correlation. When one variable is binary and the other is quantitative, the measure is called biserial correlation. These methods are less frequently used now, as alternate models for categorical data seem more appropriate. 6

INTERPRETATION

The high profile assumed by the concept of correlation during the early part of the twentieth century has now largely vanished. This is partly due to the emergence of more penetrating methods of statistical analysis. In particular, the emphasis has gradually moved away from an index measuring a degree of association, to an attempt to describe more explicitly the nature of that association. In a word, the emphasis has moved from correlation toward regression. Apart from this general shift in viewpoint, there are some specific problems in the interpretation of correlation coefficients, which need to be taken into account in any data analysis in which they are used: 1. Correlation does not imply causation. Two variables may be highly correlated because they are both causally related to a third variable or a group of such variables, and yet have no causative relation to each other. Relations of this type are often called nonsense or spurious correlations. Often, the intervening variable is time. That is, two variables x and y may both be steadily increasing with time, over a certain time period,

5

or one may be increasing while the other decreases. The two variables are then likely to be highly correlated. For instance, during the first two-thirds of the twentieth century, imports of tobacco into the UK increased steadily, as did the number of divorces granted. The two variables, measured in successive decades, are highly correlated. It would certainly not be correct to assume that either variable caused the other. Nonsense correlations are among the most prevalent causes of injudicious inferences from statistical data by the general public and the media. 2. Correlation measures closeness to a linear relationship. Two variables may be very closely associated by a nonlinear relation, and yet have a low correlation coefficient. As an extreme example, in Figure 2 (reproduced from (8)) is shown a scatter diagram between randomly drawn standard normal deviates and their squares. The population correlation coefficient is exactly zero, but it would have been entirely wrong to assume a lack of dependence. Independent random variables have zero correlation; but zero correlation does not imply independence. 3. The correlation between two biologic variables may be affected by selection of particular values of one variable. For instance, if X and Y have a distribution approximating to a bivariate normal distribution, with correlation coefficient ρ, restriction of the range of X by removal of extreme values in both directions will tend to decrease the correlation coefficient below the original value ρ. Thus, the correlation between height and age of children is higher for the age range 5–12 years than for the range 7–8 years. Conversely, omission of central values of X with retention of extreme values will tend to increase the correlation coefficient. This phenomenon may make it difficult to compare correlation coefficients in different populations differing in the degree and type of selection.

6

CORRELATION

REFERENCES

Figure 2. A scatter plot of Y = X 2 for 100 standard normally distributed random numbers and their squares. Reproduced from 8 by permission of Wiley, New York

4. The effect of sampling variation is often underestimated, so that undue importance is given to moderately high correlations based on few observations. The upper 5 percentile of the distribution of |r| when the population value ρ = 0, from (3), is 0.878 for n = 5, and 0.632 for n = 10. In this sense, moderately large correlations from small numbers of observations are inherently unreliable.

1. Armitage, P., Berry, G. & Matthews, J. N. S. (2002). Statistical Methods in Medical Research, 4th Ed. Blackwell Science, Oxford. 2. Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability, Psychological Reports 19, 3–11. 3. David, F. N. (1938). Tables of the Correlation Coefficient. Cambridge University Press, Cambridge. 4. Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population, Biometrika 10, 507–521. 5. Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh. 6. Patel, J. K. & Read, C. B. (1982). Handbook of the Normal Distribution. Marcel Dekker, New York. 7. Pearson, K. (1896). Mathematical contributions to the theory of evolution, III: regression, heredity and panmixia, Philosophical Transactions of the Royal Society of London, Series A 187, 253–318. 8. Rodriguez, R. N. (1982). Correlation, in Encyclopedia of Statistical Sciences, Vol. 2, S. Kotz & N. L. Johnson, eds. Wiley, New York, pp. 193–204. 9. Samiuddin, M. (1970). On a test for an assigned value of correlation in a bivariate normal distribution, Biometrika 57, 461–464. 10. Stigler, S. M. (1986). The History of Statistics. Belknap Press, Cambridge, Mass.

COST-EFFECTIVENESS ANALYSIS

short review of definitions and design issues will be given.

MORVEN LEESE Institute of Psychiatry—Health Service and Population Research Department, London, UK

1

DEFINITIONS AND DESIGN ISSUES

In health economics, evidence from both clinical and economic perspectives is considered together. The result might be to prioritize interventions, set research priorities, or devise a definitive recommendation about whether a specific intervention should be adopted as being both clinically effective and good value for money. The first step in a health economics study would be to gather and present costs and clinical consequences side by side, without them necessarily being entered into a formal costeffectiveness analysis (a process referred to as a ‘‘cost-consequence’’ analysis). Costeffectiveness analysis (CEA) in a clinical trial, as discussed in this article, is a more specific analysis that takes this process a step further. It typically aims to provide information about the value for money of a small number of competing treatments, often just two—the new intervention and its comparator. This comparison is usually expressed through the incremental cost effectiveness ratio (ICER), which is the cost per unit of benefit obtained from the intervention compared with the comparator. A related measure is the incremental net benefit (INB), which is the overall net cost for a given willingness-topay per unit of benefit. The INB has recently become much more widely used for inferential purposes than the ICER, because of analytical problems that develop from the ICER’s ratio form. However, the comparison made directly within the clinical trial framework may provide only partial information on cost-effectiveness (1). Consequently, modeling is often used to generalize the results of a CEA to time frames or patient groups other than those used in the trial. The various approaches to making and generalizing cost-effectiveness comparisons in clinical trials are described later in this article. First, a

Cost-effectiveness analysis is a method for comparing the costs and clinical outcomes of competing courses of action, which allows health providers and policy makers to choose between them within budgetary constraints. The procedures to be compared may be drug treatments, psychological interventions, or health-management systems. It is usually regarded as an essential component of a clinical trial and regulatory bodies may require the submission of economic information as part of the process of reimbursement (2). Cost-effectiveness analysis allows different procedures to be compared in terms of cost per unit of outcome. Cost utility analysis (CUA) is similar to cost effectiveness analysis (and is sometimes considered to be a branch of it), except that outcomes are measured in generalized units of utility, rather than in disease-specific units such as symptom reduction scores. CUA enables mortality and morbidity to be combined into a single measure, for example as quality-adjusted life years (QALYs), and allows the comparison of treatments for different diseases to be placed on a common basis. Cost minimization analysis (CMA) is also used in economic evaluation. It is only relevant where competing procedures can be regarded as having equal benefit, in which case it makes sense to compare their costs directly. CMA is less commonly used in clinical trials than CEA or CUA, because equivalence of treatments is difficult to establish (3). Both the terms ‘‘cost effectiveness analysis’’ and ‘‘cost benefit analysis’’ (CBA) are sometimes used to refer to any kind of economic evaluation, but they should be distinguished. CEA differs from CBA in that the latter measures both costs and consequences in monetary terms, whereas in CEA the consequences are measured in terms of a common health outcome. CBA, which is

1.1 The Various Types of Analysis used in Economic Evaluation

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

COST-EFFECTIVENESS ANALYSIS

described in another article, tends to be used in a general context where very different health technologies, often more than two, are to be compared with each other and/or with other modes of expenditure within an overall budget (such as the national economy). 1.2 The Economic Perspective Culyer (4) defines the economic perspective as ‘‘the viewpoint adopted for the purposes of an economic appraisal (cost effectiveness, costutility studies and so on) which defines the scope and character of the costs and benefits to be examined, as well as other critical features, which may be social value-judgemental in nature, such as the rate of discount.’’ The perspective of the patient would be concerned with, for example, symptom reduction or loss of income because of illness. The perspective of the health provider would include the costs of drugs or hospital beds, and the perspective of society might include loss of taxes because of illness-related unemployment. If appropriate, the analysis can be presented from more than one perspective. It should be said, however, that in many clinical trials a limited perspective is forced onto the investigator, especially if the economic component of a clinical trial has been ‘‘piggybacked’’ on to an existing trial design. Funding bodies may insist on a particular perspective for submissions. For example, the National Institute for Health and Clinical Excellence in England and Wales specifies this to be all health outcomes on individuals, and costs to the National Health Service and Personal Social Services (5). The U.S. Public Health Service panel requires a societal perspective (6). 1.3 Choice of Comparator From the health economist’s point of view, the choice of comparator would ideally be current best practice (if it can be adequately defined), for example, one that has been shown to be most effective or cost effective in the past. However, the comparator might simply be the currently most widely used treatment. The choice would depend on the scenario in which the economic decision is to be made. Placebo controls would rarely be appropriate; exceptions occur in which the new drug is to

be used as an adjunctive therapy to supplement rather than to substitute an existing intervention (7), or if it is necessary to obtain regulatory approval.

1.4 Setting and Timescale The economic perspective will influence the setting and length of time to which the economic analysis applies (the analytic horizon). The ideal time horizon from an economic point of view would be that over which the intervention being evaluated might have an effect on costs and benefits. In practice, however, shorter time horizons—such as the duration of the effectiveness trial—are chosen for practical and cost reasons. When the relevant data cannot be measured directly during the clinical trial, it is common practice to extrapolate the results by modeling. This process may entail additional data gathering outside the trial, either from surveys or published literature. If the results are to be generalized to different patient groups (for example from an urban setting to a rural setting), then information on costs and relevant patient characteristics may need to be gathered from the different settings so that modeling can be undertaken. A consideration for both costs and benefits is discounting to value future costs to the present day (8). A typical rate would be 3.5% per annum for costs. Discounting health outcomes can also be applied and a similar rate to that used for costs would usually be appropriate for consistency. (Note that inflation adjustment for costs is a different procedure that may also be applied, in addition to discounting.) The net-present value of the streams of costs and benefits, and hence the cost-effectiveness ratio, may be affected by discounting even when the same timescales are involved. However, it becomes a particular issue when benefits operate over different timescales to costs, for example where a current treatment is compared with a preventive measure. In practice, discounting is not considered in many clinical trials, on the basis that the rate would apply equally in both arms.

COST-EFFECTIVENESS ANALYSIS

1.5 Sample Size As a rule, sample sizes are larger for an economic evaluation than for the corresponding clinical evaluation. This larger sample size occurs because many possible sources of variation exist for usage indicators and costs, so that standard deviations are generally very high. It is worth noting that in cluster-randomized designs, sample sizes might be even larger. Not only must the number of individuals be increased according to the design effect, but also the minimum number of clusters should be increased because of the special features of health economics data (9). Simulation or bootstrapping approaches for sample sizes based on the ICER have been developed (10). A Bayesian approach has also been suggested (11), although it has not been very widely adopted possibly because of its relative complexity. A straightforward classic approach based on the INB is described later. For an extensive discussion of approaches to sample size calculation, see Willan and Briggs (12). 1.6 Economic Analysis Plans Glick et al. (13) give guidelines on the content of an economic analysis plan. The plan can either be drawn up in parallel with the main analysis plan or incorporated within it. In any case, the two plans should be as consistent as possible, for example, in the treatment of missing values and choice of primary clinical outcome and whether to use an intention to treat approach. However, some differences are likely to occur. Additional disaggregated outcomes may be specified, as would technical economic issues such the economic perspective and sources of cost data. A description of any intended adjustment of trial-based estimates through modeling would also be included in the economic analysis plan. 2 COST AND EFFECTIVENESS DATA 2.1 The Measurement of Costs The measurement of costs involves determining the resources used and placing monetary values on them via unit costs. Resource use is multiplied by these unit costs to obtain total cost per patient. If costs are common between two alternatives, then they may be

3

omitted from the analysis, bearing in mind that it would restrict more general comparisons in the future. Resources can be direct (e.g., drugs, hospitalization, travel to hospital) or indirect (e.g., lifetime medical costs, sickness absence). Data collection methods may be informal if the costs are simple (e.g., pharmacy notes on hospital case reports) or via especially designed questionnaires, if complex. Examples of the latter are the CSSRI-EU (14), used for psychiatric studies, in which aggregated resource use over a sixmonth period is noted from case records or patient interviews, and the cost diary (15), in which the patient records all use of resources over a period of time such as a week or month. Different methods can result in different estimates (16), and validity and accuracy checks are advisable. Unit costs can be found from several sources, which include the health economics literature and charges made by health providers. See Glick et al. (17) for an extended discussion of unit cost estimation. 2.2 The Measurement of Effectiveness Although mortality and morbidity, for example symptom or quality of life scores, may be used as clinical effectiveness measures, generic health utility-based outcome measures such as QALYs (18) tend to be preferred for economic evaluations. This preference occurs because they take into account both quantity and health-related quality of life, they allow for patient or societal preferences about outcomes (utilities), and they can be used to compare treatments in different contexts. (A cost-effectiveness analysis using such measures would then take the form of a cost-utility analysis.) Utility-based outcome measures are typically derived as follows. Patients describe their state of health, using categorical responses to several items using a standard scale, via a self- or intervieweradministered quality-of-life questionnaire. These responses are mapped on to a single score, which usually ranges from 1 for perfect health to 0 for dead (although it is feasible for some states to be regarded as worse than death and have negative values). The mapping process uses previously obtained usage (or preference) weights that

4

COST-EFFECTIVENESS ANALYSIS

reflect public or patient preferences as to different health states. These methods are typically derived from large samples, either of the general population or patients with the disease of interest. A survey would normally provide weights to compute QALYs in many individual studies and would only be repeated when a questionnaire was to be used in a new patient group. Ryan et al. (18,19) discuss methods for deriving preference weights, which include the following widely used techniques: time trade-off, standard gamble, and Visual Analog Score methods. The QALY is computed as the product of the usage score and length of time over which QALYs are to be accumulated. This value might be the length of the trial, survival within the trial or a normal lifespan projected into the future. One QALY is equivalent to a year of perfect health or two years in a health state valued at 0.5. Because QALYs can be derived from a variety of scales and/or sets of preference weights, it is advisable to specify the particular ‘‘flavor’’ of QALY when reporting results. 2.3 Quality-of-Life Scales A criterion for the choice of a quality-of-life scale might be its ability to produce usage scores and hence QALYs. Two examples are as follows: The EQ5D (20) (previously known as the EuroQol) is a public-domain self-administered generic measure of healthrelated quality of life that can produce QALYs. It uses a five-domain classification system: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each domain is scored as follows: no health problems = 1, moderate health problems = 2, and extreme health problem = 3. Each of the 243 unique health states thus has an associated descriptor that ranges from 11111 for perfect health to 33333 for the worst possible state. The HUI (21) is self- or intervieweradministered. The HUI2 consists of seven attributes, each with three to five levels that range from highly impaired to normal: sensation (vision, hearing, and speech), mobility, emotion, cognition, self-care, pain, and fertility. The HUI3 has eight attributes (vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain) each with five to six levels.

3 THE ANALYSIS OF COSTS AND OUTCOMES 3.1 The Comparison of Costs Whereas the balance between cost and effectiveness is generally the key issue in any evaluation, an analysis of the relative costs of competing interventions may be of interest in itself, and it would be the primary analysis in a cost-minimization study. Most cost data is skewed and, in the past, nonparametric methods or transformation have been suggested to deal with this situation. It is now generally recognized that the arithmetic mean cost is of prime interest (because it can be used to produce total costs for populations), and it has led to the suggestion of bootstrapping of the mean costs and group differences (22) or the use of generalized linear models. Nixon and Thompson (23) have compared a number of parametric models (such as the Normal, gamma, log-normal, and log-logistic distributions) from the point of view of estimating the mean. They conclude that large samples are required to distinguish between these models, because the fit in the tails is important, and advise conducting sensitivity analyses with different model choices. 3.2 The Incremental Cost Effectiveness Ratio Balancing cost and effectiveness can be achieved through the Incremental Cost Effectiveness Ratio (ICER), which is defined as ICER = C/E

(1)

where C is the difference in cost between the two interventions to be compared, and E is the difference in effectiveness. The E, C combinations can be displayed on the cost effectiveness plane, in which the ICER corresponding to any combination is the slope of the line joining the point to the origin, as shown in Fig. 1. This figure also shows 1000 bootstrapped resamples from the distribution of E and C. The four quadrants in the plane have different interpretations. Quadrant SE shows ‘‘dominance’’ of the treatment (where it is more effective and less costly and therefore the obvious choice); quadrant NW shows dominance of the comparator. In the other

COST-EFFECTIVENESS ANALYSIS

60,000

NW quadrant

NE quadrant

l = £40,000

40,000 Difference in costs (£)

5

l = £30,000 20,000

0

−20,000 SW quadrant −0.5

SE quadrant 0.0 0.5 Difference in effects (QALYs)

1.0

Figure 1. A cost Effectiveness Plane, showing 1000 bootstrapped resamples from a distribution of cost and effect differences (means £10,000 for costs and 0.25 for QALYs). The point estimate of the ICER (£40,000) is indicated as a bold dot. Lines corresponding to maximum willingness to pay values of £30,000 and £40,00 are shown.

two quadrants, the choice between the alternatives depends on the maximum willingness to pay, λ. This value is indicated by a line whose slope is λ, so that in quadrant NE any combination that lies below the line would be regarded as cost effective at that value of λ. Points in quadrant SW, where E is negative, would often be regarded as unacceptable because a new treatment which was less effective would often be rejected for ethical reasons even if less costly. Several problems occur in analyzing ICERs. One problem is that a single value may be consistent with different combinations of E and C lying in different quadrants and therefore having different interpretations. A negative ratio is consistent with a more effective and less costly treatment and with a less effective and more costly treatment. Furthermore, this ratio is undefined at the origin, and it has a skewed distribution. These features mean that analytical solutions for estimating confidence intervals may be problematic even when the point estimate is unambiguous, because they may span different quadrants. Confidence limits for the

ICER have been the subject of much investigation (24). One solution to confidence interval estimation, which assumes that C and E have a joint normal distribution, is based on Fieller’s method (25,26), the interval being given by: 2 ] [C · E − z2α/2 σEC 2 2 {[E, C − z2α/2 σEC ] − [E2 − z2α/2 σE2 ] ± ·[C2 − z2α/2 σC2 ]}

E2 − z2α/2 σE2 (2) where z2α/2 is the critical point from the standard normal distribution (α = 0.05 for 95% limits), σ E is the standard deviation of the effect differences, σ C is the standard deviation of the cost differences, and σ 2 EC is the covariance between them. Confidence interval estimation can also be performed by ordering the slopes associated with bootstrapped estimates of E and C, which determine the bottom and top percentiles for the ICER. However, when the limits span the axes, the problems with ICERs mentioned above may still occur. Bootstrapped

6

COST-EFFECTIVENESS ANALYSIS

estimates on the cost-effectiveness plane (as shown in Fig. 1) are nevertheless informative in illustrating the range of values consistent with the data. 3.3 The Incremental Net Benefit The Incremental Net Benefit (INB) approach (27) can avoid the problems associated with the ICER by a simple rearrangement of the ICER as follows: INB = C − λE

(3)

where C the difference in costs is, E is the difference in effects and λ is the maximum willingness to pay. The variance of the INB, σB2 , is given by 2 σB2 = σC2 + σE2 /λ2 − σEC /λ

(4)

This formulation is expressed in units of cost; it can of course also be expressed in terms of units of effectiveness. The advantage of the INB is that it avoids the use of ratios. In most cases, it can be treated as approximately normally distributed so that standard statistical methods can be used to analyze it, including sample size estimation and OLS regression. The disadvantage is that a choice of λ needs to be made, which is not always easy. Sample size calculations based on INB are straightforward, although they may need to be performed for a range of figures for λ. The sample size per arm for a single-sided test at critical level α and power β is given by 2 2 + σBT )/δ 2 N = (z1−α + z1−β )2 (σBS

(5)

where z1−α and z1−β are the critical values from the standard normal distribution, δ is the minimum clinically relevant value for the INB, and σ bS and σ bT are the standard deviations for the INBs in the standard and treatment group, respectively. Estimates of standard deviations for the costs and effects can be obtained from pilot studies or the literature. The correlation between cost and effectiveness is reported in the literature rarely; if such data are unavailable, then it is advisable to use a range of plausible correlations. As expected, sample size for estimating cost effectiveness approaches that for estimating the effectiveness alone as λ increases,

and also as the correlation between cost and effectiveness increases. 3.4 The Cost Effectiveness Acceptability Curve The Cost Effectiveness Acceptability Curve (CEAC) plots the probability of a treatment being cost effective in relation to a comparator, against a range of plausible values for λ. For example, in Fig. 1 the proportion of bootstrapped estimates of E and C lying below the line with slope λ = £30,000 is about 40%. By varying λ, a nonparametric CEAC can be built up as in Fig. 2. It has been proposed that the proportion is interpreted as the probability of the new intervention being cost effective at willingness-to-pay λ. However, controversy surrounds the use of the term ‘‘probability.’’ Briggs (28), among others, argues that it only makes sense in Bayesian terms. Nevertheless, this concept has provided a widespread and useful method of summarizing the evidence regarding cost effectiveness. Note that the proportion where the solid line cuts the y axis is the single-sided P value for the cost difference (because the value of λ is 0, and the INB is determined solely by the cost). Similarly, the probability to which the solid line tends is 1 minus the P value for the difference in effectiveness. Indeed, rather than bootstrapping estimates, such curves can be produced via the P values from analytical tests of INB = 0 versus INB > 0 for different values of λ. Parametric CEACs can also be computed by integrating an assumed joint distribution for E and C or by parametric bootstrapping (26,29). In Fig. 2, all combinations of E and C that lie under the λ -line are included in the calculation of the proportion. However, one might wish to exclude negative Es in the SW quadrant on ethical grounds. Severens et al. (30) point out that the CEAC might need to be adapted to deal with points in the SW quadrant if the willingness to accept a health loss differed from the willingness to pay for health gain (the lines would have different slopes in the two quadrants). O’Brien and Briggs (26) illustrate CEACs along with a related curve in which the INB is plotted against λ. In such a plot, the points where

COST-EFFECTIVENESS ANALYSIS

7

Percentage bootstrapped samples cost effective

100%

80%

60%

40%

20%

0% 0

20000 40000 60000 80000100000 120000140000160000180000 200000 220000

Maximum willingness to pay in £ (λ) Figure 2. A Cost Effectiveness Acceptability Curve based on the bootstrapped resamples shown in Figure 1. The solid line indicates cost effectiveness of the intervention; the dotted line denotes cost effectiveness of the comparator. The vertical line indicates the point estimate of the ICER (corresponding to a probability of cost effectiveness of 50%).

the 95% CIs for the INB cross the λ axis are identical to the confidence limits for the ICER. Fenwick et al. (29) give an extended discussion of the use and interpretation of CEACs.

required. These analytical considerations are now discussed, and two published examples are summarized that illustrate the range of techniques available. 4.1 Missing Data

4 ROBUSTNESS AND GENERALIZABILITY IN COST-EFFECTIVENESS ANALYSIS Ensuring the robustness and generalizability of economic conclusions as far as possible is a key issue in cost-effectiveness evaluations. The special characteristics of economic data may need to be addressed. Classic sensitivity analysis would be considered as essential; this method seeks to ensure that the conclusions of an economic evaluation are robust and do not depend heavily on the values of particular parameters or model assumptions. It may also be necessary to account for imbalance in the arms of the trial, to reduce bias in an overall estimate, or to make recommendations for a particular subgroup (typically using regression techniques as in the analysis of clinical data). Finally, extrapolation into the future through modeling may be

Economic evaluations frequently entail missing values (because of dropouts or failure to follow up participants), censored values, and treatment switches. In a discussion of possible solutions to the problem, Briggs et al. (31) point out that (1) complete case analysis is inefficient and possibly biased, (2) available case analysis lead to difficulties in using standard statistical methods and can preclude the estimation of total cost per patient, and (3) na¨ıve imputation methods fail to reflect variability in the data. They illustrate more sophisticated methods in which the appropriate level of variability can be incorporated into summary statistics, based on the variation between several complete imputed data sets (multiple imputations). These authors also give several references to methods appropriate for attrition that is not at random and a list of software for missing value treatment.

8

COST-EFFECTIVENESS ANALYSIS

4.2 Censored Data Censored data may occur when accumulated cost or survival is measured over a fixed period. Whereas survival itself can be estimated using usual life table techniques, the same is not true of quality adjusted survival (QALYs) or cost. Standard survival techniques need to be modified for such data because the costs or QALYs accumulated by the censoring point will be correlated with those accumulated at death: Those who are censored are likely to have smaller values, and censoring is thus informative. Two methods that have been proposed to deal with such data are as follows. The direct (Lin) method (32) is based on patients’ cost histories over subintervals of the follow-up, with costs or QALYs in each interval weighted by the probability of surviving to the start of the interval. Inverse probability weighting (33) also involves weighting costs in subintervals, but weights are based on the probability of surviving to the actual censoring point rather than the start of the interval. It is more flexible than the direct method for that reason and also because adjustment for covariates is possible. 4.3 Treatment Switches In addition to missing or censored data, another threat to the robustness of conclusions is ‘‘off protocol’’ treatment switching. Torrance et al. (34) discuss the implications of switches different scenarios (e.g., treatment to comparator and vice versa). The traditional analytical approach is intention to treat, in which patients are analyzed (so long as data are available) as if they were in the arm to which they were randomized. This approach compares the strategies of offering the treatments under consideration. In economic evaluations that are aimed at ‘‘reallife’’ scenarios, such switches might not be discouraged, and the interest might be in the true treatment effect rather than the effect of an offer of treatment. In that case, specialized analytical approaches are available (35). 4.4 Multicenter Trials and Pooling Multicenter trials may be undertaken to maximize the number of patients available for a

clinical study and to improve generalizability of the clinical findings. If the centers are in different countries, then purchasing power parities can be used to equalize the purchasing power of different currencies. Drummond and McGuire (7), who give a full discussion of issues in costing in this situation, advise presenting all costs in their raw state, as well as any final adjusted values. A problem for economic evaluations in multicenter trials is that whereas clinical effects might reasonably be expected to transfer from one center to another, it is less likely to be true for costs because both resource use patterns and unit costs may differ by center (36). This analysis applies particularly in multinational studies. Thompson et al. (37) discuss the use of multilevel gamma models for modeling costs taking account of both patient and center heterogeneity. Drummond and Pang (38) discuss the issues involved in pooling cost-effectiveness estimates and the stage at which this should be performed (e.g., pooling all data before calculating ICERs or pooling the ICERs themselves). They point out that homogeneous effects for the clinical outcome as measured on an odds ratio scale (e.g., when the outcome is mortality) may be heterogeneous when measured on the additive scale necessary for cost-effectiveness estimates, especially if the baseline rates differ by center. Furthermore, treatment-by-center interactions may have limited power to detect such heterogeneity in economic data. Decisions about pooling are often difficult to make, and it is advisable to present detailed center-specific summary statistics as well as any pooled estimates of the costs, the ICERs, or the INBs. 4.5 Classic Sensitivity Analysis Examples of assumptions that would be entered into a sensitivity analysis would be the choice of distributional model used to compare mean costs or the use of common or center-specific unit costs in multicenter trials. Parameters are varied and the base case analysis is re-run so that the impact of uncertainty in these parameters or assumptions can be assessed. Typically, extreme values for parameters (rather than the best estimates as used for the base-case) are substituted

COST-EFFECTIVENESS ANALYSIS

and the analysis is rerun. Favorable or unfavorable scenarios may be devised in this way, and the substitution may be performed either one parameter or several at a time. Walker and Fox-Rushby (39) give guidance on how to plan standard sensitivity analyses. Probabilistic sensitivity analysis entails assigning a distribution to parameters and simulating data sets consistent with these distributions. Briggs (40) discusses some issues in choosing appropriate distributions for these parameters. 4.6 Regression Models Regression-based modeling can also be used to investigate costs or cost effectiveness in relation to patient characteristics. Motivation includes adjustment for baseline imbalance in randomized groups and exploratory analysis of factors associated with varying levels of costs or cost effectiveness. In the former case, the adjusted estimate of cost effectiveness is expected to be less biased because of imbalance than estimates from traditional methods. In the latter case, separate summary statistics and CEACs can be presented for different subgroups as predicted by the model. In the case of modeling costs alone, generalized linear models with gamma errors might be appropriate to deal with skewness in the data. Poisson or zero-inflated Poisson models can be used for count-based resource use data (41). For modeling cost effectiveness, the INB is much easier to deal with than the ICER because ordinary least-squares (OLS) regression can be used, although such models would need to be estimated for a range of willingness to pay values. Willan et al. (42) show how seemingly unrelated regression can be used; this method, which recognizes that different sets of covariates may be to be included for cost and for effectiveness, consists of two separate regression models linked through the individual patient error terms. 4.7 Markov Models to Extrapolate Over Time The Markov model can be used to extrapolate results from clinical trials over much longer periods, even whole lifetimes. For example, it can be used to link an intermediate outcome (such as blood pressure) with a longer term

9

outcome (such as mortality). Markov models are derived from a mutually exclusive set of health states (e.g., alive or dead) with transitions among states that occur at regular intervals; each type of transition has an associated transition probability. These probabilities may be estimated from published data, for example, mortality rates. The minimum time that a patient can spend in any state determines the cycles through which the model passes, and costs and utilities of passing one cycle in each state are assigned. Expected values for costs and utilities are calculated by multiplying the proportion of patients in each state by the corresponding cost and utility values and summing across both states and cycles. Monte Carlo simulation can be employed to assign distributions to the parameters of the model, so that uncertainty in the final expected values can be assessed. Lang and Hill (43) provide an overview of modeling techniques such as these. 4.8 Examples of Modeling and Sensitivity Analysis Oostenbrink et al. (44) combine a Markov model with a probabilistic sensitivity analysis in a study of the cost effectiveness of bronchodilators. The model was structured around transitions in disease severity and the number and type of exacerbations, with cycles (after the first) being one month over the course of a year. The data for this came from trials in six centers. Dirichlet, beta, and gamma distributions were used to represent uncertainty in transitions between disease states, exacerbations, and resource use, respectively. Monte Carlo simulations of costs and effects were shown on the cost-effectiveness plane. Sensitivity analyses included changing transitions and exacerbation probabilities, as well as utilities assigned to disease states. Hoch et al. (45) show how regression can be used in conjunction with the INB to identify the marginal benefits for various subgroups of patients with particular characteristics and to adjust for baseline imbalance in a trial of a program in assertive treatment for people with severe mental illness compared with usual community services. The INB

10

COST-EFFECTIVENESS ANALYSIS

was fitted first as a regression on arm and including the covariates of age, a functioning score, and ethnic group (the latter differing between treatment arms at baseline). This formula produced an overall CEAC adjusted for these variables. Finally, an interaction between treatment group and the covariates was included, which allowed CEACs to be presented for the black and white groups separately. REFERENCES 1. M. J. Sculpher, K. Claxton, M. Drummond, and C. McCabe, Whither trial-based economic evaluation for health care decision making? Health Econ. 2006; 15: 677–687. 2. R. S. Taylor, M. F. Drummond, G. Salkeld, and S. D. Sullivan, Inclusion of cost effectiveness in licensing requirements of new drugs: the fourth hurdle. BMJ 2004; 329: 972–975. 3. A. H. Briggs and B. J. O’Brien, The death of cost minimisation analysis? Health Econ. 2004; 10: 179–184. 4. A. Culyer, The Dictionary of Health Economics. Cheltenham, UK: Edward Elgar, 2005. 5. NICE. Guide to the Methods of Technology Appraisal (reference N0515). 2004. National Institute for Health and Clinical Excellence, UK. 6. M. R. Gold, J. E. Siegel, L. B. Russell, and M. C. Weinstein, Cost-effectiveness in health and medicine. New York: Oxford University Press, 1996. 7. M. E. Drummond and A. McGuire, Economic Evaluation in Health Care. Merging Theory with Practice. Oxford, UK: Oxford University Press, 2001, p. 236. 8. D. H. Smith and H. Gravelle, The practice of discounting in economic evaluation of health care interventions. Int. J. Technol Assess. Health Care. 2001; 17: 236–243. 9. T. N. Flynn and T. J. Peters, Conceptual issues in the analysis of economic data from cluster randomised trials. J. Health Services Res. Policy 2005; 10: 97–102. 10. M. J. Al, B. A. Van Hout, B. C. Michel, and F. F. H. Rutten, Sample size calculations in economic evaluation. Health Econ. 1998; 7: 327–335. 11. A. O’Hagan and J. W. Stevens, Bayesian methods for design and analysis of costeffectiveness trials in the evaluation of health

care technologies. Stats. Methods Med. Res. 2001; 11: 469–490. 12. A. R. Willan and A. H. Briggs, Statistical Analysis of Cost-Effectiveness Data. Chichester, UK: Wiley, 2006. 13. H. Glick, D. Polsky, and K. Schulman, Trialbased economic evaluations: an overview of design. In: M. Drummond and A. McGuire (eds.), Economic Evaluation in Health Care: Merging Theory with Practice. Oxford, UK: Oxford University Press, 2001, pp. 113–140. 14. D. Chisholm, M. R. Knapp, H. C. Knudsen, F. Amaddeo, L. Gaite, B. van Wijngaarden, Client socio-demographic and service receipt inventory--European version: development of an instrument for international research. EPSILON Study 5. Br. J. Psychiatry. 2000;supplementum(39):s28–33. 15. M. E. J. B. Goossens, M. P. M. H. Ruttenvan M¨olken, J. W. S. Vlaeyen, and S. M. J. P. van der Linden, The cost diary: a method to measure direct and indirect costs in costeffectiveness research. J. Clin. Epidemiol. 2000; 53: 688–695. 16. S. Byford, M. Leese, M. Knapp, H. Seivewright, S. Cameron, V. Jones, K. Davidson, and P. Tyrer, Comparison of alternative methods of collection of service use data for the economic evaluation of health care interventions. Health Econ. 2006. 17. H. A. Glick, S. M. Orzol, J. F. Tooley, D. Polsky, and J. O. Mauskopf, Design and analysis of unit cost estimation studies: How many hospital diagnoses: how many countries? Health Econ. 2003; 12: 517–527. 18. G. W. Torrance, Measurement of health state utilities for economic appraisals: a review. J. Health Econ. 1986; 5: 1–30. 19. M. Ryan, D. A. Scott, C. Reeves, A. Bate, E. R. van Teijlingen, E. M. Russell, M. Napper, and C. M. Robb, Eliciting public preferences for healthcare: a systematic review of techniques. Health Technol. Assess. 2001; 5: 1–186. 20. EuroQol Group, EuroQol - a new facility for the measurement of health-related quality of life. Health Policy. 1990; 16: 199–208. 21. J. Horsman, W. Furlong, D. Feeny, and G. Torrance, The Health Utilities Index (HUI(r)): concepts, measurement properties and applications. Health Quality Life Outcomes. 2003;(1). Available at: http://www.hqlo.com/content/1/1/54. 22. J. A. Barber and S. G. Thompson, Analysis of cost data in randomised controlled trials: an application of the non-parametric bootstrap. Stat. Med. 2000; 19: 3219–3236.

COST-EFFECTIVENESS ANALYSIS

11

23. R. M. Nixon and S. G. Thompson, Parametric modelling of cost data in medical studies. Stat. Med. 2004; 23: 1311–1331.

Schulman, Estimating country-specific costeffectiveness from multinational clinical trials. Health Econ. 1998; 7: 481–493

24. D. Polsky, H. A. Glick, R. Willke, and K. Schulman, Confidence intervals for costeffectiveness ratios: a comparison of four methods. Health Econ. 1997; 6: 243–252.

37. S. G. Thompson, R. M. Nixon and R. Grieve, Addressing the issues that arise in analysing multicentre cost data, with application to a multinational study. J. Health Econ. 2006; 25: 1015–1028.

25. A. R. Willan and B. J. O’Brien Confidence intervals for cost-effectiveness ratios: an application of Fieller’s theorem. Health Econ. 1996; 5: 297–230. 26. B. J. O’Brien and A. H. Briggs, Analysis of uncertainty in health care cost-effectiveness studies: an introduction to statistical issues and methods. Stat. Methods Med. Res. 2002; 11: 455–468. 27. A. A. Stinnett and J. Mallahy, Net health benefits: a new framework for the analysis of uncertainty in cost-effectiveness analysis. Medical Decision Making. 1998; 18(suppl):S68–S80. 28. A. H. Briggs, A Bayesian approach to stochastic cost-effectiveness analysis. Health Econ. 1999; 8: 257–261. 29. E. Fenwick, B. J. O’Brien, and A. Briggs, Costeffectiveness acceptability curves: facts, fallacies and frequently asked questions. Health Econ. 2004; 13: 405–415. 30. J. L. Severens, D. E. M. Brunenberg, E. A. L. Fenwick, B. O’Brien, and M. Joore, Costeffectiveness acceptability curves and a reluctance to lose. Pharmacoeconomics 2005; 23: 1207–1214. 31. A. Briggs, T. Clark, J. Wolstenholme, and P. Clarke, Missing . . . presumed at random: cost-analysis of incomplete data. Health Econ. 2003; 12: 377–392. 32. D. Y. Lin, E. J. Feuer, R. Etzioni, and Y. Wax, Estimating medical costs from incomplete follow-up data. Biometrics 1997; 53: 419–434. 33. A. R. Willan, D. Y. Lin, R. J. Cook, and E. B. Chen, Using inverse-weighting in costeffectiveness analysis with censored data. Stat. Methods Med. Resl. 2002. 34. G. W. Torrance, M. F. Drummond, and V. Walker, Switching therapy in health economics trials: confronting the confusion. Med. Decision Making 2003; 335–339 35. G. Dunn, M. Maracy, and B. Tomenson, Accounting for non-compliance in clinical trials. Stat. Methods Med. Res. 2005; 14: 369–395. 36. R. J. Willke, H. A. Glick, D. Polsky, and K.

38. M. Drummond and F. Pang, Transferability of economic evaluation results. In: M Drummond and A. McGuire (eds.), Economic Evaluation in Health Care: Merging Theory with Practice. Oxford, UK: Oxford University Press, 2001. 39. D. Walker and J. Fox-Rushby, How to do (or not to do) . . . Allowing for uncertainty in economic evaluations: sensitivity analysis. Health Policy Planning. 2001; 16: 435–443. 40. A. Briggs, Probabilistic analysis of costeffectiveness models: statistical representation of parameter uncertainty. Value in Health. 2005; 8: 1–2. 41. A. M. Jones, Applied Econometrics for Health Economists – A Practical Guide. London: Office of Health Economics, 2001. 42. A. R. Willan, A. H. Briggs, and J. S. Hoch, Regression methods for covariate adjustment and subgroup analysis for non-censored costeffectiveness data. Health Econ. 2004; 13: 461–475. 43. D. Lang and S. R. Hill, Use of pharmaceconomics in prescribing research. Part 5 modelling – beyond clinical trials. J. Clin. Pharm. Therapeut. 2003; 28: 433–439. 44. J. B. Oostenbrink, M. P. M. H. Ruttenvan M¨olken, B. U. Monz, J. M. FitzGerald, Probabilistic Markov model to assess the cost-effectiveness of bronchodilator therapy in COPD patients in different countries. Value in Health. 2005; 8: 32–46. 45. J. S. Hoch, A. H. Briggs, and A. R. Willan, Something old, something new, something borrowed, something blue: a framework for the marriage of health econometrics and costeffectiveness analysis. Health Econ. 2002; 11: 415–430.

FURTHER READING A registry of published cost effectiveness analyses and preference weights for various diseases is available at the Tufts –New England Medical Center Cost Effectiveness Registry. http://www.tufts-nemc.org/CEARegistry/. The web page of the International Society for Pharmacoeconomics and Outcomes Research gives

12

COST-EFFECTIVENESS ANALYSIS

an extremely useful summary of national guidelines for health evaluation, including information on discount rates, sources of cost information, preferred method of analysis, and so on. http://www.ispor.org/PEguidelines/index.asp. Unit costs for the US, the UK (The Department of Health) and internationally (the World Health Organisation- WHO-CHOICE) may be found from the following websites, which also give general guidance and/or information on publications about cost-effectiveness analysis: http://www.cms.hhs.gov http://www.pssru.ac/uk http://www.who.int/choice Purchasing Power Parities (PPPs) may be found from the Organisation for Economic Cooperation and Development website. http:// www.oecd.org.

CROSS-REFERENCES cost benefit analysis bootstrapping treatment by center interaction quality of life intention to treat

COVARIATES

the usefulness of covariates for the analysis of clinical data. The overwhelming majority of clinical literature concentrates on investigating the prognostic effect of covariates in observational studies, and for almost every disease a large number of prognostic factors are examined. Statistical methods used for the analysis vary considerably, and the statistical community is far from a consensus as to the most sensible approach. (Some of these issues are discussed in the separate article on prognostic factor studies.) In the context of prognostic factor studies in oncology, several analysis strategies are discussed in Schumacher et al. (1) and McShane & Simon (2). Differences in approach, assumptions, and results are illustrated. The severe weaknesses in reporting methods and results of prognostic factor studies are discussed by Riley et al. (3), and several resulting difficulties of a systematic summary of the effect of a prognostic factor are discussed by Altman (4). When analyzing a randomized trial, the discussion centers around the advantages and disadvantages of adjusting the treatment effect for covariates and also the proper selection of covariates for this purpose. Some consensus has been reached on these issues, and the evolving recommendations from various groups are available (5–8). Aside from the overall results of a RCT, clinicians are interested in whether the treatment effect is similar in all patients or whether heterogeneity has been reached in the effect in specific subgroups of patients. Finding subgroups for which a treatment is more effective could be an important step toward the aim of individualizing treatments, a topic of growing interest. Specifically, as our ability to measure and to understand the biological mechanisms of gene expression data becomes more advanced, the definition of prognostic subgroups has become one of the most promising areas of research. However, a review in four major medical journals in 1997 demonstrated severe problems in subgroup analyses (9). Most trials presented baseline comparability in unduly large tables, and about half the trials inappropriately used significance tests for baseline comparison.

WILLI SAUERBREI Institute of Medical Biometry and Medical Informatics University Medical Center Freiburg Freiburg, Germany

1 UNIVERSAL CHARACTER OF COVARIATES In clinical trials, covariates are used for many purposes. They are often called explanatory variables, predictor variables, independent variables, prognostic variables, or prognostic factors. Covariates can be used in all phases of clinical studies, including defining the patient population, calculating the required sample size, allocating treatment in randomized and nonrandomized trials, adjusting the estimate of a treatment effect, defining subgroups considered for investigations of interactions with treatment, helping in the biological understanding of a disease and its potential treatment, and developing prediction rules. Covariates can be demographic factors such as country or race, or biological and genetic factors such as sex or blood group, or factors that may change over time such as blood pressure or tumor size, or characteristics of patients that are relevant for the course of the disease and the effect of a treatment. Covariates are measured on different scales, such as binary (sex), categorical (histologic tumor type), ordered categorical (tumor grade), or continuous (age, height). Because of their universality and general applicability, covariates play a central role in clinical research, in randomized controlled trials (RCT), and in nonrandomized trials (NRT). Beyond this, covariates are used in medical decision making and health policy decisions (see 2.3). The measurement of covariates can be simple, cheap, and yield high data quality (e.g., sex), but it can also be very complicated, expensive, and imperfect. The scale of the measurement and the quality of the data can have a serious influence on the use and

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

COVARIATES

Two-thirds of the reports presented subgroup findings, but mostly without appropriate statistical tests for interaction. Many reports put too much emphasis on subgroup analyses that commonly lacked statistical power. These studies were conducted before gene expression data became available in trials; although the enormous amount of new covariate type data offers new possibilities and hope for improving prediction and treatment of patients, it also poses new challenges to the design and analysis of clinical trials. The broad range of possible uses and different types of issues, such as the controversies surrounding gene expression data, are beyond the scope of this discussion; instead, we focus on some of the most important aspects and relevant applications. To this end, we will implicitly assume the situation of a phase III trial or a large observational study. We consider different uses of covariates in controlled clinical trials where they play a central role in all parts of the trial: for example, definition of the patient population, RCT treatment allocation, adjustment of the estimate of the treatment effect, and definition of subgroups for the investigation of interaction. Continuous covariates also need specific attention; in application, they are often changed to a categorical form. Simplicity is an important reason for categorization, but categorization introduces several critical issues. Finally, we will examine some issues of reporting and summary assessment of prognostic markers. It will become obvious that severe improvements are required for a better use of marker data in clinical trials. 2

USE OF COVARIATES IN CLINICAL TRIALS

2.1 Patient Population To select an adequate population in the planning phase of a new clinical trial, knowledge about the effect of covariates is required. Prospective randomized trials are usually designed for risk-adapted patient populations where covariates define the main inclusion criteria. Exclusion criteria are applied to exclude patients with extreme values or rare conditions. Age is a specific covariate that is

often used to define a population by excluding young (e.g., below 18 years) or old (e.g., above 70 years) patients. In recent years, several trials have been designed specifically to investigate treatments in older patients. With more specific knowledge about treatment responsiveness, covariates are used to restrict the trial population to patients who are more likely profit from the therapy. In most recent trials investigating hormonal therapies in breast cancer, for example, patients must have a positive estrogen receptor. In the design of a new trial, investigators must find a compromise between internal and external validity. Internal validity aims to eliminate the possibility of biases in the design and conduct of the trial, whereas external validity considers the extent to which the results of a trial provide a correct basis for generalizations to other circumstances. For eligibility, consideration should be given both to patients who are likely to benefit from treatment and to the desired generalizability of the results. If the eligibility criteria are very narrow, the generalizability of the study is compromised; if they are too broad, the effectiveness of treatment may be masked by the inclusion of patients who have little chance of responding (10). In this respect, covariates can also play a central role in deciding whether a trial is useful and whether it is ethical. The targeted population may be too small for a comparative trial; for example, a study of early stage disease would require a very large sample size because the expected event rates in a reasonable time period are low. Another problem derives when the knowledge about risk factors is insufficient to identify which patients have a severely increased risk; only those patients would be suitable candidates for inclusion in the population for a prevention trial (11). 2.2 Treatment Allocation in a RCT Covariates play an important role in allocating treatment to patients in a RCT. Usually, allocation is implemented in one of two ways. Most often used is preallocation: a randomization list is prepared in advance, giving a new patient the next treatment on the list.

COVARIATES

To use covariates for a stratified randomization, different lists are required for each strata. Also common is a form of postallocation: the patient is recruited and entered into the trial, then—possibly after covariate information has been taken into account and perhaps using some automated telephone or Web-based system—the investigator is told the treatment or (in a double-blind trial) the number of the pack to be given to the patient (12). To achieve balance with respect to the covariates suspected to have an influence on the outcome, it is common practice to use them as stratification factors in the randomization or to use dynamic techniques such as minimization (13) or biased coin designs (14). Often, the study center may be the first stratification factor used. In trials with a large number of participating centers, however, practicality may argue against using further stratification criteria because exact balance becomes impossible if several centers enter only a small number (say less than five ) patients. This is also the case when several covariates are used as stratification factor, resulting in a large number of strata (15). Too many strata also incur an organizational cost; they may complicate the process of treatment allocation when it becomes necessary to have the information on covariates available at the time of randomization (7). Randomized blocks is the most commonly used technique. Using either preallocation or postallocation, this approach ensures that the treatments are allocated in the chosen proportions for each block. The block size should be unknown to the investigators. Employing a separate list for each stratum, it can only balance for covariates by stratification. In general, it is advisable to

3

incorporate the stratification criteria in the analysis, although exceptions may be made if the strata become very small and strata are not themselves predictive of the outcome (7). Another popular technique is covariateadaptive randomization (16), of which minimization (13) seems to be used most often. However, Senn (12) argues as to whether all such schemes really involve randomization and refers to this approach by the more general term ‘‘allocation based on covariates’’ (ABC). The patient must be entered first into the trial and values need to be reported first, before the central trial office can then announce the allocation, so the method of postallocation alone is suitable using either telephone or (nowadays) Webbased allocation. A common form of ABC is ‘‘minimization,’’ an attempt to balance trials as best possible by using prognostic information. Senn (12) discusses the advantages and disadvantages of ABC schemes and points out that their advantages have been overstressed in the literature. He argues that the decision to adjust for prognostic factors has a far more dramatic effect on the precision of the inferences than the choice of an ABC or randomization approach. 2.3 Adjustment of the Estimate of a Treatment Effect in a RCT To demonstrate the importance of stratification or adjustment for covariates of the estimate of a treatment effect in a randomized trial, we want to use a well-known and extreme theoretical example for the analysis of survival time data from Peto et al. (17). Data are given in Table 1. The univariate analysis gives for the effect of treatment a P-value of 0.252 when using the logrank test. The P-value for the effect of the prognostic factor is <0.0001. Stratifying

Table 1. Hypothetical Data of Patients Treated With Either A or B and One Prognostic Factor Treatment

Prognostic factor

Survival times

A

normal elevated normal elevated

8, 220, 365 + , 852 + , 1296 + , 1328 + , 1460 + , 1976 + 8, 52, 63, 63 70, 76, 180, 195, 210, 632, 700, 1296, 1990 + , 2240 + 13, 18, 23

B

Note: + : censored observation. Source: Peto et al. Br J Cancer. 1977; 35: 1–39.

4

COVARIATES

for the prognostic factor the logrank test for treatment indicates an effect with a P-value of 0.016. Assuming proportional hazards, the estimated hazard ratio for treatment is 1.77 (95% confidence interval, 0.65–4.81) in a univariate Cox model, but it increases to 3.47 (1.07–11.22) when adjusting for the prognostic factor and even to 4.32 (1.19–15.75) in a stratified Cox model. For a more realistic, but still hypothetical, example demonstrating the effect of imbalance of a prognostic variable on the estimated treatment effect, see Altman (18). Randomized controlled trials are the gold standard for the comparison of two or more treatments because they promote comparability among the study groups. In an observational study, such comparability can only be attempted by adjusting for matching on known covariates, with no guarantee or assurance, even asymptotically, of control for other baseline patient characteristics. Even for unknown important covariates, randomization ensures comparability with a high probability, and it provides a probabilistic basis for an inference from the observed results when considered in reference to all possible results (16). However, it is obvious that some imbalance between baseline covariates will often remain after randomization. To describe the patient population and to demonstrate baseline comparability between treatment groups, most trials present covariate information in a table. These tables are often unduly large, and Assmann et al. (9) found that about half the trials inappropriately used significance tests for baseline comparison. In randomized trials, treatment groups are indeed random samples, and any differences observed between them are necessarily due to chance, so the use of hypothesis tests is highly questionable. The only use of significance testing for baseline differences is to check the randomization process; however, there is only minimal power for it (12, 19). As demonstrated with the introductory example, stratification or adjustment of the treatment estimate for covariates is an important issue involving several elements. If a covariate has a strong prognostic effect,

adjustment can be quite important even if the imbalance between groups is only small. Such imbalance can work in either direction, masking or overstating the true treatment difference (19). Reducing the variance of the estimated treatment effect by adjusting for covariates correlated with outcome is a further important statistical issue of the adjustment for baseline covariates. Usually, most of the arguments are discussed in the context of a linear model and later transferred to generalized linear models or models for survival data. For the latter models, some issues differ; however, most practical consequences are similar (20–23). 2.3.1 General Recommendations. Concerning issues of which covariates to select and when to select them, some consensus was reached by the Committee for Proprietary Medicinal Products (6). Their ‘‘Points to Consider’’ comprised the following recommendations (here, numbered Points 1 to 13). Not all statisticians may agree to all of the following points, so acceptance by the statistical community is still under debate. However, these points were meant to clarify and harmonize issues on the use of baseline covariates in clinical trials: P1. Stratification may be used to ensure balance of treatments across covariates; it may also be used for administrative reasons. The factors that are the basis of stratification should normally be included as covariates in the primary model. P2. Variables known a priori to be strongly, or at least moderately, associated with the primary outcome and/or variables for which there is a strong clinical rationale for such an association should also be considered as covariates in the primary analysis. The variables selected on this basis should be prespecified in the protocol or the statistical analysis plan. P3. Baseline imbalance observed post hoc should not be considered an appropriate reason for including a variable as covariate in the primary analysis.

COVARIATES

P4. Variables measured after randomization and so potentially affected by the treatment should not normally be included in the primary analysis. P5. If a baseline value of a continuous outcome measure is available, then this should usually be included as a covariate. P6. Only a few covariates should be included in a primary analysis. P7. In the absence of prior knowledge, a simple functional form (usually either linearity or dichotomizing a continuous scale) should be assumed for the relationship between a continuous covariate and the outcome variable. P8. The validity of model assumptions must be checked when assessing the results. P9. Whenever adjusted analyses are presented, results of the treatment effect in subgroups formed by the covariates (appropriately categorized, if relevant) should be presented to enable an assessment of the validity of the model assumptions. P10. Sensitivity analyses should be preplanned and presented to investigate the robustness of the primary results. Discrepancies should be discussed and explained. P11. The primary model should not include treatment by covariate interactions. If substantial interactions are expected a priori, the trial should be designed to allow separate estimates of the treatment effects in specific subgroups. P12. Exploratory analyses may be carried out to improve the understanding of covariates not included in the primary analysis. P13. A primary analysis, unambiguously prespecified in the protocol or statistical analysis plan, correctly carried out and interpreted, should support the conclusions that are drawn from the trial. Since there may be a number of alternative valid analyses, results based on prespecified analyses will carry most credibility.

5

2.3.2 Selection of Covariates for Adjustment. In many diseases, a long list of potentially important covariates are considered when a clinical trial being designed and analyzed. There is general agreement that only a few covariates should be included in a primary analysis (P6). However, it is less obvious how to select these covariates. For linear models with continuous outcomes, some methodological arguments are presented to guide selection of specific covariates (7). Covariates for which adjustment is to be carried out should be identified at the design stage and specified in the study protocol (P2). During the analysis phase, variable selection procedures should not be used to select covariates. Analysis strategies based on preliminary tests of covariate imbalance reduce the power of the analysis to detect a treatment difference where such a difference exists. Post hoc tests of covariates associated with outcome have a different danger in that they may bias the estimates of the treatment effect. If, by chance, a prognostic factor is severely imbalanced, problems concerning the best way to analyze and to interpret the results may appear. Adjusting or not for this factor should hardly influence the estimate of the treatment effect if the prognostic value is only weak. Fortunately, the critical case is rare in which a strong prognostic factor is a priori not considered for the final analysis and, at the same time, is severely imbalanced between treatment groups. Presenting sensitivity analyses and interpreting the results carefully are the only general recommendations for these cases. Covariates specified a priori for adjustment are often based on clinical knowledge and the expected distribution in the trial. For the linear regression model, a more formal approach for the choice of covariates to include in the analysis plan for a clinical trial is presented by Raab et al. (7). From the analysis of previous trials, they need to obtain the multiple correlation between the outcome and each potential set of covariates considered for inclusion and the number of patients planned to include in the current trial as well. Where practical, the design should aim to balance these factors across treatments.

6

COVARIATES

2.3.3 Binary or Survival Outcomes. Unfor tunately, some aspects are known to be different for randomized studies with binary and survival outcomes, usually analyzed in the framework of a logistic regression model or a Cox regression model, respectively (7). In particular, in the Cox model, the omission of a balanced covariate leads to a mixture model that is no longer a proportional hazards model (20) and results in a downward bias of the treatment effect estimate (21, 22). The same is true in the situation when the covariate is not completely omitted from the model but rather is included in a categorized form (23). Furthermore, the variance of the treatment effect estimate in an analysis omitting a balanced covariate is smaller than in the adjusted analysis. Despite the gain in precision due to covariate omission, there is a loss of power due to an underestimation of the treatment effect, which is much more important. Therefore, it is recommended to adjust for important prognostic factors even in randomized treatment comparisons. However, in the case of the Cox model, this recommendation is based on totally different arguments than in the case of the classic linear model. The practical consequences of adjusting for a prognostic covariate may be similar to the linear case (23). Robinson and Jewell (24) showed that the asymptotic power of the test for the treatment effect in logistic regression is always increased by adjusting for a prognostic covariate. This is due to an increase of the size of the adjusted treatment effect rather than to a reduction of its standard error. 2.3.4 Examples Demonstrating Influence of Adjustment. There are several published RCTs showing that adjustment or stratification for covariates can have a severe influence on the result. In a randomized trial, the Canadian Clinical Trials Group compared two adjuvant chemotherapy regimens in 716 postmenopausal women with early breast cancer (25). They used a stratified block randomization procedure with three stratification factors. Treatments were well balanced in these three factors. However, slight imbalances were observed in two other important prognostic factors. Comparing the overall survival for the two treatments based

on the logrank test gave a statistically nonsignificant P-value of 0.11, ignoring the cova riates. A stratified logrank test considering the three factors used for randomization gave a P-value of 0.034. The P-value was additionally reduced to 0.028 by adjusting the stratified analysis for the two further factors observed to be slightly imbalanced (8, 26). The analysis using the three prespecified randomization criteria is certainly sensible; however, the value of the analysis including the two additional factors is questionable because these factors were determined in a data-dependent way (P3). Christensen et al. (27) present another example showing the influence of adjustment in a randomized trial. The unadjusted analysis comparing azathioprine versus placebo in patients with primary biliary cirrhosis gave P = 0.2. There was some imbalance in serum bilirubin, which is a very strong prognostic variable in such patients. The azathioprine group had higher levels on average and hence a worse prognosis. Using a stratified logrank analysis, the P-values for therapy varied with the number of strata but were always smaller compared with the P-value from the unadjusted analysis. With one, four, and eight strata, the P-values were 0.1, 0.07, and 0.05, respectively. This illustrates the increased power of a stratified analysis, but it also shows that for a highly prognostic covariate it might not be feasible to have enough strata to remove the effect of imbalance. Adjusting the treatment effect by use of regression modeling is then preferable (18). 2.4 Define Subgroup and Investigation of Interactions Covariates are often used to define subgroups and to investigate whether treatment effects are heterogeneous in subgroups of the patient population. The statistical term for heterogeneity of this type is interaction. In a series of short statistical notes, Matthews and Altman (28–30) illustrated that the approach of analyzing data separately in each subgroup is incorrect. They show in two simple examples that it is important to compare effect sizes and not P-values between subgroups. In an assessment of 50 reports from randomized clinical trials, Assmann et al. (9)

COVARIATES

found that two thirds of the reports presented subgroup findings without appropriate statistical tests for interaction. Many reports put too much emphasis on subgroup analyses, which commonly lack statistical power. Many trials confined subgroup attention to just one baseline factor, but five trials examined more than six factors; often, subgroup differences were studied for more than one endpoint. Less than half of subgroup-analysis reports used statistical tests of interactions, which directly examine whether the treatment difference in an outcome depends on the patient’s subgroup. Most other reports relied instead on P-values for treatment difference in each separate subgroup. However, because trials are usually powered to detect an overall effect, subgroup analysis must suffer from a lack of power. Assman et al. (9) stated that, of all various multiplicity problems in clinical trials, subgroup analysis remains the most overused and overinterpreted. Even though these problems have been elucidated (31, 32), trial reports continue to employ many subgroup analyses. This reflects an intellectually important issue: real treatment differences may depend on certain baseline characteristics. Thus, such data exploration is not bad on its own, provided that investigators (and readers) consider it as hypothesis generation only. Because of low power, researchers are inevitably left in doubt as to whether a suggestive result from a subgroup analysis (e.g., with a P-value of the test of interaction between P = 0.05 and P = 0.15) is simply due to chance or merits further investigation. Even after reporting an interaction that is not statistically significant, investigators may still overinterpret subgroup findings. As stated in P11 of the ‘‘Points to Consider,’’ treatment by covariate interactions should not be included in a primary model, but possible expected interactions should be specified a priori. Obviously, this affects interpretation of a trial result. An example with a strong interaction in a prespecified subgroup analysis was reported in Jonat et al. (33). In a large randomized trial for adjuvant treatment of premenopausal breast cancer patients, chemotherapy was

7

compared with hormone therapy. The primary analysis of disease-free survival for all 1614 patients showed that chemotherapy was superior to hormone therapy (hazard ratio 1.18, P = 0.029). However, in a prespecified analysis, a highly statistically significant interaction between treatment and estrogen-receptor status was observed (P = 0.0016). Therefore, the planned estrogenreceptor subgroups were considered separately. There was a large advantage for the chemotherapy arm in the small (304 patients) estrogen-receptor–negative group (hazard ratio 1.76, P = 0.0006), but equivalence for the two treatments (hazard ratio 1.01; 95% confidence internal, 0.84–1.20) in the larger estrogen-receptor–positive subgroup. Investigations of a treatment interaction with a continuous covariate add another difficulty. In general, the covariate is categorized, which raises issues as to the number of categories (most often two are used) and the selection of a cutpoint. Some issues discussed below regarding categorization or determination of the functional form for a continuous covariate transfer to the investigation of interactions. To avoid a substantial loss of information by categorizing a continuous variable in two or three groups, several approaches have been proposed recently (34, 35). For specific examples, they have demonstrated their usefulness, but they all need further investigation of their properties. 3 CONTINUOUS COVARIATES: CATEGORIZATION OR FUNCTIONAL FORM? 3.1 Categorization Often variables are measured in a continuous form, but changing them into a categorical form is common practice in many analyses. An important reason is that categorization makes it easier for clinicians to use covariate information in making treatment decisions (36). Another reason is that the functional form of the influence of a covariate is often unknown, and an analysis based on categorization is easier. Further, categorization allows graphical presentation (e.g., in survival analysis, Kaplan-Meier curves can be presented).

8

COVARIATES

In particular, values of a variable are frequently divided into two groups. Categorization enables researchers to avoid strong assumptions about the relation between the marker and risk. However, this comes at the expense of throwing away information. The information loss is greatest with only two groups, although this approach is very common. Often splitting at the sample median is used (37). It is well known that the results of analyses can vary if different cutpoints are used for splitting. In recent years, there has been increasing interest in investigating various cutpoints and choosing the one that corresponds to the most statistically significant correlation with outcome. In other words, the cutpoint defining ‘‘low’’ and ‘‘high’’ risk is chosen such that it minimizes the P-value relating the prognostic factor to outcome. The cutpoint so chosen is often termed ‘‘optimal,’’ but this description is inadvisable because of the well-known problem of multiple testing. When a series of statistical tests, each with a prespecified nominal type 1 error—for example, 5%—is performed on the same data, this procedure leads to a global error rate for the whole procedure that might be much higher than 5%. Theoretical arguments (38, 39) and results from simulation studies (37, 40) demonstrate that the false-positive rate can be inflated to values exceeding 40% when a nominal level of 5% is used. Altman et al. (37) preferred to call this procedure the ‘‘minimum P-value approach.’’ They provide a correction, valid for large sample sizes, of the minimal P-value to allow for the multiple testing. For practical applications, a simple approximation for the corrected P-value is given by Pcor ≈ −1.63Pmin 1 + 2.35 loge Pmin for ε = 10% Pcor ≈ −3.13Pmin 1 + 1.65 loge Pmin for ε = 5% where Pmin denotes the minimum P-value of the logrank statistic, taken over the selection interval characterized by the proportion ε of the smallest and largest values of the prognostic factor that are not considered as potential cutpoints. This approximation works well

for small minimum P-values (0.0001 < pmin < 0.1). For example, to reach a value Pcor = 0.05 requires Pmin = 0.002 when ε = 10% and even Pmin = 0.001 when ε = 5%. Using the correction formula above, the corrected Pvalue for the optimal cutpoint was 0.403 in an example investigating the prognostic value of S-phase fraction (SPF) in breast cancer patients, in contrast to the minimum P-value of 0.037 which would indicate a statistically significant influence of SPF (37). The ‘‘optimal cutpoint approach’’ leads to other serious problems. Most likely different cutpoints will be termed ‘‘optimal’’ in different studies or subpopulations. Analyzing the data in multivariate analysis may further change the optimal cutpoint. This was illustrated by Altman et al. (37). Results of an analysis based on the optimal cutpoint may show impressive prognostic effects, but it must be kept in mind that the effect can be heavily overestimated. For a discussion of this issue and possible ways to correct for ¨ it, see Schumacher et al. (41) and Hollander et al. (42). For a more detailed discussion on categorizing a continuous covariate, see Mazumdar and Glassman (36) and Mazumdar et al. (43). Categorizing a continuous covariate into a small number of groups or having an ordered categorical variable with at least three groups raises an issue how to code the variable for the analysis. The importance of this coding is often underrated in analyses. Using a test for trend, the specific choice of scores (e.g., 1, 2, 3 or 1, 2, 4, or scores that are the median of each category) used for the coding has an influence on the P-value. Based on the coding, a covariate may be statistically significant or not. Using dummy coding for an ordered variable with several groups, the coding may reflect the ordered nature of the variable, or it may only reflect a categorized variable ignoring the ordering. Using variable selection strategies to derive a final multivariable model coding can also have a substantial influence. A model that does not reflect the true influence of a variable may be selected if coding does not reflect the scale of the covariate.

COVARIATES

It has to be stressed that categorizing continuous covariates has often serious disadvantages and should be avoided. For the development of a prognostic model, we propose a better approach to work with continuous covariates. 3.2 Determination of the Functional Relationship Traditionally, for continuous covariates, either a linear functional relationship or a step function after grouping is assumed. Problems of the latter approach are issues about defining cutpoints, overparameterization, and loss of efficiency (some of which we have just discussed). In any case, a cutpoint model is often an inadequate way to describe a smooth relationship. The assumption of linearity may be incorrect, leading to a misspecified final model in which a relevant variable may not be included (e.g., because the true relationship with the outcome is non-monotonic) or in which the assumed functional form differs substantially from the unknown true form. An alternative approach is to keep the variable continuous and to allow some form of nonlinearity. Instead of using quadratic or cubic polynomials, a more general family of parametric models have been proposed by Royston and Altman (44) that is based on so-called fractional polynomial (FP) functions. Here, usually one or two terms of the form Xp are fitted, the exponents p being chosen from a small preselected set S of integer and noninteger values. Although only a small number of functions is considered (besides no transformation (P = 1), the set S includes seven transformations for FPs of degree 1 (FP1) and 36 for FPs of degree 2 (FP2); FP functions provide a rich class of possible functional forms leading to a reasonable fit to the data in many situations. Royston and Altman (44) dealt mainly with the case of a single predictor, but they also suggested and illustrated an algorithm for fitting FPs in multivariable models. By combining backward elimination with the search for the most suitable FP transformation for continuous predictors, Sauerbrei and Royston (45) propose modifications to this

9

multivariable FP (MFP) procedure. A further extension of the MFP procedure aims to reflect basic knowledge of the types of relationships to be expected between certain predictors and the outcome. In their study on prognostic factors for patients with breast cancer, Sauerbrei et al. (46) illustrated this approach in a simple way by comparing results with multivariable model building by either assuming linearity for continuous factors or by working with categorized data. Stability investigations with bootstrap resampling have shown that MFP can find stable models, despite the considerable flexibility of the family of FPs and the consequent risk of overfitting when several variables are considered (47). 4 REPORTING AND SUMMARY ASSESSMENT OF PROGNOSTIC MARKERS For many diseases, a large number of prognostic markers can be considered. Systematic reviews and meta-analytical approaches to identifying the most valuable prognostic markers are needed because conflicting evidence relating to markers is usually published across a number of studies. Obviously, this requires a number of important steps, starting with systematic reports of all studies. For randomized trials, the CONSORT (Consolidated Standards of Reporting Trials) statement (48) was developed and has improved reporting. Weaknesses of the original version were addressed in a revised version (49, 50). Covariates play only a minor role in this document, which requests that baseline demographic and clinical characteristics of each treatment group should be reported. Furthermore, details should be given for subgroup analyses and adjusted analyses. Covariates are not mentioned in the revised template of the CONSORT diagram. Reporting of prognostic factor studies is hardly addressed in the literature and a meta-analytic approach to assess the prognostic value of a marker is still an exception. To investigate the practicality of such an approach, an empirical investigation of a systematic review of tumor markers for neuroblastoma was performed by Riley et al. (3). They identified 260 studies of prognostic markers, which considered 130 different

10

COVARIATES

markers. They found that the reporting of these studies was often inadequate, in terms of both statistical analysis and presentation. There was also considerable heterogeneity for many important clinical/statistical factors. These problems restricted both the extraction of data and the meta-analysis of results from the primary studies, limiting feasibility of the evidence-based approach. Recently, reporting recommendations for tumor marker prognostic studies (REMARK) were published by a NCI-EORTC Working Group on Cancer Diagnostics (51). The investigators suggested guidelines to provide relevant information about the study design, preplanned hypotheses, patient and specimen characteristics, assay methods, and statistical analysis methods. In addition, the guidelines suggested helpful presentations of data and important elements to include in reports. The goal of these guidelines is to encourage transparent and complete reporting so that the relevant information will be available to others to help them evaluate the usefulness of the data and understand the context in which the conclusions apply (51). It is hoped that the quality of primary studies and their reporting of results improve with clear conclusions and policy recommendations formed about prognostic markers. Altman and Lyman (52) have proposed guidelines for both conducting and evaluating prognostic marker studies, including the need for prospective registration of studies. Alongside these, Riley et al. (3) have developed simple guidelines on how to report results to facilitate both interpretation of individual studies and the undertaking of systematic reviews, meta-analysis, and, ultimately, evidence-based practice. They recommend that research groups collaborate and identify the most commonly used prognostic tools in current practice so that adjusted results of new prognostic marker studies can use consistent sets of adjustment factors. The availability of full individual patient data, including all markers considered, is the most viable way to produce valid and clinically useful evidence-based reviews and metaanalyses. Individual patient data would limit the large problems of poor and heterogeneous reporting that have been observed, and also reduce the potential impact of reporting bias.

Prospective registration of studies alongside the availability of individual patients data would also help restrict the potential for publication bias. These guidelines all point to researchers working together toward planned pooled analyses, currently a particularly important concept for epidemiological research (53). However, summary assessments of prognostic factor studies based on individual patient data are still an exception. Concerning the current situation, Altman (4) lists several problems with systematic reviews of prognostic studies from publications: Studies are difficult to identify. Negative (nonsignificant) results may not be reported (publication bias). Reporting of methods is inadequate. Study designs vary. Most studies are retrospective. Inclusion criteria vary. Quality assessments lack recognized criteria. Studies employ different assays/measurement techniques. Methods of analysis vary. Studies use differing methods of handling of continuous variables (some data dependent). Studies use different statistical methods of adjustment. Studies adjust for different sets of variables. Quantitative information on outcome is inadequately reported. Presentation of results varies (e.g., survival at different time points). This list clearly underlines the necessity of individual patient data for a sensible summary of prognostic factor studies. REFERENCES ¨ 1. M. Schumacher, N. Hollander, G. Schwarzer, and W. Sauerbrei, Prognostic factor studies. In: J. Crowley and D. P. Ankerst (eds.), Handbook of Statistics in Clinical Oncology, 2nd ed. Boca Raton, FL: Chapman & Hall/CRC Press, 2006, pp. 289–333.

COVARIATES 2. L. McShane and R. Simon, Statistical methods for the analysis of prognostic factor studies. In: M. K. Gospodarowicz, D. E. Henson, R. V. P. Hutter, B. O’Sullivan, L. H. Sobin, and C. Wittekind (eds.), Prognostic Factors in Cancer, 2nd ed. New York: Wiley-Liss, 2001, pp. 37–48. 3. R. D. Riley, K. R. Abrams, A. J. Sutton, P. C. Lambert, D. R. Jones, et al., Reporting of prognostic markers: current problems and development of guidelines for evidence based practice in the future. Br J Cancer. 2003; 88: 1191–1198. 4. D. G. Altman, Systematic reviews of studies of prognostic variables. In: M. Egger, S. Davey, and D. Altman (eds.), Systematic Reviews in Health Care: Meta-Analysis in Context. London: BMJ Publishing, 2001, pp. 228–247. 5. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E9 Statistical Principles for Clinical Trials. Current Step 4 version, 5 February 1998. Available at: http://www.ich.org/LOB/media/ MEDIA485.pdf 6. Committee for Proprietary Medicinal Products (CPMP), Points to consider on adjustment for baseline covariates. Stat Med. 2004; 23: 701–709. 7. G. Raab, S. Day, and J. Sales, How to select covariates to include in the analysis of a clinical trial. Control Clin Trials 2000; 21: 330–342. 8. D. Tu, K. Shalay, and J. Pater, Adjustment of treatment effect for covariates in clinical trials: statistical and regulatory issues. Drug Inf J. 2000; 34: 511–523. 9. S. Assmann, S. Pocock, L. Enos, and L. Kasten, Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet. 2000; 255: 1064–1069. 10. S. Green, J. Benedetti, and J. Crowley, Clinical Trials in Oncology 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2003. 11. M. Baum, Shooting sacred cows. Cancer Futures 2003; 2: 273–278. 12. S. Senn, Added values—controversies concerning randomization and additivity in clinical trials. Stat Med. 2004; 23: 3729–3753. 13. S. J. Pocock, and R. Simon, Sequential treatment assignment with balancing of prognostic factors Biometrics. 1975; 31: 103–115. 14. B. Efron, Forcing a sequential experiment to be balanced. Biometrika 1971; 58: 403–417.

11

15. T. M. Therneau, How many stratification factors are too many to use in a stratification plan? Control Clin Trials. 1993; 4: 98–108. 16. W. Rosenberger and J. Lachin, Randomization in Clinical Trials. Theory and Practice. New York: Wiley, 2002. 17. R. Peto, M. C. Pike, P. Armitage, Peto R, Pike MC, Armitage P, N. E. Breslow NE, D. R. Cox DR, et al., Design and analysis of randomized clinical trials requiring prolonged observations of each patient: II. Analysis and examples. Br. J. Cancer. 1977; 35: 1–39. 18. D. G. Altman, Comparability of randomised groups. Statistician. 1985; 34: 125–136. 19. D. G. Altman, Covariate imbalance, adjustment for. In: Redmond, C. and Colton, T., (eds.), Biostatistics in Clinical Trials. New York: Wiley, 2001, pp. 122–127. 20. M. Schumacher, M. Olschewski, and C. Schmoor, The impact of heterogeneity on the comparison of survival times Stat Med. 1987; 6: 773–784. 21. M. H. Gail, S. Wieand, and S. Piantadosi, Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984; 71: 431–444. 22. I. Ford, J. Norrie, and S. Ahmadi, Model inconsistency, illustrated by the Cox proportional hazards model. Stat Med. 1995; 14: 735–746. 23. C. Schmoor and M. Schumacher, Effects of covariate omission and categorization when analysing randomized trials with the Cox model. Stat Med. 1997; 16: 225–237. 24. L. D. Robinson and N. P. Jewell, Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev. 1991; 58: 227–240. 25. M. N. Levine, V. H. Bromwell, K. I. Pritchard, B. D. Norris, L. E. Shepherd, et al., Randomized trial of intensive cyclophosphamide, epirubicin, and fluorouracil in premenopausal women with node-negative breast cancer. J Clin Oncol. 1998; 16: 2651–2658. 26. C. D. Atkins, Adjuvant chemotherapy with CEF versus CMF for node-positive breast cancer. J Clin Oncol. 1998; 16: 3916–3917. 27. E. Christensen, J. Neuberger, and J. Crowe, Beneficial effect of azathioprine and prediction of prognosis in primary biliary cirrhosis. Gastroenterology. 1985; 89: 1084–1091. 28. D. G. Altman and J. N. S. Matthews, Interaction 1: heterogeneity of effects. BMJ 1996; 313: 486.

12

COVARIATES

29. J. N. S. Matthews and D. G. Altman, Interaction 2: compare effect sizes not P values. BMJ. 1996; 313: 808. 30. J. N. S. Matthews and D. G. Altman, Interaction 3: How to examine heterogeneity. BMJ. 1996; 313: 862. 31. S. Yusuf, J. Wittes, J. Probstfield, and H. Tyroler, Analysis and interpretation of treatment effects in subgroups of patients in randimised clinical trials. JAMA. 1991; 266: 93–98. 32. S. J. Pocock, M. D. Hughes, and R. J. Lee, Statistical problems in the reporting of clinical trials: a survey of three medical journals. N Engl J Med. 1987; 317: 426–432. 33. W. Jonat, M. Kaufman, W. Sauerbrei, R. Blamey, J. Cuzick, et al., for Zoladex Early Breast Cancer Research Association Study. Goserelin versus cyclophosphamide, methotrexate, and fluorouacil as adjuvant therapy in premenopausal patients with nodepositive breast cancer: The Zoladex Early Breast Cancer Research Association Study. J Clin Oncol. 2002; 20: 4628–4635. 34. M. Bonetti and R. Gelber, A graphical method to assess treatment-covariate interactions using the Cox model on subsets of the data. Stat Med. 2000; 19: 2595–2609. 35. P. Royston, and W. Sauerbrei, A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med. 2004; 23: 2509–2505. 36. M. Mazumdar and J. R. Glassman, Categorizing a prognostic variable: review of methods. Code for easy implementation and applications to decision-making about cancer treatments. Stat Med. 2000; 19: 113–132. 37. D. G. Altman, B. Lausen, W. Sauerbrei, and M. Schumacher, Dangers of using ’’optimal’’ cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst. 1994; 86: 829–835. 38. R. Miller, D. Siegmund, Maximally selected chi square statistics. Biometrics. 1982; 38: 1011–1016.

¨ 42. N. Hollander, W. Sauerbrei, and M. Schumacher, Confidence intervals for the effect of a prognostic factor after selection of an ‘‘optimal’’ cutpoint. Stat Med. 2004; 23: 1701–1713. 43. M. Mazumdar, A. Smith, and J. Bacik, Methods for categorizing a prognostic variable in a multivariable setting. Stat Med. 2003; 22: 559–571. 44. P. Royston and D. G. Altman, Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Appl Stat. 1994; 43: 429–467. 45. W. Sauerbrei and P. Royston, Building multivariable prognostic and diagnostic models: Transformation of the predictors by using fractional polynomials [corrections published in J R Stat Soc Ser A Stat Soc. 2002; 165: 399–400], J R Stat Soc Ser A Stat Soc. 1999; 162: 71–94. 46. W. Sauerbrei, P. Royston, H. Bojar, C. Schmoor, M. Schumacher, and the German Breast Cancer Study Group (GBSG), Modelling the effects of standard prognostic factors in node positive breast cancer. Br J Cancer 1999; 79: 1752–1760. 47. P. Royston and W. Sauerbrei, Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003; 22: 639–659. 48. C. Begg, M. Cho, S. Eastwood, R. Horton, D. Moher, et al., Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA. 1996; 276: 637–639. 49. D. Moher, K. Schulz, and D. Altman, The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. Ann Intern Med. 2001; 134: 657–662. 50. D. Altman, K. Shulz, D. Moher, M. Egger, F. Davidoff, et al., The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med. 2001; 134: 663–664.

40. S. G. Hilsenbeck and G. M. Clark, Practical P-value adjustment for optimally selected cutpoints. Stat Med. 1996; 15: 103–112.

51. L. M. McShane, D. G. Altman, W. Sauerbrei, S. E. Taube, M. Gion, G. M. Clark, for the Statistics Subcommittee of the NCI-EORTC Working on Cancer Diagnostics, REporting recommendations for tumor MARKer prognostic studies (REMARK). J Natl Cancer Inst. 2005; 97: 1180–1184.

¨ 41. M. Schumacher, N. Hollander, and W. Sauerbrei, Resampling and cross-validation techniques: a tool to reduce bias caused by model building? Stat Med. 1997; 16: 2813–2827.

52. D. G. Altman and G. H. Lyman, Methodological challenges in the evaluation of prognostic factors in breast cancer Breast Cancer Res Treat. 1998; 52: 289–303.

39. B. Lausen and M. Schumacher, Maximally selected rank statistics. Biometrics. 1992; 48: 73–85.

COVARIATES 53. M. Blettner, W. Sauerbrei, B. Schlehofer, T. Scheuchenpflug, C. Friedenreich, Traditional reviews, meta-analyses and pooled analyses in epidemiology. Int J Epidemiol 1999; 28: 1–9.

13

COX PROPORTIONAL HAZARDS MODEL

survival functions are compared across multiple groups or subpopulations, a commonly used model is to assume that hazard ratios are proportional across groups or subpopulations over time. This model is called the Cox proportional hazards model, which we will discuss in detail.

JIANWEN CAI and DONGLIN ZENG University of North Carolina at Chapel Hill, North Carolina

The Cox proportional hazards model, which is also referred to as Cox regression, proportional hazards regression, or relative-risk regression, is a regression procedure in the study of failure times that may be censored. It was originally proposed by Cox (1). It has become a standard method for dealing with censored failure time data and has been widely used in clinical trials and other biomedical research. Much methodological work has been motivated by this model. The aim of this article is to provide an overview on the Cox proportional hazards model. We will first present the model in its original form and discuss the interpretation and estimation of the regression parameters. We will also present the estimation of the cumulative hazard function and the survival function after a Cox model fit. Extension to the stratified model, approximations for handling tied failure-time data, incorporation of time-dependant covariates, as well as ways to check proportional hazards assumption, will be discussed.

1.2 Model and Parameter Interpretation The Cox proportional hazards model specifies the hazard rate λ (t; Z) for an individual with covariate vector Z as λ(t|Z) = λ0 (t) exp{β Z}, t ≥ 0

(1)

where β is a p-vector of unknown regression coefficients, and λ0 (t) is an unknown and unspecified non-negative function. The function λ0 (t) is referred to as the baseline hazard function, and it is the hazard rate for an individual with a covariate vector Z = 0. Model (1) is a semiparametric model, in that the effect of the covariates on the hazard is explicitly specified, whereas the form of the baseline hazard function is unspecified. The regression coefficient represents the log-hazard ratio for one unit increase in the corresponding covariate given that the other covariates in the model are held at the same value. For example, consider a clinical trial that compares treatment A with treatment B. Let Z be an indicator for the treatment group A and Age represent the age of the subject. The proportional hazards model for studying the effect of treatment after adjusting for age is λ(t|Z, Age) = λ0 (t) exp (β trt Z + β age Age). The coefficient β trt is the log-hazard ratio for comparing treatment A to treatment B for subjects of the same age, whereas β age is the log-hazard ratio for 1 year increase in age for patients within the same treatment group.

1 MODEL AND ESTIMATION 1.1 Hazard Rate Function Let T denote a random variable that represents failure time, and let S(t) = P(T ≥ t) be the survival function. The hazard rate function for T is then defined as λ(t) = lim h−1 P h↓0 t ≤ T < t + h |T ≥ t . That is, λ(t) denotes the instantaneous rate of failure at time t. Mathematically, the hazard rate function uniquely determines the distribution of T. t Specifically, S(t) = exp{− 0 λ(s) ds}. The hazard rate function is used commonly in analyzing censored survival data. For example, when comparing the survival distributions of two groups, one can use the ratio of the hazard functions from two groups. The latter is called a hazard ratio. When

1.3 Partial Likelihood Typical failure time data are subject to right censoring, in which some study subjects are only observed to survive beyond some time points. The censoring mechanism is usually assumed to be independent censoring; that is, the probability of failing at [t, t + dt) given all failure and censoring information as well

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

COX PROPORTIONAL HAZARDS MODEL

as all information on the covariates up to time t is the same as the probability of failing at [t, t + dt) given the subject did not fail up to time t and all information on the covariates up to time t. Let vj (j = 1, . . . , n) denote the observed failure and censoring times for a sample of n subjects and t1 ≤ · · · ≤ tk denote the observed distinct failure times. Under independent censoring and assuming no tied-failure times exist, the estimation of the regression parameter in Model (1) can be carried out by applying standard asymptotic likelihood procedure to the ‘‘partial’’ likelihood function [Cox (2)] L(β) =

k

eβ Zi

l∈Ri eβ Zl

i=1

,

(2)

where Zi is the covariate vector for the subject failing at ti ; Zl is the corresponding covariate vector for the lth member of the risk set Ri where Ri = {j:vj ≥ ti , j = 1, . . . n}. In fact, the ith factor in Model (2) is precisely the probability that the subject with the covariate vector Zi fails at ti given the failure, censoring, and covariate information prior to ti on all subjects in the sample. The ˆ is obtained estimator for β, denoted by β, by maximizing the partial likelihood, which is equivalent to solving the score equation U(β) = 0, provided finite maximum exists, where the score is: U(β) = ∇β log L(β) =

k

Zi − Ei (ti ; β)

(3)

i=1

with

Ei (ti ; β) =

l∈Ri Zl eβ Zl

l∈Ri eβ Zl

The partial likelihood given in Model (2), which differs from the marginal or conditional likelihood, was first introduced by Cox (2). It was later justified [c.f., Johansen (5)] that Model (2) is also the profile likelihood function for β. Hence, standard theory for parametric likelihood function is applicable under suitable regularity conditions. For 2 ˆ is a conlogL(β) example, the inverse of −∇ββ sistent estimator of the asymptotic variance ˆ The Wald’s test, score test, or likeof β. lihood ratio test for comparing two nested models can also be carried out using the partial likelihood. The formal justification for the asymptotic properties for βˆ was given in Andersen and Gill (6) using the martingale theory. 1.4 Cumulative Hazard Function Estimation The cumulative hazard function 0 (t) = t 0 λ0 (s) ds can be estimated by ˆ 0 (t) =

1

ti ≤t

l∈Ri eβˆ Zl

(4)

after the Cox model fit [Oakes (7); Breslow (8)]. In the special case of one homogeneous population (i.e., no covariates exist), The estimator in (4) reduces to the Nelson-Aalen estimator for cumulative hazard function. As a result from S0 (t) = e−0 (t) , the baseline survival function S0 (t) can be estimated ˆ by Sˆ 0 (t) = e−0 (t) . To estimate the cumulative hazard function and the survival function ˆ 0 (t) or for any given values of Z, using ˆ we have (t|Z) ˆ ˆ 0 (t)eβˆ Z and Sˆ 0 (t) and β, = βˆ z ˆ S(t|Z) = [Sˆ 0 (t)]e .

.

Note that U(β) is the sum of random variables that have a conditional, and hence an unconditional, mean of zero so that U(β) = 0 provides an unbiased estimating equation for β. It was pointed out that in the special case of the two sample problems, U(0) reduces to the log-rank test [Mantel (3); Peto and Peto (4)] when there are no tied observed failure times. Iterative procedures, such as the Newton-Raphson method, are used commonly to solve the score equation.

2 PRACTICAL ISSUES IN USING THE COX MODEL 2.1 Tied Data When failure time is continuous, although it is impossible to observe tied times in theory, tied data can develop in practice because of imprecise measure of times. The true time ordering of the ties is not observed. The partial likelihood function in Model (2) needs to be modified to reflect this ordering. Particularly, the exact partial likelihood function should be given to account for all possible

COX PROPORTIONAL HAZARDS MODEL

orderings at the tied failure times. However, such calculation can be cumbersome. For example, suppose three patients with subscripts 1, 2, 3, from a total of n patients, died on the same day after randomization. Because 3! = 6 is the possible ordering of these 3 patients, we denote each of the ordering sequences as

3

eβ Z1 βZ e 1 + eβ Z2 + eβ Z3 + · · · + eβ Zn

eβ Z2 × βZ 2 e 1 + eβ Z2 + eβ Z3 /3 + · · · + eβ Zn

×

eβ Z1 + eβ Z2

eβ Z3 . + eβ Z3 /3 + · · · + eβ Zn

A1 = {1, 2, 3}, A2 = {1, 3, 2}, A3 = {2, 1, 3},

Empirically, Breslow approximation A4 = {2, 3, 1}, A5 = {3, 1, 2}, and A6 = {3, 2, 1}. works well when the number of ties is relatively small; Efron’s approximation yields Then the term contributing to the partial results closer to the exact results than Breslikelihood function from that day, that is, low’s approximation with only trivial increase the chance of observing these three patients in computer time. Most software uses Bres6 dying on the same day, is equal to P(Ai ). low approximation as default. i=1

Particularly,

P(A1 ) =

eβ Z1 + eβ Z2

eβ Z1 + eβ Z3 + · · · + eβ Zn

×

eβ Z2 Z β + e 3 + · · · + eβ Zn

eβ Z3 , Z β + e 4 + · · · + eβ Zn

eβ Z2

×

eβ Z3

P(A2 ) =

eβ Z1

+

eβ Z2

eβ Z1 + eβ Z3 + · · · + eβ Zn

×

eβ Z2 +

eβ Z3 + · · · + eβ Zn

eβ Z3

×

eβ Z2 , Z β + e 4 + · · · + eβ Zn

eβ Z2

and so on. Clearly, the number of these terms increases dramatically with the number of ties in factorial fashion. For instance, with 5 ties, 5! = 120 possibilities can be calculated. In practice, to avoid the cumbersome calculation for the exact partial likelihood, approximations have been proposed. Breslow (9) suggests treating the sequentially occurring ties as if they were distinct. Therefore, in the previous example, the corresponding term in the partial likelihood function is

eβ Z1 +β Z2 +β Z3 βZ 3 . e 1 + eβ Z2 + eβ Z3 + · · · + eβ Zn Another approximation is suggested by Efron (10), in which the corresponding term in the partial likelihood function is

2.2 Time-Dependent Covariates Time-dependent covariates refer to covariates which may vary over time. They are usually classified into external and internal [Kalbfleisch and Prentice ((11), section 6.3)]. Particularly, external time dependent covariates are those which are not dependent on a subject’s survival for their value at any time. They can be obtained even if the subject has died. Some examples for external timedependent variables are ‘‘fixed’’ variable (i.e., non–time-dependent), amount of pollution or radiation in an area, and number of years since birth. Comparatively, internal timedependent covariates are those variables that can only be measured or only have meaning when subject is alive. Examples include white blood count at time t, systolic blood pressure at time t, and spread of cancer at time t. The Cox model can handle both external and internal covariates. Particularly, for time-dependent covariates Z(t), the Cox model assumes

λ( t| Z(t)) = λ0 (t)eβ Z(t) . Under this model, the hazard ratio for comparing subjects with Z(t) versus subjects ˜ ˜ with Z(t) is equal to eβ (Z(t)−Z(t)) , which is no longer independent of t. Estimation of β can be obtained by maximizing the partial likelihood function k eβ Zi (ti ) L(β) = l∈Ri eβ Zl (ti ) i=1

4

COX PROPORTIONAL HAZARDS MODEL

where Zi (t) is the covariate vector for the subject failing at ti ; Zl is the corresponding covariate vector for the lth member of the risk set Ri where Ri = {j: vj ≥ ti , j = 1, . . . n}. In practice, time-dependent covariates can be introduced to model hazard ratios associated with fixed covariates that change over time. For example, in the Atomic Bomb Survivors Study, the event of interest is leukemia incidence. The follow-up time is the number of years since 1945. Let XE be the indicator for being exposed and we fit the following model: λ(t) = λ0 (t)eβXE +γ XE ·t . Based on this model, the hazard ratio between exposed subjects and nonexposed subjects is λ(t|X E = 1)/λ(t|X E = 0) = eβ+γ t . Thus, this model allows the log hazard to change linearly with time. Additionally, we can test H0 : hazard ratio constant versus H1 : log-hazard ratio changes linearly with time by testing H0 : γ = 0 versus H 1 : γ = 0 in the above model. The time-dependent covariate can also be included to study the effect of a covariate that changes over time. For example, we are interested in the association of smoking and lung cancer incidence. Time is measured by age. Let XE (t) denote the cumulative number of packs smoked by age t. The following model can be fitted: λ(t) = λ0 (t)eβXE (t) . As a result, the hazard ratio that compared a smoker who has smoked 350 packs/year since age 20 with a person of the same age who has never smoked is eβ·350(t−20) if t ≥ 20 and 1 if t < 20. In the situation when β > 0, the hazard ratio increases with time as long as smoker continues to smoke. Although the Cox model with timedependent covariates is flexible in modeling time-varying hazard ratio, internal timedependent covariates are particularly susceptible to being inappropriately controlled, because they often lie in the causal pathway about which one wants to make inferences. For example, in a clinical trial for the effect of immunotherapy in the treatment of metastatic colorectal carcinoma, adjusting for the most recent depressed white blood counts (WBC) would facilitate treatment comparisons among subjects with similar prognoses at any specified time. However, immunotherapy might improve the prognosis by improving WBC over time; adjustment

for WBC over time might remove the apparent effect of treatment, because treated and control subjects with the same WBC might have similar prognoses. In general, when the time-dependent covariate lies in the causal pathway of the treatment on the event process, such time-dependent covariate should not be adjusted in the model if the treatment effect is the focus of interest. Caution needs to be exercised when applying the Cox model with internal time-dependent covariates. 2.3 Stratified Population In many applications, the study population may consist of more than one distinct stratum because of some confounding variable C. Estimation of β without adjusting for confounding variable can lead to misleading results. In the Cox proportional hazards model, there are two ways that one can adjust for the effect of C. To illustrate, we suppose a confounder C has three levels, and let ZE denote the indicator of exposure status in which we are interested. In the first approach, one is to create indicator variables for indicating different levels of C and include the indicator variables as covariates in the model. Specifically, let Zj = 1 if C = j, and Zj = 0, otherwise for j = 2 and 3. To adjust for the effect of C, we fit the following model λ(t) = λ0 (t)eβ2 Z2 +β3 Z3 +βE ZE . Based on the above model, when C = 1, the hazard is λ(t) = λ0 (t)eβE for the exposed group and λ(t) = λ0 (t) for the unexposed group. The hazard ratio that compares the exposed with the unexposed when C = 1 is then HR = eβE . Similarly, within other levels of C, the hazard ratio is also eβE . Note also that the hazard ratio that compared the unexposed group with C = j to the unexposed group with C = 1 is eβj for j = 2 and 3. Because β 2 and β 3 are regression coefficients and are not restricted, the stratum-to-stratum differences in log-hazard ratios for the unexposed groups have different magnitudes, but they are parallel over time. This parallelism is the key assumption for such indicator variable adjustment approach. A second method of stratification is to use a stratified Cox model. Specifically, λ(t) = λ0j (t)eβE XE where j = 1, 2, and 3 correspond to the three different levels of C. Based on

COX PROPORTIONAL HAZARDS MODEL

this stratified model, when C = 1, the hazard is λ(t) = λ01 (t)eβE for the exposed group and λ(t) = λ01 (t) for the unexposed group. The hazard ratio that compares the exposed with the unexposed when C = 1 is then HR = eβE . Similarly, within the other levels of C, the hazard ratio is also eβE . Note that the hazard ratio that compared the unexposed group with C = j to the unexposed group with C = 1 is λ0j (t)/λ01 (t) for j = 2 and 3. Because λ0j (t) (j = 1, 2, 3) are arbitrary functions of time, the stratum-to-stratum differences in log-hazard ratios for the unexposed groups across different level of C are completely arbitrary. Comparing the above two approaches, the first one that uses the indicator variable stratification model is restrictive because it requires the parallel log hazards across strata in the unexposed group; instead, the second approach that uses stratum-specific baseline hazard functions is more robust. However, with increasing number of parameters in the second approach, any test for the null exposure effect is less powerful as compared with the first one if the parallel assumption for the log hazard is satisfied. Additionally, if the confounder C is measured continuously and the strata were formed by grouping values of it, better control for C might be achieved with continuous covariate adjustment as done in the first approach. 2.4 Censoring Mechanism The standard inference for the Cox proportional hazards model is valid only when rightcensoring time satisfies a certain censoring mechanism. In practice, such a censoring mechanism often occurs in the following two kinds of scenarios: 1. Random censorship. The censoring times of the individuals are stochastically independent of each other and of the failure times. Random censorship includes the special case of Type I censoring, in which the censoring time of each individual is fixed in advance, as well as the case in which subjects enter the study at random over time and the analysis is carried out at some prespecified time.

5

2. Independent censorship. At time t and given all information on the covariates up to time t, individuals cannot be censored because they seem to be at unusually high (or low) risk of failure. In terms of formula, it is saying that: P failure in [t, t + dt) H(t), Z(t) = P failure in [t, t + dt) survival to t, Z(t) , where H(t) contains the failure and censoring information up to time t. Many usual censoring schemes belong to this class, which includes random censorship. Type II censoring is one special case of independent censoring, in which the study may continue until the dth smallest failure time occurs, at which time all surviving items are censored. Another special case of independent censoring is progressive type II censoring, in which a specific fraction of individuals at risk may be censored at each of several ordered failure times.

2.5 Assessing the Proportional Hazards Assumption One crucial assumption for the proportional hazards model is that the hazard ratio for different values of a time-independent covariate is constant over time. This assumption needs to be checked. Four practical approaches can be implemented to assess the proportional hazards assumption. 2.5.1 Method 1. This approach introduces an interaction term between the covariate being assessed and a specified function of time. Specifically, if we are interested in assessing the proportional hazards assumpof Z, tion for Zj , which is the jth element consider fitting the model λ(t Z) = λ0 (t) eβ Z+γ Zj Q(t) where Q(t) is a specified function of t [often Q(t) = t]. Thus, to check the proportional hazards assumption, we can test the null hypothesis of γ = 0. This test is sensitive to the type of departure specified by the function Q(t).

6

COX PROPORTIONAL HAZARDS MODEL

2.5.2 Method 2. The second approach is to divide the time axis into k intervals, which are in turn denoted as [0,t1 ),[t1 , t2 ), · · · ,[tK−1 , tK ) with t0 = 0 and tK = ∞. Replace β Z by K

βk ZI{tk−1 ≤t
model. The heterogeneity of (βˆ1 , βˆ2 , . . . , βˆK ) is an evidence of nonproportionality. 2.5.3 Method 3. The third approach is based on the fact that under the proportional hazards assumption, log(− log S(t Z )) = −β Z + log(− log S0 (t)), where S(t|Z) is the conditional survival function given Z and S0 (t) = exp {−0 (t)}. Therefore, the survival functions for all levels of Z should be parallel. We can then examine ˆ Z )) for assessing the the plot of log(− log S(t proportional hazards assumption for a single ˆ Z ) is estidiscrete variable Z, where S(t mated nonparametrically or using stratified Cox regression when other covariates need to be controlled for. When Z is continuous, we may stratify Z into Q levels and after estimation within each level, plot log(−logSˆ q (t)) versus t for q = 1, · · · , Q to examine whether the curves are parallel. 2.5.4 Method 4. As model checking for linear regression models is based on residuals, this approach also uses the so-called Schoenfeld residuals. Particularly, for subject i and the jth covariate, Schoenfeld residuals are defined as Lij = Zij (ti ) − Zj (ti ), where Zj (t) =

n βˆ Zl t) I(tl ≥t)Zlj (t)e

l=1

n βˆ Zl (t) I(tl ≥t)e

. If the proportional

l=1

hazards assumption holds, Lij should be a random walk. For assessing general departure from proportional hazards, Schoenfeld residuals can be plotted. The residuals should be randomly scattered about zero if the proportional hazards assumption holds. The trend in the Schoenfeld residuals reflects how the effect of the covariate is varying over time. 2.6 Software The Cox proportional hazards model can be fit in several software programs including R-package (http://www.r-project.org/), Splus

(http://www.insightful.com/), and SAS (http:// www.sas.com/). Especially, both R and Splus use function called ‘‘coxphm,’’ whereas SAS uses procedure ‘‘PHREG.’’ Options in these packages can allow one to choose different ways of handling ties, fit stratified Cox models, and perform tests to assess proportional hazards assumptions. All packages can handle time-dependent covariates as well. 3 EXAMPLE Data are taken from a randomized trial in which men with advanced inoperable lung cancer were randomized to either standard treatment or one test chemotherapy treatment [Kalbfleisch and Prentice (11)]. The failure time of interest was time from randomization to death, and one important question was to examine the effect of comparing two treatments in terms of the hazard rate for death. The data contain 137 patients; 69 patients were randomized to the standard treatment and the remaining to the test chemotherapy treatment. The median survival time is 62 days, and 9 patients were right censored. The Cox proportional hazards model can be used to analyze the data. The covariates used in the regression model include treatment indicator, age at randomization, indicator variables for four histologic type of tumors (squamous, small, adeno, and large), and patient performance status using the ‘‘Karnofsky rating.’’ Efron’s method is used to handle ties. The estimates for the regression coefficient, which maximizes the partial likelihood function, are given in Table 1. It shows that the two treatments have no statistically significant difference on the survival; patients with small or adeno tumors had higher risk than patients with squamous tumor. Patients with a lower Karnofsky rating tended to have higher risk. Particularly, for patients within the same treatment group and with the same Karnofsky rating and age, those with a small tumor had 135.5% higher risk than those with a squamous tumor. Those with an adeno tumor had 225% higher risk than those with a squamous tumor. The baseline cumulative hazard function can be estimated using

COX PROPORTIONAL HAZARDS MODEL

7

Table 1. Analysis of lung cancer data using Cox regression ˆ

βˆ

Covariate Chemotherapy versus standard Small tumor versus squamous tumor Adeno tumor versus squamous tumor Large tumor versus squamous tumor Karnowsky rating Age

0.303 0.856 1.179 0.402 −0.034 −0.009

the expression in Equation (4), and Fig. 1 displays the estimated baseline cumulative hazard function versus time. To assess the proportional hazards assumption, we calculate the Schoenfeld residuals and plot in Fig. 2. Each subplot in the figure plots the Schoenfeld residuals associated with each covariate versus time. The figure indicates that no obvious departure from the proportional hazards assumption is observed.

4 EXTENSION OF THE COX PROPORTIONAL HAZARDS MODEL

ˆ se(β)

1.354 2.355 3.250 1.495 0.968 0.991

0.206 0.271 0.296 0.283 0.005 0.009

P-value 0.14 0.00 0.00 0.15 0.00 0.33

problems. We list some important extensions below. 1. Extension to the counting process. The counting process is defined as a stochastic process of time t and it counts number of failures by time t. Thus, singlefailure time data are analyzed by a special counting process with only one event occurring. However, the counting process is more general, and its definition includes recurrent failures. If we denote this process as N(t), similar to hazard function, then we can define socalled intensity function given covari −1 ate Z as λ(t|z) = limh↓0 h E N(t + h)− N(t) H(t), Z where H(t) is the history process up to time t. With covariates Z, one natural extension of the Cox

0

2

Baseline cumulative hazards 4 6 8 10

12

Since its introduction, the Cox proportional hazards model has been extended from analyzing univariate failure time data to several

eβ

0

200

400 600 Time in days

800

1000

Figure 1. Estimated baseline cumulative hazard function over time.

1000

0

200

400

600

800

1000

400 600 Time in days

800

1000

0

200

400 600 Time in days

800

1000

Residue (age) −0.3 −0.1 0.1 0.3

200

Residue (large vs squamous) −6 −2 2 6

Time in days

0

Residual (kps)

Residue (small vs squamous) −8 −4 0 4

800

−0.20 −0.05 0.10

400 600 Time in days

Residual (adeno vs squamous) −5 0 5

200

0

200

400 600 Time in days

800

1000

0

200

400 600 Time in days

800

1000

Figure 2. Schoenfeld residuals for assessing proportional hazards assumption.

COX PROPORTIONAL HAZARDS MODEL

Residual (treatment) −4 0 2 4

8

0

COX PROPORTIONAL HAZARDS MODEL

proportional hazards model is the proportional intensity model which assumes λ(t Z ) = λ0 (t) exp{β Z}. The partial likelihood function, which is the same as before, can be constructed to estimate and make inference on β. Martingale theory based on counting process were developed [c.f., Andersen and Gill (6); Fleming and Harrington (12)] and have been fruitful in obtaining large sample results for event time models. 2. Extension to multivariate failures. Another extension of the Cox proportional hazards model is to model multivariate failure times [Wei et al (13); Hougaard(14)]. In this case, multivariate events, such as death, time to relapse, and so on, are of interest. An event-specific Cox regression model can be used to model the hazard function of each event, each with a different baseline hazard function and covariate effect. A pseudo-partial likelihood function, which is the simple product of each partial likelihood from each event, can be maximized to estimate covariate effect, whereas the variance needs to be estimated using sandwiched form. 3. Extension to clustered failures. In clustered failure time data, subjects may or may not experience the same type of event but they may be correlated because subjects are from the same cluster. The Cox proportional hazards model can be used to model each individual’s hazard function; however, an unobserved cluster-specific frailty is introduced into each model to account for within-cluster correlation. Such a model is called frailty model for clustered failures [Clayton and Cuzick (15); Hougaard (14)]; but it is essentially an extension of the Cox proportional hazards model. 4. Extension to different data structures. The Cox proportional hazards model has also been applied to analyze different data structures. Examples

9

include left-truncation data [Wang (16)]; interval-censored data [Huang (17), Sun (18)], in which failure time is not observed directly but is observed within some intervals; and case-cohort and case-control design [Breslow and Day (19)], in which subjects are sampled based on failure outcomes.

REFERENCES 1. D. R. Cox, Regression models and life-tables. J. Royal Stat. Soc. B 1972; 34: 187–202. 2. D. R. Cox, Partial likelihood. Biometrika 1975; 62: 269–276. 3. N. Mantel, Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother. 1966; 50: 163–170. 4. R. Peto and J. Peto, Asymptotically efficient rank invariant test procedures (with discussion). J. Royal Stat. Soc. A, 1972; 135: 185–206. 5. S. Johansen, An extension of Cox’s regression model. Internat. Stat. Rev. 1983; 51: 258–262. 6. P. K. Andersen and R. D. Gill, Cox’s Regression model for counting processes: a large sample study. Ann. Stat. 1982; 10: 1100–1120. 7. D. Oaks, Contribution to discussion of paper by D. R. Cox. J. Royal Stat. Soc. B 1972; 34: 208. 8. N. E. Breslow, Contribution to discussion of paper by D. R. Cox. J. Royal Stat. Soc. B 1972; 34: 216–217. 9. N. E. Breslow, Covariance analysis of censored survival data. Biometrics 1974; 30: 89–99. 10. B. Efron, The efficiency of Cox’s likelihood function for censored data. J. Am. Stat. Assoc. 1977; 72: 557–565. 11. J. D. Kalbfleisch and R. L. Prentice, The Statistical Analysis of Failure Time Data. New York: Wiley, 2002. 12. R. R. Fleming and D. P. Harrington, Counting Processes and Survival Analysis. New York: Wiley, 1991. 13. L. J. Wei, D. Y. Lin, and L. Weissfeld, (1989), Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. J. Am. Stat. Assoc. 1989; 84: 1065–1073. 14. P. Hougaard, Analysis of Multivariate Survival Data. New York: Springer, 2000.

10

COX PROPORTIONAL HAZARDS MODEL

15. D. Clayton and J. Cuzick, Multivariate generalizations of the proportional hazards model (with discussion). J. Royal Stat. Soc. A, 1985; 148: 82–117. 16. M. C. Wang, Hazards regression analysis for length-biased data. Biometrika 1996; 83: 343–354. 17. J. Huang, Efficient estimation for the proportional hazard model with interval censoring. Ann. Stat. 1996; 24: 540–568. 18. J. Sun, The Statistical Analysis of Intervalcensored Failure Time Data. New York: Springer, 2006. 19. N. E. Breslow and N. E. Day, Statistical Methods in Cancer Research: The Design and Analysis of Cohort Studies. Lyon, France: International Agency for Research on Cancer, 1987.

FURTHER READING P. K. Andersen, Borgan, R. D. Gill, and K. Keiding, Statistical Models Based on Counting Processes. New York: Springer-Verlag, 1993. R. R. Fleming, and D. P. Harrington, Counting Processes and Survival Analysis. New York: Wiley, 1991. P. Hougaard, Analysis of Multivariate Survival Data., New York: Springer, 2000. J. D. Kalbfleisch, and R. L. Prentice, The Statistical Analysis of Failure Time Data. New York: Wiley, 2002. T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York: Springer, 2000. J. Sun, The Statistical Analysis of Intervalcensored Failure Time Data. New York: Springer, 2006.

CROSS-REFERENCES Censoring Hazard rate Hazard ratio Logrank Test Survival Analysis

CRONBACH’S ALPHA

alpha, in general, is not a measure of reliability, although this is commonly reported as such in the psychology research literature, for the assumptions underlying that claim are seldom satisfied. Consider the following two examples: Example 1. Suppose the m items each relate to the different systems that might be primarily affected by syphilis (the last one ‘‘Other’’), with each item to be answered 0 (No) or 1 (Yes). Then for those without syphilis, the answer is No to all items, and for those with syphilis, the answer is Yes to one and only one item. The total score then is an indicator of the presence of syphilis in the individuals in the population and may have nearly perfect inter-rater or test–retest reliability. However, Cronbach’s alpha will be near zero, because answering Yes to one item precludes answering Yes to any other. In this case, the m items do not all measure exactly the same characteristic and the errors are not independent or identically distributed. Thus, Cronbach’s alpha near zero in this kind of a situation is not necessarily an indication that the total score is poor. Here Cronbach’s alpha near zero is an indication that the items are not addressing the same characteristic in the same way; i.e., inconsistency exists among the items. In many cases, the types of items used in measure development for randomized clinical trials deliberately attempt to tap into different, relatively independent facets of the construct the measure seeks, which is a situation that might well lead to Cronbach’s alpha near zero, but also to a reliable and valid measure. Example 2. Suppose that for each administration of the test, the subject is assigned a random number, and then asked m times: What number do you have? Then, if subjects comply, Cronbach’s alpha should be 1, and yet the average of the m responses is a random number with zero reliability over separate administrations of the test. A Cronbach’s alpha that is too close to 1 may indicate redundancy among the items or correlated errors, and it is not necessarily an indication that the score reliably measures any subject characteristic. Such a situation presents a difficulty in documenting a measure for use

HELENA CHMURA KRAEMER Stanford University, California

The underlying model for Cronbach’s alpha is that the response on the jth item for subject i is modeled by Xij = µ + ξi + ηj + εij where ξ i is the characteristic of interest for subject i, ηj is the fixed mean for item j, and εij j = 1,2, . . . ,m are errors terms that are independent of each other and of ξ i . Then if the score for subject i is the sum or average of the m items, under this model, the interrater reliability (q.v.) of that score is given by Cronbach’s alpha, α, where α=

mρ (m − 1)ρ + 1

where ρ is Variance (ξ i )/Variance (X ij ) (in this model, it is assumed to be the same for each item). 1

ESTIMATION OF ALPHA

To estimate α, one draws a sample of N subjects from the population of interest and administers the m items to each subject. Then, using a two-way analysis of variance (ANOVA; N subjects by m items), one can estimate α with Fs − 1 Fs where FS is the F-statistic to test the subject effect in the ANOVA with (N − 1) and (m − 1)(N − 1) degrees of freedom. Although the motivation for Cronbach’s alpha is based on considerations of the reliability of the score based on the sum or average of the m items, in practice, it is usually appropriate only to cite Cronbach’s alpha as a measure of internal consistency among the items used to compute that score. Cronbach’s

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

CRONBACH’S ALPHA

in a randomized clinical trial, for presenting redundant items merely places greater measurement burden on the patients in the trial with no increase in the quality or quantity of information available to be used to answer the research questions of the trial. When m items are presented to a subject at one time, as they usually are, it is difficult to ensure that the errors are uncorrelated and such correlated errors tend to inflate Cronbach’s alpha. Consequently, Cronbach’s alpha indicates how well the multiple items tap into a single construct (including correlated errors), always a measure of interitem consistency, but not necessarily a measure of reliability of the score obtained by summing those items. 2

ITEM-TEST CORRELATIONS

Generally, statistical computer programs that estimate Cronbach’s alpha also examine the item–test correlations in two ways, which is a useful addendum to Cronbach’s alpha. First, the correlation between each of the m items and the total score is computed. Since the item appears in both variables correlated in this way, that correlation coefficient might be inflated, particularly if the number of items is relatively small. Alternatively, the correlation between each of the m items and the score based on the remaining m – 1 items is computed. Ideally, if all the items measure the same construct and contribute about equal information, as hypothesized in the underlying model for alpha, each item–test correlation will be reasonably and about equally as strong. If an item seems to have substantially smaller correlation than others, that suggests an inconsistency with those other items. If, in fact, the remaining items are valid measures of the construct of interest, the impact of removing that item would be to improve the quality of the score as a measure of that construct. However, it is possible that the one or two items that seem to be inconsistent with the others are the most reliable and valid measures of the construct of interest, whereas the others measure some other construct. Then removing the inconsistent few items may actually detract from the quality of the score

as a measure of the construct of interest. For this reason, before seeking to estimate how internally consistent a set of items is, there should be some empirical support for the reliability and validity of each of the items for the construct of interest. 3 CONSIDERATIONS IN THE USE OF ALPHA Computing and reporting Cronbach’s alpha is only one of many ways available to document the quality of a score based on summing or averaging a set of m items. For example, psychometricians often perform factor analysis on the set of items. If such a factor analysis indicated that one major factor underlied the set of items, and a factor structure that placed approximately equal weight on each item for that factor, this would automatically indicate that Cronbach’s alpha should be quite high. On the other hand, should factor analysis indicate the presence of multiple independent factors, or one factor with very disparate weights on the m items, it is likely that Cronbach’s alpha would be low to moderate. Another interesting alternative is item response theory (2). In this situation, it is assumed that there is one underlying latent (unmeasurable) construct underlying the response to each item. The responses to each item (or some transformation of the response) are assumed to have a distribution determined by a linear function of that underlying construct. However, the slope and intercept of the linear model relating a response to each item may be item-specific. The intercept is an indication of the ‘‘difficulty’’ of the item and of the slope of the ‘‘discrimination’’ of the item. If an item has zero discrimination, then responses on that item are of no value in estimating the latent construct. Such an item is best removed. The stronger the discrimination, the more informative is the response on that item to the level of the latent construct. At the same time, on two items with the same discrimination, the same answer may indicate higher levels of the underlying construct than the other, and this is reflected in the difficulty of the item. The ideal set of items to measure that latent construct would include items at varying difficulty to ensure

CRONBACH’S ALPHA

sensitivity to variations at all levels of the latent construct. The goal would be to estimate the value of the latent construct for each subject from the responses on the m items. Cronbach’s alpha and item response theory both assume a single underlying latent construct. However, item response theory attempts to estimate that latent construct, whereas users of Cronbach’s alpha simply sum or average the items, putting equal weight on items that are more or less difficult, and have better or poorer discrimination. Coefficient alpha clearly plays an important role in measure development, a process that logically predates use of such measures in randomized clinical trials. REFERENCES 1. L. J. Cronbach, Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16:297–334. 2. S. E. Embretson and S. Reise, Item Response Theory for Psychologists. Mahwah, N.J.: Erlbaum, 2000.

CROSS-REFERENCES Interrater Reliabilitiy Reliability Analysis

3

CROSSOVER DESIGN

crossover trial. Therefore, the main reason for considering the use of a crossover design is the ability to produce sufficiently high precision with comparatively small numbers of subjects. Indeed, examples can be suggested in which the precision from a crossover trial is orders of magnitude greater than a parallel group study of equivalent size. Crossover trials have also proved to be important in dose-ascending studies of pharmacokinetics of new drug substances in healthy volunteers, for which the requirements above are broadly satisfied. This potential gain in precision comes with a price, however. Obviously such a design cannot be used with treatments that irreversibly change the condition of the subject, such as treatments that are curative. Once treatment has ceased, the subject must return to the original condition, at least approximately. For the same reason, such designs are only appropriate for conditions that are reasonably stable over time. For example, crossover designs have often been used in the past for chronic conditions like angina and asthma. When using such a design, the possibility always remains that some consequence of earlier treatment may still be influential later in the trial. In the crossover context, this is called a carryover effect. In principle, such effects may literally be caused by a physical carryover of a drug, or to a less-direct cause such as a psychological effect of prior treatment. The key point is that subjects in the second and later periods of a crossover trial have systematic differences in their past experience in the trial. This potential source of bias is akin to confounding in an epidemiological study and implies that to some extent the analysis of data from a crossover trial will inevitably rely more on assumptions and modeling, and less directly on the randomization, than a conventional parallel-group study. This issue is particularly apparent in the so-called twoperiod, two-treatment or 2 × 2 design. This design is the simplest and arguably, the most commonly used design in a clinical setting. In this design, each subject receives two different treatments that we conventionally label as A and B. Half the subjects are randomized

MICHAEL G. KENWARD London School of Hygiene and Tropical Medicine London, United Kingdom

1

INTRODUCTION

In the familiar parallel-group design, each subject is randomly allocated to one of the experimental interventions or treatments. In contrast, in a crossover design, each subject is randomized to a sequence of treatments, which is a special case of a repeated measures design. Despite this difference, the goal of a crossover study, in nearly all settings remains the same as that of a parallel-group study: to compare the effects of the single treatments not the effects of the sequences to which the subjects are randomized. Thus, in a crossover trial, subjects are not randomized to the interventions under comparison, and it has important implications for the design, analysis, and interpretation. The main advantage of a crossover design over the parallel group is the opportunity it provides to compare the effects of treatments within subjects. That is, any component of an individual’s response that is consistent across time is removed from the treatment comparison. The degree of such consistency is measured by the within-subject (or intraclass) correlation. As a consequence, the precision of the treatment effect from a crossover study will be larger than that from a parallel group study with the same numbers of subjects, and it will have the greatest advantage when the within-subject correlation is high. In this respect, the crossover study has much in common with a blocked or matched study, in which the subject is the blocking/matching variable. A second advantage in terms of precision per subject that is less often remarked in this context is that each subject contributes observations under more than one treatment. So, although a simple baseline measurement can be used to remove between-subject variability in a parallel group trial, the precision will still be substantially less than the corresponding

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CROSSOVER DESIGN

to receive A first and then, after a suitably chosen period of time, cross over to B. The remaining subjects receive B first and then cross over to A. Because this particular design is used so commonly and because it raises very special issues in its own right, the next section is devoted specifically to it. Many other so-called higher order designs exist with more than two periods, and/or treatments, and/or sequences. The choice of such designs and associated analyses are considered in the remaining sections. Much of what follows is elaborated in two standard texts on crossover trials: Senn (1) and Jones and Kenward (2), and we refer the reader to these for more details.

out a priori an overall change in response between the two periods: a so-called period effect. Hence, on design considerations alone we must adjust for a period effect. It is the requirement for this adjustment than means that, in general, any crossover design must have at least as many sequences as treatments. Fortunately the adjustment is very simple to do in practice, particularly so in the 2 × 2 design. Denote the averages from the four sequence-by-period means by Y ij . The B-A mean treatment difference adjusted for period effect (τ say) is then estimated as

2 THE TWO-PERIOD, TWO-TREATMENT DESIGN

To make inferences about τ , we note that if we define the within-subject differences Dik = Yi2k − Yi1k , then τˆ is simply the mean difference of these between the two sequence groups: 1 τˆ = (D2. − D1. ). 2

In the two-period, two-treatment, or 2 × 2, design, each subject is randomized to one of two treatment sequences: A then B, or B then A. We use the term period for the time interval in which the subject receives a particular treatment, hence it is a two-period design. Often, an additional period occurs between the two treatment periods in which subjects do not take either treatment. This period is called the washout period, and its purpose is largely to lessen the likelihood of bias caused by carryover. We return below to such concerns, as well as to the incorporation into the analysis of measurements made in the washout period. To begin, we assume that no carryover effect is present and no treatmentby-period interaction has occurred; we focus on the use of the measurements made only in the treatment periods. To represent these measurements, we introduce some notation. Let Yijk denote the outcome from subject k; (k = 1, . . . , ni ) on sequence i; (i = 1, 2) in period j; (j = 1, 2). It is assumed that ni subjects are randomized to the ith sequence. This outcome might, for example, be a summary measure from all or part of the relevant treatment period or a measure from a particular time, often the end, of the period. When the treatment effect is estimated from these measurements, it is necessary to acknowledge the repeated measures structure of the design, which implies that we cannot rule

τˆ =

1 (Y 12. + Y 21. − Y 11. − Y 22. ). 2

(1)

For inferences we can therefore use standard two sample procedures on the differences Dik . Most commonly, the standard two-sample t-test framework is used to provide test statistics, confidence intervals, and power calculations. For the latter, it must be remembered that the relevant variance is that of the within-subject differences Dik , not the raw observations Yijk . Alternatively, nonparametric procedures can be applied to the same differences. We therefore observe that with complete data, under the assumption of no carryover or treatment-by-period interaction, the analysis for the 2 × 2 design is a very straightforward application of a two-sample procedure applied to the within-subject differences. However, concern has existed in this setting about the carryover assumption. If a carryover effect exists, then the simple estimate in Equation (1) will be biased for τ . An early and influential paper by Grizzle (3) made the suggestion that the analysis should therefore begin with a test for the presence of such a carryover and, only if insignificant, should the estimator in Equation (1) be used. If it is found significant, then the treatment effect should be estimated from the first period data only (i.e., Y 21. − Y 11. ).

CROSSOVER DESIGN

This two-stage procedure was widely adopted. However, it suffers from two very serious problems that mean, in practice, it should be avoided. The first problem lies with the power of the initial test for carryover that, in the 2 × 2 trial, turns out to be equivalent to a test for treatment-by-period interaction. This test parallels that for the direct treatment effect above, but it is calculated using the subject sums Sik = Yi1k + Yi2k in place of the differences Dik , and there is no factor of 1/2. Suppose that an observation Yijk has variance σ 2 and the within-subject correlation is equal to ρ. Then V(Dik ) = (1 − ρ)σ 2 and V(Sik ) = (1 + ρ)σ 2 . It follows that the ratio of the variances of the carryover estimator and the the direct treatment estimator is R=2+

ρ 1−ρ

As the within-subject correlation increases, this ratio grows without limit. In a typical crossover setting the within-subject correlation is large, and the trial is powered for the direct treatment effect, which leads to comparatively small numbers of subjects. Consequently, the power for the carryover effect is often very low indeed. As an example, consider the 2 × 2 crossover trial on treatment for asthma described in Reference 2, section 2.10. This study had 17 subjects; the outcome was a lung function measure: forced expired volume in one second. In this trial, ρ was estimated as 0.96, with corresponding R equal to 28.6. As a consequence, 28.6 times as many subjects (476) would have been required to estimate the carryover effect with the same precision as the direct-treatment effect and, in terms of power in the actual sample, the test for carryover is effectively useless. Because of the typical low power of the test for carryover, it was suggested in the original two-stage proposal (3) that it might be made at a higher nominal size, such as 10%. However, any such modification is insufficient in most cases to rescue this test, and negative results from such low-powered tests convey no useful information. The second major problem with the twostage procedure was pointed out by Freeman

3

(4). Because of the statistical dependence between the preliminary test for carryover and the first-period treatment comparison, the two-stage procedure leads to bias in the resultant treatment estimator and increases the probability of making a Type I error. In other words, the actual significance level of the direct treatment test is higher than the nominal one chosen for the test. The firstperiod comparison would only be unbiased were it always used or chosen for a reason independent of it. Unfortunately, this is not the case in the two-stage procedure, in which it is only used after a significant (hence large) carryover. Although attempts have been made to circumvent this problem, all solutions involve introducing information about the carryover effect that is not contained in the data. Given that a crossover trial will be powered for within-subject estimation of the treatment effect and will not have the sensitivity to provide useful information on the carryover, it has been argued that the best approach to this problem is to take all practical steps to avoid possible carryover. Examples of these steps include using washout periods of adequate length between the treatment periods and then, assuming that this effect is negligible, using the simple estimator in Equation (1) and the associated analysis. In turn, this method requires a good working knowledge of the treatment effects, which will most likely be based on prior knowledge of, and/or the mechanisms behind, these examples. Many 2 × 2 trials have additional measurements taken before the first treatment period: at the end of the so-called run-in period and at the end of the washout period. These measurements are called baseline measurements, noting that in contrast to a parallel group study such a baseline measurement (the ones from the washout period) can be collected after randomization. Although it is tempting to introduce these baselines into the analysis in a fashion analogous to a parallel-group design, for example using change from baseline as response and/or with baselines as covariates, some care is needed in their introduction. Because of the within-subject nature of the primary analysis, these baselines do not have the usual role familiar from parallelgroup studies. Between-subject variability as

4

CROSSOVER DESIGN

existing in the treatment periods is already eliminated from the treatment comparison; for the baselines to make a contribution over and above this they must explain additional variability. It will happen only when the correlation between an outcome Yijk and its associated baseline is considerably greater than between the two outcomes Yi1k and Yi2k . This outcome might happen for example if the time interval between the treatment measurements is much greater than between baseline and associated treatment measurement. If the correlations among the four measurements from a subject are all similar, then not only will no improvement in precision be obtained from using baselines in the analysis, but also their introduction is likely to actually decrease precision. For example, if all within-subject correlations are exactly equal, then the variance of the treatment estimator obtained using the change from baseline as a response is exactly double that of the estimator (1) that uses only the treatment outcomes. The baselines have much greater potential value when estimating the carryover effect because they can be used to eliminate between-subject variability from this method. In some settings, this method can lead to a considerable increase in precision. However, the difficulty associated with the two-stage procedure is not eliminated. Now, two potential types of carryover exist: the original one associated with the second treatment period, and an additional one associated with the end of the washout period. Before one uses the baselines to examine the former, the status of the latter needs to be established, which leads to an even more complicated sequential procedure with additional potential for bias (5). In addition, the gain in precision afforded by inclusion of baseline information will not bring it to the same level as that of the direct treatment comparison. Again the advice is not to attempt exploration of a possible carryover effect as a part of the primary treatment comparison. Related issues apply to the use of a genuine baseline covariate, that is, one that is measured before randomization only. Details on the incorporation of such covariates into the analysis are given in Reference 2, Section 2.11. The very simple analyses so far described rest on the assumption that

the data are continuous. The analog of the two-sample t test of the differences Dik for binary data is given by the Mainland-Gart test, which is a simple extension of McNemar’s test that incorporates adjustment for a possible period effect. The Mainland-Gart test is just that for association in a 2 × 2 contingency table that is constructed as follows. When the outcome is binary, each subject presents one of four possible responses for the two periods (Yi1k , Yi2k ): (0,0), (0,1), (1,0), and (1,1). The so-called nonpreference responses (0,0) and (1,1) are discarded, and the columns of the table are correspond to the preference responses: (0,1) and (1,0). The rows are then given by the two sequence groups:

Sequence AB BA

Outcome (0,1) (1,0) n11 n12 n21 n22

In this table, n11 is, for example, the number of subjects who responded with a 0 in period 1 and a 1 in period 2 in sequence group AB. The test for treatment effect, ignoring carryover, is then the test for association in table, for example, using Pearson’s chisquared statistic. If numbers in this table are small, as they often are in a crossover trial especially after discarding the (0,0) and (1,1) responses, then a small sample exact procedure such as the Fisher exact test can be used. Other outcomes, such as ordinal and count, are best handled using appropriate model-based methods, to which we return briefly below. Sample size calculations in such categorical settings are less straightforward however than for continuous outcomes. Choices must be made about the scale on which the treatment effect is to be measured, and an appropriate measure of within-subject dependence must be introduced. The actual calculations will be based on analytical approximations or simulation methods. Finally, we note that the simplicity of the analyses met so far rests on the completeness of the data. If subjects drop out of the trial, then some singleton measurements will be left. We may choose to leave these measurements out of the analysis and use the same simple procedures with

CROSSOVER DESIGN

the completers. In terms of precision, if the within-subject correlation is high, then these singleton measurements contribute very little to the direct treatment effect, and little precision is lost through their omission. However, if we do wish to incorporate them in the analysis, then methods are needed that can combine within- and between- subject information, and suitable tools for this are met below when general models for higher order designs are discussed. 3

HIGHER ORDER DESIGNS

It has been recognized that it is best to assume that any carryover effect is negligible when using the 2 × 2 crossover design. In most settings, the design provides insufficient information on such a carryover to allow either adjustment for it or elimination of it. Several higher order designs for two treatments have been proposed that circumvent this problem but only at the expense of strong assumptions that concern the nature of the carryover effect. These designs have extra sequences and/or extra periods. For example, extending the design to three periods means that the following three alternative designs with two sequences may be used: ABB/BAA, ABA/BAB, or AAB/BAA. In the absence of carryover, these designs are effectively equivalent for estimating the direct treatment effect. However, more importantly, they also allow the treatment effect to be estimated using within-subject information after adjustment for a particular form of simple carryover effect, which is something that is impossible with the 2 × 2 design. In these circumstances, the three designs behave rather differently in terms of precision. It turns out that the first design, ABB/BAA, is most efficient among the three, and it provides an estimate of the adjusted treatment effect of useful precision. At first sight, this design seems to provide a solution to the main problem associated with the 2 × 2 design. However, this advantage depends on strong and potentially unrealistic assumptions about the form of the carryover, namely 1) that the carryover effect is the same whether carryover is occurring into A or into B and 2) it affects the immediate following period

5

only. In general, we call carryover effects that satisfy these constraints, ‘‘simple firstorder.’’ This example demonstrates a general problem with the design of higher order crossover design: Many arrangements are possible, and some can be shown to have particularly good mathematical properties under specific modeling assumptions (especially concerning carryover). However, these assumptions are themselves usually not open to sufficiently precise assessment from the data generated by the design, so the same main issue surrounding the 2 × 2 design persists, albeit in a more-complicated and lessaccessible form. For this reason, great care is needed in interpreting the term ‘‘optimal design’’ in this setting. The higher order twotreatment designs can be extended in several ways with more sequences and periods, which allows additional effects to be estimated, such as treatment-by-period interaction, secondorder carryover, and carryover-by-treatment interaction. They are not widely used in practice however, and the main role of higher order two treatment designs is in the form of three-period designs in bioequivalence trials (Reference 2, Chapter 7). For a design with more than 2 treatments (t say) it is usual, where practical, for each subject to receive all treatments, which implies that the number of periods (p say) is equal to t if each subject receives each treatment once. In addition, because adjustment for period effects is the norm, it is important for reasons of efficiency to balance treatments within periods. A design in which this balance is exact is called a Latin square design. In such a design, every subject receives each treatment, and each treatment occurs equally often in each period. This example demonstrates four sequences, four treatments (A, B, C, and, D), and four periods:

Sequence 1 2 3 4

1 A D C B

Period 2 3 B C A B D A C D

4 D C B A

A design based on this scheme would have a multiple of four subjects in total with one

6

CROSSOVER DESIGN

quarter randomly assigned to each sequence. Provided carryover is not a concern, any such square would be equally efficient for estimation of the direct treatment effects, and similar arrangements can be constructed with any number of treatments. If adjustment is to be made for carryover, then this design is very poor. It can be observed that each treatment is preceded by the same treatment in the three sequences in which it does not occur first, which means that treatment and carryover effects are highly nonorthogonal, and adjustment causes a sever loss in precision. This issue can be largely resolved by choosing a special Latin square arrangement in which every treatment is preceded equally often by each other treatment. For example, in the fou-treatment case we might consider the following design:

Sequence 1 2 3 4

1 A B C D

Period 2 3 B C D A A D C B

4 D C B A

Latin square designs with such balance are called Williams Square designs (6). In fact, they are only square when t is even; for t odd, a Williams square design has 2t sequences. Examples of Williams squares designs for a range of values of t are given in Reference 2, Chapter 4. Even if adjustment for carryover is not to be used in the analysis, these designs have the advantage of having a degree of robustness to carryover: An unadjusted analysis will produce less biased estimates of the direct-treatment effects even when carryover is present in the data compared with the earlier Latin square design. However, as above, these remarks refer strictly to a specific form of carryover: simple first order. Many alternative forms are conceivable, and a design that is chosen with one type in mind may be poor for another. Not all higher order crossover designs are appropriate for undifferentiated sets of treatments, as are the Williams squares. Particular designs are to be used for comparisons with a control and for factorial designs. A special type is used when the treatments are

increasing doses of a drug, and for safety reasons a higher dose cannot be experienced before a lower one. Details of these and other special designs can be found in Reference 2, Chapter 4. It is a good choice where possible to have each subject experience each treatment in a crossover trial, as in a Williams square. Such designs are sometimes called complete by analogy with a complete-block design. However, it is sometimes not possible, often because of practical constraints over the length of the trial, and then so-called incomplete designs are required for which p < t. Each subject receives only a subset of the treatments. An example with t = 3 and p = 2 is as follows:

Sequence 1 2 3 4 5 6

Period 1 2 A B B A A C C A B C C B

To balance out the allocation of treatments, it is usual for such designs to have more sequences than treatments. These arrangements induce considerable nonorthogonality between subjects and treatments, and consequently, adjustment for subject effects has a detrimental effect on precision. Such designs are inefficient, which implies, among other things, that potentially important information on the treatment effects are contained in the subject sums, which is not the case with complete designs. To recover such information special methods of analysis are needed, which brings us to consideration of general analysis tools for crossover designs. 4 MODEL-BASED ANALYSES The simple two-sample analyses that were used above with complete data from the 2 × 2 design need generalization for other designs and for incomplete data. Because the data from crossover trials are examples of repeated-measures data, we can draw on the

CROSSOVER DESIGN

body of analysis tools available for this setting (7,8). Accordingly, with continuous data, linear regression models for dependent data can be used. For discrete data, the appropriate analogs of these are used. In many settings with continuous data, we can in fact use very simple, ordinary, regression models by analogy with randomized-block analyses. With crossover trials, one normally expects fairly stable responses over time, and therefore it is often sufficient to capture the withinsubject dependence through the introduction into a linear model of a simple subject effect. This effect can be treated as fixed or random. To understand the implications of choosing one or other of these, we need to consider the source of the information on treatment effects in the given crossover design. It is natural and convenient to divide this information into two independent components: the between- and within-subject information. The between-subject information is that provided by the sums (or means) of all the observations from each subject. By contrast, the withinsubject information is that contained among all differences of observations from each subject. The information available from these two sources depends on both the design and the model fitted. It is also affected by the size of the within-subject correlation: as the size increases, the variability of the betweensubject sums grows. In terms of models, in many settings, a very simple, or minimal, form will suffice, which will include, in addition to the subject effects, categorical effects for period and treatment. On occasion, it will be augmented by carryover effects and, more rarely, other terms. In a fully balanced Latin square design, with the minimal model, all information on treatments is contained withinsubjects. When some form of imbalance is introduced (for example, because of dropout, the introduction of additional model terms, or the use of an incomplete design), then some information on treatment effects will move to the between-subjects stratum. We are now in a position to consider the choice of fixed- or random-subject effects. The use of fixed-subject effects, that is, treating the subject effects as any other linear model parameter, implies that all betweensubject information is discarded. However,

7

the resulting model is an example of a standard linear regression model, and all the familiar regression procedures are available, including exact small sample F- and t-based inferences. Moreover, in such analyses, no assumption needs to be made about the distribution of the subject effects among subjects implying an additional degree of robustness in these analyses. They can be observed as crossover versions of conventional randomized block analyses in which subjects are blocks. Many advantages exist to keeping the crossover analysis within this framework, not least the simplicity. We should regard this as the approach of choice provided it is appropriate, that is when there is negligible between-subject information. In contrast, the use of random-subject effects allows the recovery of the between-subject information, but it moves the analysis framework to that of the linear mixed model. This model is less robust that the fixed-effects model because of the need to assume some distribution for the subject effects. In addition, small-sample inferences must rely on approximate methods (9), whose properties become less reliable in very small samples. Hence, this approach should be used only when strictly necessary, and nonnegligible between-subject information is available for recovery, with sufficiently large sample sizes. Considerable nonorthogonality is needed to make this worthwhile; examples of such settings are inefficient designs (e.g., with p < t) and trials with a high proportion of withdrawals. For the reasons given above, this will not be the case in the typical efficient, complete, crossover design with high withinsubject correlation. Hence, careful consideration needs to be given to the data, design, and other components of the analysis model before choosing the random-subject effects approach. In certain settings, this approach is appropriate, but these settings tend in practice to be the minority. Another point is that if baseline or covariates that are measured at the beginning of each treatment period are introduced into the analysis (within-subject covariates), then the use of random, as opposed to fixed, subject effects will lead to bias in the estimated treatment comparisons.

8

CROSSOVER DESIGN

Similar issues apply when considering analyses for discrete data, with some important distinctions. First, asymptotic arguments become the norm with the discrete data analogs of these analyses. This argument implies, among other things, that the fixed-subject model cannot be used in the same way as above, by simply estimating an effect for each subject. Instead, so-called conditional likelihood analyses are the proper analogs of these and are available only for certain models: namely logistic regression for a binary outcome (Reference 2, Section 6.2), multivariate logistic regression for a nominal outcome (Reference 2, Section 6.4), and log-linear regression for a Poisson (count) outcome (Reference 2, Section 6.5.1). No conditional likelihood analogy exists, for example, for a proportional odds model for an ordinal outcome. Random-effects models can be used however with all such classes. A second important distinction concerns the variety of classes of model available for non-normal outcomes, in contrast to the almost automatic use of the linear regression basis of models for continuous data. This distinction is largely a consequence of the nonlinearity in the link between the model effects, such as treatment and period, and the mean outcome that is necessary in such non-normal settings. The distinction between two important classes of models, the so-called marginal and subject-specific, is especially relevant in the repeated measures setting and needs to be fully appreciated when analyzing discrete crossover data. These classes are discussed in the crossover setting in Reference 2, Chapter 6 and in a more general repeated measures setting in Reference 9. A brief consideration other types of outcome, such as event time, is given in Reference 2. Section 6.5. REFERENCES 1. S. J. Senn, Cross-over Trials in Clinical Research, 2nd ed. New York: Wiley, 2002. 2. B. Jones and M. G. Kenward, Design and Analysis of Cross-over Trials, 2nd ed. London: Chapman & Hall/CRC, 2003. 3. J. E. Grizzle, The two-period change-over design and its use in clinical trials. Biometrics 1965; 21: 467–480.

4. P. Freeman, (1989) The performance of the twostage analysis of two treatment, two period crossover trials. Stat. Med. 1989; 8: 421–432. 5. M. G. Kenward and B. Jones, The analysis of data from 2x2 cross-over trials with baseline measurements. Stat. Med. 1987; 6: 911–926. 6. E. J. Williams, Experimental designs balanced for the estimation of residual effects of treatments. Aust. J. Scientif. Res. 1949; 2: 149–168. 7. G. Verbeke and G. Molenberghs, Linear Mixed Models for Longitudinal Data. New York: Springer, 2000. 8. G. Molenberghs and G. Verbeke, Models for Discrete Longitudinal Data. New York: Springer, 2005. 9. M. G. Kenward and J. H. Roger, Small sample inference for fixed effects estimators from restricted maximum likelihood. Biometrics 1997; 53: 983–997.

Table 1. Plan of 2 × 2 Trial

CROSSOVER TRIALS

The aim of medical research is to develop improved treatments or cures for diseases and medical ailments. Part of that research involves comparing the effects of alternative treatments with a view to recommending those that should be used in practice. The treatments are compared using properly controlled randomized clinical trials∗ . In such trials the treatments are given either to healthy volunteers (in the early phase of development) or to patients (in the later phases of development). We will refer to patients, volunteers, or whoever is being compared in the trial as the subjects. Two types of design are used in these trials: the parallel-group design and the crossover design. In order to explain these we will consider trials for comparing two treatments A and B. The latter might be different ingredients in an inhaler used to treat asthma attacks or two drugs used to relieve the pain of arthritis, for example. In a parallel-group trial the subjects are randomly divided into two groups of equal size. Everyone in the first group gets A, and everyone in the second group gets B. The difference between the treatments is usually estimated by the difference between the group means. In a crossover trial the subjects are also randomly divided into two groups of equal size. (In agriculture and dairy science, crossover trials often are referred to as changeover trials.) Now, however, each subject gets both treatments for an equal period of time. In the first group the subjects begin by getting A for the first period and then cross over to B for the second period. Each subject in the second group begins with B in the first period and then crosses over to A for the second period. The basic plan of this design is given in Table 1. This type of crossover trial uses two treatment sequences AB and BA and is usually referred to as the 2 × 2 trial. The main advantage the crossover trial has over the parallel-group trial is that the two treatments are compared within subjects as opposed to between subjects. That is, the

Group

Period 1

Period 2

1 2

A B

B A

2 × 2 trial provides two repeated measurements on each subject, and the difference between these is used to estimate the difference between A and B. In this way each subject ‘‘acts as his or her own control’’ and any variability between subjects is eliminated. As the variability within subjects is usually much smaller than that between subjects, a relatively precise estimate of the treatment difference is obtained. In contrast, the treatment difference in the parallel-group trial is estimated by taking differences of measurements taken on different subjects, and so is based on between-subject variability. As a consequence the crossover trial requires far fewer subjects than a parallel-group trial to achieve equivalent power to detect a particular size of treatment difference. A detailed comparison of the two types of design is given by Grieve [18]. If yij is the mean for period j in group i, then for the 2 × 2 trial the within-subjects estimator of the A − B treatment difference is D = 12 [(y11 − y12 ) − (y21 − y22 )]. Obviously, crossover trials are not suitable for treatments that effect a cure. A basic assumption is that subjects will be in the same state at the start of the second period as they were in at the start of the first period. Therefore, it is essential to ensure that the effect of the first treatment is not present at the start of the second period. One way of achieving this is to separate the two active periods by a washout period of sufficient length to ensure that the effects of the first treatment have disappeared by the start of the second period. Any effect of previous treatment allocation that affects the second period is a carryover effect. If τA and τB denote the effects of treatments A and B, respectively, and λB and λB denote the carryover effects of treatments A and B, then in the presence of unequal carryover effects the expected value of D, the within-subjects estimator of the treatment

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

CROSSOVER TRIALS

difference, is (τA − τB ) − 12 (λA − λB ), i.e., D is biased. If the carryover difference is of the same sign as the treatment difference, then D underestimates the true treatment difference. Therefore, if a significant treatment difference is detected, it is still appropriate to conclude that the treatments are different, because the trial has detected an even smaller difference between the treatments than anticipated at the planning stage. In the basic 2 × 2 trial it is not possible to estimate the difference between the carryover effects using within-subject information. This is because the estimate is based on differences of subject totals. However, for designs with more than two periods or treatment sequences it is possible to estimate the carryover difference using within-subject information. Also, in the basic 2 × 2 trial the carryover difference is completely confounded with the group difference and the treatmentby-period interaction. Both the parallel-group trial and the crossover trial are usually preceded by a runin period, when subjects are acclimatized and, perhaps, are monitored for eligibility. Response measurements are usually taken during the run-in period of the parallel-group trial and during the run-in and washout periods of the 2 × 2 trial. The run-in and washout measurements can be used to estimate the carryover difference in the 2 × 2 trial, and the run-in measurements can be used to improve the precision of the parallel-group trial. However, even when run-in measurements are used, the parallel-group trial still falls short of the crossover trial as far as precision of estimation is concerned and so needs more subjects than the crossover trial to achieve comparable power. In the general case of t treatments, a crossover trial consists of randomly dividing the subjects into s groups and assigning a different sequence of treatments to each group. The sequences are of length p, corresponding to the p periods of the trial, and some of the sequences must include at least one change of treatment. The choice of sequences depends on the number of treatments and periods and on the purposes of the trial. Examples of designs for t = 3 and t = 4 treatments are given later.

2 × 2 CROSSOVER TRIAL

In the absence of run-in and washout measurements, a standard analysis for this design follows the two-stage approach of Grizzle [21] and Hills and Armitage [24]. In the first stage, a test, based on the subject totals, is done to determine if the carryover effects are equal. If they are not significantly different (usually at the 10% level), then in the second stage the within-subjects test for a treatment difference, based on D, is done (usually at the 5% level). If the first-stage test is significant, then the test for a treatment difference uses only the data collected in the first period of the trial, i.e., is based on y11 − y21 . This analysis has been criticized, particularly because the actual significance level can be much higher than the nominal level of 5% [14,46]. However, some improvement in the performance of this two-stage procedure is possible if measurements are available from the runin and washout periods [25,28]. The best advice is to base the analysis on the assumption that carryover effects are either absent or equal for both treatments and to proceed directly to the test for a treatment difference that uses within-subject comparisons. This assumption would need to rely on prior knowledge or washout periods of adequate length. As noted above, the within-subjects estimate of the treatment difference is biased downwards if carryover effects are different, but the extent of the bias will not be great unless the carryover difference is large. The detailed views of a working group of the Biopharmaceutical Section of the American Statistical Association∗ are given in Peace [42, Chap. 3]. Their view is that the 2 × 2 design is not the design of choice if carryover effects exist. A Bayesian approach to the analysis is described in Grieve [16,19,20], and further discussion on analysis can be found in Jones and Kenward [26, §2.13] and Senn [46, Chap. 3]. Overall, if significant carryover effects are likely to occur, then the 2 × 2 design is best avoided if possible. A number of better designs for two treatments are mentioned below.

CROSSOVER TRIALS

HIGHER-ORDER DESIGNS FOR TWO TREATMENTS The disadvantages of the 2 × 2 design can be overcome if more periods or sequences are used, given certain assumptions about the behavior of the carryover effects. The twoperiod design with four sequences AA, BB, AB, and BA enables the treatment difference to be estimated within subjects even if carryover effects are present and different. However, in common with all two-period designs in general, the estimate of the treatment difference is inefficient, and it is better to use at least three periods. For three periods the recommended design has two sequences ABB and BAA, and for four periods it has four sequences AABB, BBAA, ABBA, and BAAB.

3

Table 2. Balanced Design for t = 3, p = 3 Sequence No.

Treatment Sequence

1 2 3 4 5 6

ABC ACB BAC BCA CAB CBA

Table 3. Balanced Design for t = 4, p = 4 Sequence No.

Treatment Sequence

1 2 3 4

ABCD BDAC CADB DCBA

DESIGNS FOR THREE OR MORE TREATMENTS There is a great variety of designs for three or more treatments. The choice of design will be determined by the purpose of the trial, the number of permissible periods or sequences, and other practical constraints. An important feature of these designs is that carryover differences can be estimated using withinsubject information. In a variance-balanced design the variance of the estimated difference between any two treatments, allowing for subjects, periods, and any carryover effects, is the same whichever pair of treatments is considered. Plans of such designs for t = 3 and t = 4 and t periods are given below. For properties of balanced and nearly balanced designs see CHANGEOVER DESIGNS [41]. A general introduction including a review, tables of designs, discussion of optimality, and choice of design are given in Jones and Kenward [26, Chap. 5]. More recent reviews are given in Afsarinejad [1] and Matthews [37]. Study of the optimality of crossover designs has generally concentrated on a model that has fixed effects for the subject, period, treatment, and carryover effects and independent within-subject errors that have constant variance. It is mostly assumed that if carryover effects can occur they are only of first order, i.e., last for only one period. Most results in the literature refer to universal optimality [27]; a useful review is given in Matthews [37].

Closely linked to optimality are the concepts of uniformity and balance. A design is uniform if each treatment occurs equally often in each period and each treatment is allocated equally often to each subject. A design is (combinatorially) balanced if every treatment follows every other treatment equally often. Balanced uniform designs are universally optimal for the estimation of treatment and carryover effects [22,23,6]. When p = t, balanced uniform designs are the Williams designs (Williams [50]) and can be constructed easily using the algorithm given by Sheehe and Bross [48]. Examples of these designs are given in Tables 2 and 3 for t = 3 and t = 4, respectively. If every treatment follows every other treatment, including itself, equally often, the design is strongly balanced, and strongly balanced uniform designs are universally optimal [6]. A simple way of generating a strongly balanced design is to repeat the last period of a Williams design to give a design with p = t + 1. These designs are variance-balanced and have the additional property that the treatment and carryover effects are orthogonal. For those combinations of t, s, and p for which a variance-balanced design does not exist, it may be possible to construct a partially balanced design∗ . In such a design the variances of the estimated treatment differences are not all the same but do not vary

4

CROSSOVER TRIALS

much. This makes them attractive in practice, as they usually require fewer periods or sequences than a fully balanced design. Such designs are tabulated in Jones and Kenward [26] and in Ratkowsky et al. [45]. Another potentially important group of designs is where the treatments are made up of factorial combinations of two or more ingredients. For example, the four treatments A, B, C, and D might correspond to all possible combinations of two ingredients X and Y, where each ingredient can occur either at a high or a low level. Here designs that are efficient for estimating the main effects∗ or the interaction∗ of X and Y can be constructed [13].

ANALYSIS OF CONTINUOUS DATA Crossover data are examples of repeated measurements∗ ; that is, they consist of a set of short sequences of measurements. The observations from one subject will typically be correlated, which needs to be accommodated in the analysis. Continuous crossover data are most commonly analyzed using a conventional factorial linear model and analysis of variance∗ . The model will almost invariably include terms for period and treatment effects. Other terms may be included as required, such as first- and higher-order carryover effects, treatment-by-period interaction, and treatment-by-carryover interactions, although for some designs there may be aliasing among these, and the inclusion of more than a small number of such terms can seriously reduce the efficiency of the analysis. For all terms except carryover, the definition of appropriate factor levels is straightforward. Construction of factors for the latter is not obvious, because there are observations for which these effects cannot occur, for example those in the first period. One simple solution for this is to deliberately alias part of the carryover effect with period effects. For example, for a first-order carryover factor, levels follow treatment allocation in the preceding period, except for observations in the first period, when any factor level can be used provided it is the same in all sequences. After adjustment for the period term, this factor

gives the correct sums of squares and degrees of freedom for the first-order carryover. Within-subject dependence is normally allowed for by the inclusion of fixed subject effects in the linear model. Essentially, a randomized block analysis is used with subjects as blocks. In the case of the 2 × 2 trial, this analysis reduces to a pair of t-tests, each comparing the two sequence groups (Hills and Armitage [24]). For the treatment effect, the comparison is of the within-subject differences, and for the carryover–treatment-byperiod interaction, it is of the subject totals. Baseline measurements contribute little to the efficiency of the direct treatment comparison, but may substantially increase that of the carryover. Analyses for higher-order two-treatment two-sequence designs can be expressed in terms of t-tests in a similar way. For designs in which treatment effects are not orthogonal to subjects (for example, when t > p, or generally when a carryover term is included), there exists some treatment information in the between-subject stratum; this is lost when fixed subject effects are used. It has been suggested that this between-subject (interblock) information should be recovered through the use of random subject effects. Restricted maximum likelihood∗ (REML) is an appropriate tool for this. However, small, well-designed crossover trials are not ideal for the recovery of interblock information∗ : most of the treatment information lies in the within-subject stratum, between-subject variability will typically be high, and the reliance on asymptotic estimates of precision means that standard errors of effects can be seriously underestimated. The use of random subject effects implies a simple uniform covariance structure for the sequence of observations from a subject. A more general structure can be used, for example, to allow for a serial correlation pattern, but such analyses are not widely used, and there is little evidence as yet to suggest that these models are needed for routine crossover data. Such modeling may be important, however, when there are repeated measurements within treatment periods, and the multivariate linear model provides an appropriate framework for this setting.

CROSSOVER TRIALS

Nonparametric methods of analysis for crossover data are not well developed, apart from the two-treatment two-sequence designs in which the t-tests can be replaced by their nonparametric analogues. In other designs the occurrence of different treatment sequences precludes a straightforward application of an orthodox multivariate rank test, while the use of the rank transformation followed by a conventional analysis of variance is best avoided [8]. Senn [46] develops an ad hoc test based on a combination of twosequence tests, and a general review is given in Tudor and Koch [49]. ANALYSIS OF DISCRETE DATA We consider a binary response first, coded 0/1. As with continuous data, the 2 × 2 trial forms a simple special case. Each subject then provides one of four categories of joint response: (0,0), (0,1), (1,0), and (1,1). Given ni subjects on sequence i, the data from such a trial can be summarized in a 2 × 4 contingency table∗ as in Table 4. Two tests for treatment effect (assuming no carryover effect) are based on the

Table 4. Binary Data from a 2 × 2 Trial. Joint Response Sequence

(0,0)

(0,1)

(1,0)

(1,1)

AB BA

n11 n21

n12 n22

n13 n23

n14 n24

Table 5. Contingency Table for Mainland–Gart Test Sequence

(0,1)

(1,0)

AB BA

n12 n22

n13 n23

Table 6. Contingency Table for Prescott’s Test Sequence

(0,1)

(0,0) or (1,1)

(1,0)

AB BA

n12 n22

n11 + n14 n21 + n24

n13 n23

5

entries in this table. The Mainland–Gart is the test for association in the 2 × 2 contingency table [35,15] as given in Table 5. This involves the data from only those subjects who make a preference. Prescott’s test [43] introduces the pooled nonpreference data, and it is the test for linear trend in the 2 × 3 table given as Table 6. The test for carryover–treatment×period interaction is the test for association in the table involving only the nonpreference outcomes [2] as given in Table 7. Conventional chi-square, likelihoodratio, or conditional exact tests can be used with these tables. These tests are quick and simple to use. Unfortunately, they do not generalize satisfactorily for higher-order designs and ordinal responses, and they are awkward to use when there are missing data. Recent developments in modeling dependent discrete data have made available a number of more flexible model-based approaches that are applicable to crossover data. Generalized estimating equation (GEE) methods can be used to fit marginal models to binary data from any crossover design, whether data are complete or not [52,51, 30,31] and have been extended for use with ordinal data [32,31]. A marginal, or population-averaged, model defines the outcome probabilities for any notional individual in the population under consideration for the given covariate values (treatment, period, and so on). It is marginal with respect to the other periods and, provided a crossover trial is used to draw conclusions about constant as opposed to changing treatment conditions, can be regarded as the appropriate model from which to express conclusions of most direct clinical relevance. The simpler forms of GEE (GEE1) are comparatively straightforward to use, but may provide poor estimates of precision in small trials. Extended GEE methods (GEE2) are more complicated, but give better estimates of precision. The full likelihood for a marginal model cannot be expressed in closed form for p > 2, so likelihood-based analyses require considerably more elaborate computation (e.g., [32,3]). In contrast to marginal models, subjectspecific models include subject effect(s) that

6

CROSSOVER TRIALS

determine an individual’s underlying outcome probabilities. Other effects, such as period and direct treatment, modify these subject-specific probabilities, and generally these effects will not have the same interpretation as their marginal-model analogues. Marginal representations of probabilities can be obtained from subject-specific models by taking expectations over the distribution of the subject effects, but only in special cases will the treatment–covariate structure of the subject-specific model be preserved. In analyses using subject-specific models, subject effects cannot be treated as ordinary parameters as with continuous data, because the number of these effects increases at the same rate as the number of subjects. This implies that estimates of other effects will be inconsistent, a generalization of the well-known result for matched case-control studies. Two alternative approaches can be used: conditional likelihood and random subject effects. If, for binary data∗ , a logistic regression∗ model is used, or, for categorical data∗ , a model based on generalized (or adjacentcategory) logits, then a conditional likelihood analysis can be used in which the subject effects are removed through conditioning on appropriate sufficient statistics [29]. In the binary case these statistics are the subject totals; in the categorical case, the subject joint outcomes ignoring the order. The application of this approach to binary data from the two-period two-treatment design produces the Mainland–Gart test. The conditional likelihood can be calculated directly, or the whole analysis can be formulated as a log-linear analysis for a contingency table∗ of the form of the 2 × 4 table above, with appropriate extension for other designs and for categorical outcomes. One advantage of this approach is the availability of conditional exact tests when sample sizes are very small.

Table 7. Contingency Table for Hills and Armitage’s Test Sequence

(0,0)

(1,1)

AB BA

n11 n21

n14 n24

The two main disadvantages are (1) the discarding of between-subject information in the process of conditioning, which precludes a population-averaged interpretation of the results, and (2) the use of generalized logits, which are not ideal for ordinal categorical outcomes. If the subject effects are assumed to follow some distribution, typically the normal, then the likelihood for the model can be obtained through numerical integration∗ [11,12]. In general such analyses are computationally intensive, but are usually manageable for crossover trials, for which sample sizes are commonly small. The inferences from such models are subject-specific, but populationaveraged summary statistics, for example marginal probabilities, can be produced using integration. Numerical integration can be avoided through the use of an approximate or hierarchical likelihood in place of the full marginal likelihood [4,34]. However, the consistency of such procedures is not guaranteed for all sample configurations, and the smallsample properties of the resulting analyses for crossover data have not yet been explored.

CONCLUDING REMARKS There is a large and diverse literature on the statistical aspects of crossover trials, which reflects their extensive use in medical research. There are at present three books on the subject (Jones and Kenward [26]; Senn [46]; Ratkowsky et al. [45]) and several reviews. The literature is scattered over numerous journals and conference proceedings, e.g., [5,7,10,33,36]. A particularly useful review is given in Statist. Methods Med. Res., 3, No. 4 (1994). In addition to medicine, crossover trials are used in areas such as psychology [39], agriculture [40], and dairy science. An industrial example is given by Raghavarao [44].

CROSSOVER TRIALS

REFERENCES 1. Afsarinejad, K. (1990). Repeated measurements designs—a review. Commun. Statist. Theory and Methods, 19, 3985–4028. 2. Armitage, P. and Hills, M. (1982). The twoperiod cross-over trial. Statistician, 31, 119–131. 3. Balagtas, C. C., Becker, M. P., and Lang, J. B. (1995). Marginal modelling of categorical data from crossover experiments. Appl. Statist., 44, 63–77. 4. Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear models. J. Amer. Statist. Ass., 88, 9–24. 5. Carriere, K. C. and Reinsel, G. C. (1992). Investigation of dual-balanced crossover designs for two treatments. Biometrics, 48, 1157–1164. 6. Cheng, C. -S. and Wu, C-F. (1980). Balanced repeated measurements designs. Ann. Statist., 6, 1272–1283. Correction (1983), 11, 349. 7. Chi, E. M. (1992). Analysis of cross-over trials when within-subject errors follow an AR(1) process. Biometrical J., 34, 359–365. 8. Clayton, D. and Hills, M. (1987). A two-period cross-over trial. In The Statistical Consultant in Action, D. J. Hand and B. S. Everitt, eds. Cambridge University Press. 9. Cochran, W. G., Autrey, K. M., and Cannon, C. Y. (1941). A double change-over design for dairy cattle feeding experiments. J. Dairy Sci., 24, 937–951. 10. Cornell, R. G. (1991). Non-parametric tests of dispersion for the two-period crossover design. Commun. Statist. Theory Methods, 20, 1099–1106. 11. Anon. (1985–1990). EGRET: Epidemiological, Graphics, Estimation and Testing Package. Statistics and Epidemiology Research Corp., Seattle. 12. Ezzet, F. and Whitehead, J. (1991). A random effects model for ordinal responses from a cross-over trial. Statist. Med., 10, 901–907. 13. Fletcher, D. J., Lewis, S. M., and Matthews, J. N. S. (1990). Factorial designs for crossover clinical trials. Statist. Med., 9, 1121–1129. 14. Freeman, P. R. (1989). The performance of the two-stage analysis of two-treatment, twoperiod cross-over trials. Statist. Med., 8, 1421–1432. 15. Gart, J. J. (1969). An exact test for comparing matched proportions in crossover designs. Biometrika, 56, 75–80.

7

16. Grieve, A. P. (1985). A Bayesian analysis of the two-period cross-over trial. Biometrics, 41, 979–990. Correction (1986), 42, 456. 17. Grieve, A. P. (1987). A note on the analysis of the two-period crossover design when period–treatment interaction is significant. Biometric J., 29, 771–775. 18. Grieve, A. P. (1990). Crossover vs parallel designs. In Statistics in Pharmaceutical Research, D. A. Berry, ed. Marcel Dekker, New York. 19. Grieve, A. P. (1994). Extending a Bayesian analysis of the two-period crossover to allow for baseline measurements. Statist. Med., 13, 905–929. 20. Grieve, A. P. (1994). Bayesian analyses of twotreatment crossover studies. Statist. Methods Med. Res., 4, 407–429. 21. Grizzle, J. E. (1965). The two-period changeover design and its use in clinical trials. Biometrics, 21, 467–480. 22. Hedayat, A. and Afsarinejad, K. (1978). Repeated measurements designs, I. In A Survey of Statistical Design and Linear Models, J. N. Srivastava, ed. North-Holland, Amsterdam, pp. 229–242. 23. Hedayat, A. and Afsarinejad, K. (1978). Repeated measurements designs, II. Ann. Statist., 6, 619–628. 24. Hills, M. and Armitage, P. (1979). The twoperiod cross-over clinical trial. Brit. J. Clin. Pharm., 8, 7–20. 25. Jones, B. and Lewis, J. A. (1995). The case for cross-over trials in phase III. Statist. Med., 14, 1025–1038. 26. Jones, B. and Kenward, M. G. (1989). Design and Analysis of Crossover Trials. Chapman & Hall, London. (This text takes a broad view with emphasis on crossover trials used in medical research. Both theory and practice are covered in some detail. The analysis of repeated measurements both between and within periods is considered. Methods for analyzing binary and categorical data are described as well as methods for continuous data.) 27. Kiefer, J. (1975). Construction and optimality of generalized Youden designs. In A Survey of Statistical Design and Linear Models, J. N. Srivastava, ed. North-Holland, Amsterdam, pp. 333–341. 28. Kenward, M. G. and Jones, B. (1987). The analysis of data from 2 × 2 cross-over trials with baseline measurements. Statist. Med., 6, 911–926.

8

CROSSOVER TRIALS

29. Kenward, M. G. and Jones, B. (1991). The analysis of categorical data from cross-over trials using a latent variable model. Statist. Med., 10, 1607–1619. 30. Kenward, M. G. and Jones, B. (1992). Alternative approaches to the analysis of binary and categorical repeated measurements. J. Biopharm. Statist., 2, 137–170.

45.

31. Kenward, M. G. and Jones, B. (1994). The analysis of binary and categorical data from crossover trials. Statist. Methods Med. Res. 3, 325–344. 32. Kenward, M. G., Lesaffre, E., and Molenberghs, G. (1994). An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics, 50, 945–953.

46.

33. Laserre, V. (1991). Determination of optimal designs using linear models in crossover trials. Statist. Med., 10, 909–924. 34. Lee, Y. and Nelder, J. A. (1996). Hierarchical generalized linear models. J. R. Statist. Soc. B, 58, 619–678. 35. Mainland, D. (1963). Elementary Medical Statistics, 2nd ed. Saunders, Philadelphia. 36. Matthews, J. N. S. (1990). The analysis of data from crossover designs: the efficiency of ordinary least squares. Biometrics, 46, 689–696. 37. Matthews, J. N. S. (1994). Multi-period crossover designs. Statist. Methods Med. Res., 4, 383–405. 38. Molenberghs, G. and Lesaffre, E. (1994). Marginal modelling of ordinal data using a multivariate Plackett distribution. J. Amer. Statist. Ass., 89, 633–644. 39. Namboodiri, K. N. (1972). Experimental design in which each subject is used repeatedly. Psychol. Bull., 77, 54–64. 40. Patterson, H. D. (1950). The analysis of change-over trials. J. R. Statist. Soc. B, 13, 256–271. 41. Patterson, H. D. (1982). Change over designs. In Encyclopedia of Statistical Sciences, S. Kotz, N. L. Johnson, and C. B. Read, eds. Wiley, New York, pp. 411–415. 42. Peace, K. E. (1990). Statistical Issues in Drug Research and Development. Marcel Dekker, New York. 43. Prescott, R. J. (1979). The comparison of success rates in cross-over trials in the presence of an order effect. Appl. Statist., 30, 9–15. 44. Raghavarao, D. (1989). Crossover designs in

47.

48.

49.

50.

51.

industry. In Design and Analysis of Experiments, with Applications to Engineering and Physical Sciences, S. Gosh, ed. Marcel Dekker, New York. Ratkowsky, D. A., Evans, M. A., and Alldredge, J. R. (1993). Crossover Experiments. Marcel Dekker, New York. (The contents of this text are presented from the viewpoint of someone who wishes to use the SAS statistical analysis system to analyze crossover data. Two nonstandard features are the way designs are compared and the approach suggested for the analysis of categorical data. Extensive tables of designs, which would otherwise be scattered over the literature, are included.) Senn, S. (1993). Cross-over Trials in Clinical Research. Wiley, Chichester. (This text is mainly written for biologists and physicians who want to analyze their own data. The approach is nontechnical, and explanations are given via worked examples that are medical in nature. A critical view is expressed on mathematical approaches to modeling carryover effects.) Senn, S. (1994). The AB/AB crossover: past, present and future? Statist. Methods Med. Res., 4, 303–324. Scheehe, P. R. and Bross, I. D. J. (1961). Latin squares to balance for residual and other effects. Biometrics, 17, 405–414. Tudor, G. and Koch, G. G. (1994). Review of nonparametric methods for the analysis of crossover studies. Statist. Methods Med. Res., 4, 345–381. Williams, E. J. (1949). Experimental designs balanced for the estimation of residual effects of treatments. Austral. J. Sci. Res., 2, 149–168. Zeger, S. L. and Liang, K. -Y. (1992). An overview of models for the analysis of longitudinal data. Statist. Med., 11, 1825–1839.

52. Zhao, L. P. and Prentice, R. L. (1990). Correlated binary regression using a quadratic exponential model. Biometrika, 77, 642–648. See also CHANGEOVER DESIGNS; CLINICAL TRIALS —II; and REPEATED MEASUREMENTS.

M. G. KENWARD B. JONES

DATA MINING

records will have been collected for monitoring and treating patients, but can subsequently be analyzed en masse in the search for previously unsuspected relationships and causes of disease). Of course, there is no reason—apart from the expense of collecting large data sets—why data should not be collected specifically to answer a particular question, but then the analysis is a more standard statistical one. The excitement of data mining is also partly a consequence of this secondary nature: it suggests that there is valuable information concealed within the data one already has, simply waiting for someone to tease it out. Unfortunately, the ‘‘simply’’ part of this exercise is rather misleading. Indeed, if it was simple, it would doubtless already have been done. One of the problems is that large data sets necessarily have a great deal of structure in them, but this structure has three major sources in addition to the target one of ‘‘important, real, undiscovered structure’’. These three sources are data contamination, chance occurrences of data, and structure which is already known to the database owner (or, if not explicitly articulated as known, sufficiently obvious once it has been pointed out to be of no genuine interest or value—such as the fact that married people come in pairs). The first and second of these are sufficiently important to warrant some discussion. It is probably not too much of an exaggeration to say that all data sets are contaminated or distorted in some way, though with small data sets this may be difficult to detect. With large data sets, it means that the data miner may triumphantly return an unusual pattern which is simply an artifact of data collection, recording, or other inadequacies. Brunskill (1) describes errors that occurred in the coding of birth weights: 14 oz recorded as 14 lb, birth weights of one pound (1 lb) being read as 11 lb, and misplaced decimal points for example, 510 gms recorded as 5100 gms. Note that all of these errors yield overreporting of birth weights. A data mining exercise might therefore report an unusual excess of

DAVID J. HAND Imperial College London, London, UK

Data mining is a new discipline, which has sprung up at the confluence of several other disciplines, stimulated chiefly by the growth of large databases. The basic motivating stimulus behind data mining is that these large databases contain information which is of value to the database owners, but this information is concealed within the mass of uninteresting data and has to be discovered. That is, one is seeking surprising, novel, or unexpected information and the aim is to extract this information. This means that the subject is closely allied to exploratory data analysis. However, issues arising from the sizes of the databases, as well as ideas and tools imported from other areas, mean that there is more to data mining than merely exploratory data analysis. Perhaps, the main economic stimulus to the development of data mining tools and techniques has come from the commercial world: the promise of money to be made from data processing innovations is a familiar one, and huge commercial databases are now rapidly growing in size, as well as in number. However, there is also substantial scientific and medical interest: philosophers of science have remarked that advances and innovation often occur when a mismatch between the data and the predictions of a theory occurs, and nowadays to detect such mismatches often requires extensive analysis of large data sets. Examples of areas of scientific applications of data mining include astronomy (3) and molecular biology (8). Genomics, proteomics, microarray data analysis, and bioinformatics, in general, are areas that are making extensive use of data mining tools. Apart from the sizes of the data sets, one of the distinguishing features of data mining is that the data to which it is applied are often secondary. That is, the data will typically have been collected to answer some other question, or perhaps secondarily in the course of pursuing some other issue (medical

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

DATA MINING

high birth weights (indeed, we might hope that a successful analysis would report this), but an excess that was of no real interest. Indeed, it seems that this may have happened (2,7). Digit preference is another cause of such curiosities, and is only detectable in large data sets. Wright and Bray (9) describe this occurring in measurements of nuchal translucency thickness, and Hand et al. (5) describe it in blood pressure measurements. In statistics, one is often able to cope with data inadequacies by extending the model to cope with them. Thus, for example, distorted sampling may be allowed for by including a model for the case selection process, and incomplete vectors of measurements may be handled via the EM algorithm. However, such strategies can only be adopted if one has some awareness (and understanding) of the data contamination mechanism. In data mining, with secondary data, this is often not the case. Often, such problems are ignored—with obvious potential for misleading conclusions. Statistics tackles the possibility of spurious (chance) patterns arising in the data using tools which estimate the probability of such structures arising by chance, merely as a consequence of random variation; that is, with hypothesis and significance tests. Unfortunately, with large data sets, and large sets of possible patterns being sought, the opportunity for the discovery of apparent structures is clearly great. This means that the statistical approach cannot be readily applied. Instead, data miners simply define score functions (for the ‘‘interestingness’’ or ‘‘unusualness’’ of a pattern), without any probability interpretation, and pass those patterns that show the largest such scores over to an expert for evaluation. This description reveals the process nature of data mining. Data mining is not a ‘‘one-off’’ exercise, to be done and finished with. Rather, it is an ongoing process: one examines a data set, identifies features of possible interest, discusses them with an expert, goes back to the data in the light of these discussions, and so on. The score functions may be the same as the criteria used in statistical model fitting, without the probabilistic interpretation, or there may be other criteria. An illustration of the

differences in perspective is given by regression analysis. A statistician may find the maximum likelihood estimates of the parameters, assuming a normal error distribution. In contrast, a data miner may adopt the sum of squared residuals as a score function to use in choosing the parameters. Since maximizing the likelihood based on a normal error distribution leads to the sum of squares criterion, these two approaches yield the same result—but they start from different positions. The statistician has a formal model in mind, while the data miner is simply aiming to find a good description of the data. Of course, the distinction is not a rigid one—there is overlap between the two perspectives. This example does show how central the concept of modeling is to the statistician. In contrast, data miners tend to place much more emphasis on algorithms. Given the essential role of computers in data mining, this algorithmic emphasis is perhaps not surprising. Moreover, when data sets are very large, the popular statistical algorithms may become impracticable (tools that make repeated passes through the data, for example, may be out of the question with a billion data points). Computers are, of course, also important for statistics, but many statistical techniques can be applied on small data sets without computers—many were originally developed that way. One consequence of the stress on algorithms is that it may be difficult to describe exactly what model is being fit to the data. This can have adverse consequences. For example, cluster analysis is widely used in data mining, but without careful thought about the nature of the procedure, it can be difficult to be clear about what sort of ‘‘clusters’’ are being found. Thus, compact structures may be appropriate in some situations (e.g. to produce compact summarizing descriptions, with the clusters being represented by ‘‘central’’ points), while in others, elongated shapes may be desirable, in which neighboring points in the same cluster are similar but distant ones are not. Without an awareness of the type of structure that the method reveals, inappropriate conclusions could be drawn: a species could be incorrectly partitioned on a dimension in which it has substantial variability.

DATA MINING

In the above, we have used phrases such as ‘‘model’’, ‘‘structure’’, and ‘‘pattern’’ without defining them. Hand et al. (4) define a model as a large scale summary of a set of data (i.e. as the standard statistical notion of a model), and a pattern as a small scale, local structure. A Box–Jenkins decomposition of a time series (see ARMA and ARIMA Models) is a model, whereas a conjunction of values, which occasionally repeats itself (e.g. a petit mal seizure in an EEG trace), is a pattern. Models are the staples of statistics, but patterns are something with which it has generally not been concerned (there seem to be three main exceptions: the study of scan statistics, of spatial disease clusters, and of outliers, all concerned with local anomalies). An examination of the data mining literature shows that both models and patterns are important, but narrow views of data mining sometimes fail to recognize the diversity of the tools used. Thus, for example, it is sometimes claimed that data mining is merely the application of recursive partitioning methods (e.g. tree classifiers), but this is a parody of the breadth of the field. Likewise, the viewpoint sometimes proposed in the econometric literature, that data mining is merely an elaborate and extensive form of model search, fails to recognize the various other kinds of data mining activities that go on. A large number of different kinds of tools are used in data mining— reflecting the eclecticism of its origins. Some recent ones in pattern detection, culled from the data mining literature with no particular objective other than to indicate the diversity of different kinds of methods are: tools for characterizing, identifying, and locating patterns in multivariate response data; tools for detecting and identifying patterns in two dimensional displays (such as fingerprints); identifying sudden changes over time (as in patient monitoring); and identifying logical combinations of values that differ between groups. Some examples of important tools in model building in data mining (again chosen with no particular aim other than to illustrate the range of such methods) include: recursive partitioning, cluster analysis, regression modeling, segmentation of time series into a small number of segment types, techniques for condensing huge (tens of billions of data points)

3

data sets into manageable summaries, and collaborative filtering, in which transactions are processed as they arrive, so that future transactions may be treated in a more appropriate manner. Much statistical theory is aimed at producing valid inferences from a sample to some population (real or notional) from which the sample has been drawn. This might be so that one can make comparative statements about the populations, or for forecasting, or for other reasons. These methods are also appropriate in data mining, provided one has a sample and that it has been drawn in a probabilistic way (so that one knows the probability of each object appearing in the sample). Going further than this, in many data mining applications one has available data on the entire population (for example, all chemical molecules in a particular class) and then, in model building data mining applications, analyzing a sample from the data set may be a sensible way to proceed. In contrast, however, in a pattern detection exercise, it will typically be necessary to analyze the entire data set: if one is seeking those data points that are anomalous, there is no alternative to examining every data point. It is clear that data mining will be of increasing importance as time progresses. However, the importance should not conceal the difficulties. Finding unsuspected structures in large data sets and identifying those that are due to phenomena of genuine interest and not merely arising from data contamination or due to chance is by no means a trivial exercise. Issues of theory, of data management, and of practice all arise. General descriptions of data mining include those of Fayyad et al. (4) and Hand, Mannila, and Smyth (6). REFERENCES 1. Brunskill, A. J. (1990). Some sources of error in the coding of birth weight, American Journal of Public Health 80, 72–73. 2. David, R. J. (1980). The quality and completeness of birthweight and gestational age data in computerized birth files, American Journal of Public Health 70, 964–973. 3. Fayyad, U. M., Djorgovski, S. G. & Weir, N. (1996). Automating the analysis and cataloging

4

DATA MINING

of sky surveys, in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth & R. Uthurusamy, eds. AAAI Press, Menlo Park, pp. 471–493. 4. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R., eds. (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park. 5. Hand, D. J., Blunt, G., Kelly, M. G. & Adams, N. M. (2000). Data mining for fun and profit, Statistical Science 15, 111–131.

6. Hand, D. J., Mannila, H. & Smyth, P. (2000). Principles of Data Mining. MIT Press. 7. Neligan, G. (1965). A community study of the relationship between birth weight and gestational age, in Gestational Age, Size and Maturity, Vol. 19. Clinics in Developmental Medicine Spastics Society Medical Education Unit, pp. 28–32. 8. Su, S., Cook, D. J. & Holder, L. B. (1999). Knowledge discovery in molecular biology: identifying structural regularities in proteins, Intelligent Data Analysis 6, 413–436. 9. Wright, D. E. & Bray, I. (2003). A mixture model for rounded data, The Statistician 52, 3–13.

DATA MONITORING COMMITTEE

following the revelation of horrific experiments performed on inmates of the Nazi concentration camps during World War II. Additional motivation for effort in this area came from the work of Henry Beecher, whose 1966 publication on ethical lapses in clinical research received wide attention (4). This document was followed by the Declaration of Helsinki (5), first issued in 1964 with numerous subsequent revisions, and, in the United States, the report of the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (the ‘‘Belmont Report’’), issued in 1979 (6). The modern era of attention to protection of human subjects, beginning in the mid-twentieth century, coincided with the emergence of the randomized clinical trial as the preferred method of evaluating effects of medical treatments. The concept of selecting treatments for individuals essentially by the flip of a coin prompted its own ethical concerns, particularly in areas of serious and difficult-to-treat diseases such as cancer, but many issues developed in regard to the proper conduct of trials. One key issue, the approaches to which are still evolving, is the need to monitor the emerging results of a clinical trial to ensure that participation remains safe and appropriate. In the earliest days of the randomized clinical trial, the monitoring of emerging data was performed routinely by the study sponsors (pharmaceutical companies or government funding agencies). The investigators themselves, or at least the investigators who were part of the study leadership, were also often involved, especially for studies funded by the National Institutes of Health (NIH). But an awareness was developing that when difficult judgments had to be made about emerging results, involvement in the study could threaten the objectivity of these judgments. In 1967, a major NIH-commissioned assessment of then-current clinical trials methodology was reported by a committee headed by Dr. Bernard Greenberg of the University of North Carolina (7). One item recommended in this report was the establishment of a policy advisory board that would include individuals with relevant expertise

SUSAN ELLENBERG University of Pennsylvania Philadelphia, Pennsylvania

Data Monitoring Committees (DMCs) are oversight groups for clinical trials. All investigations of medical interventions require ongoing monitoring. Monitoring is required first and foremost to ensure that any emerging safety issues are identified and dealt with as rapidly as possible. Monitoring is also important to ensure that the study is being implemented appropriately, that patients are being entered as anticipated, treatment is being administered according to the study protocol, the dropout and ineligibility rates are not excessive, and so forth. In most trials, monitoring functions are performed by the sponsors of the study together with the individuals who are carrying out the study. In some trials, however, particularly those trials in which treatment strategies are compared to determine impact on survival and other serious outcomes, DMCs [also often called data and safety monitoring boards (DSMBs)] traditionally have been established to review the accumulating data on a regular basis and make recommendations to the study sponsor regarding continuation, termination, or modification. Over the last decade, the use of DMCs has increased, both in governmentsponsored and industry-sponsored trials. Two complementary books are available that provide information on DMC operations; one focuses on principles and practices (1), and one is a series of case studies that describe how DMCs operated and the issues they faced in specific trials (2). 1 EVOLUTION OF DMCS AS A COMPONENT OF CLINICAL TRIALS Research with human subjects has a very long history, but the widespread consensus on the need for mechanisms to protect research subjects is relatively recent. The Nuremberg Code (3), issued in 1949, was a simple statement of principles that emerged

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

DATA MONITORING COMMITTEE

not otherwise involved in the study to provide advice about important decisions affecting the conduct of the study. The entity now known as a data monitoring committee, as a data and safety monitoring board, or as any of several other similar appellations, evolved from this recommendation. In the remainder of this article, I will use the term ‘‘data monitoring committee (DMC),’’ as an umbrella term that includes all variants. By the late 1970s, the use of DMCs was fairly standard in major randomized trials sponsored by many institutes of the NIH as well as the Veterans Administration and other federal research agencies. Around this time, statistical methods were developed to accommodate the need to conduct regular interim analyses throughout the duration of a study without inflating the Type 1 error rate by providing multiple opportunities to declare a positive result 8–11. These approaches rapidly gained widespread use in the large multicenter trials sponsored by NIH. The tensions and urgencies surrounding the early clinical trials of treatments for HIV/AIDS, beginning in the late 1980s, led to increased attention to operational aspects of DMCs (12–14). By that time, emerging recognition of problems in cancer trials likely attributable to the widely accessible interim results (15) led to the establishment of DMCs for the NCI-funded cancer cooperative groups. By the mid-1990s, use of DMCs for randomized studies funded by federal agencies had become standard, and in 1998 the NIH issued its first formal policy on the use of such committees in its extramural research programs (16). Trials sponsored by industry have also increasingly engaged such committees to oversee many trials, particularly Phase 3 trials intended to serve as the primary basis for marketing approval of drugs to treat serious illness. DMCs are not required by the Food and Drug Administration (FDA) except for the very limited set of trials done in emergency research settings for which exception from the informed consent requirement has been approved (17). Nevertheless, the value of simultaneously ensuring optimal oversight of patient safety (by having outside experts review the interim data) and ensuring and

maintaining trial integrity (by keeping individuals with a financial interest in the trial outcome blinded to the interim data) was recognized by both FDA and industry scientists, and the use of DMCs for industry trials has increased substantially since the early 1990s. In 2006, FDA issued a detailed guidance document on the use of DMCs in trials subject to FDA regulation (18). Today, DMCs are widely used in clinical trials sponsored both by industry and by government agencies and are often used even in early phase, uncontrolled studies when the treatment is novel and risks may be unusually high. 2 ROLES AND RESPONSIBILITIES OF DMCS The primary responsibility of a DMC is to protect study subjects from risks that exceed the level anticipated when the study was initiated and described to subjects during the informed consent process. To fulfill this responsibility, a DMC will receive reports from the study statistician at regular intervals, will convene to discuss the interim data, and will make recommendations to the study sponsor regarding any actions that may be needed. Such recommendations might range from informing subjects about a new potential side effect that has been observed to changing dose or schedule or to terminating the study early. A recommendation for early termination could result from a variety of patterns of emerging data, including a level of benefit that is already definitive, an observed magnitude of risk that is unlikely to be outweighed by benefits, or a consistent similarity in event rates such that demonstrating an advantage of one or the other treatment is very unlikely even if the trial were continued to its planned completion target (futility). Of course, a DMC often notes that no changes are warranted and recommends that the trial continue as designed. Another critical responsibility of a DMC is protecting the integrity of the study. A DMC that is the only entity with access to interim data provides assurance that the study design and/or conduct will not change as a result of emerging results in a way that would bias the results or cause the study

DATA MONITORING COMMITTEE

to fail to provide useful information. For example, if investigators are aware of the interim data and these data begin to suggest an advantage for one treatment, then they may decide, long before the data are definitive, to stop entering subjects into the study and thereby prevent the study from producing conclusive results. Similarly, a sponsor aware of interim data might suggest a change in the primary endpoint midway through the study if the data on that endpoint are more suggestive of a benefit than the data on the originally specified primary endpoint. If neither sponsors nor investigators have access to interim data, changes in design and conduct that are intended to promote a more efficient and/or informative study can be initiated without concern that these changes were motivated by interim results and therefore could cause problems in interpreting the final statistical analyses. A DMC will also generally monitor the quality of the study as evidenced by dropout rate, timeliness of information, ineligibility rate, adherence to the protocol, and so forth, and make recommendations regarding issues that could threaten the study’s informativeness. Although quality issues should be monitored closely by the sponsor and lead investigator(s), it is often helpful to have an outside perspective on these matters that can give the trial leadership an extra push in regard to lagging recruitment, delayed data submissions, and other operational issues. 3

DMC MEMBERSHIP

DMC members must have the same type and magnitude of expertise as the trial leadership so the sponsor and investigators are comfortable that the DMC can fulfill its mission of protecting study subjects and make good judgments about stopping, continuing, or modifying the trial. In most cases, this involves clinical expertise in the area(s) relevant to the trial and statistical expertise in the methods of clinical trial design, analysis, and interim monitoring. For example, a study of a new drug to treat rheumatoid arthritis would include rheumatologists and at least one statistician, all of whom would be expected to be familiar with clinical trials methodology. In studies with multiple

3

components or with complex treatment regimens, multiple areas of clinical expertise may be required. For example, the DMC for the Women’s Health Initiative required clinicians with expertise in women’s health, including specialists in cardiology, cancer, gynecology, and bone metabolism (19). For trials with complex designs, it is often advisable to include more than a single statistician, as multiple perspectives may be found on statistical methods and interpretation of data, just as multiple viewpoints may develop on optimal treatment in a specific situation. In addition, nonphysician scientists, such as immunologists, pharmacologists, engineers (for medical device trials), or epidemiologists may be important DMC members for some trials. Trials that raise particularly difficult ethical issues and trials that have the potential to have a major impact on public health often include a bioethicist on their DMCs. The role of a bioethicist is more than that of a subject advocate; all DMC members generally consider themselves to be serving that role. The bioethicist can be especially helpful in framing the difficult decisions a DMC may have to make and may facilitate the development of a consensus view. Trials in some areas routinely include a community representative on a DMC for trials that affect a particular community. For example, clinical trials conducted by the NCIfunded cooperative oncology groups must include a lay representative on their DMCs (20). Such individuals may contribute an extremely valuable perspective, particularly when difficult decisions must be made. It is important to be certain, however, that maintaining confidentiality of interim data will not be a problem for such a member, who may have many connections with individuals whose choice of treatment may be influenced by the results of the trial. Other issues besides type of expertise should be considered in constituting a DMC. Evaluation of potential conflicts of interest is extremely important. Some conflicts are obvious, such as being in a position to benefit financially from a particular trial outcome, either through place of employment (e.g., the sponsoring company) or investment holdings.

4

DATA MONITORING COMMITTEE

Other conflicts, especially nonfinancial conflicts, may be trickier. Should someone serve on a DMC if her spouse is an investigator in the trial or works for the sponsor? What about her brother or a close friend? Should someone on a speaker’s bureau for the pharmaceutical sponsor be on the DMC? What about someone who was the lead investigator on a Phase 1 study that had promising results and led to more investigations? (If the product ultimately is deemed successful, then that Phase 1 study will be referenced far more frequently than if the product ultimately is found ineffective or harmful.) Other considerations for DMC member selection include representation of participating countries or regions in international trials, representation of population subsets who may be the focus of the trial (it would seem strange, for example, to have an allfemale DMC overseeing a trial that evaluates a treatment for testicular cancer or to have an all-white DMC overseeing a trial in the area of sickle-cell anemia), and prior DMC experience. Although it is neither feasible nor sensible to require that all DMC members have such experience, it is important that at least some members do. Finally, DMC members should be prepared to give DMC meetings their highest priority. DMCs cannot work effectively if its members are casual about their attendance and participation. DMCs range in size. A general principle is that the DMC should be as small as possible while including all the required areas of expertise. The primary advantage of a small committee is the ease of arranging meetings; this advantage can be particularly important when an unscheduled emergency meeting is necessary. (Of course, from a sponsor’s perspective, the costs are lower for a small committee!) Many DMCs, even for major trials, include only three members, typically two clinicians and one statistician. Trials that require members with a variety of types of expertise may require more members. Also, DMCs that monitor multiple trials, such as DMCs for cooperative group or network studies, typically require more members because multiple studies increase the range of expertise required. The choice of DMC chair is a critical one. A DMC chair definitely should have had

prior DMC experience. Successful chairs can have any of the types of scientific expertise required for the study the DMC is monitoring; the key characteristic required is the ability to manage discussion and to keep it constructive and on a path toward consensus. A DMC chair is often selected because of his/her eminence in the professional community, but such individuals are often extremely busy, with many commitments. It is essential for the trial sponsor to have an understanding with the DMC chair that DMC meetings must be given the highest priority. 4 OPERATIONAL ISSUES 4.1 Charter The operational principles for a DMC should be laid out in a charter that is provided and agreed to by all DMC members before the initiation of review. The charter will address issues such as expected meeting schedule, meeting format, statistical approach to interim monitoring, determination of conflicts of interest, handling of meeting minutes, types and format of reports to the committee, and other aspects of DMC processes. Sample charters that can serve as templates for most DMCs have been published (1,21). 4.2 Meeting Format DMC meetings are often arranged in segments. One segment, usually called the ‘‘open session,’’ permits the study sponsor and/or lead investigator(s) to discuss the status of the study with the DMC and bring to the committee’s attention any issues that may be relevant to its deliberations. For example, a sponsor may be conducting multiple studies of a product in a variety of settings and may wish to inform the DMC of findings in these other studies. Another segment, a ‘‘closed session,’’ is typically limited to the DMC and the statistician who has prepared the interim analyses. In the closed session, the interim results are presented to the DMC, with opportunity for discussion. The DMC may also request an ‘‘executive session’’ in which only DMC members would participate. In many studies sponsored by federal agencies, it is the practice for representatives of the sponsoring agency to attend closed

DATA MONITORING COMMITTEE

sessions; in some agencies, executive sessions in which such representatives would be excluded are not permitted. In trials sponsored by industry, it is rare for sponsors to attend closed sessions or have access to the interim data. 4.3 Meeting Schedule It is optimal for DMCs to have their initial meeting before the initiation of the study. It is important for all DMC members to have a thorough understanding of the study design and goals and of the plan for monitoring the interim data. In some cases, DMCs may have useful suggestions for modifications, particularly with respect to the monitoring plan, that the sponsor may wish to consider. In other cases (which should occur only very rarely), a DMC member may realize that he/she is not comfortable with the study as designed; in such circumstances, if the design cannot be changed, this member may resign from the committee and be replaced before any review of interim data. Meeting schedule/frequency once the study is underway depends on the rapidity with which information becomes available. Interim analyses may be scheduled based on chronologic time (e.g., every year for the duration of the study) or, more typically, on the proportion of information on the primary outcome that has been achieved. The study may plan for only a few interim analyses with the potential of early termination for higherthan-expected benefit or clear futility, but it may be necessary for the DMC to meet at times other than when these formal interim analyses are scheduled to allow for regular review of safety data and of study operational issues. Certainly, any trial that develops particular safety concerns would likely arrange for frequent DMC review of the accumulating safety data over and above the scheduled times for formal interim analysis. Most DMCs meet at least on a semi-annual basis to make sure that the safety oversight and review of study quality can occur frequently enough so that problems can be identified and rectified quickly. It may happen on occasion that the data reviewed at a regularly scheduled interim analysis are on the margin of decision point.

5

That is, data may suggest that termination or modification may be appropriate, but the DMC may not yet be certain that such a recommendation is warranted. In such situations, a DMC may be reluctant to wait until the next regularly scheduled analysis time for another opportunity to recommend changes in the study, and an extra meeting therefore may be scheduled. Statistical monitoring approaches have been developed that account for such ‘‘extra looks’’ at the accumulating data while still controlling Type 1 error (11,22). 4.4 Other Operational Issues The process for coming to a decision is optimally one of developing consensus; every attempt should be made to come to a mutually satisfactory agreement on recommendations without resorting to formal votes. Many DMC meetings are convened by teleconference. This method may be adequate for routine meetings to review safety data but without formal interim analyses presented; for meetings at which decisions may be made about continuing or terminating a study on the basis of a scheduled interim analysis, or when concern about emerging safety issues is to be discussed, it is preferable to meet in person. The initial meeting of a DMC should be held in person if at all possible so that board members have the opportunity to get to know each other to facilitate interaction at later meetings that may need to be held by teleconference. Unscheduled meetings of the DMC, particularly ‘‘emergency’’ meetings held to discuss new reports of serious and unexpected safety issues, may have to be held by teleconference because of the practical limitations in traveling committee members to a central site on short notice. Some debate exists over the way the interim comparative data should be presented to a DMC. Some individuals prefer to keep the data coded so that DMC members see the results by arm but do not know which arm is which. This practice, however, would require separate coding of safety and efficacy data in any trial in which one treatment has a known excess of a certain side effect; if multiple side effects are more common on one arm than another, it might even be necessary to

6

DATA MONITORING COMMITTEE

use different codes for each adverse event reported. Any attempt to maintain blinding of DMC members, whether one or multiple codes are employed, will present major obstacles to a DMC trying to weigh risks and benefits effectively. Therefore, many experienced DMC members have advocated that DMCs have access to unblinded data (23,24). Motivated by the increased use of DMCs and the recognition that much variability has developed in how they operate, the Health Technology Assessment Programme of the United Kingdom’s National Health Service commissioned a detailed study of the ways in which accumulating data are monitored and decisions are made about possible changes in ongoing trials. The report of this project, entitled DAMOCLES (Data Monitoring Committees: Lessons, Ethics, Statistics) was issued in 2005 (26); several papers reporting specific aspects of the project have been published (27–30). 5 INDEPENDENCE OF THE MONITORING PROCESS Although much consensus has developed regarding practices of clinical trial DMCs, differing perspectives are found on a variety of DMC operational aspects. One major area of debate relates to issues of independence, not just of DMC members but also of the entire process. As noted earlier, the theme of independence is an important one in DMC establishment and operation. The individuals who review the interim data and make judgments about whether the trial should continue as designed, whether changes should be made, or whether the trial should terminate early, should not have any major financial or other type of stake in the trial outcome, as this could render any recommendations suspect. Additionally, individuals with a financial or other major stake in the outcome (primarily, the commercial sponsor) are generally thought to best avoid having any access to the interim data, as such access will render suspect any change in design or analytical plan the sponsor might wish to make for the most valid reasons. So, where are the controversies?

5.1 Government versus Commercial Sponsors Representatives of federally sponsored research programs often play what are essentially ex officio roles on DMCs of trials sponsored by their agencies. Such individuals do not have any financial stake in the outcome; also, they have a responsibility to ensure the protection of study subjects and the appropriate expenditures of federal dollars. For these reasons, it is often argued that representatives of government agencies do have a role to play in evaluating the interim data from ongoing trials funded by their agencies. Another perspective is that, although such individuals have no financial stake in the outcome, they may have a major professional stake in the outcome if they proposed the study and/or were responsible for preliminary data that suggested that the study be undertaken. In addition, and just as for commercial sponsors, if new information becomes available from external sources that suggests that the study design should be altered in some way during its course, the sponsor representative may know the impact of such a change on the ultimate study conclusions and therefore no longer can consider the argument for making the change in a completely objective manner. 5.2 Statisticians Performing the Interim Analyses Other than DMC members, clearly one other person must be aware of the interim comparisons, and that is the statistician (or statistical group) who performs the interim analyses, develops the report, and presents the data to the DMC. What are the independence issues for the reporting statistician? If the statistician works for the commercial sponsor of the study, then one might worry about overt or subtle pressure on the statistician during the course of the study to indirectly reveal data trends to others in the company. Another concern is that if the statistician doing the interim analysis is the same statistician who works with the trial leadership and sponsor on design and conduct of the trial, that statistician may be put in a difficult position if a change in design partway through the trial is proposed by the sponsor and/or study leadership and the statistician is aware of whether

DATA MONITORING COMMITTEE

such a change will have an impact on the ultimate trial conclusions. In this situation, the statistician would not be able provide objective advice, uninfluenced by the interim data, as it is impossible to be uninfluenced by information of which one is aware. For this reason, it has been recommended that the statistician who performs the interim analyses and reports to the DMC be different from the trial’s primary statistician, who was involved in the design, will advise on conduct issues, and will be the primary statistical author on the trial report (31–34). A strong counterargument has been made, to the effect that such an arrangement could lead to a suboptimal monitoring process if the statistician doing the interim analysis is less knowledgeable about the study and therefore less able to engage fully with the DMC when concerns develop (35,36). REFERENCES 1. S. Ellenberg, T. R. Fleming, and D. L. DeMets, Data Monitoring Committees in Clinical Trials: A Practical Perspective. Chichester, U. K.: John Wiley & Sons, 2002. 2. D. L. DeMets, C. D. Furberg, and L. M. Friedman, Data Monitoring in Clinical Trials: A Case Studies Approach. New York: Springer, 2006. 3. The Nuremberg Code. In: R. J. Levine (ed.), Ethics and Regulation of Clinical Research, 2nd ed. Baltimore, MD: Urban and Schwarzenberg, 1986, pp. 425–426. 4. H. K. Beecher, Ethics and clinical research. N. Engl. J. Med. 1966;274:1354–1360. 5. World Medical Association, World Medical Association Declaration of Helsinki. Ethical principles for medical research involving human subjects. September 10, 2004. Available: http://www.wma.net/e/policy/b3.htm. Accessed November 29, 2007. 6. The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, The Belmont Report. Ethical principles and guidelines for the protection of human subjects of research. April 18, 1979. National Institutes of Health Office of Human Subjects Research. Available: http: //www.nihtraining.com/ohsrsite/guidelines/ belmont.html. Accessed November 29, 2007. 7. Organization, review, and administration of cooperative studies (Greenberg Report): A

7

report from the Heart Special Project Committee to the National Advisory Heart Council. May 1967. Controlled Clinical Trials 1988;9:137–148. 8. J. L. Haybittle, Repeated assessment of results in clinical trials of cancer treatments. British J. Radiology. 1971;44:793–793. 9. S. J. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64:191–199. 10. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics. 1979;35:549–556. 11. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical trials. Biometrika. 1983;70:659–663. 12. S. S. Ellenberg, M. W. Myers, W. C. Blackwelder, and D. F. Hoth. The use of external monitoring committees in clinical trials of the National Institute of Allergy and Infectious Diseases. Stat. Med. 1993;12:461–467. 13. S. S. Ellenberg, N. Geller, R. Simon, and S. Yusuf, (eds.), Practical issues in data monitoring of randomized clinical trials (workshop proceedings). Stat. Med. 1993;12:415–616. 14. D. L. DeMets, T. R. Fleming, R. J. Whitley, J. F. Childress, S. S. Ellenberg, M. Foulkes, K. H. Mayer, J. R. O’Fallon, R. B. Pollard, J. Rahal, M. Sande, S. Straus, L. Walters, and P. Whitley-Williams, The data monitoring board and acquired immune deficiency syndrome (AIDS) clinical trials. Controlled Clinical Trials. 1995;16:408–421. 15. S. J. Green, T. R. Fleming, and J. R. O’Fallon, Policies for study monitoring and interim reporting of results. J. Clinical Oncology. 1987;5:1477–1484. 16. National Institutes of Health, NIH policy for data and safety monitoring. June 10, 1998. Available: grants.nih.gov/grants/guide/noticefiles/not98-084.html.. Accessed November 29, 2007. 17. Code of Federal Regulations, Title 21, Part 50.24. 18. U.S. Food and Drug Administration, Guidance for clinical trial sponsors, Establishment and operation of clinical trial data monitoring committees. March 27, 2006.. Available: http://www.fda.gov/cber/gdlns/clindatmon .pdf. Accessed November 29, 2007. 19. The Women’s Health Initiative Study Group, Design of the Women’s Health Initiative clinical trial and observational study. Controlled Clinical Trials. 1998;19:61–109.

8

DATA MONITORING COMMITTEE

20. Policy of the National Cancer Institute for data and safety monitoring of clinical trials. Available: http://deainfo.nci.nih.gov/ grantspolicies/datasafety.htm. 21. DAMOCLES Study Group, NHS Health Technology Assessment Programme. A proposed charter for clinical trial data monitoring committees: helping them to do their job well. Lancet. 2005;365:711–722. 22. K. K. G. Lan and D. L. DeMets, Changing frequency of interim analyses in sequential monitoring. Biometrics. 1989;45:1017– 1020. 23. C. L. Meinert, Masked monitoring in clinical trials–blind stupidity? N. Engl. J. Med. 1998;338:1381–1382. 24. T. R. Fleming, S. Ellenberg, and D. L. DeMets, Monitoring clinical trials: issues and controversies regarding confidentiality. Stat. Med. 2002;21:2843–2851. 25. A. M. Grant, D. G. Altman, A. B. Babiker, M. K. Campbell, F. J. Clemens, J. H. Darbyshire, D. R. Elbourne, S. K. McLeer, M. K. Parmar, S. J. Pocock, D. J. Spiegelhalter, M. R. Sydes, A. E. Walker, and S. A. Wallace, DAMOCLES study group, Issues in data monitoring and interim analysis of trials. Health Technol. Assess. 2005;9:1–238, iii–iv. 26. M. R. Sydes, D. J. Spiegelhalter, D. G. Altman, A. B. Babiker, and M. K. B. Parmar, DAMOCLES Study Group, Systematic qualitative review of the literature on data monitoring committees for randomized controlled trials. Clinical Trials. 2004;1:60–79. 27. M. R. Sydes, D. G. Altman, A. B. Babiker, M. K. B. Parmar, and D. Spielgelhalter, DAMOCLES Group, Reported use of data monitoring committees in the main published reports of randomized controlled trials: a cross-sectional study. Clinical Trials. 2004;1:48–59. 28. F. Clemens, D. Elbourne, J. Darbyshire, and S. Pocock, DAMOCLES Group, Data monitoring in randomized controlled trials: surveys of recent practice and policies. Clinical Trials. 2005;2:22–33. 29. A. E. Walker and S. K. McLeer, DAMOCLES Group, Small group processes relevant to data monitoring committees in controlled clinical trials: an overview of reviews. Clinical Trials. 2004;1:282–296. 30. S. S. Ellenberg, and S. L. George, Should statisticians reporting to data monitoring committees be independent of the trial sponsor and leadership? Stat. Med. 2004;23:1503–1505.

31. J. P. Siegel, R. T. O’Neill, R. Temple, G. Campbell, and M. A. Foulkes, Independence of the statistician who analyses unblinded data. Stat. Med. 2004;23:1527–1529. 32. M. Packer, J. Wittes, and D. Stump, Terms of reference for data and safety monitoring committees. Am. Heart J. 2001;141:542–547. 33. D. L. DeMets and T. R. Fleming, The independent statistician for data monitoring committees. Stat. Med. 2004;23:1513–1517. 34. S. J. Pocock, A major trial needs three statisticians: why, how and who? Stat. Med. 2004;23:1535–1539. 35. S. Snapinn, T. Cook, D. Shapiro, and D. Snavely, The role of the unblinded sponsor statistician. Stat. Med. 2004;23:1531–1533. 36. J. Bryant, What is the appropriate role of the trial statistician in preparing and presenting interim findings to an independent Data Monitoring Committee in the U.S. Cancer Cooperative Group setting? Stat. Med. 2004;23:1507–1511.

FURTHER READING S. Ellenberg, T. R. Fleming, and D. L. DeMets, Data Monitoring Committees in Clinical Trials: A Practical Perspective. Chichester, U. K.: John Wiley & Sons, 2002. D. L. DeMets, C. D. Furberg, and L. M. Friedman, Data Monitoring in Clinical Trials: A Case Studies Approach. New York: Springer, 2006. A. M. Grant, D. G. Altman, A. B. Babiker, M. K. Campbell, F. J. Clemens, J. H. Darbyshire, D. R. Elbourne, S. K. McLeer, M. K. B. Parmar, S. J. Pocock, D. J. Spiegelhalter, M. R. Sydes, A. E. Walker, and S. A. Wallace, the DAMOCLES study group, Issues in data monitoring and interim analysis of trials. Health Technol. Assess. 2005;9(7): 1–238.

CROSS-REFERENCES Alpha-spending function Benefit-Risk Assessment Futility Analysis Group Sequential Designs Interim Analysis Stopping Boundaries Trial Monitoring Lan-DeMets Alpha-Spending Function Early Termination–Study

DATA SAFETY AND MONITORING BOARD (DSMB) A Data Safety and Monitoring Board (DSMB) is an independent committee composed of community representatives and clinical research experts that reviews data while a clinical trial is in progress to ensure that participants are not exposed to undue risk. A DSMB may recommend that a trial be stopped if there are safety concerns or if the trial’s objectives have been achieved.

This article was modified from the website of the National Institutes of Health (http://clinicaltrials. gov/ct/info/glossary) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

DATA STANDARDS

1

THE STANDARDS ENVIRONMENT

CDISC is now a nonprofit standards development organization that has been working collaboratively with numerous groups for the past decade to fulfill its initial mission to establish global standards for the acquisition, exchange, submission, and archiving of clinical research data and metadata. Its current mission is to develop and support global, platform-independent data standards that enable information system interoperability to improve medical research and related areas of healthcare. In 2001, CDISC established a charter agreement with Health Level Seven (HL7), which is a healthcare standards development organization (2); a commitment was made between the two organizations to harmonize their standards to facilitate the flow of information between healthcare (electronic health records) and clinical research. This harmonization is being done through the development of a Biomedical Research Integrated Domain Group (BRIDG) Model, which represents the clinical research domain in the context of the HL7 Reference Information Model (RIM). In addition, CDISC was recently approved as a Liaison A organization to the International Standards Organization (ISO) Technical Committee 215 for healthcare standards. Unfortunately, the standards environment is far more complex than three organizations, because many additional standards organizations are developing standards for various purposes from transferring digital images (Dicom) to purchase order information for consumers (X12). Other types of standards are not for data interchange but for such purposes as the quality of care or the content of clinical reports (ICH). Fortunately, CDISC has a ‘‘niche’’ in the clinical research arena and has made a concerted effort not to duplicate efforts of others, rather to ensure productive collaboration.

EDWARD HELTON, SAS REBELLA KUSH, CDISC

In 1997, the use of technology for clinical research outside of the statistical programming area was really in its infancy. Yet, a handful of individuals were well ahead of the times. One individual who came into clinical research from a major chemical company had worked with all sites in a large site network to develop standard electronic source documents and was seeking a means to transfer the source data electronically into electronic case report forms and then to sponsors in a standard format. Another person who came from the defense industry had developed a site-management system that had been rolled out to hundreds of sites such that all of these sites were collecting important clinical research information in a standard format, particularly for subject recruitment and management of studies. A third individual was among the first to implement (not pilot) electronic data capture within the clinical research industry. These individuals recognized a true value for standards to facilitate the exchange of information among research stakeholders. Unfortunately, they could not find sponsors ready to accept electronic data in a standard manner, and no industry interchange standards were in use. Hence, a challenge was put forth in July 1997 to the industry to help develop such standards. Despite the response that biopharmaceutical companies were struggling to set standards within their own companies, much less industry standards, there was sufficient interest to form a volunteer group to initiate the development of industry-wide data interchange standards. The Clinical Data Interchange Standards Consortium (CDISC) was initiated at an inaugural meeting in September of that year (1). At that point, CDISC entered the realm of Standards Development Organizations in a complex global standards environment.

1.1 CDISC Collaborations and Standards Development Although CDISC has been working closely with HL7 since 2001 as a partnering standards development organization, additional

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DATA STANDARDS

alliances and collaborative working relationships have contributed to the progress of CDISC over the past decade of its existence. It is impossible to list all the organizations with which CDISC has worked; however, the regulatory authorities in the U.S. [Food and Drug Administration (FDA)] and Europe [European Medicines Agency (EMEA)] and Ministry of Health Labor and Welfare have been very meaningful and important relationships for CDISC. In addition, CDISC works with groups such as the pharmaceutical manufacturers associations in the United States, Europe, and Japan (PhRMA, EFPIA, and JPMA); national health institutions and academia; other nonprofit organizations such as the Critical Path Institute and the Pharmaceutical Safety Institute; and other organizations in the healthcare and research arenas. The CDISC teams are multidisciplinary and the members are volunteers from many different companies, which include biopharmaceutical organizations, device companies, academia, technology, and service providers. The volunteers and alliance partners are extremely important to developing standards through an open, consensus-based process (3). 1.2 Benefits of the CDISC Standards A robust business case study by Gartner (4), with support from PhRMA and CDISC, found that the CDISC standards, when implemented from the beginning of a clinical study, can significantly improve processes, thus saving time and cost (∼60% of the nonsubject participation time, with a 70–90% savings in the start-up stage). Additional benefits beyond cost and time savings included increased data quality; improved ability to integrate data, which facilitates data reuseability in knowledge warehouses to improve science, marketing, and safety surveillance; streamlining data interchange among partners and communication among project team members; facilitating regulatory submissions; and enabling transfer and integration of data from disparate tools/technologies. The implementation of data standards seems to be a key strategic initiative of several global companies in the biopharmaceutical industry because it will position them better to use

electronic health records for clinical research in the future. 1.3 CDISC Standards and Healthcare The CDISC has a Healthcare Link initiative that is an integral part of the CDISC strategy. Through this initiative, CDISC has performed several projects with inherent goals to (1) streamline the ability of investigative site personnel to do research within the workflow of clinical care and (2) to enter data electronically only once for multiple downstream purposes, which include safety reporting, clinical trial registration, and results reporting. The CDISC Healthcare Link projects include eSource Data interchange; ‘‘Single Source,’’ which has now been extended to an integration profile; and the BRIDG Model. The eSource Data Interchange (eSDI) project (5) resulted in a document that explains the value and viable means of collecting data only once, electronically, for use in research and clinical care. Specifically, recommendations for eSource systems are used to conduct clinical research and as well as examples of various scenarios that can accomplish eSDI in the context of the existing guidelines and regulations for regulated research. ‘‘Single Source’’ is the basis for one eSDI scenario. This proof-of-concept, conducted at Duke University Medical Center and Duke Clinical Research Institute, demonstrated a viable standards-based opportunity for data entered once to be used for patient care and clinical research. It laid the foundation for the development of an integration profile, working with Integrating the Healthcare Enterprise. The profile, called Retrieve Form for Data Capture (6), was demonstrated in five use cases in January 2007, which include clinical trials, trial registry, biosurveillance, safety reporting, imaging, and lab data (7). It is now being applied in actual implementations in hospital settings, with use cases to support data collection within an electronic health record system for downstream safety reporting to regulatory authorities, academic institutions, NIH and institutional review boards, and for use in protocol-based regulated clinical research.

DATA STANDARDS

The BRIDG Model mentioned previously is the means to link healthcare and protocoldriven clinical research semantically. It was initiated by CDISC primarily to ensure harmonization of its own standards in addition to linking these with the healthcare standards of HL7. It is a domain analysis model that represents clinical research in the context of the HL7 Reference Information model. This open model is now a collaborative initiative, currently with the FDA, the National Cancer Institute, CDISC, and HL7 as the primary stakeholders. This model has been adopted by HL7 Regulated Clinical Research Information Management (RCRIM) Technical Committee (8) as its domain analysis model and the basis for all of the HL7 RCRIM standards. 1.4 CDISC Standards—Basic Types and Value CDISC has two basic types of standards today—content standards that define the information, metadata, and terminology and a transport standard that can carry the standard content, which is the CDISC XMLbased Operational Data Model (ODM) (9). The content standards are harmonized in the BRIDG model (or are in the process of being harmonized). In certain cases, additional transport standards can be used to carry CDISC standard content, such as XMLbased HL7 messages or even ASCII or SAS. The critical aspect is that the content is semantically consistent and that it is harmonized and represented in an overarching model—the BRIDG model in this case. Semantic interoperability allows for the use of different transport mechanisms and various technologies and tools to exchange information in a meaningful and efficient manner. This quality is the value of standards— to support the seamless exchange of information among systems in research and healthcare. 2 AN OVERVIEW OF KEY CLINICAL RESEARCH STANDARDS The CDISC data standards were developed ‘‘starting with the end in mind.’’ It was important to ensure that the CDISC standards

3

would readily support the end goal of a warehouse and/or an integrated report, particularly a regulatory submission. That being said, the business case emphasized the benefit of collecting the data in the same standard format such that the information would readily flow from end to end. To be efficient and make use of the varied expertise of the CDISC volunteers, the CDISC standards were developed in parallel. As noted in the prior section, the BRIDG model has served to ensure that the CDISC standards are a harmonized and compatible set, in addition to serving as the link with healthcare standards. Figure 1 depicts the key CDISC standards and how they support clinical research data flow from end to end, with the content standards being ‘‘carried’’ by the transport standard, ODM. It also indicates the linkage with electronic health records. Table 1 briefly describes the key standards relevant to clinical research. 3 ADOPTION OF CLINICAL RESEARCH STANDARDS AND BEST PRACTICES 3.1 Adoption of CDISC Standards The latest survey on adoption of CDISC standards, which was conducted by Tufts Center for Drug Development in 2007, indicates that adoption has continued to climb since the standards were first made available (10). The adoption logically follows in the order they were released with the Study Data Tabulation Model (SDTM) being used by 69% of those global companies surveyed; SDTM is being piloted by 22%, and 8% plan to implement in the future, which leaves only 1% of those surveyed without implementation in progress or anticipated. The Operational Data Model (ODM) is the next most widely adopted CDISC standard, which is used by 32% of those surveyed; it is being piloted by 32% and 34% are planning to implement the ODM. Downloads of the CDISC SDTM have numbered over 11,000 in approximately 40 different countries. The global reach of CDISC also extends to events in India, China, Australia, along with well-established CDISC Coordinating Committees in Europe and Japan. The CDISC standards and/or initiative outcomes have been included in

4

DATA STANDARDS

Table 1. Clinical Research Standards: This Table Lists Key Standards in the Clinical Research Arena, Including those from Healthcare that Relate to Clinical Research. Standard

Type

Reference Information Model (RIM)

Object Model

Description

An object model created by HL7 as a pictorial representation of the clinical data (domains); the RIM identifies the life cycle of events that an HL7 message (transport standard) will carry, explicitly representing the connections between information carried in the fields of HL7 messages. Biomedical Research Domain Open, collaborative model initiated by CDISC that will Integrated Domain Group Analysis bridge the gap between biomedical research and Model (BRIDG) Model healthcare through the harmonization of standards for research with the relevant healthcare standards of HL7 through the RIM; currently governed by the CDISC, HL7, NCI, and FDA. SDTM Content The CDISC standard for regulatory submission of Standard case-report form data tabulations from clinical research studies; referenced as a specification in Final FDA Guidance for eSubmissions. Operational Data Model Transport The XML-based CDISC standard for the acquisition, (ODM) Standard exchange, reporting or submission, and archival of CRF-based clinical research data. Analysis Data Model (ADaM) Content The CDISC content standard for regulatory submission of Standard analysis datasets and associated files. Laboratory Data Model (LAB) Content The CDISC standard for data transfers of laboratory Standard information between clinical laboratories and study sponsors/CROs; this content standard has four different potential implementations (transport methods): HL7 message, ODM XML, SAS XPT, ASCII. Case Report Tabulation Data Content and The XML-based CDISC standard referenced by the FDA Definition Specification Transport as the specification for the data definitions for CDISC (CRTDDS) – define.xml Standard SDTM datasets. This standard is an extension of the ODM. Standard for the Exchange of Content An extension of the CDISC SDTM standard for Nonclinical Data (SEND) Standard submission of data from preclinical (nonclinical) or animal toxicology studies. Protocol Representation Content The standard supporting the interchange of clinical trial Standard protocol information; a collaborative effort with Health Level Seven (HL7) Trial Design Model (TDM) Content The CDISC standard that defines the structure for Standard representing the planned sequence of events and the treatment plan of a trial; a subset of SDTM and Protocol Representation. Clinical Data Acquisition Content A CDISC-led collaborative initiative to develop the Standards Harmonization Standard content standard for basic data collection fields in case (CDASH) report forms; CDASH is the collection standard based on SDTM as the reporting standard. Terminology Controlled The controlled standard vocabulary and code sets for all Vocabulary CDISC standards (Content Standard) Glossary Dictionary of The CDISC dictionary of terms and their definitions Term related to the electronic acquisition, exchange, and Definitions reporting of clinical research information; a list of acronyms and abbreviations is also available. The standard name, type, and a brief description are included. More information is available at: www.cdisc.org; www.bridgmodel.org, www.hl7.org.

DATA STANDARDS

ments, streamlining the clinical trial process) #34–53

strategic planning documents in Japan and in documents for EMEA inspectors in Europe (11).

• Harnessing Bioinformatics # 46–53 • Moving Manufacturing into the 21st

3.2 Critical Path Initiative

Century (#54–66) • Developing Products to Address Urgent

In March 2004, through the leadership of Dr. Janet Woodcock, the FDA published a report entitled ‘‘Innovation/Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products’’ (12). This report was a unique effort on the part of the FDA to bring attention to the need to modernize the current medical product evaluation and development methodologies to reap public health benefits. The resulting Opportunities List (13) outlines 76 opportunities for public and private partnerships that could potentially help bring the fast-paced biomedical discovery realm closer to therapeutic development, which has a much slower pace. The opportunities fall into the following categories:

Public Health Needs (#67–71) • Specific At-risk Populations – Pediatrics

(#72–76) The principal activities of CDISC and HL7 fall into the second category of streamlining clinical trials with Opportunities #44 and #45. In addition, the CDISC has an FDA pilot in progress that is addressing the At-risk Population of Pediatrics through Opportunity #72. Opportunity #45 is Development of Data Standards. It states ‘‘Lack of standardization is not only inefficient; it multiplies the potential for error.’’ It also refers to CDISC as paving the way with its SDTM and that ‘‘HL7 and CDISC are working to create standards that can be used for the exchange, management, and integration of electronic healthcare information to increase the effectiveness and efficiency of healthcare delivery.’’

• Better Evaluation Tools (e.g., biomark-

ers) - #1–33 • Streamlining Clinical Trials (e.g., trial

designs, patient response measure-

Data Flow Using CDISC Standards and Linking Clinical Research and Healthcare

ODM (transport)

Electronic Health Record

SDTM and Analysis Data (content) Protocol Information (content)

Patient Info HL7 and/or ODM XML

Clinical Trial Data

Protocol Representation Trial Design (SDTM) Analysis Plan

Clinical Trial Protocol

HL7 and/or ODM XML

Source Data (other than SDTM/CRF data)

Administrative, Tracking, Lab Acquisition Info

Patient Info

ODM XML

5

Clinical (CRF or eCRF) Trial Data (CDASH as defined by STDM)

(e)Source Document

ODM XML

CRF, Analysis Data

Operational & Analysis Databases

Integrated Reports ODM XML Define .xml

STDM Data, Analysis Data, Metadata

Warehouses, Reporting, Regulatory Submissions

Figure 1. Data Flow Using CDISC Standards; Clinical Research Link to Healthcare. This figure shows the different CDISC standards and how they work together to support the flow of clinical research from protocol through reporting for data warehouses and electronic regulatory submissions. It also shows the linkage point between clinical research and healthcare (electronic medical records).

6

DATA STANDARDS

Specifically, #45 calls for Consensus on Standards for Case Report forms. These standards for data collection will facilitate the way that investigators and clinical study personnel and data managers complete and manage the data entered on paper or electronic case report forms. ‘‘Differences in case report forms across sponsors and trials creates opportunities for confusion and error.’’ To address this opportunity, CDISC, with the encouragement of the FDA and the Association of Clinical Research Organizations, formed a Collaborative Group to provide strategic direction and help fund this initiative. Since then, several hundred volunteers have been working to establish the Clinical Data Acquisition Standards Harmonization (CDASH) standards for data collection (14). These standards are based on the SDTM. Opportunity #72 is Better Extrapolation Methods and Best Practices in Pediatrics. It explicitly calls for the creation of an integrated set of pediatric data for quantitative analysis to exploit past experiences based on adults. ‘‘Analysis of such a database could reveal best practices for other aspects of pediatric trial designs as well, enabling sponsors to avoid repeating less useful or inefficient trial designs. As a result, fewer children would be exposed to unnecessary or suboptimal clinical studies.’’ CDISC is working with the FDA on a pilot to use the CDISC standards for integration of data from numerous pediatric studies to evaluate the value of using standards in this context. 3.3 CDISC Pilots with FDA Currently, CDISC is conducting or has conducted four pilots with the FDA to implement the CDISC standards in a controlled and educational manner. One pilot initiated in 2007 involves the Standard for the Exchange of Nonclinical Data (SEND) and standards for pharmacogenomic data developed through the CDISC LAB team and HL7. ‘‘The National Center for Toxicological Research (NCTR) in collaboration with CDER staff is supporting the SEND pilot by hosting SEND data and making them available to CDER reviewers. The goal of NCTR’s participation is to identify key bottlenecks and limitations of current eSubmission

strategies and identify best practices to facilitate data flow at each stage of the data submission pipeline. This activity includes evaluating the Janus data model for SEND data’’ (15). Another FDA pilot that uses CDISC standards is with the Operational Data Model (ODM). This pilot will go through 2008 with six sponsors who have volunteered to submit ODM XML files instead of paper/PDF case report forms. The ODM offers a transport standard for providing CRF data in XML format, along with a standard audit trail that adheres to the 21CFR11 regulations (16). The SDTM/ADaM pilot was the first CDISC-FDA pilot, and it focused generally on eSubmissions. This pilot was initiated in 2006 and was designed to test the effectiveness of data submitted to the FDA using CDISC standards in meeting the needs and the expectations of both medical and statistical FDA reviewers. In doing so, the project also assessed the data structure/architecture, resources, and processes needed to transform data from legacy datasets into the SDTM and ADaM formats and to create the associated metadata and CRTODS. The report from this pilot is now posted openly on the CDISC website so that others can learn from the experiences of the pilot team. The follow-on pilot from this first eSubmission pilot specifically addresses the integration of safety data from multiple pediatric studies (as noted for Opportunity #72 in the Critical Path Opportunity List). This pilot, which is called the Integrated Safety Data Pilot, has a major objective to explore and evaluate best practices for the processes associated with the use of standards for facilitating regulatory eSubmissions and safety assessments. The pilot will evaluate data integration, workflow and processes, and semantic interoperability, as well as standard analysis and reporting. These two eSubmission pilots are addressed in more depth in the next two sections of this article.

DATA STANDARDS

4 DATA STANDARDS FOR A FINAL STUDY REPORT (FDA: THE FORMAT AND CONTENT OF THE FULL INTEGRATED CLINICAL AND STATISTICAL REPORT OF A CONTROLLED CLINICAL STUDY AND ICH/FDA E3: STRUCTURE AND CONTENT OF CLINICAL STUDY REPORTS) An earlier guideline from the FDA was provided in 1988 for the Format and Content of the Clinical and Statistical Sections of New Drug Applications, in which a strong contextual description of an integrated clinical and statistical report was described. This contextual format was used as a basis for the next generation of an integrated clinical and statistical report developed by the International Conference on Harmonization (ICH). Through a rigorous review and approval process, global adoption of the ICH E3 Guideline: Structure and Content of Clinical Study Reports was adopted by the United States, Japan, and Europe in 1996 (17). This guideline would presently be viewed as a standard metadata or context for a final study report (FSR). If one looks at the Table of Contents of the E3, there is congruency with the standard data domains in SDTM such as demography, vital signs, medical history, exposure, labs, concomitant therapy/medications, adverse events, inclusion/exclusion criteria, and so on. It is important to note that the E3 does not provide extensive standard granularity or data content (e.g., required standard variables, values, units, data definitions, terminology, etc.). A clarification of the requirements of a standard FSR was provided by the ICH eCTD, which was adopted by the FDA April 2006 as Providing Regulatory Submissions in Electronic Format—Human Pharmaceutical Product Applications and Related Submissions Using the eCTD Specifications (18). The FDA removed all previous guidances for the New Drug Application effective 1 January 2008. The true significance of the FDA eCTD Guidance is the strong references to the CDISC data content models or standards and the required file structure. The electronic Common Technical Document contains five modules, and the Module 5 contextual structure provides the Clinical Study Reports Folder requirements. Very briefly

7

presented here are initial comments regarding the requirements for a FSR (note the reference to the E3 metadata structure). 4.1 Study Reports Typically, clinical study reports are provided as more than one document based on the ICH E3 guidance document when providing a study. The individual documents that should be included in a study report are listed below: • Synopsis (E3 2) • Study report (E3 1, 3–15) • Patients excluded from the efficacy stud-

ies (E3 16.2.3) • Demographic data (E3 16.2.4) • Compliance and/or drug concentration

data (E3 16.2.5) • Individual efficacy response data (E3

16.2.6) • Adverse event listings (E3 16.2.7) • Listing of individual laboratory mea-

surements by patient (E3 16.2.8) • Case report forms (E3 16.3) • Individual patient data listings (CRTs)

(E3 16.4) The e-CTD Guidance provides some of the very best data content requirements and is presented in the associated document on the FDA website called the Study Data Specifications. For example, the specifications for organizing study datasets and their associated files in folders found in the Study Data Specification document are summarized in Figure 2. In parallel with the dataset file structure the requirements for the use of CDISC data content models such as SDTM (or SEND), CRTDDS (define.xml), or using the HL7 normative standard for creating the annotated ECG waveform data files are also provided in the eCTD Datasets Study Data Specification document. Most importantly, we have entered a new era for the use and application of standard data content that will greatly enhance a more useful application of the previously available standard FSR metadata or context. Also, to support standard data collection, the e-CTD Guidance provides a contextual requirement

8

DATA STANDARDS

for case report forms that are enhanced by the previously described CDISC CDASH model. Again, we have not previously had standardized case report forms in the industry; having standard CRFs based on SDTM content will improve the workflow and process for the reporting of both safety and efficacy data. To demonstrate that all these contextual and content standards would actually work (using the eCTD file structure, the context, or metadata of the E3 and the related requirements for CDISC standards in the Study Data Specification), CDISC and the FDA performed a pilot for the generation, submission, and review of a mock FSR by the FDA. The pilot submission package included one abbreviated study report that documented the pilot project team’s analyses and reporting of the legacy clinical trial data. The purpose of providing a study report was to test the summarizing of results and the linking to the metadata, as well as providing results or findings for the regulatory review team to review and/or reproduce. Accompanying the study report were the tabulation datasets, analysis datasets, Define.xml files that contain all associated metadata, an annotated case report form (aCRF), and a reviewer’s guide. The report used data from a Phase II clinical trial provided by Eli Lilly and Company. The trial data was mapped to SDTM as the simple analytic view of the raw or observed data to begin the process described above for generating a standard report. The report and results of the SDTM/ADaM CDISC/FDA Pilot are posted on the CDISC website for use by the industry.

5 STANDARD DATA FOR REGULATED AND REQUIRED WRITTEN ADVERSE EVENT SAFETY REPORTS [INDIVIDUAL CASE SAFETY REPORT (ICSR) AND PERIODIC SAFETY UPDATE REPORT (PSUR)] Reporting drug safety events and data to regulatory authorities generally is classified into two categories that we normally describe as preapproval (under an investigational drug application) and postmarket approval reports (under a marketing approval license). An expedited safety report for either preapproval or postapproval reporting is described as a transactional process. Meaning from the time of the adverse event to the submission to a regulatory agency considerable data is gathered, processed, evaluated, and reported as required by the appropriate laws or regulations. The preapproval reports have the benefit of, more often than not, having parallel control data. As we know, the postapproval reports frequently do not have supportive control data but do generally have a much larger database. The postapproval adverse event analysis expands the risk management aspect of reporting, and it becomes more strategic after approval because of a lack of control data, the presence of frequent uncontrolled concomitant medications (polytherapy, polypharmacy), and unpredicted dosing and off-label use. Continual assessment in relationship to pharmacological class and comparator medications becomes essential. Standard data and collection greatly enhance this process and will build the relationship to translational medicine and clinical pharmacology. They will also support the need to correlate phenotype to biomarkers expressed by the genome as signals of

Figure 2. Specifications for organizing study datasets and their associated files in folders.

DATA STANDARDS

severe or unexpected untoward events. Historically, efforts were made to standardize safety data reporting by generating the ICH Medical Dictionary for Regulatory Activities (MedDRA) and the ICH E2B(R3)/E2BM2 (Data Element for Transmission of Individual Case Safety Reports). The MedDRA provides a standard international terminology for describing an adverse event with a hierarchal and standardized structure of terminology from system organ classification to preferred, verbatim, and lowest level term. The standard element of transmission of an expedited safety report (IND Initial Written Report) on adverse and unexpected events in the preapproval arena uses the MedWatch 3500A form. This 3500A form and standard elements are used in the FDA/ICH supported ICSR (Individual Case Safety Report) for postmarketing expedited reports as well. These expedited safety reports using the MedDRA and E2BM2 standard data elements are summarized in the PSUR (Periodic Safety Report) or in an IND Annual Safety Report or any time a Standard Summary Safety report is required by the FDA. An HL7 message is being tested to carry the ICSR content. The requirement for standardized safety reporting is growing exponentially to enable integration of large datasets from multiple sources, and it is driven by increasing standardization by the global regulatory authorities. For example, the FDA’s Sentinel Network for a federated network of both private and public safety databases will best be served by standard data and standard data collection. The FDA has taken a strong position that the content standard will be CDISC data models. Similarly, collaborative efforts are led by the NIH to standardize safety data reporting and monitoring across federal agencies using the BAER (Basal Adverse Event Report). These NIH, NCI, FDA, CDISC, and HL7 have worked together to harmonize their various AE reporting standards within the BRIDG model to support interoperability and to facilitate information exchange from preapproval through postmarketing for a therapy. CDISC has led these harmonization activities. The NIH reporting using BAER will be in parallel with the new FDA initiatives for MedWatch Plus and

9

the new FDA Adverse Event Reporting System. Also, the FDA’s four recent Guidances for Pre-Approval Safety, Risk Minimization Action Plans, Pharmacovigilance, and Pharmacoepidemology and the Good Review Practice, Conducting a Clinical Safety Review of a New Product Application and Preparing a Report on the Review, are all soon to embrace electronic-data and SDTM (the simple analytic view of the observed data). A new pilot between the FDA and CDISC regarding the application of CDISC data standards to support integrated safety data analysis was initiated in 2007 (19). The overall mission of the second iteration of the CDISC/FDA Pilot (the Integrated Safety Data Submission pilot) project team is to demonstrate that a safety data submission (including safety analysis plan and key safety statistical analyses) created using the CDISC standards will meet the needs and expectations of FDA reviewers in conducting an integrated and appropriately directed safety review of data from multiple studies, compounds, and sponsors. During the process of preparing, aggregating, and analyzing the data, the project team will assess the applicability of the CDISC standards to integrated data, identifying any issues/questions to be addressed by the CDISC standards development teams. A major objective of this pilot is to explore and evaluate best practices for the processes associated with the use of standards for facilitating regulatory eSubmissions and safety assessments. The pilot will evaluate data integration, workflow and process, semantic interoperability, and standard analysis and reporting. Practice datasets from eight clinical studies from multiple submissions that span several products have been deidentified and altered for confidentiality and to include intentional signals for analysis by the pilot team. The planned deliverable is submission of an integrated data summary that would support a safety review and analysis. Last, it is in the scope of the pilot team’s mission to serve the FDA Critical Path Opportunities Number 44—Propose Regulations to Require Electronic Submission of Study Data, Number 45—Streamlining Clinical Trial Data Collection and Number 72—Develop Pediatric Trials Database.

10

DATA STANDARDS

The CDISC standards are now available in production versions to support the transport, management and analysis, submission, and archive of electronic clinical research data. Standards to support data collection and protocol representation will be in production versions in 2008. CDISC will continue to support the FDA pilots and industry implementations of these standards to augment the best practices and case study results that have already been compiled. Other areas, including terminology, will also continue to be enhanced and harmonized into the BRIDG model for the open support of systems that can become increasingly interoperable within clinical research and with healthcare. Continued contributions from the industry will only serve to make these standards more valuable in streamlining processes, which improves data quality and ultimately enhances patient safety. REFERENCES 1. Clinical Data Interchange Standards Consortium. Available: http://www.cdisc.org/. 2. Health Level Seven. Available: http://www. hl7.org/. 3. http://www.cdisc.org/about/bylaws pdfs/ CDISC-COP-001-StandardsDevelopmentFeb2006.pdf. 4. C. Rozwell, R. Kush, and E. Helton, Save time and money. Appl. Clin. Trials 2007; 16(6). 70 5. eSource Data Interchange Group, Leveraging the CDISC Standards to Facilitate the use of Electronic Source Data within Clinical Trials, V 1.0, 20 2006. Available: http://www.cdisc.org/eSDI/eSDI.pdf. 6. Retrieve Form for Data Capture (RFD) Integration Profile. Available: http://wiki.ihe.net/ index.php?title=Retrieve Form for Data Capture. 7. Interoperability Showcase Press Release. Available: http://www.cdisc.org/news/10-042007 PR29 HIMSS Interoperability Showcase.pdf. 8. Regulated Clinical Research Information Management (RCRIM) Technical Committee. Available: www.hl7.org/Special/committees/rcrim/index.cfm. 9. Operational Data Model. Available: http:// www.cdisc.org/models/odm/v1.3/index.html.

10. D. Borfitz, Survey: New Mindset about eClinical Technologies, Standards. eCliniqua. 2007. 11. The Draft Reflection Paper on Expectations for Electronic Source Documents used in Clinical Trials, European Medicines Agency. Available: http://www.emea.europa.eu/Inspections/ docs/50562007en.pdf. 12. Food and Drug Administration, Innovation/ Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. Available: http://www.fda.gov/oc/initiatives/ criticalpath/whitepaper.pdf. 13. Food and Drug Administration, Critical Path Opportunities List. Available: http://www.fda. gov/oc/initiatives/criticalpath/reports/opp list.pdf. 14. Clinical Data Acquisition Standards Harmonization. Available: http://www.cdisc.org/ standards/cdash/index.html. 15. SEND Pilot. Available: http://www.fda.gov/oc/ datacouncil/send.html. 16. Title 21 Food and Drugs Chapter; Food and Drug Administration, Department of Health and Human Services, Part 11 Electronic Records; Electronic Signatures. Available: http:// www.accessdata.fda.gov/scripts/cdrh/cfdocs/ cfCFR/CFRSearch.cfm?CFRPart=11. 17. International Conference on Harmonisation, ICH Harmonised Tripartite Guideline: Structure and Content of Clinical Study Reports. Available: http://www.ich.org/LOB/media/ MEDIA479.pdf. 18. Guidance for Industry: Providing Regulatory Submissions in Electronic Format—Human Pharmaceutical Product Applications and Related Submissions Using the eCTD Specifications. Available: http://www.fda.gov/ cder/guidance/7087rev.pdf. 19. CDISC SDTM/ADaM Pilot Team, CDISC SDTM/ADaM Pilot Project-Project Report, January 2008. Available: http://www.cdisc. org/publications/SDTMADaMPilotProjectReport.pdfhttp://www.cdisc.org/publications/ SDTMADaMPilotProjectReport.pdf.

CROSS-REFERENCES Metadata

DERMATOLOGY TRIALS

study was performed to support registration of the drug.

STEVEN R. FELDMAN ADELE R. CLARK

2

Wake Forest University School of Medicine Winston-Salem, North Carolina

The study was a 4-week, multicenter, randomized, double-blinded, vehicle-controlled, parallel-group comparative study of clobetasol propionate 0.05% spray in subjects with moderate to severe psoriasis (5).

Skin diseases are among the most common human afflictions. These conditions have as much impact on patients’ quality of life as other major medical conditions (1). As the skin forms the interface with the environment, it has complex and important immune functional roles. Perhaps as a result, numerous different inflammatory skin diseases exist. One of the most common is psoriasis. Many new psoriasis treatments are being developed based on advancements in our understanding of the immune system (2). Topical corticosteroids continue to be used for psoriasis. Although topical treatment offers the opportunity to use potent medications with little risk of systemic side effects, topical treatments are inconvenient (3,4). New delivery systems are being developed. A trial of a topical clobetasol spray formulation presents a good model for dermatology trials (5). 1

STUDY DESIGN

2.1 Study Drug Clobetasol propionate 0.05% spray (Clobex spray; Galderma Laboratories, L.P., Fort Worth, TX) and the spray vehicle were compared. A key issue in topical treatment studies is that the vehicle may have important effects of its own, which includes direct effects on the skin and indirect effects on the delivery of the active drug through the skin. ‘‘Inactive’’ components of the vehicle can be irritating and can cause allergic reactions. Moreover, patients’ adherence to treatment will be affected by the vehicle (4,6). In this study, the vehicle was composed of alcohol, isopropyl myristate, sodium lauryl sulfate, and undecylenic acid (7).

OBJECTIVES 2.2 Study Population

The objective of this study was to assess the efficacy and safety of clobetasol propionate spray 0.05% in the treatment of moderate to severe plaque-type psoriasis (5). Clobetasol is a corticosteroid with very potent anti-inflammatory properties. Clobetasol is already available in ointment, cream, gel, and other formulations. Dermatologic dogma suggests that moisturizing vehicles are best suited for psoriasis treatment (6). Yet patients may perceive such vehicles as messy and inconvenient (6). An easy to use, nonmessy clobetasol formulation was developed. The primary hypothesis of this study was that more subjects treated with clobetasol propionate spray would achieve success than would subjects treated with placebo spray (5). The study would provide an assessment of the efficacy of the spray that could be compared with results in the literature for traditional clobetasol ointment products. This Phase III

In all, 120 subjects with moderate to severe psoriasis were recruited. Subjects could be of either sex, had to be at least 18 years of age, and had to present with an area of plaque psoriasis that covered at least 2% of the body surface area. The face, scalp, groin, axillae, and other intertriginous areas were excluded because clobetasol is too potent to be used on these areas. Subjects were required to have a target lesion with a severity (accounting for scaling, erythema, and plaque elevation) of at least 3 (moderate) on a scale ranging from 0 (none) to 4 (severe/very severe). Including plaques of this degree of severity is important to help define the psoriasis as moderate to severe. It also gives room to define improvement to clear (0) or almost clear (1) as success. Women of childbearing potential had to have a negative urine pregnancy test and had to agree to use an effective nonprohibited

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DERMATOLOGY TRIALS

form of birth control for the duration of the study. Prior to entering the study, subjects had to respect treatment-specific wash-out periods as defined by the protocol regarding any topical and systemic treatments known to affect psoriasis, which includes immunomodulatory therapies and exposure to natural or artificial UV radiation. Normally, this duration is about 2 weeks for topical treatments, 1 month for phototherapy or oral systemic medications, and 3 months for long acting agents, which include some biologic treatments. As patients wash out from topical medication, some flaring of the disease can occur at study entry; nearly all trials want to evaluate subjects with stable disease. Fortunately, patients tend to be noncompliant and are already largely off the medications they should be using. In the case of psoriasis, seasonal variations in disease can occur. Thus, it tends to be easier to recruit patients in the late fall and winter when the disease tends to be worse. If the treatment lasts 6 months, however, patients in placebo groups tend to improve as the weather warms and they begin to have ultraviolet light exposure. 2.3 Treatment Half the subjects were randomized to twicedaily use of clobetasol propionate spray 0.05% for 4 weeks and the other half to vehicle spray. Topical clobetasol is highly effective, so a 2- to 4-week study interval is often sufficient to observe efficacy. The study period may be included in the FDA-approved drug label. A short period can be used in marketing to promote the high efficacy of a product. A longer duration can be useful in marketing the safety of the product. In short-term studies such as this one, a vehicle control group may be entirely reasonable. In longer term studies (3 months or more), patients in the vehicle or placebo control arm may switch over to active treatment after a predetermined period. This arrangement helps promote recruitment to the study as well as retention in the placebo arm. The crossover is particularly important for studies of patients with more severe disease, as they might not be willing to risk entering a

study and being on placebo, nor may they be willing to wait several months on placebo if no ‘‘carrot’’ of active drug is offered at the end of the trial. Subjects were also instructed to separate applications by at least 8 hours, not to wash the treated area for at least 4 hours after treatment, and not to apply the study medication within 4 hours prior to study visits. One should recognize that patients are often noncompliant to topical treatments, even in a clinical trial (8,9). Without objective confirmation, researchers have little confidence that the recommendations were followed. The treatment phase was followed by a 4-week treatment-free follow-up period. This period is required to assess how persistent the effect of the product is and whether the condition has any tendency to rebound when the product is discontinued (10). 2.4 Efficacy Assessments The study was evaluated at baseline and weeks 1, 2, 4, and 8 (follow-up 4 weeks after the end of treatment). The frequent follow-up in the clinical trial promotes better adherence than is observed in clinical practice (11). Thus, many medications seem to work better in the clinical trial than in clinical practice. The investigators assessed scaling, erythema, plaque elevation, and overall disease severity using a 5-point scale (0 = clear, 1 = almost clear, 2 = mild, 3 = moderate, 4 = severe/very severe). Although investigator assessments of severity tend to be more objective than subjects’ self-assessments, considerable subjectivity may occur in investigators’ evaluation of these parameters. This subjectivity may be a cause of apparent placebo responses in dermatologic trials (12). In some studies, training sessions are held at investigator meetings to reduce variation between investigators’ efficacy assessments, yet even then a wide range of responses may occur. Using the same evaluator may be more critical in dermatology trials than in studies that use more objective tests as outcomes. Subjects’ assessments of pruritus were also assessed. Dermatology trials often include a measure of subjects’ quality of life; the quality of life measure confirms that objective changes in skin lesions actually result in improvement in the patient’s life (13).

DERMATOLOGY TRIALS

Use of success rates as a primary response measure facilitates intent to treat analyses and shows regulators that the drug makes a meaningful improvement for subjects. In this study, success was defined as a grade of 2 (mild) or less on the 5-point scale at week 2 and defined as a grade of 1 (almost clear) or less on the same scale at the end of treatment at week 4. Many trials of treatments for moderate to severe psoriasis employ the Psoriasis Area and Severity Index (PASI). Although the PASI provides a quantitative assessment of severity, the scores are not easy to interpret clinically, and the measure is not used much in clinical practice. Using ‘‘clear or almost clear’’ as a success measure gives clinicians a parameter that is meaningful in their care of patients. 2.5 Safety Assessments Safety was evaluated by reported adverse events, which include skin atrophy, telangiectasia, burning/stinging, and folliculitis. Clinical signs and symptoms of hypothalamus–pituitary–adrenal (HPA) axis suppression were also assessed. This method is an insensitive way to assess HPA axis suppression. Adrenal stimulation tests are used in other studies to assess more fully the potential for HPA axis suppression (14). 2.6 Statistical Analyses The intent-to-treat population (all subjects enrolled and randomized) was used to assess success rates. Safety evaluations included all subjects who received treatment. The success rate of all parameters was calculated for each time point using a Cochran–Mantel–Haenszel test to confirm the superiority of clobetasol propionate spray 0.05%. Statistical significance of the above efficacy parameters was based on two-tailed tests of the null hypothesis, which resulted in P values of .05 or less. 3 RESULTS 3.1 Efficacy At 2 weeks, 52 subjects (87%) achieved success with clobetasol propionate versus 17 subjects (28%) treated with the spray vehicle (P < 0.001). Despite the more restrictive

3

criteria for success at week 4 than at week 2 (subjects had to be almost or completely clear at week 4, but mild disease or better was considered success at week 2), 47 subjects (78%) achieved success with 4 weeks of treatment with clobetasol propionate compared with 2 subjects (3%) treated with vehicle (P < 0.001). 3.2 Safety Burning/stinging was reported by 14 subjects (23%) within 15 minutes after the first application of clobetasol propionate and by 13 subjects (22%) after application of the vehicle; most reports were mild in severity. No cutaneous signs of skin atrophy, telangiectasia, or folliculitis were detected in either group throughout the duration of the study. No adrenal suppression was identified. Although the subjects tolerated burning and stinging well in the trial, many patients in clinical practice will not be so tolerant of side effects. Side effects seem to be better tolerated in study settings for several reasons. First, research subjects may be highly motivated. Second, frequent follow up visits are required so subjects may be thinking they have to put up with the irritation for a shorter period of time than the typical clinic patients does. Third, at the return visits the study staff may act as cheerleaders to help keep the subject motivated to use the medication. Fourth, continuing to be in the study may mean continuing to collect remuneration. One wonders sometimes whether some subjects stay in the study even when they have stopped using the medication. 4

CONCLUSIONS

The clobetasol propionate 0.05% spray product was highly effective for moderate to severe plaque-type psoriasis and was superior to its vehicle (5). As is typical of psoriasis trials, the study was powered to assess efficacy. Assessing safety is far more difficult. In this group of subjects, the treatment was safe and well tolerated. Superpotent corticosteroids are often labeled for only up to 2 weeks of continuous use. The 4-week treatment period in this study yielded greater clinical benefit with no detectable change in the safety profile. Nevertheless, our certainty in safety is limited by

4

DERMATOLOGY TRIALS

the size of the study, and the study provides no evidence on which to base conclusions about longer term safety profile. Dermatologic dogma suggests that moisturizing, occlusive ointment preparations are most appropriate for treating a dry, scaly condition like psoriasis (15,16). Furthermore, dogma suggests that using keratolytic treatments (such as topical salicylic acid) should be used to reduce scale and enhance penetration of topical psoriasis treatments (15,16). The high efficacy in this study of a nonmoisturizing product used without any keratolytic agent goes contrary to established dogma. The effectiveness of superpotent topical corticosteroids in plaque-type psoriasis probably depends more on whether the patient applies the product than the moisturizing, occlusive, or keratolytic properties of the vehicle. The spray formulation of clobetasol propionate 0.05% is easy to apply and may provide a valuable alternative for specific patient groups.

REFERENCES

6. T. S. Housman, B. G. Mellen, S. R. Rapp, A. B. Fleischer Jr., and S. R. Feldman, Patients with psoriasis prefer solution and foam vehicles: a quantitative assessment of vehicle preference. Cutis 2002; 70: 327–332. 7. Clobex® Spray Prescribing Information. Fort Worth, TX: Galderma Laboratories, L.P., 2005. 8. J. Krejci-Manwaring, M. G. Tusa, C. Carroll, F. Camacho, M. Kaur, D. Carr, A. B. Fleischer Jr., R. Balkrishnan, S. R. Feldman, Stealth monitoring of adherence to topical medication: adherence is very poor in children with atopic dermatitis. J. Am. Acad. Dermatol. 2007; 56: 211–216. 9. C. L. Carroll, S. R. Feldman, F. T. Camacho, J. C. Manuel, and R. Balkrishnan, Adherence to topical therapy decreases during the course of an 8-week psoriasis clinical trial: commonly used methods of measuring adherence to topical therapy overestimate actual use. J. Am. Acad. Dermatol. 2004; 51: 212–216. 10. K. B. Gordon, S. R. Feldman, J. Y. Koo, A. Menter, T. Rolstad, G. Krueger, Definitions of measures of effect duration for psoriasis treatments. Arch. Dermatol. 2005; 141: 82–84.

1. S. R. Rapp, S. R. Feldman, M. L. Exum, A. B. Fleischer Jr, and D. M. Reboussin, Psoriasis causes as much disability as other major medical diseases. J. Am. Acad. Dermatol. 1999; 41: 401–407.

11. S. R. Feldman, F. T. Camacho, J. KrejciManwaring, C. L. Carroll, and R. Balkrishnan, Adherence to topical therapy increases around the time of office visits. J. Am. Acad. Dermatol. 2007; 57: 81–83.

2. E. M. Berger and A. B. Gottlieb, Developments in systemic immunomodulatory therapy for psoriasis. Curr. Opin. Pharmacol. 2007; 7: 434–444.

12. J. Hick and S. R. Feldman, Eligibility creep: a cause for placebo group improvement in controlled trials of psoriasis treatments. J. Am. Acad. Dermatol. 2007; 57: 972–976.

3. K. K. Brown, W. E. Rehmus, and A. B. Kimball, Determining the relative importance of patient motivations for nonadherence to topical corticosteroid therapy in psoriasis. J. Am. Acad. Dermatol. 2006; 55: 607–613.

13. S. R. Feldman and G. G. Krueger Psoriasis assessment tools in clinical trials. Ann. Rheum. Dis. 2005; 64(suppl 2): ii65–68; discussion ii69–73.

4. K. G. Bergstrom, K. Arambula, and A. B. Kimball, Medication formulation affects quality of life: a randomized single-blind study of clobetasol propionate foam 0.05% compared with a combined program of clobetasol cream 0.05% and solution 0.05% for the treatment of psoriasis. Cutis 2003; 72: 407–411. 5. M. T. Jarratt, S. D. Clark, R. C. Savin, L. J. Swinyer, C. F. Safley, R. T. Brodell, and K. Yu, Evaluation of the efficacy and safety of clobetasol propionate spray in the treatment of plaque-type psoriasis. Cutis 2006; 78: 348–354.

14. J. L. Jorizzo, K. Magee, D. M. Stewart, M. G. Lebwohl, R. Rajagopalan, and J. J. Brown, Clobetasol propionate emollient 0.05 percent: hypothalamic-pituitary-adrenal-axis safety and four-week clinical efficacy results in plaque-type psoriasis. Cutis 1997; 60: 55–60. 15. T. P. Habif, Psoriasis and Other Papulosquamous Diseases. Clinical Dermatology: A Color Guide to Diagnosis and Therapy. Philadelphia, PA: Mosby, 2004. 16. A. Rook, T. Burns, S. Breathnach, N. Cox, and C. Griffiths, Rook’s Textbook of Dermatology. 2004.

DERMATOLOGY TRIALS

FURTHER READING J. Garduno, MJ. Bhosle, R. Balkrishnan, and S. R. Feldman, Measures used in specifying psoriasis lesion(s), global disease and quality of life: a systematic review. J. Dermatol. Treat. 2007; 18: 223–242. C. E. Griffiths, C. M. Clark, R. J. Chalmers, A. Li Wan Po, and H. C. Williams, A systematic review of treatments for severe psoriasis. Health Technol. Assess. 2000; 4: 1–125. C. Hoare, A. Li Wan Po, and H. Williams, Systematic review of treatments for atopic eczema. Health Technol. Assess. 2000; 4: 1–191. K. T. Hodari, J. R. Nanton, C. L. Carroll, S. R. Feldman, and R. Balkrishnan, Adherence in dermatology: a review of the last 20 years. J. Dermatol. Treat. 2006; 17: 136–142. I. A. Lee and H. I. Maibach, Pharmionics in dermatology: a review of topical medication adherence. Am. J. Clin. Dermatol. 2006; 7: 231–236. H. P. Lehmann, K. A. Robinson, J. S. Andrews, V. Holloway, and S. N. Goodman, Acne therapy: a methodologic review. J. Am. Acad. Dermatol. 2002; 47: 231–240.

CROSS-REFERENCES Dosage Form Outcomes Center for Drug Evaluation and Research (CDER) Controlled Clinical Trials

5

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT

1. In the evaluation of long-term efficacy of a drug, long-term placebo treatment is not always feasible, as e.g. for cancer or psychiatric diseases. If an active control is available, non-inferiority studies might be considered, but these may require large sample sizes, and the determination of a non-inferiority margin can be difficult. 2. For major depression, several stages of treatment can be distinguished: treatment of acute symptoms, and for patients who respond, the continuation phase (to prevent relapse), and the maintenance phase (to prevent recurrence). Parallel-group, placebo-controlled trials aimed at evaluating relapse and/or recurrence include the so-called extension studies. These studies have an initial short-term, parallel–group, placebocontrolled phase. The second stage includes only stage I responders, who continue the treatment with the same medication under double-blind conditions. This design has the ethical advantage that only responders are exposed to study treatment in the second stage, but the possibility of imbalance between treatment groups (i.e., active treatment responders and placebo responders) at the start of the extension stage cannot be excluded, especially following differential dropout in stage I. This was noted by Storosum et al. (3) as a major methodological shortcoming of this design, making interpretation of extension study results difficult. 3. For progressive diseases like Alzheimer’s or Parkinson’s disease, it is important to determine the effect of a treatment on symptoms (short term) and on disease progression (long term). This could be easily addressed in the gold standard design if markers for the progression of the diseases were available, but such markers have not yet been established. Instead, both effects have to be inferred from the patient’s performance status over time. In the

CORNELIA DUNGER-BALDAUF Novartis Pharma AG Basel, Switzerland,

When an innovative treatment is being investigated, reliable evaluation of its benefit for patients is required. The double-blind, randomized, controlled clinical trial with parallel treatment groups has become widely accepted as the gold standard to achieve this goal. In such a trial, patients are initially randomized to a specific treatment or regimen on which they remain throughout the duration of the study. At the end of the treatment period, the benefit to patients is compared between treatment groups. Despite the proven success of the standard design, there remain scientific questions or situations for which it does not offer the most appropriate solution, such as in diseases with serious morbidity where maintaining a full placebo group to the end of a parallel design trial may be considered to be medically and ethically unsatisfactory. In 1975, Amery and Dony (1) proposed a ‘‘clinical trial avoiding undue placebo treatment’’ to evaluate drugs for the treatment of angina pectoris. According to this design, all patients are initially treated with a test drug. After titration to the desired outcome, responders after the first stage are randomized to either continued treatment or placebo. Amery showed that this design and a more conventional design gave similar results (2). Further situations where the standard trial design is difficult to apply are described in this article, followed by a discussion of the designs that address the difficulties.

1 RATIONALE FOR RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT Difficulties in applying the standard parallelgroup design were noted in the following situations:

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT

comparison of treatments at the end of a standard trial, the short-term and long-term effects would not be distinguishable. 4. For a disease with fluctuating symptoms, the effect of continuous treatment observed in a standard trial might be diluted, as patients have symptom-free periods where the efficacy of a drug cannot be established. A parallel-group trial with several periods of intermittent treatment (short-course administration for a predetermined time period after symptom recurrence) can be considered. However, due to potentially different dropout patterns, initially comparable treatment groups might no longer be comparable at the start of repeated treatment such that the comparison of repeated treatment in this design becomes questionable. Extensions of the standard design have been proposed to address these questions. Randomized withdrawal designs can be considered for situations 1 and 2, randomized delayed-start designs for situation 3, and retreatment designs for situation 4. Instead of (re)randomizing patients following initial study treatment (at the start of the second stage), a randomization to treatment sequences could be performed upfront before the trial. However, in the presence of dropouts or in case of non-eligibility of some patients after the first stage, randomization of only the patients who enter the second stage is more likely to generate balanced treatment groups. 2

RANDOMIZED WITHDRAWAL TRIALS

In a randomized withdrawal trial (Figure 1), patients receiving a test treatment for a specified time are randomly assigned to continued treatment with the test drug or placebo (i.e., withdrawal of active therapy) (4). Randomization can be performed for all patients who are available to enter the withdrawal stage, or for a selected population such as responders to initial treatment. If possible, the withdrawal stage would be conducted in a double-blind fashion. The trial could start

with a single open-label treatment, but also with several active treatments. The initial treatment stage can be of any length. The withdrawal stage could be designed to have a fixed duration, or to use a time to event approach (e.g. relapse of depression) where the time of placebo exposure for a patient with poor response would be kept to a minimum. Treatment effects are assessed based on the differences between continued active treatment and placebo. 2.1 Application of the Approach There are various situations where the randomized withdrawal design can be useful. It is important to consider which effects are reflected by the treatment difference in the withdrawal stage. A situation that could be interpreted reasonably well would be characterized by a chronic, stable disease where a treatment improves the patient’s health status by alleviating signs or symptoms, but a major change in health status (e.g., curing the disease or considerable deterioration) is not expected. Upon withdrawal of the test drug, the patient’s response is thought to return to the placebo level. Under these conditions, similar results would be observed in a parallel–group. placebo–controlled, longterm efficacy trial and a randomized withdrawal trial. This may be the case for chronic pain or mild to moderate hypertension where a long-term placebo-controlled trial would be difficult to perform. A randomized withdrawal trial is an alternative with considerably reduced placebo exposure. In non-chronic major depression, the design can be used to evaluate the treatment effect of a test drug in the continuation/ maintenance phase. Following treatment of acute symptoms, responders are randomized to continue with test drug treatment or to receive placebo. The reappearance of symptoms (relapse and/or recurrence) is then compared between treatments (3). Randomized withdrawal can also be employed for dose finding. In a dose-response study of a treatment for hypertensive children (5), patients were assigned to a low, middle, and high dose group at the beginning of a 2-week double-blind, randomized, dose-response period. In the high dose group,

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT

3

Active

Randomized withdrawal design

Placebo Randomization

Active

Randomized delayedstart design

Active Active

Placebo Randomization

Placebo

Re-Randomization

Active Active

TFI Placebo Re-Randomization

Randomized Re-treatment design

Placebo

TFI

Active

Randomization

Figure 1. Designs for studies with randomization following initial study treatment. TFI: treatment-free interval.

the initial dose was doubled after 3 days unless adverse events or hypotension were observed. On completion of the dose-response stage, patients entered a 2-week double-blind withdrawal stage, randomized to either their last dose or placebo. The study was useful in assessing effectiveness of the dose groups and the dose-response relationship, and the value of up-titration. The randomized withdrawal design is also suitable to assess how long a therapy should be continued (e.g. post-mastectomy adjuvant treatment for women with operable breast cancer (6)). Multiple withdrawal stages can be planned if more than two therapy durations are studied.

2.2 Advantages and Limitations of the Design An advantage of the randomized withdrawal design is that placebo exposure can often be restricted to a minimum (e.g. in the investigation of long-term efficacy). It is also a way to address scientific questions regarding a multistage course of therapy (e.g. in depression), which would be very difficult to achieve in a placebo-controlled, parallel-group trial. It is important to consider phenomena which may interfere with the interpretation of the results of the withdrawal stage. First, carry-over and rebound effects (exacerbation of symptoms beyond pre-treatment levels upon withdrawal of active treatment) may result in erroneous conclusions on the treatment effect. Second, as the patient can receive different treatments during the

4

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT

course of the study, his or her expectation and anticipation of the treatment effect might differ from that in a standard trial. This could prompt the patient to report the experience with the drug differently. In addition, partial unblinding could occur as the patient might perceive the absence of specific characteristic test drug effects when switched to placebo. As another issue, it is to be expected that primarily patients with a good prognosis are moving forward into the withdrawal part of the study, and efficacy and safety assessments are performed on a selected population, where safety issues may tend to be under-reported. 2.3 A Modified Version of a Randomized Withdrawal Design A modified version of a randomized withdrawal design was proposed by Leber (7) to evaluate treatments of a progressing disease (e.g., Alzheimer’s and Parkinson’s disease). There has been increasing interest in determining whether these have, in addition to symptomatic (short-term) effects, the potential to slow the progression of the disease (disease-modifying or long-term effects). Patients are initially randomized to receive the test drug or placebo. After a period of treatment designed to be long enough that disease-modifying effects are manifested, patients who received the test drug in the first stage are randomized to either continued test drug or placebo, and patients randomized to placebo in stage I remain on placebo. The second stage extends over a period of time of sufficient length for symptomatic effects to have dissipated. If any treatment effect evident after the initial stage (short-term and longterm effects) persists (the long-term effect), it will be reflected in the treatment difference of the test drug–placebo and placebo–placebo sequence at the end of the withdrawal stage. The symptomatic effect can be obtained by subtracting the disease-modifying effect from the treatment effect at the end of the first stage (comprises both effects), under the assumption that the symptomatic and disease-modifying effect are approximately additive. The test drug–test drug sequence is not used to distinguish between short-term and long-term effect, but it is essential to keep the trial blinded in both stages, and allows

the estimation of the combined short-term and long-term effect of the second stage. 3 RANDOMIZED DELAYED-START TRIALS A randomized withdrawal design suitable to evaluate treatments of progressive diseases such as Parkinson’s or Alzheimer’s disease has been described above. The randomized delayed-start design is another design tailored to distinguish between a drug’s effect in relieving symptoms and on the course of the disease. In the randomized delayed-start design, patients are initially randomized to placebo or test drug. Those who received placebo in the first stage are re-randomized to placebo or the test drug, while the test drug patients remain on test drug (see Figure 1). Patients in the placebo–test drug sequence are expected to improve in the second stage. If the test drug had no disease-modifying effect, then the patients in the test drug–test drug sequence would have no advantage over the patients in the placebo–test drug sequence, and both groups would be expected to reach the same level of response at the end of the second stage. Thus, any difference by which the patients in the placebo–test drug sequence stay below the stage II level of response in the test drug–test drug sequence is considered to reflect the disease-modifying effect. As the treatment effect of the first period comprises both types of effects, the symptomatic effect can be determined as explained above. 3.1 Advantages and Limitations of the Design As an advantage, the design is suitable to distinguish between symptomatic and diseasemodifying treatment effects, maintaining blinding throughout. It is also suited to determine the second stage effect (comprising the disease-modifying and symptomatic effect). The design has an advantage over the randomized withdrawal design with three sequences in that there is a higher chance of receiving the test drug in the second stage, with a potentially favorable impact on patient retention. On the other hand, both designs involve an overall prolonged placebo exposure, which may give rise to ethical issues if the drug is known or expected to modify the

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT

course of the disease. Also, disease-modifying effects currently have to be inferred from performance scores, and the involved differences seem to be small due to an imperfect relationship between the underlying pathologic state and performance. This may require large sample sizes (8). 4

RANDOMIZED RE-TREATMENT TRIALS

A randomized re-treatment design can be used to evaluate treatments for chronic diseases with fluctuating symptoms like irritable bowel syndrome (IBS). The majority of IBS patients report episodes of mild to moderate symptoms, which alternate with periods of low symptom intensity (9). Intermittent treatment (short-course administration for a predetermined time period after symptom recurrence) may be suitable for these patients. The design is directed at evaluating the initial and repeated test-drug effect. The trial should start with a treatmentfree baseline period. This serves to record baseline disease activity, to familiarize the patient with data collection methods like paper or electronic diaries, and to ensure that patients start initial treatment with a similar intensity of symptoms. To evaluate first and repeated use of a test drug, patients are randomized to test drug or placebo for a short-term period, followed by a variable period off treatment with rerandomization to placebo or test drug in a second cycle when symptoms warrant restarting medication (10). Restarting treatment only upon symptomatic relapse is an essential condition; patients could otherwise be at different phases of their symptom cycles if the duration of the treatment-free interval was fixed. The initial and repeated test drug effect can be evaluated based on the treatment difference in the initial period and the treatment difference in the repeated treatment period for those patients who received the test drug as the initial treatment. The repeated treatment results of patients initially assigned to placebo are not used to assess the effects of repeated treatment with the test drug; these patients could therefore be switched to test drug in the second cycle in a blinded fashion (see Figure 1) to improve patient retention.

5

A modification of the design has been applied in a study to evaluate treatment efficacy in IBS with constipation (11). 4.1 Advantages and Limitations of the Design An advantage of the randomized re-treatment design lies in the limited exposure to placebo. Further, all patients who are included in the evaluation of repeated treatment effects receive the test drug as initial treatment. For this reason, the treatment groups at the start of second therapy are much more likely to be comparable than in the standard parallelgroup design, where differential dropout patterns could emerge after initial test drug or placebo treatment. As a disadvantage, sample size tends to be large as only patients initially receiving the test drug contribute to the evaluation of the repeated test-drug effect. Unbalanced randomization can relieve this disadvantage. A concern has been raised that unblinding of the first-stage treatments could occur due to different symptom profiles and different times to recurrence after withdrawal of the test drug or placebo (12) Although not used for treatment comparisons, symptoms should therefore also be monitored and recorded during the treatment-free interval to assess this concern. 5

ENRICHMENT

In some randomized withdrawal trials or retreatment trials, the second stage following initial treatment is designed to include an enriched population, e.g. responders to initial treatment. This may be preferred for ethical reasons, or the scientific question may be defined for responders (like relapse and remission in depression). Enrichment could also mean selecting nonresponders to initial treatment if the efficacy of alternative treatments is being evaluated for patients who do not benefit sufficiently from a specific initial treatment. Even in the absence of a planned enrichment, the second stage of a trial will likely be enriched with patients who can tolerate the drug and who experienced the initial treatment as beneficial. It is important to consider the impact of the selection on the observed

6

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT

treatment effect. If the outcomes of the initial and second treatment stage are positively correlated, as is often the case for the test drug–test drug sequence, then those with a better initial value (responders) will more likely have a better value in the second stage, resulting in a higher test drug response level. If the correlation was less strong in the test drug–placebo sequence (this is often plausible due to different treatments, but not necessarily always the case), then the treatment effect would be larger than in the unselected population. While this would be an advantage in terms of sample size requirements, it has to be weighed against the smaller number of patients who enter the second stage and are evaluated for the comparison of treatments. A larger sample size may be required especially if the enriched population includes not more than half of the unselected population. 6

CONCLUDING REMARKS

When a trial with randomization following initial study treatment is considered, it is useful to assess how various effects can influence the treatment difference seen in the second stage and how it might differ from what could be observed in a parallel-group trial (if one was feasible). These possible effects include major changes in the patient’s health status due to treatment, changes in the effect of treatment over time, potential unblinding by the patient’s experience with more than one treatment, and rebound effects and carry-over from the previous stage. The second stage should be planned to be long enough to ensure that carry-over does not influence the results. Taking this into account, trials with randomization after initial treatment can be specifically useful in the following situations:

Efficacy of re-treatment: randomized re-treatment design • Progressive diseases Distinguish between disease-modifying and symptomatic effects: randomized withdrawal design (three treatment sequences), randomized delayed-start design (three treatment sequences) Designs with two stages were discussed above. Trials with more than two stages can be designed to evaluate treatments in specific settings and populations. In a trial with three stages and outcome-driven rerandomization, patients whose benefit from a treatment was unsatisfactory could receive subsequent treatments (13). A randomized withdrawal design and a randomized delayed-start design, each with two periods and three treatment sequences, were proposed to distinguish between disease-modifying and symptomatic effects of treatments for progressing diseases. In contrast, the familiar two-treatment, two-period, crossover design is most useful to establish the symptomatic effects of a stable disease. Another difference is that it usually requires a washout period between the two periods and equal period lengths. A general two-period design with the four treatment sequences—test drug–test drug, test drug–placebo, placebo–test drug, placebo–placebo (complete two-period design [14])—includes the other designs and allows a check of the structural assumptions in addition to the determination of both types of effects. Another part of this general design, the treatment sequences test drug–test drug, test drug–placebo, placebo–test drug, can be used in the context of the one study versus two studies paradigm (15).

• Chronic, stable diseases

Long-term efficacy: randomized withdrawal design Dose finding, including titration: randomized withdrawal design • Diseases with a multistage treatment Efficacy in later treatment stages: randomized withdrawal design • Diseases with fluctuating symptoms

REFERENCES 1. W. Amery and J. Dony, A clinical trial design avoiding undue placebo treatment. J Clinical Pharmacol. 1975; 15: 674–679. 2. R. J. Temple, Special study designs: early escape, enrichment, studies in nonresponders. Commun Stat Theory Methods. 1994; 23: 499–531.

DESIGNS WITH RANDOMIZATION FOLLOWING INITIAL STUDY TREATMENT 3. J. G. Storosum, B. J. van Zwieten, H. D. Vermeulen, and T. Wohlfarth, Relapse and recurrence prevention in major depression: a critical review of placebo-controlled efficacy studies with special emphasis on methodological issues. Eur Psychiatry. 2001; 16: 327–335. 4. The European Agency for the Evaluation of Medicinal Products, Human Medicines Evaluation Unit, Committee for Proprietary Medicinal Products (CPMP). Note for Guidance on the Choice of Control Group in Clinical Trials. CPMP/ICH/364/96. January 2001. Available at: http:///www.emea.europa. eu/pdfs/human/ich/036496en.pdf 5. T. Wells, V. Frame, B. Soffer, W. Shaw, Z. Zhang, et al., A double-blind, placebocontrolled, dose-response study of the effectiveness and safety of enalapril for children with hypertension. J Clin Pharmacol. 2002; 42: 870–880. 6. H. J. Stewart, A. P. Forrest, D. Everington, C. C. McDonald, J. A. Dewar, et al., Randomised comparison of 5 years adjuvant tamoxifen with continuous therapy for operable breast cancer. The Scottish Cancer Trials Breast Group. Br J Cancer. 1996; 74: 297–299. 7. P. Leber, Slowing the progression of Alzheimer disease; methodologic issues. Alzheimer Dis Assoc Disord. 1997; 11 (suppl 5): S10–S21. 8. E. Mori, M. Hashimoto, K. R. Krishan, and M. Doraiswami, What constitutes clinical evidence for neuroprotection in Alzheimer disease. Alzheimer Dis Assoc Disord. 2006; 20: S19–S26. 9. B. Hahn, M. Watson, S. Yan, D. Gunput, and J. Heuijerjans, Irritable bowel syndrome symptom patterns. Dig Dis Sci. 1998; 43: 2715–2718. 10. The European Agency for Evaluation of Medicinal Products, Human Medicine Evaluation Unit, Committee for Proprietary Medicinal Products (CPMP). Points to Consider on the Evaluation of Medicinal Products for the Treatment of Irritable Bowel Syndrome. CPMP/EWP/785/97. March 19, 2003. Available at: http:/www.emea.europa. eu/pfs/human/ewp/078597en.pdf ¨ 11. J. Tack, S. Muller-Lissner, P. Bytzer, R. Corinaldesi, L. Chang, et al., A randomised controlled trial assessing the efficacy and safety of repeated tegaserod therapy in women with irritable bowel syndrome with constipation (IBS-C). Gut. 2005; 54: 1707–1730.

7

12. E. Corazziari, P. Bytzer, M. Delvaux, G. Holtmann, J. R. Malagelada, et al., Clinical trial guidelines for pharmacological treatment of irritable bowel syndrome. Aliment Pharmacol Ther. 2003; 18: 569–580. 13. S. M. Davis, G. G. Koch, C. E. Davis, and L. M. LaVange, Statistical approaches to effectiveness measurement and outcome-driven rerandomizations in the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) studies. Schizophr Bull. 2003; 29: 73–80. 14. L. N. Balaam, A two-period design with t2 experimental units. Biometrics. 1968; 24: 61–73. 15. G. Koch, Commentaries on statistical consideration of the strategy for demonstrating clinical evidence of effectiveness—one larger vs two smaller pivotal studies. Stat Med. 2005; 24: 1639–1646.

FURTHER READING C. Dunger-Baldauf, A. Racine, and G. Koch, Retreatment studies: design and analysis. Drug Inf J. 2006; 40: 209–217. V. E. Honkanen, A. F. Siegel, J. P. Szalai, V. Berger, B. M. Feldman, and J. N. Siegel, A three-stage clinical trial design for rare disorders. Stat Med. 2001; 20: 3009–3021. M. P. McDermott, W. J. Hall, D. Oakes, and S. Eberly, Design and analysis of two-period studies of potentially disease-modifying treatments. Control Clin Trials. 2002; 23: 635–649.

CROSS-REFERENCES Carry-over Placebo-controlled trial Parallel-group design Randomization Unblinding

DIAGNOSTIC STUDIES

1

2

DIAGNOSTIC STUDIES

CARSTEN SCHWENKE

2.1 Guidelines for Diagnostic Studies

Schering AG SBU Diagnostics & Radiopharmaceuticals Berlin, Germany

Diagnostic studies are clinical trials with diagnostic objectives and outcomes that, in general, follow the same scientific and regulatory rules as clinical studies for any other medical product. The design and conduct of the clinical study as well as the statistical methods are regulated by all relevant guidelines like the ICH-E3, ICH-E6, ICH-E9, and ICH-E10. Nevertheless, some specific topics exist with regard to diagnostic studies, which are covered by special guidelines. The Food and Drug Agency (FDA) and the Committee for Proprietary Medicinal Products (CPMP) of the European Agency for the Evaluation of Medicinal Products (EMEA) provide guidance on the evaluation of medical imaging drugs and diagnostic agents. FDA’s draft guideline for industry (2) on ‘‘Developing Medical Imaging Drugs and Biological Products’’ and CPMP’s ‘‘Points to Consider on the Evaluation of Diagnostic Agents’’ (1) provide considerations on the methodology and data presentation along with explanations about how to define the gold standard and clinical endpoints.

INTRODUCTION

Diagnostic studies are performed to investigate the clinical usefulness of a new diagnostic procedure, for example, a new contrast agent for imaging or a new laboratory test to detect antibodies or antigens in blood. Sometimes, these studies are invented to substitute the standard procedure by an improved or less expensive procedure, which is not inferior in terms of efficacy and has some benefit compared with the standard procedure. Before starting a diagnostic study, first the unit for analysis should be defined in the terms ‘‘experimental unit’’ and ‘‘observational unit.’’ All general considerations on clinical studies, as proposed by ICH and other guidelines, apply to diagnostic studies also, but explicit guidelines also exist for diagnostic studies as described in Section 2.1 (1, 2). So-called ‘‘blinded reader’’ studies were introduced by these guidelines as part of diagnostic studies to assess the main efficacy objective in the clinical study. The clinical usefulness of a new diagnostic procedure is evaluated using three components: the reliability, the validity, and the add-on value. Reliability and validity are defined components that investigate, along with parameters for measurement and analysis tools, the performance of a diagnostic procedure. To measure the benefit of a new diagnostic procedure for the patient, the add-on value is often shown by an increased validity compared with a control. Throughout this article, the parameters of interest are described in the text, whereas the formal definitions are displayed separately in a table. The term ‘‘patient’’ is used for a participant of a clinical study regardless of the presence or absence of the disease under investigation.

2.2 The Cascade to Test the Performance of Diagnostic Procedures: Reliability and Validity In diagnostic studies, three parts of efficacy are of interest to claim the clinical usefulness of a new diagnostic procedure. The first part is the reliability (e.g., the rate of agreement of different raters on the same patient). The second part is the validity (i.e., the ability of the diagnostic procedure to measure what it is supposed to measure). As a third part, the procedure has to provide an add-on value as compared with the ability of diagnosing the disease without the new procedure. As an example, the add-on value of a contrast agent for an imaging device may be shown by comparing images with and without the contrast agent. For the validity of a diagnostic procedure, the true disease status has to be known. When no true status is available,

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

DIAGNOSTIC STUDIES

Kappa Parameter (chapt. 3.1)

Intra-class correlation coefficient Lin’s coefficient of concordance

Reliability Analysis (chapt. 3.2)

Bland-Altman plot McNemar’s test

“Truth” unknow

“Truth” known Sensitivity and Specificity Parameter (chapt. 4.1)

Accuracy

Analysis (chapt. 4.2)

Receiver Operator Characteristic (ROC)

Positive and negative predictive value

Validity

Likelihood ratio

Add-on value Figure 1. Cascade to test the performance of a diagnostic procedure

only the reliability in terms of agreement with a control and the add-on value can be used to show clinical usefulness. In Fig. 1, the cascade to conduct the testing of a new diagnostic procedure is graphically presented. The parameters and analysis tools for the evaluation of reliability and validity are described in Sections 3 and 4. For further reading on diagnostic studies, Knottnerus (3) provides more introductory explanations. 2.3 Evaluation of the ‘‘Truth’’ and Choice of Control A critical design issue in diagnostic studies is the determination of the truth provided by the so-called gold standard, standard of truth, or standard of reference (SOR). As nearly no procedure can provide 100% correct diagnoses, the terms ‘‘gold standard’’ and ‘‘standard of truth’’ are not commonly used any more as they imply an absolutely accurate diagnosis. Today, the term ‘‘standard of

reference’’ is preferred to name a method, which provides the so-called true disease status. In general, the SOR should reflect the best available standard procedure but, in any case, has to be independent from the test procedure to avoid any bias in favor or disfavor of the test procedure. When a CT is under investigation, the standard of reference should be a method different from CT like pathology or MRI to obtain independent results. To avoid any bias, in addition, the results of the test procedure should be unknown when the SOR is evaluated, but also the results of the SOR should be unknown when the test procedure is performed. Examples for SORs are CT, MRI, histopathology, biopsies, ultrasound, intraoperative ultrasound, the final diagnosis, or the clinical outcome, obtained from a follow-up examination. Sometimes the best available method to obtain the truth is inappropriate and unethical, for example, performing a biopsy or histopathology on patients with suspected benign tumors and

DIAGNOSTIC STUDIES

suspected absence of any malignancies. In these cases, the SOR might be suboptimal, less valid, or less reliable. The consequence may be incorrect SOR diagnoses, which may lead to biased estimates for the number of true positive (TP) and true negative (TN) test results (for definitions of TP and TN, see section True Positives, True Negatives, False Positives, False Negatives) and so to a biased estimation of the validity of the procedure. This bias could become unavoidable in ethically performed clinical studies. Therefore, the SOR method has to be chosen in a way that it is ethical and appropriate for the patient and provides information closest to the truth. Any inappropriateness and weakness of the SOR has to be taken into account when interpreting study results. When the distribution of patients with weak SOR is similar across treatment groups, the underestimation may be also similar such that the absolute values of sensitivity and specificity show underestimation, but the estimation of the add-on value of the new diagnostic procedure in terms of relative difference between two groups remains appropriate. Therefore, controls are needed in diagnostic studies to evaluate the performance of a diagnostic procedure. In studies on contrast agents, one may use the images without the contrast agent as control to perform an intraindividual comparison. Also, active controls like other contrast agents, other diagnostic procedures (e.g., CT versus MR) or placebo- or vehicle-controls may be used. The type of control depends on the primary objective of the respective study. In general, the control has to be chosen appropriately to enable the assessment of the clinical usefulness of the procedure under investigation. Therefore, the control should always reflect the current medical practice. To maximize the precision of the estimates and to reduce the variability a paired study design is preferable where the test and the control procedures are performed in the same patient. 2.4 Experimental Unit and Observational Unit In general, an experimental unit is the smallest unit to which one treatment is applied (e.g., a patient who receives the contrast

3

agent). The outcome is obtained from the observational unit, which is the smallest unit for which an observation can be made. When a treatment is applied to a patient and a single observation is made for this patient, then the observational unit and the experimental unit are the same. An example is the treatment of headache, where a drug is given to the patient and the pain relief is observed at a fixed point in time. In diagnostic studies, the experimental unit and the observational unit are often not the same. The diagnostic procedure is applied to the patient as experimental unit, but the observational units are several regions of interest within the patient like different organs, different vessels, or different lesions within an organ. An example is the detection of tumors in the liver by an imaging medical device with contrast agent. Here, the patient is the experimental unit as he receives the contrast agent, but the observational units are multiple focal liver lesions. 2.5 Clustered Data in Diagnostic Studies The fundamental unit for statistical analysis of clinical studies is the observational unit rather than the experimental unit. Often, the observational units are highly correlated within the experimental unit. When, for example, several focal liver lesions were found for a patient and one lesion is a metastasis, then the probability may be high that the other lesions are also metastases. In longitudinal studies, where observations are made for the same subject at several points in time, the measurements are correlated, which has to be taken into account for the analysis. To provide a patient-based analysis and to obtain valid estimates for the error variance of the statistical parameters, the experimental unit is preferred as unit for analysis. A solution is to define the patient to be a cluster of multiple correlated observations. Appropriate statistical methods are available to obtain valid estimates for the treatment effect based on clustered data. Examples are the cluster-adjusted McNemar test (4, 5) and the adjusted X2 -test (6). 2.6 Blinded Reader Studies As stated earlier, for diagnostic studies, special guidelines (1, 2) apply. One particularity,

4

DIAGNOSTIC STUDIES

especially in imaging studies, is the evaluation of the efficacy by multiple independent readers who have to perform a blinded assessment. The blinded read is an artificial scenario to assess the performance of a diagnostic procedure and may not reflect the clinical routine. Nevertheless, it forces the readers to evaluate the images under objective conditions and helps to avoid any bias as a result of nonindependence or access to unblinded information like the medical history, age, or sex of the patient. The relevant guidelines (1, 2) state that the readers have to be ‘‘independent’’ (i.e., not yet involved in the development of the diagnostic procedure or affiliated with the sponsor of the trial or the institutions) where the study is conducted. The ‘‘blindness’’ of the readers is the unawareness of most of the information given in the study protocol (i.e., the treatment identity, demographics, the primary objective, clinical information on the patient, and other information that may influence the evaluation of the images). The limitation of providing very limited study and patientrelated information to the readers is introduced to force the reader to focus on the images alone. At least two blinded readers are requested by the FDA and CPMP (1, 2) to evaluate the images to investigate the interreader and intrareader variability and to allow for a wider generalization of the results. A profound selection and training of certified clinical specialists as blinded readers is needed to minimize the interreader variability.

3 RELIABILITY Reliability is introduced to measure the agreement of different methods. In general, a measurement lacks reliability when it is prone to random error. In the following, the term ‘‘method’’ is used as a synonym for the diagnostic procedure from which the observations are obtained and for which the agreement to another method is of interest. Examples for such methods are different diagnostic procedures, multiple readers of images, different laboratory tests, or different raters in psychological studies. Cohen’s Kappa κ, the intraclass correlation coefficient (ICC), Lin’s concordance coefficient rLin , and the Bland-Altman plot can be used to assess reliability. Statistical analyses of reliability may be based on the McNemar’s test (for further reading, see Reference 7). It should be noted that reliability is different from correlation. Correlation coefficients, like Pearson’s correlation coefficient for continuous data and Spearman’s rank correlation coefficient for ordinal and continuous observations, measure the linear and monotonic relationship of two methods, respectively, not the agreement. As shown in Fig. 2, even an optimal linear correlation does not imply a moderate or substantial agreement, but may result in a poor agreement. To estimate the agreement of multiple methods, it is clearly not sufficient to only calculate the correlation. Shoukri and Edge (5) and Bland and Altman (8) provide introductory explanations in more detail on reliability. Advanced readers are referred to Shoukri (9) and Fleiss (7, 10).

Method 1

1.0

Method 1

1.0

optimal correlation, poor agreement

optimal correlation, optimal agreement 0

0 0

1.0 Method 2

0

1.0 Method 2

Figure 2. Example for the correlation and agreement of two methods

DIAGNOSTIC STUDIES

3.1 Reliability Parameters/Measurements of Reliability 3.1.1 Kappa. In 1960, Cohen (11) proposed kappa for the measurement of the agreement of dichotomous outcomes from two methods and is based on a 2 × 2 contingency table. It should be noted that neither method is a standard of reference. Kappa compares the observed proportion of agreement in the diagonal cells (PO ) and the expected proportion of agreement based on chance alone (PE ). Both proportions can be calculated based on the counts of the 2 × 2 table. These counts are shown in Table 1. In addition, the agreement rates for the outcome ‘‘disease present’’ (Pdiseased ) and for the outcome ‘‘disease absent’’ (Pnon diseased ) can be calculated. The average of these rates is then the overall agreement rate (Pall ). The

formulas for kappa and the associated parameters can be found in Table 2. In Fleiss (7), the standard error and confidence interval is presented along with a description of the kappa statistic. For illustration, the calculation of kappa and the associated proportions is shown by an example. In a hypothetical study, 169 ultrasound images were assessed by two different readers, evaluating the presence or absence of cirrhosis in the liver. The results of the evaluation are shown in Table 3 with the absolute counts and percentages. From this 2 × 2 table, the observed and expected proportions as well as the proportions of diseased and nondiseased patients then are: PO = 0.70, PE = 0.62, Pdiseased = 0.40, and Pnon diseased = 0.80. Then, kappa = (0.7–0.62)/(1 − 0.62) = 0.21. After the

Table 1. 2 × 2 Table of Counts for Two Methods with Two Outcomes Method 1

Method 2 disease present disease absent

disease present disease absent

a c a+c

b d b+d

a+b c+d a+b+c+d=n

Table 2. Formulas for Kappa and Associated Parameters (7) Kappa: κ=

PO − PE , 1 − PE

PO =

a+d = observed agreement rate, n

PE =

[(a + c) · (a + b)] + [(b + d) · (c + d)] = expected agreement rate by chance, n2

a, b, c, d and n according to Table 1. Overall Proportion of Agreement: Pall = 0.5 · (Pdiseased + Pnot Pdiseased =

diseased ),

2a = proportion with outcome ‘‘disease present’’, 2a + b + c

Pnot diseased =

2d = proportion with outcome ‘‘disease absent’’, 2d + b + c

a, b, c, d and n according to Table 1.

5

6

DIAGNOSTIC STUDIES Table 3. 2 × 2 Table of an Ultrasound Evaluation of Two Readers Reader 2 Reader 1 cirrhosis present cirrhosis absent

cirrhosis present

cirrhosis absent

17 (10%) 17 (10%) 34 (20%)

34 (20%) 101 (60%) 135 (80%)

Table 4. Interpretation Rules for Kappa (12) <0.01 0.01–0.20 0.21–0.40 0.41–0.60 0.61–0.80 0.81–1.00

chance of agreement slight agreement fair agreement moderate agreement substantial agreement almost perfect agreement

calculation of kappa, the question of how to interpret the results develops. In 1977, Landis and Koch (12) proposed rules for the interpretation of kappa, which are shown in Table 4. For the ultrasound example, the two readers would be regarded as in fair agreement. With kappa = 1, an optimal agreement of two methods is concluded; for kappa > 0, the observed agreement is higher than the agreement rate by chance alone; for kappa = 0, the observed and the expected agreement rates are the same. When a kappa < 0 is found, then the observed agreement is worse than the agreement by chance. The expected agreement rate is dependent on the prevalence of the outcome (i.e., with a prevalence near 0 or 1, kappa can be low even when the observed agreement rate is high). Therefore, kappa may change with different prevalences of the disease, even when the observed agreement rate stays the same. This effect is illustrated by a hypothetical example with different prevalences (Tables 5 and 6). In both examples, the observed proportion of agreement is the same (i.e., PO = 0.870). With a prevalence of 65% in example 1 and 85% in example 2, the expected agreement rates of PE = 0.554 and PE = 0.766, respectively, differ, which leads to different estimates for kappa (i.e., kappa = 0.709 for example 1 and kappa = 0.444 for example 2). Kappa can also be calculated when more than two outcomes for each method are present

51 (30%) 118 (70%) 169 (100%)

(e.g., a pain score with 5 outcomes from ‘‘no pain’’ to ‘‘severe pain’’). Then the overall kappa is estimated as the weighted mean of individual kappas, which are based on the 2 × 2 cross tables for each pair of outcomes (for further reading, see References 7 and 9). 3.1.2 The Intraclass Correlation Coefficient. The intraclass correlation coefficient (ICC) was introduced by Bloch and Kraemer in 1989 alternatively to Cohen’s kappa and is a commonly used measurement of agreement for quantitative assessments. The ICC and kappa are equivalent and differ only by a term with the factor 1/n, where n is the number of patients under investigation, but the ICC can also be used for the analysis of continuous data. The ICC can be obtained by applying an analysis of variance. For the introduction of the ICC, the one-factorial linear model is used. Let Y be a single outcome for a patient and X be the underlying mean score of many repeated measurements for this patient (i.e., the expected or ‘‘true’’ score). When an experiment is conducted, Y can be found to be different from X for several reasons like misinterpretations of images by the reader or imperfect calibration of the test procedure. The difference of the observed score Y and the expected score X is expressed as error e in the classical linear model Y = X + e. The score X will vary across a population of patients with mean mX and variance s2 X . The mean of the error e will vary around zero with variance s2 e . The distributions of e and X are assumed to be independent, so that the variances of e and X are also independent. The variance of Y can then be expressed as the sum of variance components of X and e (i.e., s2 Y = s2 X + s2 e ). Then the ICC1 is the proportion of the variance component of X compared with the overall variance. With a decreasing error variance, the denominator increases, and so the agreement rate increases to a

DIAGNOSTIC STUDIES

7

Table 5. Example 1: Kappa Coefficient in Case of a Medium Prevalence of the Disease Method 2 Method 1 disease present disease absent

disease present

disease absent

60 5 65

8 27 35

68 32 100

Table 6. Example 2: Kappa Coefficient in Case of a High Prevalence of the Disease Method 2 Method 1 disease present disease absent

disease present

disease absent

80 5 85

8 7 15

maximum of 1, whereas an increasing error variance would lead to a decreased agreement rate to a minimum of 0. The ICC can therefore be interpreted as the proportion of variance, caused by the patient-to-patient variability and corrected for random error. In Table 7, the formulas are given along with the confidence interval for ICC1 (7, 13). For a sample size of at least 20 patients, the ICC2 can be equivalently expressed in terms of the within patient mean square (WMS) and the between patient mean square (BMS) obtained by an analysis of variance as shown in Table 8. Despite the name, the ICC measures the agreement of multiple methods, not the correlation in terms of some monotonic relationship between the methods, as it also takes into account, whether the methods assess on the same scale. As rule for interpretation, an ICC of at least 0.75 is suggested to be meaningful and of good reliability (14). Taking into account the sampling error, the ICC should be called meaningful when the lower limit of the 95% confidence interval for the ICC is at least 0.75. A shortcoming of the ICC is its dependency on the degree of variability within the sample as it is calculated on this variation. In Fleiss (10), further introductory explanations can be found; advanced readers are referred to Fleiss (7). 3.1.3 Lin’s Coefficient of Concordance. As an alternative to the ICC, Lin’s coefficient of concordance rL may be used, which measures

88 12 100

the degree of concordance between two methods M1 and M2 taking into account not only the variance of the two methods but also the means. In Table 9 (15), the formulas are shown for the coefficient and its variance. A rL = 1 indicates a perfect positive agreement, rL = 0 indicates no agreement, and a rL = − 1 indicates a perfect negative agreement (15). Like the ICC, Lin’s coefficient is dependent on the degree of variability within the sample. 3.2 Analysis of Reliability 3.2.1 Bland-Altman Plot. For the graphical investigation of agreement, the BlandAltman plot may be used (16). In the graph, the difference of the values of method 1 and method 2 is plotted against the average of method 1 and method 2. An example is shown in Fig. 3. The limits of agreement are defined as two times the standard deviation of the mean difference. The plot should be checked for any systematic pattern (e.g., when method 1 gives systematically three times higher values than method 2). If the mean difference is close to 0 and the single differences do not exceed the agreement limits, good agreement can be concluded. As a shortcoming, the Bland-Altman plot does not provide any quantitative measure of agreement, but provides a good graphical insight into the data. To overcome this shortcoming, the BradleyBlackwood procedure may be used, which is extensively described in Bartko (17).

8

DIAGNOSTIC STUDIES Table 7. Formulas for ICC1 and Associated Parameters (7, 13) ICC1 by Variance Components: Let Y = X + e be a classical linear model with observations Y, factor X and error e, sX 2 = variance component of factor X, se 2 = variance component of error term e. => ICC1 =

s2X s2X

+ s2e

.

95% Confidence Interval: 95% CI(ICC) = F=

F/FL − 1 F/FU − 1 ; , with n + F/FU − 1 n + F/FL − 1

MSX , MSE MSX = mean square of X, MSE = mean square of e,

FU and FL = upper and lower points of the F-distribution,

method 1 - method 2

a−1 P(FL ≤ Fa(n−1) ≤ FU ) = 1 − α.

+ 2*SD

Mean

− 2*SD

Average of method 1 and method 2 Figure 3. Example for a Bland-Altman plot

3.2.2 McNemar’s Test. For the analysis of reliability in terms of hypothesis testing, McNemar’s test may be used. For dichotomous outcomes, McNemar’s test determines whether the probability of a positive test result is the same for two methods. McNemar’s test is based on the determination of the symmetry in the discordant pairs of observations (i.e., where the two methods provide different outcomes). The hypotheses and test statistic are shown in Table 10. In case

of clustered data (i.e., where several observations are obtained from the same subject like the detection of focal liver lesions in the same patient), the cluster-adjusted McNemar test (4, 5) should be used. For illustration, McNemar’s test is calculated for the ultrasound example (Table 3). Let X be reader 1 and Y reader 2, then PX = 34/169 = 20%, PY = 51/169 = 30%, and the SE(PX −PY ) = 0.0423. The test statistic then is χ 2 = (34 − 17 − 1)2 /51 = 5.02 and leads to

DIAGNOSTIC STUDIES

9

Table 8. Formulas for ICC2 and Associated Parameters (10) ICC2 by Mean Squares: Let BMS =

n 1 (xi − mi · p)2 be the mean square between methods · n mi i=1

and WMS =

n xi · (mi − xi )2 1 · be the mean square within the methods n(m − 1) mi i=1

with mi = number of ratings on the ith patient with mean m =

n 1 mi , · n i=1

n

m0 = m −

p=

(mi − m)2

i=1

n · (n − 1) · m

, xi = number of positive ratings on the ith patient,

n 1 · xi = mean proportion of positive ratings and nm i=1

n = number of patients, = > ICC2 =

BMS − WMS . BMS + (m0 − 1) · WMS

statistical significance as compared with the critical value of χ 1,0.05 2 = 3.841. 4

VALIDITY

The validity of a diagnostic procedure is substantial for its performance. It is defined as the accuracy or the grade of truthfulness of the test result, but can only be evaluated when the true disease status is known. The main question regarding the validity of a procedure is, therefore, whether it measures what it is intended to measure. Often, the observations for estimating the validity are on binary scale (i.e., with two outcomes) or dichotomized (i.e., reduced to two outcomes) to estimate validity parameters like sensitivity and specificity. When the observations are on ordinal scale (i.e., with more than two ordered outcomes like rating scores on 5-scale) or when the data are continuous (e.g., laboratory values of a PSA test as marker for prostate cancer), the area under the Receiver Operator Characteristic curve (ROC) and likelihood ratios are commonly used to analyze the validity of diagnostic

procedures. Most often, the primary efficacy objective in diagnostic studies is based on the validity (i.e., the diagnostic accuracy and the ability of the diagnostic procedure to discriminate the disease status from the nondisease status). The advantage of the new diagnostic procedure (i.e., the add-on value) is often based on an increased validity compared with some control. 4.1 Validity Parameters Basis for the definition of validity parameters is the 2 × 2-table defining true positive, false positive, false negative, and true negative outcomes (please refer to section ‘‘True Positives, True Negatives, False Positives, False Negatives’’). Then the prevalence of a disease is defined as the number of patients with a disease divided by all patients in the target population at a given and fixed point in time. The prevalence is called the prior or pretest probability for the proportion of patients with the disease. Validity parameters are sensitivity, specificity, and accuracy. The sensitivity is defined as the number of diseased patients

10

DIAGNOSTIC STUDIES Table 9. Formulas for Lin’s Concordance Coefficient and Associated Parameters (15) Lin’s Concordance Coefficient: rLin =

2sM1,M2 , s2M1 + s2M2 + (µM1 − µM2 )2 M1 = method 1, M2 = method 2, S2M1 of method M1, S2M2 of method M2, sM1,M2 = covariance of M1 and M2, µM1 = mean for method M1, µM2 = mean for method M2

Variance: Var(rLin ) =

1 n−2

4 · r3Lin · (1 − rLin ) · u2 (1 − r2 ) · rLin · (1 − r2Lin ) 2 · r4Lin · u4 + , − r2 r r2

µM1 − µM2 r = Pearson’s correlation coefficient, u = √ sM1 · sM2

Table 10. Formulas for McNemar’s Test and Associated Parameters (10) McNemar’s Test: Hypotheses: Let M1 = method 1 and M2 = method 2, then H 0 : PM1 = PM2 vs. H 1 : PM1 = PM2 ⇔ H 0 : PM1 − PM2 = 0 vs. H 1 : PM1 − PM2 = 0, PM1 =

a+b , n

PM2 =

a+c , a, b, c, d and n according to Table 1. n

b−c = difference in proportions n √ b+c . with standard error SE(PM1 − PM2 ) = n

= > PM1 − PM2 =

Test Statistic: χ12 =

|PM1 − PM2 | − 1/n SE(PM1 − PM2 )

2 =

(|b − c| − 1)2 with 1 degree of freedom. b+c

with positive test result divided by the overall number of patients with the disease. In statistical terms, the sensitivity can also be expressed as the probability for a positive test result under the condition that the disease is present. Therefore, the sensitivity is exclusively based on patients with the disease.

The specificity, on the other hand, is based on nondiseased subjects only. It is defined as the number of nondiseased patients with a negative test result divided by the overall number of patients without the disease. In statistical terms, the specificity is the probability of a negative test result under the

DIAGNOSTIC STUDIES

condition that the disease is absent. A combination of the sensitivity and specificity is the accuracy, where the information of both diseased patients and nondiseased patients is used. The accuracy is defined as the number of patients with either true positive or true negative test result divided by the overall number of patients under investigation. The disadvantage of the accuracy is its dependence on the prevalence of the disease: When the prevalence is low, the accuracy is mainly driven by the specificity (i.e., the number of true negative test results). When the prevalence is high, the accuracy is mainly influenced by the sensitivity (i.e., the number of true positive test results). As a consequence, a low specificity at a high prevalence and a low sensitivity at a low prevalence may be masked and missed. Other possibilities of combining sensitivity and specificity are the prevalence-independent receiver operating characteristic curve and the likelihood ratio, which are described later. Other validity parameters are the positive and negative predictive value. The positive predictive value (PPV) is the number of true positive test results divided by the overall number of positive test results, both true results and false results. It expresses the probability that a positive test result is correct, which is equivalent to the post-test probability for the disease when a positive test result is found. The negative predictive value (NPV) is the number of correct negative test results divided by the number of negative test results overall, which also is true only in cases where the prevalence in the sample is equal to the prevalence in the target population. In this case, the NPV is the probability for a negative test result being correct. The NPV is equivalent to the post-test probability that a negative test result is correct. Both PPV and NPV are measurements of how well the diagnostic procedure performs and can also be expressed in terms of sensitivity and specificity. In Zhou et al. (18), further descriptions can be found for situations in which the prevalences of the sample and the target population differ. In Table 11 (19), the formulas of all validation parameters and the associated parameters like the prevalence are given for a 2 × 2 table.

11

To demonstrate the calculation of the validation parameters, assume a study with 100 patients with suspected prostate cancer to test the performance of a new laboratory test to discriminate diseased and nondiseased patients. The test measures the PSA level in blood. A cut-off value is chosen to discriminate the diseased and nondiseased patients. Biopsy is used as standard of reference to obtain information on the true disease status. In Table 12, hypothetical results of such a study are in terms of a 2 × 2 table. The validation parameters and associated parameters can be calculated to investigate the performance of the new laboratory test on PSA levels. The results are: Se = 55/70 = 78.6%, Sp = 20/30 = 66.7%, Acc = 75/100 = 75%, PPV = 55/65 = 84.6%, and NPV = 20/35 = 57.1%. For further reading, see Panzer (20) as an introductory text; Zhou et al. (18) for advanced readers regarding validation parameters, their variances, and confidence intervals; and Agresti (21) for categorical data in general. 4.2 Analysis of Validity 4.2.1 Likelihood Ratio. As a measurement of the discrimination of a diagnostic procedure, the likelihood ratio combines information on the sensitivity and specificity simultaneously. For dichotomous data, the likelihood ratio for diseased patients (LR+) is the probability for a positive test result in diseased patients (i.e., the sensitivity) compared with nondiseased patients (i.e., 1-specificity). Analogously, the likelihood ratio for nondiseased patients is the probability for a negative test result in diseased patients compared with the patients without the disease. In Table 13, the likelihood ratios are given for dichotomous but also ordinal data with more than two categories along with the 95% confidence interval. With a likelihood ratio of 1, the diagnostic procedure adds no information, with increasing LR+ and decreasing LR−, the ability of the diagnostic procedure to discriminate between diseased and nondiseased patients increases. With the data from the PSA example (Table 12), the likelihood ratios for the PSA test would lead to LR+ = 0.786/(1 − 0.667) = 2.360 with 95% CI = [1.402; 3.975] and

12

DIAGNOSTIC STUDIES

LR− = (1 − 0.786)/0.667 = 0.321 with 95% CI = [0.192; 0.538]. In Knottnerus (3), more introductory explanations can be found.

4.2.2 ROC. A standard approach to evaluate the sensitivity and specificity of diagnostic procedures simultaneously is the Receiver Operating Characteristic (ROC) curve. For ordinal and continuous outcomes, a curve

is fitted to describe the inherent tradeoff between sensitivity and specificity of a diagnostic procedure. In difference to the accuracy, the ROC is independent from the prevalence of the disease. Each point on the ROC curve is associated with a specific diagnostic criterion (i.e., a specific cut-off value to classify the patient with regard to the presence/absence of the disease). The ROC was

Table 11. Formulas for the Validation Parameters and Associated Parameters for a 2 × 2 Table (19) True positives, true negatives, false positives, false negatives: see section ‘‘True Positives, False Positives, True Negatives, False Negatives’’. Prevalence: Pr =

TP + FN N

Sensitivity: Se = P(T + | D+) = Standard Error: SE(Se) =

TP , where T + = test positive, D + = disease present. TP + FN Se · (1 − Se) (TP + FN)

95% Confidence Interval: 95% CI(Se) = Se ± 1.96 · SE(Se) Specificity: Sp = P(T − | D−) = Standard Error: SE(Sp) =

TN , where T- = test negative, D- = disease absent. FP + TN Sp · (1 − Sp) (TN + FP)

95% Confidence Interval: 95% CI(Sp) = Sp ± 1.96 · SE(Sp) Accuracy: Acc =

TP + TN TP + FP + FN + TN

Positive Predictive Value: PPV = P(D + | T+) =

TP Pr ·Se = TP + FP Pr ·Se + (1 − Pr) · (1 − Sp)

Negative Predictive Value: NPV = P(D − | T−) =

TN (1 − Pr) · Sp = TN + FN (1 − Pr) · Sp + Pr ·(1 − Se)

Table 12. An 2 × 2 Table for an Hypothetical Clinical Study in Patients with Suspected Prostate Cancer

Test result T + (test positive) T − (test negative)

Standard of reference D+ D− (disease present) (disease absent) 55 15 70

10 20 30

65 35 100

DIAGNOSTIC STUDIES

13

Table 13. Formulas for the Likelihood Ratio and Associated Parameters (3) Likelihood Ratio for data with X categories (X > 2): likelihood ratio for outcome X: LRX =

P(X| D+) , P(X| D−)

P(X— D+) = probability for outcome X in diseased patients, P(X— D−) = probability for outcome X in non-diseased patients. Likelihood Ratios for dichotomous data: likelihood ratio for diseased patients: LR+ =

P(T + | D+) Se = , P(T + | D−) 1 − Sp

likelihood ratio for non-diseased patients: LR− =

1 − Se P(T − | D+) = , P(T − | D−) Sp

P(T + — D +) = probability for a positive test result in diseased patients, P(T + — D −) = probability for a positive test result in non-diseased patients, P(T − — D +) = probability for a negative test result in diseased patients, P(T − — D −) = probability for a negative test result in non-diseased patients, Se = sensitivity, Sp = specificity. 95% Confidence Interval for likelihood ratio: 95%CI(LRX ) = exp ln

p1 ± 1.96 · p0

1 − p1 1 − p0 + , p1 · n1 p0 · n0

p1 = P(X—D +) based on sample size n1 of diseased patients, p0 = P(X—D −) based on sample size n0 of non-diseased patients.

developed in the signal detection theory during World War II for the analysis of radar images and was adopted to medical statistics in the 1960s and 1970s for the analysis and interpretation of diagnostic procedures where signals on images or screens were to be evaluated. For the estimation of a ROC curve, first the sample is divided into the groups truly diseased and truly not diseased patients. These groups have different distributions of the outcomes of the diagnostic procedure (see Fig. 4). Assuming that higher test results indicate the disease, the test procedure is set to ‘‘positive’’ when the outcome is equal or

higher than the cut-off value and to ‘‘negative’’ when the result is lower than the cut-off value. Each possible cut-off value c represents an inherent tradeoff between sensitivity and specificity and results in a 2 × 2 table. For the ROC analysis, the outcome of a diagnostic procedure has to be at least ordinal (see Table 14). In the example, the image quality is measured on a score from 1 to 10, the cut-off value dichotomizes the outcome so that, for example, scores 1 to 5 would lead to the outcome ‘‘poor image’’ and scores 6 to 10 would lead to the outcome ‘‘good image.’’ The discriminatory accuracy of a diagnostic procedure can be represented graphically

14

DIAGNOSTIC STUDIES

Cut-off value c Test negative

Test positive

TN

TP

not diseased patients

diseased patients FN

FP

score of diagnostic procedure

Figure 4. Distribution of patients with and without disease with cut-off value Table 14. Outcome Categories and True Disease Status in a 2 × C Table Truth Test result categories

D + (D = 1): disease present

D − (D = 0): disease absent

n11 n12 ... n1K n1

n01 n02 ... n0K n0

1 2 ... K

Sensitivity

1.0

0 0

1.0 1-Specificity

Figure 5. Example of a ROC curve

by a ROC curve (see Fig. 5), where the sensitivity (i.e., the rate of true positive test results) is displayed on the vertical axis and 1-specificity (i.e., the rate of false positive results) on the horizontal axis. If the ROC hits the upper-left corner, an optimal differentiation between diseased and not diseased patients (i.e., the true positive rate = 1 and

n.1 n.2 ... n.K N

the false negative rate = 0) is achieved. At the bisector of the graph, the discrimination of the diagnostic procedure is equal to chance. For c = + ∞ (plus infinity), all patients are set as ‘‘not diseased’’; for c = − ∞ (minus infinity), all patients are set as ‘‘diseased.’’ The advantages of the ROC analysis are that the curve is invariant with regard to the scale of the diagnostic procedure itself and is invariant to monotonic transformations. These advantages enable scientists to compare diagnostic procedures with different scales based on ROC curves. ROC curves are, in addition, independent from the prevalence of the disease. The likelihood ratio can be extracted from the ROC curve as the slope of the tangent line at a given cut-off value. As a measurement of the validity, the area under the ROC curve (AUC) summarizes the sensitivity and specificity over the whole range of cut-off values. An AUC of 0.5 indicates that the procedure is equal to chance alone, whereas a value of 1 indicates a perfect discrimination of ‘‘diseased’’ and ‘‘not diseased’’ patients. The AUC is limited to the interval [0, 1]. In practice, the AUC should be at least 0.5. Values below 0.5 lead to a diagnostic procedure that would be worse than chance

DIAGNOSTIC STUDIES

1.0

15

Sensitivity

Sensitivity

1.0

AUC = 100%

AUC = 50%

0

0 0

1.0

1.0

0

1-Specificity

1-Specificity

Figure 6. Examples of AUCs

alone. In general, the AUC can be interpreted as the average sensitivity across all values of the specificity and vice versa. For the assessment of a diagnostic procedure, in addition to the AUC, the ROC curve should also be provided and taken into consideration to check whether the curves of the test and control procedure cross or whether one procedure is consistently superior to the other (see Fig. 6). For further reading, see Dorfman (22), Hanley and McNeil (23), and especially Zweig and Campbell (24) as an introductory text and Zhou et al. (18) for advanced readers. REFERENCES 1. CPMP, CPMP points to consider: points to consider on the evaluation of diagnostic agents, Committee for Proprietary Medicinal Products (CPMP) of the European Agency for the Evaluation of Medicinal Products (EMEA), CPMP/EWP/1119/98, 2001. 2. FDA Draft Guideline, Developing Medical Imaging Drugs and Biological Products. Washington, DC: FDA, 2000 3. A. J. Knottnerus, The Evidence Base of Clinical Diagnosis. London: BMJ Books, 2002. 4. M. Eliasziw and A. Donner, Application of the McNemar test to non-independent matched pair data. Stat. Med. 1991; 10: 1981–1991. 5. M. M. Shoukri and V. Edge, Statistical Methods for the Health Sciences. Boca Raton, FL: CRC Press, 1996. 6. A. Donner, Statistical Methods in Ophthalmology: An Adjusted Chi-Square Approach. Biometrics, 1989; 45(2): 605–611. 7. Fleiss1981.

8. J. M. Bland and D. G. Altman, Measuring agreement in method comparison studies. Stat. Meth. Med. Res. 1999; 8: 135–160. 9. M. M. Shoukri, Measures of Interobserver Agreement. Boca Raton, FL: CRC Press, 2003. 10. Fleiss1999. 11. J. Cohen, A coefficient of agreement for nominal scales. Educ. Psycholog. Measur. 1960; 20: 37–46. 12. J. Landis and G. Koch, The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174. 13. Searle1992. 14. L. G. Portney and M. P. Watkins, Foundations of Clinical Research: Applications to Practice. Norwalk, CT: Appleton & Lange, 1993. 15. L. I. Lin, A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–268. 16. J. M. Bland and D. G. Altman, Statistical method for assessing agreement between two methods of clinical measurement. Lancet 1986; i: 307–310. 17. J. J. Bartko, General methodology II. Measures of agreement: a single procedure. Stat. Med. 1994; 13: 737–745. 18. X-H. Zhou, N. Obuchowski, and D. K. McClish, Statistical Methods in Diagnostic Medicine. New York: John Wiley and Sons, 2002. 19. L. Edler and C. Ittrich, Biostatistical methods for the validation of alternative methods for in vitro toxicity testing. ATLA 2003; 31(Suppl 1): S5–S41. 20. R. J. Panzer, E. R. Black, and P. F. Griner, Diagnostic Strategies for Common Medical Problems. Philadelphia, PA: American College of Physicians, 1991.

16

DIAGNOSTIC STUDIES

21. A. Agresti, An Introduction to Categorical Data Analysis. New York: John Wiley & Sons, 1996. 22. D. D. Dorfman, Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals rating method data. J. Math. Psychol. 1969; 6: 487–496. 23. J. A. Hanley and B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143: 29–36. 24. M. H. Zweig and G. Campbell, Receiveroperating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 1993; 39/4: 561–577.

FURTHER READING P. Armitage and G. Berry, Statistical Methods in Medical Research. 3rd ed. Oxford: Blackwell Science, 1994.

DISCRIMINANT ANALYSIS

Another important task in clinical medicine is the prediction of a diagnosis for a certain patient given a set of p variables. In medical practice, situations often exist where the definitive diagnosis can only be made after a large number of clinical investigations, not seldom only after autopsy. What is needed here is a statistically-supported decision, taking into consideration available features x = (x1 , x2 , . . . , xp ) such as clinical or laboratory-type observations that can be accessed without too much inconvenience to the patient and without high cost. In a first (training) step of constructing a rule of prediction, one knows the class membership of patients (even from autopsy) and tries to ‘‘repeat’’ the diagnostic process by means of the trained prediction rule using only the available features x mentioned. If successful, further patients with unknown class membership can then be allocated (diagnosed) on the basis of the rule obtained using only the p available features x1 , . . . , xp (prediction step).

KLAUS-D. WERNECKE The Humboldt University of Berlin Charit´e —University Medicine Berlin Berlin, Germany

1

INTRODUCTION

Classification problems occur in a wide range of medical applications in which n objects (e.g., patients) are to be assigned to K certain classes (e.g., diagnoses, tumor classes, lesion types) on the basis of p observations x = (x1 , x2 , . . . , xp ) (e.g., symptoms, complaints, intraocular pressure, blood pressure, X-ray, histology, grading) on these patients. As a result of a set of p ≥ 1 observations (features or variables) for every patient, the corresponding statistical methods are designated as multivariate ones. Given a set of data from n patients in p variables, one task is the identification of groups, classes, or clusters of patients that are ‘‘similar’’ (mostly in terms of a distance) with respect to the variables x1 , . . . , xp within the classes and different between the classes. The question here is to give a description of (diagnostic) patient groups on the basis of the observed variables x1 , . . . , xp .

Example 2: Diagnosis of neuroborreliosis burgdorferi The correct diagnosis of neuroborreliosis burgdorferi is known to be extremely difficult in children. In the university hospital of Graz, Austria, various clinical outcomes were examined for their qualification to predict this diagnosis. Two patient groups with a known diagnosis served as a starting point of the investigations:

Example 1: Clustering with parameters from blood tests The features x1 : pH, x2 : bicarbonate, x3 : pO2 , x4 : pCO2 , x5 : lactate, x6 : glucose, x7 : sodium, x8 : potassium, x9 : chloride, x10 : carbamide, x11 : creatinine, x12 : content of serum iron, and x13 : content of alkaline phosphatase should be used for a description of diagnostic classes such as hepatitis, liver cirrhosis, obstructive jaundice, or healthy persons (1). Figure 1 shows a scatter-plot with 20 (numbered) patients using the two variables, x12 and x13 , respectively. Groups of patients, ‘‘similar’’ in x12 , x13 , and corresponding to the diagnoses in question, are bordered (H: hepatitis, LC: liver cirrhosis, OI: obstructive jaundice, HP: healthy persons).

• Group 1 (LNB): 20 patients suffering

from liquor-manifest neuroborreliosis (positive B. burgdorferi—CSF-titer— definitive neuroborreliosis) • Group 2 (NNB): 41 patients without neuroborreliosis (control group) (CSF: cerebro spinal fluid). The following features were examined: • Cephalea, Stiff Neck, Paresis, CSF cells, CSFigg, CSFalb, CSFprot, SBigg (SB: serum binding, igg: immune globulin g, alb: albumin, prot: protein).

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

DISCRIMINANT ANALYSIS

4.0 17

3.5

8

3.0

Content of serum iron [mg/l]

2.5

LC

5 18

2.0 1.5

H

16 7 10 14

3 HP

20

6 1.0 .5

OI

2

4 9

11

19 1 15

13

12

0.0 0 1 2 3 Content of alkaline phosphatase [U/I]

It was unknown, to which extend certain features and combinations of features are important for the diagnosis and which variables may be omitted for further diagnostic purposes (2). Following the remarks above, two principal tasks exist in classification (3): clustering (also known as unsupervised learning, classification, cluster analysis, class discovery, and unsupervised pattern recognition) and discrimination (also designated as supervised learning, classification, discriminant analysis, class prediction, and supervised pattern recognition). In the first case, the classes are unknown a priori and need to be discovered from the given observations. Then, the procedure consists in the estimation of the number K of classes (or clusters) and the determination of the classes in connection with the assignment of the given patients to these classes. This type of problem will not be considered in the following. However, in the second case, the existence of K distinct populations, groups, or classes is known a priori and a random sample has been drawn from each of these classes (for reasons of simplification, it is restricted to K = 2 classes). The given data are used to construct a classification rule (or classifier), which allows an allocation or prediction to

4

5

Figure 1. Visualization of the patients in a scatter-plot using blood tests

the predefined classes on the basis of the given observations x1j = (x1j1 , x1j2 , . . . , x1jp ), j = 1, . . . , n1 from class C1 , and x2j = (x2j1 , x2j2 , . . . , x2jp ), j = 1, . . . , n2 from class C2 , respectively. In this procedure, the given observations (with known class membership)— the socalled training sample {x} = {x11 , . . . , x2n2 }— serve for the construction of the classifier (training step) that forms the basis for a later allocation of patients with unknown class membership (prediction step). A close relationship exists between the two classification problems. On the one hand, the classes discovered from clustering can be used as a starting point for discrimination. On the other hand, in discriminant analysis, an evaluation of the homogeneity of the predefined classes can be achieved by a clustering of the data. 1.1 Exploratory Data Analysis Before applying a statistical procedure like discriminant analysis, it is always important to check the data with respect to the distributions of the given variables x1 , . . . , xp or to look for outliers (particularly when using procedures with serious preconditions on the data, for example, see Section 4).

DISCRIMINANT ANALYSIS

3

3000 NNB LNB

2000

1000

Figure 2. Data on SBigg and CSFprot for patients from LNB and NNB, respectively

Sbigg [mg/l]

0

−1000

0 20 40 CSFprot [mg/l]

Example 2 (continuation): Diagnosis of neuroborreliosis burgdorferi The next step should be a visualization of the data in order to judge whether a discrimination is worthwhile, which can be done by means of two-dimensional scatter-plots using some appropriately selected variables from the given set of features. Figure 2 shows the scatter-plot of the observations in the two classes LNB and NNB with respect to SBigg and CSFprot. In case of many given variables, a scatter-plot matrix can be helpful, but it may be principally difficult to select features of possible importance for discrimination (the general problem of feature selection will be considered in Section 4.2). One possibility of visualizing the data as a test of their suitability for discrimination is a transformation [under the criterion of minimal loss of information such as principal component analysis; see, among others, Ripley, (4)] of the given observation vectors x = (x1 , x2 , . . . , xp ) into vectors with lower dimension and their visualization in a two- or three-dimensional space. Figure 3 shows a scatter-plot with the first two principal components of the neuroborreliosis data. Third, one might test for a difference between the two groups of patients by a multivariate analysis of variance (MANOVA)

60

80

100

120

140

160

using the Lawley-Hotelling test statistics [e.g., Morrison (5)]. Whereas discriminant analysis allows an individual allocation for every single patient to one of the given classes, the MANOVA proves the hypothesis of differences in the middle between the classes. The analysis results in a significant difference between the classes LNB and NNB with ni respect to the mean vectors xi = ( n1 j=1 xij1 , i ni . . . , n1 x ); i = 1, 2. Although this outj=1 ijp i come may be interesting regarding the problem of the neuroborreliosis diagnostics from a scientific point of view, an affected patient is certainly not much interested in a statement for a group of patients than in the definitive diagnosis for himself or herself. In other words, one is looking for a decision rule that allows for an individual allocation of a given patient to exactly one of the classes (diagnoses) in question.

1.2 Fisher’s Approach R. A. Fisher (6) proposed a procedure for finding a linear combination of the variables x1 , x2 , . . . , xp that gives a separation ‘‘as good as possible’’ between the two classes C1 , C2 . Moreover, Fisher (6) suggested that such a

4

DISCRIMINANT ANALYSIS

3 2

Second Principal Component

1 0 −1 −2

NNB LNB

−3 −4 −2

−1

0

1

2

3

First Principal Component

linear combination can be obtained by a multiple regression approach zj = λ0 +

p

λl xjl + j ,

j = 1, . . . , n,

l=1

n = n1 + n2 where j is the jth error term that denotes the difference between the actual observation zj and the average value of the response z. λ0 , and λ1 , . . . , λp are unknown parameters that have to be estimated. Fisher (6) took zj = (−1)i ni /n if xj ∈ Ci , i = 1, 2 as indicators for the group membership. With the least-squares estimations of λ0 , and λ1 , . . . , λp from the equation above, one can calculate a corresponding predicted value zˆ j for every observation vecz1 = tor x j = (xj1 , . . . , xjp ), which means that 1/n1 j zˆ j for all xj ∈C1 and z2 = 1/n2 j zˆ j for all xj ∈C2 , respectively. A plausible rule for an (individual) prediction of a given observation xj to one of the two given classes is then the allocation to that class Cm with the minimal distance to the corresponding mean: zˆ j − zm = mini [ˆzj − zi ](i = 1, 2).

4

Figure 3. Dimension reduction with principal component analysis. The first two principal components for patients from the LNB and NNB group are shown

Table 1. Classification Table for the Neuroborreliosis Data Using Multiple Regression Resubsitution Observed (true) class LNB NNB Total

Predicted class LNB NNB 14 0 14

6 41 47

Total 20 41 61

Example 2 (continuation): Diagnosis of neuroborreliosis burgdorferi Using this approach, the following results are obtained for the neuroborreliosis data, arranged in a so-called classification table (Table 1), in which for every observation the prediction has been compared with the (observed) true class of origin (that means a resubstitution of every observation). Thus, altogether 6 wrong predictions were made, which stands for an error rate for this decision rule of 9.84%.

2 ALLOCATION RULES AND THEIR ERROR RATE As already mentioned, the example is restricted to K = 2 classes to simplify matters.

DISCRIMINANT ANALYSIS

Nevertheless, all the methods to be described can be generalized to K > 2 easily. In the neuroborreliosis example, a decision rule is sought that allows for an (individual) prediction or allocation of a given patient to one of the two diagnoses in question. Generally, R(x) is defined as a prediction rule or an allocation rule on the basis of p observations x = (x1 , x2 , . . . , xp ) of a patient, where R(x) = i implies that the patient is to be assigned to the ith class Ci (i = 1, 2). The rule R(x) divides the features space into two disjunctive regions R1 and R2 , and, if x falls in the region Ri , the patient is allocated to class Ci . The probability of a wrong allocation from class C1 to class C2 for any patient (from class C1 ) is given by P12 (R(x)) = P(R(x) = 2|x ∈ C1 ) (which is the probability of an allocation to class C2 provided the patient belongs to class C1 ). The probability of a wrong allocation from class C2 to class C1 is defined analogously by P21 (R(x)) = P(R(x) = 1|x ∈ C2 ). Supposing patients in class Ci have feature vectors x = (x1 , . . . , xp ) with a class-specific p-dimensional distribution F i (x) and a corresponding density function f i (x)(i = 1, 2), then P12 (R(x)) and P21 (R(x)) can be expressed by P12 (R(x)) = f1 (x)dx and P21 (R(x)) R2

=

f2 (x)dx

respectively.

R1

As Fig. 4 illustrates for p = 1 and normal distributions F i (x) with densities f i (x), integral R f1 (x)dx, for example, defines the 2 probability of patients from class 1 to have observed values x∈R2 (i.e., within the region R2 ), and it is nothing more than the area under the curve f 1 (x) over R2 (hatched areas in Fig. 4). The overall probability of misallocations, also known as (overall) misallocation rate, or (overall) error rate (7), can then be calculated to E(R(x)) = P12 (R(x)) + P21 (R(x)) = f1 (x)dx + f2 (x)dx R2

R1

5

Figure 4 shows the overall probability of misallocations as the sum of the two hatched areas in case of two normal distributions f i (x)(i = 1, 2) and p = 1. 2.1 Estimations of Error Rates The practical use of a discriminator depends on the quality by which newly appearing patients of unknown class membership can be allocated on the basis of the allocation rule R(x). Suppose a training sample {x} such as introduced in Section 1, estimate the overall probability of misallocations or the overall error rate connected with R(x). Among a lot of proposals for estimating that error rate [for a review, see Ambroise and McLachlan (8)] the leaving-one-out method (9) and the so-called π -error estimation (10), are frequently used (Fig. 5). In the leaving-one-out procedure, the available data are repeatedly reduced by one patient. The allocation rule is formed from the remaining subset and then assessed to the replaced patient. As a result of its known class membership, one can decide for a wrong or correct allocation. The procedure estimates the so-called true error rate, namely, the error rate with an allocation rule constructed on the basis of a training sample and referring to an individual patient not belonging to the training set (in the later application of the classifier, referring to patients with unknown class membership). The training sample {x} = {x11 , . . . , x2n2 }, which is customarily ordered by classes, has to be randomly mixed before starting the algorithm in order to secure a representative sample {x}rem . The leave-one-out error estimation and the wellknown resubstitution method (discriminator is trained and tested by the whole training sample, such as in Section 1.2 for the neuroborreliosis data) will be almost identical if the sample size n is large enough (relative to the number of features). Generally, the resubstitution method proves to be too small, thus it is overly optimistic (11). Another rather often used method of estimating the error rate is the bootstrap method (12). Again, one generates new samples (bootstrap samples) by drawing randomly one (or more) observation(s) from the training sample {x} and repeating this sampling several times (drawing with replacement). The error

6

DISCRIMINANT ANALYSIS

class 1 [f1(x )]

class 2 [f2(x )]

x R2

R1

wrongly all. (from 1 into 2) R2 f1(x ) dx

wrongly all. (from 2 into 1) R1 f2(x ) dx

Figure 4. Misallocation in case of two normal distributions f 1 (x) and f2 (x)(p = 1)

rate is then calculated on the basis of the bootstrap samples. 2.2 Error Rates in Diagnostic Tests The results of the error estimation will be summarized in a classification table. Table 2 shows the corresponding quantities for diagnostic tests, where C1 denotes the population of diseased, say D+, and C2 the population of non-diseased, say D−. Prediction to class C1 means a positive test result, say T+, and prediction to class C2 a negative test result, say T−. h11 , h12 , h21 , and h22 are the obtained frequencies of allocation, tp [tn] denotes the frequency of true positive [negative] decisions, and fp [fn] the frequency of false positive [negative] decisions. The probability P(T + |D+) of true positive decisions (prediction to D+ for patients from D+) is known as sensitivity; the probability P(T − |D−) of true negative (prediction to D− for patients from D−) is known

as specificity. For clinical practice, it is even more important to know what the probability of a true positive decision is, referring to all positive decisions (i.e., the probability of membership to D+ provided that the patient has been allocated to D+). That is called positive predictive value (PPV) with probability P(D + |T+). Analogously, the probability of membership to D−, provided that the patient has been allocated to D−, is denoted by P(D − |T−) [negative predictive value (NPV)]. The calculation of the above-mentioned statistical quantities in diagnostic tests are summarized in Table 3. Another calculation for the predictive values goes back to a very famous mathematical theorem, already established in 1763 by Thomas Bayes (13)—the so-called Bayes theorem. The Bayes theorem combines prior assessments (pre-test probabilities) with the eventual test results to obtain a posterior

Table 2. Classification Table in Diagnostic Tests True class (disease) D+

D− Total

Prediction (test result) T+ T− h11 (tp) sensitivity pos. pred. value h21 (fp) 1-spec. 1-PPV h11 + h21

h12 (fn) 1-sensit. 1-NPV h22 (tn) specificity neg. pred. value h12 + h22

Total h11 + h12

h21 + h22 n = h11 + h12 + h21 + h22

DISCRIMINANT ANALYSIS

7

k=1 Remove a subset {x}k = {xk1, ..., xkm} from {x} {x}rem = {x} − {x}k k=k+1

Train the classifier on {x}rem Test the classifier with {x}k pk =

m

j =1

pkj with pkj =

n

0, if xkj correctly classified 1, if xkj wrongly classified

k < n/m ?

π− error−estimation: E(p) =

y

1 n

n/m k =1

Table 3. Statistical Quantities in Diagnostic Tests Probability

Description

P(T + |D+)

Sensitivity

P(T − |D−)

Specificity

P(D + |T+)

Positive predictive value PPV

tp tp + fn tn tn + fp tp tp + fp

P(D − |T−)

Negative predictive value NPV

tn tn + fn

Estimation

assessment of the diagnosis (post-test probability). Taking the pre-test probabilities P(D + ) as prevalence of the disease, and P(D − ) as probability for no disease, positive predictive value P(D + |T+) can be calculated according to Bayes theorem to: P(D + |T+) = =

Figure 5. π -error-estimation (m = 1: leaving-one-out method)

pk

Example 2 (continuation): Diagnosis of neuroborreliosis burgdorferi For reasons of simplification, only the resubsitution method is applied for estimating the error rate in Section 1.2. To complete the calculations, the results are now presented using Fisher’s approach as allocation rule and the leaving-one-out method for error estimation, supplemented with the statistical quantities mentioned above (Table 4). The overall error rate of 16.39% is connected with a sensitivity of only 60.00% but a high specificity of 95.12% (i.e., the prediction of patients from class NNB is better than those from LNB). Positive and negative predictive values are sufficiently high (85.71% and 82.98%, respectively), referring to the prevalence from the given sample (this prevalence is not necessarily representative for the common population).

P(T + |D+)P(D+) P(T+) P(T + |D+)P(D+) P(T + |D+)P(D+)+ P(T + |D−)P(D−)

(P(T + |D+): sensitivity, P(T + |D−): 1-specificity), which shows the dependence of P(D + |T+) on the prevalence. Analogously, one calculates NPV.

3

BAYES ALLOCATION RULES

The Bayes theorem, mentioned in the previous section, based the (posterior) probability of belonging to a certain class on a prior estimate of that probability. The prior probability that a patient x belongs to class C1 may be denoted by π 1 = P(x∈C1 ) (e.g., the patient

8

DISCRIMINANT ANALYSIS

Table 4. Classification Table for the Neuroborreliosis Data Row [%]: sensitivity and specificity; Col. [%]: positive predictive value and negative predictive value leave-one-out Row [%] Col. [%] True class LNB

NNB Total

Fisher’s approach

3.1 Prediction A Bayes rule RI (x) is defined by RI (x) = m,

if

πm fm (x) ≥ πi fi (x) or

P(Cm |x) = max P(Ci |x); Predicted to class LNB NNB 12 60.00% 85.71% 2 4.88% 14.29% 14

8 40.00% 17.02% 39 95.12% 82.98% 47

i

Total 20

41

61

i = 1, 2

after which a patient will be assigned to that class Cm with the maximum posterior probability P(Cm | x). This rule is also called prediction or identification, respectively. Figure 6 illustrates the identification rule in case of two normal distributions and equal priors π 1 = π 2. 3.2 Action-Oriented Discrimination

has the disease, known as prior probability of the disease or prevalence). Analogously, membership to class C2 is assigned to π 2 = P(x∈C2 ) (e.g., prior probability of no disease), supposing π 1 + π 2 = 1. Furthermore, it is known a priori the distribution function F i (x) of feature vectors (x1 , . . . , xp ) from class Ci and their corresponding densities f i (x)(i = 1, 2). The posterior probability that a patient with unknown class membership and given observation vector x belongs to class Ci is given by

The allocation rule RI (x) assesses every misallocation with the same probability 1 − P(Ci | x); i = 1, 2 (i.e., equally for all classes), which may be disadvantageous in medical applications, where certain diagnosis have to be predicted with particular safety. Therefore, losses or costs lij of allocation are introduced when an patient from Ci is allocated to Cj , with lij = 0 for i = j (i.e., zero costs for a correct prediction). A Bayes’ rule RA (x), given a patient x, is then defined as RA (x) = m,

= min[CP1 (R(x)), CP2 (R(x))] with

P(patient ∈ Ci |x) = P(Ci |x) = πi fi (x)/[π1 f1 (x) + π2 f2 (x)];

i = 1, 2

according to Bayes’ theorem (13). In other words, given a patient with vector of observations x and unknown class membership, are asks for the (posterior) probability that the patient belongs to class Ci , given the (prior) probabilities of class membership π i , and the class-specific probability functions P( x| x ∈ Ci ) = fi ( x) of class Ci (i = 1, 2). In decision theory terms, any specific allocation rule that minimizes the overall probability of misallocations or the so-called optimal error rate (14) for known π i and f i (x) is said to be a Bayes rule.

if CPm (R(x))

CP1 (R(x)) = π2 l21 f2 (x) CP2 (R(x)) = π1 l12 f1 (x)

after which a patient will be assigned to that class Cm with the minimum misallocation loss CPm (R(x)), conditional on x. This rule is also called the action-oriented classification rule (15). It can easily be seen that the rule RI (x) follows immediately from RA (x) for equal losses l12 = l21 . As a result of the rather arbitrary nature of assigning losses of misallocation in practice, they are often taken as equal. According to McLachlan (11) this assignment is not as arbitrary as it may appear at first sight. For Bayes’ rule, only the ratio of l12 and l21 is relevant; therefore, the losses can be scaled by π 1 l12 + π 2 l21 . For an example, given two classes where C1 are patients suffering from

DISCRIMINANT ANALYSIS

class 2 [f2(x )]

Figure 6. Allocation according to identification for equal priors π1 = π2

9

class 1 [f1(x )]

x x ⇒ class 2

x ⇒ class 1

[f2(x ) ≥ f1(x )]

[f1(x ) ≥ f2(x )]

a rare disease and C2 are not, then, although π 1 and π 2 are substantially different, the cost of misallocating a patient from C1 may be much greater than the cost of misclassifying a healthy individual. Consequently, π 1 l12 and π 2 l21 may be comparable and the assumption of equal priors with unit costs of misallocation is not very far from real situations. 3.3 Maximum-Likelihood Discrimination Rules In the real world, class priors π i and class conditional densities f i (x) are unknown. Then, the frequentist analogue of Bayes’ rule is the maximum likelihood discriminant rule, in which the unknown quantities have to be estimated appropriately. For known class conditional densities f i (x), the maximum likelihood (ML) discriminator allocates an observation with vector x to the class with the greatest likelihood: RML (x) = m, if P(Cm |x) = maxi f i (x). In the case of equal class priors π i , it means maximizing the posterior probabilities [i.e., the optimal or Bayes rule RI (x)]. Otherwise, the rule RML (x) is not optimal (i.e., it does not minimize the overall error rate). Approaches of estimating the class conditional densities f i (x) have been developed by various authors, among others so-called plug-in rules (see Section 4.1), or nonparametric methods like kernel- or nearest neighbor-methods (see Sections 5 and 7, respectively).

4 DISCRIMINATION USING NORMAL MODELS The precondition of known priors is usually not fulfilled in practice. In medical diagnostics, the prevalence rates can be taken as estimations of the priors if available. If the training sample {x} = {x11 , . . . , x2n2 } has been obtained by a mixture C of the supposed classes C1 and C2 , then the prior π i can be estimated by its maximum likelihood estimator πˆ i = ni /n (i = 1, 2). Even more problematic is the estimation of the usually unknown class conditional densities f i (x) from the given training sample {x}. With the precondition of p-dimensional normally distributed observation vectors (x1 , . . . , xp ) [i.e., x ∼ N(µi , i )], with corresponding class densities f i (x) as normal (the parameters µ1 , µ2 denote the (expected) class means and 1 , 2 the corresponding covariance matrices in the classes), two special decision rules from Bayes’ formula are obtained: linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), respectively. These two decision rules are widely used classification tools and implemented in most of the known software packages for statistics. Another approach to discrimination in case of normally distributed observations x is the already mentioned Fisher’s discriminant analysis (6) that uses a multiple linear regression approach with the class indicator variable as response and the observations x = (x1 , . . . , xp ) as regressors (16).

10

DISCRIMINANT ANALYSIS

4.1 Linear Discriminant Analysis Supposing equal covariance matrices in all classes (i.e., 1 = 2 ), the optimal or Bayes’ decision rule (i = 1, 2) is obtained RO (x) = m, (x − µm )

if

−1

(x − µm )T − ln πm

= min[(x − µi ) −1 (x − µi )T − ln πi ] i

diagnostic process often consists in the investigation of the quantitative scaled data. For that reason, two groups of patients: were investigated 58 (n1 ) patients (class 1: glaucoma) and 41 (n2 ) patients (class 2: ocular hypertension) using the following features: • daily intraocular pressure (IOP), bra-

chialis blood pressure (BP), • ophthalmica blood pressure (OP), perfu-

sions pressure (difference BP − OP),

Supposing a training sample {x}, the corresponding maximum likelihood estimations from {x} are used for the unknown parameters and a so-called plug-in rule will be obtained LDA(x) = m, (x − xm )S

−1

if

(x − xm )T − ln πˆ m

= min[(x − xi )S−1 (x − xi )T − ln πˆ i ] i

ˆ where S = n/(n − 2) is the so-called biascorrected matrix of the estimated covariance ˆ Therefore, a patient is allocated matrix . with observation vector x0 to class Cm for which the Mahalanobis distance Mi ( x0 )(i = 1, 2) Mm (x0 ) = min Mi (x0 ) i

= min(x0 − xi )S−1 (x0 − xi )T i

will be minimal (except for − ln πˆ i ). Figure 7 shows the geometrical visualization [for = I (I identity matrix), Mahalanobis distance equals to Euclidian distance]. In Fig. 7, the patient with observation vector x0 will be allocated to class C1 . Example 3: Differential diagnosis of glaucoma and ocular hypertension A difficult problem in ophthalmological diagnostics is the differentiation between glaucoma and ocular hypertension. In order to achieve the diagnosis, various clinical investigations (such as measurements of intraocular pressure, fields of vision, different anamnestic findings, clinical history, etc.) are accomplished, resulting in both quantitative and categorical scaled data. The first step of the

• treatment—test (IOP after treatment),

and others (2). Table 5 shows the results obtained using linear discriminant analysis [i.e., rule LDA(x)] supposing equal priors with unit losses. The true error rates have been estimated according to the leave-one-out method, including the first four features (IOP, BP, OP, BP-OP), which resulted in altogether 16 wrong allocations, which means an overall error estimation of 16.16%. As to be expected, the rule LDA(x) has a very good quality for patients suffering from ocular hypertension (only one patient wrongly allocated: 2.44%), whereas 15 patients from 58 (25.86%) were wrongly diagnosed in the glaucoma class. On the other hand, the positive predictive value of glaucoma is rather large with 97.73% (i.e., the probability of a glaucoma is high when predicted). 4.2 Feature Selection Particularly in medical applications, the number p of features may be very large relative to the sample size. Then, consideration

Table 5. Classification Table for the Glaucoma Data leave-one-out LDA Row [%] Col. [%] Allocated to class True class glaucoma hypertension glaucoma

hyper -tension Total

43 74.14% 97.73% 1 2.44% 2.27% 44

15 25.86% 27.27% 40 97.56% 72.73% 55

Total 58

41

99

DISCRIMINANT ANALYSIS

11

x2

M2(x 0)

x02 x12

M1(x 0) x22

x11

x01

x21

x1

Figure 7. Geometrical illustration of linear discriminant analysis (patients from class C1 with mean x1 are marked by crosses, class C2 with mean x2 marked by circles)

might be given to methods for reducing the number of features because too many features may produce problems with the performance of the sample discriminant function. On the other hand, each feature selection is an intervention into the given data and always causes a loss of information (17). Nevertheless, for practical applications in medicine, it is often interesting to separate important variables from redundant ones. Principally, if one wants to know which set of variables is effective, one must examine all possible subsets of x1 , x2 , . . . , xp . As many subsets may exist, a common strategy is to accomplish the selection in a stepwise procedure, forward (including, step-by-step, the most important variable) or backward (excluding redundant variables). As a measure for the ‘‘importance’’ of a variable, the amount of the overall error rate as selection criterion is preferred, which means a measure of class separation over one or a few linear combinations of the features (4). Furthermore, a cross-validated feature selection is always performed. Thereby, the given sample is reduced repeatedly by a certain number of observations and, with the remaining, the feature selection process is carried out. In every step, the removed patients are allocated using the decision rule obtained and the wrong allocations are counted (crossvalidated error rate). In the end, those features that appear most frequently in the

validation steps are selected for a discrimination rule (Fig. 8). If certain features show up again and again in the validation steps, they can be considered as particularly stable. These most frequent features were recommended to the medical doctors, and for those features, a new error estimation has been accomplished (recommended error rate). Example 3 (continuation): Diagnosis of glaucoma and ocular hypertension The discrimination result of an overall error rate of 16.16% was reached for the four features IOP, BP, OP, BP-OP. In addition, it is always interesting to estimate the true error rate for certain features univariately in order to get an overview of the performance of the features separately. On the other hand, it may be of interest to judge the relationships between features and their impact on the classification results when combined. Table 6 gives an overview about the error estimations (leaving-one-out estimation) separately and in building-up steps of feature selection (without cross-validation). This procedure selects in each step those features that have the best error estimation, accordingly. As a result, IOP is the best-predicting feature among the four given characteristics.

12

DISCRIMINANT ANALYSIS

Table 6. Error Rates of Separate Features and Consecutive Combinations Feature

Error rate (separately)

Error rate (in combination)

IOP BP OP BP-OP

16.16% 38.38% 19.19% 34.34%

16.16% 15.15% 18.18% 16.16%

vectors (x1 , x2 , . . . , xp ), another decision rule, the so-called quadratic discriminant analysis - QDA, is obtained written in the form of a plug-in rule, as QDA(x) = m,

if Qm (x) = min Qi (x) and

Qi (x) = 1/2(x −

i

xi )S−1 i (x

− xi )T

+ 1/2 ln |Si | − ln πˆ i ; i = 1, 2 Example 2 (continuation): Diagnosis of neuroborreliosis burgdorferi For the diagnosis mentioned, the doctors want to separate important from redundant variables. When selecting features, the obtained set of variables will certainly depend on the arbitrariness of the given training sample. Therefore, a serious feature selection must always include a cross-validation in order to avoid that randomness (crossvalidated feature selection). Table 7 gives the results of the crossvalidation process for the neuroborreliosis data. A set of three features was mostly chosen (five times), and the variables 5 (SBigg), 8 (CSF-prot), and 3 (Paresis) were selected with the frequencies 11, 11, 10, respectively. Thus, the feature set with 3 variables including 5 (SBigg), 8 (CSF-prot), and 3 (Paresis) was recommended to the medical doctors. The recommended error estimation (leave-one-out) for this feature set resulted in 11.48%. For reasons of brevity, the corresponding classification table is not presented here. 4.3 Quadratic Discriminant Analysis In case of unequal covariance matrices, 1 2 , but with normally distributed feature = Table 7. Results of a Cross-Validation in 11 Steps Results of cross-validation selected numbers of features corresponding frequencies selected features corresponding frequencies

3

4

5

5

4

2

5 11

8 11

3 10

1 6

2 3

where the estimations of the covariance ˆ i are replaced by the bias-corrected matrices ˆ i /(ni − 1). This rule is estimations Si = ni also implemented in most of the commercial software packages, although it is rather seldom worth applying, because of non essential improvements of the results in the majority of applications. It could be shown in numerous simulation experiments that quadratic discriminant analysis should be applied only in case of ‘‘huge’’ differences between the covariance matrices (18). 5 NONPARAMETRIC DISCRIMINATION The methods of discrimination described so far suppose some model [for the class conditional densities f i (x)] on the basis of which certain parameters have to be estimated. As an example, refer to Section 4 for normally distributed densities f i (x). Such approaches are called parametric. On the other side, procedures that have been developed without any postulation of a certain model are called nonparametric. The focus here is on kernelmethods that dominate the nonparametric density estimation literature. Given a training sample {x} = {x11 , . . . , x2n2 }, in the nonparametric kernel estimation of the class conditional density f i (x), a ‘‘kernel’’ is laid around each observation xij of the training sample. The kernel function K (i) (x, xij ) may have any shape, but has to fulfil the conditions of a probability density function. A convenient kernel that is very often used is a multivariate normal density (i = 1, 2) ˆ i |−1/2 K (i) (x, xij ) = (2π )−p/2 | ˆ −1 (x − xij )T } × exp{− 12 (x − xij ) i

DISCRIMINANT ANALYSIS

13

k=1 Remove a subset {x}k = {xk1,...,xkm} from {x} {x}rem = {x} − {x}k Train the classifier on {x}rem and select a feature set [Mk]

k=k+1

Test the classifier with {x}k 0, if xkj correctly classified 1, if xkj wrongly classified

pkj =

m ·k < n ?

n

y

Selection of the feature set [Mko ], which was most frequent in the validations and estimation of the corresponding error on the basis of the whole sample Cross-validated error: F (cv ) =

1 n

m

Sk

k =1 j =1

pkj

Recommended error: F (rec) =

1 n

n j =1

pko j

Figure 8. Cross-validated feature selection (compare also Fig. 5)

ˆ i = ςi i with the (diagonal) covariance ( matrix i = diag{s2i1 , . . . , s2ip } of the sample standard deviations s2il = 1/(ni − 1) ni 2 (l = 1, . . . , p) and elements j=1 (xijl − xil ) ςi —so-called smoothing parameters—which have to be estimated from the training sample). The kernel estimator of f i (x) in class Ci is then defined as the arithmetic mean of all K (i) (x,xij ) in the ith class (i = 1, 2) ni 1 K (i) (x, xij ) fˆi (x) = ni j=1

A graphical representation of a kernel estimation in class Ci gives Fig. 9 with 7 observations and p = 1 (15). With the estimated class conditional densities fˆi (x), decision rule RI (x) or RA (x), can be applied for discrimination, respectively.

Example 4: The IIBM (International Investigation of Breast MRI) study Accuracy data about contrast-enhanced MRI in the literature vary, and concern exists that a widespread use of contrast-enhanced MRI combined with insufficient standards might cause unsatisfying results. Therefore, an international multicenter study was established (19) in order to find out whether— based on MRI enhancement characteristics— a statistically founded distinction between benign and malignant lesions is possible. The study resulted in findings of more than 500 patients. The analyses were performed with the MRI signal intensities SIGs to six consecutive sample points (s = 1, . . . , 6). The differentiation between ‘‘malignant’’ and ‘‘benign’’ was made on the basis of histological classes, resulting in 132 malignant and 63 benign lesions (one per patient) with complete datasets. For each lesion, a total of 12 parameters were calculated from the original MRI signal intensities. From the training sample, it was concluded that the

14

DISCRIMINANT ANALYSIS

fˆi (x)

x

Figure 9. Kernel estimation fˆi (x) of the class conditional density f i (x)

Table 8. Classification with Kernel Discriminator and Two Different Variants of Priors leave-o-out Row [%] Col. [%] True class malignant

benign Total

πˆ 1 = πˆ 2 = 0.5 ζˆ1 = 0.31, ζˆ2 = 0.24 Allocated to class malignant benign 120 90.91% 86.33% 19 30.16% 13.67% 139

12 9.09% 21.43% 44 69.84% 78.57% 56

Total 132

108 81.82% 87.80% 15 23.81% 12.20% 123

malignant 63 benign 195

assumption of normally distributed densities f i (x) was not fulfilled. Thus, a nonparametric kernel discriminator was applied and achieved after a (cross-validated) feature selection and with equal priors πˆ 1 = πˆ 2 = 0.5 a best π -error-estimation of 15.90%. Table 8 (left side) presents the results obtained, showing both high sensitivity (90.91%) and high positive predictive value (86.33%). But the study aimed at a discrimination with a possible high specificity. For this reason, the original priors πˆ 1 were replaced by πˆ 12 = πˆ 1 l12 , and πˆ 2 by πˆ 21 = πˆ 1 l21 , respectively, in order to realize the influence of different losses lij (compare Section 3.2). With the priors πˆ 12 = 0.3 and πˆ 21 = 0.7, the specificity has been increased to 76.19% with a slightly improved positive predictive value of 87.80%, but at the expense of a higher overall error rate of 20.00% (Table 8, right side). 6

πˆ 12 = 0.3, πˆ 21 = 0.7 ζˆ1 = 0.31, ζˆ2 = 0.24 Allocated to class malignant benign

leave-o-out Row [%] Col. [%] True class

LOGISTIC DISCRIMINANT ANALYSIS

According to J. A. Anderson (20), logistic discrimination is broadly applicable in a wide variety of distributions, including multivariate normal distributions with equal

Total

24 18.18% 33.33% 48 76.19% 66.67% 72

Total 132

63

195

covariance matrices, multivariate discrete distributions following the log-linear model with equal interaction terms, joint distributions of continuous and discrete random variables, and therefore, particularly advantageous in applications with mixed data. The logistic discriminant analysis is called semi-parametric because the class conditional densities f i (x) are not modeled for themselves but for the ratio between them. With given priors π 1 and π 2 , the model log

π1 f1 (x) = β0 + βxT π2 f2 (x)

is postulated where β 0 and β = (β 1 , . . . ,β p ) are unknown parameters that have to be estimated. Remembering P(Ci | x) = πi fi ( x) (without the denominator π1 f1 ( x) + π2 f2 ( x)) from rule RI (x)) leads to P(C1 | x)/[1 − P(C1 | x)] = exp(β0 + β xT ), from which one obtains

P(C1 |x) =

exp(β0 + βxT ) 1 + exp(β0 + βxT )

DISCRIMINANT ANALYSIS

and

P(C2 |x) =

1 1 + exp(β0 + βxT )

analogously as posterior probabilities for the corresponding allocations. After estimating the unknown parameters β 0 ,β 1 , . . . β p according to the maximum likelihood principle, it is again decided for the class with the greatest posterior probability [i.e., on the basis of decision rule RLOG (x) = m, if P(Cm |x) = maxi P(Ci |x) (as maximum likelihood discrimination rule)]. The software packages mostly contain an equivalent decision: allocation of the patient with observation vector x to the class C1 if P(C1 |x) > 0.5, otherwise to C2 . Example 3 (continuation): Diagnosis of glaucoma and ocular hypertension Beside the already mentioned metrically scaled measurements, categorical observations from various clinical investigations have been raised: • fields of vision according to Goldmann,

Tuebinger, Octopus • different anamnestic findings. The whole set of metrical variables in combination with two categorical features (fields of vision according to Tuebinger and Octopus) has been included in a logistic regression analysis in the sense of logistic discrimination as described above. With the estimated parameters βˆ0 , βˆ1 , . . . βˆp , one is able to determine the corresponding posterior probabilities and to decide for the corresponding allocations (predicted allocation). Together with the observed allocations (true classes), one can calculate a corresponding classification table (Table 9), which resulted in an overall error rate of 10.10%. As error estimations according to the π -method are not offered in commercial software for logistic discrimination, the numbers are shown with the resubstitution method, unfortunately. A comparison with previous results is therefore only possible by using the same error estimation. The LDA resulted in an overall error of 11.11%, only using the four features IOP, BP, OP, and BP— OP and resubsitution

15

Table 9. Classification Table for the Glaucoma Mixed Data Resubstitution Row [%] Col. [%] Observed glaucoma

hypertension Total

Logistic discrimination Predicted glaucoma hypertension 52 89.66% 92.86% 4 9.76% 7.14% 56

6 10.34% 13.95% 37 90.24% 86.05% 43

Total 58

41

99

as error estimation (2). Generally, it can be expected that the inclusion of clinically relevant features will improve the performance of a prediction rule. The problem of discriminant analysis for mixed data has been particularly treated by Wernecke (21), who gives a powerful algorithm for coupling the results of various discriminators, each adequate for the different scaling of data. Practical examples from medicine demonstrate the improvement of discrimination by using the full dataset of both quantitative and categorical features. 7 FURTHER METHODS OF DISCRIMINATION Beside the methods of discrimination mentioned, a number of further procedures for discrimination exist. A short overview is provided of some methods of particular interest for medical applications. In many medical applications, the data are categorically scaled. For those data, a special discrimination with categorical features can be applied. Instead of the original realizations of the given features x1 , . . . , xp , one most regard here the frequencies h1 (x1 , . . . , xp ), . . . , hK (x1 , . . . , xp ) by which the feature combinations appear in the classes C1 , . . . , CK (where for every (x1 , . . . , xp ) all combinations of category levels have to be inserted). Such frequencies are usually arranged in contingency tables. As an example, consider a study with patients suffering from migraine. Table 10 shows a corresponding contingency table in

16

DISCRIMINANT ANALYSIS

Table 10. Two-Dimensional Contingency Table for Class C1 with p = 2 Variables variable x1 (nausea)

cat. 1 (low)

cat. 1 (low) cat. 2 (strong)

h1 (x11 , x21 ) = 16 h1 (x12 , x21 ) = 10

six cells with two features (p = 2) in three respective two categories, each, for class C1 . Accordingly, there are 16 patients with low headache and nausea complaints, 28 patients with moderate headache and low nausea complaints, and 22 patients with strong headache and low nausea complaints, respectively overall, 66 patients exist with low nausea, 38 patients exist with strong headache complaints, and so on, and altogether, n1 = 116 patients in the class C1 . After estimating the unknown cell probabilities pi (x1 , . . . , xp ) with the help of the corresponding frequencies hi (x1 , . . . , xp ) (e.g., by using pˆ i (x1 , . . . , xp ) = hi (x1 , . . . , xp )/ni , ni = hi (x1 , . . . , xp )—so-called actuarial model) (2), a patient with observation vector x will be assigned to the class Cm for which RC (x) = m,

if πm pˆ m (x)

= max πi pˆ i (x); i

i = 1, 2, . . . , K

[Bayes optimal for known π i , pi (x)—Linhart (22)]. Friedman (23) proposed a regularized discriminant analysis as a compromise between normal-based linear and quadratic discriminant analysis. In this approach, two parameters are introduced that control shrinkage of ˆ i (i = 1, . . . , the heteroscedastic estimations ˆ K) toward a common estimate and shrinkage toward a series of prespecified covariance matrices. Nearest neighbor classifiers are based on a pairwise distance function dj (x) = d(x,xj ) [known from cluster analysis; see, for example, Everitt (24)] between an observation x with unknown class membership and some vector of observations xj from the given training sample. Denoting the sorted distances by d(1) ( x) ≤ d(2) ( x) ≤ · · · ≤ d(n) ( x), class posterior probability estimates are obtained for class Ci as the fraction of class Ci observations among the k nearest neighbors to x (25)

variable x2 (headache) cat. 2 (moderate) h1 (x11 , x22 ) = 28 h1 (x12 , x22 ) = 24

cat. 3 (strong) h1 (x11 , x23 ) = 22 h1 (x12 , x23 ) = 16

n ˆ i |x) = 1 [dj (x) ≤ d(k) (x)] P(C k j=1

for all xj ∈ Ci ˆ m |x) = and allocate x to the class Cm with P(C ˆ maxi P(Ci |x); i = 1, . . . , K. The number k of nearest neighbors is often taken as k = 1. Another approach is a cross-validated selection, where for given k = 1, 3, 5, . . . a leaveone-out error estimation (see Section 2.1) is applied to the training sample and that k with the minimal error rate will be chosen. A discriminant analysis for longitudinal data (repeated measurements) has been introduced by Tomasko et al. (26) and extended by Wernecke et al. (27). The procedure is a modification of the linear discriminant analysis using the mixed-model MANOVA for the estimation of fixed effects and for a determination of various structures of covariance matrices, including unstructured, compound symmetry, and autoregressive of order 1. Among the nonparametric statistical classification methods, Classification and Regression Trees (CART) play an important role, especially in medical applications (28). Given a set of risk factors x1 , . . . , xp that influence a response variable y, the data is split step-bystep into subgroups, which should be internally as homogeneous and externally as heterogeneous as possible, measured on a function F(y|xl ); l = 1, . . . , p (supposing dichotomous variables, the function F(y|xl ) will be the maximal measure of dependence between y and x1 from a χ 2 -statistics). Every split corresponds to a so-called node and every observation will be localized at exactly one terminal node in the end. The obtained tree defines a classification rule by assigning every terminal node to a class Ci by comparing the risks at the beginning and at the terminal nodes (29).

DISCRIMINANT ANALYSIS

Support vector machines were introduced by Vapnik (30) and refer to K = 2 classes. The discrimination is defined by the ‘‘best’’ hyperplane separating the two classes, where ‘‘best’’ means that hyperplane that minimizes the sum of distances from the hyperplane to the closest correctly classified observations while penalizing for the number of misclassifications (25).

8 JUDGING THE RELIABILITY OF THE ESTIMATED POSTERIOR PROBABILITY In the previous sections, the author dealt with discriminant rules on the basis of posterior probabilities P(patient ∈ Ci | x) = P(Ci | x) = πi fi ( x)/[ k πk fk ( x)](i = 1, . . . , K), which could be estimated from a training sample { x} = { x1 , . . . , xn } in different ways (n = i ni , xj = (xj1 , . . . , xjp ), j = 1, . . . , n). The overall performance of the discrimination rules described can generally be judged by an appropriate estimation of the error rate (See Section 2.1). Nevertheless, one problem is left and should be mentioned at least: How can one judge the reliability of an allocation on the basis of the estimated posteriors, conditional on the observed patient x? As emphasized by McLachlan (11), even in case of low error rates, there may still be patients about them where great uncertainty as to their class of origin. Conversely, assuming high error rates, it may still be possible to allocate some patients with a great certainty. The error rate (as an average criterion) addresses the global performance of the decision rule and judges the quality of the allocation of some chosen patient with unknown class membership. Mainly in clinical medicine, it is often more appropriate to proceed conditionally on the observed individual patient x with its particular realizations of the variables x1 , . . . , xp . One way to assess the mentioned reliability is the calculation of standard errors and also interval estimates (95% confidence ˆ i |x), intervals) for the point estimations P(C conditional on x. For reasons of brevity, corresponding results are not presented here, but the reader is referred to the literature [see, among others, McLachlan (11)].

17

9 SOFTWARE PACKAGES FOR DISCRIMINATION

Software packages for discrimination are numerous. Disregarding the permanent changes of program packages and the development of new software, some remarks may be worthwhile, concentrating on the most important statistical packages. R. Goldstein (31) gives an excellent overview over commercial software in biostatistics, generally. Judgements for classification software (discrimination, clustering, and neural networks) were given in some detail by D. Michie et al. (3). In this critical overview, different program packages have been described and recommendations for their use given. For applications in clinical trials, not so many discrimination methods are worth considering. The applications of discriminant analysis in medical studies might be rather often confined to the methods described in Sections 4 and 6, respectively, including error estimations (Section 2.1) and feature selection (Section 4.2) (unfortunately usually not validated). Therefore, it seems to be sufficient to refer to the well-known commercial software packages SAS (SAS Institute, Inc., Cary, NC, USA), and SPSS (SPSS, Inc., Chicago, Il, USA). SPSS offers both linear and quadratic discrimination as well as logistic discriminant analysis with the mentioned π -error estimations (not for logistic regression) and stepwise feature selection (not validated). Moreover, various diagnostics, such as tests for equality of the covariance matrices i , can be established. SAS provides a broader variety of discrimination methods including linear and quadratic discriminant analysis, logistic discrimination, nonparametric kernel discrimination, and nearest neighbor methods. Error estimations, (partly) according to Section 2.1, and stepwise feature selection (not validated) are also implemented. Users with ambitions to develop certain routines for themselves may be referred to S-PLUS 2000 (Data Analysis Products Division, MathSoft, Seattle, WA, USA). The

18

DISCRIMINANT ANALYSIS

above-mentioned examples have been calculated using mostly SAS and SPSS, respectively, and special routines (such as crossvalidated feature selection or discrimination for categorical features) developed in S-PLUS (available in the public domain on request for everybody).

of Statistical Science. New York: John Wiley and Sons, 1982, pp. 389–397. 15. J. Hermans, J. D. F. Habbema, T. K. D. Kasanmoentalib, and J. W. Raatgever, Manual for the alloc80 discriminant analysis program. The Netherlands: Department of Medical Statistics, University of Leiden, 1982. 16. B. Flury, A First Course in Multivariate Statistics. New York: Springer, 1997.

REFERENCES 1. H-J. Deichsel and G. Trampisch, Clusteranalyse und Diskriminanzanalyse. Stuttgart: Gustav Fischer Verlag, 1985. 2. K-D. Wernecke, Angewandte Statistik fur ¨ die Praxis. Bonn, Germany: Addison-Wesley, 1996. 3. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Machine Learning, Neural and Statistical Classification. Chichester: Ellis Horwood, 1994. 4. B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge: Cambridge Universitiy Press, 2004. 5. D. F. Morrison, Multivariate analysis of variance. In: P. Armitage and T. Colton (eds.), Encyclopedia of Biostatistics. New York: John Wiley and Sons, 1998, pp. 2820–2825. 6. R. A. Fisher, The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936; 7: 179–188. 7. P. A. Lachenbruch, Discriminant Analysis. New York: Hafner Press, 1975. 8. C. Ambroise and G. J. McLachlan, Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 2002; 99(10): 6562–6566. 9. P. A. Lachenbruch and M. R. Mickey, Estimation of error rates in discriminant analysis. Technometrics 1968; 10: 1–11. 10. G. T. Toussaint, Bibliography on estimation of misclassification. IEEE Trans. Inform. Theory 1974; 20: 472. 11. G. J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. New York: John Wiley and Sons, 1992. 12. B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: SIAM, 1982. 13. T. Bayes, An essay towards solving a problem in the doctrine of chance. Communicated by Mr. Price, in a letter to John Canton, AM., F.R.S., December 1763. 14. P. A. Lachenbruch, Discriminant analysis. In: S. Kotz and N. L. Johnson (eds.), Encyclopedia

¨ 17. J. Lauter, Stabile Multivariate Verfahren (Diskriminanzanalyse, Regressionsanalyse, Faktoranalyse). Berlin: Akademie Verlag, 1992. 18. S. Marks and O. J. Dunn, Discriminant function when covariance matrices are unequal. JASA 1974; 69: 555–559. 19. S. H. Heywang-K¨obrunner, U. Bick, W. G. Bradley, B. Bon´e, J. Casselmanand, A. ¨ Coulthard, U. Fischer, M. Muller-Schimpfle, H. Oellinger R. Patt, J. Teubner, M. Friedrich, G. Newstead, R. Holland, A. Schauer, E. A. Sickles, L. Tabar, J. Waisman, and KD. Wernecke, International investigation of breast MRI: results of a multicentre study (11 sites) concerning diagnostic parameters for contrast-enhanced MRI based on 519 histopathologically correlated lesions. Eur. Radiol. 2001; 11: 531–546. 20. J. A. Anderson, Logistic discrimination. In: P. R. Krishnaiah and L. Kanal (eds.), Handbook of Statistics, 2: Amsterdam: North Holland, 1982, pp. 169–191. 21. K-D. Wernecke, A coupling procedure for the discrimination of mixed data. Biometrics 1992; 48(2): 497–506. 22. H. Linhart, Techniques for discriminant analysis with discrete variables. Metrica 1959; 2: 138–149. 23. J. H. Friedman, Regularized discriminant analysis. J. Amer. Statist. Assoc 1989; 84: 165–175. 24. B. S. Everitt, Cluster Analysis. London: Halstead Press, 1980. 25. S. Dudoit and J. Fridlyand, Classification in microarray experiments. In: T. Speed (ed.), Statistical Analysis of Gene Expression Microarray Data. Boca Raton, FL: Chapman and Hall/CRC, 2003. 26. L. Tomasko, R. W. Helms, and S. M. Snapinn, A discriminant analysis extension to mixed models. Stat. Med. 1999; 18: 1249–1260. 27. K-D. Wernecke, G. Kalb, T. Schink, and B. Wegner, A mixed model approach to discriminant analysis with longitudinal data. Biom. J. 2004; 46(2): 246–254.

DISCRIMINANT ANALYSIS 28. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Monterey, CA: Wadsworth, 1984. 29. K-D. Wernecke, K. Possinger, G. Kalb, and J. Stein, Validating classification trees. Biom. J. 1998; 40(8): 993–1005. 30. V. Vapnik, Statistical Learning Theory. New York: Wiley and Sons, 1998. 31. R. Goldstein, Software, biostatistical. In: P. Armitage and T. Colton (eds.), Encyclopedia of Biostatistics. New York: John Wiley and Sons, 1998, pp. 2820–2825.

19

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS

filled teeth or % sites bleeding on probing). It has become accepted to average ordinal scores from gingival and plaque indices. Statistical properties of such measures have been examined by Sullivan and D’Agostino (3). These can also be examined using the percent of sites, teeth, or surfaces exceeding a specified index level. Summaries over subsets of teeth or sites of particular interest, such as pockets harboring specific bacterial species at baseline, or tooth surfaces diagnosed with ‘‘white spots’’ can provide more efficient and focused comparisons. In clinical trials, summary measures for each subject for primary outcomes of interest should be the basic unit of analysis for comparisons across treatment groups, not individual sites or teeth (4). This having been said, often great interest develops in understanding the dynamics of disease and treatment effects at specific sites or tooth surfaces. In such situations, it is critical that analyses of site-specific associations recognize and take into account that individual teeth and sites within subjects are not statistically independent units. Statistical methods for analyzing sitespecific association in the presence of intrasubject dependency are based on two general approaches. The first approach considers each subject as a stratum, computes a site-specific measure of association for each subject, evaluates the homogeneity of the individual measures of association across subjects, and, if appropriate, generates a summary measure of association. Examples are the use of Mantel–Haenszel procedures for categorical outcomes (5) or the incorporation of subject effect as a fixed effect indicator variable in a regression analysis of site-specific covariables (6). The second approach uses statistical methods explicitly incorporating intrasubject correlations among sites into the statistical model. As an example, generalized estimating equation procedures, described with applications to multiple site data in Reference 7, can accommodate logistic and standard multiple regression models, subject and site level covariates, unequal numbers of sites per subject, explicitly specified or empirically estimated correlation structures, and model-based or robust

MAX GOODSON RALPH KENT The Forsyth Institute—Clinical Research Boston, Massachusetts

Disease trials for dental drug products do not greatly differ from disease trials for medical drug products and should adhere to standard principles for the design, conduct, and evaluation of clinical trials (1). Oral research is, however, associated with certain important simplifications and some complications. One simplifying factor is that dentistry is principally concerned with treatment of two diseases: periodontal diseases and dental caries. In addition, however, a surprising number of dental drug products are designed for prevention, improved esthetics, and reduced pain. A principal complication is the multiplicity created by teeth. A unique attribute of statistical development in oral research is the 30-year presence of a statistical ‘‘think tank,’’ the Task Force on Design and Analysis (2). This organization has been the source of many American Dental Association Guidelines, professional input to the FDA, and has sponsored meetings and conferences to discuss methodological issues in oral research. Throughout this era, it has provided a forum for debate on substantial statistical issues connected with evaluation of drugs used in treatment of oral diseases. 1 COMMON FEATURES OF CLINICAL TRIALS FOR TESTING DENTAL DRUG PRODUCTS 1.1 Data Summary In oral disease research, evaluations are made for numerous teeth, tooth surfaces, or sites for each subject. Most commonly, observations are quantitative, dichotomous, or, as in the case of widely used indices for plaque and gingivitis, ordinal. Summary values for each subject are obtained in various ways, such as means (e.g., mean pocket depth) or percent or counts of sites or teeth that exhibit an attribute (e.g., # of decayed, missing or

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS

estimates of model parameters and standard errors. 1.2 Training and Calibration The subjectivity and variability of many assessments makes it critical to use examiner training and calibration (8,9), assignment of specific examiners to follow each subject throughout the study, blinding of examiners, and balanced assignment of examiners across experimental groups. 1.3 Adjustment for Baseline Severity Baseline levels of outcome measures for a subject are frequently highly correlated with subsequent posttreatment observations. Stratification or matching by baseline level, evaluation of change from baseline, and analysis of covariance are commonly used to reduce the effects of baseline variation across subjects for primary outcome measures. For example, the variability of an assay is largely reduced by selecting periodontal sites in a relatively narrow baseline pocket depth range (10). 2 TESTING DRUGS IN ORAL HYGIENE STUDIES Dental plaque is the soft, cheesy white material that accumulates on teeth in the absence of oral hygiene. It is composed of bacterial cells in a polysaccharide polymer created by the bacteria. It is considered by most individuals to be the precursor for most oral diseases. Among the more common clinical trials conducted to test dental products are those associated with toothpastes, mouth rinses, and toothbrushes used for suppression of dental plaque formation or removal. From the FDA viewpoint, dental plaque is not considered to be a disease but a cosmetic issue, so that these products are largely unregulated in this area. Studies of dental plaque removal or suppression depend largely on clinical indices designed to measure the extent or magnitude of plaque accumulation. The Turesky modification of the Quigley–Hein Index for plaque (11) is the most commonly used plaque index. By this method, stained plaque tooth coverage is

evaluated by assigned a value between 0 (none) and 5 (> = 2/3 coverage) for each tooth surface evaluated. When plaque becomes calcified, the hard, adherent mass is called dental calculus or tartar. Suppression of formation or removal of calculus is also most often measured by indices. In this case, the Volpe–Manhold index (12) is the index most commonly used. By this method, calculus accretion is evaluated by measuring the extent of tooth coverage at three locations on the tooth using a periodontal probe.

3 TESTING DRUGS IN GINGIVITIS STUDIES The most common periodontal disease is gingivitis, a mild inflammation of the periodontal tissues with no evidence of connective tissue destruction (13). This condition exists at some time in virtually everyone’s mouth (14). Many outcome variables have been defined for measurement of drugs used to treat gingivitis (15). All outcome variables depend on evaluating either gingival redness and/or gingival bleeding. Drugs most commonly tested for treatment of gingivitis are antibacterial mouth rinses and toothpaste. It is generally difficult if not impossible to demonstrate product efficacy in treatment of gingivitis by starting with patients having natural accumulations of plaque or tartar (calcified plaque, calculus) (16). For this reason, most studies are initiated with a thorough tooth cleaning and efficacy is demonstrated by the ability of a test product to reduce the rate of reaccumulation of plaque or tartar (17). The standard trial for treatment of gingivitis uses the protocol recommended by the American Dental Association Acceptance program (18) and evaluates gingival index (19) as the primary outcome variable. There are issues concerning use of qualitative ordinal indices as outcome variables. Difficulty in interpreting mean index values and the underlying assumptions of combining observations such as redness and bleeding into a single index have lead some researchers to consider statistical analysis of dichotomized index values (20).

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS

4 TESTING DRUGS IN PERIODONTITIS STUDIES Several periodontal diseases currently are differentiated by their clinical characteristics. Periodontitis is a complicated multifactorial chronic inflammatory disease condition that involves specific bacterial pathogens, host responses, and biochemical mediators that create inflammation in the tissues that support the teeth and may result in their ultimate destruction (21). The critical difference between periodontitis and gingivitis is that loss of supporting bone and periodontal ligament occurs in the former but not the latter. This loss of alveolar bone may be seen using radiographs, but because of the biohazard of gamma radiation Institutional Review Boards will seldom approve radiography as a clinical research measurement tool. For this reason, periodontal disease is most often evaluated indirectly by periodontal probing. Periodontal probing is used as a clinical measure of periodontal disease progression. Pocket depth (Fig. 1) is the distance from the free margin of the gingiva to the depth of the pocket formed by the disease process. In normal healthy individuals, a periodontal sulcus measures 2–3 mm in depth. In the early stages of periodontitis, a periodontal pocket forms adjacent to the tooth and can progressively deepen up to the length of the tooth root (12–15 mm). Periodontal pocket depth (PD), a continuous measure, is the most common outcome Periodontal Probe Measurements

variable used to evaluate periodontal disease. The periodontal probe is a cylindrical, thin (usually 0.5 mm diameter), hand-held measurement tool calibrated in millimeters and designed to be placed in the periodontal pocket. Skilled clinicians can measure pocket depth paired differences with a standard deviation of 0.6 mm. Several variants of this measurement are also used, notably, attachment level (AL or, more precisely, clinical attachment level), which estimates the distance from a fixed anatomical tooth landmark (the cementoenamel junction) to the base of the periodontal pocket (Fig. 1). This measure is generally considered to be the best indirect estimate of bone loss (22) because it corrects for gingival swelling (hypertrophy) or shrinkage (recession). Because AL is the difference between the probe measurements of PD and recession (B) (23), reproducibility is less than with PD (skilled clinicians can measure AL paired differences with a standard deviation of 0.8 mm). As one would expect, AL is highly correlated with PD. Drugs most commonly tested to treat periodontitis are antibacterial agents or antibiotics, occasionally anti-inflammatory agents. Subject selection should include cognizance of possible confounding with age, smoking (24), obesity, and diabetes (25). Proposed guidelines for periodontal therapy trials have been published (26). Anatomical Landmarks Tooth Enamel

Pocket Depth (PD)

“B” measure Attachment Level (AL)

AL = PD-B

3

Cemento-enamel Junction (CEJ) Gingiva Periodontal Pocket Tooth root (cementum-covered) Bone

Figure 1. Periodontal probe measurements and anatomical landmarks. Pocket depth (PD) is a single measurement most commonly recorded in clinical practice. The anatomically adjusted measurement of attachment level (AL) requires two measurements, PD and recession (B), and is associated with increased variability. The value of B carries a positive sign for sites with gingival hypertrophy and a negative sign for sites with gingival recession.

4

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS

4.1 Statistical Issues in Periodontitis Studies The appropriate unit of statistical analysis has been the subject of considerable controversy. The concept of periodontal sites was introduced into the literature as an analytical feature defining 4 to 6 locations around the tooth that were identified by anatomical landmarks and evaluated longitudinally to preserve information concerning the local site-specific changes in PD, AL, and bacteria potentially related to disease (27). Subsequent microbiological evaluations indicated that individual sites within the same mouth can have a completely different microbial ecology (28). Because bacterial diseases are generally defined by the infecting agent(s), the hypothesis was advanced that different periodontal diseases may coexist in the same mouth and that it would be inappropriate to pool periodontal sites simply because they had the same clinical signs of periodontitis. Some investigators interpreted these fundamental observations related to defining the pathogenic process as an effort to justify summarizing statistical data on periodontal sites and thereby inflating statistical significance (each subject evaluated at 6 sites/tooth may contribute as many as 192 sites). The choice of the unit of statistical analysis depends on the questions being asked. Most often, the subject is the appropriate unit of statistical summary. If sites, tooth surfaces, teeth or quadrants within subjects are to be used as the unit of analysis, however, then it is important to use procedures, such as described in Section 1.1, appropriate to the evaluation of correlated observations. The measurement of numerous bacterial species, host factors, clinical parameters, and genetic markers has created serious issues of multiple testing. For limited numbers of variables, standard multiple comparisons and multiple testing procedures can be used to control the Type I or ‘‘false positive’’ error rate. Loss of statistical power, however, can be severe for large numbers of variables. The application of groupings derived from data mining procedures such as cluster analysis is an alternate approach that has been useful in microbiological studies (29). For example, a clinical trial using ‘‘red complex’’ bacteria derived from cluster analysis as a primary

outcome variable successfully evaluated the antibacterial response of a locally applied antibiotic in the presence of conventional therapy (30). Because a high level of symmetry is found in the mouth, many investigators have suggested that test and control therapies in some instances can be conducted in a single subject with an assumed reduction in response variability. Potential problems introduced by these designs are considerable (31,32). Because all teeth are nourished by the same vascular system, one can never be certain that treatment effects are truly isolated. Evidence suggests that even toothcleaning on one side of the mouth may affect the other side (33). Carry-across effects can reduce differences between treatment and control responses. For example, statistically significant differences observed in single-site local antibiotic therapy (34) could not be repeated when multiple diseased sites were treated, presumably because antibiotic from multiple locally treated sites in one quadrant affected the response of control sites in other quadrants (35). Although it is true that the most likely error in split mouth designs is an underestimation of the therapeutic effectiveness of the test product, the magnitude of the consequences may not be acceptable. Without question, parallel designs are more dependable in evaluating therapeutic responses in periodontal disease therapy. 5 TESTING DRUGS FOR TREATMENT OF DENTAL CARIES Dental caries or dental decay occurs in more than 90% of our population (36). Although much may be said about dental caries as a disease process, from the clinical testing point of view, it can be viewed as the result of bacteria that partially ferment carbohydrates and produce organic acids that dissolve the calcium hydroxyapatite from teeth to form cavities. The disease process starts with a ‘‘white spot’’ that represents decalcification of the tooth enamel and proceeds to a cavity that if not treated can result in tooth loss. Outcome variables for the measurement of dental caries are discrete measurement variables that represent the number of holes in

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS

teeth, white spots in teeth, missing teeth, filled teeth, and so forth. The classic measure of decay in permanent teeth is DMFT (37): the number of decayed, missing, or filled permanent teeth (or, DMFS for surfaces). Criteria for evaluating dental caries using visual criteria or detection with a sharp probe vary greatly between investigators. A review of criteria that have been used to score dental decay found 29 unique systems (38). The NIDCR criteria (39) have been used for largescale epidemiologic surveys. The widespread acceptance of fluoride toothpaste as an effective decay prevention product often demands that trials be conducted using noninferiority designs (40). Guidelines for the conduct of clinical trials for agents used to treat dental caries have been published (41) and are widely accepted as a standard. Characteristically, clinical trials for prevention of dental caries measure net conversion from sound to cavitated tooth surfaces or teeth over an appropriate time interval (incidence). Reliance on classification rather than measurement results in inefficient clinical trials that require many subjects. In addition, these methods are not easily adapted to measuring treatments designed to reverse demineralization of tooth surfaces. Newer methods have been developed to address this problem (42), analytic methods have been described that can be adapted to this measurement system (43), and clinical trials using these methods have been conducted (36). At this time, it is fair to say that these methods will require more research before being generally accepted as dependable. 6 TESTING DRUGS FOR LOCAL ANESTHETICS AND ANALGESICS By far, the most commonly used drugs in dental therapy are local anesthetics and analgesics. The standard for analgesic efficacy, the third molar extraction model, has become a standard used in medicine as well. Third molar extraction as a standard pain stimulus is generally evaluated using a visual analog scale (VAS). This scale is a 10-cm line on which the subject marks a perceived pain level between the two extremes of no pain

5

and worst pain. Mean centimeter differences between treatments and controls are then evaluated (44). The onset and duration of local anesthesia is often evaluated by electrical pulp testing (45). Guidelines for local anesthetic testing have been published (46). 7 TESTING DRUGS FOR TOOTH WHITENING Tooth whitening has become an extremely popular product area. Research in this area is particularly demanding because of the intensely subjective nature of the response. The most common clinical measure is visual comparison by a professional examiner with a tooth shade guide. Shade guides are not linear and contain different levels of several pigments. Results using this approach are somewhat arbitrary and highly subjective. One may also solicit a response rating from the subject but this rating can be associated with a high placebo response. Objective measurement by reflectance spectrophotometers (chromameters) or photographic equivalents calibrated in the LAB color domain have also been used. These devices produce results that are somewhat insensitive and often difficult to interpret. Published guidelines for testing tooth whitening products (47) suggest that studies be conducted using both clinical and quantitative measures. Studies that find concordance between objective and clinical evaluations tend to provide the most convincing evidence from clinical testing (48). REFERENCES 1. R. B. D’Agostino, and J. M. Massaro, New developments in medical clinical trials. J. Dent. Res. 2004; 83 (Spec No C): C18–24. 2. A. Kingman, P. B. Imrey, B. L. Pihlstrom, and S. O. Zimmerman, Chilton, Fertig, Fleiss, and the Task Force on Design and Analysis in Dental and Oral Research. J. Dent. Res. 1997; 76: 1239–1243. 3. L. M. Sullivan, and R. B. D’Agostino, Robustness of the t test applied to data distorted from normality by floor effects. J. Dent. Res. 1992; 71: 1938–1943. 4. J. L. Fleiss, S. Wallenstein, N. W. Chilton, and J. M. Goodson, A re-examination of withinmouth correlations of attachment level and of

6

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS change in attachment level. J. Clin. Periodontol. 1988; 15: 411–414.

and gingivitis. J. Am. Dent. Assoc. 1986; 112: 529–532.

5. P. P. Hujoel, W. J. Loesche, and T. A. DeRouen, Assessment of relationships between site-specific variables. J. Periodontol. 1990; 61: 368–372.

19. H. Loe and J Silness, Periodontal disease in pregnancy. I. Prevalence and severity. Acta Odontol. Scand. 1963; 21: 533–551.

6. T. A. DeRouen, Statistical Models for Assessing Risk of Periodontal Disease. In: J. D. Bader (ed.), Risk Assessment in Dentistry. Chapel Hill, NC: University of North Carolina Dental Ecology, 1990. 7. T. A. DeRouen, L. Mancl, and P. Hujoel, Measurement of associations in periodontal diseases using statistical methods for dependent data. J. Periodontal. Res. 1991; 26: 218–229.

20. T. M. Marthaler, Discussion: Current status of indices of plaque. J. Clin. Periodontol. 1986; 13: 379–380. 21. M. A. Listgarten, Pathogenesis of periodontitis. J. Clin. Periodontol. 1986; 13: 418–430. 22. G. P. Kelly, R. J. Cain, J. W. Knowles, R. R. Nissle, F. G. Burgett, R. A. Shick, et al., Radiographs in clinical periodontal trials. J. Periodontol. 1975; 46: 381–386.

8. A. M. Polson, The research team, calibration, and quality assurance in clinical trials in periodontics. Ann. Periodontol. 1997; 2: 75–82.

23. S. P. Ramfjord, J. W. Knowles, R. R. Nissle, F. G. Burgett, and R. A. Shick, Results following three modalities of periodontal therapy. J. Periodontol. 1975; 46: 522–526.

9. E. G. Hill, E. H. Slate, R. E. Wiegand, S. G. Grossi, and C. F. Salinas, Study design for calibration of clinical examiners measuring periodontal parameters. J. Periodontol. 2006; 77: 1129–1141.

24. J. Haber, J. Wattles, M. Crowley, R. Mandell, K. Joshipura, and R. L. Kent, Evidence for cigarette smoking as a major risk factor for periodontitis. J. Periodontol. 1993; 64: 16–23.

10. J. M. Goodson, M. A. Cugini, R. L. Kent, G. C. Armitage, C. M. Cobb, D. Fine, et al., Multicenter evaluation of tetracycline fiber therapy: I. Experimental design, methods, and baseline data. J. Periodontal. Res. 1991; 26: 361–370. 11. S. L. Fischman, Current status of indices of plaque. J. Clin. Periodontol. 1986; 13: 371–380. 12. A. R. Volpe, J. H. Manhold, and S. P. Hazen, In Vivo Calculus Assessment. I. A method and its examiner reproducibility. J. Periodontol. 1965; 36: 292–298. 13. R. C. Page, Gingivitis. J. Clin. Periodontol. 1986; 13: 345–359. 14. J. W. Stamm, Epidemiology of gingivitis. J. Clin. Periodontol. 1986; 13: 360–370. 15. S. G. Ciancio, Current status of indices of gingivitis. J. Clin. Periodontol. 1986; 13: 375–378. 16. E. F. Corbet, J. O. Tam, K. Y. Zee, M. C. Wong, E. C. Lo, A. W. Mombelli, et al., Therapeutic effects of supervised chlorhexidine mouthrinses on untreated gingivitis. Oral Dis. 1997; 3: 9–18. 17. J. C. Gunsolley, A meta-analysis of sixmonth studies of antiplaque and antigingivitis agents. J. Am. Dent. Assoc. 2006; 137: 1649–1657. 18. Council on Dental Therapeutics, Guidelines for acceptance of chemotherapeutic products for the control of supragingival dental plaque

25. R. J. Genco, S. G. Grossi, A. Ho, F. Nishimura, and Y. Murayama, A proposed model linking inflammation to obesity, diabetes, and periodontal infections. J. Periodontol. 2005; 76: 2075–2084. 26. P. B. Imrey, N. W. Chilton, B. L. Pihlstrom, H. M. Proskin, A. Kingman, M. A. Listgarten, et al., Proposed guidelines for American Dental Association acceptance of products for professional, non-surgical treatment of adult periodontitis. Task Force on Design and Analysis in Dental and Oral Research. J. Periodontal. Res. 1994; 29: 348–360. 27. J. M. Goodson, A. C. Tanner, A. D. Haffajee, G. C. Sornberger, and S. S. Socransky, Patterns of progression and regression of advanced destructive periodontal disease. J. Clin. Periodontol. 1982; 9: 472–481. 28. S. S. Socransky, A. C. Tanner, J. M. Goodson, A. D. Haffajee, C. B. Walker, J. L. Ebersole, et al., An approach to the definition of periodontal disease syndromes by cluster analysis. J. Clin. Periodontol. 1982; 9: 460–471. 29. S. S. Socransky, A. D. Haffajee, M. A. Cugini, C. Smith, and R. L. Kent, Jr., Microbial complexes in subgingival plaque. J. Clin. Periodontol. 1998; 25: 134–144. 30. J. M. Goodson, J. C. Gunsolley, S. G. Grossi, P. S. Bland, J. Otomo-Corgel, F. Doherty, et al., Minocycline HCl microspheres reduce red-complex bacteria in periodontal disease therapy. J. Periodontol. 2007; 78: 1568–1579.

DISEASE TRIALS FOR DENTAL DRUG PRODUCTS 31. P. B. Imrey, Considerations in the statistical analysis of clinical trials in periodontitis. J. Clin. Periodontol. 1986; 13: 517–532. 32. P. P. Hujoel and T. A. DeRouen, Validity issues in split-mouth trials. J. Clin. Periodontol. 1992; 19: 625–627. 33. A. P. Pawlowski, A. Chen, B. M. Hacker, L. A. Mancl, R. C. Page, and F. A. Roberts, Clinical effects of scaling and root planing on untreated teeth. J. Clin. Periodontol. 2005; 32: 21–28. 34. J. M. Goodson, M. A. Cugini, R L. Kent, G. C. Armitage, C. M. Cobb, D. Fine, et al., Multicenter evaluation of tetracycline fiber therapy: II. Clinical response. J. Periodontal. Res. 1991; 26: 371–379. 35. C. L. Drisko, C. M. Cobb, W. J. Killoy, B. S. Michalowicz, B. L. Pihlstrom, R. A. Lowenguth, et al., Evaluation of periodontal treatments using controlled-release tetracycline fibers: clinical response. J. Periodontol. 1995; 66: 692–699. 36. J. D. Bader, D. A. Shugars, and A. J. Bonito, Systematic reviews of selected dental caries diagnostic and management methods. J. Dent. Educ. 2001; 53: 960–968. 37. H. Klein, C. E. Palmer, and J. W. Knutson, Studies on Dental Caries. Public Health Reports 1938; 53: 751–765. 38. A. I. Ismail, Visual and visuo-tactile detection of dental caries. J. Dent. Res. 2004; 83 Spec No C:C56–C66. 39. 39 Oral Health Surveys of the National Institute of Dental Research. Diagnostic Criteria and Procedures. NIH publication No. 91-2870 1991; 1–99. 40. R. B. D’Agostino, Sr., J. M. Massaro, and L. M. Sullivan, Non-inferiority trials: design concepts and issues—the encounters of academic consultants in statistics. Stat. Med. 2003; 22: 169–186. 41. 41 Guidelines for the acceptance of fluoridecontaining dentifrices. Council on Dental Therapeutics. J. Am. Dent. Assoc. 1985; 110: 545–547. 42. A. Hall and J. M. Girkin, A review of potential new diagnostic modalities for caries lesions. J. Dent. Res. 2004; 83 Spec No C:C89–C94. 43. P. B. Imrey and A. Kingman. Analysis of clinical trials involving non-cavitated caries lesions. J. Dent. Res. 2004; 83 Spec No C:C103–C108. 44. K. M. Hargreaves and K. Keiser, Development of new pain management strategies. J. Dent. Educ. 2002; 66: 113–121.

7

45. P. A. Moore, S. G. Boynes, E. V. Hersh, S. S. DeRossi, T. P. Sollecito, J. M. Goodson, et al., The anesthetic efficacy of 4 percent articaine 1:200,000 epinephrine: two controlled clinical trials. J. Am. Dent. Assoc. 2006; 137: 1572–1581. 46. 46 Guideline on appropriate use of local anesthesia for pediatric dental patients. Pediatr. Dent. 2005; 27 (7 Suppl):101–106. 47. 47 Guidelines for the acceptance of peroxidecontaining oral hygiene products. J. Am. Dent. Assoc. 1994; 125: 1140–1142. 48. M. Tavares, J. Stultz, M. Newman, V. Smith, R. Kent, E. Carpino, et al., Light augments tooth whitening with peroxide. J. Am. Dent. Assoc. 2003; 134: 167–175.

FURTHER READING J. L. Fleiss, The Design and Analysis of Clinical Experiments. New York: John Wiley & Sons, 1986.

CROSS-REFERENCES Surrogate Endpoints Data Mining Generalized Estimating Equations Analysis of Covariance (ANCOVA) Multiple Comparisons Data Mining Paired T-Test Two-sample T-Test Nonparametrics

DISEASE TRIALS ON REPRODUCTIVE DISEASES

creases the trust of the public in reproductive medicine and the science on which it relies. Fortunately, several moves have been employed to improve research in the area of reproductive health. The action by distinguished researchers has identified research standards and guides researchers in the development of translational and clinical research studies. Multiple issues have been identified that impair reproductive health science: a welldefined objective; diagnostic accuracy of the disease; appropriate study population; appropriate sample size; appropriate sampling and data collection; specifically developed materials and methods; appropriate randomization of treatment groups; avoidance of selection bias, test reliability and reproducibility; and appropriate index test and/or reference standards. In addition, for infertility and treatment, potential bias can exist because of previous failed or successful treatment and the use of crossover trials. Even when the ‘‘gold standard’’ of clinical trials (the prospective, RCT) is selected, serious limitations have been observed when examining treatment of infertility (4). To estimate the limitations of reproductive medicine studies, researchers tested the accuracy of studies published in two distinguished journals: Human Reproduction and Fertility and Sterility. Using the Standards for Reporting of Diagnostic Accuracy checklist (STARD), which is a checklist to test for 25 measures of diagnostic accuracy, Coppus et al. (5) assessed publications in 1999 and 2004. They found that less than one half of the studies reported at least 50% of the items demonstrating lack of accuracy with no improvement between 1999 and 2004. Unfortunately, the reporting of individual items varied widely between studies; no consistent problem with diagnostic accuracy was observed in the articles reviewed. Clearly, researchers need to consider these issues when developing their studies, and journals need to ask that all items of diagnostic accuracy be addressed in the publication. Using the Consolidated Standards for Reporting Trials (CONSORT), some improvement has occurred in the past 12 years in the

JULIA V. JOHNSON University of Vermont College of Medicine Burlington, Vermont

1

INTRODUCTION

Reproductive health issues are common for men and women, but clinical research in this area has been limited in study number and quality. The importance of these health issues is clear. To use infertility as an example of reproductive health issues, this disease affects 15% of women aged 15–44 years (1). Indeed, infertility is one of the most common medical disorders that affect reproductive aged men and women. Although highquality prospective, randomized controlled trials (RCTs) are required to determine the ideal methods for diagnosis and treatment of this common disease, the quality of research is historically poor (2). The limitation of federal funding for reproductive health issues has led to programs in the NICHD in the past 10 years (3). In the interim, most reproductive research has been funded by pharmaceutical companies and academic centers without set standards for these clinical trials. The limited research funding impairs scientific analysis of the available data, and it adversely affects provider’s ability to care for patients. This article will review the limitation of the current research in reproductive medicine and recommend the standards for researchers and clinicians to consider when reviewing the literature. Also, the current NICHD programs to optimize research in reproductive medicine will be discussed. 2

LIMITATIONS OF CURRENT STUDIES

As in all medical fields, evidence-based medicine is critical to allow effective decision making for providers. The quality of studies in the reproductive sciences has not been consistent, which limits providers’ ability to make decisions with their patients and de-

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DISEASE TRIALS ON REPRODUCTIVE DISEASES Table 1. Use of CONSORT in Reproductive Medicine RTCs 1990 (% of Studies)

1996

2002

16% 11% 67%

40% 20% 77%

48% 37% 95%

Randomized Trial Blinded Allocation Parallel Design

reporting of RCTs published in 1990, 1996, and 2002 (4). Collecting studies from the Cochrane Menstrual Disorder and Subfertility Group (MDSG), the researcher found that ideal methods of treatment allocation (randomization), concealment of allocation (blinding), and study design (parallel design) improved (see Table 1). However, many studies did not reach the characteristics suggested by CONSORT. In addition, this study demonstrated that 40% of studies had errors in patient flow (number of subjects randomized vs. number analyzed). Unit-of-analysis errors decreased over the 12 years but continued to occur. As is commonly the case with reproductive research, the data on pregnancy outcome was still lacking. Although miscarriages were reported in 62% of studies in 2002, only 28% reported ectopic pregnancies and 44% reported multiple gestations. An excellent example of a serious study design flaw is the use of crossover trials to assess infertility treatment. As described by Khan et al. (6), crossover trials overestimate the effectiveness of treatment, compared to parallel trials, by an odds ratio of 74%. Although it is tempting to use crossover trials to the lower sample size, the outcome (pregnancy) prevents participation in the second arm of the trial. Fortunately, the crossover study design was recognized as flawed; this information presumably led to the increase in parallel trials from 1990 (67%) to 2002 (95%) (4). 3

IDEAL REPRODUCTIVE STUDY DESIGN

Excellent reviews have been written to summarize the ideal clinical study in reproductive medicine (7,8). The first consideration, as with any research project, is to identify the objective clearly and determine whether the study will be effective in testing the hypothesis. In reproductive medicine, the sample size

is critical to consider. Because the sample size significantly increases if the difference with treatment is small, multicenter trials may be required for adequate subject recruitment. The target population must also be considered carefully. In fertility trials, for example, successful treatment is based on multiple factors, which include women’s age, sperm quality, previous pregnancy or failed therapy, gynecologic disorders, and the multiple causes of infertility. The inclusion and exclusion criteria need to be established early in the design of the study to assure that the appropriate population can be recruited. Once the patient population is established, a randomized controlled trial requires an appropriate study design. The CONSORT statement should be reviewed prior to beginning the study, not just for publication of the trial. Studies demonstrate that the use of CONSORT improves the quality of studies (9). Appropriate randomization, performed immediately before treatment to minimize postrandomization withdrawals, must be blinded to the researcher and subject to prevent selection and ascertainment biases. Subjects are followed closely to assure no unintended influences, or cointervention, that may alter results. Once subjects are randomized, those who discontinue treatment or are lost to follow-up must be included in the intent-to-treat analysis of the data. Also, as noted in the example described above, parallel trials are necessary for RCTs examining the treatment of infertility. The ideal study has a fully considered objective with an appropriate sample size to test the hypothesis as well as a lack of bias through randomization of a parallel trial and an analysis of all subjects. Maximal possible blinding is advised, as possible, for subjects, investigators, and health care providers, as well as laboratory personnel.

DISEASE TRIALS ON REPRODUCTIVE DISEASES

4 IMPROVING REPRODUCTIVE MEDICINE RESEARCH The importance of reproductive health issues and the demands of high-quality clinical research are recognized. The large sample size required for these studies has led to networks, such as the Reproductive Medicine Network (3). This network allows multicenter trials that involve eight centers and a datacoordinating center. Recently, the Reproductive Medicine Network completed a landmark study that demonstrated the most effective ovulation induction agent for women with polycystic ovarian syndrome (10). The unexpected results of this well-designed study altered medical care for women with this common disorder. Additional programs include the Specialized Cooperative Center Program in Reproduction Research (SCCPRR), which was developed to increase translational research in reproductive sciences. This program encourages collaboration with other centers within the institution and with outside entities. Four SCCPRR focus groups include endometrial function, ovarian physiology, male reproduction, and neuroendocrine function. The National Cooperative Program for Infertility Research (NCPIR) has two sites that involve patient-oriented research. Currently, the NCPIR is emphasizing the genetic basis of polycystic ovarian syndrome. The addition of the Cochrane MDSG has improved the interpretation of reproductive medicine trials (11). With more than 2000 randomized controlled trials in fertility, this group allows assessment of the quality and content of the published trials. 5

CONCLUSIONS

Providers rely on high-quality clinical research to guide their practice; researchers rely on standards that optimize study design and ensure appropriate analysis of results. Most importantly, patients rely on research to determine the most effective diagnostic tests and reliable treatment options. Classically, reproductive medicine had limited funding, which resulted in suboptimal clinical studies. The factors that complicate these

3

clinical trials are now recognized, and appropriate patient selection and study design has been clarified. The use of standards such as CONSORT and STARD will improve the research development and publication significantly. In addition, the efforts from the NICHD to increase multicenter and collaborative trials will set the standard for highquality research in reproductive medicine. REFERENCES 1. J. C. Abama, A. Chandra, W. D. Mosher, et al., Fertility, family planning, and women’s health: new data from the 1995 National Survey of Family Growth. Vital Health Stat. 1997; 19: 1–114. 2. S. Daya, Methodological issues in infertility research. Best Prac. Res. Clin. Obstet. Gynecol. 2006; 20: 779–797. 3. L. V. DePaolo and P. C. Leppert, Providing research training infrastructures for clinical research in the reproductive sciences. Am. J. Obstet. Gynecol. 2002; 187: 1087–1090. 4. A. Dias, R. McNamee, and A. Vail, Evidence of improving quality of reporting of randomized controlled trials in subfertility. Hum. Reprod. 2006; 21: 2617–2627. 5. S. F. P. J. Coppus, F. van der Venn, P. M. M. Bossuyt, and B. W. J. Mol, Quality of reporting of test accuracy studies in reproductive medicine: impact of the Standards for Reporting of Diagnostic Accuracy (STARD). Fertil. Steril. 2006; 86: 1321–1329. 6. K. S. Khan, S. Daya, JA. Collins, and S. D. Walter, Empirical evidence of bias in infertility research: overestimation of treatment effect in crossover trials using pregnancy as the outcome measure. Fertil. Steril. 1996; 65: 939–945. 7. J-C. Acre, A. Nyboe Anderson, and J. Collins, Resolving methodological and clinical issues in the design of efficacy trials in assisted reproductive technologies: a mini-review. Hum. Reprod. 2005; 20: 1751–1771. 8. S. Daya, Pitfalls in the design and analysis of efficacy trials in subfertility. Hum. Reprod. 2003: 18; 1005–1009. 9. D. Moher, K. F. Schulz, and D. Altman for the CONSORT Group, The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA 2001; 285: 197–191. 10. R. S. Legro, H. X. Barnhart, W. D. Sclaff, B. R. Carr, et al., for the Cooperative Multicenter

4

DISEASE TRIALS ON REPRODUCTIVE DISEASES

Reproductive Medicine Network, Clomiphene, metformin, or both for infertility in the polycystic ovarian syndrome. N. Engl. J. Med. 2007; 356: 551–566. 11. C. M. Farquhar, A. Prentice, D. H. Barlow, J. L. H. Evers, P. Vandekerckhove, and A. Vail, Effective treatment of subfertility: introducing the Cochrane menstrual disorders and subfertility group. Hum. Reprod. 1999; 14: 1678–1683.

FURTHER READING E. G. Hughes, Randomized clinical trials: The meeting place of medical practice and clinical research. Semin. Reprod. Med. 2003; 21: 55–64. G. Piaggio and A. P. Y. Pinol, Use of the equivalence approach in reproductive health clinical trials. Stats. Med. 2001; 20: 3571–3587. K. F. Schultz, I. Chalmers, R. J. Hayes, et al., Empirical evidence of bias: dimensions of methodological quality associated with treatment effects in controlled trials. JAMA 1995; 273: 408–412. L. Roenbert, Physician-scientists: endangered and essential. Science 1999; 288: 331–332. A. Vail and E. Gardener, Common statistical error in the design and analysis of subfertility trials. Hum. Reprod. 2003; 18: 1000–1004.

DISEASE TRIALS ON PEDIATRIC PATIENTS

Researchers face unique obstacles when conducing RCTs in children, including the limited number of children with major medical diseases, issues related to ethics, and finally, measurable objective outcomes. Unlike adults, most children in the United States are healthy. Of the estimated 1,400,000 patients newly diagnosed each year with cancer, only about 10,000 are in children 0–14 years of age (2). The most common chronic disease in childhood is asthma, which affects approximately 10% of the 75,000,000 children 0 to 18 years of age. However, the vast majority of even these children have mild to moderate disease. Few seek care in emergency rooms, and even fewer are hospitalized. These examples are two examples of important childhood diseases, but they affect such limited numbers of children that conducting RCTs is difficult. Investigators in child health, therefore, frequently have to resort to multicenter trials. A good example is the Children’s Oncology Group (COG), which enrolls children around the United States in clinical trials and has been conducting multisite trials for decades (3). Virtually every child in the United States with cancer can enroll in a clinical trial. More recent examples include the Pediatric Research in Office Settings group (PROS)(4), a pediatric emergency network in the United States (5), and a clinical trials unit in the United Kingdom (6). PROS is a practice-based network, established in 1986, which includes over 700 practices in the United States, Puerto Rico, and Canada. It was established by the American Academy of Pediatrics (AAP). A research staff at the AAP helps coordinate projects and assists with identifying funding and data analysis. Research ideas are vetted through numerous committees, which include biostatisticians, epidemiologists, and clinicians. The Pediatric Emergency Care Applied Research Network (PECARN) is the first federally funded collaborative network for research in pediatric emergency medicine. The group recently published the results of a large RCT that assessed the role of dexamethasone for infants diagnosed with bronchiolitis in the emergency

HOWARD BAUCHNER Boston University School of Medicine Boston Medical Center Boston, Massachusetts

Conducting clinical trials in which children are the subjects is important. Appropriately powered randomized clinical trials (RCTs) remain the most influential study design. Some clinical trials change practice on their own; others contribute through metaanalyses. Both RCTs and meta-analyses are considered the cornerstone for the Cochrane Collaboration and the primary basis for recommendations of the United States Preventive Services Task Force. In a recent analysis, we found a large gap between high-quality study designs in children and adults. We assessed all research articles published in the New England Journal of Medicine, Journal of the American Medical Association, Annals of Internal Medicine, Pediatrics, Archives of Internal Medicine and Archives of Adolescent and Pediatric Medicine during the first 3 months of 2005 (1). 405 original research reports were reviewed, of which 189 included only adults as subjects and 181 included only children as subjects (total N = 370). Both RCTs and systematic reviews were significantly more common in studies that involved adults compared with those that involved children. Of the 370 trials, 32.6% were RCTs, 23.8% involving adults and 8.8% involving children. Of the 12.3% of studies that were systematic reviews, 10.6% involved adults and 1.7% involved children. Cross-sectional studies (considered less robust designs) were twice as common in the pediatric literature (38.1% vs. 17.7%). Unlike large RCTs, in which both known and unknown biases are usually equally distributed between groups, in cross-sectional designs, confounding is of major concern. This survey highlights the concern about lack of high-quality study designs—RCTs and systematic reviews—in child health research but begs the question why this gap exists.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

DISEASE TRIALS ON PEDIATRIC PATIENTS

room setting (5). Twenty emergency rooms were involved in this study. The United Kingdom network, Medicines for Children Research Network (MCRN), was created in 2005 because of the concern about lack of high-quality clinical trials in children. The unit is based in Liverpool, but six local research networks are established, covering most of England, and nine clinical studies groups exist, too, including anaesthesia, diabetes, neonatal, neurosciences, and respiratory. The goal is to develop and conduct RCTs that will enroll children from around the United Kingdom. Research consortia like COG, PROS, PECARN, and MCRN are critical to the future health of children. However, unlike single-site trials, multisite trials are far more difficult to conduct because of issues related to expense, reliability of data, and ethics committee approval. Ethical issues are an important consideration when conducting research with children (7). According to federal regulation, children are considered a vulnerable population and, as such, are afforded additional protections by Institutional Review Boards. For example, children can participate in research with greater than minimal risk only as long as they are likely to benefit from research. The risk/benefit ratio must be perceived at least as favorable as available alternatives. The issue of when children, rather than their parents, should give consent is also complicated. When considering whether children are capable of consenting, IRBs consider age, maturity, and psychological state of the children. The adolescent years pose another potential problem for researchers. Determining when adolescents become ‘‘adults’’ and can consent to participate in research without their parents being present or contacted is another complicated issue that can vary from state to state. A final consideration in conducting trials in children is appropriate health outcomes. Often trials with adults use as objective outcomes, death, hospitalization, or major medical problems, such as myocardial infarction or stroke. These outcomes are quite rare in children, so the outcomes used in many pediatric trials relate to functional measures of performance, including cognitive outcomes and physical activity. These outcomes are

affected by many variables, so ensuring adequate sample sizes in RCTs that involve children is critical. In addition, many other variables can modify the results (effect modifiers). In addition, many outcomes that are important to child health researchers occur in the adolescent and adult years. For example, researchers in childhood cancer are particularly interested in rates of secondary cancer. Many of these occur 20–30 years after initial treatment, hence the need for longterm follow-up studies. The same is true for Type 1 diabetes and cystic fibrosis, cases in which investigators are interested in complications and/or other outcomes that occur in the adult years. Many problems in child health necessitate long-term follow-up, which is both expensive and logistically difficult. Clinical trials improve health care. However, the number of clinical trials that involve children is limited, with researchers facing a unique set of obstacles. However, I remain optimistic that the development of research networks and the growing recognition that we need to conduct more clinical trials that involve children will lead to a growing number of RCTs. REFERENCES 1. C. Martinez-Castaldi, M. Silverstein, and H. Bauchner, Child versus adult research: The gap in high quality study design. Abstract, Pediatric Academic Societies Meeting, May 2006. Available: http://www.abstracts2view. com/pasall/search.php?query=Bauchner&where []=authors&intMaxHits=10&search=do. 2. http://www.cancer.org/downloads/STT/CAFF 2007PWSecured.pdf. 3. http://www.curesearch.org/. 4. http://www.aap.org/pros/abtpros.htm. 5. H. M. Corneli et al., A multicentre, randomized, controlled trial of dexamethasone for bronchiolitis. NEJM. 2007; 357: 331–339. Available: http://ctuprod.liv.ac.uk/mcrn/library/docs/ MCRNCC%202005-06.doc. 6. http://www.mcrn/org.uk. 7. http://www.bu.edu/crtimes/.

DNA BANK

research applications. Because DNA collected at a particular time can survive for a long time, the scope of use of the specimen may be far broader than specified at the time of collection.

THERU A SIVAKUMARAN Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio

1.1 Common or Complex Diseases Modest-sized genetic studies have been successful in the identification of single genes that, when mutated, cause rare, highly heritable mendelian disorders. However, studies that are aimed at identifying genes responsible for predisposition to common diseases, such as heart disease, stroke, diabetes, agerelated macular degeneration, cancer, and so on, are challenging. Common diseases are thought to be caused by many modest-sized, often additive effects, which represent the outcome of genetic predisposition, as well as lifestyle factors such as diet and smoking habits, and the environment in which we live and work. Locating these complex interactions between genes and the environment, and their role in disease depends on the collection of well-documented epidemiological clinical information, as well as biological specimens. The concept of obtaining and maintaining prospective cohorts to understand health and disease has a precedent in epidemiological research, with the investment in studies such as the Framingham Heart Study (3), but this paradigm has only recently been broadened to genetic epidemiology investigations. In response to the large sample sizes needed to obtain statistical power to detect the predisposing genes, as well as the recent technological developments in identifying the genetic variants, DNA biobanks are growing from small, local biological repositories to large population-based collections.

SUDHA K IYENGAR Departments of Epidemiology and Biostatistics, Genetics, and Ophthalmology, Case Western Reserve University, Cleveland, Ohio

1 DEFINITION AND OBJECTIVES OF DNA BIOBANKS DNA bank, also known as DNA biobank or Biobank, is an organized collection of DNA or biological samples, such as blood plasma, and so on, that are used to isolate DNA. These warehouses also contain information about the donor of the material, such as demographic characteristics, the type of disease associated with the sample, the outcome of the disease, treatment, and so on. According the American Society of Human Genetics (ASHG) policy statement (1), a DNA bank is defined as a facility to store DNA for future analysis. The American National Bioethics Advisory commission defined a DNA bank as a facility that stores extracted DNA, transformed cell lines, frozen blood or other tissue or biological materials for future DNA analysis (2). Although many types of DNA banks are available (described below), this article focuses mainly on the DNA banks that are meant for human genome research (i.e., academic laboratory DNA banks, populationbased public DNA biobanks, and commercial biobanks). Population-based public or national DNA biobanks represent a new paradigm for biomedical research and differ from traditional academic laboratory DNA banks in several aspects. The public or commercial DNA biobanks that obtain samples may not be engaged in research but may be only intermediary brokers who supply specimens to other researchers. They act as repositories that can be used for many

1.2 Pharmacogenomics Medications prescribed by a physician can have associated side effects called adverse drug reactions in some individuals. Approximately 100,000 people die each year from adverse reactions to drugs (4), and millions of people must bear uncomfortable or even dangerous side effects. Currently, no simple

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

DNA BANK

method is available to determine whether people will respond well, badly, or not at all to a particular drug. Therefore, all pharmaceutical companies follow a ‘‘one size fits all’’ system, which allows for development of drugs to which the ‘‘average’’ patient will respond. Once a drug is administered, it is absorbed and distributed to its site of action, where it interacts with numerous proteins, such as carrier proteins, transporters, metabolizing enzymes, and multiple types of receptors. These proteins determine drug absorption, distribution, excretion, target site of action, and pharmacological response. Moreover, drugs can also trigger downstream secondary events that may vary among patients, for example, seldane (terfenadine), which is an antihistamine used to treat allergies, hives (urticaria), and other allergic inflammatory conditions, was withdrawn because it caused increased risk for long QT syndrome and other heart rhythm abnormalities. Similarly, Vioxx (Merck & Co., Whitehouse Station, NJ), which is a drug prescribed to relieve signs and symptoms of arthritis and acute pain in adults, as well as painful menstrual cycles, has also been removed from the market because of an increased risk for heart attack and stroke. Thus, the way a person responds to a drug is a complex trait that is influenced by many different genes that code for these proteins. Without knowing all of the genes involved in drug response, scientists have found it difficult to develop genetic tests that could predict a person’s response to a particular drug. Pharmacogenomics is a science that examines the inherited variations in genes that dictate drug response and explores the ways these variations can be used to predict whether a patient will have a good response to a drug, a bad response to a drug, or no response at all. The concept of pharmacogenetics, which is a discipline that assesses the genetic basis of drug response and toxicity, originated from the clinical observation that many patients had very high or very low plasma or urinary drug concentrations, followed by the realization that the biochemical traits that lead to this variation were inherited. The studies conducted in early 1950s examine the drug metabolizing enzyme variants in

the cytochrome P450 family. The cytochrome P450 monooxygenase system is responsible for a major portion of drug metabolism in humans. This large family of genes has been intensely studied, and among the numerous subtypes, CYP2D6, 3A4/3A5, 1A2, 2E1, 2C9, and 2C19 play particularly critical roles in genetically determined responses to a broad spectrum of drugs (5). Because drugs that act on the central nervous system penetrate the blood-brain barrier, renal excretion is minimal for these compounds, and cytochrome P450 metabolism, particularly CYP2D6 and CYP2C19, provides the only means of effective drug elimination. The activity of the CYP2D6 enzyme is extremely variable because of more than 50 genetic variants, and patients who are homozygous for the CYP2D6 null alleles have impaired degradation and excretion of many drugs, which include debrisoquine, metoprolol, nortriptyline, and propafone (6). These patients are termed ‘‘poor metabolizers’’ for CYP2D6 selective drugs, and they are more likely to exhibit adverse drug reactions. Similarly, patients who are homozygous for the ‘‘null’’ allele of the P450 isoform CYP2C19 are highly sensitive to omeprazole, diazepam, propranolol, mephenytoin, amitriptyline, hexobarbital, and other drugs. Today, clinical trial researchers use genetic tests for variations in cytochrome P450 genes to screen and monitor patients prior to drug administration. 1.3 Finding-Causative Genetic Factors The process of identifying genes that are responsible for common disease as well as drug response begins with scanning the genetic variations in the entire DNA sequence of many individuals with the disease, or those who respond to a particular drug, and contrasting this information with that from individuals without the disease, or who do not respond to the drug. The most commonly found human genetic variations between two individuals are variants that involve a single nucleotide, named single nucleotide polymorphisms (SNPs). In comparing two haploid genomes, a SNP was observed to occur on average every 1331 bases. When more than two haploid genomes are surveyed, a SNP

DNA BANK

is expected to occur on average every 240 bases (7). In the general population, the collective human genome is expected to contain about 10 million SNPs, and testing all of them would be very expensive. Systematic studies that identify these common genetic variants are facilitated by the fact that it is not necessary to interrogate every possible variant. The International HapMap Project (8) showed that a particular SNP allele at one site often carries information about specific alleles at other nearby variant sites; this association is known as linkage disequilibrium (LD). Thus, genetic variants that are located close to each other tend to be inherited together; these variants are often labeled as a haplotype or LD block. Because little or no recombination is observed within this haplotype block, a small subset of SNPs, called tag SNPs, are sufficient to distinguish each pair of patterns in the block and this reduces the necessity of querying all 10 million SNPs. Approximately 1 million common SNPs carry most information on common variants. Examination of these SNPs by traditional methods, such as sequencing, is very slow and expensive, but DNA microarray technology has made it possible to genotype large numbers of SNPs in a single array/chip. In the past few years, the number of SNPs available on a single microarray chip has grown from 10,000 to 1,000,000. Once a particular area of the genome is associated with disease is identified through scanning thousands of tag SNPs, it should be possible to zero in on specific genes involved in the disease process much more quickly. This process occurs because when a new causal variant develops through mutation, it is initially tethered to a unique chromosomal region that is marked by a haplotype block. If a disease is associated with a tag SNP within that particular haplotype block, then variants that contribute to the disease might be located somewhere within or near that haplotype block. In summary, it is feasible to use surrogate markers to locate disease genes or to find genes that affect the action of pharmaceutical compounds. Biobanks play a significant role in this discovery process.

2

3

TYPES OF DNA BIOBANKS

Several kinds of DNA banks are based on usage and location. 2.1 Academic DNA Banks Repositories are housed in the laboratories of researchers who study one or more genetic disorders. This collection typically contains DNA samples obtained from families at risk for the disease of interest or from cases with certain common disease, such as heart disease, diabetes, cancer, and so on, as well as samples from healthy controls. One of the first academic biobanks was the Centre d’etude du Polymorphism Humaine (CEPH) (9), which facilitated human gene mapping and was the nexus for the Human Genome Project and the International HapMap Project. Unlike most other collections, the CEPH collection was initially assembled with no information on disease. 2.2 Population-Based DNA Biobanks These centers are large private/public repositories of donated human DNA, with health and other relevant information collected from volunteers with and without the disease. The aim of public initiatives, which are supported by local governments, is to provide specimens to the scientific community for research geared toward identifying genes that contribute to common disease as well as to drug response and toxicity. The volunteers for these biobanks are approached mainly through their health care providers. Most pharmaceutical companies have their own large collections of DNA samples acquired through research collaborations or from subjects enrolled in clinical trials (10). 2.3 Commercial Biobanks Several commercial biobanks have been established in the last few years with the goal of storing personal genetic information and providing this information to researchers, as well as to pharmaceutical companies, for-a-fee. Some commercial biobanks in the United States are Genomics Collaborative, Inc., Ardais Corporation, and DNA Sciences Inc. (11). Genomics Collaborative Inc. claims

4

DNA BANK

to have a repository that contains about 500,000 biological samples and also health data from 120,000 people from all over the world. Ardais Corporation, in agreement with several physicians, recruits patients to donate samples, and the repository contains more than 220,000 tissue samples collected from over 15,000 donors. DNA Sciences Inc. has a collection of over 18,000 DNA samples in its repository (11). 2.4 State Forensic DNA Data Banks DNA profiling and databases have become common in criminal investigation and prosecution. Forensic DNA data banks serve as repositories for long-term storage of DNA collected from certain defined criminal offenders and the profiles derived from analysis of the DNA. Most states in the United States permit DNA profile databasing of offenders, of missing persons and their relatives, and of DNA profiles from criminal-case evidence in which the depositor is unknown. The national DNA database of the United Kingdom also stores DNA samples and other specimens collected from crime scenes as well as samples taken from individuals in police custody. 2.5 Military DNA Data Banks Military DNA banks assist in the identification of human remains. The U.S. military uses DNA profiles in place of traditional means of identification such as ‘‘dog tags,’’ and new recruits routinely supply blood and saliva samples used to identify them in case they are killed in action. 2.6 Nonhuman DNA Banks DNA banking has become an important resource in worldwide efforts to address the biodiversity crisis, manage the world’s genetic resources, and maximize their potential. Several plant and animal DNA banks are located around the world. A Japan-based National Institute of Agrobiological sciences’ DNA Bank was established in 1994 as a section of Ministry of Agriculture, Forestry, and Fisheries. It maintains the DNA materials and genomic information that were collected as part of rice, animal, and silkworm genome projects (12). Royal Botanic

Gardens (RBG) at Kew established a DNA bank database with the aim of extracting DNA from all plant species that are grown at RBG. Currently, it has genomic DNA from over 22,000 plants in its facility, which are stored at −80◦ C, and these samples are sent to collaborators all over the world for research purposes (13). The Australian plant DNA bank has a comprehensive collection of DNA from both Australian native and important crop plant species in its repository. It provides DNA samples and the associated data to the research community (14). Recently, Cornell University’s College of veterinary medicine has established a DNA bank called Cornell Medical Genetic Archive to understand the genetic basis of disease in many species such as dogs, cats, horses, cows, and exotic animals. The blood samples taken from these species, with the owner’s written consent, will be used for DNA extraction, and these samples will be used by the researchers at Cornell (15). In summary, DNA banks can be speciesspecific, and their use is tailored to particular needs. In general, human biobanks have been established to (1) study a specific genetic disease under collaborative guidelines; (2) to accrue information from populations of entire countries (see below) for current surveillance or for future use; (3) in pharmacogenetic studies, as part of clinical trials; and (4) in hospital based settings to be used for current and future research.

3 TYPES OF SAMPLES STORED

A variety of biological materials is now banked for isolating genomic DNA for genetic research. All nucleated cells, which include cells from the hair follicle, buccal swab, saliva, and urine, are suitable for isolating DNA. Many large biobanks obtain whole blood, as it provides amounts of DNA necessary for current applications. Whole blood is generally stored in three different forms in the biobank (i.e., genomic DNA, dried blood spots, and Epstein-Barr virus (EBV)transformed cell lines).

DNA BANK

3.1 Genomic DNA Blood is most often collected using ethylenediaminetetraacetic acid (EDTA) or anticoagulants that include heparin and acid citrate dextrose and genomic DNA is isolated using homebrewn methods or commercially available kits. The purified genomic DNA is stored at 4◦ C or −20◦ C. Most samples stored in the biobanks are isolated genomic DNA (16). Often, an EDTA biospecimen can also serve as a source of material for other bioassays, as this type of specimen is suitable for a variety of biochemical, hematologic, and other types of tests (17).

5

EBV-transformed lymphoid cells are stored at even lower temperatures. Some factors that generally affect the quality or integrity of DNA samples are the following: 1. Improper storage conditions, such as temperature, evaporation, frequent freeze, and thaw, may lead to the degradation of samples. 2. Possibility of cross contamination of neighboring samples either by sample carry over during handling or by sample switches during processing. 3. Improper handling of DNA samples (i.e., improper mixing and pipetting).

3.2 Dried Blood Spots Dried blood spots, which are also known as Guthrie cards, are generally used in newborn screening programs to quantify the analytes that include metabolic products, enzymes, and hormones (18). Blood spots from military personnel are also stored to serve as biologic ‘‘dog tags’’ for identification purposes. Blood spots may yield enough DNA to genotype several variants (19,20), although with current genotyping technology it is feasible to genotype thousands of markers. The United Kingdom’s national biobank is planning to spot whole blood onto filter paper in a 384well configuration for storage (10). 3.3 EBV Transformed Cell Lines EBV transformed lymphocytes provide an unlimited source of DNA, to be used for whole genome scans. These cells are also used for functional studies. (See Cell Line for additional details). 4 QUALITY ASSURANCE AND QUALITY CONTROL IN DNA BIOBANKS The quality of the DNA and other biological specimens in biobanks is of primary importance, as they are the resource for the downstream processes such as genotyping, sequencing, and other bioassays. The storage of these specimens varies according the type of samples, that is, DNA samples are generally stored at 4◦ C and −20◦ C; blood and other tissue samples used for DNA extraction are stored in −80◦ C; and viable cells such as

The ASHG policy statement on DNA banking (1) proposed minimal standards for quality assurance, which include: (1) A DNA bank should occupy space equipped with an alarm system that is separate from other functions; (2) the bank should maintain a manual of procedures and train personnel in meticulous technique; (3) samples should be coded and very few individuals have access to the identity; (4) records should be maintained for the receipt, disposition, and storage of each sample; (5) samples should be divided and stored in more than once place; and (6) control samples should analyzed before deposit and at periodic intervals to demonstrate that DNA profiles are unaffected by storage. To ensure the quality and integrity of biological specimens stored over a long term, biobanks follow standard procedures at every step, such as collection, transport, storage, and handling. Some procedures that are followed to ensure the quality of samples include: • Proper labeling of samples and the asso-

ciated data at the time of collection and during downstream processes. • All samples are barcoded and scanned at every step during procedures. Storage units are highly organized and compartmentalized, in which every sample occupies a unique and computer-tracked location. • Measures are taken to ensure the safety of the samples as well as to prevent damage caused by disasters, such as

6

DNA BANK

•

•

•

•

availability of backup power and backup cooling systems in case of power failure. Conducting DNA profiling of each sample and generating an extensive database of signature profiles prior to storage for later identification. Storing samples in conditions that protect them from degradation and following procedures to maintain their integrity caused by frequent freeze and thaw. Performing periodic maintenance checks to ensure sample traceability and integrity. This process is usually done by DNA profiling samples and comparing the results with the original profiles stored in a database. Finally, the clinical information that accompanies the samples and other relevant information is stored in computer systems that are run on complex database management and analysis software.

With proliferation in the number of samples stored in biobanks and the increased need to access more samples quickly and accurately, researchers started using automated systems for sample storage and retrieval (10). These automated sample management systems dramatically increase the throughput and eliminate manual errors. Some automated biological management systems are the DNA archive and reformatting (DART) system that is used by AstraZeneca, the Polar system developed by The Automated Partnership, and the RTS Life Science’s automated sample management systems. The sample storage and use of DART system at AztraZeneca is controlled by a Laboratory Information Management System. The DART system can store over 400,000 tubes and can handle over 650 tubes per hour. A complete description of the automation of sample management can be observed elsewhere (10,21). 5

ETHICAL ISSUES

Biobanks cause concerns about several technical, ethical, and legal challenges. In the

realm of ethics, these concerns revolve primarily around how donors of biological material or data will be assured that their privacy and interests will be protected, given the increasing number of large for-profit companies and entirely private DNA biobanks. 5.1 Informed Consent Informed consent is a key ethical standard required for clinical research, and its underlying principle is respect for the autonomy of individuals. The codes of ethical practice, such as the Declaration of Helsinki (22), state that a research participant must be provided the opportunity for voluntary informed consent before a study begins. In the United States, the rights and welfare of the human subjects in genetic research is protected by institutional review boards (IRB). The IRB is an independent ethics committee that is governed by principles established by the Office of Human Research Protections in the Department of Health and Human Services (23), and it is located at the institution where the research is conducted. IRBs are also run by commercial enterprises to assist governmental and industrial bodies in regulating research. In general, informed consent is requested using a special form approved by the IRB. All aspects relative to the handling of the preserved material, as well as the data, must be written clearly in the consent form. Elements of a traditional model of informed consent include the following: • The document should explain the pro-

posed research, its purpose, as well as the duration, procedures of the study, and description of potential risks. • The document should state that the sample will be used exclusively for diagnostic or research purpose and never for direct profit. If profit is the goal, then the beneficiaries should be clearly listed. • The potential benefits derived from the use of these samples to the individual and/or the entire community should be explained in lay terms. • The document should clarify the procedure for handling the data to ensure anonymity as well as the protection of

DNA BANK

confidentiality of the sample and its related investigations. • The informed consent document should state whom to contact for questions about the subjects’ rights or about studyspecific questions. • Finally, information on how to withdraw consent should be outlined, with provisions for immediate withdrawal of samples and participation. Although the principle of informed consent is recognized, its translation into the biobanks that store samples and data for long-term use, and provide this data to any entity with appropriate permissions (such as an IRB) on a fee-for-service basis, leads to many practical difficulties (24–26). Some of these difficulties include: (1) potential participants can only be given information about the sort of research that will be performed with the present knowledge, but it would be difficult to describe all the types of research that would be feasible at a later date; (2) at the time of recruitment, it is not possible to describe the information that will be subsequently collected from the volunteers because they would not know which disease they are going to develop (27); (3) it is not possible to give information about the research or research teams who will be using the samples; and (4) it is not possible to give information back to the participant on their risk profile if the research is being conducted by a third party. Because of these difficulties, no clear international guidelines on informed consent for biobanks are available. Several models of informed consent are found in the literature (27–29), and the common recommendation proposed by the United Nations Educational, Scientific and Cultural Organization (UNESCO) International Bioethics Committee is called ‘‘Blanket consent.’’ Herein, research participants can consent once about the future use their biological samples (30). The U.K. Biobank and the Estonian Biobank use the blanket approach (29,27), whereas the Icelandic Biobank uses two different consent forms: Form ‘‘A’’ authorizes the use of samples for specific research, and form ‘‘B’’ authorizes the use of samples for specific research and for

7

additional similar research if approved by National Bioethics Committee (31) without further contact with the participant. 5.2 Confidentiality The term confidentiality is defined as the safety of identifiable information that an individual discloses with the expectation that it will not be provided to third parties (23). The possibility of tracing the person from whom sample and data were derived varies according to how samples are linked to their donor identity. In general, the labeling of a sample can range from irreversible anonymity to complete identification based on the choices expressed in the written informed consent form. 5.3 Anonymous Samples are identified only by a code from the start, and all identifying links to the donors are destroyed at the time of collection. Therefore, it is not possible to link the sample to a given person, and this mechanism offers the most protection of confidentiality. 5.4 Anonymized The patient’s personal data are removed after a code is assigned, after which it is no longer possible to connect the two. 5.5 Double-Coding A anonymous number is assigned to the participants as well as to the samples, and a code that links the anonymous number and the personal information is placed is secure location, which is accessible to the person in charge of the biobank and his/her immediate colleagues. 5.6 Identifiable The identity of individuals is directly linked to the samples or data. This option is only possible on explicit request by the interested party and in any case only for exclusive personal and family use.

8

6

DNA BANK

CURRENT BIOBANK INITIATIVES

Centralized population-based biobanks have been established in at least eight countries, which include Iceland, the United Kingdom, Canada, Japan, Sweden, Latvia, Singapore, and Estonia. Population homogeneity and a strong interest in genealogy made for the establishment of the world’s first populationbased biobank in Iceland. deCODE Genetics, which is a private company, successfully partnered with the Icelandic parliament in 1998 to create and operate a centralized database of nonidentifiable health data (32). They aimed to enroll most of the country’s 270,000 citizens and currently have genetic samples from more than 100,000 volunteers (10,33). deCODE links data from the Iceland Health Database as well as public genealogical records. In Canada, three population-based biobank initiatives are in various stages of development. One project is CARTaGENE, which is a publicly funded biobank in Quebec. In the first phase, it aims to recruit a random sample of about 20,000 adults between 45 to 69 years, which represents about 1% of its citizens of this age group in the selected regions of Quebec. This resource will be available to researchers who are interested in conducting population-based genomic research (34). Two other population-based genetic biobank projects planned in Canada are the national birth control cohort and a longitudinal study on aging. The U.K. Biobank was started in August 2006 with the goal to study the separate and combined effects of genetic and nongenetic risk factors in the development of multifactorial diseases of adult life. This program plans to involve 0.5 million people 45–69 years of age and intends to collect environmental and lifestyle data, as well as information from medical records, along with biological samples. The center will also coordinate activities of the six scientific regional collaborating centers; each is a consortium of academic and research institutions responsible locally for recruitment and collection of data and samples. This biobank will serve as a major resource that can support a diverse range of research intended to improve the prevention, diagnosis, and treatment of illness and the promotion of health throughout society (35).

A genome project set up by the tiny European country of Estonia, founded in 2001, is accelerating plans to take blood samples from three quarters (i.e., 1 million of its 1.4 million) of its population and promises to be the biggest such initiative. This project aims not only to enable research on the genetic and nongenetic components of common disease, but also to create a biological description of a large and representative sample of the Estonian population (36). The Latvian Genome Project, which was launched in 2002, is a unified national network of genetic information and data processing to collect representative amounts of genetic material for genotyping of the Latvian population and to compare genomic data with clinical and pedigree information. This project is planned for a period of 10 years, with an expected sample size of 60,000 in its pilot phase (37). The Swedish National Biobank is estimated to house about 50–100 million human samples, which increases at the rate of 3–4 million samples per year (38). GenomEUtwin is a project that aims to analyze twin and general cohorts to determine the influence of genetic and nongenetic factors that cause predisposition to obesity, stature, coronary heart disease, stroke, and longevity, and to create synergies in genetic epidemiology. The implementation is coordinated by the Finland National Public Health Institute and University of Helsinki and it builds on existing twin cohorts from a few other European countries (39). The Biobank Japan project, which was established in 2003 with the support of the Japanese government, plans to collect DNA, sera, and clinical information from 300,000 patients in participating hospitals (40). Genetic Repositories Australia is a central national facility for establishing, distributing, and maintaining the long-term secure storage of human genetic samples. They provide Australian medical researchers with DNA and cell lines, as well as associated clinical information, collected from patients and controls (41). China has also launched its first biobank program, called ‘‘The Guangzhou Biobank Cohort Study,’’ with the aim of creating profiles on about 11,000 Guangzhou people aged above 50 years in the first phase (42).

DNA BANK

Several biobanks are located in the U.S., and one of the largest populationbased biobanks is the personalized medicine research project of the Marshfield Medical Clinic in Wisconsin. The investigators of this project are planning to enroll at least 100,000 people who live in northern and central Wisconsin and make the samples, as well as data, available to other researchers. Currently, this repository contains information from 40,000 participants (43,44). The Center for Genetic Medicine of Northwestern University also initiated the first hospital DNA biobank, and it plans to collect DNA samples with associated clinical healthcare information from 100,000 volunteers who receive their healthcare at Northwesternaffiliated hospitals and clinics (45). The DNA bank and tissue repository at the Center for Human Genetics at Duke University is one of the oldest academic DNA banks in the United States, and it contains samples of more than 127,500 individuals (46). Howard University is also planning to establish a large DNA and health database on individuals of African American decent. They aim to enroll about 25,000 volunteers over 5 years and use the data to study the genetic and lifestyle factors that contribute to common diseases (47). Children’s Hospital of Philadelphia has launched a biobank of Children’s DNA with the aim to collect DNA from 100,000 children and use it to study common childhood diseases such as asthma, diabetes, obesity, and so on. They plan to create a database of children’s genetic profiles, which hospital researchers can use to develop diagnostic tests and drugs (48). The U.S. Department of Veterans Affairs has also proposed a national gene bank that would link DNA donated by up to 7 million veterans and their family members with anonymous medical records (49). Other similar initiatives are being implemented in many parts of the United States, but no centralized federally mandated repository is active. 7

CONCLUSIONS

Biobanking initiatives have been globally embraced because the value of these enterprises in solving the problem of emerging chronic diseases has been recognized.

9

However, translating the basic knowledge obtained from specific gene variants to the clinic remains the current challenge. The current pipeline to pharmaceutical drug discovery is very inefficient, and in general, the yield has been disappointingly low. Although, it was initially hoped that the Human Genome Project and the International HapMap Project would accelerate drug discovery, this phenomenon did not follow the hypothesized pace because discovery of variants in genes and drug targets do not follow parallel paths. The first has been the domain of academic scientists interested in the biological mechanisms of disease, whereas the latter, drug discovery and chemical screening, has been the province of the pharmaceutical industry, each with its respective governance and pipeline. In the past decade, there has been a proliferation in the size of biobanks and the clinical data collected by academic and industry researchers alike, all with the hope that the superior sample sizes, and revolutions in technology will enable them to develop better models for disease and health. REFERENCES 1. ASHG Ad Hoc Committee on DNA technology, DNA banking and DNA analysis—points to consider. Am. J. Hum. Genet. 1988; 42: 781–783. 2. National Bioethics Advisory Commission, Research Involving Human Biological Materials: Ethical Issues and Policy Guidance. 1999. National Bioethics Advisory Commission, Rockville, MD. 3. Framingham Heart Study. Available at: http://www.framinghamheartstudy.org/. 4. J. Lazarou, B. H. Pomeranz, and P. N. Corey, Incidence of adverse drug reactions in hospitalized patients—a meta-analysis of prospective studies. JAMA. 1998; 279: 1200–1205. 5. W. W. Weber, Pharmacogenetics. New York: Oxford University Press, 1997. 6. F. Broly, A. Gaedigk, M. Heim, M. Eichelbaum, K. Morike, and U. A. Meyer, Debrisoquine sparteine hydroxylation genotype and phenotype— analysis of common mutations and alleles of Cyp2D6 in a european population. DNA Cell. Biol. 1991; 10: 545–558. 7. The International SNP Map Working Group. Nature 2001; 409: 928–933.

10

DNA BANK

8. The International HapMap Consortium. The International HapMap Project. Nature 2003; 426: 789–796. 9. Centre d’etude du Polymorphism Humaine (CEPH). Available at http://www.cephb.fr/. 10. Thornton, M., Gladwin, A., Payne, R., Moore, R., Cresswell, C., McKechnie, D., Kelly, S., and March, R. Automation and validation of DNA-banking systems. Drug Discov. Today 2005; 10: 1369–1375. 11. J. Kaiser, Biobanks—Private biobanks spark ethical concerns. Science 2002; 298: 1160. 12. NIAS DNA Bank. Available at: http://www. dna.affrc.go.jp/about/. 13. Royal Botanic Gardens, Kew: Plant DNA Bank database. Available at: http://www. kew.org/data/dnaBank/homepage.html. 14. The Australian Plant DNA Bank. Available at: https://www.dnabank.com.au/. 15. The Cornell Medical Genetic Archive. Available at; http://www.vet.cornell.edu/research/ DNABank/intro.htm. 16. I. Hirtzlin, C. Dubreuil, N. Preaubert, J. Duchier, B. Jansen, J. Simon, P. L. de Faria, A. Perez-Lezaun, B. Visser, G. D. Williams, and A. Cambon-Thomsen, An empirical survey on biobanking of human genetic material and data in six EU countries. Eur. J. Hum. Genet. 2003; 11: 475–488. 17. S. Clark, L. D. Youngman, A. Palmer, S. Parish, R. Peto, and R. Collins, Stability of plasma analytes after delayed separation of whole blood: implications for epidemiological studies. Int. J. Epidemiol. 2003; 32: 125–130. 18. E. R. B. Mccabe, S. Z. Huang, W. K. Seltzer, and M. L. Law, DNA microextraction from dried blood spots on filter-paper blotters—potential applications to newborn screening. Hum. Genet. 1987; 75: 213–216. 19. J. V. Mei, J. R. Alexander, B. W. Adam, and W. H. Hannon, Use of filter paper for the collection and analysis of human whole blood specimens. J. Nutr. 2001; 131: 1631S–1636S. 20. K. Steinberg, J. Beck, D. Nickerson, M. Garcia-Closas, M. Gallagher, M. Caggana, Y. Reid, M. Cosentino, J. Ji, D. Johnson, R. B. Hayes, M. Earley, F. Lorey, H. Hannon, M. J. Khoury, and E. Sampson, DNA banking for epidemiologic studies: A review of current practices. Epidemiology 2002; 13: 246–254. 21. S. Mahan, K. G. Ardlie, K. F. Krenitsky, G. Walsh, and G. Clough, Collaborative design for automated DNA storage that allows for rapid, accurate, large-scale studies. Assay Drug Devel. Technol. 2004; 2: 683–689.

22. World Medical Association, Declaration of Helsinki: Ethical principles for medical research involving human subjects. 2000. http://www.wma.net/e/policy/pdf/17c.pdf. 23. IRB Guidebook, Office of Human Research Protections, Department of Health and Human Services. Available at: http://www. hhs.gov/ohrp/irb/irb guidebook.htm. 24. A. Cambon-Thomsen, Science and society—The social and ethical issues of post-genomic human biobanks. Nat. Rev. Genet. 2004; 5: 866–873. 25. M. Deschenes, G. Cardinal, B. M. Knoppers, and K. C. Glass, Human genetic research, DNA banking and consent: a question of ‘form’? Clin. Genet. 2001; 59: 221–239. 26. B. Godard, J. Schmidtke, J. J. Cassiman, and S. Ayme, Data storage and DNA banking for biomedical research: informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective. Eur. J. Hum. Genet. 2003; 11: S88–S122. 27. D. Shickle, The consent problem within DNA biobanks. Stud. Hist. Phil. Biol & Biomed. Sci. 2006; 37: 503–519. 28. M. G. Hansson, J. Dillner, C. R. Bartrarn, J. A. Carlson, and G. Helgesson, Should donors be allowed to give broad consent to future biobank research? Lancet Oncol. 2006; 7: 266–269. 29. K. J. Maschke, Navigating an ethical patchwork—human gene banks. Nat. Biotechnol. 2005; 23: 539–545. 30. UNESCO. Human genetic data: Preliminary study by the IBC on its collection, processing, storage and use. 2002. http://portal.unesco. org/shs/en/files/2138/10563744931Rapfinal gendata en.pdf/Rapfinal gendata en.pdf. 31. V. Arnason, Coding and consent: moral challenges of the database project in Iceland. Bioethics 2004; 18: 27–49. 32. M.A. Austin, S. Harding, and C. McElroy, Genebanks: a comparison of eight proposed international genetic databases. Community Genet. 2007; 6: 500–502. 33. deCODE Genetics. Available at: http://www. decode.com/biorepository/index.php. 34. CARTaGENE project. Available at http:// www.cartagene.qc.ca/accueil/index.asp. 35. The UK Biobank. Available at http://www. ukbiobank.ac.uk/about/what.php. 36. The Estonian Genome Project. Available at http://www.geenivaramu.ee/index.php?lang= eng&sub=58.

DNA BANK 37. The Latvian Genome Project. Available at http://bmc.biomed.lu.lv/gene/. 38. Swedish National Biobank Program. Available at http://www.biobanks.se/. 39. GenomeEUtwin. Available at http://www. genomeutwin.org/index.htm. 40. Nakamura, Y. Biobank Japan Project: Towards personalised medicine. 2007. http:// hgm2007.hugo-international.org/Abstracts/ Publish/Plenaries/Plenary01/hgm02.html. 41. Genetic Repositories Australia. Available at http://www.powmri.edu.au/GRA.htm. 42. C. Q. Jiang, G. N. Thomas, T. H. Lam, C. M. Schooling, W. S. Zhang, X. Q. Lao, R. Adab, B. Liu, G. M. Leung, and K. K. Cheng, Cohort profile: the Guangzhou Biobank Cohort Study, a Guangzhou-Hong Kong-Birmingham collaboration. Int. J. Epidemiol. 2006; 35: 844–852. 43. C. A. McCarty, R. A. Wilke, P. F. Giampietro, S. D. Wesbrook, and M. D. Caldwell, Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. Personalized Med. 2005; 2: 49–79. 44. H. Swede, C. L. Stone, and A. R. Norwood, National population-based biobanks for genetic research. Genet. Med. 2007; 9: 141–149. 45. NUgene Project. Available at: http://www. nugene.org/. 46. DNA Bank and Tissue Repository, Center for Human Genetics, Duke University. Available at http://www.chg.duke.edu/ research/dnabank.html. 47. J. Kaiser, Genomic medicine—AfricanAmerican population biobank proposed. Science 2003; 300: 1485–1485. 48. J. Kaiser, Genetics—US hospital launches large biobank of children’s DNA. Science 2006; 312: 1584–1585. 49. J. Couzin, Veterans Affairs—Gene bank proposal draws support—and a competitor. Science 2005; 309: 684–685.

CROSS-REFERENCES Genetic Association Analysis Pharmacogenomics Microarray Repository Two-Stage Genetic Association Studies

11

ESCALATION AND UP-AND-DOWN DESIGNS

at a time to target a wide range of quantiles . Storer (10) and Korn et al. (11) used decision rules of group designs to suggest several designs for dose finding. Among the designs studied in Korn et al. (11) and Shih and Lin (12) were versions of the traditional or 3 + 3 design widely used in dose-finding trails in oncology. Lin and Shih (13) generalized the 3 + 3 by introducing A + B designs.

ANASTASIA IVANOVA University of North Carolina at Chapel Hill Chapel Hill, North Carolina

Up-and-down designs are widely used in both preclinical (animal) and clinical dose-finding trials. These designs operate on prespecified set of doses of the investigational product. The dose for the next subject (or a cohort of subjects) is repeated, increased, or decreased according to the outcome of the subject (or a cohort of subjects) having received the immediate prior dose. The decision rules used in up-and-down designs are very simple and intuitive. The goal is to find the maximally tolerated dose (MTD). The MTD is sometimes defined as the dose just below the lowest dose level with unacceptable toxicity rate; or the MTD can be defined as the dose with the probability of toxicity closest to a prespecified rate . The underlying model assumption is that the probability of toxicity is a nondecreasing function of dose. Upand-down designs do not require any other assumptions on the dose–toxicity relationship.

2

BIASED COIN DESIGNS

In biased coin designs, subjects are assigned to a dose level of a drug one at a time. The biased coin design from Durham and Flournoy (4), developed for the case of ≤ 0.5, uses a biased coin with the probability of heads equal to b = /(1−), 0 ≤ b ≤ 0.5. If the outcome of the most recent subject is toxicity, the dose is decreased. If the outcome of the most recent subject is no toxicity, the dose is increased if the biased coin’s toss results in heads, and repeated if the toss results in tails. This process is continued until prespecified number of subjects is assigned. To determine the next assignment, biased coin designs use the outcome of a single (most recent) subject only and hence are not efficient when data from more than one subject at a dose are available.

1 HISTORY OF UP-AND-DOWN DESIGNS FOR DOSE FINDING

3 Von B´ek´esy (1) and Dixon and Mood (2) described an up-and-down design where the dose level increased after a nontoxic response and decreased if toxicity is observed. This approach clusters the treatment distribution around the dose for which the probability of toxicity is equal to = 0.5. To target any quantile , Derman (3) modified the decision rule of the design using a biased coin with the probability of heads computed as a function of the target quantile . Durham and Flournoy 4, 5 considered two biased coin designs in the spirit of Derman. Giovagnoli and Pintacuda (6) later obtained interesting theoretical results on biased coin designs. Wetherill (7) and Tsutakawa (8, 9) proposed assigning subjects in groups rather than one

GROUP UP-AND-DOWN DESIGNS

Group up-and-down designs (7–9) are the building blocks for many dose-finding designs (14). A group up-and-down design induces a Markov chain on the set of doses. Hence, some statistical properties of a group design can be obtained using the Markov chain theory. Gezmu and Flournoy (15) studied small sample properties of group designs. Ivanova et al. (16) studied large sample properties. Let d1 < . . . < dK be dose levels selected for the study, and p1 < . . . < pK be corresponding probabilities of toxicity at these doses. In a group up-and-down design, subjects are assigned to treatment in cohorts of size s starting with the lowest dose. Let X be the number of toxicities in the most recent

Wiley Encyclopedia of Clinical Trials, Copyright © 2007, John Wiley & Sons, Inc.

1

2

ESCALATION AND UP-AND-DOWN DESIGNS

cohort assigned to dose dj , X ∼ Bin(s,pj ), where Bin(s,p) denotes the binomial random variable with parameters s and p. Let cL and cU be two integers such that 0 ≤ cL < cU ≤ s. Then, 1. if X ≤ cL , the next cohort of s subjects is assigned to dose dj+1 . 2. if cL < X < cU , the dose is repeated for the next cohort of s subjects. 3. if X ≥ cU , the next cohort of s subjects is assigned to dose dj−1 . Appropriate adjustments are made at the lowest and highest dose so that the assignments stay within d1 , . . . , dK . The process is continued until a prespecified number of patients are assigned. Assignments in group up-and-down design for large sample sizes are clustered around the dose with toxicity rate s , where s is the solution of Pr{Bin(s, s ) ≤ cL } = Pr{Bin(s, s ) ≥ cU }. That is, if there is a dose dk such that s = pk , the assignments are clustered around dk . If pk – 1 < s < pk , the assignments are clustered around dose k − 1 or k (16). To find s for given parameters s, cL , and cU , one needs to write the equation above using formulas for binomial probabilities. For example, for the group design with group size s, cL = 0 and cU = 1, the equation has the form (1− s )s = 1 − (1− s )s with the solution s = 1 − (0.5)1/s . For most of the group up-and-down designs, closed form solutions of this equation do not exist, but the equation can be easily solved numerically. For most practical applications of group designs, approximation s ≈ (cU /s − cL /s)/2 can be used. 4

ESCALATION DESIGNS

Escalation designs are widely used in firsttime-in-human phase I studies in a variety of therapeutic areas. Such trials usually investigate three to eight dose levels, and patients are typically assigned to cohorts with a size of six to eight patients, some among whom are receiving placebo (instead of the investigational agent). All patients receiving investigational product in a cohort receive the same

dose. Doses are increased by one level for each subsequent cohort. The trial is stopped when an unacceptable number of adverse events or an unacceptable type of adverse event is observed, when the highest dose level is reached, or for other reasons as well. The ‘‘target dose,’’ the dose recommended for future trials, is usually determined on the basis of the rates or types of adverse events (or both) at the dose levels studied, in addition often to pharmacokinetic parameters and considerations. If only the rate of adverse events is taken into account with the goal of finding the target dose, the dose with adverse events rate of , the escalation design can be more formally defined as follows. Patients are assigned to treatment in cohorts of size m starting with the lowest dose. Let design parameter CU be an integer such that 0 ≤ CU < m. Assume that the most recent cohort of patients was assigned to dose level dj , j = 1, . . . , K − 1. Let X be the number of adverse events in a cohort assigned to dj . Then if X ≤ CU , the next cohort of m patients is assigned to dose dj+1 ; otherwise, the trial is stopped. The dose one level below the dose where > CU adverse events were observed is the estimated target dose. If the escalation was not stopped at any of the doses, the highest dose dK is recommended. The obvious choice of CU is CU /m ≤ < (CU + 1)/m. The frequency of stopping escalation at a certain dose level depends on the adverse event rate at this dose as well as rates at all lower dose levels. Ivanova (14) studied how cohort size and the choice of dose levels affect the precision of the escalation design. Clearly, designs with large cohort size have better statistical precision, hence values m < 6 are not recommended. From a safety standpoint, especially in early phase studies, it might be risky for all m patients in a cohort to receive the investigational product at the same time. In such trials, the cohort of m can be split, for example, into two subcohorts. Patients are assigned in the first subcohort and then to the second, if treatment in the first cohort was well tolerated. A similar strategy is used in the A + B designs described in the next section.

ESCALATION AND UP-AND-DOWN DESIGNS

5

A + B DESIGNS

The A + B designs (13) include a stopping rule as in escalation designs but save resources at lower doses. In general terms, A + B design without de-escalation is defined as follows. Let A and B be positive integers. Let cL , cU , and CU be integers such that 0 ≤ cL < cU ≤ A, cU − cL ≥ 2, and cL ≤ CU < A + B. Patients are assigned to doses in cohorts of size A starting with the lowest dose. Assume that the most recent cohort was a cohort of A patients assigned to receive dose dj , j = 1, . . . , K − 1. Let X A be the number of toxicities in a cohort of size A assigned to dose dj , and let X A+B be the number of toxicities in a cohort of size A + B. Then,

Table 1. Examples of A + B designs.

1. 0 ≤ cL < cU ≤ A, cU − cL ≥ 2, and cL ≤ CU < A + B. 2. (cU /A − cL /A)/2 ≈ or slightly exceeds .

A = B = 5, cL A = B = 3, cL A = B = 4, cL A = B = 4, cL A = B = 3, cL

0.1 0.2 0.3 0.4 0.5

= 0, cU = 0, cU = 0, cU = 1, cU = 0, cU

= 2, CU = 2, CU = 3, CU = 3, CU = 3, CU

=1 =1 =2 =3 =3

3. CU /(A + B) < < (CU + 0.5) (A + B). The choice of cohort sizes A and B so that A ≤ B yields more effective designs on average. Several A + B designs that satisfy the rules above are presented in Table (1). The values of in Table (1) were computed from constraint [3] as the midinterval rounded to the nearest decimal. Designs described here are A + B designs without dose de-escalation. The description of A + B designs with dose de-escalation and of several modifications of the A + B designs can be found in Shih and Lin (12, 13).

Then, if in the combined cohort assigned to dj , X A+B ≤ CU , the next cohort of size A receives dose dj+1 ; otherwise, the trial is stopped:

The dose one level below the dose where an unacceptable number of toxicities were observed ( ≥ cU toxicities in a cohort of size A or > CU toxicities in a cohort of size A + B) is the estimated MTD. If the escalation was not stopped at any of the doses, the highest dose dK is recommended. The frequency of stopping escalation at a certain dose level depends on the toxicity rate at this dose as well as the rates at all lower dose levels. Hence, it is impossible to identify a quantile targeted by a certain A + B design. However, some practical guidelines on how to choose the design parameters can be formulated (14). If is the target quantile, parameters A, B, cL , cU , and CU in the A + B design can be selected according to the following constraints:

Design parameters

1. if X A ≤ cL , the next cohort of A patients is assigned to dose dj+1 . 2. if cL < X A < cU , the cohort of B patients is assigned to dose dj .

1. if X A ≥ cU , the trial is stopped.

3

6

Traditional or 3 + 3 Design

The traditional or 3 + 3 design frequently used in dose finding in oncology (see the article on phase I oncology trials for a description of the 3 + 3 design) is a special case of an A + B design without de-escalation (13), as was described here with A = B = 3, cL = 0, cU = 2, and CU = 1. REFERENCES 1. G. von B´ek´esy, A new audiometer. Acta Otolaryngologica. 1947; 35: 411–422. 2. W. J. Dixon and A. M. Mood, A method for obtaining and analyzing sensitivity data. J Am Stat Assoc. 1954; 43: 109–126. 3. C. Derman, Nonparametric up and down experimentation. Ann Math Stat. 1957; 28: 795–798. 4. S. D. Durham and N. Flournoy, Random walks for quantile estimation. In: S. S. Gupta and J. O. Berger (eds.), Statistical Decision Theory and Related Topics V. New York: SpringerVerlag, 1994, pp. 467–476.

4

ESCALATION AND UP-AND-DOWN DESIGNS

5. S. D. Durham and N. Flournoy, Up-and-down designs I. Stationary treatment distributions. In: N. Flournoy and W. F. Rosenberger (eds.), Adaptive Designs. Hayward, CA: Institute of Mathematical Statistics, 1995, pp. 139–157. 6. A. Giovagnoli and N. Pintacuda, Properties of frequency distributions induced by general ‘‘up-and-down’’ methods for estimating quantiles. J Stat Plan Inference. 1998; 74: 51–63. 7. G. B. Wetherill, Sequential estimation of quantal response curves. J R Stat Soc Ser B Methodol. 1963; 25: 1–48. 8. R. K. Tsutakawa, Random walk design in bioassay. J Am Stat Assoc. 1967; 62: 842–856. 9. R. K. Tsutakawa, Asymptotic properties of the block up-and-down method in bio-assay. Ann Math Stat. 1967; 38: 1822–1828. 10. B. E. Storer, Design and analysis of phase I clinical trials. Biometrics. 1989; 45: 925–937. 11. L. Korn, D. Midthune, T. T. Chen, L. V. Rubinstein, M. C. Christian, and R. M. Simon, A comparison of two phase I trial designs. Stat Med. 1994; 13: 1799–1806. 12. W. J. Shih and Y. Lin, Traditional and modified algorithm-based designs for phase I cancer clinical trials. In: S. Chevret (ed.), Statistical Methods for Dose Finding. New York: Wiley, 2006, pp. 61–90.

13. Y. Lin and W. J. Shih, Statistical properties of the traditional algorithm-based designs for phase I cancer clinical trials. Biostatistics. 2001; 2: 203–215. 14. A. Ivanova, Escalation, group and A + B designs for dose-finding trials. Stat Med. 2006; 25: 3668–3678. 15. M. Gezmu and N. Flournoy, Group up-anddown designs for dose-finding. J Stat Plan Inference. 2006; 136: 1749–1764. 16. A. Ivanova, N. Flournoy, and Y. Chung, Cumulative cohort design for dose-finding. J Stat Plan Inference. 2007; 137: 2316–2327.

FURTHER READING S. Chevret, ed., Statistical Methods for Dose Finding. New York: Wiley, 2006. J. Crowley, ed., Handbook of Statistics in Clinical Oncology. New York/Basel: Marcel Dekker, 2006. N. Ting, ed., Dose Finding in Drug Development. New York: Springer-Verlag, 2006.

CROSS-REFERENCES Phase I trials in oncology

DOSE ESCALATION GUIDED BY GRADED TOXICITIES

in which level 7 is the correct level. Patients enter the study sequentially. The working dose-toxicity curve, which is taken from the CRM class (described below), is refitted after each inclusion. The curve is then inverted to identify which available level has an associated estimated probability as close as we can get to the targeted acceptable toxicity level. The next patient is then treated at this level. The cycle is continued until a fixed number of subjects has been treated or until we apply some stopping rule (1,2). The di , which is often multidimensional, describes the actual doses or combinations of doses being used. We assume monotonicity, and we take monotonicity to mean that the dose levels are equally well identified by their integer subscripts i (i = 1, . . . , k), which are ordered whereby the probability of toxicity at level i is greater than that at level i whenever i is greater than i . The monotonicity requirement or the assumption that we can so order our available dose levels is important. The dose for the jth entered patient, X j can be viewed as random taking values xj , most often discrete in which case xj {d1 , . . . , dk } but possibly continuous where X j = x; x R+ . In light of the remarks of the previous two paragraphs we can, if desired, entirely suppress the notion of dose and retain only information that pertains to dose level. This information is all we need, and we may prefer to write xj {1, . . . , k}. Let Y j be a binary random variable (0, 1) where 1 denotes severe toxic response for the jth entered patient (j = 1, . . . , n). We model R(xj ), which is the true probability of toxic response at Xj = xj ; xj {d1 , . . . , dk } or xj {1, . . . , k} via

JOHN O’QUIGLEY NOLAN WAGES University of Virginia Charlottesville, Virginia

1

BACKGROUND

In phase 1 and phase 2 dose-finding studies, the endpoints of interest are typically the presence or absence of toxicity and/or the presence or absence of some indication of therapeutic effect. In numerical terms, these outcomes are represented as simple binary variables. Most protocols, however, will stipulate that intermediary degrees, or grades, of toxicity be recorded. Our purpose here is to consider how such intermediary information may be used to obtain a more accurate estimation of the maximum tolerated dose, both at the end of the study and for those patients being treated during the course of the study. For now, we limit our attention to phase 1 studies alone, in which toxicity is the focus of our interest. Working with two-stage continual reassessment method (CRM) designs, which are briefly described in the following section, we can observe that considerable use can be made of graded information both during the initial escalation period as well as during the second stage of a two-stage design. Here, we appeal to simple working models. 2

TWO-STAGE CRM DESIGNS

R(xj ) = Pr(Yj = 1|Xj = xj )

The purpose of the design is to identify a level, from among the k dose levels available d1 , . . . , dk , such that the probability of toxicity at that level is as close as possible to some value . The value is chosen by the investigator such that he or she considers probabilities of toxicity higher than to be unacceptably high, whereas those lower than are unacceptably low in that they indicate, indirectly, the likelihood of too weak an antitumor effect. Figure 1 illustrates typical behavior of a CRM design with a fixed sample

= E(Yj |xj ) = (xj , a) for some one-parameter working model (xj , a). For given fixed x, we require that (x, a) be strictly monotonic in a. For fixed a, we require that (x, a) be monotonic increasing in x or, in the usual case of discrete dose levels di , i = 1, . . . , k, that (di , a) > (dm , a) whenever i > m. The true probability of toxicity at x (i.e., whatever treatment

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DOSE ESCALATION GUIDED BY GRADED TOXICITIES Trial History CRM toxicity

9 8

Dose Level

7 6 5 4 3 2

Figure 1. A typical trial history for a two-stage CRM design using accelerated early escalation based on grade. MTD corresponds to a toxicity rate of 25% and is found at level 7.

1 1

7

10

13

16

19

22

25

28

Subject No

combination has been coded by x) is given by R(x), and we require that, for the specific doses under study (d1 , . . . , dk ) values of a, say a1 , . . . , ak exist such that (di , ai ) = R(di ), (i = 1, . . . , k). In other words, our oneparameter working model has to be rich enough to model the true probability of toxicity at any given level. We call it a working model because we do not anticipate a single value of a to work precisely at every level, that is, we do not anticipate a1 = a2 = · · · = ak = a. Many choices are possible. Excellent results have been obtained with the simple choice: (di , a) = αia , (i = 1, . . . , k)

4

(1)

where 0 < α1 < · · · < αk < 1 and 0 < a < ∞. It can be sometimes advantageous to make use of the reparameterized model (di , a) = exp(a) so that no constraints are placed on αi the parameter a. Of course, likelihood estimates of are unchanged. Once a model has been chosen and we have data in the form of the set j = {y1 , x1 , . . . , yj , xj }, the outcomes of ˆ i ), the first j experiments obtain estimates R(d (i = 1, . . . , k) of the true unknown probabilities R(di ), (i = 1, . . . , k) at the k dose levels. The target dose level is that level having associated with it a probability of toxicity as close as we can get to . The dose or dose level xj assigned to the jth included patient is

such that ˆ j ) − | < |R(d ˆ i ) − |, (i = 1, . . . , k; xj = dj ) |R(x Thus xj is the closest level to the target level in the above precise sense. Other choices of closeness could be made by incorporating cost or other considerations. We could also weight ˆ j )− the distance, for example multiply |R(x | by some constant greater than 1 when ˆ j ) > . This method would favor conserR(x vatism; such a design tends to experiment more often below the target than a design without weights. Similar ideas have been pursued by Babb et al. (3). After the inclusion of the first j patients, the log-likelihood can be written as: Lj (a) =

j

y log (x , a)

=1

+

j (1 − y ) log(1 − (x , a)) (2) =1

and is maximized at a = aˆ j . Maximization of Lj (a) can easily be achieved with a Newton Raphson algorithm or by visual inspection using some software package such as Microsoft Excel (Microsoft Corporation, Redmond, WA). Once we have calculated aˆ j , we can next obtain an estimate of the probability of toxicity at each dose level di via: ˆ i ) = (di , aˆ j ) (i = 1, . . . , k) R(d

DOSE ESCALATION GUIDED BY GRADED TOXICITIES

We would not anticipate these estimates to be consistent at all dose levels, which would usually require a richer model than what we work with. However, under broad conditions, we will (4) obtain consistency at the recommended maximum tolerated dose (MTD). Based on this formula, the dose to be given to the (j + 1)th patient, xj+1 is determined. The experiment is considered as not being fully underway until we have some heterogeneity in the responses. These examples could develop in a variety of different ways, which include use of the standard Up and Down approach, use of an initial Bayesian CRM as outlined below, or use of a design believed to be more appropriate by the investigator. Once we have achieved heterogeneity, the model kicks in and we continue as prescribed above iterating between estimation and dose allocation. The design is then split into two stages: an initial exploratory escalation followed by a more refined homing in on the target. Storer (5) was the first to propose twostage designs in the context of the classic Up and Down schemes. His idea was to enable more rapid escalation in the early part of the trial where we may be far from a level at which treatment activity could be anticipated. Moller (6) was the first to use the idea in the context of CRM designs. Her idea was to allow the first stage to be based on some variant of the usual Up and Down procedures. In the context of sequential likelihood estimation, the necessity of an initial stage was pointed out by O’Quigley and Shen (7), because the likelihood equation fails to have a solution on the interior of the parameter space unless some heterogeneity in the responses has been observed. Their suggestion was to work with any initial scheme, such as Bayesian CRM or Up and Down. For any reasonable scheme, the operating characteristics seem relatively insensitive to this choice. However, something very natural and desirable is observed in two stage designs, and currently they could be taken as the designs of choice. The reason is the following: Early behavior of the method, in the absence of heterogeneity (i.e., lack of toxic response), seems to be rather arbitrary. A decision to escalate after inclusion of three patients who tolerated some level, or after a

3

single patient tolerated a level or according to some Bayesian prior, however constructed, is translating directly (although less directly for the Bayesian prescription) the simple desire to try a higher dose because we’ve encountered no toxicity thus far. We can make use of information on toxicity grade in either one of these two stages. In the first stage, no model is being used, and we use graded toxicities simply to escalate more rapidly when it seems we are far below any level likely to result in doselimiting toxicities. Once we begin to observe some intermediary toxicity, then we slow the escalation down. The ideas are straightforward and appeal mostly to common sense arguments. Nonetheless, it can be observed that use of graded toxicity information in the first stage alone can make an important contribution to increased efficiency. Use of graded toxicity information in the second stage requires an additional model to that already used to demonstrate the rate of toxicities. We consider these two different situations in the following two sections. 3 USING GRADED INFORMATION IN THE FIRST STAGE Consider the following example of a twostage design that has been used in practice. Many dose levels were used, and the first included patient was treated at a low level. As long as we observe very low-grade toxicities, then we escalate quickly, which includes only a single patient at each level. As soon as we encounter more serious toxicities, escalation is slowed down. Ultimately, we encounter dose-limiting toxicities at which time the second stage, based on fitting a CRM model, comes fully into play. This method is done by integrating this information and that obtained on all the earlier non-dose-limiting toxicities to estimate the most appropriate dose level. We can use information on low-grade toxicities in the first stage of a two-stage design to allow rapid initial escalation, because it may be the case that we be far below the target level. Specifically, we define a grade severity variable S(i) to be the average toxicity severity observed at dose level i (i.e., the sum of the

4

DOSE ESCALATION GUIDED BY GRADED TOXICITIES

severities at that level divided by the number of patients treated at that level). The rule is to escalate providing S(i) is less than 2. Furthermore, once we have included three patients at some level, escalation to higher levels only occurs if each cohort of three patients does not experience dose-limiting toxicity. This scheme means that, in practice, as long as we observe only toxicities of severities coded 0 or 1, we escalate. Only a single patient is necessary (for whom little or no evidence of any side effects is observed) to decide to escalate. The first severity coded 2 necessitates another inclusion at this same level and, anything other than a 0 severity for this inclusion would require yet another inclusion and a non-dose-limiting toxicity before being able to escalate. This design also has the advantage that, should we be slowed down by a severe (severity 3), albeit non-doselimiting toxicity, we retain the capability of picking up speed (in escalation) should subsequent toxicities be of low degree (0 or 1). This method can be helpful in avoiding being handicapped by an outlier or an unanticipated and possibly not drug-related toxicity. Many variants on this particular escalation scheme and use of graded severity are available. It is for the investigator to decide which scheme is suitable for the given circumstance and which scheme seems to provide the best balance between rapid escalation and caution in not moving so quickly as to overshoot the region where we will begin to encounter dose-limiting toxicites. Once a dose-limiting toxicity has been encountered, this phase of the study (the initial escalation scheme) ends, and we proceed to the second stage based on a CRM modelbased recommendation. Although the initial phase is closed, the information obtained on both dose-limiting and non-dose-limiting toxicities is used in the second stage.

a statistician is to consider that the response variable, which is toxicity, has been simplified when going from five levels to two and that it may help to employ models accommodating multilevel responses. In fact, we do not believe that progress is to be made using these methods. The issue is not that of modeling a response (toxicity) at 5 levels but of controlling for dose-limiting toxicity, mostly grade 4 but possibly also certain kinds of grade 3. Lower grades are helpful in that their occurrence indicates that we are approaching a zone in which the probability of encountering a dose-limiting toxicity is becoming large enough to be of concern. This idea is used implicitly in the two-stage designs described in the section entitled, ‘‘Using graded information in the first stage.’’ Hopefully, if we proceed more formally and extract yet more information from the observations, then we need models to relate the occurrence of dose-limiting toxicities to the occurrence of lower-grade toxicities. By modeling the ratio of the probabilities of the different types of toxicity, we can make striking gains in efficiency because the more frequently observed lower grade toxicities carry a great deal of information on the potential occurrence of dose-limiting toxicities. Such a situation would also allow gains in safety because, at least hypothetically, it would be possible to predict at some level the rate of occurrence of dose-limiting toxicities without necessarily having observed very many, the prediction leaning largely on the model. At the opposite end of the model/hypothesis spectrum, we might decide we know nothing about the relative rates of occurrence of the different toxicity types and simply allow the accumulating observations to provide the necessary estimates. In this case, it turns out that we neither lose nor

4 USE OF GRADED TOXICITIES IN THE SECOND STAGE

Table 1. Toxicity ‘‘Grades’’ (Severities) for Trial

Although we refer to dose-limiting toxicities as a binary (0,1) variable, most studies record information on the degree of toxicity, from 0, complete absence of side effects, to 4, lifethreatening toxicity. The natural reaction for

0 1 2 3 4

No toxicity 1 Mild toxicity (non-dose-limiting) Non-mild toxicity (non-dose-limiting) Severe toxicity (non-dose-limiting) Dose-limiting toxicity

DOSE ESCALATION GUIDED BY GRADED TOXICITIES

5

Table 2. Compared Frequency of Final Recommendations of a Standard CRM and a Design Using Known Information on Graded Toxicities

Rk Standard Using grades

1

2

3

4

5

6

0.05 0.04 0.00

0.11 0.22 0.09

0.22 0.54 0.60

0.35 0.14 0.29

0.45 0.06 0.02

0.60 0.00 0.00

gain efficiency, and the method behaves identically to one in which the only information we obtain is whether the toxicity is dose limiting. These two situations suggest a middle road might exist, using a Bayesian prescription, in which very careful modeling can lead to efficiency improvements, if only moderate, without making strong assumptions. To make this model more precise, let us consider the case of three toxicity levels, the highest being dose limiting. Let Y j denote the toxic response for subject j, (j = 1, . . . , n). The variable Y j can assume three levels: 1, 2, and 3. The goal of the trial is to identify a level of dose whose probability of severe toxicity is closest to a given percentile of the dosetoxicity curve. Supposing, for patient j, that xj = di , then a working model for the CRM could be: exp(a)

Pr(Yj = 3) = 1 (xj , a) = αi Pr(Yj = 2 or Yj = 3) = (xj , a, b) exp(a+b)

= αi

from which Pr(Yj = 1) = 1 − 2 (xj , a, b) and exp(a+b) exp(a) Pr(Yj = 2) = αi − αi . The contributions to the likelihood are: 1 − 2 (xj , a, b) when Y j = 1, (xj , a) when Y j = 3 and 2 (xj , a, b) − 1 (xj , a) when Y j = 2. With no prior information, and being able to maximize the likelihood, we obtain almost indistinguishable results to those obtained with the more usual one-parameter CRM, which is caused by near-parameter orthogonality. Therefore, no efficiency gain occurs, although there is the advantage of learning about the relationship between the different toxicity types. However, based on previous studies, we often have a very precise idea concerning the relative rates between certain toxicity grades. We can imagine that this relationship can be estimated with good precision. Suppose that the parameter b is known precisely.

The model need not be correctly specified, although b should maintain interpretation outside the model, for instance some simple function of the ratio of grade 3 to grade 2 toxicities. Efficiency gains can then be substantial. Table 2 provides a simple illustration of the order of magnitude of the gains we might anticipate when we are targeting a value of around 0.25. The rate of lower grade toxicities is known to be twice this rate. A Bayesian framework would allow us to make weaker assumptions on the parameter b so that any errors in assumptions can then be overwritten by the data. More work is needed on this subject, but the early results are very promising. REFERENCES 1. J. O’Quigley and E. Reiner, A stopping rule for the continual reassessment method. Biometrika 1998; 85: 741–748. 2. J. O’Quigley, Estimating the probability of toxicity at the recommended dose following a Phase I clinical trial in cancer. Biometrics 1992; 48: 853–862. 3. Baab, J., Rogatko, A., and Zacks, S., (1998). Cancer phase I clinical trials: Efficient close escalation with overdose control. Statistics in Medicine 17. 1103–1120. 4. L. Z. Shen and J. O’Quigley, Consistency of continual reassessment method in dose finding studies. Biometrika 1996; 83: 395–406. 5. B. E. Storer, Phase I clinical trials. In: Encylopedia of Biostatistics. 1998. Wiley, New York. 6. S. Moller, An extension of the continual reassessment method using a preliminary up and down design in a dose finding study in cancer patients in order to investigate a greater number of dose levels. Stats. Med. 1995; 14: 911–922. 7. J. O’Quigley and L. Z. Shen, Continual reassessment method: a likelihood approach. Biometrics 1996; 52: 163–174.

6

DOSE ESCALATION GUIDED BY GRADED TOXICITIES

FURTHER READING C. Ahn, An evaluation of phase I cancer clinical trial designs. Stats. Med. 1998; 17: 1537–1549. S. Chevret, The continual reassessment method in cancer phase I clinical trials: a simulation study. Stats. Med. 1993; 12: 1093–1108. D. Faries, Practical modifications of the continual reassessment method for phase I cancer clinical trials. J. Biopharm. Stat. 1994; 4: 147–164. C. Gatsonis and J. B. Greenhouse, Bayesian methods for phase I clinical trials. Stats. Med. 1992; 11: 1377–1389. S. Goodman, M. L. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stats. Med. 14: 1149–1161. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for Phase I clinical trials in cancer. Biometrics 1990; 46: 33–48. J. O’Quigley and S. Chevret, Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stats. Med. 1991; 10: 1647–1664. J. O’Quigley, L. Shen, and A. Gamst, Two sample continual reassessment method. J. Biopharm. Statist. 1999; 9: 17–44. S. Piantadosi and G. Liu, Improved designs for dose escalation studies using pharmacokinetic measurements. Stats. Med. 1996; 15: 1605–1618. J. Whitehead and D. Williamson, Bayesian decision procedures based on logistic regression models for dose-finding studies. J. Biopharm. Statist. 1998; 8: 445–467.

DOSE-FINDING STUDIES

both efficacy and safety, as one of the key drivers of the high attrition rates currently plaguing late phase clinical trials across the pharmaceutical industry (7, 8). PhRMA has constituted a working group to evaluate and propose recommendations to address this specific issue (8). In light of these ongoing discussions and activities, this article reviews some of the key methodologies used in the dose-finding trials typically encountered in the late stage of drug development, that is, in phase II and/or phase III clinical trials. Out of scope for this overview are dose-finding studies in early development, which often take place under different constraints and use different methodologies, such as the traditional 3 + 3 designs, up-and-down designs, or continual reassessment methods (4, 5).

FRANK BRETZ Novartis Parma AG Basel, Switzerland

JOSE´ C. PINHEIRO Novartis Pharmaceuticals East Hanover, New Jersey

Understanding and adequately representing the dose–response profile of a compound, with respect to both efficacy and safety, is a fundamental objective of clinical drug development. An indication of its importance is the early publication of the International Conference on Harmonization E4 guideline on dose–response studies (1). The dose–response profile describes how the expected response—for example, a clinical endpoint of interest—varies in relation to the dose levels being administered. Proper understanding of this relationship is crucial for two critical decisions required during drug development: (1) whether there is an overall dose–response effect (proof of concept), and (2) if so, which dose level(s) should be selected for further development (dose finding). Selecting too low a dose decreases the chance of showing efficacy in later studies, whereas selecting too high a dose may result in tolerability or safety problems. Indeed, it may occur that only after having marketed a new drug at a specified dose does it become apparent that the level was set too high. This phenomenon has been documented by the U.S. Food and Drug Administration (FDA), who reported that approximately 10% of drugs approved between 1980 and 1989 have undergone dose changes—mostly decreases—of greater than 33% (2, 3). Over the past several years, an increase in interest and research activities in this area has taken place. Illustrating this trend, three books solely dedicated to dose finding in clinical drug development, from different perspectives, have recently been published (4–6). In addition, both the FDA and the Pharmaceutical Research and Manufacturers of America (PhRMA) have identified poor dose selection resulting from incorrect or incomplete knowledge of the dose–response relationship, for

1

MULTIPLE COMPARISON PROCEDURES

The analysis of dose-finding studies can be classified into two major strategies: modeling techniques (9, 10) and multiple comparison procedures (MCP) (11, 12). Modeling techniques assume a functional relationship between the dose (taken as a quantitative factor) and the response variable, according to a prespecified parametric model (defined in the study protocol). In this section we consider MCP, in which the dose is regarded as a qualitative factor and very few, if any, assumptions are made about the underlying dose–response model. MCP can be used either for detecting an overall dose-related signal by means of trend tests or for the estimation of target doses by stepwise testing strategies, while preserving the overall type I error rate at a prespecified level α. Such procedures are relatively robust to the underlying dose–response shape, but they are not designed for extrapolation of information beyond the observed dose levels. Inference is thus confined to the selection of the target dose among the dose levels under investigation. A classic method proposed by Dunnett (13) compares several treatments with a control. Because it relies on pairwise comparisons based on t-tests using the pooled variance estimate, structural information from the

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

DOSE-FINDING STUDIES

logical ordering of the dose levels is not incorporated. Trend tests exist that borrow strength from neighboring dose levels to increase the likelihood of successfully detecting a dose–response signal at the end of the study. The likelihood ratio test (14) is an example of a powerful test for detecting such a dose–response trend. However, because its critical values are difficult to compute, its application is reduced to balanced one-way layouts and other simple designs (15). Single contrast tests provide a popular alternative (16, 17), but these tests are potentially less powerful than competing methods if the true dose–response shape deviates substantially from the prespecified vector of contrast coefficients. Multiple contrast tests have been proposed instead, which take the maximum test statistic over several single contrast tests, properly adjusting it for multiplicity (18, 19). The advantage of such an approach is its greater robustness with regard to the uncertainty about the unknown true dose–response model, resulting from testing simultaneously different shapes. An additional appealing feature of multiple contrast tests is that they can be easily extended to general linear models incorporating covariates, factorial treatment structures, and random effects. Many standard trend tests are in fact special cases of (or at least closely related to) multiple contrast tests (20). Bretz et al. (21) provide more technical details on multiple contrast tests. As mentioned, the second major goal of dose-finding studies is to estimate target doses of interest, such as the minimum effective dose (MinED), which is the smallest dose showing a statistically significant and clinically relevant effect (1, 3); the maximum safe dose, which is the largest dose that is still safe; or the maximum effective dose, which is the smallest dose showing a maximum effect (1). If MCP are used, the dose levels are typically tested in a fixed order. Different possibilities exist for choosing the sequence of hypotheses to be tested and the appropriate test statistics (e.g., pairwise comparisons with the control group or any of the previously mentioned trend tests). Tamhane et al. (17, 22) and Strassburger et al. (23) provide more details.

2 MODELING TECHNIQUES Modeling approaches are commonly used to describe the functional dose–response relationship, and many different models have been proposed (24). One classification of such models is according to whether they are linear in their parameters (standard linear regression model, linear in log-dose model, etc.) or not (Emax , logistic, etc.). Pinheiro et al. (9) described several linear and nonlinear regression dose–response models commonly used in practice, including the clinical interpretations for the associated model parameters. Once a dose–response model is fitted (10), it can be used to test whether a doserelated effect is present. For example, one could test whether the slope parameter is different from 0 in a linear regression model. If a dose–response signal has been detected, the fitted dose response could then be used to estimate a target dose achieving a desired response. In contrast to MCP, dose estimation under modeling is not confined to the set of dose levels under investigation. Although such a modeling approach provides flexibility in investigating the effect of doses not used in the actual study, the validity of its conclusions depends highly on the correct choice of the dose–response model, which is typically unknown. We will describe a more detailed description of model-based approaches and potential solutions to overcome their disadvantages. A major pitfall when conducting statistical modeling is related to the inherent model uncertainty. The intrinsic problem is the introduction of a new source of variability by selecting a particular model M (for example) at any stage before the final analysis. Standard statistical analysis neglects this fact and reports the final outcomes without accounting for this extra variability. Typically, one is interested in computing the variance var θˆ of a parameter estimate θˆ . In practice, how ever, the conditional variance var θˆ |M for a given model M is computed and stated as if it were var θˆ , ignoring the model uncertainty. In addition, substantial bias in estimating the parameters of interest can be introduced from the model selection process. Whereas it is admittedly a more difficult task

DOSE-FINDING STUDIES Set of candidate models

to compute unbiased estimates conditional on the selected model, ignoring completely the model uncertainty can lead to very undesirable effects (25, 26). A common approach to addressing the model selection problem is to use information criteria based on a reasonable discrepancy measure assessing the lack of fit. Many model selection criteria are available, and the discussion of which gives the best method is still ongoing (27, 28). Examples of such model selection criteria include the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and the order restricted information criterion (29). It should be kept in mind, however, that the application of any of the model selection criteria will always lead to the selection of a model, irrespective of its goodness of fit to the observed data. Different approaches have thus been proposed to overcome the problem of conditional inference on a selected model. Such proposals include (1) weighting methods, which incorporate, rather than ignore, model uncertainty by computing estimates for quantities of interest which are defined for all models (such as the MinED) using a weighted average across the models (30–32); (2) computer-intensive simulationbased inferences, such as cross-validation techniques, which select the model with the best predictive ability across the replications (26, 33); and (3) considering model selection as a multiple hypotheses testing problem, where the selection of a specific model is done while controlling the overall type I error rate (34). 3 HYBRID APPROACHES COMBINING MCP AND MODELING Hybrid dose-finding methods combine principles of MCP with modeling techniques; see Tukey et al. (35) for an early reference. Bretz et al. (36) proposed a methodology for dose-finding studies that they called MCP-Mod, which provides the flexibility of modeling for dose estimation while preserving the robustness to model misspecification associated with MCP. The general flow of the MCP-Mod methodology, including its key steps, is depicted in Figure 1. Practical considerations regarding the implementation of

3

Optimum contrast coefficients Selection of significant models while controlling the overall type I error

Selection of a single model using max-T, AIC, ..., possibly combined with external data Dose estimation and selection (minimum effective dose, ...)

Figure 1. Combining multiple comparisons and modeling techniques in dose-finding studies

this methodology were discussed by Pinheiro et al. (37). Extensions to Bayesian methods estimating or selecting the dose–response curve from a sparse dose design have also been investigated (38, 39). The central idea of the MCP-Mod methodology is to use a set of candidate dose–response models to cover the possible shapes anticipated for the dose–response relationship. Multiple comparison procedures are applied to a set of test statistics, determined by optimal contrasts representing the models in the candidate set, to decide which shapes give statistically significant signals. If no candidate model is statistically significant, the procedure stops and declares that no dose–response relationship can be established from the observed data (i.e., no proof of concept). Otherwise, the maximum contrast test and possibly further contrast tests are statistically significant. Out of the statistically significant models in the candidate set, a best model is selected for dose estimation in the last stage of the procedure. The selection of the dose-estimation model can be based on the minimum P-value (of the model contrast tests) or some other relevant model selection criteria like the AIC or the BIC. The selected dose–response model is then employed to estimate target doses using inverse regression techniques and possibly incorporating information on clinically relevant effects. The precision of the estimated doses can be assessed using, for example, bootstrap methods. In contrast to a direct application of modelbased dose estimation, the MCP step accounts

4

DOSE-FINDING STUDIES

for possible model misspecification and includes the associated statistical uncertainty in a hypothesis-testing context. Note that different model selection criteria may lead to different dose estimates because of different sources of information and/or decision goals. Information criteria, such as the AIC or BIC, are statistical decision rules taking into account only the data from the study under consideration. Bayesian decision rules, on the other hand, may additionally include information external to the study, though still based on statistical reasoning. Finally, nonstatistical selection rules based on updated clinical knowledge, economic reasons, and so forth may also be used in the dose-selection process. Simulation results suggest that hybrid methods are as powerful as standard trend tests while allowing more precise estimation of target doses than MCP due to their modeling component (9, 36). It is worth pointing out that these methods can be seen as seamless designs that combine proof of concept (phase Ib/IIa) with dose finding (phase IIb) in one single study. 4

ADAPTIVE DOSE-FINDING DESIGNS

A fast emerging field of research is the area of adaptive dose-finding designs, which can be employed to extend the previously described methods. These designs offer efficient ways to learn about the dose response through repeated looks at the data being accrued during the conduct of a clinical trial. This interim information can be used to guide decision making on which dose to select for further development or whether to discontinue a program. It is both feasible and advantageous to design a proof-of-concept study as an adaptive dose-finding trial. The continuation of a dose-finding trial into a confirmatory stage through a seamless design is a further opportunity to increase information earlier in development on the correct dose, and thus reduce the total duration of the clinical development program. Accordingly, we briefly review (1) adaptive dose ranging studies as investigated by the related PhRMA working group (8), (2) flexible designs that strongly control the overall type I error rate, and (3) Bayesian adaptive designs.

The PhRMA working group on adaptive dose ranging studies has evaluated several existing dose-finding methods and provided recommendations on their use in clinical drug development (8). The methods considered comprise a representative cross-section of currently available dose-finding procedures, ranging from more traditional methods based on analysis of variance up to adaptive designs based on an advanced Bayesian model. Through an extensive simulation study based on a common set of scenarios (sample sizes, number of doses, etc.) for all procedures, the strengths and weaknesses of each method were investigated, in particular with respect to the ability of the procedures to learn from the data and adapt to emerging information. Flexible designs that strongly control the overall type I error rate of incorrectly rejecting any null hypothesis of no dose level effect are available for adaptive dose selection in multistage clinical trials. In the context of adaptive dose-finding trials, multiplicity concerns typically arise due to (1) the comparison of several doses with a control and (2) multiple interim looks at the data for decision making. Performing each hypothesis test at the nominal level α intended for the whole trial would inflate the overall type I error rate. Therefore, the significance levels of the individual tests have to be adjusted appropriately. Classic group sequential designs are a type of flexible design in which the planned sample size or, more generally, the planned amount of information may be updated as a result of the trial. In these trials, test statistics produced at interim analyses are compared with prespecified upper or lower stopping boundaries that ensure the overall type I error rate control (40–42). Stallard and Todd (43) extended classic group sequential procedures to multiarm clinical trials incorporating treatment selection at the interim analyses. Flexible designs, which may be regarded as an extension of classic group sequential designs, offer more flexibility for adaptation within the multistage framework. These methods offer a high level of flexibility for decision making during the trial, such as increasing the sample size based on the observed effect, modifying the target patient population, or selecting good treatments

DOSE-FINDING STUDIES

(44, 45). They require little prespecification of decision rules before the beginning of a trial; therefore, the total information available at each interim time point can be used in designing or adjusting the next stage. Bayesian adaptive dose-finding designs are an important alternative to flexible designs. Information can either be updated continuously as data are accrued in the trial, or in cohorts of patients. This makes this class of designs very appealing to sequential decision making and experimentation, including clinical studies. Bayesian approaches enable the calculation of predictive probabilities of future results for any particular design, which allows comparison of designs on the basis of probabilities of their consequences. Although control of the type I error rate is not an intrinsic property of a Bayesian design, simulations can be used to tailor a Bayesian adaptive trial such that it maintains this and other desirable frequentist operational characteristics. A potential downside to the Bayesian approach is the computational complexity coupled with the absence of commercial software packages to assist with study design and analysis. Berry et al. (46) and Krams et al. (47) have more methodological details and an example of a clinical study employing Bayesian dose-finding methods in practice. 5

CONCLUSION

Dose-finding studies play a key role in any drug development program and are often the gatekeeper for the large confirmatory studies in phase III. Many approaches exist for the proper design and analysis of these trials. The ultimate choice of the method to be applied depends on the particular settings and goals. Dose-finding studies should thus be tailored to best fit the needs of the particular drug development program under consideration. Methods are available, for example, to allow the conduct of seamless proof-ofconcept and dose-finding studies. Alternatively, if it is desired to extend dose-finding trials straight into a confirmatory phase III study, adaptive designs offer efficient possibilities to control the overall type I error rate at a prespecified level. We encourage the consideration and implementation of advanced

5

dose-finding methods, which efficiently make use of accumulating information during the drug development process. REFERENCES 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E4 Dose-Response Information to Support Drug Registration. Step 4 version, March 1994. Available at: http://www.ich.org/LOB/media/ MEDIA480.pdf. 2. FDC Reports from May 6, 1991. The Pink Sheet. 1991: 53(18): 14–15. 3. S. J. Ruberg, Dose response studies I. Some design considerations. J Biopharm Stat. 1995; 5: 1–14. 4. N. Ting (ed.), Dose Finding in Drug Development. New York: Springer, 2006. 5. S. Chevret (ed.), Statistical Methods for Dose Finding Experiments. New York: Wiley, 2006. 6. R. Krishna (ed.), Dose Optimization in Drug Development. New York: Informa Healthcare, 2006. 7. U.S. Food and Drug Administration, Department of Health and Human Services. Innovation/Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. March 2004. Available at: http://www.fda.gov/ oc/initiatives/criticalpath/whitepaper.html 8. B. Bornkamp, F. Bretz, A. Dmitrienko, G. Enas, B. Gaydos, et al., Innovative Approaches for Designing and Analyzing Adaptive DoseRanging Trials. White Paper of the PhRMA Working Group on Adaptive Dose-Ranging Studies. J Biopharm Stat. 2007, in press. Available at: http://www.biopharmnet.com/ doc/phrmaadrs white paper.pdf 9. J. Pinheiro, F. Bretz, M. Branson, Analysis of dose response studies: modeling approaches In: N. Ting, (ed.), Dose Finding in Drug Development. New York: Springer, 2006, pp. 146–171. 10. D. M. Bates and D. G. Watts, Nonlinear Regression Analysis and Its Applications. New York: Wiley, 1988. 11. Y. Hochberg, and A. C. Tamhane, Multiple Comparisons Procedures. New York: Wiley, 1987. 12. J. C. Hsu, Multiple Comparisons. New York: Chapman and Hall, 1996. 13. C. W. Dunnett, A multiple comparison procedure for comparing several treatments

6

14.

15.

16.

17.

18.

19.

20.

DOSE-FINDING STUDIES with a control. J Am Stat Assoc. 1955; 50: 1096–1121. D. J. Bartholomew, Ordered tests in the analysis of variance. Biometrika. 1961; 48: 325–332. T. Robertson, F. T. Wright, and R. L. Dykstra, Order Restricted Statistical Inference. New York: Wiley, 1988. S. J. Ruberg, Contrast for identifying the minimum effective dose. J Am Stat Assoc. 1989; 84: 816–822. A. C. Tamhane, C. W. Dunnett, and Y. Hochberg, Multiple test procedures for dose finding. Biometrics. 1996; 52: 21–37. ¨ L. A. Hothorn, M. Neuhauser, H. F. Koch, Analysis of randomized dose-finding studies: closure test modifications based on multiple contrast tests. Biom J. 1997; 39: 467–479. W. H. Stewart and S. J. Ruberg, Detecting dose response with contrasts. Stat Med. 2000; 19: 913–921. F. Bretz, An extension of the Williams trend test to general unbalanced linear models. Comput Stat Data Anal. 2006; 50: 1735–1748.

21. F. Bretz, J. Pinheiro, and A. C. Tamhane, Multiple testing and modeling in dose response problems In: A. Dmitrienko, A. C. Tamhane, and F. Bretz (eds.), Multiple Testing Problems in Pharmaceutical Statistics. New York: Taylor & Francis, 2009 (in press). 22. A. C. Tamhane, C. W. Dunnett, J. W. Green, and J. D. Wetherington, Multiple test procedures for identifying the maximum safe dose. J Am Stat Assoc. 2001; 96: 835–843. 23. K. Strassburger, F. Bretz, and H. Finner, Ordered multiple comparisons with the best and their applications to dose-response studies. Biometrics. 2007; May 8 (e-pub). 24. D. A. Ratkowsky, Handbook of Nonlinear Regression Models. New York: Marcel Dekker, 1989. 25. D. Draper, Assessment and propagation of model uncertainty. J R Stat Soc Ser B Methodol. 1995; 57: 45–97. 26. J. S. Hjorth, Computer Intensive Statistical Methods—Validation, Model Selection and Bootstrap. London: Chapman & Hall, 1994. 27. W. Zucchini, An introduction to model selection. J Math Psychol. 2000; 44: 41–61. 28. J. B. Kadane and N. A. Lazar, Methods and criteria for model selection. J Am Stat Assoc. 2004; 99: 279–290. 29. K. Anraku, An information criterion for parameters under a simple order restriction. Biometrika. 1999; 86: 141–152.

30. S. T. Buckland, K. P. Burnham, and N. H. Augustin, Model selection: an integral part of inference. Biometrics. 1997; 53: 603–618. 31. J. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, Bayesian model averaging. Stat Sci. 1999; 14: 382–417. 32. K. H. Morales, J. G. Ibrahi, C. J. Chen, and L. M. Ryan, Bayesian model averaging with applications to benchmark dose estimation for arsenic in drinking water. J Am Stat Assoc. 2006; 101: 9–17. 33. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York: Springer, 2001. 34. H. Shimodaira, An application of multiple comparison techniques to model selection. Ann Inst Stat Math. 1998; 50: 1–13. 35. J. W. Tukey, J. L. Ciminera, and J. F. Heyse, Testing the statistical certainty of a response to increasing doses of a drug. Biometrics. 1985; 41: 295–301. 36. F. Bretz, J. Pinheiro, and M. Branson, Combining multiple comparisons and modeling techniques in dose response studies. Biometrics. 2005; 61: 738–748. 37. J. Pinheiro, B. Bornkamp, and F. Bretz, Design and analysis of dose finding studies combining multiple comparisons and modeling procedures. J Biopharm Stat. 2006; 16: 639–656. 38. T. Neal, Hypothesis testing and Bayesian estimation using a sigmoid Emax model applied to sparse dose response designs. J Biopharm Stat. 2006; 16: 657–677. 39. A. Wakana, I. Yoshimura, and C. Hamada, A method for therapeutic dose selection in a phase II clinical trial using contrast statistics. Stat Med. 2007; 26: 498–511. 40. S. J. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977; 64: 191–199. 41. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics. 1979; 35: 549–556. 42. C. Jennison, and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. London: Chapman and Hall, 2000. 43. N. Stallard and S. Todd, Sequential designs for phase III clinical trials incorporating treatment selection. Stat Med. 2003; 22: 689–703. 44. G. Hommel, Adaptive modifications of hypotheses after an interim analysis. Biom J. 2001; 43: 581–589. 45. F. Bretz, H. Schmidli, F. K¨onig, A. Racine, and W. Maurer, Confirmatory seamless phase

DOSE-FINDING STUDIES II/III clinical trials with hypotheses selection at interim: general concepts (with discussion). Biom J. 2006; 48: 623–634. ¨ 46. D. A. Berry, P. Muller, A. P. Grieve, M. Smith, T. Parke, et al., Adaptive Bayesian designs for dose-ranging drug trials. In: C. Gatsonis, B. Carlin, and A. Carriquiry (eds.), Case Studies in Bayesian Statistics V. New York: Springer, New York, 2001, pp. 99–181. 47. M. Krams, K. R. Lees, W. Hacke, A. P. Grieve, J. M. Orgogozo, G. A. Ford, Acute stroke therapy by inhibition of neutrophils (ASTIN): an adaptive dose response study of UK-279,276 in acute ischemic stroke. Stroke. 2003; 34: 2543–2548.

CROSS-REFERENCES Hypothesis testing Minimum effective dose (MinED) Multiple comparisons

7

DOSE RANGING CROSS-OVER DESIGNS

doses. Too much anticoagulation and undesired bleeding may result, which threatens a patient’s well-being; too little anticoagulation could mean that thromboembolic events might not be treated or prevented (9). The choice of what series of doses of warfarin to use depends on many factors—age, nutritional status, gender, what other drugs the patient is taking, and so on. These factors make warfarin challenging to use in practice, and frequent laboratory monitoring of blood coagulation level is used to protect patients (8), although this seems not to be uniformly successful (9). Not all drugs, however, require such a high level of monitoring. Other drugs may be more benign in terms of maintaining safety over a range of doses. Early drug development focuses on exploration of the dose–response relationship (for both undesired and desired effects). The focus here (in what is commonly referred to as Phase I and II) is on ‘‘learning’’ (10) about the compound’s properties. Once a safe and effective range of doses is identified, confirmatory clinical trials are performed to provide regulators with sufficient evidence of benefit relative to risk to support access of a product at certain prescribed doses to patients in the marketplace. This article focuses on the use of crossover designs (11–13) in clinical drug development, which are used for dose-ranging to evaluate desired and undesired response. Those interested in the general topic of clinical doseresponse may find recent works in References 14–16 helpful in their research. These works tend to concentrate on application of dose-ranging in parallel group designs (most frequently, those used in oncology), and we cross-reference the findings of these works when appropriate to their application in crossover designs. Those interested in the application of techniques in nonclinical, toxicology drug research will find Hothorn and Hauschke’s work in Chapters 4 and 6 of Reference 17 of interest. After brief discussions of current international guidance and of clinical dose-ranging study design, we discuss two categories of dose ranging crossover designs. In the next

SCOTT D. PATTERSON Wyeth Research & Development, Collegeville, Pennsylvania

BYRON JONES Pfizer Pharmaceuticals, Sandwich, UK ´ NEVINE ZARIFFA

GlaxoSmithKline, Philadelphia, Pennsylvania

1

INTRODUCTION

Why do we care about dose? Many people seemingly are of the opinion that ‘‘if a little is good, more is probably better.’’ However, this opinion is not necessarily the case in drug treatment. It is more complicated than that. As illustrated by Sheiner and Steimer (1) and described in more detail in References 2 and 3, administration of a dose of drug results in exposure to drug at the site of action once drug enters the systemic circulation. At the site of action, drug molecules bind to receptors in the relevant organ, tissue, and so on, and hopefully cause a desired response, like lowering blood pressure, which improves how a patient feels, how long a patient lives, or some other desired clinical outcome. Of course, once in the circulatory system, the drug may go to other sites than that desired and cause an undesired response. For example, consider the fatal example of terfenadine (4). Doses of this drug resulted in cardiac arrythmia and death when given with a metabolic inhibitor. Experiences such as this have confirmed that a thorough understanding of how dose relates to response is critically important for a proper understanding of a drug’s usefulness (or lack thereof) (5–7). An example in which response to dose can be monitored is in the use of a drug called warfarin. Such a drug is dangerous to use, but its use is said to prevent ‘‘twenty strokes for every episode of bleeding’’ (8). The drug is titrated to the desired level of effective and safe anticoagulation in a series of daily

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

DOSE RANGING CROSS-OVER DESIGNS

section, titration designs and extension studies are discussed. Then, fully randomized crossover studies are considered. Each section includes references to examples of data, the statistical model, and methods of inference. A comprehensive exposition of dose-response modeling in crossover designs is beyond the scope of this article, and we concentrate on models which in our experience are most frequently applied, providing references for those interested in more details. 1.1 Objectives of Dose-Ranging and Summary of Current International Regulations The International Conference on Harmonization guidance ICH E4 (18) denotes choice of dose as an ‘‘integral part of drug development.’’ However, the guidance is complex and calls for determination of several factors: for example, the starting dose, the maximum tolerated dose (MTD), the minimum effective dose, the maximum effective dose, and a titration algorithm for physicians to use to determine which dose to use. Thus, a doseranging trial has several objectives that can be covered by statements as nebulous as, for example, ‘‘characterizing dose response’’ to as precise as confirming that ‘‘X mg is the MTD in a certain population.’’ In addition to choice of objective, the choice of endpoint is critical to the selection of design. Drug research contains examples of both acute, reversible responses (e.g., blood pressure), and chronic responses that change (generally for the worse) with time (e.g., survival time), which depend on the therapy area being investigated. Studies used to assess dose-response should be ‘‘well-controlled, using randomization and blinding’’ and ‘‘should be of adequate size’’ (18). In early phase development, ‘‘placebo controlled individual subject titration designs (18)’’ support later, larger trials in patients. ICH E4 recommends the use of parallel-group dose response studies; however, this method is not in keeping with regulatory trends toward the individualization of therapy (5) and it requires greater consideration of different regimens within the same patient. ICH E4 lists several potential study designs

1. Parallel dose-response designs— Subjects are randomized to fixed-dose groups and followed while on that dose to evaluate response. This type of dose ranging design is the most common in drug development but does not generally permit the evaluation of individual response (18). These designs are discussed in References 15 and 16 and are not discussed here in favor of discussion of more powerful (11) designs as follows. 2. Titration designs—Subjects (forced titration) or subjects not yet achieving a response (optional titration) receive predetermined increases in dose until the desired effect is observed, or until dosing is stopped because patients are not tolerating the regimen. Although this approach confounds time-related effects with dose and with cumulative dose response effects, it does permit the evaluation of individual dose-response under certain assumptions. Titration designs are discussed in the section entitled, ‘‘Titration designs and extension studies’’. 3. Crossover designs—Subjects are randomized to sequences of different doses with each period of treatment with a given dose separated by a wash-out period to enable the body’s function to return to basal levels (11, 12). This type of design is discussed in the section entitled, ‘‘Randomized designs.’’ FDA’s implementation of ICH E4 (18) may be found in Reference 19. This guideline enhances several aspects of ICH E4 by calling, in part, for (19): 1. Prospectively defined hypotheses/objectives, 2. Use of an appropriate control group, 3. Use of randomization to ensure comparability of treatment groups and to minimize bias. Both guidelines (18, 19) suggest that results of dose-response studies can serve as supporting evidence for confirmatory studies (conducted subsequently in drug development). This evidence may mitigate requirements for multiple confirmatory trials in

DOSE RANGING CROSS-OVER DESIGNS

some settings. Especially critical to this role as supporting evidence is the choice of endpoint. If the dose-response study or studies consider a clinically relevant endpoint related directly to dose with little to no time lag to response (e.g., blood pressure for a hypertensive agent), then use of many alternative designs (parallel, titration, or crossover) can be informative, and the outcome of the study may support registration directly. Where significant hysteresis (see Reference 3 for a definition) is present, more care in choice of design is warranted, and the relationship of dose ranging to registration is defined less well, see Fig. 1. The lists of designs in References 18 and 19 are by no means exhaustive, and other alternatives may be used. For example, another regulatory perspective may be found in Hemmings’ Chapter 2 of Reference 15, which discusses several alternative crossover study designs: 1. Randomized titration designs in nonresponders—Here, dose is titrated to effect in subjects randomly assigned to a dose or a control group. Those subjects who respond to the dose continue on that dose whereas nonresponders are randomized to continue on the dose (to evaluate whether prolonged exposure

Figure 1. Selected points to consider in different doseresponse study designs from FDA guidance (19).

3

will result in an effect) or to a higher dose. 2. Randomized withdrawal designs— Here, dose is titrated to effect in subjects randomly assigned to a dose of drug or placebo. Subjects randomly assigned to drug who respond to treatment are randomly assigned to switch to placebo or to continue on drug (to enable consideration of duration of effect once drug is discontinued). The latter type of study is known as Discontinuation designs (15), and it is a class of incomplete crossover designs (11). This design is sometimes called ‘‘Retreatment’’ designs (20) or ‘‘Enrichment’’ designs (21–24). We do not examine such studies further here as: 1. They are not generally analyzed (25, 26) in a manner consistent with crossover designs (11), and 2. Published examples examined for this article (23, 27, 28) were not dose-ranging [despite suggestions that such designs have use in this area (24)]. Thus, it can be observed that the objective of dose-response studies may be tactical {i.e., exploratory in the sense of deciding how

4

DOSE RANGING CROSS-OVER DESIGNS

to dose in subsequent studies (10)], strategic (i.e., decide how to dose in confirmatory trials for regulatory submission), and/or regulatory (i.e., provide data directly supporting submissions in place of a confirmatory trial). For the latter, ICH E5 (29) for example calls for collection of dose-response as a prerequisite for bridging findings from one ethnicity to another for registration. However, practical implementation of such an approach is region-specific (30–32). As a practical matter for those reviewing such trial designs (or their results), one can generally deduce what the true purpose of the design is by looking at the choice of endpoint and how many assumptions are made in design. If the endpoint is a biomarker (c.f., 33) and the design is titration (making the assumption that time related effects are small relative to the effect of dose), then the study is probably designed for tactical, exploratory purposes. If, however, the endpoint is a surrogate marker (33) or clinical endpoint (34), and the study employs a fully randomized control and sequence of doses [(e.g., for a crossover a William’s square (11)], then it is most likely for strategic or for regulatory purposes. Generally, the number of subjects in the study is not a good way to deduce the purpose of a study, as regulatory considerations for extent of exposure recommended for registration are more complex than this, see Reference 35. 1.2 Statistical Aspects of the Design of Dose Ranging Studies As in all protocols, the choice of population is of critical importance as are the controlled aspects of study procedures, but we will neglect those subjects here to consider statistical analysis for dose-ranging trials. Statistical consideration of such dose-ranging studies has been geared mainly toward provision of confirmatory hypothesis tests (generally used in the later stages of drug development referred to as Phase IIb-III). Modeling is the procedure that should most often be used in exploratory trials (in the early stages of drug development, referred to as Phase I and IIa). The analysis method chosen depends directly on choice of objective and endpoint.

The choice of endpoint generally determines whether one can consider a crossover (see the section entitled ‘‘Randomized designs’’) trial design in this setting (11, 13). The advantage of such a design is that one can compare the response at different doses of drug ‘‘within-subject,’’ which presumably leads to a more accurate and precise measurement (11). A potential disadvantage is that responses from the previous administration may carryover to the next treatment period, which complicates (and potentially confounds) a clear inference. If one is considering a response that is fairly ‘‘permanent’’ (i.e., survival time), then one is likely constrained by circumstance to use a parallel group or titration design (see the section entitled ‘‘Titration designs and extension studies’’). This latter design is useful here if one can assume that time related effects are small relative to the effects of dose. If one is looking at an endpoint that is reversible (i.e., generally goes back to where it started when one stops treatment), then crossover designs are recommended. The key question many statisticians face in design of dose ranging trials is, ‘‘how many subjects do we need?’’ This question is very complex to answer if one adopts a traditional approach to powering. Closed form solutions amenable to use in confirmatory trials can become very complicated as multiple doses, multiple looks, adaptive or group-sequential adjustment are involved (e.g., see Reference 36). We recommend the simpler procedure of simulation be employed. Simulated crossover data are generated very easily using a variety of commercial software packages (13), and we will not dwell on technical details here. For an example, see Chapter 6 of Reference 13. Simulation of data permits one to consider a variety of modeling and testing techniques at the design stage, as appropriate to the objective and endpoint of the trial, and we now consider two such approaches: modeling and hypothesis tests. Modeling of data from dose response trials follows the general principle 9 of regression and has been described by many authors, for example, References 1, 10, 14, and 37–42. These approaches examine response as a mathematical function of dose. These may

DOSE RANGING CROSS-OVER DESIGNS

be linear, nonlinear, or some step-function of dose as appropriate to the response involved. A well-known model used in this setting is the Power model (40) as follows: yik = (α + ξk ) + πi + β(ld) + γi−1 + εik

9 of Reference 16) is: E=

Emax(CN ) + E0 , EC50N + CN

where E is the effect being modeled, E0 is the effect observed without any drug present, C is the concentration of drug in the effect compartment, EC50 is the concentration needed to cause a 50% response, N is the slope factor (estimating ‘‘sensitivity of the response to the dose range’’), and Emax is the maximum effect that can occur with drug treatment. This example shows a nonlinear (in concentration) additive model. If drug concentration in the effect compartment is not related to effect, then Emax and EC50 would be zero. Note that instead of concentration C, one may use the dose or the log-dose in the expression if no effect compartment is being used. As a practical matter when modeling, we recommend that those who devleop such follow the principle of parsimony—Section 4.10.1 of Reference 44; the simplest model that accurately describes the data is best. The parameters estimated from such a model may be used for picking doses in subsequent studies as described in References 13, 45–47— see additional examples below. Assessment of model fit is done typically by inspection of residual values (i.e., predicted less observed), although more formal methods may sometimes be applied (48). Such

(1)

5

6

logAUC 7 8

9

10

where α is the overall mean pharmacokinetic response at a unit dose (logDose, ld = 0) known in statistics as the population intercept, ξk is the random intercept that accounts for each subject (k) as their own control, πi is the time-related effect of period i on the response, β is the slope parameter of interest regressed on logDose (parameter ld), γi−1 is any effect of the regimen from the previous period [known as a carry-over effect, (11)], and εij denotes within-subject error for each log-transformed response (yik , e.g., AUC or Cmax, see Reference 13). See for example Fig. 2. The authors have found this model particularly useful in the setting of pharmacokinetics (43). Modern computing technology has permitted the application of even more complex models in this setting (1). In brief, nonlinear mixed effect models for pharmacokinetics are described in an ‘‘effect compartment’’ (1) (which is a hypothesized part of the body where pharmacodynamic effect is thought to be induced by drug treatment); a function of this model is related to response using a statistical model (41). An Emax model (Chapter

Figure 2. Estimated logDose versus logAUC curve with individual data points [reproduced from Patterson and Jones (13), Example 7.2.1, with the permission of CRC Press)].

5

2

3

4 logDose

5

6

DOSE RANGING CROSS-OVER DESIGNS

model-based approaches may be extended to situations in which more than one endpoint is of importance (49). These approaches are also amenable to statistical optimization (50, 51) although such techniques are rarely applied in drug development. Bayesian approaches are not rare in oncology (52–54), and extensions to consider more than one endpoint are known (55). Although Bayesian extensions to the models described above may also be applied in exploratory studies in other therapy areas (43, 56–59), such use is not yet widespread (60). Results of models such as those applied later in this article may be used for confirmatory purposes in the testing of hypotheses. These testing procedures are geared toward provision of statistical confirmation of effect and are designed more to make a claim. As such, results must prove responses as desired beyond a reasonable doubt under a traditional hypothesis testing framework for consideration by regulators. ICH E4 (18) does not require this proof, but it is still often done for strategic and confirmatory purposes. One might test for example: H0 : µDi − µP ≤ 0

(2)

for all i where Di denotes dose i = 1, 2, 3, 4, , , relative to Placebo P. This hypothesis is to be tested versus the alternative hypothesis: H1 : µDi − µP > 0

(3)

for at least one Di in the study. Modifications of the traditional statistical hypothesis tests can be used for this purpose to adjust for the multiplicity inherent to such analyses in later phase (IIb–III) development. Many alternatives exist: the Fisher’s LSD test, the Bonferroni-Dunn test, the Tukey’s HSD test, the Newman-Keuls test, the Scheff´e test, the Dunnett test to name a few—see Chapter 24 of Reference 61, as well as References 62 and 63 for details. These alternative procedures are many and varied, and they are geared toward protection against a type 1 error (which is declaring a dose as providing a given response, incorrectly, relative to a null hypothesis that is true). The focus in this article is on human clinical trials, but the confirmatory

techniques discussed can be applied to nonhuman studies of the same purpose (and vice-versa). For readers interested in more on this topic, we recommend Hothorn and Hauschke’s work in Chapters 4 and 6 of Reference 17. Recent summaries of such testing procedures may be found in References 64 and 65. Use of such procedures can have an impact on study design, and it is recommended that those who employ such approaches see References 51 and 66 for discussion of relevant aspects of study design. Note that in such confirmatory trials, control of type 1 error can become complex because group-sequential or adaptive looks are commonly employed (67–71). In theory, one could also use such an approach to inference in a titration (see the next section) or discontinuation design, but given the assumptions necessary to perform such an analysis (i.e., the confounding of time with dose), it is rarely done in practice, and emphasis is placed on model-based interpretation. We now turn to such data. 2 TITRATION DESIGNS AND EXTENSION STUDIES Titration designs are used when one wishes to observe the safety and the efficacy results at one dose before choosing subsequent doses in a given experimental unit. These types of approaches are commonly used in real-world clinical practice (37) to titrate to effect under a forced (predetermined) or, most often, an optional escalation structure. Warfarin is an historical example of such an approach. Eligible patients are titrated at weekly doses of warfarin between 27.5 to 42.5 mg to achieve an international normalized ratio (INR, a measure of the blood’s ability to coagulate—the greater the value, the lesser the coagulation) of 2 to 3 (8). Weekly INR measurements are taken to determine how to adjust the dose upward or downward at the physician’s discretion. Warfarin is by no means the only drug used in this manner—see Reference 72 and Chapter 18 of Reference 3 for other examples. Clinical trials also use such approaches when the endpoint involved is reversible and/or concerns develop with tolerability at

DOSE RANGING CROSS-OVER DESIGNS

higher doses. In such cases, a titration design may be used. For example in first-time-inhuman trials, doses are generally chosen for inclusion in the study based on allometric scaling of findings from experiments in animals or in vitro studies (73, 74). Once this range of potentially safe doses is determined, testing in humans commences with small, well-controlled, data-intensive trials. See Chapter 7 of Reference 13 for more details. Exposure data from such a trial is shown in Table 1. In this case, it is determined that dose-escalation in this three-period, randomized (to placebo), single blind, crossover design would be halted when mean exposure neared 20 ng.h/mL. Dosing in this study began with the first subject at 100 mg and proceeded in cohorts of subjects until exposure levels neared the desired level. Once this was achieved, it was desired to determine what exposure levels were at lower levels (i.e., 50 mg), and several subjects were dosed at that level. Such data are easily analyzed using the power model of expression [Equation (1)] that accounts for each subject as their own control

7

as described in the section entitled, ‘‘Statistical aspects of the design of dose ranging studies’’ (see Chapter 7 of Reference 13 for more details). This procedure may be performed after the study (as illustrated here) or while the study is ongoing to aid in the choice of doses. Note that in this study, each study period was separated by a washout of at least seven days to ensure no effect of the previous period carried over to the next period. Study procedures, which include the assay, were standardized and systematic in all periods to minimize the potential for any time related effects. Therefore, we can assume that the effect of the previous dose does not carry over (i.e., in some manner explain or confound) the results at the next dose (i.e., that γi−1 = 0) and that any time-related effects are negligible (i.e., that πi = 0) relative to the effect of dose. Note also that placebo data are not included in this model, as their null values are challenging to account for in such a small sample size relative to nondetectable levels of the exposure assay. Different approaches to analysis may be taken if this is not the case (75).

Table 1. Example of Exposures (ng.h/mL) in an Optional Titration Cross-over Design Subject

Sequence

50 mg

100 mg

250 mg

500 mg

1000 mg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19

APA AAP PAA PAA AAP APA APA AAP PAA PAA PPA APA PAA APA PPA AAP APA PAA

· · · · · · · · · · · 1.857 ND 0.963 1.939 1.387 · 1.812

2.688 4.258 3.21 · 4.379 · · · · · · · · · · · 2.802 4.146

· 8.651 · 5.373 8.111 2.624 6.568 · · · · · · · · · 5.256 ·

13.255 · 19.508 5.436 5.436 7.931 · 7.84 8.069 · · · · · · · · ·

· · · · · · 9.257 · 8.664 7.07 14.196 · 15.054 · · · · ·

P: Placebo, A: Active Dose of Drug ND: Not Detectable by Assay

1500 mg · · · · · · · · · 8.595 20.917 17.418 · 11.496 36.783 13.516 · ·

8

DOSE RANGING CROSS-OVER DESIGNS

Thus, the model is a reasonable choice given the design, it looks to describe the data well, and the resulting estimates can be useful for decision making. However, we have not accounted for uncertainty in the choice of subsequent doses conditioned on the observed data, and implicitly, the physician and the medical staff will use such data to help determine their next choice of dose under protocol. If one wishes to take this sort of approach statistically, then methods are available to accommodate such models. This example shows an inductive reasoning approach of Bayesian analysis (13), but we will not discuss this approach here as application is extremely limited to date in drug development (60). In this case, after the study, we see that average exposure levels overlapped the desired maximum at 1500 mg as described in Table 2. Predicted mean responses may be compared using pairwise statistical comparisons as described in the first section, although this comparison is not often done in practice. When one uses the conservative Tukey’s approach (63), for example, one finds that exposure is significantly greater across the increasing doses used in this study until one reaches the 500–1500-mg doses. The average exposure does not seem to increase significantly with dose (adjusted P > 0.10).

Table 2. Predicted Mean (90% Confidence Interval, CI) Exposures (ng.h/mL) in an optional Titration Cross-over Design Dose (mg) 50 100 250 500 1000 1500

relative to a control. For example, a study might be designed in which n eligible subjects randomly receive one of two active doses or placebo in period 1, and treatment is double blind. At the end of this session of therapy, the data is assessed, and if the drug was effective, then every subject receives the higher of the two doses in period 2, see Table 3. If this method is planned from the start of the study [often done for the purposes of collecting additional safety data under (35)], and if the endpoint is not biased by the open-label extension nature of period 2 (i.e., one is assessing a hard endpoint like blood level where knowledge of treatment dose not influence outcome or a ‘‘softer’’ endpoint is assessed by a blinded adjudication committee), then one can regard this design as an incomplete Balaam’s design (11). Subjects are randomized de facto to sequence at study start, and data may be used for the purposes of treatment effect assessment (which one would expect would be more precise than the between-group assessment of period 1 data alone). One can assess continuity of effect with treatment by looking at data from the HH sequence and can confirm that the onset of treatment effects observed between groups in period 1 is similar to that observed withinsubjects in the PH sequence. One can also assess using the LH sequence data whether those ‘‘under-treated’’ initially benefit from later treatment with the higher dose. The models applied are similar to those described

Predicted Mean (90% CI) 1.7 (1.3-2.2) 2.8 (2.2-3.4) 5.2 (4.4-6.3) 8.5 (7.1-10.2) 13.7 (11.2-16.8) 18.2 (14.4-22.8)

We now turn to a related topic in drug development known as extension studies. Extension studies are nonrandomized crossover studies in which everyone receives a certain dose after the close of the double blind randomized portion of the study

Table 3. Example of a Random Allocation of Sequences to Subject in an Extension Design Subject

Sequence

Period 1

1 2 3 · · · n

LH HH PH · · · HH

L H P · · · H

L: Low Dose H: High Dose P: Placebo

Period 2 H H H · · · H

DOSE RANGING CROSS-OVER DESIGNS

in a titration design above or in nonrandomized crossover studies (see Chapter 7 of Reference 13) and so are omitted here. To illustrate, consider the following summary of such a simulated data set (n = 1000) based on a dose-response curve for lipid data found in Reference 76. See Table 4. As one would anticipate from the data of Table 4, the data of period 1 confirms that a dose response is present (Tukey adjusted P < 0.0001); however, when one takes into account each subject as their own control by inclusion of the period 2 data (accounting for sequence and dose in the model), standard errors for the comparisons of interest narrow by 20%, which is consistent with the increased degree of precision expected in accounting for within-subject controls. This analysis is additionally informative in confirming (P < 0.0001) that subjects differ in their response to dose depending on sequence assignment, which is consistent with the differential response observed by dose in period 1. Analysis by sequence confirms that the effect of treatment with 500 mg after the placebo treatment of period 1 is similar in magnitude to the effect between groups in period 1 [i.e., 28.0 (SE = 0.05) versus 27.7 (SE = 0.15)], which suggests that delay in active treatment is not detrimental to a subject’s later response. However, comparison of the data from the HH sequence suggests that continuation of treatment may result in a decreased average response over time of approximately 0.9 (SE = 0.1, P < 0.0001). Thus, we might expect to see treatment effects diminishing over time in long studies.

Table 4. Unadjusted Mean (SE) of Response Data in an Extension Design Sequence

n

Period 1

Period 2

LH HH PH

333 334 333

4.7 (0.1) 28.0 (0.1) 0.1 (0.1)

27.7 (0.1) 26.9 (0.1) 27.9 (0.1)

L: 100 mg H: 500 mg P: Placebo

9

Inclusion of the period 2 data in this setting seems to be informative. Of note, however, if such models are applied, then one must take care to assess what assumptions are being made to determine whether it is appropriate to properly evaluate the results that the length of treatment in period 2 is sufficient to preclude the need for a wash-out from period 1 (i.e., no carryover), and that disease state in the subjects does not change significantly, all else being equal, across the length of the trial (i.e., period effects are null). If either of these assumptions are of concern, perhaps the body acquires resistance to the drug for example or the body is sufficiently improved in period 1 to allow that disease state is modified by period 2, then one may wish to alter the design to better accommodate unbiased assessment of dose-response over time (i.e., randomize to more sequences including LL, etc.), which leads into the next topic: fully randomized, dose-ranging crossover designs. 3

RANDOMIZED DESIGNS

Situations often develop in which a range of safe doses are available, and it is necessary to determine either the dose-response relationship or to decide which doses are superior to placebo. For some medical conditions, it is possible to give subjects a sequence of doses and in this situation crossover designs offer large reductions in sample size compared with traditional parallel group designs. The simplest crossover design compares either a single dose against placebo or one active dose against another. For example, Jayaram (77) compared 10 mg of montelukast against placebo to determine the eosinophils-lowering effect of the dose in asthma patients. The plan of such a design is given in Table 5. In this table, it will be observed that subjects are randomized to receive one of two treatment sequences: Dose-Placebo or PlaceboDose, over two periods of time. In Jayaram (77), for example, the periods were 4 weeks long. Ideally, an equal number of subjects should be allocated to each sequence group. Often, a wash-out period is inserted between the two periods to ensure that the entire drug has been eliminated before the second period

10

DOSE RANGING CROSS-OVER DESIGNS

Table 5. Plan of Two-treatment, Two-period Cross-over Trial

Table 6. Latin Square Design for Four Treatments

Period Group

1

1 2

Dose Placebo

Period 2 Placebo Dose

begins. In situations in which no wash-out period occurs, or in which the wash-out period is of inadequate length, it is possible that the effect of the drug given in the first period carries over into the second period. If carryover effects cannot be removed by a wash-out period, then the simple two-period design is unsuitable and a design with either more than two periods or treatment sequences should be considered. See Reference 11 for a comprehensive description and discussion of crossover designs for two treatments. When more than one dose is to be compared with placebo or a dose-response relationship over three or more doses is to be determined, then crossover designs with three or more periods may be used. As with the two-treatment designs, the medical condition studied must be such that crossover designs are appropriate. With three or more doses and periods, a range of designs exist to choose from. It is important to make the correct choice as some designs require many more subjects than others. One advantage of multiperiod crossover designs is that carryover effects can be allowed for in the design and the analysis and so it is not essential to have wash-out periods between the treatment periods. However, where possible it is good practice to include wash-out periods and/or to ensure the length of the treatment periods is adequate to remove carryover effects. As an example of a design with four periods, we refer to Reference 78, who compared three doses of magnesium and placebo using 12 subjects in three Latin squares. In a single Latin square, as many different treatment sequences exist treatments. An example of such a design for four treatments is given in Table 6. Every subject gets each treatment over the four periods. To ensure the

Sequence

1

2

3

4

1 2 3 4

Placebo Dose1 Dose2 Dose3

Dose1 Dose2 Dose3 Placebo

Dose2 Dose3 Placebo Dose1

Dose3 Placebo Dose1 Dose2

design has adequate power to detect differences of given sizes between the treatments, it is usually necessary to assign more than one subject to each sequence (or to use multiple Latin squares, as done in Reference 78). We note that each treatment occurs an equal number of times in the design, and it is balanced in the sense that all estimated pairwise comparisons among the treatments have the same variance. The ordering of the treatments is of critical importance because a bad choice can lead to a very inefficient design if carryover effects are present. A useful measure of design efficiency is defined in Reference 11. This design compares the variance of an estimated treatment comparison to a theoretical lower bound for the variance. If the lower bound is achieved, then the design has an efficiency of 100%. The design in Table 6 has an efficiency of 18.18% if differential carry-over effects are present, so it is very inefficient. On the other hand, the design in Table 7 has an efficiency of 90.91%. Another way of expressing efficiency is to say, ‘‘the design in Table 6 will require about 90.91/18.18 = 5 times more subjects to achieve the same power as the design in Table 7.’’ In the absence of differential carryover effects, both designs have an efficiency of 100%. Therefore, it is important at the planning stage to decide whether differential carryover effects are likely. The design in Table 7 is an example of a Williams design for four treatments. Such designs exist for all numbers of treatments, but for an odd number of treatments, the number of sequences is twice the number of treatments. Some exceptions to this rule exist. For example, a balanced design for 9 treatments only requires 9 sequences, not 18. An example for three treatments is given in

DOSE RANGING CROSS-OVER DESIGNS Table 7. Balanced Latin Square Design for Four Treatments

11

Table 9. Incomplete Design for Three Treatments

Period

Period

Subject

1

2

3

1 2 3 4

Placebo Dose1 Dose2 Dose3

Dose3 Placebo Dose1 Dose2

Dose1 Dose2 Dose3 Placebo

4 Dose2 Dose3 Placebo Dose1

Table 8. Williams Design for Three Treatments Period Sequence

1

2

1 2 3 4 5 6

Placebo Dose1 Dose2 Placebo Dose1 Dose2

Dose1 Dose2 Placebo Dose2 Placebo Dose1

Sequence

1

2

1 2 3 4 5 6 7 8 9 10 11 12

Dose1 Dose2 Dose3 Dose4 Dose1 Dose2 Dose3 Dose4 Dose1 Dose2 Dose3 Dose4

Dose2 Dose1 Dose4 Dose3 Dose3 Dose4 Dose1 Dose2 Dose4 Dose3 Dose2 Dose1

3 Dose3 Dose4 Dose1 Dose2 Dose4 Dose3 Dose2 Dose1 Dose2 Dose1 Dose4 Dose3

3 Dose2 Placebo Dose1 Dose1 Dose2 Placebo

Table 8. The efficiency of this design is 80% in the presence of differential carryover effects. When many doses exist in a study, for example, six plus a placebo, it is unlikely that a design with as many periods as treatments will be practical or desirable. Fortunately, many designs are available to deal with this situation. Tables of designs for up to nine treatments are given in Reference 11 and cover all practical situations. Some designs have the same number of periods as treatments, other designs have fewer periods than treatments, and still others have more periods than treatments. An example of a design with more treatments than periods is given in Reference 79, in which the authors a crossover trial to assess the dose proportionality of rosuvastatin in healthy volunteers. Four doses were compared over three periods. Although several designs could be suggested, the design in Table 9 would be a suitable choice. It has an efficiency of 71.96% in the presence of differential carry-over effects and an efficiency of 88.89% if differential carryover effects are absent.

An important and more recent use of crossover designs is in safety trials to determine whether certain doses of a drug prolong the QTc interval, see Reference 80. The QTc interval is determined from an electrocardiogram of the heart. If the QTc interval is prolonged sufficiently in humans, then potentially fatal cardiac arrhythmias can result. New drugs, and potentially existing drugs, that seek new indications must study and rule out the potential for QTc interval. An example of such data may be found in Chapter 8 of Reference 13. Regimen E was a dose of an agent known to be a mild prolonger of the QTc interval and was included as a positive control. Regimen C was a therapeutic dose, and D was a supra-therapeutic (greater than that normally administered) dose of a moderate QTc prolonging agent. Forty-one normal, healthy volunteers are in the example data set, and QTc was measured in triplicate at baseline (time 0) and over the course of the day at set times after dosing. Triplicate measurements were averaged at each time of ECG sampling (i.e., 0, 0.5, 1, 1.5, 2.4, 4, etc.) for inclusion in the analysis, and only samples out to four hours post dose are included here. In Figure 3, mild (E) and moderate degrees of prolongation (C) relative to regimen F (placebo) are observed with slightly greater prolongation observed at the supratherapeutic dose of the drug being studied (D). Both mild and moderate prolongation

12

DOSE RANGING CROSS-OVER DESIGNS Table 10. Mean Changes (90% CI) between following a single Dose in (n = 41), reproduced from Patterson and Jones [13] Example 8.1 with the permission of CRC Press Comparison

Time

Difference

C-F

0.5 1 1.5 2.5 4 0.5 1 1.5 2.5 4 0.5 1 1.5 2.5 4

4.4923 8.1830 6.0120 3.7444 5.2944 6.6868 10.4591 7.4421 6.2212 5.7591 2.0069 7.5171 6.2216 6.9994 8.4446

D-F

E-F

90% CI (2.1997, 6.7848) (5.8904, 10.4755) (3.7195, 8.3045) (1.4518, 6.0369) (3.0018, 7.5869) (4.4035, 8.9701) (8.1758, 12.7425) (5.1588, 9.7255) (3.9379, 8.5046) (3.4757, 8.0424) (−0.2778, 4.2915) (5.2324, 9.8017) (3.9369, 8.5062) (4.7147, 9.2840) (6.1599, 10.7292)

C = Therapeutic Dose D = Supra-therapeutic Dose E = Dose of Positive Control F = Placebo

refer to effect sizes greater than zero but less than the ICH E14 (80) level of probable concern for causing arrhythmia of 20 msec (6). In the example, it is observed that moderate and statistically significant (note lower 90% confidence bounds exceed zero) QTc prolongation is observed in C and D within a half-hour of dosing, and it remains prolonged out to four hours post dosing. Significant prolongation for E is not observed until a 1/2 hour after dosing and returns to parity with F immediately after 4 hours post dose (data not shown). If a randomized crossover design can be employed under circumstances in which parallel group or titration designs are considered, then it probably should be used. In addition to the substantial gains in statistical efficiency and power (34) by the use of each subject as their own control, the use of these designs allows for evaluation of time-related and carryover effects and consideration of their impact on the study results. Within-subject inclusion of a positive control seemingly is very valuable to Regulator’s seeking to confirm the validity of a trial (80).

4 DISCUSSION AND CONCLUSION Dose finding studies are of critical importance in drug development. In fact, it is generally accepted that it is one of the ‘‘opportunities’’ (7) waiting to be realized in the development of new medicines. As such, it is vitally important to design, execute, and interpret dose ranging studies early in the development of a new medicine. Broadly speaking, three types of dose ranging studies exist. The first type is the very early studies conducted in Phase I to help ascertain the likely range of doses for additional study of a new molecule. These studies are primarily pharmacokinetic studies that aim to translate the animal exposure to the exposure observed in humans. These trials are often exploratory in terms of the statistical framework. An example of this type of early dose ranging trial was highlighted in the section entitled, ‘‘Titration designs and extension studies.’’ The results of these early trials often set the dosage range for studies conducted in early Phase IIa, with a biomarker or a surrogate endpoint. Although this second type of dose ranging trials can be

420

DOSE RANGING CROSS-OVER DESIGNS

13

Figure 3. Mild and moderate QTc prolongation (n = 41) [reproduced from Patterson and Jones (13), Example 8.1, with the permission of CRC Press].

360

Adjusted Mean QTo 380 400

C D E Placebo

performed in the target patient population and with the registration endpoint of interest [e.g., blood pressure or lipid endpoints in cardiovascular research], it is much more common that these studies are moderately sized trials in low-risk patient populations that aim to refine the understanding of mechanistic effects of the drug under study, see the section entitled, ‘‘Titration designs and extension studies.’’ The last and perhaps most important of the dose ranging trials are the Phase IIb trials conducted in the patient population itself (or a very closely related patient population) with the registration endpoint or a well-accepted clinical surrogate (e.g., QTc, see the section entitled, ‘‘Randomized designs’’). The duration of dosing in these trials is often lengthy, and the use of a crossover design implies a very meaningful lengthening of the trial duration, which makes these potentially impractical. The last type of dose ranging trials is often viewed as time consuming and has historically been an area of opportunity for enhancement in drug development. A key issue in this last category of trials includes the choice of a control arm. Not discussed at length in this article is the option to include an active agent for the purpose of calibrating the trial results with something more convincing than historic data from older agents. This decision is often difficult, but it can add a tremendous amount of value when interpreting a trial’s findings.

In summary, beyond the statistical intricacies of the design and analysis of dose finding studies that are reviewed in this article, the fact remains these trials are often the lynch pin of drug development and deserve the full attention of statisticians.

1

2

3

4

Time (h) following a Single Dose

REFERENCES 1. L. B. Sheiner and J-L. Steimer, Pharmacokinetic-pharmacodynamic modeling in drug development. Annu. Rev. Pharmacol. Toxicol. 2000; 40: 67–95. 2. M. Rowland and T.N. Tozer, Clinical Pharmacokinetics: Concepts and Applications. Philadelphia, PA: Lea and Febidger, 1980. 3. A. Atkinson, C. Daniels, R. Dedrick, C. Grudzinskas, and S. Markey, eds., Principles of Clinical Pharmacology. San Diego, CA: Academic Press, 2001. 4. C. Pratt, S. Ruberg, J. Morganroth, B. McNutt, J. Woodward, S. Harris, J. Ruskin, and L. Moye, Dose-response relation between terfenadine (Seldane) and the QTc interval on the scalar electrocardiogram: distinguishing drug effect from spontaneous variability. Am. Heart J. 1996; 131: 472–480. 5. R. Temple, Policy developments in regulatory approval. Stats. Med. 2002; 21: 2939–2948. 6. R. Temple, Overview of the concept paper, history of the QT/TdP concern; Regulatory implications of QT prolongation. Presentations at Drug Information Agency/FDA Workshop, 2003. Available: www.diahome.org.

14

DOSE RANGING CROSS-OVER DESIGNS

7. FDA Position Paper, Challenge and Opportunity on the Critical Path to New Medical Products. 2004. 8. J. Horton and B. Bushwick, Warfarin therapy: evolving strategies in anticoagulation. Am. Fam. Physician 1999; 59: 635–648.

24. 25.

9. M. Reynolds, K. Fahrbach, O. Hauch, G. Wygant, R. Estok, C. Cella, and L. Nalysnyk, Warfarin anticoagulation and outcomes in patients with atrial fibrillation. Chest 2004; 126: 1938–1945.

26.

10. L.B. Sheiner, Learning versus confirming in clinical drug development. Clin. Pharmacol. Therapeut. 1997; 61, 275–291.

27.

11. B. Jones and M. G. Kenward, Design and Analysis of Cross-over Trials, 2nd ed. London: Chapman and Hall, CRC Press, 2003. 12. S. Senn, Cross-over Trials in Clinical Research, 2nd ed. New York: John Wiley and Sons, 2002.

28.

13. S. Patterson and B. Jones, Bioequivalence and Statistics in Clinical Pharmacology. London: Chapman and Hall, CRC Press, 2006. 14. R. Tallarida, Drug Synergism and Dose-Effect Data Analysis. London: Chapman and Hall, CRC Press, 2000. 15. S. Chevret, ed., Statistical Methods for Dose Finding Experiments. West Sussex, UK: Wiley, 2006.

29.

30.

16. N. Ting, ed., Dose Finding in Drug Development. New York: Springer, 2006. 17. S. C. Chow and J. Liu, eds., Design and Analysis of Animal Studies in Pharmaceutical Development. New York: Marcel Dekker, 1998.

31.

18. International Conference on Harmonization, E4: Dose Response Information to Support Drug Registration. 1994. Available: http://www.fda.gov/cder/guidance/.

32.

19. FDA Guidance, Exposure-Response Relationships—Study Design, Data Analysis, and Regulatory Applications. 2003. Available: http://www.fda.gov/cder/guidance/. 20. C. Dunger-Baldouf, A. Racine, G. Koch, Retreatment studies: design and analysis. Drug Informat. J. 2006; 40: 209–217. 21. R. Temple, Enrichment designs: efficiency in development of cancer treatments. J. Clin. Oncol. 2005; 23: 4838–4839. 22. R. Temple, FDA perspective on trials with interim efficacy evaluations. Stats. Med. 2006; 25: 3245–3249. 23. R. Temple, Special study designs: early

33.

34.

35.

escape, enrichment, studies in nonresponders. Communicat. Stat. Theory Methods 1994; 23: 499–531. R. Temple, Government viewpoint of clinical trials. Drug Informat. J. 1982; 1: 10–17. B. Freidlin and R. Simon, Evaluation of randomized discontinuation design. J. Clin. Oncol. 2005; 23: 5094–5098. J. Kopec, M. Abrahamowicz, and J. Esdaile, Randomized discontinuation trials: utility and efficiency. J. Clin. Epidemiol. 1993; 46: 959–971. W. Stadler, G. Rosner, E. Small, D. Hollis, B. Rini, S. Zaentz, J. Mahoney, and M. Ratain, Successful implementation of the randomized discontinuation trial design: an application to the study of the putative antiangiogenic agent carboxyaminoimidazole in renal cell carcinoma -CALGB69901. J. Clin. Oncol. 2005; 23: 3726–3732. G. Rosner, W. Stadler, and M. Ratain, Randomized discontinuation design: application to cytostatic antineoplastic agents. J. Clin. Oncol. 2002; 20: 4478–4484. International Conference on Harmonization, E5: Guidance on Ethnic Factors in the Acceptability of Foreign Clinical Data. 1998. Available: http://www.fda.gov/cder/guidance/. R. Nagata, H. Fukase, and J. Rafizadeh-Kabe, East-West development: Understanding the usability and acceptance of foreign data in Japan. Internat. J. Clin. Pharmacol. Therapeut. 2000; 38: 87–92. S. Ono, C. Yoshioka, O. Asaka, K. Tamura, T. Shibata, and K. Saito, New drug approval times and clinical evidence in Japan. Contemp. Clin. Trials 2005; 26: 660–672. Y. Uyama, T. Shibata, N. Nagai, H. Hanaoka, S. Toyoshima, and K. Mori, Successful bridging strategy based on ICH E5 guideline for drugs approved in Japan. Clin. Pharmacol. Therapeut. 2005; 78: 102–113. Biomarker Definition Working Group, Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin. Pharmacol. Therapeut. 2001; 69: 89–95. L. Lesko and A. Atkinson, Use of biomarkers and surrogate markers in drug development. Annu. Rev. Pharmacol. Toxicol. 2001; 41: 347–366. International Conference on Harmonization, E1: The Extent of Population Exposure to Assess Clinical Safety for Drugs Intended for Long-term Treatment of NonLife-Threatening Conditions. 1995. Available: http://www.fda.gov/cder/guidance/.

DOSE RANGING CROSS-OVER DESIGNS 36. A. Tamhane, K. Shi, and K. Strassburger, Power and sample size determination for a step-wise test procedure for determining the maximum safe dose. J. Stat. Planning Infer. 2006; 36: 2163–2181. 37. L. B. Sheiner, S. L. Beal, and N. C. Sambol, Study designs for dose-ranging. Clin. Pharmacol. Therapeut. 1989; 46: 63–77. 38. L. B. Sheiner, Y. Hashimoto, and S. Beal, A simulation study comparing designs for dose ranging. Stats. Med. 1991; 10: 303–321. 39. L. B. Sheiner, Bioequivalence revisited. Stats. Med. 1992; 11: 1777–1788. 40. R. E. Walpole, R. H. Myers, and S. L. Myers, Probability and Statistics for Engineers and Scientists, 6th ed. Englewood Cliffs, NJ: Prentice Hall, 1998. 41. S. Machado, R. Miller, and C. Hu, A regulatory perspective on pharmacokinetic and pharmacodynamic modelling. Stats. Methods Med. Res. 1999; 8: 217–245. 42. W. Slob, Dose-response modeling of continuous endpoints. Toxicol. Sci. 2002; 66: 298–312. 43. S. Patterson, S. Francis, M. Ireson, D. Webber, and J. Whitehead, A novel Bayesian decision procedure for early-phase dose-finding studies. J. Biopharmaceut. Stats. 1999; 9: 583–598. 44. F. Harrell, Regression Modelling Strategies. New York: Springer, 2001. 45. T. G. Filloon, Estimating the minimum therapeutically effective dose of a compound via regression modeling and percentile estimation. Stats Med. 1995; 14: 925–932. 46. T. N. Johnson, Modelling approaches to dose estimation in children. Br. J. Clin. Pharmacol. 2005; 59: 663–669. 47. T. Johnson, J. Taylor, R. Haken, and A. Eisbruch, A Bayesian mixture model relating dose to critical organs and functional complication in 3D conformal radiation therapy. Biostatistics 2005; 6: 615–632. 48. P. Lupinacci, and D. Raghavarao, Designs for testing lack of fit for a nonlinear dose-response curve model. J. Biopharmaceut. Stats. 2000; 10: 43–53. 49. J. Whitehead, Y. Zhou, J. Stevens, G. Blakey, J. Price, and J. Leadbetter, Bayesian decision procedures for dose-escalation based on evidence of undesirable events and therapeutic benefit. Stats. Med. 2006a; 25: 37–53. 50. V. Fedorov and P. Hackl, Model-Oriented Design of Experiments. New York: Springer, 1997.

15

51. S. Biedermann, H. Dette, and W. Zhu, Optimal designs for dose-response models with restricted design spaces. J. Am. Stats. Assoc. 2006; 101: 747–759. 52. L. Desfrere, S. Zohar, P. Morville, A. Brunhes, S. Chevret, G. Pons, G. Moriette, E. Reyes, and J. M. Treluyers, Dose-finding study of ibuprofen in patent ductus arteriosus using the continual reassessment method. J. Clin. Pharm. Therapeut. 2005; 30: 121–132. 53. M. Gonen, A Bayesian evaluation of enrolling additional patients at the maximum tolerated dose in phase I trials. Contemp. Clin. Trials 2005; 26: 131–140. 54. Y. Loke, S. Tan, Y. Cai, and D. Machin, A Bayesian dose finding design for dual endpoint phase I trials. Stats. Med. 2006; 25: 3–22. 55. P. Thall and J. Cook, Dose-finding based on efficacy-toxicity trade-offs. Biometrics 2004; 60: 684–693. 56. J. Whitehead, Y. Zhou, S. Patterson, D. Webber, and S. Francis, Easy-to implement Bayesian methods for dose-escalation studies in healthy volunteers. Biostatistics 2001a; 2: 47–61. 57. J. Whitehead, Y. Zhou, N. Stallard, S. Todd, and A. Whitehead, Learning from previous responses in phase 1 dose escalation studies. Br. J. Clin. Pharmacol. 2001b; 52: 1–7. 58. J. Whitehead, Y. Zhou, A. Mander, S. Ritchie, A. Sabin, and A. Wright, An evaluation of Bayesian designs for dose-escalation studies in healthy volunteers. Stats. Med. 2006b; 25: 433–445. 59. Y. Zhou, J. Whitehead, E. Bonvini, and J. Stevens, Bayesian decision procedures for binary and continuous bivariate doseescalation studies. Pharmaceut. Stats. 2006; 5: 125–133. 60. C. Buoen, O. Bjerrum, and M. Thomsen, How first-time-in-humans studies are being performed: a survey of phase I dose-escalation trials in healthy volunteers published between 1995 and 2004. J. Clin. Pharmacol. 2005; 45: 1123–1136. 61. D. Sheshkin, Handbook of Parametric and Nonparametric Statistical Procedures. London: Chapman & Hall, CRC Press, 2000. 62. J. Hsu, Multiple Comparisons: Theory and Methods. London: Chapman & Hall, CRC Press, 1996. 63. J. W. Tukey, J. L. Ciminera, and J. F. Heyse, Testing the statistical certainty of a response to increasing doses of a drug. Biometrics 1985; 41: 295–301.

16

DOSE RANGING CROSS-OVER DESIGNS

64. P. Bauer, J. Rohmel, W. Maurer, and L. Hothorn, Testing strategies in multi-dose experiments including active control. Stats. Med. 1998; 17: 2133–2146. 65. A. Dmitrienko, W. Offen, O. Wang, and D. Xiao, Gate-keeping procedures in doseresponse clinical trials based on the Dunnett test. Pharmaceut. Stats. 2006; 5: 19–28. 66. Y. Cheung, Coherence principles in dosefinding studies. Biometrika 2005; 92: 863–873. 67. J. A. Robinson, Sequential choice of an optimal dose: a prediction intervals approach. Biometrika 1978; 65: 75–78. 68. C. Jennison and B.W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. New York: Chapman and Hall, 2000.

77. L. Jayaram, M. Duong, M. M. M. Pizzichini, E. Pizzichini, D. Kamada, A. Efthimiadis, and F. E. Hargreave, Failure of montelukast to reduce sputum eosinophilia in high-dose corticosteroid-dependent asthma. Euro. Respir. J. 2005; 25: 41–46. 78. M. M. Huycke, M. T. Naguib, M. M. Stroemmel, K. Blick, K. Monti, S. Martin-Munley, and C. Kaufman, A double-blind placebocontrolled crossover trial of intravenous magnesium sulfate for foscarnet-induced ionized hypocalcemia and hypomagnesemia in patients with AIDS and cytomegalovirus infection. Antimicrob. Agents Chemother. 2000; 44: 2143–2148. 79. P. D. Martin, M. J. Warwick, A. L. Dane, and M. V. Cantarini, A double-blind, randomized, incomplete crossover trial to assess the dose proportionality of rosuvastatin in healthy volunteers. Clin. Therapeut. 2003; 25: 2215–2224.

69. M. Krams, K. Lees, W. Hacke, A. Grieve, J.-M. Orgogozo, and G. Ford, for the ASTIN Study Investigators, Acute stroke therapy by inhibition of neutrophils: an adaptive dose-response study of UK-279276 in acute ischemic stroke. Stroke 2003; 34: 2543–2548.

80. International Conference on Harmonization, E14: The Clinical Evaluation of QT/QTc Interval Prolongation and Proarrythmic Potential for Non-Antiarrythmic Drugs. 2005. http://www.fda.gov/cder/guidance/.

70. V. Dragalin and V. Fedorov, Adaptive ModelBased Designs for Dose-Finding Studies. GSK BDS Technical Report 2004–02, 2004.

FURTHER READING

71. L. Kong, G. Koch, T. Liu, and H. Wang, Performance of some multiple testing procedures to compare three doses of a test drug and placebo. Pharmaceut. Stats. 2004; 4: 25–35. 72. H. Moldofsky, F. Lue, C. Mously, B. RothSchechter, and W. Reynolds, The effect of zolpidem in patients with fibromyalgia: a dose ranging, double blind, placebocontrolled, modified cross-over study. J. Rheumatol. 1996; 23: 529–533. 73. FDA Draft Guidance, Estimating the Safe Starting Dose in Clinical Trials for Therapeutics in Adult Healthy Volunteers. 2006. Available: http://www.fda.gov/cder/guidance/. 74. B. Reigner and K. Blesch, Estimating the starting dose for entry into humans: principles and practice. Euro. J. Clin. Pharmacol. 2001; 57: 835–845. 75. C. Chuang-Stein and W. Shih, A note on the analysis of titration studies. Stats. Med. 1991; 10: 323–328. 76. J. Mandema, D. Hermann, W. Wang, T. Sheiner, M. Milad, R. Bakker-Arkema, and D. Hartman, Model-based development of gemcabene, a new lipid-altering agent. AAPS J. 2005; 7: E513–E522.

International Conference on Harmonization, E9: Statistical Principles for Clinical Trials. 1998. Available: http://www.fda.gov/cder/guidance/.

DOUBLE-DUMMY

3

KENNETH F. SCHULZ

Perhaps most crucially, blinding aids in reducing differential assessment of outcomes (commonly termed information or ascertainment bias) prompted by knowledge of the group assignment of individuals being observed (3–5). For example, if unblinded outcome assessors believe a new intervention is better, they could record more ‘‘charitable’’ responses to that intervention. Indeed, in a placebo-controlled multiple sclerosis trial, the unblinded, but not the blinded, neurologists’ assessments showed an apparent benefit of the intervention (10). Subjective outcomes (e.g., pain scores) present greater opportunities for bias (9). Even some outcomes considered as objective, such as myocardial infarction, can be fraught with subjectivity. In general, blinding becomes less important to reduce information bias as the outcomes become less subjective. Objective (hard) outcomes, such as death, leave little opportunity for bias. Less understood, blinding also operationally improves compliance and retention of trial participants and reduces biased supplemental care or treatment (sometimes called co-intervention) (3–5). Many potential advantages emanate from participants, investigators, and outcome assessors not knowing the intervention group to which the participants have been assigned (Table 1) (11).

Family Health International Research Triangle Park, North Carolina

1

INTRODUCTION

Blinding in research began over two centuries ago (1). Most researchers and readers grasp the concept. Unfortunately, beyond that general comprehension lies confusion. Terms such as ‘‘single-blind,’’ ‘‘double-blind,’’ and ‘‘triple-blind’’ mean different things to different people (2). Although clinical trial textbooks (3–5), clinical trial dictionaries (6, 7), and a new edition of Last’s epidemiology dictionary (8) address blinding, they do not entirely clear the lexicographical fog (9). Investigators, textbooks, and published articles all vary greatly in their interpretations (2). In other words, terminological inconsistencies surround blinding, and doubledummy terms add a whole additional level of confusion. I will discuss blinding in general and then describe the relevance of doubledummy blinding 2

POTENTIAL IMPACTS OF BLINDING

‘‘DOUBLE-BLINDING’’ DEFINED

The terminology ‘‘double-blind’’ (or doublemask) usually means that trial participants, investigators (usually health care providers), and assessors (those collecting outcome data) all remain oblivious to the intervention assignments throughout the trial (9) so that they will not be influenced by that knowledge. Given that three groups are involved, ‘‘double-blind’’ appears misleading. In medical research, however, an investigator also frequently assesses, so, in that instance, the terminology accurately refers to two individuals. When I use ‘‘double-blind’’ or its derivatives in this article, I mean that steps have been taken to blind participants, investigators, and assessors. In reporting randomized control trials (RCTs), authors should explicitly state what steps were taken to keep whom blinded, as clearly stated in the CONSORT guidelines.

4

PLACEBOS AND BLINDING

Interventions (treatments) at times have no effect on the outcomes being studied (9). When an ineffective intervention is administered to participants in the context of a well-designed RCT, however, beneficial effects sometimes occur on participants attitudes, which in turn influences outcomes (3). Researchers refer to these phenomena as the ‘‘placebo effect.’’ A placebo refers to a pharmacologically inactive agent that investigators administer to participants in the control group in a trial (9). The use of a placebo control

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

DOUBLE-DUMMY Table 1. Potential Advantages of Successfully Blinding Participants, Investigators, and Assessors (11) If blinded?

Potential advantages

Participants

• • • •

Less likely to have biased responses to the interventions More likely to comply with trial treatments Less likely to seek supplementary interventions More likely to continue in the trial, providing outcome

Trial investigators

• • • • •

Less likely to transfer their preferences or viewpoints to participants Less likely to differentially dispense co-interventions Less likely to differentially alter dosage Less likely to differentially withdraw participants Less likely to differentially dissuade participants to continue in the trial

Assessors

• Less likely to have biases influence their outcome assessments

group balances the placebo effect in the treatment group allowing for the independent assessment of the treatment effect. Although placebos may have effects mediated through psychological mechanisms, they are administered to participants in a trial because they are otherwise ‘‘inactive.’’ Although the effect of placebos is contentious (12), when assessing the effects of a proposed new treatment for a condition for which no effective treatment already exists, the widespread view remains that placebos should be administered, whenever possible, to participants in the control group (3, 4). Placebos are generally used in trials of drugs, vaccines, and other medicinal interventions, but can sometimes also be used in trials of procedures, such as ultrasound, acupuncture, and occasionally, surgery. 5

DOUBLE-DUMMY BLINDING

As most double-blinding involves drugs, I confine my comments to such instances. When an effective standard treatment exists, it is usually used in the control group for comparison against a new treatment (9). Thus, trialists compare two active treatments. In that circumstance, they usually have three options to double-blind their trial. First, they could obtain the drugs in raw form and package the two treatments identically, such as in capsules or pills. Preferably, taste would also be equilibrated. Participants

would simply take one normal-sized capsule or pill. Frequently, however, this approach presents production problems. Pharmaceutical manufacturers seldom provide their drugs in any form other than the standard formulations. Moreover, even if they did, generating different formulations might raise objections with government regulatory bodies, such as the U.S. Food and Drug Administration (FDA), as to the equivalent bioavailability of the created formulations, even with drugs with approved formulations (13). Erasing those objections probably involves more research, more expense, and more time. This identical packaging approach frequently presents formidable production obstacles. Second, the standard formulation drugs could be encapsulated in a larger capsule, which alleviates some production concerns, but, again, may trigger equivalent bioavailability questions (13), which generates additional research expense and delays. Moreover, although the participants would only take one capsule, it may be so large as to discourage ingestion. Furthermore, some encapsulation systems allow the participant to open the capsules thereby breaking the blind (13). This encapsulation approach still involves production difficulties and also presents compliance impediments and unblinding opportunities. Third, they could conduct a double-dummy (double-placebo) design where participants receive the assigned active drug and the

DOUBLE-DUMMY

placebo matched to the comparison drug. The trial involves two active drugs and two matching placebos. For example, in comparing two agents, one in a blue capsule and the other in a red capsule, the investigators would acquire blue placebo capsules and red placebo capsules. Then both treatment groups would receive a blue and a red capsule, one active and one inactive. This option protects the double-blinding better than the second option and as well as the first. Unlike the other two options, it does not raise equivalent bioavailability questions nor does it involve production problems, delays, and costs. The only downside tends to be that participants take more pills, which could hurt enrollment or compliance. However, when examined, investigators found minimal impact on enrollment (13). Pragmatically, investigators should find fewer procedural problems with double-dummy blinding. No wonder that this approach appears the most frequently used of the three options. For simplicity of presentation, most of the discussions focused on drugs in capsule, pill, or tablet formulations. The concepts, however, extend easily to other formulations, such as intravenous fluids and ampoules administered through injections or intravenous drips (Panel 1). Where double-dummy blinding becomes difficult to implement are those situations in which blinding itself is difficult under any circumstances, such as in surgical trials. Unfortunately, some authors have disparaged placebos. Many of those efforts have been misguided. Placebos have served as the scapegoat for the problem of having an inactive treatment control group when ethically an active treatment control group is indicated because an effective treatment exists. Inappropriate control groups are the problem, not placebos. Thus, placebos emerge as critical to double-blinding in randomized trials. Inescapably, placebos are usually scientifically necessary for double-blinding, regardless of whether the control group is active or inactive. In trials with a comparison group receiving no active treatment, placebos have obvious importance. That manifestation fits the common paradigm of placebo

3

usage. However, as displayed above, placebos also frequently emerge as necessary if a trial compares two or more active treatments. If those treatments differ,\emdash for example, in shape, size, weight, taste, or color, the double-dummy technique (using two placebos) nearly always indispensably serves methodological and production concerns. 5.1 Panel 1: Examples of Descriptions of ‘‘Double-Dummy’’ Blinding ‘‘The patients were allocated to doubledummy treatment with dalteparin 100 IU/kg subcutaneously twice a day (Fragmin, Pharmacia and Upjohn, Stockholm, Sweden) and placebo tablets every day, or aspirin tablets 160 mg every day (Albyl-E, Nycomed Pharma, Oslo, Norway) and placebo ampoules subcutaneously twice a day’’ (14). ‘‘To maintain masking, each patient received two simultaneous infusions, one active and one placebo. The placebo infusion regimen was identical to its respective active counterpart’’ (15). REFERENCES 1. T. J. Kaptchuk, Intentional ignorance: a history of blind assessment and placebo controls in medicine. Bull. Hist. Med. 1998; 72: 389–433. 2. P. J. Devereaux, B. J. Manns, W. A. Ghali, H. Quan, C. Lacchetti, V. M. Montori et al., Physician interpretations and textbook definitions of blinding terminology in randomized controlled trials. JAMA 2001; 285: 2000–2003. 3. S. J. Pocock, Clinical Trials: A Practical Approach. Chichester: Wiley, 1983. 4. C. L. Meinert, Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. 5. L. Friedman, C. Furberg, and D. DeMets, Fundamentals of Clinical Trials. St. Louis, MO: Mosby, 1996. 6. C. L. Meinert, Clinical Trials Dictionary. Baltimore, MD: The Johns Hopkins Center for Clinical Trials, 1996. 7. S. Day, Dictionary for Clinical Trials. Chichester: Wiley, 1999. 8. J. M. Last (ed.), A Dictionary of Epidemiology. Oxford: Oxford University Press, 2001.

4

DOUBLE-DUMMY

9. K. F. Schulz, I. Chalmers, and D. G. Altman, The landscape and lexicon of blinding in randomized trials. Ann. Intern. Med. 2002; 136: 254–259. 10. J. H. Noseworthy, G. C. Ebers, M. K. Vandervoort, R. E. Farquhar, E. Yetisir, and R. Roberts, The impact of blinding on the results of a randomized, placebo- controlled multiple sclerosis clinical trial. Neurology 1994; 44: 16–20. 11. K. F. Schulz and D. A. Grimes, Blinding in randomised trials: hiding who got what. Lancet 2002; 359: 696–700. 12. A. Hrobjartsson and P. C. Gøtzsche, Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment. N. Engl. J. Med. 2001; 344: 1594–1602. 13. B. K. Martin, C. L. Meinert, and J. C. Breitner, Double placebo design in a prevention trial for Alzheimer’s disease. Control Clin. Trials 2002; 23: 93–99. 14. E. Berge, M. Abdelnoor, P. H. Nakstad, and P. M. Sandset, Low molecular-weight heparin versus aspirin in patients with acute ischaemic stroke and atrial fibrillation: a double-blind randomised study. HAEST Study Group. Heparin in acute embolic stroke trial. Lancet 2000; 355: 1205–1210. 15. F. Follath, J. G. Cleland, H. Just, J. G. Papp, H. Scholz, K. Peuhkurinen et al., Efficacy and safety of intravenous levosimendan compared with dobutamine in severe low-output heart failure (the LIDO study): a randomised double-blind trial. Lancet 2002; 360: 196–202.

DRIFT (FOR INTERIM ANALYSES)

The solution can be found by numerical integration, and a variety of software for doing so is available (10–13). Slud and Wei (14) developed an equation similar to Equation (2). Lan and DeMets related the α k to time through an alpha spending function α*(t). In the spending function approach, an increasing function α*(t) is defined such that α*(0) = 0, α*(1) = α, the overall type 1 error rate, and α k = α* (tk ) – α*(tk −1). In this way, a certain proportion of the overall type 1 error is ‘‘spent’’ at each interim analysis. To relate the Brownian motion process to accumulating evidence in a clinical trial directly, we use an analogy with partial sums (5,15,16). Consider N independent normal variables X1 , X2 , . . . , XN with unknown mean δ and variance equal to one. The sum of the first nk , S(nk ), is distributed normally with mean nk δ and variance nk , and the Brownian motion process at tk is distributed normally with mean tk θ and variance tk . By equating the proportion of observations in the partial sum and the Brownian motion time scale, √ tk = nk /N, the relation √ of means is δ = θ / N so that B(tk ) and S(nk )/ N have the same distri√ bution. √ Also, the joint distribution of S1 / N, . . . , SK / N is multivariate normal with the same covariance described by Equation (1). The Z statistics used for testing at each interim analysis are also related to the Brownian motion and partial sums. If Z(tk ) is the standard normal summary statistic√at interim analysis k, then B(tk ) and Z(tk ) tk have the same distribution and covariance structure. Note that this covariance structure holds for a wide variety of trials with different study outcomes, including any for which the summary is a maximum likelihood estimate (17,18). For factorial or crossover designs or studies with multiple primary endpoints, the situation is more complex, but the analogy holds for any trial that can be summarized by a single Z statistic. Thus, there is a correspondence among a Brownian motion process B(t), interim test statistics, and partial sums representing accumulating data in the study. A technical point associated with this description is that the time t must be understood as ‘‘information time’’ (16,19,20). That is, although actual analyses occur in calen-

DAVID M. REBOUSSIN Wake Forest University School of Medicine, Division of Public Health Sciences, Winston-Salem, North Carolina

1

INTRODUCTION

An early step in the development of group sequential testing as it is commonly applied to phase III clinical trials was the investigation of repeated significance tests (1). This led to consideration of sequentially recruited groups of equal sized (2,3) and then into procedures that allow some flexibility in the number and timing of interim analyses as a trial progresses using the spending function approach (4–6). We will focus on the spending function approach, but similar development can be applied to other approaches to group sequential testing (7,8). Brownian motion is a stochastic process that starts at zero and has independent, normally distributed increments (9). Group sequential testing is implemented using computed probabilities associated with a standard Brownian motion process, B(t), with mean zero observed at times 0 < t1 , . . . , distribution of B(t) at t = tK ≤ 1. The joint (t1 , ... tk ) is N(θ t, ), where kl

= min(tk , tl )

(1)

The parameter θ denotes the drift or mean of the Brownian motion process over time. Figure 1 illustrates an example of Brownian motion over time with a positive drift. A set of (two-sided symmetric) bounds ±b1 , . . . , ±bK associated with the values B(t1 ), . . . , B(tK ) can be determined iteratively. B(t1 ) has a normal distribution with mean zero and variance t1 so that calculation of b1 is straightforward; however, determination of subsequent bounds must account for not having stopped earlier. For k > 1, bk solves the equation Pr|{B(tj )| ≤ bj , j = 1, . . . , k − 1; |B(tk )| > bk } = αk (2)

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DRIFT (FOR INTERIM ANALYSES)

dar time—a certain number of days since the start of randomization—the time scale that best describes the unfolding evidence about a treatment effect is the proportion of the expected statistical information that has accumulated when the analysis is conducted. This idea can be extended to various designs (19), and in many situations, statistical information accrues more or less uniformly in calendar time, so that the distinction is not critical. However, being mindful of the difference between calendar time and information time can prevent confusion in certain situations, for example, in survival studies when the event rate is lower than anticipated and calendar time runs out before the expected number of events has occurred. When determining bounds, the computation is done assuming B(t) has drift zero as it does under the null hypothesis and that there is no treatment effect. However, the same equation can be used to determine probabilities associated with nonzero drift means given a fixed set of bounds: The drift corresponding to given cumulative exit probability can be determined. This is the basis for determining the effect of a specific monitoring plan on sample size requirements, estimation, and computing of the ‘‘conditional power,’’ all of which are detailed below. 2 SAMPLE SIZE DETERMINATION FOR TRIALS INVOLVING GROUP SEQUENTIAL TESTS With the background above, sample size determinations that take into account the group sequential design can be done by relating the test statistic summarizing the treatment effect to a Brownian motion (5,13,15). Consider the required nonsequential sample size per group for a comparison of means in two normal populations, which is Nfixed = (z1−α/2 + z1 −β )2 (2σ 2 /(µT − µC )2 ) When there is no sequential monitoring, the drift corresponding to power 1−β is θ fixed = z1−α/2 + z1 −β . For two-sided tests with alpha = 0.05 and 80%, 85%, and 90% power, θ fixed is approximately 2.8, 3, and 3.25, respectively. For a given set of interim analysis times

and bounds, the corresponding drift can be computed as described above and inserted in place of z1−α/2 + z1 −β so that 2 (2σ 2 /(µT − µC )2 ) Nseq = θseq

For example, the drift corresponding to five equally spaced, two-sided O’Brien-Flemingtype bounds for alpha = 0.05 and 90% power is 3.28, so if σ 2 = 1 and µT − µC = 0.5, Nseq = 3.282 × (2/0.25) = 86. Note that the sequential design requires a slightly larger sample size, and the increase is the square of the ratio of the sequential drift to the fixed drift. In fact the details of the relationship between design parameters and the sequential drift are not essential in order to adjust a fixed sample size to account for sequential monitoring. The drift associated with the same design monitored sequentially with given bounds and times can be computed using the equation above and available software (13). Then the required proportional increase in sample size is just the ratio of the drift under the planned sequential analyses to the drift for the fixed design. In the example above, the increase is (3.28/3.25)2 or about 2%. This small increase reflects the early conservatism of the O’Brien-Fleming strategy, which is often desirable. Minor deviations in the number and timing of interim analyses has little effect on the drift (5), so it is reasonable to design a study based on an anticipated monitoring plan and expect that the actual monitoring will have negligible effect on power. For sample size determination, then, understanding that the drift of an underlying Brownian motion is related to the expectation of interim Z statistics provides a straightforward way to adjust the sample size for an anticipated sequential monitoring plan. 3 ESTIMATION AFTER GROUP SEQUENTIAL TESTING There are a variety of approaches to estimation and bias reduction in the context of a group sequential test, and some refer to an underlying Brownian motion with nonzero drift (21–27). As an example that is closely related to the discussion of sample size in the

DRIFT (FOR INTERIM ANALYSES)

3

previous section, we can consider the median unbiased estimator (23). This is related to a confidence interval procedure that defines the 1 − γ lower confidence limit as the smallest value of the drift for which an outcome as least as extreme as the observed has probability of at least γ , and likewise for the upper limit (22). Equation (2) in this context is used to determine the value of the drift. The last bound bk is replaced with the final observed Z statistic, and α k is replaced with γ or 1 − γ . The median unbiased estimator sets γ = α k = 0.5, and for a 95% confidence interval, the upper limit sets α k = 0.025 and the lower α k = 0.975. Thus, some approaches to estimation make use of the drift of an underlying Brownian motion to form confidence intervals and point estimates.

assumptions for θ . Typically the assumptions of particular interest are θ = 0, θ = B(t)/t (the current best estimate of the drift), and θ = θ des (the value assumed for the study design). For example, consider a study for which t = 0.6 and Z(0.6) = 1.28. Then B(0.6) = 0.99 and U(t) has a mean of θ × 0.4 and a variance of 0.4. The current estimate of drift is B(0.6)/0.6 = 1.65, so under this assumption, U(t) has a mean of 0.66. Using the current estimate for drift, the conditional probability that the final Z statistic, Z(1), will exceed 1.96 given that √ Z(0.6) = 1.28 is Pr(Z > (1.96 – 1.28 – 0.66)/ 0.4) = Z0.49 = 0.31 or 31%. In this way, use of the underlying Brownian motion drift provides a simple way of assessing partway through the study the likelihood of a statistically significant final result.

4

5

STOPPING FOR FUTILITY

The drift of a Brownian motion process is also a useful model for assessing probabilities associated with the decision to terminate a trial when interim analyses show that the null hypothesis is unlikely to be rejected. Lachin (28) provides a detailed review of methods based on ‘‘conditional power’’ or the probability that the final study result will be statistically significant given the accumulated data and an assumption about future trends. This approach has also been described as stochastic curtailment (29–31). Lan and Wittes (32) describe the B-value, which is related to a Brownian motion process and provides a straightforward way to compute conditional power. The B-value is a translation of the interim Z statistic to a scale on which it can be assessed with respect to the drift of the Brownian motion underlying the study’s accumulating √ data. At time t, the B-value is B(t) = Z(t) t and the time remaining before the planned end of the study is 1 − t. The entire study can be thought of as decomposed into two pieces before and after t: B(t) and U(t) = B(1) – B(t). The distribution of U(t), representing the as-yet unobserved data, is normal and independent of B(t) with mean θ × (1 − t) and variance 1 − t. B(t) + U(t) equals B(1) and Z(1), the Z value at the end of the study. Given the observed B(t), the probability that B(t) + U(t) exceeds a critical value and can be computed under various

CONCLUSION

Calculation of bounds for a group sequential procedure and the associated testing can be done without reference to Brownian motion or drift. However, the connection between interim test statistics and Brownian motion drift is tractable and facilitates understanding of such useful elaborations as sample size determination, estimation, and stopping for futility. REFERENCES 1. P. Armitage, C. K. McPherson, and B. C. Rowe, Repeated significance tests on accumulating data. J. R. Stat. Soc. Series A (General) 1969; 132(2):235–244. 2. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics 1979 Sep; 35(3):549–556. 3. S. J. Pocock, Group sequential methods in design and analysis of clinical-trials. Biometrika 1977; 64(2):191–200. 4. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical-trials. Biometrika 1983; 70(3):659–663. 5. K. Kim and D. L. DeMets, Design and analysis of group sequential tests based on the type i error spending rate function. Biometrika 1987 Mar; 74(1):149–154. 6. D. L. DeMets and K. K. Lan, Interim analysis: the alpha spending function approach. Stat. Med. 1994 Jul 15; 13(13-14):1341–1352.

4

DRIFT (FOR INTERIM ANALYSES) 7. S. S. Emerson and T. R. Fleming, Symmetric group sequential test designs. Biometrics 1989 Sep; 45(3):905–923. 8. J. Whitehead and I. Stratton, Group sequential clinical trials with triangular continuation regions. Biometrics 1983 Mar; 39(1):227–236. 9. P. Billingsley, Probability and Measure, 3rd ed. New York: Wiley, 1995.

10. Cytel Inc., East v. 5: Advanced Clinical Trial Design, Simulation and Monitoring System. Cambridge, MA: Cytel Inc., 2007. 11. MSP Research Unit, PEST 4: Operating Manual. Reading, UK: The University of Reading, 2000. 12. Insightful Corp., S + SeqTrial. Seattle, WA: Insightful Corp., 2006. 13. D. M. Reboussin, D. L. DeMets, K. Kim, and K. K. G. Lan, Computations for group sequential boundaries using the Lan-DeMets spending function method. Controlled Clin. Trials 2000; 21(3):190–207. 14. E. Slud and L. J. Wei, Two-sample repeated significance tests based on the modified Wilcoxon statistic. J. Am. Stat. Assoc. 1982 Dec; 77(380):862–868. 15. K. Kim and D. L. DeMets, Sample size determination for group sequential clinical trials with immediate response. Stat. Med. 1992 Jul; 11(10):1391–1399. 16. K. K. Lan and D. M. Zucker, Sequential monitoring of clinical trials: the role of information and Brownian motion. Stat. Med. 1993 Apr 30; 12(8):753–765. 17. C. Jennison and B. W. Turnbull, Groupsequential analysis incorporating covariate information. J. Am. Stat. Assoc. 1997 Dec; 92(440):1330–1341. 18. D. O. Scharfstein, A. A. Tsiatis, and J. M. Robins, Semiparametric efficiency and its implication on the design and analysis of group-sequential studies. J. Am. Stat. Assoc. 1997 Dec; 92(440):1342–1350. 19. K. K. G. Lan, D. M. Reboussin, and D. L. DeMets, Information and information fractions for design and sequential monitoring of clinical-trials. Commun. Stat. Theory Methods 1994; 23(2):403–420. 20. K. Lan and D. L. DeMets, Group sequential procedures: Calendar versus information time. Stat. Med. 1989 Oct; 8(10):1191–1198. 21. C. Jennison and B. W. Turnbull, Interim analyses: The repeated confidence interval approach. J. R. Stat. Soc. Series B (Methodological) 1989; 51(3):305–361.

22. K. Kim and D. L. DeMets, Confidence intervals following group sequential tests in clinical trials. Biometrics 1987 Dec; 43(4):857–864. 23. K. Kim, Point estimation following group sequential tests. Biometrics 1989 Jun; 45(2):613–617. 24. M. LeBlanc and J. Crowley, Using the bootstrap for estimation in group sequential designs: An application to a clinical trial for nasopharyngeal cancer. Stat. Med. 1999 Oct 15; 18(19):2635–2644. 25. Z. Q. Li ZQ and D. L. DeMets, On the bias of estimation of a Brownian motion drift following group sequential tests. Statistica Sinica 1999 Oct; 9(4):923–937. 26. A. Y. Liu and W. J. Hall, Unbiased estimation following a group sequential test. Biometrika 1999 Mar; 86(1):71–78. 27. J. C. Pinheiro and D. L. DeMets, Estimating and reducing bias in group sequential designs with Gaussian independent increment structure. Biometrika 1997 Dec; 84(4):831–845. 28. J. M. Lachin, A review of methods for futility stopping based on conditional power. Stat. Med. 2005 Sep 30; 24(18):2747–2764. 29. M. Halperin, K. K. Lan, E. C. Wright, and M. A. Foulkes, Stochastic curtailing for comparison of slopes in longitudinal studies. Control Clin. Trials 1987 Dec; 8(4):315–326. 30. K. K. G. Lan, D. L. DeMets, and M. Halperin, more flexible sequential and nonsequential designs in long-term clinicaltrials. Commun. Stat. Theory Methods 1984; 13(19):2339–2353. 31. M. Halperin, K. K. Lan, J. H. Ware, N. J. Johnson, and D. L. DeMets, An aid to data monitoring in long-term clinical trials. Control Clin Trials 1982 Dec; 3(4):311–323. 32. K. K. Lan and J. Wittes, The B-value: a tool for monitoring data. Biometrics 1988 Jun; 44(2):579–585.

CROSS-REFERENCES Group Sequential Designs Interim Analyses Stopping Boundaries Alpha-spending function Conditional power

DRUG DEVELOPMENT

Pre-clinical research

Clinical studies E

Synthesis and purification

NDA review

Phase 1 E

Phase 2 Phase 3

Accelerated development/review Animal testing

E

Short -term

Treatment IND Parallel track

Long -term Institutional review boards Industry time FDA time IND submitted

NDA submitted Review decision Sponsor/FDA meetings encouraged Early access: Sponsor answers E subpart E Advisory committees any questions from review

Under the Food and Drug Administration (FDA) requirements, a sponsor first must submit data that shows that the drug is reasonably safe for use in initial, small-scale clinical studies. Depending on whether the compound has been studied or marketed previously, the sponsor may have several options for fulfilling this requirement: (1) compiling existing nonclinical data from past in vitro laboratory or animal studies on the compound, (2) compiling data from previous clinical testing or marketing of the drug in the United States or another country whose

population is relevant to the U.S. population, or (3) undertaking new preclinical studies designed to provide the evidence necessary to support the safety of administering the compound to humans. During preclinical drug development, a sponsor evaluates the toxic and pharmacologic effects of the drug through in vitro and in vivo laboratory animal testing. Genotoxicity screening is performed, as well as investigations on drug absorption and metabolism, the toxicity of the metabolites of the drug, and the speed with which the drug and its metabolites are excreted from the body. At the preclinical stage, the FDA generally will ask, at a minimum, that sponsors: (1) develop

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/develop.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

DRUG DEVELOPMENT

a pharmacologic profile of the drug, (2) determine the acute toxicity of the drug in at least two species of animals, and (3) conduct shortterm toxicity studies ranging from 2 weeks to 3 months, depending on the proposed duration of use of the substance in the proposed clinical studies. The research process is complicated, time consuming, and costly, and the end result is never guaranteed. Literally hundreds and sometimes thousands of chemical compounds must be made and tested in an effort to find one that can achieve a desirable result. The FDA estimates that it takes approximately 8.5 years to study and test a new drug before it can be approved for the general public. This estimate includes early laboratory and animal testing and later clinical trials that use human subjects. No standard route exists through which drugs are developed. A pharmaceutical company may decide to develop a new drug aimed at a specific disease or medical condition. Sometimes, scientists choose to pursue an interesting or promising line of research. In other cases, new findings from university, government, or other laboratories may point the way for drug companies to follow with their own research. New drug research starts with an understanding of how the body functions, both normally and abnormally, at its most basic levels. The questions raised by this research help determine a concept of how a drug might be used to prevent, cure, or treat a disease or medical condition. This concept provides the researcher with a target. Sometimes scientists find the right compound quickly, but usually hundreds or thousands must be screened. In a series of test tube experiments called assays, compounds are added one at a time to enzymes, cell cultures, or cellular substances grown in a laboratory. The goal is to find which additions show some effect. This process may require testing hundreds of compounds because some may not work but will indicate ways of changing the chemical structure of the compound to improve its performance. Computers can be used to simulate a chemical compound and design chemical structures that might work against the compound. Enzymes attach to the correct site on the

membrane of a cell, which causes the disease. A computer can show scientists what the receptor site looks like and how one might tailor a compound to block an enzyme from attaching there. Although computers give chemists clues as to which compounds to make, a substance still must be tested within a living being. Another approach involves testing compounds made naturally by microscopic organisms. Candidates include fungi, viruses, and molds, such as those that led to penicillin and other antibiotics. Scientists grow the microorganisms in what is known as a ‘‘fermentation broth,’’ with one type of organism per broth. Sometimes, 100,000 or more broths are tested to see whether any compound made by a microorganism has a desirable effect. In animal testing, drug companies make every effort to use as few animals as possible and to ensure their humane and proper care. Generally, two or more species (one rodent, one nonrodent) are tested because a drug may affect one species differently than another. Animal testing is used to measure how much of a drug is absorbed into the blood, how it is broken down chemically in the body, the toxicity of the drug and its breakdown products (metabolites), and how quickly the drug and its metabolites are excreted from the body. Short-term testing in animals ranges in duration from 2 weeks to 3 months, depending on the proposed use of the substance. Long-term testing in animals ranges in duration from a few weeks to several years. Some animal testing continues after human tests begin to learn whether long-term use of a drug may cause cancer or birth defects. Much of this information is submitted to FDA when a sponsor requests to proceed with human clinical trials. The FDA reviews the preclinical research data and then makes a decision as to whether to allow the clinical trials to proceed. The new drug application (NDA) is the vehicle through which drug sponsors formally propose that the FDA approve a new pharmaceutical for sale in the United States. To obtain this authorization, a drug manufacturer submits in an NDA nonclinical (animal) and clinical (human) test data and analyses,

DRUG DEVELOPMENT

drug information, and descriptions of manufacturing procedures. An NDA must provide sufficient information, data, and analyses to permit FDA reviewers to reach several key decisions, including • Whether the drug is safe and effective

for its proposed use(s), and whether the benefits of the drug outweigh its risks; • Whether the proposed labeling of the drug is appropriate, and, if not, what the label of the drug should contain; and • Whether the methods used in manufacturing the drug and the controls used to maintain the quality of the drug are adequate to preserve the identity, strength, quality, and purity of the drug. The purpose of preclinical work—animal pharmacology/toxicology testing— is to develop adequate data to undergird a decision that it is reasonably safe to proceed with human trials of the drug. Clinical trials represent the ultimate premarket testing ground for unapproved drugs. During these trials, an investigational compound is administered to humans and is evaluated for its safety and effectiveness in treating, preventing, or diagnosing a specific disease or condition. The results of this testing will comprise the single most important factor in the approval or disapproval of a new drug. Although the goal of clinical trials is to obtain safety and effectiveness data, the overriding consideration in these studies is the safety of the people in the trials. CDER monitors the study design and conduct of clinical trials to ensure that people in the trials are not exposed to unnecessary risks.

3

DRUG REGISTRATION AND LISTING SYSTEM (DRLS)

Biological Products) whether they enter interstate commerce. All domestic distributors and foreign firms that import drug products into the United States must obtain a labeler code and must list all of their products.

The Food and Drug Administration (FDA) attempted a comprehensive drug inventory for drug listings by establishing two voluntary programs. However, these two voluntary programs were unsuccessful. To make these efforts mandatory, the FDA instituted the Drug Listing Act of 1972; this regulatory policy is in the 21 Code of Federal Regulations (CFR) Part 207. The 21 CFR Part 207 addresses definitions, drug registration requirements, and drug listing requirements by FDA. This Act amended Section 510 of the Federal Food, Drug, and Cosmetic Act and defines the applicable following terms:

2

All firms, unless exempted by the Act, are requested to list their commercially marketed drug products with FDA within 5 days after the beginning of operation. They are required to list/update their drug products listing twice a year (June and December). The initial listing and updates of a product is completed on a form FDA 2657. Manufacturers are allowed to list the products for distributors on form FDA 2658. To assist the firms with the mandatory update in June, the Product Information Management Branch mails a Compliance Verification Report (CVR) to the firms. The CVR goes to all firms which have at least one prescription product listed with FDA. The firm is required to update the CVR and mail it back within 30 days.

• The term Firm refers to a company

engaged in the manufacture, preparation, propagation, compounding, or processing of a drug product. • The term Drug Products refers to human drugs, veterinary drugs, and medicated animal feed premixes that include biological products, but do not include blood and blood components. • The term Manufacturing and Processing refers to repackaging or otherwise changing the container, wrapper, or labeling of any drug product package in the distribution process from the original ‘‘maker’’ to the ultimate consumer. 1

LISTING REQUIREMENTS

3

REGISTRATION EXEMPTIONS

The following is a list of those parties exempt from registration: pharmacies, hospitals, and clinics that dispense drug products at retail; licensed physicians who use drug products solely for purposes related to their professional practice; and/or persons who use drug products solely for their professional needs and are not for sale.

REGISTRATION REQUIREMENTS

A firm must register all drug products (Domestic Manufacturers, Domestic Repackers, Domestic Labelers, and submissions for New Human Drug Application, New Animal Drug Application, Medicated Feed Application, Antibiotic Drug Application, and Establishment License Application to Manufacture

4

REGISTRATION PROCESS

Firms can register by obtaining a Registration of Drug Establishment Form, FDA 2656 within 5 days after the beginning of operation or submission of an application. Firms are required to re-register annually by returning an Annual Registration of Drug Establishment Form, FDA 2656E, within 30 days after receiving it from the Product Information Management Branch.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/druglist.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

DRUG PACKAGING The sponsor should ensure that the investigational product(s) (including active comparator(s) and placebo, if applicable) is characterized as appropriate to the stage of development of the product(s), is manufactured in accordance with any applicable Good Manufacturing Practice (GMP), and is coded and labeled in a manner that protects the blinding, if applicable. In addition, the labeling should comply with applicable regulatory requirement(s). The sponsor should determine, for the investigational product(s), acceptable storage temperatures, storage conditions (e.g., protection from light), storage times, reconstitution fluids and procedures, and devices for product infusion, if any. The sponsor should inform all involved parties (e.g., monitors, investigators, pharmacists, storage managers) of these determinations. The investigational product(s) should be packaged to prevent contamination and unacceptable deterioration during transport and storage. In blinded trials, the coding system for the investigational product(s) should include a mechanism that permits rapid identification of the product(s) in case of a medical emergency, but it does not permit undetectable breaks of the blinding. If significant formulation changes are made in the investigational or comparator product(s) during the course of clinical development, the results of any additional studies of the formulated product(s) (e.g., stability, dissolution rate, bioavailability) are needed to assess whether these changes would significantly alter the pharmacokinetic profile of the product should the product be available prior to the use of the new formulation in clinical trials.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

DRUG SUPPLY

The sponsor should: • Take steps to ensure that the investi-

gational product(s) are stable over the period of use. • Maintain sufficient quantities of the investigational product(s) used in the trials to reconfirm specifications, should it become necessary, and maintain records of batch sample analyses and characteristics. To the extent stability permits, samples should be retained either until the analyses of the trial data are complete or as required by the applicable regulatory requirement(s), whichever represents the longer retention period.

The sponsor is responsible for supplying the investigator(s)/institution(s) with the investigational product(s). The sponsor should not supply an investigator/institution with the investigational product(s) until the sponsor obtains all required documentation [e.g., approval/favorable opinion from IRB (Institutional Review Board)/IEC (Independent Ethics Committee) and regulatory authority(ies)]. The sponsor should ensure that written procedures include instructions that the investigator/institution should follow for the handling and storage of investigational product(s) for the trial and documentation thereof. The procedures should address adequate and safe receipt, handling, storage, dispensing, retrieval of unused product from subjects, and return of unused investigational product(s) to the sponsor [or alternative disposition if authorized by the sponsor and in compliance with the applicable regulatory requirement(s)]. The sponsor should: • Ensure timely delivery of investigational

product(s) to the investigator(s). • Maintain records that document ship-

ment, receipt, disposition, return, and destruction of the investigational product(s). • Maintain a system for retrieving investigational products and documenting this retrieval (e.g., for deficient product recall, reclaim after trial completion, expired product reclaim). • Maintain a system for the disposition of unused investigational product(s) and for the documentation of this disposition.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

EASTERN COOPERATIVE ONCOLOGY GROUP (ECOG)

review of tumor measurement data to determine response and progression. This article also gives an early version of the well-known five-point ECOG performance status scale for classifying overall patient status. It is notable that these early studies did not include predefined criteria for the amount of increase and decrease in tumor measurements needed to establish response or progression but instead relied on a vote of the investigators based on blinded review of the data. Standardized criteria were later developed, and the ECOG criteria used for many years are given in Reference 2. During the 1970s and 1980s the ECOG Statistical Center was led by Dr. Marvin Zelen, first at the State University of New York at Buffalo and later at the Dana-Farber Cancer Institute. During this period, the Statistical Center included many leading biostatisticians who made numerous important contributions to the statistical methodology for clinical trials, including basic work on methods for analyzing time-to-event data, for sample size and power calculations, on randomization algorithms, and on methods for monitoring studies.

ROBERT J. GRAY Dana-Farber Cancer Institute Boston, Massachusetts

The Eastern Cooperative Oncology Group (ECOG) is one of ten NCI-funded Cooperative Groups. ECOG has been continuously in operation since the Cooperative Group program was established in 1955. During the period 2004–2007, an average of 75 ECOGled protocols were open to accrual each year, of which 65 were therapeutic (17 Phase III), the others were variously prevention, symptom management, health practices, and laboratory studies. At any time, 50–60 studies are also closed to accrual but still in active follow-up before the primary analysis. Accrual to the ECOG-led therapeutic trials varied during this period from 2393 in 2005 to 5092 in 2007, and ECOG members also enrolled an average of 1580 therapeutic cases per year on studies coordinated by other groups. ECOG manages this large portfolio of studies with quite-limited grant funding. The major operational challenge for ECOG and the other cooperative groups has been conducting high-quality clinical trials as efficiently as possible. In this article, the history, structure, policies and procedures, and major accomplishments of ECOG are briefly described. 1

2

ORGANIZATION AND STRUCTURE

ECOG has a hierarchical membership structure with main institutions (usually academic centers) and affiliate institutions (usually community based). Also, separate community networks are funded through the Community Clinical Oncology Program (CCOP) of the NCI’s Division of Cancer Prevention. The principal investigators of the main institutions and CCOPs form the governing body of the Group. They elect the Group chair and approve the Group’s policies and procedures. The scientific program of ECOG is organized into disease-site committees (Breast, Gastrointestinal, Head and Neck, Lymphoma, Thoracic, Melanoma, Genitourinary, Leukemia, Myeloma, and a working group in Brain Tumors) and modality committees (such as surgery, radiation therapy, developmental therapeutics, and laboratory science).

HISTORY

ECOG was founded in 1955 (the original name was the Eastern Cooperative Group in Solid Tumor Chemotherapy) under the leadership of Dr. C. G. Zubrod. At the time, development of standardized methodology for conducting multicenter clinical trials in oncology and for measuring effects of chemotherapy was needed. Methodology developed for early ECOG trials is described in Reference 1, including use of sealed envelopes for treatment assignment; collection of detailed data on ‘‘flow sheets,’’ tumor measurement forms, and summary forms; and blinded central

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

EASTERN COOPERATIVE ONCOLOGY GROUP (ECOG)

Also, a separately funded program in prevention and cancer control research has been developed. The Group chair appoints the committee chairs, who then organize the scientific research program within each area and bring proposals for studies forward to the Group. An executive review committee reviews all study concepts and decides whether they are suitable for development in ECOG. All concepts and protocols must also be reviewed and approved by NCI through either the Cancer Therapy Evaluation Program or the Division of Cancer Prevention. A Group-wide meeting of the membership is held twice each year, at which each of the committees meets to discuss concepts for new studies and to review the progress of ongoing studies. ECOG has a Group Chair’s Office (in Philadelphia, PA), a combined Operations Office and Data Management Center (at Frontier Science and Technology Research Foundation in Brookline, MA) and a Statistical Office (at the Dana-Farber Cancer Institute). The head of the Statistical Center (the Group statistician) is appointed by the Group chair with the approval of the principal investigators. Approximately nine full-time-equivalent statisticians provide all statistical support for ECOG activities. These statisticians are assigned to work with the disease and modality committees. They collaborate actively with the investigators during the concept and protocol development process, and are responsible for developing the statistical design of the study and for reviewing all drafts of protocols. The statisticians also prepare the registration/randomization materials and collaborate with the data management staff on the development of the data collection instruments. After activation, statisticians monitor the progress of studies and adverse events and perform interim efficacy analyses when appropriate under the design. Reports are generated on all ongoing studies twice each year. Final reports are written by the statistician for each completed ECOG trial. Statisticians also work extensively on lab correlative studies conducted on the clinical trials and on investigator-initiated grants to support various aspects of ECOG activities (especially to provide support for lab studies).

3 PROCEDURES The Cancer Cooperative Groups have developed highly efficient systems for conducting clinical trials, but these systems also involve some compromises relative to standards in the pharmaceutical industry. A key difference is the ongoing relationship between ECOG and its member institutions; ECOG runs many studies simultaneously in the same network of centers, and ECOG studies are generally open to all member institutions. 3.1 Data Quality Control The focus in institutional monitoring in ECOG is on ensuring that the institution is following appropriate procedures and meeting acceptable data quality standards rather than on direct verification of all individual data items. Instead of relying on extensive (and expensive) on-site monitoring for each study to ensure complete accuracy of data, ECOG uses an audit system to verify that participating centers are performing well overall. All centers are audited at least once every 3 years, and an audit generally involves full review of at least 12% of the cases enrolled at a site as well as review of regulatory compliance and pharmacy procedures. Centers with deficiencies are required to submit written corrective action plans and may be reaudited more frequently or, in extreme cases, may be suspended. With only a limited proportion of cases audited at the institutions, ECOG also relies heavily on central review of data for internal consistency. Submitted data are checked for completeness and for invalid values, and extensive checks are made among submitted data items to ensure consistency of all data. Copies of limited source documentation, such as surgical pathology reports, are also collected and reviewed. For assessing response and progression, actual tumor measurements and lab results are required to provide documentation that protocol criteria have been met. A final review of the data coding is performed by the study chair. For the most part, ECOG study chairs are volunteers who are not paid for their time spent on ECOG studies. Available funding does not permit using paid independent clinical reviewers. Generally a blinded central radiology review is not

EASTERN COOPERATIVE ONCOLOGY GROUP (ECOG)

performed because of the time and expense required for collection and review of scans and other imaging studies. The extensive central review of reported measurements, together with the limited on-site auditing, has demonstrated acceptable results without central radiology review. 3.2 Adverse-Event Monitoring Timely monitoring of adverse events is a crucial part of any treatment trial. In ECOG, real-time monitoring occurs primarily through the expedited reporting of severe adverse events. Expedited reporting of certain classes of severe events is required by the NCI and FDA. Expedited reports are now submitted through an online system developed and maintained by NCI. The ECOG data management staff (and study chairs and statisticians) then can access the reports through the same system. A cumulative listing of expedited adverse-event reports is sent to the study chair, the study statistician, and the disease site toxicity monitor monthly if significant new events have been reported. They are each required to review the list and to determine whether any problems might require more review or other action. Adverse events are also routinely submitted on case report forms. Consistency between expedited reports and routine submission is checked as part of the data review process. Statisticians also prepare summary reports on the routine adverse-event data twice each year. These reports are made available to the study chairs, to the group membership, and, for Phase III studies, to the Data Monitoring Committee (DMC). For some studies (especially double-blind studies), more extensive information is also provided to the DMC. 3.3 Blinding of Treatment Until recent years, double-blind studies in oncology were rare. This rarity was partly because of the high level and sometimes characteristic nature of the toxicities and other side effects of many chemotherapy drugs. Also, many Phase III studies have had survival as their primary endpoint, and this endpoint is less subject to potential bias than other endpoints. With the advent of less-toxic

3

targeted therapies and increasing emphasis on endpoints such as progression-free survival, double-blind placebo-controlled studies have become more common. All registrations in ECOG now occur through a web-based registration system. On blinded studies, treatment assignments are prematched to drug identification numbers. At the time of randomization, the treatment assignment is determined using the randomization algorithm for the study, and then the assignment is matched to the drug identification. The treating center and the drug distribution center are only notified of the drug identification number. The links between the treatment codes and the drug identification numbers are kept in a database table that can only be accessed by senior database administrators and a few senior staff who need access for emergency unblinding and other reasons. All data review is performed blinded to treatment assignment. Statisticians obtain the treatment codes from the senior DBA when interim analyses are performed. Only pooled information on adverse events is made available to the group membership. Analyses by treatment arm are only presented to the DMC, until the DMC decides to release the information. Requests from institutions for unblinding treatment assignments are reviewed centrally by a committee of senior staff. Unblinding requires approval of a senior statistician and a senior clinician (usually the group statistician and the group executive officer, except in emergencies, if they are not available). Some studies permit routine unblinding of the treating physician at progression, if the information is generally needed for determining the next treatment (especially if the study involves crossover at progression). 3.4 Standardizing Terminology, Data Elements, and Criteria Standardization of data elements and terminology is important for efficient operations. As early as the mid-1980s, ECOG reached agreement with the Southwest Oncology Group (another NCI-funded Cooperative Group) on partial standardization of the data collected on breast cancer studies. Beginning in 1998, efforts at data standardization among

4

EASTERN COOPERATIVE ONCOLOGY GROUP (ECOG)

the Cooperative Groups were integrated into the NCI’s Common Data Elements (CDE) project. Efficient conduct of large studies requires that no unnecessary data items be collected; however, given the variety of diseases, treatments, modalities, endpoints, and ancillary scientific objectives, considerable variation in the data items is needed to address the objectives of a study. Consequently, the focus of the CDE project has been on improving efficiency through standardization of definitions and terminology. The CDE project has now become part of the NCI’s caBIG initiative, in the Vocabularies and Common Data Elements workspace (see https://cabig.nci.nih.gov). Standardization of evaluation criteria is another important issue for facilitating participation in studies led by different groups, for interpreting results across studies, and for combining data from multiple studies (e.g., in meta-analyses). Cooperative Groups have had a long history of standardizing criteria within their own studies (see, e.g., Reference 2), but they have not always considered standardization across groups. Recently, the NCI has taken a major role, coleading the development of the RECIST (3) solid tumor response criteria and developing the Common Terminology Criteria for Adverse Events (http://ctep.cancer.gov/reporting/ctc.html). Standardization of endpoints across studies and use of common terminology and definitions for different endpoints is another important issue for reporting of results. An important step in this direction for adjuvant breast cancer studies was recently taken by the Cooperative Group breast cancer committees (4). Although recognizing that different endpoints might be needed in different types of studies, they have given a common set of definitions and terminology for various possible endpoints. Whereas some issues are specific to breast cancer, such as the role of new ductal carcinoma in situ cancers in defining the primary endpoint for an adjuvant breast cancer study, some terminology could be applied much more broadly in studies of adjuvant therapy in other diseases. 3.5 Data Monitoring Committee ECOG typically has 20–30 Phase III studies being monitored at any time. Resource

limitations prevent setting up separate independent data monitoring committees for each study. Instead, ECOG has a single data monitoring committee (DMC) that monitors the entire portfolio of Phase III studies. This committee has nine voting members who are selected to have expertise in diverse areas of oncology and hematology. A majority of the voting members must have no affiliation with ECOG, and they must include an outside statistician and a consumer representative. The ECOG DMC meets regularly twice each year. Interim analyses are scheduled to occur in conjunction with the DMC meetings, using database cutoffs 8 weeks before the meeting date. Based on experience with Cooperative Group DMCs, it has been recommended that interim analyses of efficacy on Cooperative Group studies should be scheduled for every DMC meeting until the study reaches full information (5). Recent ECOG Phase III studies have often followed this recommendation, but it also requires considerable extra effort to have study databases cleaned for interim analyses twice each year. Another aspect of ECOG DMC operations is that the study statistician, rather than an independent outside statistician, performs the interim analysis. This aspect is contrary to the recommendations of the U.S. FDA (‘‘Establishment and Operation of Clinical Trial Data Monitoring Committees,’’ FDA Guidance Document, 2006, http://www.fda. gov/cber/gdlns/clintrialdmc.pdf). With the resource limitations of the Cooperative Groups, it is generally not feasible to have a second statistician funded to be sufficiently familiar with the study to perform meaningful analyses. The multiple levels of review required for protocol design changes within the disease committees, the Group’s executive committee, and the NCI, make manipulation of the design of the study on the basis of knowledge of interim results highly unlikely. However, the risk of errors in analysis and/or interpretation that could occur if the study statistician is not involved in the monitoring, because of lack of detailed understanding the protocol or of the study data collection and processing procedures, seems like the greater risk. This approach does require that the study statistician be very careful not to give any kind of hint of interim study results

EASTERN COOPERATIVE ONCOLOGY GROUP (ECOG)

to the study investigators, but in ECOG’s experience, this has not been a problem. 4

MAJOR ACCOMPLISHMENTS

ECOG has made many contributions advancing the care of cancer patients. A few key recent results are establishing the unfavorable risk–benefit profile of autologous bonemarrow transplant therapy in breast cancer (6,7); establishing the benefit of the antiCD20 monoclonal antibody rituximab in initial treatment of diffuse, aggressive nonHodgkin’s lymphoma (8); establishing the benefit of adding the anti-VEGF monoclonal antibody bevacizumab to chemotherapy in advanced lung cancer (9) and breast cancer (10) and in combination with oxaliplatinbased chemotherapy in advanced colorectal cancer (11); and establishing a new standard for initial treatment in multiple myeloma (12). Major studies currently accruing include Phase III studies of sorafenib in metastatic melanoma (E2603), of sorafenib or sunitinib for adjuvant renal cancer (E2805), of bevacizumab in combination with chemotherapy for adjuvant treatment of non-small cell lung cancer (E1505), of bevacizumab in combination with chemotherapy for adjuvant treatment of breast cancer (E5103), and a study of using the Oncotype DX genomic assay (13) to select patients for adjuvant chemotherapy in breast cancer (TAILORx). The NHL study E4494 of rituximab therapy (8) is an interesting case study of the complexity that can occur in monitoring and analyzing studies in oncology. While this study was underway, the GELA group in Europe released results of a similar study (14) with significantly better PFS and overall survival with rituximab. At that time, the ECOG DMC reviewed interim results from E4494 and the results from the GELA trial and decided that E4494 should continue, but the ECOG DMC also decided that they needed to review updated interim results every 6 months to reevaluate whether this recommendation continued to be appropriate, although only two interim analyses had been specified in the design. E4494 also involved an induction randomization between standard CHOP chemotherapy and

5

CHOP+rituximab and a maintenance therapy randomization of induction responders to rituximab versus observation, whereas the GELA trial only involved the induction randomization. The potential confounding of the effect of the maintenance treatment on the induction comparison was a significant problem for interpreting E4494. This problem led to the use of novel weighted analysis methods (see, e.g., Reference 15) for estimating the effect of induction rituximab in the absence of maintenance rituximab, although such analyses had not been specified prospectively. These analyses ultimately became part of the submission to the FDA for approval of rituximab for this indication. This study illustrates that when Phase III studies take an extended period to complete, plans developed at initiation can require significant modification. In such circumstances, it is important to have the flexibility to modify the original plans and to do so in accordance with sound statistical principles. The Cooperative Group Program is the premiere public system in the United States for conducting randomized comparative studies of cancer therapy. In this article, we have provided an overview of ECOG’s contributions to this program and its methods for achieving these results. REFERENCES 1. C. G. Zubrod, M. Schneiderman, E. Frei, C. Brindley, G. L. Gold, B. Shnider, R. Oviedo, J. Gorman J, R. Jones, Jr., U. Jonsson, J. Colsky, T. Chalmers, B. Ferguson, M. Dederick, J. Holland, O. Selawry, W. Regelson, L. Lasagna, A. H. Owens, Jr. Appraisal of methods for the study of chemotherapy of cancer in man: comparative therapeutic trial of nitrogen mustard and triethylene thiophosphoramide, J. Chron. Dis. 1960; 11:7–33 2. M. M. Oken, R. H. Creech, D. C. Tormey, J. Horton, T. E. Davis, E. T. McFadden, P. P. Carbone, Toxicity and response criteria of the Eastern Cooperative Oncology Group, Am. J. Clin. Oncol. 1982; 5:649–655. 3. P. Therasse, S. G. Arbuck, E. A. Eisenhauer, J. Wanders, R. S. Kaplan, L. Rubinstein, J. Verweij, M. V. Glabbeke, A. T. van Oosterom, M. C. Christian, and S. G. Gwyther. New guidelines to evaluate the response to treatment in solid tumors, J. Natl. Cancer Inst. 2000; 92:205–216.

6

EASTERN COOPERATIVE ONCOLOGY GROUP (ECOG)

4. C. A. Hudis, W. E. Barlow, J. P. Costantino, R. J. Gray, K. L. Pritchard, J. A. W. Chapman, J. A. Sparano, S. Hunsberger, R. A. Enos, R. D. Gelber, and J. Zujewski, Proposal for standardized definitions for efficacy end points in adjuvant breast cancer trials: the STEEP system, J. Clin. Oncol. 2007; 25:2127–2132. 5. B. Freidlin, E. L. Korn, and S. L. George, Data monitoring committees and interim monitoring guidelines, Controlled Clinical Trials. 1999; 20:395–40. 6. E. A. Stadtmauer, A. O’Neill, L. J. Goldstein, P. A. Crilley, K. F. Mangan, J. N. Ingle, I. Brodsky, S. Martino, H. M. Lazarus, J. K. Erban, C. Sickles, and J. H. Glick, Conventional-dose chemotherapy compared with high-dose chemotherapy plus autologous hematopoietic stem cell transplantation for metastatic breast cancer, N. Engl. J. Med. 2000; 342(15):1069–1076. 7. M. S. Tallman, R. J. Gray, N. J. Robert, C. F. LeMaistre, C. Osborne, W. P. Vaughan, W. J. Gradishar, T. M. Pisansky, J. H. Fetting, E. M. Paietta, H. M. Lazarus, Conventional adjuvant chemotherapy with or without high-dose chemotherapy and autologous stem cell transplantation in high-risk breast cancer, N. Engl. J. Med. 2003; 349(1):17–26. 8. T. M. Habermann, E. Weller, V. A. Morrison, R. Gascoyne, P. A. Cassileth, J. B. Cohn, S. R. Dakhil, B. Woda, R. I. Fisher, B. A. Peterson, and S. J. Horning, RituximabCHOP versus CHOP alone or with maintenance rituximab in older patients with diffuse large B-cell lymphoma, J. Clin. Oncol. 2006; 24(19)3121–3127. 9. A. B. Sandler, R. J. Gray, M. C. Perry, J. R. Brahmer, J H. Schiller, A. Dowlati, R. Lilenbaum, and D. H. Johnson, Paclitaxel plus Carboplatin with or without Bevacizumab in advanced non-squamous nonsmall cell lung cancer: a randomized study of the Eastern Coopertive Oncology Group, N. Engl. J. Med. 2006; 355(24):542–2550. 10. K. D. Miller, M. Wang, J. Gralow, M. Dickler, M. A. Cobleigh, E. A. Perez, T. Shenkier, D. F. Cella, and N. E. Davidson, Paclitaxel plus bevacizumab versus paclitaxel alone for metastatic breast cancer, N. Engl. J. Med. 2007; 357:2666–2676. 11. B. J. Giantonio, P. J. Catalano, N. J. Meropol, P. J. O’Dwyer, E. P. Mitchell, S. R. Alberts, M. A. Schwartz, and A. B. Benson III, Bevacizumab in combination with Oxaliplatin, Fluorouracil, and Leucovorin (FOLFOX4) for previously treated metastatic colorectal cancer: results from the Eastern Cooperative

Oncology Group Study E3200, J. Clin. Oncol. 2007; 25:1539–1544. 12. S. V. Rajkumar, S. J. Jacobus, N. S. Callander, R. Fonseca, D. H. Vesole, M. E. Williams, R. Abonour, D. S. Siegel, and P. R. Greipp, Phase III trial of lenalidomide plus highdose dexamethasone versus lenalidomide plus low-dose dexamethasone in newly diagnosed multiple myeloma (E4A03): a trial coordinated by the Eastern Cooperative Oncology Group [abstract]. Proc. ASCO 2007. Abstract LBA8025. 13. S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F. L. Baehner, M. G. Walker, D. Watson, F. Park, W. Hiller, E. R. Fisher, D. L. Wickerham, J. Bryant, and N. Wolmark, A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer, N. Engl. J. Med. 2004; 351:2817–2826. 14. B. Coiffier, E. Lepage, J. Briere, R. Herbrecht, H. Tilly, R. Bouabdallah,. P. Morel, E. van den Neste, G. Salles, P. Gaulard, F. Reyes, and C. Gisselbrecht,. CHOP chemotherapy plus rituximab compared with CHOP alone in elderly patients with diffuse large-B-cell lymphoma, N. Engl. J. Med. 2002; 346:235–242. 15. J. K. Lunceford, M. Davidian, and A. A. Tsiatis, Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials, Biometrics. 2002; 58:48–57.

CROSS-REFERENCES Adverse Event Clinical Data Management Data Monitoring Committee National Cancer Institute Southwest Oncology Group

ELIGIBILITY AND EXCLUSION CRITERIA

the eligibility criteria should be chosen to allow valid inferences to that population. For example, the Hypertension Prevention Trial (HPT) was aimed at normotensive individuals 25–49 years old with diastolic blood pressure between 78 mm Hg and 90 mm Hg, and these were the main eligibility criteria (3). Choosing the narrow eligibility criteria often appropriate for an explanatory trial can make it difficult to apply the results to a broader population (11). Yusuf (23), moreover, argues that a truly homogeneous cohort cannot be constituted because even apparently similar individuals can have very different outcomes. The consensus is that most Phase III randomized trials should be regarded as pragmatic.

MARY A. FOULKES Food & Drug Administration, Rockville, MD, USA

The choice of eligibility criteria in a clinical trial can increase or decrease the magnitude of between-patient variation, which will in turn decrease or increase the statistical power of the trial for a given sample size. Theoretically, the more homogeneous the trial population, the greater is the power of the trial, but the more limited is the ability to generalize the results to a broad population. Thus, the choice of eligibility criteria can profoundly influence both the results and the interpretation of the trial. Besides controlling variation, the Institute of Medicine (IOM) Committee on the Ethical and Legal Issues Relating to the Inclusion of Women in Clinical Studies (16) discusses four other issues related to the choice of trial population; namely, disease stage, clinical contraindications, regulatory or ethical restrictions, and compliance considerations. We will discuss these and the related issues of explanatory vs. pragmatic trials, screening and recruitment processes, and the impact of eligibility criteria on the generalizability of trial results. Other factors influencing the selection of patients, such as factors in the selection of institutions in multicenter studies and physician preferences are discussed elsewhere (2,22). 1

1.1 The Uncertainty Principle Byar et al. (4) describe the simplest possible form of eligibility criteria for a trial, in which patients are eligible provided the treating physician and the patient have ‘‘substantial uncertainty’’ as to which of the treatment options is better. This definition, known as the uncertainty principle, incorporates all factors that contraindicate one or more of the treatment options including stage of disease, co-existing disease, and patients’ preferences. However, it also largely devolves definition of eligibility to the individual physicians participating in the trial. The consequent lack of control and strict definition of the cohort of patients entering the trial has been unattractive to some investigators. 2 CONTROL OF VARIATION VS. EASE OF RECRUITMENT

EXPLANATORY VS. PRAGMATIC TRIALS

The objectives of a trial affect the appropriate eligibility criteria (20). If the trial is designed to estimate the biological effect of a treatment (explanatory trial), then the eligibility criteria should be chosen to minimize the impact of extraneous variation, as in early investigations of protease inhibitors against human immunodeficiency virus (HIV) infection (18). If, however, the trial is designed to estimate the effectiveness of a treatment in a target population (pragmatic trial), then

The debate over the uncertainty principle highlights the tension between two different ways of improving the precision of the estimated effect of treatment in a randomized trial. By using very strict eligibility criteria we seek to reduce between-patient variation in clinical outcomes, leading to improved precision of the treatment difference estimate. By using very flexible eligibility criteria (as with the uncertainty principle), we seek to allow a wider entry to the trial, thereby

1

2

ELIGIBILITY AND EXCLUSION CRITERIA

increasing the number of eligible patients and usually the precision of the treatment difference estimate. The question is, therefore, do we try to control variation and accept the smaller size of the sample, or do we try to increase the sample size and accept a wider between-patient variation? While this debate continues, the general consensus among clinical trial statisticians is that it is generally difficult to control between-patient variation successfully because often we do not know the important determinants of prognosis. Therefore, attempts to use very strict eligibility criteria are less successful than attempts to gain precision by entering very large numbers of patients into trials (23). However, if there are categories of patients who are considered very unlikely to benefit from the treatment, it is clearly conceivable to exclude them from the trial (see later sections on Stage of Disease and Clinical Contraindications). Begg (2) criticizes the common practice of introducing a long list of eligibility criteria in clinical trials, particularly in the treatment of cancer. Such an approach greatly increases the difficulty of recruiting patients in large numbers. In examining such lists it is often found that many of the criteria are of questionable importance and do not relate directly to the safety of the patient or to the lack of benefit to be derived from the treatment. 2.1 Issues in the Screening and Recruitment Process Establishing eligibility often involves a screening process. Examples include choosing individuals for a heart disease trial with ejection fraction between 0.35 and 0.8 and a specific number of ectopic beats, or choosing HIV-infected individuals with slowly rather than rapidly progressing disease (15). Some eligibility criteria may be implicit in this process. For example, the recruitment method may require the patients to be accessible by telephone contact or to be able to read and write in English, such as trials in which the initial contact is via a prepaid postal response card. Multiple ‘‘baseline’’ visits that are sometimes used in the screening process can provide multiple opportunities for exclusion, e.g. the Coronary Primary Prevention

Trial used five baseline visits and the HPT used three baseline visits. Thus, those ultimately enrolled may affect the recruitment and screening mechanisms and resources for multiple participant contacts as much as the protocol-specific eligibility criteria. The impact of eligibility criteria and of recruitment procedures on the overall cost of the trial has rarely been investigated. Borhani (3) indicated that the ordering of the application of eligibility criteria can substantially affect costs. These costs are also sensitive to the cutoffs applied to continuous responses, e.g. diastolic blood pressure, high density lipoprotein cholesterol, coronary ejection fraction, and T-cell lymphocyte counts. As mentioned earlier, eligibility criteria can have a strong impact on the ease of recruitment. For example, the need to enroll newly diagnosed or previously untreated patients can severely restrict the ability to recruit. The need to enroll rapidly after a stroke, myocardial infarction, head trauma, or exposure to infectious agent can lead to difficulties. If the condition renders the patient unconscious for some period of time, or the patient lives far from the treatment center, or is unaware that an infection, stroke, infarction, or other event has occurred, it is less likely that they will be available for enrollment. Similarly, Carew (5) suggests that recruitment be enhanced by broad eligibility criteria, allowing potentially more sites and more individuals to participate. 2.2 Stage of Disease Often the stage of disease strongly affects the outcome of treatment, and is a primary source of variation. Eligibility is often restricted to the stages of disease most appropriately managed by the treatment. For many diseases, classification or staging systems have been developed to aid clinical management. Eligibility criteria involving stage of disease are best defined using an established classification system that is in wide use. Examples of such classification systems include the coronary functional class (7), Dukes’ colon cancer staging system, and the World Health Organization staging system for HIV infection (26).

ELIGIBILITY AND EXCLUSION CRITERIA

3

2.3 Clinical Contraindications

2.6 Implementing the Eligibility Criteria

Exclusions arising because one of the treatments is clearly contraindicated are common (14). For example, 18% of those screened for the Beta Blocker Heart Attack Trial were excluded due to contraindications to the administration of propranolol (12). Since these prior conditions would preclude use of some of the treatments in a trial, the trial results could not apply to individuals with those conditions. Some argue that contraindications should be clearly delineated in the protocol to avoid investigator or regional differences in their use.

The characterization of the target population and baseline homogeneity can be subverted by deviations during the conduct of the trial from the protocol specified eligibility criteria. If extensive, these can adversely affect the assumptions underlying analyses and the interpretation of the results. Thus, monitoring the determination of eligibility criteria during the conduct of the trial is an important component of the implementation of the trial. Often, the office that conducts the randomized treatment assignment checks the eligibility criteria before enrolling the patient. Finkelstein & Green (9) discuss the exclusion from analysis of individuals found to be ineligible after enrollment in the trial.

2.4 Compliance Considerations A run-in (or qualification) period is sometimes built into the trial design so as to identify potential noncompliers and exclude them from enrollment. This reduces the dilution of treatment differences that noncompliance introduces. In some studies, this period can also be used to eliminate placebo responders. In these cases, the determination of noncompliance becomes one of the outcome measures of the trial. 2.5 Regulatory or Ethical Considerations Various demographically or otherwise defined populations have been excluded from clinical trials in the past. For example, in trials of heart disease prevention, women have been excluded as their incidence of heart disease is lower than in men and their inclusion would have required a larger sample size. Similarly, minority groups have sometimes had little or no representation because no special efforts had been made to include them. Recent changes in US regulations have required special justification for the exclusion of women, minorities, or the elderly from National Institutes of Health sponsored trials. The scientific argument for including these groups is that it provides a more solid basis for extrapolating the results of the trial to the general population (6,8,10,13,17,19,24,25). There will usually be inadequate statistical power for detecting different effects in subpopulations, but sometimes meta-analysis of several studies may be able to detect such differences.

2.7 Generalization of Results to Broader Populations Treatment trials (or prevention trials) are usually conducted on samples of convenience, enrolling participants who present at specific hospitals or clinical sites. Therefore, the population to whom the trial results apply is generally not well defined. External validity—the ability to generalize from the trial to some broader population—is the ultimate goal of any trial. Adequately randomized trials can be assumed to produce valid results for the specific group of individuals enrolled, i.e. internal validity; the difficulties arise in extending the inference beyond that limited cohort. Since complete enumeration of the target population is rarely possible, inferences from studies are based on substantive judgment. A strong argument that is often used is that treatment differences in outcome are generally less variable among different patient populations than the outcomes themselves (23). Following publication, critics questioned the generalizability of the results of the International Cooperative Trial of Extracranial– Intracranial (EC/IC) Arterial Anastomosis to evaluate the effect of the EC/IC procedure on the risk of ischemic stroke. The results showed a lack of benefit that surprised many in the surgical profession. It became clear that many of the eligible patients at the participating clinical sites did

4

ELIGIBILITY AND EXCLUSION CRITERIA

not enter the trial, while those enrolled in the trial were considered to have poorer risk and some argued that they were less likely to benefit from surgery (1,21). The ensuing controversy slowed acceptance of the trial results by the surgical community, although eventually they had a profound effect on the frequency with which EC/IC was performed.

3

CONCLUSIONS

The goals and objectives of the trial, the intended target population, and the anticipated inferences from the trial results should all be carefully specified from the outset. If that is done, then the appropriate choice of eligibility criteria usually becomes clearer. Experience has shown that simplifying eligibility criteria generally enhances recruitment, allows a wider participation, and gives greater justification for generalizing the results to a broader population.

REFERENCES 1. Barnett, H. J. M., Sackett, D., Taylor, D. W., Haynes, B., Peerless, S. J., Meissner, I., Hachinski, V. & Fox, A. (1987). Are the results of the extracranial–intracranial bypass trial generalizable?, New England Journal of Medicine 316, 820–824. 2. Begg, C. B. (1988). Selection of patients for clinical trials, Seminars in Oncology 15, 434–440. 3. Borhani, N. O., Tonascia, J., Schlundt, D. G., Prineas, R. J. & Jefferys, J. L. (1989). Recruitment in the Hypertension Prevention Trial, Controlled Clinical Trials 10, 30S–39S. 4. Byar, D. P., Schoenfeld, D. A. & Green, S. B. (1990). Design considerations for AIDS trials, New England Journal of Medicine 323, 1343–1348. 5. Carew, B. D., Ahn, S. A., Boichot, H. D., Diesenfeldt, B. J., Dolan, N. A., Edens, T. R., Weiner, D. H. & Probstfield, J. L. (1992). Recruitment strategies in the Studies of Left Ventricular Dysfunction (SOLVD), Controlled Clinical Trials 13, 325–338. 6. Cotton P. (1990). Is there still too much extrapolation from data on middle-aged white men?, Journal of the American Medical Association 263, 1049–1050.

7. Criteria Committee of NYHA (1964). Diseases of the Heart and Blood Vessels: Nomenclature and Criteria for Diagnosis, 6th Ed. Little, Brown & Company, Boston. 8. El-Sadr, W. & Capps, L. (1992). Special communication: the challenge of minority recruitment in clinical trials for AIDS, Journal of the American Medical Association 267, 954–957. 9. Finkelstein, D. M. & Green, S. B. (1995). Issues in analysis of AIDS clinical trials, in AIDS Clinical Trials, D. M. Finkelstein & D. A. Schonfeld, eds. Wiley–Liss, New York, pp. 243–256. 10. Freedman L. S., Simon, R., Foulkes, M. A., Friedman, L., Geller, N. L., Gordon, D. J. & Mowery, R. (1995). Inclusion of women and minorities in clinical trials and the NIH Revitalization Act of 1993—the perspective of NIH clinical trialists, Controlled Clinical Trials 16, 277–285. 11. Gail, M. H. (1985). Eligibility exclusions, losses to follow-up, removal of randomized patients, and uncounted events in cancer clinical trials, Cancer Treatment Reports 69, 1107–1112. 12. Goldstein, S., Byington, R. & the BHAT Research Group (1987). The Beta Blocker Heart Attack Trial: recruitment experience, Controlled Clinical Trials 8, 79 S–85 S. 13. Gurwitz, J. H., Col, N. F. & Avorn, J. (1992). The exclusion of the elderly and women from clinical trials in acute myocardial infarction, Journal of the American Medical Association 268, 1417–1422. 14. Harrison K., Veahov, D., Jones, K., Charron, K. & Clements, M. L. (1995). Medical eligibility, comprehension of the consent process, and retention of injection drug users recruited for an HIV vaccine trial, Journal of Acquired Immune Deficiency Syndrome 10, 386–390. 15. Haynes, B. F., Panteleo, G. & Fauci, A. S. (1996). Toward an understanding of the correlates of protective immunity to HIV infection, Science 271, 324–328. 16. IOM Committee on the Ethical and Legal Issues Relating to the Inclusion of Women in Clinical Studies, (1996). Women and Health Research: Ethical and Legal Issues of Including Women in Clinical Studies, A. C. Mastroianni, R. Faden & D. Federman, eds. National Academy Press, Washington. 17. Lagakos, S., Fischl, M. A., Stein, D. S., Lim, L. & Vollerding, P. (1991). Effects of zidovudine therapy in minority and other subpopulations with early HIV infection, Journal of the American Medical Association 266, 2709–2712.

ELIGIBILITY AND EXCLUSION CRITERIA 18. Markowitz, M., Mo, H., Kempf, D. J., Norbeck, D. W., Bhat, T. N., Erickson, J. W. & Ho, D. D. (1996). Triple therapy with AZT, 3TC, and ritonavir in 12 subjects newly infected with HIV-1, Eleventh International Conference on AIDS, Abstract Th.B. 933. 19. Patterson, W. B. & Emanuel, E. J. (1995). The eligibility of women for clinical research trials, Journal of Clinical Oncology 13, 293–299. 20. Schwartz, D., Flamant, R. & Lellouch, J. (1980). Clinical Trials. Academic Press, New York. 21. Sundt, T. M. (1987). Was the international randomized trial of extracranial–intracranial arterial bypass representative of the population at risk?, New England Journal of Medicine 316, 814–816. 22. Taylor, K. M., Margolese, R. G. & Soskolne, C. L. (1984). Physicians’ reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer, New England Journal of Medicine 310, 1363–1367. 23. Yusuf, S., Held, P., Teo, K. K. & Toretsky, E. R. (1990). Selection of patients for randomized controlled trials: implications of wide or narrow eligibility criteria, Statistics in Medicine 9, 73–86. 24. Yusuf, S. & Furberg, C. D. (1991). Are we biased in our approach to treating elderly patients with heart disease?, American Journal of Cardiology 68, 954–956. 25. Wenger, N. K. (1992). Exclusion of the elderly and women from coronary trials: is their quality of care compromised?, Journal of the American Medical Association 268, 1460–1461. 26. World Health Organization (1990). Acquired immune deficiency syndrome (AIDS): interim proposal for a WHO staging system for HIV infection and disease, Weekly Epidemiology Record 65, 221–228.

CROSS-REFERENCES Intention to Treat Analysis

5

EMERGENCY USE INVESTIGATIONAL NEW DRUG (IND) The need for an investigational drug may develop in an emergency situation that does not allow time for submission of an Investigational New Drug (IND) Application in accordance with 21 CFR (Code of Federal Regulations) 312.23 or 21 CFR 312.34. In such a case, the Food and Drug Administration (FDA) may authorize shipment of the drug for a specified use in advance of submission of an IND. A request for such authorization may be transmitted to FDA by telephone or by other rapid communication means. For investigational biological drugs regulated by the Center for Biologics Evaluation and Research (CBER), the request should be directed to the Office of Communication, Training and Manufacturers Assistance (HFM–40), Center for Biologics Evaluation and Research. For all other investigational drugs, the request for authorization should be directed to the Division of Drug Information (HFD–240), Center for Drug Evaluation and Research. After normal working hours, Eastern Standard Time, the request should be directed to the FDA Office of Emergency Operations (HFA–615). Except in extraordinary circumstances, such authorization will be conditioned on the sponsor making an appropriate IND submission as soon as possible after receiving the authorization.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/regulatory/applications/ ind page 1.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

END-OF-PHASE II MEETING

more fully in the Food and Drug Administration (FDA) Staff Manual Guide 4850.7 that is publicly available under the FDA public information regulations in part 20. Arrangements for an End-of-Phase II meeting are to be made with the division in the FDA Center for Drug Evaluation and Research or the Center for Biologics Evaluation and Research that is responsible for review of the IND. The meeting will be scheduled by FDA at a time convenient to both FDA and the sponsor. Both the sponsor and FDA may bring consultants to the meeting. The meeting should be directed primarily at establishing agreement between FDA and the sponsor of the overall plan for Phase III and the objectives and design of particular studies. The adequacy of the technical information to support Phase III studies and/or a marketing application may also be discussed. FDA will also provide its best judgment, at that time, of the pediatric studies that will be required for the drug product and whether their submission will be deferred until after approval. Agreements reached at the meeting on these matters will be recorded in minutes of the conference that will be taken by FDA in accordance with Sec. 10.65 and provided to the sponsor. The minutes along with any other written material provided to the sponsor will serve as a permanent record of any agreements reached. Barring a significant scientific development that requires otherwise, studies conducted in accordance with the agreement shall be presumed to be sufficient in objective and design for the purpose of obtaining marketing approval for the drug.

The purpose of an End-of-Phase II Meeting is to determine the safety of proceeding to Phase III, to evaluate the Phase III plan and protocols and the adequacy of current studies and plans to assess pediatric safety and effectiveness, and to identify any additional information necessary to support a marketing application for the uses under investigation. Although the End-of-Phase II meeting is designed primarily for Investigational New Drugs (IND) that involve new molecular entities or major new uses of marketed drugs, a sponsor of any IND may request and obtain an End-of-Phase II meeting. To be most useful to the sponsor, Endof-Phase II meetings should be held before major commitments of effort and resources to specific Phase III tests are made. The scheduling of an End-of-Phase II meeting is not intended, however, to delay the transition of an investigation from Phase II to Phase III. 1

ADVANCE INFORMATION

At least 1 month in advance of an End-ofPhase II meeting, the sponsor should submit background information on the sponsor’s plan for Phase III, including summaries of the Phase I and II investigations, the specific protocols for Phase III clinical studies, plans for any additional nonclinical studies, plans for pediatric studies (including a time line for protocol finalization, enrollment, completion, and data analysis or information to support any planned request for waiver or deferral of pediatric studies), and, if available, tentative labeling for the drug. The recommended contents of such a submission are described This article was modified from the website of the United States Food and Drug Administration (http://frwebgate.access.gpo.gov/cgi-bin/getcfr.cgi?TITLE=21&PART=312&SECTION=47& YEAR=1999&TYPE=TEXT) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

END-OF-PHASE I MEETING When data from Phase I clinical testing are available, the sponsor again may request a meeting with Food and Drug Administration (FDA) reviewing officials. The primary purpose of this meeting is to review and reach agreement on the design of Phase II controlled clinical trials, with the goal that such testing will be adequate to provide sufficient data on the safety and effectiveness of the drug to support a decision on its approvability for marketing, and to discuss the need for, as well as the design and timing of, studies of the drug in pediatric patients. For drugs for life-threatening diseases, FDA will provide its best judgment, at that time, whether pediatric studies will be required and whether their submission will be deferred until after approval. The procedures outlined in Title 21 312.47(b)(1) with respect to End-of-Phase II conferences, including documentation of agreements reached, would also be used for End-of-Phase I meetings.

This article was modified from the website of the United States Food and Drug Administration (http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/ cfCFR/CFRSearch.cfm?CFRPart=312&showFR=1 &subpartNode=21:5.0.1.1.3.5) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

• Target

ENRICHMENT DESIGN

responder selection: Choosing the subpopulation that is more likely to respond or to experience an event than the general population, such as people who responded to a treatment at the initial stage, or those with a recurrent history of a certain event (10, 11).

VALERII V. FEDOROV Biomedical Data Sciences GlaxoSmithKline Pharmaceuticals Collegeville, Pennsylvania

TAO LIU These selection maneuvers are not mutually exclusive. An enrichment process can often achieve more than one of the above objectives. An effective enrichment can greatly increase the power of detecting the target treatment effect (if only for the selected subpopulation). However, a very strict, multistage enrichment process may lead to a small subpopulation size, and consequently to lower statistical precision or to a prolonged recruitment period. An ideal enrichment design should be based on a careful trade-off between these two aspects. There are many variants of enrichment designs. Figures 1, 2, and 3 show some relatively simple schemes. The randomized discontinuation trial (RDT) (Figure 1) was first proposed by Amery and Dony (4) as an alternative to the classic placebo (or comparator)-controlled, randomized clinical trial (RCT) to reduce the trial’s duration and the degree of the patients’ exposure to inert placebo. In this design, after all of the eligible population have provided informed consent for randomization, they are assigned to an experimental treatment at the first stage. This stage is called the open stage (4); at the end of the open stage, the individuals’ responses (often surrogate endpoints) are collected and evaluated by the study clinician. The individuals who had no response or showed serious adverse effects are excluded from the study. The rest (openstage responders) are then randomized to a placebo or to the experimental treatment (or a comparator) in a double-blind fashion. The first stage serves as a filter for removing those who are unlikely to respond to treatment. The rationale is that the nonresponders would contribute little information about the population for whom the treatment can be useful. The second stage serves

Department of Biostatistics & Epidemiology University of Pennsylvania School of Medicine Philadelphia, Pennsylvania

Enrichment designs for evaluating certain treatments or drugs had been used for decades before Hallstrom and Friedman (1), Temple (2), and Pablos-Mendez et al (3) gave them formal discussion and definition in 1990s. In such designs, a subpopulation is selected or screened out from the general population for an experimental study. The procedure for selection of such a subpopulation is called enrichment (2). The goal of the design is to enhance the signal of an external intervention in the enriched subpopulation and separate it from the interference of many other undesired factors. The discussion and employment of such designs can be traced back to the 1970s (4). Recent years have seen these designs gain great popularity in many different disciplines, particularly in the field of clinical oncology. The enrichment intentions can be roughly classified into the following categories (2). • Variance suppressing selection: Selec-

ting the most homogeneous subpopulation, such as those patients with the greatest tendency to adhere to the study protocol, those whose cholesterol level falls within a certain range, or those with a similar tumor size and health condition (5). • Response

enhanced selection: Identifying the subpopulation with the strongest potential magnitude of response, such as improvement in mental condition, extension of survival, or reduction in tumor growth rate (6–9).

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

ENRICHMENT DESIGN

Eligible population for a study

Experimental Treatment

Non-responders

Treatment responders Randomization

Stop follow-up Experimental Treatment

Placebo Figure 1. The diagram of a randomized discontinuation trial.

Statistical analysis

Eligible population

Placebo run-in: Complied?

Compliers

Non-compliers

Randomization

Stop follow-up

Experimental Treatment

Placebo

Statistical analysis Figure 2. The diagram of a placebo run-in trial.

to distinguish whether the treatment adds anything over the placebo effect. A commonly accepted assumption with an RDT is that ‘‘the treatment will not cure the condition during the open stage’’ (4). For this reason, an RDT is generally applied

under conditions that require sustained use of a therapy (6, 12, 13), such as stabilizing tumor growth or treating some chronic disease. Another often accepted assumption with this design is that the treatment effect at

ENRICHMENT DESIGN

3

Eligible subjects who agree to be randomized Active treatment (100-2a)% of sample

Responders leave the study

Nonresponders continue DB study and receive placebo

Placebo treatment a % of sample

Responders leave the study

Nonresponders continue DB study and receive active treatment

Placebo treatment a% of sample

Responders leave the study

Nonresponders continue DB study and receive placebo

Figure 3. Study design for a major depressive disorder trial. DB, double blind. (From Fava et al [11].)

the open stage will not carry over to the second stage. In oncology, this might mean that the tumor growth rate is uniquely defined by the treatment and is changed as the treatment is changed (6). Traditionally, the statistical analysis of an RDT uses only the outcomes from the second stage, treating it as an RCT rendered on the enriched subpopulation. Capra (14) compared the power of an RDT with that of an RCT when the primary endpoints are individuals’ survival times. Kopec et al (7) evaluated the utility and efficiency of the RDT when the endpoints are binary; they compared the relative sample size required for the desired power of the RDT versus the RCT under various scenarios and parameter settings. Fedorov and Liu (10) considered maximum likelihood estimation of the treatment effect for binary endpoints. With some moderate assumptions, they incorporated the information from the open stage into their inference. The placebo run-in trial (PRIT) is another often used enrichment design (15, 16). It is very similar to the RDT in its setup structure, except that the block ‘‘experimental treatment’’ is replaced by ‘‘placebo treatment’’ and ‘‘responders’’ by ‘‘compliers’’ (Figure 2). An assumption with the PRIT design is that the participants behave coherently throughout the trial. If a patient’s adherence to the protocol is poor during the placebo runin, then his adherence will be poor during the second stage, and vice versa. This design can be more efficient than a conventional RCT

when the compliance of the general population is known or expected to be poor (5) or ‘‘when poor adherence is associated with a substantial reduction of therapy’’ (17). Davis et al (18) examined the efficiency of the PRIT design through empirical evaluations, in the setting of evaluating a cholesterol-lowering drug for elderly patients. The analyses were carried out using the outcomes from the second stage only, as if it were a conventional RCT. Both RDT and PRIT designs are fairly simple schemes. In reality, researchers often employed these designs with certain modifications to meet each study’s requirements. Sometimes RDT and PRIT are even used in combination. For example, Fava et al (11) proposed a study design they named the ‘Sequential Parallel Comparison Design’ for their psychiatric disorder study (see Figure 3 for the design diagram). The first stage of the design consists of three double-blinded (DB) arms: two placebo arms and one treatment arm with unequal randomization (usually more patients are on the placebo arms). Only the nonresponders of the first stage continue to the second stage, and they are assigned in a double-blinded way to the active treatment or placebo, depending on the arm they were on at the first stage. The rationale behind the design is that ‘‘since patients on the second stage have already ‘failed placebo’, their placebo response will be reduced.’’ The data analysis of this design is similar to RDT and PRIT. Fava et al (11) have discussed the statistical model for the design and the design optimization.

4

1

ENRICHMENT DESIGN

parameters of interest can be µk∗ j , µk∗ j and πk∗ , where j and j denote two comparative treatments, and the other parameters can be viewed as nuisance. Often the parameter estimation is complemented or replaced by hypotheses testing. The population model (1) should be complemented by models that describe the evolution of the response to treatment, and the observation processes. For instance, the enrichment process of an RDT is often achieved using only surrogate endpoints at the end of the first stage, which are less accurate measures than the primary endpoints, and can lead to misclassification of treatment responders and nonresponders (Figure 1). Fedorov and Liu (10) proposed to model such an imperfect enrichment process through the introduction of false-positive and falsenegative detection rates. This model builds the connection between the outcomes at the first and the second stages, and hence makes it possible to use the observed information

MODELS

In most publications, it is (implicitly and explicitly) assumed that the general population consists of K subpopulations: ϕ(x, θ ) =

K

πk ϕ(x, θkj ),

j = 1, 2, . . . , J, (1)

k=1

false positive = 0.10, false negative = 0.10 0.5

0.25

0.30

where x is the endpoint of interest (which can be a collection of several variables), π k is a fraction of the k-th subpopulation, ϕ(x, θ kj ) is the distribution density of x in the k-th subpopulation under the j-th treatment, θ kj are unknown parameters defining the distribution of x in each subpopulation, ϕ(x, θ ) is the marginal distribution of x, and the vector θ comprises all π k and θ kj . The goal can be the estimation of all components of θ , or typically a subset of θ , such as the fraction of responders πk∗ and the parameters θk∗ j . In popular settings for continuous x, ϕ(x, θ kj ) is a normal density with θ 1kj = µkj = E(x | k) and θ2kj = σkj 2 = Var(x|k). In this case, the

0.15

0.2

0.10

0.20

0.3

0.15

0.0

0.05

0.1

0.0

0.2

0.4

0.6

0.8

Figure 4. The area below each curve corresponds to the rate for which a randomized discontinuation trial (RDT) is superior to a randomized controlled trail (RCT) given the fraction of patients randomized to placebo.

ENRICHMENT DESIGN

from both stages by constructing the complete data likelihood. In other settings (6), the outcome at the first stage is x(t1 ) (e.g., tumor size) at moment t1 , while at the final stage it is x(t1 + t2 ). Thus, a model describing the relationship between x(t1 ) and x(t1 + t2 ) is needed. In oncology, x(t) can be a tumor growth model. With this model in place, the optimal selection of t1 given t1 + t2 can be considered (6, 8). 2

DESIGN AND EFFICIENCY

In terms of design, the choice of randomization scheme, rates of randomization, and selection of the length of the first stage can be diversified in many ways. Let us consider a simple RDT design (Figure 1) with binary outcomes as an example. Figure 4 shows the efficiency comparison between an RDT design and an RCT with two equal arms (10). Suppose that, at the end of the open phase, 10% of the responders to an active treatment are misclassified as nonresponders, and that 10% of the nonresponders are misclassified as responders. The x-axis represents the population response rate to placebo, and the y-axis the increase in the response rate due to the treatment, the estimation of which is the primary interest. Numbers next to each curve indicate the fraction of patients randomized to the placebo arm in the second stage of an RDT (usually the smaller value is more ethical). In the region under each curve, the RDT dominates the RCT in terms of efficiency. This figure demonstrates that an RDT has better performance under some scenarios (particularly when the treatment effect is small). However this efficiency gain is not universal, even in this simple, idealized setting. For more realistic cases, careful selection of models, combined with ethical and logistical considerations, is essential. In some publications (6, 8, 9, 14), investigators use the Monte Carlo simulations to compare different designs. 3

APPLICATIONS

Majority of applications of enrichment designs are in the field of clinical oncology. Typical examples include a study with

5

enrichment screening based on preliminary data on erlotinib (19), a study of a cytostatic antineoplastic agent (9), and a study of the putative antiangiogenic agent carboxyaminoimoidazole in renal cell carcinoma (20). Temple (21) provided a review and discussion of enrichment designs in cancer treatments. For the early development of molecularly targeted anticancer agents, RDTs were employed to select sensitive subpopulations when an effect assay for such separation was not available (6, 8). Other applications of enrichment designs can be found in children’s health research (22, 23), clinical research in psychiatry (11, 24, 25), cardiac arrhythmia suppression study in cardiology (26), and a few other therapeutic areas (27–30). Similar enrichment strategies can also be found in some two-stage surveys (31). Freidlin and Simon (8) evaluated cytostatic drugs using a design in which they expected only certain patients to be sensitive to their target treatment. 4

DISCUSSION

Because the goal of enrichment is to separate certain subpopulations from the general population and randomize only the selected subpopulation in the trial, the enrichment design is usually only capable of detecting the efficacy rather than effectiveness. This type of efficacy distinction is often of main interest in oncology, which is the reason that RDT is frequently used for screening for a treatment activity. As certain assumptions are satisfied, the efficacy detection can be greatly enhanced. The work by Kopec et al (7) illustrated that, when compared with an RCT, the sample size required for an RDT can be reduced by more than 50%. Fedorov and Liu (10) showed that the increase in efficiency of efficacy detection can be even higher if additional assumptions are made and the information from the open stage can be seamlessly included. However, enrichment designs are not always superior to other designs, even when all of the ethical and logistical conditions are acceptable. For example, studies that have compared the relative efficiency between

6

ENRICHMENT DESIGN

RDTs and the classic RCT (7, 8, 10, 32) have found that the RCT can be more efficient under certain conditions, even when the separation of the subpopulation at the first stage is perfect (i.e., no misclassifications). Other limitations of enrichment designs include: 1. The benefits of using enrichment designs come at the expense of the applicability of the study results to the general population. The Coronary Drug Project Research Group study (22) illustrated the effect of compliance on the conclusions for the enriched subpopulations and the general population. 2. The recruitment process for an enrichment design could last much longer than for a conventional RCT. 3. The use of surrogate endpoints of RDT at the first stage can affect the performance of enrichment process. Fedorov and Liu (10) showed the consequences of misclassification on design efficiency in the first stage of an RDT. 4. The screening phases (the run-in phase for PRIT and the open phase for RDT) are not free; they cost the sponsors time and money and come with errors (33, 34). In general, an enrichment design should undergo a cost-benefit analysis (35) in which efficiency is not the only contributor to the utility function. The enrichment (selection) process is not universally applicable to all scenarios, and may prove ethically controversial (36–38), even though its rationale is well supported and it has been natural to apply to the most commonly reported applications. Researchers must scrutinize the validity of the assumptions and consider the possible ethical issues associated with enrichment before the design is carried out.

REFERENCES 1. A. P. Hallstrom and L. Friedman, Randomizing responders. Control Clin Trials. 1991; 12: 486–503.

2. R. J. Temple, Special study designs: early escape, enrichment, studies in nonresponders. Commun Stat Theory Methods. 1994; 23: 499–531. 3. A. Pablos-Mendez, A. G. Barr, and S. Shea, Run-in periods in randomized trials. JAMA. 1998; 279: 222–225. 4. W. Amery and J. Dony, Clinical trial design avoiding undue placebo treatment. J Clin Pharmacol. 1975; 15: 674–679. 5. E. Brittian and J. Wittes, The run-in period in clinical trials: the effect of misclassification on efficacy. Control Clin Trials. 1990; 11: 327–338. 6. G. L. Rosner, W. Stadler, and M. J. Ratain, Discontinuation design: application to cytostatic antineoplastic agents. J Clin Oncol. 2002; 20: 4478–4484. 7. J. Kopec, M. Abrahamowicz, and J. Esdaile, Randomize discontinuation trials: utility and efficiency. J Clin Epidemiol. 1993; 46: 959–971. 8. B. Freidlin and R. Simon, Evaluation of randomized discontinuation design. J Clin Oncol. 2005; 23: 5094–5098. 9. L. V. Rubinstein, E. L. Korn, B. Freidlin, S. Hunsberger, S. P. Ivy, and M. A. Smith, Design issues of randomized phase II trials and proposal for phase II screening trials. J Clin Oncol. 2005; 23: 7199–7206. 10. V. V. Fedorov and T. Liu, Randomized discontinuation trials: design and efficiency. GlaxoSmithKline Biomedical Data Science Technical Report, 2005-3. 11. M. Fava, A. E. Evins, D. J. Dorer, and D. A. Schoenfeld, The problem of the placebo response in clinical trials for psychiatric disorders: culprits, possible remedies, and a novel study design approach. Psychother Psychosom. 2003; 72: 115–127. 12. C. Chiron, O. Dulac, and L. Gram, Vigabatrin withdrawal randomized study in children. Epilepsy Res. 1996; 25: 209–215. 13. E. L. Korn, S. G. Arbuck, J. M. Pulda, R. Simon, R. S. Kaplan, and M. C. Christian, Clinical trial designs for cytostatic agents: are new approaches needed? J Clin Oncol. 2001; 19: 265–272. 14. W. B. Capra, Comparing the power of the discontinuation design to that of the classic randomized design on time-to-event endpoints. Control Clin Trials. 2004; 25: 168–177. 15. The SOLVD Investigators. Effect of enalapril on survival in patients with reduced left ventricular ejection fractions and congestive heart failure. N Engl J Med. 1991; 325: 293–302.

ENRICHMENT DESIGN 16. J. E. Buring and C. E. Hennekens, Cost and efficiency in clinical trials: the US Physicians’ Health Study. Stat Med. 1990; 9: 29–33. 17. K. B. Schechtman and M. E. Gordon, A comprehensive algorithm for determining whether a run-in strategy will be a costeffective design modification in a randomized clinical trial. Stat Med. 1993; 12: 111–128. 18. C. E. Davis, W. B. Applegate, D. J. Gordon, R. C. Curtis, and M. McCormick, An empirical evaluation of the placebo run-in. Control Clin Trials. 1995; 16: 41–50. 19. Tarceva (erlotinib). Tablet package Insert. Melville, NY: OSI Pharmaceuticals, December 2004. 20. W. M. Stadler, G. Rosner, E. Small, D. Hollis, B. Rini, S. D. Zaentz, and J. Mahoney, Successful implementation of the randomized discontinuation trial design: an application to the study of the putative antiangiogenic agent carboxyaminoimidazole in renal cell carcinoma—CALGB 69901. J Clin Oncol. 2005; 23: 3726–3732. 21. R. J. Temple, Enrichment designs: efficiency in development of cancer treatments. J Clin Oncol. 2005; 23: 4838–4839. 22. The Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. N Engl J Med. 1980; 303: 1038–1041. 23. P. Casaer, J. Aicardi, P. Curatolo, K. Dias, M. Maia, et al, Flunazirizine in alternating hemiplegia in childhood. An international study in 12 children. Neuropediatrics. 1987; 18: 191–195. 24. F. M. Quitkin and J. G. Rabkin, Methodological problems in studies of depressive disorder: utilities of the discontinuation design. J Clin Psychopharmacol. 1981; 1: 283–288. 25. D. D. Robinson, S. C. Lerfald, B. Bennett, D. Laux, E. Devereaux, et al, Continuation and maintenance treatment of major depression with the monoamine oxidase inhibitor phenelzine: a double blind placebo-controlled discontinuation study. Psychopharmacol Bull. 1991; 27: 31–39.

7

blocker with intrinsic sympathomimetic activity. Arch Intern Med. 1988; 148: 1725–1728. 28. D. S. Echt, P. R. Liebson, L. B. Mitchell, R. W. Peters, D. Obias-Manno, et al, and the CAST Investigators. Mortality and morbidity in patients receiving encainide, flecainide, or placebo: the Cardiac Arrhythmia Suppression Trial. N Engl J Med. 1991; 324: 781–788. 29. J. R. Evans, K. Pacht, P. Huss, D. V. Unverferth, T. M. Bashore, and C. V. Leier, Chronic oral amrinone therapy in congestive heart failure: a double-blind placebo-controlled withdrawal study. Int J Clin Pharmacol Res. 1984; 4: 9–18. 30. G. H. Guyatt, M. Townsend, S. Nogradi, S. O. Pugsley, J. L. Keller, and M. T. Newhouse, Acute response to bronchodilator. An imperfect guide for bronchodilator therapy in chronic airflow limitation. Arch Intern Med. 1988; 148: 1949–1952. 31. J. L. Vazquez-Barquero, J. F. Diez-Manrique, C. Pena, R. G. Quintanal, and M. Labrador Lopez, Two stage design in a community survey. Br J Psychiatry. 1986; 149: 88–897. 32. C. Mallinckrodt, C. Chuang-Stein, P. McSorley, J. Schwartz, D. G. Archibald, et al, A case study comparing a randomized withdrawal trial and a double-blind long term trial for assessing the long-term efficacy of an antidepressant. Pharm Stat. 2007; 6: 9–22. 33. E. Brittain and J. Wittes, The run-in period in clinical trials: the effect of misclassification on efficiency. Control Clin. Trials. 1990; 11: 327–338. 34. R. J. Glynn, J. E. Buring, C. H. Hennekens, Riley, D, T. J. Kaptchuk, et al, Concerns about run-in periods in randomized trials. JAMA. 1998; 279: 1526–1527. 35. K. B. Schechtman and M. E. Gordon, A comprehensive algorithm for determining whether a run-in strategy will be a costeffective design modification in a randomized clinical trial. Stat Med. 1993; 12: 111–128. 36. S. J. Senn, A personal view of some controversies in allocating treatment to patients in clinical trials. Stat Med. 1995; 14: 2661–2674.

26. D. S. Echt, P. R. Liebson, and L. B. Mitchell, Morality and morbidity in patients receiving encainide, flecainide, or placebo. N Engl J Med. 1991; 324: 781–788.

37. S. J. Senn, Ethical considerations concerning treatment allocation in drug development trials. Stat Methods Med Res. 2002; 11: 403–411.

27. T. D. Giles, G. E. Sander, L. Roffidal, M. G. Thomas, D. P. Mersch, et al, Remission of mild to moderate hypertension after treatment with carteolol, a beta-adrenoceptor

38. P. D. Leber and C. S. Davis, Threats to validity of clinical trials employing enrichment strategies for sample selection. Control Clin Trials. 1998; 19: 178–187.

8

ENRICHMENT DESIGN

CROSS-REFERENCES Clinical trial/study Adaptive design Randomization Sample size estimation Run-in period Estimation Inference Hypothesis testing Efficacy Effectiveness

ENVIRONMENTAL ASSESSMENTS (EAS) Under the National Environmental Policy Act of 1969 (NEPA), all Federal agencies are required to assess the environmental impact of their actions and to ensure that the interested and affected public is informed of the environmental analyses. The Center for Drug Evaluation and Research’s (CDER) Environmental Assessment of Human Drug and Biologics Applications (Issued 7/1998, Posted 7/24/98) provides detailed information on a variety of topics related to preparing and filing EAs. In CDER, adherence to NEPA is demonstrated by the EA portion of the drug application. This section focuses on the environmental implications of consumer use and disposal from use of the candidate drug. However, because approval of many drugs are unlikely to have significant environmental effects, CDER has provisions for submission of abbreviated EAs rather than full EAs under certain circumstances or has categorically excluded certain classes of actions. FDA has reevaluated its NEPA regulations found in 21 CFR (Code of Federal Regulations) Part 25 and has proposed to improve its efficiency in the implementation of NEPA and reduce the number of EAs by increasing the number of applicable categorical exclusions. The notice of proposed rule making was posted in the Federal Register on April 3, 1996.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/environ.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

EQUIVALENCE TRIALS AND EQUIVALENCE LIMITS1

noninferiority hypothesis, and δ is the noninferiority margin. On the other hand, in some applications, noninferiority testing may be intended to study some other objectives, particularly when a placebo is absent from the trial. In such cases, the noninferiority margin δ may have a special meaning; for instance, the noninferiority testing may be to infer that the test treatment would have beaten the placebo had a placebo been in the trial or that the test treatment would have retained a certain fraction of the control’s effect.

H. M. JAMES HUNG Division of Biometrics I, Office of Biostatistics Office of Translational Sciences, Center for Drug Evaluation and Research U.S. Food and Drug Administration Silver Spring, Maryland

In clinical trials, when a test treatment is compared with an active or positive control that is being used in medical practices, one possible study objective may be to show that the test treatment is ‘‘equivalent’’ to the active control with respect to therapeutic benefits or adverse effects. Traditionally, equivalence means that the difference between the test treatment and the control treatment is clinically insignificant. The threshold of clinical insignificance would need to be defined, and this threshold is the so-called equivalence limit or equivalence margin.

2

EQUIVALENCE LIMIT OR MARGIN

In general, there is only one equivalence margin that defines the acceptable degree of clinical difference between the test treatment and the control, particularly when two effective treatments are compared to show that either treatment is not inferior to the other. As already mentioned, for clinical endpoints, the unacceptable margin of the inferiority of the test treatment to the control may have a special definition; for example, it may be an explicit function of the postulated control’s effect. When such a noninferiority margin is determined to show that the test treatment is better than placebo or that the test treatment retains a certain fraction of the control’s effect, it may be too large to conclude that the test treatment is not unacceptably inferior to the control. This margin is certainly irrelevant for defining the degree of superiority of the test treatment over the control; thus, if equivalence testing is pursued in these applications, the equivalence hypothesis might have two limits, one for inferiority and one for superiority.

1 EQUIVALENCE VERSUS NONINFERIORITY In pharmaceutical applications, the concept of equivalence is mostly applied to assessment of a test formulation of a medical product relative to an established reference in so-called bioequivalence studies on a pharmacokinetic variable. In contrast, clinical trials are rarely aimed at showing equivalence; instead, the concept of noninferiority may be more applicable (1–18). The distinction between equivalence and noninferiority can be subtle. On one hand, a noninferiority hypothesis can be viewed as one side of an equivalence hypothesis. That is, if δ is the equivalence margin, then the equivalence hypothesis is that the test treatment and the control treatment differ by an extent smaller than δ. If the hypothesis at stake is that the test treatment is inferior to the control at most by a degree smaller than δ, it is a

3 DESIGN, ANALYSIS, AND INTERPRETATION OF EQUIVALENCE TRIALS The necessary considerations for design, analysis, and interpretation of a noninferiority trial as stipulated in the references are, in principle, applicable to equivalence testing. To accept the equivalence hypothesis defined by the equivalence limits, the confidence

1

The views presented in this paper are not necessarily those of the U.S. Food and Drug Administration.

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

EQUIVALENCE TRIALS AND EQUIVALENCE LIMITS

interval derived from the equivalence trial for the difference between the test treatment and the active control must completely fall within the equivalence range determined by the equivalence margins. This is in contrast to the noninferiority testing that requires that the confidence interval exclude only the relevant noninferiority margin. REFERENCES 1. P. Bauer and M. Kieser, A unifying approach for confidence intervals and testing of equivalence and difference. Biometrika. 1996; 83: 934–937. 2. Committee for Medicinal Products for Human Use (CHMP), European Medicines Agency. Guideline on the Choice of the Non-Inferiority Margin. London, UK, July 27, 2005. Available at: http://www.emea.europa.eu/pdfs/human/ ewp/215899en.pdf. 3. A. F. Ebbutt and L. Frith, Practical issues in equivalence trials. Stat Med. 1998; 17: 1691–1701. 4. S. S. Ellenberg and R. Temple, Placebocontrolled trials and active-control trials in the evaluation of new treatments. Part 2: Practical issues and specific cases. Ann Intern Med. 2000; 133: 464–470. 5. T. R. Fleming, Treatment evaluation in active control studies. Cancer Treat Reports. 1987; 71: 1061–1064. 6. T. R. Fleming, Design and interpretation of equivalence trials. Am Heart J. 2000; 139: S171–S176. 7. A. L. Gould, Another view of active-controlled trials. Control Clin Trials. 1991; 12: 474–485. 8. D. Hauschke, Choice of delta: a special case. Drug Inf J. 2001; 35: 875–879. 9. D. Hauschke and L. A. Hothorn, Letter to the Editor. Stat Med. 2007; 26: 230–233. 10. H. M. J. Hung, S. J. Wang, and R. O’Neill, Noninferiority trial. In: R. D’Agostino, L. Sullivan,

and J. Massaro (eds.), Encyclopedia of Clinical Trials. New York, Wiley, 2007. 11. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E9 Statistical Principles for Clinical Trials. Step 4 version, February 5, 1998. Available at: http://www.ich.org/LOB/media/ MEDIA485.pdf. 12. B. Jones, P. Jarvis, J. A. Lewis, and A. F. Ebbutt, Trials to assess equivalence: the importance of rigorous methods. BMJ. 1996; 313: 36–39. 13. T. H. Ng, Choice of delta in equivalence testing. Drug Inf J. 2001; 35: 1517–1527. 14. G. Pledger and D. B. Hall, Active control equivalence studies: do they address the efficacy issue? In: K. E. Peace (ed.), Statistical Issues in Drug Research and Development. New York: Marcel Dekker, 1990, pp. 226–238. 15. J. Rohmel, Therapeutic equivalence investigations: statistical considerations. Stat Med. 1998; 17: 1703–1714. 16. R. Temple, Problems in interpreting active control equivalence trials. Account Res. 1996; 4: 267–275. 17. R. Temple and S. S. Ellenberg, Placebocontrolled trials and active-control trials in the evaluation of new treatments. Part 1: Ethical and scientific issues. Ann Intern Med. 2000; 133: 455–463. 18. B. Wiens, Choosing an equivalence limit for non-inferiority or equivalence studies. Control Clin Trials. 2002; 23: 2–14.

CROSS-REFERENCES Non-inferiority Trial Non-inferiority Margin Bioequivalence

ESSENTIAL DOCUMENTS

Any or all documents addressed in this guideline may be subject to, and should be available for, audit by the sponsor’s auditor and inspection by the regulatory authority(ies).

Essential Documents are those documents that permit evaluation individually and collectively of the conduct of a trial and the quality of the data produced. These documents serve to demonstrate the compliance of the investigator, sponsor, and monitor with the standards of Good Clinical Practice (GCP) and with all applicable regulatory requirements. Essential documents also serve several other important purposes. Filing essential documents at the investigator/institution and sponsor sites in a timely manner can assist greatly in the successful management of a trial by the investigator, sponsor, and monitor. Also, these documents usually are audited by the sponsor’s independent audit function and are inspected by the regulatory authority(ies) as part of the process to confirm the validity of the trial conduct and the integrity of data collected. The minimum list of essential documents has been developed. The various documents are grouped in three sections according to the stage of the trial during which they will normally be generated: (1) Before the clinical phase of the trial commences, (2) during the clinical conduct of the trial, and (3) after completion or termination of the trial. A description is given of the purpose of each document, and whether it should be filed in either the investigator/institution or the sponsor files, or both. It is acceptable to combine some documents provided the individual elements are readily identifiable. Trial master files should be established at the beginning of the trial, both at the investigator/institution’s site and at the sponsor’s office. A final close-out of a trial can be performed only when the monitor has reviewed both investigator/institution and sponsor files and has confirmed that all necessary documents are in the appropriate files. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

ETHICAL CHALLENGES POSED BY CLUSTER RANDOMIZATION

are discussed in Section 4. Section 5 concludes the article largely as a means of promoting debate, to provide recommendations for the reporting of ethical issues in cluster randomization trials. Readers interested in a more detailed discussion might wish to consult Donner and Klar (9), from which much of this article was abstracted.

NEIL KLAR Cancer Care Ontario Division of Preventive Oncology Toronto, Ontario, Canada

ALLAN DONNER The University of Western Ontario Department of Epidemiology and Biostatistics London, Ontario, Canada

1

2

EXAMPLES 1. A group of public health researchers in Montreal (10) conducted a household randomized trial to evaluate the risk of gastrointestinal disease due to consumption of drinking water. Participating households were randomly assigned to receive an in-home water filtration unit or were assigned to a control group that used tap water. Overall, 299 families (1206 individuals) were assigned to receive water filters and 308 families (1201 individuals) were in the control group. 2. The National Cancer Institute of the United States funded the Community Intervention Trial for Smoking Cessation (COMMIT), which investigated whether a community-level, 4-year intervention would increase quit rates of cigarette smokers (11) Communities were selected as the natural experimental unit because investigators assumed that interventions offered at this level would reach the greatest number of smokers and possibly change the overall environment, thus making smoking less socially acceptable. Overall 11 matched-pairs of communities were enrolled in this study with one community in each pair randomly assigned to the experimental intervention with the remaining community serving as a control. 3. Antenatal care in the developing world has attempted to mirror care that is offered in developed countries even though not all antenatal care interventions are known to be effective.

INTRODUCTION

Ethical guidelines for medical experimentation were first put forward in 1947 with publication of the Nuremberg Code (1, 2). Numerous national and international ethical codes have since been developed, written almost exclusively in the context of clinical trials in which a physician’s patients are individually assigned for the purpose of evaluating the effect of therapeutic interventions. In spite of some limited early examples (3), far less attention has been given to the distinct ethical challenges of cluster randomization (4–7), reflecting, perhaps, the recent growth of interest in this design. Experience has shown that cluster randomization is primarily adopted to evaluate nontherapeutic interventions, including lifestyle modification, educational programs, and innovations in the provision of health care. The limited attention given to the ethical aspects of this design may be more broadly related to the fact that trials of disease prevention have tended to be largely exempt from ethical constraints, possibly because of a perception that participation in such trials carries only minimal risk (8). In Section 2, the key ethical issues raised by cluster randomization trials will be examined by discussing them in the context of several recently published studies. The relative potential for harm and benefit to subjects participating in either individually randomized or cluster randomized trials is described in Section 3, whereas difficulties in obtaining informed consent before random assignment

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

ETHICAL CHALLENGES POSED BY CLUSTER RANDOMIZATION

The World Health Organization (WHO) antenatal care randomized trial (12) compared a new model of antenatal care that emphasized health-care interventions known to be effective with the standard model of antenatal care. The primary hypothesis in this equivalence trial was that the new model of antenatal health care would not increase the risk of having a low birthweight (<2500 g) child. Participating clinics, recruited from Argentina, Cuba, Saudi Arabia, and Thailand, were randomly assigned to an intervention group or control group separately within each country. Overall, 27 clinics (12,568 women) were randomly assigned to the experimental arm, whereas women attending 26 control group clinics (11,958 women) received standard antenatal care.

3

THE RISK OF HARM

The challenge of designing an ethical randomized trial requires balancing the potential benefits and risk of harm faced by individual participants with the potential long-term benefit to those subjects and to society at large. Indeed, it is concern for the risk of harm that has traditionally required that the use of random assignment be justified by the notion of ‘‘clinical equipoise,’’ whereby ‘‘experts are uncertain as to whether any of the proposed arms of a trial is superior to another’’ (2). Too often the potential risks of participating in health education or disease prevention trials have been largely neglected (8), which is unfortunate because not all such trials meet the criteria of minimal risk that, when achieved, may allow investigators to conduct research without first obtaining informed consent from study subjects (1). According to the U.S. Federal Government (13): Minimal risk means that the probability and magnitude of harm or discomfort anticipated in the research are not greater than those ordinarily encountered in daily life or during the

performance of routine physical or psychological examinations or tests.

Furthermore, as the benefits from participation in a trial decrease, so must the net risk of harm. Although risk could be considered low for subjects participating in each of the three trials described in Section 2, it is arguable that only COMMIT (11) meets the minimal risk criteria defined above. In this trial, participation was limited to the completion of standardized questionnaires and the intervention was comprised of various community-level smoking cessation programs. On the other hand, subject risk was perhaps greatest for participants in the WHO antenatal care trial, as women in the experimental group received a more limited range of antenatal care than women in the control group [although women at high risk for pregnancy complications in either group were offered whatever additional care was deemed appropriate (12)]. Attention to the risk/benefit ratio should also extend to control group subjects. Therefore, some investigators have attempted to ensure that these individuals can still benefit from participation by offering a minimal level of intervention or, alternatively, by offering all individuals the intervention by the technique of delaying its implementation in the control group. This strategy may have the added benefit of encouraging participation in the trial. Typically, however, control group subjects receive only usual care, as was the case, for example, with women participating in the WHO antenatal care trial. Investigators need to recognize that participation in a cluster randomization trial may entail considerable effort at both the individual and at the cluster level. At a minimum, it is important to communicate study findings both to trial participants and to decision makers. Investigators should also be sensitive to the consequences of a successful intervention so that the benefits enjoyed by study participants do not necessarily end when the trial is completed.

ETHICAL CHALLENGES POSED BY CLUSTER RANDOMIZATION

4

THE ROLE OF INFORMED CONSENT

According to the World Medical Association Declaration of Helsinki (14), consent must be obtained from each patient before random assignment. Such a requirement not only assures that the risks of experimentation are adequately communicated to patients, but also facilitates the process of random assignment, which may at times be seen to compromise the implicit contractual relationship between patient and physician. The situation is more complicated for cluster randomization trials, particularly when relatively large allocation units are randomized, such as schools, communities, or worksites. Then school principals, community leaders, or other key decision makers will usually provide permission for both random assignment and implementation of the intervention. For example, in the WHO antenatal care trial, women only became eligible to participate in the trial after becoming pregnant and so could not have been enrolled before random assignment (12). Individual study subjects may still be free to withhold their participation, although they may not even then be able to completely avoid the inherent risks of an intervention that is applied on a cluster-wide level (as, for example, in studies of water fluoridation or mass education). The identification of individuals mandated to provide agreement for random assignment may not be a simple task. Typically, it is elected or appointed officials who make such decisions. However, as Strasser et al. (7) point out, it is by no means certain when, or even if, securing the agreement of these officials is sufficient. Thus, as stated in a set of recently released guidelines for cluster randomized trials (15), ‘‘the roles of the guardians of the patients interests during the trial, the gatekeepers of access to patient groups, and sponsors of the research are even more important in cluster randomized trials where individuals may not have the opportunity to give informed consent to participation.’’ In the context of community-based trials, it may similarly be prudent for investigators to establish or collaborate with community advisory boards (16), a strategy used by COMMIT investigators (11, 17).

3

The practical difficulties of securing informed consent before random assignment may not necessarily occur when smaller clusters such as households or families are the unit of randomization. For instance, one of the eligibility criteria for the Montreal water filtration study (10) was a ‘‘willingness to participate in a longitudinal trial in which a random half of the households would have a filter installed.’’ The relative absence of ethical guidelines for cluster randomized trials appears to have created a research environment in which the choice of randomization unit may determine whether informed consent is deemed necessary before random assignment. This phenomenon can be seen, for example, in the several published trials of Vitamin A supplementation on childhood mortality. Informed consent was obtained from mothers before assigning children to either Vitamin A or placebo in the household randomization trial reported by Herrera et al. (18). However, in the community intervention trial of Vitamin A reported by the Ghana VAST Study Team (19), consent to participate was obtained only after random assignment, which may be seen as an example of the randomized consent design proposed by Zelen (20). It seems questionable, on both an ethical level and a methodological level, whether the unit of randomization should play such a critical role in deciding whether informed consent is required. 5

DISCUSSION

As a first step to developing a well-accepted set of ethical guidelines and norms for cluster randomization trials, editors could require investigators to report having IRB approval, and to indicate how issues of subject consent were dealt with. The greater challenge is to agree on ethical features of cluster randomization trials that should be reported (e.g., the timing of informed consen). When permission from key decision makers associated with each cluster is needed for assigning interventions, investigators should identify these decision makers and indicate how they were selected. Some information about the consent procedure administered

4

ETHICAL CHALLENGES POSED BY CLUSTER RANDOMIZATION

to individual study subjects should also be provided. For example, it would usually be helpful to know when consent was obtained and what opportunities, if any, existed for cluster members to avoid the inherent risks of intervention. Note that these suggestions may not require the development of novel ethical criteria. Very similar suggestions have been proposed earlier in the 1991 International Guidelines for Ethical Review of Epidemiological Studies put forward by CIOMS, the Council for International Organizations of Medical Sciences (1). To promote debate, the authors conclude by quoting from the Community Agreement section of these underutilized guidelines: When it is not possible to request informed consent from every individual to be studied, the agreement of a representative of a community or group may be sought, but the representative should be chosen according to the nature, traditions and political philosophy of the community or group. Approval given by a community representative should be consistent with general ethical principles. When investigators work with communities, they will consider communal rights and protection as they would individual rights and protection. For communities in which collective decision-making is customary, communal leaders can express the collective will. However, the refusal of individuals to participate in a study has to be respected: a leader may express agreement on behalf of a community, but an individual’s refusal of personal participation is binding (1).

REFERENCES 1. B. A. Brody, The Ethics of Biomedical Research. Oxford: Oxford University Press, 1998. 2. J. Sugarman, Ethics in the design and conduct of clinical trials. Epidemiologic Rev. 2002; 24: 54–58. 3. C. L. Schultze, Social programs and social experiments. In: A. M. Rivlin and P. M. Timpane (eds.), Ethical and Legal Issues of Social Experimentation. Washington, DC: The Brookings Institution, 1975, pp. 115–125. 4. K. Glanz, B. K. Rimer, and C. Lerman, Ethical issues in the design and conduct of community-based intervention studies. In:

S. S. Coughlin and T. L. Beauchamp (eds.), Ethics and Epidemiology. Oxford: Oxford University Press, 1996, pp. 156–177. 5. S. J. L. Edwards, D. A. Braunholtz, R. J. Lilford, and A. J. Stevens, Ethical issues in the design and conduct of cluster randomised controlled trials. Brit. Med. J. 1999; 318: 1407–1409. 6. J. L. Hutton, Are distintive ethical principles required for cluster randomized controlled trials? Stat. Med. 2001; 20: 473–488. 7. T. Strasser, O. Jeanneret, and L. Raymond, Ethical aspects of prevention trials. In: S. Doxiadis (ed.), Ethical Dilemmas in Health Promotion. New York: Wiley, 1987, pp. 183– 193. 8. P. Skrabanek, Why is preventive medicine exempted from ethical constraints? J. Med. Ethics 1990; 16: 187–190. 9. A. Donner and N. Klar, Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold, 2000. 10. P. Payment, L. Richardson, J. Siemiatycki, R. Dewar, M. Edwardes, and E. Franco, A randomized trial to evaluate the risk of gastrointestinal disease due to consumption of drinking water meeting microbiological standards. Amer. J. Public Health 1991; 81: 703–708. 11. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a four-year community intervention. Amer. J. Public Health 1995; 85: 183–192. 12. J. Villar et al. for the WHO Antenatal Care Trial Research Group, WHO antenatal care randomised trial for the evaluation of a new model of routine antenatal care. Lancet 2001; 357: 1551–1564. 13. Office for Protection from Research Risks, Protection of Human Subjects. Title 45, Code of Federal Regulations, Part 46. Bethesda, MD: Department of Health and Human Services N.I.H., 1994. 14. World Medical Association Declaration of Helsinki, Ethical principles for medical research involving human subjects. JAMA 2000; 284: 3043–3045. 15. Medical Research Council Clinical Trials Series, Cluster Randomized Trials: Methodological and Ethical Considerations. London: MRC, 2000. Available: www.MRC.ac.uk. 16. R. P. Strauss et al., The role of community advisory boards: involving communities in the informed consent process. Amer. J. Public Health 2001; 91: 1938–1943.

ETHICAL CHALLENGES POSED BY CLUSTER RANDOMIZATION 17. B. Thompson, G. Coronado, S. A. Snipes, and K. Puschel, Methodologic advances and ongoing challenges in designing community-based health promotion programs. Annu. Rev. Public Health 2003; 24: 315–340. 18. M. G. Herrera et al., Vitamin A supplementation and child survival. Lancet 1992; 340: 267–271. 19. Ghana VAST Study Team, Vitamin A supplementation in northern Ghana: effects on clinic attendances, hospital admissions, and child mortality. Lancet 1993; 342: 7–12. 20. M. Zelen, Randomized consent designs for clinical trials: an update. Stat. Med. 1990; 9: 645–656.

5

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

concerns about the lack of new drugs in the pipeline. Pharmaceutical companies are looking for new ways to cut the costs of drug development and have shown increasing interest in outsourcing their clinical trials to low-cost countries. India and Eastern Europe have become attractive as new clinical trial sites because labor and infrastructure costs in those countries are cheaper, and there is an abundance of treatment-naive patients who are ideal for participation in trials of new candidate drugs. According to one estimate, 40% of all clinical trials are now carried out in countries such as Russia and India. It is estimated that patients can be recruited 10 times faster in Russia compared with Western countries, and drug development cost can be $200 million cheaper if conducted in India, a substantial saving of an estimated $800 million in normal drug development costs (2). One challenge common to all international collaborative research is how to deal with conflicting regulations between host and sponsor countries. This is particularly a challenge if host country regulations are territorial and cover research done in that country, whereas sponsor country regulations apply according to funding source, such as the U.S. Common Rule. If U.S. federally funded research is done in another country, both the U.S. regulations and the host country regulations apply to that research, so the question is what should be done if they are in conflict. If, for example, a host country research ethics committee takes the position that requiring a signature or a thumbprint on an informed consent form in an illiterate population is unethical, this decision would not be in accordance with the U.S. regulations. In the national legislation of some countries, research ethics committees do not have the authority to stop research; however, U.S. regulations require that institutional review boards (IRBs) have this authority. Fortunately, most of the time in cases of apparent conflict, the regulations are flexible enough to enable reasonable accommodations by the multiple ethics review boards involved in any one international trial.

REIDAR K. LIE Department of Bioethics National Institutes of Health Bethesda, Maryland

1 BACKGROUND AND RATIONALE FOR INTERNATIONAL RESEARCH Increasingly, clinical trials involve partners in two or more countries. Examples are multisite clinical trials recruiting patients in several countries, or clinical trials carried out in a host country different from the country of the sponsor. Such trials raise particularly challenging ethical issues, especially when there are substantial differences in wealth between host and sponsor countries. Sponsors can be public agencies or pharmaceutical companies. The portion of the U.S. National Institutes of Health (NIH) budget going to international sites, both directly and through U.S. principal investigators, has shown a steady increase over the past two decades, and is now over $500 million. The European Union, together with a number of national governments and private donors, has launched a high-profile initiative to fund research on interventions against poverty related diseases, primarily malaria, human immunodeficiency virus (HIV), and tuberculosis; these efforts through the European and Developing Countries Clinical Trials Partnership (EDCTP) have been primarily related to competence building in African countries (1). The mission of this initiative is to ‘‘accelerate the development of new clinical interventions to fight HIV/AIDS, malaria and tuberculosis in developing countries (DCs), particularly sub-Saharan Africa, and to improve generally the quality of research in relation to these diseases’’ (1). The pharmaceutical industry has traditionally been quite profitable; however, coupled with rising development costs, the risks associated with drug development have increased, and there have been recent

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

Both publicly funded research on poverty related diseases and research sponsored by pharmaceutical companies in poor countries aimed at developing drugs for richer countries raise a number of ethical issues that have been discussed widely in the literature during the past few years. We will discuss four of the main concerns. 2

THE STANDARD OF CARE CONTROVERSY

Perhaps the most contentious issue in research ethics during the past few years has been whether there is an ethical obligation to provide participants in trials, irrespective of differences in economic circumstances, interventions that are at least as good as those that would be provided to participants in resource-rich countries. This debate started with the controversy over trials to test treatment regimens to prevent perinatal HIV transmission. Studies had demonstrated that an expensive preventive regimen of providing azidothymidine (AZT) for several weeks during pregnancy, intravenously during delivery, and to the newborn for some time after birth could dramatically reduce infection in newborns from around 30% to less than 10% (3). This regimen was considered too expensive and logistically impossible for resource-poor countries. There was therefore a need to identify more suitable interventions in these settings. Some argued that only by doing a trial where these new interventions were tested against placebo could one obtain results that would be useful for the host-countries (4). We have argued that this position is correct (5), although others have maintained that one could have obtained useful results by doing an equivalence trial (6). The critics of these placebo-controlled trials maintained that they are morally unacceptable because everyone in a clinical trial, regardless of location, is entitled to an intervention that is at least as good as that which is regarded to be the best treatment option for a particular condition (7). Irrespective of what one’s position is with regard to the perinatal HIV transmission controversy, the question remains as to whether it is at all permissible at least sometimes

to provide to participants a lesser standard of care than is available elsewhere, but not locally, to obtain useful results for that particular setting. The World Medical Association (WMA) has taken the position that it is not permissible (Declaration of Helsinki, paragraph 29): The benefits, risks, burdens and effectiveness of a new method should be tested against those of the best current prophylactic, diagnostic and therapeutic methods. This does not exclude the use of placebo, or no treatment, in studies where no proven, prophylactic, diagnostic or therapeutic method exists (8).

The clarification to this paragraph would also allow for placebo use in the cases where it is scientifically necessary when doing research on conditions where withholding effective interventions would only cause temporary and minor harm. It certainly would not allow withholding interventions in the case of lifesaving interventions such as antiretroviral agents to prevent HIV infection (Declaration of Helsinki, clarification to paragraph 29): The WMA hereby reaffirms its position that extreme care must be taken in making use of a placebo-controlled trial and that in general this methodology should only be used in the absence of existing proven therapy. However, a placebo-controlled trial may be ethically acceptable, even if proven therapy is available, under the following circumstances:

• Where for compelling and scientifically sound methodological reasons its use is necessary to determine the efficacy or safety of a prophylactic, diagnostic or therapeutic method; or • Where a prophylactic, diagnostic or therapeutic method is being investigated for a minor condition and the patients who receive placebo will not be subject to any additional risk of serious or irreversible harm (8).

We have argued that most other international guidance documents have disagreed with this position, and we accept what we have called the international consensus

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

opinion (9). The Council for International Organizations of Medical Sciences (CIOMS) guidelines are more typical, and it is worth quoting at length from this document: An exception to the general rule [represented by the Declaration of Helsinki] is applicable in some studies designed to develop a therapeutic, preventive or diagnostic intervention for use in a country or community in which an established effective intervention is not available and unlikely in the foreseeable future to become available, usually for economic or logistic reasons. The purpose of such a study is to make available to the population of the country or community an effective alternative to an established effective intervention that is locally unavailable. Also, the scientific and ethical review committees must be satisfied that the established effective intervention cannot be used as comparator because its use would not yield scientifically reliable results that would be relevant to the health needs of the study population. In these circumstances an ethical review committee can approve a clinical trial in which the comparator is other than an established effective intervention, such as placebo or no treatment or a local remedy (10).

According to the CIOMS position, at least three conditions must be satisfied if one can allow an exception to the general rule of requiring participants to receive at least as good care as that which is regarded as state of the art care for the condition considered: 1. The state of the art intervention must not be available locally outside the research context. Research subjects must not be denied care that they would otherwise receive locally. 2. It is necessary to provide a different intervention than that which is regarded to be state of the art intervention for the subject’s condition for scientific reasons. One could not achieve the objectives of the trial in any other way. 3. The objective of the trial is to identify an intervention that can be useful in the local context. If we apply these criteria to the original perinatal HIV transmission controversy, it is unclear whether these trials would be justified because a case can be made that they

3

violated the third condition. This is the basis for arguments made by at least some of the critics of these trials (11). However, trials such as developing a new, simplified diagnostic tool, known to be less effective than viral load measurements, would be approved using this standard (12). 2.1 Obligations to Trial Participants to Provide Ancillary Care Researchers interacting with study participants in clinical research will at times identify needs for clinical care that would be beneficial to the participants, but that are not necessary to provide as part of the design of the study. For example, during a malaria vaccine trial, participants may be identified as needing treatment for HIV infection. Assuming that treatment of the HIV infection is not part of the protocol and would not affect the design of the research one way or another, the question is what obligation investigators have, if any, to address these types of health needs of their participants, assuming that their needs cannot be met by the regular health-care system because of lack of resources. Is there an obligation to provide ancillary care to study participants, defined as care that is not needed as part of the design of the study? At least one international guidance document, CIOMS, says that there is no obligation to provide ancillary care: ‘‘Although sponsors are, in general, not obliged to provide health-care services beyond that which is necessary to conduct research, it is morally praiseworthy to do so’’ (10). It is clear from discussions with researchers in resource-poor settings that, informally, researchers and sponsors do in fact provide a substantial amount of ancillary care to their trial participants because, following CIOMS, they think it is the right and decent thing to do. The unanswered question is whether we should take a position stronger than that and claim that certain types of ancillary care are not only morally praiseworthy but also obligatory. Recently, Richardson and Belsky (13, 14) have attempted to provide an argument for why there is on obligation to provide some limited form of ancillary care. Basically, they argue that obligations are created by the interactions between researchers

4

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

and patients, and by the fact that the patients are providing researchers with privileged information that is useful for science. Information about a person’s health state that is gathered in the process of carrying out the protocol falls within the scope of the obligation. However, the strength of the claim may vary, depending on factors such as how important and urgent the health problem is, and how difficult it would be for the researcher to address the problem. In this model, if researchers need to test for HIV during a malaria vaccine trial because they are interested in seeing the influence of HIV status on the effectiveness of the vaccine, it would give them a prima facie obligation to ensure treatment access for the patients they identify as HIV positive. Whether they actually have such an obligation would in part depend on the cost of providing the care. However, in a pediatric trial of an HIV vaccine, it would not be an obligation for the researchers to ensure provision of treatment for malaria or acute life-threatening diarrhea, even though these health problems are arguably more urgent than the HIV infection in the malaria vaccine trial. For this reason, the Richardson/Belsky model has been criticized by Dickert et al. (15). The question of how to best analyze researcher and sponsor obligations to provide ancillary care has been relatively unexplored in the literature, so there is no clear consensus one way or another at the moment. 2.2 Obligations to Trial Participants to Continue to Provide Care after the Conclusion of the Trial Traditionally, researchers and sponsors have only been responsible for trial participants during the trial. Trial interventions have only been provided to participants for the time necessary to complete the study, and there is an expectation that the ordinary health-care system will begin to provide effective interventions identified during a trial. Although a single trial may not provide evidence that a particular intervention is effective, there certainly are cases where it is clear that participants receive benefits from whatever interventions they are provided as part of the

design of the study. The question is whether researchers have an obligation to ensure provision of effective interventions to their trial participants after the completion of the trial when there is no expectation that the ordinary health-care system can do so. According to the Declaration of Helsinki (paragraph 30), At the conclusion of the study, every patient entered into the study should be assured of access to the best proven prophylactic, diagnostic, and therapeutic methods identified by that study (8).

Thus, in a trial of two different drug combinations for HIV, for example, there is an obligation to ensure continuation of a new combination of drugs after trial completion if that combination has been identified as effective. After the revised version of the Declaration was accepted in 2000, this statement came under immediate criticism. It was pointed out that this would present an undue burden on research sponsors, whether public or private; this would divert scarce resources away from important research if these institutions were expected to budget for treatment provision after completion of a trial. There is also a regulatory issue with regard to the use of still unapproved drugs outside a clinical trial, if such use is necessary after trial completion but before regulatory approval. The World Medical Association considered these objections and issued the following note of clarification: The WMA hereby reaffirms its position that it is necessary during the study planning process to identify post-trial access by study participants to prophylactic, diagnostic and therapeutic procedures identified as beneficial in the study or access to other appropriate care. Post-trial access arrangements or other care must be described in the study protocol so the ethical review committee may consider such arrangements during its review (8).

In one sense, this clarification is weaker than the original paragraph. It now stated that ‘‘it is necessary . . . to identify post-trial access’’ and describe arrangements in the protocol. Although it no longer says one must ‘assure access’—and it certainly does not require

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

funding access programs—what had actually been meant by assuring access in paragraph 30 was still left unspecified. Even if one agrees in principle that one cannot simply abandon one’s patients once the research is completed, there are a whole range of options open to sponsors and researchers, from discussing the options with patients to actually providing or paying for the interventions. Even if one agrees that one should do one’s best to ensure access to interventions once the trial is over, it is still not clear what follows in terms of arrangements that need to be in place before the trial starts. In particular, little guidance is provided regarding the criteria an IRB should use when rejecting a trial because insufficient arrangements have been made for post-trial access. An IRB would, for example, have to weigh the importance of the research and the benefits it can provide against the potential benefits to the participants of obtaining post-trial access. These types of trade-offs are illustrated by the NIH Guidance for Addressing the Provision of Antiretroviral Treatment for Trial Participants Following their Completion of NIH-Funded HIV Antiretroviral Treatment Trials in Developing Countries: For antiretroviral treatment trials conducted in developing countries, the NIH expects investigators/contractors to address the provision of antiretroviral treatment to trial participants after their completion of the trial . . . The NIH recommends investigators/contractors work with host countries’ authorities and other stakeholders to identify sources available, if any, in the country for the provision of antiretroviral treatment to antiretroviral treatment trial participants following their completion of the trial . . . Applicants are expected to provide NIH Program Staff for evaluation their plans that identify available sources, if any, for provision of antiretroviral treatment to research participants (16).

It is consistent with this policy that investigators have considered options but concluded that there is no possibility of continuing to provide post-trial access to antiretroviral agents used as part of the research design. Many would, of course, criticize this guidance as too weak.

5

However, the guidance also says that ‘‘Priority may be given to sites where sources are identified for provision of ARV treatment.’’ This part of the policy could be criticized as too strong. It could, for example, mean that research funded by NIH would shift to countries or sites that are relatively well off, where it is easier to guarantee post-trial access. This would deprive countries or sites that already are disadvantaged in terms of antiretroviral access the potential benefits of hosting an antiretroviral treatment trial. If IRBs are to address the continued provision of care to trial participants as part of the approval process, guidance is needed about how they should resolve this tension between ensuring access to continued, effective care while not denying participants and communities the benefits of research participation. However, at this time there is very little practical guidance about how to go about this. 2.3 Obligation to be Responsive to Host Country Health Needs International research ethics guidelines stress that research must be responsive to host country health needs. The Declaration of Helsinki, for example, states, ‘‘Medical research is only justified if there is a reasonable likelihood that the populations in which the research is carried out stand to benefit from the results of the research’’ (paragraph 19) (8). The CIOMS guidelines (10) state, ‘‘The health authorities of the host country, as well as a national or local ethical review committee, should ensure that the proposed research is responsive to the health needs and priorities of the host country’’ (see Guideline 3), and elsewhere maintain that research ‘‘should be responsive to their health needs and priorities in that any product developed is made reasonably available to them’’ (see General Ethical Principles). These guidelines suggest that there is a requirement to do research where the products developed in that research will meet some priority health needs of the host country. Outsourcing of drug development to host countries with the expectation that the product is primarily aimed at the sponsor country markets would therefore seem to be prohibited by these guidelines. It is unclear, however, whether

6

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

the associated capacity building, infrastructure development, and training that could be used by the host country in the future would be allowed with the justification that the research ‘‘is responsive to the health needs and priorities of the host country.’’ Criticism has been leveled at the requirement that host country benefit should be understood as making the product or intervention developed in the research reasonably available to the host country population (17, 18). The essential part of this criticism is that there may be other, more important benefits for the host country. For example, healthcare infrastructure provided during a trial that can later be used to treat important health problems in the community may be more important than the ‘‘benefit’’ of guaranteed access to a product that may or may not turn out to be beneficial and, even if beneficial, may turn out to be inferior to comparable interventions tested in other countries. Or it may be that a vaccine trial for a product that is not of high priority for the host country may provide the crucial expertise to conduct locally sponsored trials of products that are still in development but target important health problems in the country; in such cases, a trial on a low priority product may be reasonably preferred over a low priority trial on a high priority product because of scientific doubts about its promise. Host countries are allowed to make these types of trade-offs in the fair benefits framework, but not according to the CIOMS guidelines. This framework has been criticized because it does not take into account the weak bargaining power of host countries when they negotiate benefit agreements. If whatever is negotiated during interactions with powerful sponsors, whether public or private, is accepted as a ‘‘fair transaction,’’ it is clear that developing countries will lose out because of existing lack of resources, which are in themselves unjust (19). But this is a misunderstanding of the fair benefits framework. There are two independent requirements in this framework. First, host countries are allowed to trade off health benefits to address current health problems against health benefits for future populations in terms of access to potential interventions. Second, the total amount of benefits received

will have to be fair, according to an acceptable theory of justice. What that level is—and what the appropriate theory of justice is—is not specified in the way this framework has been developed so far. That, of course, is a weakness of the present framework. Clearly, one’s position in this controversy would affect how one views the recent trends toward outsourcing of clinical research. If one takes the position that the product developed must be relevant to the health-care priorities of the host countries, one would probably be reluctant to approve studies outsourced to low cost countries for products that are primarily aimed at rich country markets. If, on the other hand, it is at least in principle justified to do studies in a country that are primarily intended for the health needs of other countries, then it becomes a matter of determining whether the host country benefit from the research is large enough and fair, by some account of justice. 3 CONCLUSION: LOCAL VERSUS UNIVERSAL All of the controversies we have examined seem to involve one underlying theme. Although all of us can agree that we have a responsibility to do something about the unjust and huge disparities in access to health care, there is disagreement about how one should go about doing something to change the current state of affairs. Some argue that, because we cannot do much about the underlying injustice, we should do the best we can within the current framework to produce some improvement. On the basis of this conviction, we could accept a local standard of care in a clinical trial to produce results for the common good, given current conditions. We might also accept a trial that may not yield a product that would be useful for the health priorities of the host country but would produce essential expertise not get easily obtained otherwise which in turn could be used to improve local conditions. Others, however, argue that we always have a direct responsibility to apply universal principles of justice; that we should always, in every action, try to change the underlying unjust structure, even if an alternative course of

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

action would produce more benefits in the short term. In this view, we have an obligation to provide the best standard of care always, and we should only test interventions that can be directly useful for the population studied. Given the larger disagreements about underlying principles, it is not surprising that the ethical issues that arise in international research have generated so much controversy. 3.1.1 Acknowledgments. The opinions expressed are the authors’ own. They do not reflect any position or policy of the National Institutes of Health, U.S. Public Health Service, or Department of Health and Human Services. This research was supported by the Intramural Research Program of the NIH Clinical Center. REFERENCES 1. European and Developing Countries Clinical Trials Partnership Programme (EDCTP). Programme for Action: Accelerated Action on HIV/AIDS, Malaria and Tuberculosis in the Context of Poverty Reduction. June 16, 2003. Available at: http://ec.europa .eu/research/info/conferences/edctp/edctp en .html 2. S. Nundy and C. M. Gulhati, A new colonialism? Conducting clinical trials in India. N Engl J Med. 2005; 352: 1633–1636. 3. E. M. Conner, J. Sperling, and R. Gelber, Reduction of maternal-infant transmission of human immunodeficiency virus type 1 with zidovudine treatment. N Engl J Med. 1994; 331: 1173–1180. 4. H. Varmus and D. Satcher, Ethical complexities of conducting research in developing countries. N Engl J Med. 1997; 337; 1003–1005. 5. D. Wendler, E. J. Emanuel, and R. K. Lie, The standard of care debate: can research in developing countries be both ethical and responsive to those countries’ health needs? Am J Public Health. 2004; 94: 923–928. 6. P. Lurie and S. M. Wolf, Unethical trials of interventions to reduce perinatal transmission of the human immunodeficiency virus in developing countries. N Engl J Med. 1997; 337: 1377–1381.

7

7. M. Angell, The ethics of clinical research in the third world. N Engl J Med. 1997; 337: 847–849. 8. World Medical Association. Declaration of Helsinki: Ethical Principles for Medical Research Involving Human Subjects. September 10, 2004. Available at: http://www.wma .net/e/policy/b3.htm 9. R. K. Lie, E. J. Emanuel, C. Grady, and D. Wendler, The standard of care debate: the international consensus opinion versus the Declaration of Helsinki. J Med Ethics. 2004; 30: 190–193. 10. Council for International Organizations of Medical Sciences. International Ethical Guidelines for Biomedical Research Involving Human Subjects. Geneva: CIOMS. 2002. Available at: http://www.cioms.ch/frame guidelines nov 2002.htm 11. L. H. Glantz, G. J. Annas, M. A. Grodin, and W. K. Mariner, Research in developing countries. Taking ‘‘benefit’’ seriously. Hastings Center Report 1998; 28: 38–42. 12. J. Killen, C. Grady, G. K. Folkers, and A. C. Fauci, Ethics of clinical research in the developing world. Nat Rev Immunol. 2002; 2: 210–215. 13. L. Belsky and H. S. Richardson, Medical researchers’ ancillary clinical-care responsibilities. BMJ. 2004; 328: 1494–1496. 14. H. Richardson and L. Belsky, The ancillarycare responsibilities of medical researchers: an ethical framework for thinking about the clinical care that researchers owe their subjects. Hastings Center Report 2004; 34: 25–33. 15. N. Dickert, K. L. DeRiemer, P. Duffy, L. Garcia-Garcia, B. Sina, et al., Ancillary care responsibilities in observational research: two cases, two problems. Lancet. 2007; 369: 874–877. 16. Office of Extramural Research, National Institutes of Health. Guidance for Addressing the Provision of Antiretroviral Treatment for Trial Participants Following Their Completion of NIH-Funded HIV Antiretroviral Treatment Trials in Developing Countries. March 16, 2005. Available at: http://grants.nih.gov/ grants/policy/antiretroviral/index.htm 17. Participants in the 2001 Conference of Ethical Aspects of Research in Developing Countries. Fair benefits for research in developing countries. Science, 2002; 298: 2133–2134.

8

ETHICAL ISSUES IN INTERNATIONAL RESEARCH

18. Participants in the 2001 Conference of Ethical Aspects of Research in Developing Countries. Moving beyond reasonable availability to fair benefits for research in developing countries. Hastings Center Report 2004; 34: 17–27. 19. A. London, Justice and the human development approach to international research. Hastings Center Report 2005; 35: 24–37.

EUROPEAN MEDICINES AGENCY (EMEA)

reports suggest changes to the benefit–risk balance of a medicinal product. For veterinary medicinal products, the Agency has the responsibility to establish safe limits for medicinal residues in food of animal origin. The Agency also has a role to stimulate innovation and research in the pharmaceutical sector. The EMEA gives scientific advice and protocol assistance to companies for the development of new medicinal products. It publishes guidelines on quality, safety, and efficacy testing requirements. A dedicated office established in 2005 provides special assistance to small and medium-sized enterprises (SMEs). In 2001, the Committee for Orphan Medicinal Products (COMP) was established, charged with reviewing designation applications from persons or companies who intend to develop medicines for rare diseases (orphan drugs). The Committee on Herbal Medicinal Products (HMPC) was established in 2004 and provides scientific opinions on traditional herbal medicines. The Agency is also involved in referral procedures relating to medicinal products that are approved or under consideration by member states. The Agency brings together the scientific resources of over 40 national competent authorities in 30 EU and European Environment Agency’s (EEA) European Free Trade Association (EFTA) countries in a network of over 4000 European experts. It contributes to the EU’s international activities through its work with the European Pharmacopoeia, the World Health Organization, and the International Conference on Harmonization (ICH) and International Cooperation on Harmonization of Technical Requirements for Registration of Veterinary Medicinal Products (VICH) trilateral (the EU, Japan, and the U.S.) conferences on harmonization, among other international organizations and initiatives. The EMEA is headed by the executive director and has a secretariat of about 440 staff members in 2007. The management board is the supervisory body of the EMEA, which is responsible in particular for budgetary matters. The EMEA’s mission

The European Medicines Agency (EMEA) is a decentralized body of the European Union (EU) with headquarters in London. Its main responsibility is the protection and promotion of public and animal health through the evaluation and supervision of medicines for human and veterinary use. The EMEA is responsible for the scientific evaluation of applications for European marketing authorization for medicinal products. Under the centralized procedure, companies submit one single marketing authorization application to the EMEA. All medicinal products for human and animal use derived from biotechnology and other high-technology processes must be approved via the centralized procedure. The same applies to all human medicines intended for the treatment of human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS), cancer, diabetes, or neurodegenerative diseases and for all designated orphan medicines intended for the treatment of rare diseases. Similarly, all veterinary medicines intended for use as performance enhancers to promote the growth of treated animals or to increase yields from treated animals have to go through the centralized procedure. For medicinal products that do not fall under any of the above-mentioned categories companies can submit an application for a centralized marketing authorization to the EMEA, provided that the medicinal product constitutes a significant therapeutic, scientific, or technical innovation or the product is in any other respect in the interest of patient or animal health. The safety of medicines is monitored constantly by the Agency through a pharmacovigilance network. The EMEA takes appropriate actions if adverse drug reaction This article was modified from the website of the European Medicines Agency (http://www.emea. europa.eu/mission.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

EUROPEAN MEDICINES AGENCY (EMEA)

statement is, in the context of a continuing globalization, to protect and promote public and animal health by • Developing efficient and transparent

procedures to allow rapid access by users to safe and effective innovative medicines and to generic and nonprescription medicines through a single European marketing authorization. • Controlling the safety of medicines for humans and animals, in particular through a pharmacovigilance network and the establishment of safe limits for residues in food-producing animals. • Facilitating innovation and stimulating research and thereby contributing to the competitiveness of EU-based pharmaceutical industry. • Mobilizing and coordinating scientific resources from throughout the EU to provide high-quality evaluation of medicinal products; advise on research and development programs; perform inspections to ensure that fundamental Good Clinical Practice (GCP), Good Manufacturing Practice (GMP), and Good Laboratory Practice (GLP) provisions are consistently achieved; and provide useful and clear information to users and health-care professionals.

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

research achievements that have continued to improve the lives of cancer patients around the world. EORTC research has contributed to improvements in survival for testicular cancer (2,3), childhood leukemia and lymphoma (4–6), adult Hodgkin’s and non-Hodgkin’s lymphoma (7–9), as well as gastrointestinal stromal tumors (10,11). EORTC research has defined the standards of care for patients with glioblastoma (12,13), melanoma (14,15), breast (16–18), and colorectal (19,20) cancer. Organ preservation as well as minimal and nonmutilating surgeries is now considered safe and equally effective for patients with breast (21,22) and larynx (23) cancer. Improved treatments for patients with severe and life-threatening cancer therapy-related infections have been established (24–26). EORTC studies that evaluate high-precision 3-dimensional conformal radiation therapy techniques (27) and innovative multidisciplinary and complex therapeutic strategies have also impacted current treatment approaches. EORTC research has also led to the development of various diagnostic and treatment guidelines (28,29), as well as disease (30–32), treatment (33), and quality of life (34) prognostic calculators to aid clinicians in the treatment decision-making process. These achievements are the result of large, multinational clinical trials carried out by the EORTC network. Today, several thousands of patients are entered each year into EORTC studies and an additional 30,000 patients continue to be followed on a yearly basis (Fig. 1). The EORTC clinical study database, which is housed at the EORTC headquarters, now contains outcome data for over 160,000 cancer patients. The EORTC currently has 56 ongoing clinical trials actively enrolling patients. In the past 5 years, a greater shift toward clinico-genomic cancer research has led to a significant increase in the number of newly initiated EORTC studies that incorporate a complex translational research component. One such study that has the potential to alter the future approach to the treatment

MARTINE PICCART-GEBHARDT FRANCOISE MEUNIER EORTC, Brussels, Belgium

1 BACKGROUND 1.1 History of the EORTC The eminent Belgian cancer expert, Professor Henri Tagnon (1) along with other European pioneers, founded the European Organization for Research and Treatment of Cancer (EORTC) in 1962. This medical pioneer and former head of Medicine at the Institut Jules Bordet brought together leading specialists from major European cancer research institutes and hospitals to create the Groupe Europ´een de Chimioth´erapie Anticanc´ereuse (GECA). The GECA was established as a coordinated effort to conduct large-scale clinical studies, a need already recognized and beyond the capacity of any individual hospital or institution. Futuristic for the time, GECA quickly developed into a pan-European multidisciplinary cancer research organization with many early research successes; in 1968, GECA was renamed the European Organization for Research and Treatment of Cancer. Today, the EORTC is a leading international nonprofit academic cancer research organization duly registered under Belgian Law. The EORTC formal governing body leads an extensive pan-European network of collaborating cancer researchers, scientists, clinicians, affiliated institutions, and academic centers. 1.2 Landmark Research Over the past 45 years, the EORTC has continued to work in the spirit of its pioneering founders. Under the leadership of 14 consecutively elected presidents (Table 1), each of whom is a cancer research visionary in their own right, the EORTC has achieved multiple

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC) Table 1. EORTC Presidents EORTC Presidents 1962– 2009 ´ Georges MATHE, GARATTINI, Silvio VAN BEKKUM, Dirk TAGNON, Henri LATJA, Lazlo George SCHMIDT, Carl Gottfried VERONESI, Umberto DENIS, Louis VAN DER SCHUEREN, Emmanuel MCVIE, Gordon HORIOT, Jean-Claude VAN OOSTEROM, Allan T. EGGERMONT, Alexander M.M PICCART, Martine

Villejuif, Frances Milan, Italy Rijswijk, The Netherlands Brussels, Belgium Manchester, United Kingdom Essen, Germany Milan, Italy Antwerp, Belgium Beulven, Belgium London, United Kingdom Dijon, France Leuven, Belgium Rotterdam, The Netherlands Brussels, Belgium

1962–1965 1965–1968 1969–1975 1975–1978 1979–1981 1981–1984 1985–1988 1988–1991 1991–1994 1994–1997 1997–2000 2000–2003 2003–2006 2006–2009

Figure 1. EORTC clinical studies patient accrual from 2004 to 2006.

of breast cancer is the ‘‘microarray in node negative disease may avoid chemotherapy’’ or MINDACT study, which will evaluate 6000 women with early breast cancer (35). The primary objective of the MINDACT trial is to confirm that patients with a ‘‘low risk’’ molecular prognosis and ‘‘high risk’’ clinical prognosis can be safely spared chemotherapy without affecting distant metastases-free survival. The two other main objectives of the study address questions related to adjuvant treatment of breast cancer. MINDACT will compare

anthracycline-based chemotherapy regimens to a docetaxel-capecitabine regimen, which is possibly associated with increased efficacy and reduced long-term toxicities. MINDACT will also compare the efficacy and safety of 7 years single agent letrozole to those of the sequential strategy of 2 years of tamoxifen followed by 5 years of letrozole. The EORTC continues to grow and adapt innovative cancer research technologies, multidisciplinary strategies, and data management methodologies. Additional advances in the treatment of cancer patients will be

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

achieved primarily through translational research, efficient drug development, and the conduct of large, prospective randomized, multicenter clinical trials that aim to establish state-of-the-art treatment strategies and to impact daily clinical practice. After 45 years, the EORTC is ever more dedicated to improving and saving the lives of cancer patients. 1.3 The EORTC Mission The EORTC develops, conducts, coordinates, and stimulates high-quality translational and clinical research that will improve the standards of treatment for cancer patients in Europe. This result is achieved through the development of new drugs and other innovative approaches and through the testing of more effective therapeutic strategies, using currently approved drugs, surgery, and/or radiotherapy in clinical studies. The established EORTC research network facilitates and ensures the timely passage of experimental discoveries into state-of-the-art treatments by reducing the time for research breakthroughs to reach clinical use. 1.4 EORTC Funding The EORTC conducts cancer research that is primarily academic. Core independent funding is provided by the EORTC Charitable Trust, which is supported by several national cancer leagues. This organization includes cancer leagues from Denmark, Germany, United Kingdom, The Netherlands, France, Norway, Switzerland, Sweden, Belgium, Hong Kong, and Italy, as well as the ‘‘Fonds Cancer,’’ the U.S. National Cancer Institute (since 1972), private donations, and the Belgian National Lottery. The European Commission awards grants for specific research projects. The pharmaceutical industry provides mainly unrestricted educational grants and occasionally full sponsorship for EORTCled studies that develop and evaluate new agents. 1.4.1 The EORTC Charitable Trust. In 1976, the EORTC Foundation (now called The EORTC Charitable Trust) was established by Royal Decree under the laws of the

3

Kingdom of Belgium as an international association. The specific aim of the Trust is to provide financial support for EORTC activities, the peer-reviewed organizational structure, and independent academic research projects. National Cancer Charities are the major contributors to the Trust; all major European National Cancer Charities that support the EORTC through the Charitable Trust are represented in the Charitable Trust General Assembly, along with the Hong Kong Cancer Fund and several distinguished lay members. The Honorary President of the Trust is H.R.H. Princess Astrid of Belgium, Sir Christopher Mallaby serves as the Chairman, and Sir Ronald Grierson, Past-Chairman, is the Honorary Vice-President. 1.4.2 The Academic Research Fund. The Academic Fund, operational since 2005, supports selected clinical trials for which insufficient or a lack of funding exists. These trials must be of academic excellence, include a translational research component, provide for additional research networking, and have the potential to establish a new standard of care and thus change clinical practice. The EORTC Board oversees the selection process and the dispersing of funds. Five landmark studies with the potential to redefine current standards of cancer care have been approved for funding. All of these studies aim to advance the standards of cancer care by defining a ‘‘tailored’’ approach to the treatment of specific types and stages of different cancers. The SUPREMO breast cancer study (selective use of postoperative radiotherapy after mastectomy) is a United Kingdom-led intergroup clinical trial that also involves the EORTC (EORTC Study 22051-10052). It will evaluate the benefits and risks of radiotherapy after mastectomy in 3700 patients in an attempt to help clinicians better predict which patients require radiotherapy and which might be spared this treatment (36). The EORTC, together with the groupe d’etudes de lymphomes de l’adulte (GELA), are conducting the H10 EORTC-GELA early Hodgkin’s Lymphoma study (EORTC Study 20051). This study will evaluate the use of fludeoxglucose F 18 positron emission tomography scan-guided therapy in 1576 patients

4

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

to predict treatment response more accurately and therefore aid in selecting the most appropriate therapy for each patient (37). The Interfant-06 Study Group, which is an infant leukemia international collaborative group, is conducting a randomized clinical trial (EORTC 58051) to evaluate 445 infants under the age of 1 year with acute lymphoblastic or biphenotypic leukemia (38), whereas the EORTC 22043-30041 prostate cancer study will evaluate the use of postoperative hormone and radiation therapy compared with radiation therapy alone (39). The EORTC is collaborating with the Federation Nationale des Centres de Lutte contre le Cancer and the Intergroupe Francophone de Canc´erologie Thoracique in conducting the Lung Adjuvant RadioTherapy or LungART study (EORTC 22055-08053), which will evaluate the role of postoperative conformal thoracic radiotherapy after complete resection in 700 patients with lymph node positive non-small-cell lung cancer (40).

2 ORGANIZATIONAL STRUCTURE The EORTC General Assembly, Board, and Executive Committee provide leadership and strategic direction to the EORTC. The EORTC Network, which is composed of more than 2500 clinicians and scientists who are employed at over 200 institutions in 30 countries, carries out all EORTC laboratory and clinical research. Various committees provide oversight for the research activities of this complex network. The EORTC Headquarters ensures the overall coordination and daily management of all EORTC activities and related scientific, legal, and administrative issues (Fig. 2). 2.1 The General Assembly The General Assembly is the legislative body of the EORTC. It meets once a year to ratify policies, proposals, and other activities, and to delegate specific tasks to the EORTC Board, committees, or appointed persons. The General Assembly is composed of voting and non-voting EORTC members and includes

Figure 2. EORTC organizational structure.

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

the current and past 3 EORTC Presidents as well as the chairmen of all EORTC groups, task forces, committees, and divisions. Representatives from the top 15 clinical trialaccruing institutions have full general assembly membership. The General Assembly elects a new Board every 3 years.

2.2 The Board The Board is the steering and executive body that advises the General Assembly on all new activities and formulates proposals to be ratified by the General Assembly. The Board meets at least twice a year and consists of 21 elected voting members and several exofficio members. The voting members select from among themselves the EORTC President, Vice-President, Treasurer, and Secretary General.

2.3 The Executive Committee The Executive Committee, which was created in 1991, provides support to the EORTC President in the decision-making and strategyplanning process. It reports to the Board and consists of several voting members of

5

the Board plus the non-voting EORTC Director General. This committee meets approximately every 6 weeks and communicates by other means on a weekly basis. 2.4 The EORTC Research Network All EORTC research activities fall under the divisions of laboratory research and/or clinical research; scientists and clinicians, who participate on a voluntary basis in EORTC research, do so within expertise and diseasespecific EORTC groups or task forces that belong to these two main divisions (Table 2). This intricate research network provides an integrated structure for developing, evaluating and validating new cancer agents and multimodality therapeutic strategies through the full spectrum of early phase I studies to large-scale phase III confirmative clinical trials. The Laboratory Research Division focuses on the preclinical evaluation of new anticancer agents, tumor markers, and receptors, and supports translational research, pathology, pharmacology, and functional-imaging projects conducted within the EORTC. It fosters and strengthens direct interaction between basic cancer research and the EORTC clinical activities. A broader objective

Table 2. EORTC Groups and Tasks Forces EORTC Groups EORTC Brain Tumour Group EORTC Breast Cancer Group EORTC Children’s Leukaemia Group EORTC Gastro-intestinal Tract Cancer Group EORTC Genito-Urinary Tract Cancer Group EORTC Gynaecological Cancer Group EORTC Head and Neck Cancer Group EORTC Infectious Diseases Group EORTC Leukemia Group EORTC Lung Cancer Group EORTC Lymphoma Group EORTC Melanoma Group EORTC PathoBiology Group EORTC Pharmacology and Molecular Mechanisms Group EORTC Quality of Life Group EORTC Radiation Oncology Group EORTC Soft Tissue and Bone Sarcoma Group EORTC Task Forces EORTC Cancer in Elderly Task Force EORTC Cutaneous Lymphoma Task Force

6

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

of this division is to promote translational research projects and closer cooperation with the EORTC clinical groups. The Clinical Research Division (CRD) is responsible for proposing, designing, and conducting clinical trials to evaluate new therapeutic agents or novel treatment strategies through organized cooperative groups and task forces. Each of the 17 EORTC cooperative groups and the two EORTC task forces is disease-site specific or focuses on a cancerrelated discipline, such as radiotherapy or quality of life, and has an elected voting chairman on the General Assembly. 2.5 EORTC Committees Several committees oversee the independence of the EORTC, as well as the relevance and scientific value of all research efforts, which thereby ensures the highest quality of scientific work possible. Each committee chairman serves a 3-year elected term. The New Drug Advisory Committee (NDAC) was created to facilitate the drug acquisition process for EORTC research projects. It supports the CRD groups and provides recommendations on strategy and drug-development prioritization within the EORTC Network. NDAC members participate in advisory boards for the pharmaceutical industry, stimulate interaction between drug developers and the EORTC diseaseoriented groups, and advise the EORTC headquarters specialists on methodology issues that relate to innovative agents. This committee interacts with the early project optimization department (see below). The Translational Research Advisory Committee supports and provides expert advice on translational research projects conducted within the EORTC both from a scientific and practical perspective. This practice ensures the relevance and scientific soundness of translational research projects and the subsequent results. It increases the scientific visibility of the EORTC and guarantees, as an advisory committee, a permanent EORTC forum between the Clinical and Laboratory Research Divisions. This function serves to foster translational research interest within the Clinical Research Division Groups and also supports the inclusion of

translational research project ‘‘sub-studies’’ within EORTC clinical trial protocols. The Protocol Review Committee (PRC) reviews and approves all new clinical trial protocols submitted by the EORTC groups with respect to their scientific value and relevance within the EORTC strategy and framework. The PRC also assists the cooperative groups with any aspect of trial design or implementation when required. The Scientific Audit Committee gives independent advice to the EORTC Board on the activities, overall strategies, and achievements of all EORTC groups. These comments include recommendations on a series of criteria such as conformity with EORTC structure and policies, interaction between divisions and between groups, scientific output, and publications, in addition to overall priorities and strategies for future research. The Quality Assurance Committee develops quality assurance guidelines and guarantees that adequate quality-assurance control mechanisms operate in all EORTC divisions and groups in an effort to improve the quality of EORTC research and data. These activities include the supervision of quality-assurance activities within the EORTC headquarters and the advice it provides to the Board in the event of misconduct or suspicion of fraud. The Independent Data Monitoring Committee (IDMC) reviews the status of EORTC clinical trials and provides recommendations to the CRD groups concerning the continuation, modification, and/or publication of the trial. The IDMC is involved in all phase III clinical trials in which formal interim analyses and early stopping rules are anticipated. Ad hoc expert advice is provided to the committee on a specific per protocol basis. The Membership Committee reviews applications from potential new EORTC members and forwards them to the relevant Group Chairman for consideration. New memberships are approved on a continuous basis with endorsement by the EORTC Board occurring once a year. 2.6 The EORTC Headquarters The EORTC Headquarters, which is located in Brussels, Belgium, was established in 1974 to coordinate all EORTC scientific activities

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

7

Figure 3. EORTC headquarters.

and related legal and administrative issues. The Headquarters Director General implements all EORTC strategies and policies as defined by the Board and leads a staff of 160 professionals who represent 15 nationalities. The EORTC Headquarters team works within a matrix organization of cross-functional departments, offices, and units to provide full logistic and scientific support for all EORTC clinical trials and translational research projects (Fig. 3). The Headquarters staff consists of medical doctors, statisticians, data managers, quality of life specialists, healthcare professionals, computer specialists, research fellows, and administrative personnel. The Scientific Director oversees all medical and scientific activities related to prospective, newly proposed, and ongoing EORTC clinical trials. The Director of Methodology-Operations manages all departments and units involved in the operational aspects of clinical trial data acquisition, analysis, and quality assurance. This job duty includes ensuring that all EORTC trials are appropriately designed

from a methodological and statistical perspective. The EORTC staff carries out specific research projects in the fields of statistics, quality of life, quality assurance, and information technology. Between the years 2000 and 2006, the EORTC investigators and Headquarters team published a total of 472 peer-reviewed medical and scientific papers. The Medical Department clinical research physicians coordinate all research activities of the EORTC groups and act in a liaison function between the groups and the various Headquarters staff, departments, and units. One such unit, the Pharmacovigilance Unit, provides expertise for the reporting, analysis and handling of serious adverse events that occur during EORTC clinical trials. These safety-related advisory and regulatory activities are conducted according to GCP guidelines and current national and European regulations. The Translational Research Unit promotes and facilitates research activities that transform the latest discoveries in cancer molecular biology and cellular pathology into

8

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

clinical applications that benefit cancer patients. The future platform for such projects will be facilitated by the EORTC centralized tumor bio-bank, which is a collection of tissue samples, pathology, and radiology images from patients enrolled in EORTC clinical trials. This unit works in close cooperation with the Early Project Optimization Department (EPOD), which provides assistance to all EORTC Groups in the optimization, strategic development, and proposal process of new clinical-translational research projects. The EPOD provides the clinical groups with support to mature their proposals and/or strategies and coordinate their interaction with industry. This process aims to accelerate and increase the success rate of new cancer agent development by carefully analyzing promising discoveries and preferentially accelerating these into clinical evaluation. Additional specialized departments within the EORTC headquarters address specific areas and issues related to the development and conduct of cancer clinical trials. The Statistics Department not only performs all standard data analyses for EORTC clinical trials but also actively works in codeveloping new methods of data analyses applied to translational research and many other projects on methodology, meta-analyses, and prognostic factors. The Regulatory Affairs Unit fulfils all legal requirements for the initiation and conduct

of EORTC clinical trials according to the regulations of each participating country. The Regulatory Affairs and Pharmacovigilance Units act in close collaboration for expediting reports of serious adverse events to the competent authorities according to legal requirements. The Quality of Life Department works to stimulate, enhance, and coordinate quality of life research in cancer clinical trials by cooperating with the various EORTC study groups and the Quality of Life Group. It also manages the administrative aspects associated with the use of EORTC Qualityof-Life questionnaires in both EORTC and non-EORTC studies. This department has participated in the design to final analysis and publication of quality-of-life outcome results derived from more than 120 EORTC trials. The Protocol Development Department coordinates the efforts of the Headquarters study team and the EORTC Group Study Coordinator in the development and finalization of EORTC clinical trial protocols. The Study Coordinator is responsible for protocol development and writing of the medical sections as well as the Protocol Review Committee approval process. The Headquarters study team statistician, clinical research physician and data manager author all methodological and administrative sections of the protocol, perform final protocol review, and ensure that the process occurs within the shortest possible time.

Figure 4. EORTC headquarters core activities.

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

The Information Technology (IT) Department develops new technologies to facilitate and accelerate cancer clinical research. Inhouse designed specific computer software now allows for 24 hour/7 days a week on-line web-based patient registration and randomization for EORTC studies. Clinical trial data management with automatic consistency and crosschecks plus overdue information tracking is also now possible. Additional software was developed for the transfer and storage of pathology and radiology images and the transfer of information to EORTC partners via database export. Emerging technologies are revised regularly and integrated into the development strategy of the EORTC Headquarters. The IT system design reduces to a minimum any type of human error in the management of clinical trial data and is based on client/server architecture. ‘‘Vista RDC’’ (41) is the latest EORTC IT application that allows investigators to transmit data electronically. The EORTC Fellowship Programme, which was established in 1991, is offered to medical doctors, biostatisticians, computer analysts, and other scientists. The program promotes training in the methodology of highquality cancer clinical trials and fosters specific research projects on EORTC clinical trials, quality assurance, and quality of life. After completing their period of fellowship study and returning home, EORTC fellows serve as ambassadors by stimulating and maintaining new generations of European cancer researchers. All fellowship research is based on an international platform, which also helps expedite the diffusion of clinical trial study results. Scientific support is provided both by the EORTC Headquarters staff, and supervision is available from members of the EORTC disease/specialty-oriented groups. The EORTC Headquarters has awarded 99 research fellowships in the past 16 years to fellows from countries within the European Union, Argentina, Australia, Brazil, and Canada, which totals 150 years of cancer research. The EORTC Communications Office disseminates up-to-date EORTC information on a regular basis to the international scientific community, European Cancer Leagues, patients, and the public. It reports on the

9

initiation of major EORTC clinical trials as well as the results of EORTC research and its relevance to cancer patients. The Education Office provides organizational support for EORTC-sponsored courses (e.g., clinical trial methodology, statistics, data management, and quality of life), international EORTC conferences, and internal EORTC Network meetings. The European Union Programme Office, which was created in 2006, supports, optimizes, and ensures the participation, visibility, and cooperation of the EORTC in European Union (EU) institution activities. This office aids EORTC researchers in the preparation of proposals for EU project funding and subsequently provides support for the management of funded projects. The EORTC currently receives research funding under the European Commission sixth and seventh Research Framework Programs for translational research-related projects in the areas of breast cancer, melanoma, and sarcoma. The EORTC actively participates in the European Clinical Research Infrastructures Network and biotherapy facilities Preparation Phase for the Infrastructure project and the Impact in Clinical Research of European Legislation project. 3

SCIENTIFIC RESEARCH STRATEGY

The EORTC has undertaken organizational and strategic initiatives over the years to remain a world leader in academic cancer research. This flexibility allows the EORTC to meet the challenges presented by the changing environments of cancer research and technology, government policy, and public awareness. Researchers now work in the molecular biology ‘‘omic’’ era (e.g., genomics, proteomics, and metabolomics), that will increasingly enable physicians to predict treatment risk and responsiveness and therefore ‘‘tailor’’ or personalize cancer therapy. The recent groundbreaking discoveries in molecular biology, coupled with the expanding costs of clinical trials and increasing pressure from other cancer research groups have demanded additional refinement of the EORTC scientific strategy, project prioritization, and research funding. This refinement

10

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

will capitalize on the strengths of the organization, namely its vast range of expertise and cutting-edge research facility infrastructure. Despite these new challenges, the EORTC will continue to conduct large-scale clinical trials designed to improve the standards of cancer care, to increase the scientific understanding of cancer as a disease, and to address pivotal strategic therapeutic questions. The EORTC scientific and clinical research programs will be guided over the coming years by the establishment of new models for research collaboration in ‘‘niche’’ areas, the initiation of translational research-rich clinical trials for developing tailored targeted therapies, and the prioritization of highquality clinical trials that address specific scientific questions. 3.1 New Models of Collaboration The EORTC is seeking new ways to collaborate with national cancer research groups within Europe to address specific strategic research questions that involve multimodality cancer treatments (e.g., surgery, radiotherapy) and to ensure continued research of the less common types of cancer. These goals will be achieved through highly specialized, large phase III clinical trials conducted in ‘‘niche’’ areas of disease-specific and technical expertise. This approach involves the pooling of patient accrual at a pan-European level and allows for the study of rare cancers. It also permits the best use of nationally developed technical, scientific, and research expertise in nationally run clinical trials. 3.2 Translational Research-Rich Clinical Trials To meet the challenges of clinical cancer research in today’s molecular era, the EORTC established the Network of Core Institutions (NOCI) to promote, support, and conduct high-quality translational research-rich trials across the EORTC Groups. This strategy optimizes the transfer of basic research molecular discoveries to the development of targeted tailored cancer therapies in terms of time and quality by capitalizing on the broad research expertise and infrastructure

of large institutions located in various European countries. Innovative translational research concepts are prioritized and tested in early-phase multitumor line trials. Additional development proceeds based on early trial outcome results and prospectively streamlined for validation in disease-specific, large phase III confirmative EORTC trials. This method represents a new project-focused research strategy that is time, resource, and financially efficient while simultaneously providing a platform for collaboration with partners in the pharmaceutical industry. NOCI trials involve innovative complex study designs, elaborate translational research, and allow for the large-scale collection of blood and tissue samples or biospecimens. A centralized, independent EORTC tumor biobank facility provides for the storage and management of a representative selection of these biospecimens and is linked to an electronic patient treatment outcome database. The NOCI initiative represents a phenomenal potential and guarantees the future of independent academic clinico-genomic cancer research in Europe. 3.3 Prioritization of EORTC Clinical Trials The EORTC Board classifies clinical trials into three broad categories (Table 3). This classification system is used to determine the place of each new trial proposal within the overall EORTC scientific strategy. It also allows for the prioritization of resource and/or funding allocation. This categorization facilitates the selection and prospective development of early-phase multitumor clinico-genomic studies under the NOCI umbrella, as well as the development of ‘‘niche’’-specific and multimodality clinical trials, which thereby increases the success of discovering and developing new standards of care. The EORTC will stimulate this scientific strategy by awarding 1 million Euros in 2008 for academic and clinico-genomic trial research proposals that have the highest promise of defining new standards of cancer care. 4 INTERNATIONAL PARTNERSHIPS The EORTC collaborates with several international institutions, organizations, as-

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

11

Table 3. EORTC Clinical Trial Prioritization Category 1 1A Randomized phase III trials designed to answer a question which directly contributes to defining a new standard of care (e.g. Trials with a strong multidisciplinary component prepared and run jointly by different EORTC Groups). 1B Randomized phase III trial with a crucially important translational research component that may permit a fundamental advance in the understanding of a particular disease. Category 2 2A Intergroup randomized phase III trials is not lead by an EORTC group but corresponding to the criteria listed under category 1. Phase II randomized trials clearly designed as a preparatory work for a following randomized phase III. Registration trials with a clinically relevant question and a translational research component. 2B Phase Ib trials involving drugs with a novel mechanism of action and with commitment from the company for vertical drug development in the EORTC (including combination studies with radiatherapy and biological agents). Category 3 3A Randomized phase III trials that do not meet the requirements of the above mentioned categories. 3B Randomized phase II trials with no or a weak translational research component. Phase I and single arm phase II trials with drug having truly new mechanisms of action, but for which a plan for later development within the EORTC is lacking. 3C Other phase I and single arm phase II studies.

sociations, regulatory bodies, government policymakers, and public interest groups in relation to basic scientific and clinical research, educational activities, government policy initiatives, patient advocacy, and public health issues. 4.1 Research Partners The EORTC collaborates with the U.S. National Cancer Institute (NCI), the leading U.S. agency for cancer research and treatment on several research and educational projects. The NCI Liaison Office is located next to the EORTC Headquarters in Brussels. Agreements signed between the EORTC and the U.S. NCI facilitate collaboration in the development of new cancer treatments. Today, drugs are now developed to facilitate testing on either side of the Atlantic Ocean. As a result, common methods of compound acquisition, selection and screening, toxicology testing, and clinical evaluation are well established between the U.S. and Europe. The EORTC is actively involved in intergroup collaboration with both national and

international research groups. Transatlantic collaboration with U.S. and Canadian cooperative groups is ongoing. 4.2 Regulatory Agency Partners In 1998, the EORTC standard operating procedures were filed with the U.S. Food and Drug Administration (FDA) and received an EORTC Drug Master File Number (No. DMF 13059). This result places EORTC clinical trial data on par with data generated by the U.S. National Cancer Institute and its cooperative groups. EORTC research partners can therefore use the EORTC drug master file number as a reference when seeking U.S. approval for treatments tested in Europe. EORTC experts also serve as members of various European Medicines Evaluation Agency expert working groups to discuss and establish guidelines on the requirements for new anticancer agents submitted for market registration. The EORTC has both formal and informal contact with many national regulatory agencies for the submission of new

12

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC)

clinical studies and the exchange of researchrelated information. 4.3 Educational Partners International collaboration also takes the form of jointly organized research symposia and conferences. The most recent basic cancer research discoveries, technologies, and innovations are shared at the annual ‘‘Molecular Targets and Cancer Therapeutics’’ conference jointly organized by the U.S. NCI and the EORTC since 1986. In 2001, the American Association for Cancer Research joined the NCI–EORTC effort. Today, this conference is held annually and the location alternates between the U.S. and Europe. A second annual event, ‘‘Molecular Markers in Cancer’’ provides a forum for clinicians and scientists to discuss the clinical applications of the latest cancer research findings. As of 2007, it is jointly organized by the NCI, the EORTC, and the American Society of Clinical Oncology (ASCO). 4.4 European Journal of Cancer Collaboration The European Journal of Cancer (EJC) was created in 1965 by Professor H. Tagnon who served as editor-in-chief from 1963 until 1990. The EJC is the official journal of the EORTC, the European School of Oncology, the European Association for Cancer Research, the European Society of Mastology, and the European Cancer Organization. It is available in 18 countries within Europe and in 63 countries worldwide. The EJC publishes 18 journal issues annually, and several supplements are produced in the abstract books of major European cancer conferences. 5

FUTURE PERSPECTIVES

In recent years, numerous innovative agents have been discovered as a result of the tremendous development in the understanding of the molecular basis of cancer. More breakthroughs in the prevention, diagnosis, and treatment of cancer will be accomplished primarily through translational research projects, efficient drug development, and the conduct of translational research-guided, large prospective randomized multicenter

clinical trials. These trials will require a global cooperative approach coupled with an early and open exchange of research findings. Significant improvements in the survival of cancer patients will depend on the ability of researchers to discover new drugs and novel treatment strategies tailored to the individual patient and to specific characteristics of their disease. The EORTC has been a leader in cancer research for over 45 years and has undertaken major steps to create the infrastructure and organization needed for this continued challenge throughout the twenty-first century. REFERENCES 1. F. Meunier and A. T. van Osterom, 40 years of the EORTC: The evolution towards a unique network to develop new standards of cancer care. Eur. J. Cancer 2002; 38: 3–13. 2. R. De Wit, G. Stoter, S. B. Kaye, D. T. Steijffer, G. W. Jones, W. W. Ten Bokkel Huinink, L. A. Rea, L. Collette, and R. Sylvester, The importance of bleomycin in combination chemotherapy for good prognosis testicular non-seminoma: a randomized study of the EORTC Genitourinary Tract Cancer Cooperative Group. J. Clin. Oncol. 1997; 15: 1837–1843. 3. R. T. Oliver, M. D. Mason, G. M. Mead, H. von der Maase, G. J. Rustin, J. K. Joffe, R. de Wit, N. Aass, J. D. Graham, R. Coleman, S. J. Kirk, and S. P. Stenning, MRC TE19 collaborators and the EORTC 30982 collaborators. Radiotherapy versus single dose carboplatin in adjuvant treatment of stage I seminoma: a randomised trial. Lancet 2005; 366: 293–300. 4. F. Millot, S. Suciu, N. Philippe, Y. Benoit, F. Mazingue, A. Uyttebroeck, P. Lutz, F. Mechinaud, A. Robert, P. Boutard, et al., Children’s Leukemia Cooperative Group of the European Organization for Research and Treatment of Cancer, Value of High-Dose Cytarabine during interval therapy of a Berlin - Frankfurt - Munster based protocol in increased-risk Children with acute Lymphoblastic Lymphoma: Results of the European Organization for Research and Treatment of Cancer 58881 randomized phase III trial. J. Clin. Oncol. 2001; 19: 1935–1942. 5. N. Entz-Werle, S. Suciu, J. van der Werff ten Bosch, E. Vilmer, Y. Bertrand, Y. Benoit, G. Margueritte, E. Plouvier, P. Boutard,

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC) E. Vandecruys, et al., EORTC Children Leukemia Group, Results of 58872 and 58921 trials in acute myeloblastic leukemia and relative value of chemotherapy vs allogeneic bone marrow transplantation in first complete remission: the EORTC Children Leukemia Group report. Leukemia 2005; 19: 2072–2981. 6. J. van der Werff Ten Bosch, S. Suciu, A. Thyss, Y. Bertrand, L. Norton, F. Mazingue, A. Uyttebroeck, P. Lutz, A. Robert, P. Boutard, et al., Value of intravenous 6-mercaptopurine during continuation treatment in childhood acute lymphoblastic leukemia and non-Hodgkin’s lymphoma: final results of a randomized phase III trial (58881) of the EORTC CLG. Leukemia 2005; 19: 721–7226. 7. C. Ferm´e, H. Eghbali, J. H. Meerwaldt, C. Rieux, J. Bosq, F. Berger, T. Girinsky, P. Brice, M. B. van’t Veer, J. A. Walewski, et al. for the EORTC–GELA H8 Trial, Chemotherapy plus involved-field radiation in early-stage hodgkin’s disease. N. Engl. J. Med. 2007; 357: 1916–1927. 8. E. M. Noordijk, P. Carde, N. Dupouy, A. Hagenbeek, A. D. G. Krol, J. C. Kluin-Nelemans, U. Tirelli, M. Monconduit, J. Thomas, H. Eghbali, et al., Combinedmodality therapy for clinical stage I or II Hodgkin’s lymphoma: long-term results of the European Organisation for Research and Treatment of Cancer H7 Randomized controlled trials. J. Clin. Oncol. 2006; 24: 3128–3135. 9. M. H. van Oers, R. Klasa, R. E. Marcus, M. Wolf, E. Kimby, R. D. Gascoyne, A. Jack, M. Van’t Veer, A. Vranovsky, H. Holte, et al., Rituximab maintenance improves clinical outcome of relapsed/resistant follicular non-Hodgkin lymphoma in patients both with and without rituximab during induction: results of a prospective randomized phase 3 intergroup trial. Blood 2006; 108: 3295–3301. 10. J. Verweij, P. G. Casali, J. Zalcberg, A. LeCesne, P. Reichardt J. Y., Blay, R. Issels, A. van Oosterom, P. C. Hogendoorn, M. Van Glabbeke, et al., Progression-free survival in gastrointestinal stromal tumours with highdose imatinib: randomised trial. Lancet 2004; 364: 1127–1134. 11. M. Van Glabbeke, J. Verweij, P. G. Casali, J. Simes, A. Le Cesne, P. Reichardt, R. Issels, I. R. Judson, A. T. van Oosterom, J. Y. Blay Predicting toxicities for patients with advanced gastrointestinal stromal tumours treated with imatinib: a study of the European Organisation for Research and Treatment of Cancer, the Italian Sarcoma Group, and the

13

Australasian Gastro-Intestinal Trials Group (EORTC-ISG-AGITG). Eur. J. Cancer 2006; 42: 2277–2285. 12. R. Stupp, W. P. Mason, M. J. van den Bent, M. Weller, B. Fisher, M. J. Taphoorn, K. Belanger, A. A. Brandes, C. Marosi, U. Bogdahn, et al., European Organisation for Research and Treatment of Cancer Brain Tumor and Radiotherapy Groups, National Cancer Institute of Canada Clinical Trials Group. Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N. Engl. J. Med. 2005; 352: 987–996. 13. M. E. Hegi, A. C. Diserens, T. Gorlia, M. F. Hamou, N. de Tribolet, M. Weller, J. M. Kros, J. A. Hainfellner, W. Mason, L. Mariani, et al., MGMT gene silencing and benefit from temozolomide in glioblastoma. N. Engl. J. Med. 2005; 352: 997–1003. 14. A. M. Eggermont, S. Suciu, R. MacKie, W. Ruka, A. Testori, W. Kruit, C. J. Punt, M. Delauney, F. Sales, G. Groenewegen, et al., EORTC Melanoma Group, Post-surgery adjuvant therapy with intermediate doses of interferon alfa 2b versus observation in patients with stage IIb/III melanoma (EORTC 18952): randomised controlled trial. Lancet 2005; 366: 1189–1196. 15. V. Winnepenninckx, V. Lazar, S. Michiels, P. Dessen, M. Stas, S. R. Alonso, M. F. Avril, P. L. Ortiz Romero, T. Robert, O. Balacescu, et al., Melanoma Group of the European Organization for Research and Treatment of Cancer. Gene expression profiling of primary cutaneous melanoma and clinical outcome. J. Natl. Cancer Inst. 2006; 98: 472–482. 16. L. Morales, P. Canney, J. Dyczka, E. Rutgers, R. Coleman, T. Cufer, M. WelnickaJaskiewicz, J. Nortier, J. Bogaerts, P. Therasse, R. Paridaens, Postoperative adjuvant chemotherapy followed by adjuvant tamoxifen versus nil for patients with operable breast cancer: A randomised phase III trial of the EUROPEAN ORGANISATION FOR RESEARCH AND TREATMENT OF CANCER BREAST GROUP. Eur. J. Cancer 2007; 43: 331–340. 17. M. J. Piccart-Gebhart, M. Procter, B. LeylandJones, A. Goldhirsch, M. Untch, I. Smith, L. Gianni, J. Baselga, R. Bell, C. Jackisch, D. Cameron, et al., Herceptin Adjuvant (HERA) Trial Study Team. Trastuzumab after adjuvant chemotherapy in HER2-positive breast cancer. N. Engl. J. Med. 2005; 353: 1659–1672. 18. H. Bartelink, J.C. Horiot, P. M. Poortmans,

14

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC) H. Struikmans, W. Van den Bogaert, A. Fourquet, J. J. Jager, W. J. Hoogenraad, S. B. ´ am-Rodenhuis, ´ Oei, C. C. Warl M. Pierart, and L. Collette, Impact of a higher radiation dose on local control and survival in breast-conserving therapy of early breast cancer: 10-year results of the randomized boost versus no boost EORTC 22881-10882 trial. J. Clin. Oncol. 2007; 25: 3259–3265.

19. C. H. K¨ohne, E. van Cutsem, J. Wils, C. Bokemeyer, M. El-Serafi, M. P. Lutz, M. Lorenz, ¨ P. Reichardt, H. Ruckle-Lanz, N. Frickhofen, et al., European Organisation for Research and Treatment of Cancer Gastrointestinal Group. Phase III study of weekly high-dose infusional fluorouracil plus folinic acid with or without irinotecan in patients with metastatic colorectal cancer: European Organisation for Research and Treatment of Cancer Gastrointestinal Group Study 40986. J. Clin. Oncol. 2005; 23: 4856–4865. 20. B. Nordlinger, P. Rougier, J. P. Arnaud, M. Debois, J. Wils, J.C. Ollier, O. Grobost, P. Lasser, J. Wals, J. Lacourt, et al., Adjuvant regional chemotherapy and systemic chemotherapy versus systemic chemotherapy alone in patients with stage II-III colorectal cancer: a multicentre randomised controlled phase III trial. Lancet Oncol. 2005; 6: 459–468. 21. J. A. van Dongen, A. C. Voogd, I. S. Fentiman, C. Legrand, R. J. Sylvester, D. Tong, E. van der Schueren, P. A. Helle, K. van Zijl, and H. Bartelink, Long-term results of a randomized trial comparing breast-conserving therapy with mastectomy: European Organization for Research and Treatment of Cancer 10801 trial. J. Natl. Cancer Inst. 2000; 92: 1143–1150. 22. N. Bijker, P. Meijnen, J. L. Peterse, J. Bogaerts, I. Van Hoorebeeck, J. P. Julien, M. Gennaro, P. Rouanet, A. Avril, I. S. Fentiman, et al., Breast-conserving treatment with or without radiotherapy in ductal carcinomain-situ: ten-year results of European Organisation for Research and Treatment of Cancer randomized phase III trial 10853--a study by the EORTC Breast Cancer Cooperative Group and EORTC Radiotherapy Group. J. Clin. Oncol. 2006; 24: 3381–3387. 23. J. L. Lefebvre, D. Chevalier, B. Luboinski, A. Kirkpatrick, L. Collette, T. Sahmoud, Larynx preservation in pyriform sinus cancer: preliminary results of a European Organization for Research and Treatment of Cancer phase III trial. EORTC Head and Neck Cancer Cooperative Group. J. Natl. Cancer Inst.

1996; 88: 890–899. 24. R. Herbrecht, D. W. Denning, T. F. Patterson, J. E. Bennett, R. E. Greene, J. W. Oestmann, W. V. Kern, K. A. Marr, P. Ribaud, O. Lortholary, et al., Invasive Fungal Infections Group of the European Organisation for Research and Treatment of Cancer and the Global Aspergillus Study Group, Voriconazole versus amphotericin B for primary therapy of invasive aspergillosis. N. Engl. J. Med. 2002; 347: 408–415. 25. EORTC-International Antimicrobial Therapy Cooperative Group, Efficacy and toxicity of single daily doses of amikacin and ceftriaxone versus multiple daily doses of amikacin and ceftazidime for infection in patients with cancer and granulocytopenia. Ann. Intern. Med. 1993; 119: 584–593 26. C. Viscoli, A. Cometta, W. V. Kern, R. Bock, M. Paesmans, F. Crokaert, M. P. Glauser, T. Calandra, International Antimicrobial Therapy Group of the European Organization for Research and Treatment of Cancer, Piperacillin-tazobactam monotherapy in highrisk febrile and neutropenic cancer patients. Clin. Microbiol. Infect. 2006; 12: 212–216. 27. S. Senan, D. De Ruysscher, P. Giraud, R. Mirimanoff, and V. Budach, Radiotherapy Group of European Organization for Research and Treatment of Cancer, Literature-based recommendations for treatment planning and execution in high-dose radiotherapy for lung cancer. Radiother. Oncol. 2004; 71: 139–146. 28. A.P. van der Meijden, R. Sylvester, W. Oosterlinck, E. Solsona, A. Boehle, B. Lobel, and E. Rintala, for the EAU Working Party on Non Muscle Invasive Bladder Cancer, EAU guidelines on the diagnosis and treatment of urothelial carcinoma in situ. Eur. Urol. 2005; 48: 363–371. 29. M. S. Aapro, D. A. Cameron, R. Pettengell, J. Bohlius, J. Crawford, M. Ellis, N. Kearney, G. H. Lyman, V. C. Tjan-Heijnen, J. Walewski, et al., European Organisation for Research and Treatment of Cancer (EORTC) Granulocyte Colony-Stimulating Factor (G-CSF) Guidelines Working Party, EORTC guidelines for the use of granulocyte-colony stimulating factor to reduce the incidence of chemotherapy-induced febrile neutropenia in adult patients with lymphomas and solid tumours. Eur. J. Cancer 2006; 42: 2433–2453. 30. R. J. Sylvester, A. P. van der Meijden, W. Oosterlinck, J. A. Witjes, C. Bouffioux, L. Denis, D. W. Newling, and K. Kurth, Predicting recurrence and progression in individual patients with stage Ta T1 bladder cancer

EUROPEAN ORGANIZATION FOR RESEARCH AND TREATMENT OF CANCER (EORTC) using EORTC risk tables: a combined analysis of 2596 patients from seven EORTC trials. Eur. Urol. 2006; 49: 466–465; discussion 475–477. 31. M. Buyse, S. Loi, L. van’t Veer, G. Viale, M. Delorenzi, A. M. Glas, M. S. d’Assignies, J. Bergh, R. Lidereau, P. Ellis, et al., TRANSBIG Consortium, Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J. Natl. Cancer Inst. 2006; 98: 1183–1192. 32. T. Gorlia, M. J. van den Bent, M. E. Hegi, R. O. Mirimanoff, M. Weller, J. G. Cairncross, E. Eisenhauer, K. Belanger, A. A. Brandes, A. Allgeier, et al., Nomograms for predicting survival of patients with newly diagnosed glioblastoma: prognostic factor analysis of EORTC and NCIC trial 26981-22981/CE.3. Lancet Oncol. 2008; 9: 29–38. 33. M. Van Glabbeke, J. Verweij, P. G. Casali, A. Le Cesne, P. Hohenberger, I. Ray-Coquard, M. Schlemmer, A. T. van Oosterom, D. Goldstein, R. Sciot, et al., Initial and late resistance to imatinib in advanced gastrointestinal stromal tumors are predicted by different prognostic factors: a European Organisation for Research and Treatment of Cancer-Italian

15

Sarcoma Group-Australasian Gastrointestinal Trials Group study. J. Clin. Oncol. 2005; 23: 5795–5804. 34. M. E. Mauer, M. J. Taphoorn, A. Bottomley, C. Coens, F. Efficace, M. Sanson, A. A. Brandes, C. C. van der Rijt, H. J. Bernsen, M. Fr´enay, et al., EORTC Brain Cancer Group, Prognostic value of health-related quality-oflife data in predicting survival in patients with anaplastic oligodendrogliomas, from a phase III EORTC brain cancer group study. J. Clin. Oncol. 2007; 25: 5731–5737. 35. MINDACT STUDY. Available: http://www. eortc.be/services/unit/mindact/MINDACT websiteii.asp. 36. SUPREMO STUDY. Available: http://www. supremo-trial.com. 37. H10 EORTC-GELA STUDY. Available: http://www.cancer.gov/clinicaltrials/EORTC20051. 38. INTERFANT 06 STUDY. Available: http:// www.trialregister.nl/trialreg/admin/rctview. asp?TC=695. 39. EORTC 22043-30041 PROSTATE STUDY. Available: http://groups.eortc.be/radio/future %20trials.htm. 40. LungART STUDY. Available: http://www. eortc.be/protoc/details.asp?protocol=22055. 41. VISTA RDC USER GUIDE. Available: http:// rdc.eortc.be/rdc/doc/VISTA-RDC UserGuide. pdf.

CROSS-REFERENCES Trials Oncology Drug development Human Genomics Quality of Life

Factor Analysis: Confirmatory Of primary import to factor analysis, in general, is the notion that some variables of theoretical interest cannot be observed directly; these unobserved variables are termed latent variables or factors. Although latent variables cannot be measured directly, information related to them can be obtained indirectly by noting their effects on observed variables that are believed to represent them. The oldest and best-known statistical procedure for investigating relations between sets of observed and latent variables is that of factor analysis. In using this approach to data analyses, researchers examine the covariation among a set of observed variables in order to gather information on the latent constructs (i.e., factors) that underlie them. In factor analysis models, each observed variable is hypothesized to be determined by two types of influences: (a) the latent variables (factors) and (b) unique variables (called either residual or error variables). The strength of the relation between a factor and an observed variable is usually termed the loading of the variable on the factor.

Exploratory versus Confirmatory Factor Analysis There are two basic types of factor analyses: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is most appropriately used when the links between the observed variables and their underlying factors are unknown or uncertain. It is considered to be exploratory in the sense that the researcher has no prior knowledge that the observed variables do, indeed, measure the intended factors. Essentially, the researcher uses EFA to determine factor structure. In contrast, CFA is appropriately used when the researcher has some knowledge of the underlying latent variable structure. On the basis of theory and/or empirical research, he or she postulates relations between the observed measures and the underlying factors a priori, and then tests this hypothesized structure statistically. More specifically, the CFA approach examines the extent to which a highly constrained a priori factor structure is consistent with the sample data. In summarizing the primary distinction between the two methodologies, we can say

that whereas EFA operates inductively in allowing the observed data to determine the underlying factor structure a posteriori, CFA operates deductively in postulating the factor structure a priori [5]. Of the two factor analytic approaches, CFA is by far the more rigorous procedure. Indeed, it enables the researcher to overcome many limitations associated with the EFA model; these are as follows: First, whereas the EFA model assumes that all common factors are either correlated or uncorrelated, the CFA model makes no such assumptions. Rather, the researcher specifies, a priori, only those factor correlations that are considered to be substantively meaningful. Second, with the EFA model, all observed variables are directly influenced by all common factors. With CFA, each factor influences only those observed variables with which it is purported to be linked. Third, whereas in EFA, the unique factors are assumed to be uncorrelated, in CFA, specified covariation among particular uniquenesses can be tapped. Finally, provided with a malfitting model in EFA, there is no mechanism for identifying which areas of the model are contributing most to the misfit. In CFA, on the other hand, the researcher is guided to a more appropriately specified model via indices of misfit provided by the statistical program.

Hypothesizing a CFA Model Given the a priori knowledge of a factor structure and the testing of this factor structure based on the analysis of covariance structures, CFA belongs to a class of methodology known as structural equation modeling (SEM). The term structural equation modeling conveys two important notions: (a) that structural relations can be modeled pictorially to enable a clearer conceptualization of the theory under study, and (b) that the causal processes under study are represented by a series of structural (i.e., regression) equations. To assist the reader in conceptualizing a CFA model, I now describe the specification of a hypothesized CFA model in two ways; first, as a graphical representation of the hypothesized structure and, second, in terms of its structural equations.

Graphical Specification of the Model CFA models are schematically portrayed as path diagrams (see Path Analysis and Path Diagrams) through the incorporation of four geometric symbols: a circle (or ellipse) representing unobserved

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Factor Analysis: Confirmatory

1.0

Physical SC (Appearance) F1

1.0 Physical SC (Ability) F2

1.0 Social SC (Peers) F3

1.0 Social SC (Parents) F4

Figure 1

Hypothesized CFA model

SDQ1

E1

SDQ8

E8

SDQ15

E15

SDQ22

E22

SDQ38

E38

SDQ46

E46

SDQ54

E54

SDQ62

E62

SDQ3

E3

SDQ10

E10

SDQ24

E24

SDQ32

E32

SDQ40

E40

SDQ48

E48

SDQ56

E56

SDQ64

E64

SDQ7

E7

SDQ14

E14

SDQ28

E28

SDQ36

E36

SDQ44

E44

SDQ52

E52

SDQ60

E60

SDQ69

E69

SDQ5

E5

SDQ19

E19

SDQ26

E26

SDQ34

E34

SDQ42

E42

SDQ50

E50

SDQ58

E58

SDQ66

E66

Factor Analysis: Confirmatory latent factors, a square (or rectangle) representing observed variables, a single-headed arrow (−>) representing the impact of one variable on another, and a double-headed arrow (<−>) representing covariance between pairs of variables. In building a CFA model, researchers use these symbols within the framework of three basic configurations, each of which represents an important component in the analytic process. We turn now to the CFA model presented in Figure 1, which represents the postulated four-factor structure of nonacademic self-concept (SC) as tapped by items comprising the Self Description QuestionnaireI (SDQ-I; [15]). As defined by the SDQ-I, nonacademic SC embraces the constructs of physical and social SCs. On the basis of the geometric configurations noted above, decomposition of this CFA model conveys the following information: (a) there are four factors, as indicated by the four ellipses labeled Physical SC (Appearance; F1), Physical SC (Ability; F2), Social SC (Peers; F3), and Social SC (Parents; F4); (b) the four factors are intercorrelated, as indicated by the six two-headed arrows; (c) there are 32 observed variables, as indicated by the 32 rectangles (SDQ1SDQ66); each represents one item from the SDQ-I; (d) the observed variables measure the factors in the following pattern: Items 1, 8, 15, 22, 38, 46, 54, and 62 measure Factor 1, Items 3, 10, 24, 32, 40, 48, 56, and 64 measure Factor 2, Items 7, 14, 28, 36, 44, 52, 60, and 69 measure Factor 3, and Items 5, 19, 26, 34, 42, 50, 58, and 66 measure Factor 4; (e) each observed variable measures one and only one factor; and (f) errors of measurement associated with each observed variable (E1-E66) are uncorrelated (i.e., there are no double-headed arrows connecting any two error terms. Although the error variables, technically speaking, are unobserved variables, and should have ellipses around them, common convention in such diagrams omits them in the interest of clarity. In summary, a more formal description of the CFA model in Figure 1 argues that: (a) responses to the SDQ-I are explained by four factors; (b) each item has a nonzero loading on the nonacademic SC factor it was designed to measure (termed target loadings), and zero loadings on all other factors (termed nontarget loadings); (c) the four factors are correlated; and (d) measurement error terms are uncorrelated.

3

Structural Equation Specification of the Model From a review of Figure 1, you will note that each observed variable is linked to its related factor by a single-headed arrow pointing from the factor to the observed variable. These arrows represent regression paths and, as such, imply the influence of each factor in predicting its set of observed variables. Take, for example, the arrow pointing from Physical SC (Ability) to SDQ1. This symbol conveys the notion that responses to Item 1 of the SDQ-I assessment measure are ‘caused’ by the underlying construct of physical SC, as it reflects one’s perception of his or her physical ability. In CFA, these symbolized regression paths represent factor loadings and, as with all factor analyses, their strength is of primary interest. Thus, specification of a hypothesized model focuses on the formulation of equations that represent these structural regression paths. Of secondary importance are any covariances between the factors and/or between the measurement errors. The building of these equations, in SEM, embraces two important notions: (a) that any variable in the model having an arrow pointing at it represents a dependent variable, and (b) dependent variables are always explained (i.e., accounted for) by other variables in the model. One relatively simple approach to formulating these structural equations, then, is first to note each dependent variable in the model and then to summarize all influences on these variables. Turning again to Figure 1, we see that there are 32 variables with arrows pointing toward them; all represent observed variables (SDQ1-SDQ66). Accordingly, these regression paths can be summarized in terms of 32 separate equations as follows: SDQ1 = F1 + E1 SDQ8 = F1 + E8 SDQ15.= F1 + E15 .. SDQ62 = F1 + E62 SDQ3 = F2 + E3 SDQ10.= F2 + E10 .. SDQ64 = F2 + E64 SDQ7 = F3 + E7 SDQ14.= F3 + E14 .. SDQ69 = F3 + E69

4

Factor Analysis: Confirmatory SDQ5 = F4 + E5 SDQ19.= F4 + E19 .. SDQ66 = F4 + E66 (1)

Although, in principle, there is a one-to-one correspondence between the schematic presentation of a model and its translation into a set of structural equations, it is important to note that neither one of these representations tells the whole story. Some parameters, critical to the estimation of the model, are not explicitly shown and thus may not be obvious to the novice CFA analyst. For example, in both the schematic model (see Figure 1) and the linear structural equations cited above, there is no indication that either the factor variances or the error variances are parameters in the model. However, such parameters are essential to all structural equation models and therefore must be included in the model specification. Typically, this specification is made via a separate program command statement, although some programs may incorporate default values. Likewise, it is equally important to draw your attention to the specified nonexistence of certain parameters in a model. For example, in Figure 1, we detect no curved arrow between E1 and E8, which would suggest the lack of covariance between the error terms associated with the observed variables SDQ1 and SDQ8. (Error covariances can reflect overlapping item content and, as such, represent the same question being asked, but with a slightly different wording.)

Testing a Hypothesized CFA Model Testing for the validity of a hypothesized CFA model requires the satisfaction of certain statistical assumptions and entails a series of analytic steps. Although a detailed review of this testing process is beyond the scope of the present chapter, a brief outline is now presented in an attempt to provide readers with at least a flavor of the steps involved. (For a nonmathematical and paradigmatic introduction to SEM based on three different programmatic approaches to the specification and testing of a variety of basic CFA models, readers are referred to [6–9]; for a more detailed and analytic approach to SEM, readers are referred to [3], [14], [16] and [17].)

Statistical Assumptions As with other multivariate methodologies, SEM assumes that certain statistical conditions have been met. Of primary importance is the assumption that the data are multivariate normal (see Catalogue of Probability Density Functions). In essence, the concept of multivariate normality embraces three requirements: (a) that the univariate distributions are normal; (b) that the joint distributions of all variable combinations are normal; and (c) that all bivariate scatterplots are linear and homoscedastic [14]. Violations of multivariate normality can lead to the distortion of goodness-of-fit indices related to the model as a whole (see e.g., [12]; [10]; and (see Goodness of Fit) to positively biased tests of significance related to the individual parameter estimates [14]).

Estimating the Model Once the researcher determines that the statistical assumptions have been met, the hypothesized model can then be tested statistically in a simultaneous analysis of the entire system of variables. As such, some parameters are freely estimated while others remain fixed to zero or some other nonzero value. (Nonzero values such as the 1’s specified in Figure 1 are typically assigned to certain parameters for purposes of model identification and latent factor scaling.) For example, as shown in Figure 1, and in the structural equation above, the factor loading of SDQ8 on Factor 1 is freely estimated, as indicated by the single-headed arrow leading from Factor 1 to SDQ8. By contrast, the factor loading of SDQ10 on Factor 1 is not estimated (i.e., there is no single-headed arrow leading from Factor 1 to SDQ10); this factor loading is automatically fixed to zero by the program. Although there are four main methods for estimating parameters in CFA models, maximum likelihood estimation remains the one most commonly used and is the default method for all SEM programs.

Evaluating Model Fit Once the CFA model has been estimated, the next task is to determine the extent to which its specifications are consistent with the data. This evaluative process focuses on two aspects: (a) goodness-of-fit of the model as a whole, and (b) goodness-of-fit of individual parameter estimates. Global assessment of fit

Factor Analysis: Confirmatory is determined through the examination of various fit indices and other important criteria. In the event that goodness-of-fit is adequate, the model argues for the plausibility of postulated relations among variables; if it is inadequate, the tenability of such relations is rejected. Although there is now a wide array of fit indices from which to choose, typically only one or two need be reported, along with other fit-related indicators. A typical combination of these evaluative criteria might include the Comparative Fit Index (CFI; Bentler, [1]), the standardized root mean square residual (SRMR), and the Root Mean Square Error of Approximation (RMSEA; [18]), along with its 90% confidence interval. Indicators of a well-fitting model would be evidenced from a CFI value equal to or greater than .93 [11], an SRMR value of less than .08 [11], and an RMSEA value of less than .05 [4]. Goodness-of-fit related to individual parameters of the model focuses on both the appropriateness (i.e., no negative variances, no correlations >1.00) and statistical significance (i.e., estimate divided by standard error >1.96) of their estimates. For parameters to remain specified in a model, their estimates must be statistically significant.

Post Hoc Model-fitting Presented with evidence of a poorly fitting model, the hypothesized CFA model would be rejected. Analyses then proceed in an exploratory fashion as the researcher seeks to determine which parameters in the model are misspecified. Such information is gleaned from program output that focuses on modification indices (MIs), estimates that derive from testing for the meaningfulness of all constrained (or fixed) parameters in the model. For example, the constraint that the loading of SDQ10 on Factor 1 is zero, as per Figure 1 would be tested. If the MI related to this fixed parameter is large, compared to all other MIs, then this finding would argue for its specification as a freely estimated parameter. In this case, the new parameter would represent a loading of SDQ10 on both Factor 1 and Factor 2. Of critical importance in post hoc model-fitting, however, is the requirement that only substantively meaningful parameters be added to the original model specification.

5

Interpreting Estimates Shown in Figure 2 are standardized parameter estimates resulting from the testing of the hypothesized CFA model portrayed in Figure 1. Standardization transforms the solution so that all variables have a variance of 1; factor loadings will still be related in the same proportions as in the original solution, but parameters that were originally fixed will no longer have the same values. In a standardized solution, factor loadings should generally be less than 1.0 [14]. Turning first to the factor loadings and their associated errors of measurement, we see that, for example, the regression of Item SDQ15 on Factor 1 (Physical SC; Appearance) is .82. Because SDQ15 loads only on Factor 1, we can interpret this estimate as indicating that Factor 1 accounts for approximately 67% (100 × .822 ) of the variance in this item. The measurement error coefficient associated with SDQ15 is .58, thereby indicating that some 34% (as a result of decimal rounding) of the variance associated with this item remains unexplained by Factor 1. (It is important to note that, unlike the LISREL program [13], which does not standardize errors in variables, the EQS program [2] used here does provide these standardized estimated values; see Structural Equation Modeling: Software.) Finally, values associated with the double-headed arrows represent latent factor correlations. Thus, for example, the value of .41 represents the correlation between Factor 1 (Physical SC; Appearance) and Factor 2 (Physical SC; Ability). These factor correlations should be consistent with the theory within which the CFA model is grounded. In conclusion, it is important to emphasize that only issues related to the specification of first-order CFA models, and only a cursory overview of the steps involved in testing these models has been included here. Indeed, sound application of SEM procedures in testing CFA models requires that researchers have a comprehensive understanding of the analytic process. Of particular importance are issues related to the assessment of multivariate normality, appropriateness of sample size, use of incomplete data, correction for nonnormality, model specification, identification, and estimation, evaluation of model fit, and post hoc model-fitting. Some of these topics are covered in other entries, as well as the books and journal articles cited herein.

6

Factor Analysis: Confirmatory

0.82 0.70* Physical SC (Appearance) F1

0.82* 0.85* 0.58* 0.72* 0.69* 0.71*

SDQ1

0.58

E1*

SDQ8

0.71

E8*

SDQ15

0.58

E15*

SDQ22

0.53

E22*

SDQ38

0.82

E38*

SDQ46

0.69

E46*

SDQ54

0.73

E54*

SDQ62

0.71

E62*

SDQ3

0.66

E3*

SDQ10

0.85

E10*

SDQ24

0.84

E24*

SDQ32

0.85

E32*

SDQ40

0.61

E40*

SDQ48

0.69

E48*

SDQ56

0.48

E56*

SDQ64

0.82

E64*

SDQ7

0.79

E7*

SDQ14

0.74

E14*

SDQ28

0.68

E28*

SDQ36

0.79

E36*

SDQ44

0.77

E44*

SDQ52

0.78

E52*

SDQ60

0.71

E60*

SDQ69

0.61

E69*

SDQ5

0.78

E5*

0.63 0.56*

SDQ19

0.83

E19*

0.60* 0.55* 0.66* 0.73* 0.83* 0.69*

SDQ26

0.80

E26*

SDQ34

0.83

E34*

SDQ42

0.76

E42*

SDQ50

0.68

E50*

SDQ58

0.55

E58*

SDQ66

0.73

E66*

0.41*

0.55*

Physical SC (Ability) F2

0.75 0.53* 0.54* 0.52* 0.80* 0.73* 0.88* 0.57*

0.52* 0.29* 0.61 0.67* Social SC (Peers) F3 0.21*

0.74* 0.61* 0.64* 0.62* 0.71* 0.79*

0.42*

Social SC (Parents) F4

Figure 2

Standardized estimates for hypothesized CFA model

Factor Analysis: Confirmatory

References [1] [2] [3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

Bentler, P.M. (1990). Comparative fit indexes in structural models, Psychological Bulletin 107, 238–246. Bentler, P.M. (2004). EQS 6.1: Structural Equations Program Manual, Multivariate Software Inc, Encino. Bollen, K. (1989). Structural Equations with Latent Variables, Wiley, New York. Browne, M.W. & Cudeck, R. (1993). Alternative ways of assessing model fit, in Testing Structural Equation Models, K.A. Bollen & J.S. Long eds, Sage, Newbury Park, pp. 136–162. Bryant, F.B. & Yarnold, P.R. (1995). Principal-components analysis and exploratory and confirmatory factor analysis, in Reading and Understanding Multivariate Statistics, L.G. Grimm & P.R. Yarnold eds, American Psychological Association, Washington. Byrne, B.M. (1994). Structural Equation Modeling with EQS and EQS/Windows: Basic Concepts, Applications, and Programming, Sage, Thousand Oaks. Byrne, B.M. (1998). Structural Equation Modeling with LISREL, PRELIS, and SIMPLIS: Basic Concepts, Applications, and Programming, Erlbaum, Mahwah. Byrne, B.M. (2001a). Structural Equation Modeling with AMOS: Basic Concepts, Applications, and Programming, Erlbaum, Mahwah. Byrne, B.M. (2001b). Structural equation modeling with AMOS, EQS, and LISREL: comparative approaches to testing for the factorial validity of a measuring instrument, International Journal of Testing 1, 55–86. Curran, P.J., West, S.G. & Finch, J.F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis, Psychological Methods 1, 16–29.

7

[11]

Hu, L.-T. & Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives, Structural Equation Modeling 6, 1–55. [12] Hu, L.-T., Bentler, P.M. & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological Bulletin 112, 351–362. [13] J¨oreskog, K.G. & S¨orbom, D. (1996). LISREL 8: User’s Reference Guide, Scientific Software International, Chicago. [14] Kline, R.B. (1998). Principles and Practice of Structural Equation Modeling, Guildwood Press, New York. [15] Marsh, H.W. (1992). Self Description Questionnaire (SDQ) I: A Theoretical and Empirical Basis for the Measurement of Multiple Dimensions of Preadolescent Self-concept: A Test Manual and Research Monograph, Faculty of Education, University of Western Sydney, Macarthur, New South Wales. [16] Maruyama, G.M. (1998). Basics of Structural Equation Modeling, Sage, Thousand Oaks. [17] Raykov, T. & Marcoulides, G.A. (2000). A First Course Oin Structural Equation Modeling, Erlbaum, Mahwah. [18] Steiger, J.H. (1990). Structural model evaluation and modification: an interval estimation approach, Multivariate Behavioral Research 25, 173–180.

(See also History of Path Analysis; Linear Statistical Models for Causation: A Critical Review; Residuals in Structural Equation, Factor Analysis, and Path Analysis Models; Structural Equation Modeling: Checking Substantive Plausibility) BARBARA M. BYRNE

FACTORIAL DESIGNS IN CLINICAL TRIALS

a recent chapter discussing factorial designs in medical studies given by Piantadosi (24).

Steven Piantadosi

1 BASIC FEATURES OF FACTORIAL DESIGNS

Johns Hopkins University Baltimore, MD, USA

The simplest factorial design has two treatments (A and B) and four treatment groups (Table 1). There might be n patients entered into each of the four treatment groups for a total sample size of 4n and a balanced design. One group receives neither A nor B, a second receives both A and B, and the other two groups receive one of A or B. This is called a 2 × 2 (two by two) factorial design. The design generates enough information to test the effects of A alone, B alone, and A plus B. The 2 × 2 design generalizes to higher order factorials. For example, a design studying three treatments, A, B, and C, is the 2 × 2 × 2. Possible treatment groups for this design are shown in Table 2. The total sample size is 8n if all treatment groups have n subjects. These examples highlight some of the prerequisites necessary for, and restrictions on, using a factorial trial. First, the treatments must be amenable to being administered in combination without changing dosage in the presence of each other. For example, in Table 1, we would not want to reduce the dose of A in the lower right cell where B is present. This requirement implies that the side effects of the treatments cannot be cumulative to the point where the combination is impossible to administer. Secondly, it must be ethically acceptable to withhold the individual treatments, or administer them at lower doses as the case may be. In some situations, this means having a no-treatment or placebo group in the trial. In other cases A and B may be administered in addition to a ‘‘standard’’ so that all groups receive some treatment. Thirdly, we must be genuinely interested in learning about treatment combinations; otherwise, some of the treatment groups might be unnecessary. Alternately, to use the design to achieve greater efficiency in

Factorial experiments test the effect of more than one treatment (factor) using a design that permits an assessment of interactions between the treatments. A treatment could be either a single therapy or a combination of interventions. The essential feature of factorial designs is that treatments are varied systematically (i.e. some groups receive more than one treatment), and the experimental groups are arranged in a way that permits testing whether or not the treatments interact with one another. The technique of varying more than one treatment in a single study has been used widely in agriculture and industry based on work by Fisher 10,11 and Yates (33). Influential discussions of factorial experiments were given by Cox (8) and Snedecor & Cochran (28). Factorial designs have been used relatively infrequently in medical trials, except recently in disease prevention studies. The discussion here will be restricted to randomized factorial clinical trials. Factorial designs offer certain advantages over conventional comparative designs, even those employing more than two treatment groups. The factorial structure permits certain comparisons to be made that cannot be achieved by any other design. In some circumstances, two treatments can be tested in a factorial trial using the same number of subjects ordinarily used to test one treatment. However, the limitations of factorial designs must be understood before deciding whether or not they are appropriate for a particular therapeutic question. Additional discussions of factorial designs in clinical trials can be found in Byar & Piantadosi (6) and Byar et al. (7). For a discussion of such designs related to cardiology trials, particularly in the context of the ISIS-4 trial (17), see Lubsen & Pocock (20). This article is based on

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

FACTORIAL DESIGNS IN CLINICAL TRIALS

Table 1. Treatment Groups and Sample Sizes in a 2 × 2 Balanced Factorial Design No Yes Total

n n 2n

n n 2n

2n 2n 4n

Table 2. Treatment Groups in a Balanced 2 × 2 × 2 Factorial Design 1 2 3 4 5 6 7 8

No Yes No No Yes No Yes Yes

No No Yes No Yes Yes No Yes

No No No Yes No Yes Yes Yes

n n n n n n n n

are group averages of some normally distributed response denoted by Y. The subscripts on Y indicate which treatment group it represents. Half the patients receive one of the treatments (this is also true in higher order designs). For a moment, further assume that the effect of A is not influenced by the presence of B. There are two estimates of the effect of treatment A compared to placebo in the design, Y A − Y 0 and Y AB − Y B . If B does not modify the effect of A, the two estimates can be combined (averaged) to estimate the overall effect of A(β A ), βA =

(Y A − Y 0 ) + (Y AB − Y B ) . 2

(1)

(Y B − Y 0 ) + (Y AB − Y A ) . 2

(2)

Similarly, studying two or more treatments, we must know that some interactions do not exist. Fourthly, the therapeutic questions must be chosen appropriately. We would not use a factorial design to test treatments that have exactly the same mechanisms of action (e.g. two ACE inhibitors for high blood pressure) because either would answer the question. Treatments acting through different mechanisms would be more appropriate for a factorial design (e.g. radiotherapy and chemotherapy for tumors). In some prevention factorial trials, the treatments tested also target different diseases. 1.1 Efficiency Factorial designs offer certain very important efficiencies or advantages when they are applicable. Consider the 2 × 2 design and the estimates of treatment effects that would result using an additive model for analysis (Table 3). Assume that the responses

Table 3. Treatment Effects from a 2 × 2 Factorial Design A No Yes No Yes

Y0 YA

YB Y AB

βB =

Thus, in the absence of interactions (i.e. the effect of A is the same with or without B, and vice versa), the design permits the full sample size to be used to estimate two treatment effects. Now suppose that each patient’s response has a variance σ 2 that is the same in all treatment groups. We can calculate the variance of β A to be var (βA ) =

1 4σ 2 σ2 × = . 4 n n

This is the same variance that would result if A were tested against placebo in a single twoarmed comparative trial with 2n patients in each treatment group. Similarly, var (βB ) =

σ2 . n

However, if we tested A and B in separate trials, we would require 4n subjects in each trial or a total of 8n patients to have the same precision. Thus, in the absence of interactions, factorial designs estimate main effects efficiently. In fact, tests of both A and B can be conducted in a single factorial trial with the same precision as two single-factor trials using twice the sample size.

FACTORIAL DESIGNS IN CLINICAL TRIALS

2

INTERACTIONS

for treatment A,

The effect of A might be influenced by the presence of B (or vice versa). In other words, there might be a treatment interaction. Some of the efficiencies just discussed will be lost. However, factorial designs are even more relevant when interactions are possible. Factorial designs are the only type of trial design that permits study of treatment interactions. This is because the design has treatment groups with all possible combinations of treatments, allowing the responses to be compared directly. Consider again the two estimates of A in the 2 × 2 design, one in the presence of B and the other in the absence of B. The definition of an interaction is that the effect of A in the absence of B is different from the effect of A in the presence of B. This can be estimated by comparing βAB = (Y A − Y 0 ) − (Y AB − Y B )

(3)

to zero. If β AB is near zero, we would conclude that no interaction is present. It is straightforward to verify that β AB = β BA . When there is an AB interaction present, we must modify our interpretation of the main effects. For example, the estimates of the main effects of A and B [(1) and (2)] assumed no interaction was present. We may choose to think of an overall effect of A, but recognize that the magnitude (and possibly the direction) of the effect depends on B. In the absence of the other treatment, we could estimate the main effects using βA = (Y A − Y 0 )

(4)

βB = (Y B − Y 0 ).

(5)

and

In the 2 × 2 × 2 design, there are three main effects and four interactions possible, all of which can be tested by the design. Following the notation above, the effects are βA = 14 [(Y A − Y 0 ) + (Y AB − Y B ) + (Y AC − Y C ) + (Y ABC − Y BC )],

3

(6)

βAB = 12 {[(Y A − Y 0 ) − (Y AB − Y B )] +[(Y AC − Y C ) − (Y ABC − Y BC )]}, (7) for the AB interaction, and βABC = [(Y A − Y 0 ) − (Y AB − Y B ) − (Y AC −Y C ) − (Y ABC − Y BC )]

(8)

for the ABC interaction. When certain interactions are present, we may require an alternative estimator for β A or β BA (or for other effects). Suppose that there is evidence of an ABC interaction. Then, instead of β A , one possible estimator of the main effect of A is βA = 12 [(Y A − Y 0 ) + (Y AB − Y B )], which does not use β ABC . Other estimators of the main effect of A are possible. Similarly, the AB interaction could be tested by = (Y A − Y 0 ) − (Y AB − Y B ), βAB

for the same reason. Thus, when treatment interactions are present, we must modify our estimates of main effects and lower order interactions, losing some efficiency. 2.1 Scale of Measurement In the examples just given, the treatment effects and interactions have been assumed to exist on an additive scale. This is reflected in the use of sums and differences in the formulas for estimation. Other scales of measurement may be useful. As an example, consider the response data in Table 4, where the effect of Treatment A is to increase the baseline response by 10 units. The same is

Table 4. Response Data from a Factorial Trial Showing no Interaction on an Additive Scale A No

Yes

No Yes

5 15

15 25

4

FACTORIAL DESIGNS IN CLINICAL TRIALS

true of B and there is no interaction between the treatments on this scale because the joint effect of A and B is to increase the response by 20 units. In contrast, in Table 5 are shown data in which the effects of both treatments are to multiply the baseline response by 3.0. Hence, the combined effect of A and B is a nine fold increase which is greater than the joint treatment effect for the additive case. If the analysis model were multiplicative, then Table 4 would show an interaction, whereas if the analysis model were additive, then Table 5 would show an interaction. Thus, to discuss interactions, we must establish the scale of measurement. 2.2 Main Effects and Interactions In the presence of an interaction in the 2 × 2 design, one cannot speak simply about an overall, or main, effect of either treatment. This is because the effect of A is different depending on the presence or absence of B. In the presence of a small interaction, where all patients benefit from A regardless of the use of B, we might observe that the magnitude of the ‘‘overall’’ effect of A is of some size and that therapeutic decisions are unaffected by the presence of an interaction. This is called ‘‘quantitative’’ interaction, so-named because it does not affect the direction of the treatment effect. For large quantitative interactions, it may not be sensible to talk about overall effects. In contrast, if the presence of B reverses the effect of A, then the interaction is ‘‘qualitative’’, and treatment decisions may need to be modified. Here, we cannot talk about an overall effect of A, because it could be positive in the presence of B, negative in the absence of B, and could yield an average effect near zero.

Yes

No Yes

5 15

Motivation for the estimators given above can be obtained using general linear models. There has been little theoretic work on analyses using other models. One exception is the work by Slud (27) describing approaches to factorial trials with survival outcomes. Suppose we have conducted a 2 × 2 factorial experiment with group sizes given by Table 1. We can estimate the AB interaction effect using the linear model E{Y} = β0 + βA XA + βB XB + βAB XA XB , (9) where the Xs are indicator variables for the treatment groups and β AB is the interaction effect. The design matrix has dimension 4n × 4 and is 

1 0 X = 0 0

... ... ... ...

1 1 0 0

... ... ... ...

15 45

1 0 1 0

... ... ... ...

1 1 1 1

 ... . . . , . . . ...

where there are four blocks of n identical rows representing each treatment group and the columns represent effects for the intercept, treatment A, treatment B, and both treatments, respectively. The vector of responses has dimension 4n × 1 and is Y = {Y01 , . . . , YA1 , . . . , YB1 , . . . , YAB1 , . . .}. The ordinary least squares solution for the model (9) is βˆ = (X X)−1 X Y. The covariance matrix is (X’X)−1 σ 2 , where the variance of each observation is σ 2 . We have  4 2  XX = n× 2 1 

Table 5. Response Data From a Factorial Trial Showing no Interaction on a Multiplicative Scale A No

2.3 Analysis

(X X)−1

2 2 1 1

2 1 2 1

 1 1 , 1 1

 1 −1 −1 1 1 −1 2 1 −2 , = × n −1 1 2 −2 1 −2 −2 4

FACTORIAL DESIGNS IN CLINICAL TRIALS

5

and

and 



 3 1 1 2 4 −2 −2 σ ∗ } = cov{β × − 12 1 0 . n 1 −2 0 1

Y 0 + Y A + Y B + Y AB   Y A + Y AB , XY =n×   Y B + Y AB Y AB where Y i denotes the average response in the ith group. Then,  Y0   −Y 0 + Y A , βˆ =    −Y 0 + Y B Y 0 − Y A − Y B + Y AB 

(10)

which corresponds to the estimators given above in (3)–(5). However, if we assume no interaction, then the β AB effect is removed from the model, and we obtain the estimator  3  1 1 1 4 Y 0 + 4 Y A + 4 Y B − 4 Y AB  1  1 1 1  βˆ ∗ =  − 2 Y 0 + 2 Y A − 2 Y B + 2 Y AB  . 1 1 1 1 − 2 Y 0 − 2 Y A + 2 Y B + 2 Y AB The main effects for A and B are as given above in (1) and (2). The covariance matrices for these estimators are   1 −1 −1 1 2 −1 2 1 −2 σ  = × cov{β} −1 1 2 −2 n 1 −2 −2 4

In the absence of an interaction, the main effects of A and B are estimated independently and with higher precision than when an interaction is present. The interaction effect is relatively imprecisely estimated, indicating that larger sample sizes are required to have a high power to detect such effects. 3

EXAMPLES

Several clinical trials conducted in recent years have used factorial designs. A sample of such studies is shown in Table 6. One important study using a 2 × 2 factorial design is the Physicians’ Health Study 16,30. This trial has been conducted, in 22 000 physicians in the US and was designed to test the effects of (i) aspirin on reducing cardiovascular mortality and (ii) β-carotene on reducing cancer incidence. The trial is noteworthy in several ways, including its test of two interventions in unrelated diseases, use of physicians as subjects to report outcomes reliably, relatively low cost, and an all-male (high risk)

Table 6. Some Recent Randomized Clinical Trials using Factorial Designs Trial Design Physicians’ Health Study ATBC Prevention Trial Desipramine ACAPS Linxian Nutrition Trial Retinitis pigmentosa Linxian Cataract Trial Tocopherol/deprenyl Womens’ Health Initiative Polyp Prevention Trial Cancer/eye disease Cilazapril/hydrochlorothiazide Nebivolol Endophthalmitis vitrectomy study Bicalutamide/flutamide ISIS-4 Source: adapted from Piantadosi

Reference 2×2 2×2 2×2 2×2 24 2×2 2×2 23 2×2 2×2 4×3 4×3 2×2 2×2 23

Hennekens & Eberlein Heinonen et al. Max et al. ACAPS Group Li et al. Berson et al. Sperduto et al. Parkinson Study Group Assaf & Carleton Greenberg et al. Green et al. Pordy Lacourciere et al. Endophthalmitis Vitrectomy Study Group Schellhammer et al. ISIS-4 Collaborative Group

6

FACTORIAL DESIGNS IN CLINICAL TRIALS

study population. This last characteristic has led to some unwarranted criticism. In January 1988 the aspirin component of the Physicians’ Health Study was discontinued, because evidence demonstrated convincingly that it was associated with lower rates of myocardial infarction (20). The question concerning the effect of β-carotene on cancer remains open and will be addressed by continuation of the trial. In the likely absence of an interaction between aspirin and β-carotene, the second major question of the trial will be unaffected by the closure of the aspirin component. Another noteworthy example of a 2 × 2 factorial design is the α-tocopherol β-carotene Lung Cancer Prevention Trial conducted in 29 133 male smokers in Finland between 1987 and 1994 3,15. In this study, lung cancer incidence is the sole outcome. It was thought possible that lung cancer incidence could be reduced by either or both interventions. When the intervention was completed in 1994, there were 876 new cases of lung cancer in the study population during the trial. Alpha-tocopherol was not associated with a reduction in the risk of cancer. Surprisingly, β-carotene was associated with a statistically significantly increased incidence of lung cancer (4). There was no evidence of a treatment interaction. The unexpected findings of this study have been supported by the recent results of another large trial of carotene and retinol (32). The Fourth International Study of Infarct Survival (ISIS-4) was a 2 × 2 × 2 factorial trial assessing the efficacy of oral captopril, oral mononitrate, and intravenous magnesium sulfate in 58 050 patients with suspected myocardial infarction 12,17. No significant interactions among the treatments were observed and each main effect comparison was based on approximately 29 000 treated vs. 29 000 control patients. Captopril was associated with a small but statistically significant reduction in five-week mortality. The difference in mortality was 7.19% vs. 7.69% (143 events out of 4319), illustrating the ability of large studies to detect potentially important treatment effects even when they are small in relative magnitude. Mononitrate and magnesium therapy did not significantly reduce five-week mortality.

4 SIMILAR DESIGNS 4.1 Fractional and Partial Factorial Designs Fractional factorial designs are those which omit certain treatment groups by design. A careful analysis of the objectives of an experiment, its efficiency, and the effects that it can estimate may justify not using some groups. Because many cells contribute to the estimate of any effect, a design may achieve its intended purpose without some of the cells. In the 2 × 2 design, all treatment groups must be present to permit estimating the interaction between A and B. However, for higher order designs, if some interactions are thought biologically not to exist, omitting certain treatment combinations from the design will still permit estimates of other effects of interest. For example, in the 2 × 2 × 2 design, if the interaction between A, B, and C is thought not to exist, omitting that treatment cell from the design will still permit estimation of all the main effects. The efficiency will be somewhat reduced, however. Similarly, the two-way interactions can still be estimated without Y ABC . This can be verified from the formulas above. More generally, fractional high-order designs will produce a situation termed ‘‘aliasing’’, in which the estimates of certain effects are algebraically identical to completely different effects. If both effects are biologically possible, the design will not be able to reveal which effect is being estimated. Naturally, this is undesirable unless additional information is available to the investigator to indicate that some aliased effects are zero. This can be used to advantage in improving efficiency and one must be careful in deciding which cells to exclude. See Cox (8) or Mason & Gunst (21) for a discussion of this topic. The Women’s Health Initiative clinical trial is a 2 × 2 × 2 partial factorial design studying the effects of hormone replacement, dietary fat reduction, and calcium and vitamin D on coronary disease, breast cancer, and osteoporosis (2). All eight combinations of treatments are given, but participants may opt to join one, two, or all three of the randomized components. The study is expected to accrue over 64 000 patients and is projected to finish in the year 2007. The dietary

FACTORIAL DESIGNS IN CLINICAL TRIALS

component of the study will randomize 48 000 women using a 3:2 allocation ratio in favor of the control arm and nine years of followup. Such a large and complex trial presents logistical difficulties, questions about adherence, and sensitivity of the intended power to assumptions that can only roughly be validated. 4.2 Incomplete Factorial Designs When treatment groups are dropped out of factorial designs without yielding a fractional replication, the resulting trials have been termed ‘‘incomplete factorial designs’’ (7). In incomplete designs, cells are not missing by design intent, but because some treatment combinations may be infeasible. For example, in a 2 × 2 design, it may not be ethically possible to use a placebo group. In this case, one would not be able to estimate the AB interaction. In other circumstances, unwanted aliasing may occur, or the efficiency of the design to estimate main effects may be greatly reduced. In some cases, estimators of treatment and interaction effects are biased, but there may be reasons to use a design that retains as much of the factorial structure as possible. For example, they may be the only way in which to estimate certain interactions.

REFERENCES 1. ACAPS Group (1992). Rationale and design for the Asymptomatic Carotid Artery Plaque Study (ACAPS), Controlled Clinical Trials 13, 293–314. 2. Assaf, A. R. & Carleton, R. A. (1994). The Women’s Health Initiative clinical trial and observational study: history and overview, Rhode Island Medicine 77, 424–427. 3. ATBC Cancer Prevention Study Group (1994). The alpha-tocopherol beta-carotene lung cancer prevention study: design, methods, participant characteristics, and compliance, Annals of Epidemiology 4, 1–9. 4. ATBC Cancer Prevention Study Group (1994). The effect of vitamin E and beta carotene on the incidence of lung cancer and other cancers in male smokers, New England Journal of Medicine 330, 1029–1034.

7

5. Berson, E. L., Rosner, B., Sandberg, M. A., Hayes, K. C., Nicholson, B. W., WeigelDiFranco, C. & Willett, W. (1993). A randomized trial of vitamin A and vitamin E supplementation for retinitis pigmentosa, Archives of Ophthalmology 111, 761–772. 6. Byar, D. P. & Piantadosi, S. (1985). Factorial designs for randomized clinical trials, Cancer Treatment Reports 69, 1055–1063. 7. Byar, D. P., Herzberg, A. M. & Tan, W.-Y. (1993). Incomplete factorial designs for randomized clinical trials, Statistics in Medicine 12, 1629–1641. 8. Cox, D. R. (1958). Planning of Experiments. Wiley, New York. 9. Endophthalmitis Vitrectomy Study Group. Results of the Endophthalmitis Vitrectomy Study (1995). A randomized trial of immediate vitrectomy and of intravenous antibiotics for the treatment of postoperative bacterial endophthalmitis, Archives of Ophthalmology 113, 1479–1496. 10. Fisher, R. A. (1935). The Design of Experiments. Collier Macmillan, London. 11. Fisher, R. A. (1960). The Design of Experiments, 8th Ed. Hafner, New York. 12. Flather, M., Pipilis, A., Collins, R. et al. (1994). Randomized controlled trial of oral captopril, of oral isosorbide mononitrate and of intravenous magnesium sulphate started early in acute myocardial infarction: safety and haemodynamic effects, European Heart Journal 15, 608–619. 13. Green, A., Battistutta, D., Hart, V., Leslie, D., Marks, G., Williams, G., Gaffney, P., Parsons, P., Hirst, L., Frost, C. et al. (1994). The Nambour Skin Cancer and Actinic Eye Disease Prevention Trial: design and baseline characteristics of participants, Controlled Clinical Trials 15, 512–522. 14. Greenberg, E. R., Baron, J. A., Tosteson, T. D., Freeman, D. H., Jr, Beck, G. J., Bond, J. H., Colacchio, T. A., Coller, J. A., Frankl, H. D., Haile, R. W., Mandel, R. W., Nierenberg, J. S., Rothstein, D. W., Richard, S., Dale, C., Stevens, M. M., Summers, R. W. & vanStolk, R. U. (1994). A clinical trial of antioxidant vitamins to prevent colorectal adenoma. Polyp Prevention Study Group, New England Journal of Medicine 331, 141–147. 15. Heinonen, O. P., Virtamo, J., Albanes, D. et al. (1987). Beta carotene, alpha-tocopherol lung cancer intervention trial in Finland, in Proceedings of the XI Scientific Meeting of the International Epidemiologic Association, Helsinki, August, 1987. Pharmy, Helsinki.

8

FACTORIAL DESIGNS IN CLINICAL TRIALS

16. Hennekens, C. H. & Eberlein, K. (1985). A randomized trial of aspirin and beta-carotene among U.S. physicians, Preventive Medicine 14, 165–168.

cancer. Casodex Combination Study Group, Urology 45, 745–752. 27. Slud, E. V. (1994). Analysis of factorial survival experiments, Biometrics 50, 25–38.

17. ISIS-4 Collaborative Group (1995). ISIS-4: a randomized factorial trial assessing early captopril, oral mononitrate, and intravenous magnesium-sulphate in 58 050 patients with suspected acute myocardial infarction, Lancet 345, 669–685.

28. Snedecor, G. W. & Cochran, W. G. (1980). Statistical Methods, 7th Ed. The Iowa State University Press, Ames. 29. Sperduto, R. D., Hu, T. S., Milton, R. C., Zhao, J. L., Everett, D. F., Cheng, Q. F., Blot, W. J., Bing, L., Taylor, P. R., Li, J. Y. et al. (1993). The Linxian cataract studies. Two nutrition intervention trials, Archives of Ophthalmology 111, 1246–1253. 30. Stampfer, M. J., Buring, J. E., Willett, W. et al. (1985). The 2 × 2 factorial design: its application to a randomized trial of aspirin and carotene in U.S. physicians, Statistics in Medicine 4, 111–116. 31. Steering Committee of the Physicians’ Health Study Research Group (1989). Final report on the aspirin component of the ongoing physicians’ health study. New England Journal of Medicine 321, 129–135. 32. Thornquist, M. D., Owenn, G. S., Goodman, G. E. et al. (1993). Statistical design and monitoring of the carotene and retinol efficacy trial (CARET), Controlled Clinical Trials 14, 308–324.

18. Lacourciere, Y., Lefebvre, J., Poirier, L., Archambault, F. & Arnott, W. (1994). Treatment of ambulatory hypertensives with nebivolol or hydrochlorothiazide alone and in combination. A randomized double-blind, placebo-controlled, factorial-design trial, American Journal of Hypertension 7, 137–145. 19. Li, B., Taylor, P. R., Li, J. Y., Dawsey, S. M., Wang, W., Tangrea, J. A., Liu, B. Q., Ershow, A. G., Zheng, S. F., Fraumeni, J. F., Jr et al. (1993). Linxian nutrition intervention trials. Design, methods, participant characteristics, and compliance, Annals of Epidemiology 3, 577–585. 20. Lubsen, J. & Pocock, S. J. (1994). Factorial trials in cardiology (editorial), European Heart Journal 15, 585–588. 21. Mason, R. L. & Gunst, R. L. (1989). Statistical Design and Analysis of Experiments. Wiley, New York. 22. Max, M. B., Zeigler, D., Shoaf, S. E., Craig, E., Benjamin, J., Li, S. H., Buzzanell, C., Perez, M. & Ghosh, B. C. (1992). Effects of a single oral dose of desipramine on postoperative morphine analgesia, Journal of Pain & Symptom Management 7, 454–462. 23. Parkinson Study Group (1993). Effects of tocopherol and deprenyl on the progression of disability in early Parkinson’s disease, New England Journal of Medicine 328, 176–183. 24. Piantadosi, S. (1997). Factorial designs, in Clinical Trials: a Methodologic Perspective. Wiley, New York. See Chapter 15. 25. Pordy, R. C. (1994). Cilazapril plus hydrochlorothiazide: improved efficacy without reduced safety in mild to moderate hypertension. A double-blind placebo-controlled multicenter study of factorial design, Cardiology 85, 311–322. 26. Schellhammer, P., Shariff, R., Block, N., Soloway, M., Venner, P., Patterson, A. L., Sarosdy, M., Vogelzang, N., Jones, J. & Kiovenbag, G. (1995). A controlled trial of bicalutamide versus flutamide, each in combination with lutenizing hormone-releasing hormone analogue therapy, in patients with advanced prostate

33. Yates, F. (1935). Complex experiments (with discussion), Journal of the Royal Statistical Society, Series B 2, 181–247.

FAST TRACK ‘‘Fast Track’’ is a formal mechanism to interact with the U.S. Food and Drug Administration (FDA) using approaches that are available to all applicants for marketing claims. The Fast Track mechanism is described in the Food and Drug Administration Modernization Act of 1997 (FDAMA). The benefits of Fast Track include scheduled meetings to seek FDA input into development plans, the option of submitting a New Drug Application (NDA) in sections rather than all components simultaneously, and the option of requesting evaluation of studies using surrogate endpoints. The Fast Track designation is intended for the combination of a product and a claim that addresses an unmet medical need but is independent of the Priority Review and Accelerated Approval programs. An applicant may use any or all of the components of Fast Track without the formal designation. Fast Track designation does not necessarily lead to Priority Review or Accelerated Approval.

This article was modified from the website of the United States Food and Drug Administration (http://www.accessdata.fda.gov/scripts/cder/onctools /Accel.cfm#FastTrack) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

FDA DIVISION OF PHARMACOVIGILANCE AND EPIDEMIOLOGY (DPE)

Center with access to several large recordlinked databases. The reports produced by the Data Encryption Standard (DES) are integral to the ongoing risk assessment and the risk management performed by CDER review divisions of a product’s risk versus benefit profile. In addition, DPE epidemiologists are called on to meet with industry over important safety issues or to present their work before FDA advisory committees.

The Center for Drug Evaluation and Research’s (CDER) Division of Pharmacovigilance and Epidemiology (DPE) also carries out an epidemiologic function in the monitoring of drug safety. This function is performed by a multidisciplinary professional staff of physicians and Ph.D. epidemiologists, pharmacists, and program/project managers. The primary work is directed toward the evaluation and the risk assessment of drugs in the postmarketing environment using the tools of epidemiology. Epidemiologists integrate the medical/ clinical details of the underlying disease being treated with the influence of patient factors, concomitant diseases, and medications, as well as the clinical pharmacology of the specific product under study. DPE’s Epidemiology staff work closely with the Post-Marketing Safety Reviewers to provide clinical and epidemiologic case-series reviews of spontaneous adverse event reports submitted to the Food and Drug Administration (FDA). These data are used in a variety of ways to develop, refine, and investigate signals of clinical importance related to drug safety. As a complement, drug-use data are used frequently to estimate the size and to characterize the demographic composition of the population exposed to a given prescription product. Additionally, epidemiologists are involved in the design and the critique of Phase IV protocols for safety studies performed by industry and in the review of study findings. They also design, execute, and help to analyze data from epidemiologic studies performed through the mechanism of the DPE’s cooperative agreement program that provides the This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/epidemio.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

FDA MODERNIZATION ACT (FDAMA) OF 1997

drug and biological manufacturing changes, and to reduce the need for environmental assessment as part of a product application. The act also codifies FDA’s regulations and practice to increase patient access to experimental drugs and medical devices and to accelerate review of important new medications. In addition, the law provides for an expanded database on clinical trials that will be accessible by patients. With the sponsor’s consent, the results of such clinical trials will be included in the database. Under a separate provision, patients will receive advance notice when a manufacturer plans to discontinue a drug on which they depend for life support or sustenance, or for a treatment of a serious or debilitating disease or condition.

The Food and Drug Administration (FDA) Modernization Act (FDAMA), which was enacted November 21, 1997, amended the Federal Food, Drug, and Cosmetic Act relating to the regulation of food, drugs, devices, and biological products. With the passage of FDAMA, Congress enhanced FDA’s mission to recognize that the Agency would be operating in a twenty-first century characterized by increasing technological, trade, and public health complexities. 1

PRESCRIPTION DRUG USER FEES

The act reauthorizes, for five more years, the Prescription Drug User Fee Act of 1992 (PDUFA). In the past five years, the program has enabled the agency to reduce to 15 months the 30-month average time that used to be required for a drug review before PDUFA. This accomplishment was made possible by FDA managerial reforms and the addition of 696 employees to the agency’s drugs and biologics program, which was financed by $329 million in user fees from the pharmaceutical industry. 2

3 INFORMATION ON OFF-LABEL USE AND DRUG ECONOMICS The law abolishes the long-standing prohibition on dissemination by manufacturers of information about unapproved uses of drugs and medical devices. The act allows a firm to disseminate peer-reviewed journal articles about an off-label indication of its product, provided the company commits itself to file, within a specified time frame, a supplemental application based on appropriate research to establish the safety and effectiveness of the unapproved use. The act also allows drug companies to provide economic information about their products to formulary committees, managed care organizations, and similar large-scale buyers of health-care products. The provision is intended to provide such entities with dependable facts about the economic consequences of their procurement decisions. The law, however, does not permit the dissemination of economic information that could affect prescribing choices to individual medical practitioners.

FDA INITIATIVES AND PROGRAMS

The law enacts many FDA initiatives undertaken in recent years under Vice President Al Gore’s Reinventing Government program. The codified initiatives include measures to modernize the regulation of biological products by bringing them in harmony with the regulations for drugs and by eliminating the need for establishment license application, to eliminate the batch certification and monograph requirements for insulin and antibiotics, to streamline the approval processes for This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/opacom/backgrounders/mo dact.htm), (http://www.fda.gov/oc/fdama/default. htm) by Ralph D’Agostino and Sarah Karl.

4

PHARMACY COMPOUNDING

The act creates a special exemption to ensure continued availability of compounded drug

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

FDA MODERNIZATION ACT (FDAMA) OF 1997

products prepared by pharmacists to provide patients with individualized therapies not available commercially. The law, however, seeks to prevent manufacturing under the guise of compounding by establishing parameters within which the practice is appropriate and lawful. 5 RISK-BASED REGULATION OF MEDICAL DEVICES The act complements and builds on the FDA’s recent measures to focus its resources on medical devices that present the greatest risk to patients. For example, the law exempts from premarket notification class I devices that are not intended for a use that is of substantial importance to prevent impairment of human health, or that do not present a potential unreasonable risk of illness or injury. The law also directs FDA to focus its postmarket surveillance on high-risk devices, and it allows the agency to implement a reporting system that concentrates on a representative sample of user facilities—such as hospitals and nursing homes—that experience deaths and serious illnesses or injuries linked with the use of devices. Finally, the law expands an ongoing pilot program under which FDA accredits outside— so-called ‘‘third party’’— experts to conduct the initial review of all class I and low-tointermediate risk class II devices. The act, however, specifies that an accredited person may not review devices that are permanently implantable, life-supporting, life-sustaining, or for which clinical data are required. 6

FOOD SAFETY AND LABELING

The act eliminates the requirement of the FDA’s premarket approval for most packaging and other substances that come in contact with food and may migrate into it. Instead, the law establishes a process whereby the manufacturer can notify the agency about its intent to use certain food contact substances and, unless FDA objects within 120 days, may proceed with the marketing of the new product. Implementation of the notification process is contingent on additional appropriations to cover its cost to the agency. The

act also expands procedures under which the FDA can authorize health claims and nutrient content claims without reducing the statutory standard. 7 STANDARDS FOR MEDICAL PRODUCTS Although the act reduces or simplifies many regulatory obligations of manufacturers, it does not lower the standards by which medical products are introduced into the market place. In the area of drugs, the law codifies the agency’s current practice of allowing in certain circumstances one clinical investigation as the basis for product approval. The act, however, does preserve the presumption that, as a general rule, two adequate and well-controlled studies are needed to prove the product’s safety and effectiveness. In the area of medical devices, the act specifies that the FDA may keep out of the market products whose manufacturing processes are so deficient that they could present a serious health hazard. The law also gives the agency authority to take appropriate action if the technology of a device suggests that it is likely to be used for a potentially harmful unlabeled use.

FEDERAL FOOD, DRUG, AND COSMETIC ACT

shortcomings of the 1906 act and brought a new consciousness of consumer needs. After several unpopular attempts to revise the Pure Food and Drug Act during the administration of Franklin D. Roosevelt, public outcry over the ‘‘Elixir Sulfanilamide’’ disaster in 1937—a mass poisoning incident in which a popularly marketed drug killed over 100 people—led to the Food, Drug, and Cosmetic Act of June 25, 1938.

The 1906 Pure Food and Drug Act (the ‘‘Wiley Act’’) prohibited the manufacture, sale, and interstate shipment of ‘‘adulterated’’ and ‘‘misbranded’’ foods and drugs. Product labels were required to make a truthful disclosure of contents but were not required to state weights or measures. By 1913, food manufacturers had grown alarmed by growing variety of state-level weight and measure laws and sought uniformity at the federal level through the Gould Amendment, which required net contents to be declared, with tolerances for reasonable variations. Under the Wiley Act, the U.S. federal government’s Bureau of Chemistry—which in 1927 became the Food, Drug, and Insecticide Administration, then in 1931, the Food and Drug Administration—could challenge illegal products in court but lacked the affirmative requirements to guide compliance. Food adulteration continued to flourish because judges could find no specific authority in the law for the standards of purity and content that the Food and Drug Administration (FDA) had set up. Such products as ‘‘fruit’’ jams made with water, glucose, grass seed, and artificial color were undercutting the market for honest products. False therapeutic claims for patent medicines also escaped control in 1912 when Congress enacted an amendment that outlawed such claims but required the government to prove them fraudulent; to escape prosecution, defendants had only to show that they personally believed in the fake remedy, a major weakness in the law for 26 years. The 1906 law became obsolete because technological changes were revolutionizing the production and marketing of foods, drugs, and related products. In addition, economic hardships of the 1930s magnified the many

1

THE PREVENTIVE AMENDMENTS

The 1938 Food, Drug, and Cosmetic Act in conjunction with World War II greatly expanded the FDA’s workload. Wartime demands had stimulated the development of new ‘‘wonder drugs,’’ especially the antibiotics, which were made subject to FDA testing, beginning with penicillin in 1945. Although there was now a law requiring premarket clearance of new drugs, consumers continued to be guinea pigs for a host of new chemical compounds of unknown safety. The law prohibited poisonous substances but did not require proof that food ingredients were safe. It also provided exemptions and safe tolerances for unavoidable or necessary poisons such as pesticides. When the FDA attempted to set a pesticide tolerance, an adverse court decision showed that the lengthy procedure required by law was unworkable. The FDA could stop the use of known poisons and did so in numerous cases, but the vast research efforts needed to ensure that all food chemicals were safe were clearly beyond government resources. Thus, three amendments fundamentally changed the character of the U.S. food and drug law: the Pesticide Amendment (1954), the Food Additives Amendment (1958), and the Color Additive Amendments (1960). These laws provide that no substance can legally be introduced into the U.S. food supply unless there has been a prior determination that it is safe, and the manufacturers themselves are required to prove a product’s safety. Also very significant was the proviso in the food and color additive laws that no additive could be deemed safe (or

This article was modified from the website of the United States Food and Drug Administration (http://www.cfsan.fda.gov/∼lrd/histor1a.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

FEDERAL FOOD, DRUG, AND COSMETIC ACT

given FDA approval) if it was found to cause cancer in humans or experimental animals. Known as the ‘‘Delaney Clause,’’ this section was initially opposed by the FDA and by scientists, who agreed that an additive used at very low levels need not necessarily be banned only because it might cause cancer at high levels. However, its proponents justified the clause on the basis that cancer experts have not been able to determine a safe level for any carcinogen. This was the underlying basis for the 1959 nationwide FDA recall of cranberries contaminated by the weed killer aminotriazole, which was beneficial in convincing farmers that pesticides must be used with care. Preventing violations through premarket clearance has given consumers immeasurably better protection than merely prosecuting the few violations that could be proved by investigating after injuries were reported.

FEDERAL REGISTER

of the nature of the comments received. Each time a proposal is substantively revised or amended, a notice is published in the Federal Register.

The Federal Register is one of the most important sources for information on the activities of the U.S. Food and Drug Administration (FDA) and other government agencies. Published daily, Monday through Friday, the Federal Register carries all proposed and finalized regulations and many significant legal notices issued by the various agencies as well as presidential proclamations and executive orders. Subscriptions to the Federal Register can be purchased from the federal government’s Superintendent of Documents. As an alternative, copies can usually be found in local libraries, county courthouses, federal buildings, or on the Internet. 1

3

Ultimately, a ‘‘Final Rule’’ is published, which specifies the date when the new regulatory requirements or regulations become effective. 4

REGULATORY AGENDA

Twice a year (April and October), the entire Department of Health and Human Services, including the FDA, publishes an agenda in the Federal Register that summarizes policysignificant regulations, regulations that are likely to have a significant economic impact on small entities, and other actions under development. Each item listed includes the name, address, and telephone number of the official to contact for more information.

ADVANCE NOTICE

Often, the FDA publishes ‘‘Notices of Intent’’ in the Federal Register to give interested parties the earliest possible opportunity to participate in its decisions. These are notices that the FDA is considering an issue and that outside views are welcome before a formal proposal is made. 2

FINAL REGULATIONS

5

MEETINGS AND HEARINGS

Notices are published in the Federal Register announcing all meetings of the FDA’s advisory committees and all public meetings that provide an information exchange between FDA and industry, health professionals, consumers, and the scientific and medical communities. The notice contains the date, time, and place of the meeting as well as its agenda. The Federal Register also announces administrative hearings before the FDA and public hearings to gain citizen input into Agency activities (see ‘‘Citizen Petition’’).

PROPOSED REGULATIONS

When a formal proposal is developed, the FDA publishes a ‘‘Notice of Proposed Rulemaking’’ in the Federal Register. The notice provides the timeframe in which written comments about the proposed action can be submitted. A written request also can be submitted that FDA officials extend the comment period. If FDA extends the period, a notice of the extension is published in the Federal Register. Occasionally, a second or third proposal is published in the Federal Register because This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/ora/fed state/Small business /sb guide/fedreg.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

FILEABLE NEW DRUG APPLICATION (NDA) After a New Drug Application (NDA) is received by the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER), it undergoes a technical screening, generally referred to as a completeness review. This evaluation ensures that sufficient data and information have been submitted in each area to justify ‘‘filing’’ the application—that is, to justify initiating the formal review of the NDA.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/fileable.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

FINANCIAL DISCLOSURE

The financial arrangements that must be disclosed include the following:

The U.S. Food and Drug Administration (FDA) reviews data generated in clinical studies to determine whether medical device applications are approvable. Financial interest of a clinical investigator is one potential source of bias in the outcome of a clinical study. To ensure the reliability of the data, the financial interests and arrangements of clinical investigators must be disclosed to the FDA. This requirement applies to any clinical study submitted in a marketing application that the applicant or the FDA relies on to establish that the product is effective or is used to show equivalence to an effective product, and to any study in which a single investigator makes a significant contribution to the demonstration of safety. The requirement does not apply to studies conducted under the emergency use, compassionate use, or treatment use provisions. Financial compensation or interests information is used in conjunction with information about the design and purpose of the study as well as information obtained through on-site inspections in the agency’s assessment of the reliability of the data. As of February 1999, anyone who submits a Premarket Approval (PMA) or Premarket Notification 510(k) that contains a covered clinical study must submit certain information concerning the compensation to and financial interests of any clinical investigator conducting clinical studies covered in the application. Applicants must certify the absence of certain financial interests of clinical investigators on Financial Interest Form (Certification: Financial Interests and Arrangements of Clinical Investigations, FDA Form 3454) or disclose those financial interests on Financial Interest Form (Disclosure: Financial Interests and Arrangements of Clinical Investigators, FDA Form 3455).

• Compensation made to the investigator

in which the value of the compensation could be affected by the study outcome. • Significant payments to the investigator or institution with a monetary value of $25,000 or more (e.g., grants, equipment, retainers for ongoing consultation, or honoraria) over the cost of conducting the trial. Any such payments to the investigator or institution during the time the investigator is conducting the study and for 1 year after study completion must be reported. • Proprietary interest in the device, such as a patent, trademark, copyright, or licensing agreement. • Significant equity interest in the sponsor such as ownership, interest, or stock options. All such interests whose value cannot be readily determined through reference to public prices must be reported. If the sponsor is a publicly traded company, any equity interest whose value is greater than $50,000 must be reported. Any such interests held by the investigator while the investigator was conducting the study and for 1 year after study completion must be reported. This requirement applies to investigators and subinvestigators, including their spouses and dependent children, but does not apply to full-time or part-time employees of the sponsor or to hospital or office staff. (For studies completed before February 1999, the requirements are reduced. That is, the sponsor does not need to report equity interest in a publicly held company or significant payments of other sorts. Other reporting still applies.) Sponsors are responsible for collecting financial information from investigators, and clinical investigators are responsible for providing financial disclosure information to the sponsor. The investigator’s agreement with the sponsor should require the investigator to provide the sponsor with accurate financial disclosure information. Certification or

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cdrh/devadvice/ide/financial. shtml) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

FINANCIAL DISCLOSURE

disclosure information should not be included in the Investigational Device Exemption (IDE) application. If the FDA determines that the financial interests of any clinical investigator raise a serious question about the integrity of the data, the FDA will take any action it deems necessary to ensure the reliability of the data, including: • Initiating agency audits of the data de-

rived from the clinical investigator in question. • Requesting that the applicant submit additional analyses of data (e.g., to evaluate the effect of the clinical investigator’s data on the overall study outcome). • Requesting that the applicant conduct additional independent studies to confirm the results of the questioned study. • Refusing to use the data from the covered clinical study as the primary basis for an agency action, such as PMA approval or 510(k) clearance.

FISHER’S EXACT TEST

obtained using the standard χ 2 test. But we cannot trust the accuracy of this approximation when it is based on observations on so few patients. Fisher’s exact test provides a way around this difficulty. The reasoning behind the test is as follows. Suppose that the treatment was totally ineffectual, and that each patient’s recovery over the subsequent five days was unaffected by whether the treatment were applied or not. Precisely three patients recovered. If the treatment was ineffectual, then these three, and only these three, individuals would have recovered regardless of whether they were assigned to the treatment or control group. The fact that all three did indeed appear in the treatment group would then have been just a coincidence whose probability could be calculated as follows. When four out of the eight subjects were randomly chosen for the treatment group, the chance that all three of those destined to recover should end up in the treatment group is given by the hypergeometric distribution as (3 C3 )(5 C1 ) = 0.071. 8 C4

RICK ROUTLEDGE Simon Fraser University Vancouver, British Columbia, Canada

Fisher’s exact test can be used to assess the significance of a difference between the proportions in two groups. The test was first described in independently written articles by Irwin (14) and Yates (25). Yates used the test primarily to assess the accuracy of his correction factor to the χ 2 test, and attributed the key distributional result underlying the exact test to R.A. Fisher. Fisher successfully promoted the test in 1935, presenting two applications, one to an observational study on criminal behavior patterns (8), and another to an artificial example of a controlled experiment on taste discrimination (9). Typical recent applications are to the results of simple experiments comparing a treatment with a control. The design must be completely randomized, and each experimental unit must yield one of two possible outcomes (like success or failure). Consider, for example, the study reported by Hall et al. (13). This was a randomized, double-blind, placebo-controlled study on the effect of ribavirin aerosol therapy on a viral infection (RSV) of the lower respiratory tract of infants. After five days of treatment, each infant was examined for the continued presence of viral shedding in nasal secretions. There were 26 patients in the randomized trial. For illustrative purposes, the following discussion focuses on hypothetical results from a smaller set of only eight patients. Also, a patient showing no signs of viral shedding in nasal secretions will be said to have recovered. Consider, then, the ‘‘results’’ displayed in Table 1. All three recoveries were in the treatment group. For a frequency table based on only four treatments and four control subjects, the evidence could hardly be more convincing, but is it statistically significant? Had the experiment included more patients, an approximate P value could have been

This is the standard P value for Fisher’s exact test of the null hypothesis of no treatment effect against the one-sided alternative that the treatment has a positive benefit. Consider the more general setting, as portrayed in Table 2. The P value for testing the null hypothesis that the treatment has no impact vs. the onesided alternative that it has a positive value is min(n,S) (S Cy )(F Cn−y ) . (1) P= N Cn y=a For a two-sided alternative there is no universally accepted definition. The two most common approaches are (i) to double the onesided P value, or (ii) to extend the above sum over the other tail of the distribution, including all those terms which are less than or equal to the probability for the observed

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

FISHER’S EXACT TEST

Table 1. Results From a Small, Comparative Experiment

Treatment Control Totals

Recovered

Not recovered

Totals

3 0 3

1 4 5

4 4 8

Table 2. Notation for a 2 × 2 Frequency Table of Outcomes From a Comparative Experiment

Treatment Control Totals

Recovered

Not recovered

Totals

a c S

b d F

n m N

table. The latter strategy is deployed by the major statistical packages, BMDP (Two-Way Tables in (5)), JMP (Contingency Table Analysis in Fit Y by X in (21)), SAS (FREQ procedure in (20)), S-PLUS (function, fisher.test in (17)), SPSS (Crosstabs in (23)), StatXact (6), and Systat (Tables in (24)). Gibbons & Pratt (12) discuss possible alternatives. The test can be extended to an r × c contingency table, as proposed by Freeman & Halton (11). It is also used on r × 2 tables for multiple comparisons, with the usual controversy over adjustments for simultaneous inferences on a single data set (see (22) and references therein). 1

APPLICABILITY AND POWER

A major advantage of Fisher’s exact test is that it can be justified solely through the randomization in the experiment. The user need not assume that all patients in each group have the same recovery probability, nor that patient recoveries occur independently. The patients could, for example, go to one of four clinics, with two patients per clinic. Two patients attending the same clinic might well have experienced delayed recovery from having contacted the same subsidiary infection in their clinic, but the above argument would still be valid as long as individuals were randomly selected without restriction

from the group of eight for assignment to the treatment vs. control groups. If, however, the randomization was applied at the clinic level, with two of the four clinics selected for assignment to the treatment group, then the test would be invalid. Compared with the hypergeometric distribution, the data would most likely be overdispersed. Similarly, if the randomization was restricted by blocking with respect to clinic, the pair of individuals from each clinic being randomly split between the treatment and control groups, then the test would again be invalid. These alternative designs certainly have their place, particularly in larger experiments with more subjects, but the results would have to be analyzed with another test. The example also illustrates a major weakness of Fisher’s exact test. The evidence for a table based on only four subjects in each of two groups could hardly have been more favorable to the alternative. Yet the P value still exceeds 5%, and most observers would rate the evidence as not statistically significant. It is in general difficult to obtain a statistically significant P value with Fisher’s exact test, and the test therefore has low power. The most effective way to increase the power may well be to take quantitative measurements. Suppose, for instance, that all four patients who received the treatment showed reduced nasal shedding of the virus. By quantifying this evidence, and subjecting the quantitative measurements to a test of significance, the experimenter could, in many instances, generate a more powerful test. One could also, of course, consider running the study on a larger group of patients. 2 COMPETING BINOMIAL-MODEL TEST It is also possible to obtain greater power by analyzing the above table with another statistical model. The most commonly used competitor involves assuming that the numbers of recovered patients in each group are independently binomially distributed. The test was mentioned by Irwin (14), and promoted by Barnard (2). Although he soon withdrew his support (3), it has since become a popular alternative. Its increased power has been

FISHER’S EXACT TEST

amply demonstrated by D’Agostino et al. (7) and others. For the above table, the P value is 0.035 vs. the 0.071 for Fisher’s exact test. The P value based on this binomial model is typically smaller than the one generated by Fisher’s exact test. The main reason for the difference is that the standard definition of the P value contains the probability of the observed table, and this probability is higher for Fisher’s exact test than for the binomial model 1,10,18. Thus the null hypothesis is more frequently rejected, and the binomialmodel test is more powerful. This test is available in StatXact (6). However, the increased power comes at a cost. To justify the binomial model, one must either assume that all patients within each group have the same recovery probability, or envisage that the patients were randomly sampled from some larger group. The trial must also have been conducted so as to ensure that patient deaths occur independently. They cannot, for example, attend four clinics, with two patients per clinic. There is another, more subtle problem with the binomial model. Simple calculations show that had fewer than three or more than five patients recovered, then neither P value could possibly have been significant. This puts the researcher in an awkward quandary. For example, had only two patients recovered after 5 days, the researcher would have had an incentive either to present the results after more than five days of treatment when at least one more patient had recovered, or to incorporate more patients into the experiment. One does not win accolades for announcing the results of experiments that are not only statistically insignificant, but also apparently barely capable of ever producing significant results. These are important complications when it comes to interpreting the results of these sorts of small experiments. Suppose, for example, that in the above experiment the researcher was to have adjusted the five-day reporting time, if necessary, so as to guarantee between three and five recoveries. Then the binomial P value would be invalid. The probability of obtaining a table at least as favorable to the treatment as the above one can be shown to be 0.056, not 0.035, as generated by the standard binomial model.

3

3

THE MID-P VALUE

The P value of 0.071 generated by Fisher’s exact test is still large compared with the 0.056 figure produced by this modified binomial model. There is yet another alternative with important theoretical and practical advantages (see, for example, 16,4,1, and (19)). This is the mid-P value, first introduced in 1949 by Lancaster (15). In place of the standard definition, P value = Pr (evidence at least as favorable to Ha as observed |H0 ), they propose the alternative, mid-P value = Pr(evidence more favorable to Ha as observed |H0 ) + 12 Pr(evidence equally favorable to Ha as observed |H0 ). Table 3 summarizes the possible P values for the above example. This table illustrates that the mid-P has the potential to provide a smaller, more significant-looking P value, and to reduce the discrepancy between P values generated by competing models. However, by using a smaller P value, one may reject a valid null hypothesis too frequently. Fortunately, amongst other desirable attributes of the mid-P, its routine use does indeed control a quantity closely related to the type I error rate (see (19), and references therein). The computer package, StatXact (6) facilitates the calculation of the mid-P by providing the probability of the observed table along with the standard P value.

Table 3. Comparison of P Values for the Data in Table (1) Modified Fisher’s Binomial binomial exact test model model Standard P value Mid-P value

7.1% 3.6%

3.5% 2.0%

5.6% 3.0%

4

4

FISHER’S EXACT TEST

CONCLUSION

Fisher’s exact test provides a widely applicable way to assess the results of simple randomized experiments leading to 2 × 2 contingency tables. But it has low power, especially when the standard P value is used. The power can be increased considerably through (i) using the mid-P value, or (ii) carefully constructing a test based at least in part on a binomial model. Further power increases can be generated through (iii) taking quantitative measurements on each subject, or (iv) running the trial with a larger number of patients. REFERENCES 1. Agresti, A. (1990). Categorical Data Analysis. Wiley–Interscience, New York. 2. Barnard, G. A. (1945). A new test for 2 × 2 tables, Nature 156, 177. 3. Barnard, G. A. (1949). Statistical inference, Journal of the Royal Statistical Society, Series B 11, 115–139. 4. Barnard, G. A. (1989). On alleged gains in power from lower p-values, Statistics in Medicine 8, 1469–1477. 5. BMDP Statistical Software, Inc. (1990). BMDP Statistical Software Manual: To Accompany the 1990 Software Release. University of California Press, Berkeley. 6. Cytel Software Corporation (1995). StatXact-3 for Windows. Cytel Software Corporation, Cambridge, Mass. 7. D’Agostino, R. B., Chase, W. & Belanger, A. (1988). The appropriateness of some common procedures for testing the equality of two independent binomial populations, American Statistician 42, 198–202. 8. Fisher, R. A. (1935). The logic of inductive inference, Journal of the Royal Statistical Society, Series A 98, 39–84. 9. Fisher, R. A. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh. 10. Franck, W. E. (1986). P-values for discrete test statistics, Biometrical Journal 4, 403–406. 11. Freeman, G. H. & Halton, J. H. (1951). Note on an exact treatment of contingency, goodness of fit and other problems of significance, Biometrika 38, 141–149. 12. Gibbons, J. D. & Pratt, J. W. (1975). Pvalues: interpretation and methodology, American Statistician 29, 20–25.

13. Hall, C. B., McBride, J. T., Gala, C. L., Hildreth, S. W. & Schnabel, K. C. (1985). Ribavirin treatment of respiratory syncytial viral infection in infants with underlying cardiopulmonary disease, Journal of the American Medical Association 254, 3047–3051. 14. Irwin, J. O. (1935). Tests of significance for differences between percentages based on small numbers, Metron 12, 83–94. 15. Lancaster, H. O. (1949). The combination of probabilities arising from data in discrete distributions, Biometrika 36, 370–382. 16. Lancaster, H. O. (1961). Significance tests in discrete distributions, Journal of the American Statistical Association 56, 223–234. 17. MathSoft, Inc. (1993). S-PLUS Reference Manual, Version 3.2. MathSoft, Inc., Seattle. 18. Routledge, R. D. (1992). Resolving the conflict over Fisher’s exact test, Canadian Journal of Statistics 20, 201–209. 19. Routledge, R. D. (1994). Practicing safe statistics with the mid-p, Canadian Journal of Statistics 22, 103–110. 20. SAS Institute, Inc. (1989). SAS/STAT User’s Guide, Version 6, 4th Ed., Vol. 1. SAS Institute Inc., Cary. 21. SAS Institute, Inc. (1995). JMP Statistics and Graphics Guide, Version 3.1. SAS Institute Inc., Cary. 22. Savitz, D. A. & Olshan, A. F. (1995). Multiple comparisons and related issues in the interpretation of epidemiological data, American Journal of Epidemiology 142, 904–908. 23. SPSS, Inc. (1991). SPSS Statistical Algorithms, 2nd Ed. SPSS Inc., Chicago. 24. SYSTAT, Inc. (1992). SYSTAT for Windows: Statistics, Version 5. SYSTAT, Inc., Evanston. 25. Yates, F. (1934). Contingency tables involving small numbers and the χ 2 test, Journal of the Royal Statistical Society, Supplement 1, 217–235.

FLEXIBLE DESIGNS

the adaptation rules need not be specified in advance. Different ways have been used to define flexible multi-stage designs with adaptive interim analyses (2–5). In order to control the overall type I error probability, they all adhere to a common invariance principle: Separate standardized test statistics are calculated from the samples at the different stages and aggregated in a predefined way to test statistics that are used for the test decisions. Under the null hypothesis, the distributions of these separate test statistics are known, for example, stage-wise Pvalues follow stochastically independent uniform distributions on [0, 1], or stage-wise Z-scores follow independent standard normal distributions. Assume that, given no design modifications are permitted, the test procedure applied to the aggregated test statistics controls the level α. Then, every design modification that preserves the distributional properties of the separate stage-wise test statistics does not inflate the level α of the test procedure (6). The method of Proschan and Hunsberger (7) based on the definition of the conditional error function can also be defined in terms of a test combining standardized stage-wise test statistics according to a prefixed rule (8–10). The self-designing method of Fisher (11, 12) for two stages also fits into this concept (8). However, it does not allow for an early rejection of the null hypothesis in the interim analysis. The method allows for multiple stages with design modifications (but no test decisions in the interim analyses), and it is up to the experimenter to decide if the trial is completed with a test decision after the next stage. Some additional flexibility exists because the experimenter can choose how future stages will be aggregated to the final test statistics. The weight of the last stage, however, is determined by the weights of the previous stages. A similar approach, that also allows for rejection of the null hypothesis in the interim analyses, has been proposed by Hartung and Knapp (13) and is based on the sum of χ 2 distributed test statistics. ¨ ¨ Muller and Schafer (14) use the notion of the conditional error function to extend the flexibility to the adaptive choice of the number

MARTIN POSCH PETER BAUER WERNER BRANNATH Medical University of Vienna Vienna, Austria

1

INTRODUCTION

Classical frequentist statistical inference is based on the assumption that the inferential procedure is completely specified in advance. Consequently, the consensus guideline International Conference on Harmonization (ICH) E9 on Statistical Principles for Clinical Trials (1) requires for confirmatory trials that hypotheses and the statistical analysis plan is laid down in advance. For the planning of optimal trial designs knowledge on quantities as the expected efficacy of a new treatment, the safety properties, the appropriate doses or application forms of a treatment, the success rate in the control group, and the variability of outcome parameters are essential. Typically, in the planning phase of a clinical trial, many of these quantities are unknown. However, relevant information may aggregate in the course of the trial. Based on this information, changes in the trial design can become desirable or even inevitable. The necessity to allow for cautious adaptations is also recognized in the ICH E9 document: If it becomes necessary to make changes to the trial, any consequent changes to the statistical procedures should be specified in an amendment to the protocol at the earliest opportunity, especially discussing the impact on any analysis and inferences that such changes may cause. The procedure selected should always ensure that the overall probability of type I error is controlled (1).

Statistical inference based on adaptive designs allows implementation of design adaptations without inflating the type I error. The crucial point is that the adaptations may be based on the unblinded data collected so far as well as external information and

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

FLEXIBLE DESIGNS

of interim analyses. The invariance principle behind their approach can be described in simple terms: After every stage, the remainder of the design can be replaced by a design that, given what has been observed up to now, would not result in a higher conditional type I error probability than the preplanned design. Or, in other words, design modifications at any time that preserve the conditional error probability of the original design do not compromise the overall type I error probability. This principle can be defined concisely also in terms of the recursive application of simple two-stage combination tests. This allows the construction of an overall P-value and confidence intervals (15). By this generalization in the interim analysis, an experimenter may decide to insert a further interim analysis for saving time (if he sees good chances to get an early rejection or if information from outside asks for quick decisions). Adaptive designs can also be applied to time-to-event data (16) when information from the first-stage sample is used also in the second-stage test statistics (e.g., exploiting the independent increment structure of the log-rank statistics). However, restrictions exist on the type of information from the first stage, which may be utilized for adaptations (17). Up to now, sample size reassessment (4, 5, 7, 10, 18–23) has been an issue of major interest. However, various other design modifications like changing doses (24), dropping treatment arms (25–27), redesigning multiple endpoints (28, 29), changing the test statistics (30–32), and selecting goals between non-inferiority and superiority (33, 34) have been considered.

is based on a suitably defined combination function C(p1 , p2 ), which is assumed to be left continuous, monotonically increasing in both arguments, and strictly increasing in at least one: If C(p1 , p2 ) ≤ c, one rejects the hypothesis, otherwise one accepts. Note that for α 0 = 1, no stopping for futility is applied. If, in addition, α 1 = 0, no early test decision will be taken and the interim analysis is only performed for adaptation purposes. The adaptive testing procedure is summarized in Fig. 1. If, under H 0 , the P-values are independently and uniformly distributed on [0, 1], then the level condition to determine c, α 1 , and α 0 can be written as α1 +

α0 α1

1

1[C(x,y)≤c] dy dx = α

Here, the indicator function 1[·] equals 1 if C(x, y) ≤ c and 0 otherwise. One of the combination functions considered in the literature plays a special role because of its relationship to group sequential tests (see Group Sequential Designs). Assume that the one-sided null hypothesis H0 :µA = µB versus H 1 :µA > µB is tested for comparing the means of two normal distributions with known variance (w.l.o.g. σ 2 = 1). The

First Stage p1 a1

0

a0

Reject H01

2

(1)

0

1 Accept H01

THE GENERAL FRAMEWORK Adaptation

To start fairly general, let us assume that a one-sided null hypothesis H0 is planned to be tested in a two-stage design. The test decisions are performed by using the P-values p1 and p2 calculated from the samples at the separate stages. Early decision boundaries are defined for p1 : If p1 ≤ α 1 (where α 1 < α), one stops after the interim analysis with an early rejection; if p1 > α 0 (where α 0 > α), one stops with an acceptance of H0 (stopping for futility). In case of proceeding to the second stage, the decision in the final analysis

Second Stage C (p1,p2) 0

1

c

Reject H01

Accept H01

Figure 1. The adaptive testing procedure.

FLEXIBLE DESIGNS

weighted inverse normal combination function (4, 5) can be defined as C(p1 , p2 ) = 1 − [w1 zp1 + w2 zp2 ], 0 < wi < 1, w21 + w22 = 1

(2)

where zγ is the (1 − γ ) quantile of the standard normal distribution. For the group sequential test of the normal mean with samples balanced over treatments n1 = n1A = n1B and n 2 = n2A = n2B at the two stages, and w1 = n1 /(n1 + n2 ), w2 = n2 /(n1 + n2 ). The term in the squared brackets of Equation (2) is simply the standardized difference of the treatment means calculated n /(n from the total sample, Z = Z 1 1 1 + n2 ) + Z2 n2 /(n1 + n2 ). Here, Z1 and Z2 are the standardized mean treatment differences calculated from the separate stages. Note that C(p1 , p2 ) as defined in Equation (2) is just the P-value for the test based on the total sample. Hence, if no adaptation is performed, the test decision is the same as in the classical group sequential test with an early rejection boundary zα1 . Given zα1 and an early stopping for futility boundary zα0 , the critical boundary zα2 for the test statistic Z in the final analysis can be derived from Equation (1), which then is equivalent to the common level condition Prob(Z1 ≥ zα1 ) + Prob(zα0 ≤ Z1 < zα1 , Z ≥ zα2 ) = α (3) for the corresponding group sequential test with obligatory stopping for futility. It is obvious how this analogy works for more than two stages. The conditional error function A(p1 ) = Prob(reject H 0 |p1 ) in the group sequential context leads to the so-called linear conditional error function (7) ⎧ 0 ⎪ ⎪ ⎪ ⎨1 − [(z √n + n α2 1 2 A(z1 ) = √ √ ⎪ −z n )/ n ] 1 1 2 ⎪ ⎪ ⎩ 1

if z1 < zα0 if zα0 ≤ z1 ≤ zα1 if z1 ≤ zα1 (4)

In the following, n˜ 2 denotes the adapted second-stage sample size, which may be

3

different to n2 planned a priori. Let further Z˜ 2 denote the standardized mean of the actual second-stage sample and Z˜ = Z1 n1 /(n1 + n2 ) + Z˜ 2 n2 /(n1 + n2 ) the adaptive test statistics based on the preassigned weights tocombine the two stages. Now, setting wi = ni /(n1 + n2 ), i = 1, 2 then Z˜ ≥ zα2 is equivalent to C(p1 , p2 ) < α 2 , where C(p1 , p2 ) is defined in Equation (2) and p1 , p2 are the P-values of the first and the possibly adapted second stage. Note that Z˜ ≥ zα2 is also equivalent to Z˜ 2 ≥ zA(z1 ) , so that formally the test in the second-stage sample is performed at the level A(z1 ). Hence, when proceeding to the second stage, the choice of n˜ 2 can simply be based on sample size formulas for the fixed sample size case using the adjusted level A(z1 ). Some comments have to be made here. (1) The crucial property of these flexible designs is that the adaptation rule needs not to be specified a priori. (2) An alternative approach is to start with an a priori specified sample size reassessment rule n˜ 2 (p1 ) and to use the classical test statistics for the final test (19, 20). To control the type I error, either an adjusted critical boundary or constraints for the sample sizes have to be applied. As this procedure weights the stage-wise test statistics always according to the actual sample sizes, it can be expected to be more efficient than when fixed weights are used. Unless extreme sample sizes are used, this difference, however, is likely to be small (compare Reference 4). Note that with a predefined sample size reassessment rule, one can also define a combination test that uses the classical test statistics as long as one does not deviate from this prespecified rule. The corresponding combination function is identical to Equation (2), but in the weights wi the preplanned second-stage sample size n2 is replaced by n˜ 2 (p1 ). Using this combination function, one can also here deviate from the prespecified sample size reassessment rule. However, a deviation from this rule implies that the classical test statistics is no longer used. Hence, designs with prespecified mandatory adaptation rules can be looked at as a special case of a flexible design. (3) Clearly, combination functions can also be used for tests in a distributional environment completely different to the normal.

4

FLEXIBLE DESIGNS

As long as the stage-wise P-values are independent and uniformly distributed under the global null hypothesis, the methodology will also apply. By transforming the resulting Pvalues via the inverse normal combination method, one formally arrives at independent standard normal increments so that all the results known for group sequential trials under the normal distribution can also be utilized for completely different testing problems (4). (4) The assumptions can even be relaxed, only requiring that the stage-wise P-values follow a distribution that, given the results of the previous stages, is stochastically larger or equal to the uniform distribution (15). A very general formulation of adaptive designs is given in Reference 35. (5) Classical group sequential tests (see Group Sequential Designs) are a special case of the more general concept of adaptive combination tests, because they result from a special combination function for the aggregation of the stage-wise test statistics. (6) Hence, classical group sequential trials can be planned in the context of adaptive multistage designs. Moreover, if the trial has in fact been performed according to the preplanned schedule and no adaptations have been performed, no price at all has to be paid for the option to deviate from the preplanned design: then the classical test statistic and rejection boundaries can be used. However, if adaptations are performed because of the fixed weights, the classical test statistics is no longer used in the final analysis. (7) Estimation faces the problems of sequential designs and the lack of a complete specification of the future sample space because of the flexibility. Several proposals for the construction of point estimates and confidence intervals have been made (4, 15, 34, 36–40). The crucial question is how to use the wide field of flexibility opened by this general concept in practice. Note that, in principle, at every interim analysis, a ‘‘new’’ trial at a significance level equal to the conditional error probability can be planned. This conditional error probability accounts for the past and assures that the overall type I error probability for the future is always controlled. Furthermore, by adopting the concept of (formally) performing interim looks without early test decisions after every sample unit,

this concept of the conditional error function can also be applied for mid-trial design modifications in trials without any preplanned interim analyses (41). Clearly, because of the large number and diversity of possible adaptations, their merits, for example, for the practice of clinical trials, are difficult to establish. In the following, some types of adaptations are discussed. 3 CONDITIONAL POWER AND SAMPLE SIZE REASSESSMENT The conventional measure to assess the performance of tests in a particular design is the overall power: In the long run, the experimenter performing equally powered studies gets rejections with probability 1 − β under the alternative. In case of a sequential design with early decision, however, unblinded information on the observed effect is available. So when proceeding to the next stage, one is tempted to have a look at the chances to get rejection in the end, given the results up to now. The overall power takes expectation over all possible outcomes in the interim analysis. Now, when being halfway through the trial and having preliminary estimates, should the experimenter in a concrete situation average his perspectives over outcomes for which he definitely knows that they have not been observed in the current trial? Or should he argue based on what he already knows? The conditional power (given Z1 ) for the comparison of two normal means by the weighted inverse normal combination function in Equation (2) is given by (σ 2 = 1) √ √ CP(z1 ) = 1 − (zα2 n1 + n2 − z1 n1 )/ √ √ n˜ 2 n2 − √ 2

(5)

where = µA − µB is the relevant difference on which the power is targeted (7, 10, 18, 21–23, 42). In the interim analysis, the second-stage sample size n˜ 2 to achieve a rejection probability of 1 − β in the final analysis is determined by solving the equation CP(z1 ) = 1 − β for n˜ 2 . The question of which value should be plugged in for the targeted effect develops.

FLEXIBLE DESIGNS

3.1 Using the Estimated Effect It has been proposed to calculate the conditional power by replacing by its firstˆ (12). Figure 2 shows this stage estimate ‘‘predictive’’ conditional power for two group sequential tests (here, for comparability, α = 0.025, one-sided, 1 − β = 0.8 is chosen) with equal sample sizes at the two stages balanced over treatments as a function of z1 . Clearly, for small observed effects, the predictive conditional power does not promise a good chance to reject at the end; for large effects only slightly smaller than the rejection boundary zα1 , however, the predictive conditional power exceeds 0.8, particularly in the O’Brien-Fleming (43) design. The predictive conditional power is a random variable and Fig. 3 shows its distribution function given that the trial proceeds to the second stage. Under the alternative, the conditional probability to end up with a predictive conditional power below 0.5 (given the event that the trial continues) is 0.54 for the Pocock design (44) and 0.35 for the O’Brien-Fleming design. Under the null hypothesis, the predictive conditional power will remain below 0.2 in more than 80% of the cases for both designs, which explains the findings that sample size reassessment based on the ‘‘predictive’’ conditional power by using the estimate of the effect size will, in general, lead to large expected sample sizes. Jennison and Turnbull (10) have shown this when applying a sample size reassessment rule used by Cui za1

1

et al. (5) allowing the reassessed sample size to become as large as 25 times the preplanned one. They apply a very large zα1 (>4) and the trial may go on even when a negative effect has been observed. They suggest to instead use group sequential trials that are overpowered for the (optimistic) targeted treatment differences but still have reasonable power for a (pessimistic) smaller margin. The expected sample size for the larger treatment difference (although not being optimal for usual power values) then is still sufficiently small. Other arguments also exist that have been brought up against the use of the mid-trial estimate of the effect size for sample size recalculation. In such procedures, the power does not sufficiently increase with the effect size resulting in flat power curves (10). Moreover, when used in comparisons with placebo, they may aim at small and irrelevant effect sizes. The relevant effect size should follow from medical considerations accounting also for risks and costs, which is true in theory, but not enough knowledge of all issues concerned often exists in advance. Some of them may become clearer from the cumulating observations (e.g., when aggregating sufficient data on safety). In comparisons between active treatments, it is even more difficult to specify the relevant difference because any improvement of the standard therapy may be of interest (e.g., if safer therapies evolve). The current discussion around the choice of equivalence margins in large equivalence trials is a

0.8

0.6

0.6

0.4

0.4

0.2

0.2 0

1 z1

2

Pocock Boundaries

za1

1

0.8

−1

5

−1

0

1 z1

2

O’Brien & Fleming Boundaries

Figure 2. The conditional power (bold line) and the predictive conditional power (thin line) as a function of z1 for a group sequential design with balanced sample sizes over treatments and stages, overall power 80%, and α = 0.025. The dotted line denotes the conditional error function.

6

FLEXIBLE DESIGNS

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2 0.2

0.4 0.6 0.8 Conditional Power

1

Pocock Boundaries

0.2

0.4 0.6 0.8 Conditional Power

1

O’Brien & Fleming Boundaries

Figure 3. The conditional distribution functions (conditional on proceeding to the second stage) of the conditional power (bold lines) and the predictive conditional power (thin lines) for the null (dashed lines) and the alternative (continuous lines) hypothesis.

good example for this changed environment. More emphasis is put on precise confidence intervals of the treatment differences in order to be able to position the treatment within the set of comparators, which, quite naturally, leads to the consideration of families of null hypotheses with diverging equivalence margins (45), which complicates the choice of an a priori sample size (33, 34). 3.2 Staying with the a Priori Defined Effect Size One alternative is to base conditional power on the a priori fixed relevant treatment differences. Figure 2 gives the conditional power for this strategy depending on z1 for two different group sequential tests. When proceeding to the second stage under the alternative in Pocock’s design, only for very large z1 will the conditional power be above 0.8. In the Fleming-O’Brien design, the conditional power is higher because the final critical boundary zα2 is smaller. Figure 3 gives the distribution function of the conditional power given that one proceeds to the second stage. Under the alternative, this conditional power will stay below 0.8 in 74.6% of the cases for Pocock’s design and in only 46.5% for the O’Brien-Fleming design. On the other hand, under the null hypothesis conditional power values above 0.5 would be found in 14.8% and 20.8% of the cases respectively. Denne (21) looked at properties of the resulting test in terms of power and expected sample size when sample size reassessment is

based on conditional power. He shows that, compared with the group sequential design, the power is slightly higher when the effect is smaller than expected, however, for the price of serious inflations of the expected sample size. Posch et al. (46) considered designs with sample size reassessment according to a restricted conditional power rule, where an upper limit on the second-stage sample size is applied. Such a design leads to considerable savings in expected sample size, compared with a two-stage group sequential design, in an important region of the alternative hypothesis. These savings come for a small price to be paid in terms of expected sample size close to the null hypothesis and in terms of maximal sample size. Brannath and Bauer (42) derived the optimal combination test in terms of average sample size when the conditional power rule is applied. Admittedly, for multistage designs, arguments will become much more complicated so that it will be difficult to quantify the impact of sample size reassessment rules. 3.3 Overpowered Group Sequential Versus Adaptive Trials As mentioned above, cautious sample size reassessment rules based on conditional power have good properties in terms of expected sample size. Additionally, sample size reassessment in adaptive designs does not need to be performed according to strict predefined data-based rules. Information may develop from other sources (e.g.,

FLEXIBLE DESIGNS

safety considerations not having been incorporated in a fixed rule, or information from outside the trial) that could strongly favor a change of the preplanned sample size. For example, the demand for a larger sample size may come up to achieve an appropriate judgement of the risk-benefit relationship, or the treatment allocation rule may be changed to get more information on patients under a particular treatment. To adhere to the design and start another trial may then not be considered to be a reasonable strategy by ethical and economical reasons. Clearly, if the cautious approach to overpowered group sequential designs will be more extensively used in practice, then the stopping for futility option will have to be considered thoroughly and carefully in order to avoid radically large sample sizes under the null hypothesis. The very large maximal sample sizes to be laid down in the planning phase may be a further obstacle in the implementation of such designs. In practice, experimenters may rather tend to give up with a negative outcome if large increments in sample size are required to reach the next decision (and the chance for getting a positive decision is small). It is questionable if overall power rather than conditional power arguments will prevent such unscheduled stopping for futility decisions based on ongoing results. But then the perspectives of the overpowered group sequential designs may not apply in real-life scenarios. 3.4 Inconsistency of Rejection Regions At this point, the question about the price to be paid for the option of adaptations has to be answered. As mentioned in Section (2), no price has to be paid if the design is performed according to the preplanned group sequential design (if the appropriate inverse normal combination function is applied), which seems to be surprising and seems like a free lunch. However, the price to be paid may be the potential of being misled by the observed data and to modify a design, which, in fact, may be optimal for the true state of nature. Additionally, in case of an adaptation in terms of sample size reassessment, the decisions are based on statistics deviating from the minimal sufficient statistics.

7

Note that the way how to combine the data from the different stages has to be laid down before the stages are in fact performed (e.g., the weights when using the normal inverse normal method), either from the very beginning or recursively during the trial (14, 15). Denne (21) discussed the case where the adaptive test rejects but the group sequential two-stage test with weights corresponding to the actual sample sizes would not reject. He suggests to avoid such a type of inconsistency by rejecting in the adaptive test only if this group sequential test is rejected too. He shows that, for sample size reassessment based on conditional power, this additional condition has practically no impact on the performance of the adaptive test procedure. A more fundamental inconsistency occurs if the adaptive test rejects but the fixed sample size test would fail to reject (i.e., Z˜ ≥ zα2 , ˜ 2 n˜ 2 /(n1 + n˜ 2 ) < zα , but Z1 n1 /(n1 + n˜ 2 ) + Z where zα2 ≥ zα ). To get a complete picture, consider all possible constellations for which such inconsistencies occur with positive probability (i.e., there exist z˜ 2 values that lead to an inconsistency). It turns out that these constellations can be characterized in terms of the sample ratio r = n˜ 2 /(n1 + n˜ 2 ). Thus, if balanced stages were preplanned n1 = n2 ) r = 1/2 corresponds to the case of no sample size reassessment (n˜ 2 = n2 = n1 ), for n˜ 2 → ∞, one has r → 1, and if n˜ 2 = 1, then r = 1/(n1 + 1) (which approaches 0 for increasing n1 ). Figure 4 gives the regions where such inconsistencies occur with positive probability depending on Z1 when applying Pocock or O’Brien-Fleming boundaries with equal sample sizes per stage. The bad news is that such inconsistencies are possible for all values of Z1 . One sees that the area for such inconsistencies is smaller with constant rejection boundaries (Pocock), because the adaptive test has to exceed the larger final decision boundary. Furthermore, if the sample size is increased in case of small and decreased in case of large observed Z1 -values, which is a reasonable behavior in practice, no such inconsistencies may ever occur, which is good news. The lines in the figures denote the conditional power rule when performing sample size reassessment after halfway through the group sequential designs (α = 0.025, one-sided) with overall power 0.8. It can be

FLEXIBLE DESIGNS

1

1

0.75

0.75

0.5

0.5

r

r

8

0.25

0.25

0

0 0

za1 z1 Pocock Boundaries

0

za1 z1 O’Brien & Fleming Boundaries

Figure 4. For sample size reassessment [expressed in terms of the ratio r = n˜ 2 /(n1 + n˜ 2 )] in the shaded regions with a positive (but possibly very small) probability, the adaptive test may lead to a rejection, whereas the Z-test statistics of the pooled sample falls short of zα . For sample size reassessments in the white region, such inconsistencies never occur. The line corresponds to sample size reassessment according to the conditional power rule.

seen that, following this rule, inconsistencies never occur. A general way to deal with inconsistencies without imposing restrictions on the sample size reassessment rule is to reject in the adaptive design only if the fixed sample size test in the end also rejects at the level α. 4 EXTENDING THE FLEXIBILITY TO THE CHOICE OF THE NUMBER OF STAGES Interim analyses are not only costly and time consuming but unblinding may also have an unfavorable impact on the course of the remainder of a trial. Hence, interim analysis should only be performed if either relevant information for the adaptation of the trial design can be expected or if a good chance exists to arrive at an early decision. A priori, it is often difficult to assess the right number of interim analyses. Given the results of, for example, the first interim analysis, one might want to cancel the second if the conditional probability to get an early decision is small. If, on the other hand, a high chance exists to stop the trial early, one might want to add further interim analyses. Also, for external reasons, a further interim analysis might be favorable, for example, if a competitor enters the market such that an early decision would be a competitive advantage.

Assume that in the first interim analysis of a group sequential test with at least three stages, no early stopping condition applies. Then, by a generalization of Equation (4), the conditional error function A(z1 ) gives the probability (under the null hypothesis) that, given Z1 = z1 , the original design rejects at a later stage. Hence, it is the type I error probability that can be spent for later decisions. Thus, one can either spend this type I error probability by performing a single second stage with a final test at level A(z1 ) or, alternatively, proceed with the group sequential test. The decision of which way to go can be based on all information collected so far (14, 15). Posch et al. (46) investigated the expected sample size of a three-stage design with early rejection of the null hypothesis, where the second interim analysis is dropped when the chance for an early rejection is low. It is shown that such a design has nearly the expected average sample size of a three-stage group sequential test. At the same time, it has a lower maximal sample size and saves a considerable number on interim analysis, especially under the null hypothesis. Instead of dropping an interim analysis, as in the above example, one can also add further interim analyses. For example, one can start out with a two-stage design, then

FLEXIBLE DESIGNS

compute the conditional error A(z1 ) in the interim analysis and plan a further two-stage design with level A(z1 ). This procedure can be applied recursively (14, 15). 5

SELECTION OF THE TEST STATISTICS

Additionally, adaptive interim analysis give the opportunity to adaptively choose the test statistics used in the second stage, which allows one to select scores or contrasts based on the interim data (30, 31, 47, 48). If it turns out that a covariable can explain a substantial part of the variability of the primary variable, for the second-stage test an analysis accounting for this covariable can be specified. 6 MORE GENERAL ADAPTATIONS AND MULTIPLE HYPOTHESES TESTING 6.1 The Closed Testing Principle and Adaptive Designs A more general type of adaptation occurs if, in the interim analysis, the null hypothesis to be tested is changed, which is the case, for example, if doses are selected, endpoints are re-weighted, or the study population is adapted. If in the first stage a hypothesis, H 0,1 and, in the second stage a modified hypothesis, H0,2 are tested, then the combination test tests only the intersection hypothesis H0,1 ∩ H0,2 . The rejection of this intersection null hypothesis implies that H0,1 or H 0,2 is false. If, in a dose response setting, the proof of principle for efficacy of at least one dose is intended, it maybe sufficient. Also, when multiple endpoints are considered, the proof of principle for efficacy of at least one of the endpoints may suffice. To make inference on the individual hypotheses, a multiple testing procedure to control the multiple level (i.e., the probability to erroneously reject one or more null hypotheses) has to be applied. A general principle that guarantees control of the multiple level is the closed testing principle (49)! 6.1.1 Closed Testing Procedure. Assume a set I of null hypotheses is to be tested at multiple level α. To reject an individual

9

hypothesis j ∈ I, for all subsets J ⊂ I that contain j, the intersection hypothesis H0, J = ∩i∈J H0,i (stating that all hypotheses in J are true) has to be rejected at local level α. For example, the Bonferroni and Bonferroni Holm procedure can be formulated as special cases of this principle. The closed testing procedure can easily be integrated in adaptive designs by defining adaptive tests for all intersection hypotheses (26, 28, 29, 50), which opens a new dimension of flexibility: The hypotheses to be tested can also be adapted; some hypotheses can be dropped, and new hypotheses can be included in the interim analysis. 6.2 Selection and Addition of Hypotheses The general formulation of the adaptive multiple testing procedure is quite technical (29) such that one demonstrates the methodology with a simple example. Assume that in the first stage two hypotheses H0,A , H 0,B are tested (e.g., corresponding to two treatment groups that are compared with placebo). Then, according to the closed testing principle, level α tests for all intersection hypotheses H0,J , J ∈ J = {{A}, {B}, {A, B}} have to be defined. Thus, in the planning phase of the adaptive test for all hypotheses H0,J , J ∈ J, a level α combination test C(·,·) with decision boundaries α 0 , α 1 , cα has to be specified. In the following, assume for simplicity that all hypotheses are tested with the same combination test and set α 0 = 1, α 1 = 0 such that no early stopping is possible. Now, firststage tests and P-values p1, J , J ∈ J for all (intersection) hypotheses have to be defined. 6.2.1 Selecting a Single Hypothesis. Assume that, in the interim analysis, it is decided that only hypothesis H0,A is selected for the second stage. Then, only a second-stage test for H0,A with P-value p2,A is specified. H 0,A is rejected in the final analyses at multiple level α if both, C(p1,{A,B} , p2A ) ≤ cα (which is a test for the intersection hypothesis H 0,{A,B} ) and C(p1,A , p2,A ) ≤ cα . Note that, in this case, the second-stage test for H 0,A is also used as a test for the intersection hypothesis H0,{A,B} .

10

FLEXIBLE DESIGNS

6.2.2 Selecting Both Hypotheses. If, in the interim analysis, it has been decided to continue with both hypotheses, A and B, the second-stage test for the intersection hypothesis can be based on data for both hypotheses leading to a P-value p2,{A,B} . Then, the individual hypothesis i∈{A, B} can be rejected at multiple level α if C(p1,{A,B} , p2,{A,B} ) ≤ cα and C(p1,i , p2,i ) ≤ cα . 6.2.3 Adding a Hypothesis. Finally, assume that it is decided in the interim analysis to add a new hypothesis H0,C . For simplicity, it is additionally assumed that the hypotheses H0,A , H 0,B have been dropped in the interim analysis. Then H0,C can be rejected at multiple level α if all tests for the intersection hypotheses can be rejected:

Hypothesis

Test

H0,C ∩ H0,A ∩ H0,B H0,C ∩ H0,A H0,C ∩ H0,B H 0,C

C(p2,{A,B} , p2,C ) ≤ cα C(p2,A , p2,C ) ≤ cα C(p2,B , p2,C ) ≤ cα p2,C ≤ α

A simple way to construct P-values for the intersection hypotheses are Bonferroni adjusted P-values. In the example, one can set p1,{A,B} = min[2 min(p1,A , p1,B ),1]. More general tests for intersection hypotheses allow to give different weights to the individual hypothesis or to specify a hierarchical ordering among the individual hypotheses. If several doses are tested, the intersection hypothesis could be tested by testing for a positive trend. Note that the second-stage tests for all (intersection) hypothesis can be chosen adaptively based on the data from the first stage. Clearly, sample size reassessment can be performed additionally to the adaptive choice of hypotheses carried on to the second stage. The allocation ratio to different treatments could be changed, for example, investigating a larger sample for a test treatment. Also, more efficient tests could be planned for the second stage relying on the interim information.

6.3 Adaptation of Hypotheses in Clinical Trials 6.3.1 Treatment Selection. The selection of treatments or doses allows the integration of several phases in the drug development process into a single trial. Assume that in a first stage several dose groups are tested against placebo. In the interim analysis, one or more doses can be selected for the second stage. The selection process will typically be based on safety as well as efficacy information collected in the first stage as well as possible external information coming, for example, from other trials or experiments. 6.3.2 Adapting Multiple Endpoints. If multiple endpoints are considered, the adaptive interim analysis allows one to select or even add new endpoints in the second stage. Consequently, endpoints that appear to be highly variable or for which the interim data shows no efficacy at all can be dropped in the interim analysis. If a composite endpoint is used that summarizes multiple individual variables, the weights of these individual variables in the composite endpoint can be adapted. 6.3.3 Adapting the Population. Another option is to adapt the study population in the interim analysis, which may be desirable if, for example, the interim data show a strong treatment effect in a subpopulation that was not specified beforehand or safety problems in a subpopulation occur. 7 AN EXAMPLE The methodology has been exploited in a two-stage design for an international, multicenter, five-armed clinical phase II dose finding study (51, 52). The objectives for the first stage (433 patients recruited) was to obtain some initial evidence of the primary efficacy variable (infarct size measured by the cumulative release of alpha-HBDH from time 0 to 72 h), select a subset of doses to carry through to stage two and determine the sample size to be applied at stage two. No strict adaptation rules were laid down in the protocol because, in this early phase, the decisions were planned to be taken by utilizing all the

FLEXIBLE DESIGNS

information collected up to the interim analysis from in and outside the trial. A global proof of principle for the existence of a dose response relationship was intended by using the product p1 p2 of the stage-wise P-values pi as the predefined combination function. The predefined first-stage test to give p1 was a linear trend test among the increasing doses including placebo. Multiply controlled inference on the comparisons of the doses with a control was intended by applying the closed testing principle. The second and third highest doses were used in stage two in a balanced comparison with placebo. Based on a hierarchical testing strategy, the comparison of the highest dose applied in the second stage to placebo was laid down to create the secondstage P-value p2 . The second-stage sample size was fixed to 316 per group. The decisions were taken in a two-day meeting by a group of persons: an independent interim analysis group, the principal investigator, safety experts, and others. Finally, the overall product combination test (C(p1 , p2 ) = p1 p2 ) after the recruitment of 959 patients at the second stage failed to proof a dose response relationship because the promising first-stage result on the second highest dose could not be reproduced at the second stage. Still, it was argued a posteriori that the adaptive design saved time and resources to arrive at the decision as compared with other designs. 8

CONCLUSION

The crucial point in adaptive designs considered here is that the adaptation rule does not need to be fully specified in advance. Hence, information from all sources can be incorporated into the adaptation and a full modeling of the decision process is not required. Tsiatis (53) showed that, given a fixed adaptation rule, any adaptive design can be outperformed in terms of average sample size by a likelihood ratio-based sequential design with the same type I error spending function, which implies that in such a design, an interim look has to be performed at any sample size where the adaptive design has a positive probability to reject the null hypothesis. Hence, in case of sample size reassessment, one essentially ends up with continuous monitoring. But group sequential designs have

11

been introduced just to avoid this type of monitoring, which is usually not practical and too costly in clinical trials. Additionally, every group sequential design specifies a combination function and is, thus, a special case of an adaptive design. However, the adaptive design gives the opportunity of extra flexibility. If extensive adaptations are performed in the interim analysis as, for example, a reweighting of endpoints or a change of the study population, the transparency of the testing procedure may get lost and the trial can lose persuasiveness. Although many adaptations are possible in the sense that the type I error is controlled, not all of them are feasible as the interpretability of results may suffer. Another point to consider is how to keep the integrity of the trial by avoiding any negative impact by a leakage of interim results to investigators or other persons involved in the trial. Even the decisions taken in the interim analysis may allow to draw conclusions on the interim results: If, for example, the second-stage sample size is increased, it may indicate that a poor interim treatment effect has been observed. Although the availability of such information, in principle, does not harm the validity of the flexible design (the type I error is still controlled), it may give cause for problems concerning the motivation of investigators or recruitment. When performing adaptations, one has to keep in mind that findings from small first-stage samples (‘‘internal pilot studies’’) in general will be highly variable. So, by performing adaptations, the experimenter may quite frequently be detracted from a good design carefully laid down in the planning phase by looking at interim results. Adaptive designs open a wide field of flexibility with regard to mid-trial design modifications. The authors believe that adaptive treatment selection may be considered as the main advantage of adaptive designs that hardly can be achieved by other methods. However, sample size reassessment has attracted most of the attention up to now. Clearly, the acceptance of the methodology will be higher if the scope of adaptations to be performed is anticipated in the protocol, which, as in the example above, does not

12

FLEXIBLE DESIGNS

mean that the adaptation rule needs to be prespecified in any detail.

¨ ¨ 16. H. Schafer and H-H. Muller, Modification of the sample size and the schedule of interim analyses in survival trials based on data inspections. Stat. Med. 2001; 20: 3741–3751.

REFERENCES

17. P. Bauer and M. Posch, A letter to the editor. Stat. Med., 2004; 23: 1333–1334.

1. European Agency for the Evaluation of Medical Products, ICH Topic E9: Notes for Guidance on Statistical Principles for Clinical Trials, 1998. 2. P. Bauer, Sequential tests of hypotheses in consecutive trials. Biometr. J. 1989; 31: 663–676. 3. P. Bauer and K. K¨ohne, Evaluation of experiments with adaptive interim analyses. Biometrics 1994; 50: 1029–1041. 4. W. Lehmacher and G. Wassmer, Adaptive sample size calculations in group sequential trials. Biometrics 1999; 55: 1286–1290. 5. L. Cui, H. M. J. Hung, and S. Wang, Modification of sample size in group sequential clinical trials. Biometrics 1999; 55: 321–324. 6. P. Bauer, W. Brannath, and M. Posch, Flexible two stage designs: an overwiew. Meth. Inform. Med. 2001; 40: 117–121. 7. M. A. Proschan and S. A. Hunsberger, Designed extension of studies based on conditional power. Biometrics 1995; 51: 1315–1324. 8. M. Posch and P. Bauer, Adaptive two stage designs and the conditional error function. Biometr. J. 1999; 41: 689–696. 9. G. Wassmer, Statistische Testverfahren fur ¨ gruppensequentielle und adaptive Plane ¨ in klinischen Studien. M¨onch, Germany: Verlag Alexander, 1999. 10. C. Jennison and B. Turnbull, Mid-course sample size modification in clinical trials based on the observed treatment effect. Stat. Med. 2003; 22: 971–993. 11. L. D. Fisher, Self-designing clinical trials. Stat. Med. 1998; 17: 1551–1562. 12. Y. Shen and L. Fisher, Statistical inference for self-designing clinical trials with a one-sided hypothesis. Biometrics 1999; 55: 190–197. 13. J. Hartung and G. Knapp, A new class of completely self-designing clinical trials. Biometr. J. 2003; 45: 3–19. ¨ ¨ 14. H-H. Muller and H. Schafer, Adaptive group sequential designs for clinical trials: combining the advantages of adaptive and of classical group sequential approaches. Biometrics 2001; 57: 886–891. 15. W. Brannath, M. Posch, and P. Bauer, Recursive combination tests. J. Amer. Stat. Assoc. 2002; 97: 236–244.

18. M. Posch and P. Bauer, Interim analysis and sample size reassessment. Biometrics 2000; 56: 1170–1176. 19. Z. Shun, Sample size reestimation in clinical trials. Drug Inform. J. 2001; 35: 1409–1422. 20. Z. Shun, W. Yuan, W. E. Brady, and H. Hsu, Type I error in sample size re-estimations based on observed treatment difference. Stat. Med. 2001; 20: 497–513. 21. J. S. Denne, Sample size recalculation using conditional power. Stat. Med. 2001; 20: 2645–2660. 22. A. L. Gould, Sample size re-estimation: recent developments and practical considerations. Stat. Med. 2001; 20: 2625–2643. 23. T. Friede and M. Kieser, A comparison of methods for adaptive sample size adjustment. Stat. Med. 2001; 20: 3861–3874. 24. P. Bauer and J. R¨ohmel, An adaptive method for establishing a dose response relationship. Stat. Med. 1995; 14: 1595–1607. 25. M. Bauer, P. Bauer, and M. Budde, A simulation program for adaptive two stage designs. Comput. Stat. Data Anal. 1998; 26: 351–371. 26. P. Bauer and M. Kieser, Combining different phases in the development of medical treatments within a single trial. Stat. Med. 1999; 18: 1833–1848. 27. T. Friede, F. Miller, W. Bischoff, and M. Kieser, A note on change point estimation in dose-response trials. Comput. Stat. Data Anal. 2001; 37: 219–232. 28. M. Kieser, P. Bauer, and W. Lehmacher, Inference on multiple endpoints in clinical trials with adaptive interim analyses. Biometr. J. 1999; 41: 261–277. 29. G. Hommel and S. Kropf, Clinical trials with an adaptive choice of hypotheses. Drug Inform. J. 2001; 35: 1423–1429. 30. T. Lang, A. Auterith, and P. Bauer, Trendtests with adaptive scoring. Biometr. J. 2000; 42: 1007–1020. ¨ 31. M. Neuhausen, An adaptive location-scale test. Biometr. J. 2001; 43: 809–819. 32. M. Kieser, B. Schneider, and T. Friede, A bootstrap procedure for adaptive selection of the test statistic in flexible two-stage designs. Biometr. J. 2002; 44: 641–652.

FLEXIBLE DESIGNS 33. S-J. Wang, H. M. J. Hung, Y. Tsong, and L. Cui, Group sequential test strategies for superiority and non-inferiority hypotheses in active controlled clinical trials. Stat. Med. 2001; 20: 1903–1912. 34. W. Brannath, P. Bauer, W. Maurer, and M. Posch, Sequential tests for non-inferiority and superiority. Biometrics 2003; 59: 106–114. 35. Q. Liu, M. A. Proschan, and G. W. Pledger, A unified theory of two-stage adaptive designs. J. Amer. Statist. Assoc. 2002; 97: 1034–1041. 36. J. Lawrence and H. M. Hung, Estimation and confidence intervals after adjusting the maximum information. Biom. J. 2003; 45: 143–152. 37. W. Brannath, F. K¨onig, and P. Bauer, Improved repeated confidence bounds in trials with a maximal goal. Biom. J. 2003; 45: 311–324. 38. H. Frick, On confidence bounds for the BauerK¨ohne two-stage test. Biom. J. 2002; 44: 241–249. 39. S. Coburger and G. Wassmer, Conditional point estimation in adaptive group sequential test designs. Biometr. J. 2001; 43: 821–833. 40. S. Coburger and G. Wassmer, Sample size reassessment in adaptive clinical trials using a bias corrected estimate. Biometr. J. 2003; 45: 812–825. ¨ ¨ 41. H. H. Muller and H. Schafer, A general statistical principle for changing a design anytime during the course of a trial. Stat. Med. 2004; 23: 2497–2508. 42. W. Brannath and P. Bauer, Optimal conditional error functions for the control of conditional power. Biometrics 2004; 60: 715–723. 43. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics 1979; 5: 549–556. 44. S. J. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64: 191–199. 45. P. Bauer and M. Kieser, A unifying approach for confidence intervals and testing of equivalence and difference. Biometrika 1996; 83: 934–937. 46. M. Posch, P. Bauer, and W. Brannath, Issues in designing flexible trials. Stat. Med. 2003; 30: 953–969. 47. J. Lawrence, Strategies for changing the test statistic during a clinical trial. J. Biopharm. Stat. 2002; 12: 193–205. 48. M. Kieser and T. Friede, Simple procedures for blind sample size adjustment that do not

13

affect the type i error rate. Heidelberg, Germany: Medical Biometry Unit, University of Heidelberg, Germany, 2002. 49. R. Marcus, E. Peritz, and K. R. Gabriel, On closed testing procedures with special reference to ordered analysis of variance. Biometrika 1976; 63: 655–660. 50. G. Hommel, Adaptive modifications of hypotheses after an interim analysis. Biometr. J. 2001; 43(5): 581–589. 51. U. Zeymer, H. Suryapranata, J. P. Monassier, G. Opolski, J. Davies, G. Rasmanis, G. Linssen, U. Tebbe, R. Schroder, R. Tiemann, T. Machnig, and K. L. Neuhaus, The Na+ /H + exchange inhibitor Eniporide as an adjunct to early reperfusion therapy for acute myocardial infarction. J. Amer. College Cardiol. 2001; 38: 1644–1651. 52. U. Zeymer, H. Suryapranata, and J. P. Monassier et al., Evaluation of the safety and cardioprotective effects of eniporide, a specific Sodium/Hydrogen exchange inhibitor, given as adjunctive therapy to reperfusion in patients with acute myocardial infarction. Heart Drug 2001; 1: 71–76. 53. A. A. Tsiatis and C. Mehta, On the inefficiency of the adaptive design for monitoring clinical trials. Biometrika 2003; 90: 367–378.

FOOD AND DRUG ADMINISTRATION (FDA)

responsible for advancing the public health by helping to speed innovations that make medicines and foods more effective, safer, and more affordable and by helping the public get the accurate, science-based information they need to use medicines and foods to improve their health. FDA is the federal agency responsible for ensuring that foods are safe, wholesome, and sanitary; human and veterinary drugs, biological products, and medical devices are safe and effective; cosmetics are safe; and electronic products that emit radiation are safe. FDA also ensures that these products are represented honestly, accurately and informatively to the public. Some of the specific responsibilities of the agency include the following:

The U.S. Food and Drug Administration is a scientific, regulatory, and public health agency that oversees items accounting for 25 cents of every dollar spent by consumers. Its jurisdiction encompasses most food products (other than meat and poultry); human and animal drugs; therapeutic agents of biological origin; medical devices; radiation-emitting products for consumer, medical, and occupational use; cosmetics; and animal feed. The agency grew from a single chemist in the U.S. Department of Agriculture in 1862 to a staff of approximately 9,100 employees and a budget of $1.294 billion in 2001, comprising chemists, pharmacologists, physicians, microbiologists, veterinarians, pharmacists, lawyers, and many others. About one third of the agency’s employees are stationed outside of the Washington, D. C. area, staffing over 150 field offices and laboratories, including 5 regional offices and 20 district offices. Agency scientists evaluate applications for new human drugs and biologics, complex medical devices, food and color additives, infant formulas, and animal drugs. Also, the FDA monitors the manufacture, import, transport, storage, and sale of about $1 trillion worth of products annually at a cost to taxpayers of about $3 per person. Investigators and inspectors visit more than 16,000 facilities a year and arrange with state governments to help increase the number of facilities checked. The FDA is responsible for protecting the public health by assuring the safety, efficacy, and security of human and veterinary drugs, biological products, medical devices, our national food supply, cosmetics, and products that emit radiation. The FDA is also

Biologics • Product and manufacturing establish-

ment licensing • Safety of the national blood supply • Research to establish product standards

and develop improved testing methods Cosmetics • Safety • Labeling

Drugs • Product approvals • OTC and prescription drug labeling • Drug manufacturing standards

Foods • Labeling • Safety of all food products (except meat

and poultry) • Bottled water

Medical Devices This article was modified from the website of the United States Food and Drug Administration (http: //www.fda.gov/oc/history/historyoffda/default.htm), (http://www.fda.gov/opacom/morechoices/mission. html), (http://www.fda.gov/comments/regs.html) by Ralph D’Agostino and Sarah Karl.

• Premarket approval of new devices • Manufacturing and performance stan-

dards • Tracking reports of device malfunction-

ing and serious adverse reactions

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

FOOD AND DRUG ADMINISTRATION (FDA)

Radiation-Emitting Electronic Products • Radiation safety performance standards

for microwave ovens, television receivers, and diagnostic equipment • X-ray equipment, cabinet x-ray systems (such as baggage x-rays at airports), and laser products • Ultrasonic therapy equipment, mercury vapor lamps, and sunlamps • Accrediting and inspecting mammography facilities Veterinary Products • Livestock feeds • Pet foods • Veterinary drugs and devices

FRAILTY MODELS

One can distinguish two broad classes of frailty models:

ANDREAS WIENKE

1. models with an univariate survival time as endpoint and 2. models that describe multivariate survival endpoints (e.g., competing risks, recurrence of events in the same individual, occurrence of a disease in relatives).

University Halle-Wittenberg Institute of Medical Epidemiology Biostatistics and Informatics Germany

The notion of frailty provides a convenient way to introduce random effects, association, and unobserved heterogeneity into models for survival data. In its simplest form, a frailty is an unobserved random proportionality factor that modifies the hazard function of an individual or related individuals. In essence, the frailty concept goes back to the work of Greenwood and Yule (1) on ‘‘accident proneness.’’ The term frailty itself was introduced by Vaupel et al. (2) in univariate survival models, and the model was substantially promoted by its application to multivariate survival data in a seminal paper by Clayton (3) (without using the notion ‘‘frailty’’) on chronic disease incidence in families. Frailty models are extensions of the proportional hazards model, which is best known as the Cox model (4), the most popular model in survival analysis. Normally, in most clinical applications, survival analysis implicitly assumes a homogenous population to be studied, which means that all individuals sampled into that study are subject in principle under the same risk (e.g., risk of death, risk of disease recurrence). In many applications, the study population cannot be assumed to be homogeneous, but must be considered as a heterogeneous sample (i.e., a mixture of individuals with different hazards). For example, in many cases, it is impossible to measure all relevant covariates related to the disease of interest, sometimes because of economical reasons, although the importance of some covariates is still sometimes unknown. The frailty approach is a statistical modeling concept that aims to account for heterogeneity, caused by unmeasured covariates. In statistical terms, a frailty model is a random effect model for time-to-event data, where the random effect (the frailty) has a multiplicative effect on the baseline hazard function.

In the first case, an univariate (independent) lifetime is used to describe the influence of unobserved covariates in a proportional hazards model (heterogeneity). The variability of survival data is split into a part that depends on risk factors, and is therefore theoretically predictable, and a part that is initially unpredictable, even when all relevant information is known. A separation of these two sources of variability has the advantage that heterogeneity can explain some unexpected results or give an alternative interpretation of some results, for example, crossing-over effects or convergence of hazard functions of two different treatment arms [see Manton and Stallard (5)] or levelingoff effects, which means the decline in the increase of mortality rates, which could result in a hazard function at old ages parallel to the x-axis [see Aalen and Tretli (6)]. More interesting, however, is the second case when multivariate survival times are considered in which one aims to account for the dependence in clustered event times, for example, in the lifetimes of patients in study centers in a multi-center clinical trial, caused by center-specific conditions [see Andersen et al. (7)]. A natural way to model dependence of clustered event times is through the introduction of a cluster-specific random effect—the frailty. This random effect explains the dependence in the sense that, had the frailty been known, the events would be independent. In other words, the lifetimes are conditional independent, given the frailty. This approach can be used for survival times of related individuals like family members or recurrent observations on the same person. Different extensions of univariate frailty models to multivariate models are possible and will be considered below.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

FRAILTY MODELS

The author wants to explain the key ideas of univariate frailty models by an illustrative example from Aalen and Tretli (6). The authors analyzed the incidence of testis cancer by means of a frailty model based on data from the Norwegian Cancer Registry collected from 1953 to 1993. The incidence of testicular cancer is greatest among younger men, and then declines from a certain age. The frailty is considered to be established by birth, and caused by a mixture of genetic and environmental effects. The idea of the frailty model is that a subgroup of men is particularly susceptible to testicular cancer, which would explain why testis cancer is primarily a disease of young men. As time goes by, the members of the frail group acquire the disease, and at some age this group is more or less exhausted. Then the incidence, computed on the basis of all men at a certain age, will necessarily decline. 1

UNIVARIATE FRAILTY MODELS

The standard situation of the application of survival methods in clinical research projects assumes that a homogeneous population is investigated when subject under different conditions (e.g., experimental treatment and standard treatment). The appropriate survival model then assumes that the survival data of the different patients are independent form each other and that each patient’s individual survival time distribution is the same (independent and identically distributed failure times). This basic presumption implies a homogeneous population. However, in the field of clinical trials, one observes in many most practical situations that patients differ substantially. The effect of a drug, a treatment, or the influence of various explanatory variables may differ greatly between subgroups of patients. To account for such unobserved heterogeneity in the study population, Vaupel et al. (2) introduced univariate frailty models into survival analysis. The key idea is that individuals possess different frailties, and that those patients who are most frail will die earlier than the others. Consequently, systematic selection of robust individuals (that means patients with low frailty) takes place.

When mortality rates are estimated, one may be interested in how these rates change over time or age. Quite often it is observed that the hazard function (or mortality rate) rises at the beginning, reaches a maximum, and then declines (unimodal intensity) or levels off at a constant value. The longer the patient lives after manifestation of the disease, the more improved are his or her chances of survival. It is likely that unimodal intensities are often a result of a selection process acting in a heterogeneous population and do not reflect individual mortality. The population intensity may start to decline simply because the high-risk individuals have already died out. The hazard rate of a given individual might well continue to increase. If protective factors or risk factors are known, those could be included in the analysis by using the proportional hazards model, which is of the form µ(t, X) = µ0 (t) exp(β T X) where µ0 (t) denotes the baseline hazard function, assumed to be unique for all individuals in the study population. X is the vector of observed covariates and β the respective vector of regression parameters to be estimated. The mathematical convenience of this model is based on the separation of the effects of aging in the baseline hazard µ0 (t) from the effects of covariates in the parametric term exp(β T X). Two main reasons exist why it is often impossible to include all important factors on the individual level into the analysis. Sometimes too many covariates exist to be considered in the model, in other cases, the researcher does not know or is not able to measure all the relevant covariates. In both cases, two sources of variability exist in survival data: variability accounted for by measurable risk factors, which is thus theoretically predictable, and heterogeneity caused by unknown covariates, which is thus theoretically unpredictable, even if knowing all the relevant information. Advantages to separating these two sources of variability exist because heterogeneity in contrast to variability can explain some ‘‘unexpected’’ results or can provide an alternative explanation

FRAILTY MODELS

of some results. Consider, for example, nonproportional hazards or decreasing hazards when unexpected extra variability prevails. In a proportional hazards model, neglect of a subset of the important covariates leads to biased estimates of both regression coefficients and the hazard rate. The reason for such bias lies in the fact that the timedependent hazard rate results in changes in the composition of the study population over time with respect to the covariates. If two groups of patients exist in a clinical trial where some individuals experience a higher risk of failure, then the remaining individuals at risk tend to form a more or less selected group with a lower risk. An estimate of the individual hazard rate, without taking into account the unobserved frailty, would therefore underestimate the true hazard function and the extent of underestimation would increase as time progresses. The univariate frailty model extents the Cox model such that the hazard of an individual depends in addition on an unobservable random variable Z, which acts multiplicatively on the hazard function: µ(t, Z, X) = Zµ0 (t) exp(β T X)

(1)

Again, µ0 (t) is the baseline hazard function, β the vector of regression coefficients, X is the vector of observed covariates, and Z now is the frailty variable. The frailty Z is a random variable varying over the population, which lowers (Z < 1) or increases (Z > 1) the individual risk. Frailty corresponds to the notions liability or susceptibility in different settings (8). The most important point here is that frailty is unobservable. The respective survival function S, describing the fraction of surviving individuals in the study population, is given by t S(t|Z, X) = exp(−Z

µ0 (s)ds exp(β T X)) (2) 0

S(t|Z, X) may be interpreted as the fraction of individuals surviving the time t after beginning of the follow-up given the vector of observable covariates X and frailty Z. Note, that Equation (1) and Equation (2) describe the same model using different notions. Up

3

to now, the model has been described at the level of individuals. However, this individual model is not observable. Consequently, it is necessary to consider the model at the population level. The survival function of the total population is the mean of the individual survival functions (Equation 2). It can be viewed as the survival function of a randomly drawn individual, and corresponds to that which is actually observed. It is important to note that the observed hazard function will not be similar to the individual hazard rate. What may be observed in the population is the net result for a number of individuals with different Z. The population hazard rate may have a completely different shape than the individual hazard rate as shown in Fig. 1. One important problem in the area of frailty models is the choice of the frailty distribution. The frailty distributions most often applied are the gamma distribution (2, 3), the positive stable distribution (9), a three-parameter distribution (PVF) (10), the compound Poisson distribution (11, 12) , and the log-normal distribution (13). Univariate frailty models are widely applied. A few examples that can be consulted for more details are listed here. Aalen and Tretli (6) applied the compound Poisson distribution to testicular cancer data already introduced above. The idea of the model is that a subgroup of men is particularly susceptible to testicular cancer, which results in selection over time. Another example is the malignant melanoma dataset including records of patients who had radical surgery for malignant melanoma (skin cancer) at the University Hospital of Odense in Denmark. Hougaard (14) compared the traditional Cox regression model with a gamma frailty and PVF frailty model, respectively, to these data. The third example deals with the time from insertion of a catheter into dialysis patients until it has to be removed because of infection. A subset of the complete data, including the first two infection times of 38 patients, was published by McGilchrist and Aisbett (13). To account for heterogeneity within the data, Hougaard (14) used a univariate gamma frailty model. Henderson and Oman (15) tried to quantify the bias that may occur in estimated

4

FRAILTY MODELS

Conditional and unconditional hazard rates Z=1

Z=2 hazard rates

Z = 0.5

unconditional

0

10

20

30

40

50 60 age

70

80

90 100 110

covariate effects, and fitted marginal distributions when frailty effects are present in survival data, but the latter are ignored in a misspecified proportional hazards analysis. Congdon (16) investigated the influence of different frailty distributions (gamma, inverse Gaussian, stable, binary) on total and cause-specific mortality from the London area (1988–1990).

Figure 1. Conditional and unconditional hazard rates in a simulated data set of human mortality. The red lines denote the conditional (individual) hazard rates for individuals with frailty 0.5, 1, and 2, respectively. The blue line denotes the unconditional (population) hazard rate

Averaging over an assumed distribution for the latent variables (e.g., using a gamma, log-normal, stable distribution) then induces a multivariate model for the observed data. In the case of paired observations, the twodimensional survival function is of the form ∞ S(t1 , t2 ) =

S(t1 |z, X 1 )S(t2 |z, X 2 )g(z)dz 0

2

MULTIVARIATE FRAILTY MODELS

A second important application of frailty models is in the field of multivariate survival data. Such kind of data occurs, for example, if lifetimes (or times of onset of a disease) of relatives (twins, parent-child) or recurrent events like infections in the same individual are considered. In such cases, independence between the clustered survival times cannot be assumed. Multivariate models are able to account for the presence of dependence between these event times. A commonly used and very general approach is to specify independence among observed data items conditional on a set of unobserved or latent variables (14). The dependence structure in the multivariate model develops from a latent variable in the conditional models for multiple observed survival times, for example, let S(t1 |Z, X 1 ) and S(t2 |Z, X 2 ) be the conditional survival functions of two related individuals with different vectors of observed covariates X 1 and X 2 , respectively [see Equation (2)].

where g denotes the density of the frailty Z. In the case of twins, S(t1 ,t2 ) denotes the fraction of twins pairs with twin 1 surviving t1 and twin 2 surviving t2 . Frailty models for multivariate survival data are derived under conditional independence assumption by specifying latent variables that act multiplicatively on the baseline hazard. 2.1 The Shared Frailty Model The shared frailty model is relevant to event times of related individuals, similar organs, and repeated measurements. Individuals in a cluster are assumed to share the same frailty Z, which is why this model is called shared frailty model. It was introduced by Clayton (3) and extensively studied in Hougaard (14). The survival times are assumed to be conditional independent with respect to the shared (common) frailty. For ease of presentation, the case of groups with pairs of individuals will be considered (bivariate failure times,

FRAILTY MODELS

for example, event times of twins or parentchild). Extensions to multivariate data are straightforward. Conditional on the frailty Z, the hazard function of an individual in a pair is of the form Zµ0 (t)exp(β T X), where the value of Z is common to both individuals in the pair, and thus is the cause for dependence between survival times within pairs. Independence of the survival times within a pair corresponds to a degenerate frailty distribution (Z = 1, V(Z) = σ 2 = 0). In all other cases with σ 2 > 0, the dependence is positive by construction of the model. Conditional on Z, the bivariate survival function is given as S(t1 , t2 |Z) = S1 (t1 )Z S2 (t2 )Z In most applications, it is assumed that the frailty distribution (i.e., the distribution of the random variable Z) is a gamma distribution with mean 1 and variance σ 2 . Averaging the conditional survival function produces under this assumption survival functions of the form 2

2

S(t1 , t2 ) = (S1 (t1 )−σ + S2 (t2 )−σ − 1)1/σ

2

Shared frailty explains correlations between subjects within clusters. However, it does have some limitations. First, it forces the unobserved factors to be the same within the cluster, which may not always reflect reality. For example, at times it may be inappropriate to assume that all partners in a cluster share all their unobserved risk factors. Second, the dependence between survival times within the cluster is based on marginal distributions of survival times. However, when covariates are present in a proportional hazards model with gamma distributed frailty, the dependence parameter and the population heterogeneity are confounded, which implies that the joint distribution can be identified from the marginal distributions (10). Third, in most cases, a one-dimensional frailty can only induce positive association within the cluster. However, some situations exist in which the survival times for subjects within the same cluster are negatively associated. For example, in the Stanford Heart Transplantation Study, generally the longer an individual must wait for an available heart, the shorter he or she is likely to survive after

5

the transplantation. Therefore, the waiting time and the survival time afterwards may be negatively associated. To avoid the above-mentioned limitations of shared frailty models, correlated frailty models were developed. 2.2 The Correlated Frailty Model Originally, correlated frailty models were developed for the analysis of bivariate failure time data, in which two associated random variables are used to characterize the frailty effect for each pair. For example, one random variable is assigned for partner 1 and one for partner 2 so that they would no longer be constrained to have a common frailty. These two variables are associated and have a joint distribution. Knowing one of them does not necessarily imply knowing the other. A restriction no longer exists on the type of correlation. These two variables can also be negatively associated, which would induce a negative association between survival times. Assuming gamma distributed frailties, Yashin and Iachine (17) used the correlated gamma frailty model resulting in a bivariate survival distribution of the form S(t1 , t2 ) =

S1 (t1 )1−ρ S2 (t2 )1−ρ (S1 (t1 )−σ 2 + S2 (t2 )−σ 2 − 1)ρ/σ 2

Examples of the use of multivariate frailty models are various and emphasize the importance of this family of statistical models for survival data. – a shared log-normal frailty model for the catheter infection data mentioned above used by McGilchrist and Aisbett (13); – a shared frailty model with gamma and log-normal distributed frailty, applied to the recurrence of breast cancer by dos Santos et al. (18); – a shared positive stable frailty model, applied by Manatunga and Oakes (19) to the data from the Diabetic Retinopathy Study, which examined the effectiveness of laser photo-coagulation in delaying the onset of blindness in patients with diabetic retinopathy. The positive stable frailty allows for proportional hazards both in the marginal and the conditional model;

6

FRAILTY MODELS

– a study of Andersen et al. (7), who tested for center effects in multi-center survival studies by means of a frailty model with unspecified frailty distribution; – a correlated gamma-frailty model, applied by Pickles et al. (20) to age of onset of puberty and antisocial behavior in British twins; – a correlated gamma-frailty model by Yashin and Iachine (17) and Yashin et al. (21) to analyze mortality in Danish twins; – a correlated gamma-frailty model by Wienke et al. (22) and Zdravkovic et al. (23) to analyze genetic factors involved in mortality caused by coronary heart disease in Danish and Swedish twins, respectively; – an extension of the correlated gammafrailty model by Wienke et al. (25) used to model death due to coronary heart disease in Danish twins; – different versions of the correlated gamma-frailty model applied by Zahl (26) to causespecific cancer mortality in Norway to model the excess hazard.

3

SOFTWARE

Stata 7 (procedure st streg) allows one to explore univariate models with gamma and inverse Gaussian distributed frailty. aML 2 supports log-normal frailty models in univariate analysis. WinBugs is designed for analysis of shared frailty models with different frailty distribution, using Markov Chain Monte Carlo methods. On the Internet, several SAS, GAUSS, S-Plus, and R routines are available dealing with different frailty models. REFERENCES 1. M. Greenwood and G. U. Yule, An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. Royal Stat. Soc. 1920; 83: 255–279.

2. J. W. Vaupel, K. G. Manton, and E. Stallard, The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 1979; 16: 439–454. 3. D. G. Clayton, A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 1978; 65: 141–151. 4. D. R. Cox, Regression models and life-tables. J. Royal Stat. Soc. B 1972; 34: 187–220. 5. S. Manton, Methods for evaluating the heterogeneity of aging processes in human populations using vital statistics data: explaining the black/white mortality crossover by a model of mortality selection. Human Biol. 1981; 53: 47–67. 6. O. O. Aalen and S. Tretli, Analysing incidence of testis cancer by means of a frailty model. Cancer Causes Control 1999; 10: 285–292. 7. P. K. Andersen, J. P. Klein, and M-J. Zhang, Testing for centre effects in multi-centre survival studies: a Monte Carlo comparison of fixed and random effects tests. Stat. Med. 1999; 18: 1489–1500. 8. D. S. Falconer, The inheritance of liability to diseases with variable age of onset, with particular reference to diabetes mellitus. Ann. Human Genet. 1967; 31: 1–20. 9. P. Hougaard, A class of multivariate failure time distributions. Biometrika 1986; 73: 671–678. 10. P. Hougaard, Survival models for heterogeneous populations derived from stable distributions. Biometrika 1986; 73: 671–678. 11. O. O. Aalen, Heterogeneity in survival analysis. Stat. Med. 1988; 7: 1121–1137. 12. O. O. Aalen, Modeling heterogeneity in survival analysis by the compound Poisson distribution. Ann. Appl. Probabil. 1992; 4(2): 951–972. 13. C. A. McGilchrist and C. W. Aisbett, Regression with frailty in survival analysis. Biometrics 1991; 47: 461–466. 14. P. Hougaard, Analysis of Multivariate Survival Data. New York: Springer, 2000. 15. R. Henderson and P. Oman, Effect of frailty on marginal regression estimates in survival analysis. J. Royal Stat. Soc. B 1999; 61: 367–379. 16. P. Congdon, Modeling frailty in area mortality. Stat. Med. 1995; 14: 1859–1874. 17. A. I. Yashin and A. I. Iachine, Genetic analysis of durations: correlated frailty model applied

FRAILTY MODELS

18.

19.

20.

21.

22.

23.

to survival of danish twins. Genet. Epidemiol. 1995; 12: 529–538. D. M. dos Santos, R. B. Davies, and B. Francis, Nonparametric hazard versus nonparametric frailty distribution in modeling recurrence of breast cancer. J. Stat. Plan. Infer. 1995; 47: 111–127. A. K. Manatunga and D. Oakes, Parametric analysis of matched pair survival data. Lifetime Data Anal. 1999; 5: 371–387. A. Pickles et al., Survival models for developmental genetic data: age of onset of puberty and antisocial behavior in twins. Genet. Epidemiol. 1994; 11: 155–170. A. I. Yashin, J. W. Vaupel, and I. A. Iachine, Correlated individual frailty: an advantageous approach to survival analysis of bivariate data. Math. Pop. Studies 1995; 5: 145–159. A. Wienke, N. Holm, A. Skytthe, and A. I. Yashin, The heritability of mortality due to heart diseases: a correlated frailty model applied to Danish twins. Twin Res. 2001; 4: 266–274. S. Zdravkovic et al., Heritability of death from coronary heart disease: a 36-year follow-up of 20,966 Swedish twins. J. Intern. Med. 2002; 252: 247–254.

7

24. S. Zdravkovic et al., Genetic influences on CHD-death and the impact of known risk factors: Comparison of two frailty models. Behavior Genetics 2004; 34: 585–591. 25. A. Wienke, K. Christensen, A. Skytthe, and A. I. Yashin, Genetic analysis of cause of death in a mixture model with bivariate lifetime data. Stat. Model. 2002; 2: 89–102. 26. P. H. Zahl, Frailty modelling for the excess hazard. Stat. Med. 1997; 16: 1573–1585.

FURTHER READING P. Hougaard, Modeling heterogeneity in survival data. J. Appl. Probabil. 1991; 28: 695–701. A. Wienke, P. Lichtenstein, K. Czene, and AA. I. Yashin, The role of correlated frailty models in studies of human health, ageing and longevity. In: Applications to Cancer and AIDS Studies, Genome Sequence Analysis, and Survival Analysis. N. Balakrishnan, J.-L. Auget, M. Mes¨ bah, G. Molenberghs (eds.), Birkhauser, 2006, pp. 151–166.

FUTILITY ANALYSIS

with a one-sided hypothesis. Key contributions to development of the futility monitoring concept include introduction of stochastic curtailment by Lan et al. (6) and adaptation of this methodology to the futility index by Ware et al. (1). First, we present the common statistical approaches to futility monitoring. Two examples of studies stopped for futility are then provided in detail. Finally, practical issues of futility monitoring are discussed.

BORIS FREIDLIN Biometric Research Branch National Cancer Institute Bethesda, Maryland

Interim monitoring of outcome data has become a well-accepted component of randomized clinical trials (RCT); a trial can be stopped early for efficacy if a treatment arm appears definitively better than another arm. Most clinical trials are designed to demonstrate benefit of experimental vs. standard treatment and are thus implicitly addressing a one-sided hypothesis. The one-sided nature of the question provides rationale for both efficacy and futility monitoring. In technical terms, futility monitoring refers to a statistical procedure for stopping the trial early if it appears that the experimental arm is unlikely to be shown definitively better than the control arm if the trial is continued to the final analysis. In the context of evidence-based clinical research, the primary goal of a phase III trial is to provide data on the benefit-to-risk profile of the intervention that is sufficiently compelling to change medical practice. From this perspective, a futility boundary should be interpreted as the point at which convincing evidence exists to resolve the original question—that is, to convince the clinical community that the new treatment is not beneficial. The advantages of early futility stopping are obvious in terms of minimizing patient exposure to ineffective potentially toxic experimental treatments as well as in terms of optimizing the use of resources (1). However, these potential advantages should be weighed against the risk of stopping short of obtaining sufficiently compelling evidence that the new treatment is not beneficial (2–4). Such premature stopping wastes time and resources that went into the design and conduct of the trial and may adversely affect similar ongoing and future trials. There is a rich literature on futility monitoring. DeMets and Ware (5) were among the first to point out the need for a more aggressive monitoring for futility in studies

1 COMMON STATISTICAL APPROACHES TO FUTILITY MONITORING 1.1 Statistical Background For simplicity, the following presentation assumes that the study outcome has a normal distribution (or has been transformed to a normal outcome). The results are readily adapted to other common clinical outcome types (e.g., time-to-event or binary endpoints) where the normal approximation is routinely used in the monitoring of accumulating data (7, 8). Consider a RCT that compares an experimental arm (arm A) to the control arm (arm B). Let θ denote the primary measure of treatment effect (e.g., difference in cholesterol levels or log hazard ratio for survival) with θ = 0 corresponding to no difference between the arms and positive values of θ corresponding to an advantage of the experimental treatment. The trial is designed to test H 0 :θ = 0 versus HA :θ = θ A , where θ A > 0 is the minimally clinically meaningful treatment effect. Relevant design calculations are θˆ often based on the test statistic Z = SE( θˆ ) ˆ is an estimate of the standard where SE(θ) ˆ The quantity SE(θ) ˆ −2 is called error of (θ). the information (I) about the parameter θ . The design rejects H 0 if at the final analysis Z > cα (cα denotes the 1 − α quantile of the standard normal distribution). Type I error and type II error of the design are defined as PH0 {Z > cα } and PH1 {Z ≤ cα }, respectively. The amount of information needed to detect θ A with type I error α and type II error β is: I=

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

cα + cβ θA

2 .

2

FUTILITY ANALYSIS

CPk (θ ) = P(Zk > cα |Zk = zk , θ )

curtailment (6). The most common implementation of stochastic curtailment calculates conditional power at the minimally clinically meaningful treatment effect θ A and stops at the kth analysis if CPk (θ A ) < γ for some prespecified γ . This formulation of futility rule is easily understood by nonstatistical consumers of clinical trials such as clinicians and patients (1). Use of formal rules based on conditional power at the design alternative CPk (θ A ) is sometime criticized because it conditions on current data and a value of θ that may be inconsistent with the current data. This criticism is somewhat misplaced because CPk (θ A ) is intended to quantify the degree of consistency between the observed data and the trial goal as represented by parameters (θ A ,α,β). In any case, the real life applications of stochastic curtailment method to futility monitoring typically involve evaluation of conditional power over a range of potential values of parameter θ that includes both biologically plausible values consistent with the accumulated data and the treatment effect used in the design of the study (11). In another variation of stochastic curtailment, the stopping rule is based on the conditional power calculated under the current estimate of treatment effect θˆk , that is, stop at kth analysis if CPk (θˆk ) < γ . This so-called ‘‘current trend’’ approach is problematic as it may lead to premature stopping, especially early into the trial when the estimate θˆk may be highly unreliable (see the antenatal corticosteroids example). As an alternative, Pepe and Anderson (12) proposed a more conservative approach using CPk (θˆk + SE(θˆk )). Futility boundaries based on conditional power can be derived as follows. At the kth interim analysis, the conditional distribution is √ ZK , given √ Zk = zk is normal with mean zk tk + θ IK (1 − tk ) and variance 1 − tk . The conditional power is then √ √ cα − zk tk − θ IK (1 − tk ) CPk (θ ) = 1 − √ 1 − tk

The conditional power is an appealing way to summarize the accumulated data to quantify the feasibility of continuing the study to the full accrual. If CPk (θ ) is low for θ in the desirable range, the study may be stopped for futility. This approach is known as stochastic

where ( ) is the cumulative standard normal distribution function. Hence the condition CPk (θ ) < γ corresponds to stopping for futility at kth analysis if √ (1 − tk ) 1 − tk cα − θ IK √ Zk < √ − cγ √ tk tk tk

For a specific outcome type, information I is translated into the required sample size. For ethical and economic reasons, it is important to stop the trial as soon as sufficiently compelling evidence has accumulated for either rejection of H0 in favor of H A (efficacy monitoring) or if there is no longer a reasonable chance that H0 can be rejected (futility monitoring). Most RCTs have prospectively specified interim monitoring guidelines that integrate the interim and the final analysis into a single framework. A typical interim analysis plan specifies K analyses, with the first K − 1 analyses designated as interim and the Kth as the final analysis. The test statistic and information at the kth analysis (k = 1, . . . , K) are denoted by Zk and Ik , respectively. The quantity tk = Ik /IK (0 < tk ≤ tK = 1) represents the proportion of the total information at the kth analysis and is often referred to as the ‘‘information time’’ (9). Information time provides a convenient universal notation and will be used in the following presentation. Formal derivation of futility boundaries is based on the following property that holds for normal outcomes with independent observations: statistics (Z1 , . . . , ZK ) are multivariate with √ normal √ √ K-dimensional mean vector √ θ IK ( t1 , . . . , tK ) and cov(Zl , Zk ) = tl /tk for l < k. It can also be shown that under some mild conditions the result holds asymptotically for most non-normally distributed outcomes that have been transferred to normal outcomes (10). 1.2 Conditional Power and Stochastic Curtailment Conditional power at the kth analysis is defined as the probability of rejecting the null hypothesis at the final analysis given the currently observed data and a hypothesized true treatment effect θ :

FUTILITY ANALYSIS

Because futility monitoring does not involve rejecting H 0 , the type I error of the study is not inflated. Futility monitoring does inflate the type II error. Lan et al. (6) have shown that continuous application of the stochastic curtailment futility rule CPk (θ A ) < γ on a fixed sample size study with type II error β has type II error bounded by β/(1 − γ ). Because in most implementations interim monitoring is not continuous, this bound is very conservative. A more accurate bound can be found in Davis and Hardy (13). See Lachin (14) and Moser and George (15) for a discussion of integrating the stochastic curtailment futility boundary with an efficacy boundary for simultaneous control of type I and type II error rates. An alternative to calculating conditional power at a fixed value of parameter θ is the use of predictive power (16, 17). It is derived by averaging the conditional power with respect to the posterior distribution for parameter θ : PPk = CPk (θ )π (θ |Dk )dθ where π (θ |Dk ) denotes the posterior distribution and Dk is the observed data at the kth analysis. 1.3 Group Sequential Formulation of Futility Boundary Another common approach to futility monitoring is the one-sided group sequential rule of Pampallona and Tsiatis (18) (herein referred to as PT) that is based on the power family group sequential tests (19). The rule combines both efficacy and futility boundaries to control the overall type I and type II error rates. Boundaries depend on a parameter and are defined by pairs of critical values (ak ,bk ) for k = 1, . . . , K, subject to ak ≥ bk and aK = bK . At the kth analysis time, the decision rule is: if Zk > ak , stop for efficacy; if Zk < bk , stop for futility; otherwise, continue. The critical values ak and bk are of the form ak = C1 t−0.5 k and , bk = θA Ik0.5 − C2 t−0.5 k

3

where constants C1 and C2 are selected to satisfy the specified type I and type II error rates for fixed . Here, we focus on the futility boundary {bk }. This futility procedure corresponds to a one-sided group sequential boundary for testing the hypothesis HA against H 0 . The boundary uses a spending function to control the overall error rate at level β. The shape of the spending function is determined by parameter . For example, = 0 corresponds to an O’Brien–Fleming type boundary (20) whereas = 0.5 corresponds to a Pocock type boundary (21). In practice, the most commonly used value of is 0. It can be shown that a PT futility boundary with = 0 is similar to the stochastic curtailment with γ = 0.5 (22). More generally, Moser and George (15) show that a PT boundary can be expressed in terms of the generalized stochastic curtailment procedure and vice versa. A number of published group sequential approaches to futility are based on repeated tests of the alternative hypothesis. For example, the following rule is suggested (23, 22): at an interim analysis, terminate the trial for futility if a one-sided test of consistency of the data with HA is statistically significant at some fixed nominal significance level (e.g., 0.005 or 0.001). This procedure has negligible effect on power (22). The repeated test rules are statistically equivalent to the approaches based on repeated confidence intervals (4, 24, 25). 1.4 Other Statistical Approaches to Constructing Futility Boundaries Another simple futility rule was proposed by Wieand et al. (26) (see also Ellenberg and Eisenberg [27]). A single futility analysis is performed when half of the total trial information is available. If the estimate of the treatment effect θˆ < 0 (e.g., the log hazard ratio of the control arm to the experimental arm is less than 0), early termination of the trial is recommended. This simple rule results in negligible deflation of the type I error and less than 2% loss of the fixed-sample-size design power. This approach was used for futility stopping in randomized studies of the somatostatin analogue octreotide for advanced colon cancer

4

FUTILITY ANALYSIS

(28) and for advanced pancreatic cancer (29). Although the simplicity of this rule is appealing, having long intervals between scheduled analyses may not be appropriate in some settings (30). A Bayesian framework can offer a useful tool for futility monitoring deliberations. Bayesian analysis provides a formal way for combining the totality of the prior and/or external information with the data accumulated in the trial. A futility decision is typically based on considering an ‘‘enthusiastic’’ prior—that is, a prior reflecting the opinion of the proponents of the new treatment. For example, in a normal setting, such a prior may be defined by centering on the alternative hypothesis θ A and fixing the probability of no benefit (θ < 0) at some low value (e.g., 5%). The posterior distribution of the treatment effect θ given the current data can then be used to model how the accumulated data affect the beliefs of the proponents of the intervention and whether the observed negative trend is sufficiently compelling to stop the trial for futility. For further discussion, see Spiegelhalter et al. (31). 2

EXAMPLES

Examples of futility stopping from two randomized clinical trials are now presented in detail. 2.1 Optimal Duration of Tamoxifen in Breast Cancer National Surgical Breast and Bowel Project (NSABP) B-14 trial investigated the use of

tamoxifen in early-stage breast cancer. The relevant portion of the study for the discussion here was designed to address the optimal duration of tamoxifen administration. Breast cancer patients who had completed 5 years of tamoxifen therapy free of disease recurrence were randomized to either tamoxifen or placebo for another 5 years (second randomization). The study was designed to have 85% power (α = 0.05 one-sided) to detect a 40% reduction in disease-free survival (DFS) hazard rate after observing 115 events (an event was defined as recurrence of breast cancer, occurrence of new primary cancer, or death from any cause). The study was closed after the third interim analysis with the conclusion that a meaningful benefit of continuing tamoxifen would not be demonstrated. This result went against the prevailing clinical opinion at that time. The study team provided a detailed account of the various futility monitoring approaches that had lead to the decision (32). The summary of DFS and the secondary endpoint overall survival (OS) from the three interim analyses is given in Table 1. At the time of the first interim analysis, a developing negative trend was observed but the data were considered immature. At the time of the second interim analysis, conditional power (at 115 events) under the design alternative of 40% reduction in hazard was less than 5%. In fact, even under a threefold reduction (67%) the conditional power was less than 5%. However, an argument was made that [1] the design alternative (40% reduction in hazard) that was roughly equivalent to the size of benefit for the patients

Table 1. Interim Analysis of Results from the National Surgical Breast and Bowel Project (NSABP) B-14 trial Analysis number Information time (observed/planned number of events) Placebo arm: DFS events (deaths) Tamoxifen arm: DFS events (deaths) Hazard ratio (tamoxifen/placebo) DFS P-value (two-sided) Conditional power under the design alternative

1

2

3

40% (46/115)

58% (67/115)

76% (88/115)

18 (6) 28 (9) NA NA NA

24 (10) 43 (19) 1.8 0.028 <5%

32 (13) 56 (23) 1.7 0.015 0

Note: DFS, disease-free survival; NA, not applicable. Source: Dignam et al. Control Clin Trials. 1998; 19: 575–588.

FUTILITY ANALYSIS

receiving the first 5 years of tamoxifen was not realistic, and [2] the sizing of the study was driven in part by the limitations of the eligible population pool. A trial with same parameters targeting a more realistic 30% reduction in hazard would require 229 events. Based on this projection, conditional power was approximately 10% under a 30% reduction in hazard and 40% under a 40% reduction in hazard (see Figure 3 in Dignam et al. [32]). According to the study team, these conditional power calculations suggested that a ‘‘sufficient statistical power to detect the benefit’’ was still present and that these considerations contributed to the decision to continue the trial at that time. At the third interim analysis, there was no possibility for a statistically significant result under the existing 115-event design (even if all 27 remaining events would occur in the control arm, the final test would not achieve the predetermined level of statistical significance). For a 229-event design, the conditional power was approximately 3% under a 30% reduction in hazard and 15% under a 40% reduction in hazard. A Bayesian approach to futility, using a normal approximation for the distribution of the log hazard ratio, was also examined by the investigators. An enthusiastic prior was centered at the alternative treatment effect θ A (40% reduction in hazard), with the variance corresponding to a 5% probability of no benefit (θ ≤ 0). At the third interim analysis, the posterior probability of any benefit for tamoxifen arm (θ > 0) was 13%, and the posterior probability of the treatment effect exceeding θ A was 0.00003.

2.2 A Randomized Study of Antenatal Corticosteroids Pregnant women at risk for preterm delivery are often given weekly courses of antenatal corticosteroids. To assess the efficacy of weekly administration, a randomized doubleblind, placebo-controlled, phase III trial of weekly administration versus a single course was conducted Guinn et al. (33). The trial was designed to enroll 1,000 women to give 90% power to detect a 33% reduction in composite neonatal morbidity from 25% to 16.5% (0.66 relative risk) at the 0.05 two-sided significance level (note that while a two-sided significance level was used for sample size calculation the study is addressing a one-sided question). The study results are summarized in Table 2 (34). At a planned interim analysis after 308 women had been randomized and evaluated, observed composite morbidity was 24% on the weekly arm and 27% on single-course groups, corresponding to only a 11% relative reduction. The investigators reported that under the ‘‘observed trend’’ the conditional power at 1,000 patients was less than 2%. (Guinn et al. do not detail the derivation of the 2% result, but based on the reported data the conditional power under ‘‘trend observed’’ in the first 308 patients was approximately 14%.) The reported conditional power (along with some external safety reports) was used as the rationale to stop the trial after accrual of 502 patients (194 additional patients). Had the investigators calculated conditional power under the design alternative, they would have obtained conditional power of 74%. After the study was finished, observed morbidity rates in the last

Table 2. Interim Analysis Results from Antenatal Corticosteroids Trial (as reported in [34]) First 308 patients Weekly # of patients # of morbidities (%) RR 95% CI Conditional power ‘‘current trend’’ Conditional power design alternative

Single

Last 194 patients Weekly

Single

161 147 39 (24.2) 40 (27.2) .89 (.61, 1.30) 14%

95 99 17 (17.8) 26 (26.2) .68 (.40, 1.74) 75%

74%

77%

Source: Jenkins et al. JAMA. 2002; 287: 187–188.

5

Total 502 patients Weekly

Single

256 246 56 (21.9) 66 (26.8) .82 (.6, 1.11) 40% 77%

6

FUTILITY ANALYSIS

194 patients were 17.9% in the weekly arm and 26.2% in the single-course arm, corresponding to a 32% reduction (very close to the 34% target effect) (34). At the time of the trial closure with 502 patients, conditional power (under the design alternative) was approximately 77%, and conditional power under the ‘‘current trend’’ was 40%. These data also illustrate the unstable nature of the ‘‘current trend’’ approach early in the study: conditional power ranged from 14% in the first 308 patients to 75% in the next 194. 3

DISCUSSION

The decision to stop a trial for futility is inherently complex. In most cases, it involves considerations beyond whether the primary outcome crossed a formal boundary. Such a fundamental decision requires a thorough evaluation of all relevant data (external and internal). Issues requiring careful consideration include consistency of results across primary and secondary outcomes, treatment group similarity, and complete and unbiased evaluation of outcome (35). The degree of evidence required to address the study objectives as well as justify early futility stopping depends on the nature of the intervention, the context of the disease, and the current standard of care. Randomized clinical trials are often conducted in the presence of efficacy evidence from randomized trials in similar clinical settings, epidemiologic studies, and surrogate endpoints. In some situations, experimental intervention is already in limited use in the community, and the trial is designed to provide definitive evidence to support widespread use. In such cases, stopping the trial early requires a clear refutation of meaningful benefit (4). Hence, futility decisions are typically made only after a definitive negative trend is established. This point of view is often taken in large trials in cardiovascular, cancer-prevention, and adjuvant cancer settings (32, 36). In contrast, in settings such as advanced cancer, the requirements for preliminary evidence of efficacy are relatively low and a large proportion of agents tested in phase III studies turn out to be inactive. In such settings, there is a need for an efficient approach for testing new

agents and weaker evidence (or rather a lack of evidence of activity) is sufficient to affect medical practice (28, 29). These more aggressive futility standards allow one to redirect patients to other trials. A discrepancy between the futility monitoring rule and the degree of evidence required in the particular setting may result in the failure to answer an important clinical question as well as waste time and resources. A somewhat extreme example is provided by the use of the ‘‘current trend’’ conditional power in the antenatal corticosteroids trial (33). The investigators used conditional power under the ‘‘observed trend’’ with less than one-third of the total information available to justify stopping the trial. At that time, conditional power under the design alternative was 74%. Moreover, the observed data at that time did not rule out the target relative risk of 0.66. Consequently, the use of ‘‘current trend’’ conditional power early in this study was inappropriate. One can argue that the study failed to address its goal (37). Some of the rules described previously suggest stopping for futility with the experimental arm doing nontrivially better than the control arm. For example, consider the commonly used PT futility approach with = 0. For a trial designed to detect a hazard ratio of 1.5 with four interim analyses (one-sided α = 0.025 and β = 0.1), the rule calls for stopping for futility at the fourth analysis (with 80% of total information) if the observed hazard ratio is (just less than) 1.22 with 90% confidence interval (0.98, 1.52; P = 0.068). For a trial that is designed to show a 50% improvement in median survival, a 22% increase in median survival time is a meaningful clinical effect in many cases. Furthermore, the upper confidence limit includes the design target of 1.5 while the lower confidence limit excludes all but a negligible inferiority of the experimental arm. Individuals reading the results of the study may infer that the experimental treatment is worth pursuing even though it was not as effective as the study investigators may have hoped. This leads to questions as to whether stopping the trial early was a correct decision. This example reiterates the point that investigators designing a trial should carefully examine the monitoring boundaries they are considering to ensure that they

FUTILITY ANALYSIS

would be comfortable stopping the trial for futility for certain observed positive differences (22). In addition to the potential harm to the integrity of the ongoing study, a premature stopping may jeopardize completion of other ongoing trials addressing this or similar questions (38). The decision to stop (or not to stop) a large study that is addressing a major public health question is often challenged by a wider audience. Even after thorough deliberation, some may question the wisdom and implications of the ultimate decision. For example, the early stopping of NSABP B14 was criticized (39). Some interim monitoring approaches integrate efficacy and futility boundaries under a single framework (15, 18). This allows simultaneous control of the type I and type II error rates. Others take a more informal approach to futility and implicitly argue that because futility monitoring does not inflate type I error no formal adjustment is needed. In general, due to the complexity of the monitoring process, interim monitoring boundaries are considered to be guidelines rather than strict rules. This is especially relevant to the futility stopping that in practice is often based on a synthesis of efficacy, safety, and feasibility considerations both internal and external to the ongoing trial (by contrast, common efficacy monitoring scenarios are driven by strong internal evidence of benefit with respect to the primary clinical outcome). On a purely mathematical level, the integration of the efficacy and futility boundaries typically involves adjusting the upper boundary downward to compensate for the deflation of the type I error due to the possibility of futility stopping. If the futility boundary is not strictly adhered to, the overall type I error of the design is not maintained. Therefore, due to the different nature of the efficacy and futility stopping and the priority to control type I error in many settings, the downward adjustment of the efficacy boundary may not be justified. This can potentially become an issue when the study is intended to support approval by a regulatory agency (40). Thus far, an implicit assumption in the discussion has been that the design treatment effect θ A represents a minimally clinically meaningful benefit and that it is within

7

the range of biological plausibility. However, some RCTs are sized using a hypothesized treatment effect that is a compromise between a realistic target supported by clinical/biological evidence and the feasibility considerations (achievable sample size and timely completion of the study). As a result, some studies, especially those in rare diseases, are often sized using an unrealistic target treatment effect. When an interim futility analysis conditioned on the observed data suggests that the study is unlikely to be positive if continued to the planned sample size, this may to a large degree reflect the unrealistic target effect. In such situations, more conservative futility boundaries should be used. The exact cut-off value γ used in stochastic curtailment may vary depending on the context of the trial. However, the use of γ = 0.5 or higher (i.e., to stop a trial for futility with a conditional power of 0.5 or higher) seems to be inconsistent with the original design. Recall that this conditional power calculation is done under the alternative hypothesis such that at the beginning of the trial the conditional power is the unconditional power (0.9, or possibly 0.8). This high unconditional power reflects the supporting preliminary data and rationale that are used to justify the expending of effort and resources to set up and conduct a trial. Therefore, once the trial is started, the drop in power from 0.9 (or 0.8) to 0.5 might not provide sufficient evidence to stop the trial early and risk wasting the resources that went into setting up and conducting the trial. It is instructive to draw an analogy with the widely accepted O’Brien–Fleming efficacy boundary. As mentioned previously, the boundary corresponds to stopping with conditional power 0.5 under the null (CPk (0) = 0.5); that is, the ‘‘conditional type I error’’ is 0.5. Thus, for a trial designed with ‘‘unconditional’’ 0.05 (0.025) type I error, this rule requires a 10- (20)-fold increase in conditional power to justify stopping the trial for efficacy. Although the exact choice of γ depends on circumstances, values in the 0.1 to 0.3 range are generally more consistent with common trial designs.

8

FUTILITY ANALYSIS

In many cases, it may be difficult to identify a simple expression for a stopping boundary that has the desired properties throughout the trial. Many commonly used futility rules may appear to be too conservative or too liberal at different information/logistic stages of a study. A futility boundary based on one statistical approach can usually be defined through other formulations (e.g., under a general definition, any futility stopping rule can be expressed in terms of the general stochastic curtailment approach or a repeated testing of the alternative hypothesis approach [15]). This can be useful in developing futility rules with desirable operational characteristics, in elucidating these rules to the data monitoring committee members, and in presetting the study results to the clinical community (41). In addition, increased flexibility can be achieved by selecting different values for the conditional power cut-off γ (or, equivalently, selecting different significance levels for the alternative hypothesis testing formulation), depending on the proportion of available information or depending on where the trial is in terms of accrual and follow-up. For example, different values of conditional power can be used during [1] the accrual period and [2] the follow-up period when all of the patients are off treatment. This strategy adjusts for the different impact of stopping for the study patients and the results. The first scenario implies terminating accrual and possible treatment change for the patients who are still on the study medication; some data are lost, and the total design information is not reached. The second scenario means only releasing the study data early; complete follow-up data are generally available after an additional follow-up period. In general, we recommend [1] scheduling futility analyses at the same time as efficacy analyses (starting at 25% to 30% of the total information) (30), and [2] requiring a stronger level of evidence for stopping early in the trial. The responsibility of monitoring of RCTs is best handled by a data monitoring committee that is independent from the principal investigator and the sponsor and thus is free from potential conflicts of interest. A failure to provide transparent and independent futility monitoring may lead to controversy. For example, Henke et al. (42) reported results

from a randomized, placebo-controlled trial of erythropoietin for head and neck cancer. The paper was unclear as to who was doing the interim monitoring and stated that the drug company sponsoring the study had decided to omit the second of the two planned interim analyses. Two years after the scheduled date of the omitted interim analysis, the final analysis was carried out. It revealed a significant impairment in cancer control and in survival for the erythropoietin arm relative to the placebo arm: the locoregional progression-free survival (the primary endpoint) hazard ratio was 1.62 (P = 0.0008), and the overall survival hazard ratio was 1.39 (P = 0.02). Although the study outcome at the time of the omitted analysis is not available, it is hypothetically possible that if the omitted interim analysis were carried out the study would have closed for futility and the medical community would have had the benefit of such information 2 years earlier than they did. The fact that the decision to cancel the second interim analysis was made by the sponsor rather than by an independent data monitoring committee created an apparent potential conflict of interest (43). In addition to the futility considerations based on low conditional power (disappointing observed treatment effect), futility may also be affected by such factors as lower than expected event rates or slower than expected accrual (44). If the observed control arm event rate is lower than that used at the design stage (to calculate the sample-size/duration needed for the desired power), the study may not be able to accumulate enough information (within a reasonable time period) to achieve adequate power and reliably address its goals. For example, one of the considerations in the early termination of the aspirin component of the Physicians Health Study (45) was that, due to a fivefold lower than expected cardiovascular death rate, the trial would have to be extended another 10 years to achieve the prespecified power (46). Note that such post hoc unconditional power analyses are usually used in conjunction with a conditional power analysis (11). In summary, a clinical trial to which substantial time and resources have been expended should provide an unequivocal resolution of the primary question. Premature

FUTILITY ANALYSIS

stopping may have a major impact on the future of the intervention, clinical care, and research directions. Therefore, futility stopping rules should be consistent with the study objectives in requiring a degree of evidence that is sufficiently convincing to the relevant medical community. Futility procedures should be carefully outlined in the protocol. If a trial is stopped early, then a clear explanation of the procedure and justification should be reported. REFERENCES 1. J. H. Ware, J. E. Muller, and E. Braunwald, The futility index. Am J Med. 1985; 78: 635–643. 2. S. J. Pocock, When to stop a clinical trial. BMJ. 1992; 305: 235–240. 3. P. W. Armstrong and C. D. Furberg, Clinical trial data and safety monitoring boards. Circulation. 1995; 91: 901–904. 4. D. L. DeMets, S. J. Pocock, and D. G. Julian, The agonizing negative trend in monitoring of clinical trials. Lancet. 1999; 354: 1983–1988. 5. D. L. DeMets and J. H. Ware, Group sequential methods for clinical trials with one-sided hypothesis. Biometrika. 1980; 67: 651–660. 6. K. K. G. Lan, R. Simon, and M. Halperin, Stochastically curtailed tests in long-term clinical trials. Commun Stat Theory Methods. 1982; 1: 207–219. 7. J. Whitehead, The Design and Analysis of Sequential Clinical Trials, 2nd ed. Chichester, UK: Wiley, 1997. 8. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman & Hall/CRC, 2000. 9. K. K. G. Lan and J. Wittes, The B-value: a tool for monitoring data. Biometrics. 1988; 44: 579–585. 10. D. O. Scharfstein, A. A. Tsiatis, and J. M. Robins, Semiparametric efficiency and its implications on the design and analysis of the group-sequential trials. J Am Stat Assoc. 1997; 92: 1342–1350. 11. Lachin JM and Lan SH, for the Lupus Nephritis Collaborative Study Group. Termination of a clinical trial with no treatment group difference: the lupus nephritis collaborative study. Control Clin Trials. 1992; 13: 62–79. 12. M. S. Pepe and G. L. Anderson, Two-stage experimental designs: early stopping with a negative result. Appl Stat. 1992; 41: 181–190.

9

13. B. R. Davis and H. J. Hardy, Upper bounds on type I and type II error rates in conditional power calculations. Commun Stat Theory Methods. 1990: 19: 3572–3584. 14. J. M. Lachin, A review of methods for futility stopping based on conditional power. Stat Med. 2005; 24: 2747–2764. 15. B. K. Moser and S. L. George, A general formulation for a one-sided group sequential design. Clin Trials. 2005; 2; 519–528. 16. S. C. Choi, P. J. Smith, and D. P. Becker, Early decision in clinical trials when treatment differences are small. Control Clin Trials. 1985; 6: 280–288. 17. D. J. Spiegelhalter, L. S. Freedman, and P. R. Blackburn, Monitoring clinical trials: conditional or predictive power. Control Clin Trials. 1986; 7: 8–17. 18. S. Pampallona and A. A. Tsiatis, Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. J Stat Plan Inference. 1994; 42: 19–35. 19. S. K. Wang and A. A. Tsiatis, Approximately optimal one-parameter boundaries for group sequential trials. Biometrics. 1987; 43: 193–200. 20. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics. 1979; 35: 549–556. 21. S. J. Pocock, Interim analyses for randomized clinical trials: the group sequential approach. Biometrics. 1982; 38: 153–162. 22. B. Freidlin and E. L. Korn, A comment on futility monitoring. Control Clin Trials. 2002; 23: 355–366. 23. T. R. Fleming, D. P. Harrington, and P. C. O’Brien, Designs for group sequential tests. Control Clin Trials. 1984; 5: 348–361. 24. T. R. Fleming and L. F. Watelet, Approaches to monitoring clinical trials. J Natl Cancer Inst. 1989; 81: 188–193. 25. C. Jennison and B. W. Turnbull, Interim analyses: the repeated confidence interval approach. J R Stat Soc Ser B Methodol. 1989: 51: 305–361. 26. S. Wieand, G. Schroeder, and J. R. O’Fallon, Stopping when the experimental regimen does not appear to help. Stat Med. 1994; 13: 1453–1458. 27. S. S. Ellenberg and M. A. Eisenberger, An efficient design for phase III studies of combination chemotherapies. Cancer Treat Rep. 1985; 69: 1147–1152.

10

FUTILITY ANALYSIS

28. R. M. Goldberg, C. G. Moertel, H.S. Wieand, J. E. Krook, A. J. Schutt, et al., A phase III evaluation of a somatostatin analogue (octreotide) in the treatment of patients with asymptomatic advanced colon carcinoma. Cancer. 1995; 76: 961–966. 29. P. A. Burch, M. Block, G. Schroeder, J. W. Kugler, D. J. Sargent, et al., Phase III evaluation of octreotide versus chemotherapy with 5-fluorouracil or 5-fluorouracil plus leucovorin in advanced exocrine pancreatic cancer. Clin Cancer Res. 2000; 6: 3486–3492. 30. B. Freidlin, E. L. Korn, and S. L. George, Data monitoring committees and interim monitoring guidelines. Control Clin Trials. 1999; 20: 395–407. 31. D. J. Spiegelhalter, L. S. Freedman, and M. K. B. Parmar, Bayesian approaches to randomized trials. J R Stat Soc Ser A Stat Soc. 1994; 157: 357–416. 32. J. J. Dignam, J. Bryant, H. S. Wieand, B. Fisher, and N. Wolmark, Early stopping of a clinical trial when there is evidence of no treatment benefit: protocol B-14 of the National Surgical Adjuvant Breast and Bowel Project. Control Clin Trials. 1998; 19: 575–588. 33. D. A. Guinn, M. W. Atkinson, L. Sullivan, M. Lee, S. MacGregor, et al., Single vs weekly courses of antenatal corticosteroids for women at risk of preterm delivery: A randomized controlled trial. JAMA. 2001; 286: 1581–1587. 34. T. M. Jenkins, R. J. Wapner, E. A. Thom, A. F. Das, and C. Y. Spong, Are weekly courses of antenatal steroids beneficial or dangerous? JAMA. 2002; 287: 187–188. 35. P. J. Schwartz, How reliable are clinical trials? The importance of the criteria for early termination. Eur Heart J. 1995; 16(Suppl G): 37–45. 36. G. S. Omenn, G. E. Goodman, M. D. Thornquist, J. Balmes, M. R. Cullen, et al., Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. N Engl J Med. 1996; 334: 1150–1155. 37. K. E. Murphy, M. Hannah, and P. Brocklehurst, Are weekly courses of antenatal steroids beneficial or dangerous? [letter]. JAMA. 2002; 287: 188. 38. S. Green, J. Benedetti, and J. Crowley, Clinical Trials in Oncology. London: Chapman & Hall, 2002.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

(CHMP). Reflection Paper on Methodological Issues in Confirmatory Clinical Trials with Flexible Design and Analysis Plan. Draft. March 23, 2006. Available at: http://www.emea.eu.int/pdfs/human/ewp/ 245902en.pdf S. S. Emerson, J. M. Kittelson, and D. L. Gillen, On the Use of Stochastic Curtailment in Group Sequential Clinical Trials. University of Washington Biostatistics Working Paper Series, no. 243. Berkeley, CA: Berkeley Electronic Press, March 9, 2005. Available at: http://www.bepress.com/uwbiostat/paper243/ ¨ M. Henke, R. Laszig, and C. Rube, Erythropoietin to treat head and neck cancer patients with anemia undergoing radiotherapy: randomized, double-blind, placebo controlled trial, Lancet. 2003; 362: 1255–1260. B. Freidlin and E. L. Korn, Erythropoietin to treat anemia in patients with head and neck cancer [letter]. Lancet. 2004; 363: 81. E. L. Korn and R. Simon, Data monitoring committees and problems of lower-thanexpected accrual or events rates. Control Clin Trials. 1996; 17: 526–535. Steering Committee of the Physicians’ Health Study Research Group. Final report on the aspirin component of the ongoing Physicians’ Health Study. N Engl J Med. 1989; 321: 129–135. L. Friedman, C. D. Furberg, and D. L. DeMets, Fundamentals of Clinical Trials, 3rd ed. New York: Springer, 1999. S. S. Ellenberg, T. R. Fleming, and D. L. DeMets. Data Monitoring Committees in Clinical Trials. Chichester, UK: Wiley, 2002. A. Dmitrienko and M. D. Wang, Bayesian predictive approach to interim monitoring in clinical trials. Stat Med. 2006; 25: 2178–2195. K. K. Lan and D. M. Zucker, Sequential monitoring of clinical trials: the role of information and Brownian motion. Stat Med. 1993; 12: 753–765. P. K. Andersen, Conditional power calculations as an aid in the decision whether to continue a clinical trial. Control Clin Trials. 1987; 8: 67–74. O. M. Bautista, R. P. Bain, and J. M. Lachin, A flexible stochastic curtailing procedure for the log-rank test. Control Clin Trials. 2000; 21: 428–439.

39. R. Peto, Five years of tamoxifen-or more? J Natl Cancer Inst.. 1996; 88: 1791–1793.

FURTHER READING

40. European Medicines Agency, Committee for Medicinal Products for Human Use

Excellent overviews of the futility monitoring are provided in Friedman, Furberg, and DeMets

FUTILITY ANALYSIS 1996 (46), and Ellenberg, Fleming, and DeMets 2002 (47). A review of recent developments in Bayesian approach to futility monitoring is provided in Dmitrienko and Wang (48). A comprehensive review of the statistical methodology can be found in Whitehead (7), Jennison and Turnbull (8), and Emerson, Kittelson and Gillen (41). A detailed discussion of information quantification in various outcome settings is provided in Lan and Zucker (49). Practical applications of stochastic curtailment for the time to event endpoint are presented in Andersen (50) and Bautista et al. (51).

CROSS-REFERENCES Conditional power Interim analysis Group sequential designs Stopping boundaries Trial monitoring

11

GENERALIZED ESTIMATING EQUATIONS

the entire class of models addressed in their proposed framework. The theoretical justification of and the practical application of GLMs have since been described in many articles and books; McCullagh and Nelder (2) is the classic reference. GLMs address a wide range of commonly used models, which include linear regression for continuous outcomes, logistic regression for binary outcomes, and Poisson regression for count data outcomes. A particular GLM requires specification of a link function to characterize the relationship of the mean response to a vector of covariates, and specification of a function to characterize the variance of the outcomes in terms of the mean. The derivation of the iteratively reweighted least squares algorithm appropriate for fitting GLMs begins with the likelihood specification for the single parameter exponential family of distributions. Within the usual iterative Newton-Raphson algorithm, an updated estimate of the coefficient vector can be computed via weighted ordinary least squares. This estimation is then iterated to convergence, for example, until the change in the estimated coefficient vector is smaller than some specified tolerance. The GLM response, with or without the conditioning of predictors, is a member of the single parameter exponential family of distributions described by

JAMES W. HARDIN University of South Carolina, Columbia South Carolina

JOSEPH M. HILBE Arizona State University, Tempe Arizona

Parametric model construction specifies the systematic and random components of a model. Inference from maximum likelihood (ML)-based models relies on the validity of these specified components, and model construction proceeds from the specification of a likelihood based on some distribution to the implied estimating equation. In the case of ML models, an estimating equation is defined as the derivative of the log-likelihood function (with respect to one of the parameters of interest) set to zero where there is one estimating equation for each unknown parameter. Solution of the (vector-valued) estimating equation then provides point estimates for each unknown parameter. In fact, the estimating equation is so called because its solution leads to such point estimates. Two obvious approaches to the analysis of correlated data include fixed-effects and random-effects models. Fixed-effects models incorporate a fixed increment to the model for each group (allowing group-specific intercepts), whereas random-effects models assume that the incremental effects from the groups are perturbations from a common random distribution; in such a model, the parameter (variance components) of the assumed random-effects distribution is estimated rather than the much larger collection of individual effects. 1

f(y) = exp{[y θ − b(θ )]/φ + c(y, φ)} where θ is the canonical parameter, φ is a proportionality constant or scale, y is the response, and b(θ ) is the cumulant, the moments of which describe the mean and the variance of the random response variable. After introducing covariates and associated regression parameters, this density function leads to an estimated p × 1 regression coefficient vector β estimated by solving the estimating equation

GENERALIZED LINEAR MODELS

The theory and an associated computational method for obtaining estimates in which the response variable follows a distribution from the single parameter exponential family was introduced in Nelder and Wedderburn (1). The authors introduced the term generalized linear models (GLMs), which refers to

(β) =

n

i =

i=1

n

Xi T (yi − µi )/[φV(µi )]

i=1

[∂µi /∂ηi ] = 0(p×1)

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

(1)

2

GENERALIZED ESTIMATING EQUATIONS

In the estimating equation, Xi is the ith row of an n × p matrix of covariates X, µi = g(xi β), which represents the mean or expected outcome E(y) = b (θ ). µi is a transformation of the linear predictor ηi = xi β via a monotonic (invertible) link function g(), and the variance V(µi ) is a function of the expected value proportional to the variance of the outcome Var(yi ) = φ V(µi ). If the link-variance pair of functions coincide with those functions implied by a specific member of the exponential family of distributions, then the resulting coefficient estimates are equivalent to maximum likelihood estimates. However, data analysts are not limited to only those pairs of link and variance functions. When selection of variance and link functions do not coincide to the canonical form of a particular exponential family member distribution, the estimating equation is said to imply the existence of a quasilikelihood, and the resulting estimates are referred to as maximum-quasilikelihood estimates. 2 THE INDEPENDENCE MODEL FOR CORRELATED DATA When observations are clustered because of repeated measurements on the sampling unit or because the observations are grouped by identification of a cluster identifier variable, the model is written in terms of the observations yit for the clusters I = 1, . . . , n and the within-cluster (repeated) observations t = 1, . . . , ni . The total number of observations is then N = i ni . In this presentation, the clusters i are independent, but the withinclusters observations are assumed to be correlated. The individual-level model, which is otherwise known as the independence model, assumes no within-cluster correlation and is written by considering the n vector-valued observations yi as if they defined N independent observations. The independence model is a special case of correlated data models (such as those specified through GEEs). Although the independence model assumes that the repeated measures are independent, the model still provides consistent estimators in the presence of correlated data. This consistency is

paid for through inefficiency, but Glonek and McCullagh (3) show that loss of efficiency is not always large. As such, this simple model remains an attractive alternative because of its computational simplicity as well as its straightforward interpretation. In addition, the independence model serves as a reference model in the derivation of diagnostics for more sophisticated models for clustered and longitudinal data (such as GEE models). 3 ESTIMATING VARIANCE The validity of the (naive) model-based variance estimators depends on the correct specification of the variance; in turn, this value depends on the correct specification of the working correlation model. A formal justification for an alternative estimator known as the sandwich variance estimator is given in Huber (4) and discussed at length in Hardin (5) and Hardin and Hilbe (6). Analysts can use the independence model to obtain point estimates along with standard errors based on the modified sandwich variance estimator to ensure that inference is robust to any type of within-cluster correlation. Although the inference regarding marginal effects is valid (assuming that the model for the mean is correctly specified), the estimator from the independence model is not efficient when the data are correlated. It should be noted that, assuming independence is not always conservative, the modelbased (naive) variance estimates based on the observed or expected Hessian matrix are not always smaller than those of the modified sandwich variance estimator. Because the sandwich variance estimator is sometimes called the robust variance estimator, this result may seem counterintuitive. However, this result is easily understood by assuming negative within-cluster correlation that leads to clusters with both positive and negative residuals. The cluster-wise sums of those residuals will be small, and the resulting modified sandwich variance estimator will yield smaller standard errors based on squaring the group-wise sum of residuals than the model-based Hessian variance estimators based on squaring and summing each residual.

GENERALIZED ESTIMATING EQUATIONS

4 SUBJECT SPECIFIC VERSUS POPULATION-AVERAGED MODELS Two main approaches are used to deal with correlation in repeated or longitudinal data. The population-averaged approach focuses on the marginal effects averaged across the individuals. The subject-specific approach focuses on the effects for given values of the random effects by fitting parameters of the assumed random-effects distribution. The population-averaged approach models the average response for observations that share the same covariates (across all of the clusters or subjects), whereas the subjectspecific approach explicitly models the source of heterogeneity so that the fitted regression coefficients have an interpretation in terms of the individuals. The most commonly described GEE model was introduced in Liang and Zeger (7). This method is a population-averaged approach. Although it is possible to derive subjectspecific GEE models, such models are not commonly supported in commercial software packages and so do not appear nearly as often in the literature. The basic idea behind the populationaveraged approach is illustrated as follows. We initially consider the estimating equation for GLMs. The estimating equation, in matrix form, for the exponential family of distributions can be expressed as (β) =

n

i =

i=1

=

Xi T D[∂µi /∂ηi ]

i=1

V n

n

−1

(µi ) (yi − µi )/φ

Xi T D[∂µi /∂ηi ] V −1/2 (µi )I(n×n)

i=1

V −1/2 (µi ) (yi − µi )/φ = 0(p×1) (2) This equation corresponds to the independence model we have previously discussed. However, the specification of the identity matrix between the factors of the variance matrix signals the point at which secondorder variance (within-cluster correlation) can be introduced. Formally, Liang and Zeger (7) introduce a second estimating equation for the structural parameters of the working correlation

3

matrix. The authors then establish the properties of the estimators that result from the solution of these estimating equations. The GEE moniker was applied because the model is derived through a generalization of the GLM estimating equation; the second-order variance components are introduced directly into the estimating equation rather than appearing in consideration of a multivariate likelihood. Many major statistical software packages support estimation of these models, including R, SAS, S-PLUS, STATA, LIMDEP, SPSS, GENSTAT, and SUDAAN. R and S-PLUS users can easily find userwritten software tools for fitting GEE models, whereas such support is included in the other packages. 5 ESTIMATING THE WORKING CORRELATION MATRIX One should carefully consider the parameterization of the working correlation matrix because including the correct parameterization leads to more efficient estimates. We carefully consider this choice even if we employ the modified sandwich variance estimator for calculation of the standard errors of the regression parameters estimates. Although the use of the modified sandwich variance estimator assures robustness in the case of misspecification of the working correlation matrix, the advantage of more efficient point estimates is still worth the effort of trying to identify the correct structural constraints to place on the correlation matrix. No controversy surrounds the fact that the GEE estimates are consistent, but there is some controversy with regard to their efficiency. Typically, a careful analyst chooses some small number of candidate parameterizations. Pan (8) discusses the quasilikelihood information criterion measures for choosing between candidate parameterizations of the correlation matrix. This criterion measure is similar to the well-known Akaike information criterion. The most common choices for the working correlation R matrix are given by structural constraints that parameterize the elements of the matrix as provided in Table 1.

4

GENERALIZED ESTIMATING EQUATIONS Table 1. Common correlation structures Ruv = 0 Ruv = α Ruv = α|u−v | Ruv = α|u−v | if |u-v| ≤ k 0 otherwise Ruv = α (u,v) if |u-v| ≤ k 0 otherwise Ruv = α (u,v)

Independent Exchangeable Autocorrelated − AR1 Stationary (k) Nonstationary (k) Unstructured Values are given for u = v; Ruu

=

1.

The independence model admits no extra parameters, and the resulting model is equivalent to a generalized linear model specification. The exchangeable correlation parameterization admits one extra parameter. The most general approach is to consider the unstructured (only imposing symmetry) working correlation parameterization that admits M(M − 1)/2 − M extra parameters where M = maxi {ni }. The exchangeable correlation specification, which is the most commonly used correlation structure for GEEs, is also known as equal correlation, common correlation, and compound symmetry. The elements of the working correlation matrix are estimated using Pearson residuals, which are calculated following each iteration of model fit. Estimation alternates between estimating the regression parameters β, assuming the that the initial estimates of α are true, and then obtaining residuals to update the estimate of α, and then using new estimates of α to calculate updated parameter estimates, and so forth until convergence. GEE algorithms are typically estimated using the same routines used for estimation of GLMs with an additional subroutine called to update values of α. Estimation of GEE models using other correlation structures use a similar methodology; only the properties of each correlation structure differs. A schematic for representing how the foremost correlation structures appear is found below. Discussion on how the elements in each matrix are to be interpreted in terms of model fit can be found in Twisk (9) and Hardin and Hilbe (10).

Independent Exchangeable 1 0 0 0

1 0 0

1 0

1 p p p

1

1 p p

1 p

1

Stationary or M-Dependent (2-DEP) Autoregressive (AR-1) 1 p1 p2 0

1 p1 p2

1 p1

1

1 p p2 p3

1 p p2

1 p

1

1 p6

1

Nonstationary (2-DEP) Unstructured 1 p1 p2 0

1 p3 p4

1 p5

1

1 p1 p2 p3

1 p4 p5

6 EXAMPLE The progabide data are commonly used as an example to demonstrate the various GEE correlation structures. The data are available in Thall and Vail (11). The data are from a panel study in which four 2-week counts of seizures were recorded for each epileptic patient. The response variable, or dependent variable of interest, is seizure, which is a count that ranges from 0 to 151. Explanatory predictors include time (1 = followup; 0 = baseline), progabide (1 = treatment; 0 = placebo), and time × prog (an interaction term). The natural log of the time period, which ranges from 2–8, is entered into the model as an offset.

GENERALIZED ESTIMATING EQUATIONS

5

GENERALIZED LINEAR MODEL seizures

time progabide timeXprog cons lnPeriod

Coef.

.111836 .0275345 −.1047258 1.347609 (offset)

Robust Std. Err.

z

.1169256 .2236916 .2152769 .1587079

0.96 0.12 −0.49 8.49

P > |z|

0.339 0.902 0.627 0.000

[95% Conf. Interval]

−.1173339 −.410893 −.5266608 1.036547

.3410059 .465962 .3172092 1.658671

GEE MODEL ASSUMING INDEPENDENCE seizures

time progabide timeXprog cons lnPeriod

Coef.

.111836 .0275345 −.1047258 1.347609 (offset)

Robust Std. Err.

z

.1169256 .2236916 .2152769 .1587079

0.96 0.12 −0.49 8.49

P > |z|

0.339 0.902 0.627 0.000

[95% Conf. Interval]

−.1173339 −.410893 −.5266608 1.036547

.3410059 .465962 .3172092 1.658671

Estimated within-id correlation matrix R: r1 r2 r3 r4 r5

c1 1.0000 0.8102 0.6565 0.5319 0.4309

c2

c3

c4

c5

1.0000 0.8102 0.6565 0.5319

1.0000 0.8102 0.6565

1.0000 0.8102

1.0000

In all, 59 patients participated in the study. Patients are identified with the variable identification (id). A robust cluster or modified sandwich variance estimator, clustered on id, is applied to the standard errors of the parameter estimates. Because the response is a count, we use a Poisson regression to model the data. We first model the data using a generalized linear model, which assumes that the observations are independent. Because five observations were obtained per patient, the data are likely correlated (on id); therefore, we use a robust variance estimator clustered on id to adjust the standard errors for the extra correlation. We subsequently model the data using GEE with various correlation structures. Each type of structure attempts to capture the correlation in a specific manner. Only the table of parameter estimates and associated statistics is shown for each model. GEE models have an accompanying within-id correlation matrix displayed to show how the model is adjusted.

The independence model is identical to the generalized linear model; that is, the correlation in the data caused by the clustering effect of id is not adjusted by an external correlation structure. It is only adjusted by means of a robust variance estimator. The exchangeable correlation matrix is nearly always used with clustered, nonlongitudinal data. A single correlation parameter is associated with this structure, which means that each cluster is assumed to be internally correlated in a similar manner. Subjects are assumed to be independent. The autoregressive correlation structure assumes that within time intervals, the correlation coefficients decrease in value based on the respective power increase in measurement. The values are the same for each respective off diagonal. Theoretically, the second-level diagonal values are the squares of the first-level diagonals. The third-level diagonals are the cube of the first, with a likewise increase in power for each larger diagonal. A large working correlation matrix,

6

GENERALIZED ESTIMATING EQUATIONS

GEE MODEL ASSUMING EXCHANGEABLE CORRELATION seizures time progabide timeXprog cons lnPeriod

Coef.

Semi-robust Std. Err.

.111836 .0275345 −.1047258 1.347609 (offset)

P > |z|

z

.1169256 .2236916 .2152769 .1587079

0.96 0.12 −0.49 8.49

0.339 0.902 0.627 0.000

[95% Conf. Interval]

−.1173339 −.410893 −.5266608 1.036547

.3410059 .465962 .3172092 1.658671

Estimated within-id correlation matrix R: r1 r2 r3 r4 r5

c1 1.0000 0.7767 0.7767 0.7767 0.7767

c2

c3

c4

c5

1.0000 0.7767 0.7767 0.7767

1.0000 0.7767 0.7767

1.0000 0.7767

1.0000

GEE MODEL ASSUMING AUTOREGRESSIVE (AR-1) CORRELATION seizures time progabide timeXprog cons lnPeriod

Coef.

Semi-robust Std. Err.

.1522808 .019865 −.1292328 1.3128 (offset)

P > |z|

z

.1124191 .2135299 .2620191 .1631003

1.35 0.09 −0.49 8.05

0.176 0.926 0.622 0.000

[95% Conf. Interval]

−.0680567 −.3986458 −.6427809 .9931291

.3726183 .4383758 .3843153 1.632471

Estimated within-id correlation matrix R: r1 r2 r3 r4 r5

c1 1.0000 0.8102 0.6565 0.5319 0.4309

c2

c3

c4

c5

1.0000 0.8102 0.6565 0.5319

1.0000 0.8102 0.6565

1.0000 0.8102

1.0000

which represents more within-group observations, will have increasingly small values in the extreme off diagonals. If the actual correlation structure varies considerably from the theoretical, then one should investigate using another structure. The nonstationary correlation structure is like the stationary except that the offdiagonal values are not the same. We would want to use these structures if we want to account for measurement error at each time period or measurement level. We observe above the same stopping point as in the stationary structure.

The unstructured correlation structure assumes that all correlations are different. This structure usually fits the model the best, but it does so by losing its interpretability—especially for models that have more than three predictors. Note that the number of unstructured correlation coefficients is based on the number of predictors p. Therefore the number of coefficients equals p(p − 1)/2.

GENERALIZED ESTIMATING EQUATIONS

7

GEE MODEL ASSUMING STATIONARY(2) or M-DEPENDENT(2) CORRELATION seizures time progabide timeXprog cons lnPeriod

Coef.

Semi-robust Std. Err.

.0866246 .0275345 −.1486518 1.347609 (offset)

P > |z|

z

.1739279 .2236916 .2506858 .1587079

0.50 0.12 −0.59 8.49

0.618 0.902 0.553 0.000

[95% Conf. Interval]

−.2542677 −.410893 −.639987 1.036547

.4275169 .465962 .3426833 1.658671

Estimated within-id correlation matrix R: r1 r2 r3 r4 r5

c1 1.0000 0.8152 0.7494 0.0000 0.0000

c2

c3

c4

c5

1.0000 0.8152 0.7494 0.0000

1.0000 0.8152 0.7494

1.0000 0.8152

1.0000

GEE MODEL ASSUMING NONSTATIONARY(2) CORRELATION seizures time progabide timeXprog cons lnPeriod

Coef.

Semi-robust Std. Err.

.0866246 .0275345 −.1486518 1.347609 (offset)

P > |z|

z

.1739279 .2236916 .2506858 .1587079

0.50 0.12 −0.59 8.49

0.618 0.902 0.553 0.000

[95% Conf. Interval]

−.2542677 −.410893 −.639987 1.036547

.4275169 .465962 .3426833 1.658671

Estimated within-id correlation matrix R: r1 r2 r3 r4 r5

c1 1.0000 0.9892 0.7077 0.0000 0.0000

c2

c3

c4

c5

1.0000 0.8394 0.9865 0.0000

1.0000 0.7291 0.5538

1.0000 0.7031

1.0000

It is possible to use other correlation structures with GEE models, but the ones shown above are the most common. More details on these structures can be found in Hardin and Hilbe (10). 7

CONCLUSION

GEE models are an extension of generalized linear models. GLMs are based on likelihoods (or quasilikelihoods), which assume the independence of observations in the model. When this assumption is violated because of clustering or longitudinal effects, then an appropriate adjustment needs to be made to the

model to accommodate the violation. GEE is one such method; fixed and random effects models are alternatives to GEE. Likewise, hierarchical and mixed models have been used to deal with clustering and longitudinal effects, which bring extra correlation into the model. The GEE approach is generally known as a population-averaging approach in which the average response is modeled for observations

8

GENERALIZED ESTIMATING EQUATIONS

GEE MODEL ASSUMING UNSTRUCTURED CORRELATION seizures time progabide timeXprog cons lnPeriod

Coef.

Semi-robust Std. Err.

.0826525 .0266499 −.1002765 1.335305 (offset)

P > |z|

z

.1386302 .224251 .2137986 .1623308

0.60 0.12 −0.47 8.23

0.551 0.905 0.639 0.000

[95% Conf. Interval]

−.1890576 −.4128741 −.5193139 1.017142

.3543626 .4661738 .318761 1.653467

Estimated within-id correlation matrix R: r1 r2 r3 r4 r5

c1 1.0000 0.9980 0.7149 0.8034 0.6836

c2

c3

c4

c5

1.0000 0.8290 0.9748 0.7987

1.0000 0.7230 0.5483

1.0000 0.6983

1.0000

across clusters and longitudinal subjects. This method varies from the subject-specific approach, which provides information concerning individual observations rather than averages of observations. The method has been used extensively in clinical research and trials, and it is widely available in the leading software packages. REFERENCES 1. J. A. Nelder and R. W. M. Wedderburn, Generalized linear models. J. Royal Stat. Soc. Series A 1972; 135: 370–384. 2. P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed. London: Chapman & Hall, 1989. 3. G. F. V. Glonek and R. McCullagh, Multivariate logistic models. J. Royal Stat. Soc. Series B 1995; 57: 533–546. 4. P. J. Huber, The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. Berkeley, CA: University of California Press, 1967, pp. 221–223. 5. J. W. Hardin, The sandwich estimate of variance. In: T. B. Fomby and R. C. Hill (eds.), Advances in Econometrics, Vol. 17. 2006, pp. 45–73. 6. J. W. Hardin and J. M. Hilbe, Generalized Linear Models and Extensions, 2nd ed. College Station, TX: Stata Press, 2007.

7. K.-Y. Liang and S. L. Zeger, Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22. 8. W. Pan, Akaike’s information criterion in generalized estimating equations. Biometrics 2001; 57: 120–125. 9. J. W. R. Twisk, Applied Longitudinal Data Analysis for Epidemiology: A Practical Guide. Cambridge, UK: Cambridge University Press, 2003. 10. J. W. Hardin and J. M. Hilbe, Generalized Estimating Equations. Boca Raton, FL: Chapman & Hall/CRC Press, 2002. 11. P. F. Thall and S. C. Vail, Some covariance models for longitudinal count data with overdispersion. Biometrics 1990; 46: 657–671.

CROSS-REFERENCES Generalized linear models mixed-effects models sandwich variance estimator

GENERALIZED LINEAR MODELS

functional relationship between y and x. As η is linear in the parameters, it is also called the linear predictor. Beside the structural part, the model contains the residual ε capturing all variation that is not included in η, which is the stochastic variability of y. Conventionally, one assumes normality for ε with zero mean and variance σ 2 . With this in mind, the structural part of model (1) can be written as

¨ GORAN KAUERMANN

University Bielefeld Postfach 300131, Bielefeld, Germany

JOHN NORRIE University of Aberdeen Health Services Research Unit Polwarth Building, Foresterhill, Aberdeen, Scotland, United Kingdom

1

µ = E(y|x) = η

That is, the mean of the response y given the explanatory covariate x is related in a linear fashion to x through the linear predictor η. Even though Equation (1) is a cornerstone in statistics, it is not an appropriate model if normality of the residuals does not hold. This applies, for instance, if y is a binary response variable taking values 0 and 1. To accommodate such a response, data model (1) needs to be generalized, and the decomposition of the structural and the stochastic part plays the key role in the GLM. For the stochastic part, one assumes that y for a given linear predictor η follows an exponential family distribution. This step provides an elegant mathematical framework and embraces several commonly used distributions, including the normal (or Gaussian), binomial, Poisson, exponential, gamma, and inverse normal distributions. Generalization of the structural part results as follows. Note that for normal response data, both sides of Equation (2) take values in the real numbers. This is not the case for other distributions. If, for instance, y is Binomially distributed, then E(y|x) takes values in [0, 1], so that the set of possible values for the left- and right-hand sides of Equation (2) differ. To overcome this problem, one introduces a transformation function g(·) called the link function and generalizes (2) to

INTRODUCTION

1.1 Motivation Generalized linear models (GLMs) provide a flexible and commonly used tool for modeling data in a regression context. The unifying approach traces back to Nelder and Wedderburn (1), although its general breakthrough was about a decade later, initiated by the book of McCullagh and Nelder (2). The availability of software to fit the models increased the acceptance of GLMs (see Reference 3). The approach has had a major impact on statistical modeling technology with research on a wide range of extensions continuing today. Recent references include Myers, et al. (4) and Dobson (5), which are of an introductory level, and Fahrmeir and Tutz (6), which concentrates on extensions toward multivariate models. For a concise overview of GLMs, see also Firth (7). For clarity of presentation, the generalization of regression models starts with the classic linear model, which can be written as y = β0 + xβx + ε

(2)

(1)

with y as response, x as some explanatory variable(s), and ε as the residual having mean zero. Model (1) plays a central role in statistics and mathematics, developed in the early nineteenth century by Gauß and Legendre (see Reference 8). A more recent reference is Stuart and Ord (9). This statistical model comprises two components: a structural and a random part. The structural part is given by η = β0 + xβx and models the

g(µ) = g{E(y|x)} = η

(3)

The link function g(·) guarantees that the possible values of both sides of Equation (3) remain the real numbers. For a Binomially distributed response, a convenient choice for g(·) is the logit link g(µ) = logit(µ) = log(µ) − log(1 − µ), which uniquely transforms the set

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

GENERALIZED LINEAR MODELS

[0, 1] into the real numbers. Alternatively, the probit link g(µ) = −1 (µ) can be used, with (·) as standard normal distribution function. If the response follows a Poisson distribution, as is often the case for count data, E(y|x) is a positive number. In this case, a suitable link is taking g(·) as the log function. Both examples will be revisited later. Note that classic linear regression is included by taking g(·) as the identity function.

where x = 1 indicates the CBT + TAU group and x = 0 is the TAU group as reference category. The parameter estimates (with standard errors in brackets) are βˆ0 = −0.631(0.247) and βˆx = −0.881(0.393). For binary response, the odds ratio is a nicely interpretable quantity that results via exp(βx ) = exp{logit(E(y = 1|x = 1)) − logit(E(y = 1|x = 0))}

1.2 Example: Logistic Regression In a randomized controlled trial, the relapse in 144 schizophrenics was observed. Let y = 1 indicate the occurrence of a relapse measured by the positive and negative symptom score (PANSS) psychopathology scale, and let y = 0 otherwise. In this trial, there were two randomized treatments, cognitive behavioral therapy (CBT) on top of treatment as usual (TAU) against treatment as usual. The 144 schizophrenics (who fulfilled the American Psychiatric Association DSM-IV [1994] criteria for schizophrenia) were all receiving antipsychotic medication and considered relapse prone (relapse within last 2 years or living in stressful environment or alone or problems with taking antipsychotic medication). The PANSS is a 30-item observer-rated scale, scored as 1 (no psychopathology) to 7 (severe). The first seven items comprise the positive score, and a relapse was defined as admission to hospital or a 50% increase in positive score sustained for 1 week (for those with a positive item at baseline ≥3) or reaching a positive item ≥3 (if no positive item ≥3 at baseline). Further details of the methods and results are reported by Gumley et al. (10). Note that in that report, the relapse outcome was modeled using Cox regression, incorporating the time to relapse. The exposition here uses logistic regression, ignoring the time to relapse and concentrating on the occurrence of the event within the first 12 months only. During this 12-month followup, 13 out of 72 (18.1%) of those randomized to CBT + TAU relapsed, compared with 25 out of 72 (34.7%) in the TAU group. A logistic regression model was fitted with logit{E(y|x)} = β0 + xβx

Here, the odds ratio of relapse on CBT + TAU against TAU is estimated by 0.414 with confidence interval (0.191, 0.896). This example is revisited later. 2 GENERALIZED LINEAR MODELS 2.1 Modeling 2.1.1 Stochastic Component. For the stochastic part, one assumes that response y given predictor η = g(µ), with g(·) the link function and µ the mean of y given the explanatory covariates, follows the exponential family distribution. The likelihood can be written in the form yθ − b(θ ) + h(y, φ) (4) L(µ, φ) = exp wφ where b(·) and h(·) are known functions, φ and w are dispersion parameters, and θ = θ (µ) is called the natural or canonical parameter. The dispersion is thereby decomposed into two parts: a known weight w, which might vary between observations, and a dispersion parameter φ (sometimes referred to as a nuisance parameter), which is the same for all observations. The latter depends on the underlying distribution and is either known or unknown. The natural parameter θ and the mean µ are related via the first-order partial derivative of b(·) with respect to θ ; that is, ∂b(θ ) = µ(θ ) ∂θ

(5)

Thus, the natural parameter θ determines the mean value µ and vice versa. As the function is invertible, θ can also be written in dependence of µ as θ (µ).

GENERALIZED LINEAR MODELS

Important and commonly used statistical distributions belong to the exponential family [Equation (4)]. Taking, for instance, y ∈ {0, 1} as binary response, one obtains the likelihood function L(µ, ϕ) = µy (1 − µ)(1−y) = exp{yθ − b(θ )} (6) with θ = log{µ/(1 − µ)} and b(θ ) = log[1/{1 + exp(θ )}]. The joint likelihood for m independent identically distributed binary responses yi (a Binomial distribution) is written as yθ − b(θ ) m L(µ, ϕ) = exp + log , (7) y (1/m) with y = i yi /m and w = 1/m as the weight. For a Poisson distributed response variable y with mean µ, the likelihood can be written as L(µ, ϕ) =

µy exp{−µ} y!

= exp{y log(µ) − µ − log(y!)}

(8)

so that θ = log(µ) and b(θ ) = exp(θ ). Finally, for Normally distributed response y, one gets L(µ, ϕ) = exp (2yµ − µ2 )/(2σ 2 ) + h(y, σ 2 ) with h(·) collecting terms that are not dependent on the mean, and θ = 2µ and b(θ ) = µ2 . Table 1 gives an overview of commonly used distributions belonging to Equation (4). 2.1.2 Variance Function. The exponential family implies a dispersion structure for response y. By standard results from exponential families, one finds by differentiation ∂ 2 b(θ ) 1 =: v(µ) Var(y|x) = wφ ∂θ 2

(9)

with v(µ) also called the variance function, which captures the dependence of the variance of y on the mean value. For binomial response, the variance function is v(µ) = µ (1 − µ), whereas for Poisson data, it is v(µ) = µ. For normal responses v(µ) = 1, reflecting that the variance does not depend on the mean. The role of v(µ) is then to allow variance heterogeneity to be automatically incorporated into the estimation.

3

2.1.3 Structural Component. For the structural part of the model, the mean value of y is linked to the linear predictor η via the link function g(·). This function must be appropriately chosen so that both sides of Equation (2) have the same range of possible values. The choice of g(·) can therefore in principle be made in isolation from the stochastic part with the only requirement being that g(·) maps the space of mean values for y to the real numbers and vice versa. However, natural candidates for the link function are suggested by the stochastic part of the model. Such link functions are called natural or canonical links. They are obtained by setting θ = η in Equation (4); that is, the linear predictor is set to equal the natural parameter of the exponential family. This result gives µ(θ ) = µ(η) from Equation (5), which is mathematically as well as numerically a convenient choice. Natural links enjoy widespread use in practice, with the logit link µ(θ ) = µ(η) = exp(η)/(1 + exp(η)) the natural choice for the Binomial response, whereas for Poisson response, the natural link is µ(θ ) = exp(θ ). For a normally distributed response y, the natural link is the identity function, so that the classic regression model (1) results as a special case. 2.2 Estimation 2.2.1 Maximum Likelihood Estimation. Taking the logarithm of Equation (4) yields the log likelihood l(µ, φ) = {yθ − b(θ )}/(wφ) + h(y, φ).

(10)

It is notationally convenient to parameterize the likelihood by β instead of µ. Assuming independent observations yi , i = 1, . . . , n, with corresponding covariates xi , one obtains from the individual log likelihood contributions [Equation (10)] the log likelihood function l(β, φ) =

n

li (β, φ)

i=1

=

n [{yi θi − b(θi )}/(wi φ) i=1

+ {h(yi , φ)}]

(11)

4

GENERALIZED LINEAR MODELS

Table 1. Examples of Exponential Family Distributions Distribution N(µ, σ 2 )

Normal

b(θ)

µ(θ) = b (θ)

v(µ) = b (θ)

φ

θ 2 /2

1

σ2

µ(1 − µ)

1

exp(θ) − log(−θ)

θ exp(θ) 1 + exp(θ) exp(θ) −1/θ

exp(θ) µ2

1 ν −1

−(−2θ)1/2

(−2θ)−1/2

µ3

σ2

θ(µ)

Binomial

B(1, µ)

Poisson Gamma Inverse Normal

P(µ) G(µ, ν)

µ

µ log (1 − µ) log(µ) −1/µ

IN(µ, σ 2 )

log(1 + exp(θ))

1/µ2

with θi = θ (µi ) = θ {g−1 (ηi )} and linear predictor ηi = xi β. The maximum likelihood estimate for β is now obtained by maximizing Equation (11), which is achieved by setting the first derivative of l(β, φ) to zero. Simple differentiation using the chain rule leads to the score equation 0=

n ∂ ηˆ i ∂ µˆ i ∂ θˆi (yi − µˆ i ) ∂β ∂η ∂µ

(12)

i=1

with ηˆ i = βˆ0 + xi βˆx , µˆ i = g−1 (ηˆ i ), and θˆi = θ (µˆ i ). If g(·) is chosen as natural link, one has θ = η so that (12) becomes 0=

n ∂ ηˆ i i=1

∂β

2.3 Example: Multiple Logistic Regression (yi − µˆ i ).

(13)

Note that η = (1, x)β so that ∂η/∂β = (1, x)T . Hence, for natural links, Equation (13) has a simple structure. Moreover, with normal response y and natural link, Equation (13) reads as X T (Y − Xβ), with X T = ((1, x1 )T , . . . , (1, xn )T ) and Y = (y1 , . . . , yn ), which is easily solved analytically by βˆ = (X T X)−1 X T Y. In general, however, µi = µ(θi ) is a nonlinear function so that even for natural links, no analytic solution of Equation (13) will be available. Iterative fitting, for instance, based on Fisher scoring, is therefore required. 2.2.2 Fisher Scoring. A solution of Equation (12) can be found following a Newton Raphson strategy. Taking β (0) as starting value, then obtain an update β (t+1) from β (t) , t = 0, 1, 2, . . ., from the iterative procedure β

(t+1)

=β

(t)

−

In practice, for ease of computation, it is convenient to replace the second-order derivative in Equation (14) by its expectation, otherwise known as the Fisher matrix. The resulting estimation routine is therefore called Fisher scoring. The procedure is sometimes also called iterative reweighted least squares, because the second-order derivative as well as the first-order derivative contain dispersion weights v(µi ), which depend on the parameter β and have to be updated iteratively using weighted least squares in each step.

∂ 2 l(β (t) , φ) ∂β∂β T

−1

∂l(β (t) , φ) . ∂β (14)

The subsequent example is taken from the textbook Brown (11). For 53 patients, nodal involvement of tumor was recorded as response variable y, indicating whether the tumor had spread to neighboring lymph nodes (y = 1) or not (y = 0). Explanatory variables are all binary, xage indicating whether the patient is 60 years or older (=1), xstage equal to 1 if the tumor has been classified as serious (=0 otherwise), xgrade describing the pathology of the tumor (=1 if assessed as serious, =0 otherwise), xxray if X-ray of the tumor led to a serious classification, and finally xacid indicating whether the serum acid phosphatase level exceeds the value 0.6. Parameter estimates of the multivariable model are shown in Table 2, with positive estimates indicating a tendency for an increased likelihood of tumor spread. The t-value in Table 2 is calculated by dividing the estimate by its standard deviation. The significance of the covariate can be assessed by comparing the t-value to standard normal distribution quantiles. By so doing, one should keep in mind the

GENERALIZED LINEAR MODELS Table 2. Parameter Estimates and Standard Deviation for Logistic Regression Example Covariate

Estimate

Standard Deviation

t-Value

Intercept age stage grade xray acid

−3.08 −0.29 1.37 0.87 1.80 1.68

0.98 0.75 0.78 0.81 0.81 0.79

−3.13 −0.39 1.75 1.07 2.23 2.13

5

Table 3. Parameter Estimates and Standard Deviation for Poisson Example

β0 βt β tt

model with linear time

model with quadratic time

5.178 (0.046) −0.0628 (.0057) –

5.011 (0.076) 0.0081 (.0238) −0.0043 (.0015)

quadratic fit is plotted in Fig. 1. The example is picked up again later. effects shown are conditional effects, meaning given the values of the other covariates. For instance, age exhibits a negative effect given the explanatory information contained in stage, grade, xray, and acid. It seems that some covariates show a significant effect, with their t-values taking values larger than 2 in absolute term, whereas, for instance, age does not show a significant effect, again conditional on the other variates in the model (stage, grade, xray, acid). A variable selection to classify covariate effects as significant and nonsignificant will be demonstrated in a subsequent section.

3

INFERENCE

As observed in the above three examples, it is necessary to have inferential tools at hand to select an appropriate model. This model contains only those covariates that need to be in the model (1) either because they are prognostically valuable or they are necessary for external reasons, or (2) they show significant effect. The following chapter discusses alternatives for the assessment of the latter point (2).

2.4 Poisson Regression Poisson regression can be useful for responses that are counts or frequencies. Consider, for example, the number of deaths from injury in childhood reported to the Registrar General in Scotland between 1981 and 1995. Morrison et al. (12) give full details of this data and explore socioeconomic differentials. Here the data are considered as overall number of deaths by year. Figure 1 shows the number of deaths per year, indicating a clear decline over the 15 years of observation, from a high of 173 in the late 1980s to a low of just over 50 in 1995. The number of deaths in any year are modeled by a Poisson distribution with mean µ as a function of calender year T, starting with 1981 (T = 1) to 1995 (T = 15). The simplest model would relate time T to the mean in a linear fashion: µ = exp(β0 + tβt ). In addition, a more complicated function, such as a quadratic function, can be used to accommodate nonlinearity: µ = exp(α + βt t + t2 βtt ). Table 3 provides the corresponding parameter estimates. The resulting log linear and log

3.1 Variance of Estimates 3.1.1 Fisher Matrix. Standard likelihood theory can be applied to obtain variances for the maximum likelihood estimates βˆ = (βˆ0 , βˆx ), which requires the calculation of the Fisher matrix by second-order differentiation of Equation (11). As ∂η/∂β = (1, x)T does not depend on β, one finds for the second-order derivative of l(β) an expression consisting of two components: ∂ηi ∂θi v(µi ) ∂θi ∂ηi ∂ 2 l(β) =− T ∂β ∂η wi φ ∂η ∂β T ∂β∂β n

(15)

i=1

+

n ∂ηi ∂ 2 θi ∂ηi yi − µi ∂β ∂η2 ∂β T wi φ

(16)

i=1

The first component in Equation (15) is a deterministic term, whereas Equation (16) is a stochastic one. Keeping in mind that

6

GENERALIZED LINEAR MODELS

170 150

(log)quadratic

Deaths

130 110

(log)linear

90 70 Figure 1. Accidental deaths in children in Scotland 1981 (T = 1) to 1995 (T = 15).

50 T Deaths

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

135 148 143 173 148 123 124 105 88 82 120 70 88 67 56

E(yi |xi ) = µi , the stochastic term has mean zero so that the Fisher matrix equals ∂ 2 l(β) F(β) = E − ∂β∂β T =

n ∂ηi ∂θi v(µi ) ∂θi ∂ηi ∂β ∂η wi φ ∂η ∂β T

(17)

i=1

For natural link functions, the structure nicely simplifies because with θ (η) = η, one has ∂θi /∂η = 1 and ∂ 2 θi /∂η2 = 0. Hence, Equation (16) vanishes and Equation (15) has the simple structure

F(β) =

n ∂ηi v(µi ) ∂ηi ∂β wi φ ∂β T i=1

In particular, in this case, the expected and the observed Fisher matrices coincide. From standard likelihood theory, we know that the variance of the maximum likelihood estimate βˆ is asymptotically equal to the Fisher a ˆ = F(β)−1 , and the cenmatrix, that is, Var(β) tral limit theorem provides asymptotic normality: a βˆ ∼ N(β, F(β)−1 )

Therefore, standard statistical methods can be used for the parameter estimation. 3.1.2 Dispersion Parameter. The variance formulas above contain the dispersion parameter φ. Dependent on the stochastic model

being used, φ is either known or unknown. For instance, for binomial as well as for Poisson responses, one has φ = 1, which follows directly from Equation (6) and Equation (8), respectively. Hence, estimation of φ is not necessary. In contrast, for normally distributed response, component φ is the residual variance that is typically unknown and has to be estimated from the data. In principle, this estimation can be done by maximum likelihood theory based on the likelihood function (11). For the general case, however, maximum likelihood estimation of φ is not recommended (see Reference 2, p. 295) and instead a moment-based approach should be preferred. Based on variance formula (9), an estimate is found by φˆ =

n (yi − µˆ i )2 /n. wi v(µˆ i )

(18)

i=1

For normal response models, the momentbased estimate is identical to the maximum likelihood estimate. This case is, however, special, and it does not hold in general. Moreover, replacing the denominator in Equation (18) by n − p, with p as number of parameters, reduces the bias occurring because of the use of fitted residuals (see Reference 2 for details). 3.2 Variable Selection It can be desirable within a statistical model with many explanatory covariates to only include covariates that make a substantive contribution to the understanding of the

GENERALIZED LINEAR MODELS

relationship between the covariates and the response. This question is one of parsimony, that is, to check whether all covariates are needed or whether a model with less covariates fits the data similarly well. A large amount of literature on variable selection in regression is available, with a particularly readable overview given by Harrel (13). Also, the issue of model inconsistency should be considered, particularly in the nonlinear model (see Reference 14) as one adds and subtracts covariates from the model. The problem can be tackled by testing the significance of subsets of covariate effects, and there are three different possibilities in common use. All three methods coincide for normal response y, and they are asymptotically equivalent for the general case. 3.2.1 Wald Statistics. Let the set of explanatory variables contained in x be split into xa and xb , that is, x = (xa , xb ). Model (3) is rewritten as g(µ) = xa βa + xb βb

(19)

For notational convenience, the intercept is included in xa . The focus of interest is to test the hypothesis βb = βb0 , with βb0 some given vector. The primary focus is usually on testing βb0 = 0. Taking advantage of standard likelihood theory, one can test this hypothesis using a Wald Statistic. Let the Fisher matrix F(β) be decomposed to F(β) =

Faa (β) Fab (β) Fba (β) Fbb (β)

with submatrices matching to the dimensions of βa and βb . The Wald Statistic is then defined as wb = (βˆb − βb0 )T (F bb (β))−1 (βˆb − βb0 )

(20)

with F bb (β) as the bottom right matrix of F(β)−1 . For βb = βb0 , the test statistic follows asymptotically a chi-squared distribua tion with p degrees of freedom; i.e., wb ∼ χp , where p is the dimension of βb . 3.2.2 Score Statistics. A second test statistic is available from the score contributions, that is, the first derivatives of the

7

likelihood function. Let β˜ = (β˜a , βb0 ) be the maximum likelihood estimate in the hypothetical model with βb = βb0 . The Score Statistic is then defined by

sb =

˜ φ) ∂l(β, ∂βbT

˜ F bb (β)

˜ φ) ∂l(β, ∂βb

(21)

The idea behind the Score Statistic is that if βb = βb0 , then the score contribution at β˜ will be close to zero, which indicates that the maximum likelihood estimate βˆb in the complete model is close to βb0 . Asymptotic a likelihood theory shows sb ∼ χp . 3.2.3 Likelihood Ratio Statistic. Two models are called nested if the smaller one results in setting some parameters in the larger model to zero (or to some specific value). In this respect, the model with βb = βb0 is nested in the larger model with no constraints on βb . Testing βb = βb0 can then be pursued in a general framework using the Likelihood Ratio Statistics: ˜ φ) − l(β, ˆ φ)} λb = −2{l(β,

(22)

Again, standard likelihood theory indicates a the asymptotic distribution λb ∼ χp , assuming the smaller model to holds. 3.2.4 Remarks for Practical Application. It should be noted that the Wald Statistic is ˆ the estimate in the altercalculated from β, native model, whereas the Score Statistics ˜ the estimate in the is calculated from β, hypothetical model. For the Likelihood Ratio Statistics, both estimates are required for calculation. Standard software usually calculates the Wald Statistic or the likelihood ratio statistic only, whereas the score statistic is typically not provided. In terms of asymptotic behavior, all three test statistics follow standard likelihood theory (see Reference 15) based on the central limit theorem. The fundamental assumption for the asymptotics to hold even in small samples is that Fisher matrices are well conditioned and grow asymptotically at the same rate as the sample size. ‘‘Well conditioned’’ means here that both the design matrix of covariates x as well as the coefficients β

8

GENERALIZED LINEAR MODELS

are well behaved. The first is a common feature in any linear regression model, whereas the second is a special feature occurring in GLMs. For example, consider a binary covariate x taking values 0 and 1. If the binary response y = 1 for x = 1 and y = 0 for x = 0, then regardless of the design of x, the Fisher matrix will not have full rank because estimates tend to infinity in absolute terms. A thorough investigation of the effect of design on the asymptotic behavior is found in Fahrmeir and Kaufmann (16, 17). General guidelines on when the asymptotic approximations above are good enough to be reliable are difficult to derive, because the behavior depends strongly on the underlying stochastic model. Some discussion is found in Agresti (18, p. 246) or Fahrmeir and Tutz (6, p. 50). In the case of binomial and Poisson responses, Cressie and Read (19) introduced a power divergence statistics with improved small sample behavior. Further discussion of this point is also found in Santner and Duffy (20). As rule of thumb, one can take that asymptotic results can be satisfactorily applied in finite samples if (1) the covariate design is well conditioned and (2) the mean value µ is not close to to the boundary of its parameter space (for example, for a binary response y, we want µ to be not close to 0 or 1). If one of the two conditions is violated, the small sample behavior can be doubtful. 3.3 Example: Logistic Regression (Revisited) The different test statistics are exemplified with the logistic regression example on schizophrenic relapse from the first section by testing the treatment effect and are given in Table 4. All three quantities are of similar size and show a significant effect of treatment based on an asymptotic χ 2 distribution with one degree of freedom. The corresponding P-values are also shown in Table 4. 3.4 Deviance and Goodness of Fit The deviance is strongly linked to the likelihood ratio statistics. It results by comparing the model fit with the saturated model, that is, the model that fully describes the data via E(yi ) = ηi for i = 1, . . . , n. Let η˜ = (η˜ 1 , . . . , η˜ n ) be the maximizer in the saturated model,

Table 4. Test Statistics for the Logistic Regression Example Statistics

Estimate

P-Value

5.006 5.148 5.217

0.0253 0.0233 0.0224

Wald Score Likelihood ratio

and denote with li (ηi ) the likelihood contributions in Equation (11) evaluated at the linear predictor. The deviance is then defined as D(Y, µ) = 2φ

wi {li (η˜ i ) − li (ηˆ i )}

i

where Y = (y1 , . . . , yn ) and µ = (µ1 , . . . , µn ), The deviance is a helpful tool for model selection. Let, therefore, Ma denote the model with g(µ) = xa βa . Accordingly, Mab denotes the model built from g(µ) = xβ, with x = (xa , xb ) and β = (βa , βb ). The models are nested, and one can test Ma against Mab by making use of the asymptotic result that under the hypothetical model Ma a

D(Y, µMa ) − D(Y, µMab ) ∼ χ 2 (dfa −dfab )

(23)

where D(M) is the deviance calculated under the corresponding model and df a and df ab are the degrees of freedom for the two considered models, that is, n minus the number of estimated parameters. The deviance can also be used to validate the overall goodness of fit of a model. Assume that x is a metrically scaled covariate, which is included in the model in a linear form; that is, η = β0 + xβx . A general and important question is whether the model is in fact linear or whether the functional relationship is different. Note that theoretically the model could be checked by comparing its deviance with the deviance of the saturated model, that is, using D(Y, µ) − D(Y, Y). Based on Equation (23), this would follow a χ 2 distribution with dfa − 0 degrees of freedom. As n → ∞, one has dfa → ∞ and it results that the convergence rate is useless for practical proposes. This in fact forbids the deviance to be used in this way for checking the overall goodness of fit (see

GENERALIZED LINEAR MODELS

Reference 7 for more details). Instead, one can extend the model and test it against a more complex functional form. To check, for instance, the linear model η = β0 + xβx , one can test it against an extended polynomial model; e.g., η = β0 + xβx + x2 βxx + . . . . In this case, the difference of the deviance for both models follows a well-behaved asymptotic pattern and with Equation (23) the linear model can be checked. Over the last decade, such parametric model checks have increasingly been replaced by nonparametric methods where the alternative model is specified through g(µ) = m(x), with m(·) as unknown but smooth function to be estimated from the data. The idea can be found in Firth et al. (21); for further references and examples, one should consult Bowman and Azzalini (22). 3.5 Example: Multiple Logistic Regression (Revisited) For the nodal involvement example from above, a variable selection is desirable to include only those covariates in the model that are significant. Two methods are illustrated here. The first is a Wald test using Equation (20); the second is a likelihood ratio test stated in Equation (22). The latter is equivalent to a deviance comparison making use of Equation (23). Table 5 shows the deviance for several models. The models are nested, in that in each new row, a parameter of the precedent row is set to zero. Using Equation (23), the difference of the deviance in two consecutive rows is chisquared distributed with 1 degree of freedom. The corresponding P-value is also shown in Table 5. In the same form, the P-value for the Wald Statistic is provided, testing the significance of the effect, which is omitted from the preceding row. It would seem that the model including the stage of the tumor, the X-ray assessment, and the serum acid phosphatase level seems appropriate for describing the data.

9

4 MODEL DIAGNOSTICS AND OVERDISPERSION 4.1 Residuals It is a wise principle to check the model assumptions carefully before relying on inferential conclusions drawn from it. A first and helpful step in this direction is the numerical and graphical investigation of the fitted residuals yi − µˆ i . However, even though this is a manifest approach in standard regression with normally distributed residuals, in the generalized case, one is faced with two additional features. First, residuals are not homoscedastic, that is, they have different variances, and second, residuals can take clustered values so that visual investigation is cumbersome. Both of these problems have been adequately addressed. Pierce and Schafer (23) provide an early discussion of this point; a very readable account of regression diagnostics is provided by Atkinson (24). For a detailed discussion of residual based model checking, one may consult Cook and Weisberg (25). Heterogeneity of the residuals can be taken into account by standardization that leads to the Pearson residuals defined by yi − µˆ i εˆ Pi =

v(µˆ i )/wi

(24)

The sum of the squared Pearson residuals is also known as the Pearson statistics: X 2 = (yi − µˆ i )2 /{wi v(µˆ i )}. Beside the Pearson residuals, the deviance residual is in frequent use. This residual is defined as εˆ Di =

√ wi sign(η˜ i − ηˆ i )[2{li (η˜ i ) − li (ηˆ i )}]1/2 (25)

with sign(·) as the sign function and ηˆ i = βˆ0 + xi βˆx . Finally, a third type of residual has been suggested, given by transforming yi by the dµ function T(·) defined through T(·) = , v1/3 (µ) which leads to the so-called Anscombe residuals εˆ Ai =

ˆ T(yi ) − E{T(y i )}

Var{T(yi )}

10

GENERALIZED LINEAR MODELS

Standard software packages give both Pearson and deviance residuals. In contrast, Anscombe residuals are not well established in software packages, and therefore, they are less frequently used, principally because of the additional numerical burden resulting from estimating both the expectation and the variance of T(yi ). All three residuals correct for heterogeneity. For normal responses, they all reduce to the classic fitted residuals εˆ i = yi − µˆ i . For discrete data, however, the residuals exhibit an additional problem because they take clustered values. For instance, if yi is binary in a logistic regression on x, the residuals will mirror the binary outcomes of yi , which makes simple residual exploration by eye complicated or even impossible. To overcome this point, one can plot the residuals against x and apply some smoothing methods to explore whether there is any structure left that would indicate a lack of fit. Such approaches have been suggested by, among others, le Cessie and van Houwelingen (26), and they have become increasingly popular over the last couple of years. Bowman and Azzalini (22) can serve as a general reference.

4.2 Example: Multiple Logistic Regression (Revisited) For the nodal involvement data, the Pearson is Plotted against deviance residuals for the model, including xstage , xxray , and xacid as covariates. For better visibility, some small random noise has been added to avoid overlaying points in the plot. There is one observation with a large Pearson residual and a somewhat inflated deviance residual. This observation would need some extra consideration (Fig. 2).

4.3 Overdispersion For several distributions, the variance of y is completely determined by the mean. For such distributions, the dispersion parameter φ is a known constant and the variance of yi depends exclusively on v(µi ) and some known weights wi . Examples are the binomial distribution and the Poisson distribution. When using such distributions to model data, the analyst sometimes encounters the phenomena of overdispersion, in which the empirical residual variance is larger than the variance determined by the model. Overdispersion can arise from several sources. A possible explanation is that relevant covariates have been omitted in the model, which could, for instance, happen if covariates have not been recorded or are unmeasurable (or latent). Two commonly used modeling approaches correct for overdispersion. First, instead of assuming φ to be fixed and determined from the stochastic model, one allows φ to be unknown. One assumes Var(yi |ηi ) = v(µi )wi φ with wi as known weights, v(·) as a known variance function, and φ as an unknown dispersion parameter. Estimation can then be carried out with Equation (18). The resulting stochastic model with φ unknown is not generally in a tractable form. There are, however, some exceptions. For instance, if y is Poisson distributed with mean m, say, and m is assumed to be random following a gamma distribution with mean µ and variance τ , one obtains a negative binomial distribution for y. Integrating out the random mean m yields the overdispersed model with E(y) = µ and Var(y) = µφ, where φ = (1 + τ )/τ . More details are found in McCullagh and Nelder (2). Alternatively it has been suggested to accommodate overdispersion by fitting a mixed model with latent random effect. This random effect mimics the latent covariates that are assumed to be responsible for

Table 5. Deviance and Wald Statistics for Logistic Regression Example Model xray + acid + stage + grade + age xray + acid + stage + grade xray + acid + stage xray + acid

Deviance

P-Value for Difference of Deviance

P-Value for Wald Statistic

47.61 47.76 49.18 54.79

0.699 0.233 0.024

0.698 0.235 0.018

GENERALIZED LINEAR MODELS

11

Deviance Residuals

4

• 2

••• ••••

0

••••••••

•••• ••••• •••••• •• •• • •••

−2

•

−2

Figure 2. Pearson versus deviance residuals for nodal involvement data.

the observed overdispersion. Estimation is then carried out by numerical or approximative integration over the random effect (see Reference 27 for more details). The approach is attractive because it remains in the distributional framework. It requires, however, additional numerical effort. Tests on overdispersion can be constructed in two ways. First, one can use a Likelihood Ratio Statistic comparing the fitted models with and without overdispersion. For instance, in the Poisson/gamma model sketched above, one can test whether a random mean m has variance 0; i.e., τ = 0 (without overdispersion). The resulting test based on a log likelihood ratio is, however, nonstandard, because the hypothesis τ = 0 lies on the boundary of the parameter space τ > 0. Second, one can simply fit a model with a variance specified by Var(yi |xi ) = µi + αh(µi ), where h(·) is a prespecified, e.g., quadratic function. Overdispersion is then tested by α = 0. Details on tests of overdispersion are found in Cameron and Trivedi (28). Joint modeling of mean and dispersion is treated in more depth in Jørgensen (29). 4.4 Example: Poisson Regression (Revisited) In Table 6, the different goodness of fit measures are given for the Poisson data example from above. As can be observed, the difference between deviance and Pearson statistics

0 2 Pearson Residuals

4

Table 6. Goodness of Fit for Poisson Data Example

X2 D df

Model with Linear Time

Model with Quadratic Time

39.59 38.56 13

31.52 30.67 12

is negligible. Based on the difference of the goodness of fit statistics in the linear and quadratic model, the quadratic time effect seems significant with P-values 0.0045 and 0.0049 for the Pearson and deviance statistics, respectively. There is no free dispersion parameter for the Poisson distribution, and so for a wellfitting model, the ratio of the Pearson χ 2 to its degrees of freedom should tend toward unity. A simple correction for overdispersion, proposed by McCullagh and Nelder (2), is to inflate the standard errors of the parameter estimates by an estimated dispersion parameter, and so give some protection against over-interpreting the data because of inappropriately small confidence intervals around the parameter estimates. Based on Equation (18), this means

that confidence intervals are inflated by φˆ = X 2 /df . In the linear time model, this is 9.59/13 or 1.745, and

12

GENERALIZED LINEAR MODELS

in the quadratic model, it is 31.52/12 or 1.621. After this correction, the estimated standard error for the quadratic time parameter increases to 0.0015 × 1.621 = 0.0024. In terms of estimating the change in accidents over, say, a 10-year period, the first model, linear in time but with scale parameter fixed at 1, would estimate this as a 47% reduction (95% confidence interval 40% to 52%), whereas the linear model with scale parameter estimated would return the same point estimate, i.e., 47%, but now the 95% confidence interval would stretch from 35% to 56%. For the quadratic model with scale parameter estimated, the change for the first 10 years 1981 to 1991 would be estimated as a 39% reduction with a 95% confidence interval of 23% to 52%. 5

5.2 Quasi-Likelihood

EXTENSIONS

5.1 Multicategorial and Ordinal Models In the standard form of a GLM, one assumes that the response y is univariate. It can be readily extended to accommodate multivariate response variables, for example, if y is a categorical response with possible outcomes y ∈ {1, 2, . . . , k}. Recoding y by the dummy variables 1 if y = r y˜ r = 0 otherwise for r = 1, . . . , k − 1, the distribution of the resulting vector y˜ = (˜y1 , . . . , y˜ k−1 )T can be written in exponential family form exp{˜yT θ˜ − b(θ˜ )} with multivariate parameter vector θ˜ . Accordingly, the structural part of the model (3) is written in multivariate form

g(µ) = g{E(˜y|x)} = η

different influence on the separate cells of the discrete distribution. Multicategorical responses are often measured on an ordinal scale. In this case, it is plausible to model cumulated probabilities P(y ≤ r) = rl=1 E(˜yl ) for r = 1, . . . , k − 1. Taking a logit-based link leads to the Cumulative Logistic Model or Proportional Odds Model, respectively, defined by the link funk g(·) with components gr (µ) = log{P(y ≤ r)/P(y > r)}. The linear predictor should now be specified as ηr = β0r + xβx , mirroring a joint influence of covariate x on all categories of x but with separate category specific intercepts, with β0r ≤ β0r+1 for r = 1, . . . , k − 2. The model can also be interpreted via a latent univariate response variable as motivated in depth in McCullagh and Nelder (2); see also in Fahrmeir and Tutz (6).

(26)

with g(·) = {g1 (·), . . . , gk−1 (·)}T and η˜ = (η˜ 1 , . . . , η˜ k−1 )T as a multivariate linear predictor. The natural link is then gr (η) = log{P(y = r)/P(y = k)} with P(y = r) = E(˜yr ) and P(y = k) = 1 − k−1 yl ). For notational simplicl=1 E(˜ ity, the dependence on x is thereby omitted. For the linear predictor, one may set ηr = β0r + xβxr to allow covariate x to have a

The cornerstone of so-called quasi-likelihood estimation (see Reference 30) is that the variance function v(µ) is inherited from the assumed distribution, but otherwise the distributional assumption is given up. As a consequence, no proper likelihood function is available for estimation, and instead, estimating equations are used. The structural part is extended to accommodate both mean and variance. The latter also allows multivariate response models. The mean specification can be written in matrix form as E(Y|X) = µ(β), where Y = (y1 , . . . yn ) is the response vector and X is the design matrix comprising the rows (1, xi ). Note that the explicit dependence on parameters β is not now of primary interest, but it is retained for notational convenience. The distributional assumption is now expressed as Var(Y|X) = V(µ)Wφ, with W = diag(wi ) the matrix of known weights and V(µ) as the prespecified variance matrix. In particular, correlated observations can be incorporated by appropriate specification of nondiagonal entries in V(µ). In the case of independent errors, V(µ) simplifies to diag{v(µi )}. Based on estimating equation theory (see Reference 31), the best estimating equation for β becomes ˆ Y − µˆ −1 ∂ µ ˆ := {V(µ)W} s(β) ˆ ∂β T φ

(27)

GENERALIZED LINEAR MODELS

which can be solved by Newton Raphson or Fisher scoring in the same way as in a likelihood framework. An attractive feature of the approach is that score and information identities are inherited; that is, if the prespecified variance is the true variance, one gets E{s(β)} = 0,

E(−

∂s(β) ) = Var(s(β)) ∂β

This approach became fashionable in the early 1990s with the introduction of generalized estimating equations (GEEs) (see Reference 32). 6

SOFTWARE

The software package GLIM (see Francis et al. (1993)) was to some extent a precursor on the computational side of GLMs. Nowadays, however, most standard software packages are well equipped with fitting routines for GLMs. SPSS (www.spss.com SPSS Inc, 233 S. Wacker Dr, Chicago, IL 60606, USA) allows one to fit most parametric GLMs including multinomial and ordinal models (see GLM procedure). The same holds for SAS (www.sas.com SAS Inst Inc, 100 SAS Campus Dr Cary, NC 27513, USA; see, for instance, PROC, GENMOD, and MIXED). The widest range of fitting possibilities, in particular using nonparametric approaches, is found in Splus (www.insightful.com Insightful, ¨ Kagensti 17 4153 Reinach, Switzerland) or its open source clone R (www.r-project.org; see function glm). Beside these mainstream programs, several smaller and specialized tools provide estimation routines for GLMs. A comprehensive overview is found in Fahrmeir and Tutz (6). REFERENCES 1. J. A. Nelder and R. W. M. Wedderburn, Generalized linear models. Journal of the Royal Statistical Society, Series B 1972; B 34: 370– 384. 2. P. McCullagh and J. A. Nelder, Generalized Linear Models. 2nd ed. New York: Chapman and Hall, 1989. 3. M. Aitkin, D. Anderson, B. Francis, and J. Hinde, Statistical Modelling in GLIM. Oxford, U.K.: Oxford University Press, 1989.

13

4. R. H. Myers, D. C. Montgomery, and G. G. Vining, Generalized Linear Models: With Applications in Engineering and the Sciences. New York: Wiley, 2001. 5. A. J. Dobson, Introduction to Generalized Linear Models. Boca Raton, FL: Chapman & Hall/CRC, 2001. 6. L. Fahrmeir and G. Tutz, Multivariate Statistical Modelling Based on Generalized Linear Models. 2nd ed. New York: Springer Verlag, 2001. 7. D. Firth, Generalized linear models. In: D. V. Hinkley, N. Reid, and E. J. Snell (eds.), Statistical Theory and Modelling. London: Chapman and Hall, 1991. 8. S. M. Stigler, The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Harvard University Press, 1986. 9. A. Stuart and J. K. Ord, Kendall’s Advanced Theory of Statistics, Vol. 2A: Classical Inference & the Linear Model. New York: Oxford University Press, 1999. 10. A. Gumley, M. O’Grady, L. McNay, J. Reilly, K. Power, and J. Norrie, Early intervention for relapse in schizophrenia: Results of a 12 month randomised controlled trial of cognitive behavioural therapy. Psychological Medicine 2003; 419–431. 11. B. Brown, Prediction analysis for binary data. In: R. Miller, B. Efron, B. Brown, and L. Moses (eds.), Biostatistics Casebook. New York: Wiley, 1980, pp. 3–18. 12. A. Morrison, D. H. Stone, A. Redpath, H. Campbell, and J. Norrie, Trend analysis of socio-economic differentials in deaths from injury in childhood in scotland, 1981–95. British Medical Journal 1999; 318: 567– 568. 13. F. E. J. Harrel, Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer Series in Statistics, 2001. 14. J. Norrie and I. Ford, The role of covariates in estimating treatment effects and risk in long-term clinical trials. Statistics in Medicine 2002; 21(19): 2899–2908. 15. T. A. Severini, Likelihood Methods in Statistics. Oxford, U.K.: Oxford University Press, 2000. 16. L. Fahrmeir and H. Kaufmann, Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. The Annals of Statistics 1985; 13: 342–368.

14

GENERALIZED LINEAR MODELS

17. L. Fahrmeir and H. Kaufmann, Asymptotic inference in discrete response models. Statistical Papers 1986; 27: 179–205. 18. A. Agresti, Categorical Data Analysis. New York: Wiley, 1990. 19. N. Cressie and T. R. C. Read, Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B 1984; 46: 440– 464. 20. T. J. Santner and D. E. Duffy, The Statistical Analysis of Discrete Data. New York: Springer Verlag, 1990. 21. D. Firth, J. Glosup, and D. V. Hinkley, Model checking with nonparametric curves. Biometrika 1991; 78: 245–252. 22. A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford, U.K.: Oxford University Press, 1997. 23. D. A. Pierce and D. W. Schafer, Residulas in Generalized Linear Models. Journal of the American Statistical Association 1986; 81: 977–986. 24. A. Atkinson, Plots, Transformations and Regression. An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford Statistical Science Series. Oxford, U.K.: Clarendon Press, 1985. 25. R. D. Cook and S. Weisberg, Applied Regeression including Computing and Graphics. New York: Wiley, 1999.

26. S. le Cessie and J. van Houwelingen, A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics 1991; 47: 1267–1282. 27. M. Aitkin, A general maximum likelihood analysis of variance components in generalized linear models. Biometrics 1999; 55: 218–234. 28. A. Cameron and P. Trivedi, Regression Analysis of Count Data. Cambridge, U.K.: Cambridge University Press, 1998. 29. B. Jørgensen, The Theory of Dispersion Models. Boca Raton, FL: Chapman & Hall, 1997. 30. R. W. M. Wedderburn, Quasi-likelihood functions, generalized linear models, and the gauss-newton method. Biomtrika 1974; 61: 439–447. 31. V. Godambe and C. Heyde, Quasi-likelihood and optimal estimation. International Statistical Review 1987; 55: 231–244. 32. Diggle, P. J., K.-Y. Liang, and S. L. Zeger, Analysis of Longitudinal Data. Oxford, U.K.: Oxford University Press, 1994. 33. B. Francis, M. Green, C. Payne. The GLIM System: Release 4 Claredon Press,

GENERIC DRUG REVIEW PROCESS

the application contains all the necessary components, then an ‘‘acknowledgment letter’’ is sent to the applicant to indicate its acceptability for review and to confirm its filing date. Once the application has been determined to be acceptable for filing, the Bioequivalence, Chemistry/Microbiology, and Labeling reviews may begin. If the application is missing one or more essential components, a ‘‘Refuse to File’’ letter is sent to the applicant. The letter documents the missing component(s) and informs the applicant that the application will not be filed until it is complete. No additional review of the application occurs until the applicant provides the requested data and the application is acceptable and complete. The Bioequivalence Review process establishes that the proposed generic drug is bioequivalent to the reference listed drug, which is based on a demonstration that both the rate and the extent of absorption of the active ingredient of the generic drug fall within established parameters when compared with that of the reference listed drug. The FDA requires an applicant to provide detailed information to establish bioequivalency. Applicants may request a waiver from performing in vivo (testing done in humans) bioequivalence studies for certain drug products where bioavailability (the rate and extent to which the active ingredient or active moiety is absorbed from the drug product and becomes available at the site of action) may be demonstrated by submitting data such as (1) a formulation comparison for products whose bioavailability is self evident, for example, oral solutions, injectables, or ophthalmic solutions where the formulations are identical or (2) comparative dissolution. Alternatively, in vivo bioequivalence testing that compares the rate and the extent of absorption of the generic versus the reference product is required for most tablet and capsule dosage forms. For certain products, a head-to-head evaluation of comparative efficacy based on clinical endpoints may be required. On filing an ANDA, an establishment evaluation request is forwarded to the Office of

An Applicant is any person (usually a pharmaceutical firm) who submits an abbreviated new drug application (ANDA), or an amendment or supplement to them, to obtain Food and Drug Administration (FDA) approval to market a generic drug product, and/or is any person (or firm) who owns an approved application or abbreviated application. A generic drug product is one that is comparable to an innovator drug product [also known as the reference listed drug (RLD) product as identified in the FDA’s list of Approved Drug Products with Therapeutic Equivalence Evaluations] in dosage form, strength, route of administration, quality, performance characteristics, and intended use. The ANDA contains data that, when submitted to the FDA’s Center for Drug Evaluation and Research, Office of Generic Drugs, provides for the review and ultimate approval of a generic drug product. Once approved, an applicant may manufacture and market the generic drug product provided all issues related to patent protection and exclusivity associated with the RLD have been resolved. Generic drug applications are termed ‘‘abbreviated’’ in that generally they are not required to include preclinical (animal) and clinical (human) data to establish safety and effectiveness. These parameters were established on the approval of the innovator drug product, which is the first version of the drug product approved by the FDA. An application must contain sufficient information to allow a review to be conducted in an efficient and timely manner. On receipt of the application, a pre-filing assessment of its completeness and its acceptability is performed by a project manager within the Regulatory Support Branch, Office of Generic Drugs. If this initial review documents that This article was modified from the website of the United States Food and Drug Administration http://www.fda.gov/cder/handbook/ by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

GENERIC DRUG REVIEW PROCESS

Compliance to determine whether the product manufacturer, the bulk drug substance manufacturer, and any outside testing or packaging facilities are operating in compliance with current Good Manufacturing Practice (cGMP) regulations as outlined in 21 CFR 211. Each facility listed on the evaluation request is evaluated individually and an overall evaluation for the entire application is made by the Office of Compliance. Furthermore, a preapproval product specific inspection may be performed on to assure data integrity of the application. The Chemistry/Microbiology review process provides assurance that the generic drug will be manufactured in a reproducible manner under controlled conditions. Areas such as the applicant’s manufacturing procedures, raw material specifications and controls, sterilization process, container and closure systems, and accelerated and room temperature stability data are reviewed to assure that the drug will perform in an acceptable manner. The Labeling review process ensures that the proposed generic drug labeling (package insert, container, package label, and patient information) is identical to that of the reference listed drug except for differences caused by changes in the manufacturer, distributor, pending exclusivity issues, or other characteristics inherent to the generic drug product (tablet size, shape or color, etc.). Furthermore, the labeling review serves to identify and to resolve issues that may contribute to medication errors such as similar sounding or appearing drug names, and the legibility or prominence of the drug name or strength. If at the conclusion of the Bioequivalence Review, it is determined that deficiencies exist in the bioequivalence portion of the application, a Bioequivalence Deficiency Letter is issued by the Division of Bioequivalence to the applicant. This deficiency letter details the deficiencies and requests information and data to resolve them. Alternatively, if the review determines that the applicant has satisfactorily addressed the bioequivalence requirements, the Division of Bioequivalence will issue a preliminary informational letter to indicate that no additional questions exist at this time. If deficiencies are involved in the Chemistry/Manufacturing/Controls, Microbiology,

or Labeling portions of the application, these deficiencies are communicated to the applicant in a facsimile. The facsimile instructs the applicant to provide information and data to address the deficiencies and provides regulatory direction on how to amend the application. Once the above sections are found to be acceptable, as well as the preapproval inspection and bioequivalence portion of the application, then the application moves toward approval. If after undergoing a final Office-level administrative approval review by all review disciplines no additional deficiencies are noted, then the application can be approved. A satisfactory recommendation from the Office of Compliance based on an acceptable preapproval inspection is required prior to approval. The preapproval inspection determines compliance with cGMPs and a product specific evaluation that concerns the manufacturing process of the application involved. If an unsatisfactory recommendation is received, a ‘‘not approvable’’ letter may be issued. In such a case, approval of the generic drug product will be deferred pending a satisfactory reinspection and an acceptable recommendation. After all components of the application are found to be acceptable, an ‘‘Approval’’ or a ‘‘Tentative Approval’’ letter is issued to the applicant. The letter details the conditions of the approval and allows the applicant to market the generic drug product with the concurrence of the local FDA district office. If the approval occurs prior to the expiration of any patents or exclusivities accorded to the reference listed drug product, a tentative approval letter is issued to the applicant that details the circumstances associated with the tentative approval of the generic drug product and delays final approval until all patent/exclusivity issues have expired. A tentative approval does not allow the applicant to market the generic drug product.

GENE THERAPY

Table 1. Gene therapy clinical trial indications (from www.wiley.co.uk/genemed/clinical)

SAMANTHA L GINN and IAN E ALEXANDER

Indication

Gene Therapy Research Unit Children’s Medical Research Institute and The Children’s Hospital at Westmead NSW, Australia

Cancer diseases Cardiovascular diseases Monogenic diseases Infectious diseases Neurological diseases Ocular diseases Other diseases Gene marking Healthy volunteers Total

Gene therapy was conceived originally as an approach to the treatment of inherited monogenic diseases; it is defined by the United Kingdom’s Gene Therapy Advisory Committee as ‘‘the deliberate introduction of genetic material into human somatic cells for therapeutic, prophylactic or diagnostic purposes (1).’’ Ongoing improvements in gene-transfer technologies and biosafety have been accompanied by a growing appreciation of the broader scope of gene therapy. Pathophysiological processes such as wound healing (2), chronic pain (3) and inflammation (4), cancer (5), and acquired infections such as HIV-1 (6) are now becoming realistic targets for this promising therapeutic modality. The first authorized gene transfer study took place at the National Institutes of Health (NIH) in 1989. In this marker study, tumor-infiltrating lymphocytes were harvested, genetically tagged using a retroviral vector, and reinfused with the intention of examining the tumor-homing capacity of these cells. This landmark study provided the first direct evidence that human cells could be genetically modified and returned to a patient without harm (7). Since then, over 1300 trials have been approved, initiated, or completed worldwide, which are performed predominantly in the United States (8, 9). Most studies have focused on cancer, with cardiovascular and monogenic diseases the next most frequent indications (Table 1). These predominantly early-phase trials have provided invaluable proof-of-concept for gene therapy by confirming that desired changes to the target cell phenotype can be achieved successfully. In most trials, however, an insufficient number of cells have been genetically modified to achieve therapeutic benefit.

Number of protocols 871 (66.5%) 119 (9.1%) 109 (8.3%) 85 (6.5%) 20 (1.5%) 12 (0.9%) 21 (1.6%) 50 (3.8%) 22 (1.7%) 1309

Notable exceptions to the lack of therapeutic efficacy have been the successful treatment of several primary immunodeficiencies that affect the hematopoietic compartment (10–16). 1 REQUIREMENTS FOR SUCCESSFUL THERAPEUTIC INTERVENTION The prerequisites for successful gene therapy are complex and disease-specific, but invariably they include the availability of an efficient gene delivery technology and an understanding of its properties, capacity and limitations, a detailed knowledge of the biology of the target cell population, and a precise understanding of the molecular basis of the target disease (Fig. 1). Although significant progress is being made in each of these areas, the ability to achieve efficient gene delivery has been described as ‘‘the Achilles heel of gene therapy (17).’’ 1.1 Gene Delivery Technology Gene delivery systems can be classified into two broad categories: nonviral physicochemical approaches and recombinant viral systems. The comparative strengths of nonviral approaches include ease of chemical characterization, simplicity and reproducibility of production, larger packaging capacity, and reduced biosafety risks (18, 19). Gene delivery, however, is relatively inefficient and

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

GENE THERAPY

Gene Delievery Technology

Gene Therapy Target Cell Biology

Understanding of the Target Disease

Figure 1. Venn diagram of requirements for successful therapeutic intervention.

the effects are often transient. Examples of nonviral systems include microinjection of individual cells, DNA-calcium phosphate coprecipitation, and the formulation of DNA into artificial lipid vesicles (20–24). In contrast, viral systems, which are commonly modified to render them replicationincompetent, are markedly more efficient and exploit favorable aspects of virus biology (17, 25–27). Viral vectors can be divided into two main categories: nonintegrating and integrating, based on the intracellular fate of their genetic material. These properties are important when considering the required duration of the treatment. Nonintegrating vectors are maintained extrachromosomally in the nucleus, whereas the genome of integrating vectors becomes incorporated into the host cell genome that provides the potential for stable long-term gene expression. Limitations include increased biosafety risks caused by contamination by replication-competent virus, the presence of potentially toxic viral gene products, and insertional mutagenesis when integrating vectors are used (28). Irrespective of vector type, another important limitation is the induction of unwanted immune responses directed against components of the delivery system and/or the encoded transgene product. These responses may be either cell-mediated or humoral and depend on several variables that include

the nature of the transgene, the vector and promoter used, the route and site of administration, the vector dose, and the host factors. Ultimately, host-vector immune responses have the potential to influence clinical outcomes negatively (29–32). Accordingly, the optimization of gene delivery systems and strategies to evade deleterious immune responses remains a fundamentally important challenge (33–35). Difficulty in producing high-titre vector preparations and constraints on packaging capacity are also drawbacks. Hybrid vectors that combine the advantageous features of more than one virus have also been developed, although their application has been largely in in vitro models (36–40). Despite these limitations, the relative efficiency of gene transfer has resulted in the predominant use of viral vectors in pre-clinical and clinical gene therapy studies up to the present time. The uses of gene-correction rather than gene-addition strategies, which include targeted recombination (41–43), antisense oligonucleotide-induced exon skipping (44), and RNA interference (45), are also being investigated. Such strategies will be particularly important in the context of dominant disease processes in which the simple addition of a functionally normal gene is insufficient. Currently, these approaches lack the efficiency required for human gene therapy applications. Finally, although recombinant viral vectors have the most immediate potential for clinical use, it is envisaged that these systems will be supplanted by hybrid and derivative systems that combine the simplicity and safety of nonviral gene delivery with favorable aspects of viral biology. Virosomes are a prototypic example of such a system and among these, the Hemagglutinating Virus of Japan liposome has been the most extensively investigated (46, 47). 1.2 Target Cell Biology Each viral vector system possesses a unique set of biological properties (Table 2), and their use is governed largely by the biology of the target cell. For example, integration provides the molecular basis for stable long-term gene expression as would be required for the

GENE THERAPY

3

Table 2. Properties of widely used viral vector systems Vector

Adenovirus

Retrovirus∗

Lentivirus†

AAV

Genome Insert capacity Location in cell nucleus Cell-cycle dependent gene transfer Duration of transgene expression Functional titre (per mL) Immunogenicity

dsDNA 7 to 30 kb extrachromosomal

ssRNA 7 kb integrated

ssRNA 10 kb integrated

no

yes

no

ssDNA 4.8 kb integrated/ extrachromosomal no

transient

long-term

long-term

long-term

≥ 1012

≥ 107

≥ 109

≥ 109

high

low

low

low

AAV, adeno-associated virus. ∗ Most commonly derived from Moloney-Murine Leukemia virus (MoMLV). † Most commonly derived from HIV-1.

treatment of genetic disease in replicating cell populations. Integration also provides the potential for gene-modified cells to be expanded from a modest number of progenitors. This feature is of paramount importance when the genetic modification of cells capable of enormous proliferative expansion is required. This feature has been powerfully illustrated in the first successful treatment of a genetic disease by gene therapy (10, 11). In this study, a selective growth and survival advantage was conferred on hematopoietic stem cells following transduction with a retroviral vector based on the Moloney murine leukemia virus. Despite the advantages of viral integration, nonintegrating vectors are potentially effective if the target cell is nondividing. Efforts are also being made to develop integration-deficient forms of integrating vectors in an attempt to increase biosafety (48). Additional constraints, such as the replication state of the target cell or whether the target cell is amenable to ex vivo manipulation define vector choice and gene transfer protocols even more. For example, vectors based on lentiviruses and adeno-associated viruses can modify postmitotic and nondividing cells (49–52), which makes them of particular interest for targets such as muscle and the nervous system.

1.3 Disease Pathophysiology Before a gene-therapy approach can be considered feasible for a disease or physiological process, a precise understanding of the underlying molecular basis is required. The requirement for transient or persistent transgene expression must also be considered. For example, the extrachromosomal nature of adenoviral vectors has the potential to limit the duration of gene expression by dilutional loss during cell division (53). In some contexts, in which only transient gene expression is required, such as anticancer gene therapy, this may be a positive attribute. The pathophysiology of the target disease also defines the number of cells that must be successfully gene-modified to achieve the desired therapeutic effect. For example, it is anticipated that for the treatment of hemophilia B, levels of Factor IX as low as 1% of normal will be therapeutic (54). For some more demanding disease phenotypes, expansion of gene-modified cells through in vivo selection is one strategy by which the fundamental challenge of gene-modifying sufficient cells to achieve therapeutic benefit may be overcome. Ultimately, however, the development of more efficient gene delivery technologies is required to allow the effective treatment of the many human diseases and pathophysiological processes that are potentially amenable to gene therapy.

4

2

GENE THERAPY

PRECLINICAL RESEARCH

Before a clinical gene therapy protocol can be considered for human application, extensive preclinical testing is required. The data generated is vital in establishing whether the target cell can be safely gene-modified to produce the phenotypic changes required for disease correction. This involves years of preclinical experimental testing progressing from tissue culture models to small animal models (most commonly mice), and finally to large animal models when feasible. 2.1 In Vitro Studies An important first step in establishing proofof-concept data for a clinical gene therapy protocol is provided by in vitro studies. The manipulation of mammalian cells in culture can help define several important experimental parameters. Taking into account the biology of the target cell, several gene transfer approaches could potentially be available. For example, if neuron-targeted gene transfer is required, several recombinant viral vector systems are available, such as those based on adeno-associated virus, herpes simplex virus, and lentiviruses (55–60). Using cells in culture, it is relatively easy to select the vector system that is most effective in genetically modifying the cell type of interest. Important parameters that can be determined in vitro include the tropism of the virus for the relevant target cell population, the minimum vector dose required to achieve the desired phenotypic changes, the level and the duration of transgene expression, and vector toxicity. In vitro systems also allow aspects of expression cassette design to be examined such as the use of tissue-specific promoters or regulated gene expression. Immortalized cell lines are commonly used for such studies, but frequently they are transduced more readily than primary cells and do not consistently model the challenge of transducing specific target cell populations in vivo. Culture of primary cells, and tissue explants where possible. Therefore, is preferable to offer a more realistic representation of the target cell population in its native state before proceeding to animal models.

2.2 Animal Models Another prerequisite for successful gene therapy is the availability of an animal model that approximates the human disease for preclinical testing. Indeed, successful phenotype correction in mouse models of human disease is now frequently reported (61–66). Up to the present time, these successes have rarely been replicated in large animal models or human clinical trials. The explanation for this is primarily quantitative. Success in larger subjects demands that proportionally greater numbers of target cells be gene-modified to reach a therapeutically meaningful threshold. The average child, for example, is approximately 1000-fold bigger than a mouse, and, therefore, presents a correspondingly greater gene transfer challenge. In addition to the challenge of size, other factors, such as species-specific differences, exert potent effects in some contexts. Animal models also provide valuable safety data required by regulatory bodies before approving early-phase human trials, but these models do not always accurately predict adverse effects subsequently observed in human subjects. For example, in a therapeutically successful gene therapy trial for the X-linked form of severe combined immunodeficiency (SCID-X1) 4 of 11 patients developed a leukemia-like illness as a direct consequence of retroviral insertion into chromosomal DNA (28). The risk of insertional mutagenesis when using integrating vectors had long been recognized, but formerly considered low because retroviral vectors had previously been employed extensively without incident in animal models and in almost 300 documented human clinical protocols (9). Interestingly, concurrent with the above report of vector-mediated insertional mutagenesis in humans, the first report of essentially the same phenomenon in mice was published (67). Collectively, these events illustrate the inherent challenge in predictive safety testing, whether in animal models or in early phase human clinical trials. Preferably such tests must be configured with specific adverse events in mind, and where possible, in a manner that accommodates the possible contribution of the particular disease for which gene therapy is being contemplated.

GENE THERAPY

3 CONDUCTING A HUMAN CLINICAL GENE THERAPY TRIAL In comparison with drug-based clinical trials, several additional factors must be considered when undertaking a human gene therapy application. These include additional layers of regulatory oversight, ethical considerations related to the genetic modification of the subject’s cells and availability of appropriate skills, infrastructure, and reagents. 3.1 Regulatory Oversight Clinical research with genetic material poses safety and methodological challenges not shared by other forms of human investigation. As a result, several regulatory requirements must be satisfied before human studies that involve gene transfer can be initiated. In most countries, compared with requirements for pharmaceutical products, these requirements are achieved through an additional layer, or layers, of regulation. Depending on the host country, regulatory oversight can be complicated even more by the fact that existing regulatory frameworks have evolved for more conventional therapeutic products. As for all human clinical research gene therapy, trials must also be conducted according to a set of standards referred to as Good Clinical Practice that are based on the Declaration of Helsinki (68). In the United States, it is a federal requirement that clinical protocols that involve the administration of recombinant DNA products be reviewed and approved by filing an investigational new drug application with the Food and Drug Administration (FDA). In addition, applications must be approved by local institutional human ethics and biosafety committees. The key regulatory issues for U.S.-based clinical gene therapy trials have been reviewed by Cornetta and Smith (69). In the United Kingdom, gene therapy applications are similarly regulated by the Medicines and Healthcare Products Regulatory Agency, the Gene Therapy Advisory Committee and local institutional committees. This regulatory complexity is particularly burdensome in the context of multinational trials and is a major driver behind efforts for the global harmonization.

5

Such efforts will not only facilitate international studies but also improve data quality and participant safety (70). 3.2 Special Ethical Considerations In contrast to drug-based clinical trials, several special ethical issues must be considered for human gene transfer studies. These issues include the possibility of inadvertent germ-line transmission and, depending on the type of delivery vehicle used, the ability to introduce lifelong modifications to the subject’s chromosomal DNA, the latter resulting in the need for long-term clinical follow-up. Currently, only somatic cell gene therapy protocols have been initiated. The use of germline manipulation, in which the genomes of germ cells are deliberately modified with the likely effect of passing on changes to subsequent generations, is opposed at this time (71). Although genetic manipulation of the human germ-line is illegal in many countries, this consensus is not unanimous and its use remains the subject of vigorous debate (72, 73). For any research team who attempt to develop a new medical treatment, patient safety is of paramount importance, and the decision to proceed with a gene therapy approach requires a careful balancing of the associated risks and benefits. For example, bone marrow transplantation from an HLAmatched sibling donor is the treatment of choice for diseases such as SCID-X1. Unfortunately, however, this option is available for only one in five affected infants. The alternative is transplantation from a haploidentical or matched unrelated donor, and it caries a substantial risk of severe immunologic complications and of early mortality (74). For these children, a gene therapy approach may carry a lower risk even when possible adverse events associated with gene therapy, such as leukemia induction through insertional mutagenesis, are taken into account (75). Another ethical concern for gene therapy trials is the enrollment of infants and children given their inability to provide informed consent. Although it may be preferable to undertake early phase clinical trials in adults, many severe disease phenotypes are restricted to the pediatric age group, or

6

GENE THERAPY

where meaningful intervention would only be possible early in the course of the target disease. Examples include SCID-X1 (10, 76) and cystic fibrosis (77). Accordingly, depending on the disease context, equally potent counterbalancing ethical arguments are in favor of early phase trials in the pediatric population. Another important consideration is whether a need exists for long-term monitoring of the subject after the gene transfer protocol. Parameters that include the ability of the delivery vehicle to integrate into the genome, the site of integration, vector persistence, biology of the target cell, and transgene-specific effects all influence the risk associated with the treatment. If no vector persistence exists, the risk is analogous to that of any new drug, and long-term follow-up may not necessarily be required (78). 3.3 Skills, Infrastructure, and Reagents To undertake a gene therapy clinical trial, a research team requires access to specialized facilities as well as appropriately trained staff to perform procedures in accordance with required standards. In most countries, the rigor of these requirements increases in late-phase trials to the level of good manufacturing practice. For each gene therapy protocol, the set of skills required are governed largely by the biology of the target cell. For example, in trials that involve gene transfer to hematopoietic stem cells, an ex vivo approach is the method of choice. This approach requires a medical team to harvest the subject’s bone marrow and personnel who can maintain the target cells in sterile culture for up to five days after harvest (10). This approach also requires the availability of an on-site clean-room to perform the cellular manipulations. Alternatively, it might be necessary to deliver the vector directly to the target cell in vivo. Examples of this approach include gene transfer to organs such as the eye or brain or in gene therapy protocols that deliver oncolytic agents to a solid tumor. Experimental products used for gene transfer studies are often complex and difficult to characterize completely in comparison with conventional pharmaceutical agents, which is true particularly for virus-based gene delivery systems that are also challenging to produce on a large-scale and cannot

be sterilized by autoclaving or radiation. Biological variability may also result from the packaging of empty virions, titre differences between different production runs, and loss of titre during storage. The propensity of vectors to undergo inadvertent sequence alteration during production through mechanisms such as recombination or via transcription by error prone polymerases (79) must also be monitored. 4 CLINICAL TRIALS Although originally conceived as a treatment for monogenic diseases, the major indication addressed by gene therapy trials to date has been cancer (66.5%, Table 1). This finding is predominantly caused by the poor prognosis of many cancer phenotypes that make the risk/benefit ratio more favorable for experimental intervention. Although initial trials have been largely unsuccessful, some positive outcomes have occurred. For example, in 2006, Morgan et al. (5) have observed cancer regression in patients who received autologous peripheral blood lymphocytes when modified by a retroviral vector to express a T cell receptor. Although regression was observed in only 2 of 15 patients (13%), which is considerably lower than the 50% response rate achieved when patients received tumorinfiltrating lymphocytes (TILs) in a similar trial (80), this method may prove useful in instances in which TILs are not available. Cardiovascular and monogenic diseases are the next most frequently addressed indications with 119 (9.1%) and 109 (8.3%) trials, respectively, approved worldwide (Table 1). Shortly after the first authorized gene transfer trial was undertaken in 1989 (7), the first therapeutic trial that involved two children who suffered from a severe combined immunodeficiency caused by adenosine deaminase deficiency (ADA-SCID) was approved. This trial was unsuccessful for several reasons that include maintaining the patients on PEG-ADA therapy and using patient T lymphocytes as the target cell population (81). Removal of PEG-ADA coupled with a myeloablative conditioning regime and the targeting of hematopoietic stem cells with an improved transduction protocol resulted

GENE THERAPY 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 Unknown

7

33 97 98 95 81 89 108 95 116 68 82 51 67 38 37 14 8 2 1 129

Figure 2. Number of gene therapy clinical trials approved worldwide from 1989 to 2007 (reprinted with permission from http://www.wiley.co.uk/genmed/clinical/).

in the successful treatment of patients in subsequent trials for ADA-SCID (13, 14), which highlighted the need to impart a positive growth or survival advantage to the transduced cells. After initiation of the first therapeutic trial, a progressive increase occurred in the number of gene therapy trials approved in the following years (Fig. 2). This increase slowed briefly in the mid-1990s after an NIH review committee co-chaired by Stuart Orkin and Arno Motulsky concluded that ‘‘Significant problems remain in all basic aspects of gene therapy. Major difficulties at the basic level include shortcomings in all current gene transfer vectors and an inadequate understanding of the biological interaction of these vectors with the host (82).’’ This trend toward increasing numbers of gene therapy trials leveled off after 1999 coincident with report of the first severe adverse event as a direct consequence of gene therapy (83). In a Phase I dose escalation study that investigated the safety of an adenoviral vector for the treatment of ornithine transcarbamylase deficiency, a young adult trial participant died as a result of a severe systemic inflammatory reaction to the injected adenoviral vector. This incident resulted in the suspension of all trials at the host institution by the

FDA, and a senate subcommittee investigation that also looked more broadly at clinical trial safety and reporting across the United States (84). In light of the information that emerged, a renewed emphasis exists on critical evaluation of risk/benefit ratios for trial participants and comprehensive reporting of trial outcomes and adverse events. To date, gene therapy trials have been performed in 28 countries (Fig. 3) spanning five continents. These trials have been most comprehensively reviewed by Edelstein et al. (8). Most trials have been conducted in the United States with 864 trials (66%) followed by the United Kingdom (150 trials, 11.5%). Viral vectors are the most frequently used gene delivery system (Table 3) because of their superior gene transfer efficiency over nonviral methods. Of the viral gene delivery systems available, adenoviral and retroviral vectors (derived from murine retroviruses) have been most commonly used accounting for 24.9% and 23.1% of trials, respectively (Table 3). Lentiviral vectors derived from Human Immunodeficiency virus Type 1 (HIV1) are now being used clinically (85) after safety considerations relating to the inherent properties of the parent virus were addressed. Because of differences in their integration site patterns, lentiviral vectors may offer a

8

GENE THERAPY

Figure 3. Geographical distribution of gene therapy clinical trials (reprinted with permission from http://www.wiley.co.uk/genmed/clinical/). Table 3. Gene delivery systems in clinical trial use (from www.wiley.co.uk/genemed/clinical) Vector

Number of protocols

Adenovirus Retrovirus Naked/plasmid DNA Lipofection Vaccinia virus Poxvirus Adeno-associated virus Herpes simplex virus Poxvirus and vaccinia virus RNA transfer Lentivirus Flavivirus Gene Gun Others Unknown Total

326 (24.9%) 302 (23.1%) 239 (18.3%) 102 (7.8%) 65 (5.0%) 60 (4.6%) 48 (3.7%) 43 (3.3%) 26 (2.0%) 17 (1.3%) 9 (0.7%) 5 (0.4%) 5 (0.4%) 22 (1.7%) 40 (3.1%) 1309

safer alternative to retroviral vectors derived from murine retroviruses (86–88). 5

LESSONS LEARNED

It is now more than 15 years since the first authorized gene transfer trial was undertaken in 1989. Since then, over 1300 clinical trials have been initiated worldwide (Table 1) with several notable successes since 2000. Although results from these predominantly early-phase trials have been largely

unsuccessful in providing clinical benefit to human subjects, they have (1) provided clear proof-of-concept for gene therapy, (2) demonstrated that gene therapy is relatively safe, and (3) highlighted several important issues that must be considered to advance the field. The field has also recently experienced the first commercial release of a gene therapy treatment in China by Shenzhen SiBiono GenTech’s Gendicine for head-andneck squamous cell carcinoma (89, 90).

GENE THERAPY

5.1 The Power of In Vivo Selection The major reason gene therapy has been unsuccessful in providing clinical benefit in many disease contexts is low gene transfer efficiencies. Expansion of gene-modified cells, through in vivo selection, is one strategy by which the fundamental challenge of genemodifying sufficient cells to achieve therapeutic benefit can be overcome (91). The power of in vivo selection has been impressively illustrated in the SCID-X1 trial, which is the first successful treatment of a genetic disease by gene therapy (10). For most diseases, however, the gene corrected cells will not have a selective growth or survival advantage. Therefore, efforts to develop strategies for providing modified cells with an exogenous selective advantage are being made. One such strategy exploits mutant forms of the DNA repair enzyme methylguanine methyltransferase and targeting expression to hematopoietic progenitor cells using integrating vector systems. This strategy imparts genetic chemoprotection to the gene-modified cells and has been successfully employed in large animal models (92, 93). 5.2 Insertional Mutagenesis Insertional mutagenesis is now recognized as a clinically established risk associated with the use of integrating vector systems. Random integration events have the potential to drive a cell toward neoplastic transformation through inappropriate activation or inactivation of genes involved in the regulation of cell growth and differentiation. The risk of mutagenesis of cellular sequences promoting a malignant phenotype has been estimated to be about 10−7 per insertion (94). Although avoidance of integrating vector systems is not currently a viable option for gene therapy targeting the hematopoietic compartment, two broad strategies by which the risk of insertional mutagenesis can be significantly reduced include (1) reduction in the absolute number of integration sites to which patients are exposed and (2) reduction of the risk associated with individual integration events. Achievement of the former will depend on more sharply defining both the desired target cell population and the minimum effective dose of gene-corrected cells

9

to produce the desired phenotypic effect as well as optimization of transduction conditions. Reduction of the risk associated with individual integration events is theoretically achievable by careful choice of integrating vector system and optimized expression cassette design that lacks strong viral promoter/ enhancer elements. Whether leukemia represents a generic risk associated with the use of integrating vectors to target hematopoietic progenitor cells, or is linked more directly to specific features of the SCID-X1 gene therapy trial has yet to be established. Whatever the answer, future gene therapy studies that employ integrating vectors must be assessed against the, as yet, poorly quantified risk of the development of neoplasia as a consequence of insertional mutagenesis. 6

THE WAY FORWARD

The capacity of gene therapy to cure human disease is now an established reality, but for now, most disease phenotypes and pathophysiological processes potentially amenable to this exciting therapeutic approach lie beyond the reach of existing technology. The major challenge for the future, therefore, is to address the inherent shortcomings in the efficacy and safety of available gene delivery systems. Developments in this area will be complemented by an improved understanding of target cell biology, in particular the capacity to identify and manipulate stem cell populations. Finally, unwanted host vector interactions, such as immune responses directed against the vector and encoded transgene product, must be better understood and avoided. Such progress is fundamentally dependent on sound basic and pre-clinical research coupled with iterative human clinical trials. REFERENCES 1. Gene Therapy Advisory Committee (GTAC). United Kingdom Department of Health. Available: http://www.advisorybodies.doh.gov.uk/ genetics/gtac. 2. L. K. Branski, C. T. Pereira, D. N. Herndon, and M. G. Jeschke, Gene therapy in wound healing: present status and future directions. Gene Ther. 2007; 14: 1–10.

10

GENE THERAPY

3. E. D. Milligan, E. M. Sloane, S. J. Langer, T. S. Hughes, B. M. Jekich, M. G. Frank, J. H. Mahoney, L. H. Levkoff, S. F. Maier, P. E. Cruz, T. R. Flotte, K. W. Johnson, M. M. Mahoney, R. A. Chavez, L. A. Leinwand, and L. R. Watkins, Repeated intrathecal injections of plasmid DNA encoding interleukin-10 produce prolonged reversal of neuropathic pain. Pain 2006; 126: 294–308. 4. C. H. Evans, P. D. Robbins, S. C. Ghivizzani, M. C. Wasko, M. M. Tomaino, R. Kang, T. A. Muzzonigro, M. Vogt, E. M. Elder, T. L. Whiteside, S. C. Watkins, and J. H. Herndon, Gene transfer to human joints: progress toward a gene therapy of arthritis. Proc. Natl. Acad. Sci. U.S.A. 2005; 102: 8698–8703. 5. R. A. Morgan, M. E. Dudley, J. R. Wunderlich, M. S. Hughes, J. C. Yang, R. M. Sherry, R. E. Royal, S. L. Topalian, U. S. Kammula, N. P. Restifo, Z. Zheng, A. Nahvi, C. R. de Vries, L. J. Rogers-Freezer, S. A. Mavroukakis, and S. A. Rosenberg, Cancer regression in patients after transfer of genetically engineered lymphocytes. Science 2006; 314: 126–129. 6. O. ter Brake, P. Konstantinova, M. Ceylan, and B. Berkhout, Silencing of HIV-1 with RNA interference: a multiple shRNA approach. Mol. Ther. 2006; 14: 883–892. 7. S. A. Rosenberg, P. Aebersold, K. Cornetta, A. Kasid, R. A. Morgan, R. Moen, E. M. Karson, M. T. Lotze, J. C. Yang, and S. L. Topalian, Gene transfer into humans-immunotherapy of patients with advanced melanoma, using tumor-infiltrating lymphocytes modified by retroviral gene transduction. N. Engl. J. Med. 1990; 323: 570–578. 8. M. L. Edelstein, M. R. Abedi, J. Wixon, and R. M. Edelstein, Gene therapy clinical trials worldwide 1989–2004-an overview. J. Gene Med. 2004; 6: 597–602. 9. Gene Therapy Clinical Trials Worldwide. The Journal of Gene Medicine. Available: http:// www.wiley.co.uk/genmed/clinical/2007. 10. M. Cavazzana-Calvo, S. Hacein-Bey, G. de Saint Basile, F. Gross, E. Yvon, P. Nusbaum, F. Selz, C. Hue, S. Certain, J. L. Casanova, P. Bousso, F. L. Deist, and A. Fischer, Gene therapy of human severe combined immunodeficiency (SCID)-X1 disease. Science 2000; 288: 669–672. 11. S. Hacein-Bey-Abina, F. Le Deist, F. Carlier, C. Bouneaud, C. Hue, J. P. de Villartay, A. J. Thrasher, N. Wulffraat, R. Sorensen, S. Dupuis-Girod, A. Fischer, E. G. Davies, W. Kuis, L. Leiva, and M. Cavazzana-Calvo, Sustained correction of X-linked severe combined

immunodeficiency by ex vivo gene therapy. N. Engl. J. Med. 2002; 346: 1185–1193. 12. M. G. Ott, M. Schmidt, K. Schwarzwaelder, S. Stein, U. Siler, U. Koehl, H. Glimm, K. Kuhlcke, A. Schilz, H. Kunkel, S. Naundorf, A. Brinkmann, A. Deichmann, M. Fischer, C. Ball, I. Pilz, C. Dunbar, Y. Du, N. A. Jenkins, N. G. Copeland, U. Luthi, M. Hassan, A. J. Thrasher, D. Hoelzer, C. von Kalle, R. Seger, and M. Grez, Correction of X-linked chronic granulomatous disease by gene therapy, augmented by insertional activation of MDS1-EVI1, PRDM16 or SETBP1. Nat. Med. 2006; 12: 401–409. 13. A. Aiuti, S. Vai, A. Mortellaro, G. Casorati, F. Ficara, G. Andolfi, G. Ferrari, A. Tabucchi, F. Carlucci, H. D. Ochs, L. D. Notarangelo, M. G. Roncarolo, and C. Bordignon, Immune reconstitution in ADA-SCID after PBL gene therapy and discontinuation of enzyme replacement. Nat. Med. 2002; 8: 423–425. 14. A. Aiuti, S. Slavin, M. Aker, F. Ficara, S. Deola, A. Mortellaro, S. Morecki, G. Andolfi, A. Tabucchi, F. Carlucci, E. Marinello, F. Cattaneo, S. Vai, P. Servida, R. Miniero, M. G. Roncarolo, and C. Bordignon, Correction of ADA-SCID by stem cell gene therapy combined with nonmyeloablative conditioning. Science 2002; 296: 2410–2413. 15. H. B. Gaspar, K. L. Parsley, S. Howe, D. King, K. C. Gilmour, J. Sinclair, G. Brouns, M. Schmidt, C. von Kalle, T. Barington, M. A. Jakobsen, H. O. Christensen, A. Al Ghonaium, H. N. White, J. L. Smith, R. J. Levinsky, R. R. Ali, C. Kinnon, A. J. Thrasher, Gene therapy of X-linked severe combined immunodeficiency by use of a pseudotyped gammaretroviral vector. Lancet 2004; 364: 2181–2187. 16. H. B. Gaspar, E. Bjorkegren, K. Parsley, K. C. Gilmour, D. King, J. Sinclair, F. Zhang, A. Giannakopoulos, S. Adams, L. D. Fairbanks, J. Gaspar, L. Henderson, J. H. Xu-Bayford, E. G. Davies, P. A. Veys, C. Kinnon, and A. J. Thrasher, Successful reconstitution of immunity in ADA-SCID by stem cell gene therapy following cessation of PEG-ADA and use of mild preconditioning. Mol. Ther. 2006; 14: 505–513. 17. N. Somia and I. M. Verma, Gene therapy: trials and tribulations. Nat. Rev. Genet. 2000; 1: 91–99. 18. S. D. Li and L. Huang, Gene therapy progress and prospects: non-viral gene therapy by systemic delivery. Gene Ther. 2006; 13: 1313–1319.

GENE THERAPY

11

19. D. J. Glover, H. J. Lipps, D. A. Jans, Towards safe, non-viral therapeutic gene expression in humans. Nat. Rev Genet. 2005; 6: 299–310.

32. A. K. Zaiss and D. A. Muruve, Immune responses to adeno-associated virus vectors. Curr. Gene Ther. 2005; 5: 323–331.

20. C. C. Conwell and L. Huang,. Recent advances in non-viral gene delivery. Adv. Genet. 2005; 53: 3–18.

33. D. B. Schowalter, L. Meuse, C. B. Wilson, P. S. Linsley, and M. A. Kay, Constitutive expression of murine CTLA4Ig from a recombinant adenovirus vector results in prolonged transgene expression. Gene Ther. 1997; 4: 853–860.

21. S. Mehier-Humbert and R. H. Guy, Physical methods for gene transfer: improving the kinetics of gene delivery into cells. Adv. Drug Deliv. Rev. 2005; 57: 733–753. 22. S. Simoes, A. Filipe, H. Faneca, M. Mano, N. Penacho, N. Duzgunes, and M. P. de Lima, Cationic liposomes for gene delivery. Expert. Opin. Drug Deliv. 2005; 2: 237–254. 23. C. Louise, Nonviral vectors. Methods Mol. Biol. 2006; 333: 201–226. 24. M. D. Lavigne and D. C. Gorecki, Emerging vectors and targeting methods for nonviral gene therapy. Expert. Opin. Emerg. Drugs 2006; 11: 541–557. 25. W. Walther and U. Stein, Viral vectors for gene transfer: a review of their use in the treatment of human diseases. Drugs 2000; 60: 249–271. 26. M. A. Kay, J. C. Glorioso, and L. Naldini, Viral vectors for gene therapy: the art of turning infectious agents into vehicles of therapeutics. Nat. Med. 2001; 7: 33–40. 27. I. M. Verma and M. D. Weitzman, Gene therapy: twenty-first century medicine. Annu. Rev Biochem. 2005; 74: 711–738. 28. S. Hacein-Bey-Abina, C. von Kalle, M. Schmidt, F. Le Deist, N. Wulffraat, E. McIntyre, I. Radford, J. L. Villeval, C. C. Fraser, M. Cavazzana-Calvo, and A. Fischer, A serious adverse event after successful gene therapy for X-linked severe combined immunodeficiency. N. Engl. J. Med. 2003; 348: 255–256. 29. N. Chirmule, K. Propert, S. Magosin, Y. Qian, R. Qian, and J. Wilson, Immune responses to adenovirus and adeno-associated virus in humans. Gene Ther. 1999; 6: 1574–1583.

34. E. Dobrzynski, J. C. Fitzgerald, O. Cao, F. Mingozzi, L. Wang, and R. W. Herzog, Prevention of cytotoxic T lymphocyte responses to factor IX-expressing hepatocytes by gene transfer-induced regulatory T cells. Proc. Natl. Acad. Sci. U.S.A. 2006; 103: 4592–4597. 35. B. D. Brown, M. A. Venneri, A. Zingale, S. L. Sergi, and SL, Naldini, L. Endogenous microRNA regulation suppresses transgene expression in hematopoietic lineages and enables stable gene transfer. Nat. Med. 2006; 12: 585–591. 36. F. G. Falkner and G. W. Holzer, Vaccinia viral/retroviral chimeric vectors. Curr. Gene Ther. 2004; 4: 417–426. 37. A. L. Epstein and R. Manservigi, Herpesvirus/retrovirus chimeric vectors. Curr. Gene Ther. 2004; 4: 409–416. 38. A. Oehmig, C. Fraefel, X. O. Breakefield, and M. Ackermann, Herpes simplex virus type 1 amplicons and their hybrid virus partners, EBV, AAV, and retrovirus. Curr. Gene Ther. 2004; 4: 385–408. 39. A. Recchia, L. Perani, D. Sartori, C. Olgiati, and F. Mavilio, Site-specific integration of functional transgenes into the human genome by adeno/AAV hybrid vectors. Mol. Ther. 2004; 10: 660–670. 40. H. Wang and A. Lieber, A helper-dependent capsid-modified adenovirus vector expressing adeno-associated virus rep78 mediates sitespecific integration of a 27-kilobase transgene cassette. J. Virol. 2006; 80: 11699–11709. 41. D. W. Russell and R. K. Hirata, Human gene targeting by viral vectors. Nat. Genet. 1998; 18: 325–330.

30. C. S. Manno, V. R. Arruda, G. F. Pierce, B. Glader, M. Ragni, J. Rasko, M. C. Ozelo, K. Hoots, P. Blatt, B. Konkle, M. Dake, R. Kaye, M. Razavi, A. Zajko, J. Zehnder, H. Nakai, A. Chew, D. Leonard, et al., Successful transduction of liver in hemophilia by AAV-Factor IX and limitations imposed by the host immune response. Nat. Med. 2006; 12: 342–347.

42. R. M. Kapsa, A. F. Quigley, J. Vadolas, K. Steeper, P. A. Ioannou, E. Byrne, and A. J. Kornberg, Targeted gene correction in the mdx mouse using short DNA fragments: towards application with bone marrow-derived cells for autologous remodeling of dystrophic muscle. Gene Ther. 2002; 9: 695–699.

31. E. K. Broberg and V. Hukkanen, Immune response to herpes simplex virus and gamma134.5 deleted HSV vectors. Curr. Gene Ther. 2005; 5: 523–530.

43. D. de Semir and J. M. Aran, Targeted gene repair: the ups and downs of a promising gene therapy approach. Curr. Gene Ther. 2006; 6: 481–504.

12

GENE THERAPY

44. G. McClorey, H. M. Moulton, P. L. Iversen, S. Fletcher, and S. D. Wilton, Antisense oligonucleotide-induced exon skipping restores dystrophin expression in vitro in a canine model of DMD. Gene Ther. 2006; 13: 1373–1381. 45. M. Schlee, V. Hornung, and G. Hartmann, siRNA and isRNA: two edges of one sword. Mol. Ther. 2006; 14: 463–470. 46. Y. Kaneda, Y. Saeki, and R. Morishita, Gene therapy using HVJ-liposomes: the best of both worlds? Mol. Med. Today 1999; 5: 298–303. 47. Y. Kaneda, Virosomes: evolution of the liposome as a targeted drug delivery system. Adv. Drug Deliv. Rev. 2000; 43: 197–205. 48. R. J. Yanez-Munoz, K. S. Balaggan, A. Macneil, S. J. Howe, M. Schmidt, A. J. Smith, P. Buch, R. E. Maclaren, P. N. Anderson, S. E. Barker, Y. Duran, C. Bartholomae, C. von Kalle, J. R. Heckenlively, C. Kinnon, R. R. Ali, and A. J. Thrasher, Effective gene therapy with nonintegrating lentiviral vectors. Nat. Med. 2006; 12: 348–353. 49. L. Naldini, U. Blomer, P. Gallay, D. Ory, R. Mulligan, F. H. Gage, I. M. Verma, and D. Trono, In vivo gene delivery and stable transduction of nondividing cells by a lentiviral vector. Science 1996; 272: 263–267. 50. L. Naldini, U. Blomer, F. H. Gage, D. Trono, and I. M. Verma, Efficient transfer, integration, and sustained long-term expression of the transgene in adult rat brains injected with a lentiviral vector. Proc. Natl. Acad. Sci. U.S.A. 1996; 93: 11382–11388. 51. M. A. Adam, N. Ramesh, A. D. Miller, and W. R. Osborne, Internal initiation of translation in retroviral vectors carrying picornavirus 5’ nontranslated regions. J. Virol. 1991; 65: 4985–4990. 52. P. E. Monahan and R. J. Samulski, AAV vectors: is clinical success on the horizon? Gene Ther. 2000; 7: 24–30. 53. M. Ali, N. R. Lemoine, and C. J. Ring, The use of DNA viruses as vectors for gene therapy. Gene Ther. 1994; 1: 367–384. 54. A. C. Nathwani and E. G. Tuddenham, Epidemiology of coagulation disorders. Baillieres Clin. Haematol. 1992; 5: 383–439. 55. D. J. Fink, L. R. Sternberg, P. C. Weber, M. Mata, W. F. Goins, and J. C. Glorioso, In vivo expression of beta-galactosidase in hippocampal neurons by HSV-mediated gene transfer. Hum. Gene Ther. 1992; 3: 11–19. 56. X. Xiao, J. Li, T. J. McCown, and R. J. Samulski, Gene transfer by adeno-associated virus

vectors into the central nervous system. Exp. Neurol. 1997; 144: 113–124. 57. U. Blomer, L. Naldini, T. Kafri, D. Trono, I. M. Verma, and F. H. Gage, Highly efficient and sustained gene transfer in adult neurons with a lentivirus vector. J. Virol. 1997; 71: 6641–6649. 58. J. C. Glorioso, N. A. DeLuca, and D. J. Fink, Development and application of herpes simplex virus vectors for human gene therapy. Annu. Rev. Microbiol. 1995; 49: 675–710. 59. W. T. Hermens and J. Verhaagen, Viral vectors, tools for gene transfer in the nervous system. Prog. Neurobiol. 1998; 55: 399–432. 60. J. Fleming, S. L. Ginn, R. P. Weinberger, T. N. Trahair, J. A. Smythe, and I. E. Alexander, Adeno-associated virus and lentivirus vectors mediate efficient and sustained transduction of cultured mouse and human dorsal root ganglia sensory neurons. Hum. Gene Ther. 2001; 12: 77–86. 61. R. O. Snyder, C. Miao, L. Meuse, J. Tubb, B. A. Donahue, H. F. Lin, D. W. Stafford, S. Patel, A. R. Thompson, T. Nichols, M. S. Read, D. A. Bellinger, K. M. Brinkhous, and M. A. Kay, Correction of hemophilia B in canine and murine models using recombinant adenoassociated viral vectors. Nat. Med. 1999; 5: 64–70. 62. G. M. Acland, G. D. Aguirre, J. Ray, Q. Zhang, T. S. Aleman, A. V. Cideciyan, S. E. PearceKelling, V. Anand, Y. Zeng, A. M. Maguire, S. G. Jacobson, W. W. Hauswirth, and J. Bennett, Gene therapy restores vision in a canine model of childhood blindness. Nat. Genet. 2001; 28: 92–95. 63. T. M. Daly, K. K. Ohlemiller, M. S. Roberts, C. A. Vogler, and M. S. Sands, Prevention of systemic clinical disease in MPS VII mice following AAV-mediated neonatal gene transfer. Gene Ther. 2001; 8: 1291–1298. 64. A. Bosch, E. Perret, N. Desmaris, D. Trono, and J. M. Heard, Reversal of pathology in the entire brain of mucopolysaccharidosis type VII mice after lentivirus-mediated gene transfer. Hum. Gene Ther. 2000; 11: 1139–1150. 65. R. Pawliuk, K. A. Westerman, M. E. Fabry, E. Payen, R. Tighe, E. E. Bouhassira, S. A. Acharya, J. Ellis, I. M. London, C. J. Eaves, R. K. Humphries, Y. Beuzard, R. L. Nagel, and P. Leboulch, Correction of sickle cell disease in transgenic mouse models by gene therapy. Science 2001; 294: 2368–2371. 66. T. H. Nguyen, M. Bellodi-Privato, D. Aubert, V. Pichard, A. Myara, D. Trono, and N. Ferry, Therapeutic lentivirus-mediated neonatal in

GENE THERAPY vivo gene therapy in hyperbilirubinemic Gunn rats. Mol. Ther. 2005; 12: 852–859. 67. Z. Li, J. Dullmann, B. Schiedlmeier, M. Schmidt, C. von Kalle, J. Meyer, M. Forster, C. Stocking, A. Wahlers, O. Frank, W. Ostertag, K. Kuhlcke, H. G. Eckert, B. Fehse, and C. Baum, Murine leukemia induced by retroviral gene marking. Science 2002; 296: 497. 68. Declaration of Helsinki. The World Medical Association. Available: http://www.wma. net/e/policy/b3.htm. 69. K. Cornetta and F. O. Smith, Regulatory issues for clinical gene therapy trials. Hum. Gene Ther. 2002; 13: 1143–1149. 70. S. M. Dainesi, Seeking harmonization and quality in clinical trials. Clinics 2006; 61: 3–8. 71. J. Spink and D. Geddes, Gene therapy progress and prospects: bringing gene therapy into medical practice: the evolution of international ethics and the regulatory environment. Gene Ther. 2004; 11: 1611–1616. 72. D. B. Resnik and P. J. Langer, Human germline gene therapy reconsidered. Hum. Gene Ther. 2001; 12: 1449–1458. 73. M. Fuchs, Gene therapy. An ethical profile of a new medical territory. J. Gene Med. 2006; 8: 1358–1362. 74. C. Antoine, S. Muller, A. Cant, M. CavazzanaCalvo, P. Veys, J. Vossen, A. Fasth, C. Heilmann, N. Wulffraat, R. Seger, S. Blanche, W. Friedrich, M. Abinun, G. Davies, R. Bredius, A. Schulz, P. Landais, and A. Fischer, Long-term survival and transplantation of haemopoietic stem cells for immunodeficiencies: report of the European experience 1968–99. Lancet 2003; 361: 553–560. 75. M. Cavazzana-Calvo, A. Thrasher, and F. Mavilio, The future of gene therapy. Nature 2004; 427: 779–781. 76. A. J. Thrasher, S. Hacein-Bey-Abina, H. B. Gaspar, S. Blanche, E. G. Davies, K. Parsley, K. Gilmour, D. King, S. Howe, J. Sinclair, C. Hue, F. Carlier, C. von Kalle, B. G. de Saint, F. Le Deist, A. Fischer, and M. CavazzanaCalvo, Failure of SCID-X1 gene therapy in older patients. Blood 2005; 105: 4255–4257.

13

79. Structural characterization of gene transfer vectors. Federal Drug Authority. Available: http://www.fda.gov/OHRMS/DOCKETS/ ac/00/backgrd/3664b1a.doc. 80. M. E. Dudley, J. R. Wunderlich, P. F. Robbins, J. C. Yang, P. Hwu, D. J. Schwartzentruber, S. L. Topalian, R. Sherry, N.P. Restifo, A. M. Hubicki, M. R. Robinson, M. Raffeld, P. Duray, C. A. Seipp, L. Rogers-Freezer, K. E. Morton, S. A. Mavroukakis, D. E. White, and S. A. Rosenberg, Cancer regression and autoimmunity in patients after clonal repopulation with antitumor lymphocytes. Science 2002; 298: 850–854. 81. R. M. Blaese, K. W. Culver, A. D. Miller, C. S. Carter, T. Fleisher, M. Clerici, G. Shearer, L. Chang, Y. Chiang, P. Tolstoshev, et al. T lymphocyte-directed gene therapy for ADASCID: initial trial results after 4 years. Science 1995; 270: 475–480. 82. S. H. Orkin and A. G. Motulsky, Report and recommendations of the panel to assess the NIH investment in research on gene therapy. National Institutes of Health. Available: http://www.nih.gov/news/panelrep.html. 83. S. E. Raper, N. Chirmule, F. S. Lee, N. A. Wivel, A. Bagg, G. P. Gao, J. M. Wilson, and M. L. Batshaw, Fatal systemic inflammatory response syndrome in a ornithine transcarbamylase deficient patient following adenoviral gene transfer. Mol. Genet. Metab. 2003; 80: 148–158. 84. J. Savulescu, Harm, ethics committees and the gene therapy death. J. Med. Ethics 2001; 27: 148–150. 85. B. L. Levine, L. M. Humeau, J. Boyer, R. R. Macgregor, T. Rebello, X. Lu, G. K. Binder, V. Slepushkin, F. Lemiale, J. R. Mascola, F. D. Bushman, B. Dropulic, and C. H. June, Gene transfer in humans using a conditionally replicating lentiviral vector. Proc. Natl. Acad. Sci. U.S.A. 2006; 103: 17372–17377. 86. X. Wu, Y. Li, B. Crise, and S. M. Burgess, Transcription start regions in the human genome are favored targets for MLV integration. Science 2003; 300: 1749–1751.

77. A/ Jaffe, S. A. Prasad, V. Larcher, and S. Hart, Gene therapy for children with cystic fibrosis-who has the right to choose? J. Med. Ethics 2006; 32: 361–364.

87. R. S. Mitchell, B. F. Beitzel, A. R. Schroder, P. Shinn, H. Chen, C. C. Berry, J. R. Ecker, and F. D. Bushman, Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS. Biol. 2004; 2: E234.

78. K. Nyberg, B. J. Carter, T. Chen, C. Dunbar, T. R. Flotte, S/ Rose, D. Rosenblum, S. L. Simek, and C. Wilson, Workshop on long-term followup of participants in human gene transfer research. Mol. Ther. 2004; 10: 976–980.

88. P. Hematti, B. K. Hong, C. Ferguson, R. Adler, H. Hanawa, S. Sellers, I. E. Holt, C. E. Eckfeldt, Y. Sharma, M. Schmidt, C. von Kalle, D. A. Persons, E. M. Billings, C. M. Verfaillie, A. W. Nienhuis, T. G. Wolfsberg, C. E. Dunbar,

14

89.

90.

91.

92.

93.

GENE THERAPY and B. Calmels, Distinct genomic integration of MLV and SIV vectors in primate hematopoietic stem and progenitor cells. PLoS. Biol. 2004; 2:E423. Z. Peng, Current status of gendicine in China: recombinant human Ad-p53 agent for treatment of cancers. Hum. Gene Ther. 2005; 16: 1016–1027. H. Jia, Controversial Chinese gene-therapy drug entering unfamiliar territory. Nat. Rev Drug Discov. 2006; 5: 269–270. T. Neff, B. C. Beard, and H. P. Kiem, Survival of the fittest: in vivo selection and stem cell gene therapy. Blood 2006; 107: 1751–1760. T. Neff, B. C. Beard, L. J. Peterson, P. Anandakumar, J. Thompson, and H. P. Kiem, Polyclonal chemoprotection against temozolomide in a large-animal model of drug resistance gene therapy. Blood 2005; 105: 997–1002. T. Neff, P. A. Horn, L. J. Peterson, B. M. Thomasson, J. Thompson, D. A. Williams, M. Schmidt, G. E. Georges, C. von Kalle, and H. P. Kiem, Methylguanine methyltransferasemediated in vivo selection and chemoprotection of allogeneic stem cells in a large-animal model. J. Clin. Invest. 2003; 112: 1581–1588.

94. Stocking, C, Bergholz, U, Friel, J, Klingler, K, Wagener, T, Starke, C, Kitamura, T, Miyajima, A, Ostertag, W. Distinct classes of factor-independent mutants can be isolated after retroviral mutagenesis of a human myeloid stem cell line. Growth Factors 1993; 8: 197–209.

FURTHER READING J. A. Wolff and J. Lederberg, An early history of gene transfer and therapy. Hum. Gene Ther. 1994; 5: 469–480.

CROSS-REFERENCES Risk-benefit analysis, safety, translation

GENETIC ASSOCIATION ANALYSIS

can also perform Fisher’s exact test, which is more accurate but more time-consuming. In many situations, the effects of SNP alleles are roughly additive; that is, the probability of response for the heterozygote (i.e., genotype aA) is intermediate between those of the two homozygotes (i.e., genotypes aa and AA). Then it is desirable to use the Armitage (1) trend test statistic:

D. Y. LIN Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina

In genetic association analysis, we assess the association between a genetic marker and a response variable. The association can originate in two ways: (1) The genetic marker is causally related to the response variable; and (2) the genetic marker is correlated with a causal variant. The latter might suffice prediction/classification purposes, whereas the former would be necessary for identifying potential drug targets. The most commonly used genetic markers are single nucleotide polymorphisms or SNPs. Virtually all SNPs are biallelic (i.e., having only two possible nucleotides or alleles). Thus, three possible genotypes exist at each SNP site: homozygous with allele A, homozygous with allele a, or heterozygous with one allele A and one allele a, where A and a denote the two possible alleles. In the case of a binary response variable with a total of n individuals, the data are represented in a 3 × 2 table:

nn.1 n.2 (n21 /2n.1 + n31 /n.1 − n22 /2n.2 − n32 /n.2 )2 n(n2. /2 + n3. ) − (n2. /2 + n3. )2 (2)

which has one degree of freedom. This test will be more powerful than Equation (1) if the genetic effects are indeed additive. We can also tailor our analysis to other genetic models. Under the dominant and recessive modes of inheritance, the test statistics become

Yes

No

Total

aa aA AA Total

n11 n21 n31 n.1

n12 n22 n32 n.2

n1. n2. n3. n

i=1 j=1

(nij − eij )2 , eij

n{n32 (n11 + n21 ) − n31 (n12 + n22 )}2 (n1. + n2. )n3. n.1 n.2

(4)

respectively. Again, Fisher’s exact tests can also be used. We measure the strength of association by the odds ratio or the difference of response rates. The choice of the test statistic should ideally be driven by the model of inheritance. Unfortunately, this knowledge is rarely available. The most common practice is to use the Armitage trend test, which should perform well unless the effects are far from additive. Test statistics (2)-(4) can be obtained as the score statistics under the logistic regression model:

To test the independence between the genotype and the response, we can use the well-known Pearson’s chi-squared statistic 3 2

(3)

and

Response Genotype

n{n11 (n22 + n32 ) − n12 (n21 + n31 )}2 n1. (n2. + n3. )n.1 n.2

logit{Pr(Y = 1|X)} = α + βX,

(5)

where Y is the response variable, X is the genotype score, α is the intercept, and β is the log odds ratio. Under the additive mode of inheritance, X denotes the number of A alleles; under the dominant model, X is the indicator for at least one A allele; and under the recessive model, X is the indicator for

(1)

where eij = ni. n.j /n. The null distribution of this test statistic is asymptotically chisquared with two degrees of freedom. We

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

GENETIC ASSOCIATION ANALYSIS

genotype AA. Inference on the odds ratio (i.e., eβ ) is based on the maximum likelihood theory. Test statistic (1) can also be generated under model (5). An important advantage of the logistic modeling is that it can readily accommodate multiple SNPs, environmental factors (e.g., treatment assignments), and gene-environment interactions. We adopt the general form of the logistic regression model: logit{Pr(Y = 1|X)} = α + β X where X consists of genotype scores of (possibly) multiple SNPs, environmental factors, and products of genotype scores and environmental factors, and β is a vector of log odds ratios. In particular, we can assess the interactions between treatment assignments and genetic markers under this model. If the response variable is continuous, then we replace the logistic regression model with the familiar linear regression model. Indeed, we can use any generalized linear model (2). The only difference from the traditional regression analysis lies in the incorporation of appropriate genotype scores. If the response variable is survival time or event time, then we employ the proportional hazards model (3). All the analysis can be carried out in standard statistical software. It can be problematic to include many SNPs, some of which are highly correlated, in the regression analysis. An alternative approach is to consider haplotypes. A haplotype contains alleles from one parent. Association analysis based on haplotypes can reduce the degrees of freedom and increase the power to capture the combined effects of multiple causal variants, as compared with SNPsbased analysis. The current genotyping technologies do not separate the two homologous chromosomes of an individual, so haplotypes are not directly observable. Maximum likelihood methods have been developed to perform haplotype analysis based on genotype data (4). Haplotypes can also be used to infer the alleles of an untyped SNP (i.e., an SNP that is not on the genotyping chip) and thus allows analysis of association between an untyped SNP and a response variable (5). A potential drawback of genetic association analysis is that spurious associations

may result from unknown population structure or stratification. This problem originates when a genetic subgroup has a higher response rate than another, so that any SNP with allele proportions that differ among the subgroups will appear to be associated with the response variable. Several methods have been proposed to deal with population stratification. The most popular current approach is to infer axes of genetic variation from genomic markers through principal components analysis and then include those axes as covariates in the association analysis (6). Multiple testing is a serious issue in genetic association analysis, especially if many markers are examined. The common practice is to use the Bonferroni correction, which essentially divides the overall significance level by the number of tests performed. This strategy is conservative, especially when the markers are highly correlated. Accurate control of the type I error can be achieved by incorporating the correlations of the test statistics into the adjustments of multiple testing (7). The description thus far has been focused on the frequentist paradigm. There is an increasing interest in using Bayesian methods in genetic association analysis (8). REFERENCES 1. P. Armitage, Test for linear trend in proportions and frequencies. Biometrics 1955; 11:375–386. 2. P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed. New York: Chapman & Hall, 1989. 3. D. R. Cox. Regression models and life-tables (with discussion). J. R. Stat. Soc. Ser. B 1972; 34:187–220. 4. D. Y. Lin and D. Zeng. Likelihood-based inference on haplotype effects in genetic association studies (with discussion). J. Am. Stat. Assoc. 2006; 101:89–118. 5. D. Y. Lin, Y. Hu, and B. E. Huang. Simple and efficient analysis of SNP-disease association with missing genotype data. Am J. Hum. Genet. In press. 6. A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006; 38:904–909.

GENETIC ASSOCIATION ANALYSIS 7. D. Y. Lin. An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 2005; 21:781–787. 8. J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 2007; 39:906–913.

3

GLOBAL ASSESSMENT VARIABLES

treatment benefit (or harm)?’’ with all outcomes expected to have an effect in the same direction. To answer the first question, separate tests for each single outcome can be performed. Methods for these comparisons are discussed later in this section in comparison to global assessment. To address the directional hypothesis requires either a composite outcome or a global assessment of treatment benefit. Composite outcomes (see Composite Outcomes) refer to the situation where multiple outcomes are combined using a scoring system, which is defined a priori and considered to be clinically meaningful. For the situation where there is no such scoring system, O’Brien (2) as well as Pocock et al. (3) introduced the novel global statistical test (GST) methodology to combine information across outcomes and examine whether a treatment has a global benefit. GSTs have optimal statistical power when treatment effect is of similar magnitude across all outcomes (common dose effect). Because these methods can test whether a treatment has a global benefit without the need of multiple tests, the GST approach has been widely applied in medical studies (4–8), and the extensions of the GST method have appeared in the literature. Below we review different versions of the GST and describe where each GST can be appropriately used.

BARBARA TILLEY Medical University of South Carolina, Charleston, South Carolina

PENG HUANG John Hopkins University

PETER C. O’BRIEN Mayo Clinic

Although most clinical trials are designed using a single primary outcome, in some trials it is difficult to find the single most appropriate outcome for the main objective of the study. In the NINDS Stroke Trial, a consistent and persuasive poststroke improvement on multiple outcomes was required to define treatment efficacy (1). In Parkinson’s disease clinical trials, a treatment would not be considered to slow progression if the treatment improved only motor score and other measures of outcome deteriorated. In studying quality of life, we may be interested in treatment effects on all subscales being measured. In all cases, no single measure is sufficient to describe the treatment effect of interest, and no validated summary score exists. In a quality-of-life example, we could measure an overall quality-of-life score by summing across subscales, but we may under or overestimate the effect of treatment depending on how the treatment affects the individual subscales.

2 GENERAL COMMENTS ON THE GST 2.1 Definition of a GTE

1 SCIENTIFIC QUESTIONS FOR MULTIPLE OUTCOMES

Many GSTs can be described through two quantities. The first we call the global treatment effect (GTE), and it measures a treatment’s overall benefit across multiple outcomes. The concept of GTE was first introduced by Huang et al. (9); they defined it as a scale-free quantity. However, treatment’s overall benefit for many GSTs is defined through some scale-dependent quantity. We continue to call such a quantity a GTE with the understanding that it may not be exactly the same as that defined by Huang et al. (9) The other quantity is the statistic that tests the significance of this GTE. For those GSTs without GTE defined, the interpretation of

The choice of statistical method used for assessing treatment effect on multiple outcomes depends on the scientific question under investigation. Two types of questions lead to two different classes of tests. The first type is a directionless question: ‘‘Is there any difference, helpful or harmful between the two treatments in these outcomes?’’ A treatment difference could be detected if both strong positive and negative effects were observed. The other approach addressed a directional question: ‘‘Is there any global

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

GLOBAL ASSESSMENT VARIABLES

GST is more difficult as is the qualification of treatment benefit. The GTE from Huang et al. (9) is defined as the difference of two probabilities: the probability that a control group patient will do better if the patient is switched to the treatment, and the probability that a treatment patient will do better if the patient is switched to the control. Larger positive GTE value corresponds to stronger evidence of treatment benefit. Huang et al. (10) gave an unbiased estimate of the GTE for a ranksum GST. Detailed discussion of GTE was given in Huang et al. (9). A major advantage of using a GTE is that it can easily combine a treatment’s effects across different outcomes, no matter which scales are used and whether the outcomes have skewed distributions. Such advantage is not achieved by GTEs from parametric GSTs. A good discussion of advantages of using such a scale-free measure is given by Acion et al. (11). Based on the rank-sum GST, suppose the GTE = 0.4 or 40%. This value implies that a patient in the control group would have a [(70%−P/2) = (100% + 40%−P)/2] probability of a better overall outcome if he/she had been assigned to the treatment group, where P is the probability that the two groups being compared, have tied observations. If no ties are observed, then the probability would be equal to 70%. As the types of GSTs are described, the GTEs will be indicated or we will indicate that the GTE cannot be computed. 3

RECODING OUTCOME MEASURES

Because a global assessment of outcome variables is sensitive to the direction of the treatment effect on each of the variables, it is important to reverse code any variables in which the treatment effect would be measured in the opposite direction. For example, one can multiply variables by (− 1) if smaller observations indicate better outcomes for some but not all variables. In other words, all variables are coded in the way such that larger observations are preferred for all variables before applying a GST. Hereafter, we assume that all variables are coded in the way that larger value is preferred, although

the procedure could easily be applied when the all variables are coded such that the smaller value is preferred. 3.1 Assumptions Most GSTs require a common dose assumption, which implies that the treatment has a similar effect on all outcomes. The parametric GSTs must also meet the distributional assumptions of the test.

4 TYPES OF GLOBAL STATISTICAL TESTS (GSTS) 4.1 Rank-Sum-Type GST O’Brien (2) proposed a simple nonparametric GST for multiple outcomes using ranks. First, for each outcome, all patients are ranked according to their values on this outcome. Second, for each patient, the patient’s ranks are summed to obtain a rank score. Finally, a t-test (or a one-way ANOVA if more than 2 groups are involved) is applied to compare the rank scores between the treatment and control group. More generally, any univariate statistical procedure for comparing two samples may be applied to the rank sum scores. The GTE of this test is a rescaled probability that a control group patient will do better if the patient switches to the treatment, as defined by Huang et al. (9). Because a probability takes value between 0 and 1, they linearly rescaled the probability by minus 0.5 and then multiply by 2 to a value between −1 and +1 so that GTE = 0 implies no overall treatment benefit, a positive GTE value implies treatment benefit, and a negative GTE value implies treatment detrimental. The rank-sum-type GST is very simple to carry out. It is flexible in its ability to analyze multiple outcomes without the need of parametric assumptions regarding the distributions of the outcomes and the correlation among outcomes. The test is invariant to any monotone transformation of the data and is robust to outliers. Thus, rank-sum-type GST is applicable to broad medical research settings. Huang et al. (9) provided sample size computation formulas and corresponding Splus program code when rank-sum-type GST is used as the primary analysis.

GLOBAL ASSESSMENT VARIABLES

4.2 Adjusted Rank-Sum-Type GST Huang et al. (9) extended O’Brien’s ranksum-type GST to the case where variances in two treatment groups are different. The adjusted GST is computed similar to O’Brien’s unadjusted rank-sum GST, but it is divided by an adjusting factor. As for O’Brien’s rank sum test, the GTE is defined as the difference of two probabilities: the probability that a control group patient will do better if the patient is switched to the treatment, and the probability that a treatment patient will do better if the patient is switched to control. If no tied observations are recorded, then this GTE measures the probability that a patient will do better if he/she switched to the new treatment. Because this adjusted GST does not require the two treatment groups to have the same distribution under both the null and the alternative hypotheses, it can be applied to the Behrens-Fisher’s problem. The sample size formulas given by Huang et al. (9) can be applied to this adjusted rank-sum-type test. 4.3 Ordinary Least Square (OLS)-Based GST and Generalized Least Square (GLS)-Based GST These two tests were designed to have optimal power when the common-dose assumption is met (2, 3). The test statistics for OLS-based GST uses the standardized arithmetic mean (with equal weights) of all outcomes, and it is constructed based on the ordinary least square estimate of the GTE. The common standardized mean difference is the GTE. The test statistic for GLS-based GST uses a weighted average of all outcomes with weights depending on the correlation among the outcomes, and it is constructed based on the generalized least square estimate of GTE. These two GSTs are most appropriate for cases in which the treatment effect is measured by the change in outcome means. When the common-dose assumption holds, the GLSbased GST has higher power than other tests with the same type I error rate. Its limitation is that when some weights are negative, the test will be difficult to interpret. It will be difficult to determine whether a larger or a smaller test statistic is preferred or indicates treatment benefit. The negative weights can develop when the outcomes are diverse and

3

do not all have the same correlation with each other. Because of this problem with negative weights, several authors recommend the use of unweighted ordinary least squares GST (3, 11, 12). 4.4 Likelihood Ratio (LR)-Based GST for Normal Outcomes Recognizing that the common-dose assumption may not hold in practice and the GLS may lose power in this case, Tang et al. (12) studied the GLS for normally distributed outcomes assuming that treatment effects on all outcomes are in the same direction (i.e., all improved or all worsened), but the magnitudes of changes are a multiplier of (or equivalent, proportional to) a set of prespecified and possibly different values (13). The GTE of the LR-based GST is this common multiplier. The test statistic for LR-based GST is constructed in a similar manner as the GLS-based GST, but the outcome weights are determined by the prespecified magnitudes of change and the correlations among the outcomes. This GST is also designed for normally distributed outcomes. Tang et al. (14) illustrated how to apply the LR based GST in interim analysis and clinical trial design. Like GLS-based GST, the LR-based GST has a good power when the prespecified magnitudes of changes are true for the data, but in practice, these weights are not always known. 4.5 Approximate Likelihood Ratio (ALR)-Based GST for Normal Outcomes Although the LR-based GST has good power when true magnitudes of treatment effects on all outcomes are proportional to some prespecified values, its power decreases rapidly as the true treatment effects deviate from the prespecified values. Because it would be difficult for investigators to know the true treatment effects in advance, Tang et al. (14) proposed an ALR-based GST that maintains a good power in all possible magnitudes of treatment effects (all positive values). This GST is also designed for normally distributed outcomes. Its GTE is not easy to define. The test statistic is more complicated to compute than the previous GSTs. Simulations showed that ALR test is more powerful than

4

GLOBAL ASSESSMENT VARIABLES

O’Brien’s GLS-based GST when treatment effects are larger on some outcomes than on others. However, the authors noticed that the ALR test ignores differences going in the wrong direction, so caution was needed when using one-sided multivariate testing. 4.6 GST Using Scaled Linear Mixed Model for Continuous Outcomes When all outcomes are normally distributed and are measuring the same underlying event, Lin et al. (15) proposed a scaled linear mixed model to estimate treatment’s common dose effect. The method is flexible in studying multiple outcomes measured in different scales. Another advantage of this method is that it can adjust for confounding covariates and can be implemented by repeatedly fitting standard mixed models (See Mixed Models). 4.7 Follmann’s GST for Normal Outcomes Follmann proposed a simple test that has a good power when the sum of treatment effects on all outcomes is positive (17). Follmann’s GST uses Hotelling’s T2 test statistic to make a decision. It rejects the null hypothesis when T2 exceeds its 2α critical value and the sum of treatment differences on all outcomes is positive. The test controls the type I error α whether the covariance matrix of the outcomes is known. Under the common dose assumption, Follmann’s test is less powerful than O’Brien’s GLS-based GST. However, when the common dose assumption is violated, Follmann’s test could have a higher power than O’Brien’s GST and should avoid problems in interpretation if the correlation structure leads to negative weights for the GST. 4.8 GST for Binary Outcomes Lefkopoulou et al. (18) applied the quasilikelihood methods to model multiple binary outcomes. Later they extended these methods to incorporate analyses of clusters of individuals and they derived a scores test for multiple binary outcomes (19). They showed the scores test to be a special case of Pocock et al. (3). The common relative risk is the GTE measure of the GST, between different groups across all outcomes and represents

the assumed common-dose effect. The relative risk can be derived from the common odds ratio using the methods of Lu and Tilley (20). Binary GSTs without the common dose assumption were also compared using GEE by Lefkopoulou et al. (18) and Legler et al. (21). As the number of unaffected outcomes included in the GST increases, the global treatment effect is diluted, and the statistical power of the GST decreases quickly. 4.9 GST for Binary and Survival Outcomes Pocock et al. (3) gave a computational formula for GST when outcomes consist of binary outcomes and a survival outcome. First, a test statistic is constructed for each single outcome. Then, a multivariate test statistic is constructed based on these univariate test statistics and covariances among them. Rejection of the null hypothesis implies that treatment difference is observed on at least one outcome. Because this test does not assess whether a treatment has a global benefit, no GTE can be defined for the test. Also, because a global benefit is not computed, it could be considered a composite test (See Composite Endpoints). Simpler methods for accommodating binary time to event endpoints might also be considered. Specifically, one could use the actual binary values and the rank sum scores associated with each type of endpoint, respectively. 4.10 GST Using Bootstrap Bloch et al. (22) defined the treatment benefit as an improvement on at least one outcome and non-inferior on all outcomes. For example, if we have two outcomes, 1 is the mean treatment improvement on the first outcome, and 2 is the mean treatment improvement on the second outcome. They proposed to prespecify two positive numbers δ 1 and δ 2 such that treatment benefit is defined by the satisfaction of the following three conditions simultaneously: max {1 , 2 } > 0 and 1 > − δ1 and 2 > − δ2 This method is very similar to the approach proposed earlier by Jennison and Turnbull

GLOBAL ASSESSMENT VARIABLES

(23) who used an L-shaped subset in parameter space to define treatment benefit. The test statistic T is similar to the Hotelling’s T2 test statistic, but T is positive only when estimated 1 and 2 values are greater than some critical values u1 and u2 , respectively, and T = 0 otherwise. The null hypothesis of no treatment benefit is rejected when T > c for some critical value c. The critical values u1 , u2 , and c need to be determined to control the type I error rate. Because it is difficult to derive the null distribution of T analytically, the authors use bootstrap methods to estimate u1 , u2 , and c. Their simulations show that proposed method controls type I error rate and has good power for both normal and non-normal data, such as a mixture of normal distributions and a mixture of exponential distributions. Again, GTE as a global treatment effect is not computed and again, this method could be considered more like a composite test (See Composite Endpoints). 4.11 Extension of GST in Group Sequential Trial Design When all observations are accrued independently, all GSTs discussed above can be extended to group sequential design settings. Tang et al. (14) discussed how to apply likelihood ratio based GST for normal outcomes in group-sequential trials. A more general theory is given by Jennison and Turnbull (24) when the joint distribution of multiple outcomes is known except for a few parameters. The key assumption in their methods is that the additional data used in the next interim analysis must be independent to the data used in the previous interim analyses. For example, if the first interim analysis uses data from n1 subjects, the second interim analysis uses data from original n1 subjects plus n2 additional subjects. Observations from these n2 subjects are the additional data in the second interim analysis, and they must be independent to the data used in the first interim analysis. If data are not accrued independently, such as the case when patients are measured repeatedly and the data from the same patients are used in different interim analyses, then the computation of the critical values for the

5

stopping rule will be much more complicated. When data are not normally distributed but the normal theory is used to determine the critical values for the test statistics in different interim analyses, both the sample size at each interim analysis and the sample size difference between two consecutive interim analyses need to be large to make the Central Limit Theory applicable. 5 OTHER CONSIDERATIONS 5.1 Power of a GST A GST generally provides a univariate test statistic to describe the overall benefit (GTE). Because data from multiple outcomes are used to assess the GTE, it generally has a higher power than a univariate test (3,14). More importantly, a GST helps us to make a decision regarding whether a treatment is preferred using single test, rather using than multiple tests on different outcomes with the associated penalty for multiple testing. Where some outcomes strongly suggest benefit and others strongly suggest harm, power would be reduced, and we would be unlikely to reject the null hypothesis using a GST because we are testing the benefit of treatment. In the latter case, Hotellings T2 would be likely to reject the null hypothesis (See Multiple Comparisons). Because the power of the GST is generally greater than or equal to the power of the corresponding univariate test, using any single outcome one can use the smallest sample size calculated from the corresponding univariate tests as the sample size for the GST. However, this result may lead to a larger sample size than needed. Rochon (25) applied generalized estimating equation (GEE) approach to develop a sample size computation method for univariate outcome measured longitudinally. His method can be used for a cross-sectional study with multiple outcomes. Rochon’s approach is established under assumptions that variance–covariance matrix of outcomes is a function of outcome means and observations from all subjects in the same subgroup have exact the same covariates, mean values, and variance–covariance matrix. A sample size computational algorithm is provided along with a table of minimum required sample sizes for binary outcomes.

6

GLOBAL ASSESSMENT VARIABLES

5.2 Interpreting the GST The interpretation of GST depends on the variables included in the GST and the type of GST used. When it can be computed, the GTE measures a treatment’s overall benefit across multiple outcomes. For those GSTs without GTE defined, the interpretation of GST is more difficult. Estimation of GTE and its confidence interval may often provide more information about treatment effect than the binary decision regarding to whether the null hypothesis is rejected. Framing the hypothesis in terms of the GTE is helpful in sample size estimation and the design of clinical trials. For normal outcomes, if a null hypothesis is rejected by the OLS-based GST or by the GLS-based GST, then it is implied that, for all outcomes included in the GST, the common mean difference between the two groups, which is its GTE measure, is claimed to be significantly different from zero. For LR-based GST, a rejection of the null hypothesis implies that treatment improves variable means by the same factor, which is measured by the GTE. Because a GTE is not defined for ALR test, it is not easy to quantify treatment effect when the null hypothesis is rejected. What we can say about ALR test is that treatment improves all outcomes, but the amount of improvement for each outcome is not provided. For binary outcomes, because the GTE is defined as the common odds ratio of the success probabilities between the two groups, a rejection of the null hypothesis by the 1-degree-of-freedom GST implies that treatment improves success probabilities by the same factor whose value equals to the exponential of the GTE. For both rank-sum-type GST and adjusted rank-sum-type GST, the interpretation of treatment effect is similar. Rejection of a null hypothesis by these tests implies that treatment group patients have a greater chance of better overall outcome than the control group patients do. It is important to note that GST is designed to test treatment’s global effect on multiple outcomes. It does not provide separated assessment of the treatment effect on single outcomes. If an investigator is interested to

test the treatment effect for each of the single outcomes after the global null hypothesis is rejected, several sequential procedures are available to do so 26–29. These procedures listed do not allow repeated tests on the same outcome such as we do in the interim analysis. Kosorok et al. (30) proposed a group sequential design in which each single outcome is tested repeatedly in the interim analyses by partitioning the alternative region into mutually exclusive rectangular subregions, each corresponding to a combination of decisions based on all of the single outcomes. The decisions on each outcome can be ‘‘the treatment is better,’’ ‘‘the treatment is worse,’’ or ‘‘undetermined’’. Thus, a total of 3K − 1 subregions exist. The global type I error is defined as the probability that the test statistic incorrectly rejects the global null hypothesis of no treatment effect on any of the outcomes. The type II error is defined as the probability that the test statistic incorrectly leads to a conclusion that is inconsistent with the true distribution. For each outcome, stopping boundaries are constructed by spending its type I and type II errors separately, then combined to form multivariate stopping boundaries. To preserve the overall type I and type II errors, the boundaries are multiplied by a constant c that is determined through simulation. The advantage of using this method in clinical trials is that when a trial is stopped in an interim analysis, investigators can determine which outcomes are improved, which are worsened, and which are unchanged without inflation of the type I error. The limitation is its intensive computation. The authors provide software for this computation under the link http://www.bios.unc.edu/∼kosorok/clinicaltrial/main.html. 5.3 Choosing Outcome Variables to Include in the GST All variables included in the GST must be related to the scientific question to be addressed and should all be expected to respond to treatment. If the common dose assumption is required, then the treatment is expected to have a similar benefit on all outcomes. If several correlated variables are considered,

GLOBAL ASSESSMENT VARIABLES

whether one of them, some combinations of them, or all of them should be included into a GST depends on how we want to express treatment’s effect on these variables. Adding redundant variables into GST can bias the conclusion of GST. Simulations of Khedouri have shown that inclusion of highly correlated redundant variables artificially affects the power of different GSTs, which includes the OLS/GLS based GSTs, the GEE score equation derived GST, and rank-sum-type GST (31). 6

OTHER METHODS

The main rationale for using a global outcome in a clinical trial is that, by providing a more comprehensive and thus more informative assessment of the patient, it also provides a more clinically meaningful primary endpoint than use of any single variable by itself. Two other approaches for using multiple endpoints in the primary analysis of a clinical trial are the T2 statistic and use of multiple comparison procedures (See Multiple Comparisons). 6.1 T2 Statistic The T2 statistic computes the mean difference between groups for each of the endpoints, and then computes the distance between these two sets of means relative to a multivariate measure of the variability in the data that accounts for the correlation structure among the individual variables. Importantly, it does not take into account whether individual group differences were in a positive or negative direction. Thus, by itself the statistic is not meaningful. Furthermore, the failure to use this information about the derivation of the differences results in very low power for identifying efficacy. For these reasons, it is rarely used as a primary analysis in clinical trials. 6.2 Bonferroni Multiple comparison procedures are commonly used, especially Bonferroni adjustment of individual P-values. Specifically, the two treatment groups are compared in the usual way for each variable, and then the

7

several P-values are adjusted by multiplying them by the number of tests carried out. Statistical significance is claimed only if the smallest of these adjusted P-values is less than the prespecified threshold (typically .05). For example, if five primary endpoints are observed, then the smallest unadjusted Pvalue must be less than .01 to achieve overall significance at the 0.05 level. An argument sometimes made in favor of the Bonferroni approach is that interpretation is more straightforward and less ambiguous than using a global outcome, because when overall statistical significance is achieved, the endpoint that produced the smallest individual P-value can be reliably identified as an endpoint that is improved by the experimental treatment. Although this decreased ambiguity is an advantage, closer inspection of the Bonferroni approach reveals that interpretation may actually be more problematic than with a global outcome. To illustrate the problem with the Bonferroni method, imagine two primary endpoints, and suppose that the evidence for a treatment effect was identical for each variable with univariate P-values of .01, except that the direction of the treatment effect favored the experimental treatment for one endpoint and the placebo treatment for the other. Using the Bonferroni approach, the experimental treatment is considered efficacious, P = .02. However, using the rank sum method for defining a global outcome, the mean difference between groups in rank sum scores would be 0, which appropriately reflects that no overall benefit was associated with either treatment. The problem with the Bonferroni approach is that it fails to rely on a comprehensive assessment of the patient. This failure to use important information is reflected in lower power (See Power) for the Bonferroni approach if the experimental treatment has a beneficial effect across all primary endpoints. The ability of the various tests to control the overall type I error rate (falsely declaring efficacy when none exists) at the specified level of .05 and the power of the procedures to identify treatment effects is shown in Tables 1 and 2, respectively [adapted from a simulation study (2)]. In these tables, n indicates the sample size in

8

GLOBAL ASSESSMENT VARIABLES

Table 1. Observed Type I Error Rates for Various Tests Type I Error Rate∗ n

K

Rank-sum

GLS

Least Squares

Equal

5 20 20 50 5

2 2 5 5 50

.045 .050 .051 .048 .041

.041 .050 .059 .058 —

.062 .053 .045 .053 .061

Normal

Unequal

20 50 5

5 5 50

.047 .053 .050

.047 .051 —

.060 .049 .053

Skewed

Equal

5 20 20 50 5

2 2 5 5 50

.043 .062 .056 .035 .052

.041 .063 .064 .038 —

.045 .064 .063 .036 .054

Outliers

Equal

5 20 20 50 5

2 2 5 5 50

.047 .063 .051 .052 .053

.021 .032 .027 .036 —

.024 .032 .028 .037 .028

Distribution

Correlation

Normal

∗α =

.05. Data from O’Brien (2)

Table 2. Comparison of Power for Various Tests Power∗ Distribution

Treatment Effect

Correlation

n

K

Ranksum

GLS

Least squares

Bonferroni

T2

Equal

Normal

Equal

20 50 5

5 5 50

.64 .91 .28

.67 .92 —

.67 .92 .31

.52 .85 .17

.26 .66 —

Staggered

Normal

Equal

20 50 5

5 5 50

.32 .62 .14

.35 .64 —

.32 .64 .17

.34 .69 .10

.20 .58 —

Equal

Normal

Unequal

20 50 5

5 5 50

.62 .92 .18

.74 .97 —

.64 .93 .18

.51 .84 .06

.33 .78 —

Equal

Skewed

Equal

20 50 5

5 5 50

.77 .99 .38

.66 .93 —

.66 .93 .37

.57 .86 .16

.29 .67 —

Equal

Outliers

Equal

20 50 5

5 5 50

.23 .42 .11

.07 .08 —

.08 .08 .06

.02 .02 .01

.05 .05 —

∗α =

.05. Data from O’Brien (2)

GLOBAL ASSESSMENT VARIABLES

each treatment group, and K represents the number of univariate primary endpoints. As expected, all procedures control the overall type I error rate when sampling from a normal distribution, but only the rank sum test does so reliably for non-normal distributions (Table 1). Simulations to compare power (Table 2) assumed that the treatment improved all endpoints equally (the case for which the rank sum, least squares, and GLS tests should perform well) except for a ‘‘staggered effect size,’’ in which the effects ranged from 1/K, 2/K, . . . , K/K for endpoints 1, 2, . . . , K, respectively. As expected, the Bonferroni method had lower power for the equal effects situation, but can provide greater power when effects vary among the endpoints, depending on the variability, the number of endpoints, and the sample size. The T2 test has uniformly low power. 6.3 Composite Endpoints Lefkopoulou and Ryan (19) compared two different methods to assess treatment’s global effect on multiple binary outcomes. One is the test of composite outcomes (referred to as a combined test) that collapses multiple outcomes into a single binary outcome: A success is observed if success on any one of multiple binary outcomes is observed (See Composite Endpoints). The other is the 1-degree-offreedom GST that assumes a common odds ratio. The GST can be constructed using GEE (See Generalized Estimating Equations). When the response proportions are small (> 50%), both the composite test and the GST are approximately equally efficient. However, as the response proportions increase, the GST becomes much more efficient than the composite (combined) test (19). When a combined composite endpoint is used (any one of a set of outcomes, or any two, etc.), information from all outcomes is not taken into account. Someone with a stroke and myocardial infarction (MI) would be rated the same as someone having only an MI, and an increased risk of stroke, for example, could be missed. See Composite Outcomes (8) for a discussion of the limitation of the composite outcome and the ACR criteria used in the study of rheumatoid arthritis. However, computation of composite outcomes generally does not

9

require a common dose assumption. Also, for outcomes such as the ACR criteria, individual subjects are classified as successes or failures that are not generally the case for GSTs. 7

EXAMPLES OF THE APPLICATION OF GST

The NINDS t-PA Stroke Group performed two randomized, double-blind, placebo-controlled trials for patients with acute ischemic stroke (1). In both trials, the treatment success was defined as a ‘‘consistent and persuasive difference’’ in the proportion of patients who achieve favorable outcomes on the Barthel Index, Modified Rankin Scale, Glasgow Outcome Scale, and National Institutes of Health Stroke Scale. Four primary outcomes were used because a positive result from any single outcome was not believed to provide a sufficient evidence of treatment efficacy. Trial Coordinating Center proposed the use of a GST. All four outcomes are dichotomized to represent an outcome of normal or mild disability versus more severe disability or death. Trial investigators believed the assumption of a common dose effect to be valid for all four binary outcome measures. Table 3 gives the odds ratios for individual tests of each outcome and the 95% confidence intervals computed using a Mantal-Haenszel approach. The GST was computed using methods of Lefkopoulou and Ryan for binary outcomes (19). The GTE (odds ratio) for the GST is 1.73. No single outcome would be considered significant based on the Bonferroni adjustment. However, because the GST indicated an overall favorable outcome (P = 0.008), it was (a priori) considered to provide weak protection of alpha, and a test of the individual outcomes were conducted at the 0.05 level. Using these criteria, all individual outcomes indicated a benefit of treatment. See Table 3. Tilley et al. (8) used GSTs to analyze clinical trials in rheumatoid arthritis. O’Brien used GST in analysis of a randomized trial comparing two therapies, experimental and conventional, for the treatment of diabetes. The objective of the study was to determine whether the experimental therapy resulted in better nerve function, as measured by 34 electromyographic (EMG) variables, than the

10

GLOBAL ASSESSMENT VARIABLES

Table 3. NINDS t-PA Stroke Trial, Part II Data∗ Proportion With Favorable Outcome Outcome Barthel Modified Rankin Glasgow NIH Stroke Scale GST∗

t-PA (n = 168) 50 39 44 31

Placebo (n = 165) 38 26 32 20

OR 1.63 1.68 1.64 1.72 1.73

95% CI 1.06–2.49 1.09–2.59 1.06–2.53 1.05–2.84 1.16–2.60

P .026 .019 .025 .033 .008

∗ Global statistical test for binary outcomes developed by Lefkopoulou and Ryan (19). Data from the NINDS t-PA Stroke Trial (1).

standard therapy. Six subjects were randomized to standard therapy, and five subjects were randomized to experimental therapy. Despite the small sample size, the objective of the study was viewed primarily as confirmatory rather than exploratory. The medical question at issue was controversial, which required an overall quantitative and objective probability statement. There were six EMG variables for which the difference between groups was statistically significant (P > .05). The two smallest P-values were P = .002 and P = .015. The treated group did better in 28 of 34 variables, with P > .50 as a criterion; this result indicated the type of main effect for which the authors had hoped. Although these results seem to support the hypothesis of a beneficial effect associated with the experimental group, an overall test was needed. A T2 test was not appropriate. Application of a multiple-comparisontype per-experiment error rate is also meaningless here because of the small sample size relative to the large number of variables. When applied to all the data, the GST yielded a P-value of .033 (2).

8

CONCLUSIONS

Global statistical tests and associated GTEs summarize a treatment’s global benefit without the need of performing multiple univariate tests for individual outcomes. In contrast to composite tests, some GSTs take the correlation among outcomes into account. When a treatment shows benefit on each of the single outcomes, the GST often has a higher power than univariate tests because of increased information included in the GST. When a

treatment shows strong benefit on some outcomes and harm on other outcomes, the GST generally loses power compared with Bonferroni adjustment. This loss of power is also an advantage of the GST because investigators will be reluctant to accept a treatment if it shows strong harmful effect on equally important outcomes. The GST approach of analyzing multiple primary outcomes is a new emerging area in both statistical research and clinical applications. A variety of GSTs is available to address diverse situations. We believe that GSTs will play a significant role in medical research and future scientific findings. REFERENCES 1. The National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group, Tissue plasminogen activator for acute ischemic stroke. N. Engl. J. Med. 1996; 333: 1581–1587. 2. P. C. O’Brien, Procedures for comparing samples with multiple endpoints. Biometrics 1984; 40: 1079–1087. 3. S. J. Pocock, N. L. Geller, and Tsiatis AA. The analysis of multiple endpoints in clinical trials. Biometrics 1987; 43: 487–498. 4. L. Hothorn, Multiple comparisons in longterm toxicity studies. Environmental Health Perspectives. 1994; 102(suppl 1): 33–38. 5. K. D. Kaufman, E. A. Olsen, D. Whiting, R. Savin, R. DeVillez, W. Bergfeld, et al., Finasteride in the treatment of men with androgenetic alopecia. Finasteride Male Pattern Hair Loss Study Group. J. Am. Acad. Dermatol. 1998; 39: 578–589. 6. D. K. Li, G. J. Zhao, and D. W. Paty, Randomized controlled trial of interferon-beta-1a in secondary progressive MS: MRI results. Neurology 2001; 56: 1505–1513.

GLOBAL ASSESSMENT VARIABLES 7. R. S. Shames, D. C. Heilbron, S. L. Janson, J. L. Kishiyama, D. S. Au, and D. C. Adelman, Clinical differences among women with and without self-reported perimenstrual asthma. Ann. Allergy Asthma Immunol. 1998; 81: 65–72. 8. B. C. Tilley, S. R. Pillemer, S. P. Heyse, S. Li, D. O. Clegg, and G. S. Alarcon, Global statistical tests for comparing multiple outcomes in rheumatoid arthritis trials. MIRA Trial Group. Arthritis Rheum. 1999; 42: 1879–1888. 9. P. Huang, R. F. Woolson, and P.C. O’Brien, A rank-based sample size method for multiple outcomes in clinical trials. Stat. Med. In press. 10. P. Huang, B. C. Tilley, R. F. Woolson, and S. Lipsitz, Adjusting O’Brien’s test to control type I error for the generalized nonparametric Behrens-Fisher problem. Biometrics 2005; 61: 532–539. 11. L. Acion, J. J. Peterson, S. Temple, and S. Arndt, Probabilistic index: an intuitive nonparametric approach to measuring the size of treatment effects. Stats. Med. 2006; 25: 591–602. 12. D. I. Tang, N. L. Geller, and S. J. Pocock, On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics 1993; 49: 23–30. 13. D. Follmann, Multivariate tests for multiple endpoints in clinical trials Stats. Med. 1995; 14: 1163–1175. 14. D. Tang, C. Gnecco, and N. Geller, Design of group sequential clinical trials with multiple endpoints. J. Am. Stat. Assoc. 1989; 84: 776–779. 15. D. Tang, C. Gnecco, and N. Geller An approximate likelihood ratio test for a normal mean vector with nonnegative components with application to clinical trials. Biometrika 1989; 76: 577–583. 16. X. Lin, L. Ryan, M. Sammel, D. Zhang, C. Padungtod, and X. Xu, A scaled linear mixed model for multiple outcomes. Biometrics 2000; 56: 593–601. 17. D. Follmann, A simple multivariate test for one-sided alternatives. J. Am. Stat. Assoc. 1996; 91: 854–861. 18. M. Lefkopoulou, D. Moore, and L. Ryan, The analysis of multiple correlated binary outcomes: application to rodent teratology experiments. J. Am. Stat. Assoc. 1989; 84: 810–815. 19. M. Lefkopoulou and L. Ryan Global tests for multiple binary outcomes. Biometrics 1993; 49: 975–988. 20. M. Lu and B. C. Tilley, Use of odds ratio or relative risk to measure a treatment effect in

11

clinical trials with multiple correlated binary outcomes: data from the NINDS t-PA Stroke Trial. Stats. Med. 2001; 20: 1891–1901. 21. J. Legler, M. Lefkopoulou, and L. Ryan, Efficiency and power of tests for multiple binary outcomes. J. Am. Stat. Assoc. 1995; 90: 680–693. 22. D. A. Bloch, T. L. Lai, and P. Tubert-Bitter, One-sided tests in clinical trials with multiple endpoints. Biometrics 2001; 57: 1039–1047. 23. C. Jennison and B. Turnbull, Group Sequential tests for bivariate response: interim analysis of clinical trials with both efficacy and safety endpoints. Biometrics 1993; 49: 741–752. 24. C. Jennison and B. Turnbull, Group Sequential Methods with Applications to Clinical Trials. 2000. CRC Press Inc., Boca Raton, FL. 25. J. Rochon, Application of GEE procedures for sample size calculations in repeated measures experiments. Stats. Med. 1998; 17: 1643–1658. 26. R. Falk, Hommel’s Bonferroni-type inequality for unequally spaced levels. Biometrika 1989; 76: 189–191. 27. S. Holm, A simple sequentially rejective multiple test procedure. Scand. J. Stats. 1979; 6: 65–70. 28. W. Lehmacher, G. Wassmer, and P. Reitmeir, Procedures for two-sample comparisons with multiple endpoints controlling the experimentwise error rate. Biometrics 1991; 47: 511–521. 29. R. Marcus, E. Peritz, and K. R. Gabriel On closed testing procedures with special references to ordered analysis of variance. Biometrika 1976; 63: 655–660. 30. M. R. Kosorok, S. Yuanjun, and D. L. DeMets, Design and analysis of group sequential clinical trials with multiple primary endpoints. Biometrics 2004; 60: 134–145. 31. C. Khedouri, Correcting Distortion in Global Statistical Tests With Application to Psychiatric Rehabilitation. Charleston, SC: Medical University of South Carolina, 2004. 32. T. Karrison and P. O’Brien, A rank-sum-type test for paired data with multiple endpoints. J. Appl. Stat. 2004; 31: 229–238. 33. N. Freemantle, M. Calvert, J. Wood, J. Eastaugh, and C. Griffin, Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA 2003; 289: 2554–2559.

12

GLOBAL ASSESSMENT VARIABLES

CROSS-REFERENCES Type I Error Mixed Effect Models Multiple Comparisons Composite Variables (Endpoints or Indices) Generalized Estimating Equations Composite Outcomes

GOLD STANDARD

2

Many diagnostic tests have a binary response. An error-free, binary diagnostic test differentiates perfectly between patients who have a disease and those who do not, and it is referred to as a ‘‘perfect’’ gold standard test. Perfect gold standard tests are usually not available or not feasible because of technical, economic, or implementation constraints. Instead, investigators use tests or procedures with acceptably low error rates. An example is the use of clinical signs and symptoms, enzyme assays, and electrocardiograms instead of direct examination of the tissue of the heart to determine whether a patient has had a myocardial infarction. If a test is the best available, even though it may be imperfect and subject to error, it is called a gold standard test and is used as the reference with which other tests are compared. When a diagnostic test is not perfect, other characteristics of the test must be considered before declaring it a gold standard. Sensitivity is the conditional probability that the test will be positive given that the subject has the disease. Specificity is the conditional probability that the test will be negative given that the subject does not have the disease (3, p. 30). Very high values for both sensitivity and specificity are desirable if the test is to be considered useful. A test that has high sensitivity will be good at detecting a disease if it is present, and a test with high specificity will be good at ruling out a disease if it is not. For example, Jobst et al. (4) examined various tests used to diagnose Alzheimer’s disease. The study population consisted of normal elderly subjects as well as of subjects with non-Alzheimer’s disease dementias or Alzheimer’s disease that had been histopathologically confirmed (the gold standard test). A clinical diagnosis of probable Alzheimer’s disease according to the National Institute of Neurological and Communicative Disorders and Stroke–Alzheimer’s Disease and Related Disorders Association (NINCDSADRDA) had only 49% sensitivity in the Alzheimer’s subjects but 100% specificity in the subjects who did not have Alzheimer’s disease.

MARILYN A. AGIN Pfizer Global Research and Development, Ann Arbor, Michigan

In clinical trials, the concept of a gold standard is most often applied to diagnostic tests or methods of measurement that are considered to be the best in some general sense among all other tests or methods available. The gold standard test or measurement may be error-free or have an acceptably small error rate. New tests or measurements are compared with the gold standard. Closely related are the agreement of two or more types of tests or measurements when neither is a gold standard, and the reproducibility of a given test result or measurement. This article provides an overview of gold standard tests and measurements as used in clinical trials. 1

DIAGNOSTIC TESTS

THE GOLD STANDARD

In economics, the gold standard refers to a monetary standard under which the currency of a country is equal in value to and redeemable by a certain quantity of gold (1). Currencies of different countries can thus be compared with each other by assessing their relative value in gold. The United States was on the gold standard from time to time until 1971 (2). Today, informal usage of the term ‘‘gold standard’’ has expanded beyond economics to denote a practice or principle that is a model of excellence with which things of the same class can be compared. In clinical trials, the concept of a gold standard is most often applied to diagnostic tests or methods of measurement that are considered to be the best in some general sense among all other tests or methods available. New tests or measurements are compared with the gold standard. Closely related are the agreement of two or more types of tests or measurements when neither is a gold standard, and the reproducibility of a given test result or measurement. This article provides an overview of gold standard tests and measurements as used in clinical trials.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

GOLD STANDARD

What a clinician usually wants to know, however, is the probability that a patient has a disease given that a test is positive or the probability that a patient does not have a disease given that a test is negative. If the proportion of patients in the population who have the disease is known, then Bayes’ theorem can be used to estimate these conditional probabilities retrospectively (3, p. 12). Most often, this proportion is not known and must be approximated. Other important aspects of a diagnostic test are its error rates. A false-positive error means that the test result was positive but the subject does not have the disease. A falsenegative error means that the test result was negative but the subject does have the disease. The error rate of the reference test is sometimes overlooked in evaluating the new test, and this can lead to bias in estimating the errors in the new test. When declaring an imperfect test to be a gold standard, the purpose for which the test is administered should be considered along with the sensitivity, specificity, and error rates. Sometimes two tests are available, and one or neither is considered a gold standard. When comparing a new test with a gold standard, the objective is to be able to declare the new test almost as good as (or no worse than) the gold standard. The null hypothesis should state that the two tests are different. The alternative hypothesis should state that the new test is as good as (equivalence) or no worse than (noninferiority) the gold standard since this is what one wishes to conclude. If the two statistical hypotheses had been reversed, then no claim could have been made about the new test being as good as or no worse than the gold standard. Not being able to reject the null hypothesis is not evidence that it is true, i.e., an ‘‘absence of evidence is not evidence of absence’’ (5). If neither test is a gold standard, then the alternative hypothesis should state that the two tests are equivalent or similar. A practical difficulty encountered when testing for equivalence or noninferiority is quantifying how similar the tests must be in order to claim they are not different. Chen et al. (6) propose a method to test the hypothesis of noninferiority with respect to sensitivity and specificity when

the response is binary. The new test is compared with a reference test and with a perfect gold standard test under retrospective and prospective sampling plans. A latent continuous or categorical variable is one that cannot be observed directly (7). In clinical trials, when a perfect gold standard test does not exist, then the true disease state can be thought of as a latent variable. For example, postmortem examination of tissue may be the only way to detect a certain type of pneumonia or a recurrence of cancer so the true disease state is latent as long as the patient is alive. If two imperfect diagnostic tests are used, a common statistical assumption is that the results of the tests are conditionally independent given the latent variable. 3 MEASUREMENT METHODS As with diagnostic tests, clinical measurements often cannot be made directly so indirect methods are used. For example, the duration of ventricular depolarization and repolarization of the heart is evaluated indirectly by measuring the QT interval on an electrocardiogram (ECG). Two laboratories using different methods of measurement on the same ECG (8), a technician reading the same ECG twice (9), or two different computer algorithms applied to the same ECG (10) could produce two different values for the QT interval. Since the true value of the QT interval is unknown, no method provides an unequivocally correct measurement (perfect gold standard) and the degree to which the measurements agree must be assessed. Since the act of measuring almost always involves some inherent measurement error or subjective judgment, a gold standard method of measurement is usually not perfect but is widely accepted and the reference to which other methods are compared. For example, an established assay for determining the plasma concentration of a drug is usually the reference for a new assay. The manner in which two or more sets of measurements agree or disagree is important since tests of agreement are usually specific to certain types of disagreement. For example, Pearson’s correlation coefficient

GOLD STANDARD

indicates only the linear relationship between the two sets of measurements. The correlation could be very strong, but one measurement could be consistently longer than another, or the measurements could follow a nonlinear relationship very closely but with little correlation. A paired t-test will detect only if the means of the two measurement groups are different and not provide any information about the marginal distributions of each measurement. Regression analysis, Lin’s (8) concordance correlation coefficient, and the intraclass correlation coefficient (9) address other types of agreement and reproducibility. Altman and Bland (10) and Bland and Altman (11,12) provide a clear explanation of the problem of comparing two methods of measurement and statistical principles of a graphical method for assessing agreement. The differences between measurements on the same observational unit are plotted against their averages. The within- and between-unit variability and the dependence of the difference between measurements on the mean are examined. Bartko (13) expand on Altman and Bland’s analyses by suggesting a bivariate confidence ellipse to amplify dispersion. St. Laurent (14) and Harris et al. (15) propose models and estimators for assessing agreement of one or more approximate methods of measurement with an accepted gold standard. Agreement is assessed by a ratio of variances equal to the square of the correlation between the approximate and the gold standard measurements. This correlation depends on the variability in the gold standard measurements and is identical in form to the intraclass correlation coefficient used when comparing two approximate methods. An acceptable degree of agreement depends on the error in the approximate method and can be defined for the specific clinical application under investigation. 4

CONCLUSION

When assessing diagnostic tests or methods of measurement, the first step should be to plot the data from each test or method against the others. Understanding how the tests or

3

sets of measurements agree or differ and the potential for incorrect conclusions can help to determine how to improve a test or measurement, assess the results statistically, and provide evidence for replacing a former imperfect or accepted gold standard test or method of measurement with a new one. REFERENCES 1. The American Heritage College Dictionary, 3rd ed. New York: Houghton Mifflin Company, 1997, p. 585. 2. World Book Encyclopedia Online. 2004. World Book, Inc. 3. A. Agresti, Categorical Data Analysis. New York: Wiley, 1990. 4. K. A. Jobst, L. P. D. Barnetson, and B. J. Shepstone, Accurate Prediction of Histologically Confirmed Alzheimer’s Disease and the Differential Diagnosis of Dementia: The Use of NINCDS-ADRDA and DSM-III-R Criteria, SPECT, X-Ray CT, and Apo E4 in Medial Temporal Lobe Dementias. International Psychogeriatrics 1998; 10: 271-302. Published online by Cambridge University Press 10Jan2005. 5. D. G. Altman and J. M. Bland, Absence of evidence is not evidence of absence. BMJ 1995; 311: 485. 6. J. J. Chen, H. Hsueh, and J. Li, Simultaneous non-inferiority test of sensitivity and specificity for two diagnostic procedures in the presence of a gold standard. Biomet. J. 2003; 45: 47-60. 7. P. F. Lazarsfeld, The logical and mathematical foundation of latent structure analysis. In: S. Stouffer (ed.), Measurement and Prediction. Princeton, NJ: Princeton University Press, 1950. 8. S. Patterson, M. A. Agin, R. Anziano, T. Burgess, C. Chuang-Stein, A. Dmitrienko, G. Ferber, M. Geraldes, K. Ghosh, R. Menton, J. Natarajan, W. Offen, J. Saoud, B. Smith, R. Suresh, and N. Zariffa, Investigating druginduced QT and QTc prolongation in the clinic: A review of statistical design and analysis considerations: Report from the Pharmaceutical Research and Manufacturers of America QT Statistics Expert Team. Drug Inf J. 2005; 39: 243-264. 9. N. Sarapa, J. Morganroth, J. P. Couderc, S. F. Francom, B. Darpo, J. C. Fleishaker, J. D. McEnroe, W. T. Chen, W. Zareba, A. J. Moss, Electrocardiographic identification of

4

GOLD STANDARD

drug-induced QT prolongation: Assessment by different recording and measurement methods. Ann Noninvas. Electrocardiol. 2004; 9(1): 48-57. 10. J. L. Willems, P. Arnaud, J. H. van Bemmel, P. J. Bourdillon, C. Brohet, S. Dalla Volta, J. D. Andersen, R. Degani, B. Denis, and M. Demeester, Assessment of the performance of electrocardiographic computer programs with the use of a reference data base. Circulation 1985; 71(3): 523-534. 11. L. I. Lin, A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45(1): 255-268. 12. P. E. Shrout and J. L. Fleiss, Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 1979; 2: 420-428. 13. D. G. Altman and J. M. Bland, Measurement in medicine: The analysis of method comparison studies. Statistician 1983; 32: 307-317. 14. J. M. Bland and D. G. Altman, Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;(February): 307-310. 15. J. M. Bland and D. G. Altman, Measuring agreement in method comparison studies. Stat. Methods Med. Res. 1999; 8:135-160. 16. J. J. Bartko, Measures of agreement: a single procedure. Stat. Med. 1994; 13:737-745. 17. R. T. St. Laurent, Evaluating agreement with a gold standard in method comparison studies. Biometrics 1998; 54: 537-545. 18. I. R. Harris, B. D. Burch, and R. T. St. Laurent, A blended estimator for a measure of agreement with a gold standard. J. Agricult. Biol. Environ. Stat. 2001; 6: 326-339. 19. S. D. Walter and L. M. Irwig, Estimation of test error rates, disease prevalence and relative risk from misclassified data: A review. Clin. Epidemiol. 1988; 41: 923-937. 20. S. L. Hui and X. H. Zhou, Evaluation of diagnostic tests without gold standards. Stat. Methods Med. Res. 1998; 7: 354-370. 21. L. Joseph, Reply to comment on ‘‘Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard’’. Am. J. Epidemiol. 1997; 145: 291-291.

22. N. Dendukuri and L. Joseph, Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics 2001; 57: 208-217. 23. N. Dendukuri, E. Rahme, P. B´elisle, and L. Joseph, Bayesian sample size determination for prevalence and diagnostic test studies in the absence of a gold standard test. Biometrics 2004; 60: 388-397. 24. J. L. Fleiss, 1999. The Design and Analysis of Clinical Experiments. New York: Wiley, 1999.

CAMBRIDGE JOURNALS ONLINE Further reading Walter and Irwig (19) review publications on error rates of diagnostic tests. Hui and Zhou (20) review statistical methods for estimating sensitivity and specificity of diagnostic tests when one or more new diagnostic tests are evaluated and no perfect gold standard test exists, with or without an imperfect gold standard test as a reference. They also discuss models that relax the assumption of conditional independence between tests given the true disease status. Joseph (21), Dendukuri and Joseph (22), and Dendukuri et al. (23) use a Bayesian approach to account for the potential dependence between the tests instead of assuming conditional independence. Dallal’s website http://www.tufts.edu/ gdallal/compare.html provides an excellent introduction to the problem of comparing two measurement devices. Fleiss (24, Chapter 1) provides a comprehensive overview of the reliability of quantitative measurements.

CROSS-REFERENCES False Positive – False Negative Inter-rater reliability Intra-rater reliability Intraclass Correlation Coefficient Sensitivity Specificity Type I Error Type II

GOOD CLINICAL PRACTICE (GCP)

based on adequately performed laboratory and animal experimentation as well as on a thorough knowledge of the scientific literature. The right of the research subject to safeguard his/her integrity must always be respected. Physicians should abstain from engaging in research projects that involve human subjects unless they are satisfied that the hazards involved are predictable and should cease any investigations if the hazards are found to outweigh the potential benefits. Clinical tests in humans are performed under many regulatory controls. Currently, these controls include the regulations promulgated under the Food Drug and Cosmetic Act, European Union rules, and guidelines issued by the International Conference on Harmonization (ICH) (1). These regulations and guidances are collectively termed ‘‘Good Clinical Practices.’’ The testing of medical products, which include pharmaceuticals, biologicals, and medical devices on humans, is regulated by Good Clinical Practice standards (i.e., regulations and guidances). These standards are applied to the conduct of these tests or clinical investigations. International regulatory bodies require that these standards be met for the medical products to be approved for use. The experimental protocol must be reviewed by an independent body [Independent Ethics Committee (IEC) or Institutional Review Board (IRB)] constituted of medical professionals and nonmedical members, whose responsibility is to ensure the protection of the rights, safety and well-being of human subjects involved in a clinical investigation. This independent body provides continuing review of the trial protocol and amendments and of all materials to be used in obtaining and documenting the informed consent of the trial subjects.

DAVID B. BARR Kendle International, Cincinnati, Ohio

The ultimate goal of most clinical investigations is to obtain marketing approval of the product. In the case of a drug product, the Food and Drug Administration (FDA) requires the submission of a New Drug Application (NDA) that consists of a compilation of all clinical and other test data from the clinical studies plus extensive information on the development, manufacture, and testing of the drug product. This document is often massive, running in excess of 100 volumes. The FDA reviews this information. The review often includes meetings with the sponsor and may include a request for additional testing by the sponsor. Drug approval is for a specific drug, as well as the chemical entity, formulation, and the patient population on which it is approved for use, and the indications or disease state for which it is approved for use. For a sponsor to add other indications or patient populations to the drug labeling usually requires additional clinical studies to support these additional uses. 1

HUMAN RIGHTS AND PROTECTIONS

The need to protect human subjects during clinical studies is paramount, and the need for these protections is, in large part, based on abuses that had occurred in many countries including the medical experiments conducted by the Nazis and certain studies conducted in other countries without the full knowledge of the subjects. The rights of the subjects in a clinical trial are based on the principles stated in Declaration of Helsinki in 1964. Good clinical investigations must be conducted in accordance with these stated ethical principles. The rights, safety, and well-being of the trial subjects are paramount. Clinical research on human subjects must conform to generally accepted scientific principles and should be

2

INFORMED CONSENT

Subjects must volunteer to enter a clinical investigation and must be fully informed of their rights and a description of the study that includes risks and benefits. The investigator is required to provide each subject with

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

GOOD CLINICAL PRACTICE (GCP)

a written consent form approved by the IRB, which includes the following basic elements: a statement that the study involves research, the expected duration of the subject’s participation, a description of the procedures to be followed, identification of any procedures that are experimental, a description of any reasonably foreseeable risks or discomforts, a disclosure of any alternative procedures or courses of treatment that might be advantageous to the subject, an explanation of who to contact, and a statement that participation is voluntary. This consent form must be given to the subject, explained, and signed by the subject or his/her representative [21 CFR 50.25 -50.27 (2)]. Informed consent must be signed prior to initiation of the subject into the study. 3

INVESTIGATIONAL PROTOCOL

The clinical investigation must be described in a written protocol, which includes: a summary of information from both nonclinical studies that potentially have clinical significance and from clinical trials that have been conducted; a summary of the known and potential risks and benefits; information about the investigational product(s), route of administration, dosage, dosage regimen, and treatment periods; the population to be studied; references to literature and data that are relevant and provide background material for the trial; a description of the trial design, a statement of primary endpoints and any secondary endpoints, measures to be taken to minimize bias, discontinuation criteria, investigational product accountability procedures, identification of data to be recorded directly on the Case Report Form (CRF); criteria for the selection and withdrawal of subjects; description of subject treatment including the name(s) of all products, dosing schedule, route/mode of administration, treatment period(s) including follow-up period(s); permitted and prohibited medications, procedures for monitoring subject compliance; methods of assessments of efficacy and safety; statistical methods; quality control and quality assurance procedures; data handling and record-keeping requirements; and any supplements to the protocol.

4 INVESTIGATOR’S BROCHURE The Investigator’s Brochure (IB) is a compilation of the relevant clinical and nonclinical data on the investigational product(s). The IB must be provided to IRBs and IECs for their evaluation into whether the state of knowledge of the drug justifies the risk/benefit balance for the clinical trial. The IB is given to each investigator to provide to the clinical team with an understanding of the study details and facilitate their compliance with the protocol. 5 INVESTIGATIONAL NEW DRUG APPLICATION FDA requires the submission of an Investigational New Drug Application (IND) if a sponsor intends to conduct most clinical investigations (some exemptions apply see Reference 2). The IND is required to contain information on the sponsor and the investigational drug. The IND also contains brief descriptions of the drug, any relevant previous human experience with the drug, the overall investigational plan, nonclinical studies and safety information, protocol(s), and the IB. IND filed for Phase 1 studies are relatively brief at about one or two volumes, whereas those intended for Phase 2 and 3 typically contain detailed protocols that describe all aspects of the study. The IND must be approved before any human trial is initiated. Many other countries require similar applications. 6 PRODUCTION OF THE INVESTIGATIONAL DRUG Investigational products should be manufactured, handled, and stored in accordance with applicable current Good Manufacturing Practices (cGMPs ) and used in accord with the protocol. It is important that all product formulation, manufacturing process, specification, and test method development activities for pharmaceutical ingredients and products that support submissions of the IND and marketing applications be properly conducted and documented. Such information

GOOD CLINICAL PRACTICE (GCP)

may be requested for examination by FDA during preapproval inspections. All development activities should be performed applying the principles of Good Scientific Practices (GSPs) to ensure accuracy and reproducibility of data and other information. Examples of GSPs include accurate and complete documentation of all research and development activities; use of properly maintained and calibrated equipment; use of reagents, standards, and solutions that meet compendia standards or internal specifications; use of standard operating procedures, protocols, or guidelines; and performance of activities by appropriately qualified and trained personnel. At each stage of development, critical parameters should be defined for manufacturing processes (e.g., temperature, time) and product specifications (e.g., pH, viscosity). Test results and other data collected on product quality characteristics should be obtained using credible analytical methods. Critical parameters from production processes for product stability should be performed according to written procedures or prospective protocols typically with approval by the company’s quality unit. Changes in the product, process, specifications, and test methods should be controlled and documented. All data/information, whether favorable or unfavorable, should be recorded and included in development history documents. 7

CLINICAL TESTING

Clinical testing is performed according to the written protocols and procedures. A welldesigned and well-executed clinical investigation consists of many individual tests, stages, and subjects. The trial must be in compliance with the approved protocol and all pertinent regulations. Trials must be scientifically sound and described in a clear, detailed experimental protocol. Biomedical research that involves human subjects should be conducted only by scientifically qualified persons and under the supervision of a clinically competent medical person. In 1962, the Kefauver-Harris amendments to the Food Drug and Cosmetic Act (FD&C Act) increased FDA’s regulatory authority

3

over the clinical testing of new drugs. These amendments required that new drugs be approved by the FDA and that the application include ‘‘full reports of investigations which have been made to show whether or not such drug is safe for use and whether such drug is effective in use . . . ’’ [FD&C Act section 505 (3)]. Once compounds have been identified as potential drugs, preclinical testing begins. Clinical testing of new drugs and biologicals is conducted only after extensive preclinical animal testing to determine the safety of the potential drug. Currently, typical clinical testing consists of several phases, tests, and human subjects. Many tests may be required for these experimental products. For instance, in pharmaceuticals several tests are needed to permit the development of the estimates of the safe starting dose for use in clinical investigations. Tests are generally conducted for: single and repeated dose toxicity studies, reproductive toxicity, carcinogenic potential, genotoxicity, local tolerance, and pharmacokinetics. The ICH and several regulatory bodies have published guidelines on the conduct of these studies. Additionally, regulatory requirements control of many of these studies, for example, FDA’s Good Laboratory Practice regulations [GLPs at 21CFR 58 (4)]. Once animal safety studies are conducted and show relative safety, the experimental medical product may then begin testing in humans. An experimental drug requires a sponsor. The FDA defines a sponsor in the code of federal regulations [21CFR 312.3 (5)] as ‘‘a person who takes responsibility for and initiates a clinical investigation. The sponsor may be an individual or a pharmaceuticalcompany, governmental agency, academic institution, private organization, or other organization.’’ Before testing may be conducted on humans an Investigational New Drug (IND) or Investigational Device Exemption (IDE) application needs to be approved by regulatory bodies such as the FDA. Preapproval clinical testing is generally performed in three phases; each successive phase involves more subjects. These phases are conducted in accord with regulations [21 CFR 312.21 (5)], not by requirements of law.

4

GOOD CLINICAL PRACTICE (GCP)

Some clinical testing may also be performed post-approval. Phase One studies are closely controlled and monitored studies performed primarily to assess the safety of the drug in a few patients (typically 20–80) or healthy subjects. These studies are designed to determine the actions of the drug in the human body: absorption, metabolism, and pharmacological actions including degradation and excretion of the drug, side effects associated with increasing doses, and, if possible, to gain early evidence on effectiveness. These studies may also investigate the best method of administration and the safest dosage as well as side effects that occur as dosage levels are increased. Phase Two studies are usually the initial, well monitored, and well-controlled studies conducted to evaluate the effectiveness of the drug for a particular indication or condition. These studies generally involve a limited number of subjects, up to several hundred, who have the disease or condition under study. These studies are typically conducted to: evaluate the efficacy of the drug, determine the dose range(s), determine drug interactions that may occur, to confirm the safety of the drug, and to compare it with similar approved drugs. These studies may or may not be blinded. Phase Three studies are typically the last studies performed prior to filing the NDA with the regulatory bodies (e.g., FDA or European Agency for the Evaluation of Medicinal Products). These studies are expanded controlled and uncontrolled trials. They are performed after preliminary evidence that suggests effectiveness of the drug has been obtained. These studies are intended to gather the additional information about effectiveness and safety that is needed to evaluate the overall risk/benefit relationship of the drug and to provide and adequate basis for physician labeling. These trials are often multicenter studies with several thousands of subjects. These studies: provide the statistically significant data for the efficacy of the drug, assess the safety data, and determine the final dosage(s) and dosage forms. These studies are typically conducted as doubleblinded studies against similar approved drugs and/or placebos.

Phase Four studies are postmarketing studies that may be conducted by the sponsor and often required by FDA concurrent with marketing approval to delineate additional information about the risks, benefits, and optimal use of the drug. These studies include those studying different doses or schedules of administration, the use of the drug in other patient populations or stages of disease, or usage over a longer time period (see Fig. 1). Good Clinical Practice standards include most processes and organizations involved in the conduct of a pharmaceutical clinical study, such as the following: • Sponsors • Monitors • Contract Research Organizations

(CROs) • Institutional Research Boards (IRBs) or

Independent Ethics Committees (IECs) • Investigators

FDA regulations that govern the clinical investigations are found in Title 21 of the Code of Federal Regulations and in Guidelines promulgated by FDA and other regulatory bodies as well as those developed and issued by ICH. The ICH is an organization whose membership includes the regulators and Industry members from the EU, Japan, and the United States. As such, the ICH guidelines are in use in all of the EU countries, Japan and the United States and are widely used as guidelines worldwide. 8 SPONSORS The sponsor of an experimental drug initiates the studies and assumes the ultimate responsibility for compliance with all legal and regulatory requirements. The sponsor is responsible for the selection of all the parties utilized in the conduct of the clinical investigation including the monitors and investigators. According to FDA ‘‘The sponsor may be an individual or a pharmaceutical company, governmental agency, academic institution, private organization, or other organization.’’ The 21 Code of Federal Regulations, part

GOOD CLINICAL PRACTICE (GCP)

5

Compound Success Rates by Stage: Discovery: (2 to 10 years)

5,000 to 10,000 Screened

Preclinical Testing: Laboratory and animal testing

250 enter Preclinical testing

Phase I: 20 to 80 volunteers used to determine safety and dosage

5 enter Clinical testing:

Phase II: 100 to 300 volunteers; monitor for efficacy and side effects Phase III: 1,000 to 5,000 volunteers; monitor for adverse reactions to long-term use 1 FDA Approval

FDA Review/Approval Postmarket Testing 0

2

4

6

8 10 Years

12

14

16

Figure 1.

312 (5) (drugs) and part 812 (6) ( biologicals) contain the regulations that include the sponsors’ obligations. The obligations of sponsors include that they perform the following duties:

9. Ensure that the clinical investigation is conducted in accordance with the general investigational plan and protocols contained in the IND. 10. Maintain a current IND with respect to the investigations. 11. Evaluate and report adverse experiences. 12. Ensure that FDA and all participating investigators are promptly informed of significant new adverse effects (SADEs) or risks with respect to the drug.

1. Ensure that all clinical studies are conducted according to the regulations and an approved study protocol. (Both the FDA and an IRB must approve the study protocol.) 2. Obtain regulatory approval, where necessary, before studies begin. 3. Manufacture and label investigational products appropriately. 4. Initiate, withhold, or discontinue clinical investigations as required. 5. Refrain from commercialization investigational products.

of

6. Select qualified investigators to conduct the studies. 7. Provide the investigators with the information they need to conduct an investigation properly (training, written guidance including the investigators brochure). 8. Ensure proper monitoring of the clinical investigation.

Sponsors are responsible for reviewing data from the studies and from all other sources as soon as the materials are received to ensure that the safety of the subjects is not compromised. In any case, where the compliance of the investigator (or investigator site) has deviated from the approved protocol or regulations, it is the responsibility of the sponsor to ensure the site is either brought into compliance or, if compliance cannot be attained, to stop the investigation and assure the return of all clinical supplies. 9

CONTRACT RESEARCH ORGANIZATION

Contract Research Organization (CRO) is defined by FDA [(21 CFR 312.3 (b) (5)] as

6

GOOD CLINICAL PRACTICE (GCP)

‘‘a person that assumes, as an independent contractor with the sponsor, one or more of the obligations of a sponsor, e.g. design of a protocol, selection or monitoring of investigations, evaluation of reports, and preparation of materials to be submitted to the Food and Drug Administration.’’ The transfers of obligations to a CRO must be described in writing. If all obligations are not transferred, then the sponsor is required to describe each of the obligations being assumed by the CRO. If all obligations are transferred, then a general statement that all obligations have been transferred is acceptable [21 CFR 312.52 (5)] 10

MONITORS

Monitors are persons who oversee the progress of a clinical investigation and ensure that it is conducted, recorded, and reported accurately. The purposes of monitoring are to verify that the rights and well-being of the subjects are protected; the reported trial data are accurate, complete, and verifiable from source documents; and the conduct of the trial complies with the currently approved protocol amendments, GCP, and all applicable regulatory requirements. It is the responsibility of the sponsor to ensure that the clinical investigations are monitored adequately. Monitors are selected by the sponsor and must be adequately trained; have the scientific and/or clinical knowledge needed to monitor the trial adequately; and be thoroughly familiar with the investigational product, protocol, informed consent form, and any other materials presented to the subjects, the sponsors Standard Operating Procedures (SOPs), GCP, and all other applicable regulatory requirements. The extent and nature of monitoring should be set by the sponsor and must be based on multiple considerations such as the objective, purpose, design, complexity, blinding, size, and endpoints of the trial. Typically, on-site monitoring occurs before, during, and after the trial. 11

INVESTIGATORS

Investigators are the persons who actually conduct the clinical investigation (i.e., under

whose immediate direction the drug is administered or dispensed to a subject). In the event a team of individuals conducts an investigation, the investigator is the responsible leader of the team. ‘‘Subinvestigator’’ includes any other individual member of that team 21 CFR 312.3(b) (5)]. Investigators have several responsibilities, which include the following: 1. To ensure the investigation is conducted according to all applicable regulations and the investigational plan 2. To ensure the protection of the subjects’ rights, safety, and welfare 3. To ensure control of the drugs under investigation

12 DOCUMENTATION Acceptable documentation, from the sponsor, monitor, investigator, IRB, and so on, is critical for ensuring the success of clinical investigations. Sponsors and monitors must document their procedures and policies. SOPs are commonly used and must cover all aspects of their operations including Quality Assurance and monitoring. Investigators also require written procedures and CRFs. It is essential for investigators to assure the CRFs are up to date and complete and to have source documents (laboratory reports, hospital records, patient charts, etc.) that support the CRFs. All documentation in support of an NDA must be retained. FDA regulations require that for investigational drugs, all records and reports required during the studies must be retained for at least 2 years after the marketing application is approved. If the application is not approved, all applicable records should be retained for 2 years after the shipment of the study drug is discontinued and the FDA notified. The quality of the documentation is always critical. Records, including the CRFs must be legible; corrections should be cross outs with dates and initials and never obliterate the original data. The records should be controlled and retrievable for the monitors, FDA inspection, and other quality assurance

GOOD CLINICAL PRACTICE (GCP)

needs. Records must be credible and complete. They need to convince third parties (monitors, auditors, FDA Investigators, and reviewers) of their merit. The presence of accurate and reliable data and information in an application submitted to the FDA for scientific review and approval is essential. If a submission is misleading because of the inclusion of incorrect or misleading data, the FDA may impose the Application Integrity Policy (AIP). The AIP deals with applications that contain unreliable or inaccurate data. This policy enables the agency to suspend review of an applicant’s pending submitted applications that are in the review and approval process at FDA when a pattern of data integrity problems have been found in one or more of those applications and those data integrity problems are determined to be material to the review. Where data is collected electronically, such as in the use of electronic patient diaries, the FDA regulations on Electronic Records and Electronic Signatures [21 CFR Part 11 (7)] must be followed. These rules require many controls to ensure the authenticity, integrity, and confidentiality of electronic records. The required controls include (among many others) the validation of the system(s) assuring accuracy, reliability, and the ability to discern invalid or altered records. 13

CLINICAL HOLDS

A clinical hold is an order issued by the FDA to the sponsor to delay a proposed clinical investigation or to suspend an ongoing investigation. Clinical holds may apply to one or more of the investigations covered by an IND. When a proposed study is placed on clinical hold, subjects may not be given the investigational drug. When an ongoing study is placed on clinical hold, no new subjects may be recruited to the study and placed on the investigational drug; in the interest of patient safety, patients already in the study should be taken off therapy that involves the investigational drug unless specifically permitted by the FDA (2). Clinical holds are imposed when the FDA finds issues that may negatively affect patient safety, which include the following:

7

1. Human subjects are or would be exposed to an unreasonable and significant risk of illness or injury. 2. The investigator(s) are not qualified. 3. The investigators brochure is misleading, erroneous, or materially incomplete. 4. The IND does not contain sufficient information to assess the risks to the subjects. For Phase 2 and 3 studies, clinical holds may also be imposed if the plan or protocol is clearly deficient in design to meet its stated objectives. 14

INSPECTIONS/AUDITS

Audits and regulatory inspections are conducted to ensure that the clinical investigations are conducted correctly, data is accurately recorded and reported, and all procedures and applicable rules and regulations are followed. During the course of the study, the sponsor’s Clinical Quality Assurance staff generally conducts audits. These audits cover the Investigator(s) sites, data collection, and other sites that are pertinent to the clinical investigation. Any problems or issues revealed during the course of these audits typically result in corrections that enable the study to proceed without any major incidents or delays. Regulatory inspections, such as those conducted by FDA, are typically conducted after the sponsor files for marketing approval such as an NDA (the regulatory bodies may conduct inspections during the course of the study although it is generally only done when an issue develops). Regulatory bodies typically inspect a sampling of clinical sites from any given marketing application. Inspections of other facilities, such as sponsors, monitors, Institutional Review Boards, and so on, are typically performed from a sampling of all such facilities in their establishment inventory. The FDA has established a Bioresearch Monitoring Program that is used to provide oversight of the conduct of clinical studies. FDA has several Compliance Programs (CPs)

8

GOOD CLINICAL PRACTICE (GCP)

that provide guidance and specific instruction for inspections of: Investigators (CP7348.811) Sponsors/Monitors (CP7348.810) Institutional Review Boards (CP7348.809) Nonclinical Laboratories (CP7348.808) These CPs are available at www.fda.gov/ ora/cpgm. REFERENCES 1. International Conference on Harmonization, (ICH-E6), Good Clinical Practice: Consolidated Guidance, 1996. Available: www.ich.org. 2. 21 Code of Federal Regulations, Part 50, Protection of Human Subjects. 3. FD&C Act, Section 505. 4. 21 Code of Federal Regulations, Part 58. 5. 21 Code of Federal Regulations, Part 312, Investigational New Drug Application. 6. 21 Code of Federal Regulations, Part 812. 7. 21 Code of Federal Regulations, Part 11, Electronic Records and Signatures.

FURTHER READING Guideline for Industry Clinical Safety Data Management: Definition and Standards for Expedited Reporting (ICH-E2A), Federal Register, March 1, 1995 (60 FR 11284). W. K. Sietsema, Preparing the New Drug Application: Managing Submissions Amid Changing Global Requirements. FDAnews. Washington, DC, 2006. W. K. Sietsema, Strategic Clinical Development Planning: Designing Programs for Winning Products. FDAnews. Washington, DC, 2005. Food and Drug Administration. Available: http://www.fda.gov. Food and Drug Administration; Center for Drug Evaluation and Research. Available: http:// www.fda.gov/cder. Department of Health and Human Services. Available: http://www.os.dhhs.gov. European Union. Available: http://www.europa. eu.int. International Conference on Harmonization. Available: http://www.ich.org.

GOOD LABORATORY PRACTICE (GLP)

The regulations establish standards for the conduct and the reporting of nonclinical laboratory studies and are intended to assure the quality and the integrity of safety data submitted to the FDA. The FDA relies on documented adherence to GLP requirements by nonclinical laboratories in judging the acceptability of safety data submitted in support of research and/or marketing permits. The FDA has implemented this program of regular inspections and data audits to monitor laboratory compliance with the GLP requirements.

The Federal Food, Drug, and Cosmetic Act and Public Health Service Act require that sponsors of Food and Drug Administration(FDA)-regulated products submit evidence of their product’s safety in research and/or marketing applications. These products include food and color additives, animal drugs, human drugs and biological products, human medical devices, diagnostic products, and electronic products. The FDA uses the data to answer questions regarding: • The toxicity profile of the test article. • The

observed no-adverse-effect dose level in the test system. • The risks associated with clinical studies that involve humans or animals. • The potential teratogenic, carcinogenic, or other adverse effects of the test article. • The level of use that can be approved. The importance of nonclinical laboratory studies to the FDA’s public health decisions demands that they be conducted according to scientifically sound protocols and with meticulous attention to quality. In the 1970s, FDA inspections of nonclinical laboratories revealed that some studies submitted in support of the safety of regulated products had not been conducted in accord with acceptable practice; accordingly, data from such studies were not always of the quality and of the integrity to assure product safety. As a result of these findings, the FDA promulgated the Good Laboratory Practice (GLP) Regulations, 21 CFR (Code of Federal Regulations) Part 58, on December 22, 1978 (43 FR (Federal Register) 59986). The regulations became effective June 1979. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/ora/compliance ref/bimo/7348 808/part I.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

GOODNESS OF FIT

continuous or a step function, later discussion will deal with continuous and discrete X i separately. There has been a recent resurgence of interest in the theory of goodness-of-fit tests. Technical advances have been made with some of the older tests, while new tests have been proposed and their power properties examined. This progress can be attributed in part to the availability of mathematical development in the theory of probability* and stochastic processes*. However, it is also in large measure due to the advent of the highspeed computer, the associated numerical technology, and the increased demand for statistical services. This article can only summarize some of the available results and refer the reader to special sources for further detail. Many statistical texts have introductory chapters on goodness-of-fit testing. For example, Kendall and Stuart [18, Vol. 2] and Lindgren (21) contain pertinent material lucidly presented. Pearson and Hartley (25) also contains accounts of specific tests illustrated by numerical examples. The following general notation will be adopted, additional special symbols being introduced as required:

G. M. TALLIS In general, the term ‘‘goodness of fit’’ is associated with the statistical testing of hypothetical models with data. Examples of such tests abound and are to be found in most discussions on inference*, least-squares* theory, and multivariate analysis.* This article concentrates on those tests that examine certain features of a random sample to determine if it was generated by a particular member of a class of cumulative distribution functions* (CDFs). Such exercises fall under the broad heading of hypothesis testing*. However, the feature that tends to characterize these ‘‘goodness-of-fit tests’’ is their preoccupation with the sample CDF, the population CDF, and estimates of it. More specifically, let X 1 , X 2 , . . . , X n be a random sample generated by CDF GX (x). It is required to test H0 : GX (x) = FX (x, θ ),

θ ∈ ,

(1)

where θ is a q-dimensional vector of parameters belonging to the parameter space . If θ is fixed at some value θ 0 , say, then F X (x, θ 0 ) = F X (x) is fully specified and H 0 is simple. Otherwise, the hypothesis states that GX (x) is some unspecified member of a family of CDFs and is composite*. As an example, consider the normal* family N X (x;θ ), θ = (θ 1 , θ 2 ), where θ 1 is the mean and θ 2 the variance of N X . In this case = (−∞, ∞) × (0, ∞) and it might be required to test whether or not a sample was generated by N X (x;θ ) for some unknown θ ∈ . Intuitively, and in fact, this is an intrinsically more difficult problem than testing whether the sample was generated by a particular normal CDF with known mean and variance. The latter case can always be reduced to the standard situation of testing GX (x) = N(x;θ 0 ), θ 0 = (0, 1). Most useful tests are parameter-free; i.e., the distribution of the test statistics does not depend on θ . Among such tests are found both parametric and nonparametric tests which are either distribution specific or distributionfree. Since tests may require F(x, θ ) to be

Probability density function* (PDF) corresponding to F X (x, θ ) and GX (x) (when they exist): f X (x, θ ), gX (x). Order statistics*: X i ≤ X 2 ≤ . . . ≤ X n . Expected values of order statistics: E[X i ] = ηi . Sample CDF: Gn (x) = [no. of X i ≤ x]/n. A chi-square random variable with d degrees of freedom: χ 2 (d). 100(1 − α) percentile of the chi-square distribution* with d degrees of freedom: 2 (d) . χ1−α The uniform density on [0, 1]: U [0, 1]. If X n is a sequence of random variables, L

then Xn → χ 2 (d) will indicate convergence in law to a chi-square distribution with d degrees of freedom; if X n is a sequence of ranL

dom vectors, then Xn → N(µ, ) will indicate

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

GOODNESS OF FIT

convergence* in law to a normal distribution with mean vector µ and covariance matrix*. 1

DISCRETE RANDOM VARIABLES

1.1 Simple H0 Suppose that X is a discrete random variable, Pr X = x = f X (x, θ ) for θ ∈ and x ∈ X, where X is a finite or countable set of real numbers. By suitable definition, categorical data* with no evident numerical structure can be brought within this framework. The simplest case is where X = 1, 2, . . . , k, f X (j, θ 0 ) = f j is fully specified, and it is required to test H 0 : gX (j) = f j . Let N j be the number of X i in the sample such that X i = j; then the probability under H 0 of obtaining the particular outcome vector, n = n1 , n2 , . . . , nk , where j nj = n, is n! nj fj . P(n) = j nj !

(2)

j

An exact test of H 0 can, in principle, be constructed as follows: 1. Calculate P(m) for all possible outcome vectors, m. 2. Order the P(m). 3. Sum all P(m) which are less than or equal to P(n). 4. Reject H 0 at level α if the cumulative probability of 3 is less than or equal to α. This is known as the multinomial test of goodness of fit, and the necessary calculations must be carried out on a computer. Even then, this is only practicable if n and k are small. Fortunately, for large n, there are ways around the distributional problem. Likelihood ratio theory can be invoked; the likelihood ratio test* of H 0 : gX (j) = f j against the alternative H 1 : gX (j) = f j is formed from the ratio = kj=1 (nfj /Nj )Nj . It is known that −2 ln = −2

k

Nj ln nfj − ln Nj

j=1 L

→ χ 2 (k − 1).

The null hypothesis is rejected at level α if the calculated value of −2ln exceeds χ 1−α (k − 1). A very old test dating back to the beginning of this century is based on X2 =

k

(Nj − nfj )2 /(nfj ).

(3)

j=1

This is known as Pearson’s chi-square and has the same limiting distribution as (2). Since N j is the observed number of X i in the sample with X i = j, (Oj ), and E[N j ] = nf j = (Ej ), (3) is sometimes written in the form of a mnemonic, X2 =

k (Oj − Ej )2 /Ej .

(4)

j=1

Not only do (2) and (3) share the same limiting central distribution, but they are also asmptotically equivalent in probability. However, since X 2 is a direct measure of agreement between observation and expectation under H 0 , it has some intuitive appeal not shared by (2). Both (2) and (3) give asymptotic tests which tend to break down if the nf j are too small. A common rule is that all nf j should be greater than 1 and that 80% of them should be greater than or equal to 5. These conditions are sometimes hard to meet in practice. For a general discussion and further references, see Horn (16). Radlow and Alf (29) point out that a direct comparison of X 2 with the multinomial test may be unjustified. The latter test orders experimental outcomes, m, in terms of P(m) instead of ordering them in terms of discrepancies from H 0 . It is suggested that X 2 should be compared with the following exact procedure: 1. Calculate P(m) for all possible outcomes m. 2. Calculate X 2 for each m based on H 0 , X 2 (m). 3. Sum P(m) for which X 2 (m) ≥ X 2 (n), the observed X 2 value. 4. Reject H 0 at level α if this sum exceeds α.

GOODNESS OF FIT

Numerical comparisons of this exact test with X 2 showed that the two agreed remarkably well, even for small n. The agreement of the exact multinomial test with X 2 , on the other hand, was poor. Thus care must be exercised in the assessment of the performances of large-sample tests using small n. Appropriate baseline exact tests must be used for comparisons. Another procedure is the discrete Kolmogorov–Smirnov* goodness-of-fit test (5). Let D− = maxx (FX (x) − Gn (x)),

The philosophy adopted in the preceding subsection for rejecting H 0 is used here, the 2 (k − q − 1). critical level being χ1−α Although (4) and (8) are the standard tests recommended in textbooks and have received the most attention by practitioners and theoreticians, there are others. For example, a general class of goodness-of-fit tests can be based on quadratic form* theory for multinormally distributed random variables. Under sufficient regularity conditions the following results can be established by routine methods of probability calculus. Put

D+ = maxx (Gn (x) − FX (x)), −

N = (N1 , N2 , . . . , Nk ),

+

D = max(D , D ); then D− , D+ , and D test, respectively, H 1 : GX (x) ≤ F X (x), H 1 : GX (x) ≥ F X (x), and H 1 : GX (x) = F X (x). A discussion of the application of these statistics is given in Horn (16), where their efficiency relative to X 2 is discussed and a numerical example given. For more recent asymptotic results, see Wood and Altavela (39). 1.2 Composite H0 When H 0 : GX (x) = F X (x, θ ), θ ∈ , is to be tested, θ must be estimated and the theory becomes more elaborate. However, provided that asymptotically efficient estimators θˆ n are used, tests (2) and (3) extend in a natural way and continue to be equivalent in probability. More specifically, since X i is assumed discrete, put H 0 : gX (j) = f j (θ ) and H 1 : gX (j) = f j (θ ), θ ∈ . Let θˆ n be as above; e.g., θˆ n could be the maximum likelihood estimator* (MLE) for θ under H 0 . Then under H 0 , −2 ln (θˆ n ) = −2

k j=1

→ χ 2 (k − q − 1), X 2 (θˆ n ) =

k

(Nj − nfj (θˆ n ))2 /(nfj (θˆ n ))

j=1 L

Ni = Ni /n, (f(θ)) = (f1 (θ), f2 (θ), . . . , fk (θ )); then

√

L

n(N − f (θ)) → N(0, V), where V = [υij ], υii = fi (θ )(1 − fi (θ )), υij = −fi (θ)fj (θ ), i = j,

and rank (V) = k − 1. Now suppose that θ ∗n is any estimator for θ which can be expressed in the locally, suitably regular functional form θ ∗n = g(N). Let D = [dij ], dij = ∂gi (N)/∂Nj , i = 1, 2, . . . , q; j = 1, 2, . . . , k, Q = [qrs ], qrs = ∂fr (θ )/∂θs , r = 1, 2, . . . , k; s = 1, 2, . . . , q. Then √ P √ n(N − f (θ ∗n )) → n(I − QD)(N − f (θ )) L

→ N(0, ),

Nj [ln nfj (θˆ n ) − ln Nj ]

L

→ χ 2 (k − q − 1).

3

where = (I − QD)V(I − QD) . If g is any generalized inverse* of (i.e., g = ), it then follows that, under H 0 , Qn (θ ∗n ) = n(N − f (θ ∗n )) g (N − f (θ ∗n )) L

→ χ 2 (k − q − 1).

4

GOODNESS OF FIT

The power of tests such as (5), which include (8) as a special case, can also be examined using sequences of local alternatives, similar arguments, and the noncentral chi-square* distribution. However, such studies are of limited use, as tests of fit should detect broad alternatives, a performance feature that can be checked only by computer simulations. Note that g may be a function of θ , which can be replaced by θ ∗n or any other consistent estimator for θ without affecting the asymptotic distribution of Qn (θ ∗n ). For an early, rigorous account of the theory of chi-square tests*, see Cram´er (6). A modern and comprehensive treatment of this theory with many ramifications is given by Moore and Spruill (24). Their paper is technical and covers cases where the Xi are random vectors and θ is estimated by a variety of methods. An easy-to-read over-view of this work is given by Moore (23). 2

CONTINUOUS RANDOM VARIABLES

The testing of goodness-of-fit hypotheses when f X (x, θ ) is a continuous function of x introduces features not exhibited by the discrete tests discussed in the section ‘‘Discrete Random Variables.’’ However, by suitable constructions some of the latter tests remain useful. These points will be expanded on below. 2.1 Simple H0 In order to apply results of the section ‘‘Discrete Random Variables,’’ partition the real line into k ≥ 2 sets: I1 = (−∞, a1 ],

Ij is used in the tests. The k classes are usually chosen to keep the npj acceptably high. In order to achieve some standardization, it seems reasonable to use pj = k−1 and to determine the ai by the equations F X (a1 ) = k−1 , F X (a2 ) − F X (a1 ) = k−1 , etc. (see Kendall and Stuart (18)). Nevertheless, there remains an essential nonuniqueness aspect to the tests. Given the same set of data, different statisticians can reach different conclusions using the same general procedures. In fact, these tests condense the data and examine whether or not gX (x) is a member of the particular class of density functions with given content pj for Ij . Despite these drawbacks, the approach outlined above has enjoyed wide support and is most commonly used in practice. The method of condensation of data presented above when X is a continuous random variable may also have to be practiced when X is discrete. In this case subsets of X are used in place of individual elements to achieve cell expectations sufficiently large to render the asymptotic distribution theory valid. A useful way of visually checking the adequacy of H 0 is to examine the order statistics X 1 , X 2 , . . . , X n . Since f X (x) is fully specified, E[X i ] = ηi can be calculated and plotted against X i . If H 0 holds, this plot should be roughly linear. There are analytical counterparts to the simple order statistics plots. Let 0 < λ1 < λ2 < · · · < λk < 1, ni = [nλi ] + 1, where [x] is the greatest integer less than or equal to x, and consider Xn i , i = 1, 2, . . . , k. Under suitable regularity conditions on f X (x),

I2 = (a1 , a2 ], . . . ,

Ik = (ak−1 , ∞). To test H 0 : gX (x) = f X (x), let N j be the number of X i ∈ Ij in the sample and put pj = Ij f X (x)dx. Then, under H 0 , the N j have a multinomial distribution* with parameters pj , n = kj=1 Nj and any of the tests from the discussion of simple H 0 in the section ‘‘Discrete Random Variables’’ can be applied. Clearly, making a situation discrete which is essentially continuous leads to a loss of precision. The actual values of the X i are suppressed and only their relationship with the

Y2 = n

k {[FX (Xn i ) − FX (Xn i−1 )] − pi }2 p−1 i i=1

L

→ χ 2 (k − 1),

where pi = λi − λi−1 (4). This bypasses the problem of constructing intervals Ij and uses part of the natural ordering of the sample. A number of tests make specific use of the sample CDF, Gn (x), for testing H 0 : GX (x) = F X (x). Some of these are now listed.

GOODNESS OF FIT

2.1.1 Kolmogorov–Smirnov* Dn − , Dn + , and Dn . Define

Statistics,

D+ n

Wn2 = n

= supx [Gn (x) − FX (x)],

Dn = max(Dn − , Dn + ). Then Dn − and Dn + can be used to test H 0 against the one-sided alternatives H 1 : GX (x) ≤ F X (x) and H 1 : GX (x) ≥ F X (x), respectively, while Dn tests H 1 : GX (x) = F X (x). The CDFs of the three statistics are known exactly and are independent of F X (x) (8). To see this, let U = F X (X); then Dn + = sup0≤u≤1 Gn (u) − u, etc. The most useful set of tables is given by Pearson and Hartley (25), who also include some numerical examples. A derivation of the asymptotic distributions of Dn + and Dn can be based on the stochastic process yn (t) =

√ n(Gn (t) − t),

0 ≤ t ≤ 1,

(5)

which has zero mean and C[y(s), y(t)] = min(s, t) − st,

0 ≤ s, t ≤ 1. (6) The central limit theorem* ensures that [yn (t1 ), yn (t2 ), . . . , yn (tk )] is asymptotically multinormal with null mean vector and the above covariance structure. Thus the finitedimensional distributions of Y n (t) converge to those of y(t), tied-down Brownian motion*. Intuitively, the distributions of supt yn (t) and supt —yn (t)— will tend to those of supt y(t) and supt —y(t)—. This can be verified using the theory of weak convergence. The two crossing problems thus generated can be solved to yield the desired limiting CDFs (8). For a different approach, see Feller (13). It is interesting that these investigations show 4n(Dn + )2 to be asymptotically distributed as χ 2 (2). 2.1.2 Cram´er–von Mises Test*. Let ∞ [Gn (X) − FX (x)]2 dx; Wn2 = n

By means of the probability transformation U = F X (X), (8) can be written

D− n = supx [FX (x) − Gn (x)],

(7)

−∞

then Wn2 is a measure of the agreement between Gn (x) and F X (x) for all x and is known as the Cram´er–von Mises statistic.

5

1

[Gn (u) − u]2 du,

(8)

0

emphasizing that this test is also distribution-free. The CDF of Wn2 is not known for all n but has been approximated; the asymptotic distribution is derived in Durbin (8). For easy-to-use tables, see Pearson and Hartley (25). 2.1.3 Tests Related to the Cram´er–von Mises Test. Various modifications of Wn2 are used for specific purposes. For instance, a weight 1 function ψ(t) can be introduced to give 0 [Gn (t) − t]2 ψ(t) dt as a test statistic. When ψ(t) = [t(1 − t)]−1 , the resulting statistic is called the Anderson–Darling statistic, A2n , and leads to the Anderson–Darling test*. Since E[n[Gn (t) − t]2 ] = [t(1 − t)], this weights discrepancies by the reciprocal of their standard deviations and puts more weight in the tails of the distribution, a feature that may be important. The same remarks made for Wn2 apply to A2n . A number of scientific investigations yield data in the form of directions and it may be required to test the hypothesis that these are orientated at random. Since each direction is represented by an angle measured from a fixed position P, such data can be represented as points on a unit circle. The test then concerns the randomness of the distribution of the points on this circle. Watson (38) introduced the statistic Un2 = n

1

[Gn (t) − t − Gn (t) − t]2 dt,

(9)

0

1 where 0 [Gn (t) − t]dt = Gn (t) − t. It can be shown that Un2 is independent of the choice of P. The asymptotic distribution of Un2 is known (8) and appropriate tables may be found in Pearson and Hartley (25). Under H 0 : GX (x) = F X (x), the variables U i = F X (X 1i ) are distributed as U [0, 1]. Hence Gn = 0 Gn (u) du = 1 − U, has expectation 12 , variance (12n)−1 and tends rapidly to normality. This provides a direct large-sample test

6

GOODNESS OF FIT

of H 0 , although exact significance points are available (34). Tests related to Dn − , Dn + , Dn , and Wn2 have been proposed by Riedwyl (31). He defines the ith discrepancy as di = F(X i ) ) and examines tests based on − F n (X i n n 2 n d , etc. 1 i 1 |di |, 1 di , maxi di , maxi |di |, Some pertinent exact and asymptotic results are given. Hegazy and Green (15) considered tests based on the forms T1 = n−1 n1 |Xi − νi | and T2 = n−1 n1 (Xi − νi )2 , where ν i = ηi and ν i = ξ i , the mode of X i . Tests of the hypothesis H 0 : GX (x) = F X (x) can be reduced as shown above to testing whether or not U i = F X (X i ) is distributed U [0, 1]. Thus ηi = i/(n + 1) and ξ i = (i − 1)/(n − 1). The powers of these T tests were examined against normal*, Laplace*, exponential*, and Cauchy* alternatives and compared with the powers of other tests. The conclusion was that T 1 and T 2 have similar performances and that it is slightly better to use ξ i than ηi . These T statistics generally compare favorably with the tests just described, or minor modifications of them. Hegazy and Green (15) provide an extensive bibliography of other studies of power of goodness-of-fit tests. 2.2 Composite H0 The most common hypothesis that requires testing is H 0 : GX (x) = F X (x, θ ) for some θ ∈ . The introduction of nuisance parameters* creates new technical difficulties which can only be touched on briefly here. In general, however, the same form of tests as those just presented are used, with modifications. In order to make use of the results in the discussion of composite H 0 in the section ‘‘Discrete Random Variables,’’ k intervals are introduced as in the preceding subsection. The interval contents are functions of θ , pj (θ ) = I f (x, θ ) dx and if N j is the number j of X i in Ij , a multinomial system is generated, the parameters being functions of the unknown θ . The whole problem may now be treated by the methods of the section on discrete variables, and the same comment concerning loss of information and nonuniqueness due to grouping applies.

A number of special points need emphasis. The estimation of θ must be made from the data in the grouped state if the distribution theory of the section on discrete variables is to hold. For instance, θ estimated from the X i and f (x, θ ) should not be used in the X 2 (θ ) statistic. Doing so results in a limiting CDF which depends on θ and a conservative test if 2 (k − q − 1) significance level is used. the χ1−α Since θ is not known, there is some difficulty defining the intervals Ij . In general, the boundaries of the intervals are functions of θ ; Moore and Spruill (24) have shown that, provided that consistent estimators of the boundary values are used, the asymptotic results (4), (8), and (5) remain valid if the random intervals are used as if they were the true ones. For example, reconsider the problem of testing H 0 : GX (x) = N X (x, θ ). Consistent estimators of θ 1 = µ and θ 2 = σ 2 are X = n1 Xi /n and S2 = n1 (Xi − X)2 /(n − 1) and it is appropriate that the Ij be constructed with X and S in place of µ and σ to ensure approximate contents of k−1 . Using these estimated intervals, the procedure requires that µ and σ 2 be estimated efficiently, by maximum likelihood, for instance, and the tests applied in the usual way. A test developed by Moore (22) and suggested by Rao and Robson (30) has interesting flexibility and power potential. Let Vn (θ ) be a k-vector with ith component (Ni − nfi (θ))/ nfi (θ), B(θ ) a k × q matrix with elements pi (θ )−1/2 ∂pi (θ )/∂θ j , and J(θ ) the usual information matrix for F X (x, θ ). Define the statistic Tn (θˆ n ) = Vn (θˆ n )[I − B(θˆ n )J−1 (θˆ n ){B(θˆ n )} ]−1 ×Vn (θˆ n ), where θˆn is the ungrouped MLE for θ ; then L Tn (θˆ n ) → χ 2 (k − 1). The problem of estimating intervals can be bypassed by the use of quantile* statistics. Define λi and pi as for (6), and the statistic Yn2 (θ ) = n

k {[FX (Xn i , θ ) 1

−FX (Xn i−1 , θ)] − pi }2 p−1 i .

GOODNESS OF FIT L If θ = θ n minimizes Yn2 (θ), then Y 2 (θ˜ n ) → χ 2 (k − q − 1). Alternatively, the following test is available. Put Xni /n = N∗i and let N* be the (k × 1) vector of the N∗i ; then it is well known that √ L n(N∗ − ν) → N(0, V), where ν i is defined by F X (ν i ) = λi and υ ij = λi (1 − λj )[f X (ν i )f X (ν j )]−1 , i ≤ j. In general, both ν and V are functions of unknown θ , so define

An (θ ) = n(N∗ − ν(θ)) V−1 (θ ) ×(N∗ − ν(θ)) and choose θ = θ ∗n to minimize An (θ ). Then L

An (θ ∗n ) → χ 2 (k − q), k > q. If q = 2, θ 1 and θ 2 are location and scale parameters, respectively, and an explicit expression exists for θn∗ . The matrix V for the standardized variable (X − θ 1 )/θ 2 can be used in (12) and a single matrix inversion is needed to complete the test (37). The tests described in the discussion of simple H 0 do not extend readily to composite hypotheses. In general, for the cases considered and reported in the literature to date, the resulting tests are not distribution-free but depend on F(x, θ ) and on the method used ˆ This is because the CDF to estimate θ , θ. has a different limiting distribution when the parameters are estimated to that which results when the null hypothesis is simple (9). Hence tables of critical values constructed for simple hypothesis cases cannot be used for testing composite hypotheses. In fact, different critical values are needed for each hypothesis tested; the tests are carried out ˆ in the expressions replacing F X (x) by F(x, θ) of the preceding section. 2.2.1 Kolmogorov–Smirnov Statistics Dn − , A technique for obtaining exact critical values was developed by Durbin (10), who applied it to obtain values for testing the composite hypothesis of exponentiality

Dn + , Dn .

H0 : f (x, θ ) = θ −1 exp(−x/θ ), 0 < x, θ ∈ (0, ∞). The technique is complicated, however, and has not been applied to other cases.

7

By a variety of techniques, including Monte Carlo methods*, Stephens (35) has given procedures for finding accurate critical values for testing composite hypotheses involving the normal and the exponential distributions. These procedures are also described by Pearson and Hartley (25). For a treatment of this problem using sufficient statistics*, see Kumar and Pathak (20). 2.2.2 Cram´er–von Mises Statistic∗ Wn2 . No technique is yet available for obtaining exact critical values of Wn2 for testing composite hypotheses. The first accurate calculations of asymptotic significance points for testing exponentiality and normality were made by Durbin et al. (12). Further extensions and related results were given by Stephens (36). Again, methods of obtaining good approximations to finite-sample critical values for tests of exponentiality and normality are given by Stephens (35). 2.2.3 Tests Related to the Cram´er–von Mises Tests. Similar treatments to those of Wn2 are given to A2n and Un2 by Stephens (35,36) and Durbin et al. (12) for testing exponentiality and normality. In summary, then, the development of tests of fit for composite hypotheses using the sample CDF has centered largely on the exponential and the normal distributions. The most useful reference for the practitioner is Stephens (35), where tables cater for most of the common tests when the hypothesis is simple, and for the composite hypothesis cases mentioned above. If the data are censored*, see Pettitt and Stephens (27) and Dufour and Maag (7). 3

FURTHER TESTS AND CONSIDERATIONS

In this final section a few other special tests and some additional ideas impinging on goodness of fit will be mentioned. 3.1 Two Sample Tests Let X 1 , X 2 , . . . , X m and Y 1 , Y 2 , . . . , Y n , m ≤ n, be random samples from two different populations with continuous CDFs F X (x) and F Y (y) and sample CDFs FXm (t) and FYn (t).

8

GOODNESS OF FIT

In analogy with the discussion on the Kolmogorov–Smirnov statistics Dn − , Dn + , and Dn pertaining to simple H 0 in the section ‘‘Continuous Random Variables,’’ the hypothesis H 0 : F X (t) = F Y (t) can be tested against the alternatives H 1 : F X (t) ≤ F Y (t), H 1 : F X (t) ≥ F Y (t), and H 1 : F X (t) = F Y (t) by the respective statistics

the power properties. Nevertheless, a general pattern emerged; singly, the two statistics are useful for detecting departures from specific types of alternatives, and in combination they are reasonably robust against a large variety of alternatives.

n m D− mn = supt [FY (t) − FX (t)],

There are fruitful extensions to the technique of plotting order statistics* against expectation as introduced in the discussion of simple H 0 in the section ‘‘Continuous Random Variables.’’ Let the hypothesis be

m n D+ mn = supt [FX (t) − FY (t)], + Dmn = max[D− mn , Dmn ].

The exact distributions of these statistics are known; for finite sample critical points of Dmn , see Pearson and Hartley (25). For further references to tabulations, see Steck (33). If the statistics above are multiplied by [mn(m + n)−1 ]1/2 limiting distributions √ exist n Dn − , which are the same as those for √ √ + n Dn , and n Dn . Similar modifications can be made to the Cram´er–von Mises statistic to cater for two sample tests; again, see Durbin (8). 3.2 Tests of Departure from Normality* In view of the central role of the normal distribution in statistical theory and practice, a great deal of effort has been spent in developing tests of normality (see the first section of this article). Some of these tests have been dealt with in previous sections; only special tests tailored for the normal distribution will be covered here. Let mr = n−1 n1 (Xi −√X)r and S2 = nm2 (n − 1)−1 ; then statistics b1 = m3 /S3 and b2 = m4 /S4 measure skewness and kurtosis in the sample. If the population √ from which the sample is drawn is normal, b1 and b2 should be near 0 and 3, respectively, and departure from these values is evidence to the contrary. √ Both b1 and b2 have been examined separately, jointly, and as a linear combination (see Pearson and Hartley (25) and Pearson et al. (26)). Both were compared with other tests for power; a variety of skewed and leptokurtic distributions were used as alternatives. The picture is somewhat confused, due in part to the wide spectrum of alternative distributions used and to the use of small numbers of Monte Carlo trials to establish

3.3 Wilks–Francia Test*

H0 : G(x, θ ) = F((x − θ1 ) θ2 ),

(10)

i.e., G is determined up to a location and scale parameter. Then a plot of X i against the expectation of the standardized order statistics, ηi , should lie near the line θ 1 + θ 2 ηi under H 0 . Now, the unweighted esti least-squares mator for θ 2 is θ 2 = n1 Xi ηi / n1 ηi2 and the residual sum of squares is R2n =

n

(Xi − X − θ 2 ηi )2

1

=

n

(Xi

− X) − 2

1

n

2 bi Xi

,

1

where bi = ηi /( ni ηi2 )1/2 . Dividing both sides n 2 by 1 (Xi − X) to remove the scale effect yields n

−1 (Xi

1

=1−

− X)

2

n 1

R2n 2

bi Xi

n

(Xi − X)2

1

= 1 − Wn . Then W n is the Wilks–Francia test statistic and measures the departure of the order statistics from their expectations; it has been used to test normality specifically, but it clearly enjoys a wider application. To carry out the test, tables of ηi are required as well as critical points; reference is made to the original paper by Shapiro and

GOODNESS OF FIT

Francia (32). Note that small values of W n are significant and that the test has been shown to be consistent. An asymptotic distribution for the test has been established (see Durbin (11), where further tests of fit using order statistics are discussed). For completeness, it is pointed out that there exist goodness-of-fit procedures using the differences, or spacings*, between successive order statistics. Some of these tests are reviewed by Pyke (28), who developed the limiting distributions of functions of spacings and certain general limit theorems. More recent work in this area is reported by Kale (17) and Kirmani and Alam (19). 4

FINAL REMARKS

A few general considerations are worth raising. The single sample procedures outlined in previous sections deal with the problem of testing H 0 : GX (x) = F X (x, θ ), where θ is either fixed or is specified only up to a set . If θ is not fixed, an estimator θˆ is substituted for it in F and the concordance of this estimated model with the data assessed. It is important in carrying out tests of fit not to lose sight of the fundamental purpose of the exercise. For example, tests of normality are often required as an intermediate step to further analyses. Alternatively, the performance of specific statistical processes, such as a random number generator, may need to be checked against specification. In these instances, the philosophy of using F, or a good estimate of it, to test against available data seems entirely reasonable. A different situation is generated if predictions* are required. In this case an estimate of F is to be used to predict future outcomes of the random variable X. It is possible that FX (x, θˆ ) may allow satisfactory predictions to be made, especially if the model was appropriate and θˆ based on a large sample. But there may be other candidates which would do a better job of ˆ such as to set up prediction than FX (x, θ), a measure of divergence of one PDF from another (see Ali and Silvey (3)) and then to try to find that PDF, based on the data, which comes closest to the estimated PDF. This treatment may need Bayesian arguments to construct predictive densities (2).

9

More specifically, let f X (x, θ ) be the density which is to be estimated and introduce the weight function p(θ —z) on based on data z. Put p(θ |z)fX (x, θ ) dθ ; (11) qX (x|z) =

then qX is called a predictive density for f X . On the other hand, for any estimator for θ based on z, θˆ (z), fX (x, θˆ (z)) is called an estimative density. Using the Kullback–Leibler directed measure of divergence, Aitchison (2) showed that qX (x—z) is optimal in the sense that it is closer to f X than any other competing density, in particular fX (x, θˆ (z)). Although this result may depend on the divergence measure used, it shows that fX (x, θˆ (z)) may not always be the appropriate estimator for f X (x, θ ). A chi-squared type of goodness-of-fit test for the predictive density has been developed by Guteman (14). REFERENCES 1. References are classified as follows: (A), applied; (E), expository; (R), review; (T), theoretical. 2. Aitchison, J. (1975). Biometrika, 62, 547–554. (T) 3. Ali, S. M. and Silvey, S. D. (1966). J. R. Statist. Soc. B, 28, 131–142. (T) 4. Bofinger, E. (1973). J. R. Statist. Soc. B, 35, 277–284. (T) 5. Conover, W. J. (1972). J. Amer. Statist. Ass., 67, 591–596. (T) 6. Cram´er, H. (1945). Mathematical Methods of Statistics. Princeton University Press, Princeton, N.J. (E) 7. Dufour, R. and Maag, U. R. (1978). Technometrics, 20, 29–32. (A) 8. Durbin, J. (1973). Distribution Theory for Tests Based on the Sample Distribution Function. Reg. Conf. Ser. Appl. Math. SIAM, Philadelphia. (T, R) 9. Durbin, J. (1973). Ann. Statist., 1, 279–290. (T) 10. Durbin, J. (1975). Biometrika, 62, 5–22. (T) 11. Durbin, J. (1977). Goodness-of-fit tests based on the order statistics. Trans. 7th Prague Conf. Inf. Theory, Statist. Decision Functions, Random Processes / 1974 Eur. Meet. Statist., Prague, 1974, Vol. A, 109–118. (R)

10

GOODNESS OF FIT

12. Durbin, J., Knott, M., and Taylor, C. C. (1975). J. R. Statist. Soc. B, 37, 216–237. (T) 13. Feller, W. (1948). Ann. Math. Statist., 19, 177. (T) 14. Guteman, I. (1967). J. R. Statist. Soc. B, 29, 83–100. (T) 15. Hegazy, Y. A. S. and Green, J. R. (1975). Appl. Statist., 24, 299–308. (A) 16. Horn, S. D. (1977). Biometrics, 33, 237–248. (A, R) 17. Kale, B. K. (1969). Sankhya¯ A, 31, 43–48. (T) 18. Kendall, M. G. and Stuart, A. (1973). The Advanced Theory of Statistics, Vol. 2: Inference and Relationship, 3rd ed. Hafner Press, New York. (E) 19. Kirmani, S. N. U. A. and Alam, S. N. (1974). Sankhya¯ A, 36, 197–203. (T) 20. Kumar, A. and Pathak, P. K. (1977). Scand. Statist. J., 4, 39–43. (T) 21. Lindgren, B. W. (1976). Statistical Theory, 3rd ed. Macmillan, New York. (E) 22. Moore, D. S. (1977). J. Amer. Statist. Ass., 72, 131–137. (T) 23. Moore, D. S. (1979). In Studies in Statistics, R. V. Hogg, ed. Mathematical Association of America, Washington, D.C., pp. 66–106. (T, E) 24. Moore, D. S. and Spruill, M. C. (1975). Ann. Statist., 3, 599–616. (T) 25. Pearson, E. S. and Hartley, H. O. (1972). Biometrika Tables for Statisticians, Vol. 2. Cambridge University Press, Cambridge. (E) 26. Pearson, E. S., D’Agostino, R. B., and Bowman, K. O. (1977). Biometrika, 64, 231–246. (A) 27. Pettitt, A. N. and Stephens, M. A. (1976). Biometrika, 63, 291–298. (T) 28. Pyke, R. (1965). J. R. Statist. Soc. B, 27, 395–436. (T) 29. Radlow, R. and Alf, E. F. (1975). J. Amer. Statist. Ass., 70, 811–813. (A) 30. Rao, K. C. and Robson, D. S. (1974). Commun. Statist., 3, 1139–1153. (T) 31. Riedwyl, H. (1967). J. Amer. Statist. Ass., 62, 390–398. (A) 32. Shapiro, S. S. and Francia, R. S. (1972). J. Amer. Statist. Ass., 67, 215–216. (A) 33. Steck, G. P. (1969). Ann. Math. Statist., 40, 1449–1466. (T) 34. Stephens, M. A. (1966). Biometrika, 53, 235–240. (T) 35. Stephens, M. A. (1974). J. Amer. Statist. Soc., 69, 730–743. (A, E)

36. Stephens, M. A. (1976). Ann. Statist., 4, 357–369. (T) 37. Tallis, G. M. and Chesson, P. (1976). Austr. J. Statist., 18, 53–61. (T) 38. Watson, G. S. (1961). Biometrika, 48, 109–114. (T) 39. Wood, C. L. and Altavela, M. M. (1978). Biometrika, 65, 235–239. (T)

GROUP-RANDOMIZED TRIALS

variance of any group-level statistic beyond what would be expected with random assignment of members to conditions. Moreover, with a limited number of groups, the degrees of freedom (df) available to estimate grouplevel statistics are limited. Any test that ignores either the extra variation or the limited df will have a Type I error rate that is inflated (3). This problem will only worsen as the ICC increases (4–6). As a result of these problems, RCTs are preferred over GRTs whenever randomization of individual participants is possible. However, individual randomization is not always possible, especially for many public health interventions that operate at a group level, manipulate the physical or social environment, or cannot be delivered to individuals. Just as the RCT is the gold standard in public health and medicine when allocation of individual participants is possible, the GRT is the gold standard when allocation of identifiable groups is necessary. The purpose of this article is to put GRTs in context in terms of other kinds of designs and in terms of the terminology used in other fields, to summarize their development in public health, to characterize the range of public health research areas that now employ GRTs, to characterize the state of practice with regard to the design and analysis of GRTs, to consider their future in public health research, and to review the steps required to plan a new GRT.

DAVID M. MURRAY The Ohio State University Division of Epidemiology School of Public Health Columbus, Ohio

1

INTRODUCTION

Group-randomized trials (GRTs) are comparative studies used to evaluate interventions that operate at a group level, manipulate the physical or social environment, or cannot be delivered to individuals (1). Examples include school-, worksite-, and communitybased studies designed to improve the health of students, employees, or residents. Four characteristics distinguish the GRT from the more familiar randomized clinical trial (RCT) (1). First, the unit of assignment is an identifiable group; such groups are not formed at random, but rather through some physical, social, geographic, or other connection among their members. Second, different groups are assigned to each condition, creating a nested or hierarchical structure for the design and the data. Third, the units of observation are members of those groups so that they are nested within both their condition and their group. Fourth, only a limited number of groups assigned to each condition usually exists. Together, these characteristics create several problems for the design and analysis of GRTs. The major design problem is that a limited number of often heterogeneous groups makes it difficult for randomization to distribute potential sources of confounding evenly in any single realization of the experiment, which increases the need to employ design strategies that will limit confounding and analytic strategies to deal with confounding where it is detected. The major analytic problem is that an expectation exists for positive intraclass correlation (ICC) among observations on members of the same group (2). That ICC reflects an extra component of variance attributable to the group above and beyond the variance attributable to its members. This extra variation will increase the

2 GROUP-RANDOMIZED TRIALS IN CONTEXT GRTs represent a subset of a larger class of designs often labeled nested, hierarchical, multilevel, or clustered designs. Units of observation are nested within identifiable groups or clusters, which are in turn nested within study conditions. This description defines a hierarchy of at least three levels in the design: units of observation, units of assignment, and study conditions. More complex designs may have even more levels. For example, in cohort or repeated measures designs, repeat observations are further nested within units of observation.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

GROUP-RANDOMIZED TRIALS

As used here, the label group-randomized trial refers to a design in which identifiable groups are assigned to study conditions for the express purpose of assessing the impact of one or more interventions on one or more endpoints. The terms nested, hierarchical, multilevel, and clustered designs can be used more broadly to refer to any dataset that has a hierarchical structure, and these more general terms are often used to characterize observational studies as well as comparative studies. Many examples of observational and comparative hierarchical designs can be found in education, where students are nested within classrooms, which are nested within schools, which are nested within school districts, which are nested within communities, and so on. Investigators in education often refer to such designs as hierarchical or multilevel designs (7–10). Other examples can be found in survey sampling, and in disciplines that employ surveys, such as epidemiology, sociology, and demography. In these disciplines, cluster sampling is a commonly used technique (2). Cluster-sampling designs can be a good way to limit cost when the investigator lacks a complete enumeration of the population of interest and does not want to expend the resources required to generate such an enumeration. As simple random sampling is impossible without a complete enumeration, clusters such as blocks or neighborhoods or other identifiable groups are enumerated and sampled in a first stage, followed by enumeration and sampling of individuals within the selected clusters in a second stage. Applied properly, cluster-sampling methods can yield unbiased estimates of population rates or means at a lower cost than would have been the case with simple random sampling. Unfortunately, cluster sampling invariably leads to increased variation and often to limited degrees of freedom. These problems are well known to survey-sampling statisticians (2, 11–13). Biostatisticians often use the term clusterrandomization study to refer to a grouprandomized trial (14, 15). This terminology is based on the fact that an identifiable group is a cluster. It borrows from the terminology of survey sampling. With the

broad definition given to ‘‘group’’ in this text, the phrases cluster-randomization study and group-randomized trial are equivalent. Epidemiologists have often used the terms community trial and community-intervention trial (16–18). These terms emerged from the community-based heart disease prevention studies of the late 1970s and the 1980s (19–23). None of those studies were randomized trials, but all involved whole communities as the unit of assignment with collection of data from individuals within those communities. Community trial is an attractive label, because it includes both randomized designs and nonrandomized designs. However, it is often thought to refer only to studies that involve whole communities (e.g., Reference 24), and so creates confusion when applied to studies involving other identifiable groups. 3 THE DEVELOPMENT OF GROUP-RANDOMIZED TRIALS IN PUBLIC HEALTH GRTs gained attention in public health in the late 1970s with the publication of a symposium on coronary heart disease prevention trials in the American Journal of Epidemiology (3, 21, 25–27). Cornfield’s paper in particular has become quite famous among methodologists working in this area, as it identified the two issues that have vexed investigators who employ GRTs from the outset: extra variation and limited degrees of freedom. The last 25 years have witnessed dramatic growth in the number of GRTs in public health and dramatic improvements in the quality of the design and analysis of those trials. Responding directly to Cornfield’s warning, Donner and colleagues at the University of Western Ontario published a steady stream of papers on the issues of analysis facing group-randomized trials beginning in the early 1980s. Murray and colleagues from the University of Minnesota began their examination of the issues of design and analysis in group-randomized trials in the mid-1980s. Other investigators from the National Institutes of Health, the University of Washington, the New England Research Institute, and elsewhere added to

GROUP-RANDOMIZED TRIALS

this growing literature in public health, especially in the 1990s. By the late 1980s and early 1990s, many group-randomized trials were under way that were of very high quality in terms of their methods of design and analysis. Examples include the Community Intervention Trial for Smoking Cessation (COMMIT) (28), the Working Well Trial (29), and the Child and Adolescent Trial for Cardiovascular Health (CATCH) (30). These improvements occurred as investigators and reviewers alike gradually came to understand the special issues of design and analysis that face grouprandomized trials and the methods required to address them. Unfortunately, the improvements were not represented in all group-randomized trials. Even in the 1990s, grants were funded and papers were published based on poor designs and poor analyses. Simpson et al. (31) reviewed GRTs that were published between 1990 and 1993 in the American Journal of Public Health (AJPH) and in Preventive Medicine (Prev Med). They reported that fewer than 20% dealt with the design and analytic issues adequately in their sample size or power analysis and that only 57% dealt with them adequately in their analysis. In 1998, the first textbook on the design and analysis of GRTs appeared (1). It detailed the design considerations for the development of GRTs, described the major approaches to their analysis both for Gaussian data and for binary data, and presented methods for power analysis applicable to most GRTs. The second textbook on the design and analysis of GRTs appeared in 2000 (15). It provided a good history on GRTs, examined the role of informed consent and other ethical issues, focused on extensions of classical methods, and included material on regression models for Gaussian, binary, count, and time-to-event data. Other textbooks on analysis methods germane to GRTs appeared during the same period (10, 32, 33). Murray et al. recently reviewed a large number of articles on new methods relevant to the design and analysis of GRTs published between 1998 and 2003 (34). Of particular importance for investigators planning GRTs has been the increased availability of estimates of ICC for a variety of

3

endpoints in a variety of groups. Donner and Klar reported a number of ICCs in their text (15). Murray and Blitstein (35) identified more than 20 papers published before 2002 that reported ICCs, whereas Murray et al. identified a similar number of papers published between 1998 and 2003 that reported ICCs (34). Murray and Blitstein also reported on a pooled analysis of ICCs from worksite, school, and community studies. They confirmed earlier reports that the adverse impact of positive ICC can be reduced either by regression adjustment for covariates or by taking advantage of over-time correlation in a repeated-measures analysis. More recently, Janega et al. provided additional documentation that standard errors for intervention effects from end-of-study analyses that reflect these strategies are often different from corresponding standard errors estimated from baseline analyses that do not (36, 37). As the ICC of concern in any analysis of an intervention effect is the ICC as it operates in the primary analysis (1), these findings reinforce the need for investigators to use estimates in their power analysis that closely reflect the endpoints, the target population, and the primary analysis planned for the trial. 4

THE RANGE OF GRTS IN PUBLIC HEALTH

Varnell et al. examined GRTs published between 1998 and 2002 in the AJPH and in Prev Med (38). They found 58 GRTs published in just those two journals during that 5-year period. That rate of 11.6 articles per year was double the publication rate observed for GRTs in the same journals between 1990 and 1993 in the earlier review by Simpson et al. (31). Those trials were conducted for a variety of primary outcomes. The largest fraction (25.9%) targeted smoking prevention or cessation, with trials focused on dietary representing the next largest fraction (20.7%). Studies designed to evaluate screening programs, particularly for cancer, represented the next largest fraction (12.1%). Other studies focused on alcohol or drug use, or on a combination of tobacco, alcohol, and drug use (8.6%); on sun protection (5.2%); on physical or sexual abuse (3.4%); on physician preventive practices (3.4%); on work place health

4

GROUP-RANDOMIZED TRIALS

and safety measures (3.4%); or on multiple health outcomes (8.6%). Those trials were also conducted in a variety of settings. The largest fraction (29.3%) were conducted in schools or colleges. Worksites accounted for 19.0% of the trials. Medical practices accounted for 15.5%, whereas communities accounted for 12.1%. Trials were also conducted in housing projects (5.2%) or churches (5.2%), and 13.8% were conducted in other settings. The size of the group used in these GRTs varied considerably. Almost 30% were conducted in groups having more than 100 members, 24.1% in groups having 51 to 100 members, 32.8% in groups having 10 to 50 members, and 13.8% in groups having fewer than 10 members. Most of these trials employed a pretest-post-test design (55.2%), whereas 13.8% relied on a post-test-only design; 29.3% employed designs having three time points; and 12% involved more than three time points. Most of the trials employed cohort designs (63.8%), whereas 20.7% were crosssectional and 15.5% used a combination of cohort and cross-sectional designs. As this pattern suggests, GRTs are applicable to a wide variety of problems within public health and may be used in a wide variety of settings. It remains the best design available whenever the intervention operates at a group level, manipulates the physical or social environment, or cannot be delivered to individuals. 5 CURRENT DESIGN AND ANALYTIC PRACTICES IN GRTS IN PUBLIC HEALTH As noted above, two textbooks now exist that provide guidance on the design and analysis of GRTs (1, 15), as well as more recent summary papers focused on these issues (39–43). These sources identify several analytic approaches that can provide a valid analysis for GRTs. In most, the intervention effect is defined as a function of a condition-level statistic (e.g., difference in means, rates, or slopes) and assessed against the variation in the corresponding grouplevel statistic. These approaches included mixed-model ANOVA/ANCOVA for designs

having only one or two time intervals, random coefficient models for designs having three or more time intervals, and randomization tests as an alternative to the modelbased methods. Other approaches are generally regarded as invalid for GRTs because they ignore or misrepresent a source of random variation, which include analyses that assess condition variation against individual variation and ignore the group, analyses that assess condition variation against individual variation and include the group as a fixed effect, analyses that assess the condition variation against subgroup variation, and analyses that assess condition variation against the wrong type of group variation. Still other strategies may have limited application for GRTs. Application of fixed-effect models with post hoc correction for extra variation and limited df assumes that the correction is based on an appropriate ICC estimate. Application of survey-based methods or generalized estimating equations (GEE) and the sandwich method for standard errors requires a total of 40 or more groups in the study, or a correction for the downward bias in the sandwich estimator for standard errors when fewer than 40 groups exist in the study (34). Varnell et al. recently reported on the state of current practice in GRTs with regard to design and analytic issues for the 58 GRTs published between 1998 and 2002 in the AJPH and Prev Med (38). They reported that only 15.5% provided evidence that they dealt with the design and analytic issues adequately in their sample size or power analysis, either in the published paper or in an earlier background paper. To qualify as providing evidence, the paper or background paper had to report the ICC estimate expected to apply to the primary analysis in the study, the variance components used to calculate that ICC, or a variance inflation factor calculated from that ICC. The vast majority of GRTs published in these two journals did not provide such evidence. More surprising, the proportion reporting such evidence was actually somewhat lower than reported by Simpson et al. when they reviewed the same journals for the period between 1990 and 1993 (31). Varnell et al. noted that it was possible that many of

GROUP-RANDOMIZED TRIALS

these studies performed such power calculations but did not report them. However, 27 (46%) of the reviewed studies had fewer than 10 groups per condition, and of that number, only one reported evidence of estimating sample size using methods appropriate for GRTs. Varnell et al. concluded that it was likely that many investigators reporting small GRTs planned their study without considering issues of sample size adequately. Varnell et al. reported that only 54.4% of the GRTs published between 1998 and 2002 in the AJPH and Prev Med dealt with design and analytic issues adequately in their analysis (38). Most of the studies that reported using only appropriate methods used mixedmodel regression methods (61.2%), whereas others used two-stage methods (35.4%) or GEE with more than 40 groups (6.4%). Of the remaining studies, 26.3% reported a combination of appropriate and inappropriate methods, whereas 19.3% reported using only inappropriate methods. The most widely used inappropriate methods were analysis at an individual level, ignoring group-level ICC; analysis at a subgroup level, ignoring grouplevel ICC; and use of GEE or another asymptotically robust method with fewer than 40 groups and no correction for the downward bias identified in those methods under those conditions. Varnell et al. reported appreciable differences between the AJPH and Prev Med with regard to these patterns (38). For example, among the 27 GRTs published in AJPH, 66.7% reported only analyses taking ICC into account properly, compared with 43.3% in Prev Med. At the same time, only 14.5% of the GRTs published in AJPH reported only inappropriate analyses, compared with 23.3% in Prev Med. The range of practices across other journals is likely to be even wider. Particularly surprising in this review was that 33.3% of the studies reviewed reported analyses considered inappropriate well before the studies were published, which suggests that some investigators, reviewers, and journal editors have not yet taken to heart the long-standing warnings against analysis at an individual or subgroup level that ignores the group-level ICC, or an analysis that includes group as a fixed effect.

5

Indeed, investigators continued to put forward false arguments to justify the use of inappropriate methods, including claims that appropriate methods could obscure important clinical changes, or that the observed ICCs were small and therefore ignorable. Such arguments ignore the methodological realities of the GRT and represent exactly the kind of self-deception that Cornfield warned against more than 25 years ago (3). 6 THE FUTURE OF GROUP-RANDOMIZED TRIALS Publication of disappointing results for several large GRTs in the mid-1990s, such as the Stanford Five City Project (44), the Minnesota Heart Health Program (45), the Pawtucket Heart Health Program (20), and COMMIT (46, 47), led some to question the value of GRTs in general and of community trials in particular. As noted at the time, it is both shortsighted and impractical to question GRTs in general based on disappointing results even from a number of large and expensive trials (48). Whenever the investigator wants to evaluate an intervention that operates at a group level, manipulates the social or physical environment, or cannot be delivered to individuals, the GRT is the best comparative design available. However, many challenges exist to the design and analysis of GRTs, and the investigator must take care to understand those challenges and the strategies that are available to meet them: The question is not whether to conduct GRTs, but rather how to do them well (1). Certainly, no decline in the number of GRTs has occurred since the mid-1990s. As reported by Varnell et al., the annual rate of publication of GRTs in AJPH and Prev Med during 1998 to 2002 was double that for the period 1990 to 1993. The number of GRTs proposed to NIH has also increased considerably over the past 10 to 15 years, and a study section now exists at NIH that counts GRT applications as one of its most common design types (Community Level Health Promotion, formerly known as SNEM-1). Even so, many challenges facing GRTs remain. For example, no question exists that

6

GROUP-RANDOMIZED TRIALS

it is harder to change the health behavior and risk profile of a whole community than it is to make similar changes in smaller identifiable groups such as those at worksites, physician practices, schools, and churches. And although no quantitative analysis has been published, it seems that the magnitude of the intervention effects reported for GRTs has been greater for trials that involved smaller groups than for trials involving such large aggregates as whole communities. With smaller groups, it is possible to include more groups in the design, thereby improving the validity of the design and the power of the trial. With smaller groups, it is easier to focus intervention activities on the target population. With smaller groups, the cost and difficulty of the implementation of the study generally are reduced. For these and similar reasons, future group-randomized trials may do well to focus on more and smaller identifiable groups rather than on whole cities or larger aggregates. Indeed, that pattern is evidenced in a number of recent trials (49–54). At the same time, positive effects have been reported for some studies involving very large aggregates. An example is the ASSIST project from the National Cancer Institute (55). Although this trial was not randomized, 17 states participated in the ASSIST intervention to reduce smoking, and the remaining 33 states plus the District of Columbia served as control sites. As such, the units of assignment in the ASSIST trial were even larger aggregates than are used in most GRTs. The encouraging results from that trial confirm that it is possible to successfully deliver an intervention even in very large aggregates. Another challenge is simply the difficulty in developing interventions strong enough to change the health behaviors of the target populations. This point is not new, but it is one that has been well known to investigators working in GRTs for some time (56–58). The methods for the design and analysis of GRTs have evolved considerably from the 1970s and 1980s; however, interventions continue to be employed that often prove ineffective. One of the problems for some time has been that interventions are proposed that lack even preliminary evidence of efficacy (42).

Efficacy trials in health promotion and disease prevention often are begun without the benefit of prototype studies, and often even without the benefit of adequate pilot studies, which has happened in large part because the funding agencies have been reluctant to support pilot and prototype studies, preferring instead to fund efficacy and effectiveness trials. Unfortunately, the interventions that lead to GRTs tend to be more complicated than those in other areas or those that lead to clinical trials. As such, it is even more important to subject them to adequate testing in pilot and prototype studies. These earlier phases of research can uncover important weaknesses in the intervention content or implementation methods. Moving too quickly to efficacy trials risks wasting substantial time and resources on interventions that could have been substantially improved through the experience gained in those pilot and prototype studies. Hopefully, the funding agencies will recognize this point and begin to provide better support for pilot and prototype studies. The R21 mechanism at NIH is wellsuited for that kind of pilot and feasibility work. Prototype studies will typically be based on only one or two groups per condition, and so are particularly problematic if the investigator wants to make causal inferences relating the intervention as delivered to the outcomes as observed. Studies based on only one group per condition cannot estimate variation because of the group independent of variation due to condition (59). Studies based on only a few groups per condition cannot estimate that component of variance with much accuracy. Some of the methods described earlier as having limited application to GRTs can be used, but all would require strong assumptions. For example, application of a post hoc correction would require the strong assumption that the external estimate of ICC is valid for the data at hand. Application of the subgroup or batch analysis would require the strong assumption that the subgroup captures the group variance. Does this mean that prototype studies should not be conducted? The answer is clearly no, and, more to the point, the answer must be that such studies simply should not be used to make causal inferences relating

GROUP-RANDOMIZED TRIALS

the intervention and the outcome. Studies involving only one or two groups per condition are prototype studies, not efficacy trials, and must be analyzed with that in mind. With only one or two groups per condition, the investigator can estimate the magnitude of the intervention effect, but it will not be possible to estimate a standard error for that effect with any degree of accuracy. As a result, it will not be possible to put a narrow confidence bound around the intervention effect or to draw any conclusions about the statistical significance of that effect. Even so, if the effect is much smaller than expected, that finding should push the investigators to rework their intervention, as it is not likely that a reasonably sized efficacy trial would show such a small intervention effect to be significant. If the effect is as large as expected or larger, that finding should provide good support to the investigator in seeking support for the efficacy trial required to establish the causal link between the intervention and the outcome. Methodological challenges remain as well. For example, a number of recent studies have documented the downward bias in the sandwich estimator used in GEE when fewer than 40 groups exist in the study (4, 60, 61). Some of these studies, and others, have proposed corrections for that estimator (61–66). Unfortunately, none of these corrections appear in the standard software packages, so they are relatively unavailable to investigators who analyze GRTs. Absent an effective correction, the sandwich estimator will have an inflated Type I error rate in GRTs having less than 40 groups, and investigators who employ this approach continue to risk overstating the significance of their findings. As another example, a number of recent studies have proposed methods for survival analysis that could be applied to data from GRTs (15, 67–71). Some of these methods involved use of the sandwich estimator, and so would be subject to the same concern as noted above for GEE. None of the methods appear in the standard software packages, so they also are relatively unavailable to investigators who analyze GRTs. As a third example, permutation tests have been advocated over model-based methods because they require fewer assumptions.

7

At the same time, they tend to have lower power. To overcome this problem, Feng et al. developed an optimal randomization test that had nominal size and better power than alternative randomization tests or GEE, although it was still not as powerful as the model-based analysis when the model was specified correctly (72). Additional research is needed to compare Braun and Feng’s optimal randomization test and model-based methods under model misspecification. Every reason exists to expect that continuing methodological improvements will lead to better trials. Evidence also exists that better trials tend to have more satisfactory results. For example, Rooney and Murray presented the results of a meta-analysis of group-randomized trials in the smokingprevention field (73). One of the findings was that stronger intervention effects were associated with greater methodological rigor. Stronger intervention effects were reported for studies that planned from the beginning to employ the unit of assignment as the unit of analysis, that randomized a sufficient number of assignment units to each condition, that adjusted for baseline differences in important confounding variables, that had extended follow-up, and that had limited attrition. One hopes that such findings will encourage use of good design and analytic methods. A well-designed and properly executed GRT remains the method of choice in public health and medicine when the purpose of the study is to establish the efficacy or effectiveness of an intervention that operates at a group level, manipulates the social or physical environment, or cannot be delivered to individuals. However, efficacy and effectiveness trials should occur in proper sequence. They should occur only after pilot studies have established the feasibility and acceptability of the materials and protocols for the intervention and evaluation. They should occur only after prototype studies have shown that the magnitude of the intervention effect is large enough to warrant the larger trials. Efficacy and effectiveness trials should be large enough to ensure sufficient power, with groups assigned at random from within wellmatched or stratified sets to protect against bias. Investigators should measure exposure

8

GROUP-RANDOMIZED TRIALS

and other process variables in all conditions. They should select a model for the analysis that reflects the design of the study and the nature of the endpoints. Importantly, investigators should be cautious of strategies that appear to easily solve or avoid the design and analytic problems that are inherent in group-randomized trials, for those methods are likely to prove to be inappropriate. 7 PLANNING A NEW GROUP-RANDOMIZED TRIAL The driving force behind any GRT must be the research question. The question will be based on the problem of interest and will identify the target population, the setting, the endpoints, and the intervention. In turn, those factors will shape the design and analytic plan. Given the importance of the research question, the investigators must take care to articulate it clearly. Unfortunately, that does not always happen. Investigators may have ideas about the theoretical or conceptual basis for the intervention, and often even clearer ideas about the conceptual basis for the endpoints. They may even have ideas about intermediate processes. However, without very clear thinking about each of these issues, the investigators may find themselves at the end of the trial unable to answer the question of interest. To put themselves in a position to articulate their research question clearly, the investigators should first document thoroughly the nature and extent of the underlying problem and the strategies and results of previous efforts to remedy that problem. A literature review and correspondence with others working in the field are ingredients essential to that process, as the investigators should know as much as possible about the problem before they plan their trial. Having become experts in the field, the investigators should choose the single question that will drive their GRT. The primary criteria for choosing that question should be: (1) Is it important enough to do?, and (2) Is this the right time to do it? Reviewers will ask both questions, and the investigators must be able to provide well-documented answers.

Most GRTs seek to prevent a health problem, so that the importance of the question is linked to the seriousness of that problem. The investigators should document the extent of the problem and the potential benefit from a reduction in that problem. The question of timing is also important. The investigators should document that the question has not been answered and that the intervention has a good chance to improve the primary endpoint in the target population, which is most easily done when the investigators are thoroughly familiar with previous research in the area; when the etiology of the problem is well known; when a theoretical basis exists for the proposed intervention; when preliminary evidence exists on the feasibility and efficacy of the intervention; when the measures for the dependent and mediating variables are well-developed; when the sources of variation and correlation as well as the trends in the endpoints are well understood; and when the investigators have created the research team to carry out the study. If that is not the state of affairs, then the investigators must either invest the time and energy to reach that state or choose another question. Once the question is selected, it is very important to put it down on paper. The research question is easily lost in the dayto-day details of the planning and execution of the study, and because much time can be wasted in pursuit of issues that are not really central to the research question, the investigators should take care to keep that question in mind. 7.1 The Research Team Having defined the question, the investigators should determine whether they have sufficient expertise to deal with all the challenges that are likely to develop as they plan and execute the trial. They should identify the skills that they do not have and expand the research team to ensure that those skills are available. All GRTs will need expertise in research design, data collection, data processing and analysis, intervention development, intervention implementation, and project administration. As the team usually will need to convince a funding agency that they are appropriate

GROUP-RANDOMIZED TRIALS

for the trial, it is important to include experienced and senior investigators in key roles. No substitute exists for experience with similar interventions, in similar populations and settings, using similar measures, and similar methods of data collection and analysis. As those skills are rarely found in a single investigator, most trials will require a team, with responsibilities shared among its members. Most teams will remember the familiar academic issues (e.g., statistics, data management, intervention theory), but some may forget the very important practical side of trials involving identifiable groups. However, to forget the practical side is a sure way to get into trouble. For example, a schoolbased trial that does not include on its team someone who is very familiar with school operations is almost certain to get into trouble with the schools. A hospital-based trial that does not include on its team someone who is very familiar with hospital operations is almost certain to get into trouble with the hospitals. And the same can be said for every other type of identifiable group, population, or setting that might be used. 7.2 The Research Design The fundamentals of research design apply to GRTs as well as to other comparative designs. As they are discussed in many familiar textbooks (11, 74–77), they will be reviewed only briefly here. Additional information may be found in two recent textbooks on the design and analysis of GRTs (1, 15). The goal in the design of any comparative trial is to provide the basis for valid inference that the intervention as implemented caused the result(s) as observed. Overall, three elements are required: (1) control observations must exist, (2) a minimum of bias must exist in the estimate of the intervention effect, and (3) sufficient precision for that estimate must exist. The nature of the control observations and the way in which the groups are allocated to treatment conditions will determine in large measure the level of bias in the estimate of the intervention effect. Bias exists whenever the estimate of the intervention effect is different from its true value. If that bias is substantial, the investigators will be misled

9

about the effect of their intervention, as will the other scientists and policy makers who use their work. Even if adequate control observations are available so that the estimate of the intervention effect is unbiased, the investigator should know whether the effect is greater than would be expected by chance, given the level of variation in the data. Statistical tests can provide such evidence, but their power to do so will depend heavily on the precision of the intervention effect estimate. As the precision improves, it will be easier to distinguish true effects from the underlying variation in the data.

7.3 Potential Design Problems and Methods to Avoid them For GRTs, the four sources of bias that are particularly problematic and should be considered during the planning phase are selection, differential history, differential maturation, and contamination. Selection bias refers to baseline differences among the study conditions that might explain the results of the trial. Bias because of differential history refers to some external influence that operates differentially among the conditions. Bias because of differential maturation reflects uneven secular trends among the groups in the trial favoring one condition or another. These first three sources of bias can either mask or mimic an intervention effect, and all three are more likely given either nonrandom assignment of groups or random assignment of a limited number of groups to each condition. The first three sources of bias are best avoided by randomization of a sufficient number of groups to each study condition, which will increase the likelihood that potential sources of bias are distributed evenly among the conditions. Careful matching or stratification can increase the effectiveness of randomization, especially when the number of groups is small. As a result, all GRTs planned with fewer than 20 groups per condition would be well served to include careful matching or stratification before randomization.

10

GROUP-RANDOMIZED TRIALS

The fourth source of bias is caused by contamination, which occurs when interventionlike activities find their way into the comparison groups; it can bias the estimate of the intervention effect toward the null hypothesis. Randomization will not protect against contamination; although investigators can control access to their intervention materials, they can often do little to prevent the outside world from introducing similar activities into their control groups. As a result, monitoring exposure to activities that could affect the trial’s endpoints in both the intervention and comparison groups is especially important in GRTs, which will allow the investigators to detect and respond to contamination if it occurs. Objective measures and evaluation personnel who have no connection to the intervention are also important strategies to limit bias. Finally, analytic strategies, such as regression adjustment for confounders, can be very helpful in dealing with any observed bias. 7.4 Potential Analytic Problems and Methods to Avoid them The two major threats to the validity of the analysis of a GRT that should be considered during the planning phase are misspecification of the analytic model and low power. Misspecification of the analytic model will occur if the investigator ignores or misrepresents a measurable source of random variation, or misrepresents the pattern of any over-time correlation in the data. To avoid model misspecification, the investigator should plan the analysis concurrent with the design, plan the analysis around the primary endpoints, anticipate all sources of random variation, anticipate the error distribution for the primary endpoint, anticipate patterns of over-time correlation, consider alternate structures for the covariance matrix, consider alternate models for time, and assess potential confounding and effect modification. Low power will occur if the investigator employs a weak intervention, has insufficient replication, has high variance or ICC in the endpoints, or has poor reliability of intervention implementation. To avoid low power,

investigators should plan a large enough study to ensure sufficient replication, choose endpoints with low variance and ICC, employ matching or stratification before randomization, employ more and smaller groups instead of a few large groups, employ more and smaller surveys or continuous surveillance instead of a few large surveys, employ repeat observations on the same groups or on the same groups and members, employ strong interventions with good reach, and maintain the reliability of intervention implementation. In the analysis, investigators should employ regression adjustment for covariates, model time if possible, and consider post hoc stratification. 7.5 Variables of Interest and their Measures The research question will identify the primary and secondary endpoints of the trial. The question may also identify potential effect modifiers. It will then be up to the investigators to anticipate potential confounders and nuisance variables. All these variables must be measured if they are to be used in the analysis of the trial. In a clinical trial, the primary endpoint is a clinical event, chosen because it is easy to measure with limited error and is clinically relevant (75). In a GRT, the primary endpoint need not be a clinical event, but it should be easy to measure with limited error and be relevant to public health. In both RCTs and GRTs, the primary endpoint, together with its method of measurement, must be defined in writing before the start of the trial. The endpoint and its method of measurement cannot be changed after the start of the trial without risking the validity of the trial and the credibility of the research team. Secondary endpoints should have similar characteristics and also should be identified before the start of the trial. In a GRT, an effect modifier is a variable whose level influences the effect of the intervention. For example, if the effect of a schoolbased drug-use prevention program depends on the baseline risk level of the student, then baseline risk is an effect modifier. Effect modification can be seen intuitively by looking at separate intervention effect estimates for the levels of the effect modifier. If they differ to

GROUP-RANDOMIZED TRIALS

a meaningful degree, then the investigator has evidence of possible effect modification. A more formal assessment is provided by a statistical test for effect modification, which is accomplished by including an interaction term between the effect modifier and condition in the analysis and testing the statistical significance of that term. If the interaction is significant, then the investigator should present the results separately for the levels of the effect modifier. If not, the interaction term is deleted and the investigator can continue with the analysis. Proper identification of potential effect modifiers comes through a careful review of the literature and from an examination of the theory of the intervention. Potential effect modifiers must be measured as part of the data-collection process so that their role can later be assessed. A confounder is related to the endpoint, not on the causal pathway, and unevenly distributed among the conditions; it serves to bias the estimate of the intervention effect. No statistical test for confounding exists; instead, it is assessed by comparing the unadjusted estimate of the intervention effect to the adjusted estimate of that effect. If, in the investigator’s opinion, a meaningful difference exists between the adjusted and unadjusted estimates, then the investigator has an obligation to report the adjusted value. It may also be appropriate to report the unadjusted value to allow the reader to assess the degree of confounding. The adjusted analysis will not be possible unless the potential confounders are measured. Proper identification of potential confounders also comes through a careful review of the literature and from an understanding of the endpoints and the study population. The investigators must take care in the selection of potential confounders to select only confounders and not mediating variables. A confounder is related to the endpoint and unevenly distributed in the conditions, but is not on the causal pathway between the intervention and the outcome. A mediating variable has all the characteristics of a confounder, but is on the causal pathway. Adjustment for a mediating variable, in the false belief that it is a confounder, will bias the estimate of the intervention effect toward the null hypothesis.

11

Similarly, the investigator must take care to avoid selecting as potential confounders variables that may be affected by the intervention. Such variables will be proxies for the intervention itself, and adjustment for them will also bias the estimate of the intervention effect toward the null hypothesis. An effective strategy to avoid these problems is to restrict confounders to variables measured at baseline. Such factors cannot be on the causal pathway, nor can their values be influenced by an intervention that has not been delivered. Investigators may also want to include variables measured after the intervention has begun, but will need to take care to avoid the problems described above. Nuisance variables are related to the endpoint, not on the causal pathway, but evenly distributed among the conditions. They cannot bias the estimate of the intervention effect, but they can be used to improve precision in the analysis. A common method is to make regression adjustment for these factors during the analysis so as to reduce the standard error of the estimate of the intervention effect, thereby improving the precision of the analysis. Such adjustment will not be possible unless the nuisance variables are measured. Proper identification of potential nuisance variables also comes from a careful review of the literature and from an understanding of the endpoint. The cautions described above for the selection of potential confounding variables apply equally well to the selection of potential nuisance variables. 7.6 The Intervention No matter how well designed and evaluated a GRT may be, strengths in design and analysis cannot overcome a weak intervention. Although the designs and analyses employed in GRTs were fair targets for criticism during the 1970s and 1980s, the designs and analyses employed more recently have improved, with many examples of very well-designed and carefully analyzed trials. Where intervention effects are modest or short-lived, even in the presence of good design and analytic strategies, investigators must take a hard look at the intervention and question whether it was strong enough.

12

GROUP-RANDOMIZED TRIALS

One of the first suggestions for developing the research question was that the investigators become experts on the problem that they seek to remedy. If the primary endpoint is cigarette smoking among ninth graders, then the team should seek to learn as much as possible about the etiology of smoking among young adolescents. If the primary endpoint is obesity among Native American children, then the team should seek to learn as much as possible about the etiology of obesity among those young children. If the primary endpoint is delay time in reporting heart attack symptoms, then the team should seek to learn as much as possible about the factors that influence delay time. And the same can be said for any other endpoint. One of the goals of developing expertise in the etiology of the problem is to identify points in that etiology that are amenable to intervention. Critical developmental stages may exist, or critical events or influences that trigger the next step in the progression, or it may be possible to identify critical players in the form of parents, friends, coworkers, or others who can influence the development of that problem. Without careful study of the etiology, the team will largely be guessing and hoping that their intervention is designed properly. Unfortunately, guessing and hoping rarely lead to effective interventions. Powerful interventions are guided by good theory on the process for change, combined with a good understanding of the etiology of the problem of interest. Poor theory will produce poor interventions and poor results, which was one of the primary messages from the community-based heart disease prevention studies, where the intervention effects were modest, generally of limited duration, and often within chance levels. Fortmann et al. noted that one of the major lessons learned was how much was not known about how to intervene in whole communities (56). Theory that describes the process of change in individuals may not apply to the process of change in identifiable groups. If it does, it may not apply in exactly the same way. Good intervention for a GRT will likely need to combine theory about individual change with theory about group processes and group change.

A good theoretical exposition will also help identify channels for the intervention program. For example, strong evidence exists that recent immigrants often look to longterm immigrants of the same cultural group for information on health issues. This fact has led investigators to try to use those longterm immigrants as change agents for the more recent immigrants. A good theoretical exposition will often indicate that the phenomenon is the product of multiple influences and so suggest that the intervention operate at several different levels. For example, obesity among schoolchildren appears to be influenced most proximally by their physical activity levels and by their dietary intake. In turn, their dietary intake is influenced by what is served at home and at school and their physical activity is influenced by the nature of their physical activity and recess programs at school and at home. The models provided by teachers and parents are important both for diet and physical activity. This multilevel etiology suggests that interventions be directed at the school foodservice, physical education, and recess programs; at parents; and possibly at the larger community. As noted earlier, GRTs would benefit by following the example of clinical trials, where some evidence of feasibility and efficacy of the intervention is usually required before launching the trial. When a study takes several years to complete and costs hundreds of thousands of dollars or more, that expectation seems very fair. Even shorter and less expensive GRTs would do well to follow that advice. 7.7 Power A detailed exposition on power for GRTs is beyond the scope of this article. Excellent treatments exist, and the interested reader is referred to those sources for additional information. Chapter 9 in the Murray text provides perhaps the most comprehensive treatment of detectable difference, sample size, and power for GRTs (1). Even so, a few points bear repeating here. First, the increase in between-group variance because of the ICC in the simplest analysis is calculated as 1 + (m − 1)ICC,

GROUP-RANDOMIZED TRIALS

where m is the number of members per group; as such, ignoring even a small ICC can underestimate standard errors if m is large. Second, although the magnitude of the ICC is inversely related to the level of aggregation, it is independent of the number of group members who provide data. For both of these reasons, more power is available given more groups per condition with fewer members measured per group than given just a few groups per condition with many members measured per group, no matter the size of the ICC. Third, the two factors that largely determine power in any GRT are the ICC and the number of groups per condition. For these reasons, no substitute exists for a good estimate of the ICC for the primary endpoint, the target population, and the primary analysis planned for the trial, and it is unusual for a GRT to have adequate power with fewer than 8 to 10 groups per condition. Finally, the formula for the standard error for the intervention effect depends on the primary analysis planned for the trial, and investigators should take care to calculate that standard error, and power, based on that analysis.

8

ACKNOWLEDGEMENT

The material presented here draws heavily on work published previously by David M. Murray and his colleagues (1, 34, 38, 43). Readers are referred to those sources for additional information.

REFERENCES 1. D. M. Murray, Design and Analysis of GroupRandomized Trials. New York: Oxford University Press, 1998. 2. L. Kish, Survey Sampling. New York: John Wiley & Sons, 1965.

13

5. D. M. Murray and R. D. Wolfinger, Analysis issues in the evaluation of community trials: progress toward solutions in SAS/STAT MIXED. J. Community Psychol. CSAP Special Issue 1994: 140–154. 6. D. M. Zucker, An analysis of variance pitfall: the fixed effects analysis in a nested design. Educ. Psycholog. Measur. 1990; 50: 731–738. 7. A. S. Bryk and S. W. Raudenbush, Hierarchical Linear Models: Applications and Data Analysis Methods. Newbury Park, CA: Sage Publications, 1992. 8. H. Goldstein, Multilevel Models in Educational and Social Research. New York: Oxford University Press, 1987. 9. I. Kreft and J. De Leeuw, Introducing Multilevel Modeling. London: Sage Publications, 1998. 10. S. W. Raudenbush and A. S. Bryk, Hierarchical Linear Models, 2nd ed. Thousand Oaks, CA: Sage Publications, 2002. 11. L. Kish, Statistical Design for Research. New York: John Wiley & Sons, 1987. 12. E. L. Korn and B. I. Graubard, Analysis of Health Surveys. New York: John Wiley & Sons, 1999. 13. C. J. Skinner, D. Holt, and T. M. F. Smith, Analysis of Complex Surveys. New York: John Wiley & Sons, 1989. 14. A. Donner, N. Birkett, and C. Buck, Randomization by cluster: sample size requirements and analysis. Amer. J. Epidemiol. 1981; 114(6): 906–914. 15. A. Donner and N. Klar, Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold, 2000. 16. COMMIT Research Group, Community Intervention Trial for Smoking Ccessation (COMMIT): summary of design and intervention. J. Natl. Cancer Inst. 1991; 83(22): 1620–1628. 17. A. Donner, Symposium on community intervention trials. Amer. J. Epidemiol. 1995; 142(6): 567–568. 18. R. V. Luepker, Community trials. Prevent. Med. 1994; 23: 602–605.

3. J. Cornfield, Randomization by group: a formal analysis. Amer. J. Epidemiol. 1978; 108(2): 100–102.

19. H. Blackburn, Research and demonstration projects in community cardiovascular disease prevention. J. Public Health Policy 1983; 4(4): 398–420.

4. D. M. Murray, P. J. Hannan, and W. L. Baker, A Monte Carlo study of alternative responses to intraclass correlation in community trials: is it ever possible to avoid Cornfield’s penalties? Eval. Rev. 1996; 20(3): 313–337.

20. R. A. Carleton et al., The Pawtucket Heart Health Program: community changes in cardiovascular risk factors and projected disease risk. Amer. J. Public Health 1995; 85(6): 777–785.

14

GROUP-RANDOMIZED TRIALS

21. J. W. Farquhar, The community-based model of life style intervention trials. Amer. J. Epidemiol. 1978; 108(2): 103–111.

36.

22. J. W. Farquhar et al., The Stanford five-city project: design and methods. Amer. J. Epidemiol. 1985; 122(2): 323–324. 23. D. R. Jacobs et al., Community-wide prevention strategies: evaluation design of the Minnesota Heart Health Program. J. Chronic Dis. 1986; 39(10): 775–788. 24. D. E. Lilienfeld and P. D. Stolley, Foundations of Epidemiology, 3rd ed. New York: Oxford University Press, 1994. 25. S. B. Hully, Symposium on CHD prevention trials: design issues in testing life style intervention. Amer. J. Epidemiol. 1978; 108(2): 85–86. 26. R. Sherwin, Controlled trials of the diet-heart hypothesis: some comments on the experimental unit. Amer. J. Epidemiol. 1978; 108(2): 92–99. 27. S. L. Syme, Life style intervention in clinicbased trials. Amer. J. Epidemiol. 1978; 108(2): 87–91.

37.

38.

39.

40.

41.

42.

28. M. H. Gail et al., Aspects of statistical design for the Community Intervention Trial for Smoking Cessation (COMMIT). Controlled Clin. Trials 1992; 13: 6–21. 29. D. B. Abrams et al., Cancer Control at the Workplace: The Working Well Trial. Prevent. Med. 1994; 23: 15–27. 30. D. M. Zucker et al., Statistical Design of the Child and Adolescent Trial for Cardiovascular Health (CATCH): implication of cluster randomization. Controlled Clin. Trials 1995; 16: 96–118. 31. J. M. Simpson, N. Klar, and A. Donner, Accounting for cluster randomization: a review of Primary Prevention Trials, 1990 through 1993. Amer. J. Public Health 1995; 85(10): 1378–1383. 32. H. Brown and R. Prescott, Applied Mixed Models in Medicine. Chichester, UK: John Wiley & Sons, Inc., 1999. 33. C. E. McCulloch and S. R. Searle, Generalized, Linear and Mixed Models. New York: John Wiley & Sons, 2001. 34. D. M. Murray, S. P. Varnell, and J. L. Blitstein, Design and analysis of grouprandomized trials: a review of recent methodological developments. Amer. J. Public Health 2004; 94(3): 423–432. 35. D. M. Murray and J. L. Blitstein, Methods to reduce the impact of intraclass correlation

43.

44.

45.

46.

47.

48.

in group-randomized trials. Eval. Rev. 2003; 27(1): 79–103. J. B. Janega et al., Assessing intervention effects in a school-based nutrition intervention trial: which analytic model is most powerful? Health Educ. Behav. 2004; 31(6): 756–774. J. B. Janega et al., Assessing the most powerful analysis method for school-based intervention studies with alcohol, tobacco and other drug outcomes. Addict. Behav. 2004; 29(3): 595–606. S. Varnell et al., Design and analysis of grouprandomized trials: a review of recent practices. Amer. J. Public Health 2004; 94(3): 393–399. Z. Feng et al., Selected statistical issues in group randomized trials. Annu. Rev. Public Health 2001; 22: 167–187. Z. Feng and B. Thompson, Some design issues in a community intervention trial. Controlled Clin. Trials 2002; 23: 431–449. N. Klar and A. Donner, Current and future challenges in the design and analysis of cluster randomization trials. Stat. Med. 2001; 20: 3729–3740. D. M. Murray, Efficacy and effectiveness trials in health promotion and disease prevention: design and analysis of group-randomized trials. In: N. Schneiderman et al., (eds.), Integrating Behavioral and Social Sciences with Public Health. Washington, DC: American Psychological Association, 2000, pp. 305–320. D. M. Murray, Statistical models appropriate for designs often used in group-randomized trials. Stat. Med. 2001; 20: 1373–1385. J. W. Farquhar et al., Effects of communitywide education on cardiovascular disease risk factors: The Stanford Five-City Project. JAMA 1990; 264(3): 359–365. R. V. Luepker et al., Community education for cardiovascular disease prevention: risk factor changes in the Minnesota Heart Health Program. Amer. J. Public Health 1994; 84(9): 1383–1393. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a four-year community intervention. Amer. J. Public Health 1995; 85(2): 183–192. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): II. Changes in adult cigarette smoking prevalence. Amer. J. Public Health 1995; 85(2): 193–200. M. Susser, Editorial: the tribulations of trials—intervention in communities. Amer. J. Public Health 1995; 85(2): 156–158.

GROUP-RANDOMIZED TRIALS 49. M. Ausems et al., Short-term effects of a randomized computer-based out-of-school smoking prevention trial aimed at elementary schoolchildren. Prevent. Med. 2002; 34: 581–589. 50. D. Lazovich et al., Effectiveness of a worksite intervention to reduce an occupational exposure: the Minnesota Wood Dust Study. Amer. J. Public Health 2002; 92(9): 1498–1505. 51. J. A. Mayer et al., Promoting skin cancer prevention counseling by pharmacists. Amer. J. Public Health 1998; 88(7): 1096–1099. 52. M. J. Rotheram-Borus, M. B. Lee, M. Gwadz, and B. Draimin, An intervention for parents with Aids and their adolescent children. Amer. J. Public Health 2001; 91(8): 1294–1302. 53. J. Segura et al., A randomized controlled trial comparing three invitation strategies in a breast cancer screening program. Prevent. Med. 2001; 33: 325–332. 54. L. I. Solberg, T. E. Kottke, and M. L. Brekke, Will primary care clinics organize themselves to improve the delivery of preventive services? A randomized controlled trial. Prevent. Med. 1998; 27: 623–631. 55. F. A. Stillman et al., Evaluation of the American Stop Smoking Intervention Study (ASSIST): a report of outcomes. J. Amer. Cancer Inst. 2003; 95(22): 1681–1691. 56. S. P. Fortmann et al., Community intervention trials: reflections on the Stanford FiveCity Project experience. Amer. J. Epidemiol. 1995; 142(6): 576–586. 57. T. D. Koepsell et al., Invited commentary: symposium on community intervention trials. Amer. J. Epidemiol. 1995; 142(6): 594–599. 58. D. M. Murray, Design and analysis of community trials: lessons from the Minnesota Heart Health Program. Amer. J. Epidemiol. 1995; 142(6): 569–575. 59. S. Varnell, D. M. Murray, and W. L. Baker, An evaluation of analysis options for the one group per condition design: can any of the alternatives overcome the problems inherent in this design? Eval. Rev. 2001; 25(4): 440–453. 60. Z. Feng, D. McLerran, and J. Grizzle, A comparison of statistical methods for clustered data analysis with Gaussian error. Stat. Med. 1996; 15: 1793–1806. 61. M. D. Thornquist and G. L. Anderson, Small sample properties of generalized estimating equations in group-randomized designs with gaussian response. Presented at the 120th Annual APHA Meeting, Washington, DC, 1992.

15

62. R. M. Bell and D. F. McCaffrey, Bias reduction in standard errors for linear regression with multi-stage samples. Survey Methodol. 2002; 28(2): 169–181. 63. M. Fay and P. Graubard, Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics 2001; 57: 1198–1206. 64. G. Kauermann and R. J. Carroll, A note on the efficiency of sandwich covariance matrix estimation. J. Amer. Stat. Assoc. 2001; 96(456): 1387–1396. 65. L. A. Mancl and T. A. DeRouen, A covariance estimator for GEE with improved small-sample properties. Biometrics 2001; 57: 126–134. 66. W. Pan and M. M. Wall, Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations. Stat. Med. 2002; 21: 1429–1441. 67. T. Cai, S. C. Cheng, and L. J. Wei, Semiparametric mixed-effects models for clustered failure time data. J. Amer. Stat. Assoc. 2002; 95(458): 514–522. 68. R. J. Gray and Y. Li, Optimal weight functions for marginal proportional hazards analysis of clustered failure time data. Lifetime Data Anal. 2002; 8: 5–19. 69. D. Hedeker, O. Siddiqui, and F. B. Hu, Random-effects regression analysis of correlated grouped-time survival data. Stat. Meth. Med. Res. 2000; 9: 161–179. 70. M. R. Segal, J. M. Neuhaus, and I. R. James, Dependence estimation for marginal models of multivariate survival data. Lifetime Data Anal. 1997; 3: 251–268. 71. K. K. Yau, Multilevel models for survival analysis with random effects. Biometrics 2001; 57: 96–102. 72. T. Braun and Z. Feng, Optimal permutation tests for the analysis of group randomized trials. J. Amer. Stat. Assoc. 2001; 96: 1424–1432. 73. B. L. Rooney and D. M. Murray, A Metaanalysis of smoking prevention programs after adjustments for errors in the unit of analysis. Health Educ. Quart. 1993; 23(1): 48–64. 74. R. E. Kirk, Experimental Design: Procedures for the Behavioral Sciences, 2nd ed. Belmont, CA: Brooks/Cole Publishing Company, 1982. 75. C. L. Meinert, Clinical Trials. New York: Oxford University Press, 1986. 76. W. R. Shadish, T. D. Cook, and D. T. Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference.

16

GROUP-RANDOMIZED TRIALS

Boston, MA: Houghton Mifflin Company, 2002. 77. B. J. Winer, D. R. Brown, and K. Michels, Statistical Principles in Experimental Design. New York: McGraw-Hill, 1991.

GROUP SEQUENTIAL DESIGNS

interim analysis is performed in which a decision is made to stop or to continue the trial. Particularly, a trial can be stopped with the early rejection of a hypothesis H0 . A trial can also be stopped for futility if the data indicate that H 0 probably cannot be rejected at the planned end of the trial. As a result of the repeated testing at the interim analyses, the Type I error rate α will be inflated. Group sequential designs offer the possibility to account for this kind of multiplicity and provide test procedures that control the Type I error rate. In the simplest case, the maximum number of stages, K, to be performed and the number of observations per stage n1 , . . ., nK are fixed in advance. For example, in a two-stage design (i.e., K = 2), a prespecified number n1 of subjects are observed in the first stage. If the P-value for testing H0 falls below a specified boundary α1 , H 0 can be rejected and no further subjects need to be recruited. If the P-value is too large (e.g., greater than 0.50), it is unlikely to obtain a statistically significant result after the observation of the second-stage data. Hence, further recruitment of patients will be stopped and, of course, H0 cannot be rejected. If it is decided to continue the trial, in the second stage a prespecified number of subjects n2 are observed. If the P-value calculated from the total dataset (i.e., n1 + n2 observations) falls below a boundary α2 , H 0 is rejected, otherwise H 0 is not rejected. As a result of the interim look, a false rejection of H0 can occur in the interim as well as in the final analysis and two tests of significance are repeatedly applied to the accumulating data. Hence, if the P-values obtained from the first and from both stages, respectively, are compared with the unadjusted level α, the overall Type I error rate exceeds α. If α1 = α2 = α/2, the procedure controls α because of the Bonferroni correction. The choice of more sophisticated boundaries is one important and fundamental topic in the theory of group sequential test designs. Peter Armitage was the first who adopted the sequential testing approach to clinical research. The first edition of his book Sequential Medical Trials was published in

GERNOT WASSMER Institute for Medical Statistics, Informatics, and Epidemiology University of Cologne Cologne, Germany

1

INTRODUCTION

In clinical research there is a great interest in interim analyses for ethical, economical, and organizational or administrative reasons. Interim analyses are performed within sequential designs that offer the possibility to stop a trial early with a statistically significant test result. This trial will possibly need less patients than the trial with a fixed sample size where a test decision can be made only at the end of the trial. A therapy which was shown to be superior can be applied earlier and the inferior therapy can be replaced by the better one. Furthermore, in interim analyses the quality of the performance of the trial can be assessed and possibly be improved when necessary. Interim analyses are also carried out for the assessment of the safety of the treatments under consideration, and the observation of serious adverse events can lead to an early stopping of the trial. The appointment of an Independent Data and Safety Monitoring Board (DSMB) in a clinical trial is a recommended constitution to perform these issues according to generally accepted standards of Good Clinical Practice (GCP) (1–3). Another issue is the redesign of the trial (e.g., sample size reassessment), but it was not intended originally nor is it generally possible with ‘‘classical’’ group sequential designs (see Flexible Design, Interim Analysis, Stopping Boundaries). Sequential designs were originally developed for the analysis of observations as soon as they become available. In group sequential designs, statistical inference is performed after the observation of groups of subjects, which is appealing, especially in medical research, because such a trial does not need to be continuously monitored, but can be organized in several stages. After each stage, an

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

GROUP SEQUENTIAL DESIGNS

1960 (4). The concept of repeated significance tests and its application in clinical trials was initiated by him. Armitage et al. (5) introduced the recursive integration formula, which enables the implementation of the repeated significance testing approach. Pocock (6) and O’Brien and Fleming (7), on the other hand, gave the major impetus for the development of group sequential test procedures that are widely used today, especially in clinical research. For a short historical review of the early work about group sequential procedures, see Jennison and Turnbull (References 8 and 9, Section 1.3) Ghosh (10) provides a history of sequential analysis in general, taking into account the developments beginning in the seventeenth century. The development of group sequential designs was, for the most part, within the repeated significance testing approach. This approach is conceptually different from the ‘‘purely’’ sequential approach. Applying the latter, both types of error probabilities are under control and used to determine the stopping boundaries of the sequential test procedure. The comprehensive theoretical development of these procedures owes much to the optimality of the sequential probability ratio test (SPRT), and to the derivation of analytic expressions for the decision regions and certain test characteristics. The SPRT was initiated by Abraham Wald (11). It is an optimum test in the sense that the expected sample size is minimum under both the null hypothesis and the alternative hypothesis. Several textbooks contain the theoretical developments in sequential designs (e.g., References 12–15). Theoretical research on repeated significance tests was also conducted. A series of papers were concerned with finding a bound for the critical value and with an approximation for the power and the expected sample size. Much of this work is a result of the research group of David Siegmund, in which many of the theoretical results were obtained from renewal theory. Essential developments and the mathematical background are provided in the book by Siegmund (13). An important part in the development of sequential designs was concerned with ‘‘triangular plans,’’ in which the stopping region for the sum of the observations is defined

by two straight lines that cross at the end of the trial. These are ‘‘closed’’ or ‘‘truncated’’ plans in which a decision to stop the trial is fixed at some maximum amount of information. Much of the theoretical development is concerned with the overshooting problem that occurs when the test statistic is not on the boundary. Groups of observations as well as continuous monitoring can be taken into account. Mathematical sophistication has led to the ‘‘Christmas tree correction,’’ which turns out to be quite accurate in many cases of practical importance (16). Whitehead (15) provides a comprehensive overview of sequential methods with special reference to the triangular plans in clinical trials. In a review article, Whitehead (17) argues against the distinction of group sequential methods from the wider family of sequential methods. This is certainly true, but the development and investigation of group sequential designs was in some sense separated from the rigorous mathematical derivation of results within the sequential theory, which is also a result of the rapid development of computer power that made the computations of the recursive numerical integral introduced by Armitage et al. (5) possible. Easy-to-use computer programs are available today to investigate the characteristics of the procedures numerically. The group sequential designs introduced by Pocock and O’Brien and Fleming stand for the ‘‘classical group sequential designs.’’ The basic notions of these designs and some of their generalizations are discussed in the following section. 2 CLASSICAL DESIGNS Pocock (6) proposed two-sided tests for normal responses with known variance assuming that the number of observations is equal between the K stages (i.e., n1 = . . .= nK = n, where nk denotes the sample size per stage of the trial). This situation applies to testing the difference of means in a parallel group design as well as to testing a single mean in a one-sample design (or a design with paired observations). In a group sequential test design, the hypothesis can be rejected

GROUP SEQUENTIAL DESIGNS

at the kth stage and the trial is terminated after that stage if the P-value is smaller or equal than a value αk . The adjusted nominal significance levels αk , k = 1, . . . , K, are determined such that the overall probability of a false rejection of H0 in the sequential scheme does not exceed α, which can also be defined in terms of the adjusted critical bounds uk , k = 1, . . . , K, for the standardized test statistic ∗ Zk of the accumulating data. Any sequence of critical bounds u1 , . . . , uK that fulfills ∗

∗

PH0 (|Z1 | ≥ u1 or . . . or |ZK | ≥ uK ) = α

(1)

defines a valid level-α test procedure for the two-sided test design. The statistics ∗ ∗ ∗ Z1 , . . . , ZK are correlated as each Zk incorporates all observations from the first k stages, k = 1, . . . , K. Therefore, the probability on the left-hand side in Equation (1) must in principal be computed with the multivariate normal integral, and the critical values u1 , . . . , uK are found by ∗ a numerical search. Each statistic Zk (k > 1), however, can be written as a sum of independent statistics. As a result of this independent increments structure, one can use the recursive integration formula of Armitage et al. (5) to facilitate the computation of the multiple integral and the determination of the critical values. Pocock’s design is characterized by assuming constant critical values u1 = . . .= uk = u or, equivalently, constant adjusted nominal significance levels α1 = . . . = αk = α over the stages of the trial. O’Brien and Fleming’s design, on the other hand, is defined through monotonically decreasing critical values uk √ given by uk = c/ k, where c is a suitable constant such that Equation (1) holds. Therefore, this approach postulates more conservative nominal significance levels for early stages, which yields an increasing sequence of adjusted significance levels. In Table 1, the nominal adjusted levels αk along with the corresponding critical values uk for the standardized test statistic are presented for both designs for α = 0.05 and K = 2, 3, 4, 5. These choices of K are in common use and the figures in the table can be immediately used for application. Figure 1 graphically illustrates the decision regions of a four-stage

3

O’Brien and Fleming’s and Pocock’s design in terms of the critical values uk . As K increases, the necessary adjustment becomes stronger. For Pocock’s design, the improvement of using an adjusted bound satisfying Equation (1) in place of the simple Bonferroni correction is obvious: Instead of using α/K at each stage of the test procedure, one can use the less stringent adjusted significance levels αk of Table 1. In O’Brien and Fleming’s design, strong evidence of an effect is required to stop the trial at early stages, but it is easier to reject H0 later on. As a consequence, the last-stage critical value is near to the critical value of the two-sided fixed sample size design (i.e., no interim analysis). For α = 0.05, the latter is given by zα/2 = 1.96, where zα/2 denotes the upper α/2 quantile of the standard normal distribution. In other words, the price to pay in terms of having to use a more conservative level of the Type I error rate is low for interim looks when using O’Brien and Flemings test design. Even stricter criteria for interim looks were earlier independently proposed by Haybittle (18) and Peto et al. (19). They suggested using u1 = . . .= uK−1 = 3 and uK = zα/2 . The actual Type I error rate of this procedure exceeds the nominal level α, but the excess is small, and one can simply adopt the approach to adjust the critical value uK such that the Type I error rate of the procedure is maintained. For example, for K = 5 and α = 0.05, the adjusted critical level for the kth stage is given by 1.990, which only slightly exceeds 1.960 and is considerably smaller than the values 2.413 and 2.040 required by the Pocock and the O’Brien and Fleming design (Table 1). Different group sequential test designs can be compared with respect to the necessary maximum sample size and the average sample size or average sample number (ASN). Usually, one fixes an effect size ϑ* and looks for the sample size such that the probability to reject H 0 is equal to some desired power 1 − β. Given ϑ*, the probability to reject H 0 increases as the sample sizes n per stage increase. Hence, in a two-sided design with K stages, the sample size per stage, n, to achieve power 1 − β for a group sequential design at a prespecified effect size ϑ* is found

4

GROUP SEQUENTIAL DESIGNS Table 1. Adjusted Nominal Significance Levels α k and Critical Values uk , k = 1, . . ., K for Pocock’s and O’Brien and Fleming’s Design (Two-Sided α = 0.05)

Z*k

Pocock

O’Brien and Fleming αk uk

K

k

αk

2

1 2

0.02939 0.02939

2.178 2.178

0.00517 0.04799

2.797 1.977

3

1 2 3

0.02206 0.02206 0.02206

2.289 2.289 2.289

0.00052 0.01411 0.04507

3.471 2.454 2.004

4

1 2 3 4

0.01821 0.01821 0.01821 0.01821

2.361 2.361 2.361 2.361

0.00005 0.00420 0.01942 0.04294

4.049 2.863 2.337 2.024

5

1 2 3 4 5

0.01581 0.01581 0.01581 0.01581 0.01581

2.413 2.413 2.413 2.413 2.413

0.00001 0.00126 0.00845 0.02256 0.04134

4.562 3.226 2.634 2.281 2.040

uk

4 OBF

reject H0

3 P 2

continue trial

1 0 −1

continue trial

−2

P

−3

reject H0

OBF

−4

1

3

2 stage

4

Figure 1. Decision regions for a two-sided four-stage group sequential design according to O’Brien and Fleming (OBF) and Pocock (P), respectively (α = 0.05).

numerically by solving ASNϑ ∗ = n + nPϑ ∗ (|Z1 | < u1 ) ∗

∗

Pϑ ∗ (|Z1 | ≥ u1 or . . . or |ZK | ≥ uK ) = 1 − β (2)

for n. The maximum sample size is N = Kn. The ASN given ϑ* becomes

+ nPϑ ∗ (|Z1 | < u1 , |Z2 | < u2 ) + . . . + nPϑ ∗ (|Z1 | < u1 , . . . , |ZK−1 | < uK−1 ) (3) The ASN can also be calculated under some other effect size. If the parameter specified in the alternative is true, ASNϑ ∗ specifies

GROUP SEQUENTIAL DESIGNS

the expected sample size if the assumption made in the sample size calculation is true. In case of normally distributed observations, it can be shown that N and the corresponding ASNϑ ∗ are inversely proportional to ϑ ∗2 in analogy to the fixed sample size design. Particularly, it suffices to provide tables that specify the maximum and average sample size as compared with the fixed sample size design. In Table 2, the values of the relative change are provided for two-sided α = 0.05 and different values of K and 1 − β. The factor N/nf is called the inflation factor of a particular design because it specifies how much the maximum sample size in the group sequential design must be inflated relative to the fixed sample size design. Comprehensive tables for the inflation factor in different designs can be found in Reference 9. The inflation factor can be used to perform sample size calculations for a group sequential test design in very different situations. For example, let the necessary sample size in a fixed sample design at given α = 0.05 and 1 − β = 0.80 be calculated as, say, nf = 54. If one wants to use a five-stage Pocock design, the maximum sample size necessary to achieve power 80% is the next integer larger than 1.228 × 54 = 66.3. That is, 67 observations need to be suitably distributed among the stages of the trial. Clearly, the assumption of equal sample sizes between the stages cannot exactly be fulfilled in most cases. It can be shown, however, that a slight departure from this assumption has virtually no effect on the test characteristics. That is, it might be reasonable in this case to start with n1 = 14 observations, and the subsequent sample sizes are n2 = 14, n3 = n4 = n5 = 13. The average sample size of this design is 0.799 × 54 = 43.1, which means that an essential reduction in sample size as compared with the fixed sample design is expected in this trial. A further important test characteristic is the probability of rejecting the null hypothesis at a specific stage of the trial. In this example, if the alternative is true, the probabilities to reject H0 in the first, second, third, fourth, and fifth stage are given by 0.153, 0.205, 0.187, 0.148, and 0.107, respectively. That is, some chance exists to stop the trial with the rejection of H0 at very early stages, saving time and money. On the other

5

hand, if one wants to use an O’Brien and Fleming design, the maximum and average sample size are 1.028 × 54 = 55.5 and 0.818 × 54 = 44.2, respectively. The corresponding probabilities to reject H 0 in one of the five stages of the trial are 0.001, 0.076, 0.259, 0.276, and 0.188, respectively. That is, it is very unlikely with this design to reject H0 at early stages, but the maximum sample size is only slightly above the sample size in a fixed sample design. One must balance the reasons for choosing a design also including other aspects than reducing the sample size. It is an essential characteristic of the classical design, however, that a particular design chosen must be thoroughly discussed before the start of the trial. In the ongoing trial, this design cannot be changed (see Flexible Designs). For given power, the maximum sample size in O’Brien and Fleming’s design is only slightly larger than the sample size necessary in a fixed sample size design. On the other hand, the ASN under the specified alternative, ASNϑ ∗ , in Pocock’s design is smaller. For increasing number of stages, K, Table 2 shows that ASNϑ ∗ decreases and the maximum number of observations increases. Although generally not true (in fact, the ASNϑ ∗ in Pocock’s design starts slightly increasing again for very large K), the most relevant reduction in ASNϑ ∗ is reached for K = 5. This result was first shown in References 20 and 21 and provides additional reasoning for not performing too many stages. From an organizational perspective, even the choice of less than four, one or two, say, interim analyses might be preferred. Pocock’s and O’Brien and Fleming’s designs form the milestone in the development of many other types of designs that were developed during the last two decades. In a subsequent article, Pocock (21) found optimum critical values that, given K, α, and 1 − β, minimize the expected sample size under the alternative. These critical values were found by a K-dimensional grid search. For example, if K = 5, α = 0.05, and 1 − β = 0.95, the optimum critical values are given

6

GROUP SEQUENTIAL DESIGNS Table 2. Average Sample Size ASNϑ ∗ and Maximum Sample Size N Relative to the Sample Size nf in a Fixed Sample Size Design for Pocock’s and O’Brien and Fleming’s Design (Two-Sided α = 0.05) Pocock N/nf ASNϑ ∗ /nf

O’Brien and Fleming ASNϑ ∗ /nf N/nf

K

1−β

2

0.80 0.90 0.95

0.853 0.776 0.718

1.110 1.100 1.093

0.902 0.851 0.803

1.008 1.007 1.007

3

0.80 0.90 0.95

0.818 0.721 0.649

1.166 1.151 1.140

0.856 0.799 0.751

1.017 1.016 1.015

4

0.80 0.90 0.95

0.805 0.697 0.619

1.202 1.183 1.170

0.831 0.767 0.716

1.024 1.022 1.021

5

0.80 0.90 0.95

0.799 0.685 0.602

1.228 1.206 1.191

0.818 0.750 0.696

1.028 1.026 1.025

by (u1 , u2 , u3 , u4 , u5 ) = (2.446, 2.404, 2.404, 2.404, 2.396) For K = 5, α = 0.05, and 1 − β = 0.50, these values are given by (u1 , u2 , u3 , u4 , u5 ) = (3.671, 2.884, 2.573, 2.375, 2.037) That is, for high power values, a design with constant critical levels over the stages is (nearly) optimum with respect to the average sample size. For low power values, on the other hand, a design with decreasing critical values, like O’Brien and Fleming’s design, should be used. Wang and Tsiatis (22) suggested that the optimum critical values can approximately be found within the -class of critical values uk given by uk = c(K, α, )k−0.5 , k = 1, . . . , K

(4)

where c(K, α, ) is a suitably chosen constant. They showed that the optimized ASNϑ ∗ is only negligibly larger when searching within the one-parameter family of critical values defined through Equation (4) as compared with the K-dimensional search. For = 0

and = 0.5, one obtains O’Brien and Fleming’s and Pocock’s design, respectively. The Wang and Tsiatis -class family offers a wide class of boundaries with different shapes. The critical values and the properties of the tests with intermediate values of were extensively tabulated (9, 23) or can be found using a suitable software package (24, 25). One-sided test designs were also considered in the literature. However, the issues involved in the question of whether a test should be two-sided or one-sided are getting more complex when considering multistage designs (26). Nevertheless, it is conceptually straightforward and occasionally reasonable to define one-sided tests in the context of group sequential designs. A set of critical values u1 , . . . , uK satisfying ∗

∗

PH0 (Z1 ≥ u1 or . . . or ZK ≥ uK ) = α

(5)

defines a one-sided group sequential test design. The power and the ASN are defined analogously. It is interesting that the critical values defined through Equation (5) are virtually (but not exactly) identical to the two-sided critical values defined through Equation (1) at significance level 2α, which is analogous to the fixed sample design. As the difference of the exact and the approximate critical values is less than 0.000001 for all

GROUP SEQUENTIAL DESIGNS

practically relevant situations, it suffices to consider only one case. DeMets and Ware (27, 28) considered the issue of stopping the trial for futility, which becomes evident in the one-sided setting. If, for example, the one-sided P-value at some stage k is greater than 0.50, the effect is directed opposed to the alternative. Hence, no reasonable chance to obtain a significant result at the end of the trial may exist. In the planning phase of the trial, it can be decided that in this case the trial will be stopped. Taking into account the stopping for futility option, the critical values are different as compared with the original design defined through Equation (5). Indeed, the critical values are somewhat smaller and, most importantly, if the null hypothesis is true, the ASN reduces considerably. DeMets and Ware considered various choices of stopping for futility boundaries, including the constant boundary (i.e. stopping the trial if the P-value exceeds a constant α 0 > α). Taking into account the possibility of stopping the trial in favor of H0 was also considered in the context of two-sided tests. Emerson and Fleming (29) proposed one-sided and two-sided group sequential designs in which both error probabilities are under control. These designs were termed ‘‘symmetric designs’’ and were generalized by Pampallona and Tsiatis (30). Within the -class of boundaries, they found designs that are optimal with respect to the average sample size under the alternative hypothesis. The application of the group sequential test design discussed so far is also possible for other designs. In simulation studies (6), it was shown that the use of the critical values derived for normally distributed observations with a known variance provide sufficient control of the nominal level if the variance is unknown and the t-test situation applies. Furthermore, in designs with binary or exponential response or many other types of response variables, the boundaries derived for the normal case can be applied. Generally, if the sample size is not too small, the Type I error rate is approximately equal to α, if the increments are independent and approximately normal, which is also (approximately) true for studies with a survival endpoint, certain longitudinal designs, and

7

for nonparametric testing situations, which offers a wide application of group sequential designs in clinical trials (e.g., References 31–34). Some work was also done on the exact calculation for group sequential trials in other designs such as trials comparing two treatments with dichotomous response (35); exact calculations for the t-test, χ 2 , and Ftest situation (36); or exact group sequential permutational tests (37). 3

THE α-SPENDING FUNCTION APPROACH

The test procedures discussed in the last section were originally designed for a fixed maximum number of stages and equally sized groups of observations per stage. In practice, however, the latter assumption is rarely exactly fulfilled. The schedule of interim analyses can at best ensure that the stage sizes are roughly the same. In practice of clinical trials, the strict assumption of equal sample sizes may prove impractical. The size and power requirement are not greatly affected, however, if this assumption is not grossly violated. Nevertheless, situations exist with a substantial effect on size and power (38). Hence, the importance of more flexible procedures is obvious. One can obtain valid tests relaxing the strict assumption of equal sample sizes between the stages. First of all, one may prespecify sample sizes n1 , . . ., nK , which are not necessarily equal. For example, consider a four-stage design in which the first interim analysis should take place after observation of 10% of all planned sampling units, the second interim after 40%, and the third interim analysis after 70% of observations. With these so-called information rates tk = ki=1 ni /N specified in advance, it is possible to derive the maximum sample size, N, such that the procedure has desired power 1 − β, given K, α, and the desired boundary shape. The critical values and the test characteristics differ from those derived with equally spaced stages. One can even optimize a group sequential plan with regard to, say, the ASN under the alternative considering varying information rates and varying boundary shapes (39, 40). The α-spending function or use function approach, proposed by Lan and DeMets (41)

8

GROUP SEQUENTIAL DESIGNS

and extended by Kim and DeMets (42), is conceptually different. The idea is to specify the amount of significance level spent up to an interim analysis rather than the shape of the adjusted critical levels. One uses a function α*(tk ) that specifies the cumulative Type I error rate spent at a time point tk of the kth analysis. α*(tk ) is a non-decreasing function with α*(0) = 0 and α*(1) = α that must be fixed before the start of the trial. As above, tk represents the information rate that develops from the specific course of the study (43). The information rates need not be prespecified before the actual course of the trial but must be chosen independently of the observed effect sizes. Consequently, the number of observations at the kth analysis and the maximum number K of analyses are flexible. In the two-sided case, the critical value for the first analysis is given by u1 = −1 (1 − α*(t1 )/2), where t1 = n1 /N and −1 (·) refers to the inverse normal cumulative distribution function. The critical values for the remaining stages are then computed successively. That is, once uk−1 is calculated, the critical value at stage k is calculated through ∗

∗

πk = PH0 (|Z1 | < u1 , . . . , |Zk−1 | < uk−1 , ∗

|Zk | ≥ uk ) ∗

∗

= α (tk ) − α (tk−1 )

(6)

π k denotes the probability of a Type I error at stage k and, therefore, K k=1 πk = α . The one-sided case is treated analogously. In this way, the sequence of critical values u1 , . . . , uK that defines a group sequential test is not specified in advance but results from the information rates observed during the course of the trial. The use function approach is particularly attractive if the interim analyses are planned at specific time points rather than after a specific number of evaluable observations has been obtained. Particularly, the sample size per stage may be an unknown quantity that will be observed when the interim analysis is performed. With this approach, even the number of interim analyses must not be fixed in advance. Instead, a maximum amount of information must be specified. In the simplest case, it is the maximum sample size N of

the trial. Through the use of a specified use function, the significance level spent up to this information is fixed in advance, which enables the calculation of the adjusted levels. The overall significance level α is maintained if the study stops whenever tk = tK = 1. The actually observed number of observations, however, may fall short of or exceed N. To account for the latter case ∗ (i.e., tK >1), the α-spending function α˜ (tk ) = ∗ min(α, α (tk )) might be used to account for random overrunning. If the study stops with a smaller maximum sample size than antic∗ ipated (i.e., tK < 1), then setting α˜ (tK ) = α forces the procedure to fully exhaust the level α up to the last stage (44). An important application of group sequential designs is in trials where the endpoint is the time to an event (e.g., survival data). It was shown by Tsiatis (34) that the usual logrank test can be embedded into the group sequential design. The information is the observed number of events (i.e., deaths), and the use function approach turns out to be a very useful and flexible instrument for analyzing such trials in a sequential design (32, 45). A number of proposals were designed in the literature for the form of the function α*(tk ). The α-spending functions ∗

(7) α1 (tk ) = α ln(1 + (e − 1) · tk ) and √ 2(1 − (zα/2 / tk )) (one-sided case) ∗ α2 (tk ) = √ 4(1 − (zα/4 / tk )) (two-sided case) (8) approximate Pocock’s and O’Brien and Fleming’s group sequential boundaries, respectively. Kim and DeMets (42) proposed a one parameter family of α-spending functions ∗

ρ

α3 (ρ, tk ) = αtk

(9)

The spending functions in Equations (7)–(9) are illustrated in Fig. 2. Notice that the use of constant adjusted significance levels (i.e., Pocock’s design) does not correspond to a linear α-spending function. Instead, assuming a linear α-spending function (i.e., ∗ using α3 (ρ) with ρ = 1) and equally spaced

GROUP SEQUENTIAL DESIGNS a*(tk ) 0.05 0.04 a*3 (2)

0.03

a*3 (1.5) a*3 (1) a*1

0.02

a*2

0.01 0.00 0.0

0.2

0.4

0.6

0.8

1.0

tk

Figure 2. Examples of α-spending functions. α1∗ and α2∗ approximate Pocock’s and O’Brien and Fleming’s group sequential boundaries, respectively. Kim and DeMets α-spending function class α3∗ (ρ) is shown for ρ = 1, 1.5, 2.

stage sizes yields somewhat decreasing critical values. Hwang et al. (46) introduced a one parameter family of α-spending functions and showed that the use of this family yields approximately optimal plans similar to the class of Wang and Tsiatis (22). Optimum tests adapting the use function approach were also found by Jennison (47). Similar approaches were proposed by Fleming et al. (48) and Slud and Wei (33). They are defined in terms of the Type I error rates π k in Equation (6) by specifying π 1 , . . ., π K summing to α, which is different from the spending function approach that specifies π k at some information rate tk . Nevertheless, it requires the same technique for computing the decision regions. It is tempting to use the results of an interim analysis to modify the schedule of interim looks, which is particularly true for the α-spending function approach because the maximum number of analyses must not be prespecified. For example, if the test result is very near to showing significance, it could be decided to conduct the next interim analysis earlier than originally planned. However, a data-driven analysis strategy is not allowed for the α-spending function approach. Cases exist in which the Type I error rate is seriously inflated, as was shown by several

9

authors (8, 38). In this case, therefore, one should use adaptive or flexible designs that are designed specifically for a data-driven analysis strategy and offer an even larger degree of flexibility (see Flexible Designs). The α-spending function approach can be generalized in several ways. First, it is easy to implement an asymmetric procedure for the determination of an upper and a lower bound in a two-sided test situation. Then, two αspending functions must be given specifying the Type I error rate for the lower and the upper bound, respectively. Second, for planning purposes, it is quite natural to consider a function describing how the power of the procedure should be spent during the stages of the study, known as the power spending function. Approaches that use the power spending function are described in References 49–51 . 4 POINT ESTIMATES AND CONFIDENCE INTERVALS An important field of research in group sequential designs was concerned with the parameter estimation. Through the use of a stopping rule (i.e., the possibility of stopping a trial early with the rejection (or acceptance) of the null hypothesis), point estimates that are derived for the fixed sample size case (e.g., maximum likelihood estimates) are biased. In the long run, hence, one is faced with the over-estimation or underestimation of the true parameter. Point estimates were proposed that correct for the (overall) estimation bias through numerical methods (e.g., References 52–60). Two conceptually different methods for the calculation of confidence intervals were considered in the literature. The first method enables the calculation after the trial has stopped and a final decision of rejection or acceptance of the null hypothesis was reached (e.g., References 59, 61–66). This approach requires the strict adherence to the stopping rule and depends on the ordering in the sample space, which means that it has to be decided if, for example, an observed effect leading to the rejection of the null hypothesis in the first interim is ‘‘more extreme’’ than an effect that is larger but observed in the second interim analysis. If so, the resulting ordering rates effects obtained in earlier

10

GROUP SEQUENTIAL DESIGNS

stages of the trial are stronger than effects obtained later on. However, other orderings exist that are reasonable choices too, and no general agreement exists over which ordering is the best. Generally, it is possible to derive confidence intervals based on such orderings with confidence level exactly equal to 1 − α. Bias adjusted point estimates can be derived as the 50% confidence limit yielding median unbiased estimates. Furthermore, dependent on the ordering of the sample space, it is also possible to obtain overall P-values after completion of the trial (67). The second method merely takes into account the multiplicity that develops from the repeated looks at the data. The resulting intervals are called Repeated Confidence Intervals (RCIs). They are defined as a sequence of intervals Ik for which the coverage probability is simultaneously fulfilled, i.e., Pϑ (ϑ ∈ Ik , k = 1, . . . , K) = 1 − α where ϑ is the unknown parameter. This concept was introduced by Jennison and Turnbull (68, 69) [see also Lai (70)]. The calculation of RCIs is straightforward. When considering the family of group sequential tests for testing H 0 : ϑ = ϑ 0 , one simply has to find, at given stage k, the values ϑ 0 that do not lead to a rejection of H0 . Typically, these values are easy to find, and closed form analytical expressions exist in many cases. In more complicated situations, the RCI can be found iteratively. For example, when considering the mean of normally distributed observations, the sequence of RCIs is given by the sequence of intervals Ik = [xk − uk SEM; xk + uk SEM], k = 1, . . . , K where xk is the observed mean of observations up to stage k, SEM denotes the standard error of the mean, and u1 , . . ., uK are the adjusted critical values corresponding to a particular group sequential test design. The RCIs are wider than the confidence interval in a fixed sample design because they take into account the repeated looks at the accumulating data. For a design with larger critical values at early stages (e.g., O’Brien and Fleming’s design), these intervals are

fairly wide in the first stage, but become narrower as the study progresses. The RCIs can be calculated at each interim analysis. Based on the RCI, it might be decided to terminate or to continue the trial. For example, consider testing the hypothesis H 0 : ϑ = 0. H 0 can be rejected at stage k if the RCI at stage k does not contain ϑ = 0. But also if, in this case, it is decided to continue the trial, the calculation of RCIs in subsequent stages is possible and might be used for the decision process. The RCIs are independent of the adherence to the stopping rule and can therefore be calculated also if the study is going on. As a consequence, these intervals are conservative if not all stages of the trial were actually performed. 5 SUPPLEMENTS Several important issues are not addressed in this article and are referred to further study of the literature. Incorporating covariate information is considered in References 71–73 . In trials with multiple endpoints or in multi-armed trials, such as dose-response trials, the multiplicity occurs in two dimensions. First, the study is performed sequentially and one has to account for the multiple looks effect. Second, at each stage of the procedure, the need exists to perform multiple tests on several hypotheses concerning single endpoints or one wants to perform multiple comparisons between the treatment groups. Furthermore, it should be possible to drop inferior treatments in a multi-armed trial. In multi-armed trials, group sequential methods can be applied, too, and some current work exists in this area (e.g., References 74–80). Parametric and nonparametric group sequential tests for multiple endpoints were proposed in, for example, References 81–86. Last, group sequential designs in equivalence trials were among other designs discussed in References 40, 87–90. Group sequential designs require the need of specialized software. Software packages are available that are specifically designed for planning and analyzing a sequentially planned trial (24, 25, 91), which helps for the correct implementation of a group sequential design and ensures the accurate analysis.

GROUP SEQUENTIAL DESIGNS

For advanced users, it is also possible to perform the relevant calculations using standard statistical software packages. SAS/IML (92) as well as S-PLUS (93) provide modules for the calculation and the assessment of group sequential test designs.

REFERENCES 1. P. Armitage, Interim analysis in clinical trials. Stat. Med. 1991; 10: 925–937. 2. K. McPherson, Sequential stopping rules in clinical trials. Stat. Med. 1990; 9: 595–600. 3. A. J. Sankoh, Interim analysis: an update of an FDA reviewer’s experience and perspective. Drug Inf. J. 1999; 33: 165–176. 4. P. Armitage, Sequential Medical Trials. 2nd ed. Oxford: Blackwell, 1975. 5. P. Armitage, C. K. McPherson, and B. C. Rowe, Repeated significance tests on accumulating data. J. Roy. Stat. Soc. A 1969; 132: 235–244. 6. S. J. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64: 191–199. 7. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics 1979; 35: 549–556. 8. C. Jennison and B. W. Turnbull, Group sequential tests and repeated confidence intervals. In: B. K. Ghosh and P. K. Sen (eds.), Handbook of Sequential Analysis. New York: Marcel Dekker, 1991, pp. 283–311. 9. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman & Hall, 2000.

11

16. J. Whitehead and I. Stratton, Group sequential clinical trials with triangular rejection regions. Biometrics 1983; 39: 227–236. 17. J. Whitehead, Sequential methods. In: C. K. Redmond and T. Colton (eds.), Biostatistics in Clinical Trials. Chichester: Wiley, 2001, pp. 414–422. 18. J. L. Haybittle, Repeated assessments of results in clinical trials of cancer treatment. Brit. J. Radiol. 1971; 44: 793–797. 19. R. Peto, M. C. Pike, P. Armitage, N. E. Breslow, D. R. Cox, S. V. Howard, N. Mantel, K. McPherson, J. Peto, and P. G. Smith, Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Brit. J. Cancer 1976; 34: 585–612. 20. K. McPherson, On choosing the number of interim analyses in clinical trials. Stat. Med. 1982; 1: 25–36. 21. S. J. Pocock, Interim analyses for randomized clinical trials: the group sequential approach. Biometrics 1982; 38: 153–162. 22. S. K. Wang and A. A. Tsiatis, Approximately optimal one-parameter boundaries for group sequential trials. Biometrics 1987; 43: 193–199. 23. G. Wassmer, Statistische Testverfahren fur ¨ gruppensequentielle und adaptive Plane ¨ in klinischen Studien. Theoretische Konzepte und deren praktische Umsetzung mit SAS. K¨oln: Verlag Alexander M¨onch, 1999. 24. Cytel Software Corporation, EaSt 2000: Software for the Design and Interim Monitoring of Group Sequential Clinical Trials. Cambrigde, MA: Cytel, 2000. 25. G. Wassmer and R. Eisebitt, ADDPLAN: Adaptive Designs - Plans and Analyses. Cologne: ADDPLAN GmbH, 2002.

10. B. K. Ghosh, A brief history of sequential analysis. In: B. K. Ghosh and P. K. Sen (eds.), Handbook of Sequential Analysis. New York: Marcel Dekker, 1991, pp. 1–19.

26. P. C. O’Brien, Data and safety monitoring. In: P. Armitage and T. Colton (eds.), Encyclopedia of Biostatistics. Chichester: Wiley, 1998, pp. 1058–1066.

11. A. Wald, Sequential Analysis. New York: Wiley, 1947.

27. D. L. DeMets and J. H. Ware, Group sequential methods for clinical trials with a one-sided hypothesis. Biometrika 1980; 67: 651–660.

12. B. K. Ghosh, Sequential Tests of Statistical Hypotheses. Reading, MA: Addison-Wesley, 1970. 13. D. Siegmund, Sequential Analysis. New York: Springer, 1985.

28. D. L. DeMets and J. H. Ware, Asymmetric group sequential boundaries for monitoring clinical trials. Biometrika 1982; 69: 661–663.

14. G. B. Wetherill, Sequential Methods in Statistics. London: Chapman and Hall, 1975.

29. S. S. Emerson and T. R. Fleming, Symmetric group sequential test designs. Biometrics 1989; 45: 905–923.

15. J. Whitehead, The Design and Analysis of Sequential Clinical Trials, rev. 2nd ed. Chichester: Wiley, 1997.

30. S. Pampallona and A. A. Tsiatis, Group sequential designs for one-sided and two-sided hypothesis testing with provision for early

12

GROUP SEQUENTIAL DESIGNS stopping in favour of the null hypothesis. J. Stat. Plan. Inf. 1994; 42: 19–35.

31. D. L. DeMets and M. H. Gail, Use of logrank tests and group sequential methods at fixed calender times. Biometrics 1985; 41: 1039–1044. 32. K. K. G. Lan and J. M. Lachin, Implementation of group sequential logrank tests in a maximum duration trial. Biometrics 1990; 46: 759–770. 33. E. Slud and L. J. Wei, Two-sample repeated significance tests based on the modified Wilcoxon statistic. J. Amer. Stat. Ass. 1982; 77: 862–868. 34. A. A. Tsiatis, Repeated significance testing for a general class of statistics used in censored survival analysis. J. Amer. Stat. Ass. 1982; 77: 855–861. 35. D. Y. Lin, L. J. Wei, and D. L. DeMets, Exact statistical inference for group sequential trials. Biometrics 1991; 47: 1399–1408. 36. C. Jennison and B. W. Turnbull, Exact calculations for sequential t, chi-square and F tests. Biometrika 1991; 78: 133–141. 37. C. R. Mehta, N. Patel, P. Senchaudhuri, and A. A. Tsiatis, Exact permutational tests for group sequential clinical trials. Biometrics 1994; 50: 1042–1053. 38. M. A. Proschan, D. A. Follmann, and M. A. Waclawiw, Effects on assumption violations on type I error rate in group sequential monitoring. Biometrics 1992; 48: 1131–1143. 39. E. H. Brittain and K. R. Bailey, Optimization of multistage testing times and critical values in clinical trials. Biometrics 1993; 49: 763–772.

45. K. Kim and A. A. Tsiatis, Study duration for clinical trials with survival response and early stopping rule. Biometrics 1990; 46: 81–92. 46. I. K. Hwang, W. J. Shih, and J. S. DeCani, Group sequential designs using a family of Type I error probability spending functions. Stat. Med. 1990; 9: 1439–1445. 47. C. Jennison, Efficient group sequential tests with unpredictable group sizes. Biometrika 1987; 74: 155–165. 48. T. R. Fleming, D. P. Harrington, and P. C. O’Brien, Designs for group sequential trials. Contr. Clin. Trials 1984; 5: 348–361. 49. P. Bauer, The choice of sequential boundaries based on the concept of power spending. Biom. Inform. Med. Biol. 1992; 23: 3–15. 50. M. N. Chang, I. K. Hwang, and W. J. Shih, Group sequential designs using both type I and type II error probability spending functions. Comm. Stat. Theory Meth. 1998; 27: 1323–1339. 51. S. Pampallona, A. A. Tsiatis, and K. Kim, Interim monitoring of group sequential trials using spending functions for the type I and type II error probabilities. Drug Inf. J. 2001; 35: 1113–1121. 52. S. S. Emerson, Computation of the uniform minimum variance unbiased estimator of the normal mean following a group sequential trial. Comput. Biomed. Res. 1993; 26: 68–73. 53. S. S. Emerson and T. R. Fleming, Parameter estimation following group sequential hypothesis testing. Biometrika 1990; 77: 875–892. 54. S. S. Emerson and J. M. Kittelson, A computationally simpler algorithm for the UMVUE of a normal mean following a sequential trial. Biometrics 1997; 53: 365–369.

¨ ¨ 40. H-H. Muller and H. Schafer, Optimization of testing times and critical values in sequential equivalence testing. Stat. Med. 1999; 18: 1769–1788.

55. K. Kim, Point estimation following group sequential tests. Biometrics 1989; 45: 613–617.

41. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical trials. Biometrika 1983; 70: 659–663.

56. A. Liu and W. J. Hall, Unbiased estimation following a group sequential test. Biometrika 1991; 86: 71–78.

42. K. Kim and D. L. DeMets, Design and analysis of group sequential tests based on the Type I error spending rate function. Biometrika 1987; 74: 149–154.

57. J. C. Pinheiro and D. L. DeMets, Estimating and reducing bias in group sequential designs with Gaussian independent increment structure. Biometrika 1997; 84: 831–845.

43. K. K. G. Lan and D. L. DeMets, Group sequential procedures: calender versus information time. Stat. Med. 1989; 8: 1191–1198.

58. E. Skovlund and L. Walløe, Estimation of treatment difference following a sequential clinical trial. J. Amer. Stat Ass. 1989; 84: 823–828.

44. K. Kim, H. Boucher, and A. A. Tsiatis, Design and analysis of group sequential logrank tests in maximum duration versus information trials. Biometrics 1995; 51: 988–1000.

59. S. Todd, J. Whitehead, and K. M. Facey, Point and interval estimation following a sequential clinical trial. Biometrika 1996; 83: 453–461.

GROUP SEQUENTIAL DESIGNS 60. J. Whitehead, On the bias of maximum likelihood estimation following a sequential test. Biometrika 1986; 73: 573–581. 61. M. N. Chang and P. C. O’Brien, Confidence intervals following group sequential tests. Contr. Clin. Trials 1986; 7: 18–26. 62. D. S. Coad and M. B. Woodroofe, Corrected confidence intervals after sequential testing with applications to survival analysis. Biometrika 1996; 83: 763–777. 63. D. E. Duffy and T. J. Santner, Confidence intervals for a binomial parameter based on multistage tests. Biometrics 1987; 43: 81–93. 64. K. Kim and D. L. DeMets, Confidence intervals following group sequential tests in clinical trials. Biometrics 1987; 43: 857–864. 65. G. L. Rosner and A. A. Tsiatis, Exact confidence intervals following a group sequential trial: a comparison of methods. Biometrika 1988; 75: 723–729. 66. A. A. Tsiatis, G. L. Rosner, and C. R. Mehta, Exact confidence intervals following a group sequential test. Biometrics 1984; 40: 797–803. 67. K. Fairbanks and R. Madsen, P values for tests using a repeated significance test design. Biometrika 1982; 69: 69–74. 68. C. Jennison and B. W. Turnbull, Repeated confidence intervals for group sequential clinical trials. Contr. Clin. Trials 1984; 5: 33–45. 69. C. Jennison and B. W. Turnbull, Interim analysis: the repeated confidence interval approach. J. R. Statist. Soc. B 1989; 51: 305–361. 70. T. L. Lai, Incorporating scientific, ethical and economic considerations into the design of clinical trials in the pharmaceutical industry: a sequential approach. Comm. Stat. Theory Meth. 1984; 13: 2355–2368. 71. C. Jennison and B. W. Turnbull, Group sequential analysis incorporating covariate information. J. Amer. Stat. Ass. 1997; 92: 1330–1341. 72. D. O. Scharfstein, A. A. Tsiatis, and J. M. Robins, Semiparametric efficiency and its implication on the design and analysis of group-sequential studies. J. Amer. Stat. Ass. 1997; 92: 1342–1350. 73. A. A. Tsiatis, G. L. Rosner, and D. L. Tritchler, Group sequential tests with censored survival data adjusting for covariates. Biometrika 1985; 365–373. 74. D. A. Follmann, M. A. Proschan, and N. L. Geller, Monitoring pairwise comparisons in multi-armed clinical trials. Biometrics 1994; 50: 325–336.

13

75. N. L., Geller M. A. Proschan, and D. A. Follmann, Group sequential monitoring of multiarmed clinical trials. Drug Inf. J. 1995; 29: 705–713. 76. M. Hellmich, Monitoring clinical trials with multiple arms. Biometrics 2001; 57: 892–898. 77. M. Hughes, Stopping guidelines for clinical trials with multiple treatments. Stat. Med. 1993; 12: 901–915. 78. W. Liu, A group sequential procedure for allpairwise comparisons of k treatments based on range statistics. Biometrics 1995; 51: 946–955. 79. K. J. Lui, A simple generalization of the O’Brien and Fleming group sequential test procedure to more than two treatment groups. Biometrics 1993; 49: 1216–1219. 80. M. A. Proschan, D. A. Follmann, and N. L. Geller, Monitoring multi-armed trials. Stat. Med. 1994; 13: 1441–1452. 81. C. Jennison and B. W. Turnbull, Group sequential tests for bivariate response: interim analyses of clinical trials with both efficacy and safety endpoints. Biometrics 1993; 49: 741–752. 82. J. M. Lachin, Group sequential monitoring of distribution-free analyses of repeated measures. Stat. Med. 1997; 16: 653–668. 83. S. J. Lee, K. Kim, and A. A. Tsiatis, Repeated significance testing in longitudinal clinical trials. Biometrika 1996; 83: 779–789. 84. J. Q. Su and J. M. Lachin, Group sequential distribution-free methods for the analysis of multivariate observations. Biometrics 1992; 48: 1033–1042. 85. D-I. Tang, C. Gnecco, and N. L. Geller, Design of group sequential clinical trials with multiple endpoints. J. Amer. Stat. Ass. 1989; 84: 776–779. 86. S. Todd, Sequential designs for monitoring two endpoints in a clinical trial. Drug Inf. J. 1999; 33: 417–426. 87. S. Durrlemann and R. Simon, Planning and monitoring of equivalence studies. Biometrics 1990; 46: 329–336. 88. C. Jennison and B. W. Turnbull, Sequential equivalence testing and repeated confidence intervals, with application to normal and binary responses. Biometrics 1993; 49: 31–43. 89. J. M. Kittelson and S. S. Emerson, A unifying family of group sequential test designs. Biometrics 1999; 55: 874–882. 90. J. Whitehead, Sequential designs for equivalence studies. Stat. Med. 1996; 15: 2703–2715.

14

GROUP SEQUENTIAL DESIGNS

91. MPS Research Unit, PEST 4: Operating Manual. Reading, UK: University of Reading, 2000. 92. SAS Institute, Inc., SAS/IML Software: Changes and Enhancements through Release 6. 11. Cary, NC: SAS Institute, Inc., 1995. 93. Mathsoft Inc., S-Plus 2000. Seattle, WA: MathSoft, 2000.

HAZARD RATE MITCHELL H. GAIL National Cancer Institute Bethesda, MD, USA

The hazard rate at time t of an event is the limit λ(t) = limit↓0 −1 Pr (t T≺ t + |t T), where T is the exact time to the event. Special cases and synonyms of hazard rate, depending on the event in question, include force of mortality (where the event is death), instantaneous incidence rate, incidence rate, and incidence density (where the event is disease occurrence). For events that can only occur once, such as death or first occurrence of an illness, the probability that the event occurs t in the interval [0, t) is given by 1 − exp (− 0 λ(u)du). The t quantity 0 λ(u)du is known as the cumulative hazard. Often, the theoretical hazard rate λ(u) is estimated by dividing the number of events that arise in a population in a short time interval by the corresponding person-years at risk. The various terms, hazard rate, force of mortality, incidence density, person–years incidence rate, and incidence rate are often used to denote estimates of the corresponding theoretical hazard rate.

1

HAZARD RATIO CLAUDIA SCHMOOR ERIKA GRAF

risk ratio is illustrated by comparing two theoretical populations. Then, using data from a hypothetical clinical trial, the calculation of various estimators is demonstrated for the hazard ratio and comments are made on their mathematical properties. Some references for further reading conclude the article.

and

University Hospital Freiburg Center of Clinical Trials Freiburg, Germany

1

2

INTRODUCTION

DEFINITIONS

Mathematically, the hazard rate of a population is defined as

Hazard ratios are useful measures to compare groups of patients with respect to a survival outcome. In the context of clinical trials ‘‘survival outcome’’ or ‘‘survival time’’ are used generically to denote the time ‘‘survived’’ from a specified time origin, like diagnosis of cancer, start of treatment, or first infarction, to a specific event of interest, like death, recurrence of cancer, or reinfarction (see survival analysis). Broadly speaking, the hazard rate quantifies the risk of experiencing an event at a given point in time. The hazard ratio at that time is the ratio of this risk in one population divided by the risk in a second population chosen as reference group. The hazard ratio between two populations can vary over time. For a given point in time, hazard ratios below 1.0 indicate that the risk of the event is lower in the first group than the risk in the reference group, whereas a hazard ratio of 1.0 indicates equal risks, and values above 1.0 imply a higher risk in the first group compared with the reference population. Often, it is plausible to assume proportional hazards (see Cox’s proportional hazards model) (i.e., to assume that although the respective hazard rates in each group may change in time, their ratio is constant over time). In this case, the constant hazard ratio adequately summarizes the relative survival experience in the two groups over the whole time axis. It describes the size of the difference between groups, similarly as a risk ratio or relative risk, which is used to compare proportions. However, the interpretation is different, as will be explained later. In the sequel, a formal definition of the hazard ratio is first provided, and the interpretation of hazard rate, hazard ratio, and

λ(t) = −

d d 1 · S(t) = − (log S(t)) S(t) dt dt

(1)

where S(t) denotes the survival function of the population (i.e., the proportion surviving beyond time t). Thus, λ(t) is the slope or instantaneous rate at which the population diminishes in the time point t, rescaled by the factor 1/S(t) to the surviving proportion S(t) of the population. Therefore, λ(t) is also called ‘‘instantaneous failure rate.’’ ‘‘Failure rate,’’ ‘‘force of mortality,’’ ‘‘hazard (rate) function,’’ or ‘‘hazard’’ have also been used to denote λ(t). λ(t) is approximately equal to the proportion of individuals that will die within one time unit from t, out of the subgroup of those that are still alive in t. Either S(t) or λ(t) may be used to characterize the survival distribution of a population, so that if one is known, the other is automatically determined. If two populations A and B with hazard rates λA (t) and λB (t) are now considered, their hazards may be compared by forming the so-called relative hazard function θ (t) =

λB (t) λA (t)

(2)

θ (t) is the hazard ratio (or relative hazard) of population B compared with population A at time t. In cases where the relative hazard function θ (t) of the two populations does not depend on t, the constant ratio θ (t) = θ is called the hazard ratio (or relative hazard) of population B compared to population A. If the hazard ratio θ (t) = θ is constant, a relationship can be established between

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

HAZARD RATIO

the hazard ratio and the two survival functions: Using Equation (1), it can be shown by mathematical transformations that for all t θ=

log[SB (t)] log[SA (t)]

(3)

Conversely, SB (t) can then be expressed as a function of SA (t) and the constant hazard ratio θ : SB (t) = SA (t)θ

for all

t,

(4)

following Equation (3). In contrast, the risk ratio (or relative risk) of population B versus population A at time t, RR(t) =

1 − SB (t) , 1 − SA (t)

(5)

compares cumulative risks of dying up to time t. 3 ILLUSTRATION OF HAZARD RATE, HAZARD RATIO, AND RISK RATIO To illustrate the interpretation of survival functions, hazard rates, hazard ratios, risk ratios (or relative risks), and the relationship between them, two hypothetical populations A and B are considered. Assume that the hypothetical control population A has a survival rate that can be described by the function SA (t) = exp( − (0.3 t)1.3 ). The corresponding survival rates of this population at yearly intervals are displayed in Table 1: 81.1% survive the first year from diagnosis, 59.8% survive the first two years, and so on, until after 10 years nearly the entire population has died. Now assume a new treatment is introduced, to the effect of reducing, at all time points t, the instantaneous death rate by 50% in the corresponding population B, which implies a constant hazard ratio of θ = 0.5 for population B versus population A as in Equation (3), so that SB (t) = SA (t)0.5 , following Equation (4). As a result, survival improves to 81.1%0.5 = 90.1% after 1 year, to 59.8%0.5 = 77.3% after 2 years, and so on. The hazard rates are also shown in Table 1. Note that although they increase over time, their ratio for population B versus population

A is indeed constant over time and equal to θ = 0.5. In contrast, the risk ratio displayed in the last column of Table 1 strongly depends on t. Here, the absolute risks of dying up to time t are being compared for population B versus population A. For example, after 4 years from diagnosis, 100% − 28.2% = 71.8% have died in the control population A, compared with 100% − 53.1% = 46.9% in population B, yielding a risk ratio of 46.9/71.8 = 0.653 for treatment versus control at t = 4. In other words, treatment decreases the risk of dying within 4 years from diagnosis by a factor of 0.653. This risk ratio is higher than the hazard ratio of 0.5, and it has a different interpretation. Recall that at all times t the hazard rate can roughly be approximated by the proportion of patients dying within 1 year among the patients still alive at the beginning of the year. S(t) − S(t + 1) is calculated to see how many patients die within 1 year from t, and the result is divided by S(t) to rescale this rate to the patients still alive at time t. For example, to approximate the hazard rate at t = 4 in the control population A, note that out of the 28.2% who survive to the beginning of year 4, SA (4)-SA (5) = 9.8% die in the following year, so that λA (4) ≈ 9.8/28.2 = 0.347, as a crude approximation (the true value is 0.412, see Table 1). Obviously, this method is too inaccurate to be of any practical relevance. More precise approximations could be obtained by using smaller time units like weeks, days, or seconds instead of years, so that as → 0, the approximation would yield the true hazard rate in t. However, the rough approximation approach illustrates that at all times t the hazard ratio compares the two populations with respect to the instantaneous risks of death in t only within the individuals surviving up to t. At each t, the current death rate is calculated with respect to a new reference population, consisting of the individuals still at risk of death at time t. In contrast, risk ratios compare cumulative risks of dying up to time t. Although the hazard ratio function of two populations can either vary in time or remain constant over time, the risk ratio necessarily has to vary over time. From Equation (5), one can see that a constant risk ratio (RR(t) =

HAZARD RATIO

3

Table 1. Survival and Hazard Rates in Two Hypothetical Populations A and B: Comparing Survival Experience Based on Hazard Ratio and Risk Ratio Years from diagnosis t 0 1 2 3 4 5 6 7 8 9 10

Proportion surviving t SB (t) = SA (t)0.5 SA (t) = exp(−(0.3t)1.3 ) 100.0% 81.1% 59.8% 41.8% 28.2% 18.4% 11.7% 7.3% 4.4% 2.6% 1.5%

100.0% 90.1% 77.3% 64.7% 53.1% 42.9% 34.2% 26.9% 21.0% 16.2% 12.4%

Hazard rate in t λA (t) λB (t) 0 0.272 0.335 0.378 0.412 0.440 0.465 0.487 0.507 0.525 0.542

Hazard ratio in t θ(t)

Risk ratio in t RR(t)

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0.526 0.564 0.607 0.653 0.700 0.745 0.788 0.826 0.860 0.889

0 0.136 0.167 0.189 0.206 0.220 0.233 0.244 0.254 0.263 0.271

Table 2. Comparison of Hazard Ratio and Risk Ratio Topic

Hazard ratio

Risk ratio

Interpretation of the quantities compared in the ratio?

instantaneous risk of death in t among those surviving up to time t yes or no

cumulative risk of death up to time t no

yes, if roughly constant in time t

no

no

yes

Ratio constant function of time t? Useful summary measure across time axis? Useful to compare survival rates at a specific time t?

RR) would imply that the proportions surviving in the two populations are multiples of each other over the whole time axis: SB (t) = RR SA (t), which is virtually impossible to occur in real life examples unless the survival curves are identical (RR = 1), because it would mean that one of the two populations cannot start with 100% of the individuals at risk in t = 0. Therefore, in survival analysis, estimation of the hazard ratio will provide a suitable summary measure of the relative survival experiences in two populations over the whole time axis whenever the assumption of a roughly constant relative hazard function seems plausible. Estimation of a risk ratio only makes sense if, for some reason, the populations should be compared with each other at one particular point in time. Table 2 summarizes the properties of hazard ratio and risk ratio.

4 EXAMPLE ON THE USE AND USEFULNESS OF HAZARD RATIOS The estimation of the hazard ratio is illustrated by means of the example of a clinical trial (hypothetical data) introduced in a seminal paper by Peto et al. (1). In this hypothetical example, 25 patients were randomized between two treatment arms A and B, and the effect of treatment on the survival time of the patients shall be analyzed. Additionally, as a covariate, the renal function of the patients at the time of randomization is known, which may also have an influence on survival. Not all patients were observed until death and therefore their survival time was censored at the end of follow-up (censoring). The survival time is calculated as number of days from randomization to death if the patient died during follow-up, or as number of days from randomization to the end

4

HAZARD RATIO

Table 3. Data of Patients of Hypothetical Study Patient number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4

Treatment1

Renal function2

Survival time3

Survival time censored?4

A B B A A B A A B B B A B B A B B A A B A A B A B

I N N N I N N I N N N N I N N I N N N N I N N N I

8 180 632 852 52 2240 220 63 195 76 70 8 13 1990 1976 18 700 1296 1460 210 63 1328 1296 365 23

N N N Y N Y N N N N N N N Y Y N N Y Y N N Y N Y N

Treatment group (A or B). Renal function (N: normal or I: impaired). Time between randomization and death or end of follow-up (survival time) in days. Indicator whether the patient was observed until death (survival time censored = N) or not (survival time censored = Y).

of follow-up. Additionally, an indicator is necessary giving the information whether the calculated survival time is censored or uncensored. Table 3 shows the data for the 25 study patients. 5 AD HOC ESTIMATOR OF THE HAZARD RATIO The effect of treatment on the survival time of the patients can be characterized quantitatively through an estimate of the hazard ratio between the treatment groups A and B. In doing so, it is assumed implicitly that the hazard ratio is constant in time. At first, the recorded survival times in both treatment groups are reviewed (Table 4). For a calculation of the hazard ratio estimator, the survival times have to be arranged in the following way: For each time point,

when one or more deaths occurred (death times = uncensored survival times), the number of patients who died at that time and the number of patients alive just before this time point are needed, separately for both treatment groups. This information is displayed in Table 5. In total, the observed number of deaths in treatment group i is then given by (i = A, B)

Oi =

J

dij

j=1

Under the hypothesis that the hazard or instantaneous risk of dying in treatment group A is identical to the instantaneous risk of dying in treatment group B at all times, the expected number of deaths in treatment

HAZARD RATIO

group i is given by (i = A, B) Ei =

J

dj

j=1

which may be interpreted as the ratio of the relative death rates in the treatment groups (1, 2). In this example, this estimator gives a 6/8.338 = 0.5666 (i.e., hazard ratio of HR1 = 11/8.662 the hazard rate or instantaneous risk of dying in treatment group A is estimated as about 57% of that in treatment group B).

nij nj

The reasoning behind this calculation is as follows: If the hazards are identical in both groups and a total of dj deaths are observed at time point tj , then one expects dj nij /nj out of these deaths to come from group i, because nij /nj is the proportion of patients of group i among the patients at risk of death at time point tj . The total number in treatment group i is obtained by summation over all time points. Table 6 shows the required arrangement of the data and the calculations for the example, which results in OA = 6, OB = 11, EA = 8.338, and EB = 8.662. Note that the sum of observed deaths and the sum of expected deaths is the same: OA + OB = EA + EB . This relationship provides a simple check for the correctness of the calculations. An ad hoc estimator of the hazard ratio between treatment group A and treatment group B is then given by the ratio of the two relative proportions of observed to expected deaths HR1 =

5

6 CONFIDENCE INTERVAL OF THE AD HOC ESTIMATOR If no difference existed in the risk of dying between the treatment groups, the hazard ratio would be equal to 1. In order to give information about the precision of the estimator, and in order to judge whether the observed difference between treatment groups is a result of chance, a confidence interval for the hazard ratio should be calculated. With this confidence interval, also the hypothesis θ = 1 of no treatment effect (i.e., no difference of the hazard functions of treatment groups A and B) may be tested. This test may be performed at significance level α by calculating a 100(1 − α)% confidence interval. If this interval excludes 1, it may be concluded that at level α a significant difference between treatment groups exists.

OA /EA OB /EB

Table 4. Survival Times by Treatment Group (+: Censored Survival Time) Treatment Group

Survival time (days)

A B

8, 8, 52, 63, 63, 220, 365+, 852+, 1296+, 1328+, 1460+, 1976+ 13, 18, 23, 70, 76, 180, 195, 210, 632, 700, 1296, 1990+, 2240+

Table 5. Arrangement of Data for Hazard Ratio Estimation Death time t1 t2 t3 ... tJ total

Group A

Number of deaths Group B

total

Group A

dA1 dA2 dA3 ... dAJ OA

dB1 dB2 dB3 ... dBJ OB

d1 d2 d3 ... dJ OA + OB

nA1 nA2 nA3 ... nAJ

Number at risk Group B nB1 nB2 nB3 ... nBJ

tj = ordered death times (uncensored survival times) in ascending order; j = 1, . . . , J dij = observed number of deaths at time tj in treatment group i; i = A, B; j = 1, . . . , J dj = observed total number of deaths at time tj ; j = 1, . . . , J nij = number of patients being alive in treatment group i just before time tj (called ‘number at risk’); i = A, B; j = 1, nj = total number of patients being alive just before time tj ; j = 1, . . . , J J= number of death times

total n1 n2 n3 ... nJ

...,

J

6

HAZARD RATIO

A confidence interval for the hazard ratio may be calculated by considering the logarithm of the hazard ratio, because this is approximately normally distributed. A 100(1 − α)% confidence interval for the logarithm of the hazard ratio is given by

is equal to 0.4852, and the 95% confidence interval of the hazard ratio estimator HR1 is equal to [exp(−0.5681 − 1.96 × 0.4852), exp(−0.5681 + 1.96 × 0.4852)] = [0.219, 1.467]

[log(HR1 ) − z1−α/2 SE(log(HR1 )), log(HR1 ) + z1−α/2 SE(log(HR1 ))] 7 AD HOC ESTIMATOR STRATIFIED FOR THE COVARIATE RENAL FUNCTION

where z1−α/2 is the (1 − α/2)-quantile of the standard normal distribution. For a 95% confidence interval, this quantity is equal to 1.96 (3). SE(log (HR1 )) is an estimator for the standard error of the logarithm of the estimated hazard ratio and given by SE(log(HR1 )) =

In the previous estimation of the hazard ratio, a simple two-group comparison of treatment A and B was performed without taking into account covariate information of the patients (i.e. in this example the renal function). Although the patients were assigned randomly to the treatment groups, the proportion of patients with impaired renal function is not exactly identical in treatment groups A and B (group A: 4 out of 12 (33%) patients with impaired renal function; group B: 3 out of 13 (23%) patients with impaired renal function). Thus, if the covariate renal function had an influence on survival, it should be taken into account in the analysis of the effects of treatment on survival.

1/EA + 1/EB

A 100(1 − α)% confidence interval for the hazard ratio is then given by [exp(log(HR1 ) − z1−α/2 SE(log(HR1 ))), exp(log(HR1 ) + z1−α/2 SE(log(HR1 )))] In this example, the logarithm of the hazard ratio is equal to −0.5681, the standard error of the logarithm of the estimated hazard ratio

Table 6. Arrangement of Data of the Hypothetical Study for Hazard Ratio Estimation

Death time

Number of deaths Group A Group B total

Number, at risk’ Group A Group B total

Expected number of deaths Group A Group B nAj

dAj

dBj

dj

nAj

nBj

nj

dj

8 13 18 23 52 63 70 76 180 195 210 220 632 700 1296

2 0 0 0 1 2 0 0 0 0 0 1 0 0 0

0 1 1 1 0 0 1 1 1 1 1 0 1 1 1

2 1 1 1 1 2 1 1 1 1 1 1 1 1 1

12 10 10 10 10 9 7 7 7 7 7 7 5 5 4

13 13 12 11 10 10 10 9 8 7 6 5 5 4 3

25 23 22 21 20 19 17 16 15 14 13 12 10 9 7

0.960 0.435 0.455 0.476 0.500 0.947 0.412 0.438 0.467 0.500 0.538 0.583 0.500 0.556 0.571

1.040 0.565 0.545 0.524 0.500 1.053 0.588 0.562 0.533 0.500 0.462 0.417 0.500 0.444 0.429

Total

6

11

17

8.338

8.662

nj

dj

nBj

tj

nj

HAZARD RATIO

7

Table 7. Arrangement of Data of the Hypothetical Study for Hazard Ratio Estimation Stratified for Renal Function Death time Renal function normal

impaired

tj

Number of deaths Group A Group B total

Number, at risk’ Group A Group B total

Expected number of deaths Group A Group B nAj

dBj

dj

nAj

nBj

nj

dj

8

1

0

1

8

10

18

0.444

0.556

70 76 180 195 210 220 632 700 1296

0 0 0 0 0 1 0 0 0

1 1 1 1 1 0 1 1 1

1 1 1 1 1 1 1 1 1

7 7 7 7 7 7 5 5 4

10 9 8 7 6 5 5 4 3

17 16 15 14 13 12 10 9 7

0.412 0.438 0.467 0.500 0.538 0.583 0.500 0.556 0.571

0.588 0.562 0.533 0.500 0.462 0.417 0.500 0.444 0.429

Total

2

8

10

5.009

4.991

8

1

0

1

4

3

7

0.571

0.429

13 18 23 52 63

0 0 0 1 2

1 1 1 0 0

1 1 1 1 2

3 3 3 3 2

3 2 1 0 0

6 5 4 3 2

0.500 0.600 0.750 1.000 2.000

0.500 0.400 0.250 1.000 0.000

Total

4

3

7

5.421

1.579

For an analysis of the effect of the renal function on survival, one may use the same methods as described above. A simple twogroup comparison of patients is performed with impaired and with normal renal function by estimating the hazard ratio with corresponding 95% confidence interval. The calculations are not outlined in detail, but the result is simply shown. Using the notation Oi and Ei , i = I,N, for the observed and expected number of deaths in patients with impaired and normal renal function, the results are as follows: OI = 7, ON = 10, EI = 1.6, and EN = 15.4, which results in an estimated hazard ratio of patients with impaired renal function versus patients with normal renal function of 6.7375 with a 95% confidence interval of [1.3227, 34.319]. So, a strong negative influence of an impaired renal function exists on survival, because the confidence interval is far from 1.

nj

dj

nBj

dAj

nj

This result indicates that the analysis of the effect of treatment on survival should be corrected for the renal function of the patients. For this purpose, a so-called stratified analysis will be performed, which means the calculations of observed and expected numbers of deaths in treatment groups A and B described above are done separately in patients with impaired and normal renal function and combined thereafter to get a hazard ratio estimator of treatment group A versus B. The calculations are shown in Table 7. The stratified calculation leads to the following numbers for observed and expected numbers in treatment groups A and B: OA = 2 + 4 = 6, OB = 8 + 3 = 11, EA = 5.009 + 5.421 = 10.43, and EB = 4.991 + 1.579 = 6.57. Inserting these numbers in the same formulas as used for the unstratified estimator (see above), the hazard ratio estimator of treatment group A versus treatment group

8

HAZARD RATIO

B stratified for renal function is then HR2 = 6/10.43 11/6.57 = 0.3436 with a 95% confidence interval of [0.129, 0.912]. With the stratified analysis, one may now conclude that treatment A is superior to treatment B with respect to survival, because the confidence interval does not include 1. As mentioned above, a strong effect of the renal function exists on survival, and the proportion of patients with impaired renal function is slightly higher in treatment group A than in treatment group B. The stratified analysis adjusts for this imbalance and, therefore, is able to show more clearly the superiority of treatment A as compared with treatment B with respect to survival. 8

PROPERTIES OF THE AD HOC ESTIMATOR

Although not having optimal statistical properties, the ad hoc estimator of the hazard ratio above has gained popularity because of its simplicity. One should remark that this estimate is not ‘‘consistent’’ for a true hazard ratio different from 1. With increasing sample size, this estimate does not necessarily converge to the true hazard ratio, which is an important property usually required for a good statistical estimator. Thus, this ad hoc estimate should only be used for getting a rough impression of the size of the difference between treatment groups and better estimators with better mathematical properties should be applied, generally. One such class of improved estimators are the so-called generalized rank estimators. 9 CLASS OF GENERALIZED RANK ESTIMATORS OF THE HAZARD RATIO The class of so-called generalized rank estimators of the hazard ratio is defined by J

HRGR =

j=1 J j=1

d

wj nAj Aj

d

wj nBj Bj

which is the ratio of the sum of weighted death rates between treatment groups, weighted by some known factor wj . The construction of a statistical test based on the

class of generalized rank estimators leads to the class of so-called generalized rank tests. For different choices of the weight factor wj , different well-known two-sample tests for survival data result. Different weights can be used to weight deaths differently according to whether they occur early or late in time. Therefore, the resulting estimators have different statistical properties depending on the form of the true hazard functions. In particular, the weight factor

wj =

nAj nBj nj

results in the well-known logrank test. Under the proportional hazard assumption (i.e., an underlying constant hazard ratio), the logrank test is the most efficient test in the class of generalized rank tests, which means the logrank test is the most powerful test in this class. The generalized rank estimator with this so-called logrank weight factor will then be denoted by HR3 and can be expressed as J

HR3 =

j=1 J j=1

nAj nBj dAj nj nAj nAj nBj dBj nj nBj

J

=

j=1 J j=1

nBj nj dAj nAj nj dBj

For this example, the necessary calculations are displayed in Table 8. The generalized rank estimator with logrank weights results in HR3 =

3.009 = 0.5627 5.347

The calculation of a confidence interval for the hazard ratio based on this estimator is not as easy as for the ad hoc estimator given above. The formula is rather complicated, so it is dropped here, and only the result is presented [0.214, 1.477]. It is also possible, in accordance to the stratified ad hoc estimator, to calculate a generalized rank estimator with logrank weights that is stratified for renal function. This calculation is not presented here but another possibility for stratified estimation is presented later in this article.

HAZARD RATIO

As mentioned above, under the proportional hazard assumption, the logrank test is the most efficient test in the class of generalized rank tests. Accordingly, the corresponding hazard ratio estimator with logrank weights is the most efficient estimator in the class of generalized rank estimators, which means it has smaller variance than any other generalized rank estimator. Another estimator of the hazard ratio that has even better mathematical properties than the logrank estimator is the estimator from the Cox’s proportional hazard model.

10 ESTIMATION OF THE HAZARD RATIO WITH COX’S PROPORTIONAL HAZARD MODEL Two reasons exist to prefer Cox’s proportional hazard model (Cox model) for estimation of the hazard ratio over the methods presented so far. The most important reason is the fact that a regression model like the Cox model allows one to analyze the effect of different factors on survival simultaneously. The Cox model is the most prominent multiple regression model in survival analysis.

9

The use of multiple regression models is necessary whenever the effects of other patient characteristics besides treatment, so-called prognostic factors or covariates, are of additional interest. Reasons for this interest may either stem from the necessity of adjusting the effect of treatment for other factors, as in the example above, or from the desire to study the effect of these factors on survival per se. Another reason is that the estimator of the hazard ratio derived from the Cox model has the best mathematical properties if the proportional hazards assumption is true. It is the most efficient estimator (i.e., it has the smallest variance). In the Cox model, treatment and other factors are assumed to affect the hazard function in a multiplicative way, which implies the proportional hazards assumption (i.e., a constant hazard ratio over time). For an illustration of the Cox model with the example of the hypothetical study, an indicator X 1 for the randomized treatment (X 1 = 1, if treatment A is allocated, X 1 = 0, if treatment B is allocated) and an indicator X 2 for the factor renal function (X 2 = 1, if renal

Table 8. Calculation of the Generalized Rank Estimator with Logrank Weights HR3 for the Hazard Ratio Between Treatment Groups Death time tj 8 13 18 23 52 63 70 76 180 195 210 220 632 700 1296 total

Weight

Death rate Group A Group B

Nominator

Denominator

nBj

nAj nBj dAj nj nAj

nAj nBj dBj nj nBj

0 0.077 0.083 0.091 0 0 0.100 0.111 0.125 0.143 0.167 0 0.200 0.250 0.333

1.042 0 0 0 0.500 1.053 0 0 0 0 0 0.417 0 0 0

0 0.435 0.455 0.476 0 0 0.412 0.438 0.467 0.500 0.539 0 0.500 0.556 0.571

3.009

5.347

nAj nBj

dAj

dBj

nj

nAj

6.240 5.652 5.455 5.238 5.000 4.737 4.118 3.938 3.733 3.500 3.231 2.917 2.500 2.222 1.714

0.167 0 0 0 0.100 0.222 0 0 0 0 0 0.143 0 0 0

10

HAZARD RATIO

function is impaired, and X 2 = 0, if renal function is normal) are introduced. In the simplest situation of a Cox model, the hazard function depends on only one factor. If it is assumed that the hazard function only depends on the randomized treatment, the Cox model is formulated as

approximately normally distributed. A 100(1 − α)% confidence interval for the hazard ratio is then, as in the above situation of the ad hoc estimation, given by

λ(t|X1 ) = λ0 (t) · exp(β1 X1 )

where z1−α/2 is again the (1 − α/2)-quantile of the standard normal distribution (for a 95% confidence interval equal to 1.96). SE(log (HR4 )) is an estimator for the standard error of the logarithm of the estimated hazard ratio. In the Cox model, no explicit formula can be given for this quantity, and it has to be calculated with appropriate computer software, too. In the example, the estimated standard error is SE(log (HR4 )) = 0.5096 resulting in a 95% confidence interval of the hazard ratio estimator HR4 of [0.208, 1.531]. If one takes the renal function into account in the analysis of the treatment effect, similarly to the calculation of the ad hoc estimator stratified for renal function shown above, it can be done very easily with the Cox model. Depending on the randomized treatment and on the renal function, using the indicators X 1 and X 2 , the hazard function is now formulated as

The so-called regression coefficient β 1 represents the effect of X 1 on the hazard function. With the notation used in the introduction, this can easily be seen by calculating the ratio of the hazard functions θ (t) of patients with X 1 = 1 and X 1 = 0: θ (t) =

λA (t) λ(t|X1 = 1) λ0 (t) · exp(β1 ) = = λB (t) λ(t|X1 = 0) λ0 (t)

= exp(β1 ) which shows that the hazard ratio of treatment group A versus treatment group B is equal to exp(β 1 ) in the formulation of the Cox model, and reflects the fact that the hazard ratio in the Cox model is constant over time t. Thus, a Cox model including only one factor yields—as an alternative to the ad hoc estimator HR1 and the generalized rank estimator with logrank weights HR3 —another possibility for estimation of the hazard ratio between treatment groups. One drawback of using the Cox model for estimation of the hazard ratio is that the calculations cannot be performed by hand. The regression coefficient β 1 is estimated from empirical patient data by the so-called maximum partial likelihood procedure. This method requires complicated mathematical iteration procedures, which can only be done by appropriate computer software like the procedure PHREG of the Statistical Analysis System (SAS). In the example, the calculation results in a value of −0.5728 for β 1 , and consequently the hazard ratio of treatment group A versus treatment group B from the Cox model is estimated as HR4 = exp(−0.5728) = 0.5639. A confidence interval for the hazard ratio based on the Cox model can again be calculated by considering the logarithm of the hazard ratio log(HR4 ) = β 1 , because it is

[exp(log(HR4 ) − z1−α/2 SE(log(HR4 ))), exp(log(HR4 ) + z1−α/2 SE(log(HR4 )))]

λ(t|X1 , X2 ) = λ0 (t) · exp(β1 X1 + β2 X2 ) The unknown regression coefficient β 1 again represents the effect of X 1 on the hazard function, as can be seen easily by calculating the ratio of the hazard functions θ (t) of patients with X 1 = 1 and X 1 = 0: θ (t) = =

λA (t) λ(t|X1 = 1, X2 ) = λB (t) λ(t|X1 = 0, X2 ) λ0 (t) · exp(β1 + β2 X2 ) = exp(β1 ) λ0 (t) · exp(β2 X2 )

Again, one can see that the hazard ratio of treatment group A versus treatment group B is equal to exp(β 1 ). Similarly, the ratio of the hazard functions of patients with an impaired renal function versus patients with a normal renal function is given by λ(t|X1 , X2 = 1) λ0 (t) · exp(β1 X1 + β2 ) = λ(t|X1 , X2 = 0) λ0 (t) · exp(β1 X1 ) = exp(β2 )

HAZARD RATIO

11

Table 9. Result of Simultaneous Analysis of Effects of Treatment and Renal Function on Survival with the Cox Model Factor

log(HR)

SE(log(HR))

Hazard Ratio

95% confidence interval

Treatment (A vs. B) Renal function (impaired vs. normal)

−1.2431 4.1055

0.5993 1.1645

0.2885 60.673

[0.089, 0.934] [6.191,594.62]

Table 10. Overview of the Results of the Different Estimators for the Hazard Ratio of Treatment Group A vs. Treatment Group B Estimator Ad hoc estimator Ad hoc estimator stratified for renal function Generalized rank estimator with logrank weights Cox model Cox model adjusted for renal function

Accounts for renal function

Notation

Hazard Ratio

95% confidence interval

no yes

HR1 HR2

0.5666 0.3436

[0.219, 1.467] [0.129, 0.912]

no

HR3

0.5627

[0.214, 1.477]

no yes

HR4 HR5

0.5639 0.2885

[0.208, 1.531] [0.089, 0.934]

showing that the hazard ratio of patients with impaired renal function versus patients with normal renal function is equal to exp(β 2 ). The big difference to the model above, where X 2 was not included, is that the regression coefficients β 1 and β 2 will now be estimated simultaneously from the data. In the Cox model, this analysis of the treatment effect is called adjusted for the renal function. Besides the stratified ad hoc estimator HR2 and the stratified generalized rank estimator, this Cox model estimator of the treatment effect represents another possibility for taking renal function of the patients into account when estimating the hazard ratio between treatment groups. While in the stratified approaches renal function is only included in order to enhance the estimator of the treatment effect, in the Cox model, one has the additional benefit that its effect on survival can be studied simultaneously with the effect of treatment. Table 9 shows the result of this analysis. The estimator of the hazard ratio between treatment groups from this analysis is then HR5 = exp(− 1.2431) = 0.2885 with a 95% confidence interval of [0.089, 0.934]. Table 10 gives an overview of the results of the different estimators for the hazard ratio of treatment group A versus treatment group B presented in this article.

The estimators taking the renal function of the patients into account, HR2 and HR5 , come to the same conclusion that treatment A is superior to treatment B with respect to survival, because the confidence intervals do not include 1. 11

FURTHER READING

Definition and estimation of hazard ratios belong to the broader subject of survival analysis. In a series of 4 papers, Clark et al. (4–7) give an excellent, comprehensive introduction with great emphasis on clinical applications. Girling et al. (2) give a more basic introduction in the general context of clinical trials in their chapter 9.3.4, and a simple formula for sample size calculation in chapter 5.4.3. For those that aim to understand the subject in depth, including the more technical aspects even with limited statistical experience, the monograph by Marubini and Valsecchi (8) is recommended. REFERENCES 1. R. Peto et al., Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. Analysis and examples. Brit. J. Cancer 1977; 35: 1–39.

12

HAZARD RATIO

2. D. J. Girling et al., Clinical Trials in Cancer: Principles and Practice. Oxford: Oxford University Press, 2003. 3. D. G. Altman, Practical Statistics for Medical Research. London: Chapman & Hall, 1991. 4. T. G. Clark, M. J. Bradburn, S. B. Love, and D. G. Altman, Survival analysis part I: basic concepts and first analyses. Brit. J. Cancer 2003; 89: 232–238. 5. M. J. Bradburn, T. G. Clark, S. B. Love, and D. G. Altman, Survival analysis part II: multivariate data analysis–an introduction to concepts and methods. Brit. J. Cancer 2003; 89: 431–436. 6. M. J. Bradburn, T. G. Clark, S. B. Love, and D. G. Altman, Survival analysis part III: multivariate data analysis—choosing a model and assessing its adequacy and fit. Brit. J. Cancer 2003; 89: 605–611. 7. T. G. Clark, M. J. Bradburn, S. B. Love, and D. G. Altman, Survival analysis part IV: further concepts and methods in survival analysis. Brit. J. Cancer 2003; 89: 781–786. 8. E. Marubini and M. G. Valsecchi, Analysing Survival Data from Clinical Trials and Observational Studies. Chichester: John Wiley & Sons, 1995.

Heritability Before discussing what genetic heritability is, it is important to be clear about what it is not. For a binary trait, such as whether or not an individual has a disease, heritability is not the proportion of disease in the population attributable to, or caused by, genetic factors. For a continuous trait, genetic heritability is not a measure of the proportion of an individual’s score attributable to genetic factors. Heritability is not about cause per se, but about the causes of variation in a trait across a particular population.

Definitions Genetic heritability is defined for a quantitative trait. In general terms it is the proportion of variation attributable to genetic factors. Following a genetic and environmental variance components approach, let Y have a mean µ and variance σ 2 , which can be partitioned into genetic and environmental components of variance, such as additive genetic variance σa2 , dominance genetic variance σd2 , common environmental variance σc2 , individual specific environmental variance σe2 , and so on. Genetic heritability in the narrow sense is defined as σa2 , (1) σ2 while genetic heritability in the broad sense is defined as σg2 , (2) σ2 where σg2 includes all genetic components of variance, including perhaps components due to epistasis (gene–gene interactions; see Genotype) [3]. In addition to these random genetic effects, the total genetic variation could also include that variation explained when the effects of measured genetic markers are modeled as a fixed effect on the trait mean. The concept of genetic heritability, which is really only defined in terms of variation in a quantitative trait, has been extended to cover categorical traits by reference to a genetic liability model. It is assumed that there is an underlying, unmeasured continuous “liability” scale divided into categories by “thresholds”. Under the additional assumption that

the liability follows a normal distribution, genetic and environmental components of variance are estimated from the pattern of associations in categorical traits measured in relatives. The genetic heritability of the categorical trait is then often defined as the genetic heritability of the presumed liability (latent variable), according to (1) and (2).

Comments There is no unique value of the genetic heritability of a characteristic. Heritability varies according to which factors are taken into account in specifying both the mean and the total variance of the population under consideration. That is to say, it is dependent upon modeling of the mean, and of the genetic and environmental variances and covariances (see Genetic Correlations and Covariances). Moreover, the total variance and the variance components themselves may not be constants, even in a given population. For example, even if the genetic variance actually increased with age, the genetic heritability would decrease with age if the variation in nongenetic factors increased with age more rapidly. That is to say, genetic heritability and genetic variance can give conflicting impressions of the “strength of genetic factors”. Genetic heritability will also vary from population to population. For example, even if the heritability of a characteristic in one population is high, it may be quite different in another population in which there is a different distribution of environmental influences. Measurement error in a trait poses an upper limit on its genetic heritability. Therefore traits measured with large measurement error cannot have substantial genetic heritabilities, even if variation about the mean is completely independent of environmental factors. By the definitions above, one can increase the genetic heritability of a trait by measuring it more precisely, for example by taking repeat measurements and averaging, although strictly speaking the definition of the trait has been changed also. A trait that is measured poorly (in the sense of having low reliability) will inevitably have a low heritability because much of the total variance will be due to measurement error (σe2 ). However, a trait with relatively little measurement error will have a high heritability if all the nongenetic factors are known and taken into account in the modeling of the mean.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Heritability

Fisher [1] recognized these problems and noted that whereas . . . the numerator has a simple genetic meaning, the denominator is the total variance due to errors of measurement [including] those due to uncontrolled, but potentially controllable environmental variation. It also, of course contains the genetic variance . . . Obviously, the information contained in [the genetic variance] is largely jettisoned when its actual value is forgotten, and it is only reported as a ratio to this hotch-potch of a denominator.

and Vandenburg’s F = 1/[1 − σa2 /σ 2 )] [6]. Furthermore, the statistical properties of these estimators do not appear to have been studied.

References [1] [2]

[3] [4]

Historically, other quantities have also been termed heritabilities, but it is not clear what parameter is being estimated, e.g. Holzinger’s H = (rMZ − rDZ ) (the correlation between monozygotic twins minus the correlation between dizygotic twins) (see Twin Analysis) [2], Nichol’s H R = 2(rMZ − rDZ )/rMZ [5], the E of Neel & Schull [4] based on twin data alone,

[5]

[6]

Fisher, R.A. (1951). Limits to intensive production in animals, British Agricultural Bulletin 4, 217–218. Holzinger, K.J. (1929). The relative effect of nature and nurture influences on twin differences, Journal of Educational Psychology 20, 245–248. Lush, J.L. (1948). Heritability of quantitative characters in farm animals, Suppl. Hereditas 1948, 256–375. Neel, J.V. & Schull, W.J. (1954). Human Heredity. University of Chicago Press, Chicago. Nichols, R.C. (1965). The National Merit twin study, in Methods and Goals in Human Behaviour Genetics, S.G. Vandenburg, ed. Academic Press, New York. Vandenberg, S.G. (1966). Contributions of twin research to psychology, Psychological Bulletin 66, 327–352.

JOHN L. HOPPER

HISTORICAL CONTROL

Numerous articles focus on HCT designs, especially in the area of oncology. Some authors avoid explicit discussion of historical controls in their studies by comparing their results with an (assumed) standard response rate without acknowledgment of how the reference rate was established. Articles with references to the methodological literature have previously appeared in other encyclopedia articles by Gehan (3) as well as by Beach and Baron (4). Other common references to the comparative observational study literature are Cook and Campbell (5), Rosenbaum (6), and Rothman and Greenland (7).

NEAL THOMAS Pfizer Inc., Global Research and Development, New London, Connecticut

1

HISTORICAL CONTROLS AND BIAS

The use of historical data from previously completed clinical trials, epidemiological surveys, and administrative databases to estimate the effect of placebo or standard-of-care treatment is one approach for assessing the benefit of a new therapy. Study designs without collection of concurrent control data have been largely replaced by randomized clinical trials (RCTs) with concurrently collected control data in many clinical settings. Studies comparing a new treatment with historical experience are frequently criticized and regarded by many researchers as unreliable for any purpose other than hypothesis generation because the historical controlled trial (HCT) may be subject to unknown sources of bias such as follows:

2 METHODS TO IMPROVE THE QUALITY OF COMPARISONS WITH HISTORICAL DATA There has been more acceptance of HCTs in therapeutic areas in which it is difficult to enroll patients in RCTs. HCTs are common in early development trials in oncology. The Simon two-stage design is an example of a commonly used HCT design with tumor response as the primary endpoint (8). Statisticians and other authors involved in these therapeutic areas are more positive regarding the value of HCTs, provided that they are well designed (3). Pocock (9) notes several features of a welldesigned HCT:

1. Changes in the patients over time due to changing treatment patterns. Improvements in diagnostic tests may also change the staging of patients receiving treatment.

1. Control group has received the precisely defined treatment in a recent previous study. 2. Criteria for eligibility and evaluation must be the same. 3. Prognostic factors should be known and be the same for both treatment groups. 4. No unexplained indications lead one to expect different results.

2. Changes in the measurement of the disease. These changes could be due to evolving laboratory tests or shifts in clinical rating assessments. 3. Changes in the use of concomitant medications and treatments. 4. The lack of blinding when collecting new clinical data for comparison with a known target (control) performance is another source of potential bias. Because a new treatment is typically thought to be superior to merit evaluation, the lack of blinding may be partially responsible for the perception that HCTs are often biased in favor of new treatments.

Covariate adjustment methods [Snedecor and Cochran (10)] can be applied in HCTs provided appropriately detailed patient-level data are available from earlier trials. If a large pool of patient-level data is available, matching methods (6,7) may also be used. These adjustments are examples of direct standardization in the observational study

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

HISTORICAL CONTROL

literature. When patient-level data are not available, it may still be possible to assess the potential for bias using summaries of baseline covariate information. Specialized methods for covariate adjustments and differences in exposure and follow-up time have been developed for time-to-event endpoints (11). 3 BAYESIAN METHODS TO COMBINE CONCURRENT AND HISTORICAL CONTROLS Bayesian statistical inference provides a method to combine data from historical and concurrently randomized controls (9,12). Within the Bayesian framework, it is possible to make explicit judgments about the relevance of past data accounting for potential bias. For example, a 1000-patient historical study may be regarded as equivalent to 100 concurrently randomized patients due to potentially large biases. This approach is ethically appealing because it can reduce the number of patients receiving placebo, while retaining the features of a randomized, blinded clinical trial. There has been extensive methodological development and applications of this approach in preclinical studies using animal laboratory data 13–15. Because the animals are bred and maintained in relatively homogeneous conditions for several generations, it is possible to both limit and estimate the variability between cohorts over time. The variability between cohorts provides a more objective basis for weighting data from past cohorts with concurrently randomized controls. 4 EFFECTS BASED ON TIMES-SERIES ESTIMATES Times series of event rates in large populations, often based on large administrative or federal epidemiological data collections, are another source of historical control information. Because detailed data about treatments and baseline conditions are most likely missing, time series are the least reliable source of historical control information. Using time series statistical modeling (5,6), the response to prevailing treatments during a past time period can be projected to more recent time

periods presuming no change in treatments has occurred. The projections are then compared with recent response rates after major changes in common treatments or policies. This methodological approach is common in marketing research. There are also applications for assessing the effects of medical treatments and environmental exposures. One famous example is the increasing rates of lung cancer that were associated with increased smoking rates in the early part of the twentieth century, which was cited as one source of evidence for the effect of cigarette smoking on lung cancer (16). REFERENCES 1. W. Cochran and D. Rubin, Controlling bias in observational studies: A review, Sankya, Series A 1973; 35: 417–446. 2. R. Rosenbaum and D. Rubin, The central role of the propensity score in observational studies for causal effects. Biometrika, 1983; 70: 41–55. 3. E. Gehan, Historical controls. In: Encyclopedia of Statistical Sciences, vol. 3. New York: Wiley, 1983. 4. M. Beach and J. Baron, Bias from historical controls. In: Encyclopedia of Biostatistics, vol. 1. New York: Wiley, 1998. 5. T. Cook and D. Campbell, QuasiExperimentation. Boston, MA: HoughtonMifflin, 1979. 6. P. Rosenbaum, Observational Studies. New York: Springer-Verlag, 1995. 7. K. Rothman and S. Greenland, Modern Epidemiology, 2nd ed. Philadelphia, PA: Lippincott-Raven, 1998. 8. R. Simon, Optimal two-stage designs for phase II clinical Trials. Controll. Clin. Trials 1989; 10: 1–10. 9. S. Pocock, The combination of randomized and historical controls in clinical trials. J. Chron. Dis. 1976; 29: 175–188. 10. G. Snedecor and W. Cochran, Statistical Methods, 8th ed. Ames, IA: Iowa State University Press, 1989. 11. N. Keiding, Historical controls in survival analysis. In: Encyclopedia of Biostatistics, vol. 3, New York: Wiley, 1998. 12. D. Berry and D. Strangl, Bayesian Biostatistics, New York: Marcel Dekker, 1996. 13. A. Dempster, M. Selwyn, and W. Weeks, Combining historical and randomized controls

HISTORICAL CONTROL assessing trends in proportions. J. Am. Stat. Assoc. 1983; 78: 221–227. 14. R. Tamura and S. Young, The incorporation of historical control information in tests of proportions: simulation study of Tarone’s procedure. Biometrics 1988; 42: 221–227.

3

15. R. Tarone, The use of historical control information in testing for a trend in proportions. Biometrics 1982; 38: 215–220. 16. R. Doll and B. Hill, Smoking and carcinoma of the lung. Brit. Med. J. 1950: 739–747.

HYPOTHESIS

to the discovery of Pluto in 1864, which supported Newton’s theory of gravitation. The theory, however, failed to explain irregularities of the orbit of Mercury, which could finally be understood using Einstein’s theory of relativity. According to Popper (2) and Hempel (3), the process of falsifying hypotheses has to be deductive rather than inductive. Only a deductive approach enables science to sort out incorrect hypotheses and to keep only those that passed multiple attempts to falsify them. The falsification process works in principle as follows: Let H denote a hypothesis and Ii a series of implications such that

GERD ROSENKRANZ Biostatistics and Statistical Reporting Novartis Pharma AG Basel Switzerland,

1

SCIENTIFIC HYPOTHESES

A hypothesis is ‘‘an idea or a suggestion put forward as a starting point for reasoning and explanation’’ (1). Hypotheses cannot be generated in a mechanistic fashion from empirical data: their invention is a creative process. In the course of this process, one tries to discover regularities or structures in a set of observations in an attempt to explain or organize the findings. This process is therefore inductive in nature (i.e., an attempt to derive general rules from one or more individual cases). The verification of hypotheses, however, needs a different approach. Popper (2) claimed that a necessary criterion for hypotheses to be scientific is that they are testable. Only those hypotheses that have stood up against relevant tests can be assigned a certain degree of validity and credibility. In fact, he emphasized that, strictly speaking, scientific hypotheses can only be falsified, but never become truly verified: the possibility always exists that future observations contradict a hypothesis. An example of a scientific hypothesis that passed several tests before it was eventually falsified is Newton’s theory of gravitation. It is based on the hypothesis that the gravitational forces between two bodies are proportional to the product of their masses divided by the square of their distance. One implication of this theory is that planets follow elliptical orbits on their way around the sun. When irregularities of the orbit of Neptune were observed in the nineteenth century, it first looked as if this phenomenon could not be explained by Newton’s theory. However, if one postulates the existence of another (at that point in time) unknown planet that disturbs the movements of Neptune, Newton’s theory would apply again. This reasoning led

H −→ I1 −→ . . . −→ Ik where ‘‘−→’’ stands for ‘‘implies.’’ If any of the implications turn out to be incorrect, then H is incorrect as well on purely logical grounds. In the contrary case, if Ik is true, then it cannot be concluded that H is true as well. It is easily seen that true implications can be derived from a wrong hypothesis, but not the other way a round. In this sense, deductive reasoning provides firm criteria for the verification of scientific hypotheses. The deductive method also helps to identify potential traps in hypothesis testing. Two examples are considered. First, assume that I is an implication of hypothesis H if a further assumption A holds; for example, H ∧ A −→ I. The symbol ‘‘∧’’ stands for the logical ‘‘and.’’ If I can be proven to be incorrect, one can only conclude that H or A are not both correct. In particular, the falsification of H is only possible if assumption A is known to be true. Often, two or more competing hypotheses exist. In such a situation, it would be desirable to decide which one is correct. Unfortunately, this decision is not possible in general: Either one can falsify one or both, but the falsification of one does not imply the correctness of the other with the exception of the entirely trivial case in which one claims exactly the contrary of the other.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

2

HYPOTHESIS

STATISTICAL HYPOTHESES

Pr[k|H, n] =

n k p (1 − p0 )n−k k 0

(1)

2.1 An Introductory Example The deductive framework has been successful in testing (i.e., falsifying) hypotheses that claim to be correct under all circumstances (like the theory of gravitation). Such hypotheses are called general or universal. However, hypotheses exist that are not universal but probabilistic in nature. These hypotheses will be called statistical hypotheses. As an example, consider the hypothesis that a vaccine is effective. Although an effective vaccine increases the likelihood of being protected against a specific infection, a vaccine can never be completely protective. As a consequence, if some vaccinees become infected despite vaccination, it is generally not possible to claim that the vaccine is not protective at all: It can, nevertheless, protect the majority of vaccinees. This example demonstrates that, unlike with universal hypothesis, probabilistic hypotheses cannot be falsified by some counter examples. A way to approach testability of statistical hypotheses is to make use of the law of large numbers: If the proportion of infected individuals among a large number of vaccinees is small, the hypothesis that the vaccine is protective is supported. If, on the contrary, the proportion of infected individuals is large, the hypothesis might be regarded as not acceptable. It needs to be emphasized, however, that regardless of how many vaccinated persons have been monitored to detect an infection, no complete certainty exists that the results look the same in the non-observed individuals. To look at the matter more closely, let E be an event of interest, like the breakthrough of an infection in a vaccinated individual, and let p = Pr[E], the probability that E occurs. Consider the hypothesis that this probability equals some number 0 < p0 < 1, or formally, H : p = p0 . To test this hypotheses, consider performing n experiments or randomly sampling n objects and to count the number of events. If H is true and the experiments are independent (i.e., the occurrence of an event in one experiment is not predictive of an experiment in another), the probability to observe k events is given by

In the context of the vaccination example, one could determine a critical number 0 ≤ c(H, n) ≤ n such that if k ≥ c(H, n), one would reject the hypothesis because too many vaccinees would become infected. Otherwise, the hypothesis would have passed the test. The number c(H, n) can be reasonably defined such that the probability of the number of events to exceed c(H, n) is small (e.g., less than some 0 < α < 1) if H is true. Hence, c(H, n) is defined such that

Pr[reject H|H ] =

n n j p0 (1 − p0 )n−j ≤ α j

j=c(H,n)

(2) Although a test was found to reject H, the criterion derived above has been developed under additional assumptions: One assumption is that the experiments are independent. Another requirement is that the experiments should be identically distributed (i.e., come from a single, common distribution), otherwise it is not clear whether Pr[E] = p0 holds for all experiments. Hence, strictly speaking, if k ≥ c(H, n) holds, one has only found that Pr[E] = p0 , the repeatability and the independence of the experiments cannot be true at the same time. The design and conduct of the experiments have to ensure that the assumptions of identically distributed and independent experiments are met in order to draw conclusions about the hypothesis of interest. The basic idea of statistical hypothesis tests is to identify events that are unlikely under a hypothesis H and to reject H when such events occur. It remains to be shown whether this approach delivers optimal tests or whether tests exist that enable H to be rejected with a higher probability given the same information. The key for answering this question is to compare the hypothesis H with another hypothesis K, which cannot be true at the same time; for example, K : p = p1 for some p1 > p0 . The probability to reject H if K

HYPOTHESIS

is true is then given by Pr[reject H|K ] =

n n j p1 (1 − p1 )n−j j

j=c(H,n)

(3) Although this probability depends on p1 , the test based on k ≥ c(H, n) can be shown to be optimal for all p1 > p0 . 2.2 The Structure of a Statistical Test In the previous section, the idea of a statistical hypothesis and a test to reject it was described for a specific situation. It is worthwhile to discuss some general ideas of hypothesis testing before moving on to more complex hypotheses. The first step in constructing a statistical test is to summarize the information contained in data x = (x1 , . . . , xn ) in an adequate way by some function T(x). In the previous section, the data were 0 or 1 depending on whether an event occurred and the statistic n was the number of events (i.e., T(x) = i=1 xi ). Next, one has to identify a rejection region S such that the hypothesis H will be rejected if T ∈ S. In the example above, S = {k; k ≥ c(H, n)}. T and S have to fulfill the following requirements. To reduce the risk to reject H in case it is correct, one requests Pr[T ∈ S|H ] ≤ α

(4)

for some small 0 < α < 1, which is called the level of the test. Second, under no alternative K should the probability of rejection be less than α, otherwise alternatives would exist under which the rejection of H is less likely than under H itself: Pr[T ∈ S|K ] ≥ α

(5)

Such tests are called unbiased. Ideally, one would also like to control the risk to fail to reject the hypothesis if it is incorrect, which is generally not possible. Instead one aims at finding an optimal test in the sense that for an alternative K, Pr[T ∈ S|K ] = max

(6)

3

Tests that fulfill Equation (6) for all alternatives K are called uniformly most powerful (UMP). As shown above, the test based on k ≥ c(H, n) is UMP for testing H : p = p0 against K : p = p1 for all p1 > p0 . UMP tests are always unbiased and exist for a wide variety of hypotheses (see Reference 4). Just as a test for a universal hypothesis aims to falsify the hypothesis, a test of a statistical hypothesis intends to reject it. However, some differences exist. A falsified universal hypothesis is falsified once for all, whereas some uncertainty about the correctness of the decision to reject a statistical hypothesis always exists. Second, a universal hypothesis can, in principle, be falsified by a single counterexample. To reject a statistical hypothesis, sufficient evidence is needed that contradicts the hypothesis, the degree of which is determined by the level of a statistical test. The choice of the level is often a matter of debate. When tables of critical values were produced in the beginning of the last century, α = 0.05 or α = 0.01 were used to limit the number of tables. Although these selections are somewhat arbitrary, α = 0.05 has been widely accepted. If the statistical test is in one direction, as in the example above, a level of α = 0.025 is often required. Applied to clinical studies, it would allow one out of 40 inefficacious drugs to pass clinical testing successfully. As this result does not look acceptable to health authorities, they tend to require two successful clinical studies (at least for certain indications) before a drug can become approved. For a level of 2.5% for an individual study, this result would correspond to α = 0.000625 if one big study is conducted instead of two separate trials (see Reference 5). Discussions can also develop concerning the size of the power of a statistical test. It was said above that, if possible, one would perform an optimal test (i.e., one that maximizes the probability to reject H if a specific alternative K is true). Apart from further assumptions, the power of a test depends on K and on the sample size (i.e., the number of independent data points). Theoretically, the probability to reject H can be made close to one for a specific K by increasing the sample size, as can be concluded from Equations

4

HYPOTHESIS

(2) and (3) for the vaccination example. This is often not feasible from a resource point of view, nor is it desirable from a scientific point of view to avoid that alternatives that differ from H in an irrelevant manner lead to rejection of H. 3 SPECIFIC HYPOTHESES IN CLINICAL STUDIES As explained above, the intention behind testing of statistical hypotheses is to reject a specific hypothesis under certain criteria that limit the uncertainty that the decision taken is correct. Hypothesis testing is therefore primarily applied in phase III studies. The role of these studies in the clinical development process is to confirm the results of previous findings by rejecting testable hypotheses. Dose finding studies or studies that intend to verify the biological concept behind a potential effect need different statistical methods. In the following, some aspects of hypotheses testing in confirmatory clinical studies will be addressed. 3.1 Superiority, Equivalence, and Non-Inferiority The formulation of an appropriate statistical hypothesis depends on the objective of the clinical study. If, for example, a clinical study is conducted to demonstrate the superiority of a new treatment over a comparator, it is most reasonable to consider testing the hypothesis that the new treatment is worse or at best equally efficacious. This approach enables one to control the risk to reject the hypothesis if it is true by selecting the level of the test. In the following, it is assumed that the difference in effect between two treatments can be described in terms of a parameter θ , which is zero in case of equal effects and positive in case of superiority. For example, one can think of θ to represent the difference in the reduction of blood pressure or the log-odds ratio of two success rates. Demonstrating superiority then means to be able to reject H : θ ≤ 0 in favor of K : θ > 0

(7)

Circumstances exist where the intention is not to demonstrate superiority but to provide evidence for similarity. If a new formulation of a drug becomes available, it can become necessary to figure out whether the bioavailability of the new formulation is comparable with that of the existing one. If so, doses and dose regimen of the existing formulation would also apply for the new formulation. In this situation, it is preferable to try to reject the hypothesis that the bioavailability of the two formulations is different. Ideally, one would like to have a level α test of H : θ = 0 versus K : θ = 0 that would only reject with a probability >α if θ is exactly zero. Unfortunately, it is not possible to construct such a test. A solution is to select an equivalence margin δ > 0 and to aim to reject H : θ ∈ (−δ, δ) in favor of K : θ ∈ (−δ, δ) (8) Another case of interest is the development of a new drug that is not necessarily more efficacious than an existing one but has other advantages, for example, a better safety profile. In this case, one would like to demonstrate that the efficacy of the new drug is comparable with the one that exists without excluding the possibility that it can be even better. At first glance, one would like to reject H : θ < 0 if K : θ ≥ 0 is true. The test would reject for values of θ being 0 or greater. Although this test looks very desirable, it is generally not possible to construct a test with a level α for all θ < 0 and >α for θ ≥ 0. As before, it is necessary to select a δ > 0, the non-inferiority margin, and to set out to reject H : θ ≤ −δ in favor of K : θ ≥ −δ (9)

Although for both hypotheses, Equation (8) and Equation (9) UMP unbiased tests exist in many relevant situations, some new issues coming up for bioequivalence and non-inferiority hypotheses exist. First, the equivalence or non-inferiority margin has to defined. For bioequivalence studies, where the ratio of pharmacokinetic parameters like area under the concentration curve or maximum concentration is of concern, general agreement exists to define δ such that ratios

HYPOTHESIS

between 0.80 and 1.25 (or the logarithms of the ratio between −0.22314 and 0.22314) belong to K. The selection of a non-inferiority margin for clinical studies is more difficult for several reasons. A study that compares a new drug with an active drug with the intention to demonstrate non-inferiority of the new one should also demonstrate that the new drug is superior to placebo, which is not an issue when placebo, the new compound, and the active comparator are tested in the same study in which both superiority over placebo and non-inferiority compared with an active compound can be demonstrated. However, in disease areas where placebo is not an option, only indirect comparisons with placebo are possible. Under these circumstances, the effect of the active comparator over placebo has to be determined from historical data through a meta-analysis. As a consequence, doubts might develop about whether the active comparator would still be superior to placebo in the setting of the present trial. If these doubts are justified, the study runs the risk to demonstrate noninferiority with respect to placebo. To some extent, this concern can be met by designing the non-inferiority study such that it is similar in design to previous studies of the active comparator in regard to patient population, co-medication, doses and regimen, and so on. Although a straightforward proposal is to set δ to a fraction of the difference between an active comparator and placebo, a non-inferiority margin needs to be justified on a case-by-case basis. An intrinsic issue of equivalence or noninferiority studies is that their credibility cannot be protected by methods like randomization and blinding (6). For example, a transcription error in the randomization code would make results under treatment and control look more similar. If this happens too often in a superiority study, the objective of the trial would be compromised, whereas these errors would favor the desired outcome in an non-inferiorty study. The impact of these intrinsic problems is somewhat lowered by the fact that clinical studies have usually more than one objective. Non-inferiority studies need to demonstrate advantages of a new drug in other areas like safety or tolerability,

5

otherwise little reason would exist to prefer the new medication over an existing one. To achieve these additional objectives, correct experimentation is required to maximize the chance of demonstrating an improvement. 3.2 Multiple Hypotheses Clinical studies often have more than one objective so that more than one hypothesis is to be tested. In the context of non-inferiority, it was already mentioned that basically two hypotheses are of interest: superiority over placebo and non-inferiority in regard to an active comparator. In other cases, more than one dose of a new drug is to be compared with a control. In many studies, a series of secondary objectives follow the primary objective or more than one primary objective is of concern. A general issue with multiple questions is that if more than one hypothesis is to be tested, more than one can be falsely rejected. As a consequence, the level set for a test of an individual hypothesis needs to be smaller to keep the overall error rate for a set of hypotheses. If the combination of hypotheses is what really matters, the definition of the level of a hypothesis test and, consequently, its power need some reconsideration. Instead of going through a general discussion of the multiplicity problem, a series of examples that frequently occur in clinical trials is presented. Interested readers are referred to Refences 7 and 8. First, consider a study that is to compare two doses of a new drug against placebo. Let θi denote the differences over placebo for dose i = 1, 2, Hi : θi ≤ 0 and Ki : θi > 0 and Ti the corresponding level α tests with critical regions Si . The objective is to demonstrate that at least one dose is superior over placebo to claim efficacy of the new drug. In statistical terms, one would like to reject the hypothesis that neither dose is superior over placebo or to reject H = H1 ∧ H2 in favor of K = K1 ∨ K2 (10) Note that the symbol ‘‘∨’’ stands for ‘‘one or the other or both.’’ For the level λ = Pr[T1 ∈ S1 ∨ T2 ∈ S2 |H1 ∧ H2 ] of the corresponding

6

HYPOTHESIS

test, α ≤ λ ≤ Pr[T1 ∈ S1 |H1 ] + Pr[T2 ∈ S2 |H2 ] = 2α (11) Hence, the level of the overall test can be somewhere between α and 2α. The simplest way to remedy the situation is to make T i level α/2 tests. Alternatively, one can apply some more powerful multiple comparison methods (see References 7–8). Combinations of several drugs are often required for the treatment of chronic diseases like hypertension and diabetes. To demonstrate the usefulness of combinations, one has to demonstrate that the efficacy of a combination is superior to the efficacy of each of its components. The hypothesis to be rejected is that the combination is not better than the best of the individual components against the alternative that it is better than each of them. Using notation from above, one intends to reject H = H1 ∨ H2 in favor of K = K1 ∧ K2 (12) As a win only exists if both hypotheses can be rejected, no adjustment of the level of the individual tests is required for Pr[T1 ∈ S1 ∧ T2 ∈ S2 |H ] ≤ Pr[T1 ∈ S1 |H1 ] = α Unfortunately, it is not possible to exhaust the significance level any better without further assumptions. One plausible assumption could be that the combination cannot be much better than one component but worse than the other. However, as pointed out earlier, if H is rejected under the additional assumption only, one could only conclude that H and the assumption cannot be correct simultaneously. Sometimes, a hierarchy exists between two objectives of a clinical study such that the achievement of one is only worthwhile if the other one can be achieved as well. An example is a non-inferiority study that intends to claim a superior safety profile for the new drug. In this setting, the better safety profile would be irrelevant if efficacy of the new compound would be inferior to that of an active comparator. If H1 stands for the noninferiority hypothesis and H2 for the safety

hypothesis, one tries first to reject H1 . Only if H 1 has been rejected, one attempts to reject H 2 , which implies λ = Pr[T1 ∈ S1 |H1 ] + Pr[T2 ∈ S2 ∧ T1 ∈ S1 |H ] = Pr[T1 ∈ S1 |H1 ] = α

(13)

Hence, an overall level α test can be obtained if the tests of the individual hypotheses have level α. In fact, Equation (13) shows that the only situation under which the test T1 ∈ S1 ∨ T2 ∈ S2 has level α is that T2 can only reject if T1 can. (Of course, the roles of T1 and T2 can be reversed.) The same reasoning can be applied if H2 is the hypothesis of no difference in efficacy. Thus, a test for noninferiority can be followed by a level α test for superiority after non-inferiority has been demonstrated without affecting the overall level. The downside of the hierarchical approach is that if the first hypothesis cannot be rejected, no way to reject the second one exists. In the absence of a clear hierarchy among hypotheses, like in a study that compares two doses against a control, hierarchical testing should be avoided. A final remark is on the power of tests of multiple hypotheses. For the two doses against a control study [Equation (10)], the power is given by Pr[T1 ∈ S1 ∨ T2 ∈ S2 |K ], which is usually greater than the power of the individual tests. Hence, even if one halves the level of the individual tests to achieve an acceptable overall level, a loss in power does not necessarily occur. Only if one is also interested in achieving a certain power for each of the individual hypotheses, the sample size of the study has to be increased. For the combination study, the power of T1 ∈ S1 ∧ T2 ∈ S2 is clearly smaller than the power of each individual test. However, the power of a combination test is slightly higher than the product of the power of the individual tests because they are always correlated since the data from the combination are used in both. For the hierarchical procedure, the power of the first test is unaffected, but some loss in power for the second test exist because the second hypothesis will only be considered if

HYPOTHESIS

the first could be rejected: Pr[T2 ∈ S2 |K ] = Pr[T2 ∈ S2 |T1 ∈ S1 , K ]Pr[T1 ∈ S1 |K ] ≤ Pr[T1 ∈ S1 |K ] The worst case obtains when both tests are independent. REFERENCES 1. A. S. Hornby, E. V. Gatenby, H. Wakefield, The Advanced Learner’s Dictionary of Current English. London: Oxford University Press, 1968. 2. K. Popper, The Logic of Scientific Discovery. London: Routledge, 2002. 3. C. G. Hempel, Philosophy of Natural Sciences. Englewood Cliffs, NJ: Prentice Hall, 1969. 4. E. L. Lehmann, Testing Statistical Hypotheses, 2nd ed. New York: Springer, 1986. 5. L. Fisher, One large, well-designed, multicenter study as an alternative to the usual FDA paradigm. Drug Inform. J. 1999; 33: 265–271. 6. S. Senn, Inherent difficulties with active controlled equivalence trials. Stat. Med. 1993; 12: 2367–2375. 7. Y. Hochberg and A. C. Tamhane, Multiple Comparison Procedures. New York: Wiley, 1987. 8. P. H. Westfall and S. S. Young, ResamplingBased Multiple Testing. New York: Wiley, 1993.

7

HYPOTHESIS TESTING

hypothesis testing is the process of evaluating the sample data to determine whether the null hypothesis is false. If based on data from the clinical trial the null hypothesis is shown to be improbable, then one concludes that the evidence supports the research hypothesis. The null hypothesis is denoted by H0 , whereas the alternative hypothesis is denoted by Ha or H1 (1, 2). Hypothesis tests can be used to test hypotheses about a single population parameter, or they can be used to compare parameters from two or more populations. Hypothesis tests can also be one-sided or two-sided depending on the nature of the alternative hypothesis. For example, a one-sided test regarding a single population mean would be specified as

NICOLE M. LAVALLEE PROMETRIKA, LLC, Cambridge, MA

MIGANUSH N. STEPANIANS PROMETRIKA, LLC, Cambridge, MA

Hypothesis testing is one of the two main areas of statistical inference, the other being parameter estimation. Although the objective of parameter estimation is to obtain point estimates or interval estimates (i.e., confidence intervals) for population parameters, hypothesis testing is used when the objective is to choose between two competing alternative theories or hypotheses regarding population parameters (e.g., population means, population proportions, population standard deviations, etc.). For example, the makers of a new antihypertensive drug believe that their medication will decrease systolic blood pressure, on average, by at least 20 mmHg in patients with moderate hypertension. The possible outcomes to this question can be specified as dichotomous, either it does or it does not. To test this claim, a statistical hypothesis test can be performed. To establish firm evidence of safety or efficacy of a new drug, clinical development plans ordinarily include at least one clinical trial that is designed to test prespecified hypotheses about population parameters that reflect the safety and/or efficacy of that drug. 1

H0 : µ ≤ µ0 versus Ha : µ > µ0 or H0 : µ ≥ µ0 versus Ha : µ < µ0 Whereas a two-sided test would be specified as H0 : µ = µ0 versus Ha : µ = µ0 Similarly, a one-sided test comparing two population means would be specified as H0 : µ1 ≤ µ2 versus Ha : µ1 > µ2 or H0 : µ1 ≥ µ2 versus Ha : µ1 < µ2 Whereas a two-sided test would be specified as

SPECIFICATION OF THE HYPOTHESES

H0 : µ1 = µ2 versus Ha : µ1 = µ2 .

The two competing theories being compared in a hypothesis test are referred to as the null hypothesis and the alternative hypothesis. The alternative hypothesis, which is also called the research hypothesis, is the statement that the researcher is trying to prove, for example, that the new drug provides greater efficacy than a placebo. The null hypothesis is the antithesis of the research hypothesis (e.g., the new drug provides no greater efficacy than placebo). Statistical

The choice between a one-sided and twosided alternative hypothesis depends on what the researcher is trying to demonstrate. For example, if the objective of a given trial is to show that there is a decrease in systolic blood pressure from baseline to the end of the trial, then a one-sided test regarding the mean change from baseline in systolic blood pressure will be performed. The one-sided

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

HYPOTHESIS TESTING

alternative hypothesis can be stated as Ha : µ < 0, where µ is the population mean change from baseline in systolic blood pressure. Similarly, if the objective is to show that mean systolic blood pressure for patients receiving a given medication (µ1 ) is lower than the mean for those receiving placebo (µ2 ), then you would choose a one-sided alternative hypothesis (Ha : µ1 < µ2 ). If the objective is simply to show a difference between the two treatments without specifying which is better, then you would choose a two-sided alternative (Ha : µ1 = µ2 ). It should be noted that although a onesided alternative hypothesis appears appropriate when the objective is to show efficacy relative to placebo, regulatory guidelines and industry conventions are that the test of superiority of an active drug over placebo should be specified as a two-sided test. This is done to make the test consistent with the twosided confidence intervals, which are appropriate for assessing the difference between two treatments (3). 2

ERRORS IN HYPOTHESIS TESTING

To reject the null hypothesis, we want to have a reasonable level of certainty that the alternative hypothesis is true. Since we can never know with complete certainty whether the alternative hypothesis is true, two types of errors can be made in hypothesis testing. One mistake would be to reject H0 when it is true, which is referred to as a Type I error (or false positive). A different mistake would be to fail to reject H0 when the alternative is true, which is referred to as a Type II error (or false negative). The possible results of a hypothesis test are summarized in Table 1. Consider the following example. The efficacy of a new drug is being assessed in a placebo-controlled trial. The null hypothesis is that there is no difference between the new drug and the placebo, on average, with respect to a particular outcome measure. The alternative hypothesis is that the new drug is different from the placebo with respect to the particular outcome measure. A Type I error would occur if it was concluded that there was a difference between the new drug and the placebo, when in fact there was not. A Type

II error would occur if it was concluded that there was no difference between the new drug and the placebo, when there actually was. A Type I error is generally considered more serious than a Type II error. This is particularly true in clinical trials, because it is important to guard against making incorrect claims regarding the potential benefit of a new treatment. Therefore, hypothesis tests are constructed to ensure a low probability of making a Type I error. The probability of a Type I error is called the significance level of the test and is denoted by α. Most often in clinical trials the significance level is set at α = 0.05, so that there is only a 5% chance of making a Type I error. The probability of a Type II error is denoted by β. Ideally, one would want to limit both the probability of a Type I error (α) and the probability of a Type II error (β). However, for any sample of a given size, the Type I error and the Type II error are inversely related; as the risk of one decreases, the risk of the other increases. The probability of a Type II error, β, can be reduced while maintaining the probability of a Type I error at a fixed level of α by increasing the sample size of the experiment. Increasing the sample size reduces the sampling variability, and therefore, it increases the probability that the test will reject the null hypothesis when the alternative hypothesis is true. This probability is referred to as the power of the test and is equal to 1 −β. The goal of hypothesis testing is to determine whether there is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis at a specific significance level (i.e., with a specified probability of committing a Type I error). If there is insufficient evidence, then we fail to reject the null hypothesis. Failing to reject the null hypothesis does not mean that the null hypothesis is accepted, however, only that there was not sufficient evidence from the sample to disprove it. For a given hypothesis test, α is always known, since it is preset at a certain level; however, the determination of β is complex, and one cannot readily specify the probability of committing a Type II error. Hence, it is recommended to reserve judgment and to avoid declaring that the null

HYPOTHESIS TESTING

3

Table 1. Possible Hypothesis Test Outcomes Truth

Conclusion from Hypothesis Test

Do Not Reject H0 Reject H0

hypothesis is accepted unless the probability of a Type II error can be provided. 3 TEST STATISTICS AND DETERMINATION OF STATISTICAL SIGNIFICANCE The decision of whether to reject the null hypothesis is made based on the evidence provided from the sample data. A test statistic is a value calculated from the sample data that summarizes the information available about the population parameter(s), and it is used to decide whether to reject the null hypothesis. The choice of the test statistic depends on the hypotheses of interest, the sample size, and the assumed distribution of the underlying population. For example, if we are performing a test regarding the mean of a normally distributed population with known variance, the appropriate test statistic is a function of the sample mean X as well as of the sample size n and the population standard deviation σ . Consider a hypothesis test of the form H0 : µ ≤ µ0 versus Ha : µ > µ0. It can be shown that the test statistic Z=

X − µo √ σ/ n

has a standard normal distribution under the null hypothesis. When the population variance is unknown but the sample size is large (a common threshold used is n > 30), test statistics of the above form can be calculated by substituting the sample standard deviation s for σ and can be assumed to have a standard normal distribution even if the underlying population is not normally distributed. The basic premise of a hypothesis test is that we will assume H0 is true unless the data provide sufficient evidence to the contrary.

H0 is true

H0 is false

Correct Decision Type I Error

Type II Error Correct Decision

However, we must specify what constitutes sufficient evidence before performing the test. Intuitively, for the above hypotheses, a sample mean, X, that is much larger than µ0 would provide evidence in favor of Ha . Under the assumption that µ = µ0 , the test statistic Z would have a positive value and fall in the right tail of the standard normal distribution. The probability of obtaining a test statistic as extreme or more extreme than the observed value calculated under the assumption that H0 is true is referred to as the P-value. Intuitively, if the P-value is very small, then it can be concluded that the data support Ha . Based on knowledge of the distribution of the test statistic, we can obtain the P-value. The smaller the P-value, the stronger the evidence that H0 is unlikely to be true. In hypothesis testing, we reject the null hypothesis if the P-value of the test is less than the prespecified significance level α. In clinical trials, when the null hypothesis is rejected in favor of the research hypothesis of interest, then it is common to state that statistical significance has been reached. The set of values for the test statistic that would cause the null hypothesis to be rejected is called the critical region. The boundary of the critical region is called the critical value. The test statistic must be more extreme than this value (i.e., further in the tail of the distribution of the test statistic) in order to reject the null hypothesis. The critical value(s) depends on the significance level for the test and whether the test is one-sided or two-sided. For one-sided hypothesis tests, there will be one critical value in the lower or upper tail of the distribution. For two-sided tests, there will be two critical values, marking the rejection regions in both tails of the distribution (provided that the test statistic has a symmetric distribution under the null hypothesis).

4

4

HYPOTHESIS TESTING

EXAMPLES

We illustrate the concepts and techniques of hypothesis testing by two examples. The first example demonstrates how to perform a test on a single population mean, and the second example compares the means from two independent samples. 4.1 One-Sample Test A new drug has been developed to treat hypertension. A Phase II trial is conducted to test whether treatment with the new drug will lower systolic blood pressure (SBP) by more than 20 mmHg, on average, in patients with moderate hypertension who are treated for two months. A random sample of 50 patients from the population the drug is intended to treat is enrolled in the trial. For each patient, the change in SBP from baseline to the end of the treatment period is determined. This is the outcome measure on which the test will be conducted. The null and alternative hypotheses can be stated as H0 : µ ≤ 20

4.2 Two-Sample Test Now suppose that instead of testing the within-patient change in SBP, the primary objective of another study is to test whether patients treated with the new drug experience a significantly greater decrease in SBP than patients treated with placebo. The null and alternative hypotheses can be stated as H0 : µ1 = µ2 or equivalently, µ1 − µ2 = 0

Ha : µ > 20

Ha : µ1 = µ2 or equivalently, µ1 − µ2 = 0

The significance level of the test is specified as α = 0.05. The sample mean and standard deviation are determined to be 26 and 13, respectively. The test statistic is equal to Z=

test, corresponding to 5% of the area under the standard normal curve. H0 will be rejected for all test statistics that are greater than 1.645. Since Z > 1.645, we reject H0 at a 0.05 level of significance and conclude that the new drug will lower SBP by more than 20 mmHg, on average, in moderate hypertensive patients. The probability of obtaining a test statistic of this magnitude or larger, under the assumption that H0 is true, is P = 0.00055. The P-value can be calculated using a statistical software package and is equal to the area under the standard normal probability density function corresponding to Z ≥ 3.264.

X − µo 26 − 20 √ = √ = 3.264 S/ n 13/ 50

where µ1 represents the mean change for the population of patients treated with the new drug and µ2 represents the mean change for the population of patients treated with placebo. For a test of two independent means, it can be shown that the test statistic

Since under the null hypothesis, this test statistic follows a standard normal distribution, the critical value is z0.05 = 1.645. Figure 1 depicts the critical region for this

X1 − X2 Z = S22 S21 n1 + n2

0.05

0

1.645

Figure 1. Standard normal curve with one-sided rejection region.

HYPOTHESIS TESTING

where X1 and X2 are the sample means corresponding to µ1 and µ2 , respectively, and S1 and S2 are the corresponding sample standard deviations. The test statistic Z has a standard normal distribution under the null hypothesis given that both n1 and n2 are large. For this trial, 80 patients are enrolled and randomized in equal numbers to the active drug and placebo. At the end of the trial, the sample means and standard deviations are calculated for each group. The test statistic is calculated as 22 − 16 X1 − X2 Z = = 2 = 2.145 2 2 13 122 S2 S1 + + n1 n2 40 40 has a standard normal distribution. The critical value for this test is z0.025 = 1.96. Figure 2 depicts the critical region for this test, corresponding to 2.5% of the area under the standard normal curve in each tail. H0 will be rejected for all test statistics that are less than −1.96 or greater than 1.96. Since Z > 1.96, we reject H0 at a 0.05 level of significance and conclude that the new drug lowers SBP significantly more than the placebo after two months of treatment. This result is significant with a P-value of 0.0320. The P-value is equal to the area under the standard normal probability density function corresponding to the combined areas for Z > 2.145 and Z < −2.145. For the above examples of one and two sample tests of means, each had sufficiently large sample sizes, so that a test statistic could be constructed that followed a standard normal distribution. Note that in the case of

small sample sizes (<30 per group), hypothesis tests of means can be constructed where the test statistic follows a t-distribution under the null hypothesis. REFERENCES 1. R. J. Larsen and M. L. Marx, An Introduction to Mathematical Statistics and Its Applications, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall, 2000. 2. J. Neter, W. Wasserman, and G. A. Whitmore, Applied Statistics, 4th ed. Newton, MA: Allyn and Bacon, 1992. 3. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline. Statistical Principles for Clinical Trials: E9. 1998.

FURTHER READING M. Bland, An Introduction to Medical Statistics, 3rd ed. New York: Oxford University Press, 2000. G. Casella and R. L. Berger, Statistical Inference, 2nd ed. Belmont, CA: Wadsworth, 2001. R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 6th ed. Englewood Cliffs, NJ: Prentice-Hall, 2004. L. Ott and W. Mendenhall, Understanding Statistics, 6th ed. Boston, MA: Duxbury Press, 1994. S. J. Pocock, Clinical Trials, A Practical Approach. New York: Wiley, 1983. G. W. Snedecor and W. G. Cochran, Statistical Methods, 8th ed. Ames, IA: Iowa State University Press, 1991. G. A. Walker, Common Statistical Methods for Clinical Research with SAS Examples. San Diego, CA: Collins-Wellesley Publishing, 1996.

0.025 Figure 2. Standard normal curve with two-sided rejection region.

−1.96

5

0

1.96

6

HYPOTHESIS TESTING

CROSS-REFERENCES Critical Value One-sided Test vs. Two-sided Test P-value Type I Error, Type II Error

IDENTIFYING THE MOST SUCCESSFUL DOSE (MSD) IN DOSE-FINDING STUDIES

of dose finding in which there are two correlated outcomes we wish to control for has been studied by Braun (7). This differs from the main objective here where we wish to consider simultaneously consider two outcomes, one of which is viewed as positive, one as negative, and both are believed to increase with increasing dose.

JOHN O’QUIGLEY Universite Pierre it Marie Curie Paris VI, 75005 Paris, France

SARAH ZOHAR

2 DESIGNS

CIC 9504 INSERM Centre d’Investigation Clinique, Hˆopital Saint-Louis Paris, France U717 INSERM Paris, France

1

We take Y and V to be binary random variables (0,1), where Y = 1 denotes a toxicity, Y = 0 a nontoxicity, V = 1 a response, and V = 0 a nonresponse. The dose level of a new drug is to be taken from a panel of k discrete dose levels, {d1 , . . . , dk }, with actual toxicity rates assumed to be monotonic and increasing. The dose level assigned to the jth patient (j ≤ n with n the total number of patients), X j , can be viewed as a random variable taking discrete values xj , where xj {d1 , . . . , dk }. The probability of toxicity at the dose level Xj = xj is defined by R(xj ) = Pr (Y j = 1|X j = xj ). The probability of response given no toxicity at dose level Xj = xj is defined by Q(xj ) = Pr(Vj = 1|X = xj , Yj = 0) so that P(di ) = Q(di ){1 − R(di )} is the probability of success. It is this quantity P that we would like to maximize over the range of doses available in the study and a successful trial identifies the dose level l such that P(dl ) > P(di ) (for all i where i = l). We call such a dose the most successful dose (MSD), and our purpose is, rather than find the MTD, to find the MSD. The relationship between toxicity and dose (xj ) and the relationship between response given no toxicity and dose are modeled through the use of two one-parameter models (1). The reasoning behind the use of underparameterized models is given in Shen and O’Quigley (8). Roughly the idea is that, at the MTD itself, only a single parameter is required to specify the probability of toxicity. Since sequential designs using updating will tend to settle on a single level, it is usually enough to work with a model only rich enough to characterize correctly the probabilities at a single level. Underparameterization, on the

MOTIVATION

O’Quigley et al. (1) proposed designs that take account of efficacy of treatment and toxic side effects simultaneously. In HIV studies, the toxicity is often the ability to tolerate the treatment, whereas in the cancer setting, this is more likely to be the presence or not of serious side effects. For HIV, efficacy will be seen readily and is most often a measure of the impact of treatment on the viral load. For cancer, it may take longer to see any effects of efficacy, although it is becoming increasingly common to make use of biological markers and other pharmacological parameters that can be indicative of a positive effect of treatment. Non-biological measurements may also be used, and in the light of developments in assessing the extent of tumor regression, for example, it is sometimes possible to use measures that previously were considered to be less reliable. Several articles (2–4) take a Bayesian approach to the question and base their findings on an efficacy–toxicity tradeoff. The approach of O’Quigley et al. can also take advantage of any prior information that is available. To do this, we would need to put prior distributions on both the toxicity and efficacy parameters and then use the Bayes formula in much the same way as used by Whitehead et al. (5). Ivanova (6) considered the algorithms of O’Quigley et al. (1), together with some further modification. The problem

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

378

IDENTIFYING THE MOST SUCCESSFUL DOSE (MSD) IN DOSE-FINDING STUDIES

other hand, means that extrapolation can be poor. However, since extrapolation is mostly only local and only used to provide rough guidance, this is a lesser concern to that of accurate local estimation. The same ideas carry through when modeling the most successful dose. Note that R(di ) and Q(di ) refer to exact, usually unknown, probabilities. We use the model based equivalents of these as ψ and φ, respectively, with these quantities coinciding when the working model is correct. Generally though we are interested in the case where the working model is only a first approximation to a more complex reality so that we do not expect R(di ) and ψ to coincide at more than one of the available levels. The same applies to Q(di ) and φ Equality will hold though for, at least, one level. Understanding the equality sign in this sense, i.e., in the sense of a working model, we can then write R(di ) = ψ(di , a) = αia and Q(di ) = φ(di , b) = β bi where 0 < α 1 < · · · < α k < 1, 0 < a < ∞, 0 < β 1 < · · · < β k < 1 and 0 < b < ∞. After the inclusion of j patients, R(di ), Q(di ), and P(di ) are estimated using maximum likelihood (11). The methods are robust to the choice of the α i (and, by extension, the choice of the β i so that these arbitrary choices have a very weak impact on operating characteristics. For large samples, under very broad conditions (9), the choice will have no impact at all. Our practical suggestion would be to divide roughly the interval (0,1) into comparable segments. For four dose levels, for example, we could choose both the α i and the β i to be 0.2, 0.4, 0.6, and 0.8. Note that, in all events, the designs remain invariant to power transformations on these values so that the operating characteristics for small samples based on the above model are exactly identical to, say, the square of the model, i.e., taking α i and β i to be equal to 0.04, 0.16, 0.36, and 0.64. The (j + 1)th patient will be included at the dose level xj , which maximizes the probˆ i ) with ˆ j+1 ) > P(d ability of success, so that P(x i = 1, . . . , k; xj+1 = di (1). To maximize the log likelihood function, we must have heterogeneity in patient observation in terms of toxicity and response. Some initial escalating scheme is then required, and here, we base

379

this on a simple up-and-down algorithm that includes patients in groups of 3 at a time. If all three experience no toxicity, then the dose level is escalated. If a single toxicity is observed, then we remain at the same dose level. If two or more toxicities are observed, then the dose level is lowered. We continue in this way until the first response is encountered. As soon as we have both a response and a toxicity, we are in a position to fit the parameters of the underparameterized model. The estimated parameters enable us to reconstruct the working dose toxicity and dose response relationships. From these we can target, for each subsequent inclusion or group of inclusions, the dose believed to be the most successful (MSD). Usually, from this point, patients are then allocated one at a time although, of course, it is possible to continue with grouped inclusion, based on the model, if desired. In estimating the MSD, all available information is taken into account via the use of our two-parameter model. Apart from unusual situations, this model will be underparameterized in practice. This approach, for an operational standpoint, represents the most obvious extension to the usual CRM strategy for dealing with toxicity alone. It works well as judged by the simulations of Zohar and O’Quigley (11) across a broad range of situations. Instead of the above prescription, O’Quigley et al. (1) suggested a compromise solution in which a CRM (10) model is used to target the rate of toxicity alone. This rate is no longer fixed, however, and can be changed in the course of the study. The solution is likely to be of greatest utility in those cases where we anticipate little or negligible toxicity. This is often the case for HIV studies and is becoming more the case for cancer studies where we work with non-cytotoxic agents. Initially we target a low rate of toxicity, keeping track of the overall success rate. Should the overall success rate be below some aimed for minimum amount, then we will allow the targeted rate of toxicity to be increased. We continue in this way until we either find some level with an overall success rate that we can take to be acceptable or until all levels have been eliminated. Specifically, a likelihood CRM (10,12) approach is used to find the dose level associated with the target probability of toxicity

380

IDENTIFYING THE MOST SUCCESSFUL DOSE (MSD) IN DOSE-FINDING STUDIES

θ . After the inclusion of j − 1 patients, the jth patient will be treated at the dose level xj , which is associated with the estimated toxicity probability closest to the target, such that |ψ(xj , aˆ j ) − θ | < |ψ(di , aˆ j ) − θ | with xj {d1 , . . . , dk } and ψ(di , a)= α ai . Again, we use a maximum likelihood approach, and so, in order to maximize the log likelihood function, we must have heterogeneity among the toxicity observations, i.e., at least one toxicity together with at least one nontoxicity. An upand-down dose allocation procedure is used until this heterogeneity is observed. Unlike the approach based on an underparameterized model in which, before the model is fitted, we need both heterogeneity in the toxicities and in the responses, here we only need heterogeneity in the toxicities. We do not need to consider any de-escalation, and in the particular simulation here, we use inclusion of three patients at a time until the first toxicity is encountered. Following this patients are allocated typically one at a time although, once again, if desired, grouped inclusions require no real additional work. This design uses the reestimated dose-toxicity relationship for the dose allocation scheme, but as the trial progresses and our inferences are updated, then more information about the rate of success becomes available. At each dose level, the probability of success is estimated by Pˆ i = l≤j vl I (xl = di , yl = 0)/ l≤j I(xj =di ) As inclusions increase, we can base a decision on whether to continue, stop and recommend the current level or to reject all levels on the value of Pˆ i . . 2.1 Criteria for Stopping As the trial proceeds and information is gathered about success at the dose level di , a decision is taken based on a sequential probability ratio test (13). The hypotheses to be compared are naturally composite, but for operational simplicity, we compare two point hypotheses-modifying the parameters to these in order to obtain desirable operating characteristics. The first composite hypothesis is that the success rate is greater than p1 at di . The second composite hypothesis is that the success rate is lower than p0 at this dose level. A conclusion in favor of

a success rate greater than p1 leads the trial to end with di recommended. By contrast, a conclusion in favor of a success rate lower than p0 at di leads us to remove that dose level and lower ones, and at the same time, the target toxicity probability is increased from θ to θ + θ (until some upper maximum target toxicity rate is reached). The trial then continues using those dose levels that have not been eliminated. The sequential probability ratio test (SPRT) when treating these as point hypotheses can be expressed as p < (ni − wi ) log p1 0 1−pi 1−2 + wi log 1−p < log

log

2 1−1

0

1

where ni is the number of patients included at dose level di wi is the number of successes (response with no toxicity) at di . The sequential Type I and Type II error rates are 1 and 2 . For fixed sample size, the dose level recommended at the end of the trial for all approaches is the dose level that would have been given to the (n + 1)th patient. 3 GENERAL OBSERVATIONS Identification of the MSD, as opposed to the MTD, provides a step for the statistical methodology toward current clinical practice. Here, even if not always well used, it is now common to record information of toxicity at the same time as that on toxicity. Many potential new treatments can fail because either they showed themselves to be too toxic or, at an acceptable level of toxicity, they failed to show adequate response. The classic format for a dose finding study in cancer has been to ignore information relating to response or efficacy. It has been argued that, in some sense, toxicity itself is serving a role as a surrogate for efficacy— the more toxic the treatment, then the more effect we anticipate seeing. However, this is clearly limited. We might see quite a lot of toxicity and yet little effect. We might also be able to achieve a satisfactory effect at lower rates of toxicity than are commonly used in this setting. Some of the newer biological formulations

IDENTIFYING THE MOST SUCCESSFUL DOSE (MSD) IN DOSE-FINDING STUDIES

may not even be well described by the simple paradigm-more toxicity equates to more effect. Before seeing any toxicity, it might be required to increase the given levels of dose by orders of magnitude so that it becomes increasingly likely that the methodology of dose finding will need to include more and more information relating to the desired outcome and not just the toxic side effects. To give the best chance to any promising new therapy, it becomes vital for the investigators to be able to identify, during the early development, the level of drug, the MSD, which the evidence shows provides the best chance of overall success. The work presented here is broad enough for many situations that can originate in the cancer context. Nonetheless it would be desirable, before undertaking any study using these methods, to carry out simulations of operating characteristics at the kind of sample sizes that can be realistically envisioned. This is not difficult and can also provide an informative background to the necessary discussions that will take place among the clinicians, the pharmacologists, the nursing staff, and the statisticians involved. In the context studied here, that of jointly finding a dose based on both toxicity and response, it is clear that the concept of the maximum tolerated dose (MTD) is not fully adequate. Attention instead focuses on the MSD, and our belief is that the MSD is likely to become of increasing interest in the early development of new therapeutics. In situations where we expect to see very little or almost no toxicity or in situations where we expect to see very little or response rates no greater than, say, 5%, then current designs focusing on only one endpoint are likely to perform quite adequately. The greatest gains with the suggested methods are likely to be found when there are non-negligible effects of both toxicity and efficacy and, in particular, where we weight them in a similar way; i.e., we are most interested in whether a treatment can be considered a failure-either due to toxicity or due to lack of efficacy. Then it makes sense to look for the dose that maximizes overall success rate (MSD). The proposed methods can be considered in conjunction with several others, currently available. The choice of the most appropriate method depends, as always, on the precise

381

context under investigation. Although, for the situations described in the above paragraph, existing designs would seem to require no modification, it could still be argued that by including a toxicity criterion in an efficacy dose finding design, we add a safeguard at little extra cost. One difficulty, not addressed here, is that there may be different precisions in our measurement of efficacy when compared with our measurement of toxicity, the latter tending to be typically more objective. REFERENCES 1. J. O’Quigley, M. D. Hughes, and T. Fenton, Dose-finding designs for HIV studies. Biometrics 2001; 57(4): 1018–1029. 2. P. F. Thall and K. E. Russell, A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometrics 1998; 54(1): 251–264. 3. P. F. Thall, E. H. Estey, and H. G. Sung, A new statistical method for dose-finding based on efficacy and toxicity in early phase clinical trials. Invest New Drugs 1999; 17(2): 155–167. 4. P. F. Thall and J. D. Cook, Dose-finding based on efficacy-toxicity trade-offs. Biometrics 2004; 60(3): 684–693. 5. J. Whitehead, S. Patterson, D. Webber, S. Francis, and Y. Zhou, Easy-to-implement Bayesian methods for dose-escalation studies in healthy volunteers. Biostatistics 2001; 2(1): 47–61. 6. A. Ivanova, A new dose-finding design for bivariate outcomes. Biometrics 2003; 59(4): 1001–1007. 7. T. M. Braun, The bivariate continual reassessment method. Extending theCRM to phase I trials of two competing outcomes. Control Clin. Trials 2002; 23(3): 240–256. 8. L. Z. Shen and J. O’Quigley, Using a oneparameter model to sequentially estimate the root of a regression function. Computat. Stat. Data Anal. 2000; 34(3): 357–369. 9. L. Z. Shen and J. O’Quigley, Consistency of continual reassessment method under model misspecification. Biometrika 1996; 83: 395–405. 10. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: A practical design for phase 1 clinical trials in cancer. Biometrics 1990; 46(1): 33–48.

382

IDENTIFYING THE MOST SUCCESSFUL DOSE (MSD) IN DOSE-FINDING STUDIES

11. S. Zohar and J. O’Quigley, Identifying the most successful dose (MSD) in dose-finding studies in cancer. Pharm. Stat. 2006; 5(3): 187–199. 12. J. O’Quigley and L. Z. Shen, Continual reassessment method: A likelihood approach. Biometrics 1996; 52(2): 673–684. 13. G. B. Wetherill and K. D. Glazebrook, Sequential methods in Statistics, 3th ed. London, U.K.: Chapman and Hall, 1986.

IMAGING SCIENCE IN MEDICINE

at all levels from the atomic through molecular, cellular, and tissue to the whole body and the social influences on disease patterns. At present, a massive research effort is focused on acquiring knowledge about genetic coding (the Human Genome Project) and the role of genetic coding in human health and disease. This effort is progressing at an astounding rate and gives rise to the belief among many medical scientists that genetics and bioinformatics (mathematical modeling of biological information, including genetic information) are the major research frontiers of medical science for the next decade or longer. The human body is an incredibly complex system. Acquiring data about its static and dynamic properties yields massive amounts of information. One of the major challenges to researchers and clinicians is the question of how to acquire, process, and display vast quantities of information about the body, so that the information can be assimilated, interpreted, and used to yield more useful diagnostic methods and therapeutic procedures. In many cases, the presentation of information as images is the most efficient approach to this challenge. As humans, we understand this efficiency; from our earliest years we rely more heavily on sight than on any other perceptual skill in relating to the world around us. Physicians also increasingly rely on images to understand the human body and intervene in the processes of human illness and injury. The use of images to manage and interpret information about biological and medical processes is certain to continue to expand in clinical medicine and also in the biomedical research enterprise that supports it. Images of a complex object such as the human body reveal characteristics of the object such as its transmissivity, opacity, emissivity, reflectivity, conductivity, and magnetizability, and changes in these characteristics with time. Images that delineate one or more of these characteristics can be analyzed to yield information about underlying properties of the object, as depicted in Table 1. For example, images (shadowgraphs) created by X rays transmitted through a region of the body

WILLIAM R. HENDEE Medical College of Wisconsin Milwaukee, WI

1

INTRODUCTION

Natural science is the search for ‘‘truth’’ about the natural world. In this definition, truth is defined by principles and laws that have evolved from observations and measurements about the natural world that are reproducible through procedures that follow universal rules of scientific experimentation. These observations reveal properties of objects and processes in the natural world that are assumed to exist independently of the measurement technique and of our sensory perceptions of the natural world. The purpose of science is to use these observations to characterize the static and dynamic properties of objects, preferably in quantitative terms, and to integrate these properties into principles and, ultimately, laws and theories that provide a logical framework for understanding the world and our place in it. As a part of natural science, human medicine is the quest for understanding one particular object, the human body, and its structure and function under all conditions of health, illness, and injury. This quest has yielded models of human health and illness that are immensely useful in preventing disease and disability, detecting and diagnosing conditions of illness and injury, and designing therapies to alleviate pain and suffering and restore the body to a state of wellness or, at least, structural and functional capacity. The success of these efforts depends on our depth of understanding of the human body and on the delineation of effective ways to intervene successfully in the progression of disease and the effects of injuries. Progress in understanding the body and intervening successfully in human disease and injury has been so remarkable that the average life span of humans in developed countries is almost twice that expected a century ago. Greater understanding has occurred

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

IMAGING SCIENCE IN MEDICINE

reveal intrinsic properties of the region such as its effective atomic number Z, physical density (grams/cm3 ) and electron density (electrons/cm3 ). Nuclear medicine images, including emission computed tomography (ECT) where pharmaceuticals release positrons [positron emission tomography (PET)] and single photons [single photon emission computed tomography (SPECT)], reveal the spatial and temporal distribution of target-specific pharmaceuticals in the human body. Depending on the application, these data can be interpreted to yield information about physiological processes such as glucose metabolism, blood volume, flow and perfusion, tissue and organ uptake, receptor binding, and oxygen utilization. In ultrasonography, images are produced by capturing energy reflected from interfaces in the body that separate tissues that have different acoustic impedances, where the acoustic impedance is the product of the physical density and the velocity of ultrasound in the tissue. Magnetic resonance imaging (MRI) of relaxation characteristics following magnetization of tissues can be translated into information about the concentration, mobility, and chemical bonding of hydrogen and, less frequently, other elements present in biological tissues. Maps of the electrical field (electroencephalography) and the magnetic field (magnetoencephalography) at the surface of the skull can be analyzed to identify areas of intense electrical activity in the brain. These and other techniques that use

the energy sources listed in Table 1 provide an array of imaging methods useful for displaying structural and functional information about the body that is essential to improving human health by detecting and diagnosing illness and injury. The intrinsic properties of biological tissues that are accessible by acquiring and interpreting images vary spatially and temporally in response to structural and functional changes in the body. Analysis of these variations yields information about static and dynamic processes in the human body. These processes may be changed by disease and disability, and identification of the changes through imaging often permits detecting and delineating the disease or disability. Medical images are pictures of tissue characteristics that influence the way energy is emitted, transmitted, reflected, etc. by the human body. These characteristics are related to, but not the same as, the actual structure (anatomy), composition (biology and chemistry), and function (physiology and metabolism) of the body. Part of the art of interpreting medical images is to bridge the gap between imaging characteristics and clinically relevant properties that aid in diagnosing and treating disease and disability. 2 ADVANCES IN MEDICAL IMAGING Advances in medical imaging have been driven historically by the ‘‘technology push’’ principle. Especially influential have been

Table 1. Energy Sources and Tissue Properties Employed in Medical Imaging Image Sources

Image Influences

Image Properties

• X rays • γ rays • Visible light • Ultraviolet light • Annihilation radiation • Electric fields • Magnetic fields • Infrared • Ultrasound • Applied voltage

• Mass Density • Electron density • Proton density • Atomic number • Velocity • Pharmaceutical location • Current flow • Relaxation • Blood volume/flow • Oxygenation level of blood • Temperature • Chemical state

• Transmissivity • Opacity • Emissivity • Reflectivity • Conductivity • Magnetizability • Resonance absorption

IMAGING SCIENCE IN MEDICINE

imaging developments in other areas, particularly in the defense and military sectors, that have been imported into medicine because of their potential applications in detecting and diagnosing human illness and injury. Examples include ultrasound developed initially for submarine detection (sonar), scintillation detectors and reactorproduced isotopes (including 131 I and 60 Co) that emerged from the Manhattan Project (the United States World War II effort to develop the atomic bomb), rare-earth fluorescent compounds synthesized initially in defense and space research laboratories, electrical conductivity detectors for detecting rapid blood loss on the battlefield, and the evolution of microelectronics and computer industries from research funded initially for security and surveillance, defense, and military purposes. Basic research laboratories have also provided several imaging technologies that have migrated successfully into clinical medicine. Examples include reconstruction mathematics for computed tomographic imaging and nuclear magnetic resonance techniques that evolved into magnetic resonance imaging and spectroscopy. The migration of technologies from other arenas into medicine has not always been successful. For example, infrared detection devices developed for night vision in military operations have so far not proven useful in medicine, despite early enthusiasm for infrared thermography as an imaging method for early detection of breast cancer. Today the emphasis in medical imaging is shifting from a ‘‘technology push’’ approach toward the concept of ‘‘biological/clinical pull.’’ This shift in emphasis reflects a deeper understanding of the biology underlying human health and disease and a growing demand for accountability and proven usefulness of technologies before they are introduced into clinical medicine. Increasingly, unresolved biological questions important in diagnosing and treating human disease and disability are used as an incentive for developing new imaging methods. For example, the function of the human brain and the causes and mechanisms of various mental disorders such as dementia, depression, and schizophrenia are among the greatest biological enigmas that confront biomedical

3

scientists and clinicians. A particularly fruitful method for penetrating this conundrum is the technique of functional imaging that employs tools such as ECT and MRI. Functional magnetic resonance imaging (fMRI) is especially promising as an approach to unraveling some of the mysteries of human brain function in health and in various conditions of disease and disability. Another example is the use of X-ray computed tomography and magnetic resonance imaging as feedback mechanisms to shape and guide the optimized deployment of radiation beams for cancer treatment. The growing use of imaging techniques in radiation oncology reveals an interesting and rather recent development. Until about three decades ago, the diagnostic and therapeutic applications of ionizing radiation were practiced by a single medical specialty. In the late 1960s, these applications began to separate into distinct medical specialties, diagnostic radiology and radiation oncology, that have separate training programs and clinical practices. Today, imaging is used extensively in radiation oncology to characterize the cancers to be treated, design the plans of treatment, guide the delivery of radiation, monitor the response of patients to treatment, and follow patients over the long term to assess the success of therapy, the occurrence of complications, and the frequency of recurrence. The process of accommodating this development in the training and practice of radiation oncology is encouraging a closer working relationship between radiation oncologists and diagnostic radiologists. 3 EVOLUTIONARY DEVELOPMENTS IN IMAGING Six major developments are converging today to raise imaging to a more prominent role in biological and medical research and in the clinical practice of medicine (1): • The ever-increasing sophistication of the

biological questions that can be addressed as knowledge expands and understanding grows about the complexity of the human body and its static and dynamic properties.

4

IMAGING SCIENCE IN MEDICINE • The ongoing evolution of imaging tech-

•

•

•

•

nologies and the increasing breadth and depth of the questions that these technologies can address at ever more fundamental levels. The accelerating advances in computer technology and information networking that support imaging advances such as three- and four-dimensional representations, superposition of images from different devices, creation of virtual reality environments, and transportation of images to remote sites in real time. The growth of massive amounts of information about patients that can best be compressed and expressed by using images. The entry into research and clinical medicine of young persons who are amazingly facile with computer technologies and comfortable with images as the principal pathway to acquiring and displaying information. The growing importance of images as an effective means to convey information in visually oriented developed cultures.

A major challenge confronting medical imaging today is the need to exploit this convergence of evolutionary developments efficiently to accelerate biological and medical imaging toward the realization of its true potential. Images are our principal sensory pathway to knowledge about the natural world. To convey this knowledge to others, we rely on verbal communications that follow accepted rules of human language, of which there are thousands of varieties and dialects. In the distant past, the acts of knowing through images and communicating through languages were separate and distinct processes. Every technological advance that brought images and words closer, even to the point of convergence in a single medium, has had a major cultural and educational impact. Examples of such advances include the printing press, photography, motion pictures, television, video games, computers, and information networking. Each of these technologies has enhanced the shift from using words to communicate information toward a more efficient synthesis of images to provide insights and words

to explain and enrich insights (2). Today, this synthesis is evolving at a faster rate than ever before, as evidenced, for example, by the popularity of television news and documentaries and the growing use of multimedia approaches to education and training. A two-way interchange of information is required to inform and educate individuals. In addition, flexible means are needed for mixing images and words and their rate and sequence of presentation to capture and retain the attention, interest, and motivation of persons engaged in the educational process. Computers and information networks provide this capability. In medicine, their use in association with imaging technologies greatly enhances the potential contribution of medical imaging to resolving patient problems in the clinical setting. At the beginning of the twenty-first century, the six evolutionary developments listed before provide the framework for major advances in medical imaging and its contributions to improvements in the health and well-being of people worldwide. 3.1 Molecular Medicine Medical imaging has traditionally focused on acquiring structural (anatomic) and, to a lesser degree, functional (physiological) information about patients at the organ and tissue levels. This focus has nurtured the correlation of imaging findings with pathological conditions and led to enhanced detection and diagnosis of human disease and injury. At times, however, detection and diagnosis occur at a stage in the disease or injury where radical intervention is required and the effectiveness of treatment is compromised. In many cases, detection and diagnosis at an earlier stage in the progression of disease and injury are required to improve the effectiveness of treatment and enhance the well-being of patients. This objective demands that medical imaging refocus its efforts from the organ and tissue levels to the cellular and molecular levels of human disease and injury. Many scientists believe that medical imaging is well positioned today to experience this refocusing as a benefit of knowledge gained at the research frontiers of molecular biology and genetics. This benefit is

IMAGING SCIENCE IN MEDICINE

often characterized as the entry of medical imaging into the era of molecular medicine. Examples include the use of magnetic resonance to characterize the chemical composition of cancers, emission computed tomography to display the perfusion of blood in the myocardium, and microfocal X-ray computed tomography to reveal the microvasculature of the lung. Contrast agents are widely employed in X ray, ultrasound, and magnetic resonance imaging techniques to enhance the visualization of properties correlated with patient anatomy and physiology. Agents in wide use today localize in tissues either by administration into specific anatomic compartments such as the gastrointestinal or vascular systems or by reliance on nonspecific changes in tissues such as increased capillary permeability or alterations in the extracellular fluid space. These localization mechanisms frequently do not yield a sufficient concentration differential of the agent to reveal subtle tissue differences associated with the presence of an abnormal condition. New contrast agents are needed that exploit growing knowledge about biochemical receptor systems, metabolic pathways, and ‘‘antisense’’ (variant DNA) molecular technologies to yield concentration differentials sufficient to reveal subtle variations among various tissues that may reflect the presence of pathological conditions. Another important imaging application of molecular medicine is using imaging methods to study cellular, molecular, and genetic processes. For example, cells may be genetically altered to attract metal ions that (1) alter the magnetic susceptibility, thereby permitting their identification by magnetic resonance imaging techniques; or (2) are radioactive and therefore can be visualized by nuclear imaging methods. Another possibility is to transect cells with genetic material that causes expression of cell surface receptors that can bind radioactive compounds (3). Conceivably this technique could be used to tag affected cells and monitor the progress of gene therapy. Advances in molecular biology and genetics are yielding new knowledge at an astonishing rate about the molecular and genetic infrastructure that underlie the static and

5

dynamic processes of human anatomy and physiology. This new knowledge is likely to yield increasingly specific approaches to using imaging methods to visualize normal and abnormal tissue structure and function at increasingly microscopic levels. These methods will in all likelihood lead to further advances in molecular medicine. 3.2 Human Vision Images are the product of the interaction of the human visual system with its environment. Any analysis of images, including medical images, must include at least a cursory review of the process of human vision. This process is outlined here; a more detailed treatment of the characteristics of the ‘‘end user’’ of images is provided in later sections of this Encyclopedia. 3.2.1 Anatomy and Physiology of the Eye. The human eye, diagrammed in Fig. 1, is an approximate sphere that contains four principal features: the cornea, iris, lens, and retina. The retina contains photoreceptors that translate light energy into electrical signals that serve as nerve impulses to the brain. The other three components serve as focusing and filtering mechanisms to transmit a sharp, well-defined light image to the retina. 3.2.1.1 Tunics. The wall of the eye consists of three layers (tunics) that are discontinuous in the posterior portion where the optic nerve enters the eye. The outermost tunic is a fibrous layer of dense connective tissue that includes the cornea and the sclera. The cornea comprises the front curved surface of the eye, contains an array of collagen fibers and no blood vessels, and is transparent to visible light. The cornea serves as a coarse focusing element to project light onto the observer’s retina. The sclera, or white of the eye, is an opaque and resilient sheath to which the eye muscles are attached. The second layer of the wall is a vascular tunic termed the uvea. It contains the choroid, ciliary body, and iris. The choroid contains a dense array of capillaries that supply blood to all of the tunics. Pigments in the choroid reduce internal light reflection that would otherwise blur the images. The ciliary body

6

IMAGING SCIENCE IN MEDICINE

its thickness. Accommodation is accompanied by constriction of the pupil, which increases the depth of field of the eye. The lens loses its flexibility from aging and is unable to accommodate, so that near objects can be focused onto the retina. This is the condition of presbyopia in which reading glasses are needed to supplement the focusing ability of the lens. Clouding of the lens by aging results in diminution of the amount of light that reaches the retina. This condition is known as a lens cataract; when severe enough it makes the individual a candidate for surgical replacement of the lens, often with an artificial lens. 3.2.1.3 Retina. The innermost layer of the eye is the retina, which is composed of two components, an outer monolayer of pigmented cells and an inner neural layer of photoreceptors. Because considerable processing of visual information occurs in the retina, it often is thought of more as a remote part of the brain rather than as simply another component of the eye. There are no photoreceptors where the optic nerve enters the eye, creating a blind spot. Near the blind spot is the mucula lutae, an area of about 3 mm2 over which the retina is especially thin. Within the macula lutae is the fovea centralis, a slight depression about 0.4 mm in diameter. The fovea is on the optical axis of the eye and is the area where the visual cones are concentrated to yield the greatest visual acuity.

FP

O

contains the muscles that support and focus the lens. It also contains capillaries that secrete fluid into the anterior segment of the eyeball. The iris is the colored part of the eye that has a central aperture termed the pupil. The diameter of the aperture can be altered by the action of muscles in the iris to control the amount of light that enters the posterior cavity of the eye. The aperture can vary from about 1.5–8 mm. 3.2.1.2 Chambers and Lens. The anterior and posterior chambers of the eye are filled with fluid. The anterior chamber contains aqueous humor, a clear plasma-like fluid that is continually drained and replaced. The posterior humor is filled with vitreous humor, a clear viscous fluid that is not replenished. The cornea, aqueous and vitreous humors, and the lens serve collectively as the refractive media of the eye. The lens of the eye provides the fine focusing of incident light onto the retina. It is a convex lens whose thickness can be changed by action of the ciliary muscles. The index of refraction of the lens is close to that of the surrounding fluids in which it is suspended, so it serves as a fine-focusing adjustment to the coarse focusing function of the cornea. The process of accommodation by which near objects are brought into focus is achieved by contraction of the ciliary muscles. This contraction causes the elastic lens to bow forward into the aqueous humor, thereby increasing

Figure 1. Horizontal section through the human eye (from Ref. 4, with permission).

IMAGING SCIENCE IN MEDICINE

The retina contains two types of photoreceptors, termed rods and cones. Rods are distributed over the entire retina, except in the blind spot and the fovea centralis. The retina contains about 125 million rods, or about 105 /mm2 . Active elements in the rods (and in the cones as well) are replenished throughout an individual’s lifetime. Rods have a low but variable threshold to light and respond to very low intensities of incident light. Vision under low illumination levels (e.g., night vision) is attributable almost entirely to rods. Rods contain the light-sensitive pigment rhodopsin (visual purple) which undergoes chemical reactions (the rhodopsin cycle) when exposed to visible light. Rhodopsin consists of a lipoprotein called opsin and a chromophore (light-absorbing chemical compound called 11-cis-retinal) (5). The chemical reaction begins with the breakdown of rhodopsin and ends with the recombination of the breakdown products into rhodopsin. The recovery process takes 20–30 minutes, which is the time required to accommodate to low levels of illumination (dark adaptation). The process of viewing with rods is known as ‘‘scotopic’’ vision. The rods are maximally sensitive to light of about 510 nm in the blue–green region of the visible spectrum. Rods have no mechanisms to discriminate different wavelengths of light, and vision under low illumination conditions is essentially ‘‘colorblind.’’ More than 100 rods are connected to each ganglion cell, and the brain has no way of discriminating among these photoreceptors to identify the origin of an action potential transmitted along the ganglion. Hence, rod vision is associated with relatively low visual acuity in combination with high sensitivity to low levels of ambient light. The retina contains about 7 million cones that are packed tightly in the fovea and diminish rapidly across the macula lutae. The density of cones in the fovea is about 140,000/mm2 . Cones are maximally sensitive to light of about 550 nm in the yellow–green portion of the visible spectrum. Cones are much (1/104 ) less sensitive to light than rods, but in the fovea there is a 1:1 correspondence between cones and ganglions, so that

7

visual acuity is very high. Cones are responsible for color vision through mechanisms that are imperfectly understood at present. One popular theory of color vision proposes that three types of cones exist; each has a different photosensitive pigment that responds maximally to a different wavelength (450 nm for ‘‘blue’’ cones, 525 nm for ‘‘green’’ cones, and 555 nm for ‘‘red’’ cones). The three cone pigments share the same chromophore as the rods; their different spectral sensitivities result from differences in the opsin component. 3.2.2 Properties of Vision. For two objects to be distinguished on the retina, light rays from the objects must define at least a minimum angle as they pass through the optical center of the eye. The minimum angle is defined as the visual angle. The visual angle, expressed in units of minutes of arc, determines the visual acuity of the eye. A rather crude measure of visual acuity is provided by the Snellen chart that consists of rows of letters that diminish in size from top to bottom. When viewed from a distance of 20 feet, a person who has normal vision can just distinguish the letters in the eighth row. This person is said to have 20:20 vision (i.e., the person can see from 20 feet what a normal person can see at the same distance). At this distance, the letters on the eighth row form a visual angle of 1 minute of arc. An individual who has excellent vision and who is functioning in ideal viewing conditions can achieve a visual angle of about 0.5 minutes of arc, which is close to the theoretical minimum defined by the packing density of cones on the retina (6). A person who has 20:100 vision can see at 20 feet what a normal person can see at 100 feet. This individual is considered to have impaired visual acuity. Other more exact tests administered under conditions of uniform illumination are used for actual clinical diagnosis of visual defects. If the lettering of a Snellen chart is reversed (i.e., white letters on a black chart, rather than black letters on a white chart), the ability of observers to recognize the letters from a distance is greatly impaired. The eye is extremely sensitive to small amounts of light. Although the cones do not

8

IMAGING SCIENCE IN MEDICINE

respond at illumination levels below a threshold of about 0.001 cd/m2 , rods are much more sensitive and respond to just a few photons. For example, as few as 10 photons can generate a visual stimulus in an area of the retina where rods are at high concentration (7). Differences in signal intensity that can just be detected by the human observer are known as just noticeable differences (JND). This concept applies to any type of signal, including light, that can be sensed by the observer. The smallest difference in signal that can be detected depends on the magnitude of the signal. For example, we may be able to discern the brightness difference between one and two candles, but we probably cannot distinguish the difference between 100 and 101 candles. This observation was quantified by the work of Weber who demonstrated that the JND is directly proportional to the intensity of the signal. This finding was quantified by Fechner as dS = k

dI I

,

(1)

where I is the intensity of stimulus, dS is an increment of perception (termed a limen), and k is a scaling factor. The integral form of this expression is known as the Weber–Fechner law: S = k (log I) + C, (2) or, by setting C = −k(logI0 ), S=k

log I I0

.

(3)

This expression states that the perceived signal S varies with the logarithm of the relative intensity. The Weber–Fechner law is similar to the equation for expressing the intensity of sound in decibels and provides a connection between the objective measurement of sound intensity and the subjective impression of loudness. A modification of the Weber–Fechner law is known as the power law (6). In this expression, the relationship between a stimulus and the perceived signal can be stated as dI dS , =n S I

(4)

which, when integrated, yields log S = n(log I) + K

(5)

and, when K is written as −n[logI0 ], S=

I I0

n ,

(6)

where I0 is a reference intensity. The last expression, known as the power law, states that the perceived signal S varies with the relative intensity raised to the power n. The value of the exponent n has been determined by Stevens for a variety of sensations, as shown in Table 2. 3.3 Image Quality The production of medical images relies on intercepting some form of radiation that is transmitted, scattered, or emitted by the body. The device responsible for intercepting the radiation is termed an image receptor (or radiation detector). The purpose of the image receptor is to generate a measurable signal as a result of energy deposited in it by the intercepted radiation. The signal is often, but not always, an electrical signal that can be measured as an electrical current or voltage pulse. Various image receptors and their uses in specific imaging applications are described in the following sections. In describing the properties of a medical image, it is useful to define certain image characteristics. These characteristics and their definitions change slightly from one type of imaging process to another, so a model is needed to present them conceptually. X-ray projection imaging is the preferred model because this process accounts for more imaging procedures than any other imaging method used in medicine. In X-ray imaging, photons transmitted through the body are intercepted by an image receptor on the side of the body opposite from the X-ray source. The probability of interaction of a photon of energy E in the detector is termed the quantum detection efficiency η (8). This parameter is defined as η = 1 − e−µ(E)t ,

(7)

IMAGING SCIENCE IN MEDICINE

9

Table 2. Exponents in the Power Law for a Variety of Psychophysical Responses∗ Perceived Quantity Loudness Brightness Smell Taste Temperature Vibration Duration Pressure Heaviness Electric shock ∗ From

Exponent

Stimulus

0.6 0.5 0.55 1.3 1.0 0.95 1.1 1.1 1.45 3.5

Binaural Point source Coffee Salt Cold on arm 60 Hz on finger White noise stimulus Static force on palm Lifted weight 60 Hz through fingers

Ref. 7.

where µ(E) is the linear attenuation coefficient of the detector material that intercepts X rays incident on the image receptor and t is the thickness of the material. The quantum detection efficiency can be increased by making the detector thicker or by using materials that absorb X rays more readily (i.e., have a greater attenuation coefficient µ(E) because they have a higher mass density or atomic number. In general, η is greater at lower Xray energies and decreases gradually with increasing energy. If the absorbing material has an absorption edge in the energy range of the incident X rays, however, the value of η increases dramatically for X-ray energies slightly above the absorption edge. Absorption edges are depicted in Fig. 2 for three detectors (Gd2 O2 S, YTaO4 , and CaW04 ) used in X-ray imaging. 3.3.1 Image Noise. Noise may be defined generically as uncertainty in a signal due to random fluctuations in the signal. Noise is present in all images. It is a result primarily of forming the image from a limited amount of radiation (photons). This contribution to image noise, referred to as quantum mottle, can be reduced by using more radiation to form the image. However, this approach also increases the exposure of the patient to radiation. Other influences on image noise include the intrinsically variable properties of the tissues represented in the image, the type of receptor chosen to acquire the image, the image receptor, processing and display electronics, and the amount of scattered radiation that contributes to the image. In most

instances, quantum mottle is the dominant influence on image noise. In an image receptor exposed to N 0 photons, the image is formed with ηN 0 photons, and the photon image noise σ can be estimated as σ =(ηN 0 )1/2 . The signal-to-noise ratio (SNR) is ηN0 (ηN0 )1/2

(8)

= (ηN0 )1/2

(9)

SNR =

A reduction in either the quantum detection efficiency η of the receptor or the number of photons N 0 used to form the image yields a lower signal-to-noise ratio and produces a noisier image. This effect is illustrated in Fig. 3. A complete analysis of signal and noise propagation in an imaging system must include consideration of the spatialfrequency dependence of both the signal and the noise. The propagation of the signal is characterized by the modulation transfer function (MTF), and the propagation of noise is described by the Wiener (noise power) spectrum W(f ). A useful quantity for characterizing the overall performance of an imaging system is its detective quantum efficiency DQE(f ), where (f ) reveals that DQE depends on the frequency of the signal. This quantity describes the efficiency with which an imaging system transfers the signal-to-noise ratio of the radiative pattern that emerges from the patient into an image to be viewed by an observer. An ideal imaging system has

IMAGING SCIENCE IN MEDICINE

FP

O

10

Figure 2. Attenuation curves for three materials used in X-ray intensifying screens (from Ref. 9, with permission).

a DQE(f ) = η at all spatial frequencies. In actuality, DQE(f ) is invariably less than η, and the difference between DQE(f ) and η becomes greater at higher spatial frequencies. If DQE = 0.1η at a particular frequency, then the imaging system performs at that spatial frequency as if the number of photons were reduced to 1/10. Hence, the noise would increase by 101/2 at that particular frequency. 3.3.2 Spatial Resolution. The spatial resolution of an image is a measure of the smallest visible interval in an object that can be seen in an image of the object. Greater spatial resolution means that smaller intervals can be visualized in the image, that is, greater spatial resolution yields an image that is sharper. Spatial resolution can be measured and expressed in two ways: (1) by a test object

that contains structures separated by various intervals, and (2) by a more formal procedure that employs the modulation transfer function (MTF). A simple but often impractical way to describe the spatial resolution of an imaging system is by measuring its point-spread function (PSF; Fig. 4). The PSF(x,y) is the acquired image of an object that consists of an infinitesimal point located at the origin, that is, for an object defined by the coordinates (0,0). The PSF is the function that operates on what would otherwise be a perfect image to yield an unsharp (blurred) image. If the extent of unsharpness is the same at all locations in the image, then the PSF has the property of being ‘‘spatially invariant,’’ and the relationship of the image to the object (or

11

FP

O

IMAGING SCIENCE IN MEDICINE

FP O

Figure 3. Illustration of quantum mottle. As the illumination of the image increases, quantum mottle decreases and the clarity of the image improves, as depicted in these classic photographs (from Ref. 10).

Figure 4. The point-spread function PSF (x,y).

perfect image) is Image(x, y) = PSF(x, y) ⊗ object(x, y),

(10)

where the ‘‘⊗’’ indicates a mathematical operation referred to as ‘‘convolution’’ between the two functions. This operation can be stated as Image(x, y) = PSF(x − u, y − v) object(u, v) du dv. (11)

The convolution effectively smears each value of the object by the PSF to yield the image. The convolution (blurring) operation can be expressed by a functional operator, S[ . . . ], such that PSF(x, y) = S[point(x, y)]

(12)

where S[ . . . ] represents the blurring operator, referred to as the linear transform of the system.

12

IMAGING SCIENCE IN MEDICINE

The modulation transfer function MTF (m,n) is obtained from the PSF(x,y) by using the two-dimensional Fourier transform F MTF(m, n) = F[PSF(x, y)],

(or line source) is defined mathematically as +∞ point(x, y) dy. line(x) =

(13)

where (m,n) are the conjugate spatial frequency variables for the spatial coordinates (x,y). This expression of MTF is not exactly correct in a technical sense. The Fourier transform of the PSF is actually termed the system transfer function, and the MTF is the normalized absolute value of the magnitude of this function. When the PSF is real and symmetrical about the x and y axes, the absolute value of the Fourier transform of the PSF yields the MTF directly (11). MTFs for representative X-ray imaging systems (intensifying screen and film) are shown in Fig. 5. The PSF and the MTF, which is in effect the representation of the PSF in frequency space, are important descriptors of spatial resolution in a theoretical sense. From a practical point of view, however, the PSF is not very helpful because it can be generated and analyzed only approximately. The difficulty with the PSF is that the source must be essentially a singular point (e.g., a tiny aperture in a lead plate exposed to X rays or a minute source of radioactivity positioned at some distance from a receptor). This condition allows only a few photons (i.e., a very small amount of radiation) to strike the receptor, and very long exposure times are required to acquire an image without excessive noise. In addition, measuring and characterizing the PSF present difficult challenges. One approach to overcoming the limitations of the PSF is to measure the line-spread function (LSF). In this approach, the source is represented as a long line of infinitesimal width (e.g., a slit in an otherwise opaque plate or a line source of radioactivity). The LSF can be measured by a microdensitometer that is scanned across the slit in a direction perpendicular to its length. As for the point-spread function, the width of the line must be so narrow that it does not contribute to the width of the image. If this condition is met, the width of the image is due entirely to unsharpness contributed by the imaging system. The slit

(14)

−∞

The line-spread function LSF(x) results from the blurring operator for the imaging system operating on a line source (Fig. 6), that is,  +∞  LSF(x) = S[line(x)] = S  point(x, y) dy −∞

+∞ = S[point(x, y)] dy

(15)

−∞

LSF(x) =

+∞ PSF(x, y) dy,

(16)

−∞

that is, the line-spread function LSF is the point-spread function PSF integrated over the y dimension. The MTF of an imaging system can be obtained from the Fourier transform of the LSF. F[LSF (x)] +∞ LSF(x) exp(−2π imx) dx =

(17) (18)

−∞

  +∞ +∞  PSF(x, y) exp(−2π imx) dy dx (19) = −∞

−∞

 +∞ +∞  = PSF(x, y) exp[−2π i(mx  −∞

−∞

+my)] dy dy n = 0

= F[PSF(x, y)]n=0 = MTF(m, 0),

(20) (21)

that is, the Fourier transform of the linespread function is the MTF evaluated in one dimension. If the MTF is circularly symmetrical, then this expression describes the MTF completely in the two-dimensional frequency plane. One final method of characterizing the spatial resolution of an imaging system is by using the edge-response function STEP (x,y).

FP

O

IMAGING SCIENCE IN MEDICINE

Figure 6. The line-spread function is the image of an ideal line object, where S represents the linear transform of the imaging system (from Ref. 11, with permission).

FP

O

Figure 5. Point-spread (top) and modulation-transfer (bottom) functions for fast and medium-speed CaWO4 intensifying screens (from Ref. 9, with permission).

13

14

IMAGING SCIENCE IN MEDICINE

In this approach, the imaging system is presented with a source that transmits radiation on one side of an edge and attenuates it completely on the other side. The transmission is defined as: STEP(x, y) = 1 if x0, and 0 if x < 0

(22)

This function can also be written as x STEP(x, y) =

line(x) dx

(23)

−∞

The edge-spread function ESF (x) can be computed as ESF(x) = S[STEP(x, y)]  x  = S  line(x) dx

(24)

−∞

x =

S[line(x)] dx

(25)

LSF(x) dx

(26)

−∞

x = −∞

This relationship, illustrated in Fig. 7, shows that the LSF (x) is the derivative of the edgeresponse function ESF (x). LSF(x) =

d [ESF(x)] dx

(27)

3.3.3 Contrast. Image contrast refers conceptually to the difference in brightness or darkness between a structure of interest and the surrounding background in an image. Usually, information in a medical image is presented in shades of gray (levels of ‘‘grayness’’). Differences in gray shades are used to distinguish various types of tissue, analyze structural relationships, and sometimes quantify physiological function. Contrast in an image is a product of both the physical characteristics of the object being studied and the properties of the image receptor used to form the image. In some cases, contrast can be altered by the exposure conditions chosen for the examination [for example, selection of the photon energy (kVp ) and use of a contrast agent in X-ray imaging]. Image contrast is also influenced by perturbing factors such as scattered radiation and the presence of extraneous light in the detection and viewing systems. An example of the same image at different levels of contrast is shown in Fig. 8. In most medical images, contrast is a consequence of the types of tissue represented in the image. In X-ray imaging, for example, image contrast reveals differences in the attenuation of X rays among various regions of the body, modified to some degree by other factors such as the properties of the image receptor, exposure technique, and the presence of extraneous (scattered) radiation. A simplified model of the human

Figure 7. The edge-spread function is derived from the image of an ideal step function, where S represents the linear transform of the imaging system (from Ref. 11, with permission).

FP

O

The relationship between the ESF and the LSF is useful because one can obtain a microdensitometric scan of the edge to yield an

edge-spread function. The derivative of the ESF yields the LSF, and the Fourier transform of the LSF provides the MTF in one dimension.

15

FP

O

IMAGING SCIENCE IN MEDICINE

Figure 8. Different levels of contrast in an image (from Ref. 12, with permission)

body consists of three different body tissues: fat, muscle, and bone. Air is also present in the lungs, sinuses, and gastrointestinal tract, and a contrast agent may have been used to accentuate the attenuation of X rays in a particular region. The chemical composition of the three body tissues, together with their percentage mass composition, are shown in Table 3. Selected physical properties of the tissues are included in Table 4, and the mass attenuation coefficients for different tissues as a function of photon energy are shown in Fig. 9. In Table 4, the data for muscle are also approximately correct for other soft tissues such as collagen, internal organs (e.g., liver

and kidney), ligaments, blood, and cerebrospinal fluid. These data are very close to the data for water, because soft tissues, including muscle, are approximately 75% water, and body fluids are 85% to nearly 100% water. The similarity of these tissues suggests that conventional X-ray imaging yields poor discrimination among them, unless a contrast agent is used to accentuate the differences in X-ray attenuation. Because of the presence of low atomic number (low Z) elements, especially hydrogen, fat has a lower density and effective atomic number compared with muscle and other soft tissues. At less than 35 keV, X

Table 3. Elemental Composition of Tissue Constituents∗ % Composition (by Mass)

Adipose Tissue

Hydrogen Carbon Nitrogen Oxygen Sodium Magnesium Phosphorus Sulfur Potassium Calcium ∗ From

11.2 57.3 1.1 30.3

0.06

Muscle (Striated) 10.2 12.3 3.5 72.9 0.08 0.02 0.2 0.5 0.3 0.007

water 11.2

88.8

Bone (Femur) 8.4 27.6 2.7 41.0 7.0 7.0 0.2 14.7

Ref. 11, with permission.

Table 4. Properties of Tissue Constituents of the Human Body∗ Material Air Water Muscle Fat Bone ∗ From

Effective Atomic Number

Density (kg/m3 )

7.6 7.4 7.4 5.9–6.3 11.6–13.8

1.29 1.00 1.00 0.91 1.65–1.85

Ref. 11, with permission.

Electron Density (electrons/kg) 3.01 × 1026 3.34 × 1026 3.36 × 1026 3.34–3.48 × 1026 3.00–3.10 × 1026

IMAGING SCIENCE IN MEDICINE

FP O

16

rays interact in fat and soft tissues predominantly by photoelectric interactions that vary with Z3 of the tissue. This dependence provides higher image contrast among tissues of slightly different composition (e.g., fat and muscle) when low energy X rays are used, compared with that obtained from higher energy X rays that interact primarily by Compton interactions that do not depend on atomic number. Low energy X rays are used to accentuate subtle differences in soft tissues (e.g., fat and other soft tissues) in applications such as breast imaging (mammography), where the structures within the object (the breast) provide little intrinsic contrast. When images are desired of structures of high intrinsic contrast (e.g., the chest where bone, soft tissue, and air are present), higher energy X rays are used to suppress X-ray attenuation in bone which otherwise would create shadows in the image that could hide underlying soft-tissue pathology. In some accessible regions of the body, contrast agents can be used to accentuate tissue contrast. For example, iodine-containing contrast agents are often injected into the circulatory system during angiographic imaging of blood vessels. The iodinated agent is watersoluble and mixes with the blood to increase its attenuation compared with surrounding soft tissues. In this manner, blood vessels can be seen that are invisible in X-ray images without a contrast agent. Barium is another element that is used to enhance contrast, usually in studies of the gastrointestinal (GI) tract. A thick solution of a barium-containing

Figure 9. Mass attenuation coefficient of tissues.

compound is introduced into the GI tract by swallowing or enema, and the solution outlines the borders of the GI tract to permit visualization of ulcers, polyps, ruptures, and other abnormalities. Contrast agents have also been developed for use in ultrasound (solutions that contain microscopic gas bubbles that reflect sound energy) and magnetic resonance imaging (solutions that contain gadolinium that affects the relaxation constants of tissues). 3.3.4 Integration of Image Noise, Resolution and Contrast—The Rose Model. The interpretation of images requires analyzing all of the image’s features, including noise, spatial resolution, and contrast. In trying to understand the interpretive process, the analysis must also include the characteristics of the human observer. Collectively, the interpretive process is referred to as ‘‘visual perception.’’ The study of visual perception has captured the attention of physicists, psychologists, and physicians for more than a century—and of philosophers for several centuries. A seminal investigation of visual perception, performed by Albert Rose in the 1940s and 1950s, yielded the Rose model of human visual perception (13). This model is fundamentally a probabilistic analysis of detection thresholds in low-contrast images. Rose’s theory states that an observer can distinguish two regions of an image, called ‘‘target’’ and ‘‘background,’’ only if there is enough information in the image to permit making the distinction. If the signal is

IMAGING SCIENCE IN MEDICINE

assumed to be the difference in the number of photons used to define each region and the noise is the statistical uncertainty associated with the number of photons in each region, then the observer needs a certain signal-tonoise ratio to distinguish the regions. Rose suggested that this ratio is between 5 and 7. The Rose model can be quantified by a simple example (11) that assumes that the numbers of photons used to image the target and background are Poisson distributed and that the target and background yield a lowcontrast image in which N = number of photons that define the target ∼ number of photons that define the background N = signal = difference in the number of photons that define target and background A = area of the target = area of background region C = contrast of the signal compared with background

of photons that define the target and the background: N (28) C= N Signal = N = CN

SNR =

CN signal = noise (N)1/2

= C(N)1/2 = C(A)1/2

(30)

where is the photon fluence (number of photons detected per unit area) and A is the area of the target or background. Using the experimental data of his predecessor Blackwell, Rose found that the SNR has a threshold in the range of 5–7 for differentiating a target from its background. The Rose model is depicted in Fig. 3. A second approach to integrating resolution, contrast, and noise in image perception involves using contrast-detail curves. This method reveals the threshold contrast needed to perceive an object as a function of its diameter. Contrast-detail curves are shown in Fig. 10 for two sets of images; one was acquired at a relatively high signal-to-noise ratio (SNR), and the other at a lower SNR. The curves illustrate the intuitively obvious

FP

Figure 10. Results of tests using the contrastdetail phantom of Fig. 2–14 for high-noise and low-noise cases. Each dashed line indicates combinations of size and contrast of objects that are just barely visible above the background noise (from Ref. 9, with permission).

(29)

For Poisson-distributed events, noise=(N)1/2 , and the signal-to-noise ratio (SNR) is

O

The contrast between target and background is related to the number of detected photons N and the difference N between the number

17

18

IMAGING SCIENCE IN MEDICINE

conclusion that images of large objects can be seen at relatively low contrast, whereas smaller objects require greater contrast to be visualized. The threshold contrast curves begin in the upper left corner of the graph (large objects [coarse detail], low contrast) and end in the lower right corner (small objects [fine detail], high contrast). Contrast-detail curves can be used to compare the performance of two imaging systems or the same system under different operating conditions. When the performance data are plotted, as shown in Fig. 10, the superior imaging system is one that encompasses the most visible targets or the greatest area under the curve. 3.4 Image Display/Processing Conceptually, an image is a two-(or sometimes three-) dimensional continuous distribution of a variable such as intensity or brightness. Each point in the image is an intensity value; for a color image, it may be a vector of three values that represent the primary colors red, green, and blue. An image includes a maximum intensity and a minimum intensity and hence is bounded by finite intensity limits as well as by specific spatial limits. For many decades, medical images were captured on photographic film, which provided a virtually continuous image limited only by the discreteness of image noise and film granularity. Today, however, many if not most medical images are generated by computer-based methods that yield digital images composed of a finite number of numerical values of intensity. A two-dimensional medical image may be composed of J rows of K elements, where each element is referred to as a picture element or pixel, and a threedimensional image may consist of L slices, where each slice contains J rows of K elements. The three-dimensional image is made up of volume elements or voxels; each voxel has an area of one pixel and a depth equal to the slice thickness. The size of pixels is usually chosen to preserve the desired level of spatial resolution in the image. Pixels are almost invariably square, whereas the depth of a voxel may not correspond to its width and length. Interpolation is frequently used

to adjust the depth of a voxel to its width and length. Often pixels and voxels are referred to collectively as image elements or elements. Digital images are usually stored in a computer so that each element is assigned to a unique location in computer memory. The elements of the image are usually stored sequentially, starting with elements in a row, then elements in the next row, etc., until all of the elements in a slice are stored; then the process is repeated for elements in the next slice. There is a number for each element that represents the intensity or brightness at the corresponding point in the image. Usually, this number is constrained to lie within a specific range starting at 0 and increasing to 255 [8 bit (1 byte) number], 65,535 [16 bit (2 byte) number], or even 4,294,967,295 [32 bit (4 byte) number]. To conserve computer memory, many medical images employ an 8-bit intensity number. This decision may require scaling the intensity values, so that they are mapped over the available 256 (0–255) numbers within the 8-bit range. Scaling is achieved by multiplying the intensity value Ii in each pixel by 255/Imzx to yield an adjusted intensity value to be stored at the pixel location. As computer capacity has grown and memory costs have decreased, the need for scaling to decrease memory storage has become less important and is significant today only when large numbers of high-resolution images (e.g., X-ray planar images) must be stored. To display a stored image, the intensity value for each pixel in computer memory is converted to a voltage that is used to control the brightness at the corresponding location on the display screen. The intensity may be linearly related to the voltage. However, the relationship between voltage and screen brightness is a function of the display system and usually is not linear. Further, it may be desirable to alter the voltage: brightness relationship to accentuate or suppress certain features in the image. This can be done by using a lookup table to adjust voltage values for the shades of brightness desired in the image, that is, voltages that correspond to intensity values in computer memory are adjusted by using a lookup table to other voltage values that yield the desired distribution of image brightness. If a number of lookup

IMAGING SCIENCE IN MEDICINE

Figure 11. (a) A linear display mapping; (b) a nonlinear display to increase contrast (from Ref. 6, with permission).

O

FP

O

3.4.1 Image Processing. Often pixel data are mapped onto a display system in the manner described before. Sometimes, however, it is desirable to distort the mapping to accentuate certain features of the image. This process, termed image processing, can be used to smooth images by reducing their

noisiness, accentuate detail by sharpening edges in the image, and enhance contrast in selected regions to reveal features of interest. A few techniques are discussed have as examples of image processing. In many images, the large majority of pixels have intensity values that are clumped closely together, as illustrated in Fig. 12a. Mapping these values onto the display, either directly or in modified form as described earlier is inefficient because there are few bright or dark pixels to be displayed. The process of histogram equalization improves the efficiency by expanding the contrast range within which most pixels fall, and by compressing the range at the bright and dark ends where few pixels have intensity values. This method can make subtle differences in intensity values among pixels more visible. It is useful when the pixels at the upper and lower ends of the intensity range are not important. The process of histogram equalization is illustrated in Fig. 12b. Histogram

FP

tables are available, the user can select the table desired to illustrate specific features in the image. Examples of brightness: voltage curves obtained from different lookup tables are shown in Fig. 11. The human eye can distinguish brightness differences of about 2%. Consequently, a greater absolute difference in brightness is required to distinguish bright areas in an image compared with dimmer areas. To compensate for this limitation, the voltage applied to the display screen may be modified by the factor ekV to yield an adjusted brightness that increases with the voltage V. The constant k can be chosen to provide the desired contrast in the displayed images.

19

Figure 12. (a) Representative image histogram; (b) intensity- equalized histogram.

IMAGING SCIENCE IN MEDICINE

FP O

20

equalization is also applicable when the pixels are clumped at the high or low end of intensity values. All images contain noise as an intrinsic product of the imaging process. Features of an image can be obscured by noise, and reducing the noise is sometimes desired to make such features visible. Image noise can be reduced by image smoothing (summing or averaging intensity values) across adjacent pixels. The selection and weighting of pixels for averaging can be varied among several patterns; representative techniques are included there as examples of image smoothing. In the portrayal of a pixel and its neighbors shown in Fig. 13, the intensity value of the pixel (j,k) can be replaced by the average intensity of the pixel and its nearest neighbors (6). This method is then repeated for each pixel in the image. The nearest neighbor approach is a ‘‘filter’’ to reduce noise and yield a smoothed image. The nearest neighbor approach is not confined to a set number of pixels to arrive at an average pixel value; for example, the array of pixels shown in Fig. 13 could be reduced from 9 to 5 pixels or increased from 9 to 25 pixels, in arriving at an average intensity value for the central pixel. An averaging of pixel values, in which all of the pixels are averaged by using the same weighting to yield a filtered image, is a simple approach to image smoothing. The averaging process can be modified so that the intensity values of some pixels are weighted more

Figure 13. The nearest neighbor approach to image smoothing.

heavily than others. Weighted filters usually emphasize the intensity value of the central pixel (the one whose value will be replaced by the averaged value) and give reduced weight to the surrounding pixels in arriving at a weighted average. An almost universal rule of image smoothing is that when smoothing is employed, noise decreases, but unsharpness increases (i.e., edges are blurred). In general, greater smoothing of an image to reduce noise leads to greater blurring of edges as a result of increased unsharpness. Work on image processing often is directed at achieving an optimum balance between increased image smoothing and increased image unsharpness to reveal features of interest in specific types of medical images. The image filtering techniques described before are examples of linear filtering. Many other image-smoothing routines are available, including those that employ ‘‘nonlinear’’ methods to reduce noise. One example of a nonlinear filter is replacement of a pixel intensity by the median value, rather than the average intensity, in a surrounding array of pixels. This filter removes isolated noise spikes and speckle from an image and can help maintain sharp boundaries. Images can also be smoothed in frequency space rather than in real space, often at greater speed. 3.4.2 Image Restoration. Image restoration is a term that refers to techniques to remove or reduce image blurring, so that the image is ‘‘restored’’ to a sharper condition that is more representative of the object. This

IMAGING SCIENCE IN MEDICINE

technique is performed in frequency space by using Fourier transforms for both the image and the point-spread function of the imaging system. The technique is expressed as O(j, k) =

I(j, k) P(j, k)

(31)

where I (j,k) is the Fourier transform of the image, P (j,k) is the Fourier transform of the point-spread function, and O (j,k) is the Fourier transform of the object (in threedimensional space, a third spatial dimension l would be involved). This method implies that the unsharpness (blurring) characteristics of the imaging device can be removed by image processing after the image has been formed. Although many investigators have pursued image restoration with considerable enthusiasm, interest has waned in recent years because two significant limitations of the method have surfaced (6). The first is that the Fourier transform P (j,k) can be zero for certain values of (j,k), leading to an undetermined value for O (j,k). The second is that image noise is amplified greatly by the restoration process and often so overwhelms the imaging data that the restored image is useless. Although methods have been developed to reduce these limitations, the conclusion of most attempts to restore medical images is that it is preferable to collect medical images at high resolution, even if sensitivity is compromised, than to collect the images at higher sensitivity and lower resolution, and then try to use image-restoration techniques to recover image resolution. 3.4.3 Image Enhancement. The human eye and brain act to interpret a visual image principally in terms of boundaries that are presented as steep gradients in image brightness between two adjacent regions. If an image is processed to enhance these boundary (edge) gradients, then image detail may be more visible to the observer. Edgeenhancement algorithms function by disproportionately increasing the high-frequency components of the image. This approach also tends to enhance image noise, so that edgeenhancement algorithms are often used together with an image-smoothing filter to suppress noise.

4

21

CONCLUSION

The use of images to detect, diagnose, and treat human illness and injury has been a collaboration among physicists, engineers, and physicians since the discovery of X rays in 1895 and the first applications of X-ray images to medicine before the turn of the twentieth century. The dramatic expansion of medical imaging during the past century and the ubiquitous character of imaging in all of medicine today have strenghthened the linkage connecting physics, engineering, and medical imaging. This bond is sure to grow even stronger as imaging develops as a tool for probing the cellular, molecular, and genetic nature of disease and disability during the first few years of the twenty-first century. Medical imaging offers innumerable challenges and opportunities to young physicists and engineers interested in applying their knowledge and insight to improving the human condition. REFERENCES 1. W. R. Hendee, Rev. Mod. Phys. 71(2), Centenary, S444–S450 (1999). 2. R. N. Beck, in W. Hendee and J. Trueblood, eds., Digital Imaging, Medical Physics, Madison, WI, 1993, pp. 643–665. 3. J. H. Thrall, Diagn. Imaging (Dec.), 23–27 (1997). 4. W. R. Hendee, in W. Hendee and J. Trueblood, eds., Digital Imaging, Medical Imaging, Madison, WI, 1993, pp. 195–212. 5. P. F. Sharp and R. Philips, in W. R. Hendee and P. N. T. Wells, ed., The Perception of Visual Information, Springer, NY, 1997, pp. 1–32. 6. B. H. Brown et al., Medical Physics and Biomedical Engineering, Institute of Physics Publishing, Philadelphia, 1999. 7. S. S. Stevens, in W. A. Rosenblith, ed., Sensory Communication, MIT Press, Cambridge, MA, 1961, pp. 1–33. 8. J. A. Rowlands, in W.R. Hendee, ed., Biomedical Uses of Radiation, Wiley-VCH, Weinheim, Germany, 1999, pp. 97–173. 9. A. B. Wolbarst, Physics of Radiology, Appleton and Lange, Norwalk, CT, 1993. 10. A. Rose, Vision: Human and Electronic, Plenum Publishing, NY, 1973.

22

IMAGING SCIENCE IN MEDICINE

11. B. H. Hasegawa, The Physics of Medical X-Ray Imaging, 2nd ed., Medical Physics, Madison, WI, 1991, p. 127. 12. W. R. Hendee, in C. E. Putman and C. E. Ravin, eds., Textbook of Diagnostic Imaging, 2nd ed., W.B. Saunders, Philadelphia, 1994, pp. 1–97. 13. A. Rose, in Proc. Inst. Radioengineers 30, 293–300 (1942).

IMPUTATION

analysis to only those subjects with no missing data) is generally acceptable. Missing data are missing at random (MAR) if missingness depends only on observed values of variables and not on any missing values; for example, if the value of blood pressure at time two is more likely to be missing when the observed value of blood pressure at time one was normal, regardless of the value of blood pressure at time two. If missingness depends on the values that are missing, even after conditioning on all observed quantities, the missing data mechanism is not missing at random (NMAR). Missingness must then be modeled jointly with the data—the missingness mechanism is ‘‘nonignorable.’’ The specific imputation procedures described here are most appropriate when the missing data are MAR and ignorable (see References 2 and 3 for details). Imputation (multiple) can still be validly used with nonignorable missing data, but it is more challenging to do it well.

SAMANTHA R. COOK and DONALD B. RUBIN Harvard University

1

INTRODUCTION

Missing data are a common problem with large databases in general and with clinical and health-care databases in particular. Subjects in clinical trials may fail to provide data at one or more time points or may drop out of a trial altogether, for reasons including lack of interest, untoward side effects, and change of geographical location. Data may also be ‘‘missing’’ due to death, although the methods described here are generally not appropriate for such situations. An intuitive way to handle missing data is to fill in (i.e., impute) plausible values for the missing values, thereby creating completed datasets that can be analyzed using standard complete-data methods. The past 25 years have seen tremendous improvements in the statistical methodology for handling incomplete datasets. After briefly discussing missing data mechanisms, the authors present some common imputation methods, focusing on multiple imputation (1). They then discuss computational issues, present some examples, and conclude with a short summary.

2

3

SINGLE IMPUTATION

Single imputation refers to imputing one value for each missing datum. Singly imputed datasets are straightforward to analyze using complete-data methods, which makes single imputation an apparently attractive option with incomplete data. Little and Rubin (3, p. 72) offer the following guidelines for creating imputations. They should be (1) conditional on observed variables; (2) multivariate, to reflect associations among missing variables; and (3) drawn from predictive distributions rather than set equal to means, to ensure that variability is reflected. Unconditional mean imputation, which replaces each missing value with the mean of the observed values of that variable, meets none of the three guidelines listed above. Conditional mean imputation can satisfy the first guideline by filling in missing values with means calculated within cells defined by variables such as gender and/or treatment arm, but it does not meet the second or third guidelines. Regression imputation can satisfy the first two guidelines by replacing

MISSING DATA MECHANISMS

A missing data mechanism is a probabilistic rule that governs which data will be observed and which will be missing. Rubin (2) and Little and Rubin (3) describe three types of missing data mechanisms. Missing data are missing completely at random (MCAR) if missingness is independent of both observed and missing values of all variables. MCAR is the only missing data mechanism for which ‘‘complete case’’ analysis (i.e., restricting the

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

IMPUTATION

the missing values for each variable with the values predicted from a regression (e.g., least squares) of that variable on other variables. Stochastic regression imputation adds random noise to the value predicted by the regression model, and when done properly can meet all three guidelines. Hot deck imputation replaces each missing value with a random draw from a donor pool of observed values of that variable; donor pools are selected, for example, by choosing individuals with complete data who have ‘‘similar’’ observed values to the subject with missing data, e.g., by exact matching or using a distance measure on observed variables to define similar. Hot deck imputation, when done properly, can also satisfy all three of the guidelines listed above. Even though analyzing a singly imputed dataset with standard techniques is easy, such an analysis will nearly always result in estimated standard errors that are too small, confidence intervals that are too narrow, and P-values that are too significant, regardless of how the imputations were created. Thus, single imputation is almost always statistically invalid. However, the multiple version of a single imputation method will be valid if the imputation method is ‘‘proper.’’ Proper imputations satisfy the three criteria of Little and Rubin.

4 PROPERLY DRAWN SINGLE IMPUTATIONS Let Y represent the complete data; i.e., all data that would be observed in the absence of missing data, and let Y = {Y obs ,Y mis }, the observed and missing values, respectively. For notational simplicity, assume ignorability of the missing data mechanism. Also let θ represent the (generally vector-valued) parameter associated with an appropriate imputation model, which consists of both a sampling distribution on Y governed by θ , p(Y|θ ), and a prior distribution on θ , p(θ ). A proper imputation is often most easily obtained as a random draw from the posterior predictive distribution of the missing data given the observed data, which formally

can be written as p(Ymis |Yobs ) = p(Ymis , θ |Yobs ) dθ =

p(Ymis |Yobs , θ )p(θ |Yobs ) dθ.

If the missing data follow a monotone pattern (defined below), this distribution is straight-forward to sample from. When missing data are not monotone, iterative computational methods are generally necessary, as described shortly. 5 PROPERLY DRAWN SINGLE IMPUTATIONS—THEORY WITH MONOTONE MISSINGNESS A missing data pattern is monotone if the rows and columns of the data matrix can be sorted in such a way that an irregular staircase separates Y obs and Y mis . Figures 1 and 2 illustrate monotone missing data patterns. Let Y 0 represent fully observed variables, Y 1 the incompletely observed variable with the fewest missing values, Y 2 the variable with the second fewest missing values, and so on. Proper imputation with a monotone missing data pattern begins by fitting an appropriate model to predict Y 1 from Y 0 and then using this model to impute the missing values in Y 1 . For example, fit a regression of Y 1 on Y 0 using the units with Y 1 observed, draw the regression parameters from their posterior distribution, and then draw the missing values of Y 1 given these parameters and Y 0 . Next impute the missing values for Y 2 using Y 0 and the observed and imputed values of

X

Y(1)

X

Y(0)

? ? Treated

Control

Figure 1. Pattern of missing data for intergel trial.

IMPUTATION X

Y(1)

X

Y(0)

? Treated

Control

Figure 2. Pattern of missing data for genzyme trial.

Y 1 . Continue in this manner until all missing values have been imputed. The collection of imputed values is a proper imputation of the missing data, Y mis , and the collection of univariate prediction models is the implied full imputation model. When missing data are not monotone, this method of imputation as described cannot be used. 6 PROPERLY DRAWN SINGLE IMPUTATIONS—THEORY WITH NONMONOTONE MISSINGNESS Creating imputations when the missing data pattern is nonmonotone generally involves iteration because the distribution p(Ymis |Yobs ) is often difficult to draw from directly. However, the data augmentation algorithm [DA; Tanner and Wong (4)] is often straightforward to implement. Briefly, DA involves iterating between sampling missing data given a current draw of the model parameters and sampling model parameters given a current draw of the missing data. The draws of Y mis form a Markov Chain whose stationary distribution is p(Ymis |Yobs ). Thus, once the Markov Chain has reached approximate convergence, a draw of Y mis from DA is effectively a proper single imputation of the missing data from the target distribution p(Ymis |Yobs ). Many programs discussed in Section (9) use DA to impute missing values. Other algorithms that use Markov Chain Monte Carlo methods for imputing missing values include Gibbs sampling and Metropolis–Hastings. See Gelman et al. (5, Chapter 11) for more details. As discussed, analyzing a singly imputed dataset using complete-data methods usually leads to anticonservative results because

3

imputed values are treated as if they were known, thereby underestimating uncertainty. Multiple imputation corrects this problem, while retaining the advantages of single imputation. 7

MULTIPLE IMPUTATION

Described in detail in Rubin (6), multiple imputation (MI) is a Monte Carlo technique that replaces the missing values Y mis with m > 1 plausible values, {Y mis,1 , . . . ,Y mis,m }, and therefore reveals uncertainty in the imputed values. Each set of imputations creates a completed dataset, thereby creating m completed datasets: Y (1) = {Y obs , Y mis,1 }, . . ,. Y (l) = {Y obs , Y mis,l }, . . ,.Y (m) = {Y obs , Y mis,m }. Typically m is fairly small: m = 5 is a standard number of imputations to use. Each m completed dataset is then analyzed as if there were no missing data and the results combined using simple rules described shortly. Obtaining proper MIs is no more difficult than obtaining a single proper imputation—the process for obtaining a proper single imputation is simply repeated independently m times. 8 COMBINING RULES FOR PROPER MULTIPLE IMPUTATIONS As in Rubin (6) and Schafer (7), let Q represent the estimated of interest (e.g., the mean of a variable, a relative risk, the intentionto-treat effect, etc.), let Qest represent the complete-data estimator of Q (i.e., the quantity calculated treating all imputed values of Y mis as known observed data), and let U represent the estimated variance of Qest − Q. Let Qest,l be the estimate of Q based on the lth imputation of Y mis with associated variance U l ; that is, the estimate of Q and associated variance are based on the complete-data analysis of the lth completed dataset, Y l = {Y obs , Y mis,l }. The MI estimate of Q is simply the average of the m estimates: QMIest = m l=1 Qest,l /m . The estimated variance of QMIest − Q is found by combining between and within imputation variance, as with the analysis of variance: T = U ave + (1 + m−1 )B, where Uave = m l=1 Ul /m is the within imputation variance

4

IMPUTATION

2 and B = m l=1 (Qest,l − QMIest ) /(m − 1) is the between imputation variance. The quantity T −1/2 (Q − QMIest ) follows an approximate tν distribution with degrees of freedom ν = (m − 1)(1 + U ave /((1 + m−1 )B))2 . See Rubin and Schenker (8) for additional methods for combining vector-valued estimates, significance levels, and likelihood ratio statistics and Barnard and Rubin (9) for an improved expression for ν with small complete data sets. 9 COMPUTATION FOR MULTIPLE IMPUTATION Many standard statistical software packages now have built-in or add-on functions for MI. The S-plus libraries NORM, CAT, MIX, and PAN, for analyzing continuous, categorical, mixed, and panel data, respectively, are freely available [(7), http://www.stat.psu.edu /∼jls/], as is MICE [(10), http://web.inter.nl. net/users/S.van.Buuren/mi/html/mice.htm], which uses regression models to impute all types of data. SAS now has procedures PROC MI and PROC MIANALYZE; in addition IVEwear (11) is freely available and can be called using SAS (http://support.sas.com/rnd/ app/da/new/dami.html). New software packages have also been developed specifically for multiple imputation, for example, the commercially available SOLAS (http://www.statsol.ie/solas/solas.htm), which has been available for years and is most appropriate for datasets with a monotone or nearly monotone pattern of missing data, and the freely available NORM, a standalone Windows version of the S-plus function NORM (http://www.stat. psu.edu/∼jls/). For more information, see www.multiple-imputation.com or Horton and Lipsitz (12). 10

EXAMPLE: LIFECORE

Intergel is a solution developed by Lifecore Corporation to prevent surgical gynecological adhesions. A double-blind, multicenter randomized trial was designed for the U.S. Food and Drug Administration (FDA) to determine whether Intergel significantly reduces the formation of adhesions after a first surgery. The data collection procedure for this study

was fairly intrusive: Patients had to undergo a minor abdominal surgery weeks after the first surgery in order for doctors to count the number of gynecological adhesions. The trial, therefore, suffered from missing data because not all women were willing to have another surgery, despite having initially agreed to do so. The original proposal from FDA for imputing the missing values was to fill in the worst possible value (defined to be 32 adhesions—most patients with observed data had 10 or fewer adhesions) for each missing datum, which should lead to ‘‘conservative’’ results because there were more missing data in the treatment arm than in the placebo arm. This method ignores observed data when creating imputations; for example, one woman in the treatment group refused the second look surgery because she was pregnant, which would have been impossible with more than a few gynecological adhesions. Furthermore, because the imputed values are so much larger than the observed values, the standard errors based on these worst-possible value imputations were inflated, making it unlikely to be able to get significant results even when the two treatments were significantly different. Figure 1 displays the general pattern of monotone missing data in this case, with X representing covariates, Y(0) outcomes under placebo, and Y(1) outcomes under Intergel. The question marks represent missing values. Colton et al. (13) instead used an MI hot deck procedure to impute missing values. Donor pools were created within cells defined by treatment group, treatment center, and baseline seriousness of adhesions, which were observed for all patients: For each patient whose outcome was missing, the donor pool consisted of the two patients in the same treatment group and treatment center who had the closest baseline adhesion scores. Each value in the donor pool was used as an imputation. Formally this method is improper, but the limited donor pools should still make the method conservative because the matches are not as close as they would be with bigger sample sizes.

IMPUTATION

11

EXAMPLE: GENZYME

Fabrazyme is a drug developed by Genzyme Corporation to treat Fabry’s disease, a rare and serious genetic disease caused by an inability to metabolize creatinine. Preliminary results from a Phase 3 FDA trial of Fabrazyme versus placebo showed that the drug appeared to work well in patients in their 30s, who were not yet severely ill, in the sense that it lowered their serum creatinine substantially. A similar Phase 4 trial involved older patients who were more seriously ill. As there is no other fully competitive drug, it was desired to make Fabrazyme commercially available early, which would allow patients randomized to placebo to begin taking Fabrazyme, but would create missing outcome data among placebo patients after they began taking Fabrazyme. The study had staggered enrollment, so that the number of monthly observations of serum creatinine for each placebo patient depended on his time of entry into the study. Figure 2 illustrates the general pattern of monotone missing data with the same length follow-up for each patient. Again, X represents baseline covariates, Y(0) represents repeated measures of serum creatinine for placebo patients, and Y(1) represents repeated measures of serum creatinine for Fabrazyme patients. To impute the missing outcomes under placebo, a complex hierarchical Bayesian model was developed for the progression of serum creatinine in untreated Fabry patients. In this model, inverse serum creatinine varies linearly and quadratically in time, and the prior distribution for the quadratic trend in placebo patients is obtained from the posterior distribution of the quadratic trend in an analogous model fit to a historical database of untreated Fabry patients. Thus, the historical patients’ data only influence the imputations of the placebo patients’ data subtly—via the prior distribution on the quadratic trend parameters. Although the model fitting algorithm is complex, it is straightforward to use the algorithm to obtain draws from p(θ |Yobs ) for the placebo patients, and then draw Y mis conditional on the drawn value of θ , where θ represents all model parameters. Drawing

5

the missing values in this way creates a sample from p(Y mis |Y obs ) and thus an imputation for the missing values in the placebo group. The primary analysis will consider the time to an event, defined as either a clinical event (e.g., kidney dialysis, stroke) or a substantial increase in serum creatinine relative to baseline. The analysis will be conducted on each imputed dataset and the results will be combined (as outlined earlier) to form a single inference. 12

EXAMPLE: NMES

The National Medical Expenditure Survey (NMES) collects data on medical costs from a random sample of the U.S. population. The data include medical expenditures, background information, and demographic information. Multiple imputation for NMES was more complicated than in the previous two examples because the missing data pattern was not monotone. Figure 3 shows a simplification of the missing data pattern for NMES, where if Y 1 was fully observed, the missing data pattern would be monotone. Rubin (14) imputed the missing data in NMES by capitalizing on the simplicity of imputation for monotone missing data by first imputing the missing values that destroyed the monotone pattern (the ‘‘nonmonotone missing values’’) and then proceeding as if the missing data pattern was in fact monotone, and then iterating this process. More specifically, after choosing starting values for the missing data, iterate between the following two steps: (1) Regress each variable with any nonmonotone missing values (i.e., Y 1 ), on all other variables (i.e., Y 0 ,Y 2 ,Y 3 ), treating the current imputations as true values, but use this regression to impute only the

Y0 Y1 Y2 Y3 ?

?

? Figure 3. Illustrative display for type of pattern of missing data in NMES.

6

IMPUTATION

nonmonotone missing values. (2) Impute the remaining missing values in the monotone pattern; first impute the variable with the fewest missing values (Y 2 in Fig. 3), then the variable with the second fewest missing values (Y 3 in Fig. 3), and so on, treating the nonmonotone missing values filled in step 1 as known. This process was repeated five times to create five sets of imputations in the NMES example. 13

SUMMARY

Multiple imputation is a flexible tool for handling incomplete datasets. MIs are often straightforward to create using computational procedures such as DA or using special MI software now widely available. Moreover, the results from singly imputed datasets are easy to combine into a single MI inference. Although MI is Bayesianly motivated, many MI procedures have been shown to have excellent frequentist properties (15). In small samples, the impact of the prior distribution on conclusions can be assessed by creating MIs using several different prior specifications. Furthermore, although only MAR procedures have been considered here, missing data arising from an NMAR mechanism may be multiply imputed by jointly modeling the data and the missingness mechanism; in some cases, results are insensitive to reasonable missingness models and the missing data can then be effectively treated as being MAR (6). Rubin (6), Schafer (7), and Little and Rubin (3) are excellent sources for more detail on the ideas presented here, the last two being less technical and more accessible than the first. REFERENCES 1. D. B. Rubin, Multiple imputations in sample surveys: A phenomenological Bayesian approach to nonresponse (with discussion). ASA Proc. Section on Survey Research Methods, 1978: 20–34. 2. D. B. Rubin, Inference and missing data. Biometrika 1976; 63: 581–590. 3. R. J. A. Little, and D. B. Rubin, Statistical Analysis with Missing Data. New York: Wiley, 2002.

4. M. A. Tanner, and W. H. Wong, The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Stat. Assoc. 1987; 82: 528–550. 5. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis. London: Chapman & Hall, 1995. 6. D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. New York: Wiley, 1987. 7. J. L. Schafer, Analysis of Incomplete Multivariate Data. London: Chapman & Hall, 1997. 8. D. B. Rubin, and N. Schenker, Multiple imputation in health-care databases: An overview and some applications. Stat. Med. 1991; 10: 585–598. 9. J. Barnard, and D. B. Rubin, Small-sample degrees of freedom with multiple imputation. Biometrika 1999; 86: 948–955. 10. S. van Buuren, H. C. Boshuizen, and D. L. Knook, Multiple imputation of missing blood pressure covariates in survival analysis. Stat. Med. 1999; 18: 681–694. 11. T. E. Raghunathan, J. M. Lepkowski, J. van Hoewyk, and P. Solenberger, A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol. 2001; 27: 85–95. 12. N. J. Horton, and S. R. Lipsitz, Multiple imputation in practice: Comparison of software packages for regression models with missing variables. Amer. Statistician 2001; 55: 244–254. 13. T. Colton, S. Piantadosi, and D. B. Rubin, Multiple imputation for second-look variables based on Intergel pivotal trial data. Report submitted to FDA. 2001. 14. D. B. Rubin, Nested multiple imputation of NMES via partially incompatible MCMC. Statistica Neerlandica 2003; 57: 3–18. 15. D. B. Rubin, Multiple imputation after 18 + years (with discussion). J. Amer. Stat. Assoc. 1996; 91: 473–520.

INCOMPETENT SUBJECTS AND PROXY CONSENT

make a will), but this person is entirely competent to make a medical treatment decision because she understands the risks, benefits, and alternatives of the proposed medical procedure. Children are presumed to be incompetent, and important decisions, such as medical treatment decisions, are generally made on their behalf by their parents.

LEONARD H. GLANTZ Boston University School of Public Health Boston, Massachusetts

Research with subjects who are incompetent presents the most serious ethical challenges than any other category of research. By definition, incompetent research subjects cannot consent to their participation in research, and consent is an essential element of the protection of the rights of human subjects. The Nuremberg Code proclaims that the informed and voluntary consent of the human subject is ‘‘absolutely essential.’’ However, requiring consent of every human subject would preclude research on certain classes of people. 1

2 CONSIDERATIONS IN ENROLLING INCOMPETENT SUBJECTS The use of incompetent subjects in research is inherently suspect because they cannot consent voluntarily to participation, which is generally regarded as an essential ethical and legal condition for research participation. To avoid this problem, incompetent subjects should not be included in a research project unless it is impossible to conduct the research with competent subjects. For example, there is no reason to conduct research on a new antibiotic that will be used in the general population on incompetent subjects. However, research to test the efficacy of a treatment for the alleviation of serious dementia or developing a treatment for premature newborns could not use competent subjects. Even where good scientific reasons are provided for using incompetent subjects, serious ethical challenges remain. One can think of research as existing on a spectrum from pure research, in which the sole goal of the research is the creation of knowledge, to the last stages of clinical research, in which substantial data support the conclusion that the use of the test material holds out a realistic prospect of direct benefit to the subject. For example, substantial ethical differences exist between exposing an incompetent subject to the early toxicity testing of drug in which subjects are used to determine the negative effects of different doses, and administering that drug at the end of the testing process when it is designed to determine whether subjects will benefit from the use of the drug. Although research and treatment are never

DEFINITION OF INCOMPETENCE

Both legal and ethical principles presume that every adult is competent to make important decisions about themselves. This determination is essential to maximize the autonomy of individuals and enable them to act on what they believe is in their own best interests. This presumption is based on the belief that adults can weigh the benefits, risks, and alternatives to proposed actions and decisions. However, once it is determined that adults cannot weigh these considerations, they are deemed to be incompetent and are deprived of their decision-making capacity. Adults may be incompetent as a result of serious dementia, serous developmental disabilities, certain types of psychosis (not all psychotic people are incompetent), and the influence of mind-altering substances (once the influence of such substances disappear, the person will again be competent). Incompetence is determined by establishing the inability to understand the risks, benefits, and alternatives of particular activities. For example, a person may be incompetent to make a will because a mental illness makes them incapable of recognizing that they have children (a basic requirement to competently

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INCOMPETENT SUBJECTS AND PROXY CONSENT

the same thing, the closer the research resembles treatment the more permissible the use of incompetent subjects becomes. Although it is important for researchers to consider their own ethical obligations not to exploit the decisional incapacity of subjects, important considerations must be reviewed for surrogate decision makers who permit the use of their wards in research programs. Both parents and legal guardians are supposed to act in ways that protect their children or wards from unnecessary harm. Thus, some ethicists believe surrogate decision makers may never permit their wards to be subjects in research that presents risk of harm with no corresponding benefits. Other ethicists believe that surrogates may give such permission if the risks are trivial and benefits to knowledge are potentially important. For these ethicists, the issue is often how to define a trivial risk. Noninvasive interactions that have no risk for research subjects, such as taking hair clippings or weighing children, are not problematic for either group of ethicists. But drawing blood, conducting skin biopsies, or taking spinal fluid, although all perfectly routine in the clinical setting, are more problematic in the research setting in which the subject can receive no benefit from the physical intrusions. It is also important to recognize that investigators and surrogates are obligated to protect incompetent subjects from fear, distress, and discomforts. For example, whereas an MRI procedure may present no physical risks, many people find the noise and claustrophobia that are inherent in the process to be frightening. Taking a demented person from their home and subjecting them to an MRI can cause substantial fear and distress that may not be ethically justified. 3

SUBJECT ASSENT

Even in the instance where a person is incompetent to give consent, it is important to try to obtain an incompetent subjects assent to a procedure. Even where subjects cannot understand the reasons for a procedure, they may understand the procedure itself. For example, a person might not understand why a doctor wants to take blood from them, but

they can understand that it involves putting a needle in their arm. Although no benefit is provided to the person for blood withdrawal, it is widely accepted that the person’s permission for the intrusion should be obtained. The Federal research regulations require this permission for research in children when the Institutional Review Board (IRB) determines that the children in the study can give assent, and the research holds out no prospect of direct benefit to the child. 4 SUBJECT DISSENT For informed consent to be valid, it must be given by a competent person. However, an incompetent person can express dissent. Using the MRI procedure discussed above as an example, if the demented person seems to object to being placed in the device, either verbally or through her actions, it is widely accepted that such dissent should be respected. As a matter of ethics and fairness, there is a significant difference between performing a procedure on someone who does not consent but is compliant, and a subject who objects who is forcibly intruded on. 5 RESEARCH ADVANCED DIRECTIVES Because of the ethical issues presented by research with incompetent subjects, it has been suggested that potentially incompetent research subjects give advanced consent to being a research subject for specified classes of research. This consent is similar to the use of ‘‘living wills’’ or advanced directives that people execute to determine their medical care should they become incompetent to make medical decisions in the future. Research advance directives might prove to be useful where the condition being studied is progressive and will likely lead to incompetence in the future. For example, a person with the genetic markers for Huntington Disease might be willing to enroll in longitudinal research on the condition and may execute an advance directive that permits researchers to conduct physically intrusive procedures, such as spinal taps, when she becomes incompetent. As a result, the research that is conducted on this person when he/she becomes

INCOMPETENT SUBJECTS AND PROXY CONSENT

incompetent will be based on his/her consent, not the consent of a surrogate. Researchers have little experience with such documents, so one cannot know their effectiveness in addressing the difficult ethical questions presented by research on incompetent subjects. For example, if a person who has executed a research advanced directive dissents to participation in a procedure when he/she is incompetent, should the previous executed document or the current dissent govern? REFERENCES FURTHER READING G. J. Annas and L. H. Glantz, Rules for research in nursing homes. New Engl. J. Med. 1986; 315: 1157–1158. R. Dresser, Research involving persons with mental disabilities: a review of policy issues. In National Bioethics Advisory Commission, Research Involving Persons with Mental Disorders that May Affect Decisionmaking Capacity, Vol. II 1999, pp. 5–28. Available: http://bioethics.georgetown.edu/nbac/ pubs.html. M. J. Field and R. E. Behrman, eds. Ethical Conduct of Clinical Research Involving Children. Washington, DC: National Academy Press, 2004. L. H. Glantz, Nontherapeutic research with children; Grimes v. Kennedy Krieger Institute. Am. J. Public Health 2002; 92: 4–7. M. A. Grodin and L. H. Glantz, eds. Children as Research Subjects: Science, Ethics and Law. New York: Oxford U. Press, 1994. National Bioethics Advisory Commission, Research Involving Persons with Mental Disorders that May Affect Decisionmaking Capacity, Vol. 1, 1999. The Department of Health and Human Services, Protection of Human Subjects, Title 45 Code of Federal Regulations part 46 Subpart D, 1991.

CROSS-REFERENCES Informed Consent Process, Forms, and Assent

3

INDEPENDENT ETHICS COMMITTEE (IEC) An Independent Ethics Committee (IEC) is an independent body (a review board or a committee, institutional, regional, national, or supranational) that is composed of medical/scientific professionals and nonmedical/ nonscientific members. Their responsibility is to ensure the protection of the rights, safety, and well-being of human subjects involved in a trial and to provide public assurance of that protection, by, among other things, reviewing and approving/providing favorable opinion on the trial protocol, the suitability of the investigator(s), the facilities, and the methods and material to be used in obtaining and documenting informed consent of the trial subjects. The legal status, composition, function, operations, and regulatory requirements that pertain to Independent Ethics Committees may differ among countries, but they should allow the Independent Ethics Committee to act in agreement with Good Clinical Practice (GCP) as described in this guideline.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

1

inferences∗ and superpopulation inferences∗ to the extent that the validity of the claimed generality is model dependent∗ via its sensitivity to model misspecifications∗ . Also, it is possible to regard Bayesian inferences and superpopulation inferences as providing some unity to the role of design-based and model-based considerations. Thus a focus of distinction here between design-based and model-based inference is the population to which results are generalized rather than the nature of statistical methods. Models can be useful conceptually in either context; also they can shed light on the robustness∗ of inferences to their underlying assumptions. The related issue of external validity includes substantive justification for the area of application and statistical evaluation of the plausibility of model assumptions. For other pertinent discussion, see Deming [10, Chap. 7], Fisher [12], Godambe and Sprott [15], Johnson and Smith [18], Kempthorne and Folks [23, Chap. 17], Namboodiri [29], and INFERENCE, STATISTICAL. The distinctions between design-based inference and model-based inference can be expressed in clearest terms for comparative experimental studies (e.g., multicenter clinical trials∗ ). Typically, these involve a set of blocks (or sites) which are selected on a judgmental basis. Similarly, the experimental units may be included according to convenience or availability. Thus, these subjects constitute a fixed set of finite local study populations. When they are randomly assigned to two or more treatment groups, corresponding samples are obtained for the potential responses of all subjects under study for each of the respective treatments. By virtue of the research design, randomization model methods (e.g., Kruskal-Wallis tests∗ ) in CHI-SQUARE TEST —I can be used to obtain design-based inferences concerning treatment comparisons without any external assumptions. Illustrative examples are given in CHI-SQUARE TEST —I, and LOG-RANK SCORES. A limitation of design-based inferences for experimental studies is that formal conclusions are restricted to the finite population of subjects that actually received treatment.

INFERENCE, DESIGN-BASED VS. MODEL-BASED Design-based inference and model-based inference are alternative conceptual frameworks for addressing statistical questions from many types of investigations. These include: 1. Experimental studies of randomly allocated subjects 2. Historical (observational) and followup∗ studies of all subjects in a fortuitous, judgmental, or natural population 3. Sample surveys∗ of randomly selected subjects For these situations and others, there is interest in the extent of generality to which conclusions are expressed and the rationale by which they are justified. Some of the underlying inference issues can be clarified by directing attention at the sampling processes for data collection and the assumptions necessary for the data plausibly to represent a defined target population. A statistical analysis whose only assumptions are random selection of observational units or random allocation of units to experimental conditions may be said to generate design-based inferences; i.e., design-based inferences are equivalent to randomization∗ based inferences as discussed by Kempthorne [20–22], Kish [24, Chap. 14], Lehmann [27, pp. 55–57], and others. Also, such inferences are often said to have internal validity∗ (see Campbell and Stanley [3]) when the design is adequate to eliminate alternative explanations for the observed effects other than the one of interest. In this sense, internal validity requires only that the sampled population∗ and the target population∗ be the same. Alternatively, if assumptions external to the study design are required to extend inferences to the target population, then statistical analyses based on postulated probability distributional forms (e.g., binomial, normal, Poisson, Weibull, etc.) or other stochastic processes yield model-based inferences. These can be viewed as encompassing Bayesian

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

INFERENCE, DESIGN-BASED VS. MODEL-BASED

For agricultural crop studies and laboratory animal studies undertaken at local facilities, such issues merit recognition in a strict sense. However, for medical clinical trials∗ undertaken by multiple investigators at geographically diverse locations, it often may be plausible to view the randomized patients as conceptually representative of those with similar characteristics in some large target population of potential patients. In this regard, if sites and subjects had been selected at random from larger eligible sets, then models with random effects provide one possible way of addressing both internal and external validity considerations. However, such an approach may be questionable if investigators and/or patients were not chosen by a probability sampling mechanism. In this more common situation, one important consideration for confirming external validity is that sample coverage include all relevant subpopulations; another is that treatment differences be homogeneous across subpopulations. More formally, probability statements are usually obtained via assumptions that the data are equivalent to a stratified simple random sample from the partition of this population into homogeneous groups according to an appropriate set of explanatory variables. This stratification is necessary because the patients included in a study may overrepresent certain types and underrepresent others, even though those of each of the respective types might be representative of the corresponding target subpopulations. For categorical (or discrete) response measures, the conceptual sampling process described here implies the product multinomial distribution. As a result, model-based inferences concerning treatment comparisons and their interactions with the explanatory variable stratification can be obtained by using maximum likelihood or related methods as discussed in CHI-SQUARE TEST —I and LOGLINEAR MODELS IN CONTINGENCY TABLES. Illustrative examples are given in LOGISTIC REGRESSION. In a similar spirit, least-squares methods can be used for model-based inferences when continuous response variables have approximately normal distributions with common variance within the respective strata; and analogous procedures are applicable to other distributional structures (e.g.,

see Cox [9], McCullagh [28], and Nelder and Wedderburn [30]. The principal advantages of model-based inferences for such situations are their more general scope and the comprehensive information they provide concerning relationships of response measures to treatment and stratification variables. Contrarily, their principal limitation is that subjects in a study may not represent any meaningful population beyond themselves. See Fisher [13], Kempthorne [22], Neyman et al. [31], and Simon [37] for further discussion. For historical (observational) studies, model-based inferences are usually emphasized because the target population is more extensive than the fortuitous, judgmental, or naturally defined group of subjects included. Also, their designs do not involve either random allocation or random selection, as illustrated by the following examples: 1. A study of driver injury relative to vehicle size, vehicle age, and vehicle model year for all police-reported automobile accidents in North Carolina during 1966 or 1968–1972 (see Koch et al. [26]) 2. A nonrandomized prospective study to compare the experience of patients receiving a new treatment with that of a historical control population (see Koch et al. [26]) 3. A nonrandomized study to compare nine treatments for mastitis in dairy cows relative to their pretreatment status (see CHI-SQUARED TESTS —II). 4. Market research studies involving quota sampling∗ as opposed to random selection (see Kalton [19]) The assumptions by which the subjects are considered representative of the target population and the methods used for analysis are similar to those previously described for experimental studies. Otherwise, designbased inferences are feasible for historical studies through tests of randomization as a hypothesis in its own right, but their use should be undertaken cautiously; specific illustrations are given in CHI-SQUARE TEST —I, and Koch et al. [26]. More extensive discussion of various aspects of inference for observational studies appears in Anderson

INFERENCE, DESIGN-BASED VS. MODEL-BASED

et al. [1], Breslow and Day [2], Cochran [8], Fairley and Mosteller [11], and Kleinbaum et al. [25]. Design-based inferences are often emphasized for sample surveys because the target population is usually the same as that from which subjects have been randomly selected. They are obtained by the analysis of estimates for population averages or ratios and their estimated covariance matrix which are constructed by means of finite population sampling methodology. An illustrative example is given in CHI-SQUARE TEST —I. For sample surveys, the probabilistic interpretation of design-based inferences such as confidence intervals is in reference to repeated selection from the finite population via the given design. In constrast, model-based inferences are obtained from a framework for which the target population is a superpopulation with assumptions characterizing the actual finite population as one realization; and so their probabilistic interpretation is in reference to repetitions of the nature of this postulated sampling process. The latter approach can be useful for situations where the subjects in a sample survey are not necessarily from the target population of interest. For example, Clarke et al. [6] discuss the evaluation of several pretrial release programs for a stratified random sample of 861 defendants in a population of 2,578 corresponding to January-March 1973 in Charlotte, North Carolina. Since the entire population here is a historical sample, any sample of it is also necessarily a historical sample. Thus issues of model-based inference as described for historical studies would be applicable. Another type of example involves prediction to a date later than that at which the survey was undertaken; e.g., Cassel et al. [5] studied prediction of the future use of a bridge to be constructed in terms of number of vehicles. Otherwise, it can be noted that statistical methods for designbased inferences often are motivated by a linear model; e.g., a rationale for ratio estimates involves regression through the origin. A more general formulation for which a linear model underlies the estimator and its esti¨ mated variance is given in Sarndal [35,36]. Additional discussion concerning aspects of design-based or model-based approaches to

3

sample survey data or their combination is given in Cassel et al. [4], Cochran [7], Fuller [14], Hansen et al. [16], Hartley and Sielken [17], Royall [32], Royall and Cum¨ berland [33], Sarndal [34], Smith [38], and LABELS. The distinction between design-based inference and model-based inference may not be as clear cut as the previous discussion might have suggested. For example, some type of assumption is usually necessary in order to deal with missing data; and stratification undertaken purely for convenient study management purposes (rather than statistical efficiency) is sometimes ignored. Also, a model-based approach may be advantageous for estimation for subgroups with small sample sizes (i.e., small domain estimation; see Kalton [19]). For these and other related situations, the issue of concern is the robustness∗ of inferences to assumptions. In summary, design-based inferences involve substantially weaker assumptions than do model-based inferences. For this reason, they can provide an appropriate framework for policy-oriented purposes in an adversarial setting (e.g., legal evidence). A limitation of design-based inferences is that their scope might not be general enough to encompass questions of public or scientific interest for reasons of economy or feasibility. Of course, this should be recognized as inherent to the design itself (or the quality of its implementation) rather than the rationale for inference. In such cases, model-based inferences may provide relevant information given that the necessary assumptions can be justified. It follows that design-based inference and model-based inference need not be seen as competing conceptual frameworks; either they can be interpreted as directed at different target populations and thereby at different statistical questions (e.g., experimental studies), or their synthesis is important to dealing effectively with the target population of interest (e.g., sample surveys).

4

INFERENCE, DESIGN-BASED VS. MODEL-BASED

Acknowledgments The authors would like to thank Wayne Fuller, Peter Imrey, Graham Kalton, Oscar Kempthorne, ¨ Jim Lepkowski, Carl Sarndal, and Richard Simon for helpful comments relative to the preparation of this entry. It should be noted that they may not share the views expressed here. This research was partially supported by the U.S. Bureau of the Census through Joint Statistical Agreement JSA80-19; but this does not imply any endorsement.

REFERENCES 1. Anderson, S., Auquier, A., Hauck, W. W., Oakes, D., Vandaele, W., and Weisberg, H. I. (1980). Statistical Methods for Comparative Studies. Wiley, New York. 2. Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research, 1: The Analysis of Case Control Studies. International Agency for Research on Cancer, Lyon. 3. Campbell, D. T. and Stanley, J. C. (1963). Handbook on Research on Teaching, Rand McNally, Chicago, pp. 171–246. (Experimental and quasi-experimental designs for research on teaching.) ¨ 4. Cassel, C. M., Sarndal, C. E., and Wretman, J. H. (1977). Foundations of Inference in Survey Sampling. Wiley, New York. ¨ 5. Cassel, C. M., Sarndal, C. E., and Wretman, J. H. (1979). Scand. J. Statist., 6, 97–106. (Prediction theory for finite populations when model-based and design-based principles are combined.) 6. Clarke, S. H., Freeman, J. L., and Koch, G. G. (1976). J. Legal Stud., 5(2), 341–385. (Bail risk: a multivariate analysis.) 7. Cochran, W. G. (1946). Ann. Math. Statist., 17, 164–177. (Relative accuracy of systematic and stratified random samples for a certain class of populations.) 8. Cochran, W. G. (1972). Statistical Papers in Honor of George W. Snedecor, T. A. Bancroft, ed. Iowa State University Press, Ames, Iowa, pp. 77–90. 9. Cox, D. R. (1972). J. R. Statist. Soc. B, 34, 187–220. [Regression models and life tables (with discussion).] 10. Deming, W. E. (1950). Some Theory of Sampling. Wiley, New York. 11. Fairley, W. B. and Mosteller, F. (1977). Statistics and Public Policy. Addison-Wesley, Reading, Mass.

12. Fisher, R. A. (1925). Proc. Camb. Philos. Soc., 22, 700–725. (Theory of statistical estimation.) 13. Fisher, R. A. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh (rev. ed., 1960). 14. Fuller, W. A. (1975). Sankhya C, 37, 117–132. (Regression analysis for sample survey.) 15. Godambe, V. P. and Sprott, D. A. (1971). Foundations of Statistical Inference. Holt, Rinehart and Winston, Toronto. 16. Hansen, M. H., Madow, W. G., and Tepping, B. J. (1978). Proc. Survey Res. Meth. Sec., Amer. Statist. Ass., pp. 82–107. [On inference and estimation from sample surveys (with discussion).] 17. Hartley, H. O. and Sielken, R. L. (1975). Biometrics, 31, 411–422. (A ‘‘super-population viewpoint’’ for finite population sampling.) 18. Johnson, N. L. and Smith, H., eds. (1969). New Developments in Survey Sampling. Wiley, New York. 19. Kalton, G. (1983). Bull. Int. Statist. Inst. (Models in the practice of survey sampling.) 20. Kempthorne, O. (1952). The Design and Analysis of Experiments. Wiley, New York. 21. Kempthorne, O. (1955). J. Amer. Statist. Ass. 50, 946–967. (The randomization theory of experimental inference.) 22. Kempthorne, O. (1979). Sankhya B, 40, 115–145. (Sampling inference, experimental inference, and observation inference.) 23. Kempthorne, O. and Folks, L. (1971). Probability, sity Press, Ames, Iowa. 24. Kish, L. (1965). Survey Sampling. Wiley, New York. 25. Kleinbaum, D. G., Kupper, L. L., and Morgenstern, H. (1982). Epidemiologic Research: Principles and Quantitative Methods. Lifetime Learning Publication, Belmont, Calif. 26. Koch, G. G., Gillings, D. B., and Stokes, M. E. (1980). Annu. Rev. Public Health, 1, 163–225. (Biostatistical implications of design, sampling, and measurement to health science data analysis.) 27. Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco. 28. McCullagh, P. (1980). J. R. Statist. Soc. B, 42, 109–142. (Regression models for ordinal data.) 29. Namboodiri, N. K. (1978). Survey Sampling and Measurement. Academic Press, New York. 30. Nelder, J. A. and Wedderburn, R. W. M. (1972). J. R. Statist. Soc. A, 135, 370–384.

INFERENCE, DESIGN-BASED VS. MODEL-BASED (Generalized linear models.) 31. Neyman, J., Iwaskiewicz, K., and Kolodziejczyk, S. (1935). J. R. Statist. Soc. (Suppl. 1), 2, 107–154. (Statistical problems in agricultural experimentation.) 32. Royall, R. M. (1976). Amer. J. Epidemiol., 104, 463–473. (Current advances in sampling theory: implications for human observational studies.) 33. Royall, R. M. and Cumberland, W. G. (1981). J. Amer. Statist. Ass., 76, 66–77. (An empirical study of the ratio estimator and estimators of its variance.) ¨ 34. Sarndal, C. E. (1978). Scand. J. Statist., 5, 27–52. (Design-based and model-based inference in survey sampling.) ¨ 35. Sarndal, C. E. (1980). Biometrika, 67, 639–650. (On π -inverse weighting vs. best linear unbiased weighting in probability sampling.) ¨ 36. Sarndal, C. E. (1982). J. Statist. Plann. Infer., 7, 155–170. 37. Simon, R. (1979). Biometrics, 35, 503–512. (Restricted randomization designs in clinical trials.) 38. Smith, T. M. F. (1976). J. R. Statist. Soc. A, 139, 183–204. [The foundations of survey sampling: a review (with discussion).]

GARY G. KOCH DENNIS B. GILLINGS

5

INFORMED CONSENT PROCESS, FORMS, AND ASSENT

1

THE PURPOSE FOR INFORMED CONSENT

The term informed consent would seem to be redundant. One may well ask how one can give consent to an undertaking about which they are not informed. The term is a response to the reality that before the introduction of the doctrine of informed consent patients, often gave their consent to surgeries or other medical interventions without being provided with information about the benefits, risks, and alternatives of an intervention a physician proposed to perform. In one of the earliest decisions that mandated physicians to obtain the informed consent of patients before performing surgery, the court explained that the purpose of the doctrine is ‘‘to enable the patient’s chart this course knowledgeably . . . ’’ (1) and, it found that to do so, the patient needed to have reasonable familiarity with the therapeutic alternatives to the proposed procedure and the inherent risks of the procedure and the alternatives. The court also explained that because the physician is an expert he has knowledge of the risks inherent in the proposed procedure and the probability of success. But that the weighing of the risks is not an expert skill, but a ‘‘nonmedical judgment reserved to the patient alone.’’ Similarly, the decision whether to become a research subject lies entirely in the hands of the potential subject, and without adequate information, which the potential subject can obtain only from the researcher, the subject’s consent would be uninformed. It is this reality that forms the foundation for both the ethical and legal obligation of the researcher to obtain a potential subject’s informed consent.

LEONARD H. GLANTZ Boston University School of Public Health, Boston, Massachusetts

Obtaining the informed consent of human subjects is an integral part of the overall obligations and duties inherent in the researcher role. Researchers are obliged to protect both the rights and the welfare of research subjects. The welfare of research subjects is protected by using the safest and least invasive methods possible to obtain the necessary information from subjects. For example, if the research involves obtaining blood from a subject, then the researchers should use the fewest needle sticks possible. Protecting the rights of research subjects requires empowering potential subjects to decide freely and intelligently whether they wish to become research subjects and to decide whether they wish to continue to be subjects once they have been enrolled. In this regard, the doctrine of informed consent was created to protect and enhance the potential subjects’ autonomy to decide whether to participate in a researcher’s project. Obtaining the truly informed consent of the potential subjects may well conflict with a researcher’s goals. Obviously, the researcher hopes to recruit a sufficient number of subjects in a timely manner, but fully informing potential subjects of the risks and discomforts, particularly if no direct benefit occurs for the subjects, may result in a potential subject’s refusal to participate. Indeed, if a researcher obtains a 100% participation rate, it may be the result of subjects not fully understanding the risks and discomforts that are an inherent part of the research. Because of this conflict, the potential subject’s informed consent is required to be obtained by law and regulation in addition to professional ethics. However, in obtaining the subject of informed consent, it is useful for the ethical researcher to be aware of this potential conflict.

2 THE NUREMBERG CODE AND INFORMED CONSENT In the research context, the clearest statement of the obligation to obtain consent is found in the Nuremberg Code. The Nuremberg Code was created as part of the judgment in the post-World War II criminal trials of Nazi physicians who conducted atrocious

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INFORMED CONSENT PROCESS, FORMS, AND ASSENT

experiments on concentration camp inmates. The first provision of the code states (2): The voluntary consent of the human subject is absolutely essential. This means that the person involved should have legal capacity to give consent; should be situated as to be able to exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, over-reaching, or other ulterior form of constraint or coercion, and should have sufficient knowledge and comprehension of the elements of the subject matter involved as to enable him to make an understanding and enlightened decision. This latter element requires that before the acceptance of an affirmative decision by the experimental subject there should be made known to him the nature, duration, and purpose of the experiment; the method and means by which it is to be conducted; all inconveniences and hazards reasonably to be expected; and the effects upon his health or person which may possibly come from his participation in the experiment.

This section of the code sets forth the goals and purposes of informed consent in the research context. It requires that a subject’s consent be both informed and voluntary, requires that both inconveniences and hazards that reasonably are to be expected be disclosed, and clearly states that the goal of the process is to enable the potential subjects to make ‘‘an understanding and enlightened decision.’’ The Nuremberg Code says nothing about consent forms and does not require them. Rather, it requires that the informed consent process be designed to enlighten potential subjects so they are in a position to decide voluntarily to enroll as a subject or to reject the investigator’s offer. It is the process of informed consent, not a consent form, that is essential to a well-informed decision. Indeed, the fact that the informed consent process and obtaining written informed consent are two separate matters is stated clearly in the federal research regulations. To approve research, the regulations require that the institutional review board (IRB) determine that ‘‘informed consent will be sought from each perspective subject or the subjects legally authorized representative’’ and separately requires the IRB to determine

that ‘‘informed consent will be appropriately documented . . . ’’ (3). 3 FEDERAL REGULATIONS AND INFORMED CONSENT Similar to the Nuremberg Code, the federal regulations require that that consent be obtained in circumstances that provide the potential subject ‘‘sufficient opportunity to consider whether or not to participate and that minimize the possibility of coercion or undue influence. Furthermore, the regulations require that the language used by the investigator when attempting to obtain a potential subject’s informed consent must be ‘understandable to the subject.’’’ The regulations, similar to the Nuremberg Code, require that the ‘‘basic elements of informed consent’’ include the following: 1. a statement that the study involves research, an explanation of the purposes of the research and the expected duration of the subject’s participation, a description of the procedures to be followed, and identification of any procedures that are experimental, 2. a description of any reasonably foreseeable risks or discomforts to the subject, 3. a description of any benefits to the subject or to others which reasonably may be expected from the research, 4. a disclosure of appropriate alternative procedures or courses of treatment, if any, that might be advantageous to the subject, 5. a statement describing the extent, if any, to which confidentiality of records identifying the subject will be maintained, 6. for research involving more than minimal risk, an explanation as to whether any compensation and an explanation as to whether any medical treatments are available if injury occurs, and, if so, what they consist of or where more information may be obtained, and ... 8. a statement that participation is voluntary, refusal to participate will involve

INFORMED CONSENT PROCESS, FORMS, AND ASSENT

no penalty or loss of benefits to which the subject is otherwise entitled, and the subject may discontinue participation at any time without penalty or loss of benefits to which the subject is otherwise entitled. Additional elements of informed consent are required in certain circumstances when appropriate. The federal rules also authorize the IRB to waive some or all elements of informed consent in limited circumstances when the investigator makes such a request. Such a waiver is authorized when ‘‘(1) the research involves no more than minimal risk to the subjects; (2) the waiver or alteration will not adversely affect the rights and welfare of the subjects; (3) the research could not practicably be carried out without the waiver or alteration; and (C) whenever appropriate, the subjects will be provided with additional pertinent information after participation.’’ 4 DOCUMENTATION OF INFORMED CONSENT A separate section of the regulations requires that informed consent be documented, with certain limited exceptions, and requires that the consent form, which must be signed by the subject or the subject’s legally authorized representative, contain the elements of the informed consent process. It is often not recognized or understood that obtaining a signature of a potential subject on an informed consent form does not mean that the person actually has been adequately informed and actually has given a knowledgeable and voluntary consent. The form is meant solely to memorialize the consent process, not substitute for that process. Unfortunately, over time, both investigators and institutional review boards have focused overly on the forms and have not emphasized the importance of the process itself. Several reasons for this are found. The first reason is that the only ‘‘contact’’ IRBs have with subjects is through the consent forms, and therefore they attempt to fulfill their obligation to protect the rights of subjects by focusing on the form. The second reason is

3

that a signed consent form is what investigators use to demonstrate that they have obtained the informed consent of subjects. Without somebody sitting in the room with the investigator and a potential subject, no other way is available for the investigator to demonstrate that she has met her obligation to obtain informed consent. Finally, when the federal regulatory authorities, the Office for Human Research Protection and the Food and Drug Administration, audit institutions and their IRBs for compliance with federal regulations, they focus almost entirely on the paperwork that is involved in approving research, including informed consent forms. Because of the concern that federal auditors might determine after the fact that a problem did occur with the informed consent form, the forms have become longer and more complicated so that they seem to be more complete. Furthermore, institutions have come to view these forms as ways of fending off potential litigation by subjects, even in the absence of any evidence that this fear is realistic. One major difficulty in creating a useful form is presenting complicated information in easyto-understand language. Indeed, other than consent forms, investigators never have to write in language a layperson can understand, and therefore they have little or no experience in doing so. It is not surprising that investigators tend to use the jargon and abbreviations of their professions that are so readily understood by them, their assistants, and their colleagues in the study. The length and complexity of the forms and the use of jargon raise questions as to the capacity of the investigator (and IRBs) to adequately explain the nature of the research to potential subjects in the context of the informed consent process. 5 IMPROVING THE PROCESS OF INFORMED CONSENT One challenge of the informed consent process is that it is neither medical nor scientific in nature. Rather, the informed consent process should be an educational undertaking; it requires careful thinking about what the ‘‘student’’ (subject) needs to know and how the ‘‘teacher’’ (investigator) can impart that

4

INFORMED CONSENT PROCESS, FORMS, AND ASSENT

knowledge in the best way. Teachers use just not reading material but also visual aids and assignments. For example, for certain types of research, the informed consent process can involve watching a video or even assigning potential subjects materials that they will discuss with the investigator when they come back at a later time. It has also been recommended that investigators test knowledge of potential subjects through the use of oral or written quizzes after the investigator informs the potential subject of the risks, benefits, and alternatives. If it is determined that the potential research subject is not aware of certain material matters, then the investigator can go back and explain those matters more until the potential subject is properly informed. If the informed consent form is going to be used as part of the teaching materials that will be provided to the potential subject, then it is essential that it be understandable to a layperson. Investigators will need to use common terms that may seem to them to be less precise but in fact will be more informative from the perspective of a layperson. For example, ‘‘catheters’’ are actually ‘‘thin tubes,’’ ‘‘extremities’’ are actually ‘‘arms and legs,’’ and ‘‘chemotherapy’’ is actually ‘‘medicines to try to treat your cancer.’’ In some studies, subjects may be given several drugs, and one often sees an exhaustive list of potential side effects for each drug. But, if the subject is receiving three drugs and all of them may cause nausea, then no need exists to say this three separate times—which drug causes the nausea in a particular subject is not of interest to the subject—it is the potential for nausea that matters. In sum, better consent processes will come about only if investigators seriously consider how to educate potential subjects fully about the risks, discomforts, potential benefits, and alternatives that might be derived from becoming a research subject. Instead of thinking about this as a technical matter, investigators might find it useful to think about how they would go about informing a close family member who is considering entering a research project. If investigators could accomplish this, then they would be in compliance with legal and ethical requirements.

6 ASSENT Assent is a term that is used in a situation where one cannot obtain true informed consent from the potential research subject because of their immaturity or incompetence. In such circumstances, the researcher needs to obtain the acquiescence of the subject to participate in research. The goal is to try to ensure that research is not conducted forcibly on objecting subjects. However, assent is more than the lack of objection. The federal regulations define assent as the ‘‘affirmative agreement to participate in research. Mere failure to object should not, absent affirmative agreement, be construed as assent.’’ In the context of the federal regulations, the term ‘‘assent’’ is used in the context of children who cannot give legally binding consent to become research subjects. Rather, the regulations state that if parents give ‘‘permission’’ and children ‘‘assent’’ to become a research subject, then the research may be conducted. Although no specific regulations have been developed on this topic, this process should also apply to subjects who are incompetent to provide consent. The incompetent person’s legally authorized representative would provide permission, and the incompetent person would give assent. When participation in research is likely to provide a direct benefit to an incompetent person, their lack of assent is not considered a barrier to their enrollment in the research. REFERENCES 1. Cobbs v. Grant, 8Cal. 3d 229 (California Supreme Court, 1972). 2. Trials of War Criminals before the Nuremberg Military Tribunals under Control Council Law No. 10, vol. 2. Nuremberg October 1946–April 1949. Washington, D.C.: U.S. Government Printing Office (n.d.), pp. 181–182. 3. The Department of Health and Human Services, Protection of Human subjects, Title 45 Code of Federal Regulations, part 46 (1991).

FURTHER READING G. J. Annas and M. A. Grodin (eds.), The Nazi Doctors and the Nuremberg Code. New York: Oxford University Press, 1992.

INFORMED CONSENT PROCESS, FORMS, AND ASSENT J. W. Berg, P. S. Appelbaum, L. S. Parker, and C. W. Lidz, Informed Consent: Legal Theory and Clinical Practice. New York: Oxford University Press, 2001. Office for Human Research Protections, Policy Guidances. Available: http://www.hhs.gov /ohrp/policy/index.html. The informed consent process. In D. D. Federman, K. E. Hanna, and L. L. Rodriguez (eds.), Responsible Research: A Systems Approach to Protecting Research Participants. Washington, D.C.: National Academy Press, 2002, pp. 119–127.

5

INSTITUTION

positions of leadership in multicenter settings. sponsor: 1. A person or agency that is responsible for funding a designated function or activity; sponsoring agency. 2. A person or agency that plans and carries out a specified project or activity. 3. The person or agency named in an Investigational New Drug Application or New Drug Application; usually a drug company or person at such a company, but not always (as with an INDA submitted by a representative of a research group proposing to carry out a phase III or phase IV drug trial not sponsored by a drug company). 4. A firm or business establishment marketing a product or service (5). study center: [trials] 1. data collection site; study clinic 2. Data collection or data generation site. 3. The center from which activities are directed; coordinating center; project office. 4. An operational unit in the structure of a study, especially a multicenter structure, separate and distinct from other such units in the structure, responsible for performing specified functions in one or more stages of the study (e.g., a clinical center or resource center) (5).

CURTIS L. MEINERT The Johns Hopkins University Bloomberg School of Public Health Center for Clinical Trials Baltimore, Maryland

institution: An established organization or corporation (as a college or university) especially of a public character (1). institution: Any public or private entity or agency (including federal, state, and other agencies) (2). institution: Any public or private entity or agency or medical or dental facility where clinical trials are conducted (3). institution: A place where a study is undertaken; usually a hospital or similar establishment (4). institution: An established organization, corporation, or agency, especially one that has a public character (5). institute: An organization for promotion of a cause; an educational institution and especially one devoted to technical fields (1). agency: an administrative unit through which power or influence is exerted to achieve some stated end or to perform some specified function (5). principal investigator (PI): [research] 1. The person having responsibility for conduct of the research proposed in a grant application submitted to the National Institutes of Health; such a person in any funding application submitted to the NIH, whether for grant or contract funding; such a person named on any funding proposal, regardless of funding source. 2. The person in charge of a research project; the lead scientist on a research project. 3. The head of a center in a multicenter study. 4. The chair of a multicenter study. 5. The head of a clinical center in a multicenter trial Usage note: Avoid in the sense of defn 2 in settings having multiple ‘‘principal investigators’’; use center director and chair of study to designate

In the parlance of clinical trials, most commonly used in reference to the corporate entity housing one or more study centers involved in carrying out a trial (e.g., a university housing such a study center), it is typically the fiscal officer of the institution who has legal authority for receiving and use of funds awarded to the institution for a study, consistent with the needs and dictates of the scientist responsible for the study center within the institution— usually referred to as principal investigator (see usage note for principal investigator). The most common usage in adjective form is in institutional review board (IRB). Most academic institutions engaged in research on human beings have one or more standing IRBs comprised of people from within and outside the institution. Investigators in the institution are obliged to submit all proposal

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

INSTITUTION

involving human beings to their IRB of record and may not initiate such research until or unless approved by that IRB and must thereafter maintain approval by submission of renewal requests not less frequently than annually, as dictated by the IRB. Typically, the term refers to boards within an investigators institution, but it is also used to refer to boards located outside the institution, including commercial IRBs. Virtually all multicenter studies are multiinstitutional in that they involve two institutions, such as recruiting and treating sites and a data center or coordinating center. REFERENCES 1. Merriam-Webster’s Collegiate Dictionary, 10th ed. Springfield, MA: Merriam-Webster, Inc., 2001. 2. Office for Human Research Protection, Code of Federal Regulations, Title 45: Public Welfare, Part 46: Protection of Human Subjects. Bethesda, MD: Department of Health and Human Services, National Institutes of Health, (revised) June 18, 1991. 3. International Conference on Harmonisation, E6 Good Clinical Practice. Washington, DC: U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER), and the Center for Biologics Evaluation and Research (CBER), April 1996. 4. S. Day, Dictionary for Clinical Trials. Chichester: Wiley, 1999. 5. C. L. Meinert, Clinical Trials Dictionary: Terminology and Usage Recommendations. Baltimore, MD: The Johns Hopkins Center for Clinical Trials, 1996.

INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS

1.1 Federalwide Assurance (FWA) Although federal regulations guide IRB structure and function, these regulations are intended as a minimum standard rather than a maximum. Any IRB can have policies that exceed the minimum standard set by regulation. In fact, many IRBs have policies that apply to all human subjects research at its institution, whether the research is federally funded or FDA regulated. An FWA is an agreement between an institution and U.S. Department of Health and Human Services’ Office of Human Research Protection that specifies how the institution complies with applicable law and regulations. In this FWA, an institution may state that it applies federal regulations only to federally funded research, or it may state that it applies the regulations and its policies to all research on human subjects at its institution.

SUSAN S. FISH Boston University Schools of Public Health and Medicine Boston, Massachusetts

1

PURPOSE OF IRBs

The purpose of an IRB is to protect the rights and welfare of human research subjects (45 CFR 46, 21 CFR 56). An IRB provides external and independent review of research that involves human subjects, with a focus on the ethical and regulatory issues. The IRB reviews protocols prior to their implementation and during the course of the research. In the United States, independent review of research was first required in the mid1960s for research that was supported by the U.S. Public Health Service. However, the regulations that now guide clinical research review were initially promulgated in 1981 and have been amended periodically since then. Many institutions in the United States required independent review of research prior to requirements from various federal agencies, and some IRBs have been in existence since 1966. Currently IRB review is required for all studies that are either conducted or funded by various federal agencies as well as for all studies performed to support a marketing application to the Food and Drug Administration (FDA) for a drug, device or biologic agent. The regulations initially promulgated by the Department of Health and Human Services (45 CFR 46) were eventually adopted by 17 federal agencies, and have thus become known as the ‘‘Common Rule.’’ Both the Common Rule and the FDA regulations (21 CFR 50 and 56) regulate IRBs for the research for which their agencies are responsible. Although these two sets of regulations are different and occasionally in conflict with each other, the structure and function of IRBs described in both sets of regulations are, for the most part, identical.

2 STRUCTURE OF AN IRB 2.1 Membership of an IRB The structure of IRBs is guided by federal regulations, which describe the minimum requirements of membership. IRBs, by regulation, consist of at least five members. However, many IRBs have many more than five members; boards of 15–25 members are not uncommon. These members must have varying backgrounds so that they have the expertise to review the types of research submitted to that IRB. The membership must be diverse in terms of race, gender, cultural background, and profession. In addition, the membership must be familiar with and sensitive to community attitudes, communities in which the research is taking place, and from which research subjects are recruited, as well as the research community that is submitting the research for review. These sensitivities are intended to ‘‘promote respect for (the IRB’s) advice and counsel in safeguarding the rights and welfare of human subjects.’’ The membership should also include members with the knowledge and understanding of institutional commitments, applicable state and federal laws and regulations, and professional standards in the disciplines of research

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS

that is overseen by that IRB. The regulations go on to explain that if an IRB regularly reviews research that involves a certain category of subjects who might be vulnerable to coercion or undue influence concerning research participation, then the IRB membership should include at least one member who has knowledge about and experience in working with this group of people. An example of such a vulnerable group is children and adolescents. If an IRB regularly reviews biomedical studies in the pediatric population, then a pediatrician or pediatric nurse might be an IRB member who can meet this requirement and have the expertise with this population of subjects. If an IRB reviews educational studies that take place in the school system, then a teacher might meet this requirement. On any IRB, there must be at least one member whose ‘‘primary concerns’’ are in nonscientific areas and at least one member who focuses on scientific areas. In reality, most IRBs have many scientific members and a few nonscientific members. The nonscientific member(s) can provide a balancing perspective for the enthusiasm of the scientist members and can provide the perspective of the research subjects. Most IRB members are drawn from the institution whose research the IRB oversees. In its wisdom, regulations require that at least one member be unaffiliated with the institution, as a check and balance on potential institutional conflict of interest, who can bring to the table some awareness of broader community standards. When an IRB is faced with reviewing a study that requires expertise that is not possessed by its membership, the IRB may and should obtain consultation with one or more individuals who can supply that expertise. It may be that the consultation is in a scientific area (e.g., xenotransplantation), a methodologic area (e.g., qualitative research methodology), or a cultural setting in which the research may take place (e.g., orphans in rural Zambia). Although consultants may not vote as members of the IRB, their input can be invaluable to the understanding of the research by the IRB members who do vote. Consultation should be sought liberally by the IRB.

Membership on an IRB may be a volunteer activity or a paid activity. The duration of membership varies among IRBs. The IRB Chair facilitates meetings, conducts business, and has the authority and responsibility to act for the IRB in certain situations. What is consistent across IRBs is that IRB members are intelligent and dedicated people who take their role in protecting human research subjects quite seriously.

2.2 Conflict of Interest in the IRB The diversity of the IRB membership guards against the influence of potential conflicts of interest. Although IRB decisions are made by a majority vote of members present, most decisions are the result of near-consensus. It is unlikely that one person with an undisclosed conflict could single-handedly affect the outcome of a decision. However, an IRB must have a policy that defines a conflict of interest, and federal regulations prohibit an IRB member with a conflicting interest from participating in the review of a protocol. One conflict of interest that is almost universally recognized is that of being a researcher on a protocol under review. When an IRB member’s own protocol is being reviewed by the board, that member/researcher may not participate in the review except to provide information. In this case, the member/researcher must recuse him/herself from the discussion and vote and leave the meeting room during that time.

2.3 Authority of the IRB The IRB’s authority is assigned by the institution. Once that authority is assigned, it cannot be overridden by any part of the institution. IRB approval is necessary, but not sufficient, for conduct of a research study using human subjects. In other words, if an IRB approves a study, another body in the institution may prevent the project from being conducted. However, if an IRB does not approve a study, then it cannot be conducted. No body within an institution can overturn disapproval by an IRB.

INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS

3 ACCREDITATION OF HUMAN RESEARCH PROTECTION PROGRAMS An IRB is part of a larger human research protection program (HRPP), which also includes the institution, the researchers, the sponsors, and others involved in the clinical research enterprise. Voluntary accreditation of HRPPs has developed in the last decade as a means of identifying those programs that not only meet the minimal federal regulations, but also strive toward best practices in implementing human research ethics. Some IRBs are accredited whereas many others are not. 4

FUNCTION

IRB function is described in its policies and procedures, which are based on ethical principles and regulatory requirements. Each IRB will develop its own policies and procedures, and thus functioning differs among IRBs. However, some functions are common across IRBs. Research protocols that are not reviewed by expedited procedures (45 CFR 46.110) must be reviewed at a regular meeting of the members of the IRB. For official action to be taken at an IRB meeting, a quorum of members must be present, which includes at least one nonscientific member. According to regulations, the written procedures must contain, at a minimum, policies for the following: • Conducting initial review of a protocol • Conducting continuing review of a pro• • • • •

tocol Reporting IRB findings and actions to the investigator Reporting IRB findings and actions to the institution Determining which projects must be reviewed more often than annually Determining which projects should be audited Ensuring that proposed changes to the research be submitted and approved prior to their implementation, except as needed to protect a subject from immediate harm

3

• Ensuring prompt reporting to IRB, to

the Institution, and to funding agencies of unanticipated problems involving risks to subjects or others • Ensuring prompt reporting to IRB, to the Institution, and to funding agencies of serious or continuing noncompliance with federal regulations or policies of the IRB • Ensuring prompt reporting to the Institution, and to funding agencies of suspensions of terminations of IRB approval of a protocol The approval of a protocol by an IRB is required before a researcher may begin to conduct a study. The approval period for protocols may vary depending on the risks of the research and on other factors, but the maximum duration of approval is one year. Prior to the expiration of approval, the IRB must conduct continuing review of the ongoing protocol. The study cannot be continued beyond its approval expiration date unless the IRB has reapproved the study at its continuing review and issued a new approval expiration date. 4.1 Criteria for IRB Approval For an IRB to approve a research study, it must assure that certain criteria are met. Unless all criteria are met, the IRB cannot issue an approval. These criteria are derived directly from the three Belmont principles: respect for persons, beneficence, and justice (see ‘‘Belmont Report’’) Both initial review and continuing review require that these same criteria have been satisfied. As with all federal regulations, these are minimal criteria; any IRB may include additional criteria to be met. These minimal criteria are listed in the federal regulations (45 CFR 46.111; 21 CFR 56.111). 1. Risks to subjects are minimized: (1) By using procedures that are consistent with sound research design and that do not unnecessarily expose subjects to risk and (2) whenever appropriate, by using procedures already being performed on the subjects for diagnostic or treatment purposes.

4

INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS

2. Risks to subjects are reasonable in relation to anticipated benefits, if any, to subjects, and the importance of the knowledge that may reasonably be expected to result (see ‘‘Benefit-Risk Assessment’’ and ‘‘Risk-Benefit Analysis’’). 3. Selection of subjects is equitable. 4. Informed consent will be sought from each prospective subject or the subject’s legally authorized representative (see ‘‘Informed Consent,’’ ‘‘informed consent process,’’ ‘‘legally authorized representative,’’ and ‘‘assent’’). 5. Informed consent will be appropriately documented. (see ‘‘Informed Consent Form’’). 6. When appropriate, the research plan makes adequate provision for monitoring the data collected to ensure the safety of subjects (see ‘‘Data Safety Monitoring Board, DSMB’’ and ‘‘Data Monitoring Committee’’). 7. When appropriate, adequate provisions are implemented to protect the privacy of subjects and to maintain the confidentiality of data. 8. When some or all subjects are likely to be vulnerable to coercion or undue influence, additional safeguards have been included in the study to protect the rights and welfare of these subjects (see ‘‘Vulnerable Subjects’’). For an IRB to conduct a thorough review of the research once it has been implemented, the IRB has the authority to audit or observe the research, which includes the informed consent process (see ‘‘Informed Consent Process’’). Various IRBs perform these observations in different ways, but the authority for this activity exists for all IRBs. Researchers must be informed in writing as to the decision of the IRB concerning their protocol. If any changes to the protocol are required by the IRB so that it can approve the protocol, then those required changes must be specified in writing to the researcher. If a protocol is not approved, then the reasons for that decision must be communicated to the researcher in writing.

4.2 IRB Staff Most IRBs have staff support, although it can take a variety of forms. At one extreme is the IRB that has a part-time secretary/administrator who may be responsible for answering phones, typing letters to investigators, and supporting the IRB meeting by assuring that all members have the documents needed to review protocols. At the other extreme is an IRB office staffed with many certified IRB professionals who may review protocols for science, ethics, and regulatory compliance, as well as educate researchers about these same topics. 5 DIFFERENCES BETWEEN INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS Federal regulations proscribe much of the structure and function of IRBs. When these regulations were written in 1981, most research was conducted in a single academic medical center. In the subsequent years, many changes have occurred in the research environment, which include clinical research being conducted in free-standing clinics and private doctors’ offices that are not affiliated with academic medical centers. The need for review of studies conducted in these other sites generated independent review boards that are not affiliated with a particular institution. In addition, the complexity of multiple IRB reviews for a multicenter study, in which each IRB makes different demands of the protocol but for scientific validity a protocol must be conducted identically at each site, has led sponsors to attempt to undergo a single IRB review using an independent IRB whose approval applies to all sites. Other terms for these unaffiliated boards are commercial IRBs, central IRBs, for-profit IRBs, and noninstitutional IRBs. 5.1 Institutional Review Boards Institutional IRBs report to a high-ranking administrative official within the institution, which is usually a dean of a school or president of a hospital. Membership on the IRB consists mostly of members of the institution, with the exception of the nonaffiliated member(s) and possibly the nonscientist member(s). The members are usually volunteers,

INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS

5

and some may be minimally compensated for time or may be relieved from other responsibilities at the institution. The IRB may interact with other committees that review the research proposals, which include the scientific review committee, radiation safety committee, institutional biostafety committee, resource utilization committee, conflict of interest committee, grants administration, and others. Institutional IRBs have oversight for the research conducted at its institution(s) and by its investigators. Board members and IRB staff frequently know the researchers and work with them. Thus, the IRB must confront the issue of individual nonfinancial conflict of interest. Members and staff have relationships with the researchers, either positive or negative, that can affect the objectivity of their review and decision concerning a particular protocol. But the strength of the system is that the entire board can help minimize the impact of any potential relationship conflict; the entire board participates in the decision to approve or disapprove a protocol. Institutional IRBs are confronted with institutional conflict of interest issues. Research that is approved and conducted will benefit the institution of which many members are employees. Although rarely does pressure come from the institution to approve a particular study, the research effort at the institution benefits all members of that hospital or school, both in reputation and financially. Thus, a subconscious pressure may be placed on the board to approve protocols that must be guarded against.

on this committee is not part of their academic citizenship responsibility but rather is an extracurricular activity. Independent IRBs have oversight for a research protocol that may be conducted at many sites nationally or internationally. Board members and IRB staff do not work with the researchers, and frequently they are not familiar with the location or site at which the research will be conducted. Although the relationship conflicts of interest do not exist, the board members and staff are not familiar with the researcher and are not aware of the skills and ethics of a particular researcher beyond what might be available through review of curriculum vitae. The more subtle characteristics of the researcher are unknown to the IRB when it makes a decision to approve or disapprove a protocol. Independent IRBs must deal with a different type of institutional conflict of interest. These IRBs receive direct income from a given protocol review. Although the payment does not hinge on approval of the protocol, if an IRB repeatedly disapproves protocols from a given sponsor, then that sponsor is likely to use another independent IRB, which decreases the overall income of the IRB. Thus, a subconscious pressure may be placed on the board to approve protocols that must be guarded against.

5.2 Independent Review Boards

2. Code of Federal Regulations, Title 21, Chapter 1 [21 CFR 50], Food and Drug Administration, DHHS; Part 50—Protection of Human Subjects. Updated 1 April, 2007.

Independent IRBs traditionally are not affiliated with an academic institution or hospital performing the research, although they may be affiliated with a contract research organization or site management organization that performs research (see ‘‘Contract Research Organization’’). As suggested by their name, independent IRBs are usually independent of the researchers and independent of any other committees that may review the research. Their members are frequently paid, rather than being volunteers, because participation

REFERENCES 1. Code of Federal Regulations, Title 45A [45 CFR 46], Department of Health and Human Services; Part 46—Protection of Human Subjects. Updated 1 October, 1997.

FURTHER READING E. A. Bankert and R. J. Amdur, Institutional Review Board: Management and Function. Sudbury, MA: Jones and Bartlett Publishers, Inc., 2006. IRB: A Review of Human Subjects Research—a periodical dedicated to IRB issues. Published by The Hastings Center, Garrison, NY.

6

INSTITUTIONAL AND INDEPENDENT REVIEW BOARDS

IRB Forum, An internet discussion board focused on IRB issues. Available: http://www. irbforum.org/. USDHHS, Office of Human Research Protection Policy Guidance. Available: http://www. hhs.gov/ohrp/policy/index.html#topics. USFDA Guidances, Information Sheets, and Important Notices on Good Clinical Practice in FDA-Regulated Clinical Trials. Available: http://www.fda.gov/oc/gcp/guidance.html.

INSTITUTIONAL REVIEW BOARDS (IRB)

review of activities commonly conducted by research institutions. In addition to possessing the professional competence needed to review specific activities, an IRB must be able to ascertain the acceptability of applications and proposals in terms of institutional commitments and regulations, applicable law, standards of professional conduct and practice, and community attitudes. Therefore, IRBs must be composed of people whose concerns are in relevant areas. Currently, the FDA does not require IRB registration. The institutions where the study is to be conducted should be contacted to determine whether they have their own IRB. If the study is conducted at a site that does not have its own IRB, the investigators should be queried to see if they are affiliated with an institution with an IRB that would be willing to act as the IRB for that site in the study. Independent IRBs can be contracted to act as the IRB for a site without its own IRB. An IRB can be established in accordance with 21 Code of Federal Regulation (CFR) 56, and IRBs must comply with all applicable requirements of the IRB regulation and the Investigational Device Exemption (IDE) regulation (21 CFR Part 812) in reviewing and approving device investigations involving human testing. The sFDA does periodic inspections of IRB records and procedures to determine compliance with the regulations. The FDA Office of Health Affairs provides guidelines, including an ‘‘IRB Operations and Clinical Requirements’’ list, to help IRBs carry out their responsibilities for protection of research subjects. The topic of IRBs is also addressed in the Federal Register (March 13, 1975) and the Technical Amendments concerning ‘‘Protection of Human Subjects’’ (45 CFR Part 46).

Institutional Review Boards (IRBs) ensure the rights and the welfare of people who participate in clinical trials both before and during the trial. At hospitals and research institutions throughout the country, IRBs make sure that participants are fully informed and have given their written consent before studies ever begin. The U.S. Food and Drug Administration (FDA) monitors IRBs to protect and ensure the safety of participants in medical research. The purpose of IRB review is to ensure, both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of human participants in research. To accomplish this purpose, IRBs use a group process to review research protocols and the related materials such as informed consent documents and investigator brochures. If an IRB determines that an investigation involves a significant risk device, it must notify the investigator and, if appropriate, the sponsor. The sponsor may not begin the investigation until approved by the FDA. Under FDA regulations, an IRB is an appropriately constituted group that has been formally designated to review and monitor biomedical research involving human subjects. In accordance with FDA regulations, an IRB has the authority to approve, require modifications in (to secure approval), or disapprove research. This group review serves an important role in the protection of the rights, safety, and welfare of human research subjects. An IRB must be composed of no less than five experts and lay people with varying backgrounds to ensure a complete and adequate This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/irb.htm, http://www.fda.gov/cder/about/smallbiz/humans. htm, and http://www.fda.gov/cdrh/devadvice/ide/ irb.shtml) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

INTEGRATED DATABASE

Concurrent with these methods for data integration was the development of principles of data warehousing. The classic approach to data warehousing involves three steps: data extraction, transformation, and loading. It is within the second step of transformation that data integration is performed. Here data is altered, if necessary, so that data values in the data warehouse are represented in a standard and consistent format. A simple example of this alteration is the transformation of laboratory values into standard international units. Often the term data warehouse is used interchangeably with integrated data. However, in a theoretical sense, they are not the same because data warehousing encompasses the entire process of designing, building, and accessing data in real time. Integrated data, however, may be a component of a data warehouse, but it can also exist as a separate entity and represents a harmonized view of data from multiple sources.

SUSAN J. KENNY Inspire Pharmaceuticals, Inc., Durham, North Carolina,

1 THEORY AND CONSTRUCTION OF INTEGRATED DATABASES Data integration is the problem of combining data that resides at different sources and providing the user with a unified view of these data (1). Integrated databases are widely used in many professions, including health care, epidemiology, postmarketing surveillance, and disease registries. The content of this article is restricted to the concepts, applications, and best practice as applied specifically to clinical trials. 1.1 The Purpose of Integrated Databases The primary purpose for creating an integrated database of clinical trial data is to have a database with a greater sample size so that potential differences among subpopulations of interest can be evaluated. These important differences may exist in the individual studies, but the smaller sample size may not yield enough statistical power to identify them adequately.

1.3 Constructing an Integrated Database The goal of an integrated database is to give the user the ability to interact with one large system of information that in actuality came from multiple sources. Thus, users must be provided with a homogeneous view of the data despite the reality that the data may have been represented originally in several forms. Before the creation of an integrated database, it must be established that the data to be integrated were collected in a similar manner so that the resulting integrated data accurately reflects the true associations between data elements. If studies used substantially different case report forms, had different inclusion/exclusion criteria, used different endpoint measurements, or had vastly different measures of statistical variability, then the use of an integrated database that is built from disparate data may lead to spurious or statistically biased conclusions. The first task in constructing an integrated database is to create a desired schema to describe the structure of the integrated database, the variables it will contain, and

1.2 The History of Integrated Databases The practice of combining heterogeneous data sources into a unified view is not new. With the introduction of computers into most business sectors in the 1960s, the use of electronic databases to store information became a standard practice. This practice led to an expansion of knowledge in the academic areas of database technology and computer science. Database technology was introduced in the late 1960s to support business applications. As the number of these applications and data repositories grew, the need for integrated data became more apparent. The first approach to integration was developed around 1980 (2), and over the past 20 years a rich history of research in the area of data integration and a rapid sophistication of methods has developed.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INTEGRATED DATABASE

how the values of these variables will be represented. A schema is akin to a blueprint that will guide the construction of the desired outcome. As a simple example, a schema to integrate the demographic data from three different clinical trials might specify that there will be one record per subject and that variables for age, age group category, sex, race, and survival status will be included. Associated with the database schema is a metadata dictionary to describe the attributes for each variable. The attributes would include the variable name, variable label, variable type (character or numeric), and the precision for numeric variables. In addition, a list of the acceptable values for each of the character variables and a range of acceptable values for numeric variables would be provided. The need for management of metadata is not limited to integrated databases, and various authors have presented approaches to sound practices of this important step (3,4). One of the most important steps in creating an integrated database is to specify clearly how variable values would be harmonized, or remapped, in the event that the originally collected values were not similar. This harmonization ranges from the simplistic, such as conversion of height to a standard unit, to the more difficult, such as the remapping of the recording of the states of disease progression. The recoding of adverse event data across several studies so that all verbatim terms are coded using the same version of the coding dictionary is another example of a type of harmonization that commonly occurs. It is important to document clearly the transformations because the integrity of the integrated data rests on the validation of this process. As new data is added to an existing integrated database, this documentation may need to be updated to reflect the remapping decision for any new and not previously seen data values. A consequence of harmonization is the loss of detail on an individual study basis. Depending on the extent of harmonization, it may not be possible to regenerate the original results from individual studies. Therefore, a trade-off occurs between standardization and the retention of important study details. Well-documented metadata should

support the efforts of the user of the integrated database to trace the path from the original data to the integrated, mapped data and clearly understand how important differences may have occurred. 1.4 The Benefit of Standards for Integration In an ideal setting, the knowledge of whether data from an individual clinical study eventually will be included in an integrated database would be known at the start of the study. However, this case is not practical because the decision to include data in an integrated database will likely depend on the success of the individual study and on the success of the entire product development. Therefore, the integration of data is most often done retrospectively, after the data has been collected and perhaps long after the study has been completed. This situation increases the likelihood that variables and values will need to be remapped to harmonize them with other data sources. In recent years, several initiatives to standardize clinical trial data have developed. The adoption of standards should serve to streamline any integration effort because it will reduce the variability in the individual study metadata. The Clinical Data Interchange Standards Consortium (CDISC) was founded in 1999 with the mission to ‘‘develop and support global, platform-independent data standards that enable information system interoperability to improve medical research and related areas of healthcare’’ (5). CDISC has developed standard metadata models for the representation of clinical trial data. The use of CDISC standards for each individual clinical trial at the beginning of the trial should result in a substantial reduction in the effort needed to integrate multiple studies into one integrated database. In a similar vein, Health Level Seven (HL7) is an accredited standards developing organization that operates in the healthcare arena with a mission to create standards for the exchange, management, and integration of electronic healthcare information. HL7 is working toward the creation of a standard vocabulary that will enable the exchange of clinical data and information so that a shared, well-defined, and unambiguous knowledge of the meaning of the data

INTEGRATED DATABASE

transferred is found (6). Both of these standards organizations have been referenced in the U.S. Food and Drug Administration’s (FDA’s) Critical Path Initiatives List (7,8) and undoubtedly will play an increased role in clinical trial data management in the future and result in the streamlining of integration efforts. Leading by example, the National Cancer Institute established the National Cancer Institute Center for Bioinformatics (NCICB) (9) in 2001 with the purpose to develop and provide interoperable biomedical informatics infrastructure tools and data that are needed by the biomedical communities for research, prevention, and care. Recognizing the problem of integrating data when multiple ways are available to describe similar or identical concepts, NCICB has embarked on developing a repository of common data elements, associated metadata, and a standard vocabulary. These efforts provide a robust example of how data integration can be streamlined and a solid vision for the future management of clinical trial data that can be applied to many therapeutic areas. 2 APPLICATION OF INTEGRATED DATABASES IN CLINICAL TRIAL RESEARCH Integrated databases have been used in clinical trial research for several years. In the classic sense, integrated databases are created for both regulatory purposes and postmarketing surveillance. With the rapid expansion of the field of bioinformatics, the use of integrated databases has expanded beyond this classic application. 2.1 Classic Applications The integration of data, especially the data related to patient safety, is required by most regulatory agencies when the drug is submitted for marketing approval. For submissions made to the U.S. Food and Drug Administration, the guidelines recommend that an application have sections entitled Integrated Summary of Efficacy and Integrated Summary of Safety (ISS) (10). In the ISS section, safety data from all available studies are integrated so that analyses of measures of

3

safety can be performed for the entire patient population exposed to the drug compound. Integrated databases, therefore, are often initially created as part of a new drug application. Using the integrated database of adverse events, sponsor companies can determine the most commonly occurring adverse events across the entire patient population. This information becomes part of the package insert and is used as a reference by informed patients, prescribing physicians, and marketing campaigns. In addition to defining the most common adverse events, the use of the integrated database to evaluate the serious adverse effects that would be too rare to be detected in a single study is of special interest. Often a sponsor may not be aware of potentially serious adverse effects or subpopulations that may have a differential response to treatment until the data is integrated and the sample size is large enough to detect these important differences. Using the integrated database, both sponsor companies and regulatory agencies can explore the safety profile across various subpopulations, such as age groups, racial or ethnicity groups, and gender. By employing data standards within a study project, data integration can be done as each clinical trial is completed rather than all at once just before submitting a new drug application. Sponsor companies profit from creating an integrated database as studies complete because it can reduce the time to submission, and it can highlight important differences between treatment groups or subpopulations earlier in the product development. In addition to creating integrated databases to fulfill regulatory requirements, many sponsors maintain an integrated database that contains results from late phase and postmarketing trials. The medical literature is replete with published examples of the use of integrated databases to monitor the safety and tolerability of a marketed drug. Many of these articles describe the use of integrated data specifically for postmarketing safety surveillance (11–16), and the sample size can be quite large when the drug compound under study is widely prescribed (17). The use of integrated databases

4

INTEGRATED DATABASE

is not restricted to late phase or postmarketing trials and can serve useful purposes in designing new trials and data collection instruments (18). 2.2 Recent Advances As the collection of data from clinical trials becomes less reliant on paper and increasingly collected via electronic means, the types of data that can be integrated has expanded, especially data from digital images. Constructing a database of images is inherently different than a database of textual values, and this difference impacts the design of the database and use of the data. New database strategies have been developed to address these issues (19,20). A recent study (21) has enumerated the issues associated with building an integrated database that contains the results of quantitative image analysis obtained from digital magnetic resonance images (MRIs). The authors present a workflow management system that uses a centralized database coupled with database applications that allow for seamless integration of the analysis of MRIs. This process could be generalized and applied to any clinical trial that uses image-based measurements. In an effort to improve drug safety programs, the FDA has recently embarked on several projects to discover better methods to predict cardiovascular risk of drugs. In two of these projects, the FDA has collaborated with nongovernment entities, Mortara Instruments, Inc. and Duke Clinical Research Institute, to design and build a repository to hold digital electrocardiograms. The use of this integrated warehouse will facilitate regulatory review and research and will aid in the development of tools to be used for the evaluation of cardiac safety of new drugs (22). The explosion of knowledge of the areas of genomics and proteomics has fostered the development of integrated databases that can be used to support clinical research. Matthew and colleagues (23) give a thorough review of the challenges that are being addressed in this new era of computational biology as applied to translational cancer research. By creating an integrated database

of gene expression and drug sensitivity profiles, Tsunoda and colleagues (24) could identify genes with expression patterns that showed significant correlation to patterns of drug responsiveness. Other authors have developed integrated databases of ‘‘omics’’ information that enables researchers working in the field of anticancer drug discovery to explore and analyze information in a systematic way (25). An example of the software infrastructure used to build biological data warehouse for the integration of bioinformatics data and for the retrieval of this information has been described and made freely available to the public (26). REFERENCES 1. M. Lenzerini, Data Integration: a theoretical perspective. Proceedings of the 21st ACM-SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2002: 243–246. 2. A. Hurson and M. Bright, Multidatabase systems: An advanced concept in handling distributed data. Advances in Computers. 1991; 32: 149–200. 3. C. Brandt, R. Gadagkar, C. Rodriguez, and P. Nadkarni, Managing complex change in clinical study metadata. J. Am. Med. Inform. Assoc. 2004; 11(5): 380–391. 4. A. Vaduva and T. Vetterli, Meta data management for data warehousing: an overview. Int. J. Cooperative Inform. Syst. 2001; 10(3): 273–298. 5. Clinical Data Interchange Standards Consortium (CDISC). Available: www.cdisc.org. 6. Health Level Seven (HL7). Available: www.hl7.org. 7. U. S. Food and Drug Administration (FDA). Challenge and opportunity on the critical path to new medical products. March 2004. Available: http://www.fda.gov/ oc/initiatives/criticalpath/whitepaper.pdf. 8. U. S. Food and Drug Administration (FDA). Critical path opportunities list. March 2006. Available: http://www.fda.gov/oc/ initiatives/criticalpath/reports/opp list.pdf. 9. National Cancer Institute Center for Bioinformatics (NCICB). Available: http:// ncicb.nci.nih.gov/. 10. U. S. Food and Drug Administration (FDA). Guideline for the format and content of the clinical and statistical sections of an

INTEGRATED DATABASE application. July 1988. Available: http:// www.fda.gov/cder/guidance/statnda.pdf. 11. D. Payen, A. Sablotzk, P. Barie, G. Ramsay, S. Lowry, M. Williams, S. Sarwat, J. Northrup, P. Toland, F. V. McLBooth, International integrated database for the evaluation of severe sepsis and drotrecogin alfa (activated) therapy: analysis of efficacy and safety in a large surgical cohort. Surgery. 2006; 140(5): 726–739. 12. A. Gottlieb, C. Leonardi, B. Goffe, J. Ortonne, P. van der Kerhof, R. Zitnik, A. Nakanishi, and A. Jahreis, Etanercept monotherapy in patients with psoriasis: a summary of safety, based on an integrated multistudy database. J. Am. Acad. Dermatology. 2006; 54(3 Suppl 2): S92–100. 13. D. Hurley, C. Turner, I. Yalcin, L. Viktrup, and S. Baygani, Duloxetine for the treatment of stress urinary incontinence in women: an integrated analysis of safety. Eur. J. Obstet. Gynecol. Reprod. Biol. 2006; 125(1): 120–128. 14. R. Fleischmann, S. Baumgartner, M. Weisman, T. Liu, B. White, and P. Peloso, Long term safety of etanercept in elderly subjects with rheumatic diseases. Ann. Rheum. Dis. 2006; 65(3): 379–384. 15. C. Martin and C. Pollera, Gemcitabine: safety profile unaffected by starting dose. Int. J. Clin. Pharmacol. Res. 1996; 16(1): 9–18. 16. J. Rabinowitz, I. Katz, P. De Deyn, H. Brodaty, A. Greenspan, and M. Davidson, Behavioral and psychological symptoms in patients with dementia as a target for pharmacotherapy with risperidone. J. Clin. Psychiatry. 2004; 65(10): 1329–1334. 17. J. Shepherd, D. Vidt, E. Miller, S. Harris, and J. Blasetto,. Safety of rosuvastatin: Update of 16,876 rosuvastatin-treated patients in a multinational clinical trial program. Cardiology. 2007; 107(4): 433–443. 18. A. Farin and L. Marshall, Lessons from epidemiologic studies in clinical trials of traumatic brain injury. Acta. Neurochir. Suppl. 2004; 89: 101–107. 19. H. Tagare, C. Jaffe, and J. Duncan, Medical image databases: a content-based retrieval approach. J. Am. Med. Inform. Assoc. 1997; 4(3): 184–198. 20. C. Feng, D. Feng, and R. Fulton, Contentbased retrieval of dynamic PET functional images. IEEE Trans. Inf. Technol. Biomed. 2000; 4(2): 152–158. 21. L. Liu, D. Meir, M. Polgar-Turcsanyi, P. Karkocha, R. Bakshi, C. R. Guttman, Multiple

22.

23.

24.

25.

26.

5

sclerosis medical image analysis and information management. J. Neuroimaging. 2005; 15(4 Suppl): 103S–117S. U. S. Food and Drug Administration. The Future of Drug Safety – Promoting and Protecting the Health of the Public. January 2007. Available: http://www. fda.gov/oc/reports/iom013007.pdf. J. Matthew, B. Taylor, G. Bader, S. Pyarajan, M. Antoniotti, A. Chinnaiyan, C. Sander, J. Buarkoff, and B. Mishra, From bytes to bedside: data integration and computational biology for translational cancer research. PLOS Computational Biology. 2007; 3(2): 153–163. D. Tsunoda, O. Kitahara, R. Yanagawa, H. Zembutsu, T. Katagiri, K. Yamazaki, Y. Nakamura, and T. Yamori, An integrated database of chemosensitivity to 55 anticancer drugs and gene expression profiles of 39 human cancer cell lines. Cancer Res. 2002; 62(4): 1139–1147. F. Kalpakov, V. Poroikov, R. Sharipov, Y. Kondrakhin, A. Zakharov, A. Lagunin, L. Milanesi, and A. Kel, CYCLONET – an integrated database on cell cycle regulation and carcinogenesis. Nucleic Acids Res. 2007; 35(database issue): D550–556. S. Shah, Y. Huang, T. Xy, M. Yuen, J. Ling, and B. Ouellette, Atlas – a data warehouse for integrative bioinformatics. BMC Biominformatics. 2005; 21(6): 34.

CROSS-REFERENCES Clinical Data Management Integrated Summary of Safety Information Integrated Summary of Effectiveness Data Bioinformatics Data Mining of Health System Data

INTENTION-TO-TREAT ANALYSIS

that it is only possible to perform an analysis using a post-hoc selected subset. The essential concern with such subset selection, or equivalently post-hoc exclusions, is that the resulting subset may be susceptible to various forms of bias (3,5–7). A review is provided in Reference 8. Many who champion the efficacy subset analysis approach argue that statistical techniques may be applied to provide an unbiased analysis under certain assumptions (9). The essential statistical issue is the extent to which it can be assumed that missing data and omitted data do not introduce a bias under the missing information principle (10) (i.e., that missing/omitted data are ignorable).

JOHN M. LACHIN The George Washington University, Washington, DC

A clinical trial of a new therapy (agent, intervention, diagnostic procedure, etc.) has many objectives. One objective is to evaluate whether the therapy has the intended biological or physiologic effect, which is often termed pharmacologic efficacy in relation to a new pharmaceutical agent. Another is to evaluate the pragmatic use of the therapy in clinical practice, which is often termed simply effectiveness. The intention-to-treat principle refers to a strategy for the design, conduct, and analysis of a clinical trial aimed at assessing the latter, which is the pragmatic effectiveness of a therapy in clinical practice. An analysis for the assessment of pharmacologic efficacy generally excludes subjects who either did not comply with the assigned therapy or could not tolerate it because of adverse effects. Such analyses are often termed per-protocol, efficacy subset, or evaluable subset analyses and involve posthoc subset selection or post-hoc exclusions of randomized subjects. Conversely, for the assessment of effectiveness in an intent-totreat analysis, follow-up data for all randomized subjects should be obtained and included in the analysis. This design is the essence of the intention-to-treat principle. 1

1.2 Ignorable Missing Data Missing data refers to data that are hypothetically obtainable from a subject enrolled in a trial, but that are not obtained. The hypothetically obtainable data consists of every possible observation that could be obtained from a subject from the point of initial screening and randomization to the prespecified scheduled end of follow-up for that subject. In some trials, the prespecified end of study for a subject is a fixed period, such as 1 year of treatment and follow-up. In other trials, the prespecified end may depend on when the subject enters the trial, such as the case where patient entry is staggered over one period but there is a prespecified date on which all treatment and follow-up will end. If recruitment is conducted over a 3-year period and the total study duration is 5 years, then the first subject randomized hypothetically can contribute 5 years of data collection whereas the last can only contribute 2 years. Thus, for every clinical trial design, each randomized subject has an associated hypothetical complete set of data that could be collected, and the aggregate over all subjects is the set of hypothetically obtainable data. Data may be missing for many possible reasons or mechanisms, some of which may be ignorable, occur randomly, or occur completely by chance. Missing data that develop from an ignorable mechanism are called

MISSING INFORMATION

1.1 Background The intention-to-treat principle evolved from the evaluation by regulatory officials at the Food and Drug Administration (FDA) (1), as well as scientists from the National Institutes of Health (2) and academia (3), of clinical trials in which post-hoc subset selection criteria were applied (4). In some cases, the data are collected but are excluded at the time of analysis. More often, the protocol allows subjects to terminate follow-up, and thus allows subject data to be excluded (not collected), such

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INTENTION-TO-TREAT ANALYSIS

missing completely at random (MCAR) (10), in the sense that the unobserved value of the missing observation is statistically independent of other potentially observable information, most importantly including the treatment group assignment. In the context of survival analysis, the equivalent assumption is censoring at random. Few missing data mechanisms satisfy the MCAR criterion on face value. One is administratively missing data whereby data from a subject cannot be obtained because of administrative curtailment of follow-up. This situation usually applies in the setting of staggered patient entry with a fixed study end date, such as the last patient entered in the above example for whom at most 2 years of data could be collected before study end, and for whom the assessments during years 3–5 would be administratively missing. Missing data may occur for many other reasons. Some subjects may die while participating in the study. Others may be withdrawn from therapy because of poor compliance, lack of evidence of a therapeutic effect, or an adverse effect of therapy, and concurrently withdrawn from the study (i.e., with no additional follow-up). Others may be lost to follow-up (so-called dropouts) because of withdrawal of consent, moving away, incarceration, loss of interest, and so on. The fundamental issue is whether these mechanisms can be claimed to be ignorable. Of these, losses to follow-up are often considered to be missing completely at random, but missing data from such subjects would only be ignorable when effects of treatment, either positive or negative, played no role in the decision to curtail follow-up. This situation might be plausible if examination of the timing and characteristics of such losses are equivalent among the groups, and the characteristics of losses versus those who were not lost to follow-up are equivalent. If differences are detected, then it is possible or even likely that the missing data from such subjects is not ignorable. If baseline covariate differences are detected between those lost to follow-up versus those not, then a sensitivity analysis that compares groups adjusted for those covariates might provide a less-biased comparison of the treatment groups.

On-study deaths in some circumstances may also be claimed to be ignorable, as in a study of a disease or condition (e.g., topical skin therapy) that has a negligible risk of mortality. However, even in such cases, deaths may not be ignorable if the treatment itself may adversely affect vital organs, such as through hepatotoxicity (liver toxicity). The latter could be a hypothetical concern for any agent that is known to be metabolized and excreted by the liver. Drugs have been discovered with none of the other preclinical (i.e., animal) findings or clinical findings (e.g., elevated liver enzymes) that would trigger suspicion of hepatotoxicity, but in later trials or clinical practice are shown to pose a risk of possibly fatal hepatotoxicity in some patients. Clearl, subjects withdrawn from treatment and follow-up because of a specific adverse effect would only be ignorable when the adverse effect can be claimed to be statistically independent of the treatment assignment. This result can virtually never be proven. Other subjects withdrawn from treatment and follow-up because of insufficient therapeutic effect are clearly not ignorable. The fundamental issue is that in all of these cases, it cannot be proven that the missing data or the mechanism(s) for missing data are ignorable or missing completely at random. 1.3 Conditionally Ignorable Missing Data Many statistical methods require the assumption that missing data are missing completely at random to provide an unbiased analysis. However, other methods provide an unbiased analysis under the assumption that missing data are missing at random (MAR). This explanation is somewhat of a misnomer. Under MAR, it is assumed that the missing data are in fact nonignorable in the sense that the probability of being missing may depend on the unobserved value. However, MAR assumes that this dependence is reflected in other information that has been observed. Thus under MAR, it is assumed that the missing data are conditionally independent of the unobserved value, conditioning on the other information that has been observed including treatment assignment.

INTENTION-TO-TREAT ANALYSIS

Clearly, this is a big assumption. For example, a longitudinal analysis that uses a mixed model implicitly assumes that the data that are missing at a follow-up visit are a function of the baseline characteristics and other follow-up observations that preceded it. The treatment group comparison can then be claimed to be unbiased if this relationship applies (i.e., the structural and random model components are correctly specified), and the important covariates have been measured and observed. However, these assumptions cannot be verified.

In conclusion, although statistical methods can provide an unbiased analysis of virtually any data structure with missing data when certain assumptions apply, and although those assumptions, either MCAR or MAR conditionally, can be tested and rejected, those assumptions can never be proven to apply, and the resulting analysis can never be proven to be unbiased. Lachin (8) summarized the issue by stating that the only incontrovertibly unbiased study is one in which all randomized patients are evaluated and included in the analysis, assuming that other features of the study are also unbiased. This is the essence of the intent-to-treat philosophy. Any analysis which involves post hoc exclusions of information is potentially biased and potentially misleading.

1.4 Potential for Bias If the data that are missing develop from nonignorable mechanisms, then the missing data can introduce substantial bias in the results. Lachin (8) presents a model for the assessment of the possible bias for an analysis of the difference between two proportions and the pursuant inflation in the type I error probability α. As the proportion of subjects excluded from the analysis increases, the maximum possible bias and the resulting α increase, as expected. Furthermore, as the total sample size increases, the possible bias and α with a given fraction of missing data increases. Consider the simplest case in which all control group subjects are followed, but a fraction of the treated group has missing data, for example 20%. For a sample size of 200, a bias of 0.05 leads to an α = 0.14086 whereas for N = 800, the same bias leads to an α = 0.38964. Various analyses can be performed to assess whether the missing at random (conditionally) assumption applies, such as comparing the characteristics of those with missing versus observed data, in aggregate and within groups, and comparing the characteristics of those with missing data between groups. The null hypothesis of such tests is that the missing data are conditionally ignorable within the context of a particular model. The alternative hypothesis is that they are not ignorable, and the resulting analysis is biased. Thus, such tests can reject the null hypothesis of ignorable missing data in favor of the alternative of nonignorable missing data, but such tests cannot prove that missing data are ignorable.

3

2

THE INTENTION-TO-TREAT DESIGN

Thus, an intention-to-treat design makes every attempt to ensure complete follow-up and collection of outcome data for every participant from the time of randomization to the scheduled completion of study, regardless of other developments such as noncompliance or adverse effects of therapy. The International Conference on Harmonization (ICH) document Guidance on Statistical Principles for Clinical Trials (11), provides the following description of the intention-to-treat principle: The principle that asserts that the effect of a treatment policy can best be assessed by evaluating on the basis of the intention to treat a subject (i.e. the planned treatment regimen) rather than the actual treatment given. It has the consequence that subjects allocated to a treatment group should be followed up, assessed and analyzed as members of that group irrespective of their compliance with the planned course of treatment.

This guidance also states (Section 5.2.1): The intention-to-treat principle implies that the primary analysis should include all randomized subjects. Compliance with this principle would necessitate complete follow-up of all randomized subjects for study outcomes.

4

INTENTION-TO-TREAT ANALYSIS

The ICH the Guidance on General Considerations for Clinical Trials (12, Section 3.2.2) also states: The protocol should specify procedures for the follow-up of patients who stop treatment prematurely.

These principles are the essence of the intent-to-treat design. To conduct a study that provides high confidence that it is unbiased, the extent of missing data must be minimized. 2.1 Withdrawal from Treatment versus Withdrawal from Follow-up Every study protocol should include a provision for the investigator to withdraw a subject from the randomly assigned therapy because of possible adverse effects of therapy. However, the intent-to-treat design requires that such subjects should not also be withdrawn from follow-up. Thus, the protocol should distinguish between withdrawal from treatment versus withdrawal from follow-up. To the extent possible, every subject randomized should be followed as specified in the protocol. In fact, it would be advisable that the study protocol not include provision for the investigator to withdraw a subject from the study (i.e., follow-up). The only exceptions to this policy might be the death or incapacitation of a subject, or the subject’s withdrawal of patient consent. In some studies, time to a major clinical event is the primary outcome, such as a major cardiovascular adverse event or overall survival. In these cases, even if the subject withdraws consent to continue follow-up visits, it would also be important to ask the patient to consent to ascertainment of major events and/or vital status. Furthermore, in long-term studies, patients may withdraw consent, and then later be willing to resume follow-up and possibly the assigned treatment where not contraindicated. To allow patient participation to the extent possible, subjects should not be labeled as ‘‘dropouts’’ while the study is underway. Subjects who withdraw consent or who do not maintain contact may be termed inactive, temporarily, with the understanding that

any such subject may later become active. The designation of ‘‘dropout’’ or ‘‘lost to followup’’ should only be applied after the study has been completed. 2.2 Investigator and Subject Training/Education Unfortunately, many investigators have participated in clinical trials that were not designed in accordance with this principle. Even though the protocol may state that all patients should continue follow-up to the extent possible, investigators may fail to comply. In studies that have successfully implemented continued follow-up, extensive education of physicians, nurse investigators, and patients has been implemented. In the Diabetes Control and Complications Trial (1983–1993), the patients and investigators received intensive patient education on the components of the trial and the expectation that they would continue followup (13). Of the 1441 subjects randomized into the study, during the 10 years of study only 32 subjects were declared inactive at some point in time. Of these subjects, 7 later resumed treatment and follow-up. Among the 1330 surviving subjects at study end, only 8 subjects did not complete a study closeout assessment visit. During the trial, 155 of the 1441 patients deviated from the originally assigned treatment for some period (were noncompliant). Virtually all subjects continued to attend follow-up assessment visits and most resumed the assigned therapy later. The DCCT was unusual because two multifaceted insulin therapies were being compared in subjects with type 1 diabetes who must have insulin therapy to sustain life. Therefore, withdrawal from insulin therapy was not an option. In a study comparing a drug versus placebo, the issues are different. The Diabetes Prevention Program (14) included comparison of metformin versus placebo for preventing the development of overt diabetes among patients with impaired glucose tolerance. Metformin is an approved antihyperglycemic therapy for treatment of type 2 diabetes with known potential adverse effects that require discontinuation of treatment in about 4% of patients, principally

INTENTION-TO-TREAT ANALYSIS

because of gastrointestinal effects. The following is the text provided to the investigators to explain to the patient why the study desired to continue follow-up after therapy was withdrawn due to an adverse effect: When we designed the study we knew that a fraction of patients would not be able to tolerate metformin. You were told this when you agreed to participate in the study. However, we cannot tell beforehand which participants will be able to take metformin, and which will not. In order to answer the DPP study question as to whether any treatment will prevent diabetes, every participant randomized into the study is equally important. Thus, even though you will not be taking a metformin pill, it is just as important for us to know if and when you develop diabetes as it is for any other participant. That’s why it is just as important to the study that you attend your outcome assessment visits in the future as it was when you were taking your pills.

2.3 The Intent-to-Treat Analysis An intent-to-treat analysis refers to an analysis that includes all available data for all randomized subjects. However, for an intent-totreat analysis to comply with the intentionto-treat principle, all ‘‘available’’ data should represent a high fraction of all potentially observable data. Thus, an analysis of all randomized subjects, in which a high fraction have incomplete assessments and missing data, deviates from the intention-to-treat principle in its design and/or implementation, and thus it is possibly biased. 2.4 Intent-to-treat Subset Analysis In many studies, multiple analyses are conducted in different sets of subjects. The intention-to-treat ‘‘population’’ is often defined to include all subjects randomized who had at least one dose of the study medication. However, unless the protocol specifies systematic follow-up of all subjects, and a high fraction of the potentially obtainable data is actually obtained, then an analysis of the intention-to-treat population is simply another post-hoc selected subgroup analysis that is susceptible to bias because of nonignorable or informatively missing data.

5

2.5 LOCF Analysis In an attempt to reconstruct a complete data set from incomplete data, a simplistic approach that is now commonly employed is an analysis using the last observation carried forward (LOCF) for each subject with missing follow-up measurements. This method is popular because it makes it seem as though no substantial data are missing. However, the technique is statistically flawed (15,16). LOCF values would not be expected to provide the same level of information as values actually measured and would not be expected to follow the same distribution. Furthermore, such values will distort the variance/covariance structure of the data so that any confidence intervals or P-values are biased in favor of an optimistic assessment in that the sample size with LOCF values is artificially inflated and the variance of the measures is artificially deflated. The LOCF method has no formal statistical basis and has been soundly criticized by statisticians. 2.6 Structurally Missing Data In some studies, the primary outcome is the observation of a possibly right-censored event time and a secondary outcome is a mechanistic or other longitudinal measure. Often, the follow-up of a subject is terminated when the index event occurs, that causes all subsequent mechanistic measurements to be missing structurally. For example, in the Diabetes Prevention Program, the primary outcome was the time to the onset of type 2 diabetes, and measures of insulin resistance and insulin secretory capacity were obtained up to the time of diabetes or the end of study (or loss to follow-up). Thus, these mechanistic measures were missing beyond the time of diabetes, and it is not possible to conduct a straightforward intention-to-treat analysis to describe the long-term effect of each initial therapy on these mechanistic measures (e.g., the difference between groups at say 4 years in the total cohort of those entered into the study). 2.7 Worst Rank Analyses In some cases, it may be plausible to assume that subjects with missing data because of

6

INTENTION-TO-TREAT ANALYSIS

a prior index event are actually worse (or better) than all those who continue followup for a particular measure. For example, subjects who have died can be assumed to have a worse quality of life than any subject who survives. In this case, the subjects with structurally missing data because of such an index event can be assigned a ‘‘worst rank,’’ (i.e., a rank worse than that of any of the measures actually observed) (17). 3 EFFICIENCY OF THE INTENT-TO-TREAT ANALYSIS 3.1 Power The intent-to-treat design, which is necessary to conduct a true intent-to-treat analysis, requires the follow-up of all patients, which includes those who failed to comply with the therapy or those who were withdrawn from therapy by personal choice or by the investigator. Thus, the treatment effect observed may be diluted compared with a setting in which all subjects receive the therapy and are fully compliant. However, that comparison is specious. Virtually every agent or intervention will not be applied optimally in every subject in clinical practice. Therefore, an analysis aimed at the treatment effectiveness under optimal conditions is not relevant to clinical practice. Such an assessment, however, is of interest as a reflection of pharmacologic activity, which is the underlying mechanism by which a treatment is purported to have a clinical effect. This mechanism is the justification for the so-called ‘‘efficacy subset’’ or ‘‘perprotocol’’ analysis often conducted in pharmaceutical trials. The phrase ‘‘per-protocol’’ is used when the protocol itself specifies the post-hoc exclusions of patients and patient data based on compliance, adverse effects, or other factors that indicate suboptimal treatment. Such a subset analysis is highly susceptible to bias. Nevertheless, it is instructive to compare the power of an intention-to-treat design and analysis versus an efficacy subset analysis when it is assumed that the latter is unbiased. Using the test for proportions, Lachin (8) showed that there is a trade-off between the increasing sample size in the intentto-treat analysis versus the larger expected

treatment effect in the efficacy subset analysis. However, in some settings the intentto-treat design and analysis may be more powerful, especially when the treatment may have some long-term beneficial effects that persist beyond the period of therapy. For example, if the treatment arrests progression of the disease, then a subject treated for a period of time who is then withdrawn may still be expected to have a more favorable outcome long-term compared with a subject originally assigned to control. In this case, the intention-to-treat analysis can have substantially more power than the efficacy subset analysis, and it has the advantage that it is far less susceptible to bias. These issues were borne out in the analysis of the study of tacrine in the treatment of Alzheimer’s disease (18), in which 663 subjects were entered and 612 completed follow-up. However, only 253 continued on treatment. Thus, the intent-to-treat analysis of the 612 evaluated could be compared with the efficacy subset analysis of the 253 ‘‘ontherapy completers’’ as shown in Reference 8. For some outcomes, the intention-to-treat analysis produced results that were indeed significant whereas the efficacy subset analysis was not. 3.2 Sample Size In the planning of a study, it is common to adjust the estimate of the sample size for losses to follow-up. For a simple analysis of means or proportions, the sample size computed assuming complete follow-up is inflated to allow for losses. For example, to adjust for 20% losses to follow-up (subjects with missing outcome data), the sample size is inflated by the factor 1/0.8, or by 25%. Such an adjustment allows for the loss of information because of losses to followup or missing data. It does not adjust for the potential bias introduced by such losses if the mechanism for the missing data is informative.

4 COMPLIANCE ADJUSTED ANALYSES The efficacy subset analysis is a crude attempt to assess the treatment effect had

INTENTION-TO-TREAT ANALYSIS

all subjects remained fully compliant. However, if an intention-to-treat design is implemented, then noncompliance and the degree of compliance become outcome measures, which make it possible to conduct analyses that assess treatment group differences in the primary outcomes while taking into account the differences in compliance. Analyses could also be conducted to estimate the treatment group difference for any assumed degree of compliance in the treatment groups (19–21). However, these methods can only be applied when study outcomes are measured in all subjects, or a representative subset, which includes those who are noncompliant or who are withdrawn from therapy. These methods cannot be applied in cases where follow-up is terminated when a subject is withdrawn from therapy because of noncompliance or other factors. 5

CONCLUSION

The intention-to-treat principle encourages the collection of complete follow-up data to the extent possible under an intentionto-treat design and the inclusion of all data collected for each subject in an intentionto-treat analysis. An analysis of a so-called Intention-to-treat population of all subjects randomized, but without systematic followup, is simply another type of post-hoc subset analysis that is susceptible to bias. REFERENCES 1. R. Temple and G. W. Pledger, The FDA’s critique of the Anturane Reinfarction Trial. N. Engl. J. Med. 1980; 303: 1488–1492. 2. D. L. DeMets, L. M. Friedman, and C. D. Furberg, Counting events in clinical trials (letter to the editor). N. Engl. J. Med. 1980; 302: 924. 3. L. M. Friedman, C. D. Furberg, and D. L. DeMets, Fundamentals of Clinical Trials, 3rd ed. New York: Springer; 1998. 4. D. L. Sackett and M. Gent, Controversy in counting and attributing events in clinical trials. N. Engl. J. Med. 1979; 301: 1410–1412. 5. G. S. May, D. L. DeMets, L. M. Friedman, C. Furberg, E. Passamani, The randomized clinical trial: bias in analysis. Circulation 1981; 64: 669–673.

7

6. P. Armitage, Controversies and achievements in clinical trials. Control. Clin. Trials 1984; 5: 67–72. 7. R. Peto, M. C. Pike, P. Armitage, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br. J. Cancer 1976; 34: 585–612. 8. J. M. Lachin, Statistical Considerations in the Intent-to-treat Principle. Control. Clin. Trials 2000; 21: 167–189. 9. L. B. Sheiner and D. B. Rubin, Intention-totreat analysis and the goals of clinical trials. Clin. Pharmacol. Ther. 1995; 57: 6–15. 10. R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data. New York: Wiley; 1987. 11. Food and Drug Administration, International Conference on Harmonization: Guidance on statistical principles for clinical trials. Federal Register September 16, 1998; 63: 49583–49598. 12. Food and Drug Administration, International Conference on Harmonization: Guidance on general considerations for clinical trials. Federal Register December 17, 1997; 62: 66113–66119. 13. Diabetes Control and Complications Trial Research Group, Implementation of a multicomponent process to obtain informed consent in the Diabetes Control and Complications Trial. Control. Clin. Trials 1989; 10: 83–96. 14. Diabetes Prevention Program Research Group, The Diabetes Prevention Program: design and methods for a clinical trial in the prevention of type 2 diabetes. Diabetes Care 1999; 22: 623–634. 15. G. Veberke, G. Molenberghs, L. Bijnens, and D. Shaw, Linear Mixed Models in Practice. New York: Springer, 1997. 16. F. Smith, Mixed-model analysis of incomplete longitudinal data from a high-dose trial of tacrine (cognex) in Alzheimer’s patient. J. Biopharm. Stat. 1996; 6: 59–67. 17. J. M. Lachin, Worst rank score analysis with informatively missing observations in clinical trials. Control. Clin. Trials 1999; 20: 408–422. 18. M. J. Knapp, D. S. Knopman, P. R. Solomon, et al., A 30-week randomized controlled trial of high-dose Tacrine in patients with Alzheimer’s disease. J. Am. Med. Assoc. 1994; 271: 985–991. 19. J. Rochon, Supplementing the intent-to-treat analysis: Accounting for covariates observed postrandomization in clinical trials. J. Am. Stat. Assoc. 1995; 90: 292–300.

8

INTENTION-TO-TREAT ANALYSIS

20. B. Efron and D. Feldman, Compliance as an explanatory variable in clinical trials. J. Am. Stat. Assoc. 1991; 86: 9–25. 21. J. W. Hogan and N. M. Laird, Intention-totreat analyses for incomplete repeated measures data. Biometrics 1996; 52: 1002–1017.

FURTHER READING Principles of Clinical Trial Design and Analysis: D. G. Altman, K. F. Schulz, D. Moher, M. Egger, D. Davidoff, D. Elbourne, P. C. Gøtzsche, and T. Lang, for the CONSORT Group, The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann. Intern. Med. 2001; 134: 663–694. P. Armitage, Controversies and achievements in clinical trials. Control. Clin. Trials 1984; 5: 67–72. D. L. DeMets, Statistical issues in interpreting clinical trials. J. Intern. Med. 2004; 255: 529–537 P. W. Lavori and R. Dawson, Designing for intent to treat. Drug Informat. J. 2001; 35: 1079–1086 G. S. May, D. L. DeMets, L. M. Friedman, C. Furberg, and E. Passamani, The randomized clinical trial: bias in analysis. Circulation 1981; 64: 669–673. R. Peto, M. C. Pike, P. Armitage, et al., Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br. J. Cancer 1976; 34: 585–612. D. Schwartz and J. Lellouch, Explanatory and pragmatic attitudes in therapeutic trials. J. Chron. Dis. 967; 20: 637–648. Case Studies G. Chene, P. Moriat, C. Leport, R. Hafner, L. Dequae, I. Charreau, JP. Aboulker, B. Luft, J. Aubertin, J. L. Vilde, and R. Salamon, Intention-to-treat vs. on-treatment analyses from a study of pyrimethamine in the primary prophylaxis of toxoplasmosis in HIV-infected patients. ANRS 005/ACTG 154 Trial Group. Control. Clin. Trials 1998; 19: 233–248. The Coronary Drug Project Research Group, Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. N. Engl. J. Med. 1980; 303: 1038–1041. P. W. Lavori, Clinical trials in psychiatry: should protocol deviation censor patient data (with Discussion). Neuropsychopharm 1992; 6: 39–63.

P. Peduzzi, J. Wittes, K. Detre, and T. Holford, Analysis as-randomized and the problem of non-adherence: An example from the Veterans Affairs Randomized Trial of Coronary Artery Bypass Surgery. Stat. Med. 1993; 15: 1185–1195. C. Redmond, B. Fisher, H. S. Wieand, The methodologic dilemma in retrospectively correlating the amount of chemotherapy received in adjuvant therapy protocols with disease-free survival. Cancer Treat. Rep. 1983; 67: 519–526. Alternative Views J. H. Ellenberg, Intent-to-treat analysis versus astreated analysis. Drug Informat. J. 1996; 30: 535–544. M. Gent and D. L. Sackett, The qualification and disqualification of patients and events in longterm cardiovascular clinical trials. Thrombos. Haemostas. 1979; 41: 123–134. E. Goetghebeur and T. Loeys, Beyond Intention-totreat. Epidemiol. Rev. 2002; 24: 85–90. Y. J. Lee, J. H. Ellenberg, D. G. Hirtz, and K. B. Nelson, Analysis of clinical trials by treatment actually received: is it really an option? Stat. Med. 1991; 10: 1595-605. R. J. A. Little, Modeling the drop-out mechanism in repeated-measure studies. J. Am. Stat. Assoc. 1995; 90: 1112–1121. L. B. Sheiner, Is intent-to-treat analysis always (ever) enough? Br. J. Clin. Pharmacol. 2002; 54: 203–211.

CROSS-REFERENCES Adequate and Well-Controlled Trial Adherence Analysis Data Set Analysis Population Completer Analysis Casual Inference CONSORT Diabetes Control and Complications Trial Diabetes Prevention Program Effectiveness Evaluable Population Full Analysis Set Good Statistics Practice International Conference on Harmonization (ICH) Last Observation Carried Forward (LOCF) Lost-to-Follow-Up Missing Values

INTENTION-TO-TREAT ANALYSIS Per Protocol Set Analysis Protocol Deviations Protocol Deviators Protocol Violators Responder Analysis Subset Withdrawal from Study Withdrawal from Treatment

9

INTERACTION MODEL

1

GRAPHICAL INTERACTION MODELS

What is meant by association between two variables? The most general response to this question is indirect. Two variables are dissociated if they are conditionally independent given the rest of the variables in the multivariate framework in which the two variables are embedded. Association then simply means that the two variables are not dissociated. Association in this sense is, of course, not a very precise statement. It simply means that conditions exist under which the two variables are not independent. Analysis of association will typically have to go beyond the crude question of whether or not association is present, to find out what characterizes the conditional relationship—for instance, whether it exists only under certain conditions, whether it is homogeneous, or whether it is modified by outcomes on some or all the conditioning variables. Despite the inherent vagueness of statements in terms of unqualified association and dissociation, these statements nevertheless define elegant and useful models that may serve as the natural first step for analyses of association in multivariate frames of inference. These so-called graphical models are defined and described in the subsections that follow.

SVEND KREINER University of Copenhagen, Copenhagen, Denmark

Interaction models for categorical data are loglinear models describing association among categorical variables. They are called interaction models because of the analytic equivalence of loglinear Poisson regression models describing the dependence of a count variable on a set of categorical explanatory variables and loglinear models for contingency tables based on multinomial or product multinomial sampling. The term is, however, somewhat misleading, because the interpretation of parameters from the two types of models are very different. Association models would probably be a better name. Instead of simply referring the discussion of interaction and association models to the section on loglinear models, we will consider these models from the types of problems that one could address in connection with analysis of association. The first problem is a straightforward question of whether or not variables are associated. To answer this question, one must first define association and dissociation in multivariate frameworks and, secondly, define multivariate models in which these definitions are embedded. This eventually leads to a family of so-called graphical models that can be regarded as the basic type of interaction or association. The second problem concerns the properties of the identified associations. Are associations homogeneous or heterogeneous across levels of other variables? Can the strength of association be measured and in which way? To solve these problems, one must first decide upon a natural measure of association among categorical variables and, secondly, define a parametric structure for the interaction models that encapsulates this measure. Considerations along these lines eventually lead to the family of hierarchical loglinear models for nominal data and models simplifying the basic loglinear terms for ordered categorical data.

1.1 Definition A graphical model is defined by a set of assumptions concerning pairwise conditional independence given the rest of the variables of the model. Consider, for instance, a model containing six variables, A to F. The following set of assumptions concerning pairwise conditional independence defines four constraints for the joint distribution Pr(A, B, C, D, E, F). The family of probability distributions satisfying these constraints is a graphical model: A ⊥ C|BDEF ⇔ Pr(A, C|BDEF) = Pr(A|BDEF) Pr(C|BDEF), A ⊥ D|BCEF ⇔ Pr(A, D|BCEF) = Pr(A|BCEF) Pr(D|BCEF),

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INTERACTION MODEL

Graphical models for multidimensional tables were first discussed by Darroch et al. (5). Since then, the models have been extended both to continuous and mixed categorical and continuous data and to regression and block recursive models. Whittaker (9), Edwards (7), Cox & Wermuth (4), and Lauritzen (8) present different accounts of the theory of graphical models. The sections below summarize some of the main results from this theory. Figure 1. An interaction graph

1.2 The Separation Theorem B ⊥ E|ACDF ⇔ Pr(B, E|ACDF) = Pr(B|ACDF) Pr(E|ACDF), C ⊥ E|ABDF ⇔ Pr(C, E|ABDF) = Pr(C|ABDF) Pr(E|ABDF).

Interaction models defined by conditional independence constraints are called ‘‘graphical interaction models’’, because the structure of these models can be characterized by so-called interaction graphs, where variables are represented by nodes connected by undirected edges if and only if association is permitted between the variables. The graph shown in Figure 1 corresponds to the set of conditional independence constraints above, because there are no edges connecting A to C, A to D, B to E, and C to E. Interaction graphs are visual representations of complex probabilistic structures. They are, however, also mathematical models of these structures, in the sense that one can describe and analyze the interaction graphs by concepts and algorithms from mathematical graph theory and thereby infer properties of the probabilistic model. This connection between probability theory and mathematical graph theory is special to the graphical models. The key notion here is conditional independence, as discussed by Dawid (5). While the above definition requires that the set of conditioning variables always includes all the other variables of the model, the results described below imply that conditional independence may sometimes be obtained if one conditions with certain subsets of variables.

The first result connects the concept of graph separation to conditional independence. First, we present a definition: a subset of nodes in an undirected graph separate two specific nodes, A and B, if all paths connecting A and B intersect the subset. In Figure 1, (B, D, F) separate A and B, as does (B, E, F). E and C are separated by both (A, D, F) and (B, D, F). The connection between graph separation and conditional independence is given by the following result, sometimes referred to as the separation theorem. 1.2.1 Separation Theorem. If variables A and B are conditionally independent given the rest of the variables of a multivariate model, A and B will be conditionally independent given any subset of variables separating A and B in the interaction graph of the model. The four assumptions on pairwise conditional independence defining the model shown in Figure 1 generate six minimal separation hypotheses: A ⊥ C|BDF,

A ⊥ C|BEF,

A ⊥ D|BEF,

B ⊥ E|ADF,

C ⊥ E|ADF,

C ⊥ E|BDF.

1.3 Closure and Marginal Models It follows from the separation theorem that graphical models are closed under marginalization, in the sense that some of the independence assumptions defining the model transfer to marginal models. Collapsing, for instance, over variable C of the model shown in Figure 1 leads to a

INTERACTION MODEL

3

been deleted define a hierarchial loglinear model with parameters corresponding to each of the completely connected subsets of nodes in the graph. The interaction graph for the model shown in Figure 1 has four cliques, BCDF, ABF, AEF, and DEF, corresponding to a loglinear model defined by one four-factor interaction and three three-factor interactions. 1.5 Separation and Parametric Collapsibility

Figure 2. An interaction graph obtained by collapsing the model defined by Figure 1 over variable C

graphical model defined by conditional independence of A and D and B and E, respectively, because the marginal model contains separators for both AD and BE (Figure 2). 1.4 Loglinear Representation of Graphical Models for Categorical Data No assumptions have been made so far requiring variables to be categorical. If all variables are categorical, however, the results may be considerably strengthened both with respect to the type of model defined by the independence assumptions of graphical models and in terms of the available information on the marginal models. The first published results on graphical models (5) linked graphical models for categorical data to loglinear models: A graphical model for a multidimensional contingency table without structural zeros is loglinear with generators defined by the cliques of the interaction graph.

The result is an immediate result of the fact that any model for a multidimensional contingency table has a loglinear expansion. Starting with the saturated model, one removes all loglinear terms containing two variables assumed to be conditional independent. The loglinear terms remaining after all the terms relating to one or more of the independence assumptions of the model have

While conceptually very simple, graphical models are usually complex in terms of loglinear structure. The problems arising from the complicated parametric structure are, however, to some degree to be compensated for by the properties relating to collapsibility of the models. Parametric collapsibility refers to the situation in which model terms of a complete model are unchanged when the model is collapsed over one or more variables. Necessary conditions implying parametric collapsibility of loglinear models are described by Agresti [1, p. 151] in terms which translate into the language of graphical models: Suppose variables of a graphical model of a multidimensional contingency table are divided into three groups. If there are no edges connecting variables the first group with connected components of the subgraph of variables from the third group, then model terms among variables of the first group are unchanged when the model is collapsed over the third group of variables.

Parametric collapsibility is connected to separation in two different ways. First, parametric collapsibility gives a simple proof of the separation theorem, because a vanishing two-factor term in the complete model also vanishes in the collapsed model if the second group discussed above contains the separators for the two variables. Secondly, separation properties of the interaction graph may be used to identify marginal models permitting analysis of the relationship between two variables. If one first removes the edge between the two variables, A and B, and secondly identifies separators for A and B in the graph, then the model is seen to be parametric collapsible on to the model containing A

4

INTERACTION MODEL

and B and the separators with respect to all model terms relating to A and B. The results are illustrated in Figure 3, where the model shown in Figure 3(a) is collapsed on to marginal models for ABCD and CDEF. The separation theorem is illustrated in Figure 3(b). All terms relating to A and B vanish in the complete model. The model satisfies the condition for parametric collapsibility, implying that these parameters also vanish in the collapsed model. The second property for the association between E and F is illustrated in Figure 3(c). C and D separate E and F in the graph from which the EF edge has been removed. It follows, therefore, that E and F cannot be linked to one and the same connected component of the subgraph for the variables over which the table has been collapsed. The model is therefore parametric collapsible on to CDEF with respect to all terms pertaining to E and F. 1.6 Decomposition and Reducibility Parametric collapsibility defines situations in which inference on certain loglinear terms may be performed in marginal tables because these parameters are unchanged in the marginal tables. Estimates of, and test statistics for, these parameters calculated in the marginal tables will, however, in many cases differ from those obtained from the complete table. Conditions under which calculations give the same results may, however, also be stated in terms of the interaction graphs. An undirected graph is said to be reducible if it partitions into three sets of nodes—X, Y, and Z—if Y separates the nodes of X from those of Z and if the nodes of Y are completely connected. If the interaction graph meets the condition of reducibility, it is said to decompose into two components, X + Y and Y + Z. The situation is illustrated in Figure 4, which decomposes into two components, ABCD and CDEF. It is easily seen that reducibility above implies parametric collapsibility with respect to the parameters of X and Z, respectively. It can also be shown, however, that likelihoodbased estimates and test statistics obtained by analysis of the collapsed tables are exactly the same as those obtained from the complete table.

Figure 3. Collapsing the model given in (a) illustrates the separation theorem for A and B (b), and parametric collapsibility with respect to E and F (c)

1.7 Regression Models and Recursive Models So far, the discussion has focused on models for the joint distribution of variables. The models can, however, without any problems, be extended first to multidimensional regression models describing the conditional distribution of a vector of dependent variables given another vector of explanatory variables and, secondly, to block recursive systems of variables. In the first case, the model will be based on independence assumptions relating to either two dependent variables

INTERACTION MODEL

5

associations. For categorical data, the natural measures of association are measures based on the so-called cross product ratios (2). The question therefore reduces to a question of whether or not cross product ratios are constant across different levels of other variables, thus identifying loglinear models as the natural framework within which these problems should be studied.

3

Figure 4. An interaction graph of a reducible model

or one dependent and one independent variable. In the second case, recursive models have to be formulated as a product of separate regression models for each recursive block conditionally given variables in all prior blocks. To distinguish between symmetric and asymmetric relationships edges between variables in different recursive blocks, interaction graphs are replaced by arrows. 2 PARAMETRIC STRUCTURE: HOMOGENEOUS OR HETEROGENEOUS ASSOCIATION The limitations of graphical models for contingency tables lie in the way in which they deal with higher-order interactions. The definition of the graphical models implies that higher-order interactions may exist if more than two variables are completely connected. It is therefore obvious that an analysis of association by graphical models can never be anything but the first step of an analysis of association. The graphical model will be useful in identifying associated variables and marginal models where associations may be studied, but sooner or later one will have to address the question of whether or not these associations are homogeneous across levels defined by other variables and, if not, which variables modify the association. The answer to the question of homogeneity of associations depends on the type of measure that one uses to describe or measure

ORDINAL CATEGORICAL VARIABLES

In the not unusual case of association between ordinal categorical variables, the same types of argument apply against the hierarchical loglinear models as against the graphical models. Loglinear models are basically interaction models for nominal data; and, as such, they will give results that are too crude and too imprecise for ordinal categorical data. The question of whether or not the association between two variables is homogeneous across levels of conditioning variables can, for ordinal variables, be extended to a question of whether or not the association is homogeneous across the different levels of the associated variables. While not abandoning the basic loglinear association structure, the answer to this question depends on the further parameterization of the loglinear terms of the models. We refer to a recent discussion of these problems by Clogg & Shihadeh (3).

4

DISCUSSION

The viewpoint taken here on the formulation of interaction models for categorical data first defines the family of graphical models as the basic type of models for association and interaction structure. Loglinear models are, from this viewpoint, regarded as parametric graphical models, meeting certain assumptions on the nature of associations not directly captured by the basic graphical models. Finally, different types of models for ordinal categorical data represent yet further attempts to meet assumptions relating specifically to the ordinal nature of the variables.

6

INTERACTION MODEL

REFERENCES 1. Agresti, A. (1990). Categorical Data Analysis. Wiley, New York. 2. Altham, P. M. E. (1970). The measurement of association of rows and columns for an r × s contingency table, Journal of the Royal Statistical Society, Series B 32, 63–73. 3. Clogg, C. & Shihadeh, E. S. (1994). Statistical Models for Ordinal Variables. Sage, Thousand Oaks. 4. Cox, D. R. & Wermuth, N. (1996). Multivariate Dependencies. Models, Analysis and Interpretation. Chapman & Hall, London. 5. Darroch, J. N., Lauritzen, S. L. & Speed, T. P. (1980). Markov fields and log-linear models for contingency tables, Annals of Statistics 8, 522–539.

6. Dawid, A. P. (1979). Conditional independence in statistical theory, Journal of the Royal Statistical Society, Series B 41, 1–15. 7. Edwards, D. (1995). Introduction to Graphical Modelling. Springer-Verlag, New York. 8. Lauritzen, S. L. (1996). Graphical Models. Oxford University Press, Oxford. 9. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, Chichester.

INTERIM ANALYSES

The alternative, and possibly the most familiar, purpose for interim analyses in a clinical trial is to allow interim assessment of treatment differences. Valid interim comparisons of the treatments and their use in deciding whether to stop a trial will be the main focus of the rest of this article. When using the term interim analysis hereafter, it is assumed that treatment comparisons are being conducted. The traditional approach to conducting phase III clinical trials has been to calculate a single fixed sample size in advance of the study, which depends on specified values of the significance level, power, and the treatment advantage to be detected. Data on all patients are then collected before any formal analyses are performed. Such a framework is logical when observations become available simultaneously, as in an agricultural field trial; but it may be less suitable for medical studies, in which patients are recruited over time and data are available sequentially. Here, results from patients who enter the trial early on are available for analysis while later patients are still being recruited. It is natural to be interested in such results. ‘‘However, the handling of treatment comparisons while a trial is still in progress poses some tricky problems in medical ethics, practical organisation and statistical analysis’’ (see Reference 3, Section 10.1), as discussed below. In methodological terms, the approach presented in this article is known as the frequentist approach and is the most widely used framework for clinical trials. An alternative school of thought, not discussed here but mentioned for completeness, is the Bayesian approach as described by, for example, Spiegelhalter et al. (4).

SUSAN TODD The University of Reading Medical and Pharmaceutical Statistics Research Unit Reading, Berkshire, United Kingdom

1

INTRODUCTION

The term ‘‘interim analysis’’ can, in its broadest sense, be used in relation to any evaluation of data undertaken during an ongoing trial. Whether examination of the data presents ethical and analytical challenges depends on the purpose of the analysis. Some routine monitoring of trial progress, usually blinded to treatment allocation, is often undertaken as part of a phase III trial. This monitoring can range from simple checking of protocol compliance and the accurate completion of record forms to monitoring adverse events in trials of serious conditions so that prompt action can be taken. Such monitoring may be undertaken in conjunction with a data and safety monitoring board (DSMB), established to review the information collected. As no direct comparison of the treatments in terms of their benefit is undertaken at such interim analyses, special methodology is not required. A second purpose of interim analyses is to undertake a sample size review, when the purpose of the interim analysis is to estimate one or more nuisance parameters (for example, σ 2 in the case of normally distributed data), and this information is used to determine the sample size required for the remainder of the trial. Sample size re-estimation based on the estimation of nuisance parameters, particularly on the variance of normally distributed observations, was proposed by Gould and Shih (1). A review of the methodology is given by Gould (2). A sample size review can be undertaken using data pooled over treatments in order to avoid any breaking of blindness. Simulations reported in the papers cited above show that there is negligible effect on the statistical properties of the overall test and so it is usually ignored.

2 OPPORTUNITIES AND DANGERS OF INTERIM ANALYSES The most compelling reason for monitoring trial data for treatment differences at interim analyses is that, ethically, it is desirable to terminate or modify a trial when evidence has emerged that one treatment is clearly

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

INTERIM ANALYSES

superior to the other, which is particularly important when life-threatening diseases are being studied. Alternatively, the data from a trial may support the conclusion that the experimental treatment and the control do not differ by some pre-determined clinically relevant magnitude, in which case it is likely to be desirable, both ethically and economically, to stop the study for futility and divert resources elsewhere. Finally, if information in a trial is accruing more slowly than expected, perhaps because the actual event or response rate observed in a trial is less than that anticipated when the trial was planned, then extension of enrollment may be appropriate, until a large enough sample has been recruited. Unfortunately, multiple analyses of accumulating data lead to problems in the interpretation of results (5). The main problem occurs when significance testing is undertaken at the various interim looks. Even if the treatments are actually equally effective, the more often one analyzes the accumulating data, the greater the chance of eventually and wrongly detecting a difference. Armitage et al. (6) were the first to numerically compute the extent to which the type I error probability (the probability of incorrectly declaring the experimental treatment as different from the control) is increased over its intended level if a standard hypothesis test is conducted at that level at each of a series of interim analyses. They studied the simplified setting of a comparison of normal data with known variance and set the significance level (or type I error probability) for each analysis to be 5%. If just one analysis is performed, then the error rate is 5% as planned. If one interim analysis and one final analysis are performed, the error rate rises to 8%. If four interim analyses and a final analysis are undertaken, this figure is 14%. This inflation continues if more interim analyses are performed. In order to make use of the advantages of monitoring the treatment difference, methodology is required to maintain the overall type I error rate at the planned level. The term ‘‘sequential trial’’ is most commonly used to describe a trial that uses this special methodology. Such trials implement pre-specified stopping rules to accurately maintain error rates.

A second problem associated with the conduct of interim analyses that involve treatment comparisons concerns the final analysis of the trial. When data are inspected at interim looks, the analysis appropriate for fixed sample size studies is no longer valid. Quantities such as P-values, point estimates, and confidence intervals are still well defined, but new methods of calculation are required. If a traditional analysis is performed at the end of a trial that stops because the experimental treatment is found better than control, the P-value will be too small (too significant), the point estimate too large, and the confidence interval too narrow. Again, special techniques are required. 3 THE DEVELOPMENT OF TECHNIQUES FOR CONDUCTING INTERIM ANALYSES It was in the context of quality control experiments that special methods for conducting interim analyses were first implemented. Manufactured items are often inspected one at a time, with a view to accepting or rejecting a batch in terms of its quality. Double sampling, or in the terms of this article, the use of a single interim analysis, was introduced to quality control by Dodge and Romig (7). The Second World War then provided the setting for the development of the sequential probability ratio test by Wald (8) in the United States and by Barnard (9) in the United Kingdom. In this procedure, an interim analysis is conducted after the inspection of every individual item. Sequential medical plans for comparing two sets of binary responses were introduced by Bross (10), and Kilpatrick and Oldham (11) applied the sequential t-test to a comparison of bronchial dilators. By 1960, enough accumulated theory and practice existed for the first edition of Armitage’s book on Sequential Medical Trials. During the 1970’s work in this area prescribed designs whereby traditional test statistics, such as the t-statistic or the chisquared statistic, were monitored after each patient’s response was obtained. Examples can be found in the book by Armitage (12). The principal limitation was, obviously, the need to perform an interim analysis so frequently, which was eventually overcome to

INTERIM ANALYSES

allow interim analyses to be performed after groups of patients. Work by Pocock (13) and O’Brien and Fleming (14) allowed inspections after the responses from predefined sets of patients. A more flexible approach leading on from these early repeated significance tests, referred to as the alpha-spending method, was proposed by Lan and DeMets (15) and extended by Kim and DeMets (16). An alternative flexible method, sometimes referred to as the boundaries approach, encompasses a collection of designs based on straight-line boundaries, which builds on work that has steadily accumulated since the 1940s. This approach is discussed by Whitehead (17), and the best known and most widely implemented design within this framework is the triangular test (18). 4 METHODOLOGY FOR INTERIM ANALYSES In his 1999 paper, Whitehead (19) lists the key ingredients required to conduct a trial incorporating interim analyses for assessing treatment differences: • A parameter that expresses the advan-

tage of the experimental treatment over control, which is an unknown population characteristic about which hypotheses will be posed and of which estimates will be sought. • A statistic that expresses the advantage of experimental over control apparent from the sample of data available at an interim analysis, and a second statistic that expresses the amount of information about the treatment difference contained in the sample. • A stopping rule that determines whether the current interim analysis should be the last, and, if so, whether it is to be concluded that the experimental is better than or worse than control or that no treatment difference has been established. • A method of analysis, valid for the specific design used, giving a P-value and point and interval estimates at the end of the trial.

3

The first two ingredients are common to both fixed sample size and sequential studies, but are worth revisiting for completeness. The second two are solutions to the particular problems of preserving error rates and obtaining a valid analysis on completion of a study in which interim analyses have been conducted. Any combination of choices for the four ingredients is possible, but, largely for historical reasons, particular combinations preferred by authors in the field have been extensively developed, incorporated into software, and used in practice. Each of the four ingredients will now be considered in turn. 4.1 The Treatment Effect Parameter As with a fixed sample size study, the first stage in designing a phase III clinical trial incorporating interim analyses is to establish a primary measure of efficacy. The authority of any clinical trial will be greatly enhanced if a single primary response is specified in the protocol and is subsequently found to show significant benefit of the experimental treatment. The choice of response should depend on such criteria as clinical relevance, familiarity to clinicians, and ease of obtaining accurate measurements. An appropriate choice for the associated parameter measuring treatment difference can then be made, which should depend on such criteria as interpretability, for example, whether a measurement based on a difference or a ratio is more familiar, and precision of the resulting analysis. If the primary response is a continuous measure such as the reduction in blood pressure after 1 month of antihypertensive medication, then the difference in true (unknown) means is of interest. If a binary variable is being considered, such as the occurrence (or not) of deep vein thrombosis following hip replacement, the log-odds ratio may be the parameter of interest. Finally, suppose that in a clinical trial the appropriate response is identified as survival time following treatment for cancer, then a suitable parameter of interest might be the log-hazard ratio. 4.2 Test Statistics for Use at Interim Analyses At each interim analysis, a sequential test monitors a statistic that summarizes the

4

INTERIM ANALYSES

current difference between the experimental treatment and control, in terms of the chosen parameter of interest. If the value of this statistic lies outside some specified critical values, the trial is stopped and appropriate conclusions can be drawn. The timing of the interim looks can be defined directly in terms of number of patients, or more flexibly in terms of information. It should be noted that the test statistic measuring treatment difference may increase or decrease between looks, but the statistic measuring information will always increase. Statisticians have developed flexible ways of conducting sequential trials allowing for the number and the timing of interim inspections. Whitehead (17) describes monitoring a statistic measuring treatment difference known in technical terms as the efficient score and schedules the interim looks in terms of a second statistic approximately proportional to study sample size known as observed Fisher’s information. Jennison and Turnbull (20) employ a direct estimate of the treatment difference itself as the test statistic of interest and record inspections in terms of a function of its standard error. Other choices are also possible. 4.3 Stopping Rules at Interim Analyses As highlighted above, a sequential test compares the test statistic measuring treatment difference with appropriate critical values at each interim analysis. These critical values form a stopping rule or boundary for the trial. At any interim analysis in the trial, if the boundary is crossed, the study is stopped and an appropriate conclusion drawn. If the statistic stays within the boundary, then not enough evidence exists to come to a conclusion at present and a further interim look should be taken. It is possible to look after every patient or to have just one or two interim analyses. When interims are performed after groups of patients, it may be referred to as a ‘‘group sequential trial.’’ A design incorporating inspections after every patient may be referred to as a ‘‘fully sequential test.’’ The advantage of looking after every patient is that a trial can be stopped as soon as an additional patient response results in the boundary being crossed. In

contrast, performing just one or two looks reduces the potential for stopping, and hence delays it. However, the logistics of performing interim analyses after groups of subjects are far easier to manage. In practice, planning for between four and eight interim analyses appears sensible. Once it had been established that a problem existed with inflating the type I error when using traditional tests and the usual fixed sample size critical values, designs were suggested that adjusted for this problem. It is the details of the derivation of the stopping rule that introduces much of the variety of sequential methodology. In any trial, the important issues to focus on are the desirable reasons for stopping or continuing a study at an interim analysis. Reasons for stopping may include evidence that the experimental treatment is obviously better than the control, evidence that the experimental treatment is already obviously worse than the control, or alternatively it may be established that little chance exists of showing that the experimental treatment is better than the control. Reasons for continuing may include belief that a moderate advantage of the experimental treatment is likely and it is desired to estimate this magnitude carefully or, alternatively, evidence may exist that the event rate is low and more patients are needed to achieve power. Criteria such as these will determine the type of stopping rule that is appropriate for the study under consideration. Stopping rules are now available for testing superiority, non-inferiority, equivalence, and even safety aspects of clinical trials. Some designs even aim to deal with both efficacy and safety aspects in a combined stopping rule, but these designs are more complex. As an example, consider a clinical trial conducted by the Medical Research Council Renal Cancer Collaborators between 1992 and 1997 (21). Patients with metastatic renal carcinoma were randomly assigned to treatment with either the hormone therapy, oral medroxyprogesterone acetate (MPA), or the biological therapy, interferon-α. The primary endpoint in the study was survival time and the treatment difference was measured by the log-hazard ratio. It was decided that if a difference in 2-year survival from 20%

INTERIM ANALYSES

on MPA to 32% on interferon-α (log-hazard ratio −0.342) was present, then a significant treatment difference at the two-sided 5% significance level should be detected with 90% power. The use of interferon-α was experimental and this treatment is known to be both costly and toxic. Consequently, its benefits over MPA needed to be substantial to justify its wider use. A stopping rule was required to satisfy the following two requirements: early stopping if data showed a clear advantage of interferon-α over MPA and early stopping if data showed no worthwhile advantage of interferon-α, which suggested use of an asymmetric stopping rule. Such a rule handles both of these aspects. The design chosen was the triangular test (17), similar in appearance to the stopping rule in Fig. 1. Interim analyses were planned every 6 months from the start of the trial. The precise form of the stopping rule was defined, as is the sample size in a fixed sample size trial, by consideration of the significance level, power, and desired treatment advantage, with reference to the primary endpoint. 4.4 Analysis following a Sequential Trial Once a sequential trial has stopped, a final analysis should be performed. The interim analyses determine only whether stopping should take place; they do not provide a complete interpretation of the data. An appropriate final analysis must take account of the fact that a sequential design was used. Unfortunately, many trials that have been terminated at an interim analysis are finally reported with analyses that take no statistical account of the inspections made (22). In a sequential trial, the meaning and interpretation of data summaries such as significance levels, point estimates, and confidence intervals remain as for fixed sample size trials. However, various alternative valid methods of calculation have been proposed. These methods can sometimes lead to slightly different results when applied to the same set of data. The user of a computer package may accept the convention of the package and use the resulting analysis without being concerned about the details of calculation. Those who wish to develop a deeper understanding

5

of statistical analysis following a sequential trial are referred to Chapter 5 of Whitehead (17) and Chapter 8 of Jennison and Turnbull (23). The generally accepted method is based on orderings methodology, whereby potential outcomes in a trial are ordered by degree of support for the alternative hypothesis. The original, and most successful, form of ordering was introduced by Fairbanks and Madsen (24) and explored further by Tsiatis et al. (25). 5 AN EXAMPLE: STATISTICS FOR LAMIVUDINE The effectiveness of lamivudine for preventing disease progression among patients with chronic hepatitis B and advanced fibrosis or cirrhosis is unknown and prompted the conduct of a large survival study (26). An efficient trial methodology for reaching a reliable conclusion with as few subjects as possible was required. The annual rate of disease progression was assumed to be 20% for placebo and a reduction of one-third (to 13.3%) for the lamivudine group was taken as a clinically relevant treatment effect. It was desired to detect this with power 0.9. A significance level of 0.05 was specified. When the objectives of the trial were considered in detail, interim analyses using an appropriate stopping rule were planned. The methodology followed was the boundaries approach, as discussed by Whitehead (17), and the triangular test was selected as an appropriate design. Clinically compensated chronic hepatitis B patients with histologically confirmed fibrosis or cirrhosis were randomized 2:1 to receive lamivudine (100 mg/day) or placebo for up to 5 years. Overall, 651 patients were randomized at 41 sites in nine Asia-Pacific countries, 436 to lamivudine and 215 to placebo. These people were then followed up for evidence of disease progression. An independent data and safety monitoring board was established to study the progress of the trial at interim analyses. At the first interim analysis, 36 patients had experienced disease progression. The first point plotted on Fig. 1 (x) represents those data. The statistic Z signifies the advantage seen so far on lamivudine and is the efficient score for the log-hazard ratio.

6

INTERIM ANALYSES

Z 25 20 15 10 5 0 −5

10

20

30

40

50

60

70

80

−10

90 V

−15 Figure 1. Sequential plot for the trial of lamivudine. Z: Efficient score statistic measuring the advantage of lamivudine over placebo. V: Statistic measuring the information on which the comparison is based.

Here, Z is calculated adjusting for 5 covariates, center, gender, baseline fibrosis staging score, baseline Childs-Pugh score, and baseline ALT values. An unadjusted Z would be the usual log-rank statistic for survival data. The statistic V measures the information on which the comparison is based, which is the variance of Z and is approximately one-quarter the number of events. The inner dotted boundaries, known as the Christmas tree correction for discrete looks, form the stopping boundary: reach this boundary and the trial is complete. Crossing the upper boundary results in a positive trial conclusion. At the second interim analysis, a total of 67 patients had experienced disease progression. The upper boundary was reached and the trial was stopped. When the results from data that had accumulated between the second interim analysis and confirmation of stopping were added, the final ‘‘overunning’’ point was added to the plot. A final overrunning analysis of the data was conducted using the computer package PEST. The P-value obtained was 0.0008, a highly significant result in favor of lamivudine. By using a series of interim looks, the design allowed a strong positive conclusion to be drawn after only 71 events had been observed.

6

INTERIM ANALYSES IN PRACTICE

Interim analyses are frequently being implemented in clinical trials. Early examples can

be found in the proceedings of two workshops, one on practical issues in data monitoring sponsored by the U.S. National Institutes of Health (NIH) held in 1992 (27) and the other on early stopping rules in cancer clinical trials held at Cambridge University in 1993 (28). The medical literature also demonstrates the use of interim analyses. Examples of such studies include trials of lithium gamolenate in pancreatic cancer (29), of viagra in erectile disfunction (30), and of implanted defibrilators in coronary heart disease (31), as well as the example detailed in Section 5 above and many others. Two books dealing exclusively with the implementation of sequential methods in clinical trials are those by Whitehead (17) and Jennison and Turnbull (23). In addition, three major commercial software packages are currently available. The package PEST (32) is based on the boundaries approach. The package EaSt (33) implements the alpha-spending boundaries of Wang and Tsiatis (34) and Pampallona and Tsiatis (35). An addition to the package S-Plus is the S + SeqTrial module (36). If a trial is to be a comparison of two treatments in respect to a single primary endpoint, with the objective of discovering whether one treatment is superior, non-inferior, or equivalent to the other, then it is extremely likely that a suitable sequential method exists, which means that infeasibility and unfamiliarity are no longer valid reasons for avoiding interim analyses and stopping rules in such trials, if ethical or economic purposes would be served by them.

INTERIM ANALYSES

When planning to include interim analyses in any clinical trial, the implications of introducing a stopping rule need to be thought out carefully in advance of the study. In addition, all parties involved in the trial should be consulted on the choice of clinically relevant difference, specification of an appropriate power requirement, and the selection of a suitable stopping rule. As part of the protocol for the study, the operation of any sequential procedure should be clearly described in the statistical section. Decision making as part of a sequential trial is both important and time sensitive. A decision taken to stop a study not only affects the current trial, but also is likely to affect future trials planned in the same therapeutic area. However, continuing a trial too long puts participants at unnecessary risk and delays the dissemination of important information. It is essential to make important ethical and scientific decisions with confidence. Wondering whether the data supporting interim analyses are accurate and up-to-date is unsettling and makes the decision process harder. It is, therefore, necessary for the statistician performing the interim analyses to have timely and accurate data. Unfortunately, a trade-off exists—it takes time to ensure accuracy. Potential problems can be alleviated if data for interim analyses are reported separately from the other trial data, as part of a fast-track system. Less data means that they can be validated more quickly. If timeliness and accuracy are not in balance, not only may real-time decisions be made on old data, but more seriously, differential reporting may lead to inappropriate study conclusions. If a DSMB is appointed, one of their roles should be to scrutinize any proposed sequential stopping rule before the start of the study and to review the protocol in collaboration with the trial Steering Committee. The procedure for undertaking the interim analyses should also be finalized in advance of the trial start-up. The DSMB would then review results of the interim analyses as they are reported. Membership of the DSMB and its relationship with other parties in a clinical trial has been considered in Reference 27 and by Ellenberg et al. (37). It is important that the interim results of an ongoing trial are not

7

circulated widely as it may have an undesirable effect on the future progress of the trial. Investigators’ attitudes will clearly be affected by whether a treatment looks good or bad as the trial progresses. It is usual for the DSMB to be supplied with full information and, ideally, the only other individual to have knowledge of the treatment comparison would be the statistician who performs the actual analyses. 7

CONCLUSIONS

Use of interim analyses in phase III clinical trials is not new, but it is probably the more recent theoretical developments, together with availability of software, which have precipitated their wider use. The methodology is flexible as it enables choice of a stopping rule from a number of alternatives, allowing the trial design to meet the study objectives. One important point is that a stopping rule should not govern the trial completely. If external circumstances change the appropriateness of the trial or assumptions made when choosing the design are suspected to be false, it can and should be overridden, with the reasons for doing so carefully documented. Methodology for conducting a phase III clinical trial sequentially has been extensively developed, evaluated, and documented. Error rates can be accurately preserved and valid inferences drawn. It is important that this fact is recognized and that individuals contemplating the use of interim analyses conduct them correctly. Regulatory authorities, such as the Food and Drug Administration (FDA), do not look favorably on evidence from trials incorporating unplanned looks at data. In the United States, the FDA (38) published regulations for NDAs, which included the requirement that the analysis of a phase III trial ‘‘assess . . . the effects of any interim analyses performed.’’ The FDA guidelines were updated by publication of ‘‘ICH Topic E9: Notes for Guidance on Statistical Principles for Clinical Trials’’ (39). Section 3 of this document discusses group sequential designs and Section 4 covers trial conduct including trial monitoring, interim analysis, early stopping, sample size adjustment, and the role of an independent DSMB. With such

8

INTERIM ANALYSES

acknowledgement from regulatory authorities, interim analyses are likely to become even more commonplace. REFERENCES 1. A. L. Gould and W. J. Shih, Sample size re-estimation without unblinding for normally distributed data with unknown variance. Commun. Stat. – Theory Meth. 1992; 21: 2833–3853. 2. A. L. Gould, Planning and revising the sample size for a trial. Stat. Med. 1995; 14: 1039–1051. 3. S. J. Pocock, Clinical Trials: A Practical Approach. New York: Wiley, 1983. 4. D. J. Spiegelhalter, K. R. Abrams, and J. P. Myles, Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: Wiley, 2004. 5. K. McPherson, Statistics: the problem of examining accumulating data more than once. N. Engl. J. Med. 1974; 28: 501–502. 6. P. Armitage, C. K. McPherson, and B. C. Rowe, Repeated significance tests on accumulating data. J. Royal Stat. Soc. Series A 1969; 132: 235–244. 7. H. F. Dodge and H. C. Romig, A method of sampling inspection. Bell Syst. Tech. J. 1929; 8: 613–631. 8. A. Wald, Sequential Analysis. New York: Wiley, 1947. 9. G. A. Barnard, Sequential test in industrial statistics. J. Royal Stat. Soc. 1946;(Suppl 8): S1–S26. 10. I. Bross, Sequential medical plans. Biometrics 1952; 8: 188–205. 11. G. S. Kilpatrick and P. D. Oldham, Calcium chloride and adrenaline as bronchial dilators compared by sequential analysis. Brit. Med. J. 1954; ii: 1388–1391. 12. P. Armitage, Sequential Medical Trials, 2nd ed. Oxford: Blackwell, 1975. 13. S. J. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64: 191–199. 14. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics 1979; 35: 549–556. 15. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical trials. Biometrika 1983; 70: 659–663. 16. K. Kim and D. L. DeMets, Design and analysis of group sequential tests based on the type I

error spending rate function. Biometrika 1987; 74: 149–154. 17. J. Whitehead, The Design and Analysis of Sequential Clinical Trials, rev. 2nd ed. Chichester: Wiley, 1997. 18. J. Whitehead, Use of the triangular test in sequential clinical trials. In: J. Crowley (ed.), Handbook of Statistics in Clinical Oncology. New York: Dekker, 2001, pp. 211–228. 19. J. Whitehead, A unified theory for sequential clinical trials. Stat. Med. 1999; 18: 2271–2286 20. C. Jennison and B. W. Turnbull, Group sequential analysis incorporating covariate information. J. Amer. Stat. Assoc. 1997; 92: 1330–1341. 21. Medical Research Council Renal Cancer Collaborators, Interferon-α and survival in metastatic renal carcinoma: early results of a randomised controlled trial. Lancet 1999; 353: 14–17. 22. K. M. Facey and J. A. Lewis, The management of interim analyses in drug development. Stat. Med. 1998; 17: 1801–1809. 23. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman and Hall/CRC, 2000. 24. K. Fairbanks and R. Madsen, P values for tests using a repeated significance test design. Biometrika 1982; 69: 69–74. 25. A. A. Tsiatis, G. L. Rosner, and C. R. Mehta, Exact confidence intervals following a group sequential test. Biometrics 1984; 40: 797–803. 26. Y. F. Liaw et al. on behalf of the CALM study group, Effects of lamivudine on disease progression and development of hepatocellular carcinoma in advanced chronic hepatitis B: a prospective placebo-controlled clinical trial. N. Engl. J. Med. 2004; 351: 1521–1531. 27. S. Ellenberg, N. Geller, R. Simon, and S. Yusuf (eds.), Proceedings of ‘Practical issues in data monitoring of clinical trials’. Stat. Med. 1993; 12: 415–616. 28. R. L. Souhami and J. Whitehead (eds.), Workshop on early stopping rules in cancer clinical trials. Stat. Med. 1994; 13: 1289–1499. 29. C. D. Johnson et al., Randomized, dose-finding phase III study of lithium gamolenate in patients with advanced pancreatic adenocarcinoma. Brit. J. Surg. 2001; 88: 662–668. 30. F. A. Derry et al., Efficacy and safety of oral sildenafil (viagra) in men with erectile dysfunction caused by spinal cord injury. Neurology 1998; 51: 1629–1633.

INTERIM ANALYSES 31. A. J. Moss et al., Improved survival with implanted defibrillator in patients with coronary disease at high risk of ventricular arrhythmia. N. Engl. J. Med. 1996; 335: 1933–1940. 32. MPS Research Unit, PEST 4: Operating Manual. Reading, UK: The University of Reading, 2000. 33. Cytel Software Corporation, EaSt 2000: A Software Package for the Design and Interim Monitoring of Group-Sequential Clinical Trials. Cambridge, MA: Cytel Software Corporation, 2000. 34. S. K. Wang and A. A. Tsiatis, Approximately optimal one-parameter boundaries for group sequential trials. Biometrics 1987; 43: 193–199. 35. S. Pampallona and A. A. Tsiatis, Group sequential designs for one-sided and twosided hypothesis testing with provision for early stopping in favor of the null hypothesis. J. Stat. Plan. Inference 1994; 42: 19–35.

9

36. MathSoft Inc., S-Plus 2000. Seattle, WA: MathSoft Inc., 2000. 37. S. Ellenberg, T. R. Fleming, and D. L. DeMets, Data Monitoring Committees in Clinical Trials: A Practical Perspective. Chichester: Wiley, 2002. 38. US Food and Drug Administration Guideline for the Format and Content of the Clinical and Statistical Sections of an Application. Rockville, MD: FDA. http://www.fda.gov/cder/guidance/statnda.pdf 1988. 39. International Conference on Harmonisation Statistical Principles for Clinical Trials, Guideline E9. http://www.ich. org/LOB/media/MEDIA485.pdf 1998.

INTERIM CLINICAL TRIAL/STUDY REPORT An Interim Clinical Trial/Study Report is a report of intermediate results and their evaluation based on analyses performed during the course of a trial.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE (ICH) The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) is a cooperative effort between the drug regulatory authorities and innovative pharmaceutical company professional organizations of the European Union, Japan, and the United States to reduce the need to duplicate the testing conducted during the research and development of new medicines. Through harmonization of the technical guidelines and requirements under which drugs for human use are approved within the participating nations, ICH members seek more economical use of human, animal, and material resources and the elimination of delay in availability of new drugs, while maintaining the quality, safety, and effectiveness of regulated medicines.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

International Studies of Infarct Survival (ISIS) The ISIS began in 1981 as a collaborative worldwide effort to evaluate the effects of several widely available and practical treatments for acute myocardial infarction (MI). The ISIS Collaborative Group randomized more than 134 000 patients into four large simple trials assessing the independent and synergistic effects of beta-blockers, thrombolytics, aspirin, heparin, converting enzyme inhibitors, oral nitrates, and magnesium in the treatment of evolving myocardial acute infarction (Table 1). More than 20 countries participated in these trials, which were coordinated worldwide by investigators in Oxford, England.

ISIS-1: Atenolol in Acute MI [1] Beta-blocking agents reduce the heart rate and blood pressure, as well as their product, inhibit the effects of catecholamines, and increase thresholds for ventricular fibrillation. Thus, it is not surprising that beta-blockers were among the first agents to be evaluated in randomized trials of evolving acute MI. Even by 1981, the available trials of beta-blocking agents for acute infarction were too small to demonstrate a significant benefit. However, based on an overview of the available evidence (see Meta-analysis of Clinical Trials), it was judged that the prevention of even one death per 200 patients treated with beta-blockers (see Number Needed to Treat (NNT)) would represent a worthwhile addition to usual care. Unfortunately, detecting such an effect would require the randomization of over 15 000 patients. It was toward this end that the First International Study of Infarct Survival (ISIS-1) trial was formed. Table 1 Trial

In a collaborative effort involving 245 coronary care units in 11 countries, the ISIS-1 trial randomized 16 027 patients with suspected acute MI to a regimen of intravenous atenolol versus no betablocker therapy. Patients assigned to active treatment received an immediate intravenous injection of 5–10 mg atenolol, followed by 100 mg/day orally for seven days. Similar agents were avoided in those assigned at random to no beta-blocker therapy unless it was believed to be medically indicated. As in the subsequent ISIS collaborations, all other treatment decisions were at the discretion of the responsible physician. During the seven-day treatment period in which atenolol was given, vascular mortality was significantly lower in the treated group (3.89% vs. 4.57%, P < 0.04), representing a 15% mortality reduction. Almost all of the apparent benefit was observed in days 0 to 1 during which time there were 121 deaths in the atenolol group as compared with 171 deaths in the control group. The early mortality benefit attributable to atenolol was maintained at 14 days and at the end of one year follow-up (10.7% atenolol vs. 12.0% control). Treatment did not appear to decrease infarct size substantially, although the ability of a large and simple trial such as ISIS-1 to assess such a reduction was limited. Despite its large size, the 95% confidence limits of the risk reductions associated with atenolol in ISIS-1 were wide and included relative risk reductions between 1% and 25%. However, an overview that included ISIS1 and 27 smaller completed trials of beta-blockade suggested a similar sized mortality reduction (14%). When a combined endpoint of mortality, nonfatal cardiac arrest and nonfatal reinfarction was considered from all available trials, the 10%–15% reduction persisted with far narrower confidence limits. Taken together, these data suggest that early treatment of 200 acute MI patients with beta-blocker therapy

The International Studies of Infarct Survival (ISIS) Year completed

ISIS-1 ISIS-2

1985 1988

ISIS-3

1991

ISIS-4

1993

Agents studied Atenolol vs. control Streptokinase vs. placebo Aspirin vs. placebo Streptokinase vs. tPA vs. APSAC Aspirin + SC heparin vs. aspirin Captopril vs. placebo oral mononitrate vs. placebo Magnesium vs. control

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

Patients randomized 16 027 17 187 41 299 58 050

2

International Studies of Infarct Survival (ISIS)

would lead to avoidance of one reinfarction, one cardiac arrest, and one death during the initial seven-day period. Unfortunately, beta-blocker use in the setting of acute MI remains suboptimal with utilization rates ranging between 30% in the US to less than 5% in the UK. This underutilization appears related in part to poor physician education. In the GUSTO1 trial, beta-blockers were encouraged by the study protocol and almost 50% of all patients received the drugs without any apparent increase in adverse effects.

ISIS-2: Streptokinase and Aspirin in Acute MI [2] As with beta-blockers, data from randomized trials of thrombolytic therapy completed prior to 1985 did not yield truly reliable results. Indeed, the largest of the early studies enrolled 750 patients, a totally inadequate sample size to detect the most plausible 20%–25% reduction in mortality. Given this situation, the Second International Study of lnfarct Survival (ISIS-2) was designed to test directly in a single randomized, double-blind, placebo-controlled trial (see Blinding or Masking) the risks and benefits of streptokinase and aspirin in acute MI. To accomplish this goal, the ISIS-2 collaborative group randomized 17 187 patients presenting within 24 hours of symptom onset using a 2 × 2 factorial design to one of four treatment groups: 1.5 million units of intravenous streptokinase over 60 minutes; 162.5 mg/day of oral aspirin for 30 days; both active treatments; or neither. In brief, the primary endpoint (see Outcome Measures in Clinical Trials) of the trial, total vascular mortality, was reduced 25% by streptokinase alone (95% Cl, −32 to −18, P < 0.0001) and 23% by aspirin alone (95% CI, −30% to −15%, P < 0.00001). Patients allocated to both agents had a 42% reduction in vascular mortality (95% Cl, −50 to −34, P < 0.00001), indicating that the effects of streptokinase and aspirin are largely additive. When treatment was initiated within six hours of the onset of symptoms, the reduction in total vascular mortality was 30% for streptokinase, 23% for aspirin, and 53% for both active agents. For aspirin, the mortality benefit was similar when the drug was started 0–4 hours (25%), 5–12 hours (21%), or 13–24 hours (21%) after the onset of

clinical symptoms. Aspirin use also resulted in highly significant reductions for nonfatal reinfarction (49%) and nonfatal stroke (46%). As regards side-effects, for bleeds requiring transfusion, there was no significant difference between the aspirin and placebo groups (0.4% vs. 0.4%), although there was a small absolute increase of minor bleeds among those allocated to aspirin (0.6%, P < 0.01). For cerebral hemorrhage, there was no difference between the aspirin and placebo groups. For streptokinase, those randomized within four hours of pain onset experienced the greatest mortality reduction, although statistically significant benefits were present for patients randomized throughout the 24 hour period. As expected, there was an excess of confirmed cerebral hemorrhage with streptokinase (7 events vs. 0; 2P < 0.02), all of which occurred within one day of randomization. Reinfarction was slightly more common among those assigned streptokinase alone, but this difference was not statistically significant. Furthermore, aspirin abolished the excess reinfarction attributable to streptokinase. In addition to demonstrating the independent as well as synergistic effects of streptokinase and aspirin, ISIS-2 also supplied important information concerning which patients to treat. Because the ISIS-2 entry criteria were broad, the trial included the elderly, patients with left bundle branch block, and those with inferior as well as anterior infarctions. In each of these subgroups, clear mortality reductions were demonstrated. Thus, in addition to changing radically the premise that thrombolysis should be avoided in patients already on aspirin, the ISIS-2 trial was largely responsible for widening the eligibility criteria for patients who would benefit from thrombolytic therapy.

ISIS-3: Streptokinase vs. APSAC vs. tPA and Subcutaneous Heparin vs. No Heparin in Acute MI [3] While ISIS-2 (streptokinase), the first Gruppo Italiano per lo Studio della Sopravvivenza nell’Infarto miocardico (GISSI-1, streptokinase), the APSAC Intervention Mortality Study (AIMS, anisoylated plasminogen streptokinase activator complex [APSAC]), Anglo–Scandinavian Study of Early Thrombolysis (ASSET, tissue plasminogen activator [tPA]) and ISIS-2 all documented clear mortality benefits for

International Studies of Infarct Survival (ISIS) thrombolysis, they did not provide information that allowed for directly comparing these agents. It was also unclear whether patients given aspirin would further benefit from the addition of heparin. These questions were the focus of the Third International Study of Infarct Survival (ISIS-3). In brief, the ISIS-3 collaborative group randomized 41 299 patients to streptokinase, APSAC, and tPA. Patients presenting within 24 hours of the onset of evolving acute MI and with no clear contraindication to thrombolysis were assigned randomly to IV streptokinase (1.5 MU over one hour), IV tPA (duteplase, 0.50 million U/kg over four hours), or IV APSAC (30 U over three minutes). All patients received daily aspirin (162.5 mg), with the first dose crushed or chewed in order to achieve a rapid clinical antithrombotic effect. In addition, half were randomly assigned to receive subcutaneous heparin (12 500 IU twice daily for seven days), beginning four hours after randomization. ISIS-3 demonstrated no differences in mortality between the three thrombolytic agents. Specifically, among the 13 780 patients randomized to streptokinase, there were 1455 deaths (10.5%) within the initial 35-day follow-up period as compared with 1448 deaths (10.6%) among the 13 773 patients randomized to APSAC and 1418 deaths (10.3%) among the 13 746 randomized to tPA. Long-term survival was also virtually identical for the three agents at both three and six months. With regard to in-hospital clinical events, cardiac rupture, cardiogenic shock, heart failure requiring treatment, and ventricular fibrillation were similar for the three agents. For nonfatal reinfarction, there was a reduction with tPA, while streptokinase and APSAC allocated patients had higher rates of allergy and hypotension requiring treatment. Streptokinase produced fewer noncerebral bleeds than either APSAC or tPA. While there were no major differences between thrombolytic agents in terms of lives saved or serious in-hospital clinical events, significant differences were found in ISIS-3 for rates of total stroke and cerebral hemorrhage. Specifically, there were 141 total strokes in the streptokinase group as compared with 172 and 188 in the APSAC and tPA groups, respectively. For cerebral hemorrhage there were 32 events (two per 1000) in the streptokinase group as compared with 75 (five per 1000) in the APSAC group and 89 (seven per 1000) in the tPA group. While

3

the absolute rates for cerebral hemorrhage for all three agents was low, this apparent advantage for streptokinase was highly statistically significant (P < 0.0001 for streptokinase vs. APSAC, P < 0.00001 for streptokinase vs. tPA). With regard to the addition of delayed subcutaneous heparin to thrombolytics there was no reduction in the prespecified endpoint of 35 day mortality. During the scheduled seven day period of heparin use, there were slightly fewer deaths in the aspirin plus heparin group compared with the aspirin group alone, a difference of borderline significance. There was, however, a small but significant excess of strokes deemed definite or probable cerebral hemorrhages among those allocated aspirin plus heparin (0.56% vs. 0.40%, P < 0.05). In contrast, reinfarction was more common among those randomized to aspirin alone as compared with those receiving aspirin plus subcutaneous heparin.

ISIS-4: Angiotensin Converting Enzyme Inhibition, Nitrate Therapy, and Magnesium in Acute MI [4] In 1991 the ISIS collaboration chose to investigate several other promising but unproven approaches to the treatment of acute MI. Specifically, the Fourth International Study of Infarct Survival (ISIS-4) sought to examine treatment strategies that would benefit both high- and low-risk patients presenting with acute MI, not simply those who are eligible for thrombolysis. To attain this goal, the ISIS collaborative group chose to study three promising agents: a twice daily dose of the angiotensin converting enzyme (ACE) inhibitor captopril for 30 days, a once daily dose of controlled release mononitrate for 30 days, and a 24-hour infusion of intravenous magnesium. As was true in each of the preceding ISIS trials, the available data were far too limited to allow reliable clinical recommendations concerning these therapies. For example, while ACE inhibiting agents had been shown to be successful in reducing mortality in patients with congestive heart failure and in patients a week or two past acute infarction, it was unclear whether these agents provided a net benefit for all patients in the setting of evolving acute MI. Similarly, while nitrates were often used in evolving MI because of their ability to reduce myocardial

4

International Studies of Infarct Survival (ISIS)

afterload and potentially limit infarct size, barely 3000 patients had received intravenous nitroglycerin in randomized trials and even fewer patients had been studied on oral nitrate preparations. Finally, because of its effects on calcium regulation, arrhythmia thresholds, and tissue preservation, magnesium therapy had often been considered as an adjunctive therapy for acute infarction even though no data from a randomized trial of even modest size had been available. Based on statistical overviews, the ISIS investigators estimated that each of these therapies had the potential to reduce mortality in acute infarction by as much as 15%–20%. However, because many patients presenting with acute infarction were treated with thrombolytic therapy and aspirin, the estimated mortality rates at one month were estimated to be as low as 7%–8%. Thus, to assess reliably whether these potentially important clinical effects were real required the randomization of a very large number of patients, perhaps as many as 60 000. To achieve this goal, a 2 × 2 × 2 factorial design was employed in which patients were randomized first to captopril or captopril placebo, then to mononitrate or mononitrate placebo, and then to magnesium or magnesium control. Thus, it was possible in the trial for any given patient to receive all three active agents, no active agents, or any combination.

Captopril Use of the ACE inhibitor captopril was associated with a significant 7% decrease in five-week mortality (2088 [7.19%] deaths among patients assigned to captopril vs. 2231 [7.69%] deaths among those assigned to placebo), which corresponds to an absolute difference of 4.9 ± 2.2 fewer deaths per 1000 patients treated for one month. The absolute benefits appeared to be larger (possibly as high as 10 fewer deaths per 1000) in some higher-risk groups, such as those presenting with heart failure or a history of MI. The survival advantage appeared to be maintained at 12 months. In terms of side-effects, captopril produced no excess of deaths on days 0–1, even among patients with low blood pressure at entry. It was associated with an increase of 52 patients per 1000 in hypotension considered severe enough to require termination of study treatment, of five per 1000 in reported cardiogenic shock, and of five per 1000 in some degree of renal dysfunction.

Mononitrate Use of mononitrate was not associated with any significant improvements in outcomes. There was no significant reduction in overall five-week mortality, nor were there reductions in any subgroup examined (including those not receiving short-term nonstudy intravenous or oral nitrates at entry). Continued follow-up did not indicate any later survival advantage. Somewhat fewer deaths on days 0–1 were reported among individuals allocated to active treatment, which is reassuring about the safety of using nitrates early in evolving acute MI. The only significant side-effect of the mononitrate regimen was an increase in hypotension of 15 per 1000 patients.

Magnesium As with mononitrate, use of magnesium was not associated with any significant improvements in outcomes, either in the entire group or any subgroups examined (including those treated early or late after symptom onset or in the presence or absence of fibrinolytic or antiplatelet therapies, or those at high risk of death). Further follow-up did not indicate any later survival advantage. In contrast to some previous small trials, there was a significant excess of heart failure with magnesium of 12 patients per 1000, as well as an increase of cardiogenic shock of five patients per 1000 during or just after the infusion period. Magnesium did not appear to have a net adverse effect on mortality on days 0–1. In terms of side-effects, magnesium was associated with an increase of 11 patients per 1000 in hypotension considered severe enough to require termination of the study treatment, of three patients per 1000 in bradycardia, and of three patients per 1000 in a cutaneous flushing or burning sensation. Because of its size, ISIS-4 provided reliable evidence about the effects of adding each of these three treatments to established treatments for acute MI. Collectively, GISSI-3, several smaller studies, and ISIS-4 have demonstrated that, for a wide range of patients without clear contraindications, ACE inhibitor therapy begun early in evolving acute MI prevents about five deaths per 1000 in the first month, with somewhat greater benefits in higher-risk patients. The benefit from one month of ACE inhibitor therapy persists for at least the first year. Oral nitrate

International Studies of Infarct Survival (ISIS) therapy, while safe, does not appear to produce a clear reduction in one-month mortality. Finally, intravenous magnesium was ineffective at reducing one-month mortality.

Conclusion Because of their simplicity, large size, and strict use of mortality as the primary endpoint, the ISIS trials have played a critical substantive role in establishing rational treatment plans for patients with acute MI. Methodologically, they have clearly demonstrated the utility of large simple randomized trials. Three principles guided the design and conduct of the ISIS trials. The first was the belief that a substantial public health benefit would result from the identification of effective, widely practical treatment regimens that could be employed in almost all medical settings, as opposed to those that can be administered only at specialized tertiary care facilities. For this reason, the ISIS investigations focused on strategies to decrease mortality which, in and of themselves, did not require cardiac catheterization or other invasive procedures for either diagnostic or therapeutic purposes. The second principle was that the benefits of truly effective therapies would be applicable to a wide spectrum of patients with diverse clinical presentations. Thus, the entry criteria for the ISIS trials were intentionally broad and designed to mimic the reality all health care providers encounter when deciding whether or not to initiate a given treatment plan. This is one reason that the ISIS trials focused on evolving acute MI in the view of the responsible physician. The third and perhaps most important principle was that most new therapies confer small to moderate benefits, on the order of 10%–30%. While such benefits on mortality are clinically very meaningful, these effects can be detected reliably only by randomized trials involving some tens of thousands of patients. Thus, the ISIS protocols were

5

streamlined to maximize randomization and minimize interference with the responsible physician’s choice of nonprotocol therapies and interventions. Nonetheless, by selectively collecting the most important entry and follow-up variables that relate directly to the efficacy or adverse effects of the treatment in question, the ISIS trials yielded reliable data for providing a rational basis for patient care. By limiting paperwork and not mandating protocol-driven interventions, the ISIS approach proved to be remarkably cost-effective. Indeed, the large ISIS trials were conducted at a small fraction of the usual cost of other smaller trials which, because of their inadequate sample sizes, failed to demonstrate either statistically significant effects or informative null results.

References [1]

[2]

[3]

[4]

ISIS-1 (First International Study of lnfarct Survival) Collaborative Group (1986). Randomised trial of intravenous atenolol among 16027 cases of suspected acute myocardial infarction: ISIS-1, Lancet 2, 57–65. ISIS-2 (Second International Study of lnfarct Survival) Collaborative Group (1998). Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17187 cases of suspected acute myocardial infarction: ISIS-2, Lancet 2, 349–360. ISIS-3 (Third International Study of lnfarct Survival) Collaborative Group (1992). ISIS-3: Randomised comparison of streptokinase vs. tissue plasminogen activator vs. anistreplase and of aspirin plus heparin vs. aspirin alone among 41299 cases of suspected acute myocardial infarction: ISIS-3. Lancet 339, 75–70. ISIS-4 (Fourth International Study of Infarct Survival) Collaborative Group (1995). ISIS-4: a randomised factorial trial assessing early oral captopril, oral mononitrate, and intravenous magnesium sulphate in 58 050 patients with suspected acute myocardial infarction: ISIS4. Lancet 345, 669–685.

CHARLES H. HENNEKENS & P.J. SKERRETT

Xij = ξ i + ij where in the first treatment group, T1, the mean of ξ i were µ1 and in the other treatment group, T2, µ2 . The assumptions underlying the two-sample t-test hold. With N subjects randomly assigned to T1 and T2, the power to detect a deviation from the null hypothesis of equivalence of T1 and T2 using outcome measure O is the effect size:

INTER-RATER RELIABILITY HELENA CHMURA KRAEMER Stanford University, Palo Alto, California

1

DEFINITION

The classic model that underlies reliability stipulates that a rating of unit i, i = 1,2, . . . can be expressed as Xij = ξ i + ij , where ξ i is that unit’s ‘‘true value’’ (i.e., that free of rater error), and εij is error made by the jth independent rater sampled from the population of raters (1). The inter-rater reliability coefficient is defined as:

δ(ρ) =

(µ1 − µ2 ) √ (µ1 − µ2 ) √ = ρ = ρδ (σx2 )1/2 σξ

(1)

Because the sample size necessary to achieve specified power for a two-sample t-test is inversely related to the square of the effect size, the sample size necessary to achieve given power is inversely related to the reliability coefficient. Thus, in a situation in which one would need 100 subjects per group with a perfectly reliability measure, one would need 200 subjects if the measure had a reliability of .5, and 500 subjects if the measure had a reliability of .2. Similarly, suppose two variables X and Y were to be correlated. Then, the product moment correlation coefficient between X and Y can be shown to be:

ρ = σ 2 ξ/σ 2 X where σ 2 ξ is the variance of the ξ i in the population of interest, and σ 2 X that of a single observation per subject. Thus, in a sense, reliability relates to signal-to-noise ratio where σ 2 ξ relates to ‘‘signal’’ and σ 2 X combines ‘‘signal’’ and ‘‘noise.’’ According to this definition, the reliability coefficient is zero if and only if subjects in the population are homogeneous in whatever X measures. This situation should almost never pertain when considering measures for use in randomized clinical trials. Consequently, testing the null hypothesis that ρ = 0 is virtually never of interest, although admittedly it is often observed in the research literature. Instead, the tasks of greatest interest to clinical research are (1) obtaining a confidence interval for ρ, (2) judging the adequate of ρ, and (3) considering how to improve ρ.

Corr(X, Y) = Corr(ξ , ν)(ρX ρY )1/2 Where ξ and ν are the ‘‘true values’’ of X and Y, and ρ X and ρ Y are their reliabilities, the attenuation of correlation is always associated with unreliability of measurement. In this case, the sample size needed to achieve a certain power in testing whether the observed correlation deviates from zero is proportional to the product of the two reliability coefficients. However, if the only effect of unreliability of measurement were to decrease the power of statistical tests, researchers could compensate for unreliability by increasing sample size. Increasing the sample size is not always feasible, and if feasible, it is not easy. Increasing sample size increases the time and cost of performing clinical trials, increasing the difficulty of obtaining adequate funding, and delaying the process of settling important clinical questions. Yet, all these processes

2 THE IMPORTANCE OF RELIABILITY IN CLINICAL TRIALS Concern about inter-rater reliability in randomized clinical trials stems from the fact that a lack of reliability results in a lack of precision in estimation of population parameters as well as a loss of power in testing hypotheses. For example, suppose the response to treatment in a clinical trial were given by

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INTER-RATER RELIABILITY

could be managed. However, the effect of unreliability of measurement also leads to attenuation of effect sizes. Thus, it is possible to increase sample size enough to declare the treatment difference statistically significant, only to have the effect size indicate that the treatment difference is unlikely to be of clinical significance.

randomly allocated to positions in the analysis to assure that any disagreement between raters observing one subject is attributed to error, and raters must be ‘‘blinded’’ to each other’s ratings to ensure independence of errors. It is the design of the reliability study that assures that what is estimated adequately indicates reliability.

3 HOW LARGE A RELIABILITY COEFFICIENT IS LARGE ENOUGH?

5 ESTIMATE OF THE RELIABILITY COEFFICIENT—PARAMETRIC

No standards guarantee that inter-rater reliability is ‘‘large enough,’’ although ‘‘rules of thumb’’ have been suggested. For example, ρ > .8 might be considered ‘‘almost perfect,’’ .6 to .8 ‘‘substantial, .4 to .6 ‘‘moderate,’’ .2 to .4 ‘‘low,’’ and under .2 ‘‘unacceptable’’ (2). In fact, relatively few clinical measurements have reliability above .8, and many with reliability under .4 (3,4). Moreover, many medical measures, even those in common use, have never been assessed for reliability.

When it is assumed that ξ i and ij are independent of each other, the intraclass correlation coefficient (q.v.) is generally used to estimate ρ. In practice, the easiest approach is to use a two-way ANOVA (N subjects by M raters). Then

4 DESIGN AND ANALYSIS OF RELIABILITY STUDIES

1 + (M − 1)r 1 + (M − 1)ρ ≈ F(M−1),(M−1)(N−1) 1−r 1−ρ

The design to assess the inter-rater reliability of a measurement requires sampling N subjects from the population of interest, then sampling M independent (‘‘blinded’’) raters of each subject at the same time (to avoid additional variance caused by inconsistency of the subject’s responses over time) from the population of ratings to which the results are meant to apply. It is important to realize that the reliability of any measure may vary from one clinical population to another. For example, a diagnostic procedure may be reliable in a clinical population but unreliable in a community population. At the same time, a diagnostic procedure may be reliable when done by well-trained physicians in a clinical population but unreliable when done by technicians (or vice versa) in the same population. Consequently, having a representative sample of patients from the relevant population and a representative sample of raters from that relevant population is crucial. Because of the practical difficulty of having multiple simultaneous raters of a subject at one time, most often M equals 2. Raters are

Easier to use in applications is an extension of Fisher’s z-transformation. Here for M ≥ 2,

r=

Fs − 1 Fs + M − 1

where FS is the F-statistic to test for the effect of subjects (5). Under these assumptions:

zM (r1 ) = .5 ln

1 + (M − 1)r1 1 − r1

is approximately normally distributed with mean zM (ρ) and variance 1/(h − 1) where: h=

2(M − 1)(N − 1) M

In particular, when M = 2, this means that .5 ln

1+r 1−r

is approximately normally distributed with mean 1+ρ .5 ln 1−ρ and variance 1/(N − 2). If a product moment correlation coefficient between the two ratings exists, then the variance of the z-transformation would have been 1/(N − 3), which

INTER-RATER RELIABILITY

indicates the minor loss of power associated with using an intraclass correlation rather than a product moment correlation coefficient applied to data generated in a reliability design. With the above information, confidence intervals for the intraclass correlation coefficient can be computed, presented, and evaluated for adequacy. The intra-class correlation coefficient has generated some confusion. Several different forms of the intraclass do not, in general, lead to the same conclusion (6–9). However, in a reliability study, when the multiple raters per subject are randomly assigned to the m positions, all various forms do estimate the same population parameter, although not necessarily equally efficiently. The problem related to choice of intra-class correlation coefficient therefore need not confuse reliability estimation. 6 ESTIMATION OF THE RELIABILITY COEFFICIENT—NONPARAMETRIC The above distribution theory for the intraclass correlation coefficient is robust to deviations from the normality assumptions, but if there is serious deviation for ordinal X, a nonparametric alternative is related to Kendall’s Coefficient of Concordance, W (10). If the N ratings in the jth position (j = 1, 2, . . . , M) are rank-ordered (tied ranks averaged), application of the above ANOVA leads to a nonparametric inter-rater reliability coefficient rKI = (MW − 1)/(M − 1). The distribution of rKI is well approximated by that of r when some nonlinear transformation of X ij exists that satisfies the normality assumptions (11). Otherwise, bootstrap methods (12–14) might be used to obtain confidence intervals for the reliability coefficient. 7 ESTIMATION OF THE RELIABILITY COEFFICIENT—BINARY On example of what would be considered a very serious deviation from the assumptions that underlie the parametric intraclass correlation coefficient occurs if X ij is binary (0 or 1), for then the within subject variance

3

usually depends on the true value. In that situation, the intraclass kappa coefficient estimates the inter-rater reliability coefficient (15). To compute the intraclass kappa coefficient, the above ANOVA may be applied to the 0/1 data (16). It is the distribution of the resulting reliability coefficient (the intraclass kappa) that is affected, not its computation. For M > 2, bootstrap methods are recommended to obtain confidence intervals, for the distribution depends on unknown higher order moments of ξ i = Ej (X ji ). For M = 2, the asymptotic distribution of the intraclass kappa is known (16), but unless sample size is very large, bootstrap methods are still preferred.

8 ESTIMATION OF THE RELIABILITY COEFFICIENT—CATEGORICAL Finally, suppose that X ij represents more than two nonordered categories (ordered categories are ordinal data for which rKI is more suited.). An intraclass kappa appropriate for this situation estimates

Pk (1 − Pk )κk /

Pk (1 − Pk )

where Pk is the probability of falling into category k, k = 1, 2, . . . , K, and κ k is the intraclass kappa coefficient for the binary variable in which each subject is classified either in category k or not in category k (17). However, as can be observed from this formula, it is possible that the multicategory intraclass kappa is near zero when some individual categories are nearly perfectly reliable, or conversely, that the multicategory intraclass kappa is high one when some rarer categories have near zero reliability. For this reason, for multicategory X ij , it is recommended that the intraclass kappa be computed and evaluated for each of the categories separately (18). Then, if some individual categories are unreliable and other acceptably reliable, attention can be focused on improving the reliability of the flawed categories, either by redefinition, perhaps by combining categories that are ill distinguished from each other, or by retraining raters.

4

INTER-RATER RELIABILITY

9 STRATEGIES TO INCREASE RELIABILITIY (SPEARMAN-BROWN PROJECTION) What if the reliability coefficient estimated by the above methods is not high enough to satisfy the standards set by the designers of the randomized clinical trial? Then, several strategies might be considered to improve the inter-rater reliability of a measure prior to its use in a RCT: better training of raters and clarification of the measurement system. Alternatively, one can always propose to average multiple independent ratings for each subject in the RCT (common practice with assay procedures in which three independent assay results are often averaged per tissue sample.). Under the classic model, the reliability of the average of m ratings per subject, ρ m , is given by the Spearman-Brown projection formula: ρm =

mρ (m − 1)ρ + 1

Thus, for example, to raise the reliability of a measure from ρ to ρ m , one would need the average of m independent raters for each subject, where: m=

10

ρm (1 − ρ) (1 − ρm )ρ

OTHER TYPES OF RELIABILITIES

Inter-observer reliability is one of several types of reliabilities that might be of interest, depending on which sources of error are of interest in determining the accuracy of a measurement. Two others of considerable importance are test-retest reliability and intra-rater reliability. For test-retest reliability, the multiple ratings per subjects are sampled over a span of time short enough that the subject characteristic of interest is unlikely to change but long enough that the subject inconsistency in expressing that characteristic can be included in the error (e.g., over several weeks for the diagnosis of a long-lasting condition). Once again, to ensure the independence of errors within each subject, the ratings are

preferably made by different raters. Generally, because the error included in testretest reliability combines both rater error and error caused by subject inconsistency, test-retest reliability is generally lower than is inter-observer reliability that includes only rater errors. Because inconsistency of subject’s expression of the characteristic of interest is often the major source of error, and thus of attenuation of power in statistical hypothesis testing, or attenuation of effect size in estimation, test-retest reliability provides greater assurance of accuracy of measurement than does inter-observer reliability. For intra-rater reliability, the multiple ratings per subject are obtained by multiple independent measurements made by the same rater on the same observation. For example, one might sample N cancer patients, obtain a tumor tissue sample from each, divide that sample into M subsamples, each mounted separately for pathological examination, without any subject labels. Then, one rater would examine all unlabeled MN tissue samples in random order and classify each into the type or stage of cancer. The question is to what extent the rater classifies the M tissue samples from each patient the same. The error of interest now develops only from the inconsistencies within a rater, which is one component of the inconsistencies from one rater to another that is reflected in inter-observer unreliability. Consequently, one would expect that intra-observer reliability would be higher than would be interobserver reliability, which, in turn, would be higher than test-retest reliability. The discrepancies between these reliability coefficients provide clues as to the sources of error of measurement: from inconsistencies within each rater, from inconsistencies between one rater and another, and from inconsistencies in the patient’s expression of the characteristic.

REFERENCES 1. F. M. Lord and M. R. Novick, Statistical Theories of Mental Test Scores. Reading, MA: Addison-Weslely Publishing Company, Inc., 1968.

INTER-RATER RELIABILITY 2. J. R. Landis and G. G. Koch, The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174.

CROSS-REFERENCES

3. L. M. Koran, The reliability of clinical methods, data and judgments, part 1. N. Engl. J. Med. 1975; 293: 642–646.

Boostrapping Categorical Variables Correlation Intraclass Correlation Coefficient

4. L. M. Koran, The reliability of clinical methods, data and judgments, part 2. N. Engl. J. Med. 1975; 293: 695–701. 5. E. A. Haggard, Intraclass Correlation and the Analysis of Variance. New York: Dryden Press, 1958. 6. J.J. Bartko, The intraclass correlation coefficient as a measure of reliability. Psychol. Reports 1966; 19: 311. 7. J. J. Bartko, Corrective note to: The Intraclass Correlation Coefficient as a Measure of 1974; 1974: 34. 8. J. J. Bartko, On various intraclass correlation reliability coefficients. Psychol. Reports 1976; 83: 762–765. 9. P. E. Shrout and J. L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 1979; 86: 420–428. 10. M. Kendall and J. D. Gibbons, Rank Correlation Methods, 5th ed. New York: Oxford University Press, 1990. 11. H. C. Kraemer, The small sample non-null properties of Kendall’s Coefficient of Concordance for normal populations. J. Am. Stat. Assoc. 1976; 71: 608–613. 12. C. E. Lunneforg, Estimating the correlation coefficient: the bootstrap. Psychol. Bull. 1985; 98: 209–215. 13. J. L. Rasmussen, Estimating correlation coefficients: bootstrap and parametric. Psychol. Bull. 1987; 101: 136–139. 14. M. J. Strube, Bootstrap Type I error rates for the correlation coefficient: an examination of alternate procedures. Psychol. Bull. 1988; 104: 290–292. 15. H. C. Kraemer, V. S. Periyakoil, and A. Noda tutorial in biostatistics: kappa coefficients in medical research. Stats. Med. 2002; 21: 2109–2129. 16. J. L. Fleiss, Statistical Methods For Rates and Proportions. New York: John Wiley & Sons, 1981. 17. H. C. Kraemer, Ramifications of a population model for k as a coefficient of reliability. Psychometrika 1979; 44: 461–472. 18. H. C. Kraemer, Measurement of reliability for categorical data in medical research. Stat. Methods Med. Res. 1992; 1: 183–199.

Analysis of Variance (ANOVA)

Kappa Statistic Reliability Analysis Repeated Measurements Type II Error (False Negative)

5

INTERVAL CENSORED

are still in use. For instance, a potentially biased analysis would result from using the Kaplan–Meier estimator or the Cox regression model after transforming the interval censored observations into the right censoring situation. Replacing intervals with their middle or maximal point, however, approximates the true results only in exceptional cases, for instance, when the observed intervals are generally small and when the accuracy needed in the specific problem is low.

THOMAS A GERDS and CAROLINA MEIER-HIRMER Institute for Medical Biometry and Medical Informatics, University Hospital Freiburg Center for Data Analysis and Modeling Freiburg, Germany

1

CENSORING

In this article, the response is always the time of an event, the occurrence of which becomes known at examination times. Some special cases shall be distinguished. One speaks of left censoring, if the event of interest occurred before the examination time and of right censoring if the event did not occur until the examination time. The situation with only one examination for each patient is called ‘‘case 1’’ interval censoring, and the resulting observations are often called current status data. Left and right censoring can be generalized to ‘‘case k’’ interval censoring for situations in which the information from exactly k examinations is available for each patient (k is a positive integer). Since in the clinical practice the number of examination times is typically different among patients, most frequently one has to deal with ‘‘mixed case’’ interval censoring. This term refers to situations in which some observations are exact event times, some are right or left censored, and others are really censored to intervals. It is important to emphasize that the name ‘‘interval censoring’’ is often used to generally describe data consisting of such a mixture. Using artificial data, it is demonstrated in Fig. 1 and Table 1 how interval censored observations can be obtained from the longitudinal data of the examination process. Note that although patients 2 and 4 have the same event time, the observed intervals differ considerably. A note of caution: Complicated censoring schemes arise in medical practice as well as in other fields (see the statistical literature). But the connection between the theoretical results and the applications is not yet well developed in all cases nor easily available. Moreover, computer programs are not generally available. As a result, ad hoc methods that may cause biased conclusions

2

CLASSIFICATION AND EXAMPLES

In this section, the types of interval censoring are classified and illustrated more formally. For the special cases of left and right censoring, see the article on survival analysis. ‘‘Case 1’’ interval censoring: The observations are also called current status data. Here the information of one examination is available for each patient, and it is known if the event occurred before or after the date of examination. As a result, the observations are either left censored or right censored. Crosssectional studies in clinical or epidemiological projects often result in current status data. Another example are tumor incidence data in animal experiments where independent individuals are exposed to carcinogens. Typically, a valid histological diagnosis of the onset of a tumor is only possible after death. Thus, the day of death is the only examination time revealing whether a tumor has grown. The limitation of detecting a tumor only if it exceeds a critical size is a technical reason for censoring. For tumor incident experiments, the censoring occurs because of the inability to measure the event of interest exactly (occult tumors). For cross-sectional studies, the censoring is introduced by the study design. ‘‘Case k’’ interval censoring: For each patient, the results from k examinations are available. As the same number of examinations are almost never available for all patients, ‘‘case k’’

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

INTERVAL CENSORED

Examination process I

II

III

IV V VI

Right

1

Interval 2

Left 3

Interval 4

Exact Time

5

0

12

30

42

54

Figure 1. Interval-censored observations (thick lines) corresponding to the artificial data in Table (1). Filled dots in the intervals represent the (unobservable) true event times. The respective type of censoring is marked on the left axis of the diagram. Although observations 2 and 4 have the same true event time, the observed intervals differ considerably.

Table 1. Illustration of How to Obtain Interval-Censored Observations from Hypothetical Examination Processes Examination Process Time

0

12

30

42

48

Event

NO

NO

NO

NO

NO

Time

0

12

42

–

–

Event

NO

NO

YES

–

–

Time

0

–

–

–

–

Event

YES

–

–

–

–

Time

0

12

30

54

–

Event

NO

NO

NO

YES

–

Time

0

30

–

–

–

Event

NO

YES

–

–

–

Patient 1

Patient 2

Patient 3

Patient 4

Patient 5

interval censoring (for k greater than one) occurs sparsely in the medical praxis. An example is the supervision of children learning to speak where learned words are collected on a fixed number of questionnaires for each child. However, if single questionnaires are missing for some children, and hence the number of available examinations differs, the data have to be considered as ‘‘mixed case’’ interval censored. ‘‘Mixed case’’ interval censoring: For each patient, the event status is known at a differing number of examination

True Event Time

Censored Observation

60

[48, ∞)

36

[12, 42]

− 10

( − ∞, 0]

36

[30, 54]

30

30

times. Therefore, ‘‘mixed case’’ interval censored data typically consist of a mixture of interval-censored and rightcensored observations. Sometimes they include left-censored observations and even exact time points, if the event happens to occur at the day of examination. An example for mixed case interval censoring is given by breast cancer studies where the event of interest is the first occurrence of breast retraction. Shortly after therapy, which can be a combination of radiotherapy and chemotherapy, the time interval between two adjacent examinations is

INTERVAL CENSORED

typically small but lengthens as the recovery progresses. The exact time of retraction is only known to fall into the interval between two visits or after the last examination time. Double interval censoring: If the variable of interest is the duration between two events, then both the start point and the endpoint of the duration can be interval censored. A well-known example is the incubation time between HIV-infection and AIDS diagnosis: The infection time is interval censored between the last negative and the first positive antibody test, and the time to AIDS diagnosis is right censored, when AIDS is not diagnosed within the study time. Double censoring: Double censoring refers to situations in which the observations are either exact event times, left censored, or right censored. Double censoring is therefore a special case of mixed case interval censoring. Interval censored covariates: In the framework of multistate models, the influence of an intermediate event on the main outcome variable is sometimes of interest. For instance, if the endpoint of interest is death, then the recurrence of a tumor can be an influential intermediate event. In this example, the occurrence of the intermediate event is an interval-censored covariate. Situations in which the time of the intermediate event is interval censored occur frequently in such frameworks, in particular for illnessdeath models.

3 STUDY DESIGN AND STATISTICAL MODELING Considering interval-censored data resulting from periodic follow-up in a clinical trial or longitudinal study, the information is generally the greater the smaller the intervals are between adjacent examinations. The length of the observed intervals evidently influences the power of statistical methods, and it is important for the significance of statistical

3

analysis to gather as much information as possible. However, acquisition on a dense time schedule or even continuous in time can be prohibited for various reasons. Financial costs or stress of patients are typical factors that limit the accuracy of measurement. The statistical model building for censored data proceeds in two steps: In a first step, the survival model is specified, that is, a family of probability distributions that includes the underlying survival function. The second step deals with the dependence of the event time on the censoring scheme. Sometimes it is important to specify a probabilistic model also for the distribution of the censoring process. For convenience, it is often assumed that the censoring is independent of the event time. Survival models can be parametric, semiparametric, or nonparametric. Besides the differences of the statistical methods used for the three model classes, one must be aware of the well-known tradeoff between bias and variance, which may differ among the three approaches: On the one hand, if the parametric model assumptions are violated, the statistical inference can be misleading due to biased estimates. On the other hand, with semiparametric or nonparametric models, the power of the statistical methods is typically low for small or moderate sample sizes that are often present in clinical studies. In the presence of covariates, a suitable survival regression model has to be specified. For instance, the proportional hazards model, the proportional odds model, and the accelerated failure time model are frequently used regression models that have extensions to interval-censored data structures. The task of modeling the censoring mechanism has a similar impact: Strong assumptions on the distribution of the censoring mechanism can result in biased inference, whereas allowing general censoring schemes may lead to low statistical power of the statistical procedures. As noted, most statistical procedures for interval-censored data assume that the examination times are independent of the event time. This assumption is satisfied for externally scheduled examination times. However, there are situations in which the examination process is not independent of the

4

INTERVAL CENSORED

event time. The random censorship assumption is often violated when the data arise from a serial screening and the timing of screening depends on the patient’s health status. Or, for time-to-infection data, if the infection can be suspected after certain undertakings or accidents, cool-headed patients would likely urge a laboratory testing. Then, the infection time and the examination time are not independent. 4

STATISTICAL ANALYSIS

Several characteristics of survival distributions are of interest in clinical trials: the survival probability function, the difference or ratio of the survival probabilities in two (treatment) groups, and the influence of covariates on the survival probabilities, to name the most important ones. Under the burden of complicated censoring schemes, for each of these cases, a valid estimation method is needed. The methods for interval-censored data are nonstandard and need advanced techniques. Consistent estimators may be available only under restrictive assumptions on the censoring mechanism, and the distributional properties of estimates, confidence intervals, or valid testing procedures are only approximately constructed. To this end, one should note that the development of statistical methods and their distributional properties for interval-censored data are at present not fully developed and research is an ongoing process. In particular, examples with appropriate statistical treatment of dependent interval censoring are only very recent; see References 1 and 2. There is also ongoing work in mathematical statistics. The mathematically interested reader is referred to References 3 and 4 and the references given therein. In the remaining section, some established statistical methods are named for which it is assumed that the censoring is independent of the event time. The inference in parametric survival models is relatively straightforward. As a consequence of the independence assumption, likelihood based methods are applicable; see Reference 5. In particular, the maximum likelihood estimator has the usual properties, as are consistency and the usual convergence

rate n1/2 , where n is the sample size. Software for the parametric inference should be available for most standard statistic programs. For the nonparametric estimation of the survival probability function, the so-called nonparametric maximum likelihood estimator (NPMLE) can be used. The estimator is related to the familiar Kaplan–Meier estimator, which is the nonparametric maximum likelihood estimator for right-censored data. However, the Kaplan–Meier estimator cannot be applied directly and only in exceptional cases to interval-censored data; see the note of caution at the end of Section (1). NPMLE for interval-censored data is not uniquely defined: Any function that jumps the appropriate amount in the so-called equivalence sets represents a NPMLE; see Reference 6. Briefly, the equivalence sets are found by ordering all unique left-hand limits and all unique right-hand limits of all observed intervals in a dataset; see Reference 7 for details. Outside the equivalence sets, the nonparametric estimator defines constant survival probability and the graph is horizontal in these parts. Although the graph of NPMLE is formally undefined in the equivalence sets, some authors visualize NPMLE as if it was a step function, some interpolate between the horizontal parts, and others leave the graph indefinite outside the horizontal parts. Technical differences occur with the computation of NPMLE for the different types of censoring: For ‘‘case 1’’ interval-censored data, NPMLE is given by an explicit formula, (8). For the other types of interval censoring, NPMLE has to be computed recursively; see References 6, and (8–10). For instance, the self-consistency equations developed by Turnbull (9) yield an algorithm that is a special case of the EM-algorithm. Turnbull’s algorithm is implemented in some major statistical software packages (SAS, Splus). The more recently suggested algorithms achieve improvement concerning stability and computational efficiency (8, 11, 12). Unlike the Kaplan–Meier estimator, the NPMLE for interval-censored data converges at a rate slower than n1/2 , where n is the sample size. In particular, the distribution of this survival function estimator cannot be approximated by a Gaussian distribution. However, at least for ‘‘case 1’’ and ‘‘case 2’’

INTERVAL CENSORED

5

Semiparametric models are prominent for the analysis of regression problems in survival analysis. The frequently used regression models that have extensions for intervalcensored data structures are the proportional hazards model (14, 15), the proportional odds model (16), and the accelerated failure time model (3). These model classes have semiparametric and parametric subclasses. The main difference is that the estimators of the survival curve in the semiparametric models perform as NPMLE; i.e., the convergence rate is slow, and the distribution is not

interval censoring, the asymptotic distribution of NPMLE has been derived in Reference 8. In some cases, the bootstrap provides an alternative method for approximating the distribution of NPMLE (3). By using such tools, confidence intervals for the survival probability at a fixed time can be constructed. Nonparametric tests for the two group comparison have been proposed in References 12 and 13.

Table 2. Breast Cancer Retraction Data in Two Treatment Arms. The Time of First Occurrence of Breast Retraction Lies Between the Left and the Right Endpoint of the Intervals; Right-Censored Observations have ∞ as the Right Endpoint Treatment 1 (n = 46)

[0, 5], [0, 7], [0, 8], [4, 11], [5, 11], [5, 12], [6, 10], [7, 14], [7, 16], [11, 15], [11, 18], [17, 25], [17, 25], [18, 26], [19, 35], [25, 37], [26, 40], [27, 34], [36, 44], [36, 48], [37, 44], [15, ∞), [17, ∞), [18, ∞), [22, ∞), [24, ∞), [24, ∞), [32, ∞), [33, ∞), [34, ∞), [36, ∞), [36, ∞), [37, ∞), [37, ∞), [37, ∞), [38, ∞), [40, ∞), [45, ∞), [46, ∞), [46, ∞), [46, ∞), [46, ∞), [46, ∞), [46, ∞), [46, ∞), [46, ∞)

Treatment 2 (n = 48)

[0, 5], [0, 22], [4, 8], [4, 9], [5, 8], [8, 12], [8, 21], [10, 17], [10, 35], [11, 13] [11, 17], [11, 20], [12, 20], [13, 39], [14, 17], [14, 19], [15, 22], [16, 20] [16, 24], [16, 24], [16, 60], [17, 23], [17, 26], [17, 27], [18, 24], [18, 25] [19, 32], [22, 32], [24, 30], [24, 31], [30, 34], [30, 36], [33, 40], [35, 39] [44, 48], [11, ∞), [11, ∞), [13, ∞), [13, ∞), [13, ∞), [21, ∞), [23, ∞) [31, ∞), [32, ∞), [34, ∞), [34, ∞), [35, ∞), [48, ∞)

Patients

Treatment 1

0

5

10

15

20

25

30

35

40

45

50

55

60

∞

40

45

50

55

60

∞

Months

Patients

Treatment 2

0

5

10

15

20

25

30

35

Months Figure 2. Graphical representation of the breast deterioration data in two treatment groups. Right-censored observations are shown at the end of each treatment group.

6

INTERVAL CENSORED

approximately Gaussian. In case of a parametric survival model, the usual properties retain for the maximum likelihood estimators of the covariate effects. The hypothesis of zero covariate effects can then be tested, e.g., under the proportional hazard assumption (14); see also References 17 and 18 for related test statistics. 5

It was suggested that additional chemotherapy leads to earlier breast retraction. The data consist of a mixture of interval-censored and right-censored observations. A complete listing taken from Reference 7 is presented in Table 2. A special diagram has been introduced in Reference 12 for the graphical representation of interval censored data; see Fig. 2. In each treatment group, the interval-censored part of the data is sorted by the length of the observed interval and the right-censored part by the time of the last examination. Figure 3 compares the parametric estimate of the survival curve in the Weibull survival model to NPMLE. In addition, the Kaplan–Meier estimator was computed by treating the center of the 56 closed intervals, where the right endpoint is not ∞, as if these observations were exact. The graph of the Kaplan–Meier estimator, which is only an ad hoc method in this situation, is also displayed in Fig. 3. All estimation methods show late differences among the survival probabilities in the treatment groups. The graphs of the competing methods are quite close for treatment 1, but they differ for treatment 2. To test the hypothesis of no treatment effect, the nonparametric tests proposed in References 12 and 13 are compared with the

WORKED EXAMPLE

1.0

In this section, an illustration of the mainstream statistical methods for interval censoring is given. The data are taken from the overview article (7), where the reader also finds comprehensive statistical analysis and a comparison of statistical methods. The source of the data is clinical studies on the cosmetic effect of different treatments of breast cancer (19, 20). Women with early breast cancer received a breastconserving excision followed by either radiotherapy alone or a combination of radiotherapy and chemotherapy. The event of interest was the first occurrence of breast retraction. The time interval between two adjacent examinations was in the mean 4–6 months, stretching wider as the recovery progresses. In what follows, treatment 1 corresponds to radiotherapy alone and treatment 2 to a combination of radiotherapy and chemotherapy.

0.0

0.2

Survival Probability 0.4 0.6 0.8

NPMLE Weibull Kaplan−Meier

0

10

20

30

40

50

60

Months Figure 3. Comparison of estimated survival curves in the treatment groups of the breast deterioration data. The respective upper curves belong to treatment 1 for all methods of estimation.

INTERVAL CENSORED Table 3. Test Results for the Treatment Effect in the Breast Deterioration Data Test

P-value

Finkelstein (14) ¨ Dumbgen et al. (12) Sun (13) Parametric

0.004 0.0028 0.0043 0.0012

11.

12.

13.

test under the proportional hazard assumption given in Reference 14, and with the test in a fully parametric survival model (Weibull family). The resulting P-values presented in Table 3 are different. However, in this example, all methods show significant differences among the treatment arms. REFERENCES 1. R. A. Betensky, On nonidentifiability and noninformative censoring for current status data. Biometrika 2000; 218–221. 2. D. M. Finkelstein, W. B. Goggins, and D. A. Schoenfeld, Analysis of failure time data with dependent interval censoring. Biometrics 2002; 298–304. 3. J. Huang and J. A. Wellner. Interval censored survival data: A review of recent progress. Proc. First Seattle Symposium in Biostatistics: Survival Analysis, 1997. 4. J. Sun, Encyclopedia of Biostatistics: Interval Censoring, P. Armitage and T. Colton (eds.), New York: Wiley, 2002: 2090–2095. 5. J. P. Klein and M. L. Moeschberger, Survival Analysis—Techniques for Censored and Truncated Data. Statistics in Biology an Health. New York: Springer, 1997. 6. R. Peto, Experimental survival curves for interval-censored data. Appl. Stat. 1973; 22: 86–91. 7. J. C. Lindsey and L. M. Ryan, Tutorial in biostatistics—methods for interval-censored data. Stat. Med. 1998; 17: 219–238. 8. P. Groeneboom and J. A. Wellner, Information Bounds and Nonparametric Maximum Likelihood Estimation, vol. 19 of DMV-Seminar. ¨ New York: Birkhauser, 1992. 9. B. W. Turnbull, The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Stat. Soc. Series B 1976; 38: 290–295. 10. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via

14.

15.

16.

7

the EM algorithm (with discussion). J. R. Stat. Soc. Series B 1977; 39: 1–38. G. Jongbloed, The iterative convex minorant algorithm for nonparametric estimation. J. Comput. Graphical Stat. 1998; 7: 310–321. ¨ L. Dumbgen, S. Freitag, and G. Jongbloed, Estimating a unimodal distribution from interval-censored data. Technical report. University of Bern, 2003. J. Sun, A nonparametric test for intervalcensored failure time data with application to AIDS studies. Stat. Med. 1996; 15: 1387–1395. D. M. Finkelstein. A proportional hazards model for interval-censored failure time data. Biometrics 1986; 42: 845–854. J. Huang, Efficient estimation for the proportional hazards model with interval censoring. Ann. Stat. 1996; 24: 540–568. A. J. Rossini and A. A. Tsiatis, A semiparametric proportional odds regression model for the analysis of current status data. J. Amer. Stat. Assoc. 1996; 91: 713–721.

17. M. P. Fay. Rank invariant tests for interval censored data under the grouped continous model. Biometrics 1996; 52: 811–822. 18. G. R. Petroni and R. A. Wolfe. A two-sample test for stochastic ordering with intervalcensored data. Biometrics 1994; 50: 77–87. 19. G. F. Baedle, S. Come, C. Henderson, B. Silver, and S. A. H. Hellman, The effect of adjuvant chemotherapy on the cosmetic results after primary radiation treatment for early stage breast cancer. Int. J. Radiation Oncol. Biol. Phys. 1984; 10: 2131–2137. 20. G. F. Baedle, J. R. Harris, B. Silver, L. Botnick, and S. A. H. Hellman, Cosmetic results following primary radiation therapy for early beast cancer. Cancer 1984; 54: 2911–2918.

INTRACLASS CORRELATION COEFFICIENT

to different dietary interventions. This form of randomization at a higher level than the individual is known as a cluster-randomized (or group randomized) design. Although randomization occurs at this higher level, observations are usually still made at the level of the individual. Cluster trials can adopt a completely randomized design (in which clusters are randomized to each intervention without restriction), a stratified design (in which similar clusters are grouped into strata and randomization to the interventions takes place within each stratum), or a paired design (in which clusters are paired and one cluster from each pair is randomized to each intervention). Adoption of a cluster-randomized design, however, has implications for the design, conduct, and analysis of such studies. A fundamental assumption of an individually randomized trial is that the outcome for an individual patient is completely unrelated to that for any other patient (i.e., they are ‘‘independent’’). This assumption no longer holds when cluster randomization is adopted, because patients within any one cluster are more likely to respond in a similar manner. For example, the management of patients in a single hospital is more likely to be consistent than management across several hospitals. This correlation among individuals within clusters has to be taken into account when planning and analyzing the cluster trial design. The statistical measure of this correlation is known as the intraclass or intracluster correlation coefficient (ICC). In this article, the measurement and impact of the ICC is discussed. The most common trial design—the completely randomized design—is assumed throughout.

MARION K. CAMPBELL University of Aberdeen, Aberdeen, UK

JEREMY M. GRIMSHAW Ottawa Health Research Institute, Ottawa, Canada

GILDA PIAGGIO World Health Organisation, Geneva, Switzerland

1

INTRODUCTION

The randomized controlled trial is the design of choice for evaluating healthcare interventions (1,2). Most commonly, randomization takes place at the level of the patient, in which individual patients are allocated to the different arms of the trial. Randomization by individual minimizes bias in study results and provides maximal efficiency regarding the precision of estimates and power of statistical tests. In some situations, however, allocation at an individual patient basis is problematic, for example, if there is a risk of contamination of interventions across trial groups (i.e., the unintentional transfer of elements of one intervention to participants allocated to another arm of the trial) (3). For example, if some patients in an inpatient ward were allocated randomly to receive charts to record symptoms to trigger particular actions and some patients were not, then it would difficult to prevent those patients given the charts from talking to those patients not assigned charts. This latter group might then begin to note their symptoms, even without charts, thereby potentially distorting the results of the trial. In such situations, randomizing at some ‘‘higher’’ level may be desirable to obtain a more reliable estimate of the treatment effect (4). In this case, randomization by hospital ward would have been more appropriate. Similarly, when trials that evaluate dietary interventions are being designed, families are often randomized as an entire unit to avoid the possibility of different family members being assigned

2 THE INTRACLUSTER (OR INTRACLASS) CORRELATION COEFFICIENT Individuals within a cluster are usually more similar than individuals from different clusters. Because individuals within a cluster receive the same intervention and those who receive different interventions belong to different clusters, randomizing clusters implies

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INTRACLASS CORRELATION COEFFICIENT

a loss in efficiency compared with randomizing individuals. The variability between units that receive the same intervention, which is the basis for estimating the uncontrolled variation, is thus greater than that between individuals who receive the same intervention had the individuals been randomized. The similarity between individuals within clusters is measured by the intracluster correlation coefficient (which is commonly referred to by the Greek letter rho, ρ). 2.1 Definition of the ICC The ICC can be defined in general as the correlation between pairs of subjects within clusters, and thus it can vary between −1/(j−1) and 1, where j is the number of clusters (5). However, the most commonly used definition of the ICC, which is appropriate within the cluster trial context, is based on the analysis of variance and specifically on the relationship of the between to within-cluster variance. Using this definition, the ICC is calculated as the proportion of the total variation in the outcome that can be attributed to the difference between clusters: ρ=

σb2 σb2

+ σw2

where σb2 is the between-cluster variance component, and σw2 is the within-cluster variance component (6). In this context, therefore, the ICC takes a value of between 0 and 1. 2.2 Accounting for the ICC in Sample-Size Calculations The presence of clustering within trial data has a direct impact on sample-size requirements for a trial, primarily because standard sample-size calculations assume that data within the trial are independent (which is not the case in a cluster trial). The size of the ICC has a direct impact on the efficiency of the sample size within the clustered design. For example, for a completely randomized cluster trial design, to achieve the equivalent power of a patient-randomized trial, standard sample-size estimates require to be inflated by a factor: 1 + (n − 1)ρ

to accommodate for the clustering effect, where n is the average cluster size, and ρ is the estimated ICC (assuming the clusters are of a similar size) (7). (It is important to note that cluster size can be defined as either the number of available individuals with the cluster or the number of individuals within the cluster being selected to take part in the trial.) This inflation factor is commonly known as the ‘‘design effect’’ or the ‘‘variance inflation factor’’ (8). The impact of randomization by cluster on the required sample size can be substantial, and as both the ICC and the cluster size influence the calculation, the design effect can be substantial even when the ICC seems to be numerically small. Table 1 shows the example of a trial designed to detect a change from 40% to 60% in a dichotomous outcome, with 80% power and a 5% significance level. For an individually randomized trial, a sample size of 194 patients would be required to detect this level of difference. If, however, cluster randomization was required because of the potential for contamination, samplesize estimates would have to be inflated to account for the ICC and the cluster size. Table 1 shows that, even at relatively modest levels of ICC (say ρ = 0.01), a doubling of the original estimate would not be unusual. Sample-size calculators have been developed to aid in sample size calculations for trials, which also allow trade-offs between cluster sizes and number of clusters to be assessed (9,10). 2.3 Methods for Estimating the ICC The aim of this section is to provide a brief overview for the reader of the most common approaches to estimating the ICC, providing references for additional detail and methods. Some of these estimation procedures can return ICC values of less than zero, but in such cases the ICC estimate should be truncated at zero. For the estimation of an ICC from continuous data, the most common method adopted is the analysis of variance (ANOVA) approach (6,11,12). This method draws on the fact that, as alluded to above, an ICC can be described as the relationship between the two components of variation in outcome in

INTRACLASS CORRELATION COEFFICIENT

3

Table 1. Impact of the ICC on Sample Size Requirements Example:

To detect a change from 40% to 60%, with 80% power and a 5% significance level, assuming different levels of ICC and cluster size (under individual randomization a total of 194 patients would be required).

Cluster size

ICC

10 Number of clusters (total number of individuals) required

20 Number of clusters (total number of individuals) required

50 Number of clusters (total number of individuals) required

100 Number of clusters (total number of individuals) required

0.01 0.05 0.10 0.15

22 (220) 30 (300) 38 (380) 46 (460)

12 (240) 20 (400) 30 (600) 38 (760)

6 (300) 14 (700) 24 (1200) 34 (1700)

4 (400) 12 (1200) 22 (2200) 32 (3200)

a cluster trial— the variation among individuals within a cluster and the variation between clusters. These two sources of variation can be identified separately using a standard ANOVA approach—by adopting a one-way ANOVA, with the cluster-level variable as a random factor (13). For the ANOVA approach, assuming a two-level model (i.e., cluster and individual) the appropriate model can expressed as: yij = µ + βj + eij where yij is the response of the ith individual in the jth cluster, µ is the overall mean response, β j is the random effect of the jth cluster, and eij is the random error. It is assumed that the β j are independently and identically distributed with zero mean and constant variance σb2 , and that the eij are similarly identically and independently distributed with zero mean and constant variance σw2 . The following variance component estimates are then identified from the ANOVA table: (MSB − MSW) σˆ b2 = n0 and σˆ w2 = MSW where MSB is the between-cluster mean square, MSW is the within-cluster mean square, and n0 is the ‘‘average’’ cluster size.

The ‘‘average’’ cluster size is calculated using: n2j 1 n0 = N− J−1 N where J is the number of clusters, N is the total number of individuals, and nj is the number of individuals in the jth cluster. This version of the ‘‘average’’ cluster size is used over the arithmetic average, as the arithmetic mean estimate can lead to underestimation of the between-cluster variance component when the cluster sizes vary. The ICC can then be estimated as: ρˆ =

σˆ b2 σˆ b2

+ σˆ w2

Most random effects modeling packages also produce estimates for these betweencluster σˆ b2 and within-cluster σˆ w2 components of variation directly as part of their standard output. These estimates can then be used directly in the formula to estimate the ICC. Random effects modeling also allows estimation of the ICC when covariates need to be accounted for in the model. Donner and Koval (11) note that using this derivation of the ICC requires no assumption that the dependent data be normally distributed. Donner and Koval (6) also describe an alternative approach to calculating the ICC from continuous data that can be calculated by assuming that the observations yij , are distributed about the same mean µ and variance σ 2 in such a way that two observations

4

INTRACLASS CORRELATION COEFFICIENT

within the same cluster have a common correlation ρ. This result adopts a maximum likelihood estimator (MLE) approach. Donner and Koval note that in samples of equal size n, this expression of the ICC is equivalent to the Pearson product-moment correlation as computed over all pairs of observations that can be constructed within clusters. In this article, Donner and Koval compared the performance of ANOVA and maximum likelihood estimates using Monte Carlo simulation. The results of the simulation suggested that for low to moderate ICCs (< 0.5), the ANOVA method performed well and provided accurate estimates of the ICC, especially when single-member clusters were excluded from the analysis. However, if the ICC was large (> 0.5), or completely unknown, then the maximum likelihood approach tended to produce more accurate estimates. For the estimation of an ICC from binary data, the ANOVA approach is again most commonly used, as it has also been shown not to require any strict distributional assumptions (14,11,12). Although the ANOVA estimate of the ICC may be used with dichotomous data, however, standard methods to calculate confidence intervals around this estimate do not hold in the dichotomous case. Moore and Tsiatis (15) and Feng and Grizzle (16) describe an alternative approach for use in the binary case—the moment estimator for dichotomous data. The moment estimator is based on an extension of the generalized Pearson statistic to estimate overdispersion. Ridout et al. (17) reviewed the performance of several methods for calculating ICCs from binary data and showed that the ANOVA method performed well. Feng and Grizzle (16) also showed that the ANOVA method and the moment estimator method were similar, but they noted that bias increases as the estimate of ρ increases and as the prevalence nears zero or one (estimates were lower than they should be).

3

THE MAGNITUDE OF THE ICC

As has been described above, robust estimates of ICCs are required for informed sample size calculations to be made.

3.1 What is the Magnitude of the ICC? Estimates on ICC size are available from several settings and for several outcomes. The ICC estimate varies depending on the type of cluster under examination (e.g., community, hospital, family, and so on). As the definition of the ICC suggests, ICCs are likely to be high if the variation between clusters is high and the variation within clusters is low. Gulliford et al. (18) published a range of ICCs suitable for use in community intervention studies based on a systematic analysis of the 1994 English Health Survey. This research showed that for community intervention trials where communities were at a regional or district level ICCs were typically lower than 0.01. However, as the definition of ‘‘community’’ was varied, so the typical range of ICC values varied. Where communities were defined at postcode/zipcode level, ICCs were generally lower than 0.05. When community was defined at a household level, however, ICCs ranged up to 0.3. This finding highlights well the contribution of the between-to-within cluster variation to the estimate of the ICC. In small units, such as households, within-cluster variation tends to be small so the between-cluster variation dominates, which leads to a higher estimate of ICC. For larger clusters, however, withincluster variation is relatively large; hence, the effect of the between-cluster variation is lessened, which leads to a smaller ICC estimate. Todd et al. (19) also showed how different design features could affect the size of the ICC in community-intervention trials. They showed that effective matching of clusters resulted in a marked lowering of the ICC. As the community trials referred to by Todd et al. (19) all involved large cluster sizes, however, reported ICCs, were generally lower than 0.01 irrespective of the chosen design. Individual community-intervention studies have also published several ICCs. For example, Hannan et al. (20) published ICCs from the Minnesota Health Program—these ranged typically from 0.002 to 0.012. A range of ICCs from other settings has also been identified. Murray and Blitstein (21) highlighted more than 20 papers that had reported ICCs from a range of settings,

INTRACLASS CORRELATION COEFFICIENT

which included workplace and school-based studies. Smeeth and Ng (22) provided a comprehensive analysis of ICCs for cluster trials that involved older people in primary care and observed ICCs that ranged from very small (<0.01) to 0.05. Parker et al. (23), Adams et al. (24), and Cosby et al. (25) also reported ICCs from primary care settings and observed that although most ICCs were very small some were higher (up to 0.2). Campbell et al. (26) provided estimates of the magnitude of ICCs in implementation research settings, and again they showed evidence of higher ICCs (up to 0.3) for process outcomes. Although several sources are available for ICC estimates, they are often based on a relatively small number of observations, and hence, confidence intervals around these estimates are often wide. 3.2 What Affects the Magnitude of the ICC? Campbell et al. (27) sought to identify potential determinants of the magnitude of ICCs in cluster trials. They examined a range of hypotheses (that had been generated through a survey of statisticians involved in the design of cluster trials and known experts in the field) across a range of datasets that involved 220 ICC estimates. The authors concluded that many factors affected the size of the ICC. In particular, the results suggested that if an ICC relates to a ‘‘process’’ rather than an ‘‘outcome’’ variable, then the ICC would be higher (the multivariate model suggested that, on average, the ICC would be 0.05 higher for a process variable). This result is likely observed because patient biological and compliance variability modifies the effect of interventions. Thus, although the process may be consistent within clusters, the clinical outcome within cluster is likely to be more variable. For example, although clinicians and organizations may practice according to similar guidelines (the process measure), patients are likely to be much more variable in how they comply and respond to that treatment guidance (the outcome measure). Similarly, the data showed that if an ICC represents data from secondary rather than community-based care, then the size of the ICC increased (on average 0.01 higher for a secondary care ICC compared with

5

a community-based ICC). Although numerically small, such differences can have a substantial effect on sample size, especially when the average cluster size is large. The influence of the other factors examined was less clear cut. For ICCs related to dichotomous variables, ICCs were shown to be higher for variables with mid-range prevalences compared with those for low- or high-prevalence variables. These differences are likely to be caused partly by known biases which can occur in the estimation of ICCs when the prevalence nears zero or one. Feng and Grizzle (16) showed that bias increases in the estimate of the ICC as the prevalence nears zero or one (ICCs were underestimated). This was particularly the case when ICCs are estimated from datasets with small numbers of clusters and small cluster sizes. This feature may also be observed because when prevalence is low, there is less scope for a high ICC unless several clusters have a prevalence that is relatively high compared with the others. Earlier research (13,28–30) has also shown that the larger the cluster size, the lower the observed ICC. It has been noted, however, that whereas the magnitude of the ICC tends to decline with cluster size, it does so at a relatively slow rate (8). 4

REPORTING THE ICC

Although the reporting of ICCs would seem to be increasing, there is little consistency in the information presented to describe the ICC. Empirically informed guidance on the appropriate reporting of the trial ICC estimate was published by Campbell et al. (31). This research suggested that three dimensions should be routinely reported to aid the interpretation of the ICC: • a description of the dataset (including

characteristics of the outcome and the intervention)—including information on the demographic distribution within and between clusters, a description of the outcome, and a structured description of the intervention. • information on how the ICC was calculated, which includes what statistical

6

INTRACLASS CORRELATION COEFFICIENT

method had been used to calculate the ICC, the software program used to calculate the ICC, what data were used to calculate the ICC (e.g., whether from control data and/or intervention data, preintervention, and/or post-intervention data, etc.), and whether any adjustment had been made for covariates • information on the accuracy of the ICC (including confidence intervals where possible) These recommendations were adopted in the CONSORT statement for the reporting of cluster randomized trials (32). 5 ACCOUNTING FOR THE ICC IN ANALYSIS The correlation between individuals should be accounted for in the analysis of cluster trials. If this correlation is ignored, then the analysis will yield P-values that are artificially low and confidence intervals that will be over-narrow, which increases the chances of identifying spuriously significant findings and misleading conclusions (8,33). Three main approaches can be used for the analysis of data that are generated from a clustered design—analysis at the cluster level, adjustment of standard tests to accommodate for the clustering in the data, or more advanced analysis using the data from both the cluster and the individual level (13,34). The traditional approach to the analysis of cluster-randomized trials was the cluster level approach, which consists of calculating a summary measure for each cluster, such as a cluster mean or proportion, and conducting a standard analysis with the cluster as the unit of analysis. This approach is appropriate when the inference is intended to be at the cluster level. When the inference is intended to be at the individual level, the traditional approach is no longer appropriate. Given more recent advances in statistical computing, the approach adopted in recent years in this situation has been to use the data available at both the individual and cluster level. These techniques, such as random effects (or multilevel modeling) and generalized estimating equations, incorporate the hierarchical

nature of the data directly into the analysis structure. They also allow the incorporation of individual and cluster-level characteristics and can handle more complex correlation structures. They do, however, require a larger number of clusters to be available for analysis; for example, at least 20 clusters per group are recommended for the use of generalized estimating equations (33). The most appropriate analysis option depends on several factors, which include factors such as the research question, including the level at which inference will be made, whether individual or cluster, the study design, the type of outcome (whether binary or continuous), and the number and size of clusters available. A worked example of the three approaches is presented by Mollison et al. (34). 6 SUMMARY In this article, we have shown that the ICC is of fundamental importance to the planning and execution of a cluster-randomized trial. It is the formal statistical measure of the impact of clustering and can have a substantial effect on sample-size requirements (even when the ICC estimate is numerically small). Different types of outcomes are likely to have different magnitudes of ICC, and it is important that this factor is considered when planning and analyzing any cluster-randomized trial. In this article, we concentrated our discussion on completely randomized cluster designs as they are the most commonly adopted designs in practice. In a completely randomized cluster design, the calculation and interpretation of the ICC is at its most straightforward— which is appropriate for an introductory article such as this. Variations to the calculation of the ICC and the adjustment for sample size calculations are available for more complex designs such as the stratified randomized cluster trial designs and the pair-matched cluster trial design. Readers should refer to more specialized texts (8,33) for discussion of these cases. REFERENCES 1. S. J. Pocock, Clinical Trials: A Practical Approach. Chichester, UK: Wiley, 1983.

INTRACLASS CORRELATION COEFFICIENT 2. B. Sibbald and M. Roland, Understanding controlled trials: why are randomized controlled trials important? BMJ 1998; 316: 201. 3. M. R. Keogh-Brown, M. O. Bachmann, L. Shepstone, C. Hewitt, A. Howe, C. R. Ramsay, F. Song, J. N. V. Miles, D. J. Torgerson, S. Miles, D. Elbourne, I. Harvey, and M. J. Campbell, Contamination in trials of educational interventions. Health Technol. Assess. 2007; 11:43. 4. P. M. Fayers, M. S. Jordhoy, and S. Kaasa, Cluster-randomized trials. Palliative Med. 2002; 26: 69–70. 5. R. A. Fisher, Statistical Methods for Research Workers, 12th ed. Edinburgh, UK: Oliver and Boyd, 1954. 6. A. Donner and J. J. Koval, The estimation of intraclass correlation in the analysis of family data. Biometrics 1980; 36: 19–25. 7. A. Donner, N. Birkett, and C. Buck, Randomization by cluster. Sample size requirements and analysis. Am. J. Epidemiol. 1981; 114: 906–914. 8. A. Donner and N. Klar, Design and Analysis of Cluster Randomization Trials in Health Research, 1st ed. London: Arnold, 2000. 9. A. P. Y. Pinol and G. Piaggio, ACLUSTER Version 2.0. Geneva, Switzerland: World Health Organisation, 2000. Available: www.updatesoftware.com/publications/acluster. 10. M. K. Campbell, S. Thomson, C.R. Ramsay, G.S. MacLennan, and J. M. Grimshaw, Sample size calculator for cluster randomized trials. Comp. Biol. Med. 2004; 34: 113–125. 11. A. Donner and J. J. Koval, Design considerations in the estimation of intraclass correlation. Ann. Hum. Genet. 1982; 46: 271–277. 12. D. M. Murray, B. L. Rooney, P. J. Hannan, A. V. Peterson, D. V. Ary, A. Biglan, G. J. Botvin, R. I. Evans, B. R. Flay, R. Futterman, J. G. Getz, P. M. Marek, M. Orlandi, M. A. Pentz, C. L. Perry, and S. P. Schinke, Intraclass correlation among common measures of adolescent smoking: estimates, correlates, and applications in smoking prevention studies. Am. J. Epidemiol. 1994; 140: 1038–1050. 13. O. C. Ukoumunne, M. C. Gulliford, S. Chinn, J. A. C. Sterne, and P. G. J. Burney, Methods for evaluating area-wide and organization based interventions in health and health care: a systematic review. Health Technol. Assess. 2000; 3: 5. 14. A. Donner and A. Donald, Analysis of data arising from a stratified design with the cluster as unit of randomization. Stat. Med. 1987; 6: 43–52.

7

15. D. F. Moore and A. Tsiatis, Robust estimation of the variance in moment methods for extra-binomial and extra-poisson variation. Biometrics 1991; 47: 383–401. 16. Z. Feng and J. E. Grizzle, Correlated binomial variates: properties of estimator of intraclass correlation and its effect on sample size calculation. Stat. Med. 1992; 11: 1607–1614. 17. M. S. Ridout, C. G. B. Demetrio, and D. Firth, Estimating intraclass correlation for binary data. Biometrics 1999; 55: 137–148. 18. M. Gulliford, O. Ukuomunne, and S. Chinn, Components of variance and intraclass correlations for the design of community based surveys and intervention studies: data from the health survey for England 1994. Am. J. Epidemiol. 1999; 149: 876–883. 19. J. Todd, L. Carpenter, X. Li, J. Nakiyingi, R. Gray, and R. Hayes, The effects of alternative study designs on the power of community randomized trials: evidence from three studies of human immunodeficiency virus prevention in East Africa. Internat. J. Epidemiol. 2003; 32: 755–762. 20. P. J. Hannan, D. M. Murray, D. R. Jacobs, and P. G. McGovern, Parameters to aid in the design and analysis of community trials: intraclass correlations from the Minnesota Heart Health Program. Epidemiology 1994; 5: 88–95. 21. D. M. Murray and J. L. Blitstein Methods to reduce the impact of intraclass correlation in group-randomized trials. Eval. Rev. 2003; 27: 79–103. 22. L. Smeeth and E. S. Ng, Intraclass correlation coefficients for cluster randomized trials in primary care: data from the MRC trial of the assessment and management of older people in the community. Control. Clin. Trials 2002; 23: 409–421. 23. D. R. Parker, E. Evangelou, and C. B. Eaton, Intraclass correlation coefficients for cluster randomized trials in primary care: the cholesterol education and research trial (CEART). Contemp. Clin. Trials 2005; 26: 260–267. 24. G. Adams, M. C. Gulliford, O. C. Ukoumunne, S. Eldridge, S. Chinn, and M. J. Campbell, Patterns of intra-cluster correlation from primary care research to inform study design and analysis. J. Clin. Epidemiol. 2004; 57: 785–794. 25. R. H. Cosby, M. Howard, J. Kacorowski, A. R. Willan, and J. W. Sellors, Randomizing patients by family practice: sample size estimation, intracluster correlation and data analysis. Family Practice 2003; 20: 77–82.

8

INTRACLASS CORRELATION COEFFICIENT

26. M. K. Campbell, J. M. Mollison, and J. M. Grimshaw, Cluster trials in implementation research: estimation of intracluster correlation coefficients and sample size. Stat. Med. 2000; 20: 391–399. 27. M. K. Campbell, P. M. Fayers, and J. M. Grimshaw, Determinants of the intracluster correlation coefficient in cluster randomized trials. Clinical Trials 2005; 2: 99–107. 28. A. Donner, An empirical study of cluster randomization. Internat. J. Epidemiol. 1982; 11: 283–286. 29. W. W. Hauck, C. L. Gilliss, A. Donner, and S. Gortner, Randomization by cluster. Nursing Res. 1991; 40: 356–358. 30. O. Siddiqui, D. Hedeker, B. R. Flay, and F. B. Hu, Intraclass correlation estimates in a school-based smoking prevention study. Am. J. Epidemiol. 1996; 144: 425–433. 31. M. K. Campbell, J. M. Grimshaw, and D. R. Elbourne, Intracluster correlation coefficients in cluster randomized trials: empirical insights into how should they be reported. BMC Med. Res. Methodol. 2004; 4: 9. 32. M. K. Campbell, D. R. Elbourne, and D. G. Altman, for the CONSORT group, The CONSORT statement: extension to cluster randomized trials. BMJ 2004; 328: 702–708. 33. D. M. Murray, The Design and Analysis of Group Randomized Trials. Oxford, UK: Oxford University Press, 1998. 34. J. Mollison, J. A. Simpson, M. K. Campbell, and J. M. Grimshaw, Comparison of analytic methods for cluster randomized trials: an example from a primary care setting. J. Epidemiol. Biostat. 2000; 5: 339–348.

FURTHER READING S. M. Eldridge, D. Ashby, G. S. Feder, A. R. Rudnicka, and O. C. Ukoumunne, Lessons for cluster randomized trials in the twenty-first century: a systematic review of trials in primary care. Clinical Trials 2004; 1: 80–90. S. Killip, Z. Mahfoud, and K. Pearce, What is an intracluster correlation coefficient? Crucial concepts for primary care researchers. Ann. Family Med. 2004; 2: 204–208.

CROSS-REFERENCES Cluster randomization Group randomized trials

INTRARATER RELIABILITY

of the between-subject variation (BSV) to the total variation [i.e., the sum of the BSV and the within-subject variation (WSV)], and it is the statistical measure most researchers adopted for quantifying the intrarater reliability of continuous data. Note that ICC reaches its maximum value of 1 when WSV (i.e., the average variation for a subject) reaches its lower bound of 0, a situation indicating that any variation in the data is because the subjects are different and not because the rater is being inconsistent. Used previously as a measure of reliability by Ebel (3) and Barko (4), the ICC has proved to be a valid measure of raters’ self-consistency. Shrout and Fleiss (5) discuss various forms of the ICC as a measure of inter-rater reliability, which quantifies the extent of agreement between raters as opposed to intrarater reliability used to measure self-consistency. The selection of a particular version of the ShroutFleiss ICCs is dictated by the design adopted for the intrarater reliability study. Lachin (6) also discusses various techniques for evaluating the quality of clinical trial data, which includes the ICC among others.

KILEM L. GWET STATAXIS Consulting, Gaithersburg, Maryland

The notion of intrarater reliability will be of interest to researchers concerned about the reproducibility of clinical measurements. A rater in this context refers to any datagenerating system, which includes individuals and laboratories; intrarater reliability is a metric for rater’s self-consistency in the scoring of subjects. The importance of data reproducibility stems from the need for scientific inquiries to be based on solid evidence. Reproducible clinical measurements are recognized as representing a well-defined characteristic of interest. Reproducibility is a source of concern caused by the extensive manipulation of medical equipment in test laboratories and the complexity of the judgmental processes involved in clinical data gathering. Grundy (1) stresses the importance of choosing a good laboratory when measuring cholesterol levels to ensure their validity and reliability. This article discusses some basic methodological aspects related to intrarater reliability estimation. For continuous data, the intraclass correlation (ICC) is the measure of choice and will be discussed in the section entitled ‘‘Intrarater reliability for continuous scores.’’ For nominal data, the kappa coefficient of Cohen (2) and its many variants are the preferred statistics, and they are discussed in the section entitled ‘‘nominal scale score data.’’ The last section is devoted to some extensions of kappa-like statistics aimed at intrarater reliability coefficients for ordinal and interval data.

1.1 Defining Intrarater Reliability Ratings in a typical intrarater reliability study that involves m subjects and n replicates per subject are conveniently organized as shown in Table 1. The entry yij represents the ith replicate score that the rater assigned to subject j. This table may be transposed if the number of subjects is very large. The relationship between the ICC and the analysis of variance (ANOVA) techniques motivated the proposed disposition of rows and columns of Table 1. The WSV is the average of the m subjectlevel variances s2j calculated over the n replicates. More formally, the WSV is defined as follows:

1 INTRARATER RELIABILITY FOR CONTINUOUS SCORES

1 2 Sj , m m

WSV =

A continuous clinical measurement, such as blood pressure, will be considered reproducible if repeated measures taken by the same rater under the same conditions show a rater variation that is negligible compared with the subject variation. ICC is the ratio

j=1

where S2j =

n 1 (yij − y·j )2 n−1 i=1

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

(1)

2

INTRARATER RELIABILITY

Table 1. Scores Assigned to m Subjects with n Replicates per Subject Subject Observation

1

1 2 .. . i . .. n

y11 y21 .. . yil . .. yn1

2

···

j

···

m

y12 y22 .. . yi2 . .. yn2

··· ··· .. . ··· . .. ···

ylj y2j .. . yij . .. ynj

··· ··· .. . ··· .. . ···

y1 m y2 m .. . yim . .. ynm

and y·j is the average of the n replicate observations related to subject j. The BSV is obtained by taking the variance of the m subject-level mean scores trimmed of the fraction of the WSV it contains. It is formally defined as follows: BSV = BSVo − m

where BSVo =

Examination Survey (7). For the sake of illustration, I assume that the data was collected on two occasions (times 1 and 2) by the same laboratory. Although Equations (1)–(3) can be used to obtain the ICC from Table 2 data, the more convenient approach for computing the ICC will generally be to use an ANOVA procedure either from Microsoft Excel (Microsoft Corporation, Redmond, WA), or from a standard statistical package such as SPSS (SPSS Inc., Chicago, IL) or SAS (SAS Institute Inc., Cary, NC). The ANOVA analysis will produce two mean squares (MS) known as the Mean Square for Treatments (MST ), which is also referred to as the Mean Square for the model, and the Mean Square for Error (MSE ). The ICC can be expressed as a function of the two mean squares as follows:

WSV , n

(y·j − y·· )2

j=1

(2)

m−1

and y·· is the overall mean of all mn observations. Using Equations (1) and (2), we define the intrarater reliability coefficient γˆ (read gamma hat with the hat indicating an estimation of the ‘‘true’’ parameter γ to be defined later) as follows: γˆ =

BSV BSV + WSV

γˆ =

(3)

MST − MSE MST + (n − 1)MSE

(4)

Using MS Excel’s Analysis ToolPak and the cholesterol data, I created the output shown in Table 3 known as the ANOVA table, in which the column labeled ‘‘MS’’ contains the two Mean Squares (MST = 1323.56, and MSE = 18.25) needed to compute the ICC. Therefore, the intrarater reliability associated with

To illustrate the calculation of the ICC, let us consider the cholesterol level data of Table 2. Table 2 data represent cholesterol levels taken from 10 individuals who participated in the 2005 National Health and Nutrition

Table 2. Total Cholesterol Measures (in mg/dL) taken on 10 subjects with 2 Replicates per Subject Subjects Time

1

1 2

152 155

Source: Reference 7.

2

3

4

5

6

7

8

9

10

202 210

160 156

186 200

207 214

205 209

160 163

188 189

147 146

151 153

INTRARATER RELIABILITY

3

Table 3. ANOVA Table Created with MS Excel from Table 2 Cholesterol Data ANOVA Source of Variation Between Groups Within Groups Total

SS

df

MS

F

P-value

F crit

11,912.05 182.5 12,094.55

9 10 19

1323.56 18.25

72.52

6.4E-08

3.02

Table 2 data is computed as follows: γˆ =

1, 323.56 − 18.25 = 0.973 1, 323.56 + (2 − 1) × 18.25

Consequently, the subject factor accounts for 97.3% of the total variation observed in the cholesterol data of Table 2, which is an indication of high intrarater reliability. This estimation gives us a sense of the reproducibility of cholesterol levels. However, it also raises questions about its accuracy and the steps that could be taken to improve it. These issues can only be addressed within a well-defined framework for statistical inference. 1.2 Statistical Inference The primary objective of this section is to present a framework for statistical inference that will help answer the following fundamental questions about the intrarater reliability estimate: • Is the obtained ICC sufficiently accu-

rate? • Can the obtained ICC be considered

valid? • Have we used a sufficiently large num-

ber of replicates? • Have we used a sufficiently large number of subjects? • Can the data be collected by multiple raters? These questions can be addressed only if this theoretical framework provides the following two key components: 1. The definition of a population parameter γ (i.e., gamma without the hat) that represents the ‘‘true’’ unknown intrarater reliability being measured

2. Methods for evaluating the precision of proposed statistics γˆ with respect to the parameter of interest γ Let us consider the simple scenario in which yij results from the additive effect of a common score µ, the subject effect tj , and an error εij committed on the ith replicate score of subject j. This relation is mathematically expressed as follows: yij = µ + tj + ij , i = 1, ·n and j = 1, · · · m (5) This example is a single-factor ANOVA model that goes along with the following assumptions: • The m subjects that participate in the

intrarater reliability study form a random sample selected from a larger population of subjects of interest. Moreover, tj is a normally distributed random variable with mean 0 and variance σt2 > 0. • The error ε ij is a normally distributed random variable with mean 0, and variance σˆ2 > 0, and it is independent of tj . A small value for σ2 will lead to a small variation between replicate scores, which in turn should lead to a high intrarater reliability. Therefore, the theoretical parameter that represents the intrarater reliability γ is defined as follows: γ =

σt2 σt2 + σ2

(6)

which is one of the parameters studied by McGraw and Wong (8). It follows from Equation (5) that the denominator of γ is the total variation in the scores, and Equation (6) is the most popular form of ICC found in the

4

INTRARATER RELIABILITY

statistical literature and represents the portion of total variance that is accounted for by the between-subject variance. The statistic γˆ of Equation (3) is a sample-based approximation of γ widely accepted in the statistical literature. We will now present a method for evaluating the precision of the statistic γˆ . How close is γˆ to γ ? To answer this question we will construct a 95% confidence interval around γ ; that is a range of values expected to contain the unknown γ with 95% certainty. Constructing a 95% confidence interval for γ requires the calculation of the 2.5th and 97.5th percentiles of the F distribution with m−1 and m(n−1) degrees of freedom. These two percentiles are denoted by F0.975,m−1,m(n−1) and F0.025,m−1,m(n−1) , respectively, where 0.975 = 1–(1–0.95)/2 and 0.025 = (1–0.95)/2. Although textbooks’ statistical tables provide these percentiles, they are also readily obtained from MS Excel as follows: ‘‘ =FINV(0.025,m−1,m*(n−1))’’ gives the 97.5th percentile, whereas ‘‘FINV(0.975,m−1,m*(n−1))’’ for the 2.5th percentile. Let F be defined as follows: Fo =

BSVo MST = WSV/n MSE

(7)

The 95% confidence interval for γ is obtained by noting that: P(L95 ≤ γ ≤ U95 ) = 0.95 L95 =

F0 /F0.025,m−1,m(n−1) − 1 and (m − 1) + F/F0.025,m−1,m(n−1)

U95 =

F0 /F0.975,m−1,m(n−1) − 1 (m − 1) + F/F0.975,m−1,m(n−1)

(8)

The 95% confidence interval width is given by: (9) W95 = U95 − L95 The design of an intrarater reliability study must aim at minimizing the magnitude of the confidence interval width. The optimal values for the number m of subjects and the number n of replicates per subject are those that minimize W 95 of Equation (9).

1.3 Optimizing the Design of the Intrarater Reliability Study An intrarater reliability study is well designed if the number of observations is determined to minimize the confidence interval length and the experimental error is kept small. This section addresses the following two questions: 1. What is the optimal number m of subjects, and number n of replicates per subject? 2. Can the intrarater reliability study involve two raters or more? 1.3.1 Sample Size Determination. Finding the optimal number of subjects and replicates per subject based the confidence interval length is the approach Giraudeau and Mary (9) used to propose guidelines for planning a reproducibility study. Let ω95 be the expected width of the 95% confidence interval associated with γ . Note that σ2 Fo /(σ2 + mσt2 ) = Fo /(1 + mγ /(1 − γ )) where Fo is defined by Equation (7) and follows the F distribution with m−1 and m(n−1) degrees of freedom. Because W 95 defined by Equation (9) is a function of Fo , its expected value ω95 is a function of γ . The relationship between ω95 and γ is depicted in Figs. 1, 2, and 3 for various values of m and n. The ω95 values are calculated using a Monte-Carlo simulation approach because of the difficulty to derive a mathematical expression of the probability distribution of W95 . For values of γ that vary from 0 to 1 by step of 0.05, and for various combinations of (m,n) we simulated 10,000 observations from the F distribution with m−1 and m(n−1) degrees of freedom, and calculated 10,000 confidence intervals using Equation (8). The mean length of the 10,000 intervals was used as an estimate for ω95 . Each of the three figures contains two plots, and each plot shows how different values of m and n affect the relationship between γ and ω95 for a fixed total number of observations mn. For the two plots of Fig. 1, the total sample sizes mn are 20 and 40. For Fig. 2, the total sample sizes are 60 and 80, whereas Fig. 3’s plots are based on the sample sizes of 100 and 120. All three figures tend to indicate

Expected Interval Width

INTRARATER RELIABILITY

1. .75 .50

Expected Interval Width

0

.25

.50 .75 Intra Rater Reliability

1.

.75

.50 (m,n) = (20,2) (m,n) = (10,4) (m,n) = (8,5)

.25

0

Expected Interval Width

(m,n) = (10,2) (m,n) = (5,4)

.25 0

Figure 1. Expected width of the 95% confidence interval as a function of γ for m and n values that correspond to mn = 20 and mn = 40.

5

0

.25

.50 .75 Intra Rater Reliability

1.0

0.6 0.5 0.4 (m,n) = (30,2) (m,n) = (20,3) (m,n) = (15,4) (m,n) = (12,5)

0.3 0.2 0.1 0 0

.25

.50

.75

1.0

.75

1.0

Figure 2. Expected width of the 95% confidence interval as a function of γ for m and n values that correspond to mn = 60 and mn = 80.

Expected Interval Width

Intra Rater Reliability 0.6 0.5 0.4 0.3

(m,n) = (40,2) (m,n) = (20,4) (m,n) = (16,5) (m,n) = (10,8)

0.2 0.1 0

0

.25

.50 Intra Rater Reliability

INTRARATER RELIABILITY

Expected Interval Width

6

0.6 0.5 0.4 0.3

0.1

Expected Interval Width

0

Figure 3. Expected width of the 95% confidence interval as a function of γ for m and n values that correspond to mn = 100 and mn = 120.

(m,n) = (50,2) (m,n) = (25,4) (m,n) = (20,5) (m,n) = (10,10)

0.2

0

.25

.50 .75 Intra Rater Reliability

1.0

0.5 0.4 0.3 (m,n) = (60,2) (m,n) = (40,3) (m,n) = (30,4) (m,n) = (24,5)

0.2 0.1 0

that for high intrarater reliability coefficients (i.e., greater than 0.5), and a fixed total number of observations mn, using 2, 3, or at most 4 replicates per subjects is sufficient to obtain the most efficient intrarater reliability coefficient. Having more than four replicates is likely to lead to a loss of precision. For smaller γ values, the recommendation is to use four or five replicates per subject. One would also note that if the ‘‘true’’ value of the intrarater reliability is smaller than 0.80, then its estimation will generally not be very precise. 1.3.2 Blocking the Rater Factor. If two raters or more are used in a completely randomized intrarater reliability experiment, the resulting coefficient will be inaccurate. In a completely randomized design, subjects and replicates are assigned randomly to different raters. Consequently, the rater effect will increase the experimental error, which thereby decreases the magnitude of the intrarater reliability coefficient. This problem can be resolved by designing the experiment so that the rater effect can be measured and separated from the experimental error. A design that permits this method requires each rater to rate all subjects and provide the same number of replicates per

0

.25

.50 .75 Intra Rater Reliability

1.0

subject. Under this design referred to as Randomized Block Design (RBD), the data is gathered by block (i.e., by rater in this case) with random rating within the block, and it would be organized as shown in Table 4, where γ is the number of raters. In Table 4, yijk represents the kth replicate observation on subject i provided by rater j. The intrarater reliability coefficient γˆ under an RBD design is still defined by Equation (3) with the exception that the within-subject variation (WSV) and the between-subject variation (BSV) are defined as follows: 1 2 Sij , and mr m

WSV =

r

i=1 j=1

BSV = BSVo − where S2ij = BSVo =

1 n−1

1 m−1

WSV nr n (yijk − yij· )2 and k=1

m (yi·· − y··· )2 ;

(10)

i=1

yij is the average of all n scores rater j assigned to subject i, yi·· is the average of

INTRARATER RELIABILITY

7

Table 4. Intrarater Reliability Data on m Subjects with r Raters and n Replicates per Subject and per Rater Subjects 1 2 .. . i . .. m

Rater 1

···

Rater j

···

Rater r

y111 , . . . y11n y211 , . . . y21n .. . yill , . . . yiln . .. ymll , . . . ymln

... ... .. . ... . .. ...

y1j1 , . . . y1jn y2j1 , . . . y2jn .. . yijl , . . . yijn . .. ymjl , . . . ymjn

... ... .. . ... . .. ...

y1r1 , . . . y1rn y2rl , . . . y2rn .. . yirl , . . . yirn . .. ymrl , . . . ymrn

nr scores assigned to subject i, and y··· the overall mean scores. Note that the WSV is obtained by averaging the mr sample variances calculated at the cell level. Equation (10) offers the advantage of removing any influence of inter-rater variation when calculating the intrarater reliability. The number of replicates in an RBD design may vary by rater and by subject. We assumed it to be fixed in this section for the sake of simplicity. Although a single rater is sufficient to carry out an intrarater reliability experiment, the use of multiple raters may be recommended for burden reduction or for convenience. The techniques and the inferential framework discussed in this section work well for continuous data, such as the cholesterol level, but they are not suitable for nominal data. In the next section, I present some techniques specifically developed for nominal data. 2

NOMINAL SCALE SCORE DATA

Although the ICC is effective for quantifying the reproducibility of continuous data, nominal data raise new statistical problems that warrant the use of alternative methods. Rating subjects on a nominal scale amounts to classifying them into one of q possible response categories. The discrete nature of that data has the following two implications: 1. The notion of reproducibility is exact. A response category is reproduced when the initial and the replicate categories are identical, and unlike continuous data, nominal data are not subject to random measurement errors.

2. A rater may classify a subject on two occasions into the exact same category by pure chance with a probability that is non-negligible. Table 5 shows the distribution of 100 individuals with identified pulmonary abnormalities examined by a medical student on two occasions. On both occasions, the medical student found the same 74 individuals with pulmonary abnormalities and the same 15 individuals without any abnormalities. However, the student disagreed with himself on 11 individuals. These data, which are a modification of the results reported by Mulrow et al. (10), shows how analysts may organize intrarater reliability data, and it is used later in this section for illustration purposes. For intrarater reliability experiments based on two replicates per subject, analysts may organize the observations as shown in Table 6, where m is the number of subjects rated, and mkl the number of subjects classified in category k on the first occasion and in category l on the second occasion. If the experiment uses three replicates per subject or more, then a more convenient way to organize the data is shown in Table 7 where n is the number of replicates per subject, and nik the number of times subject i is classified into category k. 2.1 Intrarater Reliability: Single Rater and Two Replications When ratings come from a simple experiment based on a single rater, two replicates per subject, and two categories such as described in Table 5, the kappa coefficient of Cohen (2) or an alternative kappa-like statistic may

8

INTRARATER RELIABILITY

Table 5. Distribution of 100 Subjects with Respect to the Presence of Pulmonary Abnormalities Observed on two Occasions by a Medical Student First

Second Observation

Observation

Present

Present Absent Total

Absent

Total

1 15 16

75 25 100

74 10 84

be used to estimate the intrarater reliability coefficient. The medical student who generated Table 5 data could have obtained some of the 89 matches by pure chance because of the small number of response categories limited to two. Consequently, 89% will overestimate student’s self-consistency. Cohen’s (2) solution to this problem was a chance-corrected agreement measure γˆκ , which is known in the literature as kappa, and it is defined as follows: pa − pe γˆκ = (11) 1 − pe where for Table 5 data pa = (74 + 15)/100 = 0.89 is the overall agreement probability, and pe = 0.75 × 0.84 + 0.21 × 0.16 = 0.6636 is the chance-agreement probability. Consequently, the kappa coefficient that measures the medical student intrarater reliability is γˆ = 0.673. According to the Landis and Koch (11) benchmark, a kappa value of this magnitude is deemed substantial. In a more general setting with m subjects, two replicates per subject, and an arbitrary number q of response categories (see Table 6), the kappa coefficient of Cohen (2) is still defined by Equation (11), except the overall

agreement probability pa and the chanceagreement probability pe that are respectively defined as follows: pa =

q q pkk , and pe = pk+ p+k k=1

(12)

k=1

where pkk = mkk /m, P+k = m+k /m, and Pk+ = mk+ /m The overall agreement probability is the proportion of subjects classified into the exact same category on both occasions (i.e., the diagonal of Table 6). The kappa coefficient will at times yield unduly low values when the ratings suggest high reproducibility. Cicchetti and Feinstein (12), as well as Feinstein and Cicchetti (13) have studied these unexpected results known in the literature as the kappa paradoxes. Several alternative more paradox-resistant coefficients are discussed by Brennan and Prediger (14). A Brennan-Prediger alternative denoted by γˆGI , which is often referred to as the G-Index (GI) and should be considered by practitioners, is defined as follows: γˆGI =

pa − 1/q . 1 − 1/q

(13)

Applied to Table 5 data, the BrennanPrediger coefficient becomes γˆGI = (0.89 −

Table 6. Distribution of m Subjects by Response Category and Replication Number. First-Replication Category 1 .. . k . .. q Total

Second-Replication Response Category 1

···

k

···

q

Total

m11 .. . mk1 . .. mq1 m+1

... .. . ... . .. ... ...

m1k .. . mkk . .. mqk m+k

... .. . ... .. . ... ...

m1q .. . mkq . .. mqq m+k

m+ .. . mk+ . .. mq+ m

INTRARATER RELIABILITY

0.5)/(1 − 0.5) = 0.78, which is slightly higher than the kappa. Aickin (15) presents an interesting discussion about kappa-like Intrarater reliability coefficients and suggests the use of his α coefficient. The α coefficient is based on a sound theory and uses the maximum likelihood estimation of some of its components obtained with a computationally intensive iterative algorithm. Gwet (16) proposed the AC1 statistic as a simple way to resolve the kappa paradoxes. The AC1 coefficient is defined as follows: γˆAC1 =

pa − p(1) e

reliability coefficient. The techniques discussed in this section generalize those of the previous section, and they are suitable for analyzing Table 7 data that involve three replicates or more per subject. All kappa-like statistics presented in the previous section can still be used with Table 7 data. However, the overall agreement probability pa is defined as the probability that two replicates random by chosen from the n replicates associated with a randomly selected subject, are identical. More formally pa is defined as follows: q m 1 nik (nik − 1) (16) pa = m n(n − 1)

(14)

1 − p(1) e

where pa is defined by Equation (12), and the chance-agreement probability is as follows:

i=1

k=1

pk = (pk+ + p+k )/2

k=1

Concerning the calculation of chance-agreement probability, several versions have been proposed in the literature, most of which are discussed by Conger (17) in the context of inter-rater reliability, rather than in the context of intrarater reliability. Fleiss (18) suggested that chance-agreement probability be estimated as follows:

1 = pk (1 − pk ), where q−1 q

p(1) e

9

(15)

For Table 5 data, p1 = (0.75 + 0.84)/2 = 0/795, p2 = (0.25 + 0.16)/2 = 0.205, and p2 = 1−p1 . Consequently, Gwet’s chanceagreement probability is P(1) e = 2 × 0.795 × 0.205 = 0.32595. The AC1 statistic is then given by γˆAC1 = (0.89 − 0.32595)/(1 − 0.32595) ∼ = 0.84, which is more in line with the observed extent of reproducibility. Gwet (16) extensively discusses the statistical properties of the AC1 statistic as well as the origins of the kappa paradoxes.

p(F) e =

q

p2k where pk =

k=1

m 1 nik m n

(17)

i=1

Note that pk represents the relative number of times that a subject is classified into category k. Fleiss’ generalized kappa is then given by: (F) γˆF = (pa − p(F) e )/(1 − pe ).

2.2 Intrarater Reliability: Single Rater and Multiple Replications

Conger (17) criticized Fleiss’ generalized kappa statistic for not reducing to Cohen’s kappa when the number of replicates is

Using more than two replicates per subject can improve the precision of an intrarater

Table 7. Frequency Distribution of mn Observations by Subject and Response Category. Response Category Subject

1

···

k

···

q

Total

1 .. . i . .. m Total

n11 .. . ni1 . .. nn1 n+1

... .. . ... . .. ... ...

n1k .. . nik . .. nnk n+k

... .. . ... .. . ... ...

n1q .. . niq . .. nnq n+q

n .. . n . .. n mn

10

INTRARATER RELIABILITY

limited to two and proposed the following chance-agreement probability: p(C) e =

q

p2k −

k=1

q s2k /n

occur in practice, and it should be dealt with using special methods that eliminate the impact of inter-rater variation. The general approach consists of averaging various probabilities calculated independently for each rater as previously discussed.

(18)

k=1

where s2k is the sample variance

2.3 Statistical Inference

1 s2k = (θjk − θ ·k )2 n−1 n

(19)

For an intrarater reliability coefficient to be useful, it must be computed with an acceptable level of precision; this notion can be defined and measured only within a formal framework for statistical inference. This section gives an overview of the main inferential approaches proposed in the literature and provides references for more inquiries. Several authors have proposed frameworks for statistical inference based on various theoretical models. Kraemer et al. (19) review many models that have been proposed in the literature. Kraemer (20) proposed a model under which the kappa coefficient can be interpreted as an intraclass correlation. Donner and Eliasziw (21), Donner and Klar (22), and Aickin (15) have proposed different models that may be useful in different contexts. This model-based approach poses two important problems for practitioners. The first problem stems from the difficulty of knowing which model is most appropriate for a particular situation. The second problem is the dependency of inferential procedures on the validity of the hypothesized model. Fortunately, a different approach to inference based on finite population sampling and widely used in the social sciences can resolve both problems.

j=1

where θ jk = mjk /m is the percent of subjects classified into category k on the jth occasion, and θ ·k is the average of these n values. To compute the variances s2k , k = 1, · · · , q, it could be useful to organize ratings as in Table 8. Both the Fleiss and Conger versions of kappa are vulnerable to the paradox problem previously discussed, and they yield reasonable intrarater reliability coefficients only when pk , the propensity for classification in category k, remain fairly constant from category to category. The generalized version of the AC1 statistic of Gwet (16) is a more paradox-resistant alternative to kappa, and it is based on Equation (14) with the exception that the chance-agreement probability is defined as follows: 1 pk (1 − pk ) q−1 q

p(1) e =

(20)

k=1

where pk is defined as in Equation (17). The situation where intrarater reliability data are collected by multiple raters may

Table 8. Frequency Distribution of mn Observations by Replicate Number and Response Category. Response Category Replication

1

···

k

···

q

Total

1 .. . j . .. n Total

m11 .. . mj1 . .. mn1 m+1

... .. . ... . .. ... ...

m1k .. . mjk . .. mnk m+k

... .. . ... .. . ... ...

m1q .. . miq . .. mnq m+q

m .. . m . .. m mn

INTRARATER RELIABILITY

The randomization approach or designbased inference is a statistical inference framework in which the underlying random process is created by the random selection of m subjects out of a predefined finite population of M subjects of interest. This approach is described in textbooks such as Kish (23) and Cochran (24), and it has been used extensively in the context of inter-rater reliability assessment by Gwet (16,25). The variances of many intrarater reliability coefficients presented in the second section can be found in Gwet (16,25). The first two sections present various approaches for evaluating the reproducibility of continuous and nominal data. These approaches are not recommended for ordinal or interval data, although ordinal clinical measurements such as the presence (no, possible, probable, definite) of a health condition as read on a radiograph, are commonplace. The objective of the next section is to present a generalization of kappa susceptible for use with ordinal and interval data. 3

ORDINAL AND INTERVAL SCORE DATA

Berry and Mielke (26) as well as Janson and Olsson (27,28) have generalized the kappa statistic to handle ordinal and interval data. In addition to being applicable to ordinal and interval data, these extensions can analyze multivariate data of subjects rated on multiple characteristics. Although Berry and Mielke (26) deserve credit for introducing the notions of vector score and Euclidean distance behind these extensions, Janson and Olsson (27) improved and expanded them substantially. Let us consider a simple intrarater reliability study in which a rater must rate all five subjects (m = 5) on two occasions (n = 2) on a three-level nominal scale (q = 3). If the rater classifies subject 1 into category 2 on the first occasion, then the corresponding score can be represented as a vector a11 = (0,1,0), with the second position of digit ‘‘1’’ indicating the category number where the subject is classified. The vector score associated with the classification of subject 1 into category 3 on the second occasion is a12 = (0,0,1). The squared Euclidean distance between a11 and a12 is

11

obtained by summing all three squared differences between the elements of both vectors and is given by: d2 (a11 , a12 ) = (0 − 0)2 + (0 − 1)2 + (1 − 0)2 = 2 Following Janon and Olsson (26), Cohen’s kappa coefficient can re-expressed as follows: 1 2 d ( ai1 , ai2 ) 5 5

γˆJO = 1 −

i=1 5 5

1 2 d ( ai1 , aj2 ) 52

(21)

i=1 j=1

The kappa coefficient as written in Equation (21) depends solely on several distance functions. Its generalization relies on the distance function’s ability to handle ordinal and interval data. If the scoring is carried out on a three-level ordinal scale, then each score will represent a single rank instead of threedimensional vector of 0s and 1s. If the categories in Table 6 are ordinal, then Equation (21) can be adapted to that data and yield the following more efficient kappa coefficient: q q

γˆJO = 1 −

pkl (k − l)2

k=1i=1 q q

(22)

pk+ p+l (k − l)2

k=1l=1

To illustrate the use of kappa with ordinal scores, let us consider Table 9 data, which are a modification of the initial chest radiograph data that Albaum et al. (29) analyzed. A radiologist has examined 100 initial chest radiographs on two occasions to determine the presence of a radiographic pulmonary infiltrate. The four levels of the measurement scale for this experiment are ‘‘No,’’ ‘‘Possible,’’ ‘‘Probable,’’ and ‘‘Definite.’’ Because classifications of radiographs into the ‘‘Probable’’ and ‘‘Definite’’ categories agree more often than those in the ‘‘No’’ and ‘‘Definite’’ categories, the use of classic kappa of Cohen (2) will dramatically underestimate the intrarater reliability. Cohen’s kappa for Table 9 is given by γˆk = (0.57 − 0.3151)/(1 − 0.3151) ≈ 0.37. The generalized version of kappa based on Equation

12

INTRARATER RELIABILITY

Table 9. Distribution of 100 Subjects by Presence of Radiographic Pulmonary inFiltrate and Assessment Time Radiographic Assessment

Radiographic Assessment in Time 2

in Time 1

No

Possible

Probable

Definite

TOTAL

No Possible Probable Definite TOTAL

6 2 2 1 11

7 7 4 4 22

2 6 7 7 22

1 2 5 37 45

16 17 18 49 100

(22) yields an intrarater reliability coefficient of γˆJO = 1 − 0.89/2.41 = 0.63. This generalized version of kappa yields an intrarater reliability coefficient substantially higher and accounts for partial agreement in a more effective way. 4

CONCLUDING REMARKS

This article introduces the notion of intrarater reliability for continuous, nominal, ordinal, as well as interval data. Although the intraclass correlation coefficient is the measure of choice for continuous data, kappa and kappa-like measures defined by Equations (11), (13), (14), and (18)–(20) are generally recommended for nominal data. The extension of kappa to ordinal data is more efficient than classic kappa when the data is ordinal, and it is an important addition to the kappa literature. The literature on inter-rater reliability is far more extensive than that on intrarater reliability, particularly for discrete data, which is explained partially by the tendency researchers have to underestimate the importance of data reproducibility. Although many techniques were developed to measure interrater reliability, very few specifically address the problem of intrarater reliability. In this article, we have adapted some inter-rater reliability estimation procedures so they can be used for computing intrarater reliability coefficients. Unlike inter-rater reliability experiments that involve multiple raters, multiple subjects, and a single replicate per subject, intrarater reliability experiments typically involve a single rater and several replicates per subject. Consequently interrater reliability methods have been modified

by considering the replicates as ratings from different independent raters. Several authors, such as Fleiss and Cohen (30), Kraemer (20), and others, have attempted to interpret kappa as well as other kappa-like reliability measures as a form of intraclass correlation under certain conditions. The main justification for this effort stems from the need to link kappa to a population parameter and to create a framework for statistical inference. So far, no clear-cut theory can establish such a link in a broad setting. The connection of kappa to the intraclass correlation is unnecessary to have a good statistical inference framework. A satisfactory solution to this problem is the use of the finite population inference framework discussed in Gwet (16, 25). REFERENCES 1. S. M. Grundy, Second report of the expert panel on detection, evaluation, and treatment of high blood cholesterol in adults (Adult Treatment Panel II). 1993; National Institutes of Health, NIH Publication No. 93-3095. 2. J. Cohen, A coefficient of agreement for nominal scales. Educat. Psychol. Measurem. 1960; 20: 37–46. 3. R. L. Ebel, Estimation of the reliability of ratings. Psychometrika 1951; 16: 407–424. 4. J. J. Barko, The intraclass correlation coefficient as a measure of reliability. Psychol. Reports 1966; 19: 3–11. 5. P. E. Shrout and J. L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 1979; 86: 420–428. 6. J. M. Lachin, The role of measurement reliability in clinical trials. Clin. Trials 2004; 1: 553–566. 7. National Health and Nutrition Examination Survey. 2005.

INTRARATER RELIABILITY 8. K. O. McGraw and S. P. Wong, Forming inferences about some intraclass correlation coefficients. Psychol. Methods 1996; 1: 30–46. 9. B. Giraudeau and J. Y. Mary, Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Stats. Med. 2001; 20: 3205–3214. 10. C. D. Mulrow, B. L. Dolmatch, E. R. Delong, J. R. Feussner, M. C. Benyunes, J. L. Dietz, S. K. Lucas, E. D. Pisano, L. P. Svetkey, B. D. Volpp, R. E. Ware, and F. A. Neelon, Observer variability in the pulmonary examination. J. Gen. Intern Med. 2007; 1: 364–367. 11. J. R. Landis and G. G. Koch, The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174. 12. D. V. Cicchetti and A. R. Feinstein, High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 1990; 43: 551–558. 13. A. R. Feinstein and D. V. Cicchetti, High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 1990; 43: 543–549. 14. Brennan, RL, and Prediger, DJ. Coefficient kappa: some uses, misuses, and alternatives. Educat. Psychol Measurem. 1981; 41: 687–699. 15. M. Aickin, Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 1990; 46: 293–302. 16. K. L. Gwet, Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Mathemat. Stat. Psychol. 2008. In press. 17. A. J. Conger, Integration and generalization of kappas for multiple raters. Psychol. Bull. 1980; 88: 322–328. 18. J. L. Fleiss, Measuring nominal scale agreement among many raters. Psychol. Bull. 1971; 76: 378–382. 19. H. C. Kraemer, V. S. Periyakoil, and A. Noda, Kappa coefficients in medical research. Stats. Med. 2002; 21: 2109–2129.

13

20. H. C. Kraemer, Ramifications of a population model for κ as a coefficient of reliability. Psychometrika 1979; 44: 461–472. 21. A. Donner, M. A. Eliasziw, A hierarchical approach to inferences concerning interobserver agreement for multinomial data. Stats. Med. 1997; 16: 1097–1106. 22. Donner, A, Klar, N. The statistical analysis of kappa statistics in multiple samples. Journal of Clinical Epidemiology. 1996; 49(9): 1053–1058. 23. K. Kish, Survey Sampling. New York: Wiley, 1965. 24. W. G. Cochran, Sampling Techniques, 3rd ed. New York: Wiley, 1977. 25. K. L. Gwet, Variance estimation of nominalscale inter-rater reliability with random selection of raters. Psychometrika. 2008. In press. 26. K. J. Berry and P. W. Mielke Jr., A generalization of Cohen’s Kappa agreement measure to interval measurement and multiple raters. Educat. Psychol. Measurem. 1988; 48: 921–933. 27. H. Janson and U. Olsson, A measure of agreement for interval or nominal multivariate observations. Educat. Psychol. Measurem. 2001; 61: 277–289. 28. H. Janson and U. Olsson, A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educat. Psychol. Measurem. 2004; 64: 62–70. 29. M. N. Albaum, L. C. Hill, M. Murphy, Y. H. Li, C. R. Fuhrman, C. A. Britton, W. N. Kapoor, and M. J. Fine, PORT Investigators. Interobserver reliability of the chest radiograph in community-acquired pneumonia. CHEST 1996; 110: 343–350. 30. J. L. Fleiss and J. Cohen, The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educat. Psychol. Measurem. 1973; 33: 613–619.

FURTHER READING D. C. Montgomery, Design and Analysis of Experiments. New York: John Wiley & Sons, 2004. D. T. Haley, Using a New Inter-rater Reliability Statistic, 2007. Available: http://computingreports.open.ac.uk/index.php/2007/200716. ANOVA Using MS Excel. Available: http:// higheredbcs.wiley.com/legacy/college/mann/ 0471755303/excel manual/ch12.pdf.

14

INTRARATER RELIABILITY

CROSS-REFERENCES Inter-Rater Reliability Intraclass Correlation Coefficient Kappa Statistic Weighted Kappa Analysis of Variance (ANOVA)

INVESTIGATIONAL DEVICE EXEMPTION (IDE)

apply to devices in commercial distribution. Sponsors do not need to submit a PMA or Premarket Notification 510(k), register their establishment, or list the device while the device is under investigation. Sponsors of IDE’s are also exempt from the Quality System (QS) Regulation except for the requirements for design control.

An Investigational Device Exemption (IDE) refers to the regulations under 21 CFR (Code of Federal Regulations) 812. An approved IDE means that the Institutional Review Board (IRB) [and the Food and Drug Administration (FDA) for significant risk devices] has approved the sponsor’s study application and that all requirements under 21 CFR 812 are met. An IDE allows the investigational device to be used in a clinical study to collect safety and effectiveness data required to support a Premarket Approval (PMA) application or a Premarket Notification [510(k)] submission to FDA. Clinical studies are most often conducted to support a PMA. Only a small percentage of 510(k)’s require clinical data to support the application. Investigational use also includes clinical evaluation of certain modifications or new intended uses of legally marketed devices. All clinical evaluations of investigational devices, unless exempt, must have an approved IDE before the study is initiated. Clinical evaluation of devices that have not been cleared for marketing requires: • an IDE approved by an IRB. If the study

• • • •

involves a significant risk device, the IDE must also be approved by the FDA; informed consent from all patients; labeling for investigational use only; monitoring of the study and; required records and reports.

An approved IDE permits a device to be shipped lawfully for the purpose of conducting investigations of the device without complying with other requirements of the Food, Drug, and Cosmetic Act (Act) that would This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cdrh/devadvice/ide/index. shtml) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

INVESTIGATIONAL NEW DRUG APPLICATION PROCESS (IND)

an investigation and under whose immediate direction the investigational drug is administered or dispensed. A physician might submit a research IND to propose studying an unapproved drug or an approved product for a new indication or in a new patient population. 2. Emergency Use IND allows the FDA to authorize use of an experimental drug in an emergency situation that does not allow time for submission of an IND in accordance with 21 Code of Federal Regulations (CFR) Section 312.23 or Section 312.24. It is also used for patients who do not meet the criteria of an existing study protocol or when an approved study protocol does not exist. 3. Treatment IND is submitted for experimental drugs that show promise in clinical testing for serious or immediately life-threatening conditions while the final clinical work is conducted and the FDA review takes place.

In the United States, current federal law requires that a drug be the subject of an approved marketing application before it is transported or distributed across state lines. Because a sponsor will probably want to ship the investigational drug to clinical investigators in many states, it must seek an exemption from that legal requirement. The Investigational New Drug (IND) process (Fig. 1) is the means through which the sponsor technically obtains this exemption from the U.S. Food and Drug Administration (FDA). During a new drug’s early preclinical development, the sponsor’s primary goal is to determine whether the product is reasonably safe for initial use in humans and if the compound exhibits pharmacologic activity that justifies commercial development. When a product is identified as a viable candidate for further development, the sponsor then focuses on collecting the data and information necessary to establish that the product will not expose humans to unreasonable risks when used in limited, early-stage clinical studies. The FDA’s role in the development of a new drug begins after the drug’s sponsor (usually the manufacturer or potential marketer) has screened the new molecule for pharmacologic activity and acute toxicity potential in animals, and wants to test its diagnostic or therapeutic potential in humans. At that point, the molecule changes in legal status under the Federal Food, Drug, and Cosmetic Act and becomes a new drug subject to specific requirements of the drug regulatory system. There are three IND types:

There are two IND categories: 1. Commercial 2. Research (noncommercial) The IND application must contain information in three broad areas: 1. Animal Pharmacology and Toxicology Studies. Preclinical data to permit an assessment as to whether the product is reasonably safe for initial testing in humans. Also included are any previous experience with the drug in humans (often foreign use). 2. Manufacturing Information. Information pertaining to the composition, manufacturer, stability, and controls used for manufacturing the drug substance and the drug product. This information is assessed to ensure that the company can adequately produce and supply consistent batches of the drug.

1. Investigator IND is submitted by a physician, who both initiates and conducts This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/regulatory/applications/ ind page 1.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

INVESTIGATIONAL NEW DRUG APPLICATION PROCESS (IND)

Applicant (Drug Sponsor)

IND

Review by CDER

Medical

Pharmacology/ Toxicology

Chemistry

Sponsor Submits New Data

Safety Review

Safety Acceptable for Study to Proceed ?

Statistical

No

No

Clinical Hold Decision Yes

Yes Complete Reviews

Reviews Complete and Acceptable ?

Notify Sponsor

No

Sponsor Notified of Deficiencies

Yes No Deficiencies

Study Ongoing*

* While sponsor answers any deficiencies Figure 1. Flowchart of the Investigational New Drug process.

3. Clinical Protocols and Investigator Information. Detailed protocols for proposed clinical studies to assess whether the initial-phase trials will expose study participants to unnecessary risks. Also, information must be provided on the qualifications of clinical investigators— professionals (generally physicians) who

oversee the administration of the experimental agent—to assess whether they are qualified to fulfill their clinical trial duties. Finally, commitments must be made to obtain informed consent from the research participants, to obtain review of the study by an institutional review board, and to adhere to the investigational new drug regulations.

INVESTIGATIONAL NEW DRUG APPLICATION PROCESS (IND)

Once the IND has been submitted, the sponsor must wait 30 calendar days before initiating any clinical trials. During this time, the FDA has an opportunity to review the IND for safety to ensure that research participants will not be subjected to unreasonable risk.

3

INVESTIGATIONAL PRODUCT

The investigator should ensure that the investigational product(s) are used only in accordance with the approved protocol. The investigator, or a person designated by the investigator/institution, should explain the correct use of the investigational product(s) to each subject and should check, at intervals appropriate for the trial, that each subject is following the instructions properly. The sponsor is responsible for supplying the investigator(s)/institution(s) with the investigational product(s). The sponsor should not supply an investigator/institution with the investigational product(s) until the sponsor obtains all required documentation [e.g., approval/favorable opinion from IRB (Institutional Review Board)/IEC (Independent Ethics Committee) and regulatory authority(ies)]. The sponsor should ensure that written procedures include instructions that the investigator/institution should follow for the handling and storage of investigational product(s) for the trial and documentation thereof. The procedures should address adequate and safe receipt, handling, storage, dispensing, retrieval of unused product from subjects, and return of unused investigational product(s) to the sponsor [or alternative disposition if authorized by the sponsor and in compliance with the applicable regulatory requirement(s)]. The sponsor should

An Investigational Product is a pharmaceutical form of an active ingredient or placebo being tested or used as a reference in a clinical trial, including a product with a marketing authorization when used or assembled (formulated or packaged) in a way different from the approved form, when used for an unapproved indication, or when used to gain further information about an approved use. Responsibility for investigational product(s) accountability at the trial site(s) rests with the investigator/institution. Where allowed/required, the investigator/ institution may/should assign some or all of the duties of the investigator/institution for investigational product(s) accountability at the trial site(s) to an appropriate pharmacist or another appropriate individual who is under the supervision of the investigator/institution. The investigator/institution and/or a pharmacist or other appropriate individual, who is designated by the investigator/institution, should maintain records of the delivery of the product to the trial site, the inventory at the site, the use by each subject, and the return to the sponsor or alternative disposition of unused product(s). These records should include dates, quantities, batch/serial numbers, expiration dates (if applicable), and the unique code numbers assigned to the investigational product(s) and trial subjects. Investigators should maintain records that document adequately that the subjects were provided the doses specified by the protocol and should reconcile all investigational product(s) received from the sponsor. The investigational product(s) should be stored as specified by the sponsor and in accordance with applicable regulatory requirement(s).

• Ensure timely delivery of investigational

product(s) to the investigator(s), • Maintain records that document ship-

ment, receipt, disposition, return, and destruction of the investigational product(s), • Maintain a system for retrieving investigational products and documenting this retrieval (e.g., for deficient product recall, reclaim after trial completion, and expired product reclaim), and • Maintain a system for the disposition of unused investigational product(s) and for the documentation of this disposition.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

INVESTIGATIONAL PRODUCT

The sponsor should • Take steps to ensure that the investi-

gational product(s) are stable over the period of use, and • Maintain sufficient quantities of the investigational product(s) used in the trials to reconfirm specifications, should this become necessary, and to maintain records of batch sample analyses and characteristics. To the extent that stability permits, samples should be retained either until the analyses of the trial data are complete or as required by the applicable regulatory requirement(s), whichever represents the longer retention period.

INVESTIGATOR

The investigator(s) should be qualified by education, training, and experience to assume responsibility for the proper conduct of the trial; should meet all the qualifications specified by the applicable regulatory requirement(s); and should provide evidence of such qualifications through up-to-date curriculum vitae and/or other relevant documentation requested by the sponsor, the IRB/IEC, and/or the regulatory authority(ies). The investigator should be thoroughly familiar with the appropriate use of the investigational product(s) as described in the protocol, in the current Investigator’s Brochure, in the product information, and in other information sources provided by the sponsor. The investigator should be aware of, and should comply with, GCP and the applicable regulatory requirements. The investigator/institution should permit monitoring and auditing by the sponsor and inspection by the appropriate regulatory authority(ies). The investigator should maintain a list of appropriately qualified persons to whom the investigator has delegated significant trial-related duties. The investigator should be able to demonstrate (e.g., based on retrospective data) a potential for recruiting the required number of suitable subjects within the agreed recruitment period. The investigator should have sufficient time to properly conduct and complete the trial within the agreed trial period. The investigator should have available an adequate number of qualified staff and adequate facilities for the foreseen duration of the trial to conduct the trial properly and safely. The investigator should ensure that all persons assisting with the trial are adequately informed about the protocol, the investigational product(s), and their trialrelated duties and functions. During and following the participation of a subject in a trial, the investigator/institution should ensure that adequate medical care is provided to the subject for any adverse events, including clinically significant laboratory values, related to the trial. The investigator/institution should inform a subject when medical care is needed for intercurrent illness(es) of which the investigator becomes

An Investigator is a person responsible for conducting the clinical trial at a trial site. If a trial is conducted by a team of individuals at a trial site, the investigator is the responsible leader of the team and may be called the principal investigator. The sponsor is responsible for selecting the investigator(s)/institution(s). If a coordinating committee and/or coordinating investigator(s) are to be used in multicenter trials, their organization and/or selection are the responsibility of the sponsor. Before entering an agreement with an investigator/institution to conduct a trial, the sponsor should provide the investigator(s)/institution(s) with the protocol and an up-to-date Investigator’s Brochure and should provide sufficient time for the investigator/institution to review the protocol and the information provided. The sponsor should obtain the agreement of the investigator/institution • To conduct the trial in compliance with

Good Clinical Practice (GCP), with the applicable regulatory requirement(s), and with the protocol agreed to by the sponsor and given approval/favorable opinion by the IRB (Institutional Review Board)/IEC (Independent Ethics Committee), • To comply with procedures for data recording/reporting, • To permit monitoring, auditing, and inspection, and • To retain the essential documents that should be in the investigator/institution files until the sponsor informs the investigator/institution that these documents no longer are needed. The sponsor and the investigator/institution should sign the protocol or an alternative document to confirm this agreement. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

INVESTIGATOR

aware. It is recommended that the investigator inform the primary physician of the subject about the participation of the subject in the trial if the subject has a primary physician and if the subject agrees to the primary physician being informed. Although a subject is not obliged to give his/her reason(s) for withdrawing prematurely from a trial, the investigator should make a reasonable effort to ascertain the reason(s), while fully respecting the rights of the subject.

INVESTIGATOR/INSTITUTION

market segment with a 38% share of total clinical trial grants. Nearly two thirds of the 9,000 investigative sites that conduct clinical trials in the United States annually fall into the part-time segment. Select sites in this segment are conducting as many as 25 or 30 clinical trials each year whereas other sites may conduct less than one trial annually. As industry sponsors continue to favor conducting their clinical trials in actualuse, community-based settings, and given the relatively low clinical trial operating costs within part-time sites, this segment is well positioned to continue to grow the faster than its peers (5). As Figure 1 shows, AMCs receive nearly all clinical research grants from government agencies and from foundations, and approximately 36% of all clinical trial grants from private sector pharmaceutical, biotechnology, and medical device companies (1, 6). The dedicated investigative site segment captures 26% of the market for clinical trial grants from industry. Dedicated sites derive nearly 100% of their annual revenue from clinical trial grants. This segment has undergone profound structural changes over the past decade and now includes a wide variety of private entities that include site management organizations, stand-alone dedicated research facilities, and managed site networks. These various entities will be discussed shortly (1).

KENNETH A. GETZ Tufts Center for the Study of Drug Development Boston, Massachusetts

Overall spending on clinical research grants to investigative sites exceeded $20 billion in 2007. Government agencies and foundations contribute $9.9 billion—approximately 50%—to the total amount spent on clinical research grants. However, these sponsors support a broad variety of activities that include patient-oriented research, epidemiologic and behavioral studies, outcomes research, and health services research (1). In 2007, government agencies and foundations spent approximately $1.9 billion on clinical trials of new medical interventions. The lion’s share of spending on clinical trials came from private sector sources that include pharmaceutical, biotechnology, and medical device manufacturers. Industry spent more than $10 billion on clinical trial grants to investigative sites in 2007 (2). Given the high failure rates of new molecular entities—which include both traditional chemical entities and biologics—under clinical investigation, the U.S. Department of Commerce estimates that industry sponsors spend more than $1 billion in capitalized costs to bring a single new drug through the 17-year R&D cycle and into the market place (3, 4). This figure includes the sponsor’s costs to develop drugs that fail to demonstrate efficacy and present a poor safety profile. 1

2

SEGMENT PROFILE: AMCs

Until the early 1990s, AMCs were the primary and the predominant home of industrysponsored clinical trials. As these programs became larger, more complex, and costly, industry sponsors grew tired of the inherent inefficiencies in working with academia that include protracted contractual and budget negotiations, bureaucratic and slow-moving institutional review boards (IRBs) and higher relative cost associated with poorer performance. (7) A select group of institutions— which include the University of Kansas Medical Center, Columbia-Presbyterian Medical Center, Johns Hopkins University School

LANDSCAPE SIZE AND STRUCTURE

Today, the market for clinical research grants is composed of three primary segments: academic medical centers (AMCs), part-time investigative sites, and dedicated investigative sites. The part-time segment consists of physicians in private, community settings that derive most (85% or more) of their annual revenue from clinical practice. In 2007, this highly fragmented and transient group has surpassed AMCs as the largest

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

INVESTIGATOR/INSTITUTION

Privatization of US Clinical Investigators

70% 51%

41%

36%

Figure 1. Privatization of US clinical investigators.

of Medicine, University of Rochester Medical School, New York University Medical Center, and George Washington Medical Center—centralized their administrative clinical research functions to compete with community-based investigative sites. Institutional culture, philosophy, and politics made it difficult to implement and to establish these offices. Central clinical trial offices serve faculty and study personnel by freeing up investigator and staff time; improving business development and marketing; and accelerating contract, budget, and IRB approval processes. Specific initiatives by central offices included the following: (1) building databases of a given institutions clinical research capabilities and patient populations to respond quickly to sponsor inquiries, (2) working with the IRB to expedite the approval process, (3) obtaining and negotiating new contracts on behalf of faculty, (4) training study coordinators and investigators, (5) providing patient recruitment support, and (6) tracking study performance. Today, most (70%) of the nearly 130 AMCs have established central offices to assist in supporting clinical research activities at their institutions (6). Still, AMCs continue to lose market share of clinical trial grants from industry. This loss is caused by several factors: many academic-based investigative sites

report that they receive poor returns on infrastructure investment, minimal institutional support, and limited incentive given renewed emphasis on the more prestigious governmentfunded clinical research programs as NIH funding surged (1). During the most recent 5-year period between 2002 and 2007, for example, total NIH funding grew nearly 4% annually to $28.8 billion FY2007. Moving forward, however, NIH funding growth is expected to level off and the NIH is beginning to phase out its long-established General Clinical Research Center (GCRC) program to make room for the new Clinical and Translational Science Awards (CTSA) model. The result will be an intensification of competition specifically for NIH clinical research grants (2). Many AMCs are turning their sights again toward industry-sponsored clinical trials as a vital source for much-needed financial resources in the near future. Fortunately, many clinical research professionals within pharmaceutical and biotechnology companies remain interested in forging, and in building, stronger clinical research relationships with academia. Whereas many AMCs are considering whether to make a more concerted push to expand their involvement in industry-funded clinical research activity, others have been implementing new strategies and tactics designed to improve their ability to compete

INVESTIGATOR/INSTITUTION

for industry study grants that include the implementation of services to expedite study start-up (e.g., contract and budget approvals, IRB review and approval), to provide higher quality training of study investigators and staff, and to accelerate the collection and management of clinical research data. Rising financial pressures will prompt many AMCs to expand their involvement in industry-funded clinical trials over the next several years. The NIH Roadmap for Medical Research and its push to establish Clinical and Translational Science Awards will serve as a catalyst in driving up AMC’s commitment to industry-sponsored clinical trials (6). By 2012, the NIH’s CTSA initiative will provide approximately $500 million annually to 60 AMCs to encourage more effective interaction and communication between clinical research and medical practice. 3 SEGMENT PROFILE: PART-TIME INVESTIGATIVE SITES Part-time investigative sites are a vital component of the study conduct services engine, yet this group of investigative sites is the most difficult to define given its highly fragmented nature. Investigative sites in this segment are very transient with extremely variable clinical research infrastructure and experience. Top sites in this segment conduct as many as 25 or 30 clinical trials each year, whereas bottom sites may do less than one annually. Although 40% of this segment has full-time coordinators, 31% of the segment does not have a single research nurse on staff. Turnover is the highest among investigators in this segment. One of every three part-time investigators who file the FDA-required 1572 form discontinues their involvement once the trial has ended (8). Sponsors have long favored placing a significant portion of their phase II and III clinical studies in actual-use settings. With rapidly expanding pipelines of promising new chemical entities, part-time sites can expect to see increasing numbers of annual clinical trial opportunities in the future. Additionally, industry anticipates that the demand for clinical research centers will exceed supply within the next several years. Given their

3

more flexible involvement in conducting clinical trials, part-time investigative sites play a valuable role in providing much needed capacity. With typically low overhead and minimal fixed costs, part-time sites can play a broader role as a variable study conduct resource that can transition flexibly into and out of clinical projects on demand. In essence, part-time sites may find larger opportunities to act as contract study conduct partners with dedicated clinical research centers. The dramatic growth in post-approval research spending represents another important trend that will increase the demand for part-time investigators. Post-approval studies are now conducted by sponsor companies for a wide variety of purposes: to extend knowledge about efficacy, safety and outcomes within actual use settings, to position newly launched drugs against standard and competing therapies, to prime the market and stimulate drug use, to observe prescribing behaviors and patient experiences, and to extend drug life cycles by targeting new populations and new markets. The FDA is also increasing its pressure on companies to continue the testing of new therapies after they are launched. Typically, phase IV research programs are conducted in community-based, clinical practice settings. Part-time sites report modest growth in year-over-year revenue and relatively high profitability, which is largely caused by the flexible and transient way that part-time sites approach conducting clinical trials. Although many very experienced part-time investigators with established clinical trial operations exist, as a group this segment is composed largely of investigative sites with wide variability in operating structure, experience, and performance. Annual revenue reported by part-time investigative sites ranges from $25,000 annually to as much as $1,000,000– $2,000,000 (1). Most (83%) part-time investigative sites report that they primarily focus on one or two specialty research areas. Few part-time sites consider themselves multispecialty research centers. Nearly half of all part-time investigative sites operate out of a group or network of

4

INVESTIGATOR/INSTITUTION

practicing physicians. Many sites in this segment report that several physicians within their provider network have been involved in conducting at least one clinical trial during the past five years. On average, part-time sites have one study coordinator on staff. The average age of a principal investigator in a part-time investigative site setting is 50 years. Most investigators are getting involved in industry-sponsored clinical research at later stages in their professional careers. These individuals are more established and more financially secure. Less than 10% of part-time principal investigators who conduct industry-sponsored clinical trials are under the age of 40. An estimated 45% of the more than 40,000 active clinical investigators— some 16,000 to 18,000 people—in any given year are conducting their clinical trials on a part-time basis in independent research centers (1). Operating profits among part-time sites vary widely—and it is a function of research infrastructure. Some of the largest part-time sites report operating profits as low as 5–7% of net revenue. As expected, part-time sites that conducted only a few clinical trials in 2006 report operating profits as high as 60–70% of grant revenue. Almost 20 cents of every clinical grant award dollar is profit for the part-time site. Minimal infrastructure and integrated and shared resources help part-time sites maintain higher relative operating margins. In addition, these sites are well positioned to handle the typical cash flow problems that result from late sponsor and CRO (contract research organization) payments. Part-time sites have a steady clinical practice revenue stream. But it is certainly not the economics alone that attracts practicing physicians to conduct clinical trials. These physicians are drawn to the science of new drug development, and they want to offer their patients access to medicines on the frontiers of medical knowledge. Both small and large part-time research centers have unique needs that must be addressed if this segment is to meet increasing demand and performance standards. To reach a comparable level of performance across this segment, experienced and new

entrant part-time sites will need to improve their abilities to start studies and enroll patients, to comply with evolving federal guidelines and regulations, to use emerging electronic solutions technologies more effectively, and to improve their operating efficiencies. At this time, part-time sites largely learn to conduct clinical research through actual experience. A more systematic and uniform approach, which is integrated into health professional training and supported by research sponsors, will likely be necessary. Ultimately, this study conduct services segment holds great promise in meeting growing demand and in offering much needed capacity. 4 SEGMENT PROFILE: DEDICATED INVESTIGATIVE SITES An estimated 500 dedicated investigative sites operate in the United States currently. Approximately two thirds of this total is stand-alone research centers. Close to 35% of all dedicated sites is part of a network (1). Dedicated sites are typically multispecialty and are relatively sophisticated in terms of their approach to managing their clinical research activity— from financial management to staffing and recordkeeping. They conduct an average of 30 clinical research studies per year. Similar to parttime community-based investigative sites, dedicated sites interact with centralized IRBs for human subject protection oversight. Most dedicated sites—nearly 90%— report that they are profitable with net operating margins of 11–15%. Dedicated investigative sites do not have a clinical practice with which to draw patients for their clinical trials. Instead, dedicated sites rely heavily on advertising and promotion to attract study volunteers. Approximately 40% of dedicated sites report that they have a full or part-time patient recruitment specialist on staff (1). In the mid-1990s, a group of dedicated sites formed to meet growing sponsor demands for clinical trial cost and time efficiency and improved data. The conceptual promise offered by these new Site Management Organizations (SMOs) was compelling:

INVESTIGATOR/INSTITUTION • Centralized clinical research operations • Standardized contracts and operating

procedures • Trained and accredited staff • New technologies to manage informa-

tion and to track performance management of patient recruitment and retention Systematic management of clinical data Streamlined regulatory and legal review and approval processes Reduced fixed costs to offer more competitive pricing Applied business and management principles

• Systematic • • • •

But since their introduction 15 years ago, most SMOs have struggled to deliver on these conceptual promises, and they have been through a wide variety of incarnations. The first SMOs emerged under a cloud of scandal when Future HealthCare, which was an early entrant, was indicted for manipulating its financial records to influence the capital markets. The mid-1990s saw a wave of new entrants— owned-site and affiliation model SMOs offering single- and multispecialty expertise—including Affiliated Research Centers (ARC), Clinical Studies Limited (CSL), Collaborative Clinical Research, Hill Top Research, Health Advance Institute (HAI), InSite Clinical Trials, Protocare, Integrated Neuroscience Consortium (INC), Rheumatology Research International (RRI), and several hybrid CRO-SMOs such as Clinicor, MDS Harris, and Scirex In the late 1990s, SMOs entered their most active period of venture capital fund raising and aggressively pursued expansion and diversification strategies. Collaborative Clinical raised $42 million in a 1996 initial public offering. ARC, INC, InSite, and HAI each closed rounds of venture capital financing. Phymatrix, which is a $200 million physician’s practice management group, acquired CSL for $85 million. Having raised more than $14 million in 1999, Radiant Research purchased peer SMO Hill Top Research. And nTouch Research (formerly Novum) raised $8 million in venture capital funding to double the size of its investigative site network through the acquisition of HAI.

5

By the early 2000s, many SMOs had exited the market whereas others diversified their services even more, venturing into traditional CRO and patient recruitment offerings. To name but a few, after a public offering, Collaborative Clinical Research renamed itself Datatrak and announced that it would be exiting the SMO business to become a provider of electronic clinical trial technology solutions. Integrated Neuroscience Consortium, MDS, and RRI focused attention on offering CRO services; The Essential Group (formerly ARC) abandoned its SMO services to focus on offering contract patient recruitment services; in 2005 nTouch Research was acquired by CRO Accelovance; InSync Research shut down its operations in 2000 after selling four of its seven sites to Radiant Research; Clincare, which had hoped to expand its regional network of eight owned sites, also exited the business in 2000. In late 2001, ICSL sold its SMO assets to Comprehensive Neuroscience. And in 2003, Radiant Research completed the acquisition of Protocare—another top five SMO—thereby expanding Radiant’s network to nearly 60 sites. Radiant Research has clearly demonstrated that it is an outlier—it is growing and operating profitably among traditional SMOs that have either exited or struggled to survive in this market. Today, Radiant owns and operates more than 30 dedicated sites, and it has 550 employees including 225 study coordinators. With the exception of Radiant Research, industry observers have concluded that traditional SMOs failed to demonstrate and deliver the value of centralized investigative site operating controls. SMO management structures were cumbersome and challenged the organization’s ability to operate profitably for a sustainable period of time. Traditional SMOs struggled to compete for sufficient levels of new business, to manage positive cash flows, and they failed to achieve revenue and earnings growth that would satisfy their investors. SMO insiders express the failures of traditional SMOs differently: Many believe that research sponsors failed to use the SMO properly by neglecting to empower the SMO to manage an entire study.

6

INVESTIGATOR/INSTITUTION

The past several years has observed strong growth in new structures among managed site networks. The Tufts Center for the Study of Drug Development has been tracking this development. These new players tend to operate regionally, and they have extremely lean centralized operations. They are composed of small networks of sites—on average five—that are connected loosely through minimal standardized management services provided across the network. Basic management services include contract, budget, and regulatory assistance. Although many decentralized site networks have established exclusive arrangements with their investigators, they encourage autonomy and direct interaction with study sponsors. As a result, they address sponsor’s ‘‘investigator-centric’’ preferences while offering minimal—although essential—operating support. Examples of decentralized site networks include Pivotal Research, Benchmark Research, ResearchAcrossAmerica, and RxTrials. Decentralized site networks are building momentum. They generated $215 million in study grant revenue in 2006, and they are growing 9.3% annually. They seem well positioned to capture growing market share over the next several years while containing operating costs (1).

5 THE LANDSCAPE MOVES OVERSEAS Recently, a dramatic shift has occurred in the use of ex–U.S.-based investigative sites. As Figure 2 indicates, major pharmaceutical and biotechnology companies have been placing an increasing proportion of their clinical trials at offshore locations throughout Central and Eastern Europe, Latin America, and parts of Asia. Several sponsors report that they now routinely place most of their clinical trials among ex–U.S.-based investigative sites. This shift has major implications not only for U.S.-based investigative sites, but also for sponsors and for CROs who are looking to optimize their relationships with an increasingly global network of investigators. It is widely accepted that conducting trials overseas is far less expensive and a healthy supply of Western-trained physicians are located there, who are eager to serve as investigators and to study coordinators. Drug shipment issues have largely been overcome and the adoption of e-clinical technology solutions helped to address operating support issues that used to bottleneck overseas projects. Most importantly, the abundance of treatment-na¨ıve patients abroad has translated into speedier clinical trial recruitment and improved retention rates. In the developing world, vast numbers of patients suffer

An Increasingly Global Mix of FDA- Regulated Investigators

Figure 2. An increasingly global mix of FDA-regulated clinical investigators.

INVESTIGATOR/INSTITUTION

from diseases that range from the familiar to the exotic. These patients are often eager to enroll in clinical trials at the request of their physicians and to gain access to free medical care, tests, and complimentary therapies that they could otherwise not afford. In an analysis of 1572 filings with the FDA, the Tufts Center for the Study of Drug Development has found that most investigative sites still hail from the United States, but the geographic mix is changing rapidly. A decade ago, less than 1 in 10 FDA-regulated investigators was based outside the U.S.; in 2006, 40% of FDA-regulated investigators was ex-U.S. The Tufts Center also found that, whereas the overall number of principal investigators within the United States has been growing by 5.5% annually since 2002, the number of FDA-regulated principal investigators based outside the US has been growing by more than 15% annually during that same time period (8). Major pharmaceutical companies have already tipped the balance in favor of ex-U.S. investigative sites, or they are planning to do so within the next couple of years. Wyeth, for example, reports that it does more than half of its clinical trials among ex–U.S.-based investigative sites. In 2007, nearly 60% of all investigative sites recruited by Novartis were ex-U.S. Mid-sized P&G Pharmaceuticals, which now conducts more than one third of its trials among ex-U.S. sites, says that it is routinely looking to place trials overseas, often as replacements for traditionally U.S.-based investigators. And GlaxoSmithKline said that it conducted 29% of its clinical trials abroad in 2006andhopes to increase that figure to 50% of its clinical trials by the end of 2006. Merck is conducting significantly more studies internationally than in the past. More than 40% of its investigative sites are now based outside the U.S. in regions that include Central and Eastern Europe, Latin America, Australia, New Zealand, and, increasingly, the Asia-Pacific (8). As clinical trial volume and scope continues to increase, an increasingly global community of investigative sites seems ready to meet drug development capacity needs. For U.S.-based investigative sites, this trend signals that competition for a smaller number

7

of trials will intensify domestically. Research sponsors have already been turning to more sophisticated criteria (e.g., historical performance metrics and details on access to specific patient populations) to justify the higher relative costs of U.S.-based clinical investigators. Competition among ex-U.S. investigators is also intensifying. As these sites take on more operating infrastructure, they will need to sustain a larger stream of clinical trial activity. Clinical research managers who represent different global regions within biopharmaceutical companies must compete with their own colleagues to gain a larger share of their organization’s finite clinical trial work. At this time, Eastern Europe and Latin America tend to be the favored regions. Growing use of ex-U.S. based investigative sites is also expected to increase outsourcing to contract service providers well positioned within markets abroad and offering significantly lower operating expenses. Given the added complexity of simultaneously conducting clinical trials internationally, several sponsors report a heavier reliance on CROs to meet more aggressive development timelines. Global Phase I clinical trials activity is one area that is beginning to migrate back to the United States and Canada largely due to a weaker US dollar combined with tighter global regulatory requirements. During the next several years, Phase I investigative sites can expect intensifying competition and consolidation among CRO-owned and community-based investigator-owned facilities. The shift toward increasing use of investigative sites outside the U.S., particularly in Phase II and III clinical research studies, will likely attract public scrutiny. With mounting political pressure, the FDA is expected to weigh-in with new restrictions and reforms. Media and policy-makers have already identified this trend as a potentially explosive issue. This issue will fuel the fire of an already damaged and eroded relationship with the public. Despite data suggesting that regardless of where clinical trials are conducted, the U.S. market typically gains early access to new medical innovations, and the media tends to depict the growing prevalence of ex–U.S.-based investigative sites as

8

INVESTIGATOR/INSTITUTION

profit-motivated and exploitive of vulnerable global markets. It is critical for research sponsors to educate the public and policymakers proactively about the full rationale behind broad-based global development programs. New drug approval in the United States requires that sponsors submit an application that contains substantial evidence of effectiveness based on the results of adequate and well-controlled studies. A few cases in which a new drug has been approved for the U.S. market are based on data from well-controlled studies conducted solely by investigative sites abroad. With regard to the exact mix of U.S.-based and ex–U.S.-based investigative sites, the FDA remains elusive. No minimum number of U.S. patients is required in a New Drug Application. Sponsors may design and conduct their studies in any manner provided that they demonstrate, via prospectively planned analyses, that the ex-U.S. patients are similar to U.S. patients, both in terms of pretreatment disease characteristics and treatment outcomes. Now and in the near future, sponsors and the FDA will continue to evaluate their studies for their applicability to patient populations and medical care settings in the United States. Select diseases and their respective patient affinity groups may vie to keep some clinical trials within the United States given wide variations in diet, lifestyle, and health care consumption. Among these diseases are experimental treatments for age-related illnesses (such as Alzheimer’s, Parkinson’s, and ALS) gastro-intestinal and endocrine disorders. The investigative site landscape is poised to continue to evolve substantially. Research sponsors increasingly seek more efficient and productive collaborations with study conduct providers. As the cost and the duration of clinical research studies continue to increase, ample opportunities exist for investigative sites to improve. Streamlined operating processes, improved financial controls, increased adoption of integrated clinical research data collection and management technologies, the implementation of more effective patient recruitment and retention practices are but a few of the key areas in which successful investigative sites will excel and differentiate.

REFERENCES 1. J. DiMasi et al., The price of innovation: new estimates of drug development Costs. J. Health Econ. 2003; 22: 151–185 2. J. DiMasi, Risk in new drug development: approval success rates for investigational drugs. Clin. Pharmacol. Ther. 2001; 69: 297–307. 3. K. Getz, The Market for Site Management Organizations and Clinical Trial Conduct Services. Dorland Medical and Healthcare Marketplace. 2003; 18: 216–218 4. K. Getz, Number of active investigators in FDA-regulated clinical trials drop. The Impact Report, The Tufts Center for the Study of Drug Development at Tufts University. 2005; 7(3). 5. K. Getz, The Evolving SMO in the United States. New York: Marcel Dekker, 2004. 6. H. Moses et al., Financial anatomy of biomedical research. JAMA 2005; 1333–1342. 7. R. Rettig, The industrialization of clinical research. Health Aff. 2000; 129–146. 8. N. Sung et al., Central challenges facing the national clinical research enterprise. JAMA 2003; 289: 1278–1287.

INVESTIGATOR’S BROCHURE

of importance to the investigator. If a marketed product is being studied for a new use (i.e., a new indication), then an IB specific to that new use should be prepared. The IB should be reviewed at least annually and revised as necessary in compliance with the written procedures of the sponsor. More frequent revision may be appropriate depending on the stage of development and the generation of relevant new information. However, in accordance with Good Clinical Practice (GCP), relevant new information may be so important that it should be communicated to the investigators and possibly to the Institutional Review Boards (IRBs)/Independent Ethics Committees (IECs) and/or regulatory authorities before it is included in a revised IB. Generally, the sponsor is responsible for ensuring that an up-to-date IB is made available to the investigator(s), and the investigators are responsible for providing the up-to-date IB to the responsible IRBs/IECs. In the case of an investigator-sponsored trial, the sponsor-investigator should determine whether a brochure is available from the commercial manufacturer. If the investigational product is provided by the sponsorinvestigator, then he or she should provide the necessary information to the trial personnel. In cases where preparation of a formal IB is impractical, the sponsor-investigator should provide, as a substitute, an expanded background information section in the trial protocol that contains the minimum current information described in this guideline.

The Investigator’s Brochure (IB) is a compilation of the clinical and nonclinical data on the investigational product(s) that are relevant to the study of the product(s) in human subjects. Its purpose is to provide the investigators and others involved in the trial with the information to facilitate their understanding of the rationale for, and their compliance with, many key features of the protocol, such as the dose, dose frequency/interval, methods of administration, and safety monitoring procedures. The IB also provides insight to support the clinical management of the study subjects during the course of the clinical trial. The information should be presented in a concise, simple, objective, balanced, and nonpromotional form that enables a clinician or potential investigator to understand it and make his/her own unbiased risk–benefit assessment of the appropriateness of the proposed trial. For this reason, a medically qualified person generally should participate in the editing of an IB, but the contents of the IB should be approved by the disciplines that generated the described data. This guideline delineates the minimum information that should be included in an IB and provides suggestions for its layout. It is expected that the type and extent of information available will vary with the stage of development of the investigational product. If the investigational product is marketed and its pharmacology is widely understood by medical practitioners, then an extensive IB may not be necessary. Where permitted by regulatory authorities, a basic product information brochure, package leaflet, or labeling may be an appropriate alternative, provided that it includes current, comprehensive, and detailed information on all aspects of the investigational product that might be This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

KAPLAN–MEIER PLOT

In its original form, the K–M estimator is applicable to right censored data, meaning that, for some patients, it is only known that their true survival time exceeds a certain censoring time. Right censoring is exclusively considered here, and extensions of the K–M estimator to other types of censoring are merely mentioned at the end of this article. Different types of censoring are described elsewhere in the encyclopedia. Nearly all textbooks on survival analysis discuss the K–M estimator in detail. The classic book by Kalbfleisch and Prentice (2), a standard reference on survival analysis for many years, increased the popularity of the K–M estimator substantially. The more practically oriented reader may refer to the textbooks by Marubini and Valsecchi (3) or Klein and Moeschberger (4). Rigorous derivations of the statistical properties of the Kaplan–Meier estimator are provided, for example, in the book by Andersen et al. (5). The practical analysis can be performed using any sound statistical software package. For instance, SAS provides the procedure PROC LIFETEST, SPSS the procedure KM, R/S-Plus the procedure SURVFIT, and STATA the procedure STS GRAPH. Next, the K–M estimator is described including an estimation of its variance and the calculation of confidence intervals. The application of the plot is illustrated using data from a particular clinical study. Then, the estimation of the median survival time and of survival probabilities of specific time points is discussed. Furthermore, practical notes are provided on the use of the K–M estimator for the analysis of clinical survival data and its interpretation. Finally, a few additional topics are considered.

¨ NORBERT HOLLANDER

University Hospital of Freiburg Institute of Medical Biometry and Medical Informatics Freiburg, Germany

1

INTRODUCTION

The Kaplan–Meier plot or Kaplan–Meier curve is the standard method to describe and compare graphically the overall survival of groups of patients. This method is applied to survival time, which is defined as time from treatment start or date of randomization to death or last date of follow up, and, more general, to any type of time-to-event data, such as time-to-disease progression, time-to-detoriation, or time-to-first toxicity. Sometimes, the time-to-event is also called failure time. In the sequel, the term ‘‘survival time’’ is used as a synonym for any time-to-event. In addition to its broad applicability, the Kaplan–Meier plot has further advantages: It displays without bias nearly all information concerning survival time obtained in the dataset. Furthermore, with the Kaplan–Meier plot, the outcome of a trial can easily be explained to clinicians. Therefore, the Kaplan–Meier plot is omnipresent in clinical trials literature. No trial investigating survival time should be published without showing the curve. The Kaplan–Meier plot is the graphical presentation of the nonparametric Kaplan–Meier estimator (K–M estimator) of the survival curve. The K–M estimator generalizes the empirical distribution function of a sample in the presence of censoring. It’s name is derived from the seminal paper of E. L. Kaplan and P. Meier entitled ‘‘Nonparametric estimation from incomplete observations’’ in the Journal of the American Statistical Association (1). As a result of the importance of the Kaplan–Meier plot, this paper has become one of the most cited statistical papers. Sometimes, the K–M estimator is also denoted as product-limit estimator, which was the name originally used by Kaplan and Meier.

2

ESTIMATION OF SURVIVAL FUNCTION

Consider the situation in which the survival time is studied for a homogenous population. The survival function S(t) = Pr(T > t) is the basic quantity to describe time-to-event data, where S(t) is defined as the probability of an individual surviving time t. Censoring is

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

KAPLAN–MEIER PLOT

usually assumed to be independent from the occurrence of the event (e.g., death), which assumes that the additional knowledge of censoring before any time t does not alter the risk of failure at t. Let n denote the sample size of a study (= number of patients) and assume that deaths are observed at k different time points with k ≤ n. To calculate the K–M estimator, the data has to be arranged as in Table 1: The observed death times are ordered by ascending values t1 < t2 < . . .< tk . The symbol dj denotes the number of deaths at time point tj (dj ≥1) and rj is the number of patients ‘‘at risk’’ just before time point tj (j = 1, . . ., k). The latter contains all patients without an observed event just before time point tj . ‘‘At risk’’ just before tj are, therefore, all patients dying at tj or thereafter and those censored at tj or thereafter. Censored observations are given in Table 1 indirectly only by the number of patients ‘‘at risk,’’ between tj−1 and tj , rj−1 − rj − dj−1 censored survival times exist. The K–M estimator for the survival distribution function S(t) is then obtained by ˆ = S(t)

tj ≤t

1−

dj rj

j = 1, . . . k

where rj is the number of patients at risk (i.e., aliveand not censored) just before time tj and denotes the product over all observed tj ≤t

death times tj less than or equal to t. The variance of the K–M estimator is commonly estimated by the so-called Greenwood formula: ˆ ˆ 2 Var(S(t)) = S(t)

tj ≤t

dj rj (rj − dj )

Table 1. Arrangement of Time-to-Event Data Time points

Number of patients ‘‘at risk’’

Number of deaths

t1 t2 . .. tk

r1 r2 . .. rk

d1 d2 . .. dk

where tj ≤t denotes the sum over all observed death times tj less than or equal to t. This formula allows the calculation of the standard ˆ at each time point tj . deviation of S(t) In large samples, the K–M estimator, evaluated at a given time t, is approximately normally distributed so that a standard 100(1 − α)% pointwise confidence interval for S(t) can be calculated by ˆ ˆ ± z1−α/2 Var(S(t)) S(t)

(1)

where z1−α/2 denotes the 1 − α/2 quantile of the standard normal distribution. The confidence interval, which is routinely calculated by several statistical packages, is appropriate for studies with n = 200 or more patients with k > 100 observed deaths (6). However, for small sample sizes (e.g., n = 50), Equation (1) serves as a rough approximation only and may be improved (see Practical Note 7). If the upper confidence bound is larger than one, it is generally set equal to one and the lower confidence bound is set equal to zero if it is negative. 2.1 An Example The computation of the K–M estimator is illustrated with the data on remission durations of leukemia patients. This dataset is taken from a controlled clinical trial published by Freireich et al. (7) and has been used previously by several authors (e.g., References 3 and 4) for illustration. In the clinical trial, it was investigated whether patients who had achieved complete remission after some induction treatment could benefit from further treatment. Here, only the patients treated with 6mercaptopurine (6-MP) chemotherapy are considered. The time-to-event of interest is the time from complete remission to relapse. For the 21 patients of the 6-MP group, the observed remission durations or rightcensored times are recorded in weeks. These times were, rearranged in ascending order: 6,6,6,6*,7,9*,10,10*,11,11*,13,16,17*,19*, 20*,22, 23,25*, 32*,34* and 35*, where * indicates a censored observation. Table 2 illustrates the single steps of the computation, and Fig. 1 displays the resulting Kaplan–Meier plot including the standard

0.50

0.75

1.00

3

0.25

symbol 'l' denotes a censored observation 95 % confidence Interval for the median survival time tnz

0.00

Survival Probability (= Probability of staying in remission)

KAPLAN–MEIER PLOT

12 weeks

0

5

~ t0.5 = 23 weeks

15 20 25 10 Time since complete remission (weeks)

30

∞

35

Figure 1. Kaplan–Meter plot with standard pointwise 95% confidence interval for the leukemla data

pointwise 95% confidence intervals derived from Equation (1). Estimated survival probˆ abilities S(t), describing, in this example, the probability of staying in complete remission, are obtained for each time point t since complete remission. At time point t = 0, all patients are in complete remission and the estimated survival probability is, therefore, ˆ S(0) = 1. After 12 weeks, for example, the estimated probability of being still in ˆ remission is S(12) = 0.753, with a 95% confidence interval of [0.564; 0.942]. 2.2 Practical Notes 1. The K–M estimator of the survival curve is a step function with jumps at the observed uncensored survival times. The size of these jumps depends not only on the number of events observed at each event time tj , but also on the pattern of the censored observations before tj . If and only if all observed times are uncensored, the K–M estimator is equal to one minus the empirical distribution function and jumps by the amount 1/n if n is the sample size. Note that, in the latter special case, the K–M

plot and the respective software are still applicable. 2. If both an event and a censored observation are observed at exactly the same time point, the event is assumed to have occurred first. This assumption is crucial for the mathematical analysis but convenient from the practical point of view. 3. To calculate the K–M estimator, the time is partitioned into a sequence of time intervals, where the number and lengths of intervals are determined by the observed uncensored survival times in the data. In the example provided, the intervals are 0 < t ≤ 6 weeks, 6 < t ≤ 7 weeks, 7 < t ≤ 10 weeks, and so on. A similar approach for estimating survival curves is the so-called life-table estimator, also known as actuarial method. With this approach, the time is partitioned into a data-independent fixed sequence of intervals that are almost always of equal lengths. For instance, in lifetable analysis for human populations, the length of each interval is usually

4

KAPLAN–MEIER PLOT

1 year. Once the set of intervals has been chosen, the construction of the estimator follows the basic idea used for the Kaplan–Meier estimator. More details on various types of life-table estimators can be found in the book by Lee (8). In clinical trials in which individual data on survival time are available, the K–M estimator should be preferred because it is more precise. 4. The K–M estimator is a product of conditional probabilities. For each time interval (see Practical Note 3), the probability of being a survivor at the end of the interval is estimated on the condition that the subject was a survivor at the beginning of the interval. This conditional probability is denoted as ‘‘factor contributed to the K–M estimator’’ ˆ in Table 2. The survival probability S(t) at some time point t is calculated as the product of the conditional probabilities of surviving each time interval up to time point t, which is done recursively as described in the second footnote of Table 2. 5. The K–M estimator is well defined for all time points less than the largest observed survival or censoring time tlast . If tlast corresponds to an event

(i.e., an uncensored observation), then the estimated survival curve is zero beyond this point. If tlast corresponds to a censored observation, as in the above example, where tlast = 35 weeks, the K–M estimator should be considered as undefined beyond tlast , because when this last survivor would have died if the survivor had not been censored is not known. Therefore, a correct Kaplan–Meier plot should end at this last observed time point and the curve should not be prolonged to the right (right-censored!). 6. To improve the interpretability of results, the K–M plot should contain information on the censoring distribution, which is especially relevant if many patients are censored after the last observed death, thereafter the estimated survival probability remains constant until the last censored observation. In Fig. 1, censored observations are marked by a symbol; 4 patients are censored after the last observed death with censoring times of 25*, 32*, 34*, and 35* weeks. Considering the K–M plot without this information, the authors only would have known that the last observed censored time is 35*

Table 2. Illustration of the Computation of the Kaplan-Meier Estimator with Standard Pointwise 95% Confidence Bounds for the Leukemia Data of Freireich et al. (7)

j 1 2 3 4 5 6 7

Number of Factor KM-estimator Ordered patients Number of contributed for all patients to the t in the distinct at risk§ failure just before dying at KM-estimator interval time tj at t [tj , tj+1 ) times time tj j dj ˆ # tj rj dj 1− S(t) rj 6 7 10 13 16 22 23

21 17 15 12 11 7 6

3 1 1 1 1 1 1

18/21 16/17 14/15 11/12 10/11 6/7 5/6

18/12 = 0.857 0.857 × 16/17 = 0.807 0.807 × 14/15 = 0.753 0.753 × 11/12 = 0.690 0.690 × 10/11 = 0.627 0.627 × 6/7 = 0.538 0.538 × 5/6 = 0.448

Standard deviation based on Greenwood variance estimator ˆ Var(S(t))

Standard pointwise 95% confidence interval for S(t)

0.0764 0.0869 0.0963 0.1068 0.1141 0.1282 0.1346

[0.707; 1.000] [0.636; 0.977] [0.564; 0.942] [0.481; 0.900] [0.404; 0.851] [0.286; 0.789] [0.184; 0.712]

§ Note that not only d but also the censored observations 6*,9*,10*,11*,17*,19*,20*,25*,32*,34* and 35* change the set j of patients at risk # S ˆ (0) = 1, for t the K-M estimator ˆ (t) is calculated recursively: at time point t = 0 all patients are in remission leading to S 1 ˆ ˆ is S(t 1 ) = S(0) ·

d 1− 1 r1

d2 ˆ ) = S(t ˆ , S(t 2 1 ) · 1− r 2

etc.

KAPLAN–MEIER PLOT

weeks and that the remaining 3 censored observations (this number would usually be unknown too) were somewhere between 23 weeks and 35 weeks. Thus, without any information on the censoring distribution, the K–M plot should be interpreted until the last observed death only. For large samples, information concerning the censoring distribution could be given by adding the number of patients ‘‘at risk’’ for equidistant time points to the time axis. 7. For small samples, better pointwise confidence intervals can be constructed ˆ using transformations of S(t) such as ˆ log(− log(S(t))) (6). Based on this transformation, the resulting 100(1 − α)% confidence interval is not symmetric about the K–M estimator of the survival function. In the example provided, one would obtain a 95% confidence interval for the probability of being still in remission after 12 weeks, S(12), of [0.503; 0.923] in contrast to that reported above, which was [0.564; 0.942] around 0.753. 8. It should be noted that alternatives exist to the Greenwood formula for variance estimation (see, for example, Reference 9). Using these alternatives may effect pointwise confidence intervals!

2.3 Median Survival Time For the presentation of survival data with censoring, the entire estimated survival ˆ curve, S(·), together with standard errors or confidence intervals, is usually the best choice. Nevertheless, summary statistics such as location estimates are sometimes also useful. Beside the presentation of survival probabilities for a specific time point t, the median, or 0.5 quantile, is a useful estimator of location. The median survival time t˜0.5 is determined by the relation S(t˜0.5 ) = 0.5. Notice that the K–M estimator is a step function and, hence, does not necessarily attain the value 0.5. Therefore, t˜0.5 is usually defined ˆ as the smallest time t, for which S(t) ≤ ˆ 0.5 (i.e., the time t where S(t) jumps from

5

a value greater than 0.5 to a value less or equal to 0.5). A commonly used method to calculate confidence intervals for the median survival time t˜0.5 has been proposed by Brookmeyer and Crowley (10), which is based on a modification of the construction of pointwise confidence intervals for S(t) described above. Based on the standard confidence interval in Equation (1), a 100(1 − α)% confidence interval for t˜0.5 is the set of all time points t that satisfy the following condition: ˆ − 0.5 S(t) ≤ z1−α/2 −z1−α ≤ S(t) ˆ Var The resulting confidence interval for t˜0.5 can easily be obtained from the pointwise confidence intervals for S(t): The lower confidence bound is the smallest value for which the lower confidence bound of S(t) is less than or equal to 0.5, and the upper confidence bound for t˜0.5 is the smallest t-value for which the upper confidence bound of S(t) is less than or equal to 0.5. Therefore, one can easily determine the corresponding confidence interval from the graph of the Kaplan–Meier Plot and its pointwise 95% confidence band (see Fig. 1). The median survival time (i.e., the median duration from complete remission to relapse) for leukaemia patients tretated with 6-MP is 23 weeks. The corresponding lower confidence bound is 13 weeks. As the upper confidence bound for S(t) does not fall below the value 0.5, one may set the upper confidence bound for t˜0.5 to infinity (= ∞). 2.4 More Practical Notes 9. If the estimated survival probability is larger than 0.5 at any time point t (which may occur when considering patients with a good prognosis), the median survival time t˜0.5 cannot be calculated. 10. Instead of the median t˜0.5 , the K–M estimator can also be used to provide estimates of other quantiles of the survival time distribution. Recall that the p-quantile t˜p is estimated by ˆ the smallest time t at which S(t) is less than or equal to 1-p. Analogous to t˜0.5 , the corresponding confidence

6

KAPLAN–MEIER PLOT

interval for t˜p can be obtained from the pointwise confidence interval of S(t). 11. Using a transformation to calculate pointwise confidence intervals for S(t) (see Practical Note 7), the confidence interval for t˜p should be based on the same transformation and, therefore, may differ from the approach described above. In the example provided, the 95% confidence interval for t˜0.5 does not change. 12. Use of the median survival time has obvious advantages as compared with the mean, which is highly sensitive to the right tail of the survival distribution, where estimation tends to be imprecise because of censoring. The ∞ mean survival time µ = 0 S(t) dt can naturally be estimated by substituting ˆ the K–M estimator S(t) for S(t). This estimator is appropriate only when tlast corresponds to a death (see Practical Notes 2) and censoring is light. Consequently, it should be used with caution.

3

ADDITIONAL TOPICS 1. The pointwise confidence intervals described above are valid only at a fixed single time point t0 (in the example, t0 = 12 weeks was considered). Although these intervals are plotted by many statistical software packages in addition to the K–M estimator as a function of time (see Fig. 1), the curves cannot be interpreted as a confidence band with, for example, 95% confidence that the entire survival function lies within the band. Proper confidence bands can be √ derived from the weak convergence of n(Sˆ − S)/S to a mean zero Gaussian martingale. Two important types of such confidence bands are the bounds of Hall and Wellner (11) and the equal precision bands (12). 2. Besides right-censoring other kinds of data incompleteness, such as left truncation, may be present in survival analysis. In epidemiologic applications, for

example, individuals are often not followed from time zero (e.g., the birth, if age is the relevant time scale), but only from a later entry time (conditional on survival until this entry time). For such left-truncated data, the usual K–M estimator may be calculated with a modified risk set, but it is of little practical use, because the estimates ˆ S(t) have large sampling errors. More useful is the estimation of the conditional survival distribution function (see, for example, Reference 4). 3. Left-censoring occurs when some individuals have experienced the event of interest before the start of the observation period. For samples that include both left-censoring and right–censoring, a modified K–M estimator of the survival function has been suggested by Turnbull (13). 4. Besides the event of interest (e.g., relapse) and right-censored observations, other events (e.g., non-diseaserelated deaths) may occur. Thus, for patients dying before relapse, the event of interest, namely relapse, could not be observed. Therefore, it is said that the two events, death and relapse, compete with each other. In such socalled competing risk situations, the occurrence of the competing risk (here the non-disease-related death) is often inadequately treated as right-censored observation, and the survival function for the event of interest is estimated by the K–M estimator. However, as illustrated, for example, by Schwarzer et al. (14), the use of the K–M estimator is not appropriate in competing risk situations: Calculating the K–M estimator for both Sˆ (time to the event of first interest) and Sˆ (time to non-diseaserelated death) where the corresponding competing risk (death in the first case, event of interest in the second case) is treated as right-censored observation the sum of both survival probabilities for a fixed time point t0 may be larger than 1. However, as the two events are mutually exclusive, the sum must not exceed 1. For competing risks, the estimation of survival functions based on

KAPLAN–MEIER PLOT

cumulative incidences is appropriate, however (14, 15).

REFERENCES 1. E. L. Kaplan and P. Meier, Non-parametric estimation from incomplete observations. J. Amer. Stat. Assoc. 1958; 53: 457–481. 2. J. Kalbfleisch and R. Prentice, The Statistical Analysis of Failure Time Data. New York: Wiley, 1980. 3. E. Marubini and M. G. Valsecchi, Analysing Survival Data from Clinical Trias and Observational Studies. Chichester: Wiley, 1995. 4. J. P. Klein and M. L. Moeschberger, Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer, 1997. 5. P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding, Statistical Methods Based on Counting Processes. New York: Springer, 1992. 6. Ø. Borgan and K. Liestøl, A note on confidence intervals and bands for the survival function based on transformations. Scand. J. Stat. 1990; 17: 35–41. 7. E. J. Freireich et al., The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukaemia: a model for evaluation of other potentially useful therapy. Blood 1963; 21: 699–716. 8. E. T. Lee, Statistical Methods for Survival Data Analysis. New York: Wiley, 1992. 9. O. O. Aalen and S. Johansen, An empirical transition matrix for nonhomogeneous Markov chains based on censored observations. Scand. J. Stat. 1978; 5: 141–150. 10. R. Brookmeyer and J. J. Crowley, A confidence interval for the median survival time. Biometrics 1982; 38: 29–41. 11. W. Hall and J. A. Wellner, Confidence bands for a survival curve from censored data. Biometrika 1980; 67: 133–143. 12. V. N. Nair, Confidence bands for survival functions with censored data: a comparative study. Technometrics 1984; 14: 265–275. 13. B. W. Turnbull, Nonparametric estimation of the survivorship function with doubly censored data. J. Amer. Stat. Assoc. 1974; 69: 169–173. 14. G. Schwarzer, M. Schumacher, T. B. Maurer, and P. E. Ochsner, Statistical analysis of failure times in total joint replacement. J. Clin. Epidemiol. 2001; 54: 997–1003.

7

15. T. A. Gooley, W. Leisenring, J. Crowley, and B. E. Storer, Estimation of failure probabilities in the presence of competing risks: new representations of old estimators. Stat. Med. 1999; 18: 695–706.

KAPPA

Table 1.

RICHARD J. COOK University of Waterloo, Waterloo, Ontario, Canada

In medical research it is frequently of interest to examine the extent to which results of a classification procedure concur in successive applications. For example, two psychiatrists may separately examine each member of a group of patients and categorize each one as psychotic, neurotic, suffering from a personality disorder, or healthy. Given the resulting data, questions may then be posed regarding the diagnoses of the two psychiatrists and their relationship to one another. The psychiatrists would typically be said to exhibit a high degree of agreement if a high percentage of their diagnoses concurred, and poor agreement if they often made different diagnoses. In general, this latter outcome could arise if the categories were ill-defined, the criteria for assessment were different for the two psychiatrists, or their ability to examine these criteria differed sufficiently, possibly as a result of different training or experience. Poor empirical agreement might therefore lead to a review of the category definitions and diagnostic criteria, or possibly retraining with a view to improving agreement and hence consistency of diagnoses and treatment. In another context, one might have data from successive applications of a test for dysplasia or cancer from cervical smears. If the test indicates normal, mild, moderate, or severe dysplasia, or cancer, and the test is applied at two time points in close proximity, ideally the results would be the same. Variation in the method and location of sampling as well as variation in laboratory procedures may, however, lead to different outcomes. In this context, one would say that there is empirical evidence that the test is reliable if the majority of the subjects are classified in the same way for both applications of the test. Unreliable tests would result from the sources of variation mentioned earlier. Again, empirical evidence of an unreliable test may lead to refinements of the testing procedure.

T2 = 1

T2 = 2

Total

T1 = 1 T1 = 2

x11 x21

x12 x22

x1· x2·

Total

x·1

x·2

x·· = n

1 THE KAPPA INDEX OF RELIABILITY FOR A BINARY TEST For convenience, consider a diagnostic testing procedure generating a binary response variable T indicating the presence (T = 1) or absence (T = 2) of a particular condition. Suppose this test is applied twice in succession to each subject in a sample of size n. Let T k denote the outcome for the kth application with the resulting data summarized in the two-by-two table (Table 1). where xij denotes the frequency at which T 1 = i and T 2 = j, xi· = 2j=1 xij , and x·j = 2 i=1 xij , i = 1, 2, j = 1, 2. Assuming that test results on different subjects are independent, conditioning on n leads to a multinomial distribution for the outcome of a particular table with f (x; p) =

x11

2 2 xij n pij , x12 x21 x22 i=1 j=1

x = (x11 , x12 , x21 , x22 ) , p = (p11 , p12 , p21 , and p22 = 1 − p11 p22 ) , − p12 − p21 . Let pi· = 2j=1 pij and p·j = 2i=1 pij . Knowledge of p would correspond to a complete understanding of the reliability of the test. Since knowledge of p is generally unattainable and estimation of p does not constitute a sufficient data reduction, indices of reliability/agreement typically focus on estimating one-dimensional functions of p. A natural choice is p0 = 2i=1 pii , the probability ofraw agreement, which is estimated as pˆ 0 = 2i=1 xii /n. If p0 = 1, then the test is completely reliable since the probability of observing discordant test results is zero. Similarly, if pˆ 0 is close to unity, then it suggests that the outcomes of the two

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

KAPPA

applications concurred for the vast majority of the subjects. However, several authors have expressed reluctance to base inferences regarding reliability on the observed level of raw agreement (see (3) and references cited therein). The purported limitations of pˆ 0 as a measure of reliability stem from the fact that p0 reflects both ‘‘chance’’ agreement and agreement over and above that which would be expected by chance. The agreement expected by chance, which we denote by pe , is computed on the basis of the marginal distribution, defined by p1· and p·1 , and under the assumption that the outcomes of the two tests are independent conditional on the true status. Specifically, pe = 2i=1 pi· p·i is estimated by pˆ e = 2i=1 x1· x·1 /n2 . To address concerns regarding the impact of nonnegligible chance agreement, Cohen (3) defined the index kappa which takes the form κ=

p0 − pe , 1 − pe

and indicated that it can be interpreted as reflecting ‘‘the proportion of agreement after chance agreement is removed from consideration’’. This can be seen by noting that p0 − pe is the difference in the proportion of raw agreement and the agreement expected by chance, this being the agreement arising due to factors not driven by chance. If p0 − pe > 0, then there is agreement arising from nonchance factors; if p0 − pe = 0, then there is no additional agreement over that which one would expect based on chance; and if p0 − pe < 0, then there is less agreement than one would expect by chance. Furthermore, 1 − pe is interpreted by Cohen (3) as the proportion ‘‘of the units for which the hypothesis of no association would predict disagreement between the judges’’. Alternatively, this can be thought of as the maximum possible agreement beyond that expected by chance. An estimate of κ, denoted κ, ˆ is referred to as the kappa statistic and may be obtained by replacing p0 and pe with their corresponding point estimates, giving κˆ =

pˆ 0 − pˆ e . 1 − pˆ e

(1)

Table 2. T 2 = 1 T 2 = 2 T 2 = 3 . . . T 2 = R Total T1 = 1 T1 = 2 T1 = 3 .. . T1 = R

x11 x21 x31 .. . xR1

x12 x22 x32 .. . xR2

x13 x23 x33 .. . xR3

Total

x·1

x·2

x·3

... ...

x1R x2R x3R .. . xRR

x1· x2· x3· .. . xR·

...

x·R

x·· = n

... ... ...

2 THE KAPPA INDEX OF RELIABILITY FOR MULTIPLE CATEGORIES When the classification procedure of interest has multiple nominal categories, assessment of agreement becomes somewhat more involved. Consider a diagnostic test with R possible outcomes and let T k denote the outcome of the kth application of the test, k = 1, 2. Then T k takes values on {1, 2, 3, . . ., R} and interest lies in assessing the extent to which these outcomes agree for k = 1 and k = 2. An R × R contingency table may then be constructed (see Table 2), where again xij denotes the frequency with which the first application of the test led to outcome i and the second led to outcome j, i = 1, 2, . . ., R, j = 1, 2, . . . , R. A category-specific measure of agreement may be of interest to examine the extent to which the two applications tend to lead to consistent conclusions with respect to outcome r, say. In this problem there is an implicit assumption that the particular nature of any disagreements are not of interest. One can then collapse the R × R table to a 2 × 2 table constructed by crossclassifying subjects with binary indicators such that T k = 1 if outcome r was selected at the kth application, T k = 2 otherwise, k = 1, 2. A category-specific kappa statistic can then be constructed in the fashion indicated earlier. This can be repeated for each of the R categories giving R such statistics. In addition to these category-specific measures, however, an overall summary index of agreement is often of interest. The kappa statistic in (1) is immediately generalized for the R × R (R > 2) table as follows. Let pij denote the probability of T 1

KAPPA

= i and T 2 = j, one of the R2 multino mial probabilities, pi· = R j=1 pij , and p·j = R p , i = 1, 2, . . . , R, j = 1, 2, . . . , R. Then, ij i=1 as before, pˆ ij = xij /n, pˆ i· = xi· /n, pˆ ·j = x·j /n, pˆ 0 = R ˆ ii , pˆ e = R ˆ i· pˆ ·i , and the overi=1 p i=1 p all kappa statistic takes the same form as in (1). This overall kappa statistic can equivalently be written as a weighted average of category-specific kappa statistics (6). The kappa statistic has several properties that are widely considered to be attractive for measures of agreement. First, when the level of observed agreement, reflected by pˆ 0 , is equal to the level of agreement expected by chance (pˆ e ), κˆ = 0. Secondly, κˆ takes on its maximum value of 1 if and only if there is perfect agreement (i.e. pˆ 0 = 1 arising from a diagonal table). Thirdly, the kappa statistic is never less than −1. The latter two features require further elaboration, however, as the actual upper and lower limits on κˆ are functions of the marginal frequencies. In particular, κˆ takes on the value 1 only when the marginal frequencies are exactly equal and all off-diagonal cells are zero. Values less than 1 occur when the marginal frequencies are the same but there are different category assignments in the table or, more generally, when the marginal frequencies differ (when the marginal frequencies differ there are necessarily nonzero diagonal cells and hence some disagreements). It is natural then to expect the kappa statistic for such a table to be less than unity. Cohen (3) shows that the maximum possible value of κˆ takes the form x·· κˆ M =

R

min(xi· , x·i ) −

i=1

R i=1

x2·· −

R

If the marginal frequencies for the two tests are uncorrelated (as measured by the product–moment correlation of the margins (3)), then the lower bound for κˆ is κˆ L = −(R − 1)−1 . When the marginal frequencies are negatively correlated, κˆ L > − (R − 1)−1 . However, when the marginal frequencies are positively correlated, κˆ L < −(R − 1)−1 . It is only as the number of categories reduces to two, the correlation of the marginal frequencies approaches 1, and the variances of the marginal frequencies increase, that κˆ L approaches −1 (3). Having computed a kappa statistic for a given contingency table it is natural to want to characterize the level of agreement in descriptive terms. Landis & Koch (11) provide ranges that suggest, beyond what one would expect by chance, 0.75 < κˆ typically represents excellent agreement, 0.40 < κˆ < 0.75 fair to good agreement, and κˆ < 0.40 poor agreement. While there is some appeal to this convenient framework for the interpretation of κ, ˆ caution is warranted. Frequently, it will be of interest to construct confidence intervals for the index kappa. Fleiss et al. (8) derive an approximate large sample estimate for the variance of κ, ˆ v ar(κ), ˆ as

R

pˆ ii [1 − (pˆ i· + pˆ ·i )(1 − κ)] ˆ 2

i=1

+ (1 − κ) ˆ 2

xi· x·i

i

,

pˆ ij (pˆ ·i + pˆ j· )2

j=i

(2)

xi· x·i

3

− [κˆ − pˆ e (1 − κ)] ˆ 2

[x..(1 − pˆ e )2 ], (3)

i=1

and argues that this is intuitively reasonable since differences in the marginal frequencies necessarily lead to a reduction in the level of agreement and hence κ. ˆ Cohen then suggests that if one is interested in assessing the proportion of the agreement permitted by the margins (correcting for chance), then one computes κ/ ˆ κˆ M . We return to the topic of marginal frequencies and their influence on the properties of κ later in the article.

and Fleiss (6) recommends carrying out tests (see Hypothesis Testing) and constructing confidence intervals by assuming approximate normality of (κˆ − κ)/[ var(κ)] ˆ 1/2 and proceeding in the standard fashion. For tests regarding the null hypothesis H0 : κ = 0, an alternate variance estimate may be derived from (3) by substituting 0 for κ, ˆ and pˆ i· pˆ ·j for

4

KAPPA

pˆ ij , giving ˆ v ar0 (κ)  R pˆ i· pˆ ·i [1 − (pˆ i· + pˆ ·i )]2 + pˆ i· pˆ ·j = k=1

× (pˆ ·i + pˆ j· )2 − p2e

i=j

[x·· (1 − pˆ e )2 ],

(4)

with tests carried out as described above. 3

THE WEIGHTED KAPPA INDEX

The discussion thus far has focused on situations in which the test serves as a nominal classification procedure (e.g. as in the psychiatric diagnosis example at the beginning of the article). In such settings, since there is no natural ordering to the outcomes, any disagreements are often considered to be equally serious and the methods previously described are directly applicable. In some circumstances with nominal scales, however, certain types of disagreements are more serious then others and it is desirable to take this into account. Furthermore, when the outcome is ordinal (as in the cervical cancer screening example), it is often of interest to adopt a measure of agreement that treats disagreements in adjacent categories as less serious than disagreements in more disparate categories. For the test based on cervical smears designed to classify the condition of the cervix as healthy, mildly, moderately, or severely dysplastic, or cancerous, if on one occasion the test suggested mild dysplasia and on another moderate, this type of disagreement would be considered less serious than if a cervix previously diagnosed as cancerous was subsequently classified as mildly dysplastic. In general, the seriousness reflects clinical implications for treatment and the consequences of wrong decisions. Weighted versions of the kappa statistic were derived by Cohen (4) to take into account the additional structure arising from ordinal measures or from nominal scales in which certain types of disagreement are of more importance than others. In particular, the objective of adopting a weighted kappa

statistic is to allow ‘‘different kinds of disagreement’’ to be differentially weighted in the construction of the overall index. We begin by assigning a weight to each of the R2 cells; let wij denote the weight for cell (i, j). These weights may be determined quite arbitrarily but it is natural to restrict 0 ≤ wij ≤ 1, set wii to unity to give exact agreement maximum weight, and set 0 ≤ wij < 1 for i = j, so that all disagreements are given less weight than exact agreement. The selection of the weights plays a key role in the interpretation of the weighted kappa statistic and also impacts the corresponding variance estimates, prompting Cohen (4) to suggest these be specified prior to the collection of the data. Perhaps the two most common sets of weights are the quadratic weights, with wij = 1 − (i − j)2 /(R − 1)2 , and the so-called Cicchetti weights, with wij = 1−|i − j|/(R − 1) (1,2). The quadratic weights tend to weight disagreements just off the main diagonal more highly than Cicchetti weights, and the relative weighting of disagreements farther from the main diagonal is also higher with the quadratic weights. Clearly, these two weighting schemes share the minimal requirements cited above. The weighted kappa statistic then takes the form κˆ (w) =

pˆ (w) ˆ (w) e 0 −p 1 − pˆ (w) e

,

(5)

R = R ˆ ij and pˆ (w) where pˆ (w) e = i=1 j=1 wij p R R R 0 wij pˆ i· pˆ ·j . If wi· = j=1 pˆ ·j wij and i=1 j=1 w·j = R p i=1 ˆ i· wij , then the large-sample variance of κˆ (w) is estimated by v ar(κˆ (w) )  R R = pˆ ij [wij − (wi· + w·j )(1 − κˆ (w) )]2 i=1 j=1

− [κˆ (w) − pˆ (w) ˆ (w) )]2 e (1 − κ 2 [x2·· (1 − pˆ (w) e ) ]

(6)

and, as before, tests and confidence intervals may be carried out and derived in the standard fashion assuming asymptotic normality var(κˆ (w) )]1/2 . As of the quantity (κˆ (w) − κ (w) )/[

KAPPA

in the unweighted case, a variance estimate appropriate for testing H0 :κ (w) = 0 may be derived by substituting pˆ i· pˆ ·j for pˆ ij , and 0 for κˆ (w) in (6). We note in passing that the weighted kappa with quadratic weights has been shown to bear connections to the intraclass correlation coefficient. Suppose that with an ordinal outcome the categories are assigned the integers 1 through R from the ‘‘lowest’’ to ‘‘highest’’ categories, respectively, and assignment to these categories is taken to correspond to a realization of the appropriate integer value. Fleiss & Cohen (7) show that the intraclass correlation coefficient computed by treating these integer responses as coming from a Gaussian general linear model for a two-way analysis of variance, is asymptotically equivalent to the weighted kappa statistic with quadratic weights.

4 THE KAPPA INDEX FOR MULTIPLE OBSERVERS Thus far we have restricted consideration to the case of two applications of the classification procedure (e.g. two successive applications of a diagnostic test, two physicians carrying out successive diagnoses, etc.). In many situations, however, there are multiple (>2) applications and interest lies in measuring agreement on the basis of several applications. Fleiss (5) considered the particular problem in which a group of subjects was examined and classified by a fixed number of observers, but where it was not necessarily the same set of observers carrying out the assessments for each patient. Moreover, Fleiss (5) assumed that it was not possible to identify which observers were involved in examining the patients. For this problem, we require some new notation. Let M denote the number of subjects, N denote the number of observers per subject, and R denote the number of categories as before. Therefore, NM classifications are to be made. Let nij denote the number of times the ith subject was assigned to the jth category. A measure of overall raw agreement for the assignments on the ith

5

subject is given by R nij (nij − 1)

qˆ i =

j=1

N(N − 1)

,

which can be interpreted as follows. N With N posobservers per subjects there are 2 nij sible pairs of assignments. There are 2 which agree on category j and hence a total R nij number of pairs of assignments j=1 2 which concur altogether for the ith subject. Thus, (7) simply represents the proportion of all paired assignments on the ith subject for which there was agreement on the category. The overall measure of raw observed agreement over all subjects is then given by ˆ i , which equals qˆ 0 = M −1 M i=1 q M R

qˆ 0 =

n2ij

i=1 j=1

MN(N − 1)

−

1 . N−1

(8)

As before, however, some agreement would be expected among the observers simply by chance and the kappa statistic in this setting corrects for this. The expected level of agreement is computed by noting that M nij

pˆ j =

i=1

MN

is the sample proportion of all assignments R made to category j, with ˆ j = 1. So if j=1 p pairs of observers were simply assigning subjects to categories at random and independently one can estimate that they would be expected to agree according to pˆ e =

R

pˆ 2j ,

(9)

j=1

then the kappa statistic is computed by correcting for chance in the usual way as κˆ =

qˆ 0 − pˆ e . 1 − pˆ e

(10)

6

KAPPA

The sample variance for (10) is derived by Fleiss et al. (9) to be v ar(κ) ˆ



2 R R  pj (1 − pj ) − pj (1 − pj ) = 2 j=1

j=1

 (1 − 2pj )

 MN(N − 1)

R

2 pj (1 − pj )

j=1

(11) and is typically used for tests or interval estimation in the standard fashion. When the same set of raters assesses all subjects and individual raters scores are known, it is not possible to use the results of Fleiss (5) without ignoring the rater-specific assignments. For this context, Schouten (13) proposed the use of indices based on weighted sums of pairwise measures of observed and expected levels of agreement. In particular, for a given pair of raters and a given pair of categories, observed and expected measures of agreement may be computed as earlier. Then, for each pair of raters, a measure of overall observed agreement may be obtained by taking a weighted average of such measures over all pairwise combinations of categories. Given a corresponding measure of expected agreement, an overall kappa statistic can be computed in the usual fashion. Schouten (13) then described how to obtain kappa statistics reflecting agreement over all observers, agreement between a particular observer and the remaining observers, and agreement within and between subgroups of observers. 5

GENERAL REMARKS

MaClure & Willett (12) provide a comprehensive review and effectively highlight a number of limitations of the kappa statistics. In particular, they stress that for ordinal data derived from categorizing underlying continuous responses, the kappa statistic depends heavily on the often arbitrary category definitions, raising questions about interpretability. They also suggest that the

use of weights, while attractive in allowing for varying degrees of disagreement, introduces another component of subjectivity into the computation of kappa statistics. Perhaps the issue of greatest debate is the so-called prevalence, or base-rate, problem of kappa statistics. Several other authors have examined critically the properties and interpretation of kappa statistics (10,14,15), and the debate of the merits and demerits continues unabated. Despite the apparent limitations, the kappa statistic enjoys widespread use in the medical literature and has been the focus of considerable statistical research. REFERENCES 1. Cicchetti, D. V. (1972). A new measure of agreement between rank ordered variables, Proceedings of the American Psychological Association 7, 17–18. 2. Cicchetti, D. V. & Allison T. (1973). Assessing the reliability of scoring EEG sleep records: an improved method, Proceedings and Journal of the Electro-physiological Technologists’ Association 20, 92–102. 3. Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement 20, 37–46. 4. Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit, Psychological Bulletin 70, 213–220. 5. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters, Psychological Bulletin 76, 378–382. 6. Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions, 2nd Ed. Wiley, New York. 7. Fleiss, J. L. & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability, Educational and Psychological Measurement 33, 613–619. 8. Fleiss, J. L., Cohen, J. & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted kappa, Psychological Bulletin 72, 323–327. 9. Fleiss, J. L., Nee, J. C. M. & Landis, J. R. (1979). Large sample variance of kappa in the case of different sets of raters, Psychological Bulletin 86, 974–977. 10. Kraemer, H. C. & Bloch, D. A. (1988). Kappa coefficients in epidemiology: an appraisal of a

KAPPA reappraisal, Journal of Clinical Epidemiology 41, 959–968. 11. Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data, Biometrics 33, 159–174. 12. MaClure, M. & Willett, W. C. (1987). Misinterpretation and misuse of the kappa statistic, American Journal of Epidemiology 126, 161–169.

7

13. Schouten, H. J. A. (1982). Measuring pairwise interobserver agreement when all subjects are judged by the same observers, Statistica Neerlandica 36, 45–61. 14. Thompson, W. D. & Walter S. D. (1988). A reappraisal of the kappa coefficient, Journal of Clinical Epidemiology 41, 949–958. 15. Thompson, W. D. & Walter S. D. (1988). Kappa and the concept of independent errors, Journal of Clinical Epidemiology 41, 969–970.

KEFAUVER–HARRIS DRUG AMENDMENTS

advertising from the Federal Trade Commission to the FDA.

In 1962, news reports about how Food and Drug Administration (FDA) Medical Officer Frances O. Kelsey, M.D., Ph.D., had kept the drug thalidomide off the U.S. market aroused public interest in drug regulation. Thalidomide had been marketed as a sleeping pill by the German firm Chemie Grunenthal, and it was associated with the birth of thousands of malformed babies in Western Europe. ‘‘In the years before 1962, Senator Estes Kefauver had held hearings on drug costs, the sorry state of science supporting drug effectiveness, and the fantastic claims made in labeling and advertising,’’ Temple says. ‘‘Well-known clinical pharmacologists explained the difference between well-controlled studies and the typical drug study. With the [Food, Drug and Cosmetic] FD&C Act ‘in play’ because of thalidomide, Congress had the opportunity to make major changes.’’ In October 1962, Congress passed the Kefauver–Harris Drug Amendments to the Federal FD&C Act. Before marketing a drug, firms now had to not only prove safety, but also they had to provide substantial evidence of effectiveness for the product’s intended use. Temple says, ‘‘That evidence had to consist of adequate and well-controlled studies, a revolutionary requirement.’’ ‘‘Also critically, the 1962 amendments required that the FDA specifically approve the marketing application before the drug could be marketed, another major change.’’ The Kefauver–Harris Drug Amendments also asked the Secretary to establish rules of investigation of new drugs, which include a requirement for the informed consent of study subjects. The amendments also formalized good manufacturing practices, required that adverse events be reported, and transferred the regulation of prescription drug This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/fdac/features/2006/106 cder.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

LANS-DEMETS: ALPHA SPENDING FUNCTION

Assume that the procedure is a one-sided test but that the process can be easily generalized to introduce two one-sided symmetric or asymmetric boundaries. In general, a test statistic Z(j), j= 1, 2, 3, . . . ,J is computed at each successive interim analysis. In the large sample case and under the null hypothesis, the Z(j)s are standard N(0,1). At each analysis, the test statistic is compared with a critical value Zc (j). The trial would continue as long as the test statistic does not exceed the critical value. That is, continue the trial as long as

DAVID L. DEMETS University of Wisconsin-Madison Madison, Wisconsin

K. K. GORDON LAN Johnson & Johnson Raritan, New Jersey

The randomized control clinical trial (RCT) is the standard method for the definitive evaluation of the benefits and risks of drugs, biologics, devices, procedures, diagnostic tests, and any intervention strategy. Good statistical principles are critical in the design and analysis of these RCTs (1,2). RCTs also depend on interim analysis of accumulating data to monitor for early evidence of benefit, harm, or futility. This interim analysis principle was established early in the history of RCTs (3) and was implemented in early trials such as the Coronary Drug Project (4,5). Evaluation of the interim analysis may require the advice of an independent data monitoring committee (DMC) (6,7), including certain trials under regulatory review (8,9). However, although ethically and scientifically compelling, interim repeated analysis of accumulating data has the statistical consequence of increased false positive claims unless special steps are taken. The issue of sequential analysis has a long tradition (10,11) and has received special attention for clinical trials (12,13). In particular, increasing the frequency of interim analysis can substantially increase the Type I error rate if the same criteria are used for each interim analysis (13). This increase was demonstrated in the Coronary Drug Project, which used sequential analysis for monitoring several treatment arms compared with a placebo (4). Most of the classical sequential methods assumed continuous analysis of accumulating data, a practice not realistic for most RCTs. Rather than continuous monitoring, most clinical trials review accumulating data periodically after additional data has been collected.

Z(j) < Zc (j) for j = 1, 2, 3, . . . ., J-1 Otherwise, the trial might be considered for termination. We would fail to reject the null hypothesis if Z(j) < Zc (j) for all j (j = 1,2, . . . ,J). We would reject the null hypothesis if at any interim analysis, Z(j) ≥ Zc (j) for j = 1, 2, . . . , J Peto et al. (14) recommended using a very conservative critical value for each interim analysis, say a standardized value of Zc (j) = 3.0 for all j (j = 1,2, . . . ,J), such the impact on the overall Type I error would be minimal. In 1977, Pocock (15) published a paper based on the earlier work of Armitage and colleagues (13) that formally introduced the idea of a group sequential approach. This modification developed a more conservative critical value than the na¨ıve (e.g., 1.96 for one-sided Type I error of 0.025) to be used at each analysis such that the overall Type I error was controlled. For example, if a total of 5 interim analyses were to be conducted with an intended Type I error of 0.025, then Zc (j) = 2.413 would be used at each interim analysis (j = 1,2, . . . ,5). Note that this final critical value is much larger than the standard critical value. In 1979, O’Brien and Fleming (16) introduced an alternative group sequential boundary for evaluating interim analysis. In this approach, the critical values change with each interim analysis, starting with a very conservative (i.e., large) value and shrinking to a final value close to the nominal critical value at the scheduled com-

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

LANS-DEMETS: ALPHA SPENDING FUNCTION

pletion. The exact√ form for each critical value is Zc (j) = ZOBF (J) (J/j). In this case, for the same 5 interim analyses and an overall Type I error of 0.025, the ZOBF (5) value is 2.04, which √ makes the 5 critical values 2.05 (5/j) for j = 1,2, . . . ,5 or (4.56, 3.23, 2.63, 2.28, and 2.04). Both of these latter models assume an equal increment in information between analyses and that the number of interim analyses J is fixed in advance. These three group sequential boundaries have been widely used, and examples are shown in Fig. 1. In fact, the OBF group sequential method was used in the Betablocker Heart Attack Trial (BHAT) (17), which terminated early because of an overwhelming treatment benefit for mortality. In 1987, Wang and Tsiatis generalized the idea of Pocock and O’Brien-Fleming and introduced a family of group sequential boundaries. For given α, J and a shape parameter φ, a constant C will be chosen so that Z(j) ≥ C(J/j)φ for some j = 1,2, . . . ,J is equal to α. The choice of φ = 0.5 yields the OBF boundary, and φ = 0 yields the Pocock boundary. 1 ALPHA SPENDING FUNCTION MOTIVATION The BHAT trial was an important factor in the motivation for the alpha spending function approach to group sequential monitoring. BHAT was a cardiovascular trial that evaluated a betablocker class drug to reduce mortality following a heart attack (17). An independent DMC reviewed the data periodically, using the OBF group sequential boundaries as a guide. A beneficial mortality trend emerged early in the trial and continued to enlarge with subsequent evaluations. At the sixth of a planned seven interim analyses, the logrank test statistic crossed the OBF boundary. After careful examination of all aspects, the DMC recommended that the trial be terminated, approximately 1 year earlier than planned. However, although the OBF sequential boundaries were used, the assumptions of these models were not strictly met. The increment in the number of deaths between DMC meetings was not equal. Furthermore, additional interim analyses were contemplated although not done. This experience suggested the need for more flexible

sequential methods for evaluating interim results. Neither the number nor the timing of interim analyses can be guaranteed in advance. A DMC may need to add additional interim analyses as trends that suggest benefit or harm emerge. As described by Ellenberg, Fleming, and DeMets (7), many factors must be considered before recommendations for early termination are made, and an additional interim analysis may be necessary to confirm or more fully evaluate these issues. Thus, the need for a flexible group sequential method seemed compelling. 2 THE ALPHA SPENDING FUNCTION The initial (or classical) group sequential boundaries are formed by choosing boundary values such that the sum of the probability of exceeding those critical values during the interim analyses is exactly the specified alpha level set in the trial design, assuming the null hypothesis of no intervention effect. That is, the total available alpha is allocated or ‘‘spent’’ over the prespecified times of interim analyses. The alpha spending function proposed by Lan and DeMets (18) allocated the alpha level over the interim analyses by a continuous monotone function, α(t), where t is the information fraction, 0 ≤ t ≤ 1. Here t could be the fraction of target patients recruited (n/N) or the fraction of targeted deaths observed (d/D) at the time of the interim analysis. In general, if the total information for the trial design is I, then at the j-th analysis, the information fraction tj = Ij /I. The total expected information I should have been determined by the trial design if properly done. The function α(t) is defined such that α(0) = 0 and α(1) = α. Boundary values Zc (j), which correspond to the α-spending function α(t), can be determined successfully so that under the null P0 {Z(1) ≥ Zc (1), or Z(2) ≥ Zc (2), or . . . , or Z(j) ≥ Zc (j)} = α(tj )

(1)

where {Z(1), . . . ,Z(j),} represent the test statistics from the interim analyses 1, . . . , . The specification of α(t) will create a boundary of critical values for interim test statistics,

LANS-DEMETS: ALPHA SPENDING FUNCTION

3

Group Sequential Boundaries 6 5

Alpha Spending OBF Type OBF

Z Score

4 3 2 Figure 1. Upper boundary values corresponding to the α 1 (t*) spending function for α – 0.05 at information fraction t* = 0.25, 0.50, 0.75, and 1.0 and for a truncated version at a critical value of 3.0.

1 0

and we can specify functions that approximate O’Brien–Fleming or Pocock boundaries as follows: √ α1 (t) = 2 − 2 − ( Zα/2 ´ / t) O’Brien–Fleming Type α2 (t) = α1n (1 + (e − 1)t) Pocock Type where denotes the standard normal cumulative distribution function. The shape of the alpha spending function is shown in Fig. 2 for both of these boundaries. Two other families of spending functions proposed (19,20) are α(θ , t) = α tθ for θ > 0 α(γ , t) = α[(1 − e−γ t )/(1 − e−γ )], for γ = 0 The increment α(tj ) – α(tj−1 ) represents the additional amount of alpha or Type I error probability that can be used at the jth analysis. In general, to solve for the boundary values Zc (j), we need to obtain the multivariate distribution of Z(1), Z(2), . . . ,Z(J). In the cases to be discussed, the distribution is asymptotically multivariate normal with covariance structure = (σ jk ) where σjk = cov (Z(j), Z(k)) √ √ = (tj /tk ) = (iJ /iK ) for j ≤ k where ij and ik are the amount of information available at the j-th and k-th data

0

0.2

0.4 0.6 Information Fraction

0.8

1.0

monitoring, respectively. Note that at the jth data monitoring, ij and ik are observable and σ jk is known even if I (total information) is unknown. However, if I is not known during interim analysis, we must estimate I by Iˆ and tj by tj = xj /Iˆ so that we can estimate α(tj ) by α(tˆ). If these increments have an independent distributional structure, which is often the case, then derivation of the values of the Zc (j) from the chosen form of α(t) is relatively straightforward using Equation (1) and the methods of Armitage et al. (21,22). If the sequentially computed statistics do not have an independent increment structure, then the derivation of the Zc (j) involves a more complicated numerical integration and sometimes is estimated by simulation. However, as discussed later, for the most frequently used test statistics, the independent increment structure holds. This formulation of the alpha spending function provides two key flexible features. Neither the timing nor the total number of interim analyses have to be fixed in advance. The critical boundary value at the j-th analysis only depends on the information fraction tj and the previous j-1 information fractions, t1 ,t2 , . . . ,tj−1 , and the specific spending function being used. However, once an alpha spending function has been chosen before the initiation of the trial, that spending function must be used for the duration of the trial. A DMC can change the frequency of the interim analyses as trends emerge without appreciably affecting the overall α level (23,24). Thus,

4

LANS-DEMETS: ALPHA SPENDING FUNCTION

Spending Functions .05 .04

Alpha

α2 (t∗) .03 α3 (t∗)

.02

α1 (t∗)

.01 0 0

0.2

0.4 0.6 Information Fraction

0.8

it is difficult to abuse the flexibility of this approach. The timing and spacing of interim analyses using the alpha spending function approach have been examined (19,25–27). For most trials, two early analyses with less than 50% of the information fraction are adequate. An early analysis say at 10% is often useful to make sure that all of the operational and monitoring procedures are in order. In rare cases, such early interim reviews can identify unexpected harm such as in the Cardiac Arrhythmia Suppression Trial (28) that terminated early for increased mortality at 10% of the information fraction using an alpha spending function. A second early analysis at 40 or 50% information fraction can also identify strong convincing trends of benefit as in two trials that evaluated beta blocker drugs in chronic heart failure (29,30). Both trials terminated early at approximately 50% of the information fraction with mortality benefits. Computation of the alpha spending function can be facilitated by available software on the web (www.biostat.wisc.edu/landemets)

1.0

Figure 2. Comparison of spending functions α 1 (t*), α 2 (t*), and α 3 (t*) at information fractions t* = 0.2, 0.4, 0.6, 0.8, and 1.0.

or by commercial software packages (www. cytel.com/Products/East/default.asp). 3 APPLICATION OF THE ALPHA SPENDING FUNCTION Initial development of group sequential boundaries was for comparison of proportions or means (15,16,26). In these cases, the increments in information are represented by additional groups of subjects and their responses to the intervention. For comparing means or proportions, the information fraction t can be estimated by the n/N, the observed sample size divided by the expected sample size. However, later work expanded the use to other common statistical procedures. Tsiatis and colleagues (31,32) demonstrated that sequential logrank test statistics and the general class of rank statistics used in censored survival data had the independent increment structure that made the application to group sequential boundary straightforward. Later, Kim and Tsiatis (33)

Table 1. Comparison of boundaries using spending functions with Pocock (P) and O’Brien–Fleming (OBF) methods (α = 0.05, t* = 0.2, 0.4, 0.6, 0.8, and 1.0) t*

α 1 (t*)

OBF

α 2 (t*)

P

0.2 0.4 0.6 0.8 1.0

4.90 3.35 2.68 2.29 2.03

4.56 3.23 2.63 2.28 2.04

2.44 2.43 2.41 2.40 2.36

2.41 2.41 2.41 2.41 2.41

LANS-DEMETS: ALPHA SPENDING FUNCTION

demonstrated that the alpha spending function approach for sequential logrank tests was also appropriate. In this case, the information fraction is approximated by d/D, the number of observed events or deaths divided by the expected or design for number of events or deaths (34). Application of the alpha spending function for logrank tests has been used in several clinical trials (e.g., 28–30). Group sequential procedures including the alpha spending function have also been applied to longitudinal studies using a linear random effects model (35,36). Longitudinal studies have also been evaluated using generalized estimating equations (37). In a typical longitudinal clinical trial, subjects are added over time, and more observations are gathered for each subject during the course of the trial. One statistic commonly used is to evaluate the rate of change by essentially computing the slope of the observations for each subject and then taking a weighted average of these slopes over the subjects in each intervention arm. The sequential test statistics for comparison of slopes using the alpha spending function must take into account their distribution. If the information fraction is defined in terms of the Fisher information (i.e., inverse of the variance for the slopes), then the increments in the test statistic are independent, and the alpha function can be applied directly (38). The total expected information may not be known exactly, but it often can be estimated. Wu and Lan (36) provide other approaches to estimate the information fraction in this setting. Scharfstein and Tsiatis (39) demonstrated that any class of test statistics that satisfies specific likelihood function criteria will have this property and thus can be used directly in this group sequential setting. 4 CONFIDENCE INTERVALS AND ESTIMATION Confidence intervals for an unknown parameter θ following early stopping can be computed by using the same ordering of the sample space described by Tsiatis et al. (32) and by using a process developed by Kim and DeMets (25,40) for the alpha spending function procedures. The method can be briefly

5

summarized as follows: A 1–γ lower confidence limit is the smallest value of θ for which an event at least as extreme as the one observed has a probability of at least γ . A similar statement can be made for the upper limit. For example, if the first time the Z-value exists the boundary at tj with the observed Z*(j) ≥ Zc (j), then the upper θ U and lower θ L confidence limits are θ U = sup {θ : Pθ {Z(1) ≥ Zc (1), or · · · , or Z( j − 1) > Zc ( j − 1), or Z( j) ≥ Z∗ (j)} ≤ 1 − γ }} and θ L = sup {θ : Pθ {Z(1) ≥ Zc (1), or · · · , or Z( j − 1) ≥ Zc ( j − 1), or Z( j) ≥ Z∗ (j)} ≥ 1 − γ }} Confidence intervals obtained by this process will have coverage closer to 1 – γ than na¨ıve ˆ confidence intervals using θˆ ± Zγ /2 SE(θ). As an alternative to computing confidence intervals after early termination, Jennison and Turnbull (41) have allocated the calculation of repeated confidence intervals. This calculation is achieved by inverting a sequential test to obtain the appropriate coefficient Z∗α/2 in the general form for the confidence interval, θˆ ± Z∗α/2 SE(θˆ ). This inversion can be achieved when the sequential test is based on an alpha spending function. If we compute the interim analyses at the tj , obtaining corresponding critical values Zc (j), then the repeated confidence intervals are of the form θˆk ± Zc ( j)SE(θˆj ) where θˆj is the estimate for the parameter θ at the j-th analysis. Methodology has also been developed to obtain adjusted estimates for the intervention effect (42–47). Clinical trials that terminate early are prone to exaggerate the magnitude of the intervention effect. These methods shrink the observed estimate closer to the null. The size of the adjustments may depend on the specific sequential boundary employed. Conservative boundaries such as that proposed by Peto or O’Brien and Fleming

6

LANS-DEMETS: ALPHA SPENDING FUNCTION

generally require less adjustment, and the na¨ıve point estimate and confidence intervals may be quite adequate. Another issue is the relevance of the estimate to clinical practice. The population sample studied is usually not a representative sample of current and future practice. Subjects were those who passed all of the inclusion and exclusion criteria and volunteered to participate. Early subjects may differ with later subjects as experience is gained with the intervention. Thus, the intervention effect estimate may represent populations like the one studied, the only solid inference, but not as relevant to how the intervention will be used. Thus, complex adjustments may not be as useful. 5

TRIAL DESIGN

If any trial is planning to have interim analyses for monitoring for benefit or harm, then that plan must be taken into account in the design. The reason is that group sequential methods will impact the final critical value, and thus power, depending on which boundary is used. For the alpha spending function approach, the specific alpha function must be chosen in advance. In addition, for planning purposes, the anticipated number of interim analyses must be estimated. This number does not have to be adhered to in the application, but it is necessary for the design. Variance with this number in the application will not practically affect the power of the trial. Thus, the design strategy for the alpha spending function is similar to that strategy described by Pocock for the initial group sequential methods (15). The key factor when the sample size is computed is to take into consideration the critical value at the last analysis when the information fraction is 1.0. One simple approach is to use this new critical value in the standard sample size formula. This estimate will reasonably approximate a more exact approach described below. To illustrate, consider a trial that is comparing failure rates of successive groups of subjects. Here, H0 : pC − pT = 0 HA : pC − pT = δ = 0

where pC and pT denote the unknown response rates in the control and newtreatment groups, respectively. We would estimate the unknown parameter by pˆ C and pˆ T , the observed event rates in our trial. For a reasonably large sample size, we often use the following test statistics, Z=

pˆ C − pˆ T ˆ − p)(1/m ˆ p(1 C + 1/mT )

to compare event rates where pˆ is the combined event rate across treatment groups. For sufficiently large n where n = mC = mT , this statistics has an approximate standard normal distribution with mean and unit variance under the null hypothesis H0 : = 0. In this case, assuming equal sample size (n) per group in each arm, n(pC − pT )/ 2p(1 − p) √ = nδ/ 2p(1 − p)

=

√

where p = (pC + pT )/2 It follows that n=

2 2 p(1 − p) δ2

To design our studies, we evaluate the previous equation for n, the sample size per treatment per sequential group. Because the plan is to have J groups each of size 2n, the total sample size 2N equals 2nJ. Now, to obtain the sample size in the context of the alpha spending function, we proceed as follows: 1. For planning purposes, estimate the number of planned interim analyses J at equally spaced increments of information (i.e., 2n subjects). It is also possible to specify unequal increments, but equal space is sufficient for design purposes. 2. Obtain the boundary values of the K interim analyses under the null hypothesis H0 to achieve a prespecified overall alpha level, α, for a specific spending function α(t). 3. For the boundary obtained, obtain the value of to achieve a desired power (1–β).

LANS-DEMETS: ALPHA SPENDING FUNCTION

4. Determine the value of n that determines the total sample size 2N = 2nJ. 5. Having computed these design parameters, one may conduct the trial with interim analysis to be done based on the information fraction tj approximated by tj = Number of subjects observed/2N at the jth analysis (38). The number of actual interim analyses may not be equal to J, but the alpha level and the power will be affected only slightly (26). As a specific example, consider using an O’Brien–Fleming-type alpha spending function α 1 (t) with a one-sided 0.025 alpha level and 0.90 power at equally spaced increments at t = 0.2, 0.4, 0.6, 0.8 and 1.0. Using previous publications (16) or available computer software, we obtain boundary values 4.56, 3.23, 2.63, 2.28, and 2.04. Using these boundary values and available software, we find that = 1.28 provides the desired power of 0.90. If we specify pC = 0.6, pT = 0.4 (p = 0.5) under the alternative hypothesis, then we can obtain a sample size as follows. For = 1.28, n=

2(1.28)2 (0.5)(0.5) = 20.5, (0.2)2

and we have a total sample size of 2(21)5 = 210 subjects. We can then proceed to conduct interim analysis times at information fraction tj equal to the observed number of subjects divided by 210. Similar formulations can be developed for the comparison of means, repeated measures, and survival analysis (48). However, for most applications, the standard sample size formulas with the new alpha spending function final critical value will be a very good approximation. 6

CONCLUSIONS

The alpha spending function approach for group sequential interim analysis has provided the necessary flexibility that allows

7

data monitoring committees to fulfill their task. DMCs can adjust their analysis as data accumulates and trends emerge. As long as the alpha spending function is specified in advance, there is little room for abuse. Many trials sponsored by industry and government have successfully used this approach. Although the decision to terminate any trial early, for benefit or harm, is a very complex decision process, the alpha spending function can be an important factor in that process. REFERENCES 1. L. Friedman, C. Furberg, and D. DeMets, Fundamentals of Clinical Trials. Littleton, MA: John Wright – PSG Inc., 1981. 2. S. J. Pocock, Clinical Trials: A Practical Approach, New York: Wiley, 1983. 3. Heart Special Project Committee. Organization, review and administration of cooperative studies (Greenberg Report): a report from the Heart Special Project Committee to the National Advisory Council, May 1967. Control. Clin. Trials 1988; 9:137–48. 4. P. L. Canner, Monitoring treatment differences in long–term clinical trials. Biometrics 1977; 33:603–615. 5. Coronary Drug Project Research Group. Practical aspects of decision making in clinical trials: The Coronary Drug Project as a case study. Control. Clin. Trials 1982; 9:137–148. 6. D. L. DeMets, L. Friedman, C. D. Furberg, Data Monitoring in Clinical Trials: A Case Studies Approach. 2005. Springer Science + Business Media, New York, NY. 7. S. Ellenberg, T. Fleming and D. DeMets, Data Monitoring Committees in Clinical Trials: A Practical Perspective. West Sussex, UK: John Wiley & Sons, Ltd., 2002. 8. ICH Expert Working Group: International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline. Statistical principles for clinical trials. Stats. Med. 1999; 18:1905–1942. 9. U.S. Department of Health and Human Services. Food and Drug Administration. Docket No. 01D–0489. Guidance for Clinical Trial Sponsors on the Establishment and Operations of Clinical Trial Data Monitoring Committees. Federal Register 66:58151–58153, 2001. Available:

8

LANS-DEMETS: ALPHA SPENDING FUNCTION http://www.fda.gov/OHRMS/DOCKETS/98fr/ 112001b.pdf.

10. F. J. Anscombe, Sequential medical trials. Journal of the American Statistical Association 1963; 58:365–383. 11. I. Bross, Sequential medical plans. Biometrics 1952; 8:188–205. 12. P. Armitage, Sequential Medical Trials, 2nd ed. New York: John Wiley and Sons, 1975. 13. P. Armitage, C. K. McPherson, and B. C. Rowe, Repeated significance tests on accumulating data. J. Royal Stat. Soc. Series A 1969; 132:235–244. 14. R. Peto, M. C. Pike, P. Armitage, et al., Design and analysis of randomized clinical trials requiring prolonged observations of each patient. 1. Introduction and design. Br. J. Cancer 1976; 34:585–612. 15. S. J. Pocock: Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64:191–199. 16. P. C. O’Brien and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics 1979; 35:549–556. 17. Beta–Blocker Heart Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction. I Mortality results. J. Amer. Med. Assoc. 1982; 247:1707–1714. 18. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical trials. Biometrika 1983; 70:659–663. 19. K. Kim and D. L. DeMets, Design and analysis of group sequential tests based on the type I error spending rate function. Biometrika 1987; 74:149–154.

45(3):1017–1020. 25. K. Kim and D. L. DeMets, Confidence intervals following group sequential tests in clinical trials. Biometrics 1987; 4:857–864. 26. K. Kim and D. L. DeMets, Sample size determination for group sequential clinical trials with immediate response. Stats. Med. 1992; 11:1391–1399. 27. Z. Li and N. L. Geller, On the choice of times for date analysis in group sequential trials. Biometrics 1991; 47:745–750. 28. Cardiac Arrhythmia Suppression Trial (CAST) Investigators. Preliminary report: Effect of endainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. N. Engl. J. Med. 1989; 321:406–412. 29. MERIT–HF Study Group. Effect of metoprolol CR/XL in chronic heart failure: Metoprolol CR/XL randomised intervention trial in congestive heart failure. Lancet 1999; 353:2001–2007. 30. M. Packer, A. J. S. Coats, M. B. Fowler, H. A. Katus, H. Krum, P. Mohacsi, J. L. Rouleau, M. Tendera, A. Castaigne, C. Staiger, et al. for the Carvedilol Prospective Randomized Cumulative Survival (COPERNICUS) Study Group. Effect of Carvedilol on survival in severe chronic heart failure. New Engl. J. Med. 2001; 334:1651–1658. 31. A. A. Tsiatis, Repeated significance testing for a general class of statistics used in censored survival analysis. J. Am. Stat. Assoc. 1982; 77:855–861. 32. A. A. Tsiatis, G. L. Rosner, and C. R. Mehta, Exact confidence intervals following a group sequential test. Biometrics 1984; 40:797–803.

20. I. K. Hwang, W. J. Shih, and J. S. DeCani, Group sequential designs using a family of type I error probability spending function. Stats. Med. 1990; 9:1439–45.

33. K. Kim and A. A. Tsiatis, Study duration for clinical trials with survival response and early stopping rule. Biometrics 1990; 46:81–92.

21. K. K. G. Lan and D. L. DeMets, Group sequential procedures: Calendar versus information time. Stats. Med. 1989; 8:1191–1198.

34. K. K. G. Lan and J. Lachin, Implementation of group sequential logrank tests in a maximum duration trial. Biometrics 1990; 46:759–770.

22. D. M. Reboussin, D. L. DeMets, K. M. Kim, and K. K. G. Lan, Computations for group sequential boundaries using the Lan–DeMets spending function method. Control. Clin. Trials 2000; 21:190–207.

35. J. W. Lee and D. L. DeMets, Sequential comparison of changes with repeated measurement data. J. Am. Stat. Assoc. 1991; 86:757–762.

23. M. A. Proschan, D. A. Follman, and M. A. Waclawiw. Effects of assumption violations on type I error rate in group sequential monitoring. Biometrics 1992; 48:1131–1143. 24. K. K. G. Lan and D. L. DeMets, Changing frequency of interim analyses in sequential monitoring. Biometrics 1989 Sept;

36. M. C. Wu and K. K. G. Lan, Sequential monitoring for comparison of changes in a response variable in clinical trials. Biometrics 1992; 48:765–779. 37. S. J. Gange and D. L. DeMets, Sequential monitoring of clinical trials with correlated categorical responses. Biometrika 1996; 83:157–167.

LANS-DEMETS: ALPHA SPENDING FUNCTION 38. K. K. G. Lan, D. M. Reboussin, D. L. DeMets: Information and information fractions for design and sequential monitoring of clinical trials. Communicat. Stat.– Theory Methods 1994; 23:403–420. 39. D. O. Scharfstein, A. A. Tsiatis, and J. M. Robins, Semiparametric efficiency and its implication on the design and analysis of group–sequential studies. J. Am. Stat. Assoc. 1997; 92:1342–1350. 40. K. Kim, Point estimation following group sequential tests. Biometrics 1989; 45:613–617. 41. C. Jennison and B. W. Turnbull: Interim analyses: The repeated confidence interval approach. J. Royal Stat. Soc., Series B 1989; 51:305–361. 42. S. S. Emerson and T. R. Fleming, Parameter estimation following group sequential hypothesis testing. Biometrika 1990; 77:875–892. 43. M. D. Hughes and S. J. Pocock, Stopping rules and estimation problems in clinical trials. Stats. Med. 1981; 7:1231–1241. 44. Z. Li and D. L. DeMets, On the bias of estimation of a Brownian motion drift following group sequential tests. Stat. Sinica 1999; 9:923–937. 45. J. C. Pinheiro and D. L. DeMets: Estimating and reducing bias in group sequential designs with Gaussian independent structure. Biometrika 1997; 84:831–843. 46. D. Siegmund, Estimation following sequential tests. Biometrika 1978; 65:341–349. 47. J. Whitehead, On the bias of maximum likelihood estimation following a sequential test. Biometrika 1986; 73:573–581. 48. D. L. DeMets and K. K. G. Lan, The alpha spending function approach to interim data analyses. In: P. Thall (ed.), Recent Advances in Clinical Trial Design and Analysis. Dordrecht, The Netherlands: Kluver Academic Publishers, 1995, pp. 1–27.

FURTHER READING M. N. Chang and P. C. O’Brien, Confidence intervals following group sequential tests. Control. Clin. Trials 1986; 7:18–26. T. Cook and D. L. DeMets, Statistical Methods in Clinical Trials. Boca Raton, FL: CRC Press/Taylor & Francis Co., 2007. D. L. DeMets, Data monitoring and sequential analysis – An academic perspective. J. Acq. Immune Def. Syn. 1990; 3(Suppl 2): S124–S133.

9

D. L. DeMets, R. Hardy, L. M. Friedman, and K. K. G. Lan, Statistical aspects of early termination in the Beta–Blocker Heart Attack Trial. Control. Clin. Trials 1984; 5:362–372. D. L. DeMets and K. K. G. Lan, Interim analysis: the alpha spending function approach. Stats. Med. 1994; 13:1341–1352. D. L. DeMets, Stopping guidelines vs. stopping rules: A practitioner’s point of view. Communicat. Stats.–Theory Methods 1984; 13:2395–2417. T. R. Fleming and D. L. DeMets, Monitoring of clinical trials: issues and recommendations. Control. Clin. Trials 1993; 14:183–197. J. L. Haybittle, Repeated assessment of results in clinical trials of cancer treatment. Brit. J. Radiol. 1971; 44:793–797. K. K. G. Lan, W. F. Rosenberger, and J. M. Lachin: sequential monitoring of survival data with the Wilcoxon statistic. Biometrics 1995; 51:1175–1183. J. W. Lee, Group sequential testing in clinical trials with multivariate observations: a review. Stats. Med. 1994; 13:101–111. C. L. Meinert, Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. S. Piantadosi. Clinical Trials: A Methodologic Perspective, 2nd ed. Hoboken, NJ: John Wiley and Sons, Inc., 2005. S. J. Pocock. Statistical and ethical issues in monitoring clinical trials. Stats. Med. 1993; 12:1459–1469. S. J. Pocock. When to stop a clinical trial. Br. Med. J. 1992; 305:235–240. E. Slud and L. J. Wei, Two–sample repeated significance tests based on the modified Wilcoxon statistic. J. Am. Stat. Assoc. 1982; 77:862–868. A. Wald, Sequential Analysis. New York: John Wiley and Sons, 1947.

LARGE, SIMPLE TRIALS

impact of daily aspirin reduces the risk of myocardial infarction. When carefully evaluated with precise estimates of effect and applied judiciously for maximum impact, relatively inexpensive interventions can produce disproportionate health benefits. Rare effects, especially those that are differential by treatment group, are difficult to detect or to distinguish from background rates in conventional studies. The demonstrated efficacy of a new drug or other medical product is established at a time when the limited clinical experience leaves the safety profile not yet fully defined. A large, simple trial with sufficient power to detect orders of magnitude differences in rare event rates can be very important (7). Serious adverse events that are not of concern at the start of the study may be explored using large health-care databases. Mitchell and Lesko (8) suggested that such was the case for the use of pediatric ibuprofen as an antipyretic. If such a difference is estimated reliably, health policy and practices can be developed to manage the risk while preserving the benefit. The need for larger trials to detect adverse event rates as low as 1 in 1000 or less has been emphasized for the full evaluation of new vaccines (9–11), highlighting the limitations of information derived from smaller trials with the ability to detect rates of 1 in 100 or higher. Similarly, large, simple trials have been used to address primary prevention questions (12). Large, simple trials have posed practical treatment questions relevant to a wide range of patients. If personalized effects applicable only to a single individual, characterized by gene expression analysis, are considered as one end of a spectrum, then large, simple trial issues would be at the opposite end of that spectrum. Cardiovascular disease research has been an active area for the application of large, simple clinical trials (13–15).

MARY A. FOULKES Center for Biologics Evaluation and Research U.S. Food and Drug Administration Rockville, Maryland

1

LARGE, SIMPLE TRIALS

Since the launch of the first randomized clinical trials in the middle of the twentieth century, immediate, major, and ubiquitous impacts on clinical practice and public health policy have been the result of very large randomized trials. For example, within hours of the results announced from the first oral polio vaccine field trial, the clarity of those results led to the regulatory approval of the vaccine and then to the rapid immunization of millions. That field trial was large (almost half a million children), simple (capture of incident cases of paralytic polio through public health records), randomized, and double blind (1, 2). The conclusions of that trial were unambiguous, leading to a seismic revolution in the fight against polio. This article describes all the characteristics that differentiate large, simple trials, sometimes called ‘‘megatrials’’ (3), from other types of trial, and we address important clinical questions that can have substantive impact: sizing the trial appropriate to the question, reliance on the bias control of randomization, minimizing the data collection, and other trial processes. 2 SMALL BUT CLINICALLY IMPORTANT OBJECTIVE The clinical effect size of interest dictates the number of subjects needed (or events observed) in any randomized, controlled trial. To identify a relatively small or moderate but clinically meaningful effect, the number of subjects needed (or number of events observed) in the trial may be very large (4–6). There are numerous instances where clinical effects that are limited can be used to substantial benefit from a population perspective. For example, in cardiovascular disease, the demonstration of a small but meaningful

2.1 Minimal Detectable Difference The difference to be estimated between the treatment groups needs to be clinically important when balanced against potential side effects, and practical enough to alter clinical practice. These differences could have major

Wiley Encyclopedia of Clinical Trials, Copyright  2007, John Wiley & Sons, Inc.

1

2

LARGE, SIMPLE TRIALS

public health impact when the interventions might ultimately be available to millions of patients. For example, the VITATOPS trial (16) aims to investigate whether a 15% relative reduction in the rate of stroke, myocardial infarction, or death from any vascular cause in patients with recent transient ischemic attack or stroke can result from a multivitamin intervention. The VITATOPS trial is expected to enroll approximately 8,000 patients, with a trial completion in 2010. It is important for the progress of medicine and public health that small to moderate (but real) benefits be reliably distinguished from no benefit, and this distinction requires large trials. 2.2 Sample Size Based on the minimal detectable difference, the sample size should be sufficient to reliably estimate the difference between treatment groups. When these differences are assumed to be relatively small, the resulting sample size estimates are large, in the tens of thousands of patients. The large sample size is necessary ‘‘not only to be correct but also to be convincing,’’ argues Peto (17). The sample size estimation process should focus on minimizing the risk of a false-negative result, a type II error. A trial of sufficient size will also limit the width of the confidence interval for the effect estimate, reducing uncertainty in applying the trial results. 2.3 Cost and Time Annual health-care expenditures, both direct and indirect, can be substantially impacted by small but important health-care advances that influence large numbers of people, as with cardiovascular disease, diabetes, or many other diseases. Effective preventive measures can also have major health-care cost benefits. The resource costs of a large trial, including time spent on protocol mandated follow-up procedures and data collection, can be minimized. If trial designs are stripped to the essentials, once the intervention is delivered, the time and cost of the trial are a function of the primary outcome measure. For a large, simple trial to be something other than a conventional trial

multiplied by orders of magnitude in sample size, cost, time, and effort, every action and every requirement of the trial should be questioned and only maintained as part of the trial if the contribution to the end result is substantial. This implies judicious inclusion, if at all, of extraneous substudies, ancillary studies, and other nonessential additions. Another costly aspect of trials that could be avoided in large, simple trials is the overhead associated with reliance on tertiary care institutions or clinical research institutions as the clinical sites for a trial. These may provide ideal sites for early clinical studies of complex interventions, or for studies requiring extensive clinical work-up to establish eligibility, but the usual health-care settings are the sites for large, simple trials. These usual health-care settings are also the ultimate site of application of large, simple trial results. 3 ELIGIBILITY Simple eligibility will allow easy, rapid identification and enrollment of eligible patients. The eligibility screen should be easily interpretable not only to those enrolling patients into the trial, but ultimately to those who may use the results if applicable. To be most widely interpretable, such eligibility requirements should not rely on the highest end ‘‘high-tech’’ diagnostics, which might not be widely available or may be completely unavailable outside the trial. 3.1 Simple, Broad In contrast to many clinical trials with very specific, thoroughly documented eligibility criteria, a large, simple trial engulfs a broad swath of patients with the general indication (18). Patients may enter from many different sites, from many different treating clinicians and practices, and from many different health-care systems, so the eligibility criteria need to be clear, interpretable, and consistent. Minimal testing should be required to determine eligibility. The exclusions should be a function of contraindications to any of the interventions. Patients thought to be at high risk of adverse events or with an independent life-threatening disease are generally

LARGE, SIMPLE TRIALS

excluded. In the Fourth International Study of Infarct Survival (ISIS-4) trial, for example, all patients presenting within 24 hours of a suspected acute myocardial infarction (with no clear indications for or contraindications to any one of the trial treatments) were eligible (19). 3.2 Multicenter and Multinational To rapidly enroll large numbers of patients, these trials are necessarily multicenter trials. They are often conducted in many countries simultaneously. This lends to the generalizability of results, but it also requires that definitions, eligibility screening, intervention delivery, and data capture be coordinated and harmonized multinationally. The uncertainty principle (20) as it applies to substantial uncertainty about which intervention is more beneficial can easily vary across countries, across health-care systems, and over time. Although internal validity would not be impacted, variation across countries and over time may impact the enrollment and the ultimate generalizability of results. One example of accommodation of such variations across countries was in the Options in Management with Antiretrovirals (OPTIMA) trial (21), a factorial design in which some countries enrolled to both factors and some to only one factor. 4

RANDOMIZED ASSIGNMENT

The strength of a large, simple trial over any similar-sized comparison of differently treated groups of individuals is the a priori assignment by chance to treatment group. Randomization ensures that the treatment groups are similar in both measured and unmeasured factors such that the treatment can be assumed to be the primary difference between the groups. Randomization addresses selection bias in the assignment of patients to groups. The strength of randomization has been widely articulated (22) and is a major strength in inferences from largescale randomized evidence and large, simple trials in particular. Randomization also provides a method of controlling confounding, of both known and unknown confounders. With or without stratification, randomization

3

is relied on to control confounding where the simple character of the trial would be compromised by other methods such as eligibility restriction. 4.1 Simple Process The actual randomization process should be as simple as possible. This process has benefited substantially from global telecommunications, networks, and interactive websites. It is possible for a single, central organization to manage and execute the randomization process. The randomization process is often the only point of capture for baseline information, so only the most important and clearly defined baseline factors (e.g., gender) should be recorded at this point. 4.2 Blinding Many clinical trials are blinded, but the incorporation of blinding into the conduct of a large, simple trial must be carefully weighed. To effect blinding, the interventions must be sufficiently similar that the large, simple nature of the trial is not compromised. If a blinded trial might require an extra intravenous line, as might have been the case had the Global Utilization of Streptokinase and t-PA for Occulted Arteries (GUSTO) trial (23) been blinded, the large size and the simple conduct might have been impossible. The Francis Field Trial (1), on the other hand, was blinded and involved delivery of identical placebo or polio vaccine. The large, simple nature of that trial was not impinged by blinding. Blinding of the outcome evaluation or adjudication, however, does not depend on blinding the patients to their randomized treatment groups. The integrity of trial results can be strengthened by evaluation by blinded reviewers. When the primary outcome is allcause mortality, blinded reviewers and adjudication are unnecessary. 4.3 Intervention Administered/Compliance The typical interventions chosen for evaluation in a large, simple trial are practical to deliver, such as the oral polio vaccine, and relatively short-term, such as a few hours of intravenous thrombolytic agent infusion.

4

LARGE, SIMPLE TRIALS

Longer term or self-administered interventions bring up issues of compliance. As with all trials, achieving and maintaining high compliance is important; the less compliance, the more the intervention effects begin to coincide. The usual methods of maintaining and improving compliance—minimizing the length and complexity of intervention, selecting expected compliant patients, maintaining frequent follow-up contact, and designing simple measures of compliance—should be applied to the extent possible while maintaining the large, simple character of the trial (24). 4.4 Patient Management All other aspects of patient management, other than the randomized comparison, should mimic local clinical practice and should be at the discretion of the treating clinician. Although this may introduce wide variation in the ancillary treatments, the results over a very large group of patients and clinicians will reflect the heterogeneity in practice. There should be no additional record keeping, laboratory tests, or other investigations and no extra follow-up visits. 5

OUTCOME MEASURES

The primary trial endpoint or outcome measure in a large, simple trial should be one that is directly relevant to the subject, that is easily and objectively measured with minimal resources. The outcome measure for a given trial substantially affects how large and simple the trial can feasibly be. It also affects the reliability of the effect estimates and the relevancy of the trial results. 5.1 Objectively Determined Survival, or mortality, is often the primary outcome measure in large, simple trials. It has many advantages: it is an easy to ascertain, objectively determined, binary, absorbing state. With the availability of healthcare databases and other national databases, information on survival status, including date, location, and cause of death, can often be captured without active follow-up efforts. Outcomes other than mortality can also be objectively determined.

5.2 Minimal Missing The amount and detail of the data collected, particularly after baseline, should be limited. The clear necessity of each data item will help to ensure that the participating patients, clinicians, and others will consistently provide the limited data requested, thus minimizing the amount of essential data that is missing. 5.3 Minimal Follow-up Depending on the primary endpoint of the trial, frequent, lengthy, labor-intensive, and high-tech direct medical assessments are not necessary in clinical trials. If the relevant primary endpoint can be adequately captured by indirect contact, clinical follow-up efforts can be minimized. Mail and telephone follow-up contacts have been employed, and e-mail could similarly be used. Lubsen and Tijssen (25) have argued that simple protocols with minimal follow-up contact are only appropriate when the mechanism of action is well understood and when the sideeffect profile is well understood. In general, minimal follow-up contact does not provide information on intermediate outcomes and does not contribute to greater understanding the mechanism of action. 5.4 Limited and Automated Data Capture The availability of health-care databases and other sources of automated data further simplify the capture of outcome data. The use of automated means to transmit data as effortlessly as possible also contributes to the completeness of the resulting data. The Francis Field Trial (1) used the existing system of public health surveillance and the required reporting of incident cases. National health systems can be used to provide hospitalization or mortality information. In the United Kingdom, this is facilitated when trial participants are ‘‘flagged’’ in the National Health Service (NHS) central register. Horsburgh (26) has described the infrastructure requirements for a public health surveillance system to capture a common disease outcome (e.g., tuberculosis). The practice of trying to answer as many questions as possible within a single trial is the antithesis of the simplicity required for

LARGE, SIMPLE TRIALS

a large, simple trial. A trial that attempts to address quality of life, economic issues, and other questions simultaneously is not feasible on a large-sample scale. The capture of data items extraneous to the primary central question must be avoided (27). The design and conduct of a large, simple trial must be diligently and vigilantly preserved; otherwise, the trial will become too complex to be either large or simple. Maintaining the simplicity of the follow-up and data capture procedure, without a routine comprehensive assessment of adverse events, and relying on randomization to control biases, the differences between treatment groups in rates of unexpected and rare adverse events can be evaluated by using large health-care databases, such those used in pharmcoepidemiology. These health-care databases could include the U.S. Centers for Medicare and Medicaid Services (CMS) Database, the U.K. and Italian General Practice Research Databases, the German MediPlus Database, various health maintenance organization databases, and many more. Serious adverse events, hospitalizations, and deaths can be captured by these sources rather than by sacrificing the simplicity of the large simple trial. 5.5 Subgroups The expectation in a large, simple trial is that the primary effect may vary by subgroups but is not likely to be reversed (28). The direction and consistency of effect across subgroups is one rationale for avoiding the collection of data to identify subgroups. Gray et al. (28) concluded that ‘‘most studies could be much larger, and of so much greater scientific value, if they had wider entry criteria and collected ten times less data.’’ Trial designers for many years have been warning that ‘‘marked heterogeneity of the treatment comparison in different strata can arise by chance more easily than would intuitively be expected’’ (29). The pitfalls of subgroup analyses are widely known (30); the ISIS-2 trial report provides a classic example, an analysis of effects across astrological subgroups (31). Qualitative rather than quantitative differences, though unexpected and uncommon, should be of concern.

5

Large, simple trials can include planned, small substudies, for example, at only certain sites, to address more specific hypotheses and requiring more complex follow-up information. These substudies would have no impact on the overall simplicity of the trial for the majority of patients. 5.6 Impact on Practice An additional outcome measure from a large, simple trial is the timely and pervasive impact on clinical practice. When a clear and convincing result to important questions shows a real effect that benefits the patient, that result should be widely reported and disseminated. The external influences on clinical decisions (e.g., financial) often respond rapidly to these results, accelerating the uptake of the results. The Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto Miocaridico 1 (GISSI-1) trial result, for example, became common (near universal) practice in Italian coronary care units within 1 year of reporting (32). Contrast this rapid pace of uptake with control of hypertension: despite numerous trials showing the positive effects of blood pressure control, according to a recent survey of the U.S. population, control of hypertension is stagnant at 34% (33). Because most individuals with elevated blood pressure are not being adequately treated, the demonstration within randomized controlled trials of a treatment effect is only a piece of a larger, more complex cluster of conditions. Even trials that have begun to enroll can benefit from further simplification, as seen in a trial of trastuzumab (Herceptin) (34). The simplification in this case involved broadening the eligibility criteria to more closely reflect the actual target patient population, streamlining study procedures (reducing frequencies and durations of procedures), and reconsidering and eliminating the placebo (and any need for blinding). As the result of these simplifications, the enrollment rate accelerated, risks of infections and other complications of multiple infusions dropped, and the resources to enroll and follow each patient were cut. These trial design and conduct aspects can be simplified in many trials, making a large trial feasible.

6

6

LARGE, SIMPLE TRIALS

CONCLUSIONS

By design, bias and confounding are minimized in randomized trials, and substantial power to detect moderate-sized effects is afforded by large sample sizes. These attributes lead to relevant, usable information that is generalizable to a broad population. Large, simple trials are not appropriate to every clinical question, but they can certainly contribute to clinical practice and public health policy. Practice and policy decisions to be addressed in the future will include decision-making based on knowledge of small but important clinical effects. Large, simple trials have a place in the full range of clinical study designs. REFERENCES 1. T. Francis Jr., Evaluation of the 1954 Field Trial of Poliomyelitis Vaccine: Final Report. 1957. Ann Arbor, MI: Edwards Brothers/National Foundation for Infantile Paralysis; School of Public Health, University of Michigan, 1957. 2. T. Francis, Jr., An evaluation of the 1954 Poliomyelitis Vaccine Trials—summary report. Am J Public Health. 1955; 45: 1–63. 3. D. Heng, Megatrials of drug treatment: strengths and limitations. Ann Acad Med Singap. 2000; 29: 606–609. 4. S. Yusuf, R. Collins, and R. Peto. Why do we need some large, simple randomized trials? Stat Med. 1984; 3: 409–420. 5. C. Baigent, The need for large-scale randomized evidence. Br J Clin Pharmacol. 1997; 43: 349–353. 6. R. M. Califf and D. L. DeMets, Principles from clinical trials relevant to clinical practice: part I. Circulation. 2002; 106: 1015–1021. 7. R. Temple, Meta-analysis and epidemiologic studies in drug development and postmarketing surveillance. JAMA. 1999; 281: 841–844. 8. A. A. Mitchell and S. M. Lesko, When a randomized controlled trial is needed to assess drug safety. The case of paediatric ibuprofen. Drug Saf. 1995; 13: 15–24. 9. B. L. Strom, How the US drug safety system should be changed. JAMA. 2006; 295: 2072–2075. 10. S. S. Ellenberg, M. A. Foulkes, K. Midthun, and K. L. Goldenthal, Evaluating the safety of

new vaccines: summary of a workshop. Am J Public Health. 2005; 95: 800–807. 11. J. Clemens, R. Brenner, M. Rao, N. Tafari, and C. Lowe, Evaluation of new vaccines for developing countries. Efficacy or effectiveness? JAMA. 1996; 275: 390–397. 12. J. E. Buring and C. H. Hennekens, The contributions of large, simple trials to prevention research. Prev Med. 1994; 23: 595–598. 13. C. D. Furberg, S. Yusuf, and T. J. Thom, Potential for altering the natural-history of congestive heart-failure—need for large clinical trials. Am J Cardiol. 1985; 55: A45–A47. 14. S. Yusuf and R. Garg, Design, results, and interpretation of randomized, controlled trials in congestive-heart-failure and leftventricular dysfunction. Circulation. 1993; 87: 115–121. 15. M. D. Flather, M. E. Farkouh, and S. Yusuf. Large simple trials in cardiovascular disease: their impact on medical practice. In: R. M. Califf, B. Mark, and G. S. Wagner (eds.), Acute Coronary Care. St. Louis, MO: Mosby, 1995, pp. 131–144. 16. VITATOPS Trial Study Group. The VITATOPS (Vitamins to Prevent Stroke) Trial: rationale and design of an international, large, simple, randomized trial of homocysteine-lowering multivitamin therapy in patients with recent transient ischaemic attack or stroke. Cerebrovasc Dis. 2002; 13: 120–126. 17. R. Peto, Clinical trial methodology. Biomedicine. 1978; 28(Special): 24–36. 18. S. Yusuf, P. Held, K. K. Teo, et al. Selection of patients for randomized controlled trials: implications of wide or narrow eligibility criteria. Stat Med. 1990; 9: 73–86. 19. ISIS-4 (Fourth International Study of Infarct Survival) Collaborative Group. A randomised factorial trial assessing early oral captopril, oral mononitrate, and intravenous magnesium sulphate in 58,050 patients with suspected acute myocardial infarction. Lancet. 1995; 345: 669–685. 20. R. Peto and C. Baigent, Trials: the next 50 years. Large scale randomized evidence of moderate benefits. BMJ. 1998; 317: 1170–1171. 21. T. C. Kyriakides, A. Babiker, and J. Singer, et al. An open-label randomized clinical trial of novel therapeutic strategies for HIV-infected patients in whom antiretroviral therapy has failed: rationale and design of OPTIMA trial. Control Clin Trials. 2003; 24: 481–500.

LARGE, SIMPLE TRIALS 22. R. Peto, R. Collins, and R. Gray, Largescale randomized evidence—large, simple trials and overviews of trials. J Clin Epidemiol. 1995; 48: 23–40. 23. Global Utilization of Streptokinase and t-PA for Occulted Arteries (GUSTO) Investigators. An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction. N Engl J Med. 1993; 329: 673–682. 24. C. Hennekens and J. E. Buring, Need for large sample sizes in randomized trials. Pediatrics. 1987; 79: 569–571. 25. J. Lubsen and J. G. P. Tijssen, Large trials with simple protocols: indications and contraindications. Control Clin Trials. 1989; 10: 151S–160S. 26. C. R. Horsburgh, A large, simple trial of a tuberculosis vaccine. Clin Infect Dis. 2000; 30 (Suppl 3): S213–216. 27. S. Yusuf, Randomised controlled trials in cardiovascular medicine: past achievements, future challenges. BMJ. 1999; 319: 564–568. 28. R. Gray, M. Clarke, R. Collins, and R. Peto, Making randomized trials larger: a simple solution? Eur J Surg Oncol. 1995; 21: 137–139. 29. R. Peto, M. C. Pike, P. Armitage, N. E. Breslow, D. R. Cox, et al. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. Analysis and examples. Br J Cancer. 1977; 35: 1–39. 30. S. Yusuf, J. Wittes, J. Probstfield, and H. A. Tyroler, Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991; 266: 93–98. 31. ISIS-2 (Second International Study of Infarct Survival) Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2. Lancet. 1988; 2: 349–360. 32. G. Tognoni, M. G. Franzosi, S. Garattini, and A. Maggioni, The case of GISSI in changing the attitudes and practice of Italian cardiologists. Stat Med. 1990; 9: 17–27. 33. A. V. Chobanian G. L. Bakris, H. R. Black, W. C. Cushman, L. A. Green, et al., for the National Heart, Lung, and Blood Institute Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure; National High Blood Pressure Education Program Coordinating Committee. The seventh report of the Joint National Committee on Prevention, Detection, Evaluation,

7

and Treatment of High Blood Pressure: the JNC 7 report. JAMA. 2003; 289: 2560–2572. 34. T. R. Fleming, Issues in the design of clinical trials: insights from the transuzumab (Herceptin) experience. Semin Oncol. 1999; 26: 102–107.

FURTHER READING Selected Examples of Large, Simple Trials ALLHAT Collaborative Research Group. Major cardiovascular events in hypertensive patients randomized to doxazosin vs chlorthalidone: the antihypertensive and lipid-lowering treatment to prevent heart attack trial (ALLHAT). JAMA. 2000; 283: 1967–1975. M. L. Barreto, L. C. Rodrizues, S. S. Cunha, S. Pereira, et al. Design of the Brazilian BCGREVAC trial against tuberculosis: a large, simple randomized community trial to evaluate the impact on tuberculosis of BCG revaccination at school age. Control Clin Trials. 2002; 23: 540–553. S. Black, H. Shinefield, M. A. Fireman, et al. Efficacy, safety and immunogenicity of the heptavalent pneumococcal conjugate vaccine in children. Pediatr Infect Dis J. 2000; 19: 187–195. Chinese Acute Stroke Trial (CAST) Collaborative Group. CAST: a randomised placebo-controlled trial of early aspirin use in 20,000 patients with acute ischaemic stroke. Lancet. 1997: 349: 1641–1649. R. Collins, R. Peto, and J. Armitage. The MRC/BHF Heart Protection Study: preliminary results. Int J Clin Pract. 2002; 56: 53–56. CRASH Trial Pilot Study Collaborative Group. The MRC CRASH Trial: study design, baseline data, and outcome in 1000 randomised patients in the pilot phase. Emerg Med J. 2002; 19: 510–514. Digitalis Investigation Group (DIG). The effect of digoxin on mortality and morbidity in patients with heart failure. N Engl J Med. 1997; 336: 525–533. Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto miocaridico (GISSI). Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction. Lancet. 1986; 1: 397–402. P. Edwards, M. Aragno, L. Balica, and R. Cottingham, Final results of the MRC CRASH, a randomized placebo-controlled trial of intravenous corticosteroid in adults with head injury—outcomes at 6 months. Lancet. 2005; 365: 1957–1959.

8

LARGE, SIMPLE TRIALS

Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto Miocaridico (GISSI). GISSI-2: a factorial randomized trial of alteplase versus streptokinase and heparin versus no heparin among 12,490 patients with acute myocardial infarction. Lancet. 1990; 336: 65–71. HOPE Study Investigators. The HOPE (Heart Outcomes Prevention Evaluation) Study: the design of a large, simple randomized trial of an angiotensin-converting enzyme inhibitor (ramipril) and vitamin E in patients at high risk of cardiovascular events. Cardiovasc Med. 1996; 12: 127–137. International Stroke Trial Collaborative Group. International Stroke Trial (IST): a randomised trial of aspirin, subcutaneous heparin, both or neither among 19,435 patients with acute ischaemic stroke. Lancet. 1997; 349: 1569–1581. ISIS-1 (First International Study of Infarct Survival) Collaborative Group. Randomised trial of intravenous atenolol among 16,027 cases of suspected acute myocardial infarction: ISIS-1. Lancet. 1986; 2: 57–66. ISIS-3 (Third International Study of Infarct Survival) Collaborative Group. ISIS-3: a randomised trial of streptokinase vs tissue plasminogen activator vs anistreplase and of aspirin plus heparin vs aspirin alone among 41,299 cases of suspected acute myocardial infarction. Lancet. 1992; 339: 753–770.

S. M. Lesko and A. A. Mitchell, An assessment of the safety of pediatric ibuprofen. A practitionerbased randomized clinical trial. JAMA. 1995; 273: 929–933. MAGIC Steering Committee. Rationale and design of the magnesium in coronaries (MAGIC) study: a clinical trial to reevaluate the efficacy or early administration of magnesium in acute myocardial infarction. Am Heart J. 2000; 139: 10–14. SOLVD Investigators. Studies of left ventricular dysfunction (SOLVD)—rationale, design and methods: two trials that evaluate the effect of enalapril in patients with reduced ejection fraction. Am J Cardiol. 1990; 66: 315–322. E. Van Ganse, J. K. Jones, N. Moore, J. M. Le Parc, R. Wall, and H. Schneid, A large simple clinical trial prototype for assessment of OTC drug effects using patient-reported data. Pharmacoepidemiol Drug Saf. 2005; 14: 249–255.

CROSS-REFERENCES Effect size Eligibility Expanded safety trials Hypothesis Sample size

LINEAR MODEL

variation, and goodness of fit. Understanding the basic structure of a (General) Linear Model is extremely helpful in the practical application of statistical models and the interpretation of their results. This article presents the Linear Model in its general form, discusses its three major subtypes, namely Linear Regression, Analysis of Variance (ANOVA), and Analysis of Covariance (ANCOVA), and explains how extensions are related to Linear Model and how they are interrelated between themselves. Therefore, the aim is to provide a comprehensive overview. For specific modeling approaches, refer to the respective articles on Linear Regression, ANOVA, and Generalized Linear Models.

DR. IRIS BURKHOLDER DR. LUTZ EDLER German Cancer Research Center Department of Biostatistics Heidelberg, Germany

1

INTRODUCTION

Linear Models for simple and multiple regression, Analysis of Variance (ANOVA), and Analysis of Covariance (ANCOVA) have broad and often direct application in clinical trials. Various extensions and generalizations of the Linear Model have been developed more recently often motivated by medical research problems. Most prominent are the loglinear models for the evaluation of count data; the logistic regression model for the evaluation of the influence of clinical factors on a dichotomous outcome as, for example, cancer incidence; and, last but not least, the Cox regression model for the analysis of survival data. The Linear Model, also called General Linear Model (GLM) when extended to incorporate classes of predictive factors, is built on the assumption of response data following a Gaussian (Normal) distribution. In contrast, extensions of the GLM no longer require this assumption, but are either based on other distributions or on allowing more general classes of distributions. The Generalized Linear Model—not to be confused with the General Linear Model—became an extremely valuable and versatile modeling approach for more general types of data (distributions from the exponential family) including the General Linear Model as a special case. Further extensions have been models for repeated measurements and the General Estimation Equation (GEE) approach. It is important to note that central features of the Linear Model were retained in those extensions. Features of similarity and analogy include such important components as model building, variable selection, maximum likelihood technique, residual analysis, explained

2

LINEAR MODEL

The Linear Model represents the relationship between a continuous response variable y and one or more predictor variables X in the form: y = Xβ + ε

(1)

where • y is a vector of observations of the

response variable • X is the design matrix determined by

the predictors • β is a vector of unknown parameters to

be estimated • ε is a vector of random disturbances,

independent of each other and usually having a normal distribution with mean zero and variance σ 2 . Remarks: 1. As an example, consider as response variable y the blood pressure (mmHg) of a patient and as predictor variables x1 = gender, x2 = age, x3 = smoking status, and x4 = treatment status (beta-blocker therapy). The variables gender, smoking status, and treatment status are measured on a qualitative

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

LINEAR MODEL

scale, whereas age is a quantitative variable. 2. It is very common to specify by subscripts the dimension of the quantities of model Equation (1)

where • yi denotes the ith observed response • β 0 is a constant intercept • β j , j = 1, . . . , p are the unknown regres-

sion parameters

ynx1 = Xnxp βpx1 + εnx1

• xij , j = 1, . . . , p are the values of the p

indicating that y is a column vector of length n of n single observations (y1 , . . ., yn ) and X is a matrix (Xij )i=1,...,n j=1,...,p with n columns and p rows. Notice, n is the sample size of the respective clinical study. 3. Model Equation (1) assumed ε ∼ N(0, σ 2 ) as normal distribution with mean zero and the same variance σ 2 for each observation. More general models exist that allow different variance σi2 for the individual observations (heteroscedasticity). Then, one writes ε = εnx1 ∼ N(0, ) where 0 = (0, . . . , 0) is a vector of length n with all components equal to zero and where is a diagonal matrix with σ12 , . . . , σn2 in the diagonal and all other elements equal to zero ( = diag(σ12 , . . . , σn2 )). Three subtypes of classic Linear Models can be distinguished by the composition of the design matrix X (see Fig. 1). The Linear Model leads to – Regression analysis, if X contains only quantitative values (covariates), – Analysis of variance, if X contains only categorical variables, – Analysis of covariance, if X contains quantitative as well as categorical variables.

explanatory variables • ε i is the ith error term.

Remark: 1. If all p explanatory variables are zero (i.e., if they add no information) Equation (2) reduces to the constant regression y = β0 + ε. In simple regression yi = β0 + β1 x1 + ε, β 0 is exactly the intercept on the y-axis and β 1 the slope of the regression line. 2. Details about estimation of the regression parameters and interpretation of the parameters can be found in the article on Linear Regression Analysis. 2.1.1 Historical Example of Simple Linear Regression. The British scientist Sir Francis Galton (1) was studying the inheritance of physical characteristics. In particular, he wondered if he could predict a boy’s adult height based on the height of his father. He plotted the heights of fathers and the heights of their sons for a number of father-son pairs and then tried to fit a straight line through the data. Let • y be the height of the first, fully grown,

son (continuous response) and • x be the height of the father (the contin-

uous predictor). 2.1 Simple and Multiple Regression Linear regression is a technique for modeling a continuous response or outcome variable as a linear combination of one (simple regression) or several potential (multiple regression) continuous explanatory factors. The general model equation for p explanatory variables is given by yi = β0 + β1 xi1 + . . . + βp xip + εi , i = 1, . . . , n

(2)

One can say that, in mathematical terms, Galton wanted to determine the intercept constant β 0 and the regression parameter β 1 . He observed that when choosing a group of parents of a given height, the mean height of their children was closer to the mean height of the population than is the given height. In other words, tall parents tend to be taller than their children and short parents tend to be shorter. Galton termed this phenomenon ‘‘regression towards mediocrity’’

LINEAR MODEL

meaning ‘‘going back towards the average.’’ The mean height of the children was closer to the mean height of all children than the mean height of their parents was to the mean height of all parents. In mathematical terms, unless X and Y are exactly linearly related for a given value of X, the predicted value of Y is always fewer standard deviations from its mean than is X from its mean. Regression toward the mean appears unless X and Y are perfectly correlated, so it always occurs in practice. 2.2 Analysis of Variance (ANOVA) The Analysis of variance is used to uncover the main and interaction effects of categorical independent variables (factors) X on the dependent variable y. A main effect is the direct effect of an independent variable on the dependent variable. An interaction effect is the joint effect of two or more independent variables on the dependent variable. Depending on the number of factors used in an ANOVA model, one distinguishes one-way ANOVA and multi-way (p-way) ANOVA models. The one-way ANOVA investigates the effects of one categorical factor on the dependent variable. If that factor has k different values (features), one can test for differences of the dependent variable between the groups defined by those different features. A dichotomous categorical variable can be included as predictor variable in the ANOVA model by creating a dummy variable that is 0 if the characteristic is absent and 1 if it is present. For example, a dummy variable representing the sex of patients in a clinical trial can take the value 0 for females and 1

3

for males. The coefficient then represents the difference in mean between male and female. Variables with more than two categories cannot simply be inserted as one factor in the ANOVA, unless it can be assumed that the categories are ordered in the same way as their codes, and that adjoining categories are in some sense the same distance apart (i.e., the variable possesses a metric scale). As this assumption is very strong, a set of dummy variables is created instead to represent the predictive factor. The number of dummy variables is the number of categories of that factor minus one. For example, in patients with nonsmall-cell lung cancer, three different histologic tumor types are usually distinguished: adenocarcinoma, squamous cell carcinoma, and large cell carcinoma. The variable histologic tumor type could be included in an ANOVA model by creating two dummy variables: histo1 = 1 if patient has adenocarcinoma, 0 otherwise, and histo2 = 1 if patient has squamous cell carcinoma, 0 otherwise.

Obviously, the patient has large cell carcinoma if histo1 = 0 and histo2 = 0. Therefore, the two dummy variables histo1 and histo2 characterize each patient of that study. The mathematical model that describes the relationship between the response and a categorical factor for the one-way ANOVA is

Linear modeling

Simple and multiple regression

Analysis of Variance ANOVA

Analysis of Covariance ANCOVA

only quantitative covariates

only categorical variables

quantitative covariates and categorical variables

Figure 1. Differences between the three subtypes of Linear Models: regression, analysis of variance, and analysis of covariance

4

LINEAR MODEL

given by

where yij = µ + τi + εij

(3)

• yij represents the kth observation on the

where • yij represents the jth observation on the

ith level of the categorical factor • µ is the common effect for the whole population • τ i denotes the effect of the ith level • ε ij is the random error present in the jth observation on the ith level and the errors are assumed to be normally and independently distributed with mean zero and variance σ 2 . 2.2.1 Example: One-way ANOVA. In four groups of patients, measures of gut permeability are obtained (2) and the differences between the groups should be investigated. In the one-way ANOVA model, let • yij the continuous response variable indi-

cating the result of the permeability test for the jth individual in the ith patient group (i = 1, . . . 4) • τ i denotes the effect of the ith patient group (i = 1, . . . 4). The basic idea of a one-way ANOVA consists in considering the overall variability in the data and partitioning the variability into two parts: between-group and within-group variability. If between-group variability is much larger than within-group variability, it suggests that differences between the groups are real, not just random noise. Two-way ANOVA analyzes one dependent variable in terms of the categories (groups) formed by two independent factors, which generalises to p-way ANOVA, which deals p independent factors. It should be noted that as the number of independent factors increases, the number of potential interactions proliferates. Two independents have one single first-order interaction only. The two-way ANOVA model that describes the relationship between the response, two categorical factors, and their interaction is given by yijk = µ + τi + δj + (τ δ)ij + εijk

(4)

• • • •

•

ith level of the first categorical factor and on the jth level of the second categorical factor µ is the common effect for the whole population τ i denotes the effect of the ith level of the first categorical factor δ j denotes the effect of the jth level of the second categorical factor (τ δ)ij represents the interaction between the ith level of the first categorical factor and the jth level of the second categorical factor εijk is the random error present in the kth observation on the ith level of the first categorical factor and on the jth level of the second categorical factor and the errors are assumed to be normally and independently distributed with mean zero and variance σ 2 .

2.2.2 Example: Interaction in Two-way ANOVA. Consider an experiment to determine the effect of exercise and diet on cholesterol levels. Both factors are categorical variables with two levels. For each factor, a dummy variable is created; for example, the dummy variable exercise is 0 if no exercise is done and is 1 if the patient does exercise, and the dummy variable is 0 if patient is on normal diet and is 1 if the patient is on low-fat diet. Besides the two main effects of the two factors of interest, the design matrix X contains a column representing the interaction between exercise and diet. In this example, the interaction column could be formed by combining the two singles columns of the main effects. The model equation is, therefore, given by • y the continuous response describing the

cholesterol level • µ the common effect for the whole popu-

lation • τ 1 the effect of no exercise and τ 2 the

effect of exercise • δ 1 the effect of normal diet and δ 2 the

effect of low-fat diet

LINEAR MODEL • (τ δ) the interaction effects: (τ δ)11 no

exercise—normal diet, (τ δ)12 no exercise—low-fat diet, (τ δ)21 exercise—normal diet and (τ δ)22 exercise—low-fat diet. Three independent factors give rise to three first-order interactions and one secondorder interaction; four independent factors have six first-order interactions, three second-order interactions, and one thirdorder interaction. As the number of interactions increases, it may become extremely difficult to interpret the model outcome. More information about one-way and twoway ANOVA and hypothesis testing could be found in the article on Analysis of Variance ANOVA. 2.3 Analysis of Covariance (ANCOVA) The analysis of covariance (ANCOVA) is a technique that combines elements of regression and variance analysis. It involves fitting a model where some elements are effects corresponding to levels of factors and interactions, in the manner of analysis of variance, and some others are regression-style coefficients. ANCOVA compares regression within several groups. Of main interest is the explanation of the relationship between the response and the quantitative variable within each group. The general model equation is given by y = Xα + Zβ + ε

(5)

where • y is the vector of observations of the

5

2.3.1 Example. The relationship between salt intake and blood pressure may be investigated for male and female patients. ANCOVA can be used to analyze whether the relationship between salt intake and blood pressure holds for both sexes. The model equation is given by • the continuous response y describing the

blood pressure • the design matrix X containing the categorical covariate sex, and • the matrix Z containing the continuous covariate of daily salt intake. ANCOVA allows one to test more precisely whether the treatment (group) means are equal. It can also be used to study the linear relationship between the dependent and the quantitative variable within each group. 3 GENERALIZATIONS OF THE LINEAR MODEL

Generalized Linear Models are an extension of the linear modeling that allows models to be fit to data that follow probability distributions other than the normal distribution, such as the Poisson, Binomial, Multinomial, and so on. Three extensions that have broad range of application (see Fig. 2) are represented here: the loglinear model for Poisson-distributed data, the logistic regression analysis applicable for binomial-distributed data, and the Cox regression for survival data. For a complete discussion, consult the article on Generalized Linear Models.

response variable • X is the matrix of dummy (0,1) variables • α is the parameter vector of the general

mean and the effects corresponding to levels of factors and their interactions • Z contains the values of the covariates (‘‘regression part’’) • β is the parameter vector for the regression-style coefficients of the covariates • ε is the error term and it is assumed that the errors are identical independent normal N(0,σ 2 ) distributed.

3.1 Loglinear Models Loglinear analysis is an extension of the two-way contingency table where the conditional relationship between two categorical variables is analyzed by taking the natural logarithm of the cell frequencies within a contingency table assuming Poissondistributed data. No distinction is made between dependent and independent variables and, therefore, loglinear models can only analyze the association between variables. Loglinear modeling is an analogue to

6

LINEAR MODEL

Generalized Linear Models

Loglinear Model

Logistic Regression

Cox Regression

Poisson distributed data

Binomial distributed data

Survival data

Figure 2. Extensions of the classical Linear Model (Generalized Linear Models)

multiple regression for categorical variables, and it can be applied to involve not only two but also three or more variables corresponding to a multiway contingency analysis. The loglinear model equation for a 2 × 2 × 2 contingency table is given by B C AB ln(Fijk ) = µ + λA i + λj + λk + λij BC ABC + λAC ik + λjk + λijk

(6)

where • ln(F ijk ) is the natural logarithm of the

expected cell frequency of the cases for cell ijk in the 2 × 2 × 2 contingency table where i indexes the first variable A, j the second variable B, and k the third variable C of this three-dimensional table • µ is the overall mean of the natural logarithm of the expected frequencies • λA i is the main effect for variable A • λB j is the main effect for variable B • λC k is the main effect for variable C • λAB ij is the two-way interaction effect for

variables A and B • λAC ik is the two-way interaction effect for variables A and C • λBC jk is the two-way interaction effect for variables B and C • λABC is the three-way interaction effect ijk for variables A, B, and C. 3.1.1 Example. Suppose one is interested in the relationship between heart disease, sex, and body weight. The continuous variable body weight is broken down into two discrete categories: not over weight and over weight. The variables heart disease and sex

are dichotomous. Then, the three-dimensional contingency looks like the results in the table below (where some fictitious numbers have been inserted in the table): Heart Disease Total no yes Body Weight

Sex

Not over weight

Male Female

Total Over weight

Male Female

Total

40 20

30 30

70 50

60

60

12

10 5

65 25

75 30

15

90

105

The basic strategy in loglinear modeling involves fitting models to the observed frequencies in the cross tabulation of categorical variables. The model can then be represented by a set of expected frequencies that may or may not resemble the observed frequencies. The most parsimonious model that fits the data is chosen and effect parameters for the variables and their interactions could be estimated (3). 3.2 Logistic Regression The logistic regression model is used when the response variable y of interest takes in two variables. Possible situations include studies in which subjects are alive or dead or have or do not have a particular characteristic (e.g., a specific disease). Denote the event y = 1 when the subject has the characteristic of interest and y = 0 when the subject does

LINEAR MODEL

not. In simple logistic regression, the subject has a single predictor variable x, which can take any form (continuous, discrete, or dichotomous). The logistic regression model relates then the probability P(y = 1) to the response via P(y = 1) =

exp(β0 + β1 x) 1 + exp(β0 + β1 x)

(7)

This model has a convenient representation in terms of the odds of the event y = 1 as odds(y = 1) =

P(y = 1) = exp(β0 + β1 x). (8) P(y = 0)

which means the log odds is simply the linear function β 0 + β 1 x, where the parameter β 1 is of primary interest. This often-called slope parameter controls the degree of association between the response and the predictor variable.

7

• β = (β 1 , . . ., β k ) is the vector of the k

unknown regression coefficients • Xi = (xi1 , . . ., xik ) is the vector of the

covariates for the patient i. Cox regression allows one to estimate the regression coefficients that best predict the observed survival. More details on estimation procedures are given in the article on Cox Regression. REFERENCES 1. F. Galton, Regression toward mediocrity in hereditary stature. J. Anthropologic. Inst. 1886; 15: 246–263. 2. M. Bland, An Introduction to Medical Statistics. Oxford: Oxford University Press, 1995. 3. D. Knoke and P. J. Burke, Log-Linear Models. Newberry Park, CA: Sage Publications, 1980. 4. D. R. Cox, Regression models and life tables (with discussion). J. Royal Stat. Soc. B 1972; 74: 187–220.

3.3 Cox Regression In many applications of survival analysis of clinical trials, the interest focuses on how covariates may effect the outcome. The proportional hazard regression model of Cox (4) has become the most important model for survival data. The basic endpoint of a patients, say i, is his/her survival Time T i . No specific distribution is assumed for this response variable, instead one considers his/her hazard function hi (t), which describes the instantaneous risk of failure of time t. Formally, hi (t) = lim

t→0

1 P(t ≤ T ≤ t + t|T ≥ t) (9) t

for example, the probability of failure of time t given no failure up to this time point. In its simplest form, but for many practical uses sufficient is a hazard function for a patient i given by hi (t) = h0 (t) exp(β1 xi1 + · · · + βk xik )

(10)

where • h0 (t) is a constant describing the hazard

function for patients with all covariates equal to 0

FURTHER READING A. Agresti, An Introduction to Categorical Data Analysis. New York: John Wiley & Sons, 1996. J. M. Bland and D. G. Altman, Regression towards the mean. Brit. Med. J. 1994; 308: 1499. P. McCullagh and J. A. Nelders, Generalized Linear Models. New York: Chapman and Hall, 1983. S. R. Searle, Linear Models. New York: John Wiley & Sons, 1971.

LOGISTIC REGRESSION

The logit transformation, defined in terms of π (x), is as follows:

STANLEY LEMESHOW

Ohio State University Columbus, OH, USA

g(x) = ln

π (x) = β0 + β1 x. 1 − π (x)

(2)

DAVID W. HOSMER Jr The importance of this transformation is that g(x) has many of the desirable properties of a linear regression model. The logit, g(x), is linear in its parameters, may be continuous, and may range from −∞ to +∞ depending on the range of x. The second important difference between the linear and logistic regression models concerns the conditional distribution of the outcome variable. In the linear regression model we assume that an observation of the outcome variable may be expressed as y = E(Y|x) + ε. The quantity ε is called the error and expresses an observation’s deviation from the conditional mean. The most common assumption is that ε follows a normal distribution with mean zero and some variance that is constant across levels of the independent variable. It follows that the conditional distribution of the outcome variable given x is normal with mean E(Y|x), and a variance that is constant. This is not the case with a dichotomous outcome variable. In this situation we may express the value of the outcome variable given x as y = π (x) + ε. Here the quantity ε may assume one of two possible values. If y = 1, then ε = 1 − π (x) with probability π (x), and if y = 0, then ε = −π (x) with probability 1 − π (x). Thus, ε has a distribution with mean zero and variance equal to π (x)[1 − π (x)]. That is, the conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, π (x).

University of Massachusetts, Amherst, MA, USA

The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent or response variable) and a set of independent (predictor or explanatory) variables. What distinguishes the logistic regression model from the linear regression model is that the outcome variable in logistic regression is categorical and most usually binary or dichotomous. In any regression problem the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and will be expressed as E(Y|x), where Y denotes the outcome variable and x denotes a value of the independent variable. In linear regression we assume that this mean may be expressed as an equation linear in x (or some transformation of x or Y), such as E(Y|x) = β0 + β1 x. This expression implies that it is possible for E(Y|x) to take on any value as x ranges between −∞ and +∞. Many distribution functions have been proposed for use in the analysis of a dichotomous outcome variable. Cox & Snell (2) discuss some of these. There are two primary reasons for choosing the logistic distribution. These are: (i) from a mathematical point of view it is an extremely flexible and easily used function, and (ii) it lends itself to a biologically meaningful interpretation. To simplify notation, let π (x) = E(Y|x) represent the conditional mean of Y given x. The logistic regression model can be expressed as π (x) =

exp(β0 + β1 x) . 1 + exp(β0 + β1 x)

1 FITTING THE LOGISTIC REGRESSION MODEL Suppose we have a sample of n independent observations of the pair (xi , yi ), i = 1, 2, . . . , n, where yi denotes the value of a dichotomous outcome variable and xi is the value of the independent variable for the ith subject. Furthermore, assume that the outcome

(1)

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

LOGISTIC REGRESSION

variable has been coded as 0 or 1 representing the absence or presence of the characteristic, respectively. To fit the logistic regression model (1) to a set of data requires that we estimate the values of β 0 and β 1 , the unknown parameters. In linear regression the method used most often to estimate unknown parameters is least squares. In that method we choose those values of β 0 and β 1 that minimize the sum of squared deviations of the observed values of Y from the predicted values based upon the model. Under the usual assumptions for linear regression the least squares method yields estimators with a number of desirable statistical properties. Unfortunately, when the least squares method is applied to a model with a dichotomous outcome the estimators no longer have these same properties. The general method of estimation that leads to the least squares function under the linear regression model (when the error terms are normally distributed) is maximum likelihood. This is the method used to estimate the logistic regression parameters. In a very general sense the maximum likelihood method yields values for the unknown parameters that maximize the probability of obtaining the observed set of data. To apply this method we must first construct a function called the likelihood function. This function expresses the probability of the observed data as a function of the unknown parameters. The maximum likelihood estimators of these parameters are chosen to be those values that maximize this function. Thus, the resulting estimators are those that agree most closely with the observed data. If Y is coded as 0 or 1, then the expression for π (x) given in (1) provides (for an arbitrary value of β = (β 0 , β 1 ), the vector of parameters) the conditional probability that Y is equal to 1 given x. This will be denoted Pr(Y = 1|x). It follows that the quantity 1 − π (x) gives the conditional probability that Y is equal to zero given x, Pr(Y = 0|x). Thus, for those pairs (xi , yi ), where yi = 1, the contribution to the likelihood function is π (xi ), and for those pairs where yi = 0, the contribution to the likelihood function is 1 − π (xi ), where the quantity π (xi ) denotes the value of π (x) computed at xi . A convenient way to express the contribution to the likelihood function for

the pair (xi , yi ) is through the term ξ (xi ) = π (xi )yi [1 − π (xi )]1−yi .

(3)

Since the observations are assumed to be independent, the likelihood function is obtained as the product of the terms given in (3) as follows: l(β) =

n

ξ (xi ).

(4)

i=1

The principle of maximum likelihood states that we use as our estimate of β the value that maximizes the expression in (3). However, it is easier mathematically to work with the log of (3). This expression, the log likelihood, is defined as

L(β) = ln[l(β)] = {yi ln[π (xi )] + (1 − yi )ln[1 − π (xi )]}. (5) To find the value of β that maximizes L(β) we differentiate L(β) with respect to β 0 and β 1 and set the resulting expressions equal to zero. These equations are as follows: n

[yi − π (xi )] = 0

(6)

xi [yi − π (xi )] = 0,

(7)

i=1

and

n i=1

and are called the likelihood equations. In linear regression, the likelihood equations, obtained by differentiating the sum of squared deviations function with respect to β, are linear in the unknown parameters, and thus are easily solved. For logistic regression the expressions in (6) and (7) are nonlinear in β 0 and β 1 , and thus require special methods for their solution. These methods are iterative in nature and have been programmed into available logistic regression software. McCullagh & Nelder (6) discuss the iterative methods used by most programs. In particular, they show that the

LOGISTIC REGRESSION

solution to (6) and (7) may be obtained using a generalized weighted least squares procedure. The value of β given by the solution to (6) and (7) is called the maximum likelihood ˆ Similarly, π(x estimate, denoted as β. ˆ i ) is the maximum likelihood estimate of π (xi ). This quantity provides an estimate of the conditional probability that Y is equal to 1, given that x is equal to xi . As such, it represents the fitted or predicted value for the logistic regression model. An interesting consequence of (6) is that n i=1

yi =

n

π(x ˆ i ).

i=1

That is, the sum of the observed values of y is equal to the sum of the predicted (expected) values. After estimating the coefficients, it is standard practice to assess the significance of the variables in the model. This usually involves testing a statistical hypothesis to determine whether the independent variables in the model are ‘‘significantly’’ related to the outcome variable. One approach to testing for the significance of the coefficient of a variable in any model relates to the following question. Does the model that includes the variable in question tell us more about the outcome (or response) variable than does a model that does not include that variable? This question is answered by comparing the observed values of the response variable with those predicted by each of two models; the first with and the second without the variable in question. The mathematical function used to compare the observed and predicted values depends on the particular problem. If the predicted values with the variable in the model are better, or more accurate in some sense, than when the variable is not in the model, then we feel that the variable in question is ‘‘significant’’. It is important to note that we are not considering the question of whether the predicted values are an accurate representation of the observed values in an absolute sense (this would be called goodness of fit). Instead, our question is posed in a relative sense. For the purposes of assessing the significance of an independent variable we compute

3

the value of the following statistic: likelihood without the variable . likelihood with the variable (8) Under the hypothesis that β 1 is equal to zero, the statistic G will follow a chi-square distribution with one degree of freedom. The calculation of the log likelihood and this generalized likelihood ratio test are standard features of any good logistic regression package. This makes it possible to check for the significance of the addition of new terms to the model as a matter of routine. In the simple case of a single independent variable, we can first fit a model containing only the constant term. We can then fit a model containing the independent variable along with the constant. This gives rise to a new log likelihood. The likelihood ratio test is obtained by multiplying the difference between the log likelihoods of the two models by −2. Another test that is often carried out is the Wald test, which is obtained by comparing the maximum likelihood estimate of the slope parameter, βˆ1 , with an estimate of its standard error. The resulting ratio

G = −2 ln

W=

βˆ1 , βˆ1 ) se(

under the hypothesis that β 1 = 0, follows a standard normal distribution. Standard errors of the estimated parameters are routinely printed out by computer software. Hauck & Donner (3) examined the performance of the Wald test and found that it behaved in an aberrant manner, often failing to reject when the coefficient was significant. They recommended that the likelihood ratio test be used. Jennings (5) has also looked at the adequacy of inferences in logistic regression based on Wald statistics. His conclusions are similar to those of Hauck & Donner. Both the likelihood ratio test, G, and the Wald test, W, require the computation of the maximum likelihood estimate for β 1 . For a single variable this is not a difficult or costly computational task. However, for large data sets with many variables, the iterative computation needed to obtain the maximum likelihood estimates can be considerable.

4

LOGISTIC REGRESSION

The logistic regression model may be used with matched study designs. Fitting conditional logistic regression models requires modifications, which are not discussed here. The reader interested in the conditional logistic regression model may find details in [4, Chapter 7]. 2 THE MULTIPLE LOGISTIC REGRESSION MODEL Consider a collection of p independent variables which will be denoted by the vector x = (x1 , x2 , . . . , xp ). Assume for the moment that each of these variables is at least interval scaled. Let the conditional probability that the outcome is present be denoted by Pr(Y = 1|x) = π (x). Then the logit of the multiple logistic regression model is given by g(x) = β0 + β1 x1 + β2 x2 + · · · + βp xp ,

Assume that we have a sample of n independent observations of the pair (xi , yi ), i = 1, 2, . . . , n. As in the univariate case, fitting the model requires that we obtain estimates of the vector β = (β 0 , β 1 , . . . , β p ). The method of estimation used in the multivariate case is the same as in the univariate situation, i.e. maximum likelihood. The likelihood function is nearly identical to that given in (4), with the only change being that π (x) is now defined as in (10). There are p + 1 likelihood equations which are obtained by differentiating the log likelihood function with respect to the p + 1 coefficients. The likelihood equations that result may be expressed as follows: n [yi − π (xi )] = 0

(9)

i=1

and

in which case π (x) =

3 FITTING THE MULTIPLE LOGISTIC REGRESSION MODEL

exp[g(x)] . 1 + exp[g(x)]

(10)

If some of the independent variables are discrete, nominal scaled variables such as race, sex, treatment group, and so forth, then it is inappropriate to include them in the model as if they were interval scaled. In this situation a collection of design variables (or dummy variables) should be used. Most logistic regression software will generate the design variables, and some programs have a choice of several different methods. In general, if a nominal scaled variable has k possible values, then k − 1 design variables will be needed. Suppose, for example, that the jth independent variable, xj has kj levels. The kj − 1 design variables will be denoted as Dju and the coefficients for these design variables will be denoted as β ju , u = 1, 2, . . . , kj − 1. Thus, the logit for a model with p variables and the jth variable being discrete is kj −1

g(x) = β0 + β1 x1 + · · · +

u=1

βju Dju + βp xp .

n

xij [yi − π (xi )] = 0,

i=1

for j = 1, 2, . . . , p. As in the univariate model, the solution of the likelihood equations requires special purpose software which may be found in many packaged programs. Let βˆ denote the solution to these equations. Thus, the fitted values for the multiple logistic regression model are π(x ˆ i ), the value of the expression in (13) computed using βˆ and xi . Before proceeding further we present an example that illustrates the formulation of a multiple logistic regression model and the estimation of its coefficients. 4 EXAMPLE To provide an example of fitting a multiple logistic regression model, consider the data for the low birth weight study described in Appendix 1 of Hosmer & Lemeshow (4). The code sheet for the data set is given in Table 1. The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 g). In this study data were collected on

LOGISTIC REGRESSION

5

Table 1. Code Sheet for the Variables in the Low Birth Weight Data Set Variable

Abbreviation

Identification code Low birth weight (0 = birth weight ≥2500 g, 1 = birth weight <2500 g) Age of the mother in years Weight in pounds at the last menstrual period Race (1 = white, 2 = black, 3 = other) Smoking status during pregnancy (1 = yes, 0 = no) History of premature labor (0 = none, 1 = one, etc.) History of hypertension (1 = yes, 0 = no) Presence of uterine irritability (1 = yes, 0 = no) Number of physician visits during the first trimester (0 = none, 1 = one, 2 = two, etc.) Birth weight (g)

ID LOW

189 women; n1 = 59 of them delivered low birth weight babies and n0 = 130 delivered normal birth weight babies. In this example the variable race has been recoded using the two design variables shown in Table 2. FTV was recoded to 0 = some, 1 = none, and PTL was recoded to 0 = none, 1 = one or more. The two newly coded variables are called FTV01 and PTL01. The results of fitting the logistic regression model to these data are given in Table 3. In Table 3 the estimated coefficients for the two design variables for race are indicated in the lines denoted by ‘‘RACE 1’’ and ‘‘RACE 2’’. The estimated logit is given by ˆ g(x) = 0.545 − 0.035 × AGE − 0.015 × LWT + 0.815 × SMOKE + 1.824 × HT +0.702 × UI + 1.202 × RACE˜1 +0.773 × RACE 2 + 0.121 × FTV01 +1.237 × PTL01.

Table 2. Coding of Design Variables for RACE Design variable RACE

RACE 1

RACE 2

White Black Other

0 1 0

0 0 1

AGE LWT RACE SMOKE PTL HT UI FTV BWT

The fitted values are obtained using the estiˆ mated logit, g(x), as in (10).

5 TESTING FOR THE SIGNIFICANCE OF THE MODEL Once we have fit a particular multiple (multivariate) logistic regression model, we begin the process of assessment of the model. The first step in this process is usually to assess the significance of the variables in the model. The likelihood ratio test for overall significance of the p coefficients for the independent variables in the model is performed based on the statistic G given in (8). The only difference is that the fitted values, πˆ , under the model are based on the vector containing p ˆ Under the null hypoth+ 1 parameters, β. esis that the p ‘‘slope’’ coefficients for the covariates in the model are equal to zero, the distribution of G is chi-square with p degrees of freedom. As an example, consider the fitted model whose estimated coefficients are given in Table 3. For that model the value of the log likelihood is L = −98.36. A second model, fit with the constant term only, yields L = −117.336. Hence G = −2[(−117.34) − (−98.36)] = 37.94 and the P value for the test is Pr[χ 2 (9) > 37.94] < 0.0001 (see Table 3). Rejection of the null hypothesis (that all of the coefficients are simultaneously equal to zero) has an interpretation analogous to that in multiple linear regression; we may

6

LOGISTIC REGRESSION

Table 3. Estimated Coefficients for a Multiple Logistic Regression Model Using All Variables From the Low Birth Weight Data Set Number of obs. = 189 χ 2 (9) = 37.94 Prob > χ 2 = 0.0000

Logit estimates

Log likelihood = −98.36 Variable

Coeff.

Std. error

AGE LWT SMOKE HT UI RACE 1 RACE 2 FTV01 PTL01 cons

−0.035 −0.015 0.815 1.824 0.702 1.202 0.773 0.121 1.237 0.545

0.039 0.007 0.420 0.705 0.465 0.534 0.460 0.376 0.466 1.266

conclude that at least one, and perhaps all p coefficients are different from zero. Before concluding that any or all of the coefficients are nonzero, we may wish to look at the univariate Wald test statistics, se(βˆj ). These are given in the fourth Wj = βˆj / column (labeled z) in Table 3. Under the hypothesis that an individual coefficient is zero, these statistics will follow the standard normal distribution. Thus, the value of these statistics may give us an indication of which of the variables in the model may or may not be significant. If we use a critical value of 2, which leads to an approximate level of significance (two-tailed) of 0.05, then we would conclude that the variables LWT, SMOKE, HT, PTL01 and possibly RACE are significant, while AGE, UI, and FTV01 are not significant. Considering that the overall goal is to obtain the best fitting model while minimizing the number of parameters, the next logical step is to fit a reduced model, containing only those variables thought to be significant, and compare it with the full model containing all the variables. The results of fitting the reduced model are given in Table 4. The difference between the two models is the exclusion of the variables AGE, UI, and FTV01 from the full model. The likelihood ratio test comparing these two models is obtained using the definition of G given in (8). It has a distribution that is chi-square

z

Pr|z|

−0.920 −2.114 1.939 2.586 1.511 2.253 1.681 0.323 2.654 0.430

0.357 0.035 0.053 0.010 0.131 0.024 0.093 0.746 0.008 0.667

[95% conf. interval] −0.111 −0.029 −0.009 0.441 −0.208 0.156 −0.128 −0.615 0.323 −1.937

0.040 −0.001 1.639 3.206 1.613 2.248 1.674 0.858 2.148 3.027

with three degrees of freedom under the hypothesis that the coefficients for the variables excluded are equal to zero. The value of the test statistic comparing the models in Tables 3 and 4 is G = −2[(−100.24) − (−98.36)] = 3.76 which, with three degrees of freedom, has a P value of P[χ 2 (3) > 3.76] = 0.2886. Since the P value is large, exceeding 0.05, we conclude that the reduced model is as good as the full model. Thus there is no advantage to including AGE, UI, and FTV01 in the model. However, we must not base our models entirely on tests of statistical significance. Numerous other considerations should influence our decision to include or exclude variables from a model.

6 INTERPRETATION OF THE COEFFICIENTS OF THE LOGISTIC REGRESSION MODEL After fitting a model the emphasis shifts from the computation and assessment of significance of estimated coefficients to interpretation of their values. The interpretation of any fitted model requires that we can draw practical inferences from the estimated coefficients in the model. The question addressed is: What do the estimated coefficients in the model tell us about the research questions that motivated the study? For most models this involves the estimated coefficients for the independent variables in the model.

LOGISTIC REGRESSION

7

Table 4. Estimated Coefficients for a Multiple Logistic Regression Model Using the Variables LWT, SMOKE, HT, PTL01 and RACE From the Low Birth Weight Data Set Number of obs. = 189 χ 2 (6) = 34.19 Prob > χ 2 = 0.0000

Logit estimates

Log likelihood = 100.24 Variable

Coeff.

Std. error

LWT SMOKE HT RACE 1 RACE 2 PTL01 cons

−0.017 0.876 1.767 1.264 0.864 1.231 0.095

0.007 0.401 0.708 0.529 0.435 0.446 0.957

The estimated coefficients for the independent variables represent the slope or rate of change of a function of the dependent variable per unit of change in the independent variable. Thus, interpretation involves two issues: (i) determining the functional relationship between the dependent variable and the independent variable, and (ii) appropriately defining the unit of change for the independent variable. For a linear regression model we recall that the slope coefficient, β 1 , is equal to the difference between the value of the dependent variable at x + 1 and the value of the dependent variable at x, for any value of x. In the logistic regression model β 1 = g(x + 1) − g(x). That is, the slope coefficient represents the change in the logit for a change of one unit in the independent variable x. Proper interpretation of the coefficient in a logistic regression model depends on being able to place meaning on the difference between two logits. Consider the interpretation of the coefficients for a univariate logistic regression model for each of the possible measurement scales of the independent variable. 7 DICHOTOMOUS INDEPENDENT VARIABLE Assume that x is coded as either 0 or 1. Under this model there are two values of π (x) and equivalently two values of 1 − π (x). These values may be conveniently displayed in a 2 × 2 table, as shown in Table 5.

z

Pr|z|

−2.407 2.186 2.495 2.387 1.986 2.759 0.099

0.016 0.029 0.013 0.017 0.047 0.006 0.921

[95% conf. interval] −0.030 0.091 0.379 0.226 0.011 0.357 −1.781

−0.003 1.661 3.156 2.301 1.717 2.106 1.970

The odds of the outcome being present among individuals with x = 1 is defined as π (1)/[1 − π (1)]. Similarly, the odds of the outcome being present among individuals with x = 0 is defined as π (0)/[1 − π (0)]. The odds ratio, denoted by ψ, is defined as the ratio of the odds for x = 1 to the odds for x = 0, and is given by ψ=

π (1)/[1 − π (1)] . π (0)/[1 − π (0)]

(11)

The log of the odds ratio, termed log odds ratio, or log odds, is π (1)/[1 − π (1)] = g(1) − g(0), ln(ψ) = ln π (0)/[1 − π (0)] which is the logit difference, where the log of the odds is called the logit and, in this example, these are π (1) g(1) = ln 1 − π (1)

and g(0) = ln

π (0) . 1 − π (0)

Using the expressions for the logistic regression model shown in Table 5 the odds ratio is exp(β0 + β1 ) 1 1 + exp(β0 + β1 ) 1 + exp(β0 ) ψ = 1 exp(β0 ) 1 + exp(β0 ) 1 + exp(β0 + β1 )

8

LOGISTIC REGRESSION

Table 5. Values of the Logistic Regression Model When the Independent Variable is Dichotomous Independent variable X x=1 y=1 Outcome variable

y y=0

1 − π (1) =

Total

=

1 1 + exp(β0 + β1 ) 1.0

exp(β0 + β1 ) = exp(β1 ). exp(β0 )

Hence, for logistic regression with a dichotomous independent variable ψ = exp(β1 ),

x=0

exp(β0 + β1 ) π (1) = 1 + exp(β0 + β1 )

(12)

and the logit difference, or log odds, is ln(ψ) = ln[exp(β1 )] = β1 . This fact concerning the interpretability of the coefficients is the fundamental reason why logistic regression has proven such a powerful analytic tool for epidemiologic research. A confidence interval (CI) estimate for the odds ratio is obtained by first calculating the endpoints of a confidence interval for the coefficient β 1 , and then exponentiating these values. In general, the endpoints are given by

βˆ1 ) . exp βˆ1 ± z1−α/2 × se( Because of the importance of the odds ratio as a measure of association, point and interval estimates are often found in additional columns in tables presenting the results of a logistic regression analysis. In the previous discussion we noted that the estimate of the odds ratio was ψˆ = exp(βˆ1 ). This is correct when the independent variable has been coded as 0 or 1. This type of coding is called ‘‘reference cell’’ coding. Other coding could be used. For example, the variable may be coded as −1 or +1. This type of coding is termed ‘‘deviation from means’’ coding. Evaluation of the

π (0) =

expβ0 1 + expβ0

1 − π (0) =

1 1 + expβ0

1.0

logit difference shows that the odds ratio is calculated as ψˆ = exp(2βˆ1 ) and if an investigator were simply to exponentiate the coefficient from the computer output of a logistic regression analysis, the wrong estimate of the odds ratio would be obtained. Close attention should be paid to the method used to code design variables. The method of coding also influences the calculation of the endpoints of the confidence interval. With deviation from means coding, the estimated standard error needed for con βˆ1 ), which fidence interval estimation is se(2 βˆ1 ). Thus the endpoints of the confiis 2 × se( dence interval are

βˆ1 ) . exp 2βˆ1 + z1−α/2 × 2 × se( In summary, for a dichotomous variable the parameter of interest is the odds ratio. An estimate of this parameter may be obtained from the estimated logistic regression coefficient, regardless of how the variable is coded or scaled. This relationship between the logistic regression coefficient and the odds ratio provides the foundation for our interpretation of all logistic regression results. 8 POLYTOMOUS INDEPENDENT VARIABLE Suppose that instead of two categories the independent variable has k > 2 distinct values. For example, we may have variables that denote the county of residence within a state, the clinic used for primary health care within a city, or race. Each of these variables has a fixed number of discrete outcomes and the scale of measurement is nominal.

LOGISTIC REGRESSION

Suppose that in a study of coronary heart disease (CHD) the variable RACE is coded at four levels, and that the cross-classification of RACE by CHD status yields the data presented in Table 6. These data are hypothetical and have been formulated for ease of computation. The extension to a situation where the variable has more than four levels is not conceptually different, so all the examples in this section use k = 4. At the bottom of Table 6 the odds ratio is given for each race, using white as the reference group. For example, for hispanic the estimated odds ratio is (15 × 20)/(5 × 10) = 6.0. The log of the odds ratios are given in the last row of Table 6. This display is typical of what is found in the literature when there is a perceived referent group to which the other groups are to be compared. These same estimates of the odds ratio may be obtained from a logistic regression program with an appropriate choice of design variables. The method for specifying the design variables involves setting all of them equal to zero for the reference group, and then setting a single design variable equal to one for each of the other groups. This is illustrated in Table 7. Use of any logistic regression program with design variables coded as shown in Table 7 yields the estimated logistic regression coefficients given in Table 8. A comparison of the estimated coefficients in Table 8 with the log odds in Table 6 ˆ shows that ln[ψ(black, white)] = βˆ11 = 2.079, ˆ ln[ψ(hispanic, white)] = βˆ12 = 1.792, and ˆ ln[ψ(other, white)] = βˆ13 = 1.386. In the univariate case the estimates of the standard errors found in the logistic regression output are identical to the estimates obtained using the cell frequencies from the contingency table. For example,

9

the estimated standard error of the estimated coefficient for design variable (1), βˆ11 , is 0.6325 = (1/5 + 1/20 + 1/20 + 1/10)1/2 . A derivation of this result appears in Bishop et al. (1). Confidence limits for odds ratios may be obtained as follows: βˆij ). βˆij ± z1−α/2 × se( The corresponding limits for the odds ratio are obtained by exponentiating these limits as follows: βˆij )]. exp[βˆij ± z1−α/2 × se(

9

CONTINUOUS INDEPENDENT VARIABLE

When a logistic regression model contains a continuous independent variable, interpretation of the estimated coefficient depends on how it is entered into the model and the particular units of the variable. For purposes of developing the method to interpret the coefficient for a continuous variable, we assume that the logit is linear in the variable. Under the assumption that the logit is linear in the continuous covariate, x, the

Table 7. Specification of the Design Variables for RACE Using White as the Reference Group RACE (code)

D1

White (1) Black (2) Hispanic (3) Other (4)

0 1 0 0

Design variables D2 0 0 1 0

D3 0 0 0 1

Table 6. Cross-Classification of Hypothetical Data on RACE and CHD Status for 100 Subjects CHD status

White

Black

Hispanic

Other

Total

Present Absent Total

5 20 25

20 10 30

15 10 25

10 10 20

50 50 100

ˆ Odds ratio (ψ) 95% CI ˆ ln(ψ)

1.0

8.0 (2.3, 27.6) 2.08

6.0 (1.7, 21.3) 1.79

4.0 (1.1, 14.9) 1.39

0.0

10

LOGISTIC REGRESSION

Table 8. Results of Fitting the Logistic Regression Model to the Data in Table 6 Using the Design Variables in Table 7 Variable

Coeff.

RACE 1 RACE 2 RACE 3 cons

2.079 1.792 1.386 −1.386

Variable

Odds ratio

RACE 1 RACE 2 RACE 3

8 6 4

Std. error 0.632 0.645 0.671 0.500

equation for the logit is g(x) = β 0 + β 1 x. It follows that the slope coefficient, β 1 , gives the change in the log odds for an increase of ‘‘l’’ unit in x, i.e. β 1 = g(x + 1) − g(x) for any value of x. Most often the value of ‘‘1’’ will not be biologically very interesting. For example, an increase of 1 year in age or of 1 mmHg in systolic blood pressure may be too small to be considered important. A change of 10 years or 10 mmHg might be considered more useful. However, if the range of x is from zero to one, as might be the case for some created index, then a change of 1 is too large and a change of 0.01 may be more realistic. Hence, to provide a useful interpretation for continuous scaled covariates we need to develop a method for point and interval estimation for an arbitrary change of c units in the covariate. The log odds for a change of c units in x is obtained from the logit difference g(x + c) − g(x) = cβ 1 and the associated odds ratio is obtained by exponentiating this logit difference, ψ(c) = ψ(x + c, x) = exp(cβ 1 ). An estimate may be obtained by replacing β 1 with its maximum likelihood estimate, βˆ1 . An estimate of the standard error needed for confidence interval estimation is obtained by multiplying the estimated standard error of βˆ1 by c. Hence the endpoints of the 100(1 − α)% CI estimate of ψ(c) are βˆ1 )]. exp[cβˆ1 ± z1−α/2 cse( Since both the point estimate and endpoints of the confidence interval depend on the choice of c, the particular value of c should be clearly specified in all tables and calculations.

z

P|z|

3.288 2.776 2.067 −2.773

0.001 0.006 0.039 0.006

[95% conf. interval] 0.840 0.527 0.072 −2.367

3.319 3.057 2.701 −0.406

[95% conf. interval] 2.32 1.69 1.07

27.63 21.26 14.90

10 MULTIVARIATE CASE Often logistic regression analysis is used to adjust statistically the estimated effects of each variable in the model for differences in the distributions of and associations among the other independent variables. Applying this concept to a multiple logistic regression model, we may surmise that each estimated coefficient provides an estimate of the log odds adjusting for all other variables included in the model. The term confounder is used by epidemiologists to describe a covariate that is associated with both the outcome variable of interest and a primary independent variable or risk factor. When both associations are present the relationship between the risk factor and the outcome variable is said to be confounded. The procedure for adjusting for confounding is appropriate when there is no interaction. If the association between the covariate and an outcome variable is the same within each level of the risk factor, then there is no interaction between the covariate and the risk factor. When interaction is present, the association between the risk factor and the outcome variable differs, or depends in some way on the level of the covariate. That is, the covariate modifies the effect of the risk factor. Epidemiologists use the term effect modifier to describe a variable that interacts with a risk factor. The simplest and most commonly used model for including interaction is one in which the logit is also linear in the confounder for the second group, but with a

LOGISTIC REGRESSION

different slope. Alternative models can be formulated which would allow for other than a linear relationship between the logit and the variables in the model within each group. In any model, interaction is incorporated by the inclusion of appropriate higher order terms. An important step in the process of modeling a set of data is to determine whether or not there is evidence of interaction in the data. Tables 9 and 10 present the results of fitting a series of logistic regression models to two different sets of hypothetical data. The variables in each of the data sets are the same: SEX, AGE, and CHD. In addition to the estimated coefficients, the log likelihood for each model and minus twice the change (deviance) is given. Recall that minus twice the change in the log likelihood may be used to test for the significance of coefficients for variables added to the model. An interaction is added to the model by creating a variable that is equal to the product of the value of the sex and the value of age. Examining the results in Table 9 we see that the estimated coefficient for the variable SEX changed from 1.535 in model 1 to 0.979 when AGE was added in model 2. Hence, there is clear evidence of a confounding effect owing to age. When the interaction term ‘‘SEX × AGE’’ is added in model 3 we see that the change in the deviance is only 0.52 which, when compared with the chi-square distribution with one degree of freedom, yields a P value of 0.47, which clearly is not significant. Note that the coefficient for sex changed from 0.979 to 0.481. This is not surprising since the inclusion of an interaction term, especially when it involves a continuous variable, will usually produce fairly marked changes in the estimated coefficients of dichotomous variables involved in the interaction. Thus, when an interaction term is present in the model we cannot assess confounding via the change in a coefficient. For these data we would prefer to use model 2 which suggests that age is a confounder but not an effect modifier. The results in Table 10 show evidence of both confounding and interaction due to age. Comparing model 1 with model 2 we see that the coefficient for sex changes from 2.505 to 1.734. When the age by sex interaction is added to the model we see that the deviance

11

is 4.06, which yields a P value of 0.04. Since the deviance is significant, we prefer model 3 over model 2, and should regard age as both a confounder and an effect modifier. The net result is that any estimate of the odds ratio for sex should be made with respect to a specific age. Hence, we see that determining if a covariate, X, is an effect modifier and/or a confounder involves several issues. Determining effect modification status involves the parametric structure of the logit, while determination of confounder status involves two things. First, the covariate must be associated with the outcome variable. This implies that the logit must have a nonzero slope in the covariate. Secondly, the covariate must be associated with the risk factor. In our example this might be characterized by having a difference in the mean age for males and females. However, the association may be more complex than a simple difference in means. The essence is that we have incomparability in our risk factor groups. This incomparability must be accounted for in the model if we are to obtain a correct, unconfounded estimate of effect for the risk factor. In practice, the confounder status of a covariate is ascertained by comparing the estimated coefficient for the risk factor variable from models containing and not containing the covariate. Any ‘‘biologically important’’ change in the estimated coefficient for the risk factor would dictate that the covariate is a confounder and should be included in the model, regardless of the statistical significance of the estimated coefficient for the covariate. On the other hand, a covariate is an effect modifier only when the interaction term added to the model is both biologically meaningful and statistically significant. When a covariate is an effect modifier, its status as a confounder is of secondary importance since the estimate of the effect of the risk factor depends on the specific value of the covariate. The concepts of adjustment, confounding, interaction, and effect modification may be extended to cover the situations involving any number of variables on any measurement scale(s). The principles for identification and inclusion of confounder and interaction variables into the model are the same regardless

12

LOGISTIC REGRESSION

Table 9. Estimated Logistic Regression Coefficients, Log Likelihood, and the Likelihood Ratio Test Statistic (G) for an Example Showing Evidence of Confounding But no Interaction Model 1 2 3

Constant

SEX

AGE

−1.046 −7.142 −6.103

1.535 0.979 0.481

0.167 0.139

SEX × AGE

Log likelihood

0.059

−61.86 −49.59 −49.33

G 24.54 0.52

Table 10. Estimated Logistic Regression Coefficients, Log Likelihood, and the Likelihood Ratio Test Statistic (G) for an Example Showing Evidence of Confounding and Interaction Model 1 2 3

Constant

SEX

AGE

−0.847 −6.194 −3.105

2.505 1.734 0.047

0.147 0.629

of the number of variables and their measurement scales. Much of this article has been abstracted from (4). Readers wanting more detail on any topic should consult this reference. REFERENCES 1. Bishop, Y. M. M., Fienberg, S. E. & Holland, P. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Boston. 2. Cox, D. R. & Snell, E. J. (1989). The Analysis of Binary Data, 2nd Ed. Chapman & Hall, London.

SEX × AGE

Log likelihood

G

0.206

−52.52 −46.79 −44.76

11.46 4.06

3. Hauck, W. W. & Donner, A. (1977). Wald’s Test as applied to hypotheses in logit analysis, Journal of the American Statistical Association 72, 851–853. 4. Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression. 2nd Ed. Wiley, New York. 5. Jennings, D. E. (1986). Judging inference adequacy in logistic regression, Journal of the American Statistical Association 81, 471–476. 6. McCullagh, P. & Nelder, J. A. (1983). Generalized Linear Models. Chapman & Hall, London.

LOG-RANK TEST

The null hypothesis can also be stated using the hazard function. The null hypothesis is equivalent to comparing the hazard rates

ALAN HOPKINS Theravance Inc., South San Francisco, California

H0 : λ1 (t) = λ2 (t) = . . . = λK (t), for t ≥ 0 (1) The alternative hypothesis usually of interest is that the survival function for one group is stochastically larger or smaller than the survival functions for the other groups. H a : Sk ≥ Sk (t), or Sk (t) ≤ Sk (t), for some k, k with strict inequality for some t.

Calculation of the log-rank test is described, and an example is provided. Conditions under which the test is valid are discussed and testing of assumptions is addressed. Generalizations of the log-rank test are described including its relationship to the Cox regression model. Sample size calculation for the log-rank test is discussed, and computing software is described. Suppose T is a continuous, non-negative random variable representing survival times from point of randomization. The distribution of censored event times can be estimated by the survival function S(t) The survival function estimates the probability of surviving to time t so S(t) = Pr(T > t). The hazard function is directly related to the survival function through the relationship λ(t) = − d log S(t)/dt. The hazard rate is the conditional probability of death in a small interval [t, t + dt) given survival to the beginning of the interval. With censored data, we cannot observe events on all subjects, so we only observe a censoring time. Thus, for each subject, we have data on observation time, an indicator whether an event was observed at the last observation time, and treatment group. With this information, we can compare hazard functions in intervals containing observed death times and calculate a global test statistic for comparing the observed survival distributions among several groups. The log-rank test was originally proposed by Mantel (1), and it is equivalent under certain circumstances to the Cox (2) regression model.

0.0.2 Assumptions. The log-rank test is nonparametric in the sense that no underlying probability distribution is assumed for the survival function. The log-rank test assumes the censoring process is unrelated to survival times or to the treatment groups themselves (independent censoring) and that the survival times are from the same distribution for subjects recruited early or late in the clinical trial (stationarity). Observations are assumed to be independent. Special methods are required if recurrent events are observed on a single individual (e.g., multiple infections.) 0.0.3 Inference. Let t1 < t2 < · · · < tL represent the ordered distinct failure times in the combined groups of observations. At time tt , dij events are observed in the jth sample out of Rij individuals at risk just prior to ti . K Here di = dij represents the total number j=1

of deaths at ti and Ri =

K

Rij the number of

j=1

subjects at risk at ti . We can represent the data at time ti as shown in Table 1. The test of hypothesis (1) is based on weighted deviations of the estimated hazard functions for each group from the overall estimated hazard rate among all data combined. If the null hypothesis is true, then an estimator of the expected hazard rate in the jth population under H0 is the pooled sample estimator of the hazard rate di /Ri . An estimate of the hazard rate for the jth sample

0.0.1 Hypothesis. Suppose we have K groups and we denote the survival functions associated with each group as S1 (t), . . . , SK (t). The null hypothesis can be expressed as H0 : S1 (t) = S2 (t) = . . . = SK (t), for t ≥ 0

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

LOG-RANK TEST Table 1. Layout and Notation for the K-Group Log-Rank Test Time ti

Group 1

Deaths Survivors

Group 2

...

di1

di2

...

diK

di

Ri2 − di2

...

RiK − diK

Ri − di

Ri1

Ri2

...

RiK

Ri

is dij /Rij . To compare survival distributions, we take a weighted average of deviations across all failure times. The test is based on statistics of the form: L i=1

di W(ti ) dij − Rij , j = 1, . . . , K Ri

where W(t) is a positive weight function. vj (t) is the sum over all event times of the difference in observed and conditionally expected events for group j. This quantity has a product hypermultinomial distribution with covariance matrix: Vjg =

At Risk

Ri1 − di1

Total

vj =

Group K

L Rig Rij Ri − di W(ti )2 di δjg − Ri Ri Ri − 1 i=1

j, g = 1, . . . , K where δ jg = 1 when j = g and 0 otherwise. Let v = (v1 , v2 , . . . , vK )T . Then a test statistic for hypothesis (1) is the quadratic form X 2 = vT V− v

(2)

where V− is a generalized inverse. The components of v are linearly dependent and sum to zero so the variance-covariance matrix has maximum rank K − 1. The overall test statistic can be constructed using any K − 1 components of v and corresponding elements of the variance-covariance matrix. Therefore, if the last row and column of V is deleted to give VK−1 and vK−1 = (v1 , v2 , . . . , vK−1 )T , then the overall log-rank test statistic is −1 vK−1 X 2 = vTK−1 VK−1

(3)

where VK−1 −1 is an ordinary inverse. The distribution of the weighted log-rank statistic is chi-squared with K − 1 degrees of freedom. Using W(ti ) = 1 gives the widely

used log-rank test. Alternative weights will be discussed in a later section. Since the log-rank statistic as presented here sums across multiple failure times, the tables used are not independent, which precludes use of standard methods to derive the asymptotic distribution of the statistic. The asymptotic properties of the log-rank test were rigorously developed using counting process techniques. For details of this approach, see Fleming and Harrington (3) or Kalbfleisch and Prentice (4). 0.0.4 A Special Case (K = 2) and W (t i ) = 1. Often a clinical trial consists of only two treatment groups. In this case, the computations are simplified. We may write the two-sample log-rank test as L di di1 − Ri1 Ri i=1 ZLR = L R Ri1 Ri − di i1 di 1− Ri Ri − 1 i=1 Ri which has approximately a standard normal distribution under the null hypothesis for large samples. 0.0.5 Relationship of the Log-rank Statistic to the Cox Regression Model. The log-rank test is closely related to the Cox proportional hazards regression model. Let zT = (z1 , . . . , zp ) represent p covariates on a given subject. In the case of the log-rank test, z would be indicator variables for treatment groups. The proportional hazards regression model is λ(t|z) = λ0 (t)exp(β T z), where λ0 (t) is the baseline hazard corresponding to zT = (0, . . . , 0) and β is a vector of regression coefficients. The likelihood for the Cox regression model is simply L

L(β) = i=1

exp(β T zi ) j∈Di

exp(β T zj )

(4)

LOG-RANK TEST

where Di is the set of subjects at risk at time ti . The efficient score for Equation (4) is given by U(β) = ∂/∂β log L(β), and its covariance by the inverse of I(β) = − ∂ 2 /∂β 2 log L(β). Then the score statistic is U’(0)I−1 (0)U(0), which has a chi-squared distribution with p − 1 degrees of freedom. This statistic is equivalent to the log-rank test when there are no tied survival times. 0.0.6 Power. The log-rank test is most powerful when the survival curves are proportional. This occurs when one survival function is consistently greater than the other over the study period. The log-rank test is the most powerful nonparametric test to detect proportional hazards alternatives. If the hazard functions cross, then there may be very little power to detect differences between the survival curves. One easy way to assess the proportionality assumption is to plot the Kaplan-Meier survival curves. If the survival curves cross, then the proportionality assumption is not met. Alternatively, a plot the estimated survival curves on a log(-log) scale gives a constant vertical shift of the two curves by an amount equal to the log of the hazards if the hazards are proportional. A more rigorous approach to checking the proportionality assumption is to use a statistical test based on a Cox regression model. Proportionality fails when there is an interaction between treatments and time. Introduction of a time-dependent interaction can be used to test formally for nonproportional hazards with the Cox regression model. Therneau and Grambsch (5) describe using residuals from Cox regressions to identify deviations from the proportional hazards assumption. 1 EXAMPLE: DISEASE-FREE SURVIVAL FOR ACUTE MYELOGENOUS LEUKEMIA AFTER BONE MARROW TRANSPLANTATION Klein and Moeschberger (6) provide a dataset containing 101 patients who received bone marrow transplantation after chemotherapy for acute myelogenous leukemia. Transplants were either allogenic (from the patient’s sibling) or autologous (from the patients own marrow harvested prior to chemotherapy).

3

The event time was based on relapse or death, whichever occurred first. The R software (7) package KMsurv contains this dataset called alloauto. Each patient in the dataset has a sequence number, a leukemia-free survival time (in months), an indicator for censoring (0 = yes, 1 = no), and an indicator for type of bone marrow transplant (1 = allogenic and 2 = autologous). There are 101 subjects in the dataset and 50 leukemia relapses. Of the 101 patients, 50 had allogenic transplants and 51 had autologous transplantation. An R script for this example is in Table 2. survfit calculates the Kaplan-Meier curve. The plot command gives the Kaplan-Meier curves shown in Fig. 1. The allogenic transplant survival is initially higher than the autologous transplants. This trend reverses itself as the survival functions cross at about 12 months casting doubt on the proportional hazards assumption. Vertical dashes on the survival functions represent censored observations. Finally, the survdiff command calculates the log-rank test shown in Table 3. Although there is separation of the two survival curves late in the time axis, the log-rank test does not yield a P < 0.05. Differences in the survival functions summed over time decrease the magnitude of the logrank statistic when the survival functions cross. The sum of the quantities (O-E)ˆ2/E in Table 3 is a conservative approximation to the actual log-rank chi-squared statistic and is produced in the output for information purposes only. 1.0.7 The Stratified Log-Rank Test. Sometimes one may know that survival is not proportional among all subjects but is related to a nuisance factor. Heterogeneous populations can sometimes be stratified into homogeneous groups for analysis purposes eliminating the nuisance source of variation. Stratified analysis can be applied to the log-rank test. This process is appropriate for situations where the proportional hazards assumption breaks down in known subgroups. For example, the hazard rate may be different for a disease outcome depending on disease burden at baseline. In that case, it may be possible to define strata within which the proportional hazards assumption is more

4

LOG-RANK TEST Table 2. R Script for Kaplan-Meier Plot and Log-Rank Test

0.8 0.6 0.4 0.2

allogenic autologous

0.0

Leukemia-Free Survival Probability

1.0

library(survival) data(alloauto, package = ‘‘KMsurv’’) my.fit <- survfit( Surv(time,delta) ∼ type, data = alloauto ) plot(my.fit, xlab = ‘‘Time (Months)’’, ylab = ‘‘Leukemia Free Survival Probability’’, lty = c(6, 1)) legend(5, 0.35, c(‘‘allogenic’’, ‘‘autologous’’), lty = c(1, 3)) survdiff( Surv(time,delta) ∼ type, data = alloauto )

0

10

20

30

40

50

Time (Months)

viable. The elements of the log-rank test can be computed separately for each stratum and then combined for an overall test. Let the strata be indexed h = 1, . . . , s and let v(h) and V(h) represent the stratum-specific components of the log-rank statistic. Then the stratified test statistic is expressed as (4):

2 χK−1

s = v(h) h=1

T s h=1

−1 V

(h)

s v(h)

h=1

1.1 Choice of Weights The log-rank test is a special case of more general methods for comparing survival distributions with different weights. Careful selection

60

Figure 1. Kaplan-Meier survival curves for autologous and allogenic bone marrow transplants.

of weights can emphasize differences between certain regions of the survival curves. Several weighting schemes are summarized in Table 4. Weights equal to the number of subjects at risk Ri at each time ti was proposed by Gehan (8) for the two group setting and by Breslow (9) for multiple groups. W j (t) = Ri weights early portions of the survival functions more heavily than later portions. Tarone and Ware (10) proposed using weights that are a function of Ri such as √ Ri . Peto and Peto (11) proposed a weighting scheme based on an estimate of the common survival function. Anderson et al. (12) recommended slight modification to the Peto–Peto weights.

Table 3. Results of the Log-Rank Test for Bone Marrow Transplant Patients N Observed Expected (O-E) ˆ 2/E 50 22 24.2 0.195 51 28 25.8 0.182 Chisq = 0.4 on 1 degrees of freedom, p = 0.537 type = 1 type = 2

(O-E) ˆ 2/V 0.382 0.382

LOG-RANK TEST

5

Table 4. Common Weights for Weighted Log-Rank Statistics Weight

Test

1.0 Ri √ Ri

ˆS(t) = 1 −

Log-rank (1) Gehan-Breslow-Wilcoxon (8, 9) Tarone-Ware (10)

ti
di Ri +1

ˆ S(t)R i + 1)

i /(R ρ γ ˆ ˆ 1 − S(t) S(t)

Peto-Peto-Wilcoxon (11) Modified Peto (12) Gρ , Gρ,γ class (13, 3) ρ ≥ 0, γ ≥ 0

Table 5. Log-Rank Statistics for Bone Marrow Transplant Data Calculated Using SAS PROC LIFETEST Line No.

Test

1 2 3 4 5 6 7 8 9

Log-rank Wilcoxon Tarone Peto Modified Peto G (1,0) G (0,1) G (0,2) G (1,1)

Fleming and Harrington (13) proposed a general class of tests Gρ,γ with weight funcˆ i−1 )ρ [1 − S(t ˆ i−1 )]γ , ρ ≥ 0, tion W ρ,r (ti ) = S(t ˆ γ ≥ 0, where S(t) is the product-limit estimator based on the combined sample. Choice of ρ and γ can provide flexibility in selecting a region of the survival curves for weighting differences among the curves. Of course, W ρ,γ (ti ) = 1 when ρ = γ = 0 and we have the ordinary log-rank test. For ρ = 1 and γ = 0, more weight is given to early differences between the survival functions. For ρ = 0 and γ > 0, then more weight is given to departures between survival functions observed later during the observation period. Table 5 shows results for the bone marrow transplant data for various weighting schemes. Statistics were calculated using SAS procedure LIFETEST (14). The log-rank test lacks power with its equal weighting in this nonproportional hazards dataset. Tests in line numbers 2–6 are even less sensitive since they weight the early portion of the curves where there is the least difference. The G-class statistics that weight the later region of the survival curves (lines 7–9) give the

Chi-Square

DF

P-Value

0.3816 0.0969 0.0039 0.0000 0.0007 0.0008 4.2026 5.9276 2.9600

1 1 1 1 1 1 1 1 1

0.5368 0.7556 0.9501 0.9956 0.9791 0.9771 0.0404 0.0149 0.0853

smallest P-values for this dataset. The SAS code for calculating these statistics is given in Table 6. The code test = (all) calculates the statistics in lines 1–6 in Table 5. 1.2 Sample Size and Power The power of the log-rank test depends on the number of events during the course of a clinical study and not on the total number of subjects enrolled. Typically one estimates the number of events required and then makes assumptions about the accrual rate, length of the intake period, and the length of follow-up to estimate the number of subjects required to observe a specific number of events for a given power requirement. In the two-sample log-rank proportional hazards setting, assume the alternative hypothesis H A : S1 (t) = S(t)θ . For the twosided log-rank test with level α = 0.05 and power 1 − β, Schoenfeld (15) showed that the total sample size required is

n=

(zα/2 + zβ )2 (ln θ )2 p(1 − p)Pd

6

LOG-RANK TEST Table 6. SAS v9.1 Code to Calculate the P-Values for Weighted Log-Rank Statistics proc lifetest notable data = time time*delta(0); strata /group = type test proc lifetest notable data = time time*delta(0); strata /group = type test proc lifetest notable data = time time*delta(0); strata /group = type test proc lifetest notable data = time time*delta(0); strata /group = type test run;

where p is the proportion of subjects in group 1 and Pd represents the cumulative proportion of events in both groups by the end of the study. Generally the calculation of Pd will require assumptions about the survival distributions in each group. Pd could be estimated by assuming an exponential survival function or through simulation based on θ . In the real world of study design, the number of events observed depends on the rate of accrual, the length of the intake period, the dropout rate, and the length of follow-up. Lakatos (16) developed a general method for calculating sample size for the log-rank test that relaxes the proportionality assumption and allows arbitrary specification of the survival functions. Lakatos and Lan (17) reviewed the literature on different methods for estimating sample sizes and compared them with computer simulations. See also ‘‘Sample Size Calculation for Comparing Time-to-Event Data’’ in this encyclopedia. Use of a computer is recommended for complicated designs. The SAS procedure POWER (14) can calculate power and sample size for the log-rank test comparing two survival curves using the Lakatos method. The survival functions can be proportional hazards models, piecewise linear curves with proportional hazards, or arbitrary piecewise linear curves. The software allows specification of uniform accrual periods and a followup time. The EastSurv (18) program allows for calculation of sample size and power along with interim analysis planning. EastSurv permits

alloauto; = (all); run; alloauto; = ( fleming(0, 1) ); alloauto; = ( fleming(1, 1) ); alloauto; = ( fleming(0, 2) );

use of time-varying accrual patterns and modeling of dropout rates and piecewise exponential hazards. REFERENCES 1. N. Mantel, Evaluation of survival data and two new rank order statistics arising in its construction. Cancer Chemotherapy Rep. 1966; 50: 163–170. 2. D. R. Cox, Regression models and life tables. J. Roy. Stat. Soc. 1972; B34: 187–220. 3. T. R. Fleming and D. P. Harrington, Counting Processes and Survival Analysis. New York: Wiley, 1991. 4. J. D. Kalbfleisch and R. L. Prentice Statistical Analysis of Failure Time Data, Second Edition. New York: Wiley, 2002. 5. T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York: Springer, 2000. 6. J. P. Klein and M. L. Moeschberger, Survival analysis: Techniques for Censored and Truncated Data (Second Edition). New York: Springer, 2003. 7. R Development Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2007. Available: http://www.Rproject.org. 8. E. A. Gehan, A generalized Wilcoxon test for comparing arbitrarily singly censored samples. Biometrika 1965; 52: 203–223. 9. N. Breslow, A generalized Kruskal-Wallis test for comparing K samples subject to unequal patterns of censorship. Biometrika, 1970; 57: 579–594.

LOG-RANK TEST 10. R. E. Tarone and J. Ware, On distributionfree tests for equality of survival distributions. Biometrika 1977; 64: 156–160. 11. R. Peto and J. Peto, Asymptotically efficient rank invariant test procedures (with discussion). J. Roy. Stat. Soc. A, 1972; 135: 186–206. 12. P. K. Anderson, O. Borgan, R. D. Gill, and N. Keiding, Linear nonparametric tests for comparison of counting processes with application to censored survival data (with discussion). Int. Stat. Rev. 1982; 50: 219–258. 13. D. P. Harrington and T. R. Fleming, A class of rank test procedures for censored survival data. Biometrika 1982; 69: 553–566. ® 14. SAS Institute Inc., SAS/STAT 9.1. User’s Guide. Cary, NC: SAS Institute Inc. 15. D. Schoenfeld, The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 1981; 68: 316–319. 16. E. Lakatos, Sample sizes based on the logrank statistic in complex clinical trials. Biometrics 1988; 44: 229–241. 17. E. Lakatos and K. K. G. Lan, A comparison of sample size methods for the logrank statistic. Stat Med. 1992; 11: 179–191. 18. Cytel Software, Inc. EastSurv, User Manual. Cambridge, MA: Cytel, Inc., 2005.

CROSS-REFERENCES Censoring Kaplan-Meier Plots Cox proportional hazard model Hazard rate Hazard ratio Sample Size Calculation for Comparing Time-toEvent Data Stratified Analysis Survival Analysis

7

LONGITUDINAL DATA

1

HUNG-IR LI

and others before any treatment and then at weeks 2, 4, and 8 weekly during the 8-week treatment period.

Allergan Inc., Irvine, California

2.1 Design The study design sets the scope and features of the resulting longitudinal data. Prospectively designed longitudinal clinical trials describe in protocols the objectives and hypotheses of the trials, criteria for subject selection, the assignment of treatments to subjects (see also Randomization), and what, when, and how the outcome measures are collected. Most importantly, how the collected data including the background information such as medical history, prior medication, as well as the outcome measures or responses at the pretreatment and posttreatment visits will be analyzed need to be specified. Longitudinal data from a parallel design clinical trial such as the gabapentin study on PHN cited above will consist of 3 average daily pain scores for each patient in the treatment groups as specified in the study protocol. If a crossover design is used, each subject will receive different treatments in different periods with outcome measures repeated across time within each period (see also Crossover Design). The data will consist of a set of records for each treatment received by each subject. Since the total number of data points increases and each subject could serve as his own control in assessing the changes due to treatments, the crossover design usually requires a smaller set of subjects to collect an adequate amount of data for making valid scientific inferences. However, the multiple periods and washout time necessary in between periods may lengthen the total duration of the study.

DEFINITION

Longitudinal data are results from longitudinal studies and essentially consist of outcome measures or responses collected repeatedly over the time for each subject included in the study (see also Repeated Measures). This is in contrast to the conventional crosssectional studies in which subsets of subjects are selected from a population and evaluated all at one point of time. Therefore, longitudinal data link to multiple sets of cross-sectional data over a series of time points. 2 LONGITUDINAL DATA FROM CLINICAL TRIALS Longitudinal data could be collected from prospective or retrospective observational studies in epidemiologic or public health research to characterize the disease progression or risk factors for certain epidemics. Penninx et al. (1) reported a community-based study having evaluated 2847 subjects with or without major depression and having evaluated them for 4 years for the diagnosis of cardiac disease and cardiac mortality to assess the association between the cardiac mortality and major depression. Most of the longitudinal data in the clinical research and development settings are collected through prospectively designed longitudinal studies under specific regulatory requirements. For example, a randomized, controlled trial was conducted in 1996-1997 to evaluate the effectiveness of gabapentin for treatment of postherpetic neuralgia (PHN) as reported by Rowbotham et al. (2). A total of 229 patients found to have PHN were assigned in random (see also Randomization) to either gabapentin or placebo in an 8-week clinical trial. All patients are evaluated for the pain intensity

2.2 Time—Occasions of Measurements The desired study duration and the frequency of the measures/assessment repeated define the time metrics and/or visits. Days, weeks, months, and years are choices in clinical trials. These depend on the objectives of the longitudinal trials (e.g., quick onset or maintenance of the treatment effect), the nature of

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

LONGITUDINAL DATA

the target disease or medical condition (e.g., course of changes), and the regulatory agency requirements of clinically meaningful endpoints (e.g., reduction of weekly average pain score after 12 weeks of treatment on painful diabetic neuropathy). The overall cost and needed resources could well play a key role in the decision of time metrics, but the nature of the target disease and/or medical condition and the regulatory requirement of clinical meaningfulness are pivotal. For example, a phase III study on panic disorder may be required to have the weekly evaluation for 8 weeks instead of 4 weeks because the disorder itself has a cyclic fluctuation of panic attacks within a month. For pain-relief studies, the assessment may need to take place daily or weekly for the first period and then followed by monthly assessments to evaluate quick onset of symptom improvement as well as the maintenance of the treatment effectiveness. Selecting a sensible time metric so that the cadence is most useful for the outcome measures or responses is very important for the longitudinal studies. 2.3 Type of Responses/Measures /Outcomes Similar to the time metrics, the selection of outcome measures or responses is dictated by the questions to be answered by the longitudinal clinical trials (the objectives), the nature of the disease, and the regulatory requirement. These could be readings from a validated objective test such as blood pressure or QT interval from the EKG monitoring, patient reported scales, or diaries such as daily pain score on an 11-point Likert scale for pain associated with post-herpetic neuralgia. In general, they can be categorized as the following types that are pivotal to the selection of the statistical methods for analyses. The measures used in phase III longitudinal clinical trials are usually gold standards for the diagnosis of the disease and trajectories of clinical benefit (efficacy) or risk (safety) acceptable to regulatory agencies. Continuous: Systolic and diastolic blood pressures in a cardiopulmonary or hypertension interventional study are good examples of continuous measures. This is the most popular type of measure that statistical models were built around for years. The average pain

score in the gabapentin trial on PHN mentioned above is another example. Count: The occurrence of episodes within a certain period of time such as number of panic attacks daily in a panic disorder study falls into this type. Other examples include the exacerbation counts in a multiple sclerosis study. Categorical: Mortality is a typical dichotomous categorical response in oncology studies. The tracking of certain medical conditions or adverse events is dichotomous as well for risk assessment of medical treatments. The investigator-rated clinical global impression of change (worsened, no change, minimally improved, moderately improved, or much improved) in the gabapentin trial mentioned earlier is an ordinal categorical measure. Nominal categorical outcome measures or responses have multiple categories, but the categories are not ordered. Rating of a status statement with choices of true, false, or other is considered a nominal outcome measure but is less applicable for the evaluation of efficacy or safety in clinical trials. 3 ADVANTAGES People may wonder why clinical research should invest higher costs and more resources to do longitudinal investigations. What are the merits of choosing these designs over conventional cross-sectional studies? The need to assess changes over time for subjects with or without medical treatments is frequently the primary reason. In addition, longitudinal studies could also separate individual changes over time and the effect of baseline differences among subjects, sometime called cohort effects in epidemiology studies as cited in Pahwa and Blaire (3). In addition, longitudinal studies also result in more observations/data than cross-sectional studies with the same number of subjects. Zeger and Liang (4) used one simple example of a longitudinal study to demonstrate the increase in power and efficiency of the analysis compared with a cross-sectional study. The gain is more when the correlation is properly accounted for because each subject serves as his own control and the intersubject variability is reduced. The longitudinal data also

LONGITUDINAL DATA

allows statistical analyses that are robust to problems of model misspecification such as selection bias common on observational studies, according to Zeger and Liang (4). Jadhav and Gobburu (5) also have used data form pivotal trials (see Pivotal Trials and NDA) for treatment of Parkinson’s disease to explore the efficiency of longitudinal data analysis in comparison with a cross-sectional analysis. The longitudinal data analysis is concluded to be more powerful and less sensitive to different types of unbalanced data, whereas the cross-sectional analysis required either excluding the incomplete data (losing information) or data imputation for the missing date to make the data balanced. When considerable missing data occurred due to dropouts, the chance of claiming effectiveness falsely (see Type I Error) using the cross-sectional analysis with last observation carried forward (see LOCF) was inflated as concluded in the case comparison. 4

dividuals and thus become unbalanced as pointed out by Liang and Zeger (6). Little and Rubin (7) have characterized the missing data according to the following reasons into three mechanisms with mathematical and probability notations and have provided suggestions for statistical solutions. Diggle et al. (8) and Hedeker and Gibbons (9) have further elaborated these mechanisms with examples and simulations along with rich lists of references in various applications. • Missing Completely at Random (MCAR):

The occurrence of the specific missing data point is simply by chance. It could equally likely happen to any subject at any time and is independent of the data observed before and of the data observed after. The missing data caused by administrative errors or dropouts due to accidental death are good examples of MCAR. • Missing at Random (MAR): The occurrence of the specific missing data point is independent of the data that would have been observed. A good example is the missing values as a result of subjects being discontinued from the study when a laboratory test value is below or beyond a certain level as specified in the protocol. • Missing Not at Random (MNAR): The occurrence of the specific missing data point is dependent of the data that would have been observed. Missing values caused by dropouts due to lack of efficacy or adverse events fall into this type. This type of missing mechanism is considered informative and nonignorable in contrast to ignorable for MCAR and MAR.

CHALLENGES

Two major issues require more sophisticated statistical models and/or methods than some common approaches in analyzing longitudinal data. Correlated Data: The repeated measures collected from the same subject are correlated. In other words, the data are no longer independent to each other as required by the traditional linear regression model in assessing the association between the outcome measures or response and the explanatory factors, including the controlled medical treatment (see Regression). Ignoring the correlation will result in inefficient parameters and in inconsistent estimates of precision leading to invalid scientific inference. Missing Data: No matter how well the clinical trials are designed, conducted, and monitored, it is likely that one or more of the measures for some subjects are not repeated due to investigation site administrative error, subject noncompliance (missed visit, skipped questions, or invalid answers to questions), subjects are lost to follow-up, or subjects discontinue prematurely (see also Dropout). The longitudinal data collected are with an unequal number of observations across in-

3

5

ANALYSIS OF LONGITUDINAL DATA

The statistical methods for the analysis of longitudinal data need to be chosen based on the design of the longitudinal studies, the type of the outcome measures, the volume of the data, and the issues of the data that need to be addressed, as explained above, so that the results of the analyses capitalize on the

4

LONGITUDINAL DATA

richness of the longitudinal data, answer all intended questions, and apply to a more general population. Without appropriate methods, the inference from the longitudinal data could be invalid, misleading, and the data wasted. Zeger and Liang (4) provided an overview of methods for the analysis of longitudinal data in 1992. The most commonly used method is the regression approach in which the outcome measures or responses are related to the explanatory factors such as exposure to risk factors or medical treatment (see Regression). The full scope of the established statistical methods for longitudinal data analysis since that are suitable for different types of response/measures with the respective flexibility/limitations in addressing data correlation and missing data can be found in Fitzmaurice et al. (10), Diggle et al. (8), and Hedeker and Gibbons (9). As mentioned in Hedeker and Gibbons (9), the mixed-effect regression model (MRM) (see Mixed-effect Model) has become very popular for continuous longitudinal data. This approach models the data correlation within each subject to the regression of outcome measures or responses to explanatory factors together by assuming that the regression coefficients are different from subject to subject and are random values from a specified underlying distribution. Other explanatory factors included in the model are considered the same across subjects; thus, fixed effects and the model become a mixed-effect model. The data correlation within each subject is accounted for by each regression coefficient. These models also allow for unbalanced data due to the presence of missing data or unequally spaced measurements across time. Similarly, the mixed-effect logistic regression model is a popular choice for the analysis of dichotomous data, whereas the mixed-effect proportional odds model is quite common for the ordinal longitudinal data. Furthermore, mixed-effect regression models are developed for the nominal data and count data. It is important to note that for confirmatory longitudinal phase III studies (see Phase III Trials or Pivotal Trials), the selected statistical models/methods, similar to the selections of the measures/responses/

outcomes and time metrics, need consultation with the regulatory agency and to be prespecified in the study protocol in detail. It is also likely that simple approaches such as cross-sectional analyses for each time point with LOCF at the presence of missing data or analysis of covariance of change from baseline to the posttreatment time points may also be required along with the model-based analyses.

REFERENCES

1. B. W. J. H. Penninx, A. T. F. Beekman, A. Honig, D. J. H. Deeg, R. A. Schoevers, J. T. M. van Eijk, and W. van Tilburg, Depression and cardiac mortality: Results from a community-based longitudinal study. Arch Gen Psychiatry 2001; 58: 221–227. 2. M. Rowbotham, et al., Gabapentin for the treatment of postherpetic neuralgia: A randomized controlled trial. JAMA 1998; 280: 1837–1842. 3. P. Pahwa and T. Blaire, Statistical models for the analysis of longitudinal data. CACRC Newslett. April 2002. 4. S. L. Zeger and K. Y. Liang, An overview of methods for the analysis of longitudinal data. Stat. Med. 1992; 11: 1825–1839. 5. P. R. Jadhav and J. V. S. Gobburu, Modelbased longitudinal data analysis can lead to more efficient drug development: A case study. 2005. FDA Science, The Critical Path From Concept to Consumer, Innovative Approaches to Design and Evaluation of Clinical Trials, 11th Annual FDA Science Forum. 6. K. Y. Liang and S. L. Zeger, Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22. 7. R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data. 2nd ed. New York: Wiley, 2002. 8. P. J. Diggle, P. Heagerty, K. Y. Liang, and S. L. Zeger, Analysis of Longitudinal Data. 2nd ed. New York: Oxford University Press, 2002. 9. D. Hedeker and R. D. Gibbson, Longitudinal Data Analysis. Hoboken, NJ: Wiley, 2006. 10. G. M. Fitzmaurice, N. M. Laird, and J. H. Ware, Applied Longitudinal Analysis. Hoboken, NJ: Wiley, 2006.

LONGITUDINAL DATA

FURTHER READING T. W. Taris, A Primer in the Longitudinal Data Analysis. London, UK: Sage, 2000. J. H. Ware, Linear models for the analysis of longitudinal studies. Am. Stat. 1985; 39(2): 95–101.

CROSS-REFERENCES Adverse Event Clinical Trial/Study Crossover Design Dropout Mixed-effect Model Last Observation Cariied Forward (LOCF) New Drug Application (NDA) Phase III Trials Pivotal Trials Quality of Life Randomization Regression Repeated Measures Safety

5

MANUAL OF POLICIES AND PROCEDURES (MAPPS) The Manual of Policies and Procedures (MaPPs) comprises the approved instructions for internal practices and procedures followed by U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER) staff to help standardize the new drug review process and other activities. The MaPPs defines external activities, and it is available for the public to review to get a better understanding of office policies, definitions, staff responsibilities, and procedures.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/regulatory/applications/ ind page 1.htm#MaPPs) by Ralph D’Agostino and Sarah Karl.

1

MASKING

efforts. Although certain practices (sealed envelopes, tamper-evident packaging, etc.) are used in efforts aimed at both (preassignment) allocation concealment and (postassignment) blinding, such practices can confirm only that the process of allocation concealment or blinding has been ensured; they cannot assure us of the result or outcome of such efforts (7). Reporting the results of clinical trials, then, necessitates that both the blinding methods that were in place be described in detail and the results thereof also examined (7,8).

RAFE MICHAEL JOSEPH DONAHUE Vanderbilt University Medical Center, Nashville, Tennessee

Blinding, sometimes also called masking, is generally taken to mean the concealment of information about which of several treatments a subject in a clinical trial is receiving. As opposed to allocation concealment, a term used to describe the prevention of foreknowledge of the treatments that will be assigned to future subjects, blinding refers to efforts that take place after treatment assignment has been made (1). The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) guidelines note that blinding is an important means of reducing or minimizing the risk of biased study outcomes (2). Blinding in clinical trials is not a recent methodological advance, as researchers have used blinding since the late eighteenth century. Schulz and colleagues (3) report that Lavoisier and Franklin used blinding to examine claims made for mesmerism (the healing power of magnetism); Franklin is said to have used actual blindfolds on trial participants to conceal treatment information. In the early twentieth century, German researchers looked to blinding as a method of reducing bias, whereas in the United States and Britain, blinding was used as a way to reduce attrition. By the end of the 1930s, British and U.S. researchers began to acknowledge the value of blinding in avoiding bias (3). Kaptchuk (4) provides an historical account of blinding. Although blinding should help to reduce bias in clinical trials (5), its average effect seems weaker than the effect of allocation concealment (6).

2 GROUPS THAT MAY BE BLINDED AND BLINDING CONSEQUENCES FOR EACH Probably no fewer than four distinct groups of individuals can or should be blinded in a clinical trial: subjects, investigators (people directly involved in the day-to-day details of the trial), outcome assessors (people measuring outcomes, either subjective or objective, on the subjects and possibly also involved in day-to-day details of the trial), and data handlers (data managers, monitors, analysts, statisticians, and the like), although some sources actively consider only the first three groups. Still other sources consider primarily only subjects and clinicians as the active participants in the trial for whom blinding is an issue. Regardless of the classifications and names assigned to these groups, persons involved with the trial data have potential biases that can be managed (or attempted to be managed) via blinding. Subjects need to remain blinded for several reasons. Knowing that they have received a heralded new treatment, a standard treatment, a treatment with an unknown benefit, or a placebo can impact a subject’s psychological state and hence the evaluation of the treatment. Knowledge of the treatment can affect the psychological or physical responses by the subjects (3). Cooperation with the demands of the trial can also be impacted. Retention and compliance, both with visit schedules and dosing, can be influenced by knowledge of the treatment.

1 DISTINGUISHING BETWEEN METHODS AND RESULTS As an effort to reduce biases that can occur in clinical trials, a difference must be noted between the efforts or processes that are undertaken to create and maintain blinding and the actual results or outcomes of such

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

MASKING

Trial investigators, loosely defined as the individuals involved with carrying on the day-to-day details of the trial, can be influenced by knowledge of the treatments to which the subjects have been assigned. Attending physicians and nurses and anyone interacting with the participants throughout the trial are known to have their perceptions directly transferred to the subjects (9). Furthermore, their knowledge of the treatments may influence ancillary treatment for specific subjects, and/or promote or discourage continuation in the trial on the basis of the knowledge of the treatment assignment (3). A temptation exists for a caregiver to adjust care and attention in those subjects that he or she perceives to be at a disadvantage because of an inferior treatment (10). Outcome assessors, certainly those making subjective assessments, are at risk of making biased assessments if they are aware of the treatment the subject received. Interview-based scores, for example, a depression rating scale, would certainly be suspect if the interviewer and/or the assessor were aware of the subject’s treatment assignment. Obviously, the more objective the endpoint, the less observer bias can influence the results. At the extreme, an outcome such as death carries little room for bias; however, even death can be biased in a clinical trial if cause of death is required to be by a certain cause for the death to qualify as an outcome. Data handlers, if aware of treatment assignments, can also impart bias into the data. Which cases to review in an audit, which values to consider aberrant for purposes of clarification, and even which analyses to carry out on the data collection as a whole can be influenced by knowledge of the treatment assignment. Unblinded data review groups, such as data safety monitoring boards, can instill bias into a trial if they share knowledge they hold. Any positive or negative signal could impact future enrollment of certain types of subjects, thus biasing the later stages of a trial. 3

LEVELS OF BLINDING

Although the terms ‘‘open-label,’’ ‘‘singleblind,’’ and ‘‘double-blind’’ are used fre-

quently, they are often used without providing a precise definition; furthermore, the literature does not provide a consensus to their explicit definitions (3). Open-label trials are those trials in which all above-listed parties share complete knowledge of the treatment assignments. As such, these trials may be open to all the biases listed above. Single-blind trials are trials in which one group listed above (subject, investigators, outcome assessors, or data handlers) is kept without knowledge of the treatment assignments of the patients. Typically, this group will be the subjects, meaning that the investigators, outcome assessors, and data handlers all will have knowledge of the treatment assignments, but the subjects will not. Senn (11) presents arguments that singleblind placebo run-ins are unethical and scientifically and statistically unnecessary and adopts the motto ‘‘no randomization, no placebo,’’ which implies that placebos ought be used only when randomization has dictated their assignment. The term double-blind denotes a situation where two of the groups are blinded; however, it can also indicate a scenario where all the individuals involved in the trial are blinded. As such, some have held that this term can be misleading (3). However the term is used, description of exactly who was kept ignorant of treatment assignment (and how this was attempted and how successful those attempts were) should be noted in the report of the trial. The ICH definition of single-blinding refers to subjects being unaware; ICH doubleblinding refers to ‘‘subjects, investigators, monitors, and, in some cases, data analysts’’ as well (2). Triple-blinding as a term has been used to mean a trial where subject, investigators, and outcomes assessors are blinded, along with data analysts. The term ‘‘quadrupleblind’’ has been used rarely to mean all four groups (subject, investigators, outcome assessors, and data handlers) are kept ignorant (3). The Consolidated Standards of Reporting Clinical Trials (CONSORT) guidelines encourage the reporting of the blinding in the trial to specify ‘‘whether or not participants,

MASKING

those administering the interventions, and those assessing the outcomes were aware of group assignment [and] if not, how the success of masking was assessed’’ (8). As such, ambiguous descriptors, such as single-blind, double-blind, and triple-blind, are discouraged, and details concerning exactly who was kept ignorant, how they were kept that way, and how that was determined, are encouraged. 4 TECHNIQUES USED IN ATTEMPTS TO BLIND Double-blinding, at least under the definition that double-blinding includes subjects, investigators, and outcome assessors, is a goal of any trial, although such a goal might not always be attainable. In a comparator trial, treatments that match (with regard to method of treatment, size, shape, color, etc. of medications, etc.) make for the most effective attempt at blinding. In the case of two oral tablets, say one red and one blue, a doubledummy technique can be used: Matching placebo tablets, one red and one blue, both completely matching its active treatment can be constructed. A subject’s treatment is then a pair of tablets, either red active and blue placebo or red placebo and blue active. A similar technique can be used with many medications and treatments: Placebos to match the treatment (be they tablets, capsules, transdermal patches, injections, or even bandages or incisions) can be designed. Each subject then gets one active and one placebo of each type. In this scheme, of course, some players in the trial may necessarily be unblinded; for example, in a trial that compares a surgical treatment to an oral medication, the surgeon performing the procedure will certainly know if the surgery was real or not. In such a trial then, a blinded assessor is a must. Blinded readers in this sense and others (radiology or chart reviews, examinations of before-and-after photographs, etc.) are a practical way of maintaining blinding, at least at the level of evaluation. If an intervention cannot be blinded, then efforts must be made at least to mask the identity of the active treatment. As an

3

example, a trial that gives out health education pamphlets to the treated group should give out at least something to the control group, lest they come to know that they are actually the control group. This strategy can be used even when it is not possible to make the two groups look identical and would address at least some biases. Senn (11). discusses trials in which complete blinding is not possible, as in a case where effects of the treatment are readily discernable. Such a situation Senn refers to as veiling. In such a case, a subject might be able to determine that some treatments could be ruled out, but others might still be in play. Analysis of such designs requires special considerations discussed in the article. 5 MEASURING EFFECTIVENESS OF BLINDING Although valiant attempts can be made to keep all necessary parties in a clinical trial ignorant of treatment assignments, just attempting to maintain the blind does not guarantee the desired result. Measuring, or at least attempting to measure, the result of the efforts to maintain the blind needs to be performed. Investigators should assess the success of the blinding by directly asking the individuals who were to be blinded which intervention they think was administered. If the blinding was successful, then this determination should be no better than is achievable just by chance while guessing (3). Of course, side effects can provide hints about treatment assignment. Furthermore, not all individuals may feel free to express what they experienced or did to gain the knowledge they held; they might not want to expose some deliberate unblinding behavior. As such, measuring the success of the blinding effort will be met with some difficulties in interpretation (3). It is better, however, to report attempts at such interpretations and measurements, with all their foibles and frailties, than not to attempt them at all. At the least, investigators need to report any failure of the blinding procedure, such as use of nonidentical placebo or active preparations (3).

4

MASKING

When randomization is restricted, as is the case when treatments are randomized within small blocks of, say, six or eight patients at each site, premature unblinding can result in blown allocation concealment. Berger and Exner (12) discuss ways to test for selection bias in such a scenario, and no evaluation of the success of a masking scheme is complete without such an analysis. 6

UNBLINDING

Before a trial begins, procedures must be in place to dictate how unblinding will take place for both unanticipated situations (such as issues with patient safety) and for situations surrounding the planned analyses of the final study data. Typically, unblinding is carried out when a patient’s safety is at risk and knowledge of the treatment assignment is necessary for caring for the patient. If, however, simply ceasing study treatment is an option to care for the patient, then it is probably not necessary to unblind the treatment assignment. Of course, audit trails and appropriate documentation of all planned and unplanned unblindings should be created at the time unblinding occurs (1).

quality of controlled clinical trials. BMJ. 2001; 323: 42–46. 6. K. F. Schulz, I. Chalmers, R. J. Hayes, and D. G. Altman, Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995; 273: 408–412. 7. V. W. Berger and C. A. Christophi, Randomziation technique, allocation concealment, masking, and susceptibility of trials to selection bias. J. Mod. Appl. Stat. Methods. 2003; 2: 80–86. 8. D. Moher, K. F. Schulz, and D. G. Altman. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. Ann. Intern. Med. 2001; 134: 657–662. 9. S. Wolf, Effects of suggestion and conditioning on action of chemical agents in human subject—pharmacology of placebos. J. Clin. Invest. 1950; 29: 100–109. 10. A. K. Akobeng, Understanding randomised controlled trials. Arch. Dis. Child. 2005; 90: 840–844. 11. S. J. Senn, A personal view of some controversies in allocating treatment to patients in clinical trials. Stat. Med. 1995; 14: 2661–2674. 12. V. W. Berger and D. V. Exner, Detecting selection bias in randomized clinical trials. Control. Clin. Trials. 1999; 20(4): 319–327.

FURTHER READING REFERENCES 1. P. M. Forder, V. J. Gebski, A. C. Keech, Allocation concealment and blinding: when ignorance is bliss. Med. J. Aust. 2005; 182: 87–89. 2. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), ICH Guideline E8: General considerations for clinical trials. ICH web site. Available: www.ich.org. Accessed June, 22, 2006. 3. K. F. Schulz, I. Chalmers, and D. G. Altman, The landscape and lexicon of blinding in randomized trials. Ann. Intern. Med. 2002; 136: 254–259. 4. T. J. Kaptchuk, Intentional ignorance: a history of blind asessment and placebo controls in medicine. Bull. Hist. Med. 1998; 72: 389–433. ¨ 5. P. Juni, D. G. Altman, and M. Egger, Systematic reviews in health care: assessing the

H. J. Bang, L. Y. Ni, and C. E. Davis, Assessment of blinding in clinical trials. Control. Clin. Trials. 2004; 25(2): 143–156. V. W. Berger, Allocation concealment and blinding: when ignorance is bliss. Med. J. Aust. 2005; 183: 165. V. W. Berger, A. Ivanova, and M. Deloria-Knoll, Minimizing predictability while retaining balance through the use of less restrictive randomization procedures. Stat. Med. 2003; 22: 3017–3028. S. J. Day and D. G. Altman, Statistics notes: blinding in clinical trials and other studies. BMJ. 2000; 321: 504. R. Kunz and A. D. Oxman, The unpredictability paradox: review of empirical comparisons of randomised and non-randomised clinical trials. BMJ. 1998; 317: 1185–1190. S. J. Senn, Turning a blind eye—Authors have blinkered view of blinding. Brit. Med. J. 2004; 328(7448): 1135–1136.

MASKING

CROSS-REFERENCES Placebo Allocation concealment Randomization codes Randomization envelopes Randomization methods

5

MAXIMUM DURATION AND INFORMATION TRIALS

where zα and zβ are upper α and β quantiles of the standard normal distribution. This test is asymptotically equivalent to the familiar Wald test

KYUNGMANN KIM Biostatistics and Medical Informatics University of Wisconsin Madison, Wisconsin

Z=

where Var θˆ = I−1 f ixed . In typical clinical trials for chronic diseases such as cancer, patients are entered into the study during the accural period, (0, sa ), during which patients are enrolled serially, in a pattern known as staggered entry, and then are followed until the occurrence of an event of interest such as death of the patient or disease progression or until the time of study closure, subject to random loss to follow-up. The period from the time when the last patient enters the study to the time of study closure—that is, (sa ,sa + sf )— is known as the follow-up period, during which no new patients are entered; the length of this period is known as the minimum follow-up time for the study. The study duration consists of the accrual duration and the follow-up duration. In such clinical trials, a response variable is time to event, with possible censoring of events due to random loss to follow-up or study closure. Many clinical trials with time-to-event data as the primary outcome have been designed as duration trials, simply because studies are often planned in terms of calendar time. The difference between the calendar time and the information in chronic disease clinical trials in which time to event is a response variable leads to differences in how such studies are designed and analyzed. Lan and DeMets (1) were perhaps the first to formally recognize the difference between the two.

In clinical trials in which a response variable is observed more or less instantaneously, the statistical information of the test statistic is proportional to the number of study subjects, which is directly related to the duration of the study in calendar time and the rate by which subjects are enrolled and randomized into the study. However, in clinical trials in which a response variable is observed over follow-up as time to event, the statistical information is not directly related to the number of subjects but rather to the number of events of interest. Regardless of the type of response variable, the efficient score is a

U ∼ N (θ I, I) , where I is Fisher’s information and θ is the canonical parameter representing treatment effect according to the likelihood theory under the assumption of a relative large sample and a small effect size of treatment. A one-sided fixed-sample test of the null hypothesis H0 :θ = 0 indicating no treatment difference with a significance level α to detect the alternative hypothesis H 1 :θ = θ 1 > 0 with power 1 − β requires Pr(U ≥ cα |θ = 0) =α and Pr(U ≥ cα |θ =θ1 ) = 1 − β. This determines the information content or the information horizon of the study as If ixed =

2 zα + zβ θ12

θˆ − θ a ∼ N (0, 1) , ˆ Var θ

1 TWO PARADIGMS: DURATION VERSUS INFORMATION

(1)

and the critical value of the test as zα + zβ zα , cα = θ1

According to Lan and DeMets (1), there are two paradigms for the design of clinical trials with time-to-event data: duration trials and information trials. They are distinguished

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

MAXIMUM DURATION AND INFORMATION TRIALS

by the way the end of the study is defined. In the former, the study is concluded when data are accumulated over a fixed duration of follow-up on a specified number of subjects, a sample size in a traditional sense. In the latter, a study is concluded when a prespecified amount of statistical information has been obtained. For example, with the logrank test for comparison of time-toevent data in chronic disease clinical trials, the operating characteristics of the study do not depend directly on the number of subjects enrolled in the study, which is a function of the duration of the accrual period in calendar time and the enrollment rate of subjects into the study, but rather depend on the number of events of interest, which is directly proportional to the statistical information. Thus, the study design often specifies the duration of the accrual and follow-up periods to ensure that a necessary number of events are observed during the study. Either the accural duration or the follow-up duration is fixed, and the necessary duration of the other period is determined so that the required number of events are ultimately observed during the study. Determination of the study duration for a fixed-sample study has been investigated by Bernstein and Lagakos (2) and Rubinstein et al. (3). With the duration trial, the information of the test statistic at study closure is random, and there is no guarantee that the required amount of statistical information specified in equation 1 will be obtained if the duration design is strictly adhered to in the analysis. With the information trial, however, the statistical information specified for the design in equation 1 is obtained exactly, and as a result the operating characteristic of the statistical test can be maintained exactly as specified in the design. The calendar time of study closure will be random. 2 SEQUENTIAL STUDIES: MAXIMUM DURATION VERSUS INFORMATION TRIALS If it is desirable to monitor the data periodically during the course of a study, group sequential designs or designs based on triangular tests can be used. Group sequential methods such as by Pocock (4) and O’Brien

and Fleming (5) were developed to maintain the type I error probability at a desired overall significance level despite repeated significance tests. Both methods assume that the number of interim analyses is specified in advance and that the interim analyses are performed after equal increments of statistical information. For monitoring of time-toevent data, one needs the flexibility because these two assumptions are often not met. This flexibility can be achieved by using the error spending function introduced by Lan and DeMets (6). When computed at calendar time s, the efficient score is a

U (s) ∼ N (θ I (s) , I (s)) , as in the fixed-sample study. With the logrank test, which is the efficient score under the proportional hazards model, θ is the log hazard ratio of control λc to experimental λe ; that is, θ = log(λc /λe ), and I(s) is the asymptotic variance of the logrank statistic. The asymptotic variance I(s) is closely related to the expected number of events ε(s) by I (s σz2 ε (s), where σz2 is the variance of treatment indicator Z. As will be shown later, the expression for ε(s) can be derived based on some parametric assumptions regarding the length of the accrual period in calendar time, the enrollment rate, and the distribution for time to event. Assume that patient accrual is uniform during the accrual period (0, sa ) with a constant accrual rate A, the average number of patients accrued per unit time; that allocation to treatment is by simple randomization, with possibly unequal allocation between two treatments, control (c) and experimental (e); that events occur with constant hazard rates λv ,v = c,e; and that random censoring occurs with common constant hazard rate ν. Finally, assume, at least tentatively, that accumulating data will be analyzed after equal increments of information for a maximum of K times. When designing a group sequential study, it is reasonable to assume a prespecified maximum number of analyses K at equal increments of statistical information. With a group sequential design, it is known that the information horizon of the study has to be

MAXIMUM DURATION AND INFORMATION TRIALS

inflated from that for the fixed-sample study as compensation for the possibility of early stopping for treatment difference. Given K,α, and β, and a group sequential design, the amount of required inflation in sample size is called the inflation factor F by Kim et al. (7). Given the information horizon for the corresponding fixed-sample study determined by equation 1, the maximum information, (i.e., the information content for the group sequential study), is determined by Imax = If ixed × F.

(2)

Given the maximum information for the group sequential design, the necessary maximum expected number of events to be observed by the end of study at calendar time sK (i.e., at the last analysis K), is determined by εmax Imax /σz2 . Then the length of study sK = sa + sf is determined to satisfy ε(1) (sK ) = (1 − µz ) εe (sK ) + µz εc (sK ) = εmax , where µz is the mean of treatment indicator Z for the experimental treatment and εv (s) is the expected number of events by time s when all patients are given treatment v. In other word, the maximum duration of the study is −1 determined by sK = ε(1) (εmax ) subject to εmax −1 ≤ sa ≤ ε(1) (εmax ) . A These inequalities ensure that the accrual duration is long enough— but no longer than necessary— for the required maximum number of events. The expected number of events εv (s) can be evaluated by double integration with respect to the density function for time to event and the uniform density function for patient entry. For example, under exponential time with hazard rates λv , v = c, e, and exponential random censoring with common constant hazard rate ν, then the expected number of events by time s if all the patients in the study are given treatment v are εv (s) = A

λv λv

exp{−λv (s − sa )+ } − exp(−λv s) s ∧ sa − , λv

3

where λv = λv + ν, s∧&sa is the smaller of s and sa , and x+ = x if x is positive and 0 otherwise.Therole of exponential distribution is simply to provide a calculation for the expected number of events. The above formula can be generalized to other time to event distributions, and the arguments extend naturally to other parametric proportional hazards models after suitable transformation on the time scale. As noted above, once the study duration is fixed, the maximum expected number of events by the end of the trial can be estimated as ε(1) (sa + sf ). Hence, one may choose to fix the trial duration, sa + sf , or, equivalently, the total number of events to be observed. Although interim analyses are scheduled at regular intervals in calendar times, they depend on the information. By analogy with the fixed-sample study, there are also two paradigms for design of group sequential trials (1): the maximum duration trial in which the maximum duration of the study is fixed, or the maximum information trial in which the maximum information of the study is fixed. As such, the two designs again differ in how the end of the study is defined. Design procedure for maximum duration trials with time data has been investigated by Kim and Tsiatis (8), and that for maximum information trials has been proposed by Kim et al. (9). A maximum duration design specifies the end of the study in terms of the study duration, and a maximum information design specifies the end of the study in terms of the maximum information. In a maximum duration trial, the study is concluded either due to early stopping for treatment difference or when the follow-up reaches the prespecified calendar time. In a maximum information trial, the study is concluded either due to early stopping for treatment difference or when a prespecified maximum information is reached. Sequential clinical trials with time-toevent data are often designed as maximum duration trials in which a specified number of subjects enrolled during the accrual period are evaluated over the follow-up period. As such, the maximum duration—the accrual duration plus the follow-up duration—of a clinical trial is fixed at the design stage.

4

MAXIMUM DURATION AND INFORMATION TRIALS

Because the maximum information to be observed during the study is unknown at the time of an interim analysis, the information times have to be estimated. When monitoring a clinical trial with time-to-event data using the logrank test, the information time at an interim analysis is proportional to the maximum expected number of events to be observed during the study and is estimated by the number of events observed at the time of interim analysis divided by the maximum number of events expected by the end of the study. The denominator of this fraction is a random quantity and must be estimated. There are at least two candidates for the denominator, one under the null hypothesis of no treatment difference and the other under the specified alternative hypothesis; therefore, there are at least two candidates for the information time: tˆj =

Number of observed events at interim analysis , Maximum expected number of events Hj at the end of trial

j = 0, 1. This problem is due to the fact that the asymptotic variance of the logrank statistic depends on treatment effect, a situation similar to the comparison of proportions. This uncertainty about information horizon complicates interim analyses as it necessitates some adjustments in determining group sequential boundaries to maintain the type I error probability at a specified significance level. Kim et al. (9) proposed the following convention for estimation of the information time: ε /ε if k < K and dk ≤ Dj , ˆtk,j = k max,j 1 otherwise where εk denotes the observed number of events at calendar time sk of the k-th interim analysis for k = 1, . . . ,K, and εmax,j denotes the information horizon under Hj for j = 0,1. By setting the information time equal to 1 at the last analysis, the type I error probability is always maintained using the error spending function. In maximum information trials, the information horizon is determined under the

specified alternative hypothesis to achieve a desired power given other design parameters, and the information time is estimated by tˆ1 and is always unbiased. The net effect is that computation of group sequential boundaries becomes straightforward; furthermore, not only the significance level but also the power of group sequential tests is maintained exactly as specified during the design of the study. Therefore, from a statistical point of view, a maximum information trial design is preferable. The end of the study is, however, defined as the random calendar time when the information horizon is realized. 3 AN EXAMPLE OF A MAXIMUM INFORMATION TRIAL For many years, radiotherapy has been the treatment of choice for patients with regional stage III non-small cell lung cancer, giving a median survival of 9 to 11 months and a 3-year survival rate of less than 10%. In attempts to improve survival in these patients, clinical researchers in the early 1980s considered the possibility that radiotherapy alone might not be sufficient to eradicate micrometastatic disease. At that time, there was also some evidence that platinum-based chemotherapy was beneficial in terms of survival in patients with more advanced disease. Therefore, various systemic approaches to the treatment of stage III patients were proposed, including chemotherapy in conjunction with radiotherapy. The Cancer and Leukemia Group B 8433 (CALGB 8433) study was developed in 1983tocompare the standard control treatment with radiotherapy alone with an experimental treatment using two courses of combination chemotherapy given before radiotherapy in these patients. The primary objective of the CALGB 8433 study was to compare survival with the experimental treatment with that with the control treatment. Under the proportional hazards assumption, survival data can be modeled based on the hazard function λ (t|z) = λ0 (t) exp (θ z) , where λ0 (t) represents the baseline hazard at time t, z is a treatment indicator, and

MAXIMUM DURATION AND INFORMATION TRIALS

θ is the log hazard ratio. The null hypothesis of no treatment difference is H0 :θ = 0, and the alternative hypothesis that the two treatments differ with respect to survival is H 1 :θ = 0. The sample size was obtained to achieve 80% power (1 − β) to detect a log hazard ratio, θ = 0.405, using the logrank test at a twosided significance level of α = 0.05. This log hazard ratio represents a 50% difference in median survival between the two treatments. Therefore, a total of 190 deaths on the two arms was required: εf ixed = =

2 4 zα/2 + zβ θ12 4 (1.96 + 0.84)2 = 190. (log 1.5)2

The fixed sample size was determined by assuming that the final analysis of the study would take place after 80% of the patients had died, so 190/0.8 or approximately 240 patients were required. Based on previous experience in the same patient population, about 60 patients were expected to be accrued each year. Therefore, the study was anticipated to have about 4 years of accrual with possibly 6 months to 1 year of additional follow-up to obtain 80% deaths among 240 patients. When this study was developed initially in 1983, this fixed sample size was used, and there was no provision for early stopping for treatment difference. The CALGB policies for interim monitoring for possible

early termination were amended in 1986, coinciding with the emergence of treatment differences in CALGB 8433. At the time of the first interim monitoring, a conservative error spending function α4∗ (t) = αt1.5 by Kim and DeMets (10) was chosen for formal sequential tests to take advantage of its flexible nature. This error spending function was known to generate group sequential boundaries similar to O’Brien and Fleming (5), but not quite as conservative early on. Also, because formal interim monitoring was going to be used, it was decided that the final analysis would be performed with more ‘‘information’’ than for the fixed-sample size analysis in order to maintain the same power. This group sequential test at 80% power has an inflation factor of F = 1.05 over the corresponding fixed-sample size design. Therefore, the maximum number of deaths required by the end of study was inflated to εmax = 190×1.05 = 200 in accordance with equation 2 and the proportionality between the number of events and the information. Table 1 summarizes the monitoring process in the study. Based on the the ground sequential test and the error spending function α4∗ (t), the study was closed in March of 1987. Although only 28% of the total number of deaths were obtained, 163 (68%) of the total number of patients were accrued by the time of study termination. More importantly, the results of this trial were published 2 years earlier than originally anticipated.

Table 1. Summary of the monitoring process in CALGB 8433. Nominal P-value Analysis date Sep 1985 Mar 1986 Aug 1986 Oct 1986 Mar 1987

5

Percent of information

Logrank P-value

Truncated O’Brien-Fleminga

Pocock

α4∗ (t)

5% 8% 18% 22% 29%

NA 0.021 0.0071 0.0015 0.0015b

0.0013 0.0013 0.0013 0.0013 0.0013

0.0041 0.0034 0.0078 0.0061 0.0081

0.0006 0.0007 0.0027 0.0026 0.0042

a The standard O’Brien-Fleming boundary would give a nominal P-value less than 0.0001 at each interim analysis shown. b The P-value of 0.0008 from the Cox model was used in the decision for early termination of the study. Note: NA, not applicable.

6

MAXIMUM DURATION AND INFORMATION TRIALS

REFERENCES 1. K. K. G. Lan and D. L. DeMets, Group sequential procedures: calendar versus information time. Stat Med. 1989; 8: 1191–1198. 2. D. Bernstein and S. W. Lagakos, Sample size and power determination for stratified clinical trials. J Stat Comput Simulat. 1978; 8: 65–73. 3. L. V. Rubinstein, M. H. Gail, and T. J. Santner, Planning the duration of a comparative clinical trial with loss to follow-up and a period of continued observation. J Chronic Dis. 1981; 34: 469–479. 4. S. J. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977; 64: 191–199. 5. O’Brien, P. C. and T. R. Fleming, A multiple testing procedure for clinical trials. Biometrics. 1979; 35: 549–556. 6. K. K. G. Lan and D. L. DeMets, Discrete sequential boundaries for clinical trials. Biometrika. 1983; 70: 659–663.

7. K. Kim, A. A. Tsiatis, and C. R. Mehta, Computational issues in information-based group sequential clinical trials J Jpn Soc Comput Stat 2003; 15: 153–167. 8. K. Kim and A. A. Tsiatis, Study duration and power consideration for clinical trials with survival response and early stopping rule. Biometrics. 1990; 46: 81–92. 9. K. Kim, H. Boucher, and A. A. Tsiatis, Design and analysis of group sequential logrank tests in maximum duration versus information trials. Biometrics. 1995; 51: 988–1000. 10. K. Kim and D. L. DeMets, Design and analysis of group sequential tests based on the type I error spending rate function Biometrika 1987; 74: 149–154.

CROSS-REFERENCES

MAXIMUM TOLERABLE DOSE

point that is acceptable to the patient, based on several severity grading scales of adverse events, such as the Common Toxicity Criteria Grades developed by the National Cancer Institute of the United States or those developed by the World Health Organization. For instance, in oncology, DLT is usually defined as any nonhematological grade III or grade IV toxicity (except alopecia, nausea, vomiting, or fever, which can be rapidly controlled with appropriate measures); absolute neutrophil count < 500/ml for at least 7 days; febrile neutropenia (absolute neutrophil count < 500/ml for at least 3 days and fever above 38.5◦ C for 24 hours); or thrombocytopenia grade IV. Nevertheless, as stated above, definition of dose-limiting toxicities widely differs across Phase I (1).

SYLVIE CHEVRET Inserm France

The ‘‘Maximum Tolerable Dose’’ (MTD), also known as the ‘‘Maximum Tolerated Dose’’ or the ‘‘Maximally Tolerated Dose’’, is used both in animal toxicology and in early phases of clinical trials, mostly in life-threatening diseases such as cancer and AIDS. In toxicology, the MTD has been defined operationally as the highest daily dose of a chemical that does not cause overt toxicity in laboratory mice or rats. Similarly, in Phase I cancer clinical trials, the MTD of a new agent has been defined as the highest dose level of a potential therapeutic agent at which the patients have experienced an acceptable level of dose-limiting toxicity (DLT) or that does not cause unacceptable side effects. In both settings, these definitions refer on imprecise notions such as ‘‘overt,’’ ‘‘acceptable,’’ or ‘‘unacceptable’’ toxicity. Moreover, definition of dose-limiting toxicities has been found highly variable across published Phase I studies (1). This lack of consensus is observed also when dealing with the methods for establishing the MTD, with wide variations in the designs and the statistical methods of estimation. Finally, the concept of MTD itself has become criticized and even controversial. We summarize the main concepts underlying the choice of MTD as the main endpoint in cancer Phase I clinical trials and the main approaches used in establishing the MTD of a new drug. More discussion with regards to its currently reported limits is provided in the last section with some proposed alternate endpoints. 1

1.1 Underlying Assumptions Determining the optimal dose of a new compound for subsequent testing in Phase II trials is the main objective of cancer Phase I clinical trials. With cytotoxic agents, this dose typically corresponds to the highest dose associated with an acceptable level of toxicity, based on the underlying assumption that stems from the work of Skipper et al. (3) that the higher the dose, the greater the likelihood of drug efficacy. In addition to the relationship between dose and antitumor response, cytotoxic agents also exhibit a dose–toxicity relationship. Thus, dose-related toxicity is regarded as a surrogate for efficacy. In other words, it is assumed that dose-response curves for toxicity and efficacy are parallel or, simply expressed, ‘‘the more pain, the more gain.’’ These findings have yielded the concept of maximum tolerable or tolerated dose (MTD) as the main objective of Phase I trials. Of note, the recommended dose level for more Phase II trials is either considered as synonymous with the MTD or, in two thirds of the Phase I trials as reported recently (1), chosen as one dose level below the selected MTD.

BASIC CONCEPTS AND DEFINITIONS

The MTD is often defined as the dose that produces an ‘‘acceptable’’ level of toxicity (2); the dose that, if exceeded, would put patients at ‘‘unacceptable’’ risk for toxicity; or the dose that produces a certain frequency of (medically unacceptable) reversible, dose-limiting toxicity (DLT) within the treated patient population. DLT includes host effects up to the

2

HOW IS THE MTD ESTABLISHED?

In Phase I oncology trials conducted over the past few decades, the MTD has usually been

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

MAXIMUM TOLERABLE DOSE

estimated by the traditional 3 + 3 escalation rule, which traces back to the last 1960s (4). The maximum tolerated dose is determined thus by testing increasing doses on different groups of people until the highest dose with acceptable side effects is found (see ‘‘Phase I trials’’ for more details). 2.1 The MTD is Interpreted Differently According to the Design Numerous Phase I dose finding clinical trials are conducted everyday to find the ‘‘maximum tolerated dose’’ (MTD) of a cancer treatment. Although various innovative statistical designs for Phase I clinical trials have been proposed in the literature, the traditional 3 + 3 design is still widely used because of its algorithm-based simplicity in logistics for the clinical investigators to carry out. Actually, based on a recent review of Phase I trials of single agent cytotoxics published between 1993 and 1995, the MTD was usually defined as the dose level at which > 2/6 patients experienced DLT, but several studies required 3–4/6 patients (1). Such algorithm- or rulebased Phase I designs treat the MTD as being identifiable from the data so that in this setting, the MTD is a statistic rather than a parameter (5). Indeed, no statistical estimation after these trials has been maintained, although statistical quantities of interest have been recently estimated from such trial data (6). By design, the MTD estimate relies heavily on the actual cohort of patients treated and the order in which they enter the study, with poor properties. It seemed in the early 1990s that the MTD should be thought of as a population characteristic rather than as a sample statistic. The MTD as a percentile of the dose–toxicity relationship seemed the mostly commonly used definition of the MTD in so-called model-based designs. Most proposals consisted in establishing a mathematical model for the probability of DLT over the dose scale, iteratively fitted from the data after each patient inclusion using Bayesian inference (7,8). Comparisons show that the Bayesian methods are much more reliable than the conventional algorithm for selecting the MTD (9). NonBayesian methods were also proposed, with close properties (10).

3 EXTENDING THE MTD? 3.1 Maximum Tolerated Schedule Most Phase I clinical trials are designed to determine a maximum tolerated dose for one initial administration or treatment course of a cytotoxic experimental agent. Toxicity usually is defined as the indicator of whether one or more particular adverse events occur within a short time period from the start of therapy. However, physicians often administer an agent to the patient repeatedly and monitor long-term toxicity caused by cumulative effects. A new method for such settings has been proposed, the goal of which is to determine a maximum tolerated schedule (MTS) rather than a conventional MTD (11,12). 3.2 Overdose Control More recently, some researchers considered the percentile definition of MTD as inadequate in addressing the ethical question of protecting the patients from severe toxicity. This consideration leads to the suggestion by Babb and colleagues (13) that the MTD should be chosen with toxicity not exceeding the tolerable toxicity with a high probability, which imposes a safety constraint on the overdose control. 3.3 Most Successful Dose Objective responses observed in Phase I trials are important for determining the future development of an anticancer drug (14). Thus, much recent interest has developed in Phase I/II dose finding designs in which information on both toxicity and efficacy is used. These designs concern, for instance, dose finding in HIV in which both information on toxicity and efficacy are almost immediately available. Recent cancer studies are beginning to fall under this same heading in which toxicity can be evaluated quickly and, in addition, we can rely on biological markers or other measures of tumor response. Unlike the classic Phase I dose finding design in which the aim is to identify the MTD, the Phase I/II dose finding study aims to locate the most successful dose (MSD) (i.e., the dose that maximizes the product of the probability of seeing no toxicity together with the probability of seeing

MAXIMUM TOLERABLE DOSE

a therapeutic response). For a dose finding study in cancer, the MSD, among a group of available doses, is that dose at which the overall success rate is the highest (15,16). Close proposals have been also published, based on a bivariate modeling of toxicity and efficacy (17,18). 3.4 Patient-Specific Optimal Dose Because Phase I trials are small studies, the maximum tolerated dose of a new drug may not be established precisely for any individual. New paradigms for the clinical evaluation of new cancer therapies have been proposed. One proposal entails adjusting the search for the optimal dose on the basis of measurable patient characteristics to obtain a personalized treatment regimen (19). Accelerated titration (i.e., rapid intrapatient drug dose escalation) designs also seem to provide a substantial increase in the information obtained with regards to interpatient variability or cumulative toxicity (20). 3.5 Different Optimal Doses: The Example of Targeted Agents in Cancer Advances in molecular biology have led to a new generation of anticancer agents that inhibit aberrant and cancer-specific proliferative and antiapoptotic pathways. These agents may be cytostatic and may produce relatively minimal organ toxicity compared with standard cytotoxics. Thus, these new, targeted anticancer agents have posed challenges to the current Phase I paradigm of dose selection based on toxicity. Indeed, traditional trial designs and endpoints may not be adequate for developing contemporary targeted drugs. Notably, increasing the drug dose to toxicity seems unnecessary for drug effect, which makes the use of MTD as a surrogate of effective dose inappropriate in the Phase I setting. To accommodate these new drugs, the concept of an optimal biologic dose defined as a dose that reliably inhibits a drug target or achieves a target plasma concentration has been reported as desirable and appropriate for the Phase I study of mechanism-based, relatively nontoxic novel agents. This concept should rely on pharmacodynamic data in addition to toxicity (21).

3

In these early-phase dose finding clinical trials with monotone biologic endpoints, such as biological measurements, laboratory values of serum level, and gene expression, a specific objective is to identify the minimum dose that exhibits adequate drug activity and shifts the mean of the endpoint from a zero dose to the so-called minimum effective dose (22). Stepwise test procedures for dose finding have been well studied in the context of nonhuman studies in which the sampling plan is done in one stage (23). This development has fueled interest in alternatives to toxicity as a surrogate endpoint in Phase I clinical trials, although no consensus has been reached. Indeed, the optimal biologic dose rarely formed the basis of dose selection. This situation is exemplified in a recent overview of 60 Phase I trials that involved 31 single agents representative of the most common targets of interest in the oncology literature: 60% still used toxicity, whereas only 13% used pharmacokinetic data as endpoints for selection of the recommended Phase II dose (24). Finally, the selected dose should incorporate the fact that wide variations will be found in steady-state drug levels in patients (25). 4

CONCLUDING REMARKS

Phase I clinical trials are designed to identify an appropriate dose for experimentation in Phase II and III studies. Assuming that efficacy and toxicity are closely related, the highest dose level of a potential therapeutic agent at which the patients have experienced an acceptable level of dose-limiting toxicity defined the main objective of these trials. Efforts in Phase I studies over the past several years have focused on efficient estimation of maximum tolerated dose. Recently, the MTD has been controversial, in part because of difficulties in extrapolating findings to the whole treatment course and in applying population findings to individuals. Some extensions have been proposed to better address these issues. Otherwise, improved understanding of the biology of cancer has led to the identification of new molecular targets and the development of pharmacologic agents that hold

4

MAXIMUM TOLERABLE DOSE

promise for greater tumor selectivity than traditional cytotoxic agents. This development has avoided the use of the MTD as a surrogate endpoint for efficacy. Nevertheless, increased research efforts should be spent on the prospective evaluation and validation of novel biologic endpoints and innovative clinical designs so that promising targeted agents can be effectively developed to benefit the care of cancer patients. A need exists for improved definition of optimal biologic dose (26). Finally, allowing accelerated drug development through combined Phase I/II and Phase II/III clinical trial designs is a promising research area in the near future.

REFERENCES 1. S. Dent and E. Eisenhauer, Phase I trial design: Are new methodologies being put into practice? Ann. Oncol. 1996; 7(6): 561–566. 2. B. Storer, Phase I trials, in T. Redmond (ed.), Biostatistics in clinical trials, C.a.C. Chichester: John Wiley & Sons, 2001,:. pp. 337–342. 3. H. Skipper, F. J. Schabel, and W. Wilcox, Experimental evaluation of potential anticancer agents XIII: on criteria and kinetics associated with ‘‘curability’’ of experimental leukemia. Cancer Chemother. Rep. 1964; 35: 1. 4. M. Schneiderman, Mouse to man: statistical problems in bringing a drug to clinical trial. Proc. of the fifth Berkeley symposium on mathematical statistics and prob. in fifth Berkeley symposium on mathematical statistics and probability. Berkeley, CA: University of California Press, 1967. 5. W. F. Rosenberger and L. M. Haines, Competing designs for phase I clinical trials: a review. Stat. Med. 2002; 21(18): 2757–2770. 6. Y. Lin and W. Shih, Statistical properties of the traditional algorithm-based designs for phase I cancer clinical trials. Biostatistics. 2001; 2(2): 203–215. 7. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics. 1990; 46(1): 33–48. 8. J. Whitehead and D. Williamson, Bayesian decision procedures based on logistic regression models for dose-finding studies. J. Biopharm. Stat. 1998; 8(3): 445–467.

9. P. Thall and S. Lee, Practical model-based dose-finding in phase I clinical trials: methods based on toxicity. Int. J. Gynecol. Cancer. 2003; 13(3): 251–261. 10. J. O’Quigley, and L. Shen, Continual reassessment method: a likelihood approach. Biometrics. 1996; 52(2): 673–684. 11. T. Braun et al., Simultaneously optimizing dose and schedule of a new cytotoxic agen. Clin. Trials. 2007; 4(2): 113–124. 12. T. Braun, Z. Yuan, and P. Thall, Determining a maximum-tolerated schedule of a cytotoxic agent. Biometrics. 2005; 61(2): 335–343. 13. J. Babb, A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat. Med. 1998; 17(10): 1103–1120. 14. I. Sekine et al., Relationship between objective responses in phase I trials and potential efficacy of non-specific cytotoxic investigational new drugs. Ann. Oncol. 2002; 13(8): 1300–1306. 15. J. O’Quigley, M. Hughes, and T. Fenton, Dosefinding designs for HIV studies. Biometrics. 2001; 57(4): 1018–1029. 16. S. Zohar and J. O’Quigley, Optimal designs for estimating the most successful dose. Stat. Med. 2006; 25(24): 4311–4320. 17. P. F. Thall, E. H. Estey, and H.-G. Sung, A new statistical method for dose-finding based on efficacy and toxicity in early phase clinical trials. Investigational New Drugs. 1999; 17: 155–167. 18. Y. Zhou et al., Bayesian decision procedures for binary and continuous bivariate doseescalation studies. Pharm. Stat. 2006; 5(2): 125–133. 19. A. Rogatko et al., New paradigm in dosefinding trials: patient-specific dosing and beyond phase I. Clin. Cancer Res. 2005; 11(15): 5342–5346. 20. R. Simon et al., Accelerated titration designs for phase I clinical trials in oncology. J. Natl. Cancer Inst. 1997; 89(15): 1138–1147. 21. D. Kerr, Phase I clinical trials: adapting methodology to face new challenges. Ann. Oncol. 1994; 5(S4): 67–70. 22. S. Kummar et al., Drug development in oncology: classical cytotocixs and molecularly targeted agents. Br. J. Clin. Pharmacol. 2006; 62(1): 15–26. 23. M. Polley and Y. Cheung, Two-stage designs for dose-finding trials with a biologic endpoint using stepwise tests. Biometrics. 2007.

MAXIMUM TOLERABLE DOSE 24. W. Parulekarand E. Eisenhauer, Phase I trial design for solid tumor studies of targeted, noncytotoxic agents: theory and practice. J. Natl. Cancer Inst. 2004; 96(13): 990–997. 25. A. Adjei, The elusive optimal biologic dose in phase I clinical trials. J. Clin. Oncol. 2006; 24(25): 4054–4055. 26. B. Ma, C. Britten, and L. Siu, Clinical trial designs for targeted agents. Hematol. Oncol. Clin. North Am. 2002; 16(5): 1287–1305.

CROSS-REFERENCES Phase I trials Therapeutic Dose Range

5

METADATA

2.1 A Metadata Example The importance of metadata can be illustrated by the following example adapted from actual events. Consider the two datasets shown in Table 1. Each dataset contains the same variables but has different data values. With only cryptic variable names and no additional metadata, it would be almost impossible to determine what the data values represent without additional documentation. If we add context by stating that these datasets represent rocket burn instructions for placing a satellite in orbit and that the last three variables represent distance, speed, and force, then a knowledgeable rocket scientist may be able to infer what the data points could represent. In fact, that inference, based on inadequate metadata, led to a very costly mistake. The Mars Climate Orbiter was launched in December 1998 and was scheduled to go into orbit around Mars 9 months later. The $150 million satellite failed to reach orbit because the force calculations were ‘‘low by a factor of 4.45 (1 pound force = 4.45 Newtons), because the impulse bit data contained in the AMD file was delivered in lb-sec instead of the specified and expected units of Newton-sec’’ (2). In other words, a data file in English units (pounds) was input into a program that expected metric units (Newtons). Table 2 illustrates the same two datasets with additional metadata, including meaningful variable names, labels, and units. Note the units used in each dataset are now clearly identified in the column header metadata. Although it cannot be said with certainty that this metadata would have prevented the error, it would have increased the probability that the error would have been found by project personnel. The use of machine readable or ‘‘parsable’’ metadata would allow computer programs to be used to check for compatible units.

DAVID H. CHRISTIANSEN Christiansen Consulting Boise, Idaho

1

INTRODUCTION

Metadata (also ‘‘meta data’’ or ‘‘meta-data’’) are commonly defined as ‘‘data about data.’’ Clinical trial metadata are concerned with describing data originating from or related to clinical trials, including datasets and statistical analyses performed on the datasets. Clinical trial metadata may be in the form of a separate written document, may be linked electronically to a document or dataset, or may be integrated into a dataset as part of the definition of the data fields or variables. Metadata may be accessed by statisticians performing analyses on the data and by other scientists reviewing or using the data and results. Increasingly, metadata included in computerized datasets (machine-readable metadata) can also be used by statistical software and other computer applications to present or use the data in an appropriate manner, based on the metadata description. For machine-readable metadata to be used by a computer application, standards for the format and content must exist for both the metadata and the application that reads the metadata. Metadata are an important component of the documentation required for regulatory submissions and should provide a clear and concise description of the data collected and the analyses performed.

2

HISTORY/BACKGROUND

The term metadata was coined in 1969 by Jack E. Kelly. Although ‘‘Metadata’’ was copyrighted in 1986 by The Metadata Company (1), the generic ‘‘metadata’’ is commonly used in many disciplines, including computer science, database administration, geographic science, and clinical data management.

2.2 Geospatial Data Geographic science is an example of a discipline with well-defined metadata. In 1990, the Office of Management and Budget established the Federal Geographic Data

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

METADATA

Table 1. Example Datasets with No Metadata Dataset 1 Event

Time

D

S

F

Begin End

9/23/99 02:01:00 9/23/99 02:17:23

121,900,000

12,300 9,840

143.878

Dataset 2 Event

Time

D

S

F

Start Finish

19990923 05:01:00 19990923 05:17:23

196,200,000

5.5 4.4

640

Table 2. Mars Climate Orbiter Burn Instructions with Metadata Dataset 1 - Mars Orbit Insertion Burn in English Units Event

M/D/Y HH:MM:SS Pacific Daylight Time (Earth Receive Time, 10 min. 49 sec. Delay)

Begin End

9/23/99 02:01:00 9/23/99 02:17:23

Distance (miles) 121,900,000

Speed (miles/hr) 12,300 9,840

Force (Pounds) 143.878

Dataset 2 - Mars Orbit Insertion Burn in Metric Units Event Start Finish

YYYYMMDD EDT (Earth Receive Time, 10 min. 49 sec. Delay) 19990923 05:01:00 19990923 05:17:23

Committee as an interagency committee to promote the coordinated development, use, sharing, and dissemination of geospatial data on a national basis (3). In addition to detailed metadata and tools, the group also provides informational and educational materials that may be useful for motivating the development and use of metadata. For example, the U.S. Geologic Survey in a document titled ‘‘Metadata in Plain Language’’ (4) poses the following questions about geospatial data:

1. What does the dataset describe? 2. Who produced the dataset? 3. Why was the dataset created? 4. How was the dataset created? 5. How reliable are the data; what problems remain in the dataset? 6. How can someone get a copy of the dataset? 7. Who wrote the metadata?

Distance (km) 196,200,000

Speed (km/sec) 5.5 4.4

Force (Newtons) 640

Although these questions may have a different context and different emphasis in clinical trials, they can provide background information for our discussion of clinical trial metadata. 2.3 Research Data and Statistical Software Research data management refers to the design, collection, editing, processing, analyzing, and reporting of data that results from a research project such as a clinical trial. The characteristics and requirements for research data management activities are different from those of a commercial data management system. Because clinical trials are experimental by nature, the resulting datasets are unique, contain many different variables, and require a high degree of studyspecific documentation, including metadata. Conversely, commercial data systems, such as payroll or credit card billing, are characterized by very stable systems that perform well-defined functions on a relatively small

METADATA

3

number of standard variables. These differences make it difficult to use commercial data systems, such as databases and reporting systems, for research data. These unique requirements drove the development of statistical software specifically for research data over the last 30 years. The most commonly used systems all have some type of metadata at both the dataset and variable levels. These metadata allow the researchers conducting the studies to describe the structure and content of the datasets clearly. The ability to describe the data clearly and unambiguously is important in the analysis and reporting of the study results, both within an organization and for regulatory review and approval of new treatments.

3

2.4 Electronic Regulatory Submission

A dataset is a computer file structured in a predictable format that can be read and processed by a computer application or program, typically a statistical system such as SAS (SAS Institute, Cary, NC), S-Plus (Insightful Corporation, Seattle, WA), or SPSS (SPSS Corporation, Chicago, IL). In the discussion here, SAS terminology will be used, but similar concepts and attributes exist in all statistical languages and systems. Dataset-level metadata describes the content, structure, and use of the dataset by describing its physical and logical attributes. Some of these attributes relate to the dataset itself, whereas others are dependent on the context in which the dataset is used. The description and use of these attributes as they relate to regulatory submissions are shown below.

The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) (5) has developed a Common Technical Document (CTD) that ‘‘addresses the organization of the information to be presented in registration applications for new pharmaceuticals (including biotechnology-derived products) in a format that will be acceptable in all three ICH regions (Japan, Europe and United States)’’ (6). The CTD outline is, in fact, metadata for the regulatory submission. It provides information about the submission in a structured format that facilitates both the creation and review of the submission. Because the requirements for submission of clinical data vary from country to country, the CTD does not specifically address electronic submission of clinical data. The U.S. Food and Drug Administration (FDA), however, defined metadata requirements for clinical trial datasets submitted as part of its treatment approval processes in 1999 (7). The Clinical Data Interchange Standard Consortium (CDISC) enhanced the FDA metadata by adding metadata attributes at the dataset and variable levels (8). Since that time, the FDA and CDISC have collaborated on more standards, resulting in the FDA referencing the CDISC standard for clinical domain data as an acceptable format (9,10). The FDA and CDISC currently are working on a similar standard for analysis datasets (11,12).

DATASET METADATA

In the context of clinical trials and regulatory submissions, the most useful metadata refer to the datasets that contain the trial results. The datasets can be generally classified as those data collected during the execution of the trial (tabulation data or observed data) or data derived from the observed data for the purpose of display or analysis (analysis data). Dataset metadata can be classified at various levels as described below. The metadata attributes described are discussed by one or more of the following standards or organizations: CDISC Tabulation (10), CDISC Analysis (11), FDA (13), and SAS (14). 3.1 Dataset-Level Metadata

3.1.1 Dataset Name. – Unique dataset name for this file. FDA and CDISC have naming conventions for some clinical domain and analysis datasets. (FDA, CDISC, and SAS) 3.1.2 Description (Label). – A more detailed description of the content of the dataset. (FDA, CDISC, and SAS) 3.1.3 Location. – The relative physical file location in an electronic submission. (FDA and CDISC) 3.1.4 Structure. – The shape of the dataset or the level of detail represented by each

4

METADATA

row or record. Structure can range from very horizontal (one record per subject) to very vertical (one record per subject per visit per measurement). It is recommended that structure be defined as ‘‘one record per . . . ’’ rather than the ambiguous terms normalized or denormalized, horizontal or vertical, tall or short, skinny or fat, and so forth. (CDISC) 3.1.5 Purpose. – Definition of the type of dataset as tabulation or analysis. (CDISC) 3.1.6 Key Fields. – Variables used to uniquely identify and index records or observations in the dataset. (CDISC). 3.1.7 Merge Fields. – A subset of key fields that may be used to merge or join SAS datasets. (CDISC analysis) 3.1.8 Analysis Dataset Documentation. – Written documentation that includes descriptions of the source datasets, processing steps, and scientific decisions pertaining to creation of the dataset. Analysis dataset creation programs may also be included. (CDISC Analysis) 3.1.9 Rows. – The number of data records or observations in the dataset. (SAS) 3.1.10 Columns. – The number of variables or fields in the dataset. For most analysis datasets, each column represents the measurement value of some characteristic of the subject such as sex, age, and weight at a visit, for example. (SAS) 3.2 Variable-Level Metadata Each column or variable in a dataset has certain attributes that describe the content and use of the variable. These variable attributes are usually consistent for all observations within a dataset. That is, each column has the same attributes for all rows of the dataset. This rectangular data structure is required for statistical analysis by the vast majority of statistical analysis software and is the natural structure for CDISC Analysis datasets. Many CDISC Tabulation datasets, however, have a structure of one record per measurement. Because different measurements have

different attributes, metadata at the variable level is not adequate. See the section on value-level metadata below for a discussion of this issue. 3.2.1 Variable Name. – Unique name for the variable. A variable name should be consistent across datasets and studies within a regulatory submission. The data values of a variable should also be consistent in definition and units across all datasets within a submission. For example, if AGE is recorded in years in one study and months in another, then either AGE must be converted to, say, years in the second study or a different variable name, say AGEMON, must be used for age in months. (FDA, CDISC, and SAS) 3.2.2 Variable Label. – A more detailed description of the variable. This description may be used by some software and review tools. (FDA, CDISC, and SAS) 3.2.3 Type. – Description of how the data values for this variable are stored. Current conventions specify only a character string (CHAR) or a numeric value (NUM). These two types are consistent with SAS and other software, but additional types such as floating point, integer, binary, and date/time are used by some systems. (FDA, CDISC, and SAS) 3.2.4 Length. – The number of bytes allocated to store the data value. For CHAR variables, this number is the length of the character string. SAS and other software define this attribute as the number of bytes allocated to store the numeric value in some internally defined form, typically floating point hexadecimal. This length is not the number of digits used to display a number. See Format below. (SAS) 3.2.5 Format. – Description of how a variable value is displayed. Formats can be general, defining the width of a display field and the number of decimal digits displayed for a number. They can also provide code list values (1 = ‘‘Female’’; 2 = ‘‘Male’’) or display numerically coded values such as SAS dates in a human readable form. (FDA, CDISC, and SAS)

METADATA

CDISC Tabulation metadata defines this code list attribute as ‘‘Controlled Terms.’’ These terms are standard values for specific variables in certain clinical domains. 3.2.6 Role. – Description of how a variable may be used in a particular dataset. A variable may be assigned multiple roles. The role attribute is used by CDISC in two distinct ways. First, CDISC Tabulation datasets have a set of specific roles designed to describe a variable in the context of the proposed FDA JANUS database (15). • Identifier variables, usually keys,

identify the study, subject, domain, or sequence number of the observation. • Topic variables specify the focus of the observation, typically the name of a measurement or lab test in a one record per measurement structured dataset. • Timing variables specify some chronological aspect of a record such as visit number or start date. • Qualifier variables define a value with text, units, or data quality. Qualifiers are often used to store variables not otherwise allowed in tabulation datasets. Role attributes in CDISC Analysis datasets have a different emphasis than those in the tabulation models. Analysis roles focus on providing information useful to the statistical review rather than specification for the still-to-be-developed JANUS database. Because the primary goal of analysis roles is clear communication of the statistical analysis performed by the sponsor, the values of the role attribute are open-ended and can be extended as needed to meet this goal. The following roles have been identified by FDA and CDISC through the Analysis Dataset Model group (16). It should be noted that these definitions are still under development and may be subject to change. • Selection variables are frequently

used to subset, sort, or group data for reporting, displaying, or analysis. Common selection variables include treatment group, age, sex, and race. Specific study designs, drug indications,

5

and endpoints may have specific selection variables as well. For example, a hypertension trial may identify baseline blood pressure measurements as selection variables of interest. Flag variables identifying analysis populations such as ‘‘per protocol’’ or ‘‘intent to treat’’ are also commonly identified as selection variables. • Analysis variables relating to major study objectives or endpoints may be identified to assist reviewers. This identification may be especially useful in complex studies in which it may not be clear from the variable name which variable is the primary endpoint. • Support variables are identified as useful for background or reference. For example, the study center identifier may be used to group subjects by center, but the study name or investigator name would provide supporting information. • Statistical roles such as covariate, censor, endpoint, and so forth may also be useful for specific analyses and study designs. These roles may be added as needed to improve clear communication between the study authors and the reviewers.

3.2.7 Origin. – Describes the point of origin of a variable. CDISC Tabulation datasets allow CRF ‘‘derived’’ and ‘‘sponsor defined’’ values of origin. This attribute usually refers to the first occurrence of a value in a clinical study and does not change if a variable is added to another file such as an analysis dataset. 3.2.8 Source. – In a CDISC Analysis dataset, source provides information about how a variable was created and defines its immediate predecessor. For example, an analysis dataset may be created by merging two or more source datasets. The variable SEX from the demographics domain dataset DM would have a source of ‘‘DM.SEX’’. This convention of specifying the immediate predecessor defines an audit trail back to the original value, no matter how many generations of datasets are created. Derived variables in

6

METADATA

analysis datasets may include a code fragment to define the variable or may hyperlink to more extensive documentation or programs.

requires additional metadata for each value of the vital signs test as shown in Table 4. The value-level metadata defines the attributes needed to transpose the tabulation dataset into a CDISC Analysis dataset, typically with a structure of one record per subject. This dataset, shown in Table 5, contains the data points from the tabulation dataset, but in a different structure. Values of SEX and the derived variable Body Mass Index (BMI) have been added to illustrate that analysis dataset can include data from several sources. Note that now each measurement is represented by a variable, and each variable can have its own metadata attributes. This dataset structure is preferred by most statisticians and can be used directly by most statistical packages. ?>

3.3 Value-Level Metadata Many CDISC Tabulation (SDTM) datasets have a structure of one record per subject per time point per measurement. This structure means that different measurements having different attributes such as name, label, type, and format vary from record to record. To provide clear communication of the content and use of such a dataset, each measurement name (test code) must have its own metadata. For example, the CDISC Tabulation model defines a Vital Signs Dataset with measurements for height, weight, and frame size. Table 3 illustrates selected portions of such a dataset and its metadata. Some CDISC attributes have been changed or omitted to simplify the example. Note that the data values of HEIGHT, WEIGHT, and FRMSIZE are stored in the same column, and no useful metadata identifies format, variable type, role, or origin. This dataset is difficult to understand and cannot be used with statistical software without additional programming. This file structure

3.4 Item-Level Metadata Item-level metadata refers to the attributes of an individual data value or cell in a rectangular data matrix. That is, it refers to a value for the crossing of a specific variable (column) and observation (row). Item-level metadata are typically used to describe the quality of that particular measurement, for example, a partial date in which the day portion is missing and is imputed or a lipid measurement

Table 3. Vital Signs Tabulation Dataset Vital Signs Findings–1 Record/subject/measurement Name

USUBJID

VSTESTCD

VSTEST

Label Type Format Origin Role

Subject ID Char

VS Short Name Char

VS Name Char

Sponsor Identifier

CRF Topic

Sponsor Qualifier

00001 00001 00001

HEIGHT WEIGHT FRMSIZE

Height in cm Weight in kg Frame Size

VSSTRESN

VSSTRESC

Numeric Result Numeric ? CRF Qualifier

Character Result Char ? CRF Qualifier

165 56.1 Small

Table 4. Vital Signs Tabulation Value-Level Metadata Vital Signs Value-Level Metadata for VSTESTCD Value (VSTESTCD)

Label (VSTEST)

Type

Format or Controlled Terms

Origin

HEIGHT WEIGHT FRMSIZE

Height in cm Weight in kg Frame size

Num Num Char

3.0 5.1 Small, Medium, Large

VSSTRESN VSSTRESN VSSTRESC

METADATA

7

Table 5. Vital Signs Analysis Dataset Vital Signs Analysis - 1 Record/Subject Name

USUBJID

SEX

Label Type Format

Subject ID Char

Sex Char Female, Male

Origin Source

Role

Sponsor CRF VS.USUBJID DM.SEX

Identifier 00001

Qualifier, Selection Female

HEIGHT

WEIGHT

BMI

FRMSIZE

Num 3.0

Num 5.1

Num 5.2

CRF VS.VSSTRESN (where VS.VSTESTCD = ‘‘HEIGHT’’) Analysis

CFR VS.VSSTRESN (where VS.VSTESTCD = ‘‘WEIGHT’’) Analysis

Derived WEIGHT/ (0.01* HEIGHT)∗∗ 2

Analysis

Char Small, Medium, Large CRF VS.VSSTRESC (where VS.VSTESTCD = ‘‘FRMSIZE’’) Analysis

165

56.1

20.61

Small

in which a frozen sample was accidentally allowed to thaw. In each of these cases, a value exists, but it may be considered to be of lesser quality than a regular measurement. Because of the complex and expensive nature of clinical trials, it is not always practical to discard such measurements, nor is it scientifically valid to treat them as complete. The identification of these data items is especially important for accurate statistical analysis in regulatory submissions. In a discussion of quality assurance for submission of analysis datasets, one FDA statistical director stated that it was desirable to provide a ‘‘clear description of what was done to each data element: edits, imputations, partial missing . . . .’’ (17). Historically, the concept of status attributes for data items was used in the 1970s by the Lipids Research Clinic Program to identify missing values and record the editing status of each variable (18). Clinical research data management systems may also have similar features, but this information is not easily exported to statistical reporting and analysis programs. Currently, the most common method of identifying itemlevel quality involves the addition of a separate variable that contains the status value. These ‘‘status flags’’ are cumbersome to maintain and can increase the size of datasets considerably. The development of statistical datasets using the eXtensible Markup Language (XML) (19) have the potential to provide the more flexible structure required

for integrating item-level metadata into clinical datasets, but the tools required do not exist at this time. An audit file system for regulatory statistical reviewers was proposed by an FDA statistical director as a ‘‘file describing the changes or edits made during the data management or cleaning of the data,’’ that is, provide a link or audit trial between the original and submitted data values by providing metadata related to the edits, including the following attributes (20). • Patient, observation, visit, variable, and

other identifiers • Original and submitted values • Qualifiers describing the change such as

who, when, and why • Edit codes describing the action taken

such as empty (not recorded), completed, replaced, confirmed, suspicious but not collectable. It is interesting to note that many of the edit codes proposed by the FDA director are similar to the 1970s system described above. It should also be noted that current data management systems do have audit trails, but they typically cannot extract and submit the data in a form that is useable by a regulatory reviewer. 4

ANALYSIS RESULTS METADATA

Analysis results metadata define the attributes of a statistical analysis performed

8

METADATA

on clinical trial data. Analyses may be tables, listings, or figures included in a study report or regulatory submission. Analyses may also be statistical statements in a report, for example, ‘‘The sample size required to show a 20% improvement in the primary endpoint is 200 subjects per treatment arm’’ or ‘‘The active treatment demonstrated a 23% reduction in mortality (p = 0.023) as compared to placebo.’’ Analysis results metadata are designed to provide the reader or reviewer with sufficient information to evaluate the analysis performed. Inclusion of such metadata in FDA regulatory submissions was proposed in 2004 (21) and is included in the CDISC Analysis Data Model V 2.0 (22). By providing this information in a standard format in a predictable location, reviewers can link from a statistical result to metadata that describes the analysis, the reason for performing the analysis, and the datasets and programs used to generate the analysis. Note that analysis results metadata are not part of an analysis dataset but that one attribute of analysis results metadata describes the analysis datasets used in the analysis. • Analysis Name. – A unique identifier

•

•

•

•

for this analysis. Tables, figures, and listing may incorporate the name and number (Fig. 4 or Table 2.3). Conventions for this name may be sponsorspecific to conform to Standard Operating Procedures (SOPs). Description. – Additional text describing the analysis. This field could be used to search for a particular analysis or result. Reason. – Planned analyses should be linked to the Statistical Analysis Plan (SAP). Other reasons would include data driven, exploratory, requested by FDA, and so forth. Dataset(s). – Names of datasets used in the analysis. If datasets are part of a submission, then a link to the dataset location should be provided. Documentation. – A description of the statistical methodology, software used for computation, unexpected results, or any other information to provide the

reviewer with a clear description of the analysis performed. Links to the SAP, external references, or analysis programs may also be included.

5 REGULATORY SUBMISSION METADATA 5.1 ICH Electronic Common Technical Document In addition to the Common Technical document described earlier, the ICH has also developed a specification for an electronic Common Technical Document (eCTD), thus defining machine-readable metadata for regulatory submissions. This eCTD is defined to serve as an interface for industry-to-agency transfer of regulatory information while at the same time taking into consideration the facilitation of the creation, review, lifecycle management, and archival of the electronic submission (23). The eCTD uses XML (24) to define the overall structure of the document. The purpose of this XML backbone is two-fold: ‘‘(1) to manage meta-data for the entire submission and each document within the submission and (2) to constitute a comprehensive table of contents and provide corresponding navigation aids’’ (25). Metadata on submission level include information about submitting and receiving organization, manufacturer, publisher, ID and kind of the submission, and related data items. Examples for metadata on document level are versioning information, language, descriptive information such as document names, and checksums used to ensure accuracy. 5.2 FDA Guidance on eCTD Submissions The FDA has developed a guidance for electronic submission based on the ICH eCTD backbone. As discussed earlier, the ICH does not define detailed specification for submission of clinical data within the eCTD, but it does provide a ‘‘place-holder’’ for such data in the guideline E3 ‘‘Structure and Content of Clinical Study Reports’’ as Appendix 16.4 with the archaic term ‘‘INDIVIDUAL PATIENT DATA LISTINGS (US ARCHIVAL LISTINGS)’’ (26). The FDA eCTD specifies that submitted datasets should be organized as follows (27): Individual Patient Data Listings (CRTs)

METADATA • Data tabulations

– Data tabulations datasets – Data definitions – Annotated case report form • Data listing

– Data listing datasets – Data definitions – Annotated case report form

9

developments may include protocol authoring tools, Statistical Analysis Plan templates, eCRF and CRF automated database design, automated analysis and reporting, and submission assembly. This tool development by government, drug developers, and software providers will contribute to drug development and approval by enhancing the clear communication of the content and structure of clinical trial data and documents.

• Analysis datasets

– – – –

Analysis datasets Analysis programs Data definitions Annotated case report form

• Subject profiles • IND safety reports

The FDA Study Data Specification document (28) defines tabulation and analysis datasets and refers to the CDISC Data Definition Specification (Define.XML) for machinereadable dataset metadata in XML (29). This machine-readable metadata and the ICH eCDT are key elements in providing clear communication of the content and structure of clinical datasets and regulatory submissions. These metadata standards allow the regulatory agencies, software developers, and drug developers to create and use standard tools for creating, displaying, and reviewing electronic submissions and clinical datasets. The FDA has developed a viewer to use the ICH eCTD backbone to catalog and view the components of a submission, thus providing FDA reviewers with a powerful tool to view, manage, and review submissions. Software developers have used the Define.XML standard metadata to develop tools for compiling and viewing patient profiles and viewing tabulation datasets. SAS Institute has developed software to generate XML-based datasets with Define.XML metadata (30) and viewing tools for review and analysis. These first steps demonstrate the power of having clearly defined metadata for clinical research. The adoption and additional specification of these metadata standards will provide the basis for the development of a new generation of tools for review and analysis. Future

REFERENCES 1. U.S. Trademark Registration No. 1,409,260. 2. Mars Climate Orbiter Mishap Investigation Board Phase I Report, 1999:13. Available:ftp. hq.nasa.gov/pub/pao/reports/1999/MCO report. pdf. 3. The Federal Geographic Data Committee. Available:www.fgdc.gov/. 4. U.S. Geologic Survey, Metadata in Plain Language. Available:geology.usgs.gov/tools/ metadata/tools/doc/ctc/. 5. International Conference and Harmonisation. Available: www.ich.org. 6. International Conference and Harmonisation, Organization of the Common Technical Document for the Registration of Pharmaceuticals for Human Uses M4, 2004. Available: www.ich.org/LOB/media/MEDIA554.pdf. 7. Providing Regulatory Submissions in Electronic Format — NDAs, FDA Guidance, 1999. 8. D. H. Christiansen and W. Kubick, CDISC Submission Metadata Model, 2001. Available: www.cdisc.org/standards/SubmissionMetadata ModelV2.pdf. 9. FDA Study Data Specification, 2006. Available: www.fda.gov/cder/regulatory/ersr/Studydatav1.3.pdf. 10. CDISC Study Data Tabulation Model Version 1.1, 2005. Available: www.cdisc.org/models/sds/ v3.1/index.html. 11. CDISC Analysis Data Model: Version 2.0, 2006. Available: www.cdisc.org/pdf/ADaMdocument v2.0 2 Final 2006-08-24.pdf. 12. FDA Future Guidance List, p. 2. Available: www.fda.gov/cder/guidance/CY06.pdf. 13. Providing Regulatory Submissions in Electronic Format — Human Pharmaceutical Product Applications and Related Submissions Using the eCTD Specifications, FDA Guidance, 2006. Available: www.fda.gov/cder/ guidance/7087rev.pdf.

10

METADATA

14. SAS 9.1.3 Language Reference: Concepts. Cary, NC: SAS Institute Inc., 2005, pp. 475– 476. 15. JANUS Project Description, NCI Clinical Research Information Exchange, 2006. Available: crix.nci.nih.gov/projects/janus/. 16. D. H. Christiansen and S. E. Wilson, Submission of analysis datasets and documentation: scientific and regulatory perspectives. PharmaSUG 2004 Conference Proc., San Diego, CA, 2004: Paper FC04, p. 3. 17. S. E. Wilson, Submission of analysis datasets and documentation: regulatory perspectives. PharmaSUG 2004 Conference Proc., San Diego, CA, 2004. 18. W. C. Smith, Correction of data errors in a large collaborative health study. Joint Statistical Meetings presentation, Atlanta, GA, 1975. 19. SAS/ 9.1.3 XML LIBNAME Engine: User’s Guide. Cary, NC: SAS Institute Inc., 2004. 20. S. E. Wilson, Clinical data quality: a regulator’s perspective. DIA 38th Annual Meeting presentation, Chicago, IL, 2002: 20–21. Available: www.fda.gov/cder/present/DIA62002/default. htm. 21. D. H. Christiansen and S. E. Wilson, submission of analysis datasets and documentation: scientific and regulatory perspectives. PharmaSUG 2004 Conference Proc., San Diego, CA, 2004: Paper FC04, p. 5. 22. CDISC Analysis Data Model: Version 2.0, 2006, p. 22. Available: www.cdisc.org/pdf/ ADaMdocument v2.0 2 Final 2006-08-24. pdf. 23. ICH M2 EWG Electronic Common Technical Document Specification V 3.2, 2004, p. 1. Available: http://estri.ich.org/eCTD/eCTD Specification v3 2.pdf. 24. World Wide Web Consortium (W3C) Extensible Markup Language (XML). Available: http://www.w3.org/XML/.

25. ICH M2 EWG Electronic Common Technical Document Specification V 3.2, 2004: Appendix 1, p. 1-1. Available: http://estri.ich. org/eCTD/eCTD Specification v3 2.pdf. 26. ICH Structure and Content of Clinical Study Reports E3, p. 29. Available: http:// www.ich.org/LOB/media/MEDIA479.pdf. 27. Providing Regulatory Submissions in Electronic Format — Human Pharmaceutical Product Applications and Related Submissions Using the eCTD Specifications, FDA Guidance, 2006. Available: http://www.fda.gov/cder/ guidance/7087rev.pdf. 28. FDA Study Specifications. Available: http:// www.fda.gov/cder/regulatory/ersr/Studydatav1.3.pdf. 29. CDISC Case Report Tabulation Data Definition Specification (define.xml). Available: http://www.cdisc.org/models/def/v1.0/CRT DDSpecification1 0 0.pdf. 30. SAS 9.1.3 XML LIBNAME Engine: User’s Guide. Cary NC: SAS Institute Inc., 2004, pp. 27, 39–43.

CROSS-REFERENCES Biostatistics Statistical Analysis Plan Electronic Submission of NDA International Conference on Harmonisation (ICH) Good Programming Practice

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

of GRTs. These challenges generally arise from factors pertaining to the group-directed intervention and the nature of the group. The group-targeted interventions in GRTs are commonly prevention interventions (disease prevention or risk-factor prevention), and they are commonly targeted at relatively healthy populations. These two features have big consequences for GRTs. First, the rate of endpoint (disease, or risk factor) occurrence is relatively low. As a result, GRTs generally require a moderate number of groups (10–50) and typically have large numbers of study participants (1,000–10,000 +). GRTs are also typically long (5–20 + years), which makes follow-up of study participants and maintaining research relationships with collaborating organizations especially challenging. Also, the large size and long duration typical of most GRTs means an expensive trial and thus a low tolerance for degradation of scientific rigor. In addition, the healthy nature of the study population in most GRTs means that (1) research participation may have low salience or importance to study participants, and (2) the intervention must be designed to be easy to do. As a result of (1), there are special challenges for achieving high response rates for endpoint data collection, and for intervention compliance among those in the experimental organizations. As a result of (2), intervention contamination in the control organizations can easily happen, and so avoiding it is a challenge. It is critical for scientific rigor that these challenges be overcome. To do so, the challenges must first be recognized. Then, appropriate methods for addressing them must be planned in advance. Finally, these methods must be meticulously implemented throughout the trial. For each of the four requirements for trial execution—maintaining the research collaboration, maintaining the randomized assignment, achieving high location rate, and achieving high response rate—this article summarizes the challenges and same principles for addressing them and provides

ARTHUR V. PETERSON Jr. Member, Public Health Sciences, Fred Hutchinson Cancer Research Center, and Professor of Biostatistics, University of Washington.

1

INTRODUCTION

Just as with trials randomized by the individual, the randomized assignment in group-randomized trials (GRTs) provides two invaluable advantages that enable unambiguous conclusions to be reached from the trial: avoidance of selection and accidental biases (and consequent biases in the reported effect sizes), and provision of a basis for the statistical inferences that draw conclusions from the study. Moreover, the basis for statistical inference that randomization provides does not rely on any distributional or other assumptions. To realize the invaluable advantages that randomization enables, careful attention to trial conduct is needed from beginning to end of the GRT. For example, both sample size determination in the design phase at the start of the GRT and choice of statistical methods in the analysis phase at the end of the GRT must account for the intraclass correlation and the imprecision of its estimate. Likewise, achieving the scientific rigor enabled by the randomization in GRTs requires attention during the period of trial execution. There are four key requirements: (1) maintaining research collaboration with each recruited group for the duration of the trial, (2) maintaining the randomization assignment of intervention or control condition for each group throughout the trial, (3) achieving a high location rate for study participants at endpoint, and (4) achieving a high response rate at endpoint data collection. Special challenges to meeting these requirements are inherent to the very nature

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

examples of methods for applying these principles in the conduct of GRTs.

2 MAINTAINING THE RESEARCH COLLABORATIONS FOR THE DURATION OF THE TRIAL Maintaining the research collaborations with participating organizations is essential for maintaining scientific rigor. Because an organization’s dropping out of the study is not determined at random, but it is instead determined by selection due to circumstances, it introduces selection bias into the results and weakens the (randomization-generated) basis for statistical inference. Moreover, the consequences of one organization dropping out from a GRT are far more severe than one study participant dropping out from an individual-randomized trial. Because the number of randomized units in a GRT is small or moderate (10–50 organizations), compared with the typical 100–400 individuals in an individual-randomized trial, loss of even one organization can severely impact the integrity of the trial. Thus, strong efforts are needed throughout the trial to avoid organizational dropout and its associated degradation of scientific rigor. Presented in Table 1 are challenges inherent in GRTs for maintaining research collaboration with participating organizations and principles for addressing the challenges. Of special note is the need to minimize, by

design, the research burden on the organization. Making participation easy for the organization directly addresses challenge #1—participation in research is not a goal of collaborating organizations. It also makes it more possible to overcome challenges #2 and #4—changes in personnel and organizations’ priorities over time. Examples of methods for maintaining research collaboration include the following: (1) Learn about the organizations in advance; identify key stakeholders; and demonstrate knowledge and interest in stakeholders and in their organization, needs, priorities, and challenges. (2) Conduct well-organized, efficient, and highly motivating recruitment meeting(s) with key stakeholders in each collaborating organization. Not only does this meeting serve to motivate potential collaborators, but also it provides them with first-hand evidence that the research group is capable and competent. (3) Begin maintaining the collaboration during recruitment, for example, by emphasizing the research question and randomized design so that any decision to participate is an informed and lasting one, and by working as a team to prepare for collaboration in the event that the organization decides to join the study. (4) Make excellent customer service an integral part of all trial procedures. For example, respond quickly to organizations’ inquiries and requests, and do what you say you will do. (5) Be visible: Communicate regularly with collaborating organizations to update

Table 1. Maintenance of Research Collaborations with Participating Organizations Requirement: Maintain the research collaborations (support for all research activities) with all participating organizations for the duration of the trial Challenges 1. Participation in research is not a goal of collaborating organizations, which are busy with their own priorities and mandates. 2. Organizations have multiple stakeholders to be kept informed and satisfied. 3. Organizations’ priorities or circumstances change over time. 4. Turnover of personnel within the collaborating organizations over time.

Principles 1. Design the trial and activities to minimize the research burden on the organization. 2. Sell the research. 3. Emphasize the essential role of the organizations as research partners. 4. Meet their needs to be informed, motivated, valued, and to have expectations met. 5. Make maintaining the research collaborations the top priority for the research group.

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

progress, to alert them to upcoming activities, and to express appreciation for their support. Offer to present in-person to key stakeholders on the trial and its progress. (6) Incorporate the principles of maintaining collaborative relationships into all project activities and protocols. (7) Identify and develop simple, low-cost procedures, such as periodic progress-report letters and Secretary Day cards, that serve the sole purpose of building and maintaining collaborative relationships. (8) Coordinate and document all contacts with collaborating organizations. 3 MAINTAINING THE RANDOMIZED ASSIGNMENT The essential scientific advantages that randomization enables—avoidance of bias and a basis for statistical inference—depend on maintaining each organization’s randomized assignment to either experimental or control. Thus, strong efforts are warranted throughout the trial to minimize at all times the risk of straying from the randomized assignment. Conformance to the randomized assignment could be violated by (1) an organization’s nonacceptance, or poor understanding, of the concept of randomized assignment at the start of the collaboration, during recruitment; (2) an organization’s nonacceptance of the actual assignment when randomization is performed and the assignment is communicated; (3) intervention contamination in organizations assigned to the control condition; and (4) implementation noncompliance in organizations assigned to the experimental condition. Accordingly, maintaining the randomized assignment requires attention to each aspect. Presented in Table 2 are, for each of these four aspects, the challenges inherent in GRTs for maintaining the randomization and the principles for addressing the challenges. Of special note is the importance of learning about changes in key personnel, and motivating the new stakeholders about the crucial role of randomization. A change in key personnel could conceivably cause challenges to occur in all four aspects of maintaining the randomized assignment. GRT managers must be proactive to overcome such challenges.

3

Examples of methods for maintaining randomized assignment for the duration of the trial include (1) during recruitment, sell the randomized assignment as the crux of the experimental design and the key to the research’s ability to attain scientific rigor and unambiguous conclusions at the trial’s end. (For example, ‘‘Because of the randomized assignment, any difference in outcome between the control and experimental groups can be attributed to the intervention, and not to other factors that differ among participating organizations.’’) (2) Have the principal investigator be the one to notify the organization, so that the importance of the randomization is emphasized. (3) Emphasize at every verbal and written contact the importance of the research and the critical role of the randomized assignment in obtaining unambiguous study conclusions. (4) Control access to intervention materials. (5) Make provision whenever practical for control organizations to receive intervention materials after the trial is finished. (6) Avoid choosing as the experimental unit entities that communicate easily with each other, for example, classrooms within a school. (7) In all communications with key personnel in control organizations, reinforce the essential role of control organizations in the research and express gratitude for their support of this often less-than-glamorous role. (8) Try to design the intervention to also meet the existing goals of the organization. (9) Incorporate behavior change theory and principles into the provider training to motivate providers to want to implement the intervention. (10) Meet with implementers periodically to provide visible support and appreciation of their key role and to obtain their critical feedback on the quality of the intervention and its implementation. (11) Publicly acknowledge and thank implementers for their contributions. (12) Use both provider self-reports and project staff observations to monitor provider implementation. 4

LOCATING STUDY PARTICIPANTS

Because losses to follow-up (failure to obtain endpoint data from individual study participants in participating organizations) are

4

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

Table 2. Maintenance of the Randomized Assignment Requirement: Maintain the randomized assignment of experimental/control condition for each participating organization for the duration of the study Challenges

Principles

Aspect #1: Acceptance (at recruitment) of the concept of randomized assignment 1. Randomization is not the usual way organizations make decisions. 2. Each organization tends to have its own preference/hope for the experimental condition that would be best for them. 3. Benefits of the control condition may not be clearly evident to participating organizations. 4. Changes in organizations and key personnel may threaten the understanding and/or support of the randomized assignment.

1. Motivate randomization as the key element for the study’s scientific rigor. 2. Emphasize that it is unknown whether the intervention is effective or not; indeed, the whole purpose of the trial is to find that out. 3. Emphasize importance of both intervention and control organizations to the success of the research. 4. Learn about changes in key personnel and inform them promptly about the trial.

Aspect #2: Acceptance (at randomization) of the communication of the actual randomized assignment 1. Organization stakeholders of the randomization may anticipate the randomization result. 2. Changing personnel and circumstances could heighten preference.

5. Make the performance of the randomization a big deal (it is), and witnessed. 6. Inform organization immediately. 7. Apply principles 1-4 above, again.

Aspect #3: Avoidance of contamination in control organizations 1. Participation in study, even as control, may enhance interest in the scientific question, and hence the intervention. 2. Prevention interventions are necessarily easy to do, and so may be easy for control to adopt.

8. Minimize the opportunity for purloining the intervention. 9. Reinforce the essential role of control organizations in the research.

Aspect #4: Implementation compliance in intervention organizations 1. Implementers within the organizations may be reluctant to add to their existing activities. 2. Intervention activities may mean some extra work, often unrewarded by organizations.

selected (not random), each loss degrades the scientific integrity of the study: It introduces bias into the results and degrades the basis for inference that the randomization enabled. Thus, to maximize the scientific

10. Make the intervention easy to do, enjoyable, and useful to them. 11. Motivate and train providers to implement the intervention. 12. Maintain strong, collaborative relationships with implementers as research partners.

rigor of the study, informed and aggressive measures must be taken to minimize loss to follow-up. Because successfully following up a study participant requires both locating the study

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

participant and, once located, obtaining the endpoint data (compliant response to the endpoint survey), both need to be addressed and successfully accomplished. The first requirement is covered in this section; the second in the next section. Successfully locating a study participant is defined as obtaining an up-to-date address and/or telephone number (where the study participant can be contacted for collecting the trial’s endpoint data). Presented in Table 3 are the challenges in locating study participants and principles for addressing the challenges. Examples of methods for applying these principles (see Refs. 1–3) include (1) Collect at the start of the trial the names, addresses, and telephone numbers of study participants, their parents, and close friends or relatives who would know their whereabouts if they moved away. (2) First activities for updating addresses: periodic ‘‘softtracking’’ letters sent to study participants (which ask for no reply from the participant) that use the U.S. Postal Service’s ‘‘Address Service Requested’’ endorsement. (3) Use the National Change of Address Service to obtain

5

new addresses of study participants and parents, close friends, and relatives. (4) Contact parents, friends, and relatives, as needed. (5) Keep detailed records of all directory information changes and all contact attempts: date and time, whom contacted, relationship to study participant, and response/information obtained. (6) Use Directory Assistance (from more than one telephone company). (7) Use online people-search engines on the Internet. (8) Use Telephone Append & Verify, a service arranged by the U.S. Postal Service in cooperation with telephone companies to provide phone numbers that go with specific addresses. (9) If all else fails, use publicly or privately available databases. Deserving special emphasis is method (4)—Contact parents, friends, and relatives. Although timeconsuming, it is the most likely to yield accurate information. 5 HIGH RESPONSE RATE TO ENDPOINT SURVEY Together with achieving high location rates, achieving high response rates to the endpoint survey is necessary for achieving successful follow-up of study participants. In contrast

Table 3. Locating Study Participants at Endpoint Requirement: Locate at endpoint study participants enrolled at the start Challenges 1. Long-term nature of GRTs (prevention studies, rare endpoints) requires follow-up over many years. 2. Our society is highly mobile. 3. Participation is not a high priority in the lives of study participants. 4. Organizations may not keep track of new addresses of those who leave. 5. The USPS will forward mail to a new address for only 6 months following a move, and will provide a new address to a sender for only 12 months. 6. Telephone companies provide new telephone numbers for only a limited period of time (6 months-1 year). 7. Our society is skeptical about contact from unknown callers.

Principles 1. Collect locator information at the outset of the trial. 2. Update locator information periodically throughout the trial. 3. Use a friendly (yet professional) approach with study participants at all times in all circumstances. 4. Keep record for each study participant of all tracking attempts. 5. Don’t limit the number of tracking attempts. 6. Use multiple modes (mail, telephone, email, cell phones).

6

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

to the other requirements for the conduct of GRTs, the challenges for achieving high response rates are common to both grouprandomized trials and individual-randomized trials. The challenges, and principles and methods for addressing them, have been well established over many years [see Don Dillman’s useful text (4)]. Shown in Table 4 are the challenges inherent in GRTs and other trials for achieving high response rates for the endpoint survey, and principles for addressing these challenges. Some examples of methods for achieving these follow. To build trust, when contacting participants, always (1) provide name of research organization, (2) remind the participant about his/her past participation, and (3) assure participants that their participation in the survey is voluntary and confidential. To make surveys easy to do, (4) keep the questionnaire short <20 minutes to complete), design it to have an inviting appearance, and include a memorable cover design. (5) Keep the survey at a low reading level, with easy skip patterns, and with a salient first question to interest participants. (6) Make multiple contacts to initial nonresponders, using a variety of communication modes—mail, telephone, and others. To motivate participants, (7) emphasize the importance of the research and their personal participation; (8) use prepaid monetary incentives and, as needed, promised monetary incentives; and (9) develop a highly motivating cover letter that is short, to-thepoint, and signed. To be opportunistic, (10)

administer the survey whenever contact with a study participant is made (i.e., perform tracking and survey administration as one step whenever opportunity arises).

6 REPORTS OF SCIENTIFIC RIGOR Various trials have illustrated that the inherent challenges of GRTs can be addressed and overcome to achieve the critical requirements for rigorous trial conduct, even when working with multiple organizations. For example, Murray et al. (5) successfully followed up (located and surveyed) 93% of 7124 adolescent study participants 5–6 years after baseline. The Community Intervention Trial (6) successfully maintained collaboration with 22 sizeable (50,000–250,000) communities for 7 years. The Hutchinson Smoking Prevention Project (HSPP) GRT (7, 8), with 40 collaborating school districts and 8388 study participants, reported (1) maintenance of collaboration by 100% of the 40 school districts for the 12-year duration of the trial, (2) maintenance of the randomized assignment: (a) 100% acceptance of both the concept of randomization (at recruitment) and the actual randomized assignment of intervention vs. control (b) 100% acceptance of the actual random assignment (c) 100% avoidance of contamination in the control group (d) 99% + implementation, with 86% implementation fidelity in the experimental group, (3) 96.6% of study participants located 12 years after enrollment in the study, (4) 97.6%

Table 4. High Response Rate at Endpoint Survey Requirement: Obtain responses from located study participants Challenges

Principles

1. Research participation is not a high priority in the lives of study participants. 2. Most people in our society live busy lives.

1. Build trust. 2. Make survey completion easy, and interesting to do.

3. Our society is skeptical about contact from unknown callers.

3. Motivate participants to want to comply with requests for information. 4. Be pleasantly persistent. 5. Be opportunistic: administer survey concurrently with successfully locating study participant (‘‘Bird-in-the Hand’’ principle).

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

data collection response rate among those located Such reports provide encouraging evidence that proper attention to the challenges and principles for conducting GRTs, together with appropriate methods for conducting the trial, can achieve high levels of scientific rigor. 7

DISCUSSION AND CONCLUSIONS

GRTs are essential for investigating the effectiveness of group-delivered interventions that are so important for public health promotion and disease prevention. The important role and significant monetary costs of GRTs require special attention during both design and execution of the trial to achieve the scientific rigor needed to obtain unambiguous conclusions. Randomization enables scientific rigor, but it cannot guarantee it without the support of careful implementation of trial methods that overcome challenges inherent in GRTs. Essential for realizing the rigor that the design of the GRT enables are the methods used to conduct the trial. The goal of achieving scientific rigor during the conduct of GRTs seems achievable through a combination of (1) attention to the special challenges inherent in GRTs; (2) recognition of the requirements and principles for rigorous trial conduct; (3) advance planning to adopt proven methods, or as needed to develop new ones; and (4) commitment to rigor and meticulous execution of the methods. The importance of informed and persistent follow-up cannot be overemphasized. In this connection, it should be noted that various ‘‘attrition analyses’’— for example, demonstrating that the fraction of lost-to-follow-up is the same in the experimental and control conditions, or that the baseline characteristics of those lost to follow-up do not differ between experimental and controls— cannot mitigate the degradation of scientific rigor that loss-to-follow-up causes. Such attrition analyses cannot overcome the fundamental problem that the endpoints are unknown for those lost to follow-up, and they cannot be inferred from the data on those successfully followed, because of the selected nature of those lost to follow-up. Also, various imputation methods that attempt to minimize the

7

bias that results from loss-to-follow-up by imputing values for the unknown outcomes rely on unverifiable assumptions and should not be viewed as a substitute for aggressive follow-up of study participants. Some main themes are evident in the collection of principles and methods for conducting GRTs: (1) It is important to identify the challenges at the start of a GRT, and to develop strategies and methods for addressing them. (2) Motivating individuals is important for more than the intervention; it is useful wherever cooperation from people is needed: in recruiting organizations and maintaining their collaboration, in designing the (intervention) provider training, and in designing the research activities (tracking, data collection). (3) Dedicated customerservice in all phases of the trial has great benefits for enhancing the scientific rigor of the research. Both the cost and importance of GRTs are high. Thus, a high degree of attention to scientific rigor is warranted, both to realize the sizeable investments (in money, time, and scientific careers) that GRTs entail and, more important, to achieve unambiguous scientific conclusions needed to advance disease prevention and health promotion for our citizenry. How might the challenges, principles, and methods of GRTs change in the future? It can be expected that the challenges would not decrease, because they derive from characteristics inherent in GRTs: the prevention nature of the intervention, the healthy nature of the population, and the nature of organizations. Indeed, the challenges may well increase if our society continues in the direction of its current trends—being busy, mobile, and generally skeptical about requests for time and information. As for the principles for addressing the challenges, it can be expected that they will develop and change as additional experience with GRTs is gained, as new technologies develop, new services appear, old ones close, and new regulations are enacted, and as researchers continue to identify creative and effective solutions to the challenges encountered in the rigorous conduct of GRTs.

8

METHODS FOR CONDUCT OF RIGOROUS GROUP-RANDOMIZATION

REFERENCES 1. V. R. A. Call, L. B. Otto, and K. I. Spenner, Tracking Respondents: A Multi-Method Approach. Lexington, MA: Lexington Books, 1982. 2. P. L. Ellickson, Getting and keeping schools and kids for education studies. J. Community Psychol. 1994: 102–116. 3. P. L. Pirie, S. J. Thompson, S. L. Mann, A. V. Peterson, D. M. Murray, B. R. Flay, and J. A. Best, Tracking and attrition in longitudinal school-based smoking prevention research. Preventive Med. 1989; 18: 249–256. 4. D. A. Dillman, Mail and Internet Surveys: The Tailored Design Method. 2nd ed. New York: Wiley, 2000. 5. D. M. Murray, P. L. Pirie, R. V. Luepker, and U. Pallonen, Five- and six-year follow-up results from four seventh-grade smoking prevention strategies. J. Behav. Med. 1989; 12: 207–218. 6. The COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a fouryear community intervention. Amer. J. Public Health 1995; 85: 183–192. 7. A. V. Peterson, K. A. Kealey, S. L. Mann, P. M. Marek, and I. G. Sarason, Hutchinson Smoking Prevention Project (HSPP): A longterm randomized trial in school-based tobacco use prevention. Results on smoking. J. Natl. Cancer Inst. 2000; 92: 1979–1991. 8. A. V. Peterson, S. L. Mann, K. A. Kealey, and P. M. Marek, Experimental design and methods for school-based randomized trials: Experience from the Hutchinson Smoking Prevention Project (HSPP). Controlled Clinical Trials 2000; 21: 144–165.

MICROARRAY

major contributions to clinical trials (9,10), basic science, cancer, diabetes, and nutrition to name just a few of the uses. In the future, we envision the impact of microarrays on research increasing as they are increasingly applied in Comparative Genomic hybridization (11), genotyping (12), phage display (13), tissue arrays (14), tiling arrays (15), and especially clinical trials. Microarrays have been a part of clinical trials essentially since microarrays have been developed. They have been used to differentiate lymphatic cancers, breast cancer, and small blue cell cancers to name a few. They have been used to identify new disease mechanisms for diseases such as diabetes (16) and have been used to predict response to drugs (17,18) and diagnosis diseases (19). However, many of these studies, especially the early ones, have not yielded reproducible or generalizable results (20), which is because of a lack of understanding of the needs for careful experimental design and conduct of microarray studies. Clinical trials and microarray studies on clinical samples have been a driving factor in the development and maturity of microarray technology. Whereas most microarray studies have been small(ish) basic science experiments, microarray studies on clinical trial samples, although less numerous, tended to have many samples that have provided sufficient power and information to identify the issues with microarrays, which we discuss in the rest of this article. The identification of these issues has allowed for them to be controlled for and reduced or eliminated. In addition, regulatory and U.S. Food and Drug Administration (FDA) requirements drove the development of the microarray quality control consortium (MAQC), which establishes the validity and reproducibility of array technology. Now that researchers understand the methods and techniques to run a microarray study to generate valid results, microarrays can provide great insights into clinical trials. Some potential uses include determining why certain people respond or do not respond to a

GRIER P. PAGE University of Alabama at Birmingham, Department of Biostatistics, School of Public Health,

XIANGQIN CUI University of Alabama at Birmingham, Department of Medicine, School of Medicine, Department of Biostatistics, School of Public Health, Birmingham, Alabama

In this review, we highlight some issues an investigator must consider when thinking about using expression arrays in a clinical trial. We present the material in a series of sections. The first section highlights previous uses of microarrays in clinical trials. The second section discusses expression microarrays, whereas other uses for microarrays are discussed in the next section. To generate high-quality microarray data for these applications requires careful consideration of experimental design and conduct. The next sections highlight the steps in the planning and analysis of a microarray study. The fourth section discusses defining the objectives of the study, followed by experimental design, data extraction from images, microarray informatics, single gene analysis, data annotation, analyses beyond a single gene, microarray result validation, and finally conclusions. 1

INTRODUCTION

Microarrays have been described as ‘‘the hottest thing in biology and medicine since the advent of the polymerase chain reaction a decade ago.’’ (Anonymous, 2001). This technology emerged circa 1996 (1,2) and had its first high-profile uses in 1998 and 1999 (3–5). What is ‘‘hot’’ is that microarrays allow the simultaneous measurement of mRNA expression levels from thousands of genes from samples. DNA microarray studies have already been published for variety of species including human (6), mouse (7), rat (8), and made

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

MICROARRAY

treatment, to understand the mechanisms of action of a drug or treatment, and to identify new drug targets and ultimately personalized medicine. However, microarrays are only successful with careful design conduct, analysis, and replication. This success is best exemplified by recent approval by the FDA of the first microarray based device, the MammaPrint (Molecular Profile Institute, Phoenix, AZ) array. 1.1 MammaPrint MammaPrint is the first commercial product approved (February 2007) by the FDA for disease prognosis. This product uses an agilentbased microarray that contains a 70-gene breast cancer signature to classify unfixed tumors from node negative women under age 61 as low or high risk for disease recurrence of the disease. The 70-gene breast cancer signature was initially developed by ‘t Veer in a 2002 paper in Nature. The initial study of 78 people identified a signature from a genome wide expression array (21). That signature was established in a study of 151 people (22). A diagnostic array was developed from the genome wide array. This array contains 232 probes that query the 70 signature genes, 289 probes for hybridization and printing quality control, as well as 915 normalization probes (23). This diagnostic array was validated in a study of 302 patients (24). The 70 gene profiles were developed in the initial study by identifying the genes most associated with good outcome called the ‘‘good prognosis template’’ and then validated with ‘‘leave one out cross validation.’’ For new samples, the 70 profile genes are correlated to the ‘‘good prognosis template’’ with a cosine correlation. A value higher than 0.4 results in a patient being assigned to the good-profile group, otherwise they are assigned to the poor profile group. MammaPrint was approved by the FDA because of several important steps the group took, which are explored in greater detail below. These steps included the use of large sample size, replication samples, a good statistical analysis plan, careful quality control, and the results of the MAQC.

2 WHAT IS A MICROARRAY? Usually, a microarray is a solid substrate (usually glass) on which many different sense or antisense (which depends on the technology) cRNAs or cDNAs have been spotted in specific locations, and some technologies attach the probes to beads. Normally, many different cDNAs are spotted onto a single array in a grid pattern, and the cDNAs are taken from genes known to code for poly A+ RNA or from EST sequences. First, RNA from a tissue sample is extracted and labeled with fluorescent dyes or radioactive nucleotides. This labeled RNA is then hybridized to the cRNA or cDNA immobilized on the array. The labeled RNA binds to its complementary sequence approximately proportional to the amount of each mRNA transcript in a sample. The amount of radioactivity of fluorescence can be measured, which allows estimation of the amount of RNA for each transcript in the sample. Alternative used of microarrays are described in the next section. 2.1 Types of Expression Microarrays A high-quality microarray study requires high-quality microarrays. The manufacture of high-quality spotted arrays (both long oligos and cDNA) is very time and resource intensive. Failure to generate high-quality arrays can lead to the arrays being no more than an expensive random number generator. The particulars on how to manufacture high-quality arrays are beyond the scope of this review (25). A variety of different microarray technologies, depending on the species being studied, may be available for use. In general, the technologies can be broken down into three general groups: short oligonucleotides (Affymetrix, NimbleGen), long oligonucleotides (Aglient, Illumina, Amersham), and ‘‘spotted’’ cDNA amplicons (NIA, Stanford, etc.). Each particular technology has its advantages and disadvantages. The short oligos have certain advantages in that they are manufactured to (usually) high standards. Because the arrays use the sum of several probes, the quantification of RNA levels seems to be robust. In addition, the probe sequences are known. Also, some

MICROARRAY

factors could be of disadvantage; specifically they are not particularly robust to polymorphisms. This factor can be a problem if one is working in an atypical species for which an array is not available. Short oligo arrays tend to be relatively expensive with a limited number of species available. However, NimbleGen seems to offer relatively inexpensive custom arrays for any species for which NimbleGen can be provided with sequence information. Long oligos arrays (50–80 mers) offer advantages in that they are robust to some polymorphisms, can often be used across related species, the sequence is known, the spots can be applied at a known concentration, and probes are single stranded. However, long oligos have a high cost of synthesizing and are nonrenewable, but they are often less expensive than short oligo platforms. The complete genome sequence is required to generate robust (unique) oligos with minimal cross-hybridization. Agilent has a very flexible format that allows for the development of highly customized arrays. cDNA amplicons or spotted arrays (4) use sets of plasmids of specific cDNAs, which may be full- or partial-length sequences to generate the sequence to be spotted on an array. The cDNA microarrays allow investigators to print any genes from any species for which a clone can be obtained. cDNA and some long oligo arrays microarrays allow for two samples (usually labeled Cy3 and Cy5) to be hybridized to a single array at a time. The array’s design can be changed rapidly, and the cost per array is low once the libraries have been established. Also, the cDNAs are a renewable supply. Several disadvantages of this method include a variable amount of DNA spotted in each spot and a high (10–20%) drop-out rates caused by failed PCR, failed spots, and possible contamination and cross-hybridization with homologous sequences. As a result, many cDNA arrays are of poor quality. Although microarray technology was originally developed to measure gene expression arrays (26), it has been modified for various other types of applications described in the next section.

3

2.2 Microarrays can Generate Reproducible Results After a series of high profile papers (27) to indicate that different microarray technologies did not generate reproducible results, the FDA has led a project entitled the MAQC project. The purpose of the MAQC project was to provide an assessment of the reproducibility, precision, and accuracy of a variety of human microarrays The MAQC is study of seven microarray platforms (Applied Biosystems, Affymetrix, Agilent, Eppendorf, GE Healthcare, Illumina, and NIC Operon array) with each being tested at 3–6 sites with at least three types of validation (taqMan, StaRTPCR, and QuantiGene. Approximately 150 investigators in seven organizations have been involved. In a series of papers in the September 2006 Nature Biotechnology (28–31), authors revealed that although a range of reproducibility was observed across platform types, microarrays, when conducted with high-quality experiment design, generate highly reproducible results both across platforms as well as to alternative technologies such as reverse transcription polymerase chain reaction (RT-PCR). The most reproducible technologies in both intrasite and intersite studied were those from Affymetrix, Illumina, and Agilent. These reports greatly increase the confidence of the scientific community in the reliability of microarrays and also encouraged the FDA to approve microarray-based prognostic tools. 3

OTHER ARRAY TECHNOLOGIES

Although microarrays were initially used for expression quantification, they are applied in the analysis of other types of data. All these types of arrays could be included potentially in a clinical trial. 3.1 Genotyping using Expression Microarrays Instead of using mRNA as hybridization target, DNA can be used to hybridize the arrays designed for measure gene expression (expression arrays), such as the Affymetrix expression chips that have redundant probe for each gene. Because only one copy is in

4

MICROARRAY

the genome for most genes, the hybridization difference between two DNA samples will just reflect the amount of DNA used in each sample for most genes. Therefore, redundant probes from the same gene should give similar fold changes across samples unless there is a polymorphism between one probe and the target in a sample. The fold change computed from the probe that contains the polymorphism will be different from those obtained from other probes in the same gene. This mechanism can be used to identify polymorphisms between samples, and the polymorphism revealed by this method is call single feature polymorphism (SFP) (32–35). Hybridizations with mRNA as target can also be used to identify DNA polymorphisms and genotype samples. Similarly, the presence of polymorphism between the two compared samples will cause the probe to show different fold change across samples compared with the nearby probes in the same gene (36). However, this strategy can be affected by some RNA expression properties, such as alternative splicing and RNA degradation. 3.2 Splicing Arrays When introns are removed (splicing) during transcription, different splicing products can be generated from the same gene because of the alternative usage of intron and exons (alternative splicing). Alternative splicing is becoming one of the most important mechanisms in control of gene expression by generating structurally and functionally distinct proteins. In addition, alternatively spliced transcripts may regulate stability and function of corresponding transcripts. It is believed that >50% of all human genes are subjected to alternative splicing (37). Although the machinery of splicing has been well known, how splice sites are selected to generate alternative transcripts remains poorly understood. For most alternative splicing events, their functional significance remains to be determined. In recent years, many researchers are characterizing alternative splicing using microarrays with probes from exons, introns, and intron–exon junctions. The basic idea is that the discordance among multiple probes

from the same gene indicates difference in transcripts themselves across samples. For example, skipping of an exon in one sample will cause the probes that hybridize to the skipped exon or the related exon–intron junctions to show different fold change compared with other probes from the same gene. Hu et al. (38) used a custom oligo array of 1600 rat genes with 20 pairs of probes for each gene to survey various rat tissues. A total of 268 genes were detected to have alternative splicing, and 50% of them were confirmed by RT-PCR. Clark et al. (39) monitored the global RNA splicing in mutants of yeast RNA processing factors using exon, intron, and exon-junction probes. Two of 18 mutants tested were confirmed by RT-PCR. In addition, they could cluster the mutants based on the indices for intron exclusion and junction formation to infer the function of associated mutant factors. In humans, Johnson et al. (40) analyzed the global alternative splicing of 10,000 multiexon genes in 52 normal tissues using probes from every exon–exon junction and estimated that at least 74% of human multiexon genes are alternatively spliced. They validated a random sample of 153 positives using RT-PCR and successfully confirmed 73. Other similar experiments were also conducted in humans to investigate alternative splicing in tumor cell line (41) and various tissues (42) but on a smaller scale. 3.3 Exon Array Most current designs of short oligo arrays have probes concentrated at the 3 end of genes because of the better EST support and commonly used reverse transcription methods starting from the 3 end. For sequenced genomes, the gene structure is annotated. Every exon instead of the 3 end can be represented on the array, which is called exon array. Exon array has a better representation of the genes. It provides not only information on the gene expression level but also an opportunity to detect alternative splicing by examining the relative expression of the exons across samples. If the relative expression of different exons of the same gene does not agree, then alternative splicing is indicated. It can detect some types of alternative

MICROARRAY

splicing such as exon skipping, but it is not sensitive to some types of alternative splicing, such as intron retention. In addition to detecting gene expression and alternative splicing, exon arrays can also be used to detect gene copy number difference through comparative genomic hybridization (43).

3.4 Tiling Array—Including Methylation Arrays The current knowledge of genes and gene expression is mainly based on the study of expressed mRNA and computational analysis of the EST collections. The full genome sequences of several organisms and the advance of microarray technology provides a mean, which are also called tiling arrays, to survey the whole genome for gene expression (44). The tiling arrays contain probes that cover the whole genome (or a few chromosomes) in an overlap, head-to-tail, or with small gaps. In theory, the expression of any region in the genome can be read out from the expression intensity of the probes. Tiling arrays have been developed for several organisms, including human (45,46), Drosophila (47), Arabidopsis (48,49), and rice (50). The results from the tiling experiments showed that tiling arrays can detect much more expressed regions in the genome than what have been known; however, because of the lack of reproducibility, many false positives may be observed (51,52). In addition, tiling arrays can be used to survey the genome for copy numbers using DNA instead of mRNA as hybridization target (53). The similar idea was also used to detect the epigenetic modification of the genome, DNA methylation (54,55), and histone acetylation (56). For detecting DNA methylations, arrays contain probes that correspond to each fragment of the methylation sensitive restriction enzymes. After digestion with these enzymes, the different fragments from hypermethylated or hypomethylated regions are enriched by specific adapters and PCR schemes. The enriched and samples are then hybridized to tiling arrays or arrays that cover just GpC islands in the genome (57).

5

3.5 SNP Chip A major task in dissecting the genetic factors that affect complex human disease is genotyping each individual in the study population. The most common polymorphism across individuals is the single nucleotide polymorphism (SNP). Special microarrays are specially designed for genotyping SNPs, such as Affymetrix 10k, 100k, 500k SNP chips, and Illumina SNP chips. These chips have probes centered on the SNP site and have perfect match probes for each allele as well as corresponding mismatch probes (58–60). Each array platform can genotype 10,000 or 500,000 SNP in a single hybridization. Although each hybridization is relatively expensive, the cost per genotype is low (61,62). 3.6 ChIP-on-Chip Microarrays can also be used to characterize the bindings of protein to DNA in defining gene regulation elements in the genome. The DNA sequences with proteins bound are enriched using chromatin immunoprecipitation (ChIP) and are then compared with regular DNA samples to detect the enrichment of some DNA elements. These DNA elements are associated with protein binding and tend to be involved in regulation of gene expression and other features related to the binding protein. The microarrays used in ChIP-chip studies are genomic arrays such as tiling arrays. However, because of the large number of arrays that it takes to cover the genome, some special arrays are designed to cover certain regions of the genome, such as the promoter regions and arrays, which use DNA from large probes (several kb or mb BAC clones). However, the resolution of the protein binding position decreases as the probe size increases. For reviews, see References 63–65. 3.7 Protein Arrays So far, the most application of microarrays in biological research are DNA arrays that analyze mRNA expression and DNA genotyping. However, the functional units for most genes are proteins instead of mRNA, which is just a messenger. Researchers have put forth a

6

MICROARRAY

large effort to increase the throughput for characterizing the assays with protein, such as the proteome technologies. Considerable efforts have been spent in developing protein arrays; however, because of the challenging nature of protein, there has been less success in producing and applying protein arrays. Nonetheless, the research and development of protein arrays is still an active field with the hope of similar success as DNA microarrays in the near future (66,67). Based on the specific use, two major types of protein arrays are available. One type is analytical array, in which antibodies are arrayed on solid surface to detect the corresponding proteins in the hybridization soup (68–70). Another type is more functional oriented to detect protein–protein, protein–DNA, and protein–small molecule interaction. For the protein–protein interaction arrays, proteins are arrayed on the surface, and they can interact and bind to their interaction partners in the hybridization solution (71,72). Protein–DNA interaction arrays are used for detecting the binding sites of some proteins in the genome (73,74). The protein–small molecule interaction arrays are used to identify substrates for protein (75), drug targets (76), and immune response to proteins (77–79). 4

DEFINE OBJECTIVES OF THE STUDY

In the next few sections, we will discuss the steps of a microarray study and what an investigator needs to be aware of before, during, and after starting a clinical trial with microarrays. The first step of any experiment, microarray or otherwise, is to define a hypothesis. Although it may seem perfectly obvious to state this, the objectives of a study are likely to be met if that study is designed in such a way that it is consistent with meeting those objectives, and a study can only be designed to meet the objectives if the objectives can be articulated clearly prior to initiating the study. It is not uncommon for investigators who have conducted a microarray study to be unable to state the objective(s) of their study clearly (e.g., to identify genes that are differentially expressed in

response to a treatment or genes that may be responsible for differential response to treatments). Often, investigators have inadvertently allowed themselves to go down the slippery slope from hypothesis-generating studies to studies that might be termed ‘‘objective generating.’’ By objective generating, we mean studies in which the investigator has initiated a microarray study without a clear conception of the exact objectives of the study in the sole hope that, by some mysterious process, the mass of data from the microarray study will make all things clear. We do not believe that such outcomes are likely. However, that is not to say that the experimental objectives may not include the generation of new objectives/hypotheses on interesting pathways/genes that may be highlighted as a result of the microarray study or be very broadly defined. Thus, we urge investigators to articulate clearly what they hope to obtain from microarray studies so that the studies can be designed to meet those objectives from the beginning; in other words, researchers should make a hypothesis. 5 EXPERIMENTAL DESIGN FOR MICROARRAY After stating a hypothesis, the most important step of an experiment is the experimental design. If an experiment is designed well, then the choice of analytical methods will be defined and the analysis plan will follow. Several essential principles must be in mind when designing a microarray experiment. These include avoidance of experimental artifact; elimination of bias via use of a simultaneous control group; randomization and (potentially) blinding; and reduction of sampling error via use of replication, balance design, and (where appropriate) blocking. 5.1 Avoidance of Experimental Artifacts Microarrays and most laboratory techniques are liable to nonbiological sources of variation including day, chip lot, reagent lot, day of extraction, clinic samples came from and personnel (post doc effects). In many cases, these sources of variation are larger than

MICROARRAY

the biological variation (80–84). If nonbiological and biological differences are confounded, then the experiment can be essentially meaningless. Thus, careful consideration and identification of all factors must be taken before starting a study. These factors must then be eliminated by making the experimental conduct more homogeneous or controlling by randomization and blocking. 5.2 Randomization, Blocking, and Blinding Although blocking can be used to control for measured or known confounding factor, such as the number of samples that can be run in a day, randomization of varieties/samples/groups and random sampling of populations is very useful for reducing the effect of unmeasured confounding factors (81,85,86), such as difference in weather and interruptions in sample delivery. Microarray experiments can require multiple levels of randomization and blocking to help minimize unanticipated biases from known factors. For example, if only four samples can be processed in a day and there are two experimental groups, then two samples from each treatment group can be run each day (blocking), but the two samples are randomly selected from all samples in a the experimental group. Proper randomization and blocking can greatly reduce the bias of studies. Blinding should, of course, be a part of the conduct of any clinical trial, but blinding may also be appropriate on array analysis at the sample collection and processing steps. Unintentional biases can be introduced by individuals collecting samples; for example, margins may be cleaned more carefully in one group compared to another. In addition, more care may be paid to the processing of one treatment group over another. All of these may cause bias, and if possible blinding should be used in microarray experiments. 5.3 Replication Individuals within a population vary, and all measurement tools such as microarrays measure with some error; thus, a single sample cannot be used to make generalizable inferences about a group or population. Replication of microarrays is needed at several levels.

7

5.3.1 Types of Replication. Replication in the context of microarrays can be incorporated at several levels: (R1) gene-to-gene; genes can be spotted multiple times per array, (R2) array-to-array; mRNA samples can be used on multiple arrays and each array is hybridized separately, and (R3) subject-tosubject; mRNA samples can be taken from multiple individuals to account for inherent biological variability. The first two types of replication are referred to as technical replication. The first type measures withinarray variation whereas the second type measures between-array variation. These types of replication are important for assessing the measurement error and reproducibility of microarray experiments and are extremely useful for improving precision of measurements. On the other hand, the third type of replication allows us to assess the biological variation within populations and thereby to make biologically interesting observations. R1 Technical replicates can not substitute for biological replicates (R3). Although R2 technical replicates have a specific role when the cost of samples is far larger than arrays, an experiment cannot be run only with R2 replicates and biological generalizable results cannot be obtained (87). 5.3.2 Replication, Power, and Sample Size. Sample size has a major impact on how confidently genes can be declared either differentially (sensitivity and power) or not differentially expressed (80,88) (specificity). Sample sizes can be determined in a variety of ways. One way is traditional statistical power analysis programs such as PS, which contains the following: power (1-beta), significance (alpha), a measure of variation (Standard deviation), and a detectable difference (delta). As an example at 80% power at a Bonferroni corrected significance level α = 0.05 to detect a 1/2 standard deviation (SD) reduction requires a sample size of over 250 per group, which is not normally achievable in microarray experimentations for budgetary reasons. Another approach we believe is more appropriate would be to choose sample size based on control of the false discovery rate (FDR) (89) and the expected discovery rate (EDR). The FDR is an estimate of the expected proportion of

8

MICROARRAY

genes declared significant that are in fact not differentially expressed [i.e., that are ‘‘false discoveries’’ (90,91)]. The EDR is the expected proportion of genes in which true differences between conditions exist that are found to be significantly different. This approach has been developed and applied in the PowerAtlas (92–94) (www.poweratlas.org). In addition, the PowerAtlas allows an investigator to either upload their own pilot data or to choose from among over 1000 public microarray experiments to use as pilot data for estimating sample size. 5.4 Practice, Practice, Practice No experimenter runs every step of a microarray experiment perfectly the first time. A learning curve is observed for all steps, and the learning process is a confounding factor. Training at all steps is necessary from sample collection to RNA processing, hybridization, and analysis. Thus, all the individual steps should be practiced before running an experiment, and new people who will handle samples need to be trained to sufficient standards before they run ‘‘real’’ samples. Resources spent on training are not wasted, and training samples should not be included in a ‘‘real’’ experiment. 5.5 Strict Experimental Practices Because microarray experiments are liable to many nonbiological sources of error, it is critical to conduct microarray studies that follow very strictly defined protocols. For example, know exactly what types of samples are and are not acceptable, what cut of the samples is needed, what protocol will be used to extract samples, what analyses will be used, what is good quality RNA, and so on before a study is started. Consider a microarray study like a clinical trial in which the researcher must perform a full disclosure of all steps before starting a clinical trial. Deviations from these protocols are strongly discouraged for fear of introducing biases. 6

DATA EXTRACTION

Once a microarray experiment has been conducted and the image of an array is obtained,

several steps must occur to convert the image to analyzable data, and the methods are specific to each technology. 6.1 Image Processing from cDNA and Long Oligo Arrays Image processing consists of three steps: addressing, segmentation, and information extraction. 6.1.1 Gridding/Addressing. Typically, a microarray is an ordered array of spots with constant separation between the row and column; grids or spots must be the same throughout the microarray. Addressing is the process of finding the location of spots on the microarray or assigning coordinates to each spot on the microarray. However, the spotting is rarely perfect; variations must be dealt with. Although software usually does a good job, manual review and occasional intervention results in better data, but it is a very time-consuming process. 6.1.2 Segmentation. Segmentation is the most important and also the most difficult part of the image analysis. In this process, each image pixel is classified as either signal or the background noise. The popular methods of segmentation used fixed circles, adaptive circles, adaptive shapes, or the histogram method. The first two methods provide the means to separate the circular spots from the background by clearly defining the boundaries of the spots. A variety of comparisons of the methods has been published with no clear winner (95,96). 6.1.3 Information Extraction. In the final step of image analysis, the mean and median values of the spot intensities and the background spot intensities are calculated. Usually, correlations between spot intensities, percentage of the spots without any signal, their distribution and signal-to-noise ratio (SNR) for each spot, and variation of the pixel intensities are also calculated. The spot intensities are then measured as the sum of intensities of all the pixels inside the spot.

MICROARRAY

6.2 Image Analysis of Affymetrix GeneChip Microarrays Affymetrix GeneChips are the most widely used oligonucleotide arrays. Unlike the other systems, the Affymetrix system represents each gene as 11–20 probe pairs. Each probe pair is composed of a 25 base pair perfect match (PM) probe that represents the gene’s known sequence and a mismatch (MM) probe that differs from the PM by the middle base. The expression level is some function of the averages of difference in intensities of PM and MM over the 20 sequences. Several algorithms have been developed for averaging the probe pairs to yield a final quantification. These include Dchip (97), GCRMA-EB and GCRMA-MLE (98), MAS5 (99), PDNN (100), and RMA (101,102). all of which have different measurement properties, and it is not yet clear which is best (103). Other technologies such as Illumina and NimbleGen have their own image analysis steps as well. 6.3 Normalization of DNA Data One of the early and near universal steps in microarray study is the use of a technique called either normalization or transformation. The normalization has at least two purposes: to adjust microarray data for effects that develop from variation in the technology rather than from biological differences between the RNA samples or between the printed probes (104) and ‘‘aiding in the analysis of the data by bending the data nearer the Procrustean bed of the assumptions underlying conventional analyses’’ (105), which will allow for reliable statistical and biological analyses. The former is really more adjusting for measured covariates such as dye biases whereas the later is the true meaning of normalization. Either way, a wide variety of methods has been developed for all meanings of normalization including several varieties of linear models (106), loess (104), quantilequantle (107), log2 , and others (108,109). Normalization is usually required in cDNA microarray experiments to reduce dye biases. This area still requires active research, and it is not clear which methods are appropriate for each chip and experimental design.

7

9

MICROARRAY INFORMATICS

For many investigators, microarrays will involve the highest data storage, analysis, and informatics hurdle they will face in their research careers. Microarrays generate huge amounts of data, which can make data storage and handling difficult. In addition, data reporting standards are being enforced by many journals for publications (110,111), and the NIH has started to make data sharing mandatory for certain grants. Microarray experiments generate volumes of data that many biological researchers may not be accustomed to. A single Affymetrix chip will generate about 50 MB of data. After initial processing, each chip will provide thousands to tens of thousands of numbers per array for analysis. After analysis, summary statistics, such as changes in expression and associated significance probabilities, will be available for all genes on the chips. Sorting through significance tests for tens of thousands of genes ‘‘manually’’ and trying to deduce a biological meaning is a Sisyphusian task because of the dimensionality of the data the speed at which new information is generated. These data can be overwhelming. Before an experiment has begun, consideration should be paid to how data are stored, viewed, and interpreted (112). 8

STATISTICAL ANALYSIS

Three types of single gene analyses are typically conducted on microarray data. Class prediction involves building models to predict which group to which samples should be assigned. This method is often used in clinical trials, for example, to develop profiles that predict poor prognosis of cancer (21) or to differentiate among pathologically similar samples (113). The second set of analyses is class discovery, which involves the unsupervised analysis of data to identify previously unknown relationships between genes or samples. The final type is class differentiation, which usually involves the inferential statistical analysis. 8.1 Class Prediction Analysis We use the term prediction to define the construction of a model that uses gene expression

10

MICROARRAY

experiments to classify objects into preexisting known classes, to develop indexes that can serve as a biomarker, or to predict to which class a sample should be assigned (114,115). Many methods can be used to construct such scores (116–118), and it is not clear which technique is best, but the goal of all methods is to find the best compromise between complexity and simplicity in the model. As the predicted model becomes more and more complex by using more and more sample information, the predicted ability in the sample in hand will increase; however, the sample data contain not only information about the true structure of the data but also it contains ‘‘noise’’ because of sample variation. Thus, great care must be taken in the model building (119–121). To build the models that will predict new samples well, one must build cross-validated models. Crossvalidating requires that one have sufficient data to hold some models from the estimation process so that one can subsequently check how well the prediction is using the held back data. For cross validation to be accurate, the held back data used in the cross validation must have not been used in the selection of the structure of the model used for prediction or which parameters go into that model (122,123). This validation has often been violated in microarray research. 8.2 Class Discovery Analysis Since Eisen et al. (4) first applied hierarchical clustering to microarray data analysis in 1998, cluster analysis has emerged as a prevalent tool for the exploration and visualization of microarray data. A variety of cluster methods are available, which include hierarchical and nonhierarchical methods. Among hierarchical methods are agglomerative and divisive methods. For the hierarchical methods, different ways can be used to measure distance between genes, which includes Person’s correlation, Euclidian distance, and Kendall’s Tau as well as a variety of methods for linking genes based on their distance including average, single, complete, and median. Several nonhierarchical methods exist including K-nearest neighbors, selforganizing Maps and related techniques such

a support vector machine and singular value decomposition. Each method and approach has its own positive and negative aspects that should be evaluated. Clustering is a commonly used tool for microarray data analysis; but unlike other statistical methods, no theoretical foundations provide the correct answer. This problem leads directly to several related criticisms of cluster analysis. First, the cluster algorithms are guaranteed to produce clusters from data, no matter what kind of data has been used. Two, different methods can produce drastically different results, and the search for the best choice among them has just begun (124,125). Third, no valid method is available to establish the number of clusters in nonhierarchical cluster analysis. Therefore, cautions are required for performing such analysis, and one should avoid overinterpreting the results; however, cluster analysis is good to provide exploratory descriptive analysis and concise displays of complex data. 8.3 Class Differentiation Analysis One of the main tasks in analyzing microarray data is to determine which genes are differentially expressed between two or more groups of samples. This type of analysis is a conventional hypothesis test. For making inference, virtually any statistical method can be used including t-test, analysis of variance, and linear models. A variety of Bayesian methods and information borrowing approaches such as Cyber-t (126) and SAM (90) has been developed. Because of the small sample sizes, it is often useful to employ variance shrinkage based methods for more robust estimation (127,128). 8.3.1 Adjusting for Multiple Testing. Because each microarray can contain thousands of genes, some adjustment for multiple testing is required to avoid many false positive results. One way is to control the family-wise error rate (FWE), which is the probability of wrongly declaring at least one gene as differentially expressed. A Bonferroni correction is a method to adjust P-values from independent tests. Permutation methods can be used to control FWE in presence of nonindependent tests (129). Another approach

MICROARRAY

to address the problem of multiple testing is the false discovery rate (FDR), which is the proportion of false positives among all genes initially identified as being differentially expressed (89,130). In addition, a variety of Bayesian and alternative FDR methods has been developed (91). 9

ANNOTATION

The gene-by-gene statistical analysis of microarray data is not the end of a study by any stretch of the imagination. The next step is the annotation of the expression data. The amount of information about the functions of genes is beyond what any one person can know. Consequently, it is useful to pull in information on what others have discovered about genes to interpret an expression study fully and correctly. A variety of tools such as array manufacturer’s web sites, KEGG (Kyoto Encyclopedia of Genes and Genomes)(131,132), Gene Index(133,134), Entrez Gene, Medminer (135), DAVID (Database for Annotation, Visualization and Integrated Discovery), and Gene Ontology (136). Each database and tool has slightly different data, and one should use multiple databases when annotating. Also be aware databases can be different with respect to the same information. 10 PATHWAY, GO, AND CLASS-LEVEL ANALYSIS TOOLS Analysis of microarray experiments should not stop at a single gene, but rather several approaches can be used to get a picture beyond a single gene. These tools are called by a variety of names including pathway analysis, gene class testing, global testing, entrez testing, or GO (Gene Ontology analysis). The goal of all these tools is to relate the expression data to other attributes such as cellular localization, biological process, molecular function or a pathway for individual genes or groups of related genes. The most common way to analyze a gene list functionally is to gather information from the literature or from databases that cover the whole genome. In recent years, many

11

tools have been developed to assess the statistical significance of association of a list of genes with GO annotations terms and new ones are released regularly (137). Extensive discussion has occurred of the most appropriate methods for the class level analysis of microarray data (138–140). The methods and tools are based on different methodological assumptions. Two key points must be considered: (1) does the method use gene sampling or subject sampling and (2) whether the method use competitive or self-contained procedures. The subject sampling methods are preferred, and the competitive versus selfcontained debate continues. Gene sampling methods base their calculation of the P-value for the gene set on a distribution in which the gene is the unit of sampling, whereas the subject sampling methods take the subject as the sampling unit. The latter, which is based on the subjects not the genes, is typically the unit of randomization in a study (141–143). Competitive tests, which encompass most existing tools, test whether a gene class, defined by a specific GO term or pathway or similar, is overrepresented in the list of genes differentially expressed compared to a reference set of genes. A self-contained test compares the gene set with a fixed standard that does not depend on the measurements of genes outside the gene set. Goeman et al. (144,145), Mansmann and Meister (141) and Tomfohr et al. (143) applied the selfcontained methods. These methods are also implemented in SAFE and Globaltest. Another important aspect of ontological analysis regardless of the tool or statistical method is the choice of the reference gene list against which the list of differentially regulated genes is compared. Inappropriate choice of reference genes may lead to false functional characterization of the differentiated gene list. Khatri and Draghici (146) pointed out that only the genes represented on the array should be used as reference list instead of the whole genome as it is a common practice. In addition, correct, up-to-date, and complete annotation of genes with GO terms is critical; the competitive and gene samplebased procedures tend to have better and more complete databases. GO allows annotation of genes at different levels of abstraction due to the directed acyclic graph structure

12

MICROARRAY

of the GO. In this hierarchical structure, each term can have one or more child terms as well as one or more parent terms. For instance, the same gene list is annotated with a more general GO term such as ‘‘cell communication’’ at a higher level of abstraction, whereas the lowest level provides a more specific ontology term such as ‘‘intracellular signaling cascade.’’ It is important to integrate the hierarchical structure of the GO in the analysis because various levels of abstraction usually give different P-values. The large number (hundreds or thousands) of tests performed during ontological analysis may lead to spurious associations just by chance. Correction for multiple testing is a necessary step to take. Other analyses look beyond single genes such as coexpression (147), network analysis (148,149), and promoter and transcriptional regulations (150,151). 11 VALIDATION OF MICROARRAY EXPERIMENTS A plethora of factors, which include biological and technical factors, inherent characteristics of different array platforms, and processing and analytical steps can affect results of a typical microarray experiment (152). Thus, several journals now require some sort of validation for a paper to be published. Sometimes, it is possible to confirm the outcome without doing any additional laboratorybased analysis. For example, array results can be compared with information available in the literature or in expression databases such as GEO (153). However, such in silico validation is not always possible or appropriate. Thus, other techniques such as RT-PCR, SAGE (154), and proteomics are used. However, many studies merely conduct technical validation of microarray results. This method may have been appropriate before the result of MAQC established the validity of expression studies. Thus, in our opinion, if microarray studies are well planned, then a valid and technical validation of the array results is not needed, but rather the verification that investigators should pursue should advance their hypotheses rather than arbitrarily technical validation of certain genes.

12 CONCLUSIONS When coupled with good experiment design, a high-quality analysis, and thorough interpretation, microarray technology has matured to the point where it can generate incredibly valuable information. Microarrays can be used to provide greater understanding of the disease being studied, to develop profiles to predict response to compounds, and to predict side effects or poor outcomes. In the near future, microarrays may be used determine which treatments a person may best respond to. We hope this article will help investigators in their use of microarrays. 13 ACKNOWLEDGEMENT The contents here were developed over many years in discussion with many investigators at UAB and around the world including David Allison, Steve Barnes, Lang Chen, Jode Edwards, Gary Gadbury, Issa Coulibaly, Kyoungmi Kim, Tapan Mehta, Prinal Patal, Mahyar Sabripour, Jelai Wang, Hairong Wei, Richard Weindruch, Stanislav Zakharkin, and Kui Zhang. The work could not have been conducted without their thoughts and input. GPP and XQ were supported by NIH grants AT100949, AG020681, and ES 012933 and GPP by NSF grant 0501890. REFERENCES 1. M. Chee, R. Yang, E. Hubbell, A. Berno, Z. C. Hunag, D. Stern, J. Winkler, D. J. Lockhart, M. S. Morris, and S. P. A. Fodor, Accessing genetic information with high-density DNA arrays. Science 1996; 274: 610–614. 2. D. J. Lockhart, H. Ding, M. Byrne, M. T. Follettie, M.V. Gallo, M. A. S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown, Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotech. 1996; 14: 1675–1680. 3. C.-K. Lee, R. G. Kloop, R. Weindruch, and T. A. Prolla, Gene expression profile of aghing and its restriction by caloric restriction. Science 1999; 285: 1390–1393. 4. M. B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster analysis and display of genome-wide expression patterns.

MICROARRAY Proc. Natl. Acad. Sci. U.S.A. 1998; 95: 14863–14868. 5. C. M. Perou, S. S. Jeffrey, R. M. van de, C. A. Rees, M. B. Eisen, D. T. Ross, A. Pergamenschikov, C. F. Williams, S. X. Zhu, J. C. Lee, D. Lashkari, D. Shalon, P. O. Brown, and D. Botstein Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. U.S.A. 1999; 96: 9212–9217. 6. M. A. Ginos, G. P. Page, B. S. Michalowicz, K. J. Patel, S. E. Volker, S. E. Pambuccian, F. G. Ondrey, G. L. Adams, and P. M. Gaffney, Identification of a gene expression signature associated with recurrent disease in squamous cell carcinoma of the head and neck. Cancer Res. 2004; 64: 55–63. 7. Y. Higami, T. D. Pugh, G. P. Page, D.B. Allison, T. A. Prolla, and R. Weindruch, Adipose tissue energy metabolism: altered gene expression profile of mice subjected to longterm caloric restriction. FASEB J. 2003; 8: 415–417. 8. S. O. Zakharkin, K. Kim, T. Mehta, L. Chen, S. Barnes, K.E. Scheirer, R. S. Parrish, D. B. Allison, and G. P. Page, Sources of variation in Affymetrix microarray experiments. BMC Bioinformat. 2005; 6: 214. 9. J. C. Lacal, How molecular biology can improve clinical management: the MammaPrint experience. Clin. Transl. Oncol. 2007; 9: 203. 10. S. Mook, L. J. van’t Veer, Rutgers E.J., Piccart-Gebhart M.J., and F. Cardoso, Individualization of therapy using Mammaprint: from development to the MINDACT Trial. Cancer Genom. Proteom. 2007; 4: 147–155. 11. J. Zhao, J. Roth, B. Bode-Lesniewska, M. Pfaltz, P.U. Heitz, and P. Komminoth, Combined comparative genomic hybridization and genomic microarray for detection of gene amplifications in pulmonary artery intimal sarcomas and adrenocortical tumors. Genes Chromos. Cancer 2002; 34: 48–57. 12. K. L. Gunderson, F. J. Steemers, G. Lee, L. G. Mendoza, and M. S. Chee, A genomewide scalable SNP genotyping assay using microarray technology. Nat. Genet. 2005; 37: 549–554. 13. L. Cekaite, O. Haug, O. Myklebost, M. Aldrin, B. Ostenstad, M. Holden, A. Frigessi, E. Hovig, and M. Sioud, Analysis of the humoral immune response to immunoselected phage-displayed peptides by a microarray-based method. Proteomics 2004; 4: 2572–2582.

13

14. C. Gulmann, D. Butler, E. Kay, A. Grace, and M. Leader, Biopsy of a biopsy: validation of immunoprofiling in gastric cancer biopsy tissue microarrays. Histopathology 2003; 42:70–6. 15. T. C. Mockler, S. Chan, A. Sundaresan, H. Chen, S. E. Jacobsen, and J. R. Ecker, Applications of DNA tiling arrays for wholegenome analysis. Genomics 2005; 85: 1–15. 16. V. K. Mootha, C. M. Lindgren, K. F. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstrale, E. Laurila, N. Houstis, M. J. Daly, N. Patterson, J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn, D. Altshuler, and L. C. Groop, PGC-1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003; 34: 267–273. 17. L. Cabusora, E. Sutton, A. Fulmer, and C. V. Forst, Differential network expression during drug and stress response. Bioinformatics 2005; 21: 2898–2905. 18. J. M. Naciff, M. L. Jump, S. M. Torontali, G. J. Carr, J. P. Tiesman, G. J. Overmann, and G. P. Daston, Gene expression profile induced by 17alpha-ethynyl estradiol, bisphenol A, and genistein in the developing female reproductive system of the rat. Toxicol. Sci. 2002; 68: 184–199. 19. Y. Tang, D. L. Gilbert, T. A. Glauser, A. D. Hershey, and F. R. Sharp, Blood gene expression profiling of neurologic diseases: a pilot microarray study. Arch. Neurol. 2005; 62: 210–215. 20. J. P. Ioannidis, Microarrays and molecular research: noise discovery? Lancet 2005; 365: 454–455. 21. L. J. ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. K. van der, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend, Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415: 530–536. 22. M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards, A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 2002; 347: 1999–2009.

14

MICROARRAY

23. Glas A. M., A. Floore, L. J. Delahaye, A. T. Witteveen, R. C. Pover, N. Bakx, J. S. LahtiDomenici, T. J. Bruinsma, M. O. Warmoes, R. Bernards, L. F. Wessels, and L. J. van’t Veer, Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genom. 2006; 7: 278. 24. M. Buyse, S. Loi, L. van’t Veer, G. Viale, M. Delorenzi, A. M. Glas, M. S. d’Assignies, J. Bergh, R. Lidereau, P. Ellis, A. Harris, J. Bogaerts, P. Therasse, A. Floore, M. Amakrane, F. Piette, E. Rutgers, C. Sotiriou, F. Cardoso, and M. J. Piccart, Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J. Natl. Cancer Inst. 2006; 98: 1183–1192.

31. W. Tong, A. B. Lucas, R. Shippy, X. Fan, H. Fang, H. Hong, M. S. Orr, T. M. Chu, X. Guo, P. J. Collins, Y. A. Sun, S. J. Wang, W. Bao, R. D. Wolfinger, S. Shchegrova, L. Guo, J. A. Warrington, and L. Shi, Evaluation of external RNA controls for the assessment of microarray performance. Nat. Biotechnol. 2006; 24: 1132–1139. 32. J. O. Borevitz, D. Liang, D. Plouffe, H. S. Chang, T. Zhu, D. Weigel, C. C. Berry, E. Winzeler, and J. Chory, Large-scale identification of single-feature polymorphisms in complex genomes. Genome Res. 2003; 13: 513–523.

25. W. Zhang, I. Shmulevich, and J. Astola, Microarray Quality Control. 2004. John Wiley & sons, Inc., Hoboken, NJ.

33. X. Cui, J. Xu, R. Asghar, P. Condamine, J. T. Svensson, S. Wanamaker, N. Stein, M. Roose, and T. J. Close, Detecting singlefeature polymorphisms using oligonucleotide arrays and robustified projection pursuit. Bioinformatics 2005; 21: 3852–3858.

26. M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270: 467–470.

34. N. Rostoks, J. Borevitz, P. Hedley, J. Russell, S. Mudie, J. Morris, L. Cardle, D. Marshall, and R. Waugh, Single-feature polymorphism discovery in the barley transcriptome. Genome Biol. 2005; 6:R54.

27. P. K. Tan, T. J. Downey, E. L. Spitznagel Jr, P. Xu, D. Fu, D. S. Dimitrov, R. A. Lempicki, B. M. Raaka, and M. C. Cam, Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 2003; 31: 5676–5684.

35. E. A. Winzeler, C. I. Castillo-Davis, G. Oshiro, D. Liang, D. R. Richards, Y. Zhou, and D. L. Hartl, Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics 2003; 163: 79–89.

28. T. A. Patterson, E. K. Lobenhofer, S. B. Fulmer-Smentek, P. J. Collins, T. M. Chu, W. Bao, H. Fang, E. S. Kawasaki, J. Hager, I. R. Tikhonova, S. J. Walker, L. Zhang, P. Hurban, F. de Longueville, J. C. Fuscoe, W. Tong, L. Shi, and R. D. Wolfinger, Performance comparison of one-color and two-color platforms within the Microarray Quality Control (MAQC) project. Nat. Biotechnol. 2006; 24: 1140–1150.

36. J. Ronald, J. M. Akey, J. Whittle, E. N. Smith, G. Yvert, and L. Kruglyak, Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 2005; 15: 284–291. 37. P. A. Sharp, The discovery of split genes and RNA splicing. Trends Biochem. Sci. 2005; 30: 279–281. 38. G. K. Hu, S. J. Madore, B. Moldover, T. Jatkoe, D. Balaban, J. Thomas, and Y. Wang, Predicting splice variant from DNA chip expression data. Genome Res. 2001; 11: 1237–1245.

29. L. Shi, L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, F. de Longueville, E. S. Kawasaki, and K. Y. Lee, The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 2006; 24: 1151–1161.

39. T. A. Clark, C. W. Sugnet, M. Ares Jr., Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science 2002; 296: 907–910.

30. R. Shippy, S. Fulmer-Smentek, R. V. Jensen, W. D. Jones, P. K. Wolber, C. D. Johnson, P. S. Pine, C. Boysen, X. Guo, E. Chudin, et al., Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat. Biotechnol. 2006; 24: 1123–1131.

40. J. M. Johnson, J. Castle, P. Garrett-Engele, Z. Kan, P. M. Loerch, C. D. Armour, R. Santos, E. E. Schadt, R. Stoughton, and D. D. Shoemaker, Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 2003; 302: 2141–2144.

MICROARRAY 41. A. Relogio, C. Ben Dov, M. Baum, M. Ruggiu, C. Gemund, V. Benes, R. B. Darnell, and J. Valcarcel, Alternative splicing microarrays reveal functional expression of neuronspecific regulators in Hodgkin lymphoma cells. J. Biol. Chem. 2005; 280: 4779–4784. 42. K. Le, K. Mitsouras, M. Roy, Q. Wang, Q. Xu, S. F. Nelson, and C. Lee, Detecting tissue-specific regulation of alternative splicing as a qualitative change in microarray data. Nucleic Acids Res. 2004; 32:e180. 43. P. Dhami, A. J. Coffey, S. Abbs, J. R. Vermeesch, J. P. Dumanski, K. J. Woodward, R. M. Andrews, C. Langford, and D. Vetrie, Exon array CGH: detection of copy-number changes at the resolution of individual exons in the human genome. Am. J. Hum. Genet. 2005; 76: 750–762. 44. T. C. Mockler, S. Chan, A. Sundaresan, H. Chen, S. E. Jacobsen, and J. R. Ecker, Applications of DNA tiling arrays for wholegenome analysis. Genomics 2005; 85: 1–15. 45. P. Kapranov, S. E. Cawley, J. Drenkow, S. Bekiranov, R. L. Strausberg, S. P. Fodor, and T. R. Gingeras, Large-scale transcriptional activity in chromosomes 21 and 22. Science 2002; 296: 916–919. 46. D. Kampa, J. Cheng, P. Kapranov, M. Yamanaka, S. Brubaker, S. Cawley, J. Drenkow, A. Piccolboni, S. Bekiranov, G. Helt, H. Tammana, and T. R. Gingeras, Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004; 14: 331–342. 47. M. Hild, B. Beckmann, S. A. Haas, B. Koch, V. Solovyev, C. Busold, K. Fellenberg, M. Boutros, M. Vingron, F. Sauer, J. D. Hoheisel, and R. Paro, An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome. Genome Biol. 2003; 5:R3. 48. K. Yamada, J. Lim, J. M. Dale, H. Chen, P. Shinn, C. J. Palm, A. M. Southwick, H. C. Wu, C. Kim, M. Nguyen, et al., Empirical analysis of transcriptional activity in the arabidopsis genome. Science 2003; 302: 842–846. 49. V. Stolc, M. P. Samanta, W. Tongprasit, H. Sethi, S. Liang, D. C. Nelson, A. Hegeman, C. Nelson, D. Rancour, S. Bednarek, E. L. Ulrich, Q. Zhao, R. L. Wrobel, C. S. Newman, B. G. Fox, G. N. Phillips Jr, J. L. Markley, and M. R. Sussman, Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome

15

tiling arrays. Proc. Natl. Acad. Sci. U.S.A. 2005; 102: 4453–4458. 50. L. Li, X. Wang, V. Stolc, X. Li, D. Zhang, N. Su, W. Tongprasit, S. Li, Z. Cheng, J. Wang, and X. W. Deng, Genome-wide transcription analyses in rice using tiling microarrays. Nat. Genet. 2006; 38: 124–129. 51. J. M. Johnson, S. Edwards, D. Shoemaker, and E. E. Schadt, Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. Trends Genet. 2005; 21: 93–102. 52. T. E. Royce, J. S. Rozowsky, P. Bertone, M. Samanta, V. Stolc, S. Weissman, M. Snyder, and M. Gerstein, Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping. Trends Genet. 2005; 21: 466–475. 53. A. E. Urban, J. O. Korbel, R. Selzer, T. Richmond, A. Hacker, G. V. Popescu, J. F. Cubells, R. Green, B. S. Emanuel, M. B. Gerstein, S. M. Weissman, and M. Snyder, High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A. 2006; 103: 4534–4539. 54. A. Schumacher, P. Kapranov, Z. Kaminsky, J. Flanagan, A. Assadzadeh, P. Yau, C. Virtanen, N. Winegarden, J. Cheng, T. Gingeras, and A. Petronis, Microarray-based DNA methylation profiling: technology and applications. Nucleic Acids Res. 2006; 34: 528–542. 55. X. Zhang, J. Yazaki, A. Sundaresan, S. Cokus, S. W. L. Chan, H. Chen, I. R. Henderson, P. Shinn, M. Pellegrini, S. E. Jacobsen, and J. J S. Ecker, Genome-wide highresolution mapping and functional analysis of DNA methylation in arabidopsis. Cell 2006; 126: 1189–1201. 56. C. L. Liu, T. Kaplan, M. Kim, S. Buratowski, S. L. Schreiber, N. Friedman, and O. J. Rando, Single-nucleosome mapping of histone modifications in S. cerevisiae. PLoS Biol. 2005; 3:e328. 57. A. Schumacher, P. Kapranov, Z. Kaminsky, J. Flanagan, A. Assadzadeh, P. Yau, C. Virtanen, N. Winegarden, J. Cheng, T. Gingeras, and A. Petronis, Microarray-based DNA methylation profiling: technology and applications. Nucleic Acids Res. 2006; 34: 528–542. 58. G. C. Kennedy, H. Matsuzaki, S. Dong, W. Liu, J. Huang, G. Liu, X. Su, M. Cao, W.

16

MICROARRAY Chen, J. Zhang, et al., Large-scale genotyping of complex DNA. Nat. Biotech. 2003; 21: 1233–1237.

59. G. C. Kennedy, H. Matsuzaki, S. Dong, W. Liu, J. Huang, G. Liu, X. Su, M. Cao, W. Chen, J. Zhang, et al., Large-scale genotyping of complex DNA. Nat. Biotech. 2003; 21: 1233–1237. 60. H. Matsuzaki, S. Dong, H. Loi, X. Di, G. Liu, E. Hubbell, J. Law, T. Berntsen, M. Chadha, H. Hui, et al., Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods 2004; 1: 109–111. 61. S. John, N. Shephard, G. Liu, E. Zeggini, M. Cao, W. Chen, N. Vasavda, T. Mills, A. Barton, A. Hinks, S. Eyre, et al., Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. Am. J. Hum. Genet. 2004; 75: 54–64. 62. C. I. Amos, W. V. Chen, A. Lee, W. Li, M. Kern, R. Lundsten, F. Batliwalla, M. Wener, E. Remmers, D. A. Kastner, L. A. Criswell, M. F. Seldin, and P. K. Gregersen, High-density SNP analysis of 642 Caucasian families with rheumatoid arthritis identifies two new linkage regions on 11p12 and 2q33. Genes Immun. 2006; 7: 277–286. 63. Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004; 83: 349–360. 64. J. Wu, L. T. Smith, C. Plass, and T. H. M. Huang, ChIP-chip comes of age for genomewide functional analysis. Cancer Res. 2006; 66: 6899–6902. 65. M. L. Bulyk, DNA microarray technologies for measuring protein-DNA interactions. Curr. Opin. Biotechnol. 2006; 17: 422–430. 66. C. S. Chen and H. Zhu, Protein microarrays. Biotechniques 2006; 40: 423, 425, 427. 67. P. Bertone and M. Snyder Advances in functional protein microarray technology. FEBS J. 2005; 272: 5400–5411. 68. B. B. Haab, M. J. Dunham, and P. O. Brown, Protein microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions. Genome Biol. 2001; 2:RESEARCH0004. 69. A. Sreekumar, M. K. Nyati, S. Varambally, T. R. Barrette, D. Ghosh, T. S. Lawrence, and A. M. Chinnaiyan, Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins. Cancer Res. 2001; 61: 7585–7593.

70. B. Schweitzer, S. Roberts, B. Grimwade, W. Shao, M. Wang, Q. Fu, Q. Shu, I. Laroche, Z. Zhou, V. T. Tchernev, J. Christiansen, M. Velleca, and S. F. Kingsmore, Multiplexed protein profiling on microarrays by rollingcircle amplification. Nat. Biotechnol. 2002; 20: 359–365. 71. H. Zhu, M. Bilgin, R. Bangham, D. Hall, A. Casamayor, P. Bertone, N. Lan, R. Jansen, S. Bidlingmaier, T. Houfek, et al., Global analysis of protein activities using proteome chips. Science 2001; 293: 2101–2105. 72. M. Arifuzzaman, M. Maeda, A. Itoh, K. Nishikata, C. Takita, R. Saito, T. Ara, K. Nakahigashi, H. C. Huang, A. Hirai, et al., Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res. 2006; 16: 686–691. 73. S. W. Ho, G. Jona, C. T. L. Chen, M. Johnston, and M. Snyder, Linking DNA-binding proteins to their recognition sequences by using protein microarrays. Proc. Natl. Acad. Sci. U.S.A. 2006; 103: 9940–9945. 74. D. A. Hall, H. Zhu, X. Zhu, T. Royce, M. Gerstein, and M. Snyder, Regulation of gene expression by a metabolic enzyme. Science 2004; 306: 482–484. 75. T. Feilner, C. Hultschig, J. Lee, S. Meyer, R. G. H. Immink, A. Koenig, A. Possling, H. Seitz, A. Beveridge, D. Scheel, et al., High throughput identification of potential arabidopsis mitogen-activated protein kinases substrates. Molec. Cell. Proteom. 2005; 4: 1558–1568. 76. H. Du, M. Wu, W. Yang, G. Yuan, Y. Sun, Y. Lu, S. Zhao, Q. Du, J. Wang, S. Yang, et al., Development of miniaturized competitive immunoassays on a protein chip as a screening tool for drugs. Clin. Chem. 2005; 51: 368–375. 77. A. Lueking, O. Huber, C. Wirths, K. Schulte, K. M. Stieler, U. Blume-Peytavi, A. Kowald, K. Hensel-Wiegel, R. Tauber, H. Lehrach, et al., Profiling of alopecia areata autoantigens based on protein microarray technology. Molec. Cell. Proteom. 2005; 4: 1382–1390. 78. W. H. Robinson, C. DiGennaro, W. Hueber, B. B. Haab, M. Kamachi, E. J. Dean, S. Fournel, D. Fong, M. C. Genovese, H. E. de Vegvar, et al., Autoantigen microarrays for multiplex characterization of autoantibody responses. Nat. Med. 2002; 8: 295–301. 79. A. Lueking, A. Possling, O. Huber, A. Beveridge, M. Horn, H. Eickhoff, J. Schuchardt, H. Lehrach, and D. J. Cahill, A nonredundant human protein chip for antibody

MICROARRAY

80.

81.

82.

83.

84.

85.

86.

87.

88.

89.

90.

91.

screening and serum profiling. Molec. Cell. Proteom. 2003; 2: 1342–1349. M. T. Lee, F. C. Kuo, G. A. Whitmore, and J. Sklar, Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. U.S.A. 2000; 97: 9834–9839. M. K. Kerr and G. A. Churchill, Statistical design and the analysis of gene expression microarray data. Genet. Res. 2001; 77: 123–128. K. Mirnics, Microarrays in brain research: the good, the bad and the ugly. Nat. Rev. Neurosci. 2001; 2: 444–447. K. R. Coombes, W. E. Highsmith, T. A. Krogmann, K. A. Baggerly, D. N. Stivers, and L. V. Abruzzo, Identifying and quantifying sources of variation in microarray data using high-density cDNA membrane arrays. J. Comput. Biol. 2002; 9: 655–669. Y. Woo, J. Affourtit, S. Daigle, A. Viale, K. Johnson, J. Naggert, and G. Churchill, A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. J. Biomol. Tech. 2004; 15: 276–284. D. Rubin, Practical implications of modes of statistical inference for causal effects and the critical role of the assignment mechanism. Biometrics 1991; 47: 1213–1234. M. K. Kerr and G. A. Churchill, Experimental design for gene expression microarrays. Biostatistics 2001; 2: 183–201. M. F. Oleksiak and G. A. Churchill, and D. L. Crawford, Variation in gene expression within and among natural populations. Nat. Genet. 2002; 32: 261–266. D. B. Allison and C. S. Coffey, Two-stage testing in microarray analysis: what is gained? J. Gerontol. A Biol. Sci. Med. Sci. 2002; 57:B189–B192. Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Series B 1995; 57: 289–300. V. G. Tusher, R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 2001; 98: 5116–5121. D. Allison, G. Gadbury, M. Heo, J. R. Fernandez, C. K. Lee, T. A. Prolla, and R. Weindruch, A mixture model approach for the analysis of microarray gene expression data. Computat. Stat. Data Anal. 2002; 39: 1–20.

17

92. G. L. Gadbury, G. Xiang, J. Edwards, G. Page, and D. B. Allison, The role of sample size on measures of uncertainty and power. In D. B. Allison, J. W. Edwards, T. M. Beasley, and G. Page (eds.), DNA Microarrays and Related Genomics Techniques. Boca Raton, FL: CRC Press, 2005, pp. 51–60. 93. G. Gadbury, G. Page, J. Edwards, T. Kayo, R. Weindruch, P. A. Permana, J. Mountz, and D. B. Allison, Power analysis and sample size estimation in the age of high dimensional biology. Stat. Meth. Med. Res. 2004; 13: 325–338. 94. G. P. Page, J. W. Edwards, G. L. Gadbury, P. Yelisetti, J. Wang, P. Trivedi, and D. B. Allison, The PowerAtlas: a power and sample size atlas for microarray experimental design and research. BMC Bioinformat. 2006; 7: 84. 95. R. Nagarajan, Intensity-based segmentation of microarray images. IEEE Trans. Med. Imag. 2003; 22: 882–889. 96. Q. Li, C. Fraley, R. E. Bumgarner, K. Y. Yeung, and A. E. Raftery, Donuts, scratches and blanks: robust model-based segmentation of microarray images. Bioinformatics 2005. 97. C. Li and W. W. Hung, Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2001; 2: 32–35. 98. Z. Wu and R. A. Irizarry, Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J. Comput. Biol. 2005; 12: 882–893. 99. E. Hubbell, W. M. Liu, and R. Mei, Robust estimators for expression analysis. Bioinformatics 2002; 18: 1585–1592. 100. L. Zhang, L. Wang, A. Ravindranathan, and M. F. Miles, A new algorithm for analysis of oligonucleotide arrays: application to expression profiling in mouse brain regions. J. Mol. Biol. 2002; 317: 225–235. 101. R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, T. P. Speed, Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003; 31:e15. 102. R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, T. P. Speed, Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4: 249–264. 103. K. Shedden, W. Chen, R. Kuick, D. Ghosh, J. Macdonald, K. R. Cho, T. J. Giordano, S. B. Gruber, E. R. Fearon, J. M. Taylor, and

18

MICROARRAY S. Hanash, Comparison of seven methods for producing Affymetrix expression scores based on False Discovery Rates in disease profiling data. BMC Bioinformat. 2005; 6: 26.

104. G. K. Smyth and T. Speed, Normalization of cDNA microarray data. Methods 2003; 31: 265–273. 105. J. Tukey, On the comparative anatomy of transformation. Ann. Mathemat. Statist. 1964; 28: 602–632. 106. R. D. Wolfinger, G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules, Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 2001; 8: 625–637. 107. B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed, A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003; 19: 185–193. 108. B. P. Durbin, J. S. Hardin, D. M. Hawkins, and D. M. Rocke, A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002; 18(suppl 1):S105–S110. 109. B. P. Durbin and D. M. Rocke, Variancestabilizing transformations for two-color microarrays. Bioinformatics 2004; 20: 660–667. 110. C. A. Ball, G. Sherlock, H. Parkinson, P. Rocca-Sera, C. Brooksbank, HC. Causton, D. Cavalieri, T. Gaasterland, P. Hingamp, F. Holstege, et al., Standards for microarray data. Science 2002, 298: 539. 111. C. A. Ball, G. Sherlock, H. Parkinson, P. Rocca-Sera, C. Brooksbank, H. C. Causton, D. Cavalieri, T. Gaasterland, P. Hingamp, F. Holstege, et al., An open letter to the scientific journals. Bioinformatics 2002; 18: 1409. 112. K. H. Cheung, K. White, J. Hager, M. Gerstein, V. Reinke, K. Nelson, P. Masiar, R. Srivastava, Y. Li, J. Li, J. Li, et al., YMD: A microarray database for large-scale gene expression analysis. Proc. AMIA Symp. 2002; 140–144. 113. C. Baer, M. Nees, S. Breit, B. Selle, A. E. Kulozik, K. L. Schaefer, Y. Braun, D. Wai, and C. Poremba, Profiling and functional annotation of mRNA gene expression in pediatric rhabdomyosarcoma and Ewing’s sarcoma. Int. J. Cancer 2004; 110: 687–694. 114. R. L. Somorjai, B. Dolenko, and R. Baumgartner, Class prediction and discovery using

gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 2003; 19: 1484–1491. 115. M. D. Radmacher, L. M. McShane, and R. Simon, A paradigm for class prediction using gene expression profiles. J. Comput. Biol. 2002; 9: 505–511. 116. M. Ringner and C. Peterson, Microarraybased cancer diagnosis with artificial neural networks. Biotechnology 2003(suppl): 30–35. 117. U. M. Braga-Neto and E. R. Dougherty, Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004; 20: 374–380. 118. C. Romualdi, S. Campanaro, D. Campagna, B. Celegato, N. Cannata, S. Toppo, G. Valle, and G. Lanfranchi, Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Hum. Mol. Genet. 2003; 12: 823–836. 119. R. Simon and M. D. Radmacher, and K. Dobbin, Design of studies using DNA microarrays. Genet. Epidemiol. 2002; 23: 21–36. 120. R. Simon, M. D. Radmacher, K. Dobbin, and L. M. McShane, Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 2003; 95: 14–18. 121. R. Simon, Diagnostic and prognostic prediction using gene expression profiles in highdimensional microarray data. Br. J. Cancer 2003; 89: 1599–1604. 122. C. Ambroise and G. J. McLachlan, Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. U.S.A. 2002; 99: 6562–6566. 123. R. Simon, M. D. Radmacher, K. Dobbin, and L. M. McShane, Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 2003; 95: 14–18. 124. N. R. Garge, G. P. Page, A. P. Sprague, B. S. Gorman, and D. B. Allison, Reproducible clusters from microarray research: whither? BMC Bioinformat. 2005; 6(suppl 2):S10. 125. S. Datta and S. Datta, Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 2003; 19: 459–466. 126. P. Baldi and A. D. Long, A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 2001; 17: 509–519.

MICROARRAY 127. D. B. Allison, X. Cui, G. P. Page, and M. Sabripour, Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006; 7: 55–65. 128. X. Cui, J. T. Hwang, J. Qiu, N. J. Blades, and G. A. Churchill, Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 2005; 6: 59–75. 129. P. H. Westfall, D. V. Zaykin, and S. S. Young, Multiple tests for genetic effects in association studies. Methods Molec. Biol. 2002; 184: 143–168. 130. Y. Benjamini, D. Drai, G. Elmer, N. Kafkafi, and I. Golani, Controlling the false discovery rate in behavior genetics research. Behav. Brain Res. 2001; 125: 279–284. 131. M. Kanehisa, S. Goto, M. Hattori, K.F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006; 34:D354–D357. 132. X. Mao, T. Cai, J. G. Olyarchuk, and L. Wei, Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 2005; 21: 3787–3793. 133. Y. Lee, R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, et al., Crossreferencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 2002; 12: 493–502. 134. Y. Lee, J. Tsai, S. Sunkara, S. Karamycheva, G. Pertea, R. Sultana, V. Antonescu, A. Chan, F. Cheung, and J. Quackenbush, The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res. 2005; 33:D71–D74. 135. L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter, and J. N. Weinstein, MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechnology 1999; 27: 1210–1217. 136. M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, et al., Gene ontology: tool for the unification of biology (In Process Citation). Nat. Genet. 2000; 25: 25–29. 137. P. Khatri and S. Draghici, Ontological analysis of gene expression data: current tools,

138.

139.

140.

141.

142.

143.

144.

145.

146.

147.

148.

149.

19

limitations, and open problems. Bioinformatics 2005; 21: 3587–3595. J. J. Goeman and P. Buhlmann, Analyzing gene expression data in terms of gene sets: methodological issues 2. Bioinformatics 2007; 23: 980–987. I. Rivals, L. Personnaz, L. Taing, and M. C. Potier, Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 2007; 23: 401–407. D. B. Allison, X. Cui, G. P. Page, and M. Sabripour, Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006; 7: 55–65. U. Mansmann and R. Meister, Testing differential gene expression in functional groups. Goeman’s global test versus an ANCOVA approach. Methods Inf. Med. 2005; 44: 449–453. VK. Mootha, CM. Lindgren, KF. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstrale, E. Laurila, et al., PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003; 34: 267–273. J. Tomfohr, J. Lu, and T. B. Kepler, Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformat. 2005; 6: 225. J. J. Goeman, S. A. van de Geer, F. de Kort, and H. C. van Houwelingen, A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004; 20: 93–99. J. J. Goeman, J. Oosting, A. M. CletonJansen, J. K. Anninga, H. C. van Houwelingen, Testing association of a pathway with survival using gene expression data. Bioinformatics 2005; 21: 1950–1957. P. Khatri and S. Draghici, Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005; 21: 3587–3595. P. Zimmermann, M. Hirsch-Hoffmann, L. Hennig, and W. Gruissem, GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiol. 2004; 136: 2621–2632. F. A. de la, P. Brazhnik, and P. Mendes, Linking the genes: inferring quantitative gene networks from microarray data. Trends Genet. 2002; 18: 395–398. S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, and S. Miyano, Combining microarrays and biological knowledge for estimating

20

MICROARRAY

gene networks via Bayesian networks. J. Bioinform. Comput. Biol. 2004; 2: 77–98. 150. Z. S. Qin, L. A. McCue, W. Thompson, L. Mayerhofer, C. E. Lawrence, and J. S. Liu, Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites. Nat. Biotechnol. 2003. 151. B. Xing and M. J. van der Laan, A statistical method for constructing transcriptional regulatory networks using gene expression and sequence data. J. Comput. Biol. 2005; 12: 229–246. 152. D. Murphy, Gene expression studies using microarrays: principles, problems, and prospects. Adv. Physiol. Educ. 2002; 26: 256–270. 153. R. Edgar, M. Domrachev, and A. E. Lash, Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30: 207–210. 154. R. Tuteja and N. Tuteja, Serial analysis of gene expression (SAGE): unraveling the bioinformatics tools. BioEssays 2004; 26: 916–922.

MINIMUM EFFECTIVE DOSE (MINED)

1 INDIVIDUAL AND POPULATION DOSE RESPONSE CURVES

NEAL THOMAS and NAITEE TING

Figure 1 distinguishes between individual dose-response relationships (the three thinner curves representing three different individual subjects) and the (single, thicker) population average dose-response relationship. Because of intersubject variability, different subjects may respond to the same drug in different ways (2). The definitions of the MinED in the statistics literature focus on the population dose-response curve. The most common definition of the MinED is the lowest dose that produces a specified average difference in response with placebo or other comparator treatment. In the case of a binary outcome, the specified average response is a proportion. Bretz, Hothorn, and Hsu (3) define a MinED as the lowest dose that achieves a specified ratio of the mean response divided by the comparator mean response. None of the available literature appear to define the MinED with time-toevent outcomes analyzed by methods such as Cox regression. Figure 2 displays a typical monotonically increasing dose response curve (left side of the plot) with a hypothetical MinED and MaxED indicated by vertical lines. The MaxED is placed at a dose that nearly achieves the maximum mean response. There does not appear to be a broadly accepted definition of the MaxED, but it is common in pharmacometric practice to target no more than 90% of the maximum effect because dose-response curves often increase slowly to the maximum effect, so much higher doses are required to yield small additional improvements in efficacy. The MinED is the dose that produces a minimum clinically important difference (MCID) from the control. The curve on the right side of Figure 2 is a population dose-response curve for some measure of toxicity. A somewhat optimistic setting is displayed, with toxicity increasing for doses above the MinED. Because of subject variability around the average dose response, even in this setting some patients may not achieve the MinED without experiencing

Global Research and Development Pfizer Inc. New London, Connecticut

The International Conference on Harmonization (ICH) E4 Guideline for Industry (1), the primary source of regulatory guidance for dose response studies, provides a concise definition of a minimum effective dose (MinED or MED): Historically, drugs have often been initially marketed at what were later recognized as excessive doses (i.e., doses well onto the plateau of the dose-response curve for the desired effect), sometimes with adverse consequences (e.g., hypokalemia and other metabolic disturbances with thiazide-type diuretics in hypertension). This situation has been improved by attempts to find the smallest dose with a discernible useful effect or a maximum dose beyond which no further beneficial effect is seen, but practical study designs do not exist to allow for precise determination of these doses. Further, expanding knowledge indicates that the concepts of minimum effective dose and maximum useful dose do not adequately account for individual differences and do not allow a comparison, at various doses, of both beneficial and undesirable effects. Any given dose provides a mixture of desirable and undesirable effects, with no single dose necessarily optimal for all patients.

The definition of the MinED as the ‘‘smallest dose with a discernible useful effect’’ requires that we further define a discernible effect and a useful effect. The MinED is implicitly defined by population summaries, as indicated by the latter portion of the ICH passage. Methodological development of the MinED has focused on operational definitions of ‘‘discernible’’ and ‘‘useful,’’ and the distinction between population and individual effects. Although less commonly discussed, the guidelines assign similar importance to establishing a ‘‘maximum useful dose,’’ which is sometimes called the maximum effective dose (MaxED).

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

MINIMUM EFFECTIVE DOSE (MINED)

Response

2

Dose

Response

Figure 1. Individual and average dose-response curves.

Toxicity

Efficacy

MCID

MinED

MoxED Dose

Figure 2. Dose-response for efficacy and toxicity.

adverse events. In less optimistic settings, the MinED cannot be evaluated because of toxicities occurring at lower doses.

2 OPERATIONAL DEFINITIONS OF USEFUL EFFECT It is difficult to achieve consensus about the smallest magnitude of an effect that is useful. The useful effect is often based on

precedent or convention within a therapeutic area. There can be different sources for the precedent, including the estimated effect of a well-established therapy that has been accepted as useful. For example, in studying chronic eye diseases such as age-related macular degeneration or diabetic macular edema, a loss (or gain) of three lines (or, equivalently, 15 letters) in vision based on the Early Treatment Diabetic Retinopathy Study (ETDRS) chart is a common standard for

MINIMUM EFFECTIVE DOSE (MINED)

visual acuity (4). Another example would be a two-point mean difference on the International Prostate Symptoms Scale for a useful effect for benign prostate hyperplasia (5). Another useful concept is the minimum clinically important difference (MCID). This is the magnitude of the treatment response exceeding the placebo response that is large enough to be perceived by a subject. It is a commonly used concept in outcomes research. There are two general methods to study this issue (6, 7): (1) the anchor-based method, which correlates response of a clinical endpoint to responses to questions about perceptions of improvement (or worsening); and (2) the distribution-based method, which calculates the between-person standard deviation unit and uses a multiple of this unit as the MCID. A similar problem arises when selecting a tolerance interval width for noninferiority and equivalence trials, which typically arise when comparing two active drugs rather than a drug and placebo. Achieving consensus on acceptable difference between active drugs is also difficult in this setting, as has been discussed elsewhere (8, 9). 3 OPERATIONAL DEFINITIONS OF DISCERNIBLE EFFECTS A discernible effect has been interpreted as a statistically significant difference between a dose group and placebo by several investigators. Ruberg (10) provides another way of restating the ‘‘smallest dose with a discernible useful effect’’: ‘‘The MinED is the lowest dose producing a clinically important response that can be declared statistically, significantly different from the placebo response.’’ This definition of a discernible effect involving statistical significance depends on the size and design of the dose-response study. Two general approaches have been used to establish statistical differences: (1) pairwise testing of doses versus placebo, with adjustment for multiple comparisons; and (2) dose response modeling, with reporting sometimes based on Bayesian probabilities rather than P-values. Sequential applications of trend tests corrected for multiple comparisons are a compromise between the pairwise

3

test-based approaches and estimation-based approaches. The testing-based approaches appear to be more commonly used. Pairwise comparisons of each dose group to placebo can be performed that preserve type I error without requiring any assumptions about the ordering of dose groups. To maintain a prespecified type I error, the alpha level of each test must be adjusted (multiple comparison adjustment). Dunnett (11) is a highly referenced method in this setting. Numerous other multiple comparison methods have been proposed that have improved power for some implicitly defined dose-response curves (12–14). These methods are based on sequential hypothesis testing. They typically begin with tests for the highest dose. The lowest statistically significant dose that is also clinically useful is selected as the MinED. Testing (and estimation) methods have also been developed that assume an increasing order of response with dose but do not assume a specific form for the doseresponse curve. These methods, such as that of Williams (15, 16), are called isotonic inference, isotonic regression, or order-restricted inference (10, 17–20). Tests based on contrasts (weighted combinations of dose group mean responses) derived from dose-response curves have also been proposed (10, 21, 22). These approaches, like those from isotonic inference, can increase power to achieve statistical significance by using information about the likely shape of the dose-response curve. The contrast tests are also implemented as sequential tests, which proceed as long as statistical significance is achieved. The MinED is derived from the contrast that is statistically significant, and includes the lowest maximum dose. The maximum dose from the selected contrast is the MinED. Bauer (23) and Hsu and Berger (13) note that these methods can have inflated type I error for an ineffective high dose when the dose-response curve is not monotonic. Hsu and Berger (13) show that sequential tests that preserve the familywise error rate are in the form of directional pairwise comparisons with the comparator group.

4

MINIMUM EFFECTIVE DOSE (MINED)

As noted in Tamhane et al. (22), all of the testing procedures select a MinED from only those doses included in the dose-response study design; as a consequence, they can be very dependent on the selection the doses included in the design. They noted this fact and offered a different naming convention: ‘‘what any test procedure finds is a minimum detectable dose (MinDD).’’ Most of the procedures noted here require higher doses to be statistically significant before lower doses can be declared significant, and, as a consequence, they tend to report higher values for the MinED. The MinED can also be selected by inversion of an estimated dose-response curve to find a dose with expected response equal to a target value, typically the smallest useful dose (24). The inversion of the dose-response curve may be preceded by a global test for a dose-response trend. A significant global trend test, however, is not equivalent to statistical significance of individual doses, as is typically required in most testing approaches. Bretz, Pinheiro, and Branson (25) describe several criteria for selecting a MinED based on inversion of the dose-response curve. The MinED estimated from a dose-response curve is likely to be a dose not included in the study. Ruberg (10) describes another method of estimating the MinED using the logistic Emax curve. This approach is based on methods for determining the minimum detectable concentration in biological assays (26, 27). Unlike all of the other methods noted here, this approach is based on selecting a dose that could be consistently differentiated from placebo in future trials of a specified size. REFERENCES 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E4 DoseResponse Information to Support Drug Registration. Current Step 4 version, March 10, 1994. Available at: http://www.ich.org/LOB/ media/MEDIA480.pdf 2. N. Holford and L. Sheiner, Understanding the dose-effect relationship: clinical application of pharmacokinetic–pharmacodynamic models. Clin Pharmacokinet. 1981; 6: 429–453.

3. F. Bretz, L. Hothorn, and J. Hsu, Identifying effective and/or safe doses by stepwise confidence intervals for ratios. Stat Med. 2003; 22: 847–858. 4. U.S. Food and Drug Administration, Center for Drug Evaluation and Research. Joint meeting of the ophthalmic drugs subcommittee of the dermatologic and ophthalmic drugs advisory committee and the endocrine and metabolic drugs advisory committee, March 11, 1998. http://www.fda.gov/cder/foi/ adcomm/98/jnt doac emdac 031198 ag ques. pdf 5. F. Desgrandchamps, Importance of individual response in symptom score evaluation. Eur Urol. 2001; 40(Suppl 3): 2–7. 6. G. H. Guyatt, D. Osoba, A. W. Wu, K. W. Wyrwich, G. R. Norman, and the Clinical significance consensus meeting group. Methods to explain the clinical significance of health status measures. Mayo Clin Proc. 2002; 77: 371–383. 7. M. A. G. Sprangers, C. M. Moinpour, T. J. Moynihan, D. L. Patrick, D. A. Revicki, and the Clinical Significance Consensus Meeting Group. Assessing meaningful change in quality of life over time: a users’ guide for clinicians, Mayo Clin Proc. 2002; 77: 561–571. 8. Committee for Medicinal Products for Human Use (CHMP), European Medicines Agency. Guideline on the Choice of the Non-Inferiority Margin. London, UK, July 27, 2005. Available at: http://www.emea.europa.eu/pdfs/human/ ewp/215899en.pdf 9. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E10 Choice of Control Group and Related Issues in Clinical Trials. Current Step 4 version, July 20, 2000. Available at: http://www.ich.org/LOB/media/ MEDIA486.pdf 10. S. Ruberg, Dose response studies. II. Analysis and interpretation. J Biopharm Stat. 1995; 5: 15–42. 11. C. Dunnett, A multiple comparison procedure for comparing several treatments to control. J Am Stat Assoc. 1955; 50: 1096–1121. 12. Y. Hochberg, and A. Tamhane, Multiple Comparison Procedures. New York: Wiley, 1987. 13. J. Hsu and R. Berger, Stepwise confidence intervals without multiplicity adjustment for dose-response and toxicity studies. J Am Stat Assoc. 1999; 94: 468–482. 14. A. Tamhane and B. Logan, Multiple comparison procedures in dose response studies. In:

MINIMUM EFFECTIVE DOSE (MINED) N. Ting (ed.), Dose-Finding in Drug Development. New York: Springer, 2006, pp 172–183. 15. D. Williams, A test for differences between treatment means when several dose levels are compared with a zero dose level. Biometrics. 1971; 27: 103–117. 16. D. Williams, The comparison of several dose levels with a zero dose control. Biometrics. 1972; 28: 519–531. 17. T. Robertson, F. Wright, and R. Dykstra, Order Restricted Inference, New York: Wiley, 1988. 18. D. J. Bartholomew, Isotonic inference. In: Encyclopedia of Statistical Sciences, Vol. 4. New York: Wiley, 1983, pp 260–265. 19. C. Hirotsu, Isotonic inference. In: Encyclopedia of Biostatistics, Vol. 3. New York: Wiley, 1999, pp 2107–2115. 20. J. W. McDonald, Isotonic regression. In: Encyclopedia of Biostatistics, Vol. 3. New York: Wiley, 1999, pp 2115–2116. 21. S. Ruberg, Contrasts for identifying the minimum effective dose. J Am Stat Assoc. 1989; 84: 816–822. 22. A. Tamhane, Y. Hochberg, and C. Dunnett, Multiple test procedures for dose finding. Biometrics. 1996; 52: 21–37. 23. P. Bauer, A note on multiple testing procedures in dose finding. Biometrics. 1997; 53: 1125–1128. 24. T. G. Filloon, Estimating the minimum therapeutically effective dose of a compound via regression modelling and percentile estimation (Disc: p933-933). Stat Med. 1995; 14: 925–932.

5

25. F. Bretz, J. Pinheiro, and M. Branson, Combining multiple comparisons and modeling techniques in dose-response studies. Biometrics. 2005; 61: 738–748. 26. M. Davidian, R. J. Carroll, and W. Smith, Variance functions and the minimum detectable concentration in assays. Biometrika. 1988; 75: 549–556. 27. D. Rodbard, Statistical estimation of the minimal detectable concentration (‘‘sensitivity’’) of radioligand assays. Anal Biochem. 1978; 90: 1–12.

CROSS-REFERENCES

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

R&D promotion (i.e., guardian) is an important feature of the Japanese system. This reflects some historical mishaps related to blood products and human immunodeficiency virus in the 1990s, in which the Ministry was severely criticized for its unclear decisionmaking process. Japan has universal health insurance, and all clinical trials submitted for investigational new drug notification (IND), with the exception of phase I studies involving healthy volunteers, must be done under the health insurance scheme. The financial rules for reimbursement and copayment in IND-submitted trials are determined by the Health Insurance Bureau of the MHLW. The MHLW is responsible for national policy making and final decisions on IND and NDA approvals. Day-to-day scientific evaluation of actual INDs and NDAs, safety review and pharmaco-vigilance after marketing approval, and good manufacturing practice (GMP) inspections are handled by an external agency, the Pharmaceuticals and Medical Devices Agency (PMDA).

SHUNSUKE ONO Graduate School of Pharmaceutical Sciences University of Japan Tokyo, Japan

As an important component of pharmaceutical research and development (R&D) activities, clinical trials in Japan are implemented under various regulations of the Pharmaceutical Affairs Law (PAL) and related ordinances and guidelines issued by the Ministry of Health, Labour and Welfare (MHLW). 1 MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) The Ministry of Health, Labour and Welfare (MHLW), which was originally established in 1938, is a government agency responsible for public health, social security, and labor issues in Japan. In 2001, two ministries, the Ministry of Health and Welfare and the Ministry of Labour, were merged into the current MHLW in an effort to reorganize and streamline government ministries. The MHLW consists of the Minister’s Secretariat, 11 bureaus, and affiliated institutions, councils, local branch offices, and external organizations. For the updated organizational structure, see the Ministry’s website at http://www.mhlw.go.jp/english/index.html (in English). Issues related to pharmaceuticals and medical devices are under the control of two bureaus in the MHLW, the Pharmaceutical and Food Safety Bureau (PFSB) and the Health Policy Bureau (HPB). The PFSB is the office in charge of enactment, amendment, and implementation of the Pharmaceutical Affairs Law (PAL), the backbone of Japanese pharmaceutical regulation. The objectives of the PFSB are the implementation of regulatory rules and related guidelines rather than the promotion of R&D policies. A division in the HPB, the Research and Development Division, is responsible for R&D policy issues in health care. The philosophy of strict separation between regulations (i.e., police) and

2 PHARMACEUTICALS AND MEDICAL DEVICES AGENCY (PMDA) The PMDA is an incorporated administrative agency established in 2004. It is not a government agency per se but rather was established based on the law for the Incorporated Administrative Agency. It is therefore a quasi-governmental agency, and its objectives and operations are specifically determined by the law and related ordinances. The PMDA took over the operations and services of three previous agencies: the Pharmaceuticals and Medical Devices Evaluation Center (PMDEC), the Organization for Pharmaceutical Safety and Research (OPSR), and part of the Japan Association for the Advancement of Medical Equipment (JAAME) (1). 2.1 Objectives of the PMDA The PMDA has three major objectives. 1. Consultation, review, and related activities. The PMDA conducts scientific

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

review for submitted NDAs and INDs, and conveys its review results to the MHLW, the final decision-making agency, in the form of review reports. It also offers consultation services for the industry and clinical researchers on a fee-for-service basis (see section 2.4). 2. Postmarketing safety operations. The PMDA collects, analyzes, and disseminates information on the quality, efficacy, and safety of pharmaceuticals and medical devices on the market. 3. Adverse health-effect relief services. The PMDA provides services related to payment for medical expenses and disability pensions to patients affected by adverse drug reactions. These services were originally performed by the OPSR. Before the PMDA was established in 2004, review and consultation were provided in two separate organizations. The PMDEC undertook scientific review activities to evaluate NDAs and INDs submitted to the MHLW, and the OPSR provided consultation services for the industry for predetermined fees. The NDA and IND review activities are official responsibilities, for which delegation to the PMDA is stipulated by the PAL, but the consultation services are not directly based on the PAL. This is in apparent contrast to the regulatory apparatus in the United States, where the sponsors can ask for various types of meetings as part of the overall review (2). However, the basic services that the regulatory agencies in both countries offer to sponsors (i.e., scientific advice on a timely basis) are the same.

The PMDA website (http://www.pmda. go.jp/) provides more detailed information on the scope of its operations. 2.2 Organization of the PMDA Efficacy, safety, and quality issues of pharmaceuticals (including over-the-counter drugs) and medical devices are handled by eleven review-related offices in the PMDA. The Offices of New Drugs I, II, III, and IV, and the Offices of Biologics I and II are the offices responsible for conducting scientific review for NDAs and INDs of pharmaceutical products. In each office, several review teams are organized, each focusing on specific therapeutic categories (Table 1). As of October 2007, there were ten review teams operating in the PMDA. A review team consists of project managers and experts in chemistry, pharmacology and pharmacokinetics, toxicology, physicians, and biostatisticians. The total number of review staff including the reviewers and support staff involved in review activities is about 200, one-tenth of that of the U.S. Food and Drug Administration (as of 2007). Although the human resources for review have been expanding gradually, there is still a significant gap between the United States and Japan. The lack of sufficient review resources has often been attributed to insufficient opportunities for consultation services and delay in new drug approval times. 2.3 IND and NDA Review by the PMDA For each IND and NDA, a specific review team is assigned at the time of submission. Assignment is based on the therapeutic

Table 1. Review offices and therapeutic fields in the Pharmaceuticals and Medical Devices Agency (PDMA) PMDA office

Therapeutic area or type of products

New Drugs I (three teams)

Anticancer, anti-infective, dermatology, additives for pharmaceutical products, and gastroenterology Cardiovascular disease, Alzheimer disease, Parkinson disease, obstetrics and gynecology, and urology Central and peripheral nervous system, and ophthalmology Respiratory disease, allergy, and autoimmune disease, Blood products, etc. Vaccines and tissue engineering

New Drugs II (two teams) New Drugs III (one team) New Drugs IV (two teams) Biologics I (one team) Biologics II (one team)

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

3

area and class of the drug. When biological products are assigned to a nonbiologic team, several reviewers from the Office of Biologics always join the review team to scrutinize the quality issues. The focus of review is different between INDs and NDAs. The IND review is to check the validity of initiating clinical trials, paying particular attention to safety concerns. The PAL requires that the IND review be finished within 30 days. The scope of the NDA review, on the other hand, is much broader, and it commonly takes one or more years for nonpriority review products. Figure 1 presents the review process for typical pharmaceutical NDAs. Review results prepared by review teams as well as the summary documents (i.e., Common Technical Documents [CTD] Module 2) submitted by the drug companies are published publicly on an Internet website in Japanese (http://www.info.pmda.go.jp/).

clinical development plans with Japanese regulators. Any sponsor (i.e., drug companies or physicians in sponsor-investigator trials) can apply for a consultation. Applications must be submitted 3 months ahead of the meeting. Due to insufficient review capacities with the agency, applications in crowded therapeutic areas are prioritized according to a point table. Some applicants with low priority points (e.g., a trial for an additional new indication) sometimes must wait several months in the queue. Though the PMDA explains that this ‘‘rationing’’ of services is inevitable with its current lack of sufficient review resources, the industry insists that this situation should be improved as soon as possible by adopting a fee system as in the U.S. Prescription Drug User Fee Act (PDUFA). Several types of consultations are offered to satisfy the needs in different development stages. Common times of consultation include before phase I, pre-early phase II, pre-late 2.4 Consultation Services Provided by the phase II, after phase II, and before NDA PMDA submission. The sponsors can ask for complementary consultation(s) for a lower fee if The PMDA provides various consultation serfurther discussions or advice are considered vices on a fee-for-service basis. The fees vary, necessary. depending on the type of consultation. For The sponsor and PMDA discuss design example, the fee for a consultation after a issues (e.g., choice of endpoints, sample size, phase II trial (i.e., the planning stage for statistical analysis), conduct issues (e.g., good phase III studies) is about 6 million yen (U.S. $55,000). The consultation services were started clinical practice [GCP] compliance), and all other issues pertinent to the clinical trials of in 1997 to meet the needs of the industry. interest. To clarify questions and concerns, Before that time, there were no official opportunities in which the industry could discuss preliminary discussions can be held several scientific and regulatory issues in specific times by facsimile and over the phone. The

PAFSC Special members Support

PAC & Committee

Support

Consult

PMDA Compliance review Team review

Figure 1. New Drug Application (NDA) review process in Japan.

NDA submission

Expert meeting

Interview

Hearing

Review Report

Interview

Applicant (Pharmaceutical company)

MHLW PMSB

Advise

Final decision on approval

Approval

4

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

PMDA and the sponsor commonly hold a final meeting to conclude the discussions. In cases in which all the questions and concerns are settled before the final meeting, the meeting can be canceled. The results of discussions are documented by the reviewers of the PMDA. The sponsor can comment on the results before they are finalized. All consultation result documents must be attached to an NDA submission in CTD Module 1. 3 REGULATORY STRUCTURE FOR CLINICAL TRIALS Most clinical trials in Japan, both industry sponsored and investigator sponsored, are done under the regulation of the PAL. For clinical trials outside the scope of the PAL, the MHLW issued a general ethics guideline for clinical research in 2003. 3.1 Definition of Clinical Trials The PAL requires submission of IND notification for a clinical trial that could be part of a clinical data package in an NDA. By definition, ‘‘sponsor’’ would include the physician who plans to conduct the clinical trials as the sponsor-investigator, but most of the trials for which IND notifications are submitted are sponsored by pharmaceutical companies. The regulatory scheme of sponsor-investigator trials was introduced comparatively recently. The PAL was amended to incorporate the definition of sponsor-investigator trials in

2002, and domestic GCP guidelines were also amended in 2003 to reflect the changes in the PAL. Before that, there were, of course, clinical trials implemented spontaneously by clinical researchers. However, they were not given official regulatory status and thus were not allowed to be used for Japanese NDA submissions. Since the introduction of the sponsor-investigator trials, 21 sponsor-investigator IND notifications have been submitted (as of November 2006). In contrast to the IND-related requirements in the United States, the definition of the Japanese IND requirement puts more emphasis on the intention for future NDA submission of pharmaceutical products. This leads to limited IND submissions from academic researchers, who are not necessarily interested in pharmaceutical NDAs. The number of recently submitted IND notifications is shown in Figure 2. The reasons for the drastic changes observed are explained in section 3.3. The PAL stipulates that IND notifications for substances for which clinical trials are done for the first time in Japan should be reviewed by the PMDA within 30 days. A review team assigned to the IND notification checks the IND; if they identify critical problems that could harm study participants, they issue an order to hold the trial. In such a case, the sponsor cannot start the trial until the problems are settled. This mechanism of clinical hold is similar to that of the United States.

1400 1200

First INDs All INDs

1000 800 600 400 200

Ye ar 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04

0

Figure 2. Annual Investigational New Drug (IND) notifications submitted to the Ministry of Health, Labour and Welfare (MHLW).

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

3.2 GCP and Related Regulations in Japan The conduct of clinical trials for which IND notifications are submitted must satisfy the requirements of the GCP guideline and other pertinent ordinances. The Japanese GCP guideline is a Ministerial ordinance under the PAL, and serious violations of the guideline could be eligible for criminal punishment. It should be noted that the scope of penalties in the GCP is only for the sponsors; all of the other participants on the investigator’s side (e.g., physicians as investigators or subinvestigators, heads of hospitals, and research nurses) are exempt from the penalties. This exemption from punishment is an outstanding feature of Japanese GCP regulation. However, physicians could be punished in sponsor-investigator trials if a severe violation was committed when they played the role of the sponsor. The PMDA is responsible for implementation of on-site GCP inspections (see section 3.5). Lack of punishment on the investigator’s side makes it virtually impossible to publish a blacklist of investigators who have been involved in serious GCP violations in the past. 3.3 Drastic Changes in Japanese Clinical Trial Market As Figure 2 shows clearly, a drastic decline in the number of domestic commercial trials has been observed. This was caused by several factors on both the demand and supply sides (3). The MHLW’s health insurance policies in 1980s to 1990s, which included tightfisted price-setting rules for newly approved drugs and historical price cutting for existing drugs, seemed to have a negative impact on R&D activities in Japan (4). In addition to the economic disincentives, the globalization of R&D activities also has reduced the demand for Japanese trials. The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) E5 guideline, called the ‘‘bridging guideline,’’ was accepted by the United States, the European Union, and Japan, and was implemented in 1998; this accelerated the trend toward global use of foreign clinical data (5). At the same time, the

5

MHLW abolished the longstanding requirements for the inclusion of Japanese pharmacokinetic studies, a Japanese dose-response study, and at least one Japanese confirmatory phase III study in the clinical data package. All these changes significantly reduced the demand for Japanese trials. Drastic changes also occurred on the supply side of clinical trials. After several years of discussions by the ICH expert working group, ICH-GCP was implemented in Japan in 1997. The new GCP guideline, for the first time in Japanese history, introduced Western components of clinical trials such as intensive monitoring and auditing through source data verification, support staff such as research nurses and pharmacists, contract research organizations, and independent data monitoring committees. These changes on the supply side, in conjunction with the changes on the demand side, caused a reduction in the number of trials in the Japanese market. The cost of Japanese trials skyrocketed accordingly. Most academic institutions and national hospitals adopted similar feesetting tables, but the fees are far from similar because overhead costs vary significantly among institutions (see section 3.4). 3.4 Expensive Japanese Trials for the Industry The Japanese pharmaceutical R&D environment, and the clinical environment in particular, has been discussed in light of launch delays in Japan. Several health-care providers have been associated with the delay in the background. Because virtually all Japanese clinical trials (except typical phase I trials) are done under the universal health insurance scheme, investigators as well as study participants face significant insurance-related red-tape situations in contracts, in-house procedures, and payment. For the sponsor, the cost of Japanese trials (i.e., payment from a drug company to a hospital) is a serious concern. It has been reported that the payments per subject are much higher in Japan than in the United States, European countries, and most Asian countries. In Japan, the sponsors are not allowed to pay investigators directly because the investigators (i.e., physicians) are regarded as employees of the hospital.

6

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

Instead, clinical trial fees go directly to the hospitals. As previously mentioned, in many public and academic hospitals, a clinical trial fee for each trial is determined based on a point-table and a matrix to calculate the fee basis, taking into consideration the design and conditions of the trial. The basic components of point-tables are similar nationwide, but actual fees vary to a great extent because of variable overhead costs. In general, hospitals have much a stronger power over the fee-setting process because they are the customers for the drug companies’ marketed products. The close tie between R&D activities and postmarketing business characterizes Japanese clinical development. Also, the limited availability of clinical experts in some therapeutic fields makes it possible for them to exert pricing powers as monopolists or oligopolists in providing clinical trials. All these features seem to be associated with the exceptionally high prices of clinical trials in Japan. 3.5 Quality of Japanese Trials Until the ICH-GCP was introduced, Japanese trials had a generally poor reputation for quality. Even before the ICH-GCP introduction, GCP guidance had been issued in 1989, but it was not a legally binding ordinance based on the PAL. Western components of clinical trials such as source data verification (SDV) monitoring and auditing, research nurses, and strict rules for informed consent were introduced in Japan for the first time in 1997, along with the ICH-GCP. The actual reports of GCP inspections are not publicly available, but the PMDA routinely publishes summary findings for educational purposes and provides materials on the quality of Japanese trials (6). Before the introduction of the ICH-GCP, the most obvious deficiencies were errors in case report forms (CRFs). Some errors were serious (e.g., fraudulent discrepancies from medical records), but most of them were trivial such as typographical errors. Surprisingly, SDV was not officially done before the ICH-GCP in Japan because there was no regulatory basis to allow SDV with appropriate protection of personal information. The ICH-GCP

made it possible for drug companies to conduct intensive monitoring through SDV. Also, research nurses and pharmacists employed at clinical institutions started operations to support investigators and drug companies. These improvements in the research environment reduced CRF deficiencies drastically and thus increased the accuracy of the data. Some types of deficiencies have not decreased, however. According to the published summary of inspection results, the number of protocol deviations has not declined since the ICH-GCP introduction (6). Japanese investigators participate in clinical trials as employees of hospitals and within the restrictions of health insurance regulations. Their motivation is not necessarily high because clinical trials have historically been considered low-priority chores that their supervisors (e.g., professors, directors) order them to do to acquire research grants from drug companies. Academic incentives are not effective for many physicians because they know Japanese commercial trials are rarely accepted in prestigious medical journals. This mindset has been changing, but only gradually. Recent discussions on trial quality often focus on burdens related to handling documents. For example, there has been widespread misinterpretation of the scope of essential documents. Too much attention to paper handling is likely to increase the cost of monitoring activities and the cost of clinical trials, accordingly. 4 GOVERNMENT EFFORTS TO INVITE NEW DRUGS TO JAPAN For several reasons, some drugs already marketed in the United States and European Union are not introduced in Japan (4). Even when they are finally introduced in Japan, significant delays in market introduction are quite common. This situation, called ‘‘new drug lag,’’ has attracted a great deal of public attention, especially in cancer therapy. That many drugs and therapeutic regimens already approved in Western countries are unavailable or are not approved in Japan has caused Japanese policy-makers to establish regulatory schemes to bring those drug therapies to Japan.

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

4.1 Study Group on Unapproved Drugs (MHLW since January 2005) The Study Group on Unapproved Drugs started in January 2005 to oversee the situation of Japanese new drug lag, to evaluate the need for introducing unapproved drugs in Japan, and to lead these drugs to the clinical development stage. Ten formal meetings have been held since the group’s establishment, and development and approval strategies for 33 drugs have been discussed (as of November 2006). Of the 33 drugs, 20 drugs were cancer drugs, and nine were pediatric drugs. The study group also makes recommendations about clinical trials for safety confirmation that should be performed for drugs that have already been submitted for NDA because such trials could provide opportunities for patients to access these unapproved drugs during the NDA review. Compassionate use of investigational or unapproved drugs has not been legitimized in Japan. 4.2 Study Group on Cancer Combination Therapy (MHLW since January 2004) The Study Group on Cancer Combination Therapy specifically focuses on unapproved use (e.g., indications) or unapproved regimens of cancer drugs for which Japanese approval has already been obtained for some other indication(s) or regimen. Because drug companies would have to bear substantial costs to conduct clinical trials to expand these indications, they would rather maintain the indications unchanged and leave the decision on how to use the drugs to individual oncologists. As a result, many cancer drugs have been used for unapproved conditions or in unapproved regimens in Japan. Lack of flexibility in Japanese health insurance reimbursement decisions makes the situation more serious. For pharmaceutical products, reimbursement is strictly based on the conditions (i.e., indication, dose, regimen, patient type) of approval under the PAL. Japanese insurance bodies basically are not allowed to make their own decisions on reimbursement under the universal health-care insurance, a marked contrast to the diversified decisions by the U.S. insurance bodies.

7

To fill the gap between Japan and the United States/European Union, this MHLW study group investigates the current Japanese situation and also collects clinical evidence available in foreign countries. When the investigation shows that a given drug use is considered ‘‘standard’’ in the therapeutic field, the MHLW issues its approval based on those prior investigations and foreign evidence. In 2005, expanded indications were approved for 30 cancer drugs. 5 GOVERNMENT EFFORTS TO PROMOTE CLINICAL TRIALS IN JAPAN The MHLW and related government agencies have been implementing strategies to boost the number and quality of Japanese clinical trials. 5.1 Three-Year Clinical Trial Promotion Plan (MHLW and MEXT, 2003–2005) The MHLW and the Ministry of Education, Culture, Sports, Science and Technology (MEXT) jointly executed a set of programs to promote Japanese clinical trials, Three-Year Clinical Trial Promotion Plan. Setting up several large networks for clinical trials was one of the important objectives of this plan. Research grants to establish networks and implement some sponsorinvestigator trials as model programs were managed by the Center for Clinical Trials of the Japan Medical Association. As of September 2006, there are 1,212 clinical institutions participating in the networks. Training programs for clinical research coordinators (e.g., research nurses and pharmacists) were provided regularly under this plan. Several symposiums were also held to raise public awareness of clinical trials because it is believed that Japanese patients who receive medical services under the universal health insurance without substantial financial burdens are not necessarily interested in clinical trials. The national hospitals and research centers directly managed by the MHLW have been promoting harmonization of trial contract documents. These streamlining efforts also are intended to reduce the administrative costs of trials.

8

MINISTRY OF HEALTH, LABOUR AND WELFARE (MHLW) OF JAPAN

The MHLW plans to continue these activities for the next 5 years beginning with the fiscal year 2007. 5.2 Study Group on Faster Access to Innovative Drugs (MHLW, 2006) The MHLW convened an expert panel, the Study Group on Faster Access to Innovative Drugs, to discuss the bottlenecks to the introduction of new drugs in Japan in October 2006. This study group will focus on the principles of new drug approval, postmarketing safety measures, and enhancement of the PMDA review. It is expected that the discussion results will be reflected in upcoming amendments of existing regulations. REFERENCES 1. Y. Fujiwara and K. Kobayashi, Oncology drug clinical development and approval in Japan: the role of the pharmaceuticals and medical devices evaluation center (PMDEC). Crit Rev Oncol Hematol. 2002; 42: 145–155. 2. Center for Drug Evaluation and Research, U.S. Food and Drug Administration. Training and Communications: Formal meetings between CDER and CDER’s external constituents. MAPP 4512.1. Available at: http://web.archive.org/web/20050921092516 /http://www.fda.gov/cder/mapp/4512-1.pdf 3. S. Ono and Y. Kodama, Clinical trials and the new good clinical practice guideline in Japan. Pharmacoeconomics. 2000; 18: 125–141.

4. P. M. Danzon, The impact of price regulation on the launch delay of new drugs—evidence from twenty-five major markets in the 1990s. Health Econ. 2005; 14: 269–292. 5. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E5(R1) Ethnic Factors in the Acceptability of Foreign Clinical Data. Step 4 version, February 5, 1998. Available at: http://www.ich.org/LOB/media/ MEDIA481.pdf 6. K. Saito, Y. Kodama, S. Ono, M. Mutoh, S. Kawashima, and A. Fujimura, Quality of Japanese trials estimated from Good Clinical Practice auditing findings. Am J Ther. 2006; 13: 127–133.

FURTHER READING For more detailed information about Japanese regulations, see Japan Pharmaceutical Manufacturers Association (JPMA), Pharmaceutical Administration and Regulations in Japan, March 2006. Available from http://www.jpma.or.jp/english/parj/0607.html

CROSS-REFERENCES Good clinical practice (GCP) International Conference on Harmonization (ICH) Investigational New Drug Application (IND) Regulatory authorities Food and Drug Administration (FDA)

MIN TEST

the name by Gleser (2), and was further studied by Berger (3–5) and others. Saikali and Berger (6) have pointed out that the Min test and the simple IUT are alternative names for the same test. Note that the elementary tests are performed at the same α level as the desired overall size of the test, and the test is valid even if the individual test statistics are correlated. Here, validity means that the probability of a type 1 error is at most α. The conclusion that the test has the claimed probability of a type 1 error follows only if inference is restricted to accepting or rejecting the specified global null and alternative hypotheses previously discussed. No consideration in calculating the size of the test is given to other errors that could arise if other inferences are drawn. For example, suppose that the global null is not rejected but one or more of the component tests reject their elementary null Hoi . If it were desired to reach conclusions about the corresponding parameters, it would be necessary to take into account the possibility of additional errors in order to control the familywise error rate. There are many ways to control the multiplicity problem but reporting on significant findings in the computation of the components of the Min test is not one of them. After failing to find min[ti − ciα ] ≥ 0 for all i, the only statistically valid statement is that there is insufficient evidence to conclude that all of the parameters are positive. The test statistics ti can be parametric, such as a t-test, or it could be nonparametric, such as a rank test. The parameters appearing in the hypotheses may represent a single outcome measure obtained under different experimental conditions, or the parameters could arise from a multivariate outcome. Both cases occur when testing whether a combination comprising several treatments is efficacious.

EUGENE M LASKA MORRIS J MEISNER Nathan Kline Institute for Psychiatric Research New York University School of Medicine Orangeburg, New York

Multiple parameter testing arises in many if not most clinical trials. There are very few diseases for which treatments are directed at a single targeted endpoint. In many cases, the evidence required to demonstrate that a new treatment is effective requires the simultaneous comparison of many outcome measures. This translates to formal hypotheses involving inequalities among multiple parameters. One common case is testing whether the parameters are all positive. This arises, for example, when it is desired to show that a combination treatment is superior to each component; that a treatment is equivalent or noninferior to a control; or that a dose of a combination is synergistic. In such cases, the Min test is the test of choice. We introduce this test as well as present and discuss its properties and illustrative applications. 1

THE MIN TEST

The Min test is used to test a null hypothesis such as H o : υ i ≤ 0 for at least one i verses H 1 : υ i > 0 for all i, i = 1, 2, . . . , K. This is sometimes called the sign-testing problem. If ti is an α-level test of υ i ≤ 0 that rejects H oi : υ i ≤ 0 if the statistic is larger than its critical value ciα , then the Min test rejects the global null hypothesis Ho if the min[ti − ciα ] > 0. Alternatively, if pi is the observed P-value of the i-th test statistic, then the Min test rejects the null if the largest value of pi < α. This test is also called the simple intersection union test (simple IUT or the SIUT) because the form of the null hypothesis is the union of K ‘‘elementary’’ hypotheses, and the alternative is the intersection of the complement of the elementary hypotheses. The simple IUT was first described by Lehman (1), was given

1.1 Power The statistical power (probability of rejecting the null hypothesis) of an α-level Min test for K = 2 depends on the primary parameter δ, the minimum of υ 1 and υ 2 , and on the nuisance parameter γ , the difference between

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

MIN TEST

υ 1 and υ 2 . If the parameters υ i are means, given any fixed value of δ, the power of the α-level Min test increases as a function of the absolute value of γ . Therefore, given any value of the primary parameter, the α-level Min test has the smallest power when the nuisance parameter is zero, that is, when υ1 = υ2. In determining the sample size for a study, in the absence of prior knowledge to the contrary, a prudent statistician will assume that the means are equal to ensure that there is sufficient power for rejecting the null hypothesis. For example, for two normal random variables with effect sizes υ 1 /σ = υ 2 /σ = 0.5, to achieve 90% power to reject the null, the sample size per group is 84 based on a t-test and 97 per group based on a Wilcoxon test. If υ 1 /σ were larger than 0.5, then γ would be positive, and the power of the Min test would be greater than 0.90. The size of a test, α, is the supremum of the power taken over all values of the parameters that lie in the null space. For the sign-testing problem, where the parameters are the means of a bivariate normal, the parameter values that define α occur in the limit of (υ 1 , 0) and (0, υ 2 ) as υ 1 or υ 2 approaches positive infinity. A test is said to be biased if there is a parameter point in the alternative hypothesis space (so that the null should be rejected) for which the power to reject is smaller than the size of the test. Thus, biased tests reject the null more frequently when the null is true than when it is false. Unfortunately, for parameters in the alternative space that lie in the neighborhood of the origin, the size of the Min test is very small. Therefore, the test is biased, and the power is low near the origin. The power of the Min test has been considered under different distributional assumptions by many investigators, including Laska and Meisner (7, 8), Patel (9), Hung et al. (10), Hung (11), Wang and Hung (12), Horn et al. (13) Sidik and Jonkman(14), and Kong et al. (15, 16). The tables or formulas in their publications may be consulted to obtain sample sizes for specific circumstances. 1.2 Optimality Properties If the test statistics are restricted to the class of monotone functions of the test statistics

of the elementary hypotheses (T 1 , T 2 ), then, under some mild regularity conditions, the uniformly most powerful test of Ho is the Min test (1, 7). A test is said to be monotone in (T 1 , T 2 ) if (a, b) are values of the test statistics that lie in the critical region implies that (a’, b’) also lies in the critical region whenever a ≥ a and b ≥ b. That is, larger values of the test statistic provide greater evidence in support of rejection. The sign-testing problem was considered by Inada (17) for bivariate normal and Sasabuchi (18) for multivariate normal; they found that the Min test is the likelihood ratio test (LRT) under the assumption of equal sample sizes per group and known variance. They specified the null as H o :{υ 1 = 0, υ 2 > 0} or {υ 1 > 0, υ 2 = 0} rather than H o : υ 1 ≤ 0 or υ 2 ≤ 0, but later Berger (4) showed that the LRT is the same for both formulations of the null. When the sample sizes are unequal, however, Saikali and Berger (6) showed that the two tests are not the same. The rejection region of the LRT is a proper subset of the rejection region of the Min test, which is therefore uniformly more powerful. However, the Min test is biased and has low power when the parameters are close to the origin, and the size of the test α is attained only in the limit. In an effort to remedy this situation, Berger (4), Liu and Berger (19), and McDermott and Wang (20) enlarged the rejection region near the origin without changing the α level and obtained tests that are uniformly more powerful than the Min test. But this result does not come without a price. In particular, the new tests are not monotone. Laska and Meisner (7) have argued that in a regulatory setting, such a property would be unacceptable. Imagine the consequences of one randomized, controlled clinical trial with less evidence for rejection of the null being declared a pivotal study by the regulatory body while another study with more evidence does not reach the nominal level of statistical significance. Additionally, these ‘‘improved’’ new tests all contain points in the rejection region that are arbitrarily close to the origin, which of course is not good evidence against the null hypothesis. Perlman and Wu (21) have argued strongly against the use of these new tests. Hung (22) has cautioned that their value in increasing power must be carefully

MIN TEST

weighed against the shortcomings already discussed. 2 COMMON APPLICATIONS OF THE MIN TEST The Min test is useful in many areas of application, but its most common use has been in testing whether a combination treatment is efficacious. Such treatments are widely used, and, even though their components may be effective when used individually, it does not follow that the combination consisting of these components is useful. To market a combination treatment, the U.S. Food and Drug Administration (U.S. FDA 21CFR300.50) requires demonstration that ‘‘each component makes a contribution to the claimed effects . . . ’’ (23, 24). In addition, the European Agency for the Evaluation of Medical Products (EMEA) also requires that the benefit/risk assessment of the combination equals or exceeds the ratio of each of its components (25). 2.1 Combination Treatments: Single Endpoint There are many ways to interpret the concept of ‘‘contribution.’’ If all of the components of the combination treat the same symptom, such as pain, then each component ‘‘contributes’’ is interpreted as requiring that the combination be superior to each of its components in terms of effectiveness. Clearly, if the combination is not superior to one of its ingredients, then, other issues such as adverse events aside, the combination has no advantage over the ingredient. Suppose there are K components in the combination treatment. Then here, υ i is the difference between the mean effect of the combination and the mean effect of component treatment i. If ti is a one-sided test that rejects H oi : υ i ≤ 0 if the statistic is larger than its critical value ciα , then the Min test may be used to demonstrate that the components of a combination contribute to the claimed effects. 2.2 Combination Treatments: Multiple Endpoints A combination treatment may be designed to treat many symptoms, such as pain and

3

sleeplessness. For simplicity, suppose there are two components in the combination treatment, each of which may have an effect on J outcome measures. Let υ ij be the difference between the mean effect of the combination and the mean effect of component treatment i on the j-th outcome measure. The notion of contribution has multiple interpretations. Laska et al. (26) gave the definitions for uniformly best, comparable to the best, and admissible, and Tamhane and Logan (27) provided the definition for locally best: 1. Uniformly best. The combination treatment is superior to each of its component treatments on all J endpoints. 2. Locally best. For each endpoint, the combination treatment is at least as good as (noninferior to) the component treatment that is best for that endpoint and superior to each treatment on at least one endpoint. 3. Comparable to the best. For each endpoint, the combination treatment is at least as good as (noninferior to) the component treatment that is best for that endpoint. 4. Admissible. The combination treatment is superior to each component treatment on at least one endpoint. Laska et al. (26) gave a general theory for these hypothesis formulations and applied the theory when the random variables are multivariate normal and when the tests of the elementary hypotheses are rank tests. If tij is a one-sided test that rejects Hoij : υ ij ≤ 0 when the statistic is larger than its critical value cijα , then the Min test may be used to demonstrate that the combination is uniformly best. Under mild conditions, it too is uniformly most powerful (UMP)among the class of monotone functions of the test statistics tij . Tamhane and Logan (27) describe a test for demonstrating locally best. The alternative hypothesis is the intersection of the superiority alternative and the noninferior alternative. For each outcome j, a noninferiority margin ej must be chosen to meaningfully reflect clinical indifferences. If sij is

4

MIN TEST

an α/J-level test of υ ij ≤ 0 that rejects H oij : υ ij ≤ 0 if the statistic is larger than its critical value c*ijα , and tij is an α-level test of υ ij − ≤ ej , that rejects Hoij : υij − ≤ ej if the statistic is larger than its critical value cijα , then the test rejects Ho if simultaneously the Min test given by min[tij − cijα ] and max[sij − c∗ ijα/J ] are both positive. The test uses Roy’s (28) union intersection (UI) principle and the Min test if the endpoint on which the combination is superior is not specified in advance. If j∗ is the endpoint specified in advance for testing superiority, then the test is a Min test composed of min[tij − cijα ]>0 and max[sij∗ − c∗ ij∗ α ] > 0. To test the hypothesis of comparable to the best, Laska et al. (26) show that the Min test once again is a UMP test among the class of monotone functions of the test statistics tij . The component test tij rejects Hoij : υ − ≤ ej if the corresponding test statistic is larger than its critical value. These same tests were later used by Tamhane and Logan (27) in the noninferiority portion of their test for locally best. For the admissibility hypothesis, Laska et al. (26) compare the combination with each component treatment to determine if there is at least one endpoint where it is superior to that component. Inference about each component i involves J comparisons, so the size of these J tests must be adjusted to achieve an α-level test for the component. The proposed adjustment was based on the Bonferroni method, so the test for component i is max[sij − c∗ ijα/J ] > 0. The test in Laska et al. (26) then is min[max[sij − c∗ ijα/J ]] > 0. Westfall et al. (29) observed that the power of the test could be improved by replacing Bonferroni with Holm (30) and still more by using the Simes (31) and Hommel (32) approaches instead of Bonferroni. It is interesting to note that some of the definitions of contribution listed above have inclusion relations. If a combination is uniformly best, then it is locally best. Similarly, a locally best combination is comparable to the best, and it is admissible. There is no inclusion relationship between the latter two. Thus, in any application, based on closed testing, the admissibility hypothesis may be tested first; and if it is not rejected, the locally

best hypothesis may be tested; and if it is not rejected, then uniformly best can be tested. These tests are all conducted at the stated α level without adjustment for multiplicity. Alternatively, the comparable to the best hypothesis may be tested; and if not rejected, the locally best hypothesis may be tested; and if not rejected, uniformly best may be tested. This may not be prudent unless a specific endpoint for locally best is chosen in advance; otherwise, power is lost in seeking the endpoint where the combination is best. This power loss occurs because part of the overall type 1 error in testing locally best is spent in the contrast for each of the endpoints. Thus, if the goal is to test for uniformly best, there is no cost to either first testing for admissibility or first testing for comparable to the best. 2.3 Combination Treatments: Multiple Doses, Univariate Effect Particularly for antihypertensive treatments but for other indications as well, both the United States and the European Union require that dose-ranging studies be conducted, preferably using a factorial design. In such a design, patients are randomly allocated to receive one of the doses of drug A and one of the doses of drug B, which includes the possibility of the zero dose, placebo. Hung et al. (33, 34) considered the statistical problem of identifying which dose combinations have the property that both components make a contribution. These investigators recognized that such a design also yields information on the dose response of the combination. Just as for combination drugs treating multiple symptoms, the definition of ‘‘contributing’’ has many interpretations. Hung et al. (33) described contributing in both a weak and a strong sense. A combination exhibits global superiority in a weak sense if the average effect of the combination taken over all of the nonzero doses is superior to the average effect of each of the component treatments taken over its corresponding individual non-zero doses. A combination exhibits global superiority in a strong sense if there is at least one non-zero dose combination that is superior to both of its components. Notice that demonstration of weak superiority does

MIN TEST

not guarantee that there is a dose combination that is superior to its components. Hung et al. (10) developed two tests for global superiority. The α-level AVE test is an average of the Min test statistics examining the contribution of each combination relative to its components taken over all of the non-zero dose combinations under study. The MAX test examines the largest Min test statistic. Both are one-sided α-level tests. Hung (35) extended the tests to incomplete and unbalanced factorial designs in which not all dose combinations are studied and the sample sizes of the dose pairs are not equal. For the local problem of identifying which explicit dose pairs are contributing, they recommend using simultaneous Min tests adjusted according to the Hochberg (36) procedure to maintain the familywise error rate at α. Alternative multiple testing methods that protect the familywise error rate for identifying effective dose combinations were investigated by Buchheister and Lehmacher (37) and Lehmacher (38). They proposed procedures based on linear contrast tests and on closed testing procedures. 2.4 Synergy In some contexts, it is desirable to determine whether the components of a combination are additive, synergistic, or antagonistic at specific doses. These concepts specify whether the effect of the combination is, respectively, the same as, greater than, or less than expected on the basis of the effects of its components (39). Another concept is the notion of therapeutic synergy. A combination is said to be therapeutically synergistic if its effect is larger than the maximum effect achievable by any of its components over all doses in its therapeutic range. Laska et al. (40) proposed a simple approach to assessing synergy in a combination of two drugs that does not require modeling a response surface or marginal doseresponse curves. Their experimental design and test are based on the concept of an isobologram (39) to establish sufficient conditions for synergy at a specified dose (x1 , x2 ) of the combination. An isobole is the set of dose pairs from the two treatments, all of which have the same expected responses.

5

The design calls for studying the combination and a single dose of each drug. Suppose that x1e and x2e are the doses of drug 1 and drug 2, respectively, that produce the same level of response, say e. The potency ratio at e is defined to be ρ(e) = x2e /x1e . In many instances, ρ(e) is well studied, and an estimate r is available. Then, to investigate whether the combination is synergistic at (x1 , x2 ), N subjects are randomly assigned to each of three dose combinations: (x1 + x2 /r, 0), (0, r x1 + x2 ), and (x1 , x2 ). The Min test is used to see if the following two conditions hold simultaneously: 1. The response at the combination dose (x1 , x2 ) is greater than the response to drug 1 alone at dose (x1 + x2 /r, 0). 2. The response at the combination dose of interest is greater than the response to drug 2 alone at dose (0, r x1 + x2 ). If both these conditions are true, then synergy can be claimed at the combination dose of interest. Recently, Feng and Kelly (41) generalized this approach to three or more drugs and studied the power of the procedure. To test if a combination is therapeutically synergistic at a given dose pair (x1 , x2 ), its effect must be shown to be superior to the maximum effect achieved by its components at any dose. Suppose that xe1∗ and xe2∗ are the known doses of drug 1 and drug 2 that produce the maximum level of response, say e1 * and e2 *, respectively. Then, component i at dose xei∗ and the combination must be studied, and the Min test is an optimal test. 2.5 Bioequivalence, Equivalence and Noninferiority Two formulations of the same drug are said to be bioequivalent if they are absorbed into the blood and become available at the drug action site at about the same rate and concentration. Suppose υ T = mean blood concentration of a test treatment and υ S = mean blood concentration of a standard treatment. Then, to demonstrate bioequivalence the null hypothesis H0 : υT − υS ≥ δ or υT − υS ≤ −δ must be rejected in favor of HA : − δ < υT − υS < δ

6

MIN TEST

which is equivalent to HA : − δ < υT − υS and υT − υS < δ Here, δ > 0 is a prespecified tolerance limit. More generally, this representation of equivalence may be used for any clinically relevant outcome measure where δ demarks a zone of clinical indifference. Bioequivalence refers to measures of blood concentration of the drug, whereas equivalence refers to outcomes that measure the impact on the therapeutic target. The Min test may be used to test the null hypothesis. In the bioequivalence context, the test is often referred to as the TOST because it is composed of two one-sided α-level tests (42). Kong et al. (15) considered a multivariate setting and studied a method for demonstrating equivalence or noninferiority of two treatments on each endpoint under the assumption of multivariate normality. The test statistic is a Min test that takes into account a noninferiority margin ej for each outcome j that reflects meaningful clinical indifference. The form of the test and the form for testing whether a combination treatment is uniformly best are the same. After developing the distributional properties, the investigators used a simulation to examine the power of the test under different configurations of the means and covariance matrix. More recently, Kong et al. (16) considered a similar scenario in which the endpoints are distributed binomially.

REFERENCES 1. E. L. Lehmann, Testing multiparameter hypotheses. Ann Math Stat. 1952; 23: 541–552. 2. L. J. Gleser, On a theory of intersection-union tests [abstract]. Inst Math Stat Bull. 1973; 2: 233. 3. R. L. Berger, Multiparameter hypothesis testing and acceptance sampling. Technometrics. 1982; 24: 295–300. 4. R. L. Berger, Uniformly more powerful tests for hypotheses concerning linear inequalities and normal means. J Am Stat Assoc. 1989; 84: 192–199.

5. R. L. Berger, Likelihood ratio tests and intersection-union tests In: S. Panchapakesan and N. Balakrishnan (eds.), Advances in Statistical Decision Theory and Applications Boston: Birhauser, 1997, pp. 225–237. 6. K. G. Saikali and R. L. Berger, More powerful tests for the sign testing problem. J Stat Plan Inference. 2002; 107: 187–205. 7. E. M. Laska and M. Meisner, Testing whether an identified treatment is best. Biometrics. 1989; 45: 1139–1151. 8. E. M. Laska and M. Meisner, Testing whether an identical treatment is best: the combination problem In: Proceedings of the Biopharmaceutical Section of the American Statistical Association Alexandria, VA: American Statistical Association, 1986, pp. 163–170. 9. H. I. Patel, Comparison of treatments in a combination therapy trial. J Biopharm Stat. 1991; 1: 171–183. 10. H. M. Hung, G. Y. Chi, and R. J. Lipicky, Testing for the existence of a desirable dose combination. Biometrics. 1993; 49: 85–94. 11. H. M. Hung, Two-stage tests for studying monotherapy and combination therapy in twoby-two factorial trials. Stat Med. 1993; 12: 645–660. 12. S. J. Wang and H. M. Hung, Large sample tests for binary outcomes in fixed-dose combination drug studies. Biometrics. 1997; 53: 498–503. 13. M. Horn, R. Vollandt, and C. W. Dunnett, Sample size determination for testing whether an identified treatment is best. Biometrics. 2000; 56: 879–881. 14. K. Sidik and J. N. Jonkman, Sample size determination in fixed dose combination drug studies. Pharm Stat. 2003; 2: 273–278. 15. L. Kong, R. C. Kohberger, and G. G. Koch, Type I error and power in noninferiority/equivalence trials with correlated multiple endpoints: an example from vaccine development trials. J Biopharm Stat. 2004; 14: 893–907. 16. L. Kong, G. G. Koch, T. Liu, and H. Wang, Performance of some multiple testing procedures to compare three doses of a test drug and placebo. Pharm Stat. 2005; 4: 25–35. 17. K. Inada, Some bivariate tests of composite hypotheses with restricted alternatives. Rep Fac Sci Kagoshima Univ (Math Phys Chem). 1978; 11: 25–31. 18. S. Sasabuchi, A test of a multivariate normal mean with composite hypotheses determined by linear inequalities. Biometrika. 1980; 67: 429–439.

MIN TEST 19. H. Liu and R. L. Berger, Uniformly more powerful, one sided tests for hypotheses about linear inequalities. Ann Stat. 1995; 72: 23–55. 20. M. P. McDermott and Y. Wang, Construction of uniformly more powerful tests for hypotheses about linear inequalities. J Stat Plan Inference. 2002; 107: 207–217. 21. M. D. Perlman and L. Wu, The emperor’s new tests (with discussion). Stat Sci. 1999; 14: 355–369. 22. H. M. Hung, Combination drug clinical trial In: S. C. Chow (ed.), Encyclopedia of Biopharmaceutical Statistics, Revised and Expanded 2nd ed. New York: Marcel Dekker, 2003, pp. 210–213. 23. U.S. Food and Drug Administration, Department of Health and Human Services. Code of Federal Regulations, Title 21 Food and Drugs. Volume 5, Part 300.50: Fixedcombination prescription drugs for humans. Revised as of April 1, 1999. Available at: http://www.access.gpo.gov/nara/cfr/waisidx 99 /21cfr300 99.html 24. H. M. Leung and R. O’Neill, Statistical assessment of combination drugs: a regulatory view In: Proceedings of the Biopharmaceutical Section of the American Statistical Association Alexandria, VA: American Statistical Association, 1986, pp. 33–36. 25. European Agency for the Evaluation of Medicinal Products, Human Medicines Evaluation Unit, Committee for Proprietary Medicinal Products (CPMP). Note for Guidance on Fixed Combination Medicinal Products. CPMP/EWP/240/95. April 17, 1996. Available at: http://www.emea.europa.eu/ pdfs/human/ewp/024095en.pdf 26. E. M. Laska, D. I. Tang, and M. Meisner, Testing hypotheses about an identified treatment when there are multiple endpoints. J Am Stat Assoc. 1992; 87: 825–831. 27. A. C. Tamhane and B. R. Logan, A superiority-equivalence approach to one-sided tests on multiple endpoints in clinical trials. Biometrika. 2004; 91: 715–727. 28. S. N. Roy, On a heuristic method of test construction and its use in multivariate analysis. Ann Math Stat. 1953; 24: 220–238. 29. P. H. Westfall, S. Y. Ho, and B. A. Prillaman, Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials. J Biopharm Stat. 2001; 11: 125–138. 30. S. Holm, A simple sequentially rejective test procedure. Scand J Stat. 1979; 6: 65–70.

7

31. R. J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986; 73: 751–754. 32. G. Hommel, A comparison of two modified Bonferroni procedures. Biometrika. 1988; 75: 383–386. 33. H. M. Hung, T. H. Ng, G. Y. Chi, and R. J. Lipicky, Response surface and factorial designs for combination antihypertensive drugs. Drug Inf J. 1990; 24: 371–378. 34. H. M. Hung, G. Y. Chi, and R. J. Lipicky, On some statistical methods for analysis of combination drug studies. Commun Stat Theory Methods. 1994; A23: 361–376. 35. H. M. Hung, Evaluation of a combination drug with multiple doses in unbalanced factorial design clinical trials. Stat Med. 2000; 19: 2079–2087. 36. Y. Hochberg, A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988; 75: 800–802. 37. B. Buchheister and W. Lehmacher, Establishing desirable dose combinations with multiple testing procedures In: N. Victor, M. Blettner, L. Edler, R. Haux, P. Knaup-Gregori, et al. (eds.), Medical Informatics, Biostatistics and Epidemiology for Efficient Health Care and Medical Research Munchen, Germany: Urban & Vogel, 1999, pp. 18–21. 38. B. Buchheister and W. Lehmacher, Multiple testing procedures for identifying desirable dose combinations in bifactorial designs. GMS Med Inform Biom Epidemiol. 2006; 2 (2): Dec07. 39. M. C. Berenbaum, What is synergy?. Pharmacol Rev. 1989; 41: 93–141. 40. E. M. Laska. M. Meisner, and C. Siegel, Simple designs and model-free tests for synergy. Biometrics. 1994; 50: 834–841. 41. P. Feng and C. Kelly, An extension of the model-free test to test synergy in multiple drug combinations. Biometric J. 2004; 3: 293–304. 42. R. L. Berger and J. C. Hsu, Bioequivalence trials, intersection-union tests and equivalence confidence sets. Stat Sci. 1996; 11: 283– 319.

FURTHER READING Office of Combination Products, U.S. Food and Drug Administration Website, at http://www.fda.gov/oc/combination/

8

MIN TEST

CROSS-REFERENCES Bioavailability Combination therapy Multiple comparisons Multiple endpoints Noninferiority

MISSING DATA

a nonrandom process is nonignorable. Thus, under ignorable dropout, one can literally ignore the missingness process and nevertheless obtain valid estimates of, say, the treatment. The above definitions are conditional on including the correct set of covariates into the model. An overview of the various mechanisms, and their (non-)ignorability under likelihood, Bayesian, or frequentist inference, is given in Table 1. Let us first consider the case in which only one follow-up measurement per patient is made. When dropout occurs in a patient, leaving the investigator without follow-up measures, one is usually forced to discard such a patient from analysis, thereby violating the intention to treat (ITT) principle, which stipulates that all randomized patients should be included in the primary analysis and according to the randomization scheme. Of course, the effect of treatment can be investigated under extreme assumptions, such as, for example, a worst-case and a best-case scenario, but such scenarios are most often not really helpful. The focus of this article will be on analysis techniques for repeated measurements studies. Early work regarding missingness focused on the consequences of the induced lack of balance of deviations from the study design (2, 3). Later, algorithmic developments took place, such as the expectation-maximization algorithm (EM) (4) and multiple imputation (5). These have brought likelihood-based ignorable analysis within reach of a large class of designs and models. However, they usually require extra programming in addition to available standard statistical software. In the meantime, however, clinical trial practice has put a strong emphasis on methods such as complete case analysis (CC) and last observation carried forward (LOCF) or other simple forms of imputation. Claimed advantages include computational simplicity, no need for a full longitudinal model analysis (e.g., when the scientific question is in terms of the last planned measurement occasion only), and for LOCF, compatibility

GEERT MOLENBERGHS Universiteit Hasselt Center for Statistics Diepenbeek, Belgium

EMMANUEL LESAFFRE Catholic University of Leuven Leuven, Belgium

1

INTRODUCTION

Data from longitudinal studies in general, and from clinical trials in particular, are prone to incompleteness. As incompleteness usually occurs for reasons outside of the control of the investigators and may be related to the outcome measurement of interest, it is generally necessary to reflect on the process governing incompleteness. Only in special but important cases is it possible to ignore the missingness process. When patients are examined repeatedly in a clinical trial, missing data can occur for various reasons and at various visits. When missing data result from patient dropout, the missing data have a monotone pattern. Nonmonotone missingness occurs when there are intermittent missing values as well. The focus here will be on dropout. Reasons typically encountered are adverse events, illness not related to study medication, uncooperative patient, protocol violation, ineffective study medication, loss to follow-up, and so on. When referring to the missing-value, or nonresponse, process, we will use the terminology of Little and Rubin (1). A nonresponse process is said to be missing completely at random (MCAR) if the missingness is independent of both unobserved and observed data and missing at random (MAR) if, conditional on the observed data, the missingness is independent of the unobserved measurements. A process that is neither MCAR nor MAR is termed nonrandom (MNAR). In the context of likelihood inference, and when the parameters describing the measurement process are functionally independent of the parameters describing the missingness process, MCAR and MAR are ignorable, whereas

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

MISSING DATA

with the ITT principle. However, a CC analysis assumes MCAR, and the LOCF analysis makes peculiar assumptions about the (unobserved) evolution of the response, underestimates the variability of the response, and ignores the fact that imputed values are no real data. On the other hand, a likelihood-based longitudinal analysis requires only MAR, uses all data (obviating the need for both deleting and filling in data), and is consistent with the ITT principle. Furthermore, it can also be shown that the incomplete sequences contribute to estimands of interest (treatment effect at the end of the study), even early dropouts. For continuous responses, the linear mixed model is popular and is a direct extension of analysis of variance (ANOVA) and MANOVA approaches, but more broadly valid in incomplete data settings. For categorical responses and count data, so-called marginal (e.g., generalized estimating equations, GEEs) and randomeffects (e.g., generalized linear mixed-effects models, GLMMs) approaches are in use. Although GLMM parameters can be fitted using maximum likelihood, the same is not true for the frequentist GEE method, but modifications have been proposed to accommodate the MAR assumption (6). Finally, MNAR missingness can never be fully ruled out based on the observed data only. It is argued that, rather than going either for discarding MNAR models entirely or for placing full faith on them, a sensible compromise is to make them a component of a sensitivity analysis. 2

METHODS IN COMMON USE

We will focus on two relatively simple methods that have been and still are in extensive use. A detailed account of simple methods to handle missingness is given in Verbeke and Molenberghs (7, 8).

2.1 Complete Case Analysis A complete case analysis includes only those cases for analysis for which all measurements were recorded. This method has obvious advantages. It is very simple to describe, and because the data structure is as would have resulted from a complete experiment, standard statistical software can be used without additional work. Furthermore as the entire estimation is performed on the same subset of completers, there is a common basis for inference. Unfortunately, the method suffers from severe drawbacks. First, there is nearly always a substantial loss of information. The impact on precision and power is dramatic. Furthermore, such an analysis will only be representative for patients who remain on study. Of course a complete case analysis could have a role as an auxiliary analysis, especially if a scientific question relates to it. A final important issue about a complete case analysis is that it is only valid when the missingness mechanism is MCAR. However, severe bias can result when the missingness mechanism is MAR but not MCAR. This bias can go both ways, i.e., either overestimating or underestimating the true effect. 2.2 Last Observation Carried Forward A method that has received a lot of attention (9–11) is the last observation carried forward (LOCF). As noted, in the LOCF method, whenever a value is missing, the last observed value is substituted. For the LOCF approach, the MCAR assumption is necessary but not sufficient for an unbiased estimate. Indeed, it further assumes that subjects’ responses would have been constant from the last observed value to the endpoint of the trial. These conditions seldom hold (8). In a clinical trial setting, one might believe that the response profile changes as soon as a patient goes off treatment and even

Table 1. Overview of Missing Data Mechanisms Acronym

Description

Likelihood/Bayesian

Frequentist

MCAR MAR MNAR

missing completely at random missing at random missing not at random

ignorable ignorable non-ignorable

ignorable non-ignorable non-ignorable

MISSING DATA

that it would flatten. However, the constant profile assumption is even stronger. Therefore, carrying observations forward may bias estimates of treatment effects and underestimate the associated standard errors (8, 12–16). Further more this method artificially increases the amount of information in the data, by treating imputed and actually observed values on equal footing. Despite its shortcomings, LOCF has been the longstanding method of choice for the primary analysis in clinical trials because of its simplicity, ease of implementation, and the belief that the potential bias from carrying observations forward leads to a ‘‘conservative’’ analysis in comparative trials. An analysis is called conservative when it leads to no treatment difference, whereas in fact there is a treatment difference. However, reports of anti-conservative or liberal behavior of LOCF are common (17–21), which means that a LOCF analysis can create a treatment effect when none exists. Thus, the statement that LOCF analysis has been used to provide a conservative estimate of treatment effect is unacceptable. Historically, an important motivation behind the simpler methods was their simplicity. Indeed, the main advantage, shared with complete case analysis, is that complete data software can be used. However, with the availability of commercial software tools, such as, for example, the SAS procedures MIXED and NLMIXED and the SPlus and R nlme libraries, this motivation no longer applies. It is often quoted that LOCF or CC, although problematic for parameter estimation, produces randomization-valid hypothesis testing, but this is questionable. First, in a CC analysis, partially observed data are selected out, with probabilities that may depend on post-randomization outcomes, thereby undermining any randomization justification. Second, if the focus is on one particular time point, e.g., the last one scheduled, then LOCF plugs in data. Such imputations, apart from artificially inflating the information content, may deviate in complicated ways from the underlying data (17). Third, although the size of a randomizationbased LOCF test may reach its nominal size under the null hypothesis of no difference

3

in treatment profiles, there will be other regions of the alternative space where the power the LOCF test procedure is equal to its size, which is completely unacceptable. 3 AN ALTERNATIVE APPROACH TO INCOMPLETE DATA A graphical illustration is first provided, using an artificial example, of the various simple methods that have been considered, and then so-called direct likelihood analysis is discussed. 3.1 Illustration of Simple Methods Take a look at an artificial but insightful example, depicted in Fig. 1, which displays the results of the traditional methods, CC and LOCF, next to the result of an MAR method. In this example, the mean response is supposed to be linear. For both groups (completers and dropouts), the slope is the same, but their intercepts differ. Patients with incomplete observations dropped out half way through the study; e.g., because they reached a certain level of the outcome. It is obviously an MAR missingness mechanism. Using a method, valid under the MAR assumption, yields the correct mean profile, being a straight line centered between the mean profiles of the completers and incompleters. If one would perform a CC analysis, the fitted profile would coincide with the mean profile of the complete cases (bold line). Next, under LOCF, data are imputed (dashed line). The resulting fitted profile will be the bold dashed line. Clearly, both traditional methods produce an incorrect result. Furthermore, in a traditional available case analysis (AC), one makes use of the information actually available. One such set of estimators could be the treatment-specific mean at several designed measurement occasions. With a decreasing sample size over time, means later in time would be calculated using less subjects than means earlier in time. Figure 1 shows a dramatic instance of this approach, due to the extreme nature of this illustrative example. The key message is that such an approach cannot remove major sources of bias.

MISSING DATA

10

4

8

9

Unobserved

7

LOCF ‘data’

6

LOCF

4

MAR

5

Inc.Obs.

2

3

AC

1

CC

0

Comp.Obs. 0

1

2

3

4

5

6

7

8

9

10

Figure 1. Artificial situation, illustrates the results of the traditional MCAR methods—CC and LOCF—next to the result of the direct likelihood method.

3.2 Direct Likelihood Analysis For continuous outcomes, Verbeke and Molenberghs (8) describe likelihood-based mixedeffects models, which are valid under the MAR assumption. Indeed, for longitudinal studies, where missing data are involved, a mixed model only requires that missing data are MAR. As opposed to the traditional techniques, mixed-effects models permit the inclusion of subjects with missing values at some time points (both dropout and intermittent missingness). This likelihood-based MAR analysis is also termed likelihood-based ignorable analysis or, as used in the remainder of this article, a direct likelihood analysis. In such a direct likelihood analysis, the observed data are used without deletion nor imputation. In so doing, appropriate adjustments are made to parameters at times when data are incomplete, due to the within-patient correlation. Thus, even when interest lies, for example, in a comparison between the two treatment groups at the last occasion, such a full longitudinal analysis is a good approach, because

the fitted model can be used as the basis for inference at the last occasion. In many clinical trials, the repeated measures are balanced in the sense that a common (and often limited) set of measurement times is considered for all subjects, which allows the a priori specification of a ‘‘saturated’’ model. For example, a full group-by-time interaction for the fixed effects combined with an unstructured covariance matrix. Such a model specification is sometimes termed mixed-effects model repeated-measures analysis (MMRM) (11). Thus, MMRM is a particular form of a linear mixed model, relevant for acute phase confirmatory clinical trials, fitting within the direct likelihood paradigm. Moreover, this direct likelihood MMRM analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA) approaches, but more generally valid when they are incomplete. This response is a strong answer to the common criticism that a direct likelihood method is making strong assumptions. Indeed, its coincidence with MANOVA for data without missingness shows that the assumptions

MISSING DATA

made are very mild. Therefore, it constitutes a very promising alternative for CC and LOCF. When a relatively large number of measurements is made within a single subject, the full power of random effects modeling can be used (8). The practical implication is that a software module with likelihood estimation facilities and with the ability to handle incompletely observed subjects manipulates the correct likelihood, providing valid parameter estimates and likelihood ratio values. A few cautionary remarks are warranted. First, when at least part of the scientific interest is directed toward the nonresponse process, obviously both processes need to be considered. Under MAR, both questions can be answered separately, which implies that a conventional method can be used to study questions in terms of the outcomes of interest, such as treatment effect and time trend, whereafter a separate model can be considered to study missingness. Second, likelihood inference is often surrounded with references to the sampling distribution (e.g., to construct measures of precision for estimators and for statistical hypothesis tests (22)). However, the practical implication is that standard errors and associated tests, when based on the observed rather than the expected information matrix and given that the parametric assumptions are correct, are valid. Thirdy, it may be hard to rule out the operation of an MNAR mechanism. This point was brought up in Section 1 and will be discussed further in Section 5. 4 ILLUSTRATION: ORTHODONTIC GROWTH DATA As an example, we use the orthodontic growth data, introduced by Potthoff and Roy (23) and used by Jennrich and Schluchter (24). The data have the typical structure of a clinical trial and are simple yet illustrative. They contain growth measurements for 11 girls and 16 boys. For each subject, the distance from the center of the pituitary to the maxillary fissure was recorded at ages 8, 10, 12, and 14. Figure 2 presents the 27 individual profiles. Little and Rubin (1) deleted 9 of the [(11 + 16) × 4] measurements, rendering 9

5

incomplete subjects, which even though it is a somewhat unusual practice has the advantage of allowing a comparison between the incomplete data methods and the analysis of the original, complete data. Deletion is confined to the age 10 measurements, and rougly speaking, the complete observations at age 10 are those with a higher measurement at age 8. Some emphasis will be placed on ages 8 and 10, the typical dropout setting, with age 8 fully observed and age 10 partially missing. The simple methods and direct likelihood method from Sections 2 and 3 are now compared using the growth data. For this purpose, a linear mixed model is used, assuming unstructured mean, i.e., assuming a separate mean for each of the eight age × sex combinations, together with an unstructured covariance structure, and using maximum likelihood (ML) as well as restricted maximum likelihood (REML). The mean profiles of the linear mixed model using maximum likelihood for all four datasets, for boys, are given in Fig. (3). The girls’ profiles are similar and hence not shown. Next to this longitudinal approach, a full MANOVA analysis and a univariate ANOVA analysis will be considered, i.e., one per time point. For all of these analyses, Table 2 shows the estimates and standard errors for boys at ages 8 and 10, for the original data and all available incomplete data, as well as for the CC and the LOCF data. First, the group means for the boys in the original dataset in Fig. (3) are considered; i.e., relatively a straight line is observed. Clearly, there seems to be a linear trend in the mean profile. In a complete case analysis of the growth data, the 9 subjects that lack one measurement are deleted, resulting in a working dataset with 18 subjects. This result implies that 27 available measurements will not be used for analysis, a severe penalty on a relatively small dataset. Observing the profiles for the CC dataset in Fig. (3), all group means increased relative to the original dataset but mostly so at age 8. The net effect is that the profiles overestimate the average length. For the LOCF dataset, the 9 subjects that lack a measurement at age 10 are completed by imputing the age 8 value. It is clear

6

MISSING DATA

Orthodontic Growth Data Profiles and Means 34

Distance

30

26

22

18

14 6

8

10

12

14

16

Age in Years Figure 2. Orthodontic growth data. Raw and residual profiles. (Girls are indicated with solid lines. Boys are indicated with dashed lines.)

22

24

Distance

26

28

Original Data CC LOCF Direct Likelihood (Fitted) Direct Likelihood (Observed)

8

10

12

14

Age Figure 3. Orthodontic growth data. Profiles for the original data, CC, LOCF, and direct likelihood for boys.

MISSING DATA

that this procedure will affect the apparently increasing linear trend found for the original dataset. Indeed, the imputation procedure forces the means at ages 8 and 10 to be more similar, thereby destroying the linear relationship. Hence, a simple, intuitively appealing interpretation of the trends is made impossible. In case of direct likelihood, two profiles can now be observed: one for the observed means and one for the fitted means. These two coincide at all ages except age 10. As mentioned, the complete observations at age 10 are those with a higher measurement at age 8. Due to the within-subject correlation, they are the ones with a higher measurement at age 10 as well, and therefore, the fitted model corrects in the appropriate direction. The consequences of this are very important. Although it is believed that the fitted means do not follow the observed means all that well, this nevertheless is precisely what should be observed. Indeed, as the observed means are based on a nonrandom subset of the data, the

fitted means take into account all observed data points, as well as information on the observed data at age 8, through the measurements that have been taken for such children, at different time points. As an aside, note that, in case of direct likelihood, the observed average at age 10 coincides with the CC average, whereas the fitted average does not coincide with anything else. Indeed, if the model specification is correct, then a direct likelihood analysis produces a consistent estimator for the average profile, as if nobody had dropped out. Of course, this effect might be blurred in relatively small datasets due to small-sample variability. Irrespective of the small-sample behavior encountered here, the validity under MAR and the ease of implementation are good arguments that favor this direct likelihood analysis over other techniques. Now compare the different methods by means of Table 2, which shows the estimates and standard errors for boys at age 8 and 10, for the original data and all available

Table 2. Orthodontic Growth Data. Comparison of Analyses Based on Means at Completely Observed Age 8 and Incompletely Observed Age 10 Measurement Method

Boys at Age 8

Boys at Age 10

Original Data Direct likelihood, ML Direct likelihood, REML MANOVA ANOVA per time point

22.88 (0.56) 22.88 (0.58) 22.88 (0.58) 22.88 (0.61)

23.81 (0.49) 23.81 (0.51) 23.81 (0.51) 23.81 (0.53)

All Available Incomplete Data Direct likelihood, ML Direct likelihood, REML MANOVA ANOVA per time point

22.88 (0.56) 22.88 (0.58) 24.00 (0.48) 22.88 (0.61)

23.17 (0.68) 23.17 (0.71) 24.14 (0.66) 24.14 (0.74)

Complete Case Analysis Direct likelihood, ML Direct likelihood, REML MANOVA ANOVA per time point

24.00 (0.45) 24.00 (0.48) 24.00 (0.48) 24.00 (0.51)

24.14 (0.62) 24.14 (0.66) 24.14 (0.66) 24.14 (0.74)

Last Observation Carried Forward Analysis Direct likelihood, ML Direct likelihood, REML MANOVA ANOVA per time point

7

22.88 (0.56) 22.88 (0.58) 22.88 (0.58) 22.88 (0.61)

22.97 (0.65) 22.97 (0.68) 22.97 (0.68) 22.97 (0.72)

8

MISSING DATA

incomplete data, as well as for the CC data and the LOCF data. Table 2 shows some interesting features. In all four cases, a CC analysis gives an upward biased estimate, for both age groups. This result is obvious, because the complete observations at age 10 are those with a higher measurement at age 8, as shown before. The LOCF analysis gives a correct estimate for the average outcome for boys at age 8. This result is not surprising because there were no missing observations at this age. As noted, the estimate for boys of age 10 is biased downward. When the incomplete data are analyzed, we see from Table 2 that direct likelihood produces good estimates. The MANOVA and ANOVA per time point analyses give an overestimation of the average of age 10, like in the CC analysis. Furthermore, the MANOVA analysis also yields an overestimation of the average at age 8, again the same as in the CC analysis. Thus, direct likelihood shares the elegant and appealing features of ANOVA and MANOVA for fully observed data, but it is superior with incompletely observed profiles. 5

SENSITIVITY ANALYSIS

When there is residual doubt about the plausibility of MAR, one can conduct a sensitivity analysis. Although many proposals have been made, this is still an active area of research. Obviously, several MNAR models can be fitted, provided one is prepared to approach formal aspects of model comparison with due caution. Such analyses can be complemented with appropriate (global and/or local) influence analyses (25). Another route is to construct pattern-mixture models, where the measurement model is considered, conditional upon the observed dropout pattern, and to compare the conclusions with those obtained from the selection model framework, where the reverse factorization is used (26, 27). Alternative sensitivity analyses frameworks are provided by Robins, et al. (28), Forster and Smith (29) who present a Bayesian sensitivity analysis, and Raab and Donnelly (30). A further paradigm, useful for sensitivity analysis, is so-called shared parameter models, where common latent or

random effects drive both the measurement process as well as the process governing missingness (31, 32). Nevertheless, ignorable analyses may provide reasonably stable results, even when the assumption of MAR is violated, in the sense that such analyses constrain the behavior of the unseen data to be similar to that of the observed data. A discussion of this phenomenon in the survey context has been given in Rubin, et al. (33). These authors first argue that, in well-conducted experiments (some surveys and many confirmatory clinical trials), the assumption of MAR is often to be regarded as a realistic one. Second, and very important for confirmatory trials, an MAR analysis can be specified a priori without additional work relative to a situation with complete data. Third, although MNAR models are more general and explicitly incorporate the dropout mechanism, the inferences they produce are typically highly dependent on the untestable and often implicit assumptions built in regarding the distribution of the unobserved measurements given the observed ones. The quality of the fit to the observed data need not reflect at all the appropriateness of the implied structure governing the unobserved data. Based on these considerations, it is recommended, for primary analysis purposes, the use of ignorable likelihood-based methods or appropriately modified frequentist methods. To explore the impact of deviations from the MAR assumption on the conclusions, one should ideally conduct a sensitivity analysis (8). 6 CONCLUSION In conclusion, a direct likelihood analysis is preferable because it uses all available information, without the need neither to delete nor to impute measurements or entire subjects. It is theoretically justified whenever the missing data mechanism is MAR, which is a more relaxed assumption than MCAR, necessary for simple analyses (CC, LOCF). There is no statistical information distortion, because observations are neither removed (such as in CC analysis) nor added (such as in LOCF analysis). Software is available, such that no

MISSING DATA

additional programming is involved to perform a direct likelihood analysis. It is very important to realize that, for complete sets of data, direct likelihood, especially with the REML estimation method, is identical to MANOVA (see Table 2). Given the classic robustness of MANOVA, and its close agreement with ANOVA per time point, this provides an extra basis for direct likelihood. Indeed, it is not as assumption-driven as is sometimes believed. This, in addition with the validity of direct likelihood under MAR (and hence its divergence from MANOVA and ANOVA for incomplete data), provides a strong basis for the direct likelihood method. 7

ACKNOWLEDGMENTS

The authors gratefully acknowledge support from Fonds Wetenschappelijk OnderzoekVlaanderen Research Project G.0002.98 ‘‘Sensitivity Analysis for Incomplete and Coarse Data’’ and from Belgian IUAP/PAI network ‘‘Statistical Techniques and Modeling for Complex Substantive Questions with Complex Data.’’ We are thankful to Eli Lilly for kind permission to use their data. REFERENCES 1. R. J. A. Little, and D. B. Rubin, Statistical Analysis with Missing Data. New York: John Wiley & Sons, 2002. 2. A. Afifi, and R. Elashoff, Missing observations in multivariate statistics I: Review of the literature. J. Am. Stat. Assoc. 1996; 61: 595–604. 3. H. O. Hartley, and R. Hocking, The analysis of incomplete data. Biometrics, 1971; 27: 7783–7808. 4. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Stat. Soc. Series B, 1977; 39: 1–38. 5. D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, 1987. 6. J. M. Robins, A. Rotnitzky, and L. P. Zhao, Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 1995; 90: 106–121.

9

7. G. Verbeke, and G. Molenberghs, Linear Mixed Models in Practice: A SAS-Oriented Approach. Lecture Notes in Statistics 126. New York: Springer-Verlag, 1997. 8. G. Verbeke, and G. Molenberghs, Linear Mixed Models for Longitudinal Data. New York: Springer-Verlag, 2000. 9. O. Siddiqui, and M. W. Ali, A comparison of the random-effects pattern mixture model with last observation carried forward (LOCF) analysis in longitudinal clinical trials with dropouts. J. Biopharm. Stat. 1998; 8: 545–563. 10. C. H. Mallinckrodt, W. S. Clark, R. J. Carroll, and G. Molenberghs, Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. J. Biopharm. Stat. 2003; 13: 179–190. 11. C. H. Mallinckrodt, T. M. Sanger, S. Dube, D. J. Debrota, G. Molenberghs, R. J. Carroll, W. M. Zeigler Potter, and G. D. Tollefson, Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol. Psychiatry, 2003; 53: 754–760. 12. R. D. Gibbons, D. Hedeker, I. Elkin, D. Waternaux, H. C. Kraemer, J. B. Greenhouse, M. T. Shea, S. D. Imber, S. M. Sotsky, and J. T. Watkins. Some conceptual and statistical issues in analysis of longitudinal psychiatric data. Arch. Gen. Psychiatry, 1993; 50: 739–750. 13. A. Heyting, J. Tolboom, and J. Essers. Statistical handling of dropouts in longitudinal clinical trials. Stat. Med. 1992; 11: 2043–2061. 14. P. W. Lavori, R. Dawson, and D. Shera. A multiple imputation strategy for clinical trials with truncation of patient data. Stat. Med. 1995; 14: 1913–1925. 15. C. H. Mallinckrodt, W. S. Clark, and R. D. Stacy. Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Inform. J. 2001; 35(4): 1215–1225. 16. C. H. Mallinckrodt, W. S. Clark, and R. D. Stacy. Accounting for dropout bias using mixed-effects models. J. Biopharm. Stat. 2001; 11(1 & 2): 9–21. 17. M. G. Kenward, S. Evans, J. Carpenter, and G. Molenberghs. Handling missing responses: Time to leave Last Observation Carried Forward (LOCF) behind, Submitted for publication. 18. G. Molenberghs, H. Thijs, I. Jansen, C. Beunckens, M. G. Kenward, C. Mallinckrodt,

10

19.

20.

21.

22.

23.

24.

MISSING DATA and R. J. Carroll. Analyzing incomplete longitudinal clinical trial data. Biostatistics, 2004; 5: 445–464. C. H. Mallinckrodt, J. G. Watkin, G. Molenberghs, and R. J. Carroll. Choice of the primary analysis in longitudinal clinical trials. Pharm. Stat. 2004; 3: 161–169. R. J. A. Little, and L. Yau. Intent-to-treat analysis in longitudinal studies with dropouts. Biometrics 1996; 52: 1324–1333. G. Liu and A. L. Gould. Comparison of alternative strategies for analysis of longitudinal trials with dropouts. J. Biopharm. Stat. 2002; 12: 207–226. M. G. Kenward, and G. Molenberghs, Likelihood based frequentist inference when data are missing at random. Stat. Sci. 1998; 12: 236–247. R. F. Potthoff, and S. N. Roy, A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 1964; 51: 313–326. R. I. Jennrich, and M. D. Schluchter, Unbalanced repeated measures models with structured covariance matrices. Biometrics 1986; 42: 805–820.

25. G. Verbeke, G. Molenberghs, H. Thijs, E. Lesaffre, and M. G. Kenward, Sensitivity analysis for non-random dropout: A local influence approach. Biometrics 2001; 57: 7–14. 26. B. Michiels, G. Molenberghs, L. Bijnens, T. Vangeneugden, and H. Thijs. Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Stat. Med. 2002; 21: 1023–1041.

27. H. Thijs, G. Molenberghs, B. Michiels, G. Verbeke, and D. Curran, Strategies to fit pattern-mixture models. Biostatistics 2002; 3: 245–265. 28. J. M. Robins, A. Rotnitzky, and D. O. Scharfstein, Semiparametric regression for repeated outcomes with non-ignorable non-response. J. Am. Stat. Assoc. 1998; 93: 1321–1339. 29. J. J. Forster, and P. W. Smith, Model-based inference for categorical survey data subject to non-ignorable non-response. J. Roy. Stat. Soc. Series B 1998; 60: 57–70. 30. G. M. Raab, and C. A. Donnelly, Information on sexual behaviour when some data are missing. Appl. Stat. 1999; 48: 117–133. 31. M. C. Wu, and K. R. Bailey, Estimation and comparison of changes in the presence of informative right censoring: Conditional linear model. Biometrics 1989; 45: 939–955. 32. M. C. Wu, and R. J. Carroll, Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 1988; 44: 175–188. 33. D. B. Rubin, H. S. Stern, and V. Vehovar, Handling ‘‘don’t know’’ survey responses: The case of the Slovenian plebiscite. J. Am. Stat. Assoc. 1995; 90: 822–828.

MONITORING

however, in exceptional circumstances, the sponsor may determine that central monitoring in conjunction with procedures such as investigators’ training and meetings, and extensive written guidance can assure appropriate conduct of the trial in accordance with GCP. Statistically controlled sampling may be an acceptable method for selecting the data to be verified. The monitor(s), in accordance with the sponsor’s requirements, should ensure that the trial is conducted and documented properly by carrying out the following activities when relevant and necessary to the trial and the trial site:

The purposes of trial monitoring are to verify that: • The rights and well-being of human sub-

jects are protected. • The reported trial data are accurate,

complete, and verifiable from source documents. • The conduct of the trial is in compliance with the currently approved protocol/amendment(s), with Good Clinical Practice (GCP), and with applicable regulatory requirement(s).

• Acting as the main line of communica-

tion between the sponsor and the investigator. • Verifying that the investigator has adequate qualifications and resources and these remain adequate throughout the trial period, and that the staff and facilities, which include laboratories and equipment, are adequate to conduct the trial in a safe and proper manner, and these remain adequate throughout the trial period. • Verifying, for the investigational product(s): • That storage times and conditions are acceptable, and that supplies are sufficient throughout the trial. • That the investigational product(s) are supplied only to subjects who are eligible to receive it and at the protocol specified dose(s). • That subjects are provided with necessary instruction on properly using, handling, storing, and returning the investigational product(s). • That the receipt, use, and return of the investigational product(s) at the trial sites are controlled and documented adequately. • That the disposition of unused investigational product(s) at the trial sites complies with applicable regulatory

Selection and Qualifications of Monitors • Monitors should be appointed by the

sponsor. • Monitors should be trained appropri-

ately and should have the scientific and/ or clinical knowledge needed to monitor the trial adequately. A monitor’s qualifications should be documented. • Monitors should be thoroughly familiar with the investigational product(s), the protocol, the written informed consent form and any other written information to be provided to subjects, the sponsor’s Standard Operation Procedures (SOP), GCP, and the applicable regulatory requirement(s). The sponsor should ensure that the trials are adequately monitored. The sponsor should determine the appropriate extent and nature of monitoring. The determination of the extent and the nature of monitoring should be based on considerations such as the objective, purpose, design, complexity, blinding, size, and endpoints of the trial. In general, a need exists for on-site monitoring, before, during, and after the trial; This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

MONITORING

requirement(s) and is in accordance with the sponsor’s authorized procedures. • Verifying that the investigator follows

the approved protocol and all approved amendment(s), if any. • Verifying that written informed consent

was obtained before each subject’s participation in the trial. • Ensuring that the investigator receives

the current Investigator’s Brochure, all documents, and all trial supplies needed to conduct the trial properly and to comply with the applicable regulatory requirement(s). • Ensuring that the investigator and the

investigator’s trial staff are adequately informed about the trial. • Verifying that the investigator and the

investigator’s trial staff are performing the specified trial functions, in accordance with the protocol and any other written agreement between the sponsor and the investigator/institution, and have not delegated these functions to unauthorized individuals. • Verifying that the investigator is enroll-

ing only eligible subjects. • Reporting the subject recruitment rate. • Verifying that source data/documents

and other trial records are accurate, complete, kept up-to-date, and maintained. • Verifying that the investigator provides

all the required reports, notifications, applications, and submissions, and that these documents are accurate, complete, timely, legible, dated, and identify the trial. • Checking the accuracy and complete-

ness of the Case Report Form (CRF) entries, source data/documents, and other trial-related records against each other. The monitor specifically should verify that: • The data required by the protocol are reported accurately on the CRFs and are consistent with the source data/ documents.

• Any dose and/or therapy modifications

are well documented for each of the trial subjects. • Adverse events, concomitant medications, and intercurrent illnesses are reported in accordance with the protocol on the CRFs. • Visits that the subjects fail to make, tests that are not conducted, and examinations that are not performed are clearly reported as such on the CRFs. • All withdrawals and dropouts of enrolled subjects from the trial are reported and explained on the CRFs. • Informing the investigator of any CRF

entry error, omission, or illegibility. The monitor should ensure that appropriate corrections, additions, or deletions are made, dated, explained (if necessary), and initialed by the investigator or by a member of the investigator’s trial staff who is authorized to initial CRF changes for the investigator. This authorization should be documented. • Determining whether all adverse events (AEs) are appropriately reported within the time periods required by GCP, the protocol, the IRB (Institutional Review Board)/IEC (Independent Ethic’s Committee), the sponsor, the applicable regulatory requirement(s), and indicated in the International Conference on Harmonisation (ICH) Guideline for Clinical Safety Data Management: Definitions and Standards for Expedited Reporting. • Determining whether the investigator is maintaining the essential documents. • Communicating deviations from the protocol, SOPs, GCP, and the applicable regulatory requirements to the investigator and taking appropriate action designed to prevent recurrence of the detected deviations. The monitor(s) should follow the sponsor’s established written SOPs as well as those procedures that are specified by the sponsor for monitoring a specific trial.

MONITORING

Monitoring Report • The monitor should submit a written report to the sponsor after each trial-site visit or trial-related communication. • Reports should include the date, site, name of the monitor, and name of the investigator or other individual(s) contacted. • Reports should include a summary of what the monitor reviewed and the monitor’s statements concerning the significant findings/facts, deviations and deficiencies, conclusions, actions taken or to be taken, and/or actions recommended to secure compliance. • The review and follow-up of the monitoring report by the sponsor should be documented by the sponsor’s designated representative.

3

MONOTHERAPY

therapy with therapy delayed for an appropriate length of time such that every patient has the opportunity to receive the drug and yet a sufficient comparison period exists to assess the effect on a short-term outcome (as well as the opportunity to compare the effect of the different durations imposed by the delayed start of treatment on longer-term outcomes). The comparisons of immediate versus delayed therapy may also be appropriate when it is unclear when to start therapy, a concept explored in the Concorde trial (3). Once an effective therapy is available for a disease, it becomes much more difficult to assess new therapies as it is usually no longer ethical to compare them with an untreated group (which may receive placebo). However, this is not always the case; for example, it may be appropriate to evaluate a new bronchodilator for acute asthma by comparison with a placebo as long as treatment is available if therapy is not effective. An alternative way to evaluate certain new monotherapy regimens is in a crossover trial. However, this approach is only practicable if the disease is chronic and returns to a similar baseline when therapy is stopped and no prolonged effect of the therapy exists after it is stopped. To evaluate a new monotherapy when an existing therapy exists and an untreated group is not ethically acceptable, the randomized comparison is usually with the existing standard therapy. The new therapy may be expected to be more effective than the existing therapy or of similar efficacy but with less toxicity, and this fact will influence whether the trial is designed as a superiority or equivalence (or non-inferiority trial). In either case, the trial will need to be larger than if the new drug was compared with no treatment, as the difference between the efficacy of the new and old drug is likely to be much smaller than between drug and no treatment. An example of a monotherapy trial comparing a new drug with existing therapy is the evaluation of the second antiHIV drug didanosine (ddI), which was compared directly with ZDV (4). Such trials are, wherever possible, blinded (or masked) either by making the two drugs indistinguishable

JANET DARBYSHIRE London, United Kingdom

1

DEFINITION

A trial that evaluates a monotherapy would most often test a single drug, but the definition could be extended to include other interventions such as immunotherapy or even a different modality such as radiotherapy that contain a single element. The issues around trial design are essentially the same and, therefore, this section focuses on monotherapy trials that evaluate a drug. The simplest and most efficient way to evaluate a new agent is to compare it with no treatment or a placebo in a classic randomized parallel group design, assuming that its safety and activity have already been demonstrated in Phase I and II trials. If the drug is to be clinically useful, the effect compared with no treatment will be relatively large and, therefore, the trial may not need to include many patients. The duration of the trial will be driven by the disease, in particular its natural history, and the primary outcome measure. A classic example of a new monotherapy was the drug zidovudine (ZDV), the first drug to be evaluated for the treatment of HIV infection. In patients with advanced HIV infection, ZDV was shown to be substantially and significantly better than placebo in preventing death over a short period (1). However, subsequent trials, also placebo-controlled, of ZDV at a much earlier stage of HIV infection showed that the effect was short-lived because of the emergence of resistance to ZDV (2). In a highly fatal disease or a disease with a high incidence of other clear-cut clinical endpoints, the benefits of the first potent therapy may even be clearly demonstrated without a trial, the often quoted ‘‘penicillinlike’’ effect. In such circumstances, it may be difficult to withhold a new drug that looks promising from individuals, and one approach could be to compare immediate

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

MONOTHERAPY

or, if this is not possible, employing the double dummy technique. The selection of the comparison therapy is crucial. For example, if a new antibiotic is being evaluated, the choice of an inappropriate comparator may make the new drug appear better than it really is. Alternatively, if the comparator happens to be more potent against the likely infecting organisms, a drug that is useful in other infections may appear ineffective. In certain circumstances, it may be necessary to assess a new drug that will ultimately be used only as part of a combination therapy regimen. In these circumstances, the new drug may be given in addition to the conventional therapy (which itself may be a single therapy or a combination). Again, the basic comparison is with a group that only received the conventional therapy and, wherever possible, the trial is blinded or masked. Many examples of this comparison exist, such as the ESPRIT trial, which is evaluating Interleukin-2 (IL-2) on a background of antiretroviral therapy (5) and the first trial of a drug from a new class in HIV infection, ritonavir, a protease inhibitor (6). One of the difficulties of assessing a drug by adding it to or substituting it for one component of a combination regimen, is that additive, synergistic, or antagonistic activity may exist between the drugs. Further, an active new drug may add little to a potent treatment and therefore be rejected as inactive. New drugs that will never be used as monotherapy, such as in HIV and tuberculosis (TB), are sometimes evaluated for the first time as monotherapy by using short-term studies based on laboratory markers. These studies are used to determine whether the new drug has any activity ideally comparing them with other drugs concurrently or alternatively using historical controls. For both TB and HIV, issues around such designs exist because of concerns about the emergence of drug resistance, and therefore, innovative designs are needed to address this increasingly important issue. An efficient approach to the evaluation of two drugs, which also explores the effect of giving them together, is the factorial design. This design relies on the assumption that there will not be an interaction between

them. If a possibility of an interaction exists, whether positive or negative, then it may be an important finding, and the factorial design is still the optimal approach to evaluating it. However, the sample size will need to be sufficiently large to adequately assess interactions. If monotherapy regimens fail, a number of options exist that may need to be explored depending on the disease and the available drugs and issues such as cross-resistance or interactions. For example, in early HIV trials, a new drug was often added to ZDV or, alternatively, patients were switched from one monotherapy to another. No single optimal approach to decide how to use new drugs exists, but a clear need exists to consider all the relevant disease and drug-related factors. REFERENCES 1. M. A. Fischl, D. D. Richmann, M. H. Grieco, M. S. Gottlieb, P. A. Volberding, O. L. Laskin et al., The efficacy of azidothymidine (AZT) in the treatment of patients with AIDS and AIDS related complex. A double-blind, placebo controlled trial. N. Engl. J. Med. 1987; 317: 185– 191. 2. HIV Trialists’ Collaborative Group, Zidovudine, didanosine, and zalcitabine in the treatment of HIV infection: meta-analyses of the randomised evidence. Lancet 1999; 353: 2014–2025. 3. Concorde Coordinating Committee, Concorde: MRC/ANRS randomised double-blind controlled trial of immediate and deferred zidovudine in symptom-free HIV infection. Lancet 1994; 343: 871–882. 4. J. O. Kahn, S. W. Lagakos, D. D. Richman, A. Cross, C. Pettinelli, S-H. Liou et al., A controlled trial comparing continued zidovudine with didanosine in human immunodeficiency virus infection. N. Engl. J. Med. 1992; 327: 581–587. 5. S. Emery, D. I. Abrams, D. A. Cooper, J. H. Darbyshire, H. C. Lane, J. D. Lundgren, and J. D. Neaton, The evaluation of subcutaneous Proleukin (interleukin-2) in a randomized international trial: rationale, design, and methods of ESPRIT. Controlled Clin. Trials 2002; 23: 198–220. 6. D. W. Cameron, M. Heath-Chiozzi, S. Danner, C. Cohen, S. Kravcik, C. Maurath et al., Randomised placebo controlled trial of ritonavir in advanced HIV-1 disease. Lancet 1998; 351(9102): 543–549.

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

in developed countries and two trials of interventions feasible in resource-limited settings (Table 1). For each trial, the objectives, study design, results, and conclusions are summarized, along with implications for future patients and questions for additional study.

DAVID E. SHAPIRO Center for Biostatistics in AIDS Research, Harvard School of Public Health Boston, Massachusetts

1 THE PEDIATRIC AIDS CLINICAL TRIALS GROUP 076 TRIAL

Mother-to-child transmission (MTCT) of human immunodeficiency virus type 1 (HIV) can occur through three major routes: through the placental barrier during gestation (in utero and antepartum), via contact with maternal bodily fluids during labor and delivery (intrapartum), and by ingestion of breast milk after delivery (postpartum); most MTCT is believed to occur close to the time of or during childbirth (1). Infants who have a positive HIV test within the first 72 hours of life are presumed to have been infected in utero, and those who are HIV-negative within the first 72 hours and HIV-positive thereafter are presumed to have been infected close to or during delivery (or via early breastfeeding, if the infant breastfeeds). Prevention of MTCT (PMTCT) is one of the most successful areas in HIV clinical research. Interventions to prevent MTCT have been developed for two different settings: the United States and other developed countries with access to medications, medical infrastructure, and safe replacement feeding so that breastfeeding can be discouraged; and resource-limited countries with limited access to medications and clean water, and where breastfeeding is critical (2). In the absence of intervention, the risk of MTCT is approximately 15–25% during pregnancy through delivery (3), and an additional 0.9% per month during breastfeeding (4). With combination antiretroviral therapy during pregnancy and delivery, MTCT risk can be reduced to below 2% in the absence of breastfeeding (5,6); several clinical trials of interventions to reduce MTCT during breastfeeding are ongoing (7). This article describes four important PMTCT randomized trials, including two trials of key interventions used

The seminal Pediatric AIDS Clinical Trials Group (PACTG) 076 trial, which was sponsored by the U.S. National Institutes of Health (NIH), was the first Phase III trial to establish the efficacy of an antiretroviral drug to prevent MTCT of HIV. As of October 2007, the PACTG 076 zidovudine (ZDV) regimen is still recommended in the United States and other developed countries (8). 1.1 Objectives Based on animal models of retroviral infection, it was hypothesized that ZDV might prevent MTCT either by reducing the circulating HIV viral load in the mother and thereby reducing exposure of the fetus to HIV in utero and during delivery, or by accumulating therapeutic concentrations in the fetus and infant that could provide protection during and after exposure to HIV, or both (9). Phase I studies in pregnant women suggested that ZDV was safe when used for short periods and that it crossed the placenta well (9). The primary objectives of the trial were to assess the efficacy and safety of ZDV for the prevention of MTCT of HIV. 1.2 Study Design The PACTG 076 study was a randomized, double-blind, placebo-controlled trial conducted in the United States and France from 1991 to 1994. HIV-infected pregnant women, who were between 14- and 34-weeks gestation, had CD4+ T-lymphocyte counts exceeding 200/mm3 and no indications for antiretroviral therapy were enrolled. The study regimen consisted of oral ZDV five times daily during pregnancy, intravenous ZDV during labor and delivery, and oral ZDV

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

Table 1. Summary of the Study Interventions in Four Important PMTCT Randomized Trials Study (site; years enrolled) PACTG 076 (US, France; 1991–4)

EMDC (Europe; 1993–1998)

Antepartum

Intrapartum

Postpartum: mother

Arm 1: ZDV (oral, Arm 1: ZDV 5x/day from 14 (intravenous) weeks gestation) Arm 2: placebo Arm 2: placebo

No antiretrovirals

Nonstudy antiretrovirals (mainly ZDV)

Non-study Non-study antiretrovirals antiretrovi(mainly ZDV) rals (mainly ZDV)

Arm 1: elective cesarean delivery

Postpartum: infant Arm 1: ZDV (oral, 4x/day for 6 weeks) Arm 2: placebo

Arm 2: vaginal delivery HIVNET 012 (Uganda; 1997–1999)

No antiretrovirals

Arm 1: NVP (oral, single-dose)

No antiretrovirals

Arm 2: ZDV (oral) Arm 3: placebo (stopped Feb. 1998) MASHI (Botswana; ZDV (oral, 2x/day 2001–2003) from 34 weeks gestation)∗

∗ In

First No antiretrorandomization∗∗ : virals∗ NVP (oral, single-dose) versus placebo ZDV (oral)

Arm 1: NVP (oral, single dose on day 2–3 of age) Arm 2: ZDV (oral, 2x/day for 1 week) Arm 3: placebo (stopped Feb. 1998)

Second randomization: Breastfeeding with ZDV (oral, 3x/day until age 6 months) versus formula feeding NVP∗∗ (oral, single-dose) plus ZDV (oral, 2x/day for 4 weeks)

the revised design, women received combination antiretroviral therapy if required per Botswana national guidelines. the initial design, the first randomization was to maternal/infant single dose NVP versus maternal/infant placebo.

∗∗ In

to the infant for 6 weeks. The women were followed until 6 months postpartum, and the infants were followed until 18 months of age. The primary efficacy outcome measure was the MTCT rate at 18 months of age, as estimated by the Kaplan-Meier method to permit inclusion of infants who had incomplete follow-up. The target sample size was 748 mother–infant pairs (636 assessable) to provide 80% power to detect a reduction in MTCT from 30% in the placebo arm to 20% in the ZDV arm, with a two-sided, 0.05 Type I error. The trial was to be monitored by an independent Data and Safety Monitoring Board (DSMB) at least annually for study progress and safety; three interim efficacy analyses were planned, with an O’Brien– Fleming stopping boundary. In

February 1994, the first interim efficacy analysis revealed a dramatic reduction in MTCT in the ZDV arm. The interim results were released immediately, enrollment was discontinued, all study participants who received the blinded study drug were offered ZDV, and study follow-up was completed as planned (9). 2 RESULTS In the final efficacy analysis based on complete follow-up of 402 assessable mother– infant pairs, MTCT rates were 7.6% and 22.6% in the ZDV and placebo groups, respectively (10), which represented a 66% reduction in overall MTCT risk and confirmed

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

the results of the interim efficacy analysis (9). ZDV significantly reduced MTCT both in utero and intrapartum (11). However, ZDV reduced maternal viral load only slightly (median reduction, 0.24 log), and after adjustment for the baseline maternal viral load and CD4+ cell count, the reduction in viral load from baseline to delivery was not significantly associated with MTCT risk (10). ZDV was well tolerated and minimal shortterm toxic effects were observed, other than significantly lower hemoglobin levels in ZDVexposed infants during the first 6 weeks of life (9,12). No differences between groups were observed with respect to adverse pregnancy outcomes, uninfected infant growth and immune parameters, or HIV disease progression (12,13). The development of viral mutations conferring resistance to ZDV was rare (14). 2.1 Conclusions and Questions for Additional Study The PACTG 076 results suggested that a maternal–infant ZDV regimen could dramatically reduce MTCT of HIV with little short-term toxicity in pregnant women with mildly symptomatic HIV disease. Within 2 months of the release of the initial PACTG 076 results, the U.S. Public Health Service (USPHS) issued interim guidance to support the use of the PACTG 076 ZDV regimen, and 4 months later, they issued more extensive guidance (15,16). Epidemiologic studies have subsequently demonstrated large decreases in MTCT with incorporation of the PACTG 076 regimen into general clinical practice (8). Two important questions for additional study included the regimen’s long-term safety and whether it would be efficacious in women with more advanced HIV disease. Long-term follow-up studies of the PACTG 076 mothers (17) and infants (18,19) observed no major adverse effects of ZDV within the first few years after delivery, although subsequently a possible association of in utero ZDV or ZDV/lamivudine (3TC) exposure with mitochondrial dysfunction among uninfected infants has been found in some studies but not others (8). The efficacy of ZDV among women with more advanced HIV disease was demonstrated in PACTG 185, which was a randomized trial in the United States of passive

3

immunization for the prevention of MTCT of HIV among pregnant HIV-infected women who had CD4+ counts below 500/mm3 and who were receiving ZDV for maternal health. Enrollment was discontinued after the first interim efficacy analysis revealed an unexpectedly low MTCT rate of 4.8%, which substantially increased the sample size required to achieve the primary objective (20). Questions for additional study regarding modifications of the PACTG 076 regimen were in opposite directions in developed and resource-limited countries. In developed countries, the primary question was whether intensifying the PACTG 076 regimen by adding other antiretrovirals to reduce viral load even more would increase efficacy. A nonrandomized study in France showed that adding 3TC to the PACTG 076 regimen starting at 32-weeks gestation could reduce MTCT to below 2% (21), and other studies subsequently observed similar results with other combination antiretroviral regimens (5,6). In resource-limited settings, however, the PACTG 076 regimen was already too complex and expensive to be feasible, so the key question was whether the regimen could be shortened or simplified, yet still reduce MTCT risk. The relative importance of the maternal and infant components of the ZDV regimen could not be determined from PACTG 076 because mothers and infants received the same treatment assignment. Interestingly, a 2 × 2 factorial design, which could have addressed this question, was considered for PACTG 076 but ultimately rejected because of concerns that the maternal and infant components would not have independent influence on MTCT risk, which would reduce the statistical power of the factorial design (22). Subsequently, a randomized, placebo-controlled trial in Thailand found that a simplified, short-course (SC) maternal ZDV regimen (oral ZDV twice daily starting from 36-weeks gestation and oral ZDV during labor) with no infant treatment could reduce MTCT by 50%, from 19% to 9% (23). Another Thai trial with a 2 × 2 factorial design compared long versus short maternal ZDV (starting from 28- or 36-weeks gestation) and long versus short infant ZDV (for 6 weeks or 3 days). At the first interimefficacy analysis, the short–short arm was discontinued because it was inferior to the

4

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

long–long arm (10.5% versus 4.1% MTCT), but at the end of the trial the long–long, long–short, and short–long arms had equivalent MTCT (6.5% versus 4.7% versus 8.6%); the rate of in utero MTCT was higher with short versus long antepartum ZDV, which suggested that longer treatment of the infant cannot substitute for longer treatment of the mother (24). A trial in Africa among breastfeeding HIV-infected women showed that a short-course combination regimen of ZDV and 3TC starting from 36-weeks gestation, orally during delivery, and for 1 week after delivery to the mother and infant, reduced MTCT at age 6 weeks by approximately 63% compared with placebo (25). 3 THE EUROPEAN MODE OF DELIVERY TRIAL Because most MTCT is thought to occur during labor and delivery, interventions at the time of delivery could potentially reduce MTCT risk substantially, especially for women who have little or no prenatal care, are diagnosed as HIV-infected late in pregnancy, or have high viral load near the time of delivery. In the early 1990s, when results of some observational studies but not others suggested that elective cesarean-section (ECS) delivery before membrane rupture and before labor could reduce MTCT compared with vaginal delivery, the European Mode of Delivery Collaboration (EMDC) conducted a randomized trial that demonstrated the efficacy of ECS (26). 3.1 Objectives It was hypothesized that ECS might reduce MTCT risk by avoiding direct contact with maternal vaginal secretions and infected blood in the birth canal and by reducing influx of maternal blood during uterine contractions. The primary objectives of the EMDC trial were to assess the relative risks and benefits of ECS versus vaginal delivery overall and in subgroups defined according to ZDV use and viral load (26). 3.2 Study Design HIV-infected pregnant women who were at 34–36-weeks gestation and had no indication for or contraindication to ECS were

enrolled in Italy from 1993 to 1998, and in France, the United Kingdom, Spain, Switzerland, and Sweden from 1995 to 1998. Women were randomized to ECS at 38-weeks gestation or vaginal delivery. Women assigned to the ECS group who went into labor before 38 weeks-gestation were delivered by cesarean section if labor was diagnosed before the start of the second stage. Women assigned to vaginal delivery waited for spontaneous labor unless a clinical decision was made for cesarean section. The primary efficacy outcome measure was MTCT by age 18 months. The original planned sample size was about 450 women, based on an anticipated MTCT rate of 15% in the vaginal delivery group and an estimated 50% reduction associated with ECS. With the publication of the PACTG 076 results, the assumed MTCT rate in the vaginal delivery group was decreased to 8%, which increased the required sample size to 1200 women. No interim analyses were planned. However, in March 1998, when the initially planned sample size was reached, and with increasing evidence from observational studies of a protective effect of ECS that was greater than previously suggested, an interim efficacy analysis was conducted and enrollment was discontinued because of a significant difference in MTCT between the ECS and vaginal delivery groups (26). 3.3 Results In all, 70% of women in the ECS group and 58% in the vaginal-delivery group received antiretroviral therapy during pregnancy, generally the PACTG 076 ZDV regimen. Overall, 88.3% of the women assigned to the ECS group delivered by cesarean section (4.3% of which were emergency procedures), and 11.7% of women delivered vaginally. In this study, 73.2% of the women assigned to the vaginal delivery group delivered vaginally and 26.8% delivered by cesarean section (54% of which were emergency procedures). In an intent-to-treat analysis, MTCT rates were 1.8% in the ECS group and 10.5% in the vaginal delivery group (P < 0.001). MTCT rates according to actual mode of delivery were 2.4% with ECS, 10.2% with vaginal delivery, and 8.8% with emergency cesarean section (after membrane rupture or

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

onset of labor). Few postpartum complications occurred and no serious adverse events occurred in either group (26).

3.4 Conclusions and Questions for Additional Study The results of both the intent-to-treat and as-delivered analyses suggested that ECS significantly lowers MTCT without a significantly increased risk of complications in women who received no antiretroviral therapy or only ZDV during pregnancy. Interpretation was somewhat complicated, however, because more women in the ECS group received ZDV, and substantial numbers of women did not deliver according to their assigned mode. Nonetheless, after publication of the results of the EMDC trial and an individual patient data meta-analysis of 15 prospective cohort studies that also observed a protective effect of ECS (27), the American College of Obstetrics and Gynecology (ACOG) recommended that ECS be offered to all HIVinfected pregnant women, and the ECS rate among HIV-infected women in the United States and Europe increased substantially (28). Because the EMDC trial was conducted before the advent of viral-load testing and combination antiretroviral therapy, one important question for additional study was whether ECS would be worthwhile among women with low viral loads or who receive combination antiretroviral regimens. Subsequent observational studies have suggested that MTCT risk is very low in such women, and current ACOG and USPHS guidelines recommend ECS only for women with viral load greater than 1000 copies per milliliter, for whom benefits with respect to reduction of transmission risk generally outweigh the increased risk of maternal and infant morbidity and the cost of ECS delivery (28). Another unanswered question was how soon after onset of labor or rupture of membranes the benefit of ECS is lost. These questions are unlikely to be answered by randomized clinical trials because of the large sample sizes required (28).

5

4 THE HIV NETWORK FOR PREVENTION TRIALS 012 TRIAL The PACTG 076 ZDV regimen, ECS, and even short-course ZDV or ZDV/3TC regimens were too expensive or complex for resourcelimited countries, and MTCT via breastfeeding remained a problem. The landmark, NIHsponsored HIV Network for Prevention Trials (HIVNET) 012 trial demonstrated the efficacy of a very simple and inexpensive regimen: a single dose of nevirapine (SD-NVP) given to the mother during labor and to the infant within 72 hours after birth (29). 4.1 Objectives It was hypothesized that giving SD-NVP to the woman during labor could protect the infant from infection during delivery and during the first 1–2 weeks of life via breastfeeding, because NVP has potent antiviral activity, is rapidly absorbed when given orally, passes quickly through the placenta, and has a long half-life in pregnant women and babies (29). The primary objectives were to determine the safety and rates of MTCT and infant HIV-free survival after exposure to SD-NVP or SC-ZDV during labor and the first week of life. 4.2 Study Design The HIVNET 012 trial was conducted in Uganda and enrolled HIV-infected pregnant women who were at greater than 32-weeks gestation and not currently receiving antiretroviral or HIV immunotherapy from November 1997 to April 1999 (29). The trial was originally designed as a 1500 patient, double-blind, placebo-controlled trial to determine the efficacy of SD-NVP and SC-ZDV (during labor and to the infant for 1 week), but enrollment to the placebo arm was discontinued after release of the Thai shortcourse ZDV trial results (23), after only 49 women had been enrolled in HIVNET 012. Enrollment into the open-label SD-NVP and SC-ZDV arms was continued to provide preliminary screening data on efficacy to select one of the two regimens for inclusion in a redesigned, future Phase III efficacy trial, in which the comparator would be a standard antiretroviral regimen to be chosen based

6

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

on the anticipated results of other continuing perinatal trials. The sample size of 500 assessable mother–infant pairs was chosen to provide 80% probability to choose SD-NVP or SC-ZDV correctly if the true difference in MTCT rates between arms were 0% or 8%, respectively. The primary efficacy endpoints were MTCT and HIV-free survival at 6–8 weeks, 14–16 weeks, and 18 months of age. The study was monitored by an independent DSMB and interim-efficacy analyses (with an O’Brien–Fleming stopping boundary) were to be performed approximately annually during the projected 3-year study duration. Postpartum follow-up was originally planned to be 18 months for infants and 6 weeks for mothers, but it was subsequently extended to 5 years (30). 4.3 Results In the final efficacy analysis based on 617 assessable mother–infant pairs, the estimated risks of MTCT in the SD-NVP and SC-ZDV groups were 11.8% and 20.0% by age 6–8 weeks, respectively (P = 0.0063) (30). The cumulative MTCT rates in both groups increased by ages 14–16 weeks and 18 months because of continued breastfeeding, but the absolute and relative reductions in MTCT risk with SD-NVP (8.2% and 41%, respectively) were sustained through age 18 months. Results for HIV-free survival were similar. Both regimens were well tolerated with few serious side effects (30). Mutations in HIV that conferred resistance to NVP were detected at 6–8 weeks postpartum in 25% of mothers (31) and 46% of assessable HIV-infected infants exposed to NVP, but these mutations were no longer detected after 12–24 months postpartum (32). 4.4 Conclusions and Questions for Additional Study The HIVNET 012 results suggested that SD-NVP was efficacious and safe. In most resource-limited countries, especially in subSaharan Africa, national PMTCT programs subsequently were built around the HIVNET 012 regimen (33). Extensive controversy about whether the results of HIVNET 012 were valid developed in 2002, after Boehringer Ingelheim (BI),

which is the manufacturer of NVP, decided to pursue a U.S. Food and Drug Administration (FDA) labeling change to include PMTCT using HIVNET 012 as a registrational trial (34). As a result, reviews of the safety data that were far more in-depth than would ordinarily occur for a NIH-sponsored trial that was not intended to support an FDA submission were conducted, including a pre-FDA inspection audit by an NIH contractor who found some deficiencies in study conduct. The findings led to a comprehensive and lengthy remonitoring effort by NIH, withdrawal of BI’s FDA application because of inability to meet time constraints, and ultimately a U.S. Institutes of Medicine review of HIVNET 012, which concluded that no reason could be found to retract the publications or alter the conclusions of the trial (34). Two important areas for additional study suggested by the HIVNET 012 results included the efficacy of combining SD-NVP with other regimens and the implications and prevention of NVP resistance after SDNVP. The efficacy of combining SD-NVP with other regimens depends on their duration and potency. When the mother receives antiretrovirals during pregnancy, adding SDNVP can increase the efficacy of short-course regimens: The PHPT-2 trial in Thailand showed that adding SD-NVP to SC-ZDV in a nonbreastfeeding population could reduce the MTCT risk to 2% (35); when this combination regimen was used in a breastfeeding setting, a somewhat greater MTCT risk of 6.5% was observed (36). However, the PACTG 316 trial suggested that SD-NVP does not seem to provide any additional efficacy when added to the standard antiretroviral regimens used in developed countries (at a minimum, the full PACTG 076 ZDV regimen, often combined with at least two other antiretrovirals). The trial was stopped for futility because of low MTCT rates of 1.6% with placebo and 1.4% with SD-NVP (6). The efficacy of SD-NVP plus SC-ZDV in infants of mothers who did not receive antiretrovirals during pregnancy was assessed in two clinical trials in Malawi; SD-NVP plus SC-ZDV provided greater efficacy than SD-NVP alone when the mother did not receive any antiretrovirals during labor (37) but not when the mother received SD-NVP during labor (38).

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

The high prevalence of NVP resistance mutations after SD-NVP is of concern because it could compromise the effectiveness of (1) SD-NVP for prevention of MTCT in subsequent pregnancies and (2) future antiretroviral treatment for HIV disease, which, in resource-limited settings, often includes drugs in the same class [non-nucleoside reverse transcriptase inhibitor (NNRTI)] as NVP (39). Initial data from secondary analyses or follow-up studies of clinical trials suggest that SD-NVP remains effective in subsequent pregnancies and NNRTI-based treatment may still be effective after SDNVP exposure, particularly if sufficient time has elapsed; randomized trials of the latter question in women and children are in progress (39). One approach to reducing NVP resistance after SD-NVP would be to avoid the maternal NVP dose; in the perinatal component of the 2 × 2 factorial MASHI trial in Botswana, equivalent MTCT rates with and without the maternal NVP dose were observed when mothers received SCZDV and infants received SD-NVP and SCZDV (40). Another approach that has been studied is adding additional antiretrovirals during and after delivery, under the hypothesis that NVP resistance emerges because NVP remains present in subtherapeutic concentrations for several days or weeks because of its long half-life (41). A clinical trial in South Africa showed that adding ZDV/3TC for 3 or 7 days after delivery to cover the NVP ‘‘tail’’ can reduce the prevalence of NVP resistance mutations after SD-NVP but did not seem to have an effect on MTCT. Randomized trials are ongoing to assess whether ‘‘tail therapy’’ for longer durations or with more potent antiretrovirals could reduce the prevalence of NVP resistance even more (39). 5

THE MASHI TRIAL

Breastfeeding accounts for up to half of all MTCT in resource limited-settings, but it also provides important benefits, such as protection against other causes of infant mortality and morbidity when replacement feeding such as infant formula or animal milk is not safely available (e.g., because of a lack of clean water), culturally acceptable,

7

or affordable (7). As of October 2007, several clinical trials of antiretroviral, immunologic, or behavioral interventions to reduce breastfeeding MTCT are ongoing but only a few have been completed (7). The postpartum component of the NIH-funded MASHI trial (42) is described to illustrate some key issues in this active area of research. 5.1 Objectives A previous randomized clinical trial of breastfeeding versus formula feeding in Kenya found that formula feeding could prevent an estimated 44% of MTCT without leading to excess infant mortality, and it was therefore associated with improved HIV-free survival (43). However, study participants were required to have access to clean water, which is often limited outside urban areas, and it did not receive any antiretroviral treatment or prophylaxis, which could provide protection from MTCT during breastfeeding (42). The primary objective of the postpartum component of the MASHI trial was to compare the efficacy and safety of formula feeding versus breastfeeding plus extended infant ZDV for the prevention of postpartum MTCT. 5.2 Study Design The MASHI trial was a 2 × 2 factorial randomized trial that enrolled HIV-infected women at 33–35-weeks gestation in Botswana from March 2001 to October 2003 (42). All women received SC-ZDV during pregnancy and labor, and all infants received 1 month of ZDV. The two randomization factors were (1) SD-NVP versus placebo and (2) breastfeeding plus infant ZDV until 6 months of age versus formula feeding. Initially, mothers and infants were both randomized to SD-NVP or placebo; however, almost midway through enrollment, after release of the PHPT-2 trial results (35), the MASHI design was revised to give all infants SD-NVP and randomize only the mothers to SD-NVP or placebo (42). This modification coincided with the availability of antiretroviral therapy in Botswana for qualifying HIV patients, so the original and revised designs can be viewed as two 2 × 2 factorial trials. The primary efficacy outcome measures were MTCT by

8

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS

age 7 months and HIV-free survival by age 18 months, and the primary safety outcome measure was the rate of adverse events by 7 months of age. The planned sample size was 1,200 mother-infant pairs, to provide 80% power to detect a 7% difference in MTCT by age 7 months and 90% power to detect a 10% difference in HIV-free survival by 18 months of age, based on reference rates of 17% and 79%, respectively, a two-sided type I error of 0.05, and an annual loss-to-follow-up rate of 10%. The trial was monitored by an independent DSMB and two interim efficacy analyses were planned, with an O’Brien–Fleming stopping boundary (42). 5.3 Results The results of the SD-NVP component of the trial were described previously. Breastfeeding with infant ZDV was associated with higher MTCT (9% versus 5.6%, P = 0.04) but lower mortality (4.9% vs. 9.3%, P = 0.003) by age 7 months compared with formula feeding, which resulted in comparable HIV-free survival at 18 months (84.9% vs. 86.1%, P = 0.60). The most common causes of infant deaths were diarrhea and pneumonia. A statistically significant interaction occurred between the feeding strategies and the original maternal/infant SD-NVP/placebo intervention, with a greater increase in MTCT with breastfeeding plus ZDV and SDNVP (P = 0.04), and a greater decrease in mortality with breastfeeding plus ZDV and placebo (P = 0.03); no significant interaction was observed in the revised study (42). 5.4 Conclusions and Questions for Additional Study The results of the MASHI trial suggested that formula feeding was associated with a lower risk of MTCT but a higher risk of early mortality compared with breastfeeding plus infant ZDV; that is, formula-fed infants escaped HIV infection but then died of other causes, which led to similar rates of HIV-free survival at 18 months in the two groups (42). The MASHI study demonstrated the risks of formula feeding but did not definitively support infant ZDV to prevent breastfeeding MTCT, which highlighted the need for additional study (42).

Prevention of breastfeeding MTCT in a safe manner remains one of the major challenges and the most active areas in MTCT research. Several randomized trials are ongoing or planned to assess the efficacy and safety of other interventions to prevent postpartum MTCT without adversely affecting infant or maternal health, which includes antiretroviral and immune interventions to reduce MTCT risk during breastfeeding, and interventions to make replacement feeding safer (7). The issues are complex and tension often exists between what is best for the mother and what is best for the baby. For example, many pregnant and postpartum HIV-infected women are asymptomatic and have relatively high CD4+ cell counts, and therefore would not require antiretroviral therapy for their HIV disease if they were not pregnant or breastfeeding; the safety of giving such mothers combination antiretrovirals for the duration of breastfeeding is unknown (33). A preventive infant vaccine would be a very attractive approach to both reducing the risk of MTCT while providing nutrition and protection against other infectious causes of morbidity and mortality via breastfeeding (44), but scientists believe that a vaccine is most likely at least 10 years away (45). REFERENCES 1. Kourtis AP, Lee FK, Abrams EJ, Jamieson DJ, Bulterys M. Mother-to-child transmission of HIV-1: timing and implications for prevention. Lancet Infect Dis. 2006; 6: 726–732. 2. Minkoff H. For whom the bell tolls. Am. J. Obstet. Gynecol. 2007; 197:S1–S2. 3. De Cock KM, Fowler MG, Mercier E, de Vincenzi I, Saba J, Hoff E, Alnwick DJ, Rogers M, Shaffer N. Prevention of mother-to-child HIV transmission in resource-poor countries: translating research into policy and practice. JAMA 2000; 283: 1175–1182. 4. Breastfeeding and HIV International Transmission Study Group, Coutsoudis A, Dabis F, Fawzi W, Gaillard P, Haverkamp G, Harris DR, Jackson JB, Leroy V, Meda N, Msellati P, Newell ML, Nsuati R, Read JS, Wiktor S. Late postnatal transmission of HIV-1 in breast-fed children: an individual patient data metaanalysis. J. Infect. Dis. 2004; 189: 2154–2166.

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS 5. Cooper ER, Charurat M, Mofenson L, Hanson IC, Pitt J, Diaz C, Hayani K, Handelsman E, Smeriglio V, Hoff R, Blattner W; Women and Infants’ Transmission Study Group. Combination antiretroviral strategies for the treatment of pregnant HIV-1-infected women and prevention of perinatal HIV-1 transmission. J. Acquir. Immune Defic. Syndr. 2002; 29: 484–494. 6. Dorenbaum A, Cunningham CK, Gelber RD, Culnane M, Mofenson L, Britto P, Rekacewicz C, Newell ML, Delfraissy JF, CunninghamSchrader B, Mirochnick M, Sullivan JL; International PACTG 316 Team. Two-dose intrapartum/newborn nevirapine and standard antiretroviral therapy to reduce perinatal HIV transmission: a randomized trial. JAMA 2002; 288: 189–198. 7. Kourtis AP, Jamieson DJ, de Vincenzi I, Taylor A, Thigpen MC, Dao H, Farley T, Fowler MG. Prevention of human immunodeficiency virus-1 transmission to the infant through breastfeeding: new developments. Am. J. Obstet. Gynecol. 2007; 197(Suppl 3):S113–S122. 8. Public Health Service Task Force. Recommendations for use of antiretroviral drugs in pregnant HIV-1 infected women for maternal health and interventions to reduce perinatal HIV-1 transmission in the United States. October 12, 2006 update. http://AIDSinfo.nih.gov. 9. Connor EM, Sperling RS, Gelber R, Kiselev P, Scott G, O’Sullivan MJ, VanDyke R, Bey M, Shearer W, Jacobson RL, Jimenez E, O’Neill E, Bazin B, Delfraissy J-F, Culnane M, Coombs R, Elkins M, Moye J, Stratton P, Balsley J; Pediatric AIDS Clinical Trials Group Protocol 076 Study Group. Reduction of maternal-infant transmission of human immunodeficiency virus type 1 with zidovudine treatment. Pediatric AIDS Clinical Trials Group Protocol 076 Study Group. N. Engl. J. Med. 1994; 331: 1173–1180. 10. Sperling RS, Shapiro DE, Coombs RW, Todd JA, Herman SA, McSherry GD, O’Sullivan MJ, Van Dyke RB, Jimenez E, Rouzioux C, Flynn PM, Sullivan JL. Maternal viral load, zidovudine treatment, and the risk of transmission of human immunodeficiency virus type 1 from mother to infant. Pediatric AIDS Clinical Trials Group Protocol 076 Study Group. N. Engl. J. Med. 1996; 335: 1621–1629. 11. Shapiro DE, Sperling RS, Coombs RW. Effect of zidovudine on perinatal HIV-1 transmission

12.

13.

14.

15.

16.

17.

18.

9

and maternal viral load. Pediatric AIDS Clinical Trials Group 076 Study Group. Lancet. 1999; 354: 156; author reply 157–158. Sperling RS, Shapiro DE, McSherry GD, Britto P, Cunningham BE, Culnane M, Coombs RW, Scott G, Van Dyke RB, Shearer WT, Jimenez E, Diaz C, Harrison DD, Delfraissy JF. Safety of the maternal-infant zidovudine regimen utilized in the Pediatric AIDS Clinical Trial Group 076 Study. AIDS. 1998; 12: 1805–1813. McSherry GD, Shapiro DE, Coombs RW, McGrath N, Frenkel LM, Britto P, Culnane M, Sperling RS. The effects of zidovudine in the subset of infants infected with human immunodeficiency virus type-1 (Pediatric AIDS Clinical Trials Group Protocol 076). J. Pediatr. 1999; 134: 717–724. Eastman PS, Shapiro DE, Coombs RW, Frenkel LM, McSherry GD, Britto P, Herman SA, Sperling RS. Maternal viral genotypic zidovudine resistance and infrequent failure of zidovudine therapy to prevent perinatal transmission of human immunodeficiency virus type 1 in pediatric AIDS Clinical Trials Group Protocol 076. J. Infect. Dis. 1998; 177: 557–564. Jamieson DJ, Clark J, Kourtis AP, Taylor AW, Lampe MA, Fowler MG, Mofenson LM. Recommendations for human immunodeficiency virus screening, prophylaxis, and treatment for pregnant women in the United States. Am. J. Obstet. Gynecol. 2007; 197(3 Suppl):S26–S32. Centers for Disease Control and Prevention. Recommendations of the US Public Health Service Task Force on the use of zidovudine to reduce perinatal transmission of human immunodeficiency virus. MMWR Recomm. Rep. 1994; 43: 1–20. Bardeguez AD, Shapiro DE, Mofenson LM, Coombs R, Frenkel LM, Fowler MG, Huang S, Sperling RS, Cunningham B, Gandia J, Maupin R, Zorrilla CD, Jones T, O’Sullivan MJ; Pediatrics AIDS Clinical Trials Group 288 Protocol Team. Effect of cessation of zidovudine prophylaxis to reduce vertical transmission on maternal HIV disease progression and survival. J. Acquir. Immune Defic. Syndr. 2003; 32: 170–181. Culnane M, Fowler M, Lee SS, McSherry G, Brady M, O’Donnell K, Mofenson L, Gortmaker SL, Shapiro DE, Scott G, Jimenez E, Moore EC, Diaz C, Flynn PM, Cunningham B, Oleske J. Lack of long-term effects of in utero exposure to zidovudine among uninfected children born to HIV-infected women. Pediatric

10

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS AIDS Clinical Trials Group Protocol 219/076 Teams. JAMA. 1999; 281: 151–157.

19. Hanson IC, Antonelli TA, Sperling RS, Oleske JM, Cooper E, Culnane M, Fowler MG, Kalish LA, Lee SS, McSherry G, Mofenson L, Shapiro DE. Lack of tumors in infants with perinatal HIV-1 exposure and fetal/neonatal exposure to zidovudine. J. Acquir. Immune Defic. Syndr. Hum. Retrovirol. 1999; 20: 463–467. 20. Stiehm ER, Lambert JS, Mofenson LM, Bethel J, Whitehouse J, Nugent R, Moye J Jr, Glenn Fowler M, Mathieson BJ, Reichelderfer P, Nemo GJ, Korelitz J, Meyer WA 3rd, Sapan CV, Jimenez E, Gandia J, Scott G, O’Sullivan MJ, Kovacs A, Stek A, Shearer WT, Hammill H. Efficacy of zidovudine and human immunodeficiency virus (HIV) hyperimmune immunoglobulin for reducing perinatal HIV transmission from HIV-infected women with advanced disease: results of Pediatric AIDS Clinical Trials Group protocol 185. J. Infect. Dis. 1999; 179: 567–575. 21. Mandelbrot L, Landreau-Mascaro A, Rekacewicz C, Berrebi A, Benifla JL, Burgard M, Lachassine E, Barret B, Chaix ML, Bongain A, Ciraru-Vigneron N, Crenn-Hebert C, Delfraissy JF, Rouzioux C, Mayaux MJ, Blanche S; Agence Nationale de Recherches sur le SIDA (ANRS) 075 Study Group. Lamivudine-zidovudine combination for prevention of maternal-infant transmission of HIV-1. JAMA. 2001; 285: 2083–2093. 22. Gelber RD, Lindsey JC, MaWhinney S. Clinical Trials to Reduce the Risk of Maternal-Infant Transmission of HIV Infection. In: AIDS Clinical Trials. Finkelstein DM, Schoenfeld DA, eds. 1995. John Wiley & Sons, New York. pp. 287–302. 23. Shaffer N, Chuachoowong R, Mock PA, Bhadrakom C, Siriwasin W, Young NL, Chotpitayasunondh T, Chearskul S, Roongpisuthipong A, Chinayon P, Karon J, Mastro TD, Simonds RJ. Short-course zidovudine for perinatal HIV-1 transmission in Bangkok, Thailand: a randomised controlled trial. Bangkok Collaborative Perinatal HIV Transmission Study Group. Lancet 1999; 353: 773–780. 24. Lallemant M, Jourdain G, Le Coeur S, Kim S, Koetsawang S, Comeau AM, Phoolcharoen W, Essex M, McIntosh K, Vithayasai V. A trial of shortened zidovudine regimens to prevent mother-to-child transmission of human immunodeficiency virus type 1. Perinatal HIV Prevention Trial (Thailand) Investigators. N. Engl. J. Med. 2000; 343: 982–991.

25. Petra Study Team. Efficacy of three shortcourse regimens of zidovudine and lamivudine in preventing early and late transmission of HIV-1 from mother to child in Tanzania, South Africa, and Uganda (Petra study): a randomised, double-blind, placebo-controlled trial. Lancet 2002; 359: 1178–1186. 26. No authors listed. Elective caesarean-section versus vaginal delivery in prevention of vertical HIV-1 transmission: a randomised clinical trial. The European Mode of Delivery Collaboration. Lancet 1999; 353: 1035–1039. 27. No authors listed. The mode of delivery and the risk of vertical transmission of human immunodeficiency virus type 1--a meta-analysis of 15 prospective cohort studies. The International Perinatal HIV Group. N. Engl. J. Med. 1999; 340: 977–987. 28. Jamieson DJ, Read JS, Kourtis AP, Durant TM, Lampe MA, Dominguez KL. Cesarean delivery for HIV-infected women: recommendations and controversies. Am. J. Obstet. Gynecol. 2007; 197(Suppl 3):S96–100. 29. Guay LA, Musoke P, Fleming T, Bagenda D, Allen M, Nakabiito C, Sherman J, Bakaki P, Ducar C, Deseyve M, Emel L, Mirochnick M, Fowler MG, Mofenson L, Miotti P, Dransfield K, Bray D, Mmiro F, Jackson JB. Intrapartum and neonatal single-dose nevirapine compared with zidovudine for prevention of mother-to-child transmission of HIV-1 in Kampala, Uganda: HIVNET 012 randomised trial. Lancet. 1999; 354: 795–802. 30. Jackson JB, Musoke P, Fleming T, Guay LA, Bagenda D, Allen M, Nakabiito C, Sherman J, Bakaki P, Owor M, Ducar C, Deseyve M, Mwatha A, Emel L, Duefield C, Mirochnick M, Fowler MG, Mofenson L, Miotti P, Gigliotti M, Bray D, Mmiro F. Intrapartum and neonatal single-dose nevirapine compared with zidovudine for prevention of mother-to-child transmission of HIV-1 in Kampala, Uganda: 18-month follow-up of the HIVNET 012 randomised trial. Lancet 2003; 362: 859–868. 31. Eshleman SH, Guay LA, Mwatha A, Brown ER, Cunningham SP, Musoke P, Mmiro F, Jackson JB. Characterization of nevirapine resistance mutations in women with subtype A vs. D HIV-1 6-8 weeks after single-dose nevirapine (HIVNET 012). J. Acquir. Immune Defic. Syndr. 2004; 35: 126–130. 32. Eshleman SH, Mracna M, Guay LA, Deseyve M, Cunningham S, Mirochnick M, Musoke P, Fleming T, Glenn Fowler M, Mofenson LM, Mmiro F, Jackson JB. Selection and fading of resistance mutations in women and infants receiving nevirapine to prevent HIV-1 vertical

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS transmission (HIVNET 012). AIDS 2001; 15: 1951–1957. 33. Dao H, Mofenson LM, Ekpini R, Gilks CF, Barnhart M, Bolu O, Shaffer N. International recommendations on antiretroviral drugs for treatment of HIV-infected women and prevention of mother-to-child HIV transmission in resource-limited settings: 2006 update. Am. J. Obstet. Gynecol. 2007; 197(Suppl 3):S42–S55. 34. Institute of Medicine of the National Academy of Sciences. Review of the HIVNET 012 Perinatal HIV Prevention Study. 2005. National Academies Press. Washington, DC. pp. 1–10. http://www.nap.edu/catalog.php?record id= 11264#toc.

41.

42.

35. Lallemant M, Jourdain G, Le Coeur S, Mary JY, Ngo-Giang-Huong N, Koetsawang S, Kanshana S, McIntosh K, Thaineua V; Perinatal HIV Prevention Trial (Thailand) Investigators. Single-dose perinatal nevirapine plus standard zidovudine to prevent mother-tochild transmission of HIV-1 in Thailand. N. Engl. J. Med. 2004; 351: 217–228. 36. Dabis F, Bequet L, Ekouevi DK, Viho I, Rouet F, Horo A, Sakarovitch C, Becquet R, Fassinou P, Dequae-Merchadou L, Welffens-Ekra C, Rouzioux C, Leroy V; ANRS 1201/1202 DITRAME PLUS Study Group. Field efficacy of zidovudine, lamivudine and singledose nevirapine to prevent peripartum HIV transmission. AIDS 2005; 19: 309–318. 37. Taha TE, Kumwenda NI, Gibbons A, Broadhead RL, Fiscus S, Lema V, Liomba G, Nkhoma C, Miotti PG, Hoover DR. Short postexposure prophylaxis in newborn babies to reduce mother-to-child transmission of HIV-1: NVAZ randomised clinical trial. Lancet 2003; 362: 1171–1177. 38. Taha TE, Kumwenda NI, Hoover DR, Fiscus SA, Kafulafula G, Nkhoma C, Nour S, Chen S, Liomba G, Miotti PG, Broadhead RL. Nevirapine and zidovudine at birth to reduce perinatal transmission of HIV in an African setting: a randomized controlled trial. JAMA 2004; 292: 202–209. 39. McConnell MS, Stringer JS, Kourtis AP, Weidle PJ, Eshleman SH. Use of single-dose nevirapine for the prevention of mother-tochild transmission of HIV-1: does development of resistance matter? Am. J. Obstet. Gynecol. 2007; 197(Suppl 3):S56–S63. 40. Shapiro RL, Thior I, Gilbert PB, Lockman S, Wester C, Smeaton LM, Stevens L, Heymann SJ, Ndung’u T, Gaseitsiwe S, Novitsky V, Makhema J, Lagakos S, Essex M. Maternal single-dose nevirapine versus placebo as

43.

44.

45.

11

part of an antiretroviral strategy to prevent mother-to-child HIV transmission in Botswana. AIDS. 2006; 20: 1281–1288. Cressey TR, Jourdain G, Lallemant MJ, Kunkeaw S, Jackson JB, Musoke P, Capparelli E, Mirochnick M. Persistence of nevirapine exposure during the postpartum period after intrapartum single-dose nevirapine in addition to zidovudine prophylaxis for the prevention of mother-to-child transmission of HIV-1. J. Acquir. Immune Defic. Syndr. 2005; 38: 283–288. Thior I, Lockman S, Smeaton LM, Shapiro RL, Wester C, Heymann SJ, Gilbert PB, Stevens L, Peter T, Kim S, van Widenfelt E, Moffat C, Ndase P, Arimi P, Kebaabetswe P, Mazonde P, Makhema J, McIntosh K, Novitsky V, Lee TH, Marlink R, Lagakos S, Essex M; Mashi Study Team. Breastfeeding plus infant zidovudine prophylaxis for 6 months vs formula feeding plus infant zidovudine for 1 month to reduce mother-to-child HIV transmission in Botswana: a randomized trial: the Mashi Study. JAMA. 2006; 296: 794–805. Nduati R, John G, Mbori-Ngacha D, Richardson B, Overbaugh J, Mwatha A, NdinyaAchola J, Bwayo J, Onyango FE, Hughes J, Kreiss J. Effect of breastfeeding and formula feeding on transmission of HIV-1: a randomized clinical trial. JAMA 2000; 283: 1167–1174. Fowler MG, Lampe MA, Jamieson DJ, Kourtis AP, Rogers MF. Reducing the risk of motherto-child human immunodeficiency virus transmission: past successes, current progress and challenges, and future directions. Am. J. Obstet. Gynecol. 2007; 197(Suppl 3):S3–S9. Stark K. Ending of HIV vaccine trial jolts industry. Philadelphia Inquirer. Oct 7, 2007. http://www.philly.com/philly/business/ 10296647.html.

FURTHER READING NIH AIDSinfo website. http://aidsinfo.nih.gov. UNAIDS website. http://www.unaids.org. CDC HIV/AIDS website. http://www.cdc.gov/hiv. IMPAACT website. http://impaact.s-3.com. I-Base website. http://www.i-base.info.

CROSS-REFERENCES AIDS Clinical Trials Group (ACTG) Human Immunodeficiency Virus (HIV) Trials Data and Safety Monitoring Board

12

MOTHER TO CHILD HUMAN IMMUNODEFICIENCY VIRUS TRANSMISSION TRIALS Interim Analysis Factorial Designs

MULTICENTER TRIAL

from the different centers performing functions in relation to a particular study protocol. The centers are clinics, coordinating centers, and other resource centers as needed, such as central laboratories and reading centers, for conduct of the trial. Broadly, the complement of multicenter trial is single center trial, defined as (1):

CURTIS L. MEINERT The Johns Hopkins University Center for Clinical Trials Bloomberg School of Public Health Baltimore, Maryland

1

DEFINITION

1. A trial performed at or from a single site: (a) Such a trial, even if performed in association with a coalition of clinics in which each clinic performs its own trial, but in which all trials focus on the same disease or condition (e.g., such a coalition formed to provide preliminary information on a series of different approaches to the treatment of hypertension by stress control or reduction); (b) A trial not having any clinical centers and a single resource center (e.g., the Physicians’ Health Study). 2. A trial involving a single clinic; with or without satellite clinics or resource centers. 3. A trial involving a single clinic and a center to receive and process data. 4. A trial involving a single clinic and one or more resource centers. ant: multicenter trial Usage note: Note that the usual line of demarcation between single and multicenter is determined by whether or not there is more than one treatment or data collection site. Hence, a trial having multiple centers may still be classified as a single-center trial if it has only one treatment or data collection site.

The term ‘‘multicenter trial’’ is variously defined: multicenter trial (1): 1. A trial involving two or more clinical centers, a common study protocol, and a data center, data coordinating center, or coordinating center to receive, process, and analyze study data. 2. A trial involving at least one clinical center or data collection site and one or more resource centers. 3. A trial involving two or more clinics or data collection sites. syn: collaborative trial (not recommended), cooperative trial (not recommended) ant: singlecenter trial Usage note: Preferred to collaborative trial or cooperative trial for reasons indicated in usage notes for those two terms. See single-center trial for comments on line of demarcation between single and multicenter trials. multicentre study (2): A study carried out at more than one study centre. multicentre trial (3): A clinical trial conducted according to a single protocol but at more than one site, and therefore carried out by more than one investigator.

Generally, in the language of trials, trials are assumed to be single center unless otherwise indicated. Typically, a multicenter trial involves two or more administratively distinct clinics as well as a coordinating center or data center and, perhaps, other resource centers as well. However, trials exist, such as the Physician’s Health Study (PHS; see Section 4 for sketch), that can be classified as single center or multicenter, depending on the characteristics used for classification. People in the PHS were recruited, enrolled, and followed by mail and telephone and, hence, no clinical centers existed. However, a coordinating center and a differentiated organizational structure did exist and, hence, from an organizational perspective, the trial can be classified as multicenter. (The trial is not indexed as multicenter by the National Library of Medicine (NLM) indexers and,

The National Library of Medicine (medical subject headings—annotated alphabetic list 2003) defines multicenter trials as controlled studies that are planned and carried out by several cooperating institutions to assess certain variables and outcomes in specified patient populations, for example, a multicenter study of congenital anomalies in children. A multicenter trial involves two or more administratively distinct study centers and a leadership structure involving investigators

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

MULTICENTER TRIAL

therefore, is regarded as single center under their system of classification.) 2

RELATED TERMS

Various other terms, as taken from Meinert (1), of relevance to multicenter structures are: multi-protocol: (adj) Of or relating to a research activity involving two or more protocols (e.g. as in a multi-study or program project. multi-site: (adj) 1. Relating to or having more than one site. 2. multicenter multi-study: (adj) Of, relating to, or consisting of multiple studies. ant: singlestudy rt: multi-trial Usage note: Use multi-trial if all studies are trials. See note for study n. multi-study: (n) A study having two or more separate and distinct studies performed under the same umbrella organizational structure; multi-trial when studies are trials. Usage note: Not to be confused with a study having a series of substudies. multi-study design: (n) A design involving two or more studies; performed on the same or different study populations. ant: single-study design rt: multi-trial design, multi-study structure Usage note: See multi-study. multi-study network: (n) A network of centers organized to conduct a series of studies. multi-study structure: (n) An umbrella organizational structure created and maintained to initiate and carry out a series of related studies involving the same or different study populations. ant: single-study structure Usage note: Use multi-trial if all studies are trials. See multi-trial structure and study n for added comments. multi-trial: (adj) Of, relating to, or consisting of multiple trials. ant: single-trial rt: multi-study multi-trial: (n) A study having two or more separate and distinct trials performed under the same umbrella organizational structure.

multi-trial design: (n) A design involving two or more trials performed on the same or different people and involving different treatment protocols. ant: single-trial design Usage note: See multi-trial structure. multi-trial structure: (n) An organizational structure created or maintained to initiate and carry out a series of related trials involving the same or different study populations. ant: singletrial structure Usage note: Most useful when characterizing past or existing structures as to observed function or planned intent. Note that a structure created as a single-trial structure may ultimately be classified as a multi-trial structure if it serves the purpose of directing two or more trials, even if not originally created for that purpose. program project: (n) A collection of interrelated research activities having a common organizational structure and bearing on the same general question or issue (e.g., a collection of basic and applied research projects aimed at providing a better understanding of atherosclerosis, especially such a collection funded from a single source). sponsor: (n) 1. A person or agency that is responsible for funding a designated function or activity; sponsoring agency. 2. A person or agency that plans and carries out a specified project or activity. 3. The person or agency named in an Investigational New Drug Application or New Drug Application; usually a drug company or person at such a company, but not always (as with an INDA submitted by a representative of a research group proposing to carry out a phase III or phase IV drug trial not sponsored by a drug company). 4. A firm or business establishment marketing a product or service. study: (adj) Of or relating to one or something being evaluated or studied. Usage note: Used primarily as a general descriptor, as in study candidate, when other more precise terms are not appropriate or have undesirable connotations

MULTICENTER TRIAL

(as in trial candidate). See usage note for study n for additional notes. study: (n) 1. An experimental or nonexperimental investigation or analysis of a question, process, or phenomenon. 2. Any one of a variety of activities involving the collection, analysis, or interpretation of data. 3. Clinical trial, especially in a setting where there is a desire or need to de-emphasize the experimental nature of the investigation. 4. An investigation involving both a trial and nonexperimental investigation (as in the Coronary Artery Surgery Study, comprised of a clinical trial and a followup study) (4, 5). Usage note: Widely and loosely used. Avoid in favor of more informative, less generic, terms whenever possible (e.g. use trial rather than study when appropriate). As a label, limit use to defined sets of activities involving data collection not ordinarily characterized with a more informative, design-specific term (such as trial or followup study), or where a general term is needed to characterize a collection of activities involving a number of different designs (e.g. referring to a mix of trials and observational studies). Avoid as a synonym for a specified kind of investigation having a recognized term (as in referring to an investigation as a trial in the title or abstract of a paper and study in the body of the paper). umbrella organizational structure: (n) [multicenter multistudies] An organizational structure created to support and direct a series of separate and distinct studies; multi-study structure.

3

HISTORY

The accomplishment of scientific goals by collaboration involving disparate groups and disciplines is hardly unique to trials. Increasingly, as goals expand and projects become more complex, the only viable route to accomplishment is through the ‘‘pooling’’ of effort. Evidence of this trend extends to virtually all sciences, marked most recently by human genome projects.

3

Multicenter trials started to come into prominence in the 1940s. An early example is a trial of patulin, a metabolic product of Penicillium patulum Bainer, as a possible preventive for the common cold. It was carried out in the fall of 1943 and spring of 1944 and involved a study population of more than 100,000 persons assembled from multiple factory and postal worksites in the United Kingdom (6). One of the first multicenter trials involving a chronic disease was of streptomycin treatment of pulmonary tuberculosis (see Section 4 for sketch of this trial and others). One of the largest multicenter trials ever undertaken was carried out largely in 1954 and was done to test effectiveness of the Salk vaccine in preventing polio. All told, the study involved some 1.8 million children recruited from schools across the United States, with 650,000 of them being randomized to receive injections of the vaccine or matching placebo injections and the others serving as a nonrandomized comparison group. To a major degree, emergence of the multicenter trial coincides with transition of a society burdened with infectious diseases to one afflicted with chronic diseases and from trials involving small sample sizes and short-term treatment and follow up to trials involving large sample sizes and long-term treatment and follow up. The realization that even small benefits of treatment in reducing mortality or morbidity translates into huge reductions when projected to the general population served to underscore the need for sample sizes far beyond the reach of any one study site. By the 1960s, it was clear that the industrialized world was facing an ‘‘epidemic’’ of heart disease and of increasing rates of cancer of most types. It was clear, as well, that if treatments were to be tested for their potential in reducing or reversing the rise, that it would have to be via large-scale long-term multicenter trials. This reality gave rise to working parties in the United Kingdom and collaborative groups in the United States in the late 1950s and early 1960s to undertake multicenter treatment and prevention trials for cancer and heart disease. The ‘‘multicenter era’’ came into prominence in the late 1970s and 1980s in

4

MULTICENTER TRIAL

the United States due, in no small measure, to the efforts of Donald Fredrickson (1924–2002; Director of NIH 1975–1981) and Tom Chalmers (1917–1995). Both men were major proponents of trials and especially the long-term multicenter trial. The mindset underlying trials has progressed, albeit slowly, from one in which trials were seen as a measure of last resort to one in which they are seen increasingly as the option of choice. No doubt exists that single center trials are administratively less complicated than multicenter trials; but it is also true that, on average, multicenter trials are more robust and more rigorously designed and conducted than their single center counterparts. The move to multicenter trials has been accomplished by an ever more active role of sponsors in initiating multicenter trials. The NIH started to assume the role of initiator in the 1970s with the advent of requests for applications (RFAs; grant funding) and requests for proposals (RFPs; contract funding) for multicenter trials. To a large degree, the traditional investigator-initiated multicenter trial, at least in the United States via NIH funding, has become ‘‘endangered’’ with the advent in the mid-1990s of the requirement that investigators need permission to submit proposals in excess of $500,000 per year (direct cost). Permission is unlikely if the proposed trial is not high on the agenda of an institute. The transition to a ‘‘global economy’’ led to the creation of the International Conference of Harmonisation (ICH) in 1990 (www.ich.org). This economy has caused drug firms to take progressively more directive approaches to initiating and conducting trials they sponsor and has caused them to want study sites located in various parts of the world. For a comprehensive compilation of landmark trials published over time, visit the James Lind Library website at: jameslindlibrary.org/trial records/published. html.

4 EXAMPLES Streptomycin Treatment of Pulmonary Tuberculosis Trial Period of conduct: 1947–1948 Study population: Patients with acute bilateral pulmonary tuberculosis and unsuitable for collapse therapy Trial type: Treatment trial Locale of conduct: United Kingdom Sponsor/funding: UK Medical Research Council Sample size: 109 (not including 2 people who died in the preliminary observation period) Test treatment: Bed rest (minimum of 6 months) + streptomycin (intramuscular; 2g/day in 4 injections at 6 hour intervals) Control treatment: Bed rest (minimum of 6 months) Outcome measure: Death; radiographic changes as determined by masked readers Enrolling sites: 6 Method of assignment: Randomized blocks; stratified by clinic and gender; numbered envelopes Assignment ratio (observed): 55(S):52(C) Finding/conclusion: Streptomycin treatment beneficial in reducing the risk of death and in effecting improved chest radiographs Primary publication: Streptomycin in Tuberculosis Trials Committee, Streptomycin treatment of pulmonary tuberculosis. Br. Med. J. 1948; 2: 769–782.

MULTICENTER TRIAL

5

University Group Diabetes Program (UGDP)

Physicians’ Health Study (PHS)

Period of conduct: 1960–1975 Study population: Adults (males and females) with non-insulin-dependent, adult-onset, diabetes Trial type: Treatment/prevention trial Locale of conduct: United States Sponsor/funding: National Institutes of Health Sample size: 1,027 Test treatments (4): Antidiabetic diet + 1.5 gm tolbutamide/day; diet + 100 mg phenformin/day, diet + 10, 12, 14, or 16 units of insulin lente/day depending on body surface; diet + insulin lente daily dosed to control blood sugar Control treatment: Antidiabetic diet + placebo matched to tolbutamide or phenformin (no placebo injections) Outcome measure: Renal, CV, eye, and peripheral vascular morbidity and all cause mortality Clinics: 12 Coordinating center: University of Minnesota, School of Public Health, through 1963; University of Maryland, School of Medicine, thereafter Method of assignment: Randomized blocks; stratified by clinic and gender; centrally administered Assignment ratio: 1:1:1:1:1 Finding/conclusion: Tolbutamide and phenformin treatments stopped due to safety concerns; insulin treatments no more effective than placebo in prolonging life or reducing morbidity Primary publications: University Group Diabetes Program Research Group, A Study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: I. Design, methods and baseline characteristics. Diabetes 1970; 19(Suppl 2): 747–783. University Group Diabetes Program Research Group, A Study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: II. Mortality results. Diabetes 1970; 19(Suppl 2): 785–830. University Group Diabetes Program Research Group, A Study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: V. Evaluation of Phenformin therapy. Diabetes 1975; 24(Suppl 1): 65–184.

Period of conduct: 1982–1995 Study population: Male physicians, aged 40 to 84, registered in the American Medical Association, free of overt evidence of heart disease Trial type: Prevention trial Locale of conduct: United States Sponsor/funding: National Institutes of Health Sample size: 22,071 Test treatments: Aspirin (325 mg/day) and beta carotene (50 mg/day) Test treatment groups: Aspirin and beta carotene (5,517), aspirin and beta carotene placebo (5,520), aspirin placebo and beta carotene (5,519) Control treatment: Aspirin placebo and beta carotene placebo (5,515) Period of treatment: 46–77 months depending on when enrolled Method of treatment administration and followup: Mail and telephone Primary outcome measure: Fatal and nonfatal MI Study control center: Harvard Medical School Method of assignment: Blocked randomization Finding/conclusion: In regard to aspirin: Regular doses of aspirin were found to significantly reduce the risk of both fatal and nonfatal MI Primary publication: Steering Committee of the Physicians’ Health Study Research Group, Final report of the aspirin component of the ongoing physicians’ health study. NEJM 1989; 321:129–135.

6

MULTICENTER TRIAL

Pediatric AIDS Clinical Trial Group: Protocol 076 (ACTG 076) Period of conduct: 1991–1994 Study population: Pregnant, HIV-infected, women; 14 to 34 weeks gestation on entry Trial type: Prevention trial Locale of conduct: Primarily United States Sponsor/funding: National Institutes of Health Sample size: 477 Test treatment: Zidovudine (100 mg orally 5 times daily plus intrapartum zidovudine (2 mg/kg of body weight given intravenously for 1 hour followed by 1 mg/kg/hr until delivery; plus zidovudine for the newborn (2 mg/kg daily every 5 hours for 6 weeks, beginning 8 to 12 hours after birth Control treatment: Matching placebo Primary outcome measure: HIV infection of newborn Study clinics: 59 Method of assignment: Randomization, stratified by gestation time (2 groups) Finding/conclusion: ‘‘In pregnant women with mildly symptomatic HIV disease and no prior treatment with antiretroviral drugs during the pregnancy, a regimen consisting of zidovudine given ante partum and intra partum to the mother and to the newborn for six weeks reduced the risk of maternal-infant transmission by approximately two thirds’’ Primary publication: E. M. Connor, R. S. Sperling, R. Gelber, P. Kiseley, G. Scott, et al. for the Pediatric AIDS Clinical Trials Group Protocol 076 Study Group, Reduction of maternal-infant transmission of human immunodeficiency virus type 1 with zidovudine treatment. NEJM 1994; 331:1173–1180.

Asymptomatic Carotid Surgery Trial (ACST) Period of conduct: 1993–2003 Study population: Asymptomatic patients with substantial carotid narrowing Trial type: Prevention trial Locale of conduct: 30 different countries Sponsor/funding: Medical Research Council of the United Kingdom Sample size: 3,120 Test treatment: Immediate carotid endarterectomy (CEA) (1,560) Control treatment: Deferral of CEA (1,560) Period of followup: Variable depending on time of randomization; continuous from randomization to common closing date Primary outcome measure: Stroke (fatal or nonfatal) Study clinics: 126 (located in 30 different countries) Coordinating center/data center: Clinical Trial Service Unit, Oxford, United Kingdom Method of assignment: Minimized randomization using age, sex, history of hypertension, and several other variables Finding/conclusion: ‘‘In asymptomatic patients younger than 75 years of age with carotid diameter reduction about 70% or more on ultrasound, immediate CEA halved the net 5-year stroke risk from about 12% to about 6%. Half this benefit involved disabling or fatal stokes.’’ Primary publication: MRC Asymptomatic Carotid Surgery Trial (ACST) Collaborative Group, Prevention of disabling and fatal stroke by successful carotid endarterectomy in patients without recent neurological symptoms: randomised controlled trial. Lancet 2004; 363:1491–1502.

5 ORGANIZATIONAL AND OPERATIONAL FEATURES The key feature of multicenter trials is a differentiated organizational structure involving at least two administratively distinct centers and a defined infrastructure structure serving to bind centers and associated personnel into a cohesive whole. The key elements in the organizational and operational structure of multicenter trials are as follows: Center Director: The scientific head of a study center. Executive Committee (EC): A committee in multicenter studies responsible for

MULTICENTER TRIAL

direction of the day-to-day affairs of the study. One of the key committees in the organizational structure of a multicenter trial. Usually consists of the officers of the study and perhaps others selected from the steering committee and typically headed by the chair or vice-chair of the steering committee and reporting to that committee. Principal Investigator (PI): Broadly, the scientific head of study; generally best avoided in favor of more operationally explicit terminology in multicenter trials because of potential for confusion in favor of terms such as Center Director and Study Chair. Traditionally, in single center trials, the PI is the person in charge of the trial and is usually also the one responsible for enrollment and study of patients in the trial. As a result, the term is often reserved in multicenter trials for persons heading study clinics, leaving those heading the coordinating center or other resource center with some lesser designation implying ‘‘nonprincipal’’ investigatorship. However, even if uniformly applied to all center directors, the term is still best avoided, except in settings where the term is used to refer to a single individual (e.g., the one in investigator-initiated trials who is the recognized head by virtue of initiative in the funding effort). The term should be avoided in egalitarian settings in which, in effect, multiple ‘‘principal’’ investigators exist such as those represented in sponsor or governmentinitiated trials. Research Group: The entire set of personnel involved in the conduct of a research project; in multicenter trials includes center directors and support staff, representatives from the sponsoring agency, and study committee members. syn: investigative team, investigative group, study group (not a recommended syn), study staff (not a recommended syn) Steering Committee (SC): A committee of an organization responsible for directing or guiding the activities of that

7

organization. In multicenter trials, the committee responsible for conduct of the trial and to which other study committees report. Usually headed by the study chair and consisting of persons designated or elected to represent study centers, disciplines, or activities. One of the key committees in multicenter structures. Study Center: An operational unit in the structure of a study, especially a multicenter structure, separate and distinct from other such units in the structure, responsible for performing specified functions in one or more stages of the study (e.g., a clinical center or resource center, such as coordinating center). Study Chair: Chair of the steering committee. Study Officers: The set of persons holding elected or designated offices in a study; in multicenter trials, generally the study chair and vice-chair and the heads or directors of key resource centers, such as the coordinating center, and project office. Treatment Effects Monitoring Committee: A standing committee in the structure of most trials responsible for the periodic review of accumulated data for evidence of adverse or beneficial treatment effects during the trial and for making recommendations for modification of a study treatment, including termination, when appropriate. One of the key committees in the organizational structure of a multicenter trial. Usually constituted such that voting privileges are restricted to members not directly involved in the execution of the trial and not associated with participating centers or sponsors of the trial. syn: Data Monitoring Committee (DMC), Data and Safety Monitoring Committee (DSMC) and Safety Monitoring Committee (SMC); sometimes also Ethical Committee or Ethics Committee but not recommended usage because of likely confusion with institutional review boards (IRBs) and the like.

8

6

MULTICENTER TRIAL

COUNTS

The table below gives counts of all trials, randomized clinical trials, all multicenter clinical trials, and randomized multicenter clinical trials from 1980 forward, as indexed by the National Library of Medicine and counted via PubMed (August 2003). The indexing for multicenter trials was largely nonexistent prior to 1980 and probably spotty in the early 1980s. For example, the UGDP publications, while indexed as ‘‘clinical trial,’’ are not indexed as ‘‘multicenter trial’’. Randomized multicenter trials published since 1980 account for 9% of all randomized trials and 52% of all multicenter trials. About 40% of all randomized multicenter trials are in cancer or cardiovascular disease. The counts of single and multicenter trials (based on indexing by NLM) published in BMJ, Lancet, JAMA, and NEJM in 2002 and the 25th, 50th (median), and 75th percentile points of the sample distributions for the two classes of trials is given in the table below. Not surprisingly, multicenter trials are larger than single center trials. 7

ACKNOWLEDGMENTS

Thanks to Susan Tonascia, Ann Ervin, and Betty Collison for help in producing this piece. REFERENCES 1. C. L. Meinert, Clinical Trials Dictionary: Terminology and Usage Recommendations. Baltimore, MD: The Johns Hopkins Center for Clinical Trials, 1996. 2. S. Day, Dictionary for Clinical Trials. Chichester: Wiley, 1999. 3. International Conference on Harmonisation, E6 Good Clinical Practice. Washington, DC: U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER), and the Center for Biologics Evaluation and Research (CBER), April 1996. 4. CASS Principal Investigators and their associates, Coronary Artery Surgery Study (CASS): a randomized trial of coronary artery bypass surgery: comparability of entry characteristics

and survival in randomized patients and nonrandomized patients meeting randomization criteria. J. Am. Coll. Cariol. 1984; 3: 114–128. 5. CASS Principal Investigators and their associates, National Heart, Lung, and Blood Institute Coronary Artery Surgery Study (CASS): a multicenter comparison of the effects of randomized medical and surgical treatment of mildly symptomatic patients with coronary artery disease, and a registry of consecutive patients undergoing coronary angiography. Circulation 1981; 63(monograph 79)(Part II): I1–I-81. 6. Patulin Clinical Trials Committee, Clinical trial of patulin in the common cold. Lancet 1944; 2: 373–375.

FURTHER READING B. G. Greenberg (chair), A Report from the Heart Special Project Committee to the National Advisory Heart Council: Organization, Reviews and Administration of Cooperative Studies, 1967. Controlled Clin. Trials 1988; 9: 137–148. C. R. Klimt, Principles of multi-center clinical studies. In: J. P. Boissel and C. R. Klimt (eds.), Multi-Center Controlled Trials. Principles and Problems. Paris: INSERM, 1979. Cancer Research Campaign Working Party, Trials and tribulations: thoughts on the organization of multicentre clinical studies. Br. Med. J. 1980; 280: 918–920. J. Y. Lee, J. E. Marks, and J. R. Simpson, Recruitment of patients to cooperative group clinical trials. Cancer Clin. Trials 1980; 3: 381–384. J. M. Lachin, J. W. Marks, J. L. Schoenfield, and the NCGS Protocol Committee and the National Cooperative Gallstone Study Group, Design and methodological considerations in the National Cooperative Gallstone Study: a multicenter clinical trial. Controlled Clin. Trials 1981; 2: 177–229. C. L. Meinert, Organization of multicenter clinical trials. Controlled Clin. Trials 1981; 1: 305–312. Coronary Drug Project Research Group, The Coronary Drug Project: methods and lessons of a multicenter clinical trial. Controlled Clin. Trials 1983; 4: 273–541. S. J. Pocock, Clinical Trials: A Practical Approach. New York: Wiley, 1983. D. G. Weiss, W. O. Williford, J. F. Collins, and S. F. Bingham, Planning multicenter clinical trials: a biostatistician’s perspective. Controlled Clin. Trials 1983; 4: 53–64.

MULTICENTER TRIAL

All trials Rz

Total 1980–85 1986–90 1991–95 1996–00 2001 2002 Total

27,806 39,973 65,534 96,667 19,254 18,469 267,703

5,388 11,625 16,777 25,006 5,790 5,761 70,347

19.4 29.1 25.6 25.9 30.1 31.2 26.3

No. of trials Single center Multi-center BMJ Lancet JAMA NEJM

29 62 27 31 149

Multicenter trials Total Rz % Rz

% Rz

35 47 38 54 174

501 3,039 2,901 4,229 961 825 12,456

194 1,091 1,473 2,547 602 525 6,432

38.7 35.9 50.8 60.2 62.6 63.6 51.6

9

RZ MC by disease Cancer CV 25 219 213 334 77 72 940

81 344 428 583 130 103 1,669

Sample size Single center 25% Mdn 132 41 84 41 82

C. L. Meinert and S. Tonascia, Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. C. L. Meinert, In defense of the corporate authorship for multicenter trials. Controlled Clin. Trials 1993; 14: 255–260. L. M. Friedman, C. D. Furberg, and D. L. Demets, Fundamentals of Clinical Trials, 3rd ed. Multicenter Trials. New York: Springer, 1998, pp. 345–356. B. O’Brien, L. Nichaman, J. E. H. Brouhe, D. L. Levin, P. C. Prorok, and J. K. Gohagan for the PLCO Project Team, Coordination and management of a large multicenter screening trial: the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. Controlled Clin. Trials 2000; 21: 310S–328S. Asthma Clinical Research Network, The Asthma Clinical Research Network. Controlled Clin. Trials 2001; 22: 117S–251S.

272 204 230 123 219

75%

25%

Multicenter Mdn

75%

1,090 1,004 1,382 287 775

211 201 481 239 281

540 920 1,182 446 690

1,859 2,309 2,475 1,232 2,037

MULTINATIONAL (GLOBAL) TRIAL

multinational trial it is important to carefully consider such factors as the number of countries and ratios of patients per center and centers per country. Several investigators have discussed some of the practical difficulties associated with multinational trials (1–3). One important difficulty is that medical practice can vary among countries. There might be different diagnostic criteria or differences in medical terminology that lead to a different understanding of what constitutes the disease in question or how a study endpoint is defined. For example, if the study endpoint is hospitalization for the disease in question, there may be substantial differences among countries in the criteria for a hospital admission. In addition, the standard of care for that hospital admission may vary considerably among countries, and the variations in the standard of care may interact with the study treatment, leading to the potential for countryby-treatment interaction. Another aspect of medical practice relates to the choice of the control treatment. When the standard of care for the disease varies from country to country, a decision must be made between choosing a common control treatment across the protocol and choosing the control treatment deemed most relevant within each country. The logistical aspects of providing study supplies for a multinational trial might also be somewhat difficult. Import/export policies for experimental treatments vary from country to country, and they may be time consuming to navigate. Even when the treatments are locally available, the formulations might differ, leading to questions regarding equivalent bioavailability. Another logistical difficulty has to do with the analysis of laboratory samples. Differences in laboratory assays can make a pooled analysis of local laboratory results difficult to interpret. The use of a central laboratory solves this problem, but it may be difficult to ship biologic samples across national borders. The International Conference on Harmonization (ICH) was formed to develop common regulatory standards for clinical research across Europe, North America, and Japan. Despite considerable progress, there

STEVEN M. SNAPINN Amgen, Inc. Thousand Oaks, California

Multicenter trials have long been common in clinical research (see also the article on multicenter trials). The main reasons for conducting a trial at multiple centers is the ability to recruit a large number of patients in a relatively short period of time, and the ability to generalize the results more broadly. Multinational trials are those in which the centers are in multiple countries; Suwelack and Weihrauch (1) define them as ‘‘prospective clinical trials with investigational centers from more than one country with a common protocol foreseeing the combination of clinical data from all centers for a joint evaluation.’’ The reasons for conducting a multinational trial are similar to those for a multicenter trial; however, there are also some special considerations. 1

PRACTICAL ISSUES

The need to recruit a large number of patients is often the main driving force for multinational trials, but there may be particular reasons to recruit patients from specific parts of the world. For example, the disease might be particularly prevalent in specific countries, or in some cases it might be necessary to recruit diseased patients who are treatment-na¨ıve, necessitating selection of countries where access to the treatment is currently limited. It can also be advantageous to demonstrate the safety and efficacy of a therapy under a variety of conditions to gain confidence that the conclusions of the trial are generalizable. For this reason, it might be desirable to conduct the trial in countries with different lifestyles, diets, or cultures. However, it should be recognized that these differences, as well as differences in patient characteristics, may lead to increased variability, which may decrease the power of the trial or complicate the ability to reach a clear conclusion. For this reason, when designing a

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

MULTINATIONAL (GLOBAL) TRIAL

are still several legal and regulatory aspects to consider. For example, different countries may have different requirements for informed consent or for reporting serious adverse events to health authorities. When submitting the study protocol for institutional review board review, different countries may assess potential risks differently and so may require conflicting changes. Cultural differences may also cause some difficulties. When the Case Report Form (CRF) needs to be translated into the local languages, there is always the potential for misunderstandings. This is particularly true for the collection of information on patientreported outcomes, including information on quality of life, where nuances in the wording of questions can have a major effect. Although much information on the CRF can be collected through checkboxes, there is often the need to collect some information as free text, which naturally leads to additional potential for misunderstanding. 2 COUNTRY-BY-TREATMENT INTERACTION Multinational trials may provide broadly generalizable results, but this requires that the results be consistent across patient subsets. However, substantial country-by-treatment interaction has been noted in several clinical trials, leading to difficulties in interpretation. One such study was the Metoprolol CR/XL Randomized Intervention Trial in Congestive Heart Failure (MERIT-HF), described by Wedel et al. (4), which compared the βblocker metoprolol with placebo with respect to total mortality in 3991 patients with heart failure. The trial was stopped early by the steering committee due to a highly significant reduction in total mortality (hazard ratio = 0.66; P = 0.00009). After completion of the trial, the investigators carried out a number of subgroup analyses. Although the overall result was consistent among most subgroups, there was one notable exception: Among the 1071 patients in the United States, the hazard ratio was 1.05. After carefully examining the potential causes of the interaction, the investigators cautioned against overinterpretation of subgroup results and concluded that the best estimate of the treatment

effect for any subgroup, including country subgroups, should be the overall observed effect. O’Shea and DeMets (5) and O’Shea and Califf (6) reviewed other cardiovascular trials that have found country-by-treatment interactions. For example, the Beta-blocker Heart Attack Trial (BHAT) compared propranolol with placebo in 3837 patients who had survived a myocardial infarction; the overall result was highly positive (the mortality rates were 9.8% with placebo and 7.2% with propranolol), but there was wide variation in the size and direction of the effect among study centers. The Flolan International Randomized Survival Trial (FIRST) compared epoprostenol with placebo in patients with heart failure; although mortality was greatly reduced by epoprostenol among patients from Europe, there was no apparent benefit in patients from North America. The Platelet Glycoprotein IIb/IIIa in Unstable Angina: Receptor Suppression Using Integrilin Therapy (PURSUIT) trial compared the platelet inhibitor eptifibatide with placebo in nearly 11,000 patients with an acute coronary syndrome; there was an absolute reduction of 3.3% in the primary endpoint among 3827 patients in North America, but the reductions in other regions ranged only from 1.0% to −1.4%. The Global Utilization of Streptokinase and t-PA for Occluded Coronary Arteries I (GUSTO-I) trial studied the effects of four thrombolytic strategies in 41 thousand patients and found a statistically significant country-by-treatment interaction when comparing the U.S. and non-U.S. patients. Finally, the Global Use of Strategies To Open Occluded Coronary Arteries IV (GUSTO-IV) trial studied the use of abciximab in 7800 patients with an acute coronary syndrome; though there was no overall effect of abciximab, the drug appeared to be beneficial among patients in North America. Although it is not clear if these interactions are due to chance, O’Shea and Califf (7) discuss some of the differences among countries that might lead to interactions like these. Examining the results of over a dozen cardiovascular trials, they found significant differences in patient characteristics. One notable difference was that patients enrolled in the United States were heavier and taller

MULTINATIONAL (GLOBAL) TRIAL

than other patients, and were more likely to have diabetes. They also found important differences in the use of concurrent therapies. For example, compared with patients from other countries, patients in the United States were consistently more likely to be taking a β-blocker. In PURSUIT, women from the United States were considerably more likely to be taking hormone replacement therapy. In addition, patients from the United States in the acute coronary syndrome trials presented to the hospital earlier than patients in other countries. 3

COST EFFECTIVENESS EVALUATION

Some clinical trials include a health economic or cost-effectiveness evaluation (see also the article on cost-effectiveness analysis). Although the situation is conceptually similar to that of the evaluation of clinical effects, the generalizability of costeffectiveness results is more controversial (8–10). Differences in medical practice and patient characteristics have the potential to cause country-by-treatment interaction with regard to clinical effects, but the biologic effect of the treatment is expected to be consistent among countries. With respect to economic data, on the other hand, it is often assumed that systematic differences among countries will prevent pooling the results. This is due to perceived differences in practice patterns, payment systems, and relative prices of resources. Despite the common assumption that economic results will vary among countries, it is reasonable to include an assessment of the degree of similarity. Cook et al. (9) discussed the issues involved in combining economic data from multinational trials, and pointed out that one can use the same tools to assess interaction for clinical and economic endpoints. They concluded that, in the absence of interaction, the pooled estimate of cost effectiveness should be considered representative of the participating countries. Reed et al. (10) summarized a workshop held on this topic and discussed the terminology for multination economic evaluations; this terminology depends on whether pooled

3

or country-specific estimates of clinical efficacy and resource utilization are used. Also concerned by between-country heterogeneity in costs, Pinto et al. (11) proposed univariate and multivariate shrinkage estimators for costs and effects from multinational trials. REFERENCES 1. D. Suwelack and T. R. Weihrauch, Practical issues in design and management of multinational trials. Drug Inf J. 1992; 26: 371–378. 2. H. Maier-Lenz, Implementating multicenter, multinational clinical trials. Drug Inf J. 1993; 27: 1077–1081. 3. H. T. Ho and S. C. Chow, Design and analysis of multinational clinical trials. Drug Inf J. 1998; 32: 1309S–1316S. 4. H. Wedel, D. DeMets, P. Deedwania, B. Fagerberg, S. Goldstein, et al., for the MERIT-HF Study Group. Challenges of subgroup analyses in multinational clinical trials: experiences from the MERIT-HF trial. Am Heart J. 2001; 142: 502–511. 5. J. C. O’Shea and D. L. DeMets, Statistical issues relating to international differences in clinical trials. Am Heart J. 2001; 142: 21–28. 6. J. C. O’Shea and R. M. Califf, International differences in treatment effects in cardiovascular clinical trials. Am Heart J. 2001; 141: 875–880. 7. J. C. O’Shea and R. M. Califf, International differences in cardiovascular clinical trials. Am Heart J. 2001; 141: 866–874. 8. S. D. Sullivan, B. Liljas, M. Buxton, C. J. Lamm, P. O’Byrne, et al., Design and analytic considerations in determining the costeffectiveness of early intervention in asthma from a multinational clinical trial. Control Clin Trials. 2001; 22: 420–437. 9. J. R. Cook, M. Drummond, H. Glick, and J. F. Heyse, Assessing the appropriateness of combining economic data from multinational clinical trials. Stat Med. 2003; 22: 1955–1976. 10. S. D. Reed, K. J. Anstrom, A. Bakhai, A. H. Briggs, R. M. Califf, et al., Conducting economic evaluations alongside multinational clinical trials: toward a research consensus. Am Heart J. 2005; 149: 434–443. 11. E. M. Pinto, A. R. Willan, and B. J. O’Brien, Cost-effectiveness analysis for multinational clinical trials. Stat Med. 2005; 24: 1965–1982.

4

MULTINATIONAL (GLOBAL) TRIAL

CROSS-REFERENCES Case report form Cost-effectiveness analysis Foreign clinical data International Conference on Harmonization (ICH) Multicenter trial

MULTIPLE COMPARISONS

µ1 , . . . ,µk . Fisher’s least significant difference (LSD) procedure (1) declares the pair µi and µj different if the t-statistic that compares µi and µj is significant at level 0.05 and the F-statistic that compares all means is also significant at level 0.05. Fisher’s LSD protects the FWE under the global null hypothesis µ1 = . . . = µk because, to declare at least one pair different, we must reach a P-value of 0.05 or less for the F-statistic, and that has probability 0.05 under the global null hypothesis. On the other hand, suppose that µ1 = . . . = µk−1 = µ, but µk is so much larger than µ that the F-statistic is virtually guaranteed to reach statistical significance. In that case, Fisher’s LSD is tantamount to pairwise comparisons of means at level 0.05. But the FWE for 0.05-level pairwise comparisons of the equal means µ1 , . . . ,µk−1 is clearly larger than 0.05. Therefore, Fisher’s LSD provides weak control but not strong control of the FWE.

MICHAEL A. PROSCHAN National Institute of Allergy and Infectious Diseases Bethesda, Maryland

Multiple comparisons arise in several ways in clinical trials; common examples include multiple arms, multiple endpoints, subgroup analyses, and monitoring over multiple time points. At first, it seems that trying to answer several questions in the same trial is very efficient, especially if another large trial to answer the same questions is unlikely. The problem is that if the comparisonwise error rate (also known as the per-comparison error rate)—the expected proportion of false positives—is 0.05, then the chance of at least one false positive, which is known as the familywise error rate (FWE) or the experimentwise error rate, can be substantially greater than 0.05. In fact, with enough comparisons, the FWE can be close to 1. To control the FWE at level 0.05, one must require stronger evidence for each individual comparison; this is called a multiple comparison adjustment.

2 CRITERIA FOR DECIDING WHETHER ADJUSTMENT IS NECESSARY Whether and how much multiplicity adjustment is needed is controversial. Some have argued that such adjustments are never needed (2), whereas others have used several case studies to argue that without adjusting for all comparisons made in a clinical trial, one cannot be confident of the results (3). The view of many in clinical trials is somewhere between these two positions, namely that adjustment is needed in certain settings but not in others. It is important to note that multiplicity adjustment is needed only when a trial is considered successful when at least one comparison is statistically significant, not when all comparisons must be statistically significant to declare a success. For example, regulatory agencies such as the Food and Drug Administration (FDA) require proof that a combination drug is superior to each of its constituents. If two constituents exist, A and B, then the null hypotheses are H0A : the combination is no better than constituent A, and H0B : the combination is no better than constituent B. To declare success, we must reject

1 STRONG AND WEAK CONTROL OF THE FWE Two different levels of control of the FWE exist. The first level, which is called weak control, means that under the global null hypothesis that each separate null hypothesis is true, the probability of at least one type 1 error is no greater than α. But the global null hypothesis is a very strong assumption; it is more likely that some null hypotheses are true and others are not. We would like the probability of rejecting at least one true null hypothesis to be α or less, where that probability is computed without necessarily assuming that the other null hypotheses are true. If this holds, then the FWE is said to be controlled strongly. To understand why strong control is indeed stronger than weak control, consider a trial with k arms in which we are interested in all pairwise comparisons of means

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

MULTIPLE COMPARISONS

both H0A and H0B ; it is not sufficient to reject at least one of H0A and H0B . If each comparison is made at level 0.05, then the probability of a successful trial—the probability of the intersection of the separate rejection events—will be no greater than 0.05. Therefore, no multiple comparison adjustment is needed in this situation (4). A similar setting is when a drug must show benefit on several different characteristics to be declared beneficial. A helpful way to distinguish whether adjustment is needed is to ask whether a successful trial requires rejection of hypothesis 1 AND hypothesis 2 . . . .AND hypothesis k. If so, then no adjustment is needed. If AND is replaced by OR, then adjustment may be needed. In the sequel, we restrict attention to the latter setting. Proschan and Waclawiw (5), who observed that some settings exist in which nearly everyone agrees that a multiple comparison adjustment is needed, tried to determine what those settings had in common. One factor is the sheer number of comparisons. Situations in which the number of comparisons is often larger than two include interim monitoring and subgroup analyses. Another factor for deciding whether a multiple comparison adjustment is needed is whether one stands to benefit from the multiplicity. For instance, even if investigators have a good reason for changing the primary endpoint in the middle of a trial, their decision will be met with skepticism, especially if no treatment effect existed for the original endpoint but an effect existed for the new primary endpoint. Situations in which one stands to benefit from unadjusted multiple comparisons include monitoring, subgroup analyses, multiple primary endpoints, and comparison of multiple doses of a drug to a placebo. By contrast, no single entity stands to gain from multiple unadjusted comparisons of drugs made by different manufacturers. The most important and difficult factor is the relatedness of the questions being asked and their answers (6, p. 31–35). Clinical trialists tend to feel more compelled to adjust for a family of related (e.g., heart disease and stroke) than unrelated (e.g., cancer and heart disease) hypotheses. Situations in which the questions are related include interim monitoring (in which case the same question is

asked at different times), subgroup analyses, and comparisons of different doses of the same drug to a control. One could debate the relatedness of comparisons of drugs made by different manufacturers to the same control group. One might consider that whether drug A is better than control is a completely separate question than whether drug B is better than control. In fact, the two questions could have been answered in separate trials, in which case no multiple comparison adjustment would have been made (2, 7). But the test statistics in separate trials are statistically independent, so learning that a type 1 error was made in one trial does not erode our confidence in conclusions of other trials. Learning that a type 1 error was made in a trial that compares multiple treatments to the same control does erode our confidence in the results of comparisons of the other treatments with control because it calls into question whether the common control arm was aberrant. Thus, having the same control de facto makes the answers to the questions related (8). In other cases, the questions and answers are more clearly unrelated. Sometimes, two completely different interventions that could have been tested in separate trials are, for efficiency reasons, tested in the same trial using a factorial design. The Women’s Angiographic Vitamin and Estrogen trial (9) in post-menopausal women compared hormone replacement with placebo and vitamins with placebo with respect to diameters of coronary arteries measured using an angiogram. It was considered unlikely that an interactive effect of the two interventions would occur, so a factorial design was used. Unlike comparisons of several arms to the same control, comparisons of main effects in a factorial design with no interaction are asymptotically statistically independent. Therefore, multiple comparison adjustments are often not used for main effect comparisons in a factorial trial (8). Table 1 shows the multiplicity criteria (columns) applied to common clinical trial settings (rows). Darker rows indicate settings that suggest a greater need for adjustment. Monitoring and subgroups are the darkest, and this is consistent with the general feeling that these situations call for some sort of

MULTIPLE COMPARISONS

3

Table 1. Guidelines applied to four common sources of multiplicity in clinical trials Large Number of Comparisons?

Single Entity Benefits?

Related Family?

Multiple Arms Doses vs. same control Different drugs, same control Different drugs, factorial Primary Endpoints Related Unrelated Subgroups Monitoring It usually applies. It may or may not apply, depending on the circumstances. The reader should fill in the gray squares with white or black ones according to the circumstances of the specific application. It usually does not apply.

adjustment (specific adjustment techniques are discussed later). Multiple primary endpoints and multiple doses of a drug compared with the same control are also dark rows, suggesting that separate comparisons at level 0.05 are probably not appropriate. 3 IMPLICIT MULTIPLICITY: TWO TAILED TESTING Before moving to specific adjustment methods in different clinical trial settings, we apply the relatedness criterion to an issue we do not usually associate, although we should, with multiple comparisons—two-tailed testing. Consider, for example, a trial that compares a drug with placebo with respect to mortality. Although the alternative hypothesis for a two-tailed test is that the mortality rate differs between treatment and placebo, we always want to know the direction of the effect. Therefore, we are really trying to answer two separate questions with respect to mortality: (1) Is treatment better than placebo? (2) Is treatment worse than placebo? The two conclusions and their implications could not be more dissimilar, so why do researchers routinely lump them by controlling the probability of any difference? It makes more sense to specify error rates for the two types of errors separately; we may or may not want to use α = 0.025 for each. Consider the Cardiac Arrhythmia Suppression Trial (CAST), which tested the

hypothesis that suppression of cardiac arrhythmias in arrhythmic patients with a previous heart attack would reduce the rate of sudden arrhythmic deaths and cardiac arrests. When it was discovered that two drugs, encainide and flecainide, actually increased the rate of sudden death and cardiac arrest compared with placebo (10), investigators eliminated these two drugs and continued with the third drug, moricizine. The continuation trial was called CAST II (11). Investigators used level 0.025 for declaring superiority and level 0.05 for declaring harm in CAST II, even though the combined type 1 error rate, 0.025 + 0.05 = 0.075, exceeds the conventional level of significance in clinical trials. Because of the asymmetry of harm and benefit coupled with the experience of CAST 1, it made sense to exceed the conventional two-tailed error rate of 0.05 4 SPECIFIC MULTIPLE COMPARISON PROCEDURES We briefly review some common multiplicity adjustment techniques in clinical trials. Readers should consult Miller (6) or Hochberg and Tamhane (12) for a full treatment of multiple comparison methods. 4.1 Multiple Arms Certain multiarmed trials exist in which a multiple comparison adjustment is typically not made. One example is when an arm

4

MULTIPLE COMPARISONS

is added for the purpose of ‘‘validating’’ a trial. For example, the purpose of the Glucosamine/Chondroitin Arthritis Intervention Trial (13) was to determine whether glucosamine and/or chondroitin were superior to placebo for the relief of pain of osteoarthritis of the knee. It was already known that celecoxib relieves pain, so a celecoxib arm was included to validate the trial; if the celecoxib and placebo arms did not differ, then some would question the validity of the trial. Because celecoxib is being included for a very different purpose, no reason exists to lump the comparison of celecoxib to placebo with the comparisons of glucosamine and/or chondroitin to placebo. In the section, ‘‘Criteria for deciding whether adjustment is necessary,’’ we cited another case in which adjustment for multiple comparisons is not typically used—a factorial trial in which an interaction is considered very unlikely, such as when the mechanisms of action and/or outcomes of interest are different. Such a trial is often thought of as two separate trials combined for efficiency reasons only. In many trials, more interest is focused on certain comparisons than on others, as pointed out by Cook and Farewell (14) and Hughes (15). For instance, interest often centers on comparisons with a common control arm. The mere fact that comparisons are made with the same control group was cited in the section on ‘‘Criteria’’ as a reason to consider a multiplicity adjustment, especially if more than two comparisons are made. Dunnett (16) developed a classic method for doing so. Let Ti be the t-statistic that compares arm i to the control, but replace the pooled variance of the two arms by the pooled variance across all k arms and use (k + 1)(n − 1) instead of 2(n − 1) degrees of freedom. Let Pi be the corresponding P-value, i = 1,2, . . . ,k. Treatment i is declared different from the control if Pi ≤ ck , where ck is selected so that P(P1 ≤ ck or P2 ≤ ck or . . . or Pk ≤ ck ) = α. It is easy to see that Dunnett’s method protects the FWE not only under the global null hypothesis, but also when only some means differ from control. Thus, Dunnett’s method provides strong control of the type 1 error rate for comparisons with control.

A more powerful modification of Dunnett’s method that still strongly controls the type 1 error rate is the following step down method (see Example 4.2 of Reference 12). Order the P-values so that P(1) < P(2) < . . . < P(k) . Step 1 tests whether any of the k treatments differ from control by comparing P(1) with ck . If P(1) > ck , then stop and declare no treatment superior to control, whereas if P(1) ≤ ck , then declare the associated treatment different from control and proceed to step 2. If we proceed to step 2, then we either already made a type 1 error or we did not. If we did not, then at most k − 1 type 1 errors are possible, so we ‘‘step down’’ and compare the remaining k − 1 treatments to control. That is, we compare P(2) with ck−1 , where ck−1 is the critical value for P-values that compare k − 1 active arms to control. If we find, after stepping down, that P(2) > ck−1 , then we stop and declare no remaining treatments different from control. On the other hand, if P(2) ≤ ck−1 , then we declare that treatment different from control, and then ‘‘step down’’ again by comparing P(3) to ck−2 , and so on. The step down version of Dunnett’s method clearly has more power than the original, but it has a drawback. Suppose that the k different treatments are made by different companies. If one company’s drug is extremely good, then another company’s drug will not have to meet the same burden of proof to be declared superior to the control. This failure to use a ‘‘level playing field’’ for all comparisons may be objectionable. The section entitled, ‘‘Crtieria for deciding whether adjustment is necessary’’ mentioned another setting that involves comparisons with the same control—that of different doses of the same drug. Table 1 suggests that it is inappropriate to compare each dose to control at level α as the primary analysis. But, often the first question is whether the drug worked, and the dose-response relationship is examined only if the answer is yes. One useful method is a hierarchical procedure whereby one first establishes that the comparison of the active arms combined to the control is significant at level α, then compares each dose to control at level α.

MULTIPLE COMPARISONS

4.2 Multiple Endpoints The multiple endpoint problem is complex because the endpoints might all be primary or a combination of primary and secondary. First consider multiple primary endpoints. As mentioned in the section on Criteria and as suggested by Table 1, if the endpoints are thought to be unrelated, then one could defend not using any adjustment, although most statisticians would feel uneasy if the study had more than two primary endpoints. An attractive method to adjust within a group of similar endpoints is to use a perpatient summary that combines the different endpoints. An example is O’Brien’s rank test (17), which first ranks patients on each of the k outcomes, and then averages ranks across outcomes for a given patient. These per-patient summary measures are then analyzed using a t-test or permutation test. Cook and Farewell (14) note that this method works well when many outcome variables reasonably measure the same phenomenon that treatment is expected to benefit. Follmann (18) showed that if relatively few endpoints exist and treatment has a strong effect on one outcome but not others, then the Bonferroni method, which uses significance level α/k for each endpoint, is attractive. It controls the FWE strongly and is very simple to use, but it is conservative. Bonferroni’s method works well when the number of endpoints is not too large and the test statistics are not highly correlated. For example, Table 2 shows the FWE for two, five, or ten z-statistics with correlation ρ when no adjustment is used (upper panel) and when the Bonferroni adjustment is used (lower panel). The type 1 error rate is substantially inflated with no multiple comparison adjustment, even if ρ is high. The FWE using the Bonferroni adjustment is only slightly less than 0.05 for two uncorrelated z-statistics, which indicates very slight conservatism. As the correlation becomes close to 1, the FWE is about half of 0.05 because the two test statistics are essentially one as the correlation approaches 1. Nonetheless, even for a fairly high correlation of 0.7, the degree of conservatism is not excessive. The conservatism is much more substantial with a greater number of comparisons. For example, with 10

5

z-statistics whose pairwise correlations are all 0.7, the actual type 1 error rate is 0.029 instead of 0.05. More powerful variants of the Bonferroni procedure maintain strong control of the FWE. One variant (19) is a step-down procedure similar to the step-down version of Dunnett’s method described above. If the smallest of the k P-values is less than or equal to α/k, then we declare that endpoint significant and proceed to the second step, whereas if the smallest P-value exceeds α/k, then we stop. If we proceed to the second step, then we compare the second smallest P-value to α/(k − 1). If the second smallest P-value exceeds α/(k − 1), then we stop; if it is less than or equal to α/(k − 1), then we declare that endpoint significant and compare the third smallest P-value to α/(k − 2), and so on. Any time we fail to reach statistical significance, we stop testing, so any larger P-values cannot be declared significant. This procedure, which still strongly controls the type 1 error rate, is more powerful than the classical Bonferroni method, which requires each P-value to be α/k or less to be declared significant. Another Bonferroni variant is Hochberg’s (20) modification of Simes’ (21) method, though as Dmitrienko and Hsu (22) point out, Hochberg’s procedure does not control the FWE in all settings. The Hochberg modification first compares the largest P-value to α; if the largest P-value is α or less, then all endpoints are declared significant. If the largest P-value is larger than α, then the second largest P-value is compared with α/2. If the second largest P-value is α/2 or less, then it and smaller P-values are declared significant. If the second largest P-value exceeds α/2, then the third largest P-value is compared with α/3, and so on. Finally, if all other P-values are larger than their thresholds, then the smallest P-value is declared significant if its P-value is α/k or less. For example, with two endpoints, if both attain P-values of 0.05 or less, then they are both declared significant; if one P-value exceeds 0.05, then the other is still declared significant if its P-value is 0.025 or less. The discussion thus far has not dealt with secondary endpoints. It is difficult to formulate a single strategy that covers all cases

6

MULTIPLE COMPARISONS

Table 2. FWE of unadjusted (upper panel) and Bonferroni-adjusted (lower panel) two-tailed tests for two, five, or ten comparisons using Z-statistics with the same pairwise correlation, ρ FWE with no adjustment ρ k 2 5 10

0 .098 .226 .401

.10 .097 .224 .394

.20 .096 .218 .377

.30 .095 .209 .351

.40 .093 .197 .321

.50 .091 .183 .287

.60 .088 .167 .251

.70 .083 .148 .213

.80 .078 .127 .173

.90 .070 .102 .128

1 .050 .050 .050

.70 .043 .035 .029

.80 .040 .030 .023

.90 .036 .023 .017

1 .025 .010 .005

FWE with Bonferroni adjustment ρ 2 5 10

0 .049 .049 .049

.10 .049 .049 .049

.20 .049 .048 .047

.30 .048 .047 .045

.40 .048 .045 .042

because, as D’Agostino (23) points out, there are many different purposes for secondary endpoints. One purpose might be, after showing that the treatment works, to understand its mechanism of action, in which case a multiplicity adjustment is probably not needed. In other situations, it is less clear how much adjustment is needed. Some people would be content to use level α for each secondary endpoint, arguing that the designation ‘‘secondary outcome’’ instills in the reader the proper amount of caution when interpreting results. Others suggest adjusting for all secondary outcomes or all secondary outcomes plus the primary outcome (24) using the Bonferroni method. The problem is that trials are usually powered for the primary outcome, so even with no multiple comparison adjustment, power for secondary outcomes may be low. To undermine power even more by adjusting for all secondary outcomes makes it very unlikely to reach statistical significance. A reasonable middle ground is to consider any secondary outcome finding suggestive if it reaches level α when no adjustment is made and more definitive if it remains significant after adjustment for multiplicity. 4.3 Subgroup Analyses The subgroup row of Table 1 is completely dark, which suggests that some sort of multiplicity adjustment or at least cautionary language is needed for subgroup conclusions,

.50 .047 .042 .039

.60 .045 .039 .034

especially if the overall effect is not significant. That is what happened in the rpg120 HIV Vaccine Study (25), which was the first Phase 3 trial of an HIV vaccine in humans. No significant difference in HIV acquisition was found between vaccine and placebo arms overall, although an effect seemed to exist in nonwhite participants. One is tempted to conclude that the trial would have reached statistical significance if it had enrolled only nonwhites. The problem is that by chance alone we can often find one or more subgroups in which a benefit of treatment seems to exist, and other subgroups in which no benefit seems to exist. Peto (26) showed that with two equally-sized subgroups constituted completely at random, if the overall effect reaches a P-value of about 0.05, then the probability is about 1/3 that one subgroup will have a statistically significant treatment effect that is more than three times as large as the other subgroup’s treatment effect, which fails to reach statistical significance. Thus, one can be misled by subgroup results even in the simplest possible setting of only two equally sized subgroups. In practice, many subgroups of varying sizes often exist, which compounds the problem. ISIS-2 Investigators (27) illustrated the difficulty in interpreting subgroup effects in a factorial trial of aspirin and streptokinase on mortality. The trial showed a strong benefit of aspirin. In an effort to show how misleading subgroup results can be, investigators

MULTIPLE COMPARISONS

facetiously noted a slightly adverse effect of aspirin in patients born under the astrological signs of Gemini or Libra, but a strong positive effect of aspirin for patients of other astrological signs (P < 0.00001). The ISIS2 example highlights that the magnitude of multiplicity can be larger than it seems. Astrological sign sounds like it should comprise 12 components, but the example above combined two noncontiguous signs, Gemini and Libra. It is hard to know the actual magnitude of multiplicity because if the authors could not demonstrate their point by combining two astrological signs, then they may have tried combining three or more signs. The same sort of phenomenon has arisen from post-hoc combinations of different classes of heart failure, for example. Similarly, if subgroups are formed by somewhat arbitrary cutpoints on a numeric scale, then one may be inclined to experiment with different cutpoints until one is found that highlights subgroup differences. The true extent of multiplicity may be unknowable, in which case it is impossible to separate a real effect from the play of chance. The discussion so far has stressed the need for extreme caution in interpreting subgroup effects, but too much caution will cause us never to find real subgroup effects. After all, even with no adjustment for multiplicity, power in subgroups is often low. For this reason, a widely accepted compromise is first to test whether a subgroup by treatment interaction exist. If the interaction is not statistically significant, then one estimates the treatment effect in the subgroup by the overall treatment effect. If the interaction test is significant, then one reports the treatment effects observed in the different subgroups. It is important to keep in mind that although the treatment effect may differ numerically across subgroups (a quantitative interaction), it is unusual for the treatment to benefit one subgroup and harm another (a qualitative interaction) (26). 4.4 Interim Monitoring The monitoring row of Table 1 is completely dark, which suggests that adjustment is needed for monitoring. A popular and flexible method is to use a spending function

7

α*(t), which dictates the cumulative amount of type 1 error to be spent by time t of the trial, where t is measured in terms of relative information; t = 0 and 1 at the beginning and end of the trial, respectively. Making α*(1) = α ensures that the FWE will be controlled. The boundary at a given interim analysis depends on its time and the times of previous interim analyses, and it is computed using numerical integration. Neither the number nor the timing of interim analyses need be specified in advance. The properties of spending function boundaries depend on the particular spending function selected. Desirable spending functions spend relatively little alpha early, and then make up for that by steeply increasing for t close to 1. This causes early boundaries to be large, and the boundary at the end of the trial to be close to what it would have been without monitoring. A more thorough treatment of spending functions and other monitoring boundaries may be found in References 28 and 29. This article focused mostly on phase III clinical trials. Generally speaking, less emphasis is placed on multiplicity adjustment in earlier phase trials because such trials are used to decide whether more definitive phase III testing is justified rather than to declare the treatment beneficial.

REFERENCES 1. G. W. Snedecor and W. G. Cochran, Statistical Methods, 7th ed. Ames, IA: The Iowa State University Press, 1980. 2. K. Rothman, No adjustments are needed for multiple comparisons. Epidemiology 1990; 1: 43–46. 3. L. A. Moy´e, Multiple Analyses in Clinical Trials. New York: Springer-Verlag, 2003. 4. E. M. Laska and M. J. Meisner, Testing whether an identified treatment is best. Biometrics 1989; 45: 1139–1151. 5. M. A. Proschan and M. A. Waclawiw, Practical guidelines for multiplicity adjustment in clinical trials. Control. Clin. Trials 2000; 21: 527–539. 6. R. G. Miller, Simultaneous Statistical Inference. New York: Springer-Verlag, 1981. 7. D. B. Duncan, Multiple range and Multiple F-tests. Biometrics 1955; 11: 1–42.

8

MULTIPLE COMPARISONS 8. M. Proschan and D. Follmann, Multiple comparisons with control in a single experiment versus separate experiments: why do we feel differently? Am. Stat. 1995; 49: 144–149. 9. D. D. Waters, E. L. Alderman, J. Hsia, et al., Effects of hormone replacement therapy and antioxidant vitamin supplements on coronary atherosclerosis in postmenopausal women: a randomized controlled trial. J. Am. Med. Assoc. 2002; 288: 2432–2440.

10. Cardiac Arrhythmia Suppression Trial Investigators. Preliminary report: effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. N. Engl. J. Med. 1989; 321: 406–412. 11. Cardiac Arrhythmia Suppression Trial II Investigators. Effect of the antiarrhythmic agent moricizine on survival after myocardial infarction. N. Engl. J. Med. 1992; 327: 227–233. 12. Y. Hochberg and A. C. Tamhane, Multiple Comparison Procedures. New York: Wiley, 1987. 13. D. O. Clegg, D. J. Reda, C. L. Harris, et al., Glucosamine, chondroitin sulfate, and the two in combination for painful knee osteoarthritis. N. Engl. J. Med. 2006; 354: 795–808. 14. R. J. Cook, and V. T. Farewell, Multiplicity considerations in the design and analysis of clinical trials. J. R. Stat. Soc. A 1996; 159: 93–110. 15. M. D. Hughes, Multiplicity in clinical trials. In: P. Armitage, and T. Colton, eds. Encyclopedia of Biostatistics. New York: Wiley, 1998. 16. C. Dunnett, A multiple comparisons procedure for comparing several treatments with a control. J. Am. Stat. Assoc. 1957; 50: 1096–1121. 17. P. C. O’Brien, Procedures for comparing samples with multiple endpoints. Biometrics 1984; 40: 1079–1087. 18. D. A. Follmann, Multivariate tests for multiple endpoints in clinical trials. Stats. Med. 1995; 14: 1163–1176. 19. S. Holm, A simple sequentially rejective multiple test procedure. Scand. J. Stats. 1979; 6: 65–70. 20. Y. Hochberg, A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988; 75: 800–802. 21. R. J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986; 73: 751–754. 22. A. Dmitrienko and J. C. Hsu, Multiple testing in clinical trials. In: S. Kotz, N. Balakrishnan,

C. B. Read, B. Vidakovic, and N. Johnson, eds. Encyclopedia of Statistical Sciences. New York: Wiley, 2006. 23. R. B. D’Agostino, Controlling alpha in a clinical trial: the case for secondary endpoints. Stats. Med. 2000; 19: 763–766. 24. C. E. Davis, Secondary endpoints can be validly analyzed, even if the primary endpoint does not provide clear statistical significance. Control. Clin. Trials 1997; 18: 557–560. 25. The rgp120 HIV Vaccine Study Group. Placebo-controlled, phase III trial of a recombinant glycoprotein 120 vaccine to prevent HIV-1 infection. J. Infect. Dis. 2005; 191: 654–665. 26. R. Peto, Clinical trials. In P. Price and K. Sikora, eds. Treatment of Cancer. London: Chapman and Hall, 1995. 27. ISIS-2 Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2. Lancet 1988; 2: 349–360. 28. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman and Hall/CRC Press, 2000. 29. M. A. Proschan, K. K. Lan, and J. T. Wittes, Statistical Monitoring of Clinical Trials: A Unified Approach. New York: Springer, 2006.

MULTIPLE ENDPOINTS

(2) multivariate global tests. Multiple testing approaches, such as the well-known Bonferroni procedure, make individual assessments for each endpoint under investigation. They are particularly advantageous if the endpoints can be classified before the study according to some hierarchy, as this natural ordering can then be explicitly used in the multiple testing approach. In contrast, global tests aim at assessments across all endpoints based on measuring the distance between multivariate populations. These methods include, for example, the classical multivariate analysis of variance (MANOVA) F-tests, which are particularly powerful if treatment differences are expressed through the combined effect of the endpoints. In the following, the main attributes underpinning multiple testing and global testing methodologies are briefly described.

FRANK BRETZ and MICHAEL BRANSON Norvatis Pharma AG Basel, Switzerland

1

INTRODUCTION

A common problem in pharmaceutical research is the comparison of two treatments for more than one outcome measure, which in the clinical context are often referred to as endpoints. A single observation or measurement is often not sufficient to describe a clinically relevant treatment benefit. In respiratory studies, for example, several endpoints (such as FEV1 , respiratory symptoms, and health-related quality of life) are frequently considered to determine a treatment-related benefit. In cardiovascular trials, possible endpoints include time to myocardial infarction, congestive heart failure, stroke, and so on. In such instances, the experimenter is interested in assessing a potential treatment effect while accounting for all multiple outcome measures. The aspects of a highly multidimensional and complex syndrome are usually assessed by means of various symptoms or ordinal items in scales. In order to map these observations on one (or a few) ordinal scale(s) that best represents the severity (or improvement) of the disease, one can try to reduce the dimensionality by combining the univariate projections of this syndrome to a single (or a few) measure(s) of efficacy. This approach is also in agreement with the ICH E9 guideline (1), in which it is recommended to use a single (primary) endpoint, where possible, thus reflecting the need to efficiently reduce the dimensionality. If, nevertheless, certain aspects cannot be combined into one scale or index because they describe inherently different dimensions, or if various measures need to be taken in order to capture the entire benefit range, multiple endpoints are unavoidable. Statistical test procedures for the analysis of multiple endpoints can roughly be classified into (1) multiple testing approaches and

2

MULTIPLE TESTING METHODS

If not stated otherwise, the comparison of two treatments j = 1, 2 with k > 1 endpoints being measured on nj patients are considered. No assumptions are made about the distribution of the measurements in this section. The interest lies in testing the k null hypotheses H i : θ 1i = θ 2i (the two treatments do not differ in effect) against the one-sided alternative hypotheses K i : θ 1i > θ 2i (treatment 1 is better than treatment 2), where θ ji denotes the mean effect of treatment j for endpoint i = 1, . . ., k. Note that the following results are directly extendable to the two-sided testing situation. A main concern when testing multiple hypotheses is the increased likelihood of rejecting incorrectly at least one true null hypothesis [familywise error rate, FWER; see Hochberg and Tamhane (2) for a detailed overview]. In particular, within the regulated environment in which drug development takes place, it is often required to control the FWER at a prespecified level α-frequently, but not restricted to, 2.5% onesided. A standard approach in this context is the Bonferroni procedure. Let pi denote the P-value for the ith hypothesis, Hi , as

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

MULTIPLE ENDPOINTS

obtained from applying an appropriate twosample test. If pi < α/k, H i can be rejected at FWER ≤ α, which follows from Bonferroni’s inequality {pi < α/k} FWER = P i∈T P(pi < α/k) ≤ α ≤ i∈T

where T denotes the set of all true null hypotheses. Several improvements of the Bonferroni procedure exist. The closure principle (CP) of Marcus et al. (3) is a general and powerful multiple testing procedure, which includes many of these improvements as special cases. For simplicity, its use is illustrated for k = 3 endpoints (see Fig. 1). Starting from the set of hypotheses of interest H 1 , H 2 , H 3 , all possible intersection hypotheses H12 , H 23 , H 13 , H 123 are created, where Hij = Hi ∩ Hj , 1 ≤ i, j ≤ 3, are the intersection hypotheses of two hypotheses at a time. For example, H23 states that endpoints 2 and 3 do not differ for both treatments. The overall intersection hypothesis H 123 is the global null hypothesis of no treatment effect across all three endpoints. According to the CP, a hypothesis H i is rejected at FWER α if Hi itself and all hypotheses formed by intersection with H i are each rejected at (local) level α. For example, in order to reject H1 in Fig. 1, one has to reject H 1 itself as well as H12 , H 13 , H 123 , where the choice of the individual level-α tests is free. Application of the Bonferroni approach to each hypothesis, for example, leads to the stepwise procedure of Holm (4): Let p(1) ≤ . . . ≤ p(k) denote the ordered unadjusted P-values with the associated hypotheses H (1) , . . ., H (k) . Then, H (i) is rejected if p(j) < α/(k − j + 1), j = 1, . . ., i (i.e., if all hypotheses H(j) preceding H (i) are also rejected). As a result of its stepwise nature, Holm’s procedure is more powerful than the Bonferroni procedure. A second important application of the CP is the allocation of the FWER α to a preordered sets of hypotheses (5). Chi (6), for example, advocated the definition of relevant decision paths before conducting the clinical trial so that the inherent multiplicity problem caused by multiple endpoints is reduced

by sequentially testing different sets of endpoints in a clinically relevant and meaningful way. It is, therefore, common practice in clinical trials to classify the endpoints into primary, co-primary, and secondary endpoints (6) and to test them in a prespecified order of importance, thus reflecting the need to efficiently reduce the multiplicity. Different strategies of allocating the FWER α between the sets of hypotheses are possible (6, 7), most of which can ultimately be derived by applying the CP. Fixed sequence methods test the individual endpoints in a prespecified order, each at level α, where the nonrejection of a hypothesis at any step stops the test procedure (5). The fixed sequence approach is only applicable, when a complete hierarchy of the endpoints is available and a hypothesis is only of interest and thus tested if the higher prioritized hypotheses have all been rejected beforehand (8). Gatekeeping methods first test a preliminary hypothesis and, in case of a rejection, continue testing additional hypotheses (9). This approach may be required if a single primary endpoint is to be tested for significance before further secondary endpoints are to be analyzed. More general gatekeeping procedures are introduced in (10). An important point to note is that of requiring statistical significance for all primary (multiple) endpoints before being able to claim ‘‘success.’’ In this context, no formal adjustment of the FWER is necessary (i.e., each endpoint is tested separately at level α) (8). What should be carefully evaluated and understood is the impact on the power for such a hypothesis testing strategy. Other multiplicity adjustments exist that improve on Bonferroni’s inequality, in particular the Simes test (11) and its stepwise extension (12). All of the methods considered so far, however, do not account for the inherent correlation between different endpoints. Exploiting the association between the endpoints can considerably improve the methods above. Under the standard ANOVA assumptions and if the correlations are known (or at least well approximated), the endpointspecific t-tests are jointly multivariate t distributed (13) and efficient integration routines can be used for the design and analysis

MULTIPLE ENDPOINTS

3

H123

Figure 1. CP hypothesis tree for three endpoints

H12

H13

H23

H1

H2

H3

of related clinical trials (14). If the correlations are unknown (which is the typical case in practice), a standard alternative is resampling the data to approximate the joint distribution of the test statistics (15, 16). Different resampling schemes are available, each of which has its own advantages. The key idea is to permute, a large number of times, the entire patient-specific observation vectors such that the complete information for a single patient, across the endpoints, is maintained. Such procedure retains the (stochastic) dependencies between the endpoints under the null hypothesis when no real differences exist among the treatments. For each resampling step, the test statistics are computed based on the resampled data. Then, the originally observed test statistics are compared with the resampled test statistics via the resampling distribution, where extreme values indicate a significant treatment effect. The resampling methods are available for a variety of multiple comparison procedures, in particular for the CP based on t-statistics (15). In addition, extensions to the comparison of multiple treatments are possible and software is readily available such as PROC MULTTEST in SAS.

3

MULTIVARIATE GLOBAL TESTS

The classic approach for comparing two multivariate normal populations with an unknown common covariance matrix is Hotelling’s T 2 test (17)

T2 =

n1 n2 (Y1 − Y2 ) S−1 (Y1 − Y2 ) n1 + n2

where Yj denotes the vector of sample means, nj S = j l=1 (Yjl − Yj )(Yjl − Yj ) /ν denotes the sample covariance matrix with ν = n1 + n2 − 2, and Yjl denotes the observation vector for patient l under treatment j = 1, 2. Under the null hypothesis of no treatment effect across all endpoints, (n1 + n2 − k − 1)T 2 is k(n1 + n2 − 2)Fν1 , ν2 distributed with ν 1 = k and ν 2 = n1 + n2 − k − 1 degrees of freedom. Note that T 2 can be regarded as a multivariate generalization of the twosample t test to the multivariate setting. The test statistic T 2 is the squared maximum of univariate t test statistics of all linear combinations of the endpoints. In addition, the T 2 test has several optimality properties. Among others, it is uniformly the most powerful of all tests invariant to scale transformations (18). Several extensions to situations with more than two treatments exist, the most notable being Wilks’ , Pillai’s trace, Hotelling-Lawley trace, and Roy’s maximum root, all of which are approximately F distributed (18). All of the afore-mentioned global tests were developed for the two-sided alternative and thus lack power in practical applications involving one-sided directional decisions. A variety of one-sided tests in the multivariate setting exists, although many problems remain unsolved [see Tamhane and Logan (19) for a recent review of these methods]. Kudˆo (20) derived the exact likelihood ratio (LR) test for the test problem H: θ = 0 vs. K: θ ≥ 0 with at least one θ i > 0, when the covariance structure is known. Perlman (21) extended the LR test to situations with an unknown covariance structure. However, the exact distribution of the LR test is not free

4

MULTIPLE ENDPOINTS

of the unknown nuisance parameters. Moreover, the LR test is biased, so that research to circumvent these difficulties is still ongoing (22, 23). O’Brien (24) proposed a different solution by restricting the alternative space K, where the standardized treatment differences (θ 1i − θ 2i )/σ i , i = 1, . . ., k, are assumed to be all positive and of the same magnitude. Using ordinary least square (OLS) and generalized least square (GLS) methods, O’Brien (24) showed that the resulting LS statistics are standardized weighted sums of the individual −1 t statistics ti = (Y 1i − Y 2i )/ s2i (n−1 1 + n2 ),

where Y ji , j = 1, 2, and s21 denote, respectively, the mean response and pooled sample variance of endpoint i = 1, . . ., k. More specifi ˆ cally, the OLS statistic is tOLS = i ti / 1 R1, ˆ where R denotes the pooled sample correlation matrix. Alternatively, the GLS method ˆ −1 t/ 1 R ˆ −1 1 weights the t statistGLS = 1 R ˆ where t is the vector of t tics according to R, statistics from above. Note that by construction, tGLS may include negative weights for certain correlation structures so that it is possible to reject H in favor of positive treatment effects when, in fact, the opposite is true (25). As the exact distribution of tOLS and tGLS is unknown, approximations have to be used. O’Brien (24) proposed the use of a t distribution with ν(= n1 + n2 − 2) degrees of freedom. For small sample sizes, the LS tests can be either conservative or liberal, depending on the parameter constellations. Improved small sample approximations can be found in Logan and Tamhane (26). For large sample sizes, the limiting standard normal distribution can be used. ¨ ¨ In contrast, Lauter (27) and Lauter et al. (28) derived exact one-sided level-α tests without restricting the multivariate alternative region. They gave conditions under which choosing data-dependent weight vectors w, lead to the test statistics wi ti n1 n2 tw = i n1 + n2 w Rw ˆ being t distributed with ν degrees of freedom under H. Let Y i denote the total variablewise mean over both samples. A common choice is then to set wi = νii−1 , where

nj νii = 2j=1 l=1 (Yijl − Y i )2 is the ith diagonal element of the total product matrix, leading to the standardized score (SS) test. Logan and Tamhane (26) compared the power of the OLS and the SS tests. They found out by analytical methods that (1) if a single endpoint exists with a positive effect, the OLS test is more powerful than the SS test; and (2) if all endpoints have a positive effect with the same magnitude (which is the underlying assumption of the LS tests), both tests are equally powerful. Moreover, the authors conducted a simulation study for a variety of scenarios, showing that both the OLS test and the SS test behave similar, in terms of power, throughout most of the alternative region. Alternatively to the SS test, the first principal component calculated from the total product matrix can be used, leading to the PC test (27, 28). The PC test has higher power than the OLS test over a wide range of the alternative, such as, for example, if some variables have an effect and others not.

4 CONCLUSIONS In this article, some of the existing methods to analyze multiple endpoints have briefly been reviewed. The main application of these methods are clinical trials focusing on relatively few efficacy endpoints, where a strong control of the type I error rate is mandatory. It is important to reiterate that the multiple testing methods, as discussed above, are P-value based approaches, providing a high degree of flexibility in tailoring the testing strategy to suit the particular application. Such procedures are not restricted to particular data types, and are applicable for the analysis of, for example, normal, binary, count, ordinal, and time-to-event endpoints. An interesting and evolving method in this context is the use of the CP in adaptive designs, which allows the user to select and confirm endpoints within a single trial (29, 30). A further approach is to apply multivariate tests to the intersection hypotheses of the CP (31). Such a hybrid approach thus combines the advantages of multiple testing methods (assessment of the individual hypotheses) and multivariate global test

MULTIPLE ENDPOINTS

(taking the multivariate nature of data into account.) Other applications may require different statistical methods than reviewed in this article. Longitudinal data, for example, may include (nonlinear) mixed-effects models to describe the stochastic dependencies between the time points (32). The analysis of multivariate time-to-event outcomes has been the topic of much research, and standard implementations based on a counting-process formulation (33, 34) can be easily implemented in standard statistical software (35). Safety (adverse event) data may require analysis using novel multiple testing techniques based on controlling the false discovery rate (36, 37) or by, for example, using hierarchical Bayes models as proposed by Berry and Berry (38). Finally, high-dimensional screening studies with thousands of endpoints, such as gene expression profiling, for example, often use completely different statistical tools again (39, 40). REFERENCES 1. ICH Tripartite Guideline E9. (1998). International conference on harmonization; guidance on statistical principles for clinical trials. (online). Available: http://www.fda.gov/80/cder/guidance/. 2. Y. Hochberg and A. C. Tamhane, Multiple Comparison Procedures. New York: Wiley, 1987. 3. R. Marcus, E. Peritz, and K. R. Gabriel, On closed testing procedures with special reference to ordered analysis of variance. Biometrika 1976; 63: 655–660. 4. S. Holm, A simple sequentially rejective multiple test procedure. Scand. J. Statist. 1979; 6: 65–70. 5. W. Maurer, L. A. Hothorn, and W. Lehmacher, Multiple comparisons in drug clinical trials and preclinical assays: a-priori ordered hypotheses. In: J. Vollman (ed.), Biometrie in der Chemische-Pharmazeutichen Industrie, vol. 6. Stuttgart, Germany: Fischer Verlag, 1995. 6. G. Chi, Multiple testings: Multiple comparisons and multiple endpoints. Drug Inform. J. 1998; 32: 1347S–1362S. 7. P. H. Westfall and A. Krishen, Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. J. Stat. Plan. Infer. 2001; 99: 25–40.

5

8. CPMP Points to Consider. (2002). PtC on multiplicity issues in clinical trials. CPMP/EWP/908/99. (online). Available: http://www.emea.eu.int/pdfs/human/ewp/ 090899en.pdf. 9. P. Bauer, J. R¨ohmel, W. Maurer, and L. A. Hothorn, Testing strategies in multi-dose experiments including active control. Stat. Med. 1998; 17: 2133–2146. 10. A. Dmitrienko, W. W. Offen, and P. H. Westfall, Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Stat. Med. 2003; 22: 2387–2400. 11. R. J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986; 73: 751–754. 12. Y. A. Hochberg, Sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988; 75: 800–802. 13. F. Bretz, A. Genz, and L. A. Hothorn, On the numerical availability of multiple comparison procedures. Biometric. J. 2001; 43: 645–656. 14. A. Genz and F. Bretz, Comparison of methods for the computation of multivariate probabilities. J. Comput. Graph. Stat. 2002; 11: 950–971. 15. P. H. Westfall and S. S. Young, Resamplingbased multiple testing. New York: Wiley, 1993. 16. J. F. Troendle, A permutational step-up method of testing multiple outcomes. Biometrics 1996; 52: 846–859. 17. H. Hotelling, The generalization of Student’s ratio. Ann. Math. Stat. 1931; 2: 360–378. 18. T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd ed. New York: Wiley, 2003. 19. A. C. Tamhane and B. R. Logan, Multiple endpoints: an overview and new developments. Technical Report, 43. Milwaukee, WI: Division of Biostatistics, Medical College of Wisconsin, 2003. 20. A. Kudˆo, A multivariate analogue of the oneside test. Biometrika 1963; 15: 403–418. 21. M. D. Perlman, One-sided testing problems in multivariate analysis. Ann. Stat. 1969; 40: 549–567. 22. M. D. Perlman and L. Wu The Emperor’s New Tests (with discussion). Statistical Science 1999; 14: 355–381. 23. M. S. Srivastava, Methods of multivariate statistics. New York: Wiley, 2002. 24. P. C. O’Brien, Procedures for comparing samples with multiple endpoints. Biometrics 1984; 40: 1079–1089.

6

MULTIPLE ENDPOINTS

25. S. J. Pocock, N. L. Geller, and A. A. Tsiatis, The analysis of multiple endpoints in clinical trials. Biometrics 1987; 43: 487–498. 26. B. R. Logan and A. C. Tamhane, On O’Brien’s OLS and GLS tests for multiple endpoints. In: Y. Benjamini, F. Bretz, and S. Sarkar (eds.), New Developments in Multiple Comparison Procedures. IMS Lecture Notes - Monograph Series 47. 2004, pp. 76–88. ¨ 27. J. Lauter, Exact and F test for analyzing studies with multiple endpoints. Biometrics 1996; 52: 964–970. ¨ 28. J. Lauter, E. Glimm, and S. Kropf, Multivariate tests based on left-spherically distributed linear scores. Ann. Stat. 1998; 26: 1972–1988.

33. R. L. Prentice, B. J. Williams, and A. V. Peterson, On the regression analysis of multivariate failure time data. Biometrika 1981; 68: 373–379. 34. L. J. Wei, D. Y. Lin, and L. Weissfeld, Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. J. Amer. Stat. Assoc. 1993; 84: 1065–1073. 35. T. M. Therneau and P. M. Grambsch, Modeling survival data: Extending the Cox Model. New York: Springer, 2000. 36. Y. Benjamini and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B 1995; 57: 289–300.

29. M. Kieser, P. Bauer, and W. Lehmacher, Inference on multiple endpoints in clinical trials with adaptive interim analyses. Biometric. J. 1999; 41: 261–277. 30. G. Hommel, Adaptive modifications of hypotheses after an interim analysis. Biometric. J. 2001; 43: 581–589. 31. W. Lehmacher, G. Wassmer, and P. Reitmeir. Procedures for two-sample comparisons with multiple endpoints controlling the experimentwise error rate. Biometrics 1991; 47: 511–521.

37. D. V. Mehrotra and J. F. Heyse, Use of the false discovery rate for evaluating clinical safety data. Stat. Meth. Med. Res. 2004; 13: 227–238 38. S. M. Berry and D. A. Berry, Accounting for multiplicities in assessing drug safety: a three-level hierarchical mixture model. Biometrics 2004; 60: 418–426. 39. T. Speed, Statistical Analysis of Gene Expression Microarray Data. Boca Raton, FL: CRC Press, 2003.

32. J. Pinheiro and D. Bates, Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000.

40. G. Parmigiani, E. S. Garett, R. A. Irizarry, and S. L. Zeger, The Analysis of Gene Expression Data. New York: Springer, 2003.

MULTIPLE EVALUATORS

and measure agreement? What is the overall level of agreement? How much bias and variance is there among evaluators? In short, the goal of an agreement study is two-fold. The first goal is to determine if the measurements from multiple evaluators agree with each other. If not, then the second goal is to identify where the difference(s) occur and correct/calibrate them if possible. The agreement problem covers a broad range of data, and examples can be found in different disciplines. The scale of a measurement can be continuous, binary, nominal, or ordinal. The rest of this article is organized as follows. In the next section, the approaches for assessing agreement of continuous data are reviewed. We classify these approaches into three categories. The first category is the hypothesis testing-type approach; the second category is an index approach; and the third category is the interval-type approach. In the section entitled, ‘‘Agreement for categorical data,’’ the approaches for assessing agreement of categorical data are reviewed. A summary is provided in the last section.

JASON J. Z. LIAO and ROBERT C. CAPEN Merck Research Laboratories, West Point, Pennsylvania

1

INTRODUCTION

In medical and other related sciences, clinical or experimental measurements usually serve as a basis for diagnostic, prognostic, therapeutic, and performance evaluations. The measurement can be from multiple systems, processes, instruments, methods, raters, and so forth, but for the sake of simplicity, we refer to them as ‘‘evaluators’’ throughout this article. As technology continues to advance, new methods/instruments for diagnostic, prognostic, therapeutic, and performance evaluations become available. Before a new method or a new instrument is adopted for use in measuring a variable of interest, its agreement relative to other similar evaluators needs to be assessed. Measurements of agreement are needed to assess the reliability of multiple raters (or the same rater over time) in a randomized clinical trial (RCT). These measurements of agreement can be used for assessing the reliability of the inclusion criteria for entry into an RCT, validating surrogate endpoints in a study, determining that the important outcome measurements are interchangeable among the evaluators, and so on. An agreement study involving multiple evaluators can happen in all phases of drug development or other medical-related experimental settings. A typical design for an agreement study involves sampling n subjects from the population, sampling d evaluators from a population of evaluators, and obtaining r replicate determinations/evaluator. Each evaluator is ‘‘blinded’’ to all others to ensure independence of measurement error; the measurements represent the range of ratings proposed to be used interchangeably in the RCT. Various questions regarding agreement can be posed, such as the listings by Bartko (1). Can the measurements from evaluators be used interchangeably? How does one define

2

AGREEMENT FOR CONTINUOUS DATA

Let Y ij be the continuous measurement made by the j-th evaluator on the i-th subject (i = 1, . . . , n, and j = 1, . . . , d). The goal is to assess the agreement among the measurements made by the d evaluators. In the case of d = 2, the paired measurements of a perfect agreement from the two evaluators fall exactly on the identity line Y 2 = Y 1 , for example, the 45◦ line through the origin. The existing approaches can be classified into three categories. An appropriate transformation, such as logarithm, of the data may be recommended and used to better meet the assumptions under a specified approach. Therefore, Y ij are the final chosen reportable values used in the agreement evaluations. 2.1 Hypothesis Testing Approach The first approach of this type is the paired t-test to compare the mean value of two measurements for d = 2, and the F-test to compare the mean value of d(d > 2) mea-

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

MULTIPLE EVALUATORS

surements (1). The second approach in this category is a hypothesis of the parameters from, for example, different types of regression settings (2–9). All these hypothesis-type approaches heavily depend on the residual variance, which can reject a reasonably good agreement when the residual errors are small (more precision) but fail to reject a poor agreement when the residual errors are large (less precision) (10). Other critiques of these approaches can be found in Reference 11. Thus, any kind of approach based on a hypothesis testing strategy is not recommended for assessing agreement. 2.2 An Index Approach Two commonly used approaches comprise this category: the intraclass correlation coefficient (ICC) and the concordance correlation coefficient (CCC). Fleiss (12) gave an overview of ICC as a measure of agreement among multiple evaluators. In defining ICC, a useful two-way random effects model is given by Yij = µ + si + rj + eij , i = 1, . . . , n, and j = 1, . . . , d,

(1)

where µ is a fixed parameter and where si , rj , and eij are independent random effects with mean 0 and variances σs2 , σr2 , and σe2 , respectively. Usually d = 2. The term si is the subject effect, rj is the systematic bias (i.e., the systematic error), and eij is the measurement error. To assess the agreement among multiple evaluators by using the concept of intraclass correlation coefficient, the index is defined as ρu =

σs2 σs2 + σr2 + σe2

(2)

In Equation (1), the denominator of ρ u is the variance of Y ij (unconditionally on j), and the numerator of ρ u is the covariance between Y ij1 and Y ij2 (unconditionally on j1 = j2). Thus, ICC defined in Equation (2) is the correlation between any two arbitrary measurements made on the same subject.

Another measuring index is the ICC defined as follows: ρc =

σs2

σs2 + σe2

(3)

ρ c in Equation (3) does not take systematic error into account in assessing agreement under Equation (1). Contrasting this with the ICC defined in Equation (2), the denominator of ρ c in Equation (3) is the variance of Y ij conditionally on j. Thus, ρ c is the correlation between Y ij1 and Y ij2 conditionally on j1 = j2, which is the usual product–moment correlation between two specific measurements made on the same subject. Note that ρ u = ρ c if we assume a model without systematic error, that is σr2 = 0 in Equation (1), which is also referred to as the one-way random effects model (13). The one-way random model is as follows: Yij = µ + si + eij , i = 1, . . . , n, and j = 1, . . . , d,

(4)

Again, ICC from Equation (4) is the simple correlation between Y ij1 and Y ij2 . Therefore, we can estimate the ICCs using the Pearson correlation coefficient. However, the estimation of ICCs based on applying analysis of variance (ANOVA) is the most commonly used and recommended approach (13). Among the various methods proposed for calculating the confidence intervals of the ICCs, the one proposed by Cappelleri and Ting (14) almost exactly maintains the nominal coverage probability for typical agreement studies (15). As pointed out by Rousson et al. (16), the main difference between these two ICCs is that the value of ρ u is attenuated when systematic error exists, whereas ρ c is not. Another difference is that ρ u does not depend on the ordering of the d measurements, whereas ρ c does. For assessing interevaluator agreement, Rousson et al. (16) recommended using ρ u because a systematic bias clearly indicates a failure in agreement and should be taken into account. In contrast, they recommended using ρ c for test–retest agreement because the first and the second

MULTIPLE EVALUATORS

trial in this situation are clearly not interchangeable. In assessing agreement for two evaluators, Lin (10) claimed that we should divide the assessment into two steps. First, a precision step assesses how close the observations are to the regression line. Second, an accuracy step assesses how close the regression line and the target line (the line of identity) are to each other. Using squared differences under the assumption that the observations are from a bivariate normal distribution and µ2 , and covariance matrix with mean µ1 σ12 ρσ1 σ2 , Lin (10) proposed a concorρσ1 σ2 σ22 dance correlation coefficient (CCC) for d = 2 (denoted as CCC = ρCa ) as follows: CCC = 1 −

E(Y2 − Y1 )2 E[(Y2 − Y1 )2 |ρ = 0]

= ρCa =

2ρσ1 σ2 σ12 + σ22 + (µ1 − µ2 )2

(5)

This index combines the measure of precision (ρ) and measure of accuracy Ca =

2σ1 σ2 σ12 +σ22 +(µ1 −µ2 )2

together. Lin (10) esti-

mated his CCC by replacing the parameters with the corresponding samples values and recommended using the Fisher Z- transformation Z(r) = tanh−1 (r) = 12 ln 1+r 1−r for inference. However, some concerns for using CCC exist (17–20). Note that the CCC is a special case of the formula for ICC (19, 21). Therefore, two improved CCCs have been proposed in the literature (22, 23). Many critiques argue against using the index approaches. In assessing agreement, all indices assumed observations from a distribution (usually normal) with a fixed mean and constant covariance. However, the fixed mean assumption is often not satisfied in practice (24–26). In addition, usually a single index is not good enough to measure the degree of agreement (27). When an index indicates a poor agreement, no indication shows what is wrong. When poor agreement occurs, it is important to determine the degree of bias (fixed and/or proportional biases). No current agreement index can provide this information. As pointed out by Bland and Altman (11, 28), a correlation method is very sensitive

3

to the range of the measurements available in the sample. The greater this range, the higher the correlation. In addition, it is not related to the actual scale of measurement or to the size of error that might be clinically allowable. Related to this point, Liao and Lewis (20) gave examples where nested experiments lead to conflicted conclusions. Because any index approach is very sensitive to sample heterogeneity, therefore, Atkinson and Nevill (18) suggested no index approach should be used to assess agreement. 2.3 An Interval Approach The commonly used approach in this category was proposed for d = 2 evaluators by Bland and Altman (28). Let Di , Y i2 − Y i1 , i = 1, . . . , n. Assuming the Di s are approximately normally distributed, Bland and Altman (11, 28) proposed using the 95% interval of the observed differences [D − 2SD , D + 2SD ]

(6)

which they called ‘‘limits of agreement’’ to n measure agreement, where D = n1 Di and

S2D

=

1 n−1

n

i=1

(Di −

D)2 .

These limits are

i=1

then compared with scientifically acceptable boundaries. The interval in Equation (6) will ensure that 95% of all differences will fall into it under the assumption of normality. This method does not depend on the range of the sample, and the limits of agreement give some indication whether the discrepancy is acceptable in practice by comparing the limits with a scientifically justifiable boundary. As a supplement, they also proposed a mean–difference graphic that plots the difference Di against the mean of the two measurements M i = (Y i1 + Y i2 )/2 along with the 95% limits of the difference, for example, the limits of agreement. They claimed that this graphical tool could be used to investigate the assumptions and possible trends in systematic error (bias) and/or in measurement error, thus, leading to a possible transformation of the responses being compared. These trends can be tested using an appropriate technique, such as the Spearman rank correlation between |Di | and M i .

4

MULTIPLE EVALUATORS

The Bland and Altman approach is a favorite of medical researchers with over 11,000 citations. However, some concerns about this method exist. First, it creates a problem of interpretation when a mixture of fixed, proportional bias and/or proportional error occurs (29). Second, it is only good for the test–retest situation (16). Third, the level and type of bias between the two sets of measurements cannot be fully assessed, and covariate information cannot be incorporated into this approach (25, 26). Fourth, the mean–difference plot gives artifactual bias information in measurements differing only in random error (30). A new approach that overcomes these concerns was proposed in References 25 and 26. 3

AGREEMENT FOR CATEGORICAL DATA

rect for such marginal disagreement (35), it is preferable that it be investigated as a potential source of interevaluator difference (36). Existing approaches for categorical data generally focus on assessing agreement through the calculation of a kappa-like index or by modeling the pattern of agreement in the data. 3.1 Measuring Agreement between Two Evaluators 3.1.1 Kappa. Suppose two evaluators are to classify n subjects into m mutually exclusive and exhaustive categories. Following Banerjee et al. (32), let pkk be the proportion of subjects placed into category k(k = 1, 2, . . . , m) by both evaluators and define m pkk as the observed proportion of p0 = k=1

Since the development of the kappa coefficient (31), a considerable amount of research has been performed in the area of interevaluator agreement for categorical data (see, for example, Reference 32). Cohen originally developed kappa as a measure of chancecorrected agreement between two evaluators for nominal ratings (two or more ratings that cannot be logically ordered, such as positive/negative, schizophrenic/bi-polar/ neurotic/depressed, etc.). However, kappa itself is affected by the prevalence of the illness (33), an issue that will be discussed in more detail in the section entitled, ‘‘Issues in kappa.’’ Understanding the underlying assumptions behind the methods as well as their limitations is critical when analyzing categorical agreement data. For example, in the development of kappa, Cohen assumed independence between the evaluators. When quantifying the level of agreement, this assumption might be questionable when the evaluations from one clinician are known in advance by a second clinician, or if the evaluations are done over time, without blinding, by the same clinician. A well-publicized shortcoming of kappa is its inability to differentiate between two components of disagreement (34). Although it is possible to cor-

agreement. A portion of this overall agreement will be due to chance alone. Call this portion pc . Kappa is defined as the achieved agreement beyond chance relative to the potential agreement beyond chance (33). p0 − pc (7) κ= 1 − pc Cohen (31) defined chance agreement in a natural way as pc =

m

pk· p·k

(8)

k=1

where pk· is the proportion of subjects placed into the kth category by the first evaluator and p·k is the proportion of subjects placed into the kth category by the second evaluator. When disagreements between evaluators are not equally weighted, the use of kappa is problematic. 3.1.2 Weighted Kappa. Cohen (37) generalized the definition of kappa to include situations when disagreements are not equally weighted. Let njk be the number of subjects in the (j, k)th cell of an m × m table. The total m number of subjects is n = njk . Define j, k = 1

MULTIPLE EVALUATORS

wjk to be the weight corresponding to the (j, k)th cell. Then, weighted kappa, which measures the proportion of weighted agreement corrected for chance (32), is defined as κw =

p0w − pcw 1 − pcw

(9)

where p0w =

m m 1 wjk njk = wjk pjk n j, k = 1

pcw =

1 n2

m j, k = 1

j, k = 1

wjk nj· n·k =

m

wjk pj· p·k (10)

j, k = 1

Three common choices for the weights are provided in Equation (11). Weights can also be selected based on disagreement (37). wjk =

1, j = k (Unweighted) 0, j = k

wjk = 1 −

|j − k| (Linear Weights) (m − 1)2

wjk = 1 −

(j − k)2 (Quadratic Weights) (m − 1)2 (11)

Hypothesis testing and confidence interval construction follow similarly to the unweighted case (38–40), although a large number of subjects are generally required to construct confidence intervals for weighted kappa even for a modest number of categories (32). Note that for 2 × 2 tables, use of either linear or quadratic weights is equivalent to calculating unweighted kappa. For more than two categories, each weighing scheme assigns less weight to agreement as the categories get farther apart. The greatest disparity in ratings is assigned an agreement weight of 0. In general, for more than two categories the use of quadratic weights is common (33). 3.1.3 Agreement Measures for Ordinal Data. In many settings, the potential ratings that can be assigned to a subject form a natural ordering. Such ordinal scales often arise because a truly continuous variable is discretized. For example, the degree of

5

motion of a particular joint, which could be measured, might instead be categorized as ‘‘unrestricted,’’ ‘‘slightly restricted,’’ ‘‘moderately restricted,’’ and ‘‘highly restricted’’ (33). Although the kappa statistics mentioned above can, in theory, be extended to ordinal data, doing so will generally result in a loss of statistical power (41, 42) and could lead to misleading results (43). Various alternative methods to kappa exist for assessing agreement for ordered categories (44). The approach developed by Rothery (45) is a nonparametric analog to the ICC and has a straightforward interpretation (44). 3.2 Extensions to Kappa and Other Approaches for Modeling Patterns of Agreement 3.2.1 Extensions to Kappa. Since the development of kappa more than 45 years ago, most research has focused on the case of two evaluators using a binary scale to classify a random sample of n subjects. Both Banerjee et al. (32) and Kraemer et al. (43) provide nice discussions on various extensions to kappa. We refer the reader to their work for more details. Extension to the case of multiple evaluators was treated in Reference 46. Fleiss assumed that each of n subjects was rated by an equal number (d > 2) of evaluators into one of m mutually exclusive and exhaustive nominal categories, where, for each subject, the set of evaluators was randomly selected from a ‘‘population’’ of evaluators. Davies and Fleiss (47) developed a chance-corrected statistic for measuring agreement assuming, among other things, that a common set of evaluators rated all subjects. Kraemer (48) allowed for a different number of ratings per subject as well as for the same evaluator to place a subject into more than one category. By considering the placement of each observation in the rank ordering of the m categories, Kraemer could derive a kappa-like statistic as a function of the Spearman rank correlation coefficient. Kraemer et al. (43) warned that the use of kappa when more than two nominal categories exist is suspect regardless of the number of evaluators. 3.2.2 Modeling Patterns of Agreement. Up to this point, the focus has been centered on

6

MULTIPLE EVALUATORS

assessing agreement through the calculation of a single index. Alternatively, one can model the pattern of agreement in the data. Various sophisticated strategies have been developed for doing this modeling, and a discussion of them is beyond the scope of this article. The most common approaches include the use of log-linear models (49–51), latent-class models (50, 52–56), and generalized estimating equations (57–60). 3.3 Issues with Kappa Although the use of kappa statistics is widespread, appearing in the psychology, education, physical therapy, medical, and psychiatry literature (to name a few), its usefulness as a metric to gauge interevaluator agreement is not universally accepted (61). A thorough discussion of what kappa is designed to do is provided in Kraemer et al. (43). In this section, we will describe some more common issues surrounding the use of kappa. It is not meant to be an exhaustive list. 1. Trait prevalence. When the prevalence of a disease or trait is rare, the kappa statistic can be misleading. Feinstein and Cicchetti (62) discussed this paradox and ways to resolve it. Sim and Wright (33) defined a prevalence index in the context of a 2 × 2 table as |p11 − p22 |. When the prevalence index is high, chance agreement is very likely and kappa will be attenuated as a result. 2. Evaluator bias. Bias measures the extent that two evaluators disagree on the proportion of ‘‘positive’’ (or ‘‘negative’’) cases (33). This bias is different than bias relative to a ‘‘gold standard.’’ However, a small level of bias, as determined by the bias index of Sim and Wright (33), does not imply a large value for kappa. In fact, any time the proportion of subjects in the (1,2) cell is the same as the proportion of subjects in the (2,1) cell, their bias index will be 0. A calculation of both the bias and prevalence index should accompany any calculation of kappa. 3. Kappa does not distinguish among types and sources of disagreement. By placing less weight on agreement for

categories that are more separated, this issue can be, to some degree, overcome. Still, different weighting schemes will lead to different values for kappa, so using any type of ad hoc rule for determining the strength of agreement (see, for example, Reference 63) is inappropriate. 4. Nonindependent ratings. An important assumption in the development of kappa is that evaluators generate ratings independently of one another. The independence assumption is often satisfied in practice by appropriate care in the design of the agreement study. Dependent ratings will generally lead to inflated values of kappa. Two typical settings where the independence assumption is questionable are where (1) one evaluator is aware of the ratings of the other evaluator and (2) the same evaluator rates the same set of subjects, without blinding, on two different occasions. In the latter setting, Streiner and Norman (64) suggest that a time interval of 2–14 days is reasonable but depends on the attribute being evaluated. 5. Ordinal data. Calculating kappa on data that are continuous but have been categorized is a practice that should be avoided. It is much better to preserve the original scale and apply measures appropriate for continuous data to assess agreement (see the section entitled ‘‘Agreement for continuous data’’). When it is impossible to obtain the continuous measure or when the data are ordinal in their original form, the weighted form of kappa is often used to assess agreement. This approach is questionable because it is generally not as powerful as other measures of agreement specifically developed for ordinal data (e.g., see Reference 45). 4 SUMMARY AND DISCUSSION In this article, we have reviewed approaches for assessing measurement agreement among multiple evaluators. For continuous data, the index approach and the interval approach are

MULTIPLE EVALUATORS

commonly used for assessing the agreement. Because of its simplicity and its intuitive and practical interpretability, we particularly recommend using the interval approach instead. For the common situation of two evaluators and two nominal categories, we primarily discussed kappa and weighted kappa statistics and mentioned several issues surrounding their usefulness to assess agreement. Trying to describe the degree of agreement among two or more evaluators by a single index is overly optimistic. Even for the simplest case of two evaluators and binary categories, besides reporting the value for a kappa statistic, one should report the proportion of overall agreement and the proportions of agreement specific to each category. For multiple evaluators, the strategy for two evaluators can be adopted for pairs of evaluators. If the underlying attribute is continuous, one can use log-linear models, latent class models, or generalized estimating equations to investigate, among other things, interevaluator agreement. A similar strategy can be used for the case of nominal or ordinal scale data, although for ordinal data, one should avoid the use of kappa. The design of an agreement study is very important. At the very least, design aspects that must be considered include the procedures for selecting the subjects and the evaluators. The population from which the subjects are selected must be well–defined, and the evaluators must be well trained/validated. We advocate collecting r replicate evaluations for each subject by each evaluator. If the r replicate evaluations are to be completed over time, then the time interval must be chosen to minimize the possibility of correlated ratings while still ensuring that the attribute being observed has not meaningfully changed in the population or sample. A consequence of violating the latter requirement is that the ratings would become time-dependent (i.e., not interchangeable). Other important issues regarding the general design of agreement studies are discussed in References 11, 13, 44, 65, and 66. For the particular issue of sample size, see References 15, 67, and 68, for continuous data, and see References 33, 69, and 70 for categorical data.

5

7

ACKNOWLEDGMENTS

The authors thank Dr. Christy Chuang-Stein and two referees for their valuable comments and suggestions that improved this article. REFERENCES 1. J. J. Bartko, Measurements of agreement: a single procedure. Stat. Med. 1994; 13: 737–745. 2. W. E. Deming, Statistical Adjustment of Data. New York: Wiley, 1943. 3. U. Feldmann, B. Schneider, H. Klinkers, and R. Haeckel, A multivariate approach for the biometric comparison of analytical methods in clinical chemistry. J. Clin. Chem. Clin. Biochem. 1981; 19: 121–137. 4. U. Feldmann and B. Schneider, Bivariate structural regression analysis: a tool for the comparison of analytic methods. Methods Informat. Med. 1987; 6: 205–214. 5. H. Passing and W. A. Bablok, A new biometrical procedure for testing the equality of measurements from two different analytical methods. J. Clin. Chem. Clin. Biochem. 1983; 21: 709–720. 6. E. L. Bradley and L. G. Blackwood, Comparing paired data: a simultaneous test of means and variances. Am. Stat. 1989; 43: 234–235. 7. K. Linnet, Estimation of the linear relationship between the measurements of two methods with proportional errors. Stat. Med. 1990; 9: 1463–1473. 8. B. Zhong and J. Shao, Testing the agreement of two quantitative assays in individual means. Commun. Stat. Theory Meth. 2002; 31: 1283–1299. 9. B. Zhong and J. Shao, Evaluating the agreement of two quantitative assays with repeated measurements. J. Biopharmaceut. Stat. 2003; 13: 75–86. 10. L. I-K. Lin, A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–268. 11. J. M. Bland and D. G. Altman, Statistical methods for assessing agreement between two methods of clinical measurement. Stat. Methods Med. Res. 1999; 8: 135–160. 12. J. L. Fleiss, The Design and Analysis of Clinical Experiments. New York: John Wiley & Sons, 1986. 13. N. J.-M. Blackman, Reproducibility of clinical data I: continuous outcomes. Pharmaceut. Stat. 2004a; 3: 99–108.

8

MULTIPLE EVALUATORS

14. J. C. Cappelleri and N. Ting, A modified large sample approach to approximate interval estimation for a particular intraclass correlation coefficient. Stat. Med. 2003; 22: 1861–1877. 15. Y. Saito, T. Sozu, C. Hamada, and I. Yoshimura, Effective number of subjects and number of raters for inter-rater reliability studies. Stat. Med. 2006; 25: 1547–1560. 16. V. Rousson, T. Gasser, and B. Seifert, Assessing intrarater, interrater and test-retest reliability of continuous measurements. Stat. Med. 2002; 21: 3431–3446. 17. R. Muller and P. Buttner, A critical discussion of intraclass correlation coefficients. Stat. Med. 1994; 13: 2465–2476. 18. A. Atkinson and A. Nevill, Comment on the use of concordance correlation to assess the agreement between two variables. Biometrics 1997; 53: 775–777. 19. C. A. E. Nickerson, A note on ‘a concordance correlation coefficient to evaluate reproducibility.’ Biometrics 1997; 53: 1503–1507. 20. J. J. Z. Liao and J. Lewis, A note on concordance correlation coefficient. PDA J. Pharmaceut. Sci. Technol. 2000; 54: 23–26. 21. J. L. Carrasco and L. Jover, Estimating the generalized concordance correlation coefficient through variance components. Biometrics 2003; 59: 849–858. 22. J. J. Z. Liao, An improved concordance correlation coefficient. Pharmaceut. Stat. 2003; 2: 253–261. 23. M. P. Fay, Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement. Biostatistics 2005; 6: 171–180. 24. J. J. Z. Liao, Agreement for curved data. J. Biopharmaceut. Stat. 2005; 15: 195–203. 25. J. J. Z. Liao, R. C. Capen, and T. L. Schofield, Assessing the reproducibility of an analytical method. J. Chromat. Sci. 2006a; 44: 119–122. 26. J. J. Z. Liao, R. C. Capen, and T. L. Schofield, Assessing the concordance of two measurement methods. ASA Proceedings on Section of Biometrics CD, 2006b. 660–667. 27. R. A. Deyo, P. Diehr, and D. L. Patrick, Reproducibility and responsiveness of health status measures: statistics and strategies for evaluation. Control. Clin. Trial 1991; 12: 142S–158S. 28. J. M. Bland and D. G. Altman, Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; 2: 307–310.

29. J. Ludbrook, Comparing methods of measurement. Clin. Exp. Pharmacol. Physiol. 1997; 24: 193–203. 30. W. G. Hopkins, Bias in Bland-Altman but not regression validity analyses. Sportscience 2004; 8: 42–46. 31. J. Cohen, A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960; 20: 37–46. 32. M. Banerjee, M. Capozzoli, L. McSweeney, D. Sinha, Beyond kappa: a review of interrater agreement measures. Canadian J. Stat. 1999; 27: 3–23. 33. J. Sim and C. C. Wright, The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys. Ther. 2005; 85: 257–268. 34. T. P. Hutchinson, Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. Res. Nurs. Health 1993; 16: 313–315. 35. R. Zwick, Another look at interrater agreement. Psychol. Bull. 1988; 103: 374–378. 36. J. Cohen, Weighted kappa: nominal scale agreement with provisions for scaled disagreement or partial credit. Psychol. Bull. 1968; 70: 213–220. 37. J. L. Fleiss, J. Cohen, and B. S. Everitt, Large sample standard errors of kappa and weighted kappa. Psychol. Bull. 1969; 72: 323–327. 38. D. A. Bloch and H. C. Kraemer, 2 × 2 Kappa coefficients: measures of agreement or association. Biometrics 1989; 45: 269–287. 39. D. V. Cicchetti and J. L. Fleiss, Comparison of the null distributions of weighted kappa and the C ordinal statistic. Appl. Psychol. Meas. 1977; 1: 195–201. 40. J. L. Fleiss and D. V. Cicchetti, Inference about weighted kappa in the non-null case. Appl. Psychol. Meas. 1978; 2: 113–117. 41. A. Donner and M. Eliasziw, Statistical implications of the choice between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994; 50: 550–555. 42. E. Bartfay and A. Donner, The effect of collapsing multinomial data when assessing agreement. Internat. J. Epidemiol. 2000; 29: 1070–1075. 43. H. C. Kraemer, V. S. Periyakoil, and A. Noda, Kappa coefficient in medical research. Stat. Med. 2002; 21: 2109–2129. 44. N. J.-M. Blackman, Reproducibility of clinical data II: categorical outcomes. Pharmaceut. Stat. 2004b; 3: 109–122.

MULTIPLE EVALUATORS 45. P. Rothery, A nonparametric measure of intraclass correlation. Biometrika 1979; 66: 629–639. 46. J. L. Fleiss, Measuring nominal scale agreement among many raters. Psychol. Bull. 1971; 76: 378–382. 47. M. Davies and J. L. Fleiss, Measuring agreement for multinomial data. Biometrics 1982; 38: 1047–1051. 48. H. C. Kraemer, Extension of the kappa coefficient. Biometrics 1980; 36: 207–216. 49. M. A. Tanner and M. A. Young, Modeling agreement among raters. J. Am. Stat. Assoc. 1985; 80: 175–180. 50. A. Agresti, Modelling patterns of agreement and disagreement. Stat. Methods Med. Res. 1992; 1: 201–218. 51. P. Graham, Modelling covariate effects in observer agreement studies: the case of nominal scale agreement. Stat. Med. 1995; 14: 299–310. 52. W. R. Dillon and N. Mulani, A probabilistic latent class model for assessing inter-judge reliability. Multivar. Behav. Res. 1984; 19: 438–458. 53. M. Aickin, Maximum likelihood estimation of agreement in the constant predictive model, and its relation to Cohen’s kappa. Biometrics 1990; 46: 293–302. 54. J. S. Uebersax, and W. M. Grove, Latent class analysis of diagnostic agreement. Stat. Med. 1990; 9: 559–572. 55. Y. Qu, M. Tan, and M. H. Kutner, Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996; 52: 797–810. 56. S. L. Hui and X. H. Zhou, Evaluation of diagnostic tests without gold standards. Stat. Methods Med. Res. 1998; 7: 354–370. 57. S. R. Lipsitz, and G. M. Fitzmaurice, Estimating equations for measures of association between repeated binary responses. Biometrics 1996; 52: 903–912. 58. J. M. Williamson, A. K. Manatunga, and S. R. Lipsitz, Modeling kappa for measuring dependent categorical agreement data. Biostatistics 2000; 1: 191–202. 59. N. Klar, S. R. Lipsitz, and J. G. Ibrahim, An estimating equations approach for modelling kappa. Biomet. J. 2000; 42: 45–58. 60. E. Gonin, S. R. Lipsitz, G. M. Fitzmaurice, and G. Molenberghs, Regression modelling of weighted κ by using generalized estimating equations. Appl. Stat. 2000; 49: 1–18.

9

61. J. S. Uebersax, Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 1987; 101: 140–146. 62. A. R. Feinstein and D. V. Cicchetti, High agreement but low kappa I: the problems of two paradoxes. J. Clin. Epidemiol. 1990; 43: 543–548. 63. J. R. Landis and G. G. Koch, The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174. 64. D. L. Streiner and G. R. Norman, Health Measurement Scales: A Practical Guide to their Development and Use, 3rd ed. Oxford, UK: Oxford University Press, 2003. 65. H. C. Kraemer, Evaluating Medical Tests. Objective and Quantitative Guidelines. Newbury Park, CA: Sage, 1992. 66. P. Graham and R. Jackson, The analysis of ordinal agreement data: beyond weighted kappa. J. Clin. Epidemiol. 1993; 9: 1055–1062. 67. D. G. Bonett, Sample size requirements for estimating intraclass correlations with desired precision. Stat. Med. 2002; 21: 1331–1335. 68. J. J. Z. Liao, Sample size calculation and concordance assessment for an agreement study. ENAR presentation, 2004. 69. A. Donner, Sample size requirements for the comparison of two or more coefficients of interobserver agreement. Stat. Med. 1997; 15: 1157–1168. 70. S. D. Walter, M. Eliasziw, and A. Donner, Sample size and optimal designs for reliability studies. Stat. Med. 1998; 17: 101–110.

CROSS-REFERENCES Interrater reliability Reproducibility Kappa statistic weighted Kappa

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

1 TRIAL DESIGN 1.1 Overview The objective of MRFIT was to test whether a multifactor intervention would result in lower CHD mortality among men, baseline ages 35–57 years, at higher risk for CHD as measured by three established major risk factors: blood pressure (BP), serum cholesterol, and cigarette smoking. Eligible highrisk men were randomized to either the ‘‘Special Intervention’’ (SI) group, which targeted cessation of smoking, lowering serum cholesterol, and lowering BP, or to the ‘‘Usual Care’’ (UC) group; the latter were referred to their personal physicians or a community care clinic and did not receive any study intervention.

LYNN E. EBERLY School of Public Health, University of Minnesota, Minneapolis, Minnesota

JEREMIAH STAMLER Feinberg School of Medicine, Northwestern University, Chicago, Illinois

LEWIS H. KULLER School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania

JAMES D. NEATON School of Public Health, University of Minnesota, Minneapolis, Minnesota

1.2 Design of the Multifactor Intervention for the SI Group

The Multiple Risk Factor Intervention Trial (MRFIT) was a nationwide randomized trial on primary prevention of coronary heart disease (CHD) death, sponsored by the National Heart, Lung, and Blood Institute within the United States National Institutes of Health, and it was conducted during 1972–1982. The study cohort consisted of men (baseline ages 35–57 years) at higher risk for CHD death, but with no clinical evidence of cardiovascular disease (CVD); women were not included because of substantially lower CHD rates. The intervention tested was multifactor: intensive counseling for smoking cessation, dietary advice particularly to lower serum cholesterol, and stepped-care pharmacologic treatment for hypertension (primarily with diuretics). The primary outcome was CHD mortality, with an observation period of 6 years minimum and 7 years on average (1,2). Since the end of active follow-up in 1982, the 361,662 men who were screened during MRFIT recruitment, which includes the 12,866 who were randomized, have been followed for mortality date and cause through Social Security Administration and National Death Index searches (3,4).

1.2.1 Cigarette Smoking. Each cigarette smoker was counseled individually by a MRFIT physician immediately after randomization. Dosage reduction (low tar, low nicotine cigarettes) was not recommended (5,6). 1.2.2 Nutrition. The nutrition intervention did not stipulate a structured diet, but instead it aimed to encourage lifelong shopping, cooking, and eating patterns concordant with saturated fat intake < 10% of calories, dietary cholesterol < 300 mg/day, polyunsaturated fat ∼ 10% of calories, with a fare moderate in total fat and calories to facilitate prevention and control of obesity and elevated BP (7). 1.2.3 Hypertension. Antihypertensive drugs were prescribed using a stepped-care protocol beginning with the use of an oral diuretic (either hydrochlorothiazide or chlorthalidone); other drugs were sequentially added if the BP goal (either a 10 mm Hg reduction in diastolic BP, or diastolic BP no more than 89 mm Hg) was not reached. Before drug prescription, weight loss was attempted for overweight men (8).

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

1.2.4 Simultaneous Efforts. Shortly after randomization, each man was invited (with spouse or friend) to a 10-week series of weekly group discussions covering all three risk factors; common educational materials were used at all sites. After this initial phase, individual counseling by an intervention team (behavioral scientist, nutritionist, nurse, physician, and general health counselor) and measurement of risk factors was provided every 4 months; specific risk factor goals were set for each person. Men could be examined at more frequent intervals based on their changes in risk factors and intervention team recommendations (1,9). 1.3 Sample Size and Power Calculations The MRFIT design stipulated recruitment of at least 12,000 men ages 35 to 57 years; the men had to be at elevated risk of CHD mortality but without clinical signs of CVD at baseline. The randomized sample size for MRFIT of 12,866 was estimated to provide 88% power to detect a 26.6% reduction in risk of CHD death over 6 years (29.0/1000 men in UC compared with 21.3/1000 men in SI) using a one-sided test of proportions and a type I error of 5%. Behind this calculation were several steps with key assumptions about level of CHD risk in the UC group and anticipated intervention effects. First, a multiple logistic regression model for 6-year CHD death—as a function of serum cholesterol, diastolic BP, and cigarettes smoked/day—was developed using Framingham Heart Study data. Second, variables that represented the screening levels of diastolic BP, serum cholesterol, and cigarettes/day for the MRFIT randomized men were entered into the logistic risk model to project the 6-year CHD death rate of 29.0/1000 men in UC. Third, a reduction of the CHD death rate to 21.3/1000 men in SI was projected based on the following assumptions about reductions of screening risk factor levels for SI men: (1) a 10% reduction in serum cholesterol for those at a level of 220 mg/dL or more (no reduction for others); (2) a 10% reduction in diastolic BP for those at a level of 95 mm Hg or more (no reduction for others); (3) a 25% reduction for smokers of 40 or more cigarettes/day, a 40% reduction for

smokers of 20–39/day, and a 55% reduction for smokers of <20/day. It was also assumed that reductions would be observed for UC smokers of 5%, 10%, and 15%, respectively, but with no reductions in cholesterol or BP for UC men. Additional assumptions were made about levels of non-adherence in the SI group, and duration of sustained intervention required to achieve the projected reductions in risk factor levels (1,10). 2 TRIAL SCREENING AND EXECUTION Continual attention to quality assurance was heavily emphasized in all phases of MRFIT (11). 2.1 Determination of Risk and Trial Eligibility Trial eligibility, determined over three screening visits (S1, S2, S3), was based primarily on diastolic BP, cigarettes smoked/day, and serum cholesterol, among other factors. At S1, BP was measured, after the screenee had been seated for 5 minutes, three times at 2-minute intervals using a standard mercury sphygmomanometer; the average of the second and third readings was used for screening. Casual serum samples were divided into two aliquots, one of which was analyzed at a local laboratory that had been standardized by the Centers for Disease Control; the other was retained for analysis by the Central Laboratory at the Institute for Medical Sciences in San Francisco. In all, 336,117 of the 361,662 men screened were excluded at S1 for age (<35 or >57 years old), expected geographic mobility, for extremely high risk (history of either heart attack or diabetes that required medication, serum cholesterol 350+ mg/dL, or diastolic BP 115+ mm Hg), or for ‘‘lower risk status.’’ Lower risk status was determined from a risk score, based on the Framingham logistic regression model for 6-year CHD death, using each man’s S1 cholesterol, diastolic BP, and cigarettes/day. Lower risk status was defined as below the 85th percentile of risk score, later changed to below the 90th percentile. Thus, men had to have a risk score in the upper 10–15% to be considered for inclusion (4,9,10).

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

Of the remaining 25,545 men, 9,754 were excluded at S2 for body weight ≥150% of desirable weight (0.9 times average weight for men of the same height in the 1960–1962 National Health Survey), heavy drinking, angina pectoris (from Rose questionnaire), history or resting ECG evidence of myocardial infarction (MI), use of lipid-lowering drugs, untreated diabetes based on a 75 g glucose tolerance test, diabetes treatment with insulin or oral antihyperglycemics, illness/ disability incompatible with participation, prescribed diets incompatible with the intervention, diastolic BP 120+ mm Hg, or for not attending S2. An additional 1,680 men were eligible for but did not attend S3, and they were excluded. The remaining 12,866 men consented to participation and were randomized, 6,428 to SI and 6,438 to UC (4,9,10). 2.2 Recruitment and Randomization Recruitment was carried out in 1973–1975 by 22 clinics in 18 cities across the United States. S1 screenings were held in civic, religious, and public health facilities, places of employment, shopping centers, and by doorto-door recruitment. S2 and S3 screenings were held in study clinics. After eligibility and consent to participate had been established, clinic coordinators called the coordinating center for the randomization assignment. Allocation to treatment arm was stratified by clinic and balanced in blocks of size four or six (1,12).

3

database. Blood samples were processed by the Central Laboratory; periodic quality control was implemented by the Centers for Disease Control Lipid Standardization Laboratory. During-trial follow-up was above 90% through 6 years for each of the SI and UC groups. The last day of active follow-up was 28 February 1982 (1,9,13,14). 2.4 Mortality and Morbidity Ascertainment Four key endpoints, each considered singly, were of interest: CHD mortality (the primary endpoint), all-cause mortality, nonfatal MI or CHD mortality, and CVD mortality (the secondary endpoints). Clinics learned of deaths through contact with family or friends of the participants, routine clinic follow-up of missed visits, responses to change-of-address postcards sent semiannually to all participants, and searches of publicly accessible files of deceased persons. Cause of death was determined by a Mortality Review Committee (a three-member panel of cardiologists independent of any MRFIT clinic and blinded to treatment arm and interim results) by reviewing clinic and hospital records, nextof-kin interviews, death certificates, and autopsy reports. Nonfatal MI ascertainment was based on follow-up ECG findings and hospitalization records (1,15).

2.3 During-Trial Follow-Up

3 FINDINGS AT THE END OF INTERVENTION 3.1 Baseline Differences and During Trial Changes in Risk Factors

At or about each anniversary of randomization, SI and UC participants returned to their clinical center for assessment of risk factor levels and morbidity status. A 24-hour dietary recall was repeated at four (UC) or five (SI) of the first six follow-up visits by nutritionists trained in standardized procedures. Data collection forms were reviewed and edited at each clinic by a clinic coordinator, then mailed to the coordinating center for data entry, editing, and analysis. Each form received a unique log number, and it was double-keyed and checked against a data dictionary for out of range values and logic errors; samples of each form type were later keyed a third time and checked against the

The SI and UC groups were not statistically different at baseline (S1, S2, or S3) in levels of numerous risk factors and related measures. Significant differences were observed between the two groups at each annual followup visit, with larger reductions in risk factor levels observed in the SI group. Average diastolic BP was reduced from 91.0 at S3 to 80.5 mm Hg at visit 6 in SI and from 90.9 to 83.6 mm Hg in UC. The percent of smokers was reduced from 59.3% at S3 to 32.3% at visit 6 in SI and from 59.0% to 45.6% in UC. Average plasma cholesterol was reduced from 240.3 at S2 to 228.2 mg/dL at visit 6 in SI and from 240.6 to 233.1 mg/dL in UC (P < 0.01 for each of these comparisons) (1).

4

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

3.2 Primary and Secondary Outcomes With an average follow-up of 7 years through 28 February 1982, there were 265 deaths among SI men (138 CVD, 115 CHD) and 260 deaths among UC men (145 CVD, 124 CHD). All-cause death rates were 41.2 per 1000 men (19.3/1000 for CVD, 17.9/1000 for CHD) in SI, and 40.4 per 1000 (22.5/1000 for CVD, 19.3/1000 for CHD) in UC. Thus, for the primary outcome, there was a 7.1% lower CHD death rate among SI men [relative difference in proportion dead (pUC -pSI )/pUC = 7%, 90% CI (−15%, 25%)]. The observed CHD death rate of 17.9/1000 in SI men was lower than the rate of 21.3/1000 for which the study had been powered, but the observed CHD death rate of 19.3/1000 in UC men was substantially lower than the Framinghamprojected rate of 29.0/1000 (1). The total death rate was 2.1% higher for SI than UC, whereas the CVD death rate was 4.7% lower for SI than UC. A slight CHD and total mortality advantage for SI, which developed around year 2, waned by year 5. The combined outcome of CHD mortality or acute MI was lower in the SI men (395 events) compared to the UC men (431 events) [relative difference 8%, 95% CI (−5%, 20%)] (1,15). Additional analyses (done for the first time in the 1990s) considered four combined morbidity/mortality endpoints in accordance with approaches in general use by then in randomized trials. The combined endpoints, based on during trial data through 28 February 1982, included a clinical cardiac event endpoint: earliest of significant serial change in ECG, clinical non-fatal myocardial infarction (MI), congestive heart failure (CHF), ECG-diagnosed left ventricular hypertrophy (LVH), or CHD death. The SI/UC hazard ratio for this endpoint was 0.87 [95% CI (0.76, 0.99), adjusted for age at screening, baseline screening factors, and baseline presence of resting and/or exercise ECG abnormalities], a 13% reduction in risk for the SI group. A clinical cardiovascular endpoint included: earliest of nonfatal stroke, significant serial change in ECG, clinical nonfatal MI, CHF, ECG LVH, coronary by-pass surgery, peripheral arterial occlusion, impaired renal function, accelerated hypertension, or CVD death, and

resulted in a hazard ratio of 0.86 [95% CI (0.78, 0.95), adjusted as above] (16). 3.3 Lessons Learned from this ‘‘Landmark Trial’’ Potential reasons for lack of significant findings on the original primary outcomes analyses in MRFIT have been detailed (1,17). The trial design assumed large SI/UC differences in risk factor levels early in the trial, followed by increasing noncompliance; what was observed were small differences, followed by long-term maintenance of those differences. The extent of the UC risk factor changes was unanticipated, and the numbers of UC deaths after 6 years (219 total, 104 CHD) were substantially smaller than expected (442 total, 187 CHD) from the Framingham-based predictions. Thus, overestimated event rates for UC, and an overestimated group difference, led to a study with power of only 60% rather than 88% to detect a significant difference for the primary endpoint of CHD death. The lower than expected event rate for UC was caused in part by less stringent exclusion criteria applied to the Framingham database than those applied to the MRFIT screenees. Also, during the trial, a secular trend was observed in the U.S. of decreasing CHD death rates, perhaps reflective of improving treatments for heart disease and growing efforts to control CVD major risk factors (1,17). 4 LONG-TERM FOLLOW-UP Death dates and causes through 1999 have been determined through searches of Social Security Administration and National Death Index files for all 361,662 men screened, with median follow-up of 24.7 years (3,4). This long-term mortality follow-up in such a large cohort, with demographic information and major CVD risk factor data, is a tremendous resource for medicine and public health, and it has been used extensively. 4.1 Primary Outcome Results at 10.5 Years and at 16 Years Mortality follow-up through 1985 (average follow-up 10.5 years from MRFIT baseline)

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

showed a CHD death rate lower by 10.6% for SI compared with UC [relative difference in proportion dead (pUC -pSI )/pUC = 10.6%, 90% CI (−4.9%, 23.7%)], with the largest reduction for death from acute MI (relative difference 24.3%, P-value = 0.02) (18). Mortality follow-up through 1990 (average follow-up 15.8 years) showed CHD death lower by 11.4% for SI compared with UC [90% CI (−1.9%, 23.0%)], with the largest reduction again for death from acute MI [relative difference 20.4%, 95% CI (3.4%, 34.4%), P-value = 0.02] (19).

•

4.2 Longer-Term Impacts of the MRFIT The results of MRFIT had a substantial impact on the treatment and management of hypertension (20–22), including the use of lower dose diuretics (23), and ultimately led to ALLHAT (the Antihypertensive and Lipid Lowering Treatment to Prevent Heart Attack Trial) (24,25). Epidemiologic results from mortality follow-up of the randomized men showed the importance of white blood cell count (26,27), HDL cholesterol (28), pulmonary function (29,30), and C-reactive protein (31) for cause-specific mortality.

•

5 EPIDEMIOLOGIC FINDINGS FROM LONG-TERM FOLLOW-UP OF 361,662 MRFIT SCREENEES

•

Analyses from long-term mortality follow-up of the 361,662 screened men have been used to address important public health and medical care issues.

•

5

as CHD mortality (32–36). Pulse pressure is not as strong a predictor of mortality as either systolic BP or diastolic BP, systolic BP is a stronger single predictor than diastolic BP of CHD or CVD risk, and both systolic and diastolic BP together predict better than each alone (33,34,37). The extraordinary sample size of this cohort, and its long-term follow-up, made possible for the first time the measurement of CHD and CVD risk for men assessed to be at lower risk at baseline (38). Nonsmokers with favorable levels of serum cholesterol (<200 mg/dL) and blood pressure (≤120/≤80 mmHg), no prior heart attack, and no diabetes (a rare combination) are at substantially lower risk of mortality compared to all others; for these low risk men, CHD and CVD rates are endemic, not epidemic (39). For low risk men, both nonHispanic White and African-American, death rates are lower, irrespective of socioeconomic status (SES), compared with higher risk men (40). Most (87–100%) of those who develop CVD have antecedent cardiovascular risk factors (41,42). History of diabetes was shown to be a powerful predictor of CHD mortality (35), but slightly less powerful than history of heart attack (43). Socioeconomic status (SES) differences between Black and White screenees were strongly predictive of differences in CVD mortality and ESRD (44–47).

• Results from the screened men have had

a substantial impact in particular on guidelines for treatment of high blood cholesterol and BP. Risk of CHD mortality, for example, was shown not to be restricted to those with ‘‘hypertension’’ or those with ‘‘hypercholesterolemia.’’ The relationship of BP with CHD death is strong, continuous, graded, and independent of other major risk factors, as is the relationship of serum cholesterol with CHD death; cigarette smoking amplifies these risks. This is all true for CVD mortality, all-cause mortality, and end stage renal disease (ESRD) as well

6

CONCLUSIONS

Although significant findings were not observed for any of the four pre-specified primary endpoints, significantly lower event rates for SI compared with UC were found for acute MI mortality and for combined cardiovascular morbidity/mortality endpoints. The MRFIT was a large and complex trial that was operationally successful: Recruitment exceeded design goals, and it was completed in only 28 months; randomization resulted in two balanced groups. Loss to follow-up was

6

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

less than expected throughout the trial; 91% of those alive attended their sixth visit; only 30 of 12,866 men had unknown vital status at the end of the study. Thorough ascertainment of dates and classification of causes of death (blinded to group assignment) was achieved. The intervention was largely a success for the SI men: Smoking cessation was larger than anticipated. Diastolic BP reduction was larger than expected. Cholesterol was lowered considerably although not to the target level. Average risk factor levels continued to decline each year, instead of regressing back toward baseline. However, contrary to design expectations, UC risk factor reductions were also considerable, although less than SI, and UC death rates were less than expected, which resulted in a study underpowered for its primary endpoint of CHD death (1). More analyses of the MRFIT data, using continued mortality follow-up, have provided evidence for numerous, substantial public health and medical advances. The MRFIT bibliography includes 250+ published manuscripts and five monographs. The overarching thrust of much of this body of work is the importance of being at low risk and of primary prevention of elevated risk factor levels through improvement of lifestyles. We have described here only a small fraction of the research contributions based on MRFIT data.

REFERENCES 1. Multiple Risk Factor Intervention Trial Research Group, Multiple Risk Factor Intervention Trial: Risk factor changes and mortality results. JAMA 1982; 248: 1465–1477. 2. J. Stamler, for the Multiple Risk Factor Intervention Trial Research Group, Multiple Risk Factor Intervention Trial. In: H. Hofmann (ed.), Primary and Secondary Prevention of Coronary Heart Disease: Results of New Trials. New York: Springer-Verlag, 1985, pp. 8–33. 3. D. Wentworth, J. D. Neaton, and W. Rasmussen, An evaluation of the Social Security Administration MBR File and the National Death Index in the ascertainment of vital status. Am. J. Publ. Health 1983; 73: 1270–1274.

4. L. E. Eberly, J. D. Neaton, A. J. Thomas, and Y. Dai, for the Multiple Risk Factor Intervention Trial Research Group, Multiple-stage screening and mortality in the Multiple Risk Factor Intervention Trial. Clin. Trials 2004; 1: 148–161. 5. J. Ockene, Multiple Risk Factor Intervention Trial (MRFIT): Smoking cessation procedures and cessation and recidivism patterns for a large cohort of MRFIT participants. In: J. Schwart (ed.), Progress in Smoking Cessation. New York: American Cancer Society, 1979, pp. 183–198. 6. G. H. Hughes, N. Hymowitz, J. K. Ockene, N. Simon, and T. M. Vogt, The Multiple Risk Factor Intervention Trial (MRFIT): V. Intervention on smoking. Prev. Med. 1981; 10: 476–500. 7. A. W. Caggiula, G. Christakis, M. Farrand, S. B. Hulley, R. Johnson, N. L. Lasser, J. Stamler, and G. Widdowson, The Multiple Risk Factor Intervention Trial (MRFIT): IV. Intervention on blood lipids. Prev. Med. 1981; 10: 443–475. 8. J. D. Cohen, R. H. Grimm, and W. McFate Smith, The Multiple Risk Factor Intervention Trial (MRFIT): VI. Intervention on blood pressure. Prev. Med. 1981; 10: 501–518. 9. R. Sherwin, C. T. Kaelber, P. Kezdi, M. O. Kjelsberg, and T. H. Emerson (for the MRFIT), The Multiple Risk Factor Intervention Trial (MRFIT): II. The development of the protocol. Prev. Med. 1981; 10: 402–425. 10. Multiple Risk Factor Intervention Trial Group, Statistical design considerations in the NHLI Multiple Risk Factor Intervention Trial (MRFIT). J. Chron. Dis. 1977; 30: 261–275. 11. Multiple Risk Factor Intervention Trial Research Group. The Multiple Risk Factor Intervention Trial: quality control of technical procedures and data acquisition. Control. Clin. Trials 1986; 7: 1S–202S. 12. J. D. Neaton, R. H. Grimm, and J. A. Cutler, Recruitment of participants for the Multiple Risk Factor Intervention Trial (MRFIT). Control. Clin. Trials 1988; 8: 41S–53S. 13. N. L. Lasser, S. Lamb, P. Dischinger, and K. Kirkpatrick, Introduction: Background and organization of quality control in the Multiple Risk Factor Intervention Trial. Control. Clin. Trials 1986; 7: 1S–16S. 14. A. G. DuChene, D. H. Hultgren, J. D. Neaton, P. V. Grambsch, S. K. Broste, B. M. Aus, and W. L. Rasmussen, Forms control and error detection procedures used at the Coordinating

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT) Center of the Multiple Risk Factor Intervention Trial (MRFIT). Control. Clin. Trials 1986; 7: 34S–45S. 15. Multiple Risk Factor Intervention Trial Research Group, Coronary heart disease death, nonfatal acute myocardial infarction and other clinical outcomes in the Multiple Risk Factor Intervention Trial. Am. J. Cardiol. 1986; 58: 1–13. 16. J. Stamler, J. Shaten, J. Cohen, J. A. Cutler, M. Kjelsberg, L. Kuller, J. D. Neaton, and J. Ockene, for the MRFIT Research Group, Combined nonfatal plus fatal cardiac and cardiovascular events in special intervention (SI) compared to usual care (UC) men during the Multiple Risk Factor Intervention Trial (MRFIT). Canadian J. Cardiol. 1997; 13: 366B, Abstract #1379. 17. A. M. Gotto, The Multiple Risk Factor Intervention Trial (MRFIT). A Return to a Landmark Trial. JAMA 1997; 277: 595–597. 18. Multiple Risk Factor Intervention Trial Group, Mortality rates after 10.5 years for participants in the Multiple Risk Factor Intervention Trial: findings related to a priori hypotheses of the trial. JAMA 1990; 263: 1795–1801.

7

24. The ALLHAT Officers and Coordinators for the ALLHAT Collaborative Research Group, Major cardiovascular events in hypertensive patients randomized to doxazosin vs chlorthalidone: the Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT). JAMA 2000; 283: 1967–1975. 25. The ALLHAT Officers and Coordinators for the ALLHAT Collaborative Research Group, Major outcomes in high-risk hypertensive patients randomized to angiotensinconverting enzyme inhibitor or calcium channel blocker vs diuretic. The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT). JAMA 2002; 288: 2981–2997. 26. R. H. Grimm, J. D. Neaton, and B. Ludwig, The prognostic importance of total white blood cell count for all cause, cardiovascular and cancer mortality. JAMA 1985; 254: 1932–1937. 27. A. N. Phillips, J. D. Neaton, D. G. Cook, R. H. Grimm, and A. G. Shaper, The white blood cell count and risk of lung cancer. Cancer 1992; 69: 681–684.

19. Multiple Risk Factor Intervention Trial Group, Mortality after 16 years for participants randomized to the Multiple Risk Factor Intervention Trial. Circulation 1996; 94: 946–951.

28. D. J. Gordon, J. L. Probstfield, R. J. Garrison, J. D. Neaton, W. P. Castelli, J. D. Knoke, D. R. Jacobs Jr., Bangdiwala, and H. A. Tyroler, High-density lipoprotein cholesterol and cardiovascular disease. Four prospective American studies. Circulation 1989; 79: 8–15.

20. Multiple Risk Factor Intervention Trial Group, Baseline rest electrogardiographic abnormalities, antihypertensive treatment, and mortality in the Multiple Risk Factor Intervention Trial. Am. J. Cardiol. 1985; 55: 1–15.

29. L. H. Kuller, J. Ockene, E. Meilahn, and K. H. Svendsen, Relation of forced expiratory volume in one second (FEV1) to lung cancer mortality in the Multiple Risk Factor Intervention Trial (MRFIT). Am. J. Epidemiol. 1990; 132: 265–274.

21. L. H. Kuller, S. B. Hulley, J. D. Cohen, and J. D. Neaton, Unexpected effects of treating hypertension in men with ECG abnormalities: a review of the evidence. Circulation 1986; 73: 114–123.

30. L. E. Eberly, J. Ockene, R. Sherwin, L. Yang, and L. Kuller, for the Multiple Risk Factor Intervention Trial Group, Pulmonary function as a predictor of lung cancer mortality in continuing cigarette smokers and in quitters. Int. J. Epidemiol. 2003; 32: 592–599.

22. J. D. Cohen, J. D. Neaton, R. J. Prineas, and K. A. Daniels, Diuretics, serum potassium and ventricular arrhythmias in the Multiple Risk Factor Intervention Trial. Am. J. Cardiol. 1987; 60: 548–554. 23. J. D. Neaton, R. H. Grimm, R. J. Prineas, G. Grandits, P. J. Elmer, J. A. Cutler, J. M. Flack, J. A. Schoenberger, R. MacDonald, et al., for the Treatment of Mild Hypertension Study Research Group. Treatment of Mild Hypertension Study (TOMHS): final results. JAMA 1993; 270: 713–724.

31. L. H. Kuller, R. P. Tracy, J. Shaten, and E. N. Meilahn, for the MRFIT Research Group. Relation of C-reactive protein and coronary heart disease in the MRFIT Nested CaseControl Study. Am. J. Epidemiol. 1996; 144: 537–547. 32. J. Stamler, D. Wentworth, and J. D. Neaton, Is the relationship between serum cholesterol and risk of premature death from Coronary Heart Disease continuous and graded? JAMA 1986; 256: 2823–2828.

8

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

33. J. Stamler, J. D. Neaton, and D. N. Wentworth, Blood pressure (systolic and diastolic) and risk of fatal coronary heart disease. Hypertension 1989; 13: 2–12. 34. J. D. Neaton and D. Wentworth, Influence of serum cholesterol, blood pressure, and cigarette smoking on death from coronary heart disease in 316,099 white men age 3557 years: overall findings and differences by age. Arch. Intern. Med. 1992; 152: 56–64. 35. J. Stamler, O. Vaccaro, J. D. Neaton, and D. Wentworth, Diabetes, other risk factors, and 12-year cardiovascular mortality for men screened in the Multiple Risk Factor Intervention Trial. Diabetes Care 1993; 16: 434–444. 36. M. J. Klag, P. K. Whelton, B. L. Randall, J. D. Neaton, F. L. Brancati, C. E. Ford, N. B. Shulman, and J. Stamler, Blood Pressure and End-Stage Renal Disease in Men. N. Engl. J. Med. 1996; 334: 13–18. 37. M. Domanski, M. Pfeffer, J. Neaton, J. Norman, K. Svendsen, R. Grimm, J. Cohen, and J. Stamler, for the MRFIT Research Group, Pulse pressure and cardiovascular mortality: 16-year findings on 342,815 men screened for the Multiple Risk Factor Intervention Trial (MRFIT). JAMA 2002; 287: 2677–2683. 38. J. Stamler, J. D. Neaton, D. Wentworth, J. Shih et al., Lifestyles and life-style related risk factors: their combined impact in producing epidemic cardiovascular disease and the potential for prevention. In: A. M. Gotto, C. Lenfant, R. Paoletti, M. Some (eds.), Multiple Risk Factors in Cardiovascular Disease. Norwell, MA: Kluwer, 1992, pp. 19–25. 39. J. Stamler, R. Stamler, J. D. Neaton, D. N. Wentworth, M. Daviglus, D. Garside, A. Dyer, P. Greenland, and K. Liu, Low risk-factor profile and long-term cardiovascular and noncardiovascular mortality and life expectancy: findings for 5 large cohorts of young adult and middle-aged men and women. JAMA 1999; 282: 2012–2018. 40. Stamler J, Neaton JD, Garside D, and Daviglus ML. Current status: six established major risk factors - and low risk. In: M. Marmot and P. Elliott (eds.), Coronary Heart Disease Epidemiology: From Aetiology to Public Health. 2nd ed. London: Oxford University Press, 2005, pp. 32–70. 41. P. Greenland, M. D. Knoll, J. Stamler, J. D. Neaton, A. R. Dyer, D. B. Garside, and P. W. Wilson, Major risk factors are nearly universal antecedents of clinical coronary heart disease in three large U.S. cohorts. JAMA 2003; 290: 891–897.

42. J. Stamler, Low risk – and the ‘‘no more than 50%’’ myth/dogma. Arch. Intern. Med. 2007; 167: 537–539. 43. O. Vaccaro, L. E. Eberly, J. D. Neaton, L. Yang, G. Riccardi, and J. Stamler, Impact of diabetes and previous myocardial infarction on long-term survival: 18 years mortality follow-up of primary screenees in the Multiple Risk Factor Intervention Trial. Arch. Intern. Med. 2004; 164: 1438–1443. 44. G. Davey Smith, J. D. Neaton, D. N. Wentworth, R. Stamler, and J. Stamler, Socioeconomic differentials in mortality risk among men screened for the Multiple Risk Factor Intervention Trial: Part I - White men. Am. J. Pub. Health 1996; 86: 46–496. 45. G. Davey Smith, D. N. Wentworth, J. D. Neaton, R. Stamler, and J. Stamler, Socioeconomic differentials in mortality risk among men screened for the Multiple Risk Factor Intervention Trial: Part II - Black men. Am. J. Pub. Health 1996; 86: 497–504. 46. G. Davey Smith, J. D. Neaton, D. N. Wentworth, R. Stamler, and J. Stamler, Mortality differences between black and white men in the USA: the contribution of income and other factors among men screened for the MRFIT. Lancet 1998; 351: 934–939. 47. M. J. Klag, P. K. Whelton, B. L. Randall, J. D. Neaton, F. L. Brancati, and J. Stamler, Endstage renal disease in African-American and white men. JAMA 1997; 277: 1293–1298.

FURTHER READING J. Stamler, Established major coronary risk factors: historical overview. In: M. Marmot and P. Elliott (eds.), Coronary Heart Disease Epidemiology: From Aetiology to Public Health. 2nd ed. London: Oxford University Press, 2005, pp. 1–31. J. Stamler, J. D. Neaton, D. B. Garside, and M. L. Daviglus, The major adult cardiovascular diseases: a global historical perspective. In: R. M. Lauer, T. L. Burns, and S. R. Daniels (eds.), Pediatric Prevention of Atherosclerotic Cardiovascular Disease. London: Oxford University Press, 2006, pp. 27–48. J. Stamler, M. Daviglus, D. B. Garside, P. Greenland, L. E. Eberly, L. Yang, and J. D. Neaton, Low risk cardiovascular status: Impact on cardiovascular mortality and longevity. In: R. M. Lauer, T. L. Burns, and S. R. Daniels (eds.), Pediatric Prevention of Atherosclerotic Cardiovascular Disease. London: Oxford University Press, 2006, pp. 49–60.

MULTIPLE RISK FACTOR INTERVENTION TRIAL (MRFIT)

CROSS-REFERENCES Controlled Clinical Trials Data Quality Assurance Disease Trials for Cardiovascular Diseases Diuretics Intervention Design/Nutritional Intervention/Smoking Cessation Intervention Primary Prevention Trials Regression Dilution Bias Selection Bias Survival Study

9

MULTIPLE TESTING IN CLINICAL TRIALS

drugs and increased patient risks. For this reason, regulatory agencies mandate a strict control of the false-positive (Type I error) rate in clinical trials and require that drug developers perform multiple analyses with a proper adjustment for multiplicity. To stress the importance of multiplicity adjustments, the draft guidance document entitled ‘‘Points to consider on multiplicity issues in clinical trials’’ released by the European Committee for Proprietary Medicinal Products on July 26, 2001 states that

ALEXEI DMITRIENKO Eli Lilly and Company Indianapolis Lilly Research Laboratories Indianapolis, Indiana

JASON C. HSU Ohio State University Columbus Department of Statistics Columbus, Ohio

1

INTRODUCTION

‘‘a clinical study that requires no adjustment of the Type I error is one that consists of two treatment groups, that uses a single primary variable, and has a confirmatory statistical strategy that pre-specifies just one single null hypothesis relating to the primary variable. All other situations require attention to the potential effects of multiplicity’’ (1).

Multiplicity problems caused by multiple analyses performed on the same dataset occur frequently in a clinical trial setting. The following are examples of multiple analyses encountered in clinical trials. • Multiple comparisons. Multiple test-

ing is often performed in clinical trials involving several treatment groups. For example, most Phase II trials are designed to assess the efficacy and safety of several doses of an experimental drug compared to a control. • Multiple primary endpoints. Multiplicity can be caused by multiple criteria for assessing the efficacy profile an experimental drug. The multiple criteria are required to accurately characterize various aspects of the expected therapeutic benefits. In some cases, the experimental drug is declared efficacious if it meets at least one of the criteria. In other cases, drugs need to produce significant improvement with respect to all of the endpoints (e.g., new therapies for the treatment of Alzheimer’s disease are required to demonstrate their effects on both cognition and global clinical improvement).

As a result of these regulatory concerns, multiplicity adjustment strategies have received much attention in the clinical trial literature. This article provides a brief review of popular approaches to performing multiple analyses of clinical trial data. It outlines main principles underlying multiple testing procedures and introduces singlestep and stepwise multiple tests widely used in clinical applications. See Hochberg and Tamhane (2), Westfall and Young (3), and Hsu (4) for a comprehensive review of multiple decision theory with clinical trial applications. Throughout the article, H01 , . . . , H0k will denote the k null hypotheses and HA1 , . . . , HAk denote the alternative hypotheses tested in a clinical study. The associated test statistics and P-values will be denoted by T1 , . . . , Tk and p1 , . . . , pk , respectively.

It is commonly recognized that failure to account for multiplicity issues can inflate the probability of an incorrect decision and could lead to regulatory approval of inefficacious

In order to choose an appropriate multiple testing method, it is critical to select the definition of correct and incorrect decisions that reflect the objective of the study.

2

CONCEPTS OF ERROR RATES

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

MULTIPLE TESTING IN CLINICAL TRIALS

2.1 Comparisonwise Error Rate In the simple case when each hypothesis is tested independently, the comparisonwise error rate is controlled at a significance level α (e.g., 0.05 level) if each H0i is tested so that the probability of erroneously rejecting H0i is no more than α. Using the law of large numbers, it can be shown that, in the long run, the proportion of erroneously rejected null hypotheses does not exceed α. However, if the k null hypotheses are true, the probability of rejecting at least one true null hypothesis will be considerably greater than the significance level chosen for each individual hypothesis. Thus, if a correct decision depends on correct inference from all k tests, the probability of an incorrect decision will exceed α. 2.2 Experimentwise Error Rate An early attempt to alleviate this problem and achieve a better control of the probability of an incorrect decision was to consider each experiment as a unit and define the experimentwise error rate. The experimentwise error rate is said to be controlled at α if the probability of rejecting at least one true null hypothesis does not exceed α when the k null hypotheses are simultaneously true. Control of the experimentwise error rate is sometimes referred to as the weak control of the familywise error rate. Note, however, that, in terms of the probability of making an incorrect decision, H01 , . . . , H0k all being true is not always the worst case scenario. Suppose, for example, that H01 , . . . , H0(k−1) are true but H0k is false. Then any multiple testing method for which the probability of incorrectly rejecting at least one null hypothesis when all the null hypotheses are true is no more than α, but the probability of rejecting at least one of H01 , . . . , H0(k−1) , given that they are true and H0k is false is greater than α, protects the experimentwise error rate. It is obvious from this example that preserving the experimentwise error rate does not necessarily guarantee that the probability of an incorrect decision is no greater than α. 2.3 Familywise Error Rate As a result of the described limitation of the experimentwise error rate, clinical

researchers rely on a more stringent method for controlling the probability of an incorrect decision known as the strong control of the familywise error rate (FWER). The FWER is defined as the probability of erroneously rejecting any true null hypothesis in a family regardless of which and how many other null hypotheses are true. This definition is essentially based on the maximum experimentwise error rate for any subset of the k null hypotheses and, for this reason, FWER-controlling tests are sometimes said to preserve the maximum Type I error rate. 2.4 False Discovery Rate Another popular approach to assessing the performance of multiple tests, known as the false discovery rate (FDR), is based on the ratio of the number of erroneously rejected null hypotheses to the total number of rejected null hypotheses (5). To be more precise, the FDR is said to be controlled at α if the expected proportion of incorrectly rejected (true) null hypotheses is no more than α, for example

Number of true H0i rejected E Total number of H0i rejected

≤α

FDR-controlling tests are useful in multiplicity problems involving a large number of null hypotheses (e.g., multiplicity problems occurring in genetics) and are becoming increasingly popular in preclinical research. It is important to point out that the FDR is uniformly larger than the FWER and, thus, controlling the FDR may not control the probability of an incorrect decision. In fact, in confirmatory studies, it is often possible to manipulate the design of the clinical trial so that any conclusion desired can be almost surely inferred without inflating the FDR (6). 3 UNION-INTERSECTION TESTING Most commonly, multiple testing problems are formulated as union-intersection (UI) problems (7), meaning that one is interested in testing the global hypothesis, denoted by H0I , which is the intersection of k null hypotheses versus the union of the corresponding alternative hypotheses, denoted by

MULTIPLE TESTING IN CLINICAL TRIALS

HAU . As an illustration, consider a dosefinding study designed to test a low and high dose of an experimental drug (labeled L and H) to placebo (P). The primary endpoint is a continuous variable with larger values indicating improvement. Let µP , µL , and µH denote the mean improvement in the placebo, low dose group, and high dose group, respectively. The individual null hypotheses tested in the trial are HL : µL ≤ µP and HH : µH ≤ µP . In this setting, a UI approach would test H0I : µL ≤ µP and µH ≤ µP versus HAU : µL > µP or µH > µP . According to the UI testing principle, the global hypothesis H0I is tested by examining each of its components individually, rejecting H0I if at least one of the components is rejected. Tests of homogeneity that one learns in elementary statistics courses, such as the F-test, tend to be UI tests. The following is a brief overview of popular methods for constructing UI tests. 3.1 Single-Step Tests Based on Univariate P-Values ˇ ak ´ These tests (e.g., the Bonferroni and Sid tests) are intuitive, easy to explain to non-statisticians and, for this reason, are frequently used in clinical applications. The Bonferroni adjustment for testing H0i amounts to computing an adjusted P-value ˇ ak-adjusted ´ given by kpi . Similarly, the Sid P-value for H0i is equal to 1 − (1 − pi )k . The adjusted P-values are then compared with α and the global hypothesis H0I is rejected if at least one adjusted P-value is no greater than α. Another example of a test based on univariate P-values is the Simes test (8). The adjusted Simes P-value for the global hypothesis H0I is k min(p[1] , p[2] /2, . . . , p[k] /k), where p[1] , . . . , p[k] are ordered P-values (i.e., p[1] ≤ . . . ≤ p[k] ). It is easy to see from this definition that the Simes test is uniformly more powerful than the Bonferroni test in the sense that the former rejects H0I every time the latter does. Although the Simes test has a power advantage over the Bonferroni test, one needs to remember that the Simes test does not always preserve the overall Type I error rate. It is known that the size of this test does not exceed α when p1 , . . . , pk are independent or positively dependent (9). It is important to keep in mind that tests based

3

on univariate P-values ignore the underlying correlation structure and become very conservative when the test statistics are highly correlated or the number of null hypotheses is large (e.g., in clinical trials with multiple outcome variables). 3.2 Parametric Single-Step Tests The power of simple tests based on univariate P-values can be improved considerably when one can model the joint distribution of the test statistics T1 , . . . , Tk . Consider, for example, the problem of comparing k doses of an experimental drug with a control in a one-sided manner. Assuming that T1 , . . . , Tk follow a multivariate normal distribution and larger treatment differences are better, Dunnett (10) derived a multiple test that rejects H0i if Ti ≥ d, where d is the 100(1 − α)% percentile of max(T1 , . . . , Tk ). Dunnett’s method also yields a set of simultaneous one-sided confidence intervals for the true mean treatment differences δ1 , . . . , δk : δi > δˆi − ds 2/n,

i = 1, . . . , k

where s is the pooled sample standard deviation and n is the common sample size per treatment group. 3.3 Resampling-Based Single-Step Tests A general method for improving the performance of tests based on univariate P-values was proposed by Westfall and Young (3). Note first that the adjusted P-value for H0i is given by P{min(P1 , . . . , Pk ) ≤ pi }. In this equation, P1 , . . . , Pk denote random variables that follow the same distribution as p1 , . . . , pk under the assumption that the global hypothesis H0I is true. The joint distribution of the P-values is unknown and can be estimated using permutation or bootstrap resampling. The advantage of using resampling-based testing procedures is that they account for the empirical correlation structure of the individual P-values and, thus, are more powerful than the Bonferroni and similar tests. Furthermore, unlike the Dunnett test, the resampling-based approach does not rely on distributional assumptions. When carrying out resampling-based tests, it is important

4

MULTIPLE TESTING IN CLINICAL TRIALS

to ensure that the subset pivotality condition is met. This condition guarantees that the resampling-based approach preserves the FWER at the nominal level. The subset pivotality condition is met in most multiple testing problems for which pivotal quantities exist; however, it may not be satisfied in the case of binary variables, for example; see Reference 3 for more details. 4

CLOSED TESTING

A cornerstone of multiple hypotheses testing has been the closed testing principle of Marcus et al. (11). The principle has provided a foundation for a variety of multiple testing methods and has found a large number of applications in multiple testing problems occurring in clinical trials. Examples of such applications include procedures for multiple treatment comparisons and multiple outcome variables (12, 13), testing a dose-response relationship in dose ranging trials (14), and gatekeeping strategies for addressing multiplicity issues developing in clinical trials with multiple primary and secondary endpoints (15, 16). The closed testing principle is based on a hierarchical representation of the multiplicity problem in question. To illustrate, consider the null hypotheses HL and HH from the dose-finding trial example. In order to derive a closed test for this multiple testing problem, construct the closed family of null hypotheses by forming all possible intersections of the null hypotheses. The closed family contains HL , HH , and HL ∩ HH . The next step is to establish implication relationships in the closed family. A hypothesis that contains another hypothesis is said to imply it; for example, HL ∩ HH implies both H L and HH . The closed testing principle states that an FWER-controlling testing procedure can be constructed by testing each hypothesis in the closed family using a suitable level α test. A hypothesis in the closed family is rejected if its associated test and all tests associated with hypotheses implying it are significant. For example, applying the closed testing principle to the dose-finding trial example, statistical inference proceeds as follows:

• If HL ∩ HH is accepted, the closed test

has to accept HL and HH because HL ∩ HH implies HL and HH . • If HL ∩ HH is rejected, but not HL or HH , the inference is at least one of the two alternative hypotheses is true, but which one cannot be specified. • If HL ∩ HH and HH are rejected but HL is accepted, one concludes that HH is false (i.e., µH > µP ). Similarly, if HL ∩ HH and HL are rejected but HH is accepted, the null hypothesis HL is declared to be false (i.e., µL > µP ). • Lastly, if HL ∩ HH , HL and HH are rejected, the inference is that µL > µP and µH > µP . Now, in order to construct a multiple testing procedure, one needs to choose a level α significance test for the individual hypotheses in the closed family. Suppose, for example, that the individual hypotheses are tested using the Bonferroni test. The resulting closed testing procedure is equivalent to the stepwise testing procedure proposed by Holm (17). The Holm procedure relies on a sequentially rejective algorithm for testing the ordered null hypotheses H[01] , . . . , H[0k] corresponding to the ordered P-values p[1] ≤ . . . ≤ p[k] . The procedure first examines the null hypothesis associated with the most significant P-value (i.e., H[01] ). This hypothesis is rejected if p[1] ≤ α/k. Further, H[0i] is rejected if p[j] ≤ α/(k − j + 1) for all j = 1, . . . , i. Otherwise, the remaining null hypotheses H[0i] , . . . , H[0k] are accepted and testing ceases. Note that H[01] is tested at the α/k level and the other null hypotheses are tested at successively higher significance levels. As a result, the Holm procedure rejects at least as many (and possibly more) null hypotheses as the Bonferroni test from which it was derived. This example shows that, by applying the closed testing principle to a single-step test, one can construct a more powerful stepwise test that maintains the FWER at the same level. The same approach can be adopted to construct stepwise testing procedures based on other single-step tests. For example, the popular Hochberg and Hommel testing procedures can be thought of as closed testing

MULTIPLE TESTING IN CLINICAL TRIALS

versions of the Simes test (18, 19). It is worth noting that the Hommel procedure is uniformly more powerful than the Hochberg procedure, and both of the two procedures preserve the FWER at the nominal level only when the Simes test does (i.e., under the assumption of independence or positive dependence). In the parametric case, an application of the closed testing principle to the Dunnett test results in the stepwise Dunnett test defined as follows. Consider again the comparison of k doses of an experimental drug with a control in a one-sided setting. Let T[1] , . . . , T[k] denote the ordered test statistics (T[1] ≤ . . . ≤ T[k] ) and di be the 100(1 − α)% percentile of max(T1 , . . . , Ti ), i = 1, . . . , k. The stepwise Dunnett test begins with the most significant statistic and compares it with dk . If T[k] ≥ dk , the null hypothesis corresponding to T[k] is rejected and the second most significant statistic is examined. Otherwise, the stepwise algorithm terminates and the remaining null hypotheses are accepted. It is easy to show that the derived stepwise test is uniformly more powerful than the single-step Dunnett test. An important limitation of the closed testing principle is that it does not generally provide the statistician with a tool for constructing simultaneous confidence intervals for parameters of interest. For instance, it is not clear how to set up simultaneous confidence bounds for the mean differences between the k dose groups and control group within the closed testing framework. The closed testing principle can also be used in the context of resampling-based multiple tests to set up stepwise testing procedures that account for the underlying correlation structure. 5

PARTITION TESTING

The partitioning principle introduced in References 20 and 21 can be viewed as a natural extension of the principle of closed testing. The advantage of using the partitioning principle is two-fold: Partitioning procedures are sometimes more powerful than procedures derived within the closed testing framework and, unlike closed testing procedures, they are easy to invert in order to set up simultaneous confidence sets for parameters of

5

interest. To introduce the partitioning principle, consider k null hypotheses tested in a clinical trial and assume that H0i states that θ ∈ i , where θ is a multidimensional parameter and i is a subset of the parameter space. Partition the union of 1 , . . . , k into disjoint subsets ∗J , J ⊂ {1, . . . , k}, which can be interpreted as the part of the parameter space in which exactly H0i , i ∈ J are true and the remaining null hypotheses are false. Now define null hypotheses corresponding to the constructed subsets (i.e., HJ∗ : θ ∈ ∗J ) and test them at level α. As these null hypotheses are mutually exclusive, at most one of them is true. Therefore, although no multiplicity adjustment is made, the resulting multiple test controls the FWER at the α level. To illustrate the process of carrying out partitioning tests, consider the null hypotheses HL : µL ≤ µP and HH : µH ≤ µP from the dose-finding trial example. The union of HL and HH is partitioned into three hypotheses: H1∗ : µL ≤ µP

and

µH ≤ µP

H2∗

: µL ≤ µP

and

µH > µP

H3∗

: µL > µP

and

µH ≤ µP

Testing each of the three hypotheses with a level α significance test results in the following decision rule: • If H1∗ is accepted, neither HL nor HH can

be rejected, otherwise infer that µL > µP or µH > µP . • If H1∗ and H2∗ are rejected, one concludes that µL > µP . Likewise, rejecting H1∗ and H3∗ implies that µH > µP . • Finally, if H1∗ , H2∗ and H3∗ are rejected, the inference is that µL > µP and µH > µP . Although these decision rules appears to be similar to the closed testing rules, it is important to point out that the partitioning principle does not deal with the hypotheses in the closed family (i.e., HL , HH and HL ∩ HH ) but rather with hypotheses H1∗ , H2∗ , and H3∗ defined above. As a result of the choice of mutually exclusive null hypotheses, partitioning tests can be inverted to derive a confidence region for

6

MULTIPLE TESTING IN CLINICAL TRIALS

the unknown parameter θ . Recall that the most general method for constructing a confidence set from a significance test is defined as follows. For each parameter point θ0 , test H0 : θ = θ0 using an level-α test and then consider the set of all parameter points θ0 for which H0 : θ = θ0 is accepted. The obtained set is a 100(1 − α) confidence set for the true value of θ . This procedure corresponds to partitioning the parameter space into subsets consisting of a single parameter point and can be used for constructing simultaneous confidence limits associated with various stepwise tests. Consider, for example, confidence limits for the mean treatment differences between k dose groups and a control group (20). If the largest mean difference is not significant (T[k] < dk ), the one-sided limits for the true mean differences δ1 , . . . , δk are given by δi > δˆi − dk s 2/n,

i = 1, . . . , k

and testing stops. Otherwise, one infers that δ[k] > 0 and examines the second largest difference. At the jth step of the stepwise test, the one-sided limits for δ[1] , . . . , δ[k−j+1] are δ[i] > δˆ[i] − dk s 2/n,

i = 1, . . . , k − j + 1

if the corresponding test statistic is not significant (T[k−j+1] < dk−j+1 ) and δ[k−j+1] > 0 otherwise. Comparing the resulting testing procedure to the stepwise Dunnett test derived in Section 4 using the closed testing principle, it is easy to see that the partitioning principle extends the closed testing framework by enabling clinical researchers to set up confidence limits for treatment-control differences. The partitioning principle can also be used for constructing confidence sets in a much more general context [e.g., confidence intervals for fixed-sequence testing methods occurring in dose-finding studies and other clinical applications (22)]. REFERENCES 1. European Committee for Propriety Medicinal Products, Points to consider on multiplicity issues in clinical trials. September 19, 2002. 2. Y. Hochberg and A. C. Tamhane, Multiple Comparison Procedures. New York: Wiley, 1987.

3. P. H. Westfall and S. S. Young, ResamplingBased Multiple Testing: Examples and Methods for P-Value Adjustment. New York: Wiley, 1993. 4. J. C. Hsu, Multiple Comparisons: Theory and Methods. London: Chapman & Hall, 1986. 5. Y. Benjamini and Y. Hochberg, Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Series B 1995; 57: 289–300. 6. H. Finner and M. Roter, On the false discovery rate and expected Type I errors. Biometr. J. 2001; 43: 985–1005. 7. S. N. Roy, On a heuristic method for test construction and its use in multivariate analysis. Ann. Stat. 1953; 24: 220–238. 8. R. J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986; 63: 655–660. 9. S. Sarkar and C. K. Chang, Simes’ method for multiple hypothesis testing with positively dependent test statistics. J. Amer. Stat. Assoc. 1997; 92: 1601–1608. 10. C. W. Dunnett, A multiple comparison procedure for comparing several treatments with a control. J. Amer. Stat. Assoc. 1955; 50: 1096–1121. 11. R. Marcus, E. Peritz and K. R. Gabriel, On closed testing procedure with special reference to ordered analysis of variance. Biometrika 1976; 63: 655–660. 12. P. Bauer, Multiple testings in clinical trials. Stat. Med. 1991; 10: 871–890. 13. W. Lehmacher, G. Wassmer and P. Reitmeir, Procedures for two-sample comparisions with multiple endpoints controlling the experimentwise error rate. Biometrics 1991; 47: 511–521. 14. D. M. Rom, R. J. Costello and L. T. Connell, On closed test procedures for dose-response analysis. Stat. Med. 1994; 13: 1583–1596. 15. A. Dmitrienko, W. Offen and P. H. Westfall, Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Stat. Med. 2003; 22: 2387–2400. 16. P. H. Westfall and A. Krishen, Optimally weighted, fixed sequence, and gate-keeping multiple testing procedures. J. Stat. Plan. Inference 2001; 99: 25–40. 17. S. Holm, A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979; 6: 65–70. 18. Y. Hochberg, A sharper Bonferroni procedure for multiple significance testing. Biometrika 1988; 75: 800–802.

MULTIPLE TESTING IN CLINICAL TRIALS 19. G. Hommel, A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 1988; 75: 383–386. 20. G. Stefansson, W-C. Kim and J. C. Hsu, On confidence sets in multiple comparisons. In: S. S. Gupta and J. O. Berger, eds. Statistical Decision Theory and Related Topics IV. New York: Academic Press, 1988, pp. 89–104. 21. H. Finner and K. Strassburger, The partitioning principle: a powerful tool in multiple decision theory. Ann. Stat. 2002; 30: 1194–1213. 22. J. C. Hsu and R. L. Berger, Stepwise confidence intervals without multiplicity adjustment for dose-response and toxicity studies. J. Amer. Stat. Assoc. 1989; 94: 468–482.

FURTHER READING A. Dmitrienko, G. Molenberghs, C. Chuang-Stein, W. Offen. Analysis of Clinical Trials Using SAS: A Practical Guide. Carry, NC (Chapter 2, ‘‘Multiple comparisons and multiple endpoints’’). 2005.

7

MULTISTAGE GENETIC ASSOCIATION STUDIES

Our primary focus is on interactions rather than main effects, which has been the main thrust of most of the literature on multistage designs. In a pharmaceutical trial, there is generally a wealth of prior physiological and biochemical knowledge about the pathway(s) targeted by the agent under study, so the aim of a genetic study will be to characterize the role of particular functional polymorphisms across multiple genes within critical steps in the pathway. The exposure variable (here, treatment) may be studied with a randomized controlled trial, rather than by observational epidemiologic studies with their attendant potential for confounding and other biases. The study design is likely to involve unrelated individuals rather than families, particularly in therapeutic trials where there would be few families with multiple cases eligible to be treated concurrently and even fewer opportunities to assign members of the same family to different treatments.

DUNCAN THOMAS DAVID CONTI Department of Preventive Medicine University of Southern California Los Angeles, CA

We consider the design of genetic association studies within the context of clinical research, where the ultimate question might be to identify and characterize genes that modify the response to pharmacologic agents, either preventive or therapeutic, or other interventions. This includes both existing clinical trials primarily designed to determine treatment effects, expanding them with secondary goals aimed at characterizing the influence of genes, as well as nested challenge or treatment experiments within existing population-based studies, exploiting extensive information (possibly genetic) that may already have been collected. In either case, the questions are similar. Why is it that some people respond favorably to a particular treatment and others do not? Why do some experience a particular adverse effect and others do not? Why is one treatment better for some people, and a different treatment better for someone else? Could something in a person’s genetic makeup explain such differences and help inform a future of personalized medicine? There is, of course, a vast literature on the design of genetic association studies and pharmacogenetic trials, which is summarized in other articles in this volume (see the articles on genetic association analysis and pharmacogenomics) as well as previously in the literature (1–3). Here, we focus on the design of multistage association studies, where statistical sampling can be used in various ways to improve cost-efficiency. General works are available on multistage designs (4–7) and the context of genetic studies (8–10). In particular, we focus on two-stage designs in the pharmacogenetic context. This focus has several implications.

It is also worth considering the various genetic contexts in which such research might be undertaken. One might have in mind a particular candidate gene containing one or more known functional polymorphisms that are known to play a role in the metabolism or other action of the agent under study. Or one might know the function of the gene, but not have identified any (or all) functional polymorphisms, and would like to fully characterize the spectrum of variation within the gene and its potential impact on the outcome under study. This may entail single nucleotide polymorphism (SNP) discovery through resequencing, or extensive genotyping of all known polymorphisms in a sample, before proceeding with typing only a subset of variants in the main study. More generally, one may have some understanding

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

MULTISTAGE GENETIC ASSOCIATION STUDIES

of an entire pathway involving multiple genes and possibly environmental factors affecting their expression or substrates, or even a complex network of interconnected pathways; in either case, one wishes to build a model for the entire system of genes. Whether one is studying a single gene or an entire pathway, it may be helpful to measure certain intermediate biomarkers (gene expression, methylation, metabolomic or proteomic measures) or additional clinical phenotypes. As these may be expensive or difficult to obtain, one might consider using only a subsample for this purpose. If no particular candidate genes are known, one might consider a genomewide search for possible associations, or at least a search for variants within a candidate region (say from a previous linkage scan); in either case, one would be looking for variants that might either be directly causal or in linkage disequilibrium with other causal variants. Each of these contexts provides different design challenges that might be addressed using some form of multistage design. 1 REASONS TO CONSIDER MULTISTAGE DESIGNS The main reason to consider multistage sampling designs is cost. No subsampling can possibly yield a more powerful study than obtaining complete data on all variables on all subjects, unless the cost of doing so would require a smaller overall sample size (1). But some measurements may be more informative or more costly than others, so some trade-off could lead to a more cost-efficient design. Here, two situations must be distinguished, one where the total sample size is fixed (say, by a previously completed clinical trial or population-based sample), and one where it is not. In the former case, one may simply be interested in getting a reasonably precise answer at a reasonable cost, recognizing that additional genetic measurements

or treatment assignments may not improve precision enough to justify the cost of making them on everybody. In this case, one might have an opportunity to conduct a larger main study by saving costs on some measurements through subsampling. A second reason is there may be opportunities to improve the statistical efficiency of the design by using stratified sampling, based on one or more sources of information already collected. These could include treatment arm, outcomes, covariates, genotypes (or surrogates for them), or various combinations of such factors. Depending on the sampling design used, adjustments may be necessary to allow for the sampling scheme in the statistical analysis to obtain unbiased estimates. For example, if one were to overrepresent both cases (adverse outcomes) and active treatment arm subjects, then a na¨ıve analysis would induce a spurious association between treatment and outcome and likely distort the assessment of the modifying effect of genotypes or biomarkers measured on the subsample. Fortunately, allowing for such biased sampling schemes is relatively easy via weighting subjects by the inverse of their sampling probabilities in the analysis (11) or by including the logarithms of these weights as ‘‘offsets’’ (covariates with regression coefficients fixed at unity) in a logistic regression analysis (12). As an extreme example, when investigating a dichotomous outcome within a previously performed randomized control trial, one might only select the ‘‘cases’’ as the subset for further genotyping and then perform the appropriate case-only analysis, yielding an increase in power for detection of gene × treatment interactions (13–17). In many cases, the substudy data only may be sufficient to answer the immediate questions, but some types of more integrative questions might require combining the information from the main study and substudy in some manner. For example, one might use substudy data to build a model for the dependence of biomarkers on genes, treatment, and possibly other factors, and then incorporate these predicted biomarkers into an analysis of the main study in which data on treatments, genotypes, and outcomes (but not biomarkers) were available on everybody (18).

MULTISTAGE GENETIC ASSOCIATION STUDIES

Much of our previous discussion and literature has assumed that the genetic association study is to be conducted as an add-on to a previous clinical trial, with the aim of studying the genetic determinants of response to treatment. An alternative approach might be to add a clinical trial to some preexisting observational study in which the genetic variation has already been characterized (for instance, an etiologic cohort or case-control study), and use the genetics as the basis for a sampling scheme to assess the outcome of some treatment in targeted subgroups. The potential benefit of this lies in being able to balance the genotypic distribution to obtain greater statistical efficiency when testing for gene–treatment interactions. For example, if one were interested in investigating how treatment response varies in relation to a previously measured rare genetic variant, one could overrepresent mutation carriers relative to their original cohort frequencies. However, one may quickly reach the limits for an efficient block design if the number of independent variants under investigation is large (see the article on stratification). Finally, whatever general sampling and analysis scheme is used, it may be possible to predict what the optimal sampling fractions would be for a range of plausible model parameters and costs (19–23). Under some circumstances, such an analysis may reveal that the most efficient design, weighing both statistical efficiency and cost, does not involve any subsampling with differential assessment of variables across samples but rather a complete assessment of all variables on a smaller main study. 2 EXAMPLE 1: CANDIDATE GENE ASSOCIATION STUDY USING TAG SNPs Suppose one has identified a candidate gene that is hypothesized to have an effect on the outcome under study or, in a clinical trial setting, to modify the effect of treatment. If one or more functional polymorphisms have already been identified in this gene and if one were persuaded that the hypothesized effects of the gene could be tested using only these known variants, there would probably be no need for a multistage sampling

3

design. Assuming that the cost for genotyping these known variants is not prohibitive, one would simply genotype these variants in a single sample from the relevant population; this might be, for example, cases and controls for a disease trait, a simple or stratified random sample of the population for a quantitative trait, or possibly the entire sample from a preexisting study such as a clinical trial. The design and analysis of such studies is discussed elsewhere in this volume (see the articles on genetic association analysis and pharmacogenomics). A negative result from such a study could, however, arise if the targeted polymorphism(s) was not the relevant one and would not exclude the possibility that some other variants in the gene might have shown an effect. Complete characterization of a gene—including all rare variations— conventionally entails full sequencing of the entire gene (not just coding regions, but potential regulatory regions, splice-site junctions, and highly conserved intronic regions). Because previous genomic databases could not be relied on to include all rare variants, a large sample would be required to detect them, although there do exist procedures for detecting rare variants in DNA pools (24). If such large samples are required, such a monumental task would be difficult to justify unless one had a strong hypothesis that rare variants were likely to be important (25–28). If, on the other hand, one wished instead to exhaustively characterize only the common variation in the gene, then it would not be necessary to fully sequence every subject, and a staged approach might be considered. The fact that most common variants are in strong linkage disequilibrium (LD) with other nearby variants means that most of them can be effectively predicted by combinations of a modest number of ‘‘tagging SNPs’’ (29–34). In this case, a relatively small sample of individuals might be sufficient to characterize the LD structure of the gene and select this subset of tag SNPs, and only this subset is then tested in the main study. However, in determining the subsample needed, one must also take into account the uncertainty of LD levels and patterns across the genome and the effect this uncertainty may have on characterization of

4

MULTISTAGE GENETIC ASSOCIATION STUDIES

the underlying genetic diversity. Often the SNP selection sample is not part of the main study and is selected to represent the source population (mainly controls, if the trait is rare), so the information on the untagged variants is not used in the analysis of the main association study. Because the International HapMap Project is designed to provide a freely available catalog of the genetic similarities and difference across human populations, this resource serves as a primary example of SNP selection samples and often remains independent of the main study in the final analysis. However, the SNP selection sample could be used to develop a prediction model for each of the untagged variants, which could then be tested for association in the main study (23). Such analyses could be done at the genotype or haplotype level. Suppose, for example, that one wished to test the association of a disease trait Y with a particular untyped SNP polymorphism G using a set T of tagging SNPs. The substudy (in which both G and T are measured) yields an estimate of pα (G|T), where α denotes a vector of LD information to be estimated from the substudy. The main study would then be analyzed using likelihood contributions of the form for each subject: pG (Y|T) = g pβ (Y|G = g)pα (G = g|T) where β is the relative risk parameter of interest. The single-imputation strategy (35) approximates this likelihood by using as the risk factor for each individual in the main study his or her expected allele (or haplotype) ‘‘dosage,’’ E(G|T). Although more complex, the likelihood-based analysis, explicitly incorporating contributions from the substudy subjects, has the advantage of properly allowing for the uncertainty in the estimates of the parameters α in the genotype prediction model, and supports a broader range of analyses, such as the estimation of the location of an unobserved causal variant. A haplotype-based analysis is similar, except that the postulated causal genotype G is replaced by a vector of diplotypes (pairs of haplotypes) H, and the likelihood requires an additional summation over all possible diplotypes that are compatible with the observed unphased tag-SNP genotypes T.

This likelihood can be fitted using either the expectation-maximization (E-M) algorithm (36) or Markov-chain Monte Carlo (MCMC) methods (37). This basic approach could be extended to a multistage design, depending on the extent of ancillary knowledge from genomic databases. For example, one might fully sequence a small sample to search for previously undetected variants (particularly if one were studying a population that was underrepresented in the HapMap or previous resequencing databases); cases with extreme phenotypes are sometimes used for this purpose to identify previously unrecognized variants that are more likely to be causal (38). A somewhat larger sample of controls might then be genotyped at all common variants (from the investigator’s own first sample as well as other databases) to investigate the LD structure and choose tagSNPs, which would then be genotyped in the main study. In a study of gene-treatment interactions, one could optimize a two-stage design by selecting the subsample and stratifying on outcome, treatment, or a correlate of the causal variant (23). For example, one might overrepresent subjects with a positive family history or known carriers of a particular functional polymorphism. 3 EXAMPLE 2: PATHWAY-BASED STUDY INVOLVING BIOMARKERS Suppose one wished to investigate an entire pathway or network of genes, such as those known to be involved in the metabolism of a particular drug or involved in repair of DNA damage induced by ionizing radiation. One could simply genotype all these loci in the entire study population and build an empirical model for the main effects and interactions (gene–gene or gene–treatment, for example) (39) or a physiologically based pharmacokinetic model incorporating these genes as modifiers of the respective metabolic rates (40). Such models might be usefully informed by having additional biomarker data on some of the intermediate steps in the process, such as metabolite concentrations in blood or urine (41). These measurements may be expensive or difficult to collect, however, possibly

MULTISTAGE GENETIC ASSOCIATION STUDIES

requiring multiple samples over time because of the variability in the measurements or biases due to inducing factors, measurement factors (e.g., time of day), or the underlying disease process (making measurements on cases suspect due to the possibility of ‘‘reverse causation’’). One might therefore consider making the biomarker measurements on only a sample, perhaps only for controls. Suppose one is primarily interested in studying the relation of some intermediate metabolite X on outcome Y, where X is postulated to be modified by treatment T and genes G, and the biomarker Z is a flawed surrogate for X. Then one might have measurements of (T,G,Z) on the subsample S and of (T,G,Y) on the main study M. The likelihood is then formed by marginalizing over the unobserved X: p(Zi |Xi = x)p(Xi = x|Gi , Ti ) dx iS

×

p(Yj |Xj = x)p(Xj = x|Gj , Tj ) dx

jM

As in Example (1), one might seek to optimize the study design by stratified sampling of S based on information about (T,G,Y) already obtained from the main study. Or, rather than using such a model-dependent likelihood analysis, the technique of Mendelian randomization (42–47) provides an indirect ‘‘nonparametric’’ test of the effect of X on Y by comparing instead the effects of G on both Y and Z separately. Davey Smith and Ibrahim (42) have argued that, under appropriate circumstances, inferences from Mendelian randomization about the effect of an environmentally modifiable intermediate variable can have the same strength of causality as from a randomized controlled trial. 4 EXAMPLE 3: GENOME-WIDE ASSOCIATION STUDY Now, suppose there were no specific genes or pathways with strong prior hypotheses, but one wished instead to scan the entire genome for variants that influenced the trait or response to treatment. This possibility was first seriously suggested a decade ago (48),

5

but it is only very recently that advances in high-volume genotyping technologies have made it possible at a reasonable cost (49–52). Most recent genome-wide association (GWA) studies have adopted some variant of a multistage genotyping strategy first proposed by Sobell et al. (53) and formalized in a series of papers by Satagopan et al. (54–56) and others (57–60). The basic idea of all these approaches is to use relatively inexpensive (per genotype) commercial genotyping technology for dense SNP genotyping (about 300 thousand to 1 million SNPs and increasing rapidly) of a sample of the study population and, on the basis of the observed associations in this sample; one selects a small subset of the most promising SNPs for genotyping in a second sample, using a more expensive (per genotype) customized assay. The two samples are then combined in the final analysis, with critical values at stages 1 and 2 chosen to yield a predetermined level of genome-wide significance (e.g., 0.05 expected false positives across the entire genome) (61). Variants of this general idea include targeted sampling of subjects for the first stage, use of DNA pooling in the first stage followed by individual genotyping 62, 63, adding additional SNPs at the second stage to localize the LD signal (60), or using prior information such as linkage scans or genomic annotation data to prioritize the selection of SNPs for the second stage 64, 65. Needless to say, with hundreds of thousands of markers under consideration, the multiple comparisons problem is serious (see the article on multiple comparisons in this work). A variety of ways of analyzing the data have been proposed, generally using some form of Bonferroni correction or false discovery rate approaches (66–68). Even more daunting are analyses of all possible haplotype associations (69) or all possible gene–gene interactions (70). Given the infancy of this approach, there are many unresolved design and analysis issues, but there are many opportunities to exploit multistage sampling designs in new and creative ways. As in the previous example, of course, one could stratify the sampling of subjects by various combinations of disease, treatment (or other interacting factors), or race. In addition, there are various ways to refine the selection of markers to

6

MULTISTAGE GENETIC ASSOCIATION STUDIES

advance to stage 2, particularly if there are multiple types of tests under consideration (main effects of SNPs, haplotype associations, gene–gene or gene–treatment interactions, etc.) or prior knowledge to be exploited. 5

PERSPECTIVES

Although multistage sampling designs have been well established in statistics for decades and are increasingly used in epidemiology and genetics, this remains an underdeveloped research area, particularly in the pharmacogenetics context. Its utility for exploring pathways appears to be particularly promising, especially in the potential incorporation of pharmacokinetic/pharmacodynamic principles and approaches (71–75). This field is in a rapid state of development, driven in part by the potential for drug development. The interested reader should consult several recent reviews (17, 76–80) for further discussion of these issues. REFERENCES 1. R. C. Elston, R. M. Idury, L. R. Cardon, and J. B. Lichter, The study of candidate genes in drug trials: sample size considerations. Stat Med. 1999; 18: 741–751. 2. L. R. Cardon, R. M. Idury, T. J. Harris, J. S. Witte, and R. C. Elston, Testing drug response in the presence of genetic information: sampling issues for clinical trials. Pharmacogenetics. 2000; 10: 503–510. 3. P. J. Kelly, N. Stallard, and J. C. Whittaker, Statistical design and analysis of pharmacogenetic trials. Stat Med. 2005; 24: 1495–1508. 4. N. Breslow and K. Cain, Logistic regression for two-stage case-control data. Biometrika. 1988; 75: 11–20. 5. N. E. Breslow and N. Chatterjee, Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Stat. 1999; 48: 457–468. 6. K. Cain and N. Breslow, Logistic regression analysis and efficient design for twostage studies. Am J Epidemiol. 1988; 128: 1198–1206. 7. J. E. White, A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol. 1982; 115; 119–128.

8. R. C. Elston, D. Y. Lin, and G. Zheng. Multistage sampling for genetic studies. Annu Rev Genomics Hum Genet. 2007; 8: 327–342. 9. K. D. Siegmund, A. S. Whittemore, and D. C. Thomas, Multistage sampling for disease family registries. J Natl Cancer Inst Monogr. 1999; 26: 43–48. 10. A. S. Whittemore and J. Halpern, Multi-stage sampling in genetic epidemiology. Stat Med. 1997; 16: 153–167. 11. D. Horvitz and D. Thompson, A generalization of sampling without replacement from a finite population. J Am Stat Assoc. 1952; 47: 663–685. 12. B. Langholz and L. Goldstein. Conditional logistic analysis of case-control studies with complex sampling. Biostatistics. 2001; 2: 63–84. 13. W. Piegorsch, C. Weinberg, and J. Taylor, Non-hierarchical logistic models and caseonly designs for assessing susceptibility in population-based case-control studies. Stat Med. 1994; 13: 153–162. 14. M. Khoury and W. Flanders, Nontraditional epidemiologic approaches in the analysis of gene-environment interaction: case-control studies with no controls! Am J Epidemiol. 1996; 144: 207–213. 15. S. Greenland, A unified approach to the analysis of case-distribution (case-only) studies. Stat Med. 1999; 18: 1–15. 16. W. J. Gauderman, Sample size requirements for matched case-control studies of geneenvironment interaction. Stat Med. 2002; 21: 35–50. 17. J. Little, L. Sharp, M. J. Khoury, L. Bradley, and M. Gwinn, M. The epidemiologic approach to pharmacogenomics. Am J Pharmacogenomics. 2005; 5: 1–20. 18. D. C. Thomas, Multistage sampling for latent variable models. Lifetime Data Anal. In press. 19. D. Spiegelman and R. Gray, Cost-efficient study designs for binary response data with Gaussian covariate measurement error. Biometrics. 1991; 47: 851–869. 20. D. Spiegelman, R. J. Carroll, and V. Kipnis, Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Stat Med. 2001; 20: 139–160. 21. S. Greenland, Statistical uncertainty due to misclassification: implications for validation substudies. J Clin Epidemiol. 1988; 41: 1167–1174.

MULTISTAGE GENETIC ASSOCIATION STUDIES 22. W. D. Flanders and S. Greenland, Analytic methods for two-stage case-control studies and other stratified designs. Stat Med. 1991; 10: 739–747. 23. D. Thomas, R. Xie, and M. Gebregziabher, Two-Stage sampling designs for gene association studies. Genet Epidemiol. 2004; 27: 401–414. 24. L. T. Amundadottir, P. Sulem, J. Gudmundsson, A. Helgason, A. Baker, A., et al., A common variant associated with prostate cancer in European and African populations. Nat Genet. 2006; 38: 652–658. 25. J. K. Pritchard, Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001; 69: 124–137. 26. J. K. Pritchard and N. J. Cox, The allelic architecture of human disease genes: common disease-common variant . . . or not? Hum Mol Genet. 2002; 11: 2417–2423. 27. W. Y. Wang, H. J. Cordell, and J. A. Todd, Association mapping of complex diseases in linked regions: estimation of genetic effects and feasibility of testing rare variants. Genet Epidemiol. 2003; 24: 36–43. 28. N. S. Fearnhead, J. L. Wilding, B. Winney, S. Tonks, S. Bartlett, et al. Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proc Natl Acad Sci USA. 2004; 101: 15992–15997. 29. D. O. Stram, Tag SNP selection for association studies. Genet Epidemiol. 2004; 27: 365–374. 30. D. O. Stram, Software for tag single nucleotide polymorphism selection. Hum Genomics. 2005; 2: 144–151. 31. D. Thompson, D. Stram, D. Goldgar, and J. S. Witte, Haplotype tagging single nucleotide polymorphisms and association studies. Hum Hered. 2003; 56: 48–55. 32. G. C. Johnson, L. Esposito, B. J. Barratt, A. N. Smith, J. Heward, et al. Haplotype tagging for the identification of common disease genes. Nat Genet. 2001; 29: 233–237. 33. Z. Lin and R. B. Altman, Finding haplotype tagging SNPs by use of principal components analysis. Am J Hum Genet. 2004; 75: 850–861. 34. Z. Meng, D. V. Zaykin, C. F. Xu, M. Wagner, and M. G. Ehm, Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet. 2003; 73: 115–130. 35. D. O. Stram, C. L. Pearce, P. Bretsky, M. Freedman, J. N. Hirschhorn, et al., Modeling and E-M estimation of haplotype-specific

7

relative risks from genotype data for a casecontrol study of unrelated individuals. Hum Hered. 2003; 55: 179–190. 36. L. Excoffier, G. Laval, and D. Balding, Gametic phase estimation over large genomic regions using an adaptive window approach. Hum Genomics. 2003; 1: 7–19. 37. T. Niu, Z. S. Qin, X. Xu, and J. S. Liu, Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am J Hum Genet. 2002; 70: 157–169. 38. C. A. Haiman, D. O. Stram, M. C. Pike, L. N. Kolonel, N. P. Burtt, et al., A comprehensive haplotype analysis of CYP19 and breast cancer risk: the Multiethnic Cohort. Hum Mol Genet. 2003; 12: 2679–2692. 39. R. J. Hung, P. Brennan, C. Malaveille, S. Porru, F. Donato, et al., Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. Cancer Epidemiol Biomark Prev. 2004; 13: 1013–1021. 40. D. V. Conti, V. Cortessis, J. Molitor, and D. C. Thomas, Bayesian modeling of complex metabolic pathways. Hum Hered. 2003; 56: 83–93. 41. P. Tonolio, P. Boffetta, D. E. K. Shuker, N. Rothman, B. Hulka, and N. Pearce, Application of Biomarkers in Cancer Epidemiology. Lyon, France: IARC Scientific, 1997. 42. G. Davey Smith and S. Ebrahim, ‘‘Mendelian randomization’’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003; 32: 1–22. 43. Brennan, P. Commentary: Mendelian randomization and gene-environment interaction. Int J Epidemiol. 2004; 33: 17–21. 44. G. Davey Smith and S. Ebrahim, What can mendelian randomisation tell us about modifiable behavioural and environmental exposures? BMJ. 2005; 330: 1076–1079. 45. G. Davey Smith and S. Ebrahim, Mendelian randomization: prospects, potentials, and limitations. Int J Epidemiol. 2004; 33: 30–42. 46. D. C. Thomas and D. V. Conti, Commentary: the concept of ‘‘mendelian randomization.’’ Int J Epidemiol. 2004; 33: 21–25. 47. J. Little and M. J. Khoury, Mendelian randomization: a new spin or real progress? Lancet. 2003; 362: 390–391. 48. N. Risch and K. Merikangas, The future of genetic studies of complex human diseases. Science. 1996; 273: 1616–1617.

8

MULTISTAGE GENETIC ASSOCIATION STUDIES

49. D. C. Thomas, R. W. Haile, and D. Duggan, Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet. 2005; 77: 337–345. 50. D. C. Thomas, Are we ready for genomewide association studies? Cancer Epidemiol Biomarkers Prev. 2006; 15: 595–598. 51. W. Y. S. Wang, B. J. Barratt, D. G. Clayton, and J. A. Todd, Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet. 2005; 6: 109–118. 52. J. N. Hirschhorn and M. J. Daly, Genomewide association studies for common disease and complex traits. Nat Rev Genet. 2005; 6: 95–108. 53. J. L. Sobell, L. L. Heston, and S. S. Sommer, Novel association approach for determining the genetic predisposition to schizophrenia: case-control resource and testing of a candidate gene. Am J Med Genet. 1993; 48: 28–35. 54. J. M. Satagopan, D. A. Verbel, E. S. Venkatraman, K. E. Offit, and C. B. Begg, Two-stage designs for gene-disease association studies. Biometrics. 2002; 58: 163–170. 55. J. M. Satagopan and R. C. Elston, Optimal two-stage genotyping in population-based association studies. Genet Epidemiol. 2003; 25: 149–157. 56. J. M. Satagopan, E. S. Venkatraman, and C. B. Begg, Two-stage designs for gene-disease association studies with sample size constraints. Biometrics. 2004; 60: 589–597. 57. A. Saito and N. Kamatani, Strategies for genome-wide association studies: optimization of study designs by the stepwise focusing method. J Hum Genet. 2002; 47: 360–365. 58. S. K. Service, L. A. Sandkuijl, and N. B. Freimer, Cost-effective designs for linkage disequilibrium mapping of complex traits. Am J Hum Genet. 2003; 72: 1213–1220. 59. H. Wang and D. O. Stram, Optimal twostage genome-wide association designs based on false discovery rate. Comput Stat Data Anal. 2006; 5: 457–465. 60. H. Wang, D. C. Thomas, I. Pe’er, and D. O. Stram, Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol. 2006; 30: 356–368. 61. A. D. Skol, L. J. Scott, G. R. Abecasis, and M. Boehnke, Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2006; 38: 209–213. 62. A. Bansal, D. van den Boom, S. Kammerer, C. Honisch, G. Adam, et al., Association testing by DNA pooling: an effective initial

screen. Proc Natl Acad Sci USA. 2002; 99: 16871–16874. 63. S. Wang, K. K. Kidd, and H. Zhao, On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol. 2003; 24: 74–82. 64. K. Roeder, S. A. Bacanu, L. Wasserman, and B. Devlin, Using linkage genome scans to improve power of association in genome scans. Am J Hum Genet. 2006; 78: 243–252. 65. A. S. Whittemore, A Bayesian false discovery rate for multiple testing. Appl Stat. 2007; 34: 1–9. 66. D. Y. Lin, Evaluating statistical significance in two-stage genomewide association studies. Am J Hum Genet. 2006; 78: 505–509. 67. Thomson, G. Significance levels in genome scans. Adv Genet. 2001; 42: 475–486. 68. J. Zhao, E. Boerwinkle, and M. Xiong, An entropy-based statistic for genomewide association studies. Am J Hum Genet. 2005; 77: 27–40. 69. S. Lin, A. Chakravarti, and D. J. Cutler, Exhaustive allelic transmission disequilibrium tests as a new approach to genomewide association studies. Nat Genet. 2004; 36: 1181–1188. 70. J. Marchini, P. Donnelly, and L. R. Cardon, Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005; 37: 413–417. 71. M. Eichelbaum, M. Ingelman-Sundberg, and W. E. Evans, Pharmacogenomics and individualized drug therapy. Annu Rev Med. 2006; 57: 119–137. 72. T. W. Gant and S. D. Zhang, In pursuit of effective toxicogenomics. Mutat Res. 2005; 575: 4–16. 73. I. Cascorbi, Genetic basis of toxic reactions to drugs and chemicals. Toxicol Lett. 2006; 162: 16–28. 74. R. H. Howland, Personalized drug therapy with pharmacogenetics—Part 1: pharmacokinetics. J Psychosoc Nurs Ment Health Serv. 2006; 44: 13–16. 75. R. H. Howland, Personalized drug therapy with pharmacogenetics—Part 2: pharmacodynamics. J Psychosoc Nurs Ment Health Serv. 2006; 44: 13–16. 76. C. M. Ulrich, K. Robien, and R. Sparks, Pharmacogenetics and folate metabolism—a promising direction. Pharmacogenomics. 2002; 3: 299–313. 77. C. M. Ulrich, K. Robien, and H. L. McLeod, Cancer pharmacogenetics: polymorphisms,

MULTISTAGE GENETIC ASSOCIATION STUDIES pathways and beyond. Nat Rev Cancer. 2003; 3: 912–920. 78. Z. Feng, R. Prentice, and S. Srivastava, Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics. 2004; 5: 709–719. 79. U. Mahlknecht and S. Voelter-Mahlknecht, Pharmacogenomics: questions and concerns. Curr Med Res Opin. 2005; 21: 1041–1048.

9

80. A. C. Need, A. G. Motulsky, and D. B. Goldstein, Priorities and standards in pharmacogenetic research. Nat Genet. 2005; 37: 671–681.

NATIONAL CANCER INSTITUTE (NCI)

of disparities among underserved groups and gaps in quality cancer care, helping to translate research results into better health for groups at high risk for cancer, including cancer survivors and the aging population. As the leader of the National Cancer Program, the NCI provides vision and leadership to the global cancer community, conducting and supporting international research, training, health information dissemination, and other programs. Timely communication of NCI scientific findings help people make better health choices and advise physicians about treatment options that are more targeted and less toxic.

The National Cancer Institute is the world’s largest organization solely dedicated to cancer research. The NCI supports researchers at universities and hospitals across the United States and at NCI Designated Cancer Centers, a network of facilities that not only study cancer in laboratories but also conduct research on the best ways to rapidly bring the fruits of scientific discovery to cancer patients. In the NCI’s own laboratories, almost 5000 principal investigators, from basic scientists to clinical researchers, conduct earliest phase cancer clinical investigations of new agents and drugs. Recent advances in bioinformatics and the related explosion of technology for genomics and proteomics research are dramatically accelerating the rate for processing large amounts of information for cancer screening and diagnosis. The largest collaborative research activity is the Clinical Trials Program for testing interventions for preventing cancer, diagnostic tools, and cancer treatments, allowing access as early as possible to all who can benefit. The NCI supports over 1300 clinical trials a year, assisting more than 200,000 patients. The NCI’s scientists also work collaboratively with extramural researchers to accelerate the development of state-of-the-art techniques and technologies. In addition to direct research funding, the NCI offers U.S. cancer scientists a variety of useful research tools and services, including tissue samples, statistics on cancer incidence and mortality, bioinformatic tools for analyzing data, databases of genetic information, and resources through NCI-supported Cancer Centers, Centers of Research Excellence, and the Mouse Models of Human Cancer Consortium. NCI researchers are also seeking the causes This article was modified from the website of the National Institutes of Health (http://www.nih.gov /about/almanac/organization/NCI.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL CENTER FOR TOXICOLOGICAL RESEARCH (NCTR)

The NCTR’s research program involves basic and applied research specifically designed to define biological mechanisms of action underlying the toxicity of products regulated by the FDA. This research is aimed at understanding critical biological events in the expression of toxicity and at developing methods to improve assessment of human exposure, susceptibility, and risk. The NCTR conducts research through eight divisions. The NCTR research divisions study biochemical and molecular markers of cancer, nutritional modulation of risk and toxicity, developmental toxicity, neurotoxicity, quantitative risk assessment, transgenics, applied and environmental microbiology, and solid-state toxicity. Each division works with the others to support the FDA’s mission to bring safe and efficacious products to the market rapidly and to reduce the risk of adverse health effects from products on the market.

The National Center for Toxicological Research (NCTR) conducts peer-reviewed scientific research and provides expert technical advice and training that enables the U.S. Food and Drug Administration (FDA) to make sound, science-based regulatory decisions. The research is focused toward understanding critical biological events in the expression of toxicity and toward developing methods and incorporating new technologies to improve the assessment of human exposure, susceptibility, and risk through the characterization of present models and the development of new models. The aim of the program is to: • Conduct

peer-reviewed scientific research that provides the basis for FDA to make sound, science-based regulatory decisions, and to promote U.S. health through the FDA’s core activities of premarket review and postmarket surveillance.

• Conduct fundamental and applied re-

search aimed at understanding critical biological events and to determine how people are affected adversely by exposure to products regulated by FDA. • Develop methods to measure human

exposure to products that have been adulterated or to assess the effectiveness and/or the safety of a product. • Provide the scientific findings used by

the FDA product centers for premarket application review and produce safety assurance to the scientific community for the betterment of public health. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/nctr/overview/mission.htm and http://www.fda.gov/oc/oms/ofm/accounting/cfo /2002/NCTR.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL COOPERATIVE GALLSTONE STUDY

consisting of a standard X-ray of the abdominal region after injestion of iodine that is absorbed and secreted into bile to allow visualization of the gallbladder and its contents on the X-ray. Gallstones formed from pure cholesterol tend to be buoyant and seem to be ‘‘floating’’ on the X-ray. Gallstones containing calcium are termed ‘‘calcified.’’ Gallstones can be benign for a period of years, but they may also cause periodic episodes of pain (biliary colic) caused by bile duct obstruction that occurs when a gallstone has either escaped from the gallbladder or is formed in a bile duct. Obstruction in the cystic duct that connects the gallbladder to the larger common duct leads to debilitating episodes of cholecystitis or gallbladder inflamation that often requires emergency surgery. Gallstones can also promote bacterial infection of the bile in the gallbladder or bile ducts (cholangitis) requiring emergency therapy and when untreated may lead to sepsis. In 1970, the only available treatment for gallstone disease was radical cholecystectomy or surgical removal. Unlike the laparoscopic cholecystectomy practiced today, the radical procedure required an extensive hospital stay and a period of rehabilitation. Several different bile acids are secreted into bile in humans, with the primary bile acid being cholic acid and its derivatives. A small preliminary study showed that feeding chenodoxycholic acid (chenodiol or just cheno, for short) daily via a pill could restore cholesterol solubility in bile. It was hypothesized that prolonged treatment of cholesterol gallstones with chenodiol would desaturate cholesterol in bile and would literally cause the gallstones to dissolve. However, chenodiol, then as now, could not be synthesized from raw chemicals and must be extracted from the bile of animals at slaughter; then it must be purified into crystals, formulated, and then encapsulated. There was no industrial sponsor for the compound. Today chenodiol would be characterized as an ‘‘orphan’’ drug. Thus, the National Cooperative Gallstone Study was organized by the NIAMDD to conduct various studies leading up to a full-scale study of the

John M. Lachin The George Washington University Washington, DC

The National Cooperative Gallstone Study (NCGS) was conducted from 1973 to 1984 with funding from the National Institute of Arthritis, Metabolism and Digestive Diseases (NIAMDD), now the National Institute of Diabetes, Digestive and Kidney Diseases (NIDDK). The ultimate objective was to conduct a Phase III clinical trial of the effectiveness and safety of exogenous bile acid supplementation for treatment of cholesterol gallstones. The design and methods are presented in Reference 1 and the principal results in Reference 2. The function of the gallbladder is to serve as a reservoir for the storage of bile that is secreted by the liver into the gallbladder. In response to a meal, especially a fatty meal, the gallbladder contracts and deposits the bile into the intestine to assist with digestion. Bile consists of bile acids and lipids, among other elements. Hepatic excretion into bile is also one of the principal means by which lipids are excreted. One of the earliest lipidlowering drugs, cholestyramine, is a bile acid sequestrant that promotes clearance of lipids from circulating blood through secretion into bile. Bile is an aqueous medium, and bile acids and phospholipids (lecithin) maintain cholesterol and other lipids in solution. However, if bile becomes super-saturated with lipids, which means that the concentration of bile acids is inadequate to keep the available cholesterol in solution, then the cholesterol begins to precipitate, forming cholesterol gallstones (cholelithiasis). A minority of gallstones may also contain other precipitants, such as calcium, bilirubinate, or other pigments. Today, gallstones are diagnosed by an ultrasound examination. At the time that the NCGS was conducted, gallstones were diagnosed using an oral cholecyctogram (OCG)

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

NATIONAL COOPERATIVE GALLSTONE STUDY

effectiveness and safety of bile acid supplementation with chenodiol for treatment of gallstones. 1

THE NCGS ORGANIZATION

The NCGS was initiated by the NIAMDD through issuance of a Request For Proposals (RFP). Through an open competition, the Cedars-Sinai Medical Center of Los Angeles, CA, was named the Coordinating Center with Leslie Schoenfeld as Director and Jay Marks as Deputy Director (among others), both gastroenterologists, and Gloria Croke as Project Coordinator. Cedars-Sinai contracted with The Biostatistics Center of George Washington University (GWU) to serve as the Biostatistical Unit (BSU) of the Coordinating Center under the codirection of Jerome Cornfield and Larry Shaw. Cedars-Sinai and GWU jointly submitted the winning application. After Cornfield’s death, John Lachin became codirector. Supplemental RFPs issued by CedarsSinai led to the appointment of other central units, including a Central Radiology Unit, Central Serum Laboratory, Central Bile Laboratory, and a Central Morphology and Electron Microscopy Laboratory. 2

2.2 The Hepatic Morphology Study Because of the potential for hepatotoxicity, the FDA mandated that the first 100 subjects treated with chenodiol should undergo liver biopsy before and during treatment for up to two years. Since the NCGS planned to employ a low and a high dose of chenodiol (375 and 750 mg/day, respectively), an initial study of 128 subjects randomly allocated to these doses was conducted in which all subjects had a liver biopsy at baseline and again at 9 and 24 months of treatment. All biopsies were independently assessed using light microscopy by two pathologists at different institutions using a common protocol for the assessment of morphologic abnormalities (5). The study included masked duplicate readings to assess interpathologist and intrapathologist reliability of the detailed morphologic assessments. In addition, biopsies were also read by electron microscopy by one morphologist (6). The study did not include an untreated control group. However, both the light and electron microscopic evaluations showed an increase in the prevalence of lesions over time that were consistent with intrahepatic cholestasis and were unlikely to be a reflection of the natural history of gallstone disease.

STUDIES

2.1 Animal Toxicology

2.3 The Full-Scale Study

Since the objective was to conduct a largescale Phase III study using chenodiol, the U.S. Food and Drug Administration (FDA) required that various animal toxicology studies be conducted. The program of toxicology studies, including studies in monkeys, was organized by the Pharmacology Toxicology Committee appointed by the Coordinating Center. These studies showed that chenodiol was associated with liver enzyme increases, and resulted in hepatic injury in rhesus monkeys (3). Subsequently, however, it was shown that the injury in the monkey was attributed to biotransformation of chenodiol to lithocholic acid and that humans were protected from such injury through sulfation of the lithocholic acid that neutralizes its toxicity (4).

When the morphology (biopsy) study was fully enrolled, the full-scale study was launched (1). A total of 916 subjects were enrolled in 10 clinical centers in the United States and randomly assigned to receive either placebo (n = 305), the low dose of 375 mg/day (n = 360), or the high dose of 750 mg/day (n = 305), double-masked. Subjects had predominantly cholesterol gallstones present on a baseline OCG and were generally in good health. Subjects were to be treated for two years with frequent monitoring of symptoms, frequent laboratory testing including lipids and liver enzymes, periodic follow-up OCGs, and periodic collection of gallbladder bile through duodenal intubation.

NATIONAL COOPERATIVE GALLSTONE STUDY

The results (2) showed a dose-dependent effect on complete confirmed gallstone dissolution, the primary outcome; the cumulative incidence over two years being 13.5% with the high dose, 5.2% with the low dose, and 0.8% with placebo. The cumulative incidence of partial (or complete) dissolution by at least 50% occurred in 40.8%, 23.6%, and 11% in the three groups, respectively. Although statistically significant, these modest effects were of marginal clinical significance. Clinically significant hepatotoxicity occurred in 3% of those treated with the high dose and 0.4% of those treated with the low dose or placebo. Gallstone dissolution with treatment occurred more frequently among women, those with lower body weight, those with small or floating gallstones, and those with higher cholesterol levels at baseline. There were no differences among the treatment groups in the incidence of biliary symptoms (pain, cholangitis, cholecyctitis) or the need for surgery (cholecystectomy). The investigators concluded that highdose chenodiol treatment for up to two years was an ‘‘appropriate therapy for dissolution of gallstones in selected patients who are informed of the risks and benefits’’ (2). The investigators’ evaluation of the study is presented in Reference 7. 2.4 The Recurrence Study Patients who achieved complete dissolution in the main study were then offered the opportunity to receive additional doublemasked therapy with either the low dose or placebo. The cumulative incidence of gallstone recurrence was 27% after up to 3.5 years of treatment with no difference between treatment groups (8). 2.5 Open-Label Treatment The protocol and informed consent stated that subjects randomly assigned to placebo would be offered the opportunity to receive treatment with chenodiol at the conclusion of the study. The final phase consisted of openlabel treatment for up to two years of those subjects originally treated with placebo who opted to receive active therapy.

3

3 MAJOR ISSUES 3.1 Clinical Centers The initial complement of 10 clinical centers was selected after an open competition (9). Early in the study it was clear that two of the clinical centers could not meet the recruitment targets and were promptly replaced. Recruitment, however, proved difficult for all clinical centers, and aggressive steps were taken to achieve the recruitment goals (10). 3.2 The Biospy Study Prior to initiation of the full-scale NCGS, the FDA required that a preliminary Phase II study be conducted with liver biopsies before and after treatment with chenodiol. The NCGS investigators strongly objected to the implementation of this recommendation as part of the main study, in part because a needle liver biopsy was considered to have substantial risk and would be unethical to administer to patients treated with placebo, and in part because it would not be scientifically possible to differentiate the effects of therapy from the natural history of gallstone disease progression in the absence of an untreated (and biopsied) control group. Substantial negotiations among the NIAMDD, the NCGS investigators, and the FDA led to the compromise that the major study could be launched after the enrollment of the uncontrolled biopsy study was completed (9). 3.3 Costs Over the 8-year period from 1973 to 1981 (not including the open-label follow-up), the total cost was $11.2 million. Of this cost, 43% was allocated to the clinical centers, 13% to the Coordinating Center, 19% to the Biostatistical Unit, 10% to the central laboratories and units, 5% to ancillary studies, 4% to animal toxicology studies, 3% to the purchase of drug, and 3% to committee expenses (7). 4 BIOSTATISTICAL CONSIDERATIONS 4.1 Randomization The NCGS randomization sequence was stratified by clinical center and was implemented using a technique (1) developed by Larry Shaw, the BSU co-PI with Cornfield,

4

NATIONAL COOPERATIVE GALLSTONE STUDY

that later became known as ‘‘big-stick’’ randomization (11). This technique is a variation on the biased-coin design (12) but less restrictive in that complete randomization (a coin toss) is employed except where the imbalance between the numbers assigned to each group reaches an unacceptable level. In the NCGS, this level was defined as a deviation from perfect balance assigned to any one group that was greater than one standard deviation under the corresponding binomial (1). For example, for equal allocations to two groups with probability 12 , after n allo√ cations, the standard deviation is 12 n. On the next allocation, one tosses a coin if the number assigned to both groups falls within √ 1 2 (n ± n); otherwise, the next allocation is assigned to whichever group has the smaller number. 4.2 Sample Size Evaluation The sample size for the biopsy study was dictated by the FDA that requested sequential biopsy on 100 subjects. To ensure that close to this number received 24 months of treatment and were biopsied, the number enrolled was increased to 128. Since there was no untreated control group, the power of the study was then described in terms of the change from baseline to follow-up in the proportion of subjects with an abnormality (1). The sample size for the three group main study was based on the test for the difference in proportions achieving gallstone dissolution using pairwise tests for each of the two dose groups versus control (1). The power of the study was also assessed in terms of the ability to detect a treatment group by covariate interaction or heterogeneity of the treatment group effect between two strata. For the comparison of means of K active treatment groups versus a control, the Dunnett (13) procedure is widely used for which the most powerful comparisons are provided √ using the ‘‘square root rule’’ in which K subjects are assigned to the control group for every one subject assigned to a treatment group. For the case of two active treatments, as in the NCGS, this would yield alloca√ tion ratios of 2 : 1 : 1 or 1.41:1:1 for control and the two active treatment groups, respectively. However, it was not known whether

these allocations were also optimal for comparisons of proportions among K active treatment groups versus control. Lachin (14) explored the power of such tests and determined that there was no uniformly optimal allocation ratio for tests for proportions. Rather the optimal ratio depended on the nature of the underlying null and alternative hypotheses, and on average, equal allocations seemed to be robust. 4.3 Outcome Evaluation The primary outcome of the study was the presence or absence of gallstones on an OCG and, if present, the number, total volume, and total surface area. The Central Radiology Unit developed a novel technique for providing these measurements. The gallbladder and gallstones were visualized on a standard flat plate X-ray of the abdomen taken with the patient recumbent. Thus, the actual distance of the gallbladder and stones from the table surface (and film plane) varied as a function of patient girth. To allow for the resulting magnification differential, each clinical center was provided with a half-inch ball bearing that was taped to the patient’s side at the level of the gallbladder. The resulting image of the ball bearing allowed the area of the gallbladder and gallstones on the X-ray to be calibrated to that of the image of the ball bearing (15). A computer program was then written to allow the reader to identify the ‘‘edge’’ of the image of the gallbladder and individual gallstones using a computer monitor, and to then compute areas. Biliary bile acids were measured in a Central Bile Laboratory and other biochemistries measured in a Central Serum Laboratory. 4.4 External Quality Assurance The reproducibility and reliability of all outcome assessments was monitored to assure high-quality data. For the Central Radiology Unit, individual X-rays were randomly selected for masked duplicate re-reading within the unit (2). For the central laboratories, split duplicate specimens were collected in the clinical centers and then shipped to the laboratories for a masked duplicate assessment. The detailed procedures and descriptions of the statistical calculations of reliability are presented in Reference 16.

NATIONAL COOPERATIVE GALLSTONE STUDY

4.5 External Monitoring of the BSU Finally, an external monitoring committee was appointed to site visit the Biostatistical Unit (BSU) periodically, and over a period of years, it reviewed and critiqued virtually all functions of the unit. Details are presented in Reference 17. REFERENCES 1. J. M. Lachin, J. Marks, and L. J. Schoenfield, the Protocol Committee and the NCGS Group. Design and methodological considerations in the National Cooperative Gallstone Study: A multi-center clinical trial. Controlled Clinical Trials, 1981; 2: 177–230. 2. L. J. Schoenfield, J. M. Lachin, the Steering Committee and the NCGS Group. National Cooperative Gallstone Study: A controlled trial of the efficacy and safety of chenodeoxycholic acid for dissolution of gallstones. Ann. Internal Med. 1981; 95: 257–282. 3. R. Okun, L. I. Goldstein, G. A. Van Gelder, EI. Goldenthal, F. X. Wazeter, R. G. Giel, National Cooperative Gallstone Study: nonprimate toxicology of chenodeoxycholic acid. J. Toxicol. Environ. Health, 1982; 9: 727–741. 4. J. W. Marks, S. O. Sue, B. J. Pearlman, G. G. Banorris, P. Varady, J. M. Lachin, and L. J. Schoenfield. Sulfation of lithocholate as a possible modifier of chenodeoxycholic acidinduced elevations of serum transaminase in patients with gallstones. J. Clin. Invest. 1981; 68: 1190–1196. 5. R. L. Fisher, D. W. Anderson, J. L. Boyer, K. Ishak, G. Klatskin, J. M. Lachin, and M. J. Phillips, and the Steering Committee for the NCGS Group. A prospective morphologic evaluation of hepatic toxicity of chenodeoxycholic acid in patients with cholelithiasis: The National Cooperative Gallstone Study. Hepatology 1982; 2: 187–201. 6. M. J. Phillips, R. L. Fisher, D. W. Anderson, S. Lan, J. M. Lachin, J. L. Boyer, and the Steering Committee for the NCGS Group. Ultrastructural evidence of intrahepatic cholestasis before and after chenodeoxycholic acid (CDCA) therapy in patients with cholelithiasis: The National Cooperative Gallstone Study (NCGS). Hepatology 1983; 3: 209–220. 7. L. J. Schoenfield, S. M. Grundy, A. F. Hofmann, J. M. Lachin, J. L. Thistle, and M. P. Tyor, for the NCGS. The National Cooperative Gallstone Study viewed by its investigators. Gastroenterology 1983; 84: 644–648.

5

8. J. W. Marks, S. P. Lan, The Steering Committee, and The National Cooperative Gallstone Study Group. Low dose chenodiol to prevent gallstone recurrence after dissolution therapy. Ann. Intern. Med. 1984; 100: 376–381. 9. J. Marks, G. Croke, N. Gochman, A. F. Hofmann, J. M. Lachin, L. J. Schoenfield, and M. P. Tyor, and the NCGS Group. Major issues in the organization and implementation of the National Cooperative Gallstone Study (NCGS). Controlled Clinical Trials 1984; 5: 1–12. 10. G. Croke. Recruitment for the the National Cooperative Gallstone Study. Clin. Pharm. Ther, 1979; 25: 691–674. 11. J. F. Soares and C. F Wu. Some restricted randomization rules in sequential designs. Communications in Statistics Theory and Methods 1982; 12: 2017–2034. 12. B. Efron. Forcing a sequential experiment to be balanced. Biometrika 1971; 58: 403–417. 13. C. W. Dunnett. A multiple comparison procedure for comparing several treatments with a control. J. Am Stat Assoc. 1955; 50: 1096–1121. 14. J. M. Lachin. Sample size determinations for r x c comparative trials. Biometrics, 1977; 33: 315–324. 15. E. C. Lasser, J. R. Amberg, N. A. Baily, P. Varady, J. M. Lachin, R. Okun, and L. J. Schoenfield. Validation of a computer-assisted method for estimating the number and volume of gallstones visualized by cholecystography. Invest. Radiol 1981; 16: 342–347. 16. R. L. Habig, P. Thomas, K. Lippel, D. Anderson, and J. M. Lachin. Central laboratory quality control in the National Cooperative Gallstone Study. Controlled Clinical Trials 1983: 4: 101–123. 17. P. L. Canner, L. C. Gatewood, C. White, J. M. Lachin, and L. J. Schoenfield. External monitoring of a data coordinating center: Experience of the National Cooperative Gallstone Study. Controlled Clinical Trials 1987: 8: 1–11.

FURTHER READING J. W. Marks, G. G. Bonorris, A Chung, M. J. Coyne, R. Okun, J. M. Lachin, and L. J. Schoenfield. Feasibility of low-dose and intermittent chenodeoxycholic acid therapy of gallstones. Am. J. Digest. Dis. 1977; 22: 856–860. J. J. Albers, S. M. Grundy, P. A. Cleary, D. M. Small, J. M. Lachin, L. J. Schoenfield, and the

6

NATIONAL COOPERATIVE GALLSTONE STUDY

National Cooperative Gallstone Study Group. National Cooperative Gallstone Study. The effect of chenodeoxycholic acid on lipoproteins and apolipoproteins. Gastroenterology, 1982; 82: 638–646. A. F. Hofmann, S. M. Grundy, J. M. Lachin, S. P. Lan, et al. Pretreatment biliary lipid composition in white patients with radiolucent gallstones in the National Cooperative Gallstone Study. Gastroenterology 1982; 83: 738–752. A. F. Hofmann and J. M. Lachin. Biliary bile acid composition and cholesterol saturation. Gastroenterology 1983; 84: 1075–1077. J. M. Lachin, L. J. Schoenfield, and the National Cooperative Gallstone Study Group. Effects of dose relative to body weight in the National Cooperative Gallstone Study: A fixed-dose trial. Controlled Clinical Trials 1983; 4: 125–131. S. M. Grundy, S. Lan, J. Lachin, the Steering Committee and the National Cooperative Gallstone Study Group. The effects of chenodiol on biliary lipids and their association with gallstone dissolution in the National Cooperative Gallstone Study (NCGS). J. Clin. Invest. 1984; 73: 1156–1166. J. L. Thistle, P. A. Cleary, J. M. Lachin, M. P. Tyor, T. Hersh, The Steering Committee, and The National Cooperative Gallstone Study Group. The natural history of cholelithiasis: The National Cooperative Gallstone Study. Ann. Intern. Med. 1984; 101: 171–175.

F. Stellard, P. D. Klein, A. F. Hofmann, and J. M. Lachin. Mass spectrometry identification of biliary bile acids in bile from patients with gallstones before and during treatment with chenodeoxycholic acid. J. Lab. Clin. Med. 1985; 105: 504–513.

CROSS-REFERENCES Biased-coin Randomization Clinical Trial/Study Conduct Multiple Comparisons Orphan Drugs Phase III Trials Placebo-Controlled Trial Quality Assurance Quality Control Reproducibility Sample Size for Comparing Proportions (Superiority and Non-Inferiority)

NATIONAL EYE INSTITUTE (NEI) The U.S. National Eye Institute (NEI) conducts and supports research, training, health information dissemination, and other programs with respect to blinding eye diseases, visual disorders, mechanisms of visual function, preservation of sight, and the special health problems of individuals who are visually impaired or blind. Vision research is supported by the NEI through research grants and training awards made to scientists at more than 250 medical centers, hospitals, universities, and other institutions across the United States and around the world. The NEI also conducts laboratory and patient-oriented research at its own facilities located on the U.S. National Institutes of Health (NIH) campus in Bethesda, Maryland. Another part of the NEI mission is to conduct public and professional education programs that help prevent blindness and reduce visual impairment. To meet these objectives, the NEI has established the National Eye Health Education Program, a partnership of more than 60 professional, civic, and voluntary organizations and government agencies concerned with eye health. The program represents an extension of the NEI’s support of vision research, where results are disseminated to health professionals, patients, and the public.

This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NEI.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

• Coordinates relevant activities with

NATIONAL HEART, LUNG, AND BLOOD INSTITUTE (NHLBI)

other research institutes and all federal health programs in the covered areas, including the causes of stroke. • Conducts educational activities, including development and dissemination of materials for health professionals and the public in the covered areas, with emphasis on prevention. • Maintains continuing relationships with institutions and professional associations, and with international, national, state, and local officials as well as voluntary agencies and organizations working in the covered areas. • Oversees management of the Women’s Health Initiative study.

The National Heart, Lung, and Blood Institute (NHLBI): • Provides leadership for a national pro-

gram in diseases of the heart, blood vessels, lungs, and blood; sleep disorders; and blood resources. • Plans, conducts, fosters, and supports

an integrated and coordinated program of basic research, clinical investigations and trials, observational studies, and demonstration and education projects related to the causes, prevention, diagnosis, and treatment of heart, blood vessel, lung, and blood diseases, and sleep disorders conducted in its own laboratories and by scientific institutions and individuals supported by research grants and contracts. • Plans and directs research in development, trials, and evaluation of interventions and devices related to the prevention of diseases and disorders in the covered areas and the treatment and rehabilitation of patients suffering from such conditions.

1 DIVISION OF CARDIOVASCULAR DISEASES (DCVD) The Division of Cardiovascular Diseases (DCVD) promotes opportunities to translate promising scientific and technological advances from discovery through preclinical studies to networks and multisite clinical trials. It designs, conducts, supports, and oversees research on the causes and prevention and treatment of diseases and disorders such as atherothrombosis, coronary artery disease, myocardial infarction and ischemia, heart failure, arrhythmia, sudden cardiac death, adult and pediatric congenital heart disease, cardiovascular complications of diabetes and obesity, and hypertension.

• Conducts research on clinical use of

blood and all aspects of the management of blood resources. • Supports research training and career

development of new and established researchers in fundamental sciences and clinical disciplines to enable them to conduct basic and clinical research related to heart, blood vessel, lung, and blood diseases; sleep disorders; and blood resources through individual and institutional research training awards and career development awards.

2 DIVISION OF PREVENTION AND POPULATION SCIENCES (DPPS) The Division of Prevention and Population Sciences (DPPS) supports and provides leadership for population-based and clinic-based research on the causes, prevention, and clinical care of cardiovascular, lung, and blood diseases and sleep disorders. Research includes a broad array of epidemiologic studies to describe disease and risk factor patterns in populations and to identify risk factors for

This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NHLBI.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

NATIONAL HEART, LUNG, AND BLOOD INSTITUTE (NHLBI)

disease; clinical trials of interventions to prevent disease; studies of genetic, behavioral, sociocultural, and environmental influences on disease risk and outcomes; and studies of the application of prevention and treatment strategies to determine how to improve clinical care and public health. 3

DIVISION OF LUNG DISEASES (DLD)

The Division of Lung Diseases (DLD) supports basic research, clinical trials, national pulmonary centers, technological development, and application of research findings. Activities focus on understanding the structure and function of the respiratory system, increasing fundamental knowledge of mechanisms associated with pulmonary disorders, and applying new findings to evolving treatment strategies for patients.

NATIONAL HUMAN GENOME RESEARCH INSTITUTE (NHGRI) The National Human Genome Research Institute (NHGRI), which was originally established in 1989 as the National Center for Human Genome Research, became an institute of the U.S. National Institutes of Health (NIH) in 1997. The NHGRI led the NIH’s contribution to the International Human Genome Project, which successfully completed the sequencing of the 3 billion base pairs that make up the human genome in April 2003. The NHGRI mission has evolved over the years to encompass a broad range of studies aimed at understanding the structure and function of the human genome and its role in health and disease. To that end, the NHGRI supports the development of resources and technology that will accelerate genomeresearch and its application to human health as well as the study of the ethical, legal, and social implications of genome research. The NHGRI also supports the training of investigators and the dissemination of genome information to the public and to health professionals. The NHGRI is organized into three main divisions: the Office of the Director, which provides guidance to scientific programs and oversees general operations; the Division of Extramural Research, which supports and administers genomic research; and the Division of Intramural Research, which comprises the in-house genetics research laboratories. Research guidance and guidance related to NHGRI grants comes from the National Advisory Council for Human Genome Research, which meets three times a year. Members include representatives from health and science disciplines, public health, social sciences, and the general public. Portions of the council meetings are open to the public. This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NHGRI.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASE (NIAID)

clinical trials of hematopoietic stem cell transplantation for treating severe autoimmune disorders.

The National Institute of Allergy and Infectious Diseases (NIAID) conducts and supports research to study the causes of allergic, immunologic, and infectious diseases, and to develop better means of preventing, diagnosing, and treating these illnesses. The NIAID runs a pediatric allergy clinic at the U.S. National Institutes of Health (NIH) Clinical Center that serves as a focal point for translational research conducted in collaboration with NIAID intramural laboratories and clinical trials of novel therapies.

3

1

4

In collaboration with industry, academia, and other government agencies, the NIAID has established research programs to facilitate drug development, including capacity databases to screen compounds for their potential use as therapeutic agents, facilities to conduct preclinical testing of promising drugs, and clinical trials networks to evaluate the safety and efficacy of drugs and therapeutic strategies in humans.

GENETICS AND TRANSPLANTATION

The NIAID’s basic immunology and genetics research seeks to define the effects of gene expression on immune function and to determine the manner in which the products of gene expression control the immune response to foreign substances such as transplanted organs and cells. Research programs in genetics and transplantation include human leukocyte antigen (HLA) region genetics in immune-mediated diseases, the genomics of transplantation, and clinical trials in organ transplantation. 2

DRUG RESEARCH AND DEVELOPMENT

ANTIMICROBIAL RESISTANCE

The NIAID-supported clinical trials networks with capacity to assess new antimicrobials and vaccines relevant to drug-resistant infections include the Adult AIDS Clinical Trials Groups, the Bacteriology and Mycology Study Group, the Collaborative Antiviral Study Group, and Vaccine and Treatment Evaluation Units. 5 DIVISION OF MICROBIOLOGY AND INFECTIOUS DISEASES The Division of Microbiology and Infectious Diseases (DMID) supports extramural research to control and prevent diseases caused by virtually all human infectious agents except human immunodeficiency virus (HIV). The DMID supports a wide variety of projects spanning the spectrum from basic biomedical research through applied research to clinical trials to test the safety and efficacy of new disease prevention strategies.

IMMUNE-MEDIATED DISEASES

The NIAID conducts and supports basic, preclinical, and clinical research on immunemediated diseases, including asthma and allergic diseases, autoimmune disorders, primary immunodeficiency diseases, and the rejection of transplanted organs, tissues, and cells. Efforts are underway to evaluate the safety and efficacy of tolerance induction strategies for treating immune-mediated diseases as well as to assess the efficacy through This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NIAID.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF ARTHRITIS AND MUSCULOSKELETAL AND SKIN DISEASES (NIAMS) The U.S. National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) was established in 1986. The mission of NIAMS is to support research into the causes, treatment, and prevention of arthritis and musculoskeletal and skin diseases; the training of basic and clinical scientists to carry out this research; and the dissemination of information on research progress in these diseases. The Institute also conducts and supports basic research on the normal structure and function of joints, muscles, bones, and skin. Basic research involves a wide variety of scientific disciplines, including immunology, genetics, molecular biology, structural biology, biochemistry, physiology, virology, and pharmacology. Clinical research includes rheumatology, orthopaedics, dermatology, metabolic bone diseases, heritable disorders of bone and cartilage, inherited and inflammatory muscle diseases, and sports and rehabilitation medicine. The Institute’s Genetics and Clinical Studies Program supports genetic studies in rheumatic diseases, both in animal models and in humans; clinical trials and complex clinical studies, including epidemiology, outcomes, and prevention of rheumatic and related diseases; and research on Lyme disease and infection-related arthritis.

This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NIAMS.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF BIOMEDICAL IMAGING AND BIOENGINEERING (NIBIB) The mission of the U.S. National Institute of Biomedical Imaging and Bioengineering (NIBIB) is to improve health by leading the development of and accelerating the application of biomedical technologies. The Institute integrates the physical and engineering sciences with the life sciences to advance basic research and medical care: • Supporting research and development

• • • •

•

•

of new biomedical imaging and bioengineering techniques and devices to fundamentally improve the detection, treatment, and prevention of disease. Enhancing existing imaging and bioengineering modalities. Supporting related research in the physical and mathematical sciences. Encouraging research and development in multidisciplinary areas. Supporting studies to assess the effectiveness and outcomes of new biologics, materials, processes, devices, and procedures. Developing technologies for early disease detection and assessment of health status. Developing advanced imaging and engineering techniques for conducting biomedical research at multiple scales.

This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NIBIB.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

• Human growth and development is a

NATIONAL INSTITUTE OF CHILD HEALTH AND HUMAN DEVELOPMENT (NICHD)

lifelong process that has many phases and functions. Much of the research in this area focuses on cellular, molecular, and developmental biology to build understanding of the mechanisms and interactions that guide a single fertilized egg through its development into a multicellular, highly organized adult organism. • Learning about the reproductive health of women and men and educating people about reproductive practices is important to both individuals and societies. Institute-supported basic, clinical, and epidemiologic research in the reproductive sciences seeks to develop knowledge that enables women and men to overcome problems of infertility and to regulate their fertility in ways that are safe, effective, and acceptable for various population groups. Institute-sponsored behavioral and social science research in the population field strives to understand the causes and consequences of reproductive behavior and population change. • Developing medical rehabilitation interventions can improve the health and the well-being of people with disabilities. Research in medical rehabilitation seeks to develop improved techniques and technologies for the rehabilitation of individuals with physical disabilities resulting from diseases, disorders, injuries, or birth defects.

The mission of the National Institute of Child Health and Human Development (NICHD) is to ensure that every person is born healthy and wanted; that women suffer no harmful effects from the reproductive process; and that all children have the chance to fulfill their potential to live healthy and productive lives, free from disease or disability; and to ensure the health, productivity, independence, and well-being of all people through optimal rehabilitation. In pursuit of this mission, the NICHD conducts and supports laboratory research, clinical trials, and epidemiologic studies that explore health processes; examines the impact of disabilities, diseases, and defects on the lives of individuals; and sponsors training programs for scientists, doctors, and researchers to ensure that NICHD research can continue. NICHD research programs incorporate the following concepts: • Events that happen before and through-

out pregnancy as well as during childhood have a great impact on the health and the well-being of adults. The Institute supports and conducts research to advance knowledge of pregnancy, fetal development, and birth for developing strategies that prevent maternal, infant, and childhood mortality and morbidity; to identify and promote the prerequisites of optimal physical, mental, and behavioral growth and development through infancy, childhood, and adolescence; and to contribute to the prevention and amelioration of mental retardation and developmental disabilities.

The Institute also supports research training across all its programs, with the intent of adding to the cadre of trained professionals who are available to conduct research in areas of critical public health concern. In addition, an overarching responsibility of the NICHD is to disseminate information that emanates from Institute research programs to researchers, practitioners, other healthcare professionals, and the public.

This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NICHD.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF DENTAL AND CRANIOFACIAL RESEARCH (NIDCR)

The center has six components: Clinical Trials Program, Dental Practice-Based Research Networks, Epidemiology Research Program, Health Disparities Research Program, Health Promotion and CommunityBased Research Program, and the Basic and Applied Behavioral/Social Science Research Program.

The mission of the National Institute of Dental and Craniofacial Research (NIDCR) is to improve oral, dental, and craniofacial health through research, research training, and the dissemination of health information. We accomplish our mission by:

2 DIVISION OF EXTRAMURAL ACTIVITIES, SCIENTIFIC REVIEW BRANCH

• Performing and supporting basic and

The Scientific Review Branch of the Division of Extramural Activities coordinates the initial scientific peer review of applications for the following mechanisms of support: center research grants, program project grants, small research grants, research conference grants, institutional training grants, shortterm training and fellowship grants, Physician Scientist Awards for Dentists, Dentist Scientist Awards, requests for applications issued by the NIDCR, certain investigatorinitiated clinical trials, cooperative agreements, and all proposals for research and development contracts.

clinical research. • Conducting and funding research training and career development programs to ensure an adequate number of talented, well-prepared, and diverse investigators. • Coordinating and assisting relevant research and research-related activities among all sectors of the research community. • Promoting the timely transfer of knowledge gained from research and its implications for health to the public, health professionals, researchers, and policymakers. 1

CENTER FOR CLINICAL RESEARCH

The Center for Clinical Research (CCR) supports and conducts patient-oriented and population-based research, including clinical trials, practice-based networks, epidemiology, and health-disparity research in all areas of program interest to the NIDCR. Providing statistical support for Institute centers and divisions, the CCR develops and supports programs to foster diversity in the scientific workforce as well as clinical research activities aimed at the health of vulnerable and special needs populations. This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NIDCR.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF DIABETES AND DIGESTIVE AND KIDNEY DISEASES (NIDDK)

of surrogate markers for use in clinical trials for the prevention or treatment of type 2 diabetes, cellular therapies for the treatment of type 2 diabetes, and improving the care of patients with type 2 diabetes. The Type 1 Diabetes Clinical Trials Program supports large, multicenter clinical trials conducted under cooperative agreements or contracts. For example, the Diabetes Prevention Trial Type 1 (DPT-1) was aimed at determining whether it was possible to prevent or delay the onset of type 1 diabetes in individuals who are at immunologic, genetic, and/or metabolic risk. The program also supports future clinical trials that are part of the Type 1 Diabetes TrialNet, which are intervention studies to prevent or slow the progress of type 1 diabetes, and natural history and genetics studies in populations screened for or enrolled in these studies. The program also supports the Epidemiology of Diabetes Interventions and Complications (EDIC) study, an epidemiologic follow-up study of the patients previously enrolled in the Diabetes Control and Complications Trial (DCCT). The Type 2 Diabetes Clinical Trials Program supports large, multicenter clinical trials conducted under cooperative agreements or contracts. For example, the Diabetes Prevention Program (DPP) focuses on testing lifestyle and pharmacologic intervention strategies in individuals at genetic and metabolic risk for developing type 2 diabetes to prevent or delay the onset of this disease. The Gene Therapy and Cystic Fibrosis Centers Program supports three types of centers: Gene Therapy Centers (P30), Cystic Fibrosis Research Centers (P30), and Specialized Centers for Cystic Fibrosis Research (P50). Gene Therapy Centers provide shared resources to a group of investigators to facilitate development of gene therapy techniques and to foster multidisciplinary collaboration in the development of clinical trials for the treatment of cystic fibrosis and other genetic metabolic diseases. Cystic Fibrosis Research Centers (P30) and Specialized Centers for Cystic Fibrosis Research (P50) provide resources and support research on many

The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) conducts and supports research on many of the most serious diseases affecting public health. The institute supports much of the clinical research on the diseases of internal medicine and related subspecialty fields as well as many basic science disciplines. The Institute’s Division of Intramural Research encompasses the broad spectrum of metabolic diseases such as diabetes, obesity, inborn errors of metabolism, endocrine disorders, mineral metabolism, digestive and liver diseases, nutrition, urology and renal disease, and hematology. Basic research studies include: biochemistry, biophysics, nutrition, pathology, histochemistry, bioorganic chemistry, physical chemistry, chemical and molecular biology, and pharmacology. The NIDDK extramural research is organized into four divisions: Diabetes, Endocrinology and Metabolic Diseases; Digestive Diseases and Nutrition; Kidney, Urologic and Hematologic Diseases; and Extramural Activities. The Institute supports basic and clinical research through investigator-initiated grants, program project and center grants, and career development and training awards. The Institute also supports research and development projects and large-scale clinical trials through contracts. The Clinical Research in Type 2 Diabetes Program focuses on patient-oriented research (i.e., clinical studies and small clinical trials) related to pharmacologic interventions and/or lifestyle interventions to prevent or treat type 2 diabetes, including studies relevant to new drug development, development This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NIDDK.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

NATIONAL INSTITUTE OF DIABETES AND DIGESTIVE AND KIDNEY DISEASES (NIDDK)

aspects of the pathogenesis and treatment of cystic fibrosis. The Clinical Trials in Digestive Diseases Program supports patient-oriented clinical research focusing on digestive diseases. Small clinical studies (pilot), planning grants, or phase III clinical trials may be appropriate to this program. The small clinical studies focus on research that is innovative and/or potentially of high impact that will lead to full-scale clinical trials. Phase III clinical trials usually are multicenter and involve several hundred participants who are randomized to two or more treatments, one of which is usually a placebo. The aim of the trial is to provide evidence for support of, or a change in, health policy or standard of care. The interventions/treatments may include pharmacologic, nonpharmacologic, and behavioral interventions given for disease prevention, prophylaxis, diagnosis, or therapy. The Clinical Trials in Liver Disease Program supports patient-oriented clinical research in liver diseases to evaluate one or more experimental intervention(s) in comparison with a standard treatment and/or placebo control among comparable groups of patients. Experimental interventions may include pharmacologic, nonpharmacologic, and behavioral interventions given for disease prevention, prophylaxis, diagnosis, or therapy. Either pilot studies or phase III trials may be appropriate. The Obesity Prevention and Treatment Program supports research that focuses on the prevention and treatment of obesity and the overweight condition in humans. Prevention includes primary and secondary approaches to prevent the initial development of obesity through control of inappropriate weight gain and increases in body fat, weight maintenance among those at risk of becoming overweight, and prevention of weight regain once weight loss has been achieved. Treatment includes clinical trials evaluating approaches to lose weight or maintain weight loss, including, but not limited to, behavioral, pharmacologic, and surgical approaches. Look AHEAD: Action for Health in Diabetes is a clinical trial recruiting 5000 obese individuals with type 2 diabetes into an 11.5 year study that investigates the long-term

health consequences of interventions designed to achieve and sustain weight loss. The primary outcome of the trial is cardiovascular events: heart attack, stroke, and cardiovascular death. The Clinical Trials in Nutrition Program supports clinical research on nutrition and eating disorders, focusing on metabolic and/or physiologic mechanisms. Small clinical studies (pilot), planning grants, or phase III clinical trials may be appropriate to this program. The small clinical studies focus on research that is innovative and/or potentially of high impact that will lead to full-scale clinical trials. Phase III clinical trials usually are multicenter and involve several hundred participants who are randomized to two or more treatments, one of which is a placebo.

NATIONAL INSTITUTE OF ENVIRONMENTAL HEALTH SCIENCE (NIEHS) The mission of the U.S. National Institute of Environmental Health Science (NIEHS) is to reduce the burden of human illness and disability by understanding how the environment influences the development and progression of human disease. To have the greatest impact on preventing disease and improving human health, the NIEHS focuses on basic science, disease-oriented research, global environmental health, and multidisciplinary training for researchers: • Funding

extramural research and training via grants and contracts to scientists, environmental health professionals, and other groups worldwide. • Conducting intramural research at the NIEHS facility and in partnership with scientists at universities and hospitals. • Providing toxicological testing and test validation through the National Toxicology Program. • Maintaining outreach and communications programs that provide reliable health information to the public and scientific resources to researchers.

This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NIEHS.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES (NIGMS)

Certain NIGMS training programs address areas in which there is a particularly serious need for well-prepared scientists. One of these, the Medical Scientist Training Program, provides investigators who can bridge the gap between basic and clinical research by supporting research training leading to the combined M.D.–Ph.D. degree. Other programs train scientists to conduct research in the rapidly growing field of biotechnology and at the interface between the fields of chemistry and biology. The NIGMS also has a Pharmacology Research Associate Program, in which postdoctoral scientists receive training in the NIH or Food and Drug Administration (FDA) laboratories and clinics. In recent years, the NIGMS has launched initiatives in such cutting-edge areas as structural genomics (the Protein Structure Initiative), pharmacogenetics, collaborative research initiatives (which includes ‘‘glue grants’’), and computational modeling of infectious disease outbreaks. The NIGMS is also participating in the NIH Roadmap for Medical Research, a series of far-reaching initiatives that seek to transform the nation’s biomedical research capabilities and speed the movement of research discoveries from the bench to the bedside. Each year, NIGMS-supported scientists make major advances in understanding fundamental life processes. In the course of answering basic research questions, these investigators also increase our knowledge about the mechanisms and pathways involved in certain diseases. Other grantees develop important new tools and techniques, many of which have medical applications. In recognition of the significance of their work, a number of NIGMS grantees have received the Nobel Prize and other high scientific honors.

The National Institute of General Medical Sciences (NIGMS) was established in 1962, and in fiscal year 2006, its budget was $1.9 billion. The vast majority of this money funds grants to scientists at universities, medical schools, hospitals, and research institutions throughout the country. At any given time, the NIGMS supports over 4400 research grants— about 10% of the grants that are funded by the U.S. National Institutes of Health as a whole—and the NIGMS also supports approximately 25% of the trainees who receive assistance from the NIH. Primarily, the NIGMS supports basic biomedical research that lays the foundation for advances in disease diagnosis, treatment, and prevention. The NIGMS is organized into divisions and a center that support research and research training in basic biomedical science fields. One division has the specific mission of increasing the number of underrepresented minority biomedical and behavioral scientists. The Institute places great emphasis on the support of individual, investigatorinitiated research grants. It funds a limited number of research center grants in selected fields, including structural genomics, trauma and burn research, and the pharmacologic sciences. In addition, NIGMS funds several important resources for basic scientists. The Institute’s training programs help provide the most critical element of good research: well-prepared scientists. The NIGMS research training programs recognize the interdisciplinary nature of biomedical research today and stress approaches to biological problems that cut across disciplinary and departmental lines. Such experience prepares trainees to pursue creative research careers in a wide variety of areas. This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NIGMS.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF MENTAL HEALTH (NIMH)

mental health clinical sciences and other disciplines urgently needed in studies of mental illness and the brain. Another important part of this research is to eliminate the effects of disparities in the availability of and access to high quality mental health services. These disparities, which impinge on the mental health status of all Americans, are felt in particular by many members of ethnic/cultural, minority groups, and by women, children, and elderly people.

Mental disorders occur across the lifespan, from very young childhood into old age, and in the United States, mental disorders collectively account for more than 15% of the overall ‘‘burden of disease’’— a term that encompasses both premature death and disability associated with mental illness. The mission of the National Institute of Mental Health (NIMH) is to reduce the burden of mental illness and behavioral disorders through research on mind, brain, and behavior. Investments made over the past 50 years in basic brain and behavioral science have positioned NIMH to exploit recent advances in neuroscience, molecular genetics, behavioral science and brain imaging; to translate new knowledge about fundamental processes into researchable clinical questions; and to initiate innovative clinical trials of new pharmacologic and psychosocial interventions, with emphasis on testing their effectiveness in the diagnostically complex, diverse group of patients typically encountered in front-line service delivery systems. Investigators funded by NIMH also seek new ways to translate results from basic behavioral science into research relevant to public health, including the epidemiology of mental disorders, prevention and early intervention research, and mental health service research. Diverse scientific disciplines contribute to the design and evaluation of treatments and treatment delivery strategies that are relevant and responsive to the needs of persons with and at risk for mental illness. In this era of opportunity, NIMH is strongly committed to scientific programs to educate and train future mental health researchers, including scientists trained in molecular science, cognitive and affective neuroscience,

1

MECHANISMS OF SUPPORT

NIMH provides leadership at a national level for research on brain, behavior, and mental illness. Under a rigorous and highly competitive process, the institute funds research projects and research center grant awards and contracts to individual investigators in fields related to its areas of interest and to public and private institutions. NIMH also maintains and conducts a diversified program of intramural and collaborative research in its own laboratories and clinical research units at the National Institutes of Health (NIH). NIMH’s informational and educational activities include the dissemination of information and education materials on mental illness to health professionals and the public; professional associations; international, national, state, and local officials; and voluntary organizations working in the areas of mental health and mental illness. 2

AUTISM STAART CENTERS

NIMH supports interdisciplinary research centers through an NIH cooperative agreement in the Studies to Advance Autism Research and Treatment (STAART) Program, in cooperation with the National Institute of Child Health and Human Development (NICHD), the National Institute of Neurological Disorders and Stroke (NINDS), the National Institute on Deafness and Other Communication Disorders (NIDCD), and the National Institute of Environmental Health

This article was modified from the website of the National Institutes of Health (http://www.nih.gov/ about/almanac/organization/NIMH.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

NATIONAL INSTITUTE OF MENTAL HEALTH (NIMH)

Science (NIEHS). By evaluating and treating patients as well as enrolling them in clinical trials, each center helps expand the research base on the causes, diagnosis, early detection, prevention, and treatment of autism. 3 DIVISION ON SERVICES AND INTERVENTION RESEARCH (DSIR) The Division on Services and Intervention Research (DSIR) supports two critical areas of research: intervention research to evaluate the effectiveness of pharmacologic, psychosocial (psychotherapeutic and behavioral), somatic, rehabilitative, and combination interventions on mental and behavior disorders; and mental health services research on organization, delivery (process and receipt of care), related health economics, delivery settings, clinical epidemiology, and the dissemination and implementation of evidencebased interventions into service settings. The division also provides biostatistical analysis and clinical trials operations expertise for research studies, analyzes and evaluates national mental health needs and community research partnership opportunities, and supports research on health disparities. 3.1 DSIR Clinical Trials Operations and Biostatistics Unit The Clinical Trials Operations and Biostatistics Unit serves as the operations focal point for collaborative clinical trials on mental disorders in adults and children. The unit is responsible for overseeing both contractsupported and cooperative agreement–supported multisite clinical trial protocols as well as special projects undertaken by NIMH. In addition, the unit manages overarching matters related to clinical trials operations such as the coordination of the ancillary protocols across the large trials and the implementation of NIMH policy for dissemination of public access datasets. The unit also consults Institute staff and grantees/contractors on biostatistical matters related to appropriateness of study design, determination of power and sample size, and approaches to statistical analysis of data from NIMH-supported clinical trials.

NATIONAL INSTITUTE OF NEUROLOGICAL DISORDERS AND STROKE (NINDS)

hundreds of scientists in training, and provide career awards that offer a range of research experience and support for faculty members at various levels. The purposes and goals of the Extramural Division, Clinical Trials are:

The National Institute of Neurological Disorders and Stroke (NINDS), created by the U.S. Congress in 1950, has occupied a central position in the world of neuroscience for more than 50 years. The mission of NINDS is to reduce the burden of neurologic disease in every age group and segment of society. To accomplish this goal, the Institute supports and conducts research on the healthy and diseased nervous system; fosters the training of investigators in the basic and clinical neurosciences; and seeks better understanding, diagnosis, treatment, and prevention of neurologic disorders. Scientists in the Institute’s laboratories and clinics in Bethesda, Maryland, conduct research in the major areas of neuroscience and on many of the most important and challenging neurologic disorders, and collaborate with scientists in several other NIH institutes. The NINDS vision is:

• To promote the development of clinical

•

• • •

interventions for neurologic disorders and stroke. To stimulate the translation of findings in the laboratory to clinical research and clinical interventions. To ensure measures for protection of human subjects and safety monitoring. To encourage innovation in clinical research methodology. To support the development of neurology clinical researchers with training in biostatistics, epidemiology, and clinical trial methodology.

• To lead the neuroscience community in

shaping the future of research and its relationship to brain diseases. • To build an intramural program that is the model for modern collaborative neuroscience research. • To develop the next generation of basic and clinical neuroscientists through inspiration and resource support. • To seize opportunities to focus our resources to rapidly translate scientific discoveries into prevention, treatment, and cures. The Institute’s extramural program supports thousands of research project grants and research contracts. Institutional training grants and individual fellowships support This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NINDS.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE OF NURSING RESEARCH (NINR)

(NIH) campus, the NINR’s Division of Intramural Research (DIR) focuses on health promotion and symptom management and also provides research training opportunities.

The U.S. National Institute of Nursing Research (NINR) seeks to promote and to improve the health of individuals, families, communities, and populations by supporting and conducting clinical and basic research and research training on health and illness across the lifespan. The NINR research focus encompasses health promotion and disease prevention, quality of life, health disparities, and end-of-life issues. In keeping with the importance of nursing practice in various settings, the NINR seeks to extend nursing science by integrating the biological and behavioral sciences, employing new technologies to research questions, improving research methods, and developing the scientists of the future. The NINR supports basic research on preventing, delaying the onset, and slowing the progression of disease and disability. This includes finding effective approaches to achieving and sustaining a healthy lifestyle, easing the symptoms of illness, improving quality of life for patients and caregivers, eliminating health disparities, and addressing issues at the end of life. The NINR also fosters collaborations with many other disciplines in areas of mutual interest such as long-term care for older people, the special needs of women across the lifespan, genetic testing and counseling, biobehavioral aspects of the prevention and treatment of infectious diseases, and the impact of environmental influences on risk factors for chronic illnesses. In support of finding sound scientific bases for changes in clinical practice, the NINR’s major emphasis is on clinical research, and NINR programs are conducted primarily through grants to investigators across the country. On the National Institutes of Health This article was modified from the website of the National Institutes of Health (http://www.nih.gov/ about/almanac/organization/NINR.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE ON AGING (NIA)

neuroendocrine system, and immune system in aging; and the development of clinical trials and novel interventions to treat these pathologies. The Dementias of Aging Branch supports studies of etiology, pathophysiology, epidemiology, clinical course/natural history, diagnosis and functional assessment, drug discovery, drug development and clinical trials, prevention, behavioral management, and intervention in the cognitive changes associated with the dementias of later life (e.g., mild cognitive impairment, vascular dementia, frontotemporal dementia, Lewy body dementia), especially Alzheimer’s disease.

The U.S. National Institute on Aging (NIA) was established by the U.S. Congress in 1974 to ‘‘conduct and support of biomedical, social, and behavioral research, training, health information dissemination, and other programs with respect to the aging process and diseases and other special problems and needs of the aged.’’ The NIA maintains several branches with varying emphases. The Clinical Trials Branch plans and administers clinical trials on age-related issues that require extensive specialized clinical trials expertise. Examples of current and possible future types of interventions for trials are: • Interventions to prevent or to treat geri-

•

•

•

•

atric syndromes, disability, and complications of comorbidity or polypharmacy. Trials to detect age-related or comorbidity-related differences in responses to interventions against conditions found in middle age and old age. Interventions for problems associated with menopause and other midlife and late-life changes. Interventions that may affect rates of progression of age-related declines in function in early life and midlife. Interventions with protective effects against multiple age-related conditions.

The Integrative Neurobiology Section of the Neurobiology of Aging Branch supports research on the neural mechanisms underlying age-related changes in endocrine functions; neurodegenerative diseases of aging associated with conventional and unconventional infectious agents (e.g., prions); interactions of the central nervous system, This article was modified from the website of the National Institutes of Health (http://www. nih.gov/about/almanac/organization/NIA.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE ON ALCOHOL ABUSE AND ALCOHOLISM (NIAAA) The mission of the National Institute on Alcohol Abuse and Alcoholism (NIAAA) is to provide leadership in the national effort to reduce alcohol-related problems by: • Conducting and supporting research in

a wide range of scientific areas including genetics, neuroscience, and epidemiology to examine the health risks and benefits of alcohol consumption, prevention, and treatment. • Coordinating and collaborating with other research institutes and federal programs on alcohol-related issues. • Collaborating with international, national, state, and local institutions, organizations, agencies, and programs engaged in alcohol-related work. • Translating and disseminating research findings to health-care providers, researchers, policymakers, and the public. The Institute’s efforts to fulfill its mission are guided by the NIAAA vision to support and promote, through research and education, the best science on alcohol and health for the benefit of all by: • Increasing the understanding of normal

and abnormal biological functions and behavior relating to alcohol use. • Improving the diagnosis, prevention, and treatment of alcohol use disorders. • Enhancing quality health care.

This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NIAAA.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTE ON DEAFNESS AND OTHER COMMUNICATION DISORDERS (NIDCD) The U.S. National Institute on Deafness and Other Communication Disorders (NIDCD) conducts and supports research and research training on disorders of hearing and other communication processes, including diseases affecting hearing, balance, smell, taste, voice, speech, and language. The NIDCD sponsors: • Research performed in its own laborato-

ries and clinics. • A program of research grants, individ-

ual and institutional research training awards, career development awards, center grants, conference grants, and contracts to public and private research institutions and organizations. • Cooperation and collaboration with professional, academic, commercial, voluntary, and philanthropic organizations concerned with research and training that is related to deafness and other communication disorders, disease prevention and health promotion, and the special biomedical and behavioral problems associated with people having communication impairments or disorders. • The support of efforts to create devices that substitute for lost and impaired sensory and communication functions. • Ongoing collection and dissemination of information to health professionals, patients, industry, and the public on research findings in these areas.

This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NIDCD.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

• The relationship of drug abuse to the

NATIONAL INSTITUTE ON DRUG ABUSE (NIDA)

acquisition, transmission, and clinical course of human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS), tuberculosis, and other diseases and the development of effective prevention/intervention strategies.

The National Institute on Drug Abuse (NIDA) provides national leadership for research on drug abuse and addiction by supporting a comprehensive research portfolio that focuses on the biological, social, behavioral, and neuroscientific bases of drug abuse on the body and brain as well as its causes, prevention, and treatment. The NIDA also supports research training, career development, public education, and research dissemination efforts. Through its Intramural Research Program as well as grants and contracts to investigators at research institutions around the country and overseas, NIDA supports research and training on:

The Center for Clinical Trials Network (CCTN) supports and leads a network of 17 Regional Research Training Centers (RRTCs) and 240 Community Treatment Programs (CTPs) in a bi-directional effort to bridge the gap between the science of drug treatment and its practice through the study of scientifically based treatments in real world settings. The Clinical Trials Network (CTN) serves as a resource and forum for: • Multisite efficacy and effectiveness tri-

• The neurobiological, behavioral, and

social mechanisms underlying drug abuse and addiction.

•

• The causes and consequences of drug

abuse, including its impact on society and the morbidity and mortality in selected populations (e.g., ethnic minorities, youth, women).

•

• The relationship of drug use to problem

•

behaviors and psychosocial outcomes such as mental illness, unemployment, low socioeconomic status, and violence.

•

•

• Effective prevention and treatment ap-

proaches, including a broad research program designed to develop new treatment medications and behavioral therapies for drug addiction.

als of promising medications and behavioral interventions. Researchers who use the CTN as a platform for studies supported outside of the CCTN. NIDA-supported training using predoctoral and postdoctoral and career awards mechanisms. Secondary analyses of its rich database. Rapid response to emerging public health needs. Systematic transfer of research findings, both positive and negative, to treatment programs, clinicians, and patients.

• The mechanisms of pain and the search

for nonaddictive analgesics. • The relationship of drug abuse to cul-

tural and ethical issues such as health disparities. This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/almanac/organization/NIDA.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTES OF HEALTH (NIH)

alcohol to combating heart disease. In every state across the country, the NIH supports research at hospitals, universities, and medical schools. The NIH is training the current and next generation of researchers to ensure that the capability to advance medical science remains strong. Many of these scientists-in-training will go on to become leading medical researchers and educators at universities; medical, dental, nursing, and pharmacy schools; schools of public health; nonprofit health research foundations; and private medical research laboratories around the country. As a federal agency, the NIH considers many different perspectives in establishing research priorities. A very competitive peer-review system identifies and funds the most promising and highest quality research to address these priorities. This research includes studies that ultimately touch the lives of all people. Currently, with the support of the American people, the NIH annually invests over $28 billion in medical research. More than 83% of the NIH’s funding is awarded through almost 50,000 competitive grants to more than 325,000 researchers at over 3000 universities, medical schools, and other research institutions in every state and around the world. About 10% of the NIH’s budget supports projects conducted by nearly 6000 scientists in its own laboratories, most of which are on the NIH campus in Bethesda, Maryland. The NIH’s own scientists and scientists working with support from the NIH grants and contracts have been responsible for countless medical advances, and more than 100 of these scientists have received Nobel Prizes in recognition of their work.

The National Institutes of Health (NIH), a part of the U.S. Department of Health and Human Services, is the primary federal agency for conducting and supporting medical research. Headquartered in Bethesda, Maryland, the NIH is composed of 27 Institutes and Centers and has more than 18,000 employees on the main campus and at satellite sites across the country. The NIH provides leadership and financial support to researchers in every state and throughout the world. Helping to lead the way toward important medical discoveries that improve people’s health and save lives, NIH scientists investigate ways to prevent disease as well as the causes, treatments, and even cures for common and rare diseases. From the time of its founding in 1887 as the Laboratory of Hygiene at the Marine Hospital in Staten Island, New York, the National Institutes of Health has played an important role in improving health in the United States. Many important health and medical discoveries of the last century resulted from research supported by the NIH. In part because of NIH research, our citizens are living longer and better: life expectancy at birth was only 47 years in 1900; by 2000, it was almost 77 years. The NIH translates research results into interventions and communicates research findings to patients and their families, health-care providers, and the general public. The National Institutes of Health supports and conducts medical research to understand how the human body works and to gain insight into countless diseases and disorders, from rare and unusual diseases to more familiar ones like the common cold. It supports a wide spectrum of research, from learning how the brain becomes addicted to This article was modified from the website of the National Institutes of Health (http://www.nih. gov/about/NIHoverview.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

for sensory function (pupillary response and plantar response) were initially included, but were later removed.

GUSTAVO A. ORTIZ and RALPH L. SACCO MD MS FAAN FAHA, MD

1

Miller School of Medicine University of Miami Miami, Florida

CHARACTERISTICS OF THE SCALE

The NIHSS assesses level of consciousness, gaze, visual fields, facial weakness, motor performance of the extremities, sensory deficit, coordination (ataxia), language (aphasia), speech (dysarthria), and hemi-inattention (neglect) (Tables 1, 2, and 3. Figs. 1 and 2). For all parameters, a value of 0 is normal; so, the higher the score, the worse the neurological deficit. The differences between the levels are not subtle. For example, motor performance for the upper extremity was graded as 0 = normal, 1 = drift (with arms outstretched), 2 = inability to resist gravity for 10 seconds (with arms outstretched), 3 = no effort against gravity, and 4 = no movement. The lowest possible overall NIHSS score is 0 or ‘‘normal,’’ and the highest possible score is 42. The scale was designed to be done quickly and easily at the bedside to provide a rapid and standardized assessment of neurological function in the early periods after a stroke. When first evaluated in 65 patients, it was administered in a mean of 6.6 ± 1.3 minutes and was completed in all patients, regardless of stroke severity (2).

The correlation of NIHSS scores with MRI and clinical findings has been used for assessment of ischemic penumbra, in an effort to identify patients with tissue at risk of infarction for thrombolytic or neuroprotective drugs. Currently, the NIHSS is the most widely used scale for trials related to thrombolytic therapy. Clinical measurement of the severity of cerebral infarction became an important subject of investigation in the late 1980s with the introduction of new therapies for acute stroke. Traditional measures such as mortality or long-term functional status were not as well suited for the assessment of acute stroke therapies in which the immediate effects of stroke had to be quantified. The evaluation tools used to quantify a clinical condition are usually referred to as clinimetric instruments (1). ‘‘Stroke scales’’ are clinimetric instruments used to quantify neurological deficits, functional outcome, global outcome, or health-related quality of life in patients after a stroke. The National Institutes of Health Stroke Scale (NIHSS) is a systematic assessment tool designed to measure the neurological deficits most often seen with acute stroke. It was developed by Thomas Brott, MD and colleagues at the University of Cincinnati (Ohio), and its description was first reported in 1989 (2). The examination format was developed to assess neurologic signs in the distribution of each of the major arteries of the brain. Exam items for the NIHSS were adapted from the Toronto Stroke Scale, the Oxbury Initial Severity Scale (3,4) and the Cincinnati Stroke Scale which graded speech, drift of the affected arm, drift of the affected leg, and grip strength. Two items from the Edinburgh-2 Coma scale were also used to assess mental status (5). Other categories

1.1 Reliability Reliability, an index of the reproducibility and precision of the NIHSS, was first evaluated among twenty-four patients with acute cerebral infarction who were examined twice within a 24-hour interval. Each examination was performed by a neurologist, whereas the other three examination team members (neurology house officer, neurology nurseclinician, and emergency department nurseclinician) observed. Then, each team member independently performed scoring. Inter-rater agreement between these four examiners was high (mean k = 0.69); and intra-rater agreement was also excellent, especially when the rater was a neurologist (k = 0.77) (2). Another study of 20 patients confirmed that the NIH

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

Table 1. The National Institutes of Health Stroke Scale∗ Test 1a. Level of Consciousness (LOC):

1b. LOC Questions:

1c. LOC Commands:

2. Best Gaze:

3. Visual: Visual fields (upper and lower quadrants) 4. Facial Palsy: Ask– or use pantomime to encourage

5. Motor Arm: 5a. Left Arm 5b. Right Arm

6. Motor Leg: 6a. Left Leg 6b. Right Leg

7. Limb Ataxia:

8. Sensory:

Scale Definition 0 = Alert; keenly responsive. 1 = Not alert; however, arousable by minor stimulation to obey, answer, or respond. 2 = Not alert; requires repeated stimulation to attend, or is obtunded and requires strong or painful stimulation to make movements (not stereotyped). 3 = Responds only with reflex motor or autonomic effects or totally unresponsive, flaccid, and are flexic. 0 = Answers both questions correctly. 1 = Answers one question correctly. 2 = Answers neither question correctly. 0 = Performs both tasks correctly. 1 = Performs one task correctly. 2 = Performs neither task correctly. 0 = Normal. 1 = Partial gaze palsy; gaze is abnormal in one or both eyes, but forced deviation or total gaze paresis is not present. 2 = Forced deviation, or total gaze paresis not overcome by the oculocephalic maneuver. 0 = No visual loss. 1 = Partial hemianopia. 2 = Complete hemianopia. 3 = Bilateral hemianopia (blind including cortical blindness). 0 = Normal symmetrical movements. 1 = Minor paralysis (flattened nasolabial fold; asymmetry on smiling). 2 = Partial paralysis (total or near-total paralysis of lower face). 3 = Complete paralysis of one or both sides (absence of facial movement in the upper and lower face). 0 = No drift; limb holds 90 (or 45) degrees for full 10 seconds. 1 = Drift; limb holds 90 (or 45) degrees, but drifts down before full 10 seconds; does not hit bed or other support. 2 = Some effort against gravity; limb cannot get to or maintain (if cued) 90 (or 45) degrees; drifts down to bed, but has some effort against gravity. 3 = No effort against gravity; limb falls. 4 = No movement. UN = Amputation or joint fusion, explain: 0 = No drift; leg holds 30-degree position for full 5 seconds. 1 = Drift; leg falls by the end of the 5-second period but does not hit bed. 2 = Some effort against gravity; leg falls to bed by 5 seconds but has some effort against gravity. 3 = No effort against gravity; leg falls to bed immediately. 4 = No movement. UN = Amputation or joint fusion, explain: 0 = Absent. 1 = Present in one limb. 2 = Present in two limbs. UN = Amputation or joint fusion, explain: 0 = Normal; no sensory loss. 1 = Mild-to-moderate sensory loss; patient feels pinprick is less sharp or is dull on the affected side, or there is a loss of superficial pain with pinprick, but patient is aware of being touched. 2 = Severe to total sensory loss; patient is not aware of being touched in the face, arm, and leg.

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

3

Table 1. (continued) Test

Scale Definition

9. Best Language:

0 = No aphasia; normal. 1 = Mild-to-moderate aphasia; some obvious loss of fluency or facility of comprehension, without significant limitation on ideas expressed or form of expression. Reduction of speech and/or comprehension, however, makes conversation about provided materials difficult or impossible. For example, in conversation about provided materials, examiner can identify picture or naming card content from patient response. 2 = Severe aphasia; all communication is through fragmentary expression; great need for inference, questioning, and guessing by the listener. Range of information that can be exchanged is limited; listener carries burden of communication. Examiner cannot identify materials provided from patient response. 3 = Mute, global aphasia; no usable speech or auditory comprehension. 0 = Normal. 1 = Mild-to-moderate dysarthria; patient slurs at least some words and, at worst, can be understood with some difficulty. 2 = Severe dysarthria; patient speech is so slurred as to be unintelligible in the absence of or out of proportion to any dysphasia, or is mute/anarthric. UN = Intubated or other physical barrier, explain: 0 = No abnormality. 1 = Visual, tactile, auditory, spatial, or personal inattention or extinction to bilateral simultaneous stimulation in one of the sensory modalities. 2 = Profound hemi-inattention or extinction to more than one modality; does not recognize own hand or orients to only one side of space.

10. Dysarthria:

11. Extinction and Inattention (Formerly Neglect):

∗ For all parameters, a value of 0 is normal; so, the higher the score, the worse the neurological deficit. The differences between the levels are not subtle. (Adapted from ‘‘NIH Stroke Scale’’ at www.ninds.nih.gov.)

Table 2. Evaluation of Language: Reading∗ You know how. Down to earth. I got home from work. Near the table in the dining room. They heard him speak on the radio last night. ∗ The patient is asked to read these sentences. (Reproduced from www.ninds.nih.gov.)

Table 3. Evaluation of Dysarthria: the Patient is Asked to Read and Say These Words (Reproduced From www.ninds.nih.gov) MAMA TIP – TOP FIFTY – FIFTY THANKS HUCKLEBERRY BASEBALL PLAYER

4

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

Stroke Scale has substantial or moderate inter-rater reliability for nine of its items (6). The NIHSS was first used in a pilot study of t-PA administered between 0 and 180 minutes from stroke onset (7); and a modified, 13-item form of the scale was later used in the definitive NINDS trial of t-PA (8). In preparation for the latter trial, 162 investigators were trained and certified in the use of the scale by using a two-camera videotape method, an approach that enhanced intra-rater and inter-rater reliability (9). Moderate to excellent agreement was found on most items (unweighted k > 0.60), but facial paresis and ataxia showed poor agreement (unweighted k < 0.40). The scale was also demonstrated to be reliable when performed by non-neurologists and nonphysicians, after training via the videotape method was considered a prerequisite for its use (9–12). 1.2 Validity The validity, or extent to which the NIHSS measured what it purported to measure, was initially assessed by determining reliability and accuracy (13). Accuracy was determined by correlating the NIHSS scores with two other measures of stroke severity: computed tomography (CT) measurement of volume of infarction and clinical outcome at 3 months. 1. Correlation with size of infarction: The Spearmen correlation between the total NIHSS score and the computed tomography lesion at 7 days was 0.74. The initial neurological deficit of the patient as measured by the scale also correlated with the 7–10 day CT lesion volume (correlation coefficient = 0.78) (14). 2. Correlation with clinical outcome: Stroke scale scores at 7 days corresponded to eventual clinical outcome, including placement in the community, with a Spearman correlation of 0.71 (P = 0.0001). The admission stroke scale scores also correlated with the eventual patient functional outcome at 3 months, with a Spearman correlation of 0.53 (P = 0.0001) (2). In 1999, Dr. Lyden et al. with the NINDS tPA Trial investigators validated the scale

as an outcome measure, using data from the NINDS tPA Stroke Trial (15). In this study, it was shown that the internal scale structure remained consistent in treated and placebo groups and when administered serially over time. Also, the correlations between the scale and other clinical outcome scales (Barthel Index, Rankin Scale, and Glasgow Outcome Scale) at 3 months were significant (P < 0.001). The study supported the validity of the scale for use in future treatment trials as an outcome measure. Other studies using the NIHSS have also shown good construct validity. In a post hoc analysis by stroke subtype of 1268 patients enrolled in an acute stroke trial, baseline NIHSS scores strongly predicted outcome at 7 days and at 3 months. An excellent outcome was achieved by almost two thirds of patients with a score of 3 or less at day 7; however, very few patients with baseline scores of more than 15 had excellent outcomes after 3 months (16). Other alternative measures of neurological outcome, such as activities of daily living (ADL) scales, have also been correlated with the NIHSS (17). 2 CORRELATION OF NIHSS SCORES WITH MRI STUDIES MRI scans permitted a more accurate measurement of stroke lesion volume and allowed for a better assessment of the correlation between NIHSS and stroke size. Lovblad et al. in 1997 found a significant correlation between Diffusion Weighted Imaging (DWI) lesion size and both initial (r = 0.53, P = 0.0003) and chronic NIHSS score (r = 0.68, P < 0.0001) in 50 patients with acute middle cerebral artery ischemic stroke (< 24-hour duration) (18). The correlation was also demonstrated in earlier strokes (within 6.5 hours of symptom onset), by comparing DWI or Perfusion Weighted Imaging (PWI) to assess lesion size (19). In eight of nine DWIpositive patients in this series, a strong linear correlation existed between 24-hour NIHSS and DWI volume (r = 0.96, P < 0.001), and PWI correlated better with 24-hour NIHSS than with DWI. The difference was attributed primarily to one patient, who had a substantial perfusion delay in PWI, whereas DWI

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

5

Figure 1. Evaluation of language: Comprehension. The patient is asked to describe what is happening in this picture. (Reproduced from www.ninds.nih.gov).

Figure 2. Evaluation of language: Naming. The patient is asked to name the objects shown in this ‘‘naming sheet.’’ (Reproduced from www.ninds.nih.gov).

showed no abnormality. Clinically, the deficit was substantial (NIHSS = 24). On a subsequent 24-hour MRI scan, a DWI abnormality developed that closely matched her initial perfusion deficit (19). This study led to the

development of the concept of PWI/DWI mismatch. Following the initial report by Dr. Tong, MRI has been used extensively to characterize the ischemic penumbra by identifying regions of brain with reduced blood flow

6

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

(defined by PWI) and regions with irreversible neuronal injury (approximated by DWI). A mismatch between the volume of abnormality on diffusion and perfusion imaging seems to indicate viable penumbral tissue (PWI/DWI mismatch). PWI/DWI mismatch has been shown to correlate with favorable outcomes following late treatment with intravenous rtPA up to 6 hours from symptom onset (20–23). However, PWI is a complex, time-consuming, and not well-standardized technique, with limited availability. Based on the fact that abnormal PWI volume has a higher correlation with the stroke severity (evaluated clinically with the NIHSS) than the DWI volume (19,24), Davalos et al. in 2004 proposed the concept of clinical-DWI mismatch (CDM) (25). CDM defined as NIHSS ≥ 8 and ischemic volume on DWI of ≤ 25 ml was associated with a high probability of infarct growth and early neurological deterioration. Subsequently, it was suggested that CDM may identify patients with tissue at risk of infarction for thrombolytic or neuroprotective drugs. CDM was later shown to predict the presence of PWI/DWI mismatch with a specificity of 93% (95% confidence interval (CI), 62%–99%) and a positive predictive value of 95% (95% CI, 77%–100%) but a sensitivity of only 53% (95% CI, 34%–68%). Efforts are being made to evaluate a computed tomography–NIHSS mismatch, by using the recently applied Alberta Stroke Program Early CT score (ASPECTS), but results are still controversial (26,27). 3

USE OF THE NIHSS

Originally, the NIHSS was developed for use in multicenter acute stroke trials. It proved reliable and valid for the measurement of stroke-related deficits at baseline, as well as for measuring outcome. Currently, the NIHSS is the most widely used scale for trials related to thrombolytic therapy. With the approval of intravenous thrombolysis for the treatment of acute stroke, the use of the NIHSS in the clinical care of patients increased. The ease of administration of the NIHSS, which can be effectively performed by physicians and nonphysicians

after a relatively simple training, led to more widespread acceptance. The NIHSS is used to provide an accurate assessment of the deficits of a patient with stroke during initial presentation. This initial clinical assessment can be easily transmitted to other health-care providers involved in the care of the patient, thus expediting the decision-making process in the emergency setting. The scale is also a sensitive measurement tool for serial monitoring of patients with stroke, quantifying changes in the neurological examination. An increase of 2 points or more on the NIHSS has been suggested to be associated to a relevant worsening of the stroke, although this specific cutoff has not been independently validated (28). In the clinical setting, the NIHSS predicts post-acute care disposition among stroke patients and may facilitate the process of hospital discharge (29,30). In particular, stroke patients with NIHSS scores ≤ 5 can be expected to go home, those with scores > 13 most often go to a nursing facility, and those with intermediate scores most often go to acute rehabilitation (29). 4 STANDARDIZED TRAINING FOR THE USE OF THE NIHSS The American Stroke Association, in conjunction with the American Academy of Neurology (AAN) and the National Institute of Neurological Disorders and Stroke (NINDS), developed a free, CME/CE-certified online training program for healthcare professionals (targeted to emergency physicians, neurologists, nurses, clinical research raters, and medical students) to learn or review how to administer the NIHSS for acute stroke assessment. Available at the American Stroke Association webpage (www. strokeassociation.org) by clicking on the link to Professional Education Center, it requires signing up and creating a profile with username and password before taking the course. Video training is also available on DVD. The course consists of six test groups, A through F. After successfully completing the program, participants can print their CME/CE credit for test groups A and B directly from the website and can also print a certificate of

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

completion for test groups A, B, C, D, E, or F (31). 5

scale are graded numerically, with higher values representing more severe deficits, these are ordinal-level, not interval-level, data (33). The total score is obtained by adding all these individual rankings and may be misleading (34). It has been suggested that clinically it is more useful to think of the NIHSS as a way to quantify serial neurologic findings and measure change within an individual patient over time.

LIMITATIONS OF THE SCALE

Several limitations of the NIHSS are as follows: 1. The inter-rater reliability of the scale is not homogeneous for all the items. It has been shown that the assessments of limb ataxia and facial weakness have a lower agreement between examiners compared with the other items, with calculated k values that were not significantly different from chance. However, the calculated values of k were significantly greater than expected by chance for 11 of 13 items, indicating substantial agreement for 5 items, moderate agreement for 4 items, and fair agreement for 2 items. This rating system compared favorably with other scales (6). 2. The NIHSS has a differential characterization of right versus left hemisphere strokes. Of the 42 possible points on the NIHSS, 7 are directly related to measurement of language function (orientation questions, 2; commands, 2; and aphasia, 3), and only 2 points are for neglect. It has been shown that for a given NIHSS, the total lesion volume for patients with right (usually nondominant for language) hemisphere strokes is statistically larger than the lesion volume for patients with left (usually dominant for language) hemisphere strokes. The difference reflects the differential weighting of the NIHSS with regard to language function, as compared with hemineglect (32). 3. Brainstem infarcts may not be adequately characterized by the NIHSS. Cranial nerves are not fully assessed in the scale; therefore, it is possible that life-threatening brainstem or cerebellar infarctions may result in lower scores that underestimate the clinical severity of the stroke. 4. The total NIHSS score has limited value. Although all items in the

7

6

FUTURE DIRECTIONS

A modified NIHSS (mNIHSS) with 11 items, derived from previous clinimetric studies of the NIHSS, has been proposed. Poorly reproducible or redundant items (level of consciousness, face weakness, ataxia, and dysarthria) were deleted, and the sensory items were collapsed into 2 responses. 10 of the 11 items show excellent reliability, and 1 item shows good reliability (35,36). Other shortened scales are being proposed for prehospital clinical assessment of stroke, but additional studies are needed to assess their value in the screening of stroke by the paramedic services (37). Video telemedicine is being proposed as an option to facilitate cerebrovascular specialty consults to underserved areas. It seems feasible to perform NIHSS remotely using computer-based technology, in an effort to increase the rate of rt-PA administration (38–40). REFERENCES 1. K. Asplund, Clinimetrics in stroke research. Stroke 1987; 18: 528–530. 2. T. Brott, et al., Measurements of acute cerebral infarction: a clinical examination scale. Stroke 1989; 20: 864–870. 3. Oxbury, J.M., R.C. Greenhall, and K.M. Grainger, Predicting the outcome of stroke: acute stage after cerebral infarction. Br. Med. J. 1975; 3: 125–127. 4. R. Cote, et al., The Canadian Neurological Scale: a preliminary study in acute stroke. Stroke 1986; 17: 731–737. 5. K. Sugiura, et al., The Edinburgh-2 coma scale: a new scale for assessing impaired consciousness. Neurosurgery 1983; 12: 411–415.

8

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS)

6. L. B. Goldstein, C. Bertels, and J. N. Davis, Interrater reliability of the NIH stroke scale. Arch. Neurol. 1989; 46: 660–662. 7. E. C. Haley Jr., et al., Urgent therapy for stroke. Part II. Pilot study of tissue plasminogen activator administered 91–180 minutes from onset. Stroke 1992; 23: 641–645. 8. NINDS, Tissue plasminogen activator for acute ischemic stroke. The National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group. N. Engl. J. Med. 1995; 333: 1581–1587. 9. P. Lyden, et al., Improved reliability of the NIH Stroke Scale using video training. NINDS TPA Stroke Study Group. Stroke 1994; 25: 2220–2226. 10. M. A. Albanese, et al., Ensuring reliability of outcome measures in multicenter clinical trials of treatments for acute ischemic stroke. The program developed for the Trial of Org 10172 in Acute Stroke Treatment (TOAST). Stroke 1994; 25: 1746–1751. 11. L. B. Goldstein and G. P. Samsa, Reliability of the National Institutes of Health Stroke Scale. Extension to non-neurologists in the context of a clinical trial. Stroke 1997; 28: 307–310. 12. S. Schmulling, et al., Training as a prerequisite for reliable use of NIH Stroke Scale. Stroke 1998; 29: 1258–1259. 13. R. Cote, et al., Stroke assessment scales: guidelines for development, validation, and reliability assessment. Can. J. Neurol. Sci. 1988; 15:261–265 14. T. Brott, et al., Measurements of acute cerebral infarction: lesion size by computed tomography. Stroke 1989; 20: 871–875. 15. P. Lyden, et al., Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis. NINDS tPA Stroke Trial Investigators. Stroke 1999; 30: 2347–2354. 16. H. P. Adams Jr., et al., Baseline NIH Stroke Scale score strongly predicts outcome after stroke: a report of the Trial of Org 10172 in Acute Stroke Treatment (TOAST). Neurology 1999; 53: 126–131. 17. P. W. Duncan, et al., Measurement of motor recovery after stroke. Outcome assessment and sample size requirements. Stroke 1992; 23: 1084–1089. 18. K. O. Lovblad, et al., Ischemic lesion volumes in acute stroke by diffusion-weighted magnetic resonance imaging correlate with clinical outcome. Ann Neurol, 1997; 42: 164–170. 19. D. C. Tong, et al., Correlation of perfusion- and diffusion-weighted MRI with NIHSS score in

acute (< 6.5 hour) ischemic stroke. Neurology 1998; 50: 864–870. 20. G. W. Albers, et al., Magnetic resonance imaging profiles predict clinical response to early reperfusion: the diffusion and perfusion imaging evaluation for understanding stroke evolution (DEFUSE) study. Ann. Neurol. 2006; 60: 508–517. 21. J. Rother, et al., Effect of intravenous thrombolysis on MRI parameters and functional outcome in acute stroke < 6 hours. Stroke 2002; 33: 2438–2445. 22. M. W. Parsons, et al., Diffusion- and perfusionweighted MRI response to thrombolysis in stroke. Ann. Neurol. 2002; 51: 28–37. 23. G. Thomalla, et al., Outcome and symptomatic bleeding complications of intravenous thrombolysis within 6 hours in MRI-selected stroke patients: comparison of a German multicenter study with the pooled data of ATLANTIS, ECASS, and NINDS tPA trials. Stroke 2006; 37: 852–858. 24. T. Neumann-Haefelin, et al., Diffusion- and perfusion-weighted MRI. The DWI/PWI mismatch region in acute stroke. Stroke 1999; 30: 1591–1597. 25. A. Davalos, et al., The clinical-DWI mismatch: a new diagnostic approach to the brain tissue at risk of infarction. Neurology 2004; 62: 2187–2192. 26. H. Tei, S. Uchiyama, and T. Usui, Clinicaldiffusion mismatch defined by NIHSS and ASPECTS in non-lacunar anterior circulation infarction. J Neuro. 2007; 254: 340–346. 27. S. R. Messe, et al., CT-NIHSS mismatch does not correlate with MRI diffusion-perfusion mismatch. Stroke 2007; 38: 2079–2084. 28. B. C. Tilley, et al., Use of a global test for multiple outcomes in stroke trials with application to the National Institute of Neurological Disorders and Stroke t-PA Stroke Trial. Stroke 1996; 27: 2136–2142. 29. D. Schlegel, et al., Utility of the NIH Stroke Scale as a predictor of hospital disposition. Stroke 2003; 34: 134–137. 30. D. J. Schlegel, et al., Prediction of hospital disposition after thrombolysis for acute ischemic stroke using the National Institutes of Health Stroke Scale. Arch. Neurol. 2004; 61: 1061–1064. 31. American Stroke Association, NIH Stroke Scale Training Online, 2007. Available: www.strokeassociation.org. 32. D. Woo, et al., Does the National Institutes of Health Stroke Scale favor left hemisphere

NATIONAL INSTITUTES OF HEALTH STROKE SCALE (NIHSS) strokes? NINDS t-PA Stroke Study Group. Stroke 1999; 30: 2355–2359. 33. A. R. Feinstein, B. R. Josephy, and C. K. Wells, Scientific and clinical problems in indexes of functional disability. Ann. Intern. Med. 1986; 105: 413–420. 34. T. J. Steiner and F. Clifford Rose, Towards a model stroke trial. The single-centre naftidrofuryl study. Neuroepidemiology 1986; 5: 121–147. 35. B. C. Meyer, et al., Modified National Institutes of Health Stroke Scale for use in stroke clinical trials: prospective reliability and validity. Stroke 2002; 33: 1261–1266. 36. P. D. Lyden, et al., A modified National Institutes of Health Stroke Scale for use in stroke clinical trials: preliminary reliability and validity. Stroke 2001; 32: 1310–1317. 37. D. L. Tirschwell, et al., Shortening the NIH Stroke scale for use in the prehospital setting. Stroke 2002; 33: 2801–2806. 38. B. C. Meyer, et al., Prospective reliability of the STRokE DOC wireless/site independent telemedicine system. Neurology 2005; 64: 1058–1060. 39. S. R. Levine and M. Gorman, ‘‘Telestroke’’: the application of telemedicine for stroke. Stroke 1999; 30: 464–469. 40. S. R. Levine, and K. M. McConnochie, Telemedicine for acute stroke: when virtual is as good as reality. Neurology 2007; 69: 819–820.

9

NATIONAL LIBRARY OF MEDICINE (NLM) The U.S. National Library of Medicine (NLM) is the world’s largest research library of the health sciences and serves scientists, health professionals, and the public. The Library has a statutory mandate from Congress to apply its resources broadly to the advancement of medical and health-related sciences. To that end, it collects, organizes, and makes available biomedical information to investigators, educators, practitioners, and the public and carries out programs designed to strengthen existing and develop new medical library services in the United States. It conducts research in health communications, supports medical informatics, and provides information services and sophisticated tools in the areas of molecular biology and toxicology/environmental health. The NLM also creates Web-based services for the general public containing information from the National Institutes of Health (NIH) and other reliable sources.

This article was modified from the website of the National Institutes of Health (http://www.nih.gov/ about/almanac/organization/NLM.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NEW DRUG APPLICATION (NDA) PROCESS For decades, the regulation and control of new drugs in the United States has been based on the New Drug Application (NDA). Since 1938, every new drug has been the subject of an approved NDA before U.S. commercialization. The NDA application is the vehicle through which drug sponsors formally propose that the U.S. Food and Drug Administration (FDA) approve a new pharmaceutical for sale and marketing in the United States. The data gathered during the animal studies and human clinical trials of an Investigational New Drug (IND) become part of the NDA. The goals of the NDA are to provide enough information to permit an FDA reviewer to reach the following key decisions: • Whether the drug is safe and effective

in its proposed use(s), and whether the benefits of the drug outweigh the risks. • Whether the drug’s proposed labeling (the package insert) is appropriate, and what it should contain. • Whether the methods used in manufacturing the drug and the controls used to maintain the drug’s quality are adequate to preserve the drug’s identity, strength, quality, and purity. The documentation required in an NDA is supposed to tell the drug’s whole story, including what happened during the clinical tests, what the ingredients of the drug are, what the results of the animal studies were, how the drug behaves in the body, and how it is manufactured, processed, and packaged.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/regulatory/applications/ nda.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

N-OF-1 RANDOMIZED TRIALS

hypertension, nocturnal leg cramps, attention deficit/hyperactivity disorder, fibromyalgia, Parkinson’s disease, asthma, and chronic obstructive pulmonary disease, among many others.

ANDREW L. AVINS Kaiser-Permanente, Northern California Division of Research and Departments of Medicine and Epidemiology & Biostatistics, University of California, San Francisco

1

GOAL OF N-OF-1 STUDIES

An N-of-1 study may be initiated for several reasons. Most commonly, the primary motivation is to better define the effectiveness of therapy for a symptomatic condition, though the effect of therapy on an asymptomatic endpoint (e.g., blood pressure) may also be investigated with appropriate measurements. N-of-1 studies may be valuable when patients are hesitant to use a therapy and desire more information about their response and when the medical provider and patient disagree about the response to a therapy. Another common use is to understand whether a specific therapy is the cause of an undesirable symptom in a patient. Determining the optimal dose of a medication is a matter that also lends itself well to study by N-of-1 methodology. Because it is often impossible or impractical to conduct typical randomized trials for patients with rare conditions, N-of-1 trials may provide an opportunity for defining optimal therapy for these patients. As will be noted, however, conditions and therapies amenable to N-of-1 trials have specific attributes that render many conditions or treatments unsuitable for study by this design. One of the most valuable situations in which to perform N-of-1 studies is when the clinical response to an intervention is known to be highly variable, there is uncertainty about the patient’s response, and the intervention is expensive or associated with serious adverse effects. In this situation, greater insight into a patient’s true response is particularly helpful in formulating rational treatment decisions. For example, Zucker et al. (8) describe the use of an N-of-1 trial to help fibromyalgia patients make decisions about using the combination of amitriptyline and fluoxetine versus amitriptyline alone; a prior trial had found that only 63% of patients

JOHN NEUHAUS Department of Epidemiology & Biostatistics, University of California, San Francisco

Comparative clinical trials in individual patients, commonly known as ‘‘N-of-1’’ trials, have been used for many years in the field of psychology and have found increasing applications for biomedical problems. Such trials can be structured to provide information generalizable to large groups of patients, but the primary application of N-of-1 trials is to help individual patients make more evidencebased decisions about specific aspects of their medical care (1). In clinical practice, patients frequently undergo ‘‘therapeutic trials’’ in which a health-care provider recommends a course of action that the patient tries and decides whether to continue, modify, or terminate. An N-of-1 study has the same goal as a therapeutic trial but brings a more structured and less biased approach to evaluating a therapeutic response (1–3). Such trials are also known as ‘‘single-patient trials,’’ ‘‘individualized medication effectiveness trials,’’ and ‘‘individual patient trials’’ (4–6). An N-of-1 trial is generally structured as a randomized, double-blind, controlled, multiple-period crossover trial in a single patient, often of a pharmacologic agent (1, 3, 6, 7). In these studies, a patient is assigned to different periods on the active and the comparator treatments, with outcome measurements obtained during each of these periods. At the end of the trial, the patient’s responses during each of the treatment conditions are compared, and the information is used to make a decision about the likelihood that one of the therapeutic options is superior. Such trials have been successfully used in such disparate conditions as osteoarthritis,

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

N-OF-1 RANDOMIZED TRIALS

responded to the combination therapy (and both drugs are associated with a wide range of side effects). 2

REQUIREMENTS

Similar to other crossover studies, N-of-1 trials are possible or practical in only a limited number of settings, as both the condition being treated and the therapy being tested must meet certain conditions in order for the trial to be successful (1, 3, 6, 7, 9, 10). 2.1 Patient’s Motivation Both the adherence to the study protocol and the usefulness of the data obtained are closely related to the motivation of the patient. Hence, there should be a clearly perceived need for conducting the N-of-1 study on the part of both the patient and the health-care provider. There should be uncertainty or disagreement about which choice of therapy (if any) is best and a mutual understanding of the value of conducting an N-of-1 trial to answer the question (2). Given the difficulty and expense of conducting an N-of-1 trial, there should also be a clear understanding of how (and whether) the patient will use the information from the trial to make a decision about therapy. A lack of motivation may seriously undermine the likelihood of success of the trial (see section 5.1). 2.2 Conditions Suitable for Study Because N-of-1 trials are generally structured as multiple-period crossover trials, only certain medical conditions are amenable to study with this methodology. Generally, appropriate illnesses must be chronic or recurring. Conditions whose status changes rapidly are difficult to investigate, though increasing the number of crossover periods may mitigate this problem. Similarly, recurrent illnesses with long asymptomatic intercurrent periods are difficult to study; this problem can sometimes be addressed by increasing the length of the treatment periods, though this approach may reduce the patient’s motivation and adherence to the protocol. Studies of short, self-limited illnesses are unlikely to be informative.

2.3 Therapies Suitable for Study For a successful trial, the study therapy must act quickly or a longer treatment period will be required to fairly evaluate treatment effectiveness. Similarly, it is desirable that once treatment is stopped the patient’s condition quickly returns to its baseline state, or prolonged washout periods will be required. Obviously, a potentially curative treatment is inappropriate for any crossover study. 2.4 Combining Results of Many N-of-1 Trials When many patients with the same diagnosis participate in N-of-1 trials with the same intervention, the results of these studies can be combined to estimate a population treatment effect, much like a formal multiparticipant crossover study. If such a series of N-of-1 trials is contemplated, care should be taken to ensure that they are structured in a way that permits a valid synthesis of the data. These considerations include appropriate definitions of the disease state under study and applicable eligibility criteria, a treatment protocol that provides enough similarity to make a synthesis meaningful, and outcome assessments that are measured with sufficient uniformity to permit combining the data. Using Bayesian methods, the data from an accumulating series of N-of-1 trials can be used to refine the estimates of the likelihood of a response for an individual patient as well as providing a generalizable estimate of effect for a population. Zucker et al. (8, 11) provide a detailed methodology for this technique, including the structure of a hierarchical Bayesian model and an example of this method in a series of N-of-1 studies for testing pharmacologic therapy for fibromyalgia. 3 DESIGN CHOICES AND DETAILS FOR N-OF-1 STUDIES Numerous options exist for the design and implementation of N-of-1 studies. Most variants have been tried successfully, and it is clear that there is no single design that is optimal for all situations; the creativity and insight of the investigator will dictate the best approach among the available

N-OF-1 RANDOMIZED TRIALS

alternatives. Issues that must be addressed include randomization, number and frequency of crossover periods, run-in and washout periods, blinding, outcome assessment, and choice of outcome measure.

3

alternative is to use strictly alternating treatments, with the first treatment chosen at random, though such a design may be more susceptible to unblinding of the treatments. 3.2 Number and Length of Treatment Periods

3.1 Randomization It is not absolutely required that the various treatment periods in an N-of-1 study be randomized, but most authorities agree that randomization is greatly preferable when it can be done. When randomization is employed, there are two major options: an unconstrained randomized design, or a blocked randomized design (2). In the former, the treatment periods are completely randomized without regard to ordering (i.e., without blocking). This method has the advantage of satisfying the underlying assumption of some statistical tests commonly used in single-patient experiments and is easy to implement. However, it suffers from the risk of generating treatment patterns that make causal inference difficult, being more susceptible to period effects. For example, in an N-of-1 study of two therapies with three treatment periods each, an unconstrained design may result in three periods on the first treatment followed by three treatments periods on the second treatment (i.e., AAABBB). Should the second treatment appear superior, one may be left wondering if it is the better therapy or if the patient’s symptoms were beginning to remit spontaneously. In typical crossover studies, this problem is addressed by randomizing participants to different interventions for the initial treatment period, but this option is not available in the N-of-1 design with a single participant. An alternative to the unconstrained randomized design is the blocked randomized design. In this design (using the paradigm of the two-treatment comparison), the treatment periods are randomized in pairs (i.e., blocks of two) so that each treatment is subsequently compared with the other treatment after every two treatment periods, though the specific ordering of the comparison is random. For example, a blocked randomization design might result in treatment ordering of ABBABA, BABABA, or BAABBA. A related

The choice of the number and length of the treatment periods will be a function of the particular characteristics of the illness and therapies under study, the tolerance of the patient for longer trial lengths (and burden of data collection), the need to obtain the clinical information more quickly, the availability of resources for conducting the trial, and the specific outcomes studied. No definitive advice is applicable to all situations. Of course, in general, the longer the trial and the greater the number of crossover periods, the more information will be obtained, and the more stable the resultant estimates of effect will be. 3.3 Run-In and Washout Periods One of the difficulties in all crossover designs is the potential for carryover effects—the possibility that the effect of one treatment has not entirely dissipated when the next treatment is instituted. For example, a crossover comparison of two treatments, A and B, will actually be a comparison of A followed by a comparison of B combined with some effect of A if the effects of treatment A persist into the B-treatment period. Several approaches are available for addressing the carryover effect. A ‘‘run-in’’ period (an initial period off all active treatment, which may be a placebo-assigned period) is often advocated to wash out the effect of any remaining active therapies the patient may have used. A conditiondependent crossover rule may be specified; that is, after treatment A is stopped, the patient’s clinical condition is allowed to return to baseline before starting treatment B. A time-dependent crossover rule is an alternative: a specific amount of time is required to pass before the patient is crossed over to the next therapy. One implementation method for the time-dependent crossover design is to discard data from the first part of each treatment period and analyze only

4

N-OF-1 RANDOMIZED TRIALS

the data from the latter part of each treatment period, timing the data-discard period to an adequate washout period for the prior treatment. For example, Nikles et al. (4) performed a set of N-of-1 studies of medical therapy for osteoarthritis; the treatment periods were 2 weeks long, but the data for only the second week of treatment in each period was used for analysis. Additional advantages to this latter technique are that participants are never off the study drug (which may improve adherence) and using the active treatment for a period of time before true data collection starts allows for some stability of drug levels and drug effects before the clinical response is measured. Another variant is an open-label run-in period to assess the tolerability of the study treatment, ensure that some response is evident, and/or perform an initial dose-finding analysis; this information may help shorten the overall duration of the trial (7, 9). 3.4 Blinding One of the great advantages of a formal Nof-1 study over a typical therapeutic trial is the potential to blind (or ‘‘mask’’) the therapies under study. Successful blinding can help distinguish the placebo effect from the specific treatment effect of a trial agent and potentially provide more meaningful data, depending on the needs of the patient and clinician. Blinding a pharmaceutical N-of-1 trial, however, requires the formulation of a credible placebo (for placebo-controlled trials) or formulation of similar-appearing medications for active-controlled trials. Producing these products is often difficult and can be an insurmountable problem for clinicians who are not supported by an infrastructure to help conduct these studies. Some institutions have created services to support clinicians interested in conducting N-of-1 trials (see section 5.3). 3.5 Choice of Outcome Measures The N-of-1 trial is unique in that the choice of outcome measures is often far less a function of credibility to the research community or the general population of patients than of personal importance to the individual patient

and clinician. For those trials that are conducted to better inform a single patient about the effectiveness or side effects of a particular intervention, it is critical that the outcomes chosen for the study be the ones that will have the greatest impact on the decision to continue, terminate, or change therapy. Therefore, although a large-scale randomized trial of a therapy for osteoarthritis, for example, might choose well-validated measures of pain, stiffness, and physical function, the outcome chosen for an N-of-1 trial of osteoarthritis might be simple ordinal scales of very specific issues such as the degree of resting pain and/or the ability to dance, ride a bicycle, or walk to the grocery store. Alternatively, simple dichotomous choices or ordinal preference scores may also be used (e.g., ‘‘Which did you prefer overall, treatment with A or B?’’ and ‘‘How much better was your preferred treatment than the other one?’’). For example, Pope, et al. (12) conducted a set of N-of-1 trials of a nonsteroidal inflammatory drug for patients with osteoarthritis. Each trial consisted of 2-week treatment pairs; after each pair, the patient was asked to select the preferred study medication. If he or she was unable to do so, up to two additional blinded treatment pairs were attempted. For multiple N-of-1 trials, a more uniform choice of endpoints may be required. However, the number of patients treated with these designs is generally small, and the emphasis may still be on the relevance of the response to the individual patients. In these situations, very specific outcomes or simple choice outcomes may still be chosen, but it is desirable that the scale on which the outcomes are measured be uniform to permit easier aggregation of the data (10). 3.6 Outcome Assessment Like any clinical trial, an unbiased assessment of outcomes is critical. For fully blinded trials, achieving this objective is relatively straightforward. For incompletely blinded trials, an independent outcome assessment method should be employed whenever possible to guard against the biases that may be held by the clinician-investigator. In many trials, multiple assessments of the outcomes over time are often employed

N-OF-1 RANDOMIZED TRIALS

to increase the power of the analysis (e.g., by use of a symptom diary). This additional burden on the patient should be discussed in detail before conducting the trial to enhance adherence. Additional work by highly motivated patients, however, may not be entirely undesirable. For example, in a series of Nof-1 studies of children with attention deficit/ hyperactivity disorder, follow-up interviews with patients and their parents found very positive impressions of the experience of participating in the N-of-1 studies; these included the data collection aspect, which was viewed favorably by many patients as a means of feeling more invested in the study and providing more information (13). 4

STATISTICAL ISSUES

One of the more controversial and underdeveloped aspects of N-of-1 trials is the issue of the statistical design and analysis. Indeed, some investigators have argued that statistical analysis is often superfluous, being irrelevant to the purpose of the trial. Should statistical analysis be desired, data features such as correlation between repeated measures and unequal variances by treatment period must be addressed. 4.1 Should Statistical Analysis of an N-of-1 Study be Conducted at All? It has been suggested that a formal statistical analysis of data from an N-of-1 study need not be undertaken when the goal is to help a patient make a simple therapeutic decision (14). The argument is that the decision is unavoidable. In this context, issues of statistical significance are immaterial: the patient should simply choose the therapy that appears to be most beneficial. Because the decision will be based on evidence gathered in a structured (and, possibly, blinded) fashion, it will be better informed than any alternative, regardless of P-values and confidence intervals. Critical to this argument is the need to present the data in a manner understandable to the patient; graphical methods are often preferred, though variability in interpretation of graphical presentations can prove problematic (15).

5

This line of reasoning is not endorsed by all investigators and several methods of analysis have been proposed, as will be discussed. Even if one accepts that formal statistical testing is not relevant for some clinical applications, there may still be times when it assists decision making. For example, consider a patient at high risk for coronary heart disease who may strongly desire the protective effects of a cholesterol-lowering medication but is concerned about recent gastrointestinal upset that she believes may be caused by the drug. For this patient, there may be a need for a high degree of certainty that any attribution of the symptoms to the drug not be the result of random error, so statistical design and analysis would be helpful for this patient’s decision making. Clearly, if the goal of conducting a series of N-of-1 trials is to generalize beyond the patients in the studies, a formal statistical analysis is required. 4.2 Statistical Analysis Options for the Multiple N-of-1 Study The statistical analysis of N-of-1 studies focuses on estimation of the magnitude and significance of changes in the expected value of the response between periods of treatment change and must accommodate correlation between the repeated measures of the study subject. The repeated responses gathered in an N-of-1 study form a time series, or more accurately an interrupted time series (16), and data analysts could analyze such data using classic time series methods (17). However, the lengths of typical N-of-1 series tend to be shorter than a typical time series, and popular approaches commonly involve linear model methods instead of classic time series approaches. Rochon (18) described the following useful linear model approach to analyze data from an N-of-1 study. Let yit denote the response observed at the tth time point (t = 1, . . . , T) during the i-th treatment period (i = 1, . . . , I). For notational convenience, we assume equal treatment period lengths, but the methods extend easily to settings with unequal period lengths. Let yTi = (yi1 , . . . , yiT) denote the set of T responses gathered in the i-th treatment period and Xi denote a Txp

6

N-OF-1 RANDOMIZED TRIALS

matrix of p explanatory variables. Rochon proposes the linear model, yi = XTi β + ei , where ei = (ei1 , . . . ,eiT ) is a vector of error terms representing unexplained variation in the repeated response. This linear model is quite general and can describe features such as treatment effects, time, and carryover effects. Because the N-of-1 studies gather repeated measures on the same study subject, it is typically unreasonable to assume that the error terms are uncorrelated. Rather, one assumes that the repeated measures are correlated, often with an autoregressive structure. For example, as in Rochon (18), one might assume that the errors follow a first order autoregressive process, eit = ρi ei,t−1 + uit where ρ i is the autoregressive correlation parameter, with |ρ i | < 1 and uit as mutually independent random variables with mean zero and variance σi2 (1 − ρi2 ). Rochon and others found that both the variability and longitudinal correlation of responses from N-of-1 studies seemed to vary with treatment period (18). To address this, useful models allow separate variance and correlation parameters for each of the treatment periods. One typically estimates the model parameters and variability using maximum likelihood methods and can express the estimated magnitude of treatment effects using standard confidence interval methods. Finally, one can test that observed treatment effects are greater than expected by chance using standard hypothesis testing methods such as likelihood ratio tests and associated significance probabilities. Spiegelhalter (14) noted that patients can use such significance probabilities to assess the confidence associated with statements about the superiority of one treatment over another. 5

OTHER ISSUES

5.1 Do N-of-1 Trials Achieve Their Goals? One standard for assessing the success of Nof-1 studies is the extent to which patients accept the results of the study and adhere to the recommended therapy (which may be

no therapy at all). Several authors have measured the proportion of patients who accepted the results of their N-of-1 studies and adopted the therapy recommended by their trials in follow-up. Clearly, large numbers of patients and physicians who participate in N-of-1 studies use the information in their clinical decision making, as evidenced by substantial proportions of patients who ultimately change their choice of therapy (19–22). However, on follow-up, several investigators have found many patients ultimately elect a treatment strategy that is inconsistent with the results of their N-of-1 trial (10, 23, 24). It is unclear why rates of post-trial adherence are not higher. In a study of N-of-1 trials of quinine for the treatment of nocturnal leg cramps, the investigators found that all 10 patients completing the trials elected to use quinine after their studies, despite the fact that the drug was clearly effective only in three. The authors attributed this, in part, to the fact that they did not sufficiently discuss the implications of inconclusive or negative findings with patients prior to initiating the trials (24). Another potential explanation for low rates of adopting the apparently superior therapy after an N-of-1 trial may be that many of these trials may have been instigated by the clinician or investigator, not the patient. It is difficult to glean this aspect of trial initiation from the available publications but, as noted above, it is important that the patient perceive a value to the Nof-1 study or their adherence to its results may be diminished. This problem may also account, in part, for the relatively high rates of withdrawal from many studies (10, 23, 24), though many of these ‘‘withdrawals’’ may simply be early recognition of effectiveness or toxicity, resulting in early termination of the trial. Finally, even when no statistical testing is performed, many patients may intuit that observed differences in the trial were of small degree or the results unstable and act accordingly. Greater understanding of how N-of-1 trial results are understood and used by patients is clearly needed. Despite the inconsistent rates of adopting the superior therapy observed in some studies of N-of-1 trials, many studies attest to high levels of patient satisfaction with

N-OF-1 RANDOMIZED TRIALS

the procedure. Even when the patient’s final decision is not entirely consistent with the information gleaned from the study, most patients report that the trial was useful and provided helpful information as they made their therapeutic decisions (8, 13, 20, 21, 23). Two groups have formally compared patients concurrently randomized to an N-of-1 study versus usual care for therapeutic decision making. Mahon et al. (25) randomized 31 patients with chronic obstructive pulmonary disease with an unclear response to theophylline to take part in an N-of-1 study or usual care. At 6 months, there was less theophylline use among the patients allocated to the N-of-1 trials with no apparent decline in their pulmonary symptoms. In a later study, the same investigators randomized 68 comparable patients using a similar design. Over a 1-year follow-up, the authors found no significant differences between groups in symptoms or theophylline use (26). Finally, Pope et al. (12) randomized 51 patients with osteoarthritis to an N-of-1 study or usual care for deciding whether a non-steroidal anti-inflammatory drug was helpful for their arthritis symptoms. Over the 6-month followup period, the investigators found no differences in use of the drug and apparent response. All studies found that the N-of-1 trials increased total medical-care costs for the duration of the trial period. 5.2 Ethics of N-of-1 Trials All N-of-1 trials, regardless of technique or intent, are intervention studies. As such, the ethics of conducting an N-of-1 trial must be considered carefully. Certainly, a set of multiple N-of-1 trials designed primarily to provide generalizable data is little different from other crossover studies and the full complement of ethical safeguards is likely to be required in this context. Such safeguards include full written informed consent and approval by an institutional review board (IRB). It is less clear that all ethical protections required in typical research studies are required for N-of-1 studies that are designed to help an individual patient make a more informed personal therapeutic decision. Some authors contend that written, informed

7

consent and IRB approval are not necessary, since these studies are simply more structured methods of conducting therapeutic trials, a procedure performed regularly in routine clinical practice (27). 5.3 N-of-1 Trial Support Services Single-patient studies are not commonly performed. Numerous barriers exist to wider application of these methods: unfamiliarity on the part of patients and clinicians, hesitancy to conduct ‘‘studies’’ in the clinical setting, and additional burdens of time and trouble to carry out the study to make ‘‘simple’’ clinical decisions. One of the greatest barriers to wider implementation is the lack of infrastructure for performing these studies: placebos (or indistinguishable active drugs) must be formulated, a randomization scheme must be prepared (to which the clinician and patient are blind), and data must be collected and presented and/or analyzed. Few clinicians have access to these services or expertise (28). Recognizing this difficulty, some academic institutions have established ‘‘N-of-1 Services’’ that support clinicians who desire to use these methods. Investigators at McMaster University and the University of Washington have published their experiences with such services in the 1980s and 1990s (3, 20, 29, 30). Both groups of investigators found the service to be feasible and the results useful to patients and clinicians. A more ambitious effort was organized by Nikles et al. (22) in Australia. These investigators established an N-of-1 trial support service that was available to clinicians throughout the country who desired to carry out a single-patient trial in their practice. The success of this venture was documented in the completion of a large number of trials in conditions such as attention deficit/hyperactivity disorder and osteoarthritis (4, 13, 22). 5.4 Novel Uses of N-of-1 Studies The utility of N-of-1 studies enables a wide variety of applications. In addition to the common use for informing clinical decisions for individual patients, N-of-1 studies have found value in other settings.

8

N-OF-1 RANDOMIZED TRIALS

In one recent randomized, double-blind, parallel-comparison clinical trial, a participant considered withdrawing from the study because he was concerned that the study medication raised his blood pressure. Rather than conduct a simple therapeutic trial of withdrawing the study medication, the investigators offered the participant the option of taking part in an N-of-1 study (using the study medication and known placebo), which he selected. The N-of-1 study showed that, relative to placebo, his assigned study medication had minimal effects on his blood pressure, and he elected to continue in the trial of the study medication (31). It is notable that such N-of-1 studies are relatively simple to perform because there are no problems with the availability of placebo, a research pharmacist, data collection infrastructure, and personnel for presenting and analyzing the data. Guyatt et al. (32) describe another potential use of the multiple N-of-1 method: as a means of gathering critical information about a new drug in development. Such trials can be designed to provide a wide range of information useful for planning larger, fully powered phase III studies. The kinds of information gleaned from such studies include defining the population of patients most likely to benefit from the drug, determining the optimal dose, estimating the rapidity of the drug’s onset of action and loss of treatment effect, identifying the types of outcomes most responsive to the drug, and estimating the magnitude of a potential treatment effect. The investigators provided an example of the use of a tricyclic antidepressant medication for the treatment of fibromyalgia. The same technique has also been used in the initial study of a device for treatment of Tourette’s syndrome (33). 6

CONCLUSIONS

Although N-of-1 trials have been employed in the mental health field for many decades, interest has grown in wider applications of the method in recent years. Because the focus of this type of study is on the individual patient, there are several unique aspects to the design, conduct, and interpretation of the

results of these studies. Serious infrastructure barriers to wider implementation often exist, as do issues of adherence and optimizing the value of the N-of-1 trial for its intended purpose. Analytic issues continue to evolve, and, as clinicians and patients become more familiar with the technique, N-of-1 trials may play a more prominent role in the application of evidence-based, individualized therapeutics. Patient satisfaction with N-of1 trials is high, but the incremental benefit of this methodology over standard practice remains uncertain. REFERENCES 1. G. Guyatt, D. Sackett, D. W. Taylor, J. Chong, R. Roberts, and S. Pugsley, Determining optimal therapy—randomized trials in individual patients. N Engl J Med. 1986; 314: 889–892. 2. R. Jaeschke, D. Cook, and D. L. Sackett, The potential role of single-patient randomized controlled trials (N-of-1 RCTs) in clinical practice. J Am Board Fam Pract. 1992; 5: 227–229. 3. E. B. Larson, N-of-1 clinical trials. A technique for improving medical therapeutics. West J Med. 1990; 152: 52–56. 4. C. J. Nikles, M. Yelland, P. P. Glasziou, and C. Del Mar, Do individualized medication effectiveness tests (N-of-1 trials) change clinical decisions about which drugs to use for osteoarthritis and chronic pain? Am J Ther. 2005; 12: 92–97. 5. P. M. Peloso, Are individual patient trials (Nof-1 trials) in rheumatology worth the extra effort? J Rheumatol. 2004; 31: 8–11. 6. B. Spilker, Single-patient clinical trials. In: Guide to Clinical Trials. Philadelphia: Lippincott-Raven, 1996, pp. 277–282. 7. D. J. Cook, Randomized trials in single subjects: the N of 1 study. Psychopharmacol Bull. 1996; 32: 363–367. 8. D. R. Zucker, R. Ruthazer, C. H. Schmid, J. M. Feuer, P. A. Fischer, et al., Lessons learned combining N-of-1 trials to assess fibromyalgia therapies. J Rheumatol. 2006; 33: 2069–2077. 9. G. Guyatt, D. Sackett, J. Adachi, R. Roberts, J. Chong, D., et al., A clinician’s guide for conducting randomized trials in individual patients. CMAJ. 1988; 139: 497–503. 10. A. C. Wegman, D. A. van der Windt, W. A. Stalman, and T. P. de Vries, Conducting research in individual patients: lessons learnt from two

N-OF-1 RANDOMIZED TRIALS series of N-of-1 trials. BMC Fam Pract. 2006; 7: 54. 11. D. R. Zucker, C. H. Schmid, M. W. McIntosh, R. B. D’Agostino, H. P. Selker, and J. Lau, Combining single patient (N-of-1) trials to estimate population treatment effects and to evaluate individual patient responses to treatment. J Clin Epidemiol. 1997; 50: 401–410. 12. J. E. Pope, M. Prashker, and J. Anderson, The efficacy and cost effectiveness of N of 1 studies with diclofenac compared to standard treatment with nonsteroidal antiinflammatory drugs in osteoarthritis. J Rheumatol. 2004; 31: 140–149. 13. C. J. Nikles, A. M. Clavarino, and C. B. Del Mar, Using N-of-1 trials as a clinical tool to improve prescribing. Br J Gen Pract. 2005; 55: 175–180. 14. D. J. Spiegelhalter, Statistical issues in studies of individual response. Scand J Gastroenterol Suppl. 1988; 147: 40–45. 15. K. J. Ottenbacher and S. R. Hinderer, Evidence-based practice: methods to evaluate individual patient improvement. Am J Phys Med Rehabil. 2001; 80: 786–796. 16. T. D. Cook TD and D. T. Campbell, Quasiexperimentation: Design And Analysis Issues. Boston: MA: Houghton-Mifflin, 1979. 17. P. J. Diggle, Time Series: A Biostatistical Introduction. Oxford: Clarendon Press, 1990. 18. J. Rochon, A statistical model for the ’’N-of-1’’ study. J Clin Epidemiol. 1990; 43: 499–508. 19. G. H. Guyatt and R. Jaeschke, N-of-1 randomized trials—where do we stand? West J Med. 1990; 152: 67–68. 20. E. B. Larson, A. J. Ellsworth, and J. Oas, Randomized clinical trials in single patients during a 2-year period. JAMA. 1993; 270: 2708–2712. 21. L. March, L. Irwig, J. Schwarz, J. Simpson, C. Chock, and P. Brooks, N of 1 trials comparing a non-steroidal anti-inflammatory drug with paracetamol in osteoarthritis. BMJ. 1994; 309: 1041–1046. 22. C. J. Nikles, G. K. Mitchell, C. B. Del Mar, A. Clavarino, and N. McNairn, An N-of-1 trial service in clinical practice: testing the effectiveness of stimulants for attention-deficit/hyperactivity disorder. Pediatrics. 2006; 117: 2040–2046. 23. M. A. Kent, C. S. Camfield, and P. R. Camfield, Double-blind methylphenidate trials: practical, useful, and highly endorsed by families. Arch Pediatr Adolesc Med. 1999; 153: 1292–1296.

9

24. R. Woodfield, F. Goodyear-Smith, and B. Arroll, N-of-1 trials of quinine efficacy in skeletal muscle cramps of the leg. Br J Gen Pract. 2005; 55: 181–185. 25. J. Mahon, A. Laupacis, A. Donner, and T. Wood, Randomised study of N of 1 trials versus standard practice. BMJ. 1996; 312: 1069–1074. 26. J. L. Mahon, A. Laupacis, R. V. Hodder, D. A. McKim, N. A. Paterson, et al., Theophylline for irreversible chronic airflow limitation: a randomized study comparing N of 1 trials to standard practice. Chest. 1999; 115: 38–48. 27. L. Irwig, P. Glasziou, and L. March, Ethics of N-of-1 trials. Lancet. 1995; 345: 469. 28. K. B. Saunders, N of 1 trials. BMJ. 1994; 309: 1584. 29. G. H. Guyatt, J. L. Keller, R. Jaeschke, D. Rosenbloom, J. D. Adachi, and M. T. Newhouse, The N-of-1 randomized controlled trial: clinical usefulness. Our three-year experience. Ann Intern Med. 1990; 112: 293–299. 30. J. L. Keller, G. H. Guyatt, R. S. Roberts, J. D. Adachi, and D. Rosenbloom, An N of 1 service: applying the scientific method in clinical practice. Scand J Gastroenterol Suppl. 1988; 147: 22–29. 31. A. L. Avins, S. Bent, and J. M. Neuhaus, Use of an embedded N-of-1 trial to improve adherence and increase information from a clinical study. Contemp Clin Trials. 2005; 26: 397–401. 32. G. H. Guyatt, A. Heyting, R. Jaeschke, J. Keller, J. D. Adachi, and R. S. Roberts, N of 1 randomized trials for investigating new drugs. Control Clin Trials. 1990; 11: 88–100. 33. J. L. Houeto, C. Karachi, L. Mallet, B. Pillon, J. Yelnik, et al., Tourette’s syndrome and deep brain stimulation. J Neurol Neurosurg Psychiatry. 2005; 76: 992–995.

CROSS-REFERENCES Crossover design Randomization Placebo-controlled trial Washout period Blinding

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

Non-compartmental analysis (NCA) is a standard technique to determine the pharmacokinetics (PK) of a drug. After drug intake, the concentration time profiles (e.g., in plasma or serum) are recorded and used to characterize the absorption, distribution, and elimination of the drug. Less frequently, concentrations in blood, saliva, other body fluids, or amounts excreted unchanged in urine are used instead of or in addition to plasma or serum concentrations. NCA is the most commonly used method of PK data analysis for certain types of clinical studies like bioequivalence, dose linearity, and food effect trials. The common feature of non-compartmental techniques is that no specific compartmental model structure is assumed. The most frequently applied method of NCA is slope height area moment analysis (SHAM analysis) (1,2). For SHAM analysis, the area under the concentration time curve (AUC) is most commonly determined by numerical integration or by curve fitting. Numerical integration is the non-compartmental method of choice for analysis of concentration time data after extravascular input because absorption kinetics are often complex. In comparison to compartmental modeling, numerical integration has the advantage that it does not assume any specific drug input kinetics. For intravenous bolus data, fitting the concentration time curves by a sum of exponentials is the non-compartmental method of choice. This introductory article presents some standard applications of NCA of plasma (or serum) concentration data, as those applications are most commonly used. References to NCA of urinary excretion data and more advanced topics of NCA are provided. This article focuses on exogenous compounds and does not cover endogenous molecules. Our focus is (1) to describe studies and objectives for which NCA is well suited, (2) to provide and discuss assumptions of NCA, and (3) to present a practical guide for performing an NCA by numerical integration and basic approaches to choosing appropriate blood sampling times.

¨ JuRGEN B. BULITTA

Department of Pharmaceutical Sciences, School of Pharmacy and Pharmaceutical Sciences, State University of New York at Buffalo, Buffalo, New York

NICHOLAS H. G. HOLFORD Department of Pharmacology and Clinical Pharmacology, University of Auckland, Auckland, New Zealand

Objectives of NCA are often assessing dose proportionality, showing bioequivalence, characterizing drug disposition, and obtaining initial estimates for pharmacokinetic models. Specific results and applications of NCA are as follows: (1) The area under the concentration time curve (e.g., in plasma or serum) describes the extent of systemic drug exposure; the peak concentration and its timing indicate the rate of drug input (absorption), and (2) NCA provides estimates for clearance, volume of distribution, terminal half-life, mean residence time, and other quantities. Application 1 serves purely descriptive purposes and requires almost no assumptions. Importantly, application 2 does rely on several assumptions that are similar to the assumptions for compartmental modeling. Standard NCA requires frequent (blood) samples to estimate pharmacokinetic parameters reliably. Numerical integration is most commonly used for NCA after extravascular input. Fitting of disposition curves by a sum of exponential functions, for example is the method of choice for intravenous bolus input. NCA of an adequately designed clinical trial can provide robust estimates for extent of drug exposure and rate of absorption and other quantities. NCA estimates for clearance and volume of distribution rely on several assumptions that have to be critically considered for appropriate interpretation of NCA results.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

1 TERMINOLOGY 1.1 Compartment A compartment is a hypothetical volume that is used to describe the apparent homogeneous and well-mixed distribution of a chemical species. ‘‘Kinetically homogeneous’’ assumes that an instantaneous equilibration of a chemical compound (drug or metabolite) is found between all components of the compartment. 1.2 Parameter A parameter is a primary constant of a quantitative model that is estimated from the observed data. For example, clearance (CL) and volume of distribution at steady state (Vss) are the two most important PK parameters. 1.3 Fixed Constant Some models include fixed constants that are known (or assumed) a priori and are not estimated. Examples are stoichiometric coefficients or π . 1.4 Statistic A statistic is a derived quantity that is computed from observed data or estimated model parameters. Examples: The average plasma concentration is a statistic because it is computed from observed concentrations. Another statistic is the AUC. Under certain assumptions, the AUC is F · Dose / CL with F being the fraction of drug that reaches the systemic circulation. The average clearance and its standard deviation are two statistics that can be calculated from individual clearance estimates of several subjects. 1.5 Comment A one-compartment PK model can be defined by any two parameters, for example, by CL and Vss, by CL and t1/2 , or by Vss and t1/2 . NCA estimates CL and t1/2 and derives Vss from other NCA statistics. However, physiologically, the parameterization by CL and Vss is more informative than the other two parameterizations because CL and Vss characterize the physiology of the body and the

physicochemical properties of the drug. Halflife is determined by the primary PK parameters clearance and volume of distribution and should be called a statistic (or secondary parameter). For more complex PK models, Vss is calculated by residence time theory (see below and Reference 3). 2 OBJECTIVES AND FEATURES OF NON-COMPARTMENTAL ANALYSIS Non-compartmental techniques can be applied to PK analyses of studies with a variety of objectives. Some of those objectives are: 1. Characterization of drug disposition. 2. PK analysis of various types of studies, including: • Bioavailability and bioequivalence studies, • Food-effect studies, • PK interaction studies, and • Dose-proportionality studies. 3. Supporting development of pharmaceutical formulations by characterizing drug absorption profiles. 4. Obtaining initial estimates for compartmental modeling. NCA has appealing features, and the majority of studies in literature only report results from NCA. Some specific results from NCA and its features are: 1. NCA can provide robust estimates for important PK parameters like clearance and volume of distribution, if the assumptions of NCA are adequately met. 2. Descriptive measures for the rate of absorption like peak concentration (Cmax) and its timing (Tmax) can be directly derived from the observed data. 3. No need exists to specify the full structure of the compartmental model. 4. NCA and plots of the observed concentration time data may provide valuable insights into the PK characteristics of a drug and can be very helpful for testing model assumptions and for building compartmental models.

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

5. Standard NCA can be learned quickly and is often straightforward to apply. 6. NCA requires minimum decision making by the user, can be highly standardized, and most often yields consistent results between different users. However, it is important to be aware that NCA may be suboptimal or more difficult to apply in situations that need: 1. to analyze sparse data (insufficient to characterize the shape of the profile before the log-linear terminal phase), 2. to derive PK parameters from concentration time profiles of complex dosage regimens, 3. to simulate other-than-the-studied dosage regimens [powerful methods do exist for this task, but they are underused (4)], or 4. to study drugs with mixed-order (‘‘saturable’’) elimination or time-dependent PK. 2.1 Advanced Non-Compartmental Techniques Various advanced non-compartmental techniques have been developed. A detailed presentation of these concepts is beyond the scope of this article. Some situations for which those advanced methods are superior to standard NCA are listed below. Non-compartmental methods to analyze data with saturable elimination (Michaelis– Menten kinetics) were developed by several authors (5–13). Methods for the analysis of metabolite data (14–18), reversible metabolism (19–27), maternal–fetal disposition (28), enterohepatic recirculation (29), nonlinear protein binding (30) or capacity limited tissue-distribution (31), organ clearance models (19,32), target-mediated drug disposition (33,34), and for modeling a link compartment (35) are available. Noncompartmental methods for sparse sampling (36–43) are often applied in preclinical PK and toxicokinetics. Veng-Pedersen presented (4,44) an overview of linear system analysis that is a powerful tool to determine the absorption, distribution, and elimination

3

characteristics and to predict concentrations for other dosage regimens. Although these advanced methods have been developed, they are not as widely used as standard NCA, and some of these methods are not available in standard software packages. 3 COMPARISON OF NON-COMPARTMENTAL AND COMPARTMENTAL MODELS NCA is often called ‘‘model independent,’’ although this phrase is misleading and often misinterpreted (45–47). The common feature of non-compartmental techniques is that they do not assume a specific compartmental model structure. NCA only can be described as (fully) model independent when it is exclusively used for descriptive purposes, for example, for recording the observed peak concentration (Cmax) and its associated time point (Tmax). Importantly, such a descriptive model-independent approach cannot be applied for extrapolation purposes. The AUC from time zero to infinity (AUC0−∞ ) can be interpreted as a descriptive measure of systemic drug exposure (e.g., in plasma) for the studied dosage regimen. This interpretation requires the assumption of a monoexponential decline in concentrations during the terminal phase (or another extrapolation rule). However, the AUC is commonly used to calculate F, for example, in bioavailability and bioequivalence studies. As the AUC is determined by F · dose / clearance, the use of AUC to characterize F assumes that clearance is constant within the observed range of concentrations and over the time course of the data used to calculate the AUC. Therefore, all assumptions required for determination of clearance (see below) are implicitly made, if AUC is used to characterize F. Figure 1 compares the standard noncompartmental model and a compartmental model with two compartments. An important assumption of NCA is that drug is only eliminated from the sampling pool. The two models shown in Fig. 1 differ in the specification of the part of the model from which no observations were drawn (nonaccessible part of the system). In this example,

4

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

Figure 1. Comparison of the standard non-compartmental model and a compartmental model with two compartments.

the models differ only in the drug distribution part. For compartmental modeling, the user has to specify the complete model structure including all compartments, drug inputs, and elimination pathways. In Fig. 1 (panel b), drug distribution is described by one peripheral compartment that is in equilibrium with the central compartment. The ‘‘non-compartmental’’ model does not assume any specific structure of the nonaccessible part of the system. Any number of loops can describe the distribution, and each loop can contain any number of compartments. A more detailed list of the assumptions of standard NCA is shown below.

4 ASSUMPTIONS OF NCA AND ITS REPORTED DESCRIPTIVE STATISTICS NCA parameters like clearance and volume of distribution only can be interpreted as physiological or pharmacological properties of the body and the drug of interest, if the assumptions of NCA are adequately met. NCA relies on a series of assumptions. Violating these assumptions will cause bias in some or all NCA results. This bias has been quantified and summarized by DiStefano and Landaw (45,46) for several examples. Table 1 compares several key assumptions (36–50) between standard NCA and compartmental modeling. One key assumption of standard NCA is linear drug disposition [see Veng-Pedersen (44,50), Gillespie (47), and DiStefano and Landaw (46) for details]. Although advanced methods to account for

nonlinear drug disposition have been developed [see, e.g., Cheng and Jusko (8)], these techniques are seldom applied. As shown in Table 1, the assumptions for a compartmental model with first-order disposition (‘‘linear PK’’) are similar to the assumptions of standard NCA. Compartmental models offer the flexibility of specifying various types of nonlinear disposition. This process is straightforward, if a compartmental model is specified as a set of differential equations. Below is a discussion of the assumptions of NCA and compartmental modeling. 4.1 Assumptions 1 (see Table 1): Routes and Kinetics of Drug Absorption Compartmental modeling uses a parametric function (e.g., first-order or zero-order kinetics) for drug absorption, whereas NCA does not assume a specific time course of drug input. 4.2 Assumptions 2 to 4 (see Table 1): Drug Distribution Standard NCA does not assume a specific model structure for the nonaccessible part of the system (‘‘distribution compartments’’). It only assumes that drug distribution is linear, for example, neither the rate nor the extent of distribution into the peripheral compartment(s) is saturable. The user has to specify the complete structure and kinetics of drug transfer between all distribution compartments for compartmental modeling. Nonlinear drug disposition can be incorporated into compartmental models. Most compartment models assume linear

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

Table 1. Comparison of Assumptions Between Standard NCA and Compartmental Modeling

5

6

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

distribution. In this case, the assumptions for drug distribution are similar for noncompartmental and compartmental methods. 4.3 Assumption 5 (see Table 1): Routes of Drug Elimination Standard NCA assumes that all elimination occurs only from the sampling pool. Most compartmental models also assume that drug is eliminated only from the sampling (central) compartment (no elimination from any peripheral compartment). This assumption seems reasonable because the liver and the kidneys as the two main elimination organs are highly perfused organs and therefore in a rapid equilibrium with plasma (or serum), which is most commonly used for PK analysis. DiStefano and Landaw (45,46) discussed this assumption and the implications of its violation in detail. Other routes of elimination can be specified for compartmental modeling. Nakashima and Benet (51) derived formulas for linear mammillary compartmental models with drug input into any compartment and drug elimination from any compartment. 4.4 Assumption 6 (see Table 1): Kinetics of Drug Elimination Standard NCA assumes that clearance is not saturable. As metabolism and renal tubular secretion are potentially saturable, the results of standard NCA for drugs with saturable elimination need to be interpreted cautiously. This cautious interpretation is especially important when plasma concentrations exceed the Michaelis– Menten constant of the respective saturable process. Several NCA methods that can account for saturable elimination are quoted above. However, those methods are not implemented in software packages like WinNonlinTM Pro or Thermo Kinetica , which makes them unavailable to most users. Compartmental modeling offers a great flexibility for specifying the elimination pathways and the kinetics of the elimination process (e.g., first-order, zero-order, mixedorder, target-mediated, etc.)

4.5 Assumptions 7 and 8 (see Table 1): Sampling Times and Monoexponential Decline For adequately chosen sampling times, the assumption that the last three (or more) observed concentrations characterize a monoexponential decline of concentrations during the terminal phase is often reasonable [see Weiss for details (52)]. Standard NCA requires that the samples be collected during the whole concentration time profile. This time period is usually one dosing interval at steady state or at least three terminal halflives after a single dose (see below for NCA of sparse data). For studies with frequent sampling (12 or more samples per profile) at appropriately chosen time points, the AUC usually characterizes the total drug exposure well. For single-dose studies, it should be ensured that the AUC from time zero until the last observed concentration (AUC0−last ) comprises at least 80% of AUC0−∞ . Blood sampling times should be chosen so that at least two or more observations will be observed before the peak concentration in all subjects. If the peak concentration is at the time of the first observation, then regulatory authorities like the Food and Drug Administration (FDA) (53,54) may require that a study be repeated because the safety of the respective formulation could not be established. In this situation, NCA cannot provide a valid measure for peak concentrations. Blood sampling time schedules can be optimized to select the most informative sampling times for estimation of compartmental model parameters. The latter approach is more efficient as it requires (much) fewer blood samples per patient compared with NCA and is especially powerful for population PK analyses. 4.6 Assumption 9 (see Table 1): Time Invariance of Disposition Parameters This assumption is made very often for both NCA and compartmental modeling. The disposition parameters are assumed to be timeinvariant (constant) within the shortest time interval required to estimate all PK parameters of interest. PK parameters may differ between two dosing intervals because

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

two different sets of PK parameters can be estimated for each dosing interval both by NCA and compartmental methods. Caution is indicated for data analysis of drugs with inducible metabolism or when comedication may affect hepatic metabolism of the drug. In summary, the assumptions of compartmental models with linear PK and of the standard NCA model are similar, if NCA is used to derive PK parameters like CL, Vss, and F (as is most often done). The complete model structure and kinetics of all drug transfer processes must be defined for compartmental models, whereas NCA requires fewer assumptions on the model structure. Both standard compartmental models and the standard NCA model assume linear PK [see Veng-Pedersen (44,50), Gillespie (47), and DiStefano and Landaw (46) for more details]. It is easier to specify nonlinear drug disposition for compartmental models than for NCA. 4.7 Assumptions of Subsequent Descriptive Statistics NCA does not directly make any assumption about the distribution of NCA statistics between different subjects. However, most authors report the average and standard deviation for the distribution of NCA statistics and thereby assume that the average is an appropriate measure for the central tendency and that the standard deviation characterizes the variability adequately. The distribution of Tmax is especially problematic because it is determined primarily by the discrete distribution of the nominal sampling times. It is often not possible to decide whether the between-subject variability follows a normal distribution, log-normal distribution, or another distribution. Therefore, it is often helpful to report the median and representative percentiles (e.g., 5% to 95% percentile and 25% to 75% percentile) to describe the central tendency and variability of data by nonparametric statistics in addition to the average and standard deviation. If only data on a few subjects is available, then the 10% and 90% percentiles or the interquartile range are usually more appropriate than the 5% and 95% percentiles. Reporting only the

7

median and range of the data does not provide an accurate measure for dispersion, especially for studies with a small sample size. 5

CALCULATION FORMULAS FOR NCA

This section presents numerical integration methods to determine the AUC that is most commonly used to analyze plasma (or serum) concentrations after extravascular dosing. Additionally, fitting of concentration time curves by a sum of exponential functions is described. The latter is the non-compartmental method of choice for iv bolus input. The concentration time curves after iv bolus input are called disposition curves because drug disposition comprises drug distribution and elimination but not the absorption process. Both methods are noncompartmental techniques because they do not assume a specific model structure. It is important to note that several compartmental models will result in bi-exponential concentration time profiles. It is appealing that the numerical integration of a standard NCA can be performed with a hand calculator or any spreadsheet program. Nonlinear regression to describe disposition curves can be done by standard software packages like ADAPT, WinNonlinTM Pro, Thermo Kinetica , SAAM II, EXCEL and many others. Performing an NCA with a hand calculator or self-written spreadsheets is good for learning purposes. However, the use of validated PK software for NCA of clinical trials is vital for projects to be submitted to regulatory authorities. Audit trail functionality is an important advantage of validated PK software packages. 5.1 NCA of Plasma or Serum Concentrations by Numerical Integration Numerical integration is the most commonly used method to determine the AUC and other moments of the plasma (or serum) concentration time curve after extravascular input. This method is also often applied for infusion input, if the duration of infusion is nonnegligible. Numerical integration requires a function neither to describe the whole concentration time curve nor to describe the rate of

8

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

drug absorption (e.g., first-order or zero-order input). The Cmax and Tmax are directly recorded from the observed concentration time data. NCA assumes that the concentration during the terminal phase declines monoexponentially. Subsequently, linear regression on semilogarithmic scale is used to determine the slope (−λz ) of the concentration time curve during the terminal phase. Terminal half-life is calculated as t1/2 =

ln(2) λz

A guide for the most appropriate selection of observations for estimation of terminal halflife is described below. Figure 2 illustrates the calculation of terminal half-life for one subject. In this example, the last five observations were used to estimate the terminal slope. The area under the plasma concentration time curve is most often determined by the trapezoidal method. The plasma concentration time profile is usually linearly (or logarithmically) interpolated between two observations. The formulas apply to linear and logarithmic interpolation, and Ci denotes the ith observed concentration: Linear interpolation: Ci + Ci+1 2 Ci + Ci+1 = (ti+1 − ti ) · 2 Logarithmic interpolation: = t · AUCi+1 i

= t · AUCi+1 i

Ci+1 − Ci Ci+1 ln Ci

Figure 3 shows the calculation of trapezoids for data after a single oral dose by the linear interpolation rule. The sum of the individual trapezoids yields the AUC from time zero to the last quantifiable concentration: AUC0−last =

n−1

AUCi+1 (for n observations) i

i=1

Linear interpolation is usually applied, if concentrations are increasing or constant

(Ci+1 ≥ Ci ), and logarithmic interpolation is often used, if concentrations are decreasing (Ci+1 < Ci ). Note that the log-trapezoidal rule is invalid when Ci+1 = Ci and when Ci is zero. Plasma concentration time profiles are curved to the right during an intravenous infusion at a constant rate (see Fig. 4) and also tend to be curved to the right for extravascular administration. Therefore, linear interpolation to calculate the AUC is a reasonable approximation if Ci+1 ≥ Ci . After the end of infusion, drug concentrations often decline mono-, bi-, or triexponentially (see Fig. 4). The number of exponential phases can be assessed visually, when drug concentrations are plotted on log scale versus time. Therefore, assuming an exponential decline (equivalent to a straight line on log-scale) is often a good approximation for the AUC calculation if Ci+1 < Ci . Figure 4 shows that linear interpolation approximates the true plasma concentration time curve slightly better than the logarithmic interpolation during the infusion. After the end of infusion, logarithmic interpolation approximates the true curve better then linear interpolation—as expected. The difference between linear and logarithmic interpolation tends to be small, if blood samples are frequently drawn. Differences are expected to be larger for lessfrequent blood sampling. Linear and logarithmic interpolation have been compared with interpolation by higher-order polynomials and splines (55,56). For PK data, the linear trapezoidal rule for increasing or constant concentrations and the log-trapezoidal rule for decreasing concentrations is usually a good choice (19) because those methods are stable and reasonably accurate even for sparse data. Although higher-order polynomials and spline functions performed better for some datasets, these methods are potentially more sensitive to error in the data and are not available in most standard software packages. The AUC from the last concentration (Clast ) to time infinity (AUClast−∞ ) is extrapolated assuming a log-linear decline of concentrations during the terminal phase: C . AUClast−∞ = λlast z Standard software packages like WinNonlinTM Pro, Thermo Kinetica ,

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

9

Figure 2. Determination of terminal half-life after a single oral dose.

and others provide two options for specifying Clast : One can use the last observed concentration as Clast . Alternativley, the concentration predicted by log linear regression at the time of the last observation may be used as Clast to estimate λz . The current FDA guideline for bioavailability and bioequivalence studies (53) recommends that the last ‘‘measured’’ (observed) concentration should be used. The FDA guideline does not provide a reason why this method is proposed. Theoretically, the last predicted concentration will be more precise than an observed value if the assumptions for estimation of the terminal half-life are met. Use of the last observed concentration may yield a large bias, if this concentration was an ‘‘outlier’’ with a high value, because it would cause a longer half-life and therefore would affect both terms in the equation for AUClast−∞ . The AUC from time zero to infinity (AUC0−∞ ) is calculated as: AUC0−∞ = AUC0−last + AUClast−∞ The residual area is calculated as 1 − AUC0−last / AUC0−∞ . Residual areas above

Figure 3. Calculation of the AUC by the linear trapezoidal rule.

20% are considered too large for reliable estimation of AUC0−∞ and therefore also for CL. The uncertainty in the calculated area under the first moment concentration time curve from time zero to infinity (AUMC0−∞ ) is even larger compared with the uncertainty in AUC0−∞ for larger residual areas. The (apparent) total clearance (CL/F) is calculated from the administered dose and AUC0−∞ as: Dose CL = F AUC0−∞ For intravenous administration, the extent of drug reaching the systemic circulation (F) is 100% by definition, and the ratio of dose and AUC0−∞ yields total clearance (CL). As F is mathematically not identifiable after extravascular administration without an intravenous reference dose, only CL/F can be derived after extravascular administration. The volume of distribution during the terminal phase (Vz) can be calculated from CL and λz . As shown by several authors (57,58), Vz is not an independent parameter that characterizes drug disposition because it depends on, for example, the estimate for

10

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

Figure 4. Interpolation between plasma concentration time points by the linear and logarithmic trapezoidal method for a 2-h infusion (sampling times: 0, 0.5, 2, 2.5, 4, and 6 h).

clearance. Volume of distribution at steady state (Vss) is a better measure for volume of distribution than Vz. The estimate for clearance does not affect the estimate of Vss, if drug is only eliminated from the sampling pool. Vss can always be calculated from data that allow one to calculate Vz (see below). Gobburu and Holford (58) pointed out that the finding of an altered Vz, for example, in a special patient population may lead to the inappropriate conclusion of changing the loading dose in this patient population. Because of the potential misuse of Vz, the use of Vz should be discouraged (58). Statistical moment theory (59) is used to calculate the mean residence time (MRT). The MRT is calculated from AUMC0−∞ and AUC0−∞ . The AUMC is calculated as:

Linear interpolation: ti · Ci + ti+1 · Ci+1 2 Logarithmic interpolation: AUMCi+1 = t · i

= t · AUMCi+1 i

ti+1 · Ci+1 − ti · Ci Ci+1 ln Ci

− t2 ·

Ci+1 − Ci 2 Ci+1 ln Ci

The AUMC from time zero to the last observed concentration (AUMC0−last ) and from time zero to infinity (AUMC0−∞ ) are calculated

as: AUMC0−last =

n−1

AUMCi+1 i

i=1

(for n observations) AUMC0−∞ = AUMC0−last +

Clast · tlast Clast + λz λz 2

For steady-state data (dosing interval: τ ), the AUMC0−∞ can be calculated as (60): AUMC0−∞ = AUMCSS,0−τ + τ · AUCSS,τ −∞ The mean residence time (MRT) is calculated as: AUMC0−∞ MRT = AUC0−∞ [The AUMC0−last and AUC0−last , should not be used for calculation of MRT, because such a calculation yields systematically biased (low) estimates for the MRT.] For an intravenous bolus administration, the MRT is equal to the mean disposition residence time (MDRT). The MDRT is the average time a drug molecule stays in the systemic circulation (44). The MDRT is often called MRTiv (iv indicates iv bolus input). We prefer to write MDRT to refer specifically to drug disposition. If peripheral venous blood is sampled, then the mean transit time of drug molecules from arterial blood to the sampling site needs to be considered for calculation of MDRT, at least for drugs with a short MDRT [see Weiss (3,61) and Chiou (62) for details].

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

Vss is calculated as: Vss = MDRT · CL For high-clearance drugs, calculation of Vss is more complex (3,19,32,63). The MDRT determines the accumulation ratio (RA ), which is defined as the average amount of drug in the body at steady state (after multiple administration) divided by the bioavailable maintenance dose (F · MD). The RA is MDRT divided by the dosing interval (3). Furthermore, MDRT determines the accumulation time and the washout time as described by Weiss (64,65). For noninstantaneous drug input (e.g., extravascular dose or constant rate infusion), the ratio of AUMC0−∞ and AUC0−∞ yields the mean total residence time (MTRT), which is the sum of the mean input time (MIT) and MDRT. For constant rate (intravenous) infusion with a given duration of infusion (Tinf ), the MIT equals Tinf / 2. Therefore, the MDRT can be calculated as: MDRT = MTRT − MIT =

Tinf AUMC0−∞ − AUC0−∞ 2

The MIT after extravascular administration is more difficult to calculate without an intravenous reference. The MIT can be determined as the difference of MTRT and MDRT, when both an extravascular and intravenous dose are given to the same subject on different occasions. To calculate Vss after extravascular administration, the MIT needs to be subtracted from the MTRT (66,67). The MIT can be approximated by Tmax/2 if the input process is close to zero-order [common for drugs in class I of the biopharmaceutical classification system (BCS) because of a zeroorder release of drug from stomach; BCS class I drugs have high solubility and high permeability]. If the process seems to be first-order, then MIT can be approximated by Tmax/(3·ln(2)) because Tmax commonly occurs at around three times the first-order absorption half-life. The extent of bioavailability (F) is typically determined by giving a single oral dose and a single intravenous dose on two different occasions (with an appropriate washout

11

period) to the same subject. From these data, CL, Vss, MDRT, MIT, and F can be calculated by the formulas shown above. The ratio of the AUC after oral and intravenous dosing characterizes F: F=

AUC0−∞,oral AUC0−∞,iv

The MIT values for each subject can be calculated as the difference of MTRT and MDRT. Assuming a first-order absorption (without lag time), MIT is the inverse of the firstorder absorption rate constant (1/ka). The MIT can be correlated with the mean dissolution time determined in vitro to establish an in vitro/in vivo correlation, for example, to compare the release characteristics of various modified release formulations (68,69). 5.2 NCA of Plasma or Serum Concentrations by Curve Fitting For NCA of concentration time curves after iv bolus administration, the results from numerical integration are very sensitive to the timing of samples and to errors in the data (e.g., measurement error or undocumented deviations in the sampling time). For such data, use of analytical functions to fit the disposition curves is the method of choice. For monotonously decreasing concentrations, a sum of exponential functions is often a good choice to describe disposition curves. No a priori reason exists for choosing a sum of exponential functions. Some authors used other functions like gamma curves (52,70–72). Most commonly, disposition curves (concentration CD (t) at time t) are described by a sum of n exponential functions: CD (t) =

n

Ci · e−λi ·t

i=1

The Ci is the intercept of each exponential phase, and −λi is the associated slope on semilogarithmic scale. The kth moment (MOCk ) of the disposition curve can be calculated as (3): MOCk = k!

n i=1

λi

Ci (k+1)

for k = 0, 1, 2, 3, . . .

12

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

regression is beyond the scope of this article. See Gabrielsson and Weiner (73) and the manuals of the respective software packages for details. 5.3 NCA with Plasma or Serum Concentrations and Amounts in Urine

Figure 5. Disposition curve after iv bolus administration with three exponential functions.

As AUC is the area under the zeroth-moment concentration time curve and AUMC is the area under the first-moment concentration time curve, this formula yields: AUC0−∞ =

n Ci , λi

AUMC0−∞ =

i=1

MDRT =

n Ci i=1

λ2i

,

n Ci λi 2

AUMC0−∞ i=1 = n Ci AUC0−∞ λi i=1

Figure 5 shows an example of a triexponential disposition curve. The following parameter values were used for simulation: C1 = 50 mg/L, C2 = 40 mg/L, C3 = 10 mg/L, λ1 = 2.77 h−1 , λ2 = 0.347 h−1 , and λ3 = 0.0578 h−1 . The disposition curve shows three different slopes on a semilogarithmic plot versus time. The contribution of the individual exponential functions is indicated. Fitting the concentration time profiles by a sum of exponential functions may be helpful to interpolate between observed concentrations. This sum of exponentials could be used, for example, as forcing function for a pharmacodynamic model. The parameters of this sum of exponential functions can be estimated by software packages like ADAPT, WinNonlinTM Pro, Thermo Kinetica , SAAM II, and many others. Use of an appropriate error variance model (or of an adequate weighting scheme) is highly recommended. A thorough discussion of nonlinear

Urinary excretion rates can be used instead of plasma concentrations to characterize the PK profile of drugs excreted (primarily) by the kidneys. The majority of studies use plasma (or serum) concentrations to determine the PK. If the excretion of drug into urine is studied in addition to plasma concentrations, then renal clearance (CLR ) and nonrenal clearance/F can be calculated in addition to total clearance. The total amount excreted unchanged in urine from time zero up to the last collected urine sample (Turine ) and the AUC from time zero to Turine are used to calculate CLR :

CLR =

Amount excreted unchanged in urine until time Turine AUC0−Turine in plasma

Importantly, this formula yields renal clearance (and not renal clearance/F) both for intravenous and for extravascular administration because the amount recovered in urine is known. Nonrenal clearance (CLNR ) is calculated as the difference between total (CL) and renal clearance (CLR ) for intravenous administration. Apparent nonrenal clearance is calculated as the difference between CL/F and CLR for extravascular administration. If F is less than 100%, then this difference will not be equal to CLNR after intravenous administration divided by F because the estimate for CLR does not include F. In addition to clearance, MDRT and therefore volume of distribution at steady state can be calculated from urinary excretion data as described by Weiss (3). The urinary excretion rate is calculated as amount of drug excreted unchanged in urine per unit of time. A urinary excretion rate versus time plot is one method to estimate terminal elimination half-life based on urine data [see Rowland and Tozer (74) for details].

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

The calculated half-life from urinary excretion data should be similar to the half-life calculated from plasma data. 5.4 Superposition Methods and Deconvolution Superposition methods assume that the concentration time profiles of each dose can be added if two doses are given at the same or at different times because the concentrations achieved by each dose do not influence each other (47). A linear pharmacokinetic system has this property. This situation allows one to simulate plasma concentration time profiles after multiple doses by adding up concentration time profiles after various single doses at the respective dosing times. Such a nonparametric superposition module is implemented, for example, in WinNonlinTM Pro, and a similar module (on convolution) is available in Thermo Kinetica . In pharmacokinetics, deconvolution is often used to determine the in vivo release profile, for example, of a modified release formulation. The disposition curve after iv bolus administration (impulse response of the system) can be used in combination with the plasma concentration time curve of an extravascular formulation to determine the in vivo release profile. The Wagner–Nelson method (75–78) that is based on a onecompartment model and the Loo–Riegelman method that can account for multiple disposition compartments (79,80) have been applied extensively. These methods subsequently were greatly extended (44,50,73,81–83). A detailed description of those algorithms is beyond the scope of this article. Convolution/deconvolution methods are available in WinNonlinTM Pro and in Thermo Kinetica . 6 GUIDELINES TO PERFORM AN NCA BASED ON NUMERICAL INTEGRATION Before running any PK analysis, it is very helpful to prepare various plots of the observed concentration time data. Typically, the observed individual concentration time data are plotted on linear and semilogarithmic scale (i.e., concentrations on log-scale vs. time on linear scale). A log2 scale is often a good choice to visualize concentration versus

13

time data because half-life can be visualized more easily by such a plot. These plots usually are prepared for all subjects at the same time (‘‘spaghetti plots’’) and for each subject individually. An intraindividual comparison of the observed data for different treatments is often helpful. If different doses were given to the same subject, then plotting dose-normalized concentrations is helpful to assess dose linearity visually according to the superposition principle. Calculation of descriptive statistics in addition to the individual observed data provides useful summary statistics. The average ± standard deviation and the median (25%–75% percentiles) or other representative percentiles are often plotted versus time for each treatment. These observed data plots might already reveal that some assumptions like first-order elimination might not be appropriate. Actual, instead of nominal, protocol sampling times should always be recorded and used for PK analysis. For the very common case that numerical integration by the trapezoidal rule is applied, the two most important user decisions are (1) determination of the terminal phase and (2) choice of the integration rule for the AUC. The choice of the most appropriate AUC calculation method is discussed above. Some guidelines for determining the terminal phase are shown below. 6.1 How to Select Concentration Data Points for Estimation of Terminal Half-Life The following rules provide some practical guidelines for determination of the terminal phase of the concentration time profile: 1. Define a maximum number of data points used to specify the terminal phase. During the terminal phase, drug distribution is in equilibrium. As drug distribution is usually not in equilibrium at Tmax, the terminal phase should not contain Cmax and the data points directly after Cmax. Use of at least three points and not more than five to seven points to define the terminal phase is often appropriate.

14

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

2. Select the ‘‘optimal’’ number of data points according to a statistical criterion for the goodness of fit. Programs like WinNonlinTM Pro, Thermo Kinetica , and others provide an interactive graphical interface that shows the r2 and the r2 -adjusted value (the latter is called G-criterion in Thermo Kinetica ) for the chosen data points. The highest r2 -adjusted (or r2 ) value can be used as a statistical criterion for the best fit (ideal value: 1.0). The r2 adjusted criterion might be preferred to the r2 -criterion because the r2 -adjusted criterion considers the number of data points used to derive λz , whereas the r2 -criterion does not. If the last three, four, and five data points all yield an r2 -value of 0.98, for example, the r2 adjusted criterion is highest for the last five data points. With more data points being used to estimate λz , the probability of this correlation occurring by chance decreases, and, thus, several data points may be preferable to estimate λz . Other criteria for the number of data points may also be appropriate. Proost (84) compared various criteria to estimate λz by simulating profiles for a one-compartment model and concluded that various criteria (including the r2 and r2 -adjusted criterion) had comparable bias and precision for the estimate of λz . Irrespective of the criterion chosen, it seems reasonable to ensure an appropriate selection of data points to estimate the terminal half-life by visual inspection of the observations and the predicted regression line in each subject. 3. If half-life becomes systematically shorter with more data points being used for estimation of λz , then the use of only the last three, four, (or five) data points to define the terminal phase seems reasonable. This situation often occurs for drugs that need to be described by a sum of two, three, or more exponential functions, when the distribution is non-instantaneous.

4. If half-life becomes systematically longer with more data points being used for estimation of λz , then the use of only the last three, four, (or five) data points to define the terminal phase seems reasonable. One possible reason for such an observation would be a mixed-order (Michaelis-Menten) elimination that violates the assumptions of standard NCA (see Table 1). 5. It is reasonable to use at least three data points for estimation of λz . Exception: If the third last point is Cmax, only the last two points should be used to estimate λz or λz should not be estimated in this subject. 6. Exclusion of data points for estimation of λz seems warranted only if a problem with the respective sample has been documented, for example, by the clinical site or by the analytical laboratory. Probably no set of guidelines is applicable to all datasets. This set of rules may need to be revised, if the precision of the bioanalytical assay is low at low concentrations. Terminal half-life sometimes cannot be well determined by NCA for extended-release formulations or for data after multiple dosing, especially if the dosing interval is shorter than about twice the terminal half-life. It may be very helpful to report the median instead of the arithmetic mean to describe the central tendency of terminal half-life, if NCA indicates extremely long half-lives for some subjects. Compartmental modeling may be more powerful in these situations (85). The concentrations during the terminal phase used in linear regression on semilogarithmic scale often span a wide range (sometimes more than a factor of 100 between the highest and lowest concentration). This consideration is important for choosing an adequate weighting scheme. Importantly, unweighted linear regression (uniform weighting) on semilogarithmic scale is approximately equal to assuming a constant coefficient of variation error model. Therefore, unweighted linear regression on semilogarithmic scale is usually an adequate weighting scheme for NCA.

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

6.2 How to Handle Samples Below the Quantification Limit The best way to handle observations reported as being below the limit of quantitation is to ask the chemical analyst to supply the measured values. It may be necessary to tell and convince the analytical team that reporting the measured concentration value for samples ‘‘below the quantification limit’’ (BQL) contributes valuable information. These low concentrations can be adequately weighted when compartmental modeling techniques are applied. No good PK or regulatory reason exists not to use these observations. However, very good statistical theory supports the idea that discarding these values will bias the results of subsequent PK analysis (86–94). Concentrations at the first time points for extravascular administration and during the terminal phase are often reported to be BQL by the analytical laboratory. A thorough discussion of handling BQL samples in a PK data analysis is beyond the scope of this article. Population PK modeling offers more powerful methods for dealing with BQL samples. As a practical guide for NCA, the following procedure can be applied, if the only available information is that the respective samples were reported to be BQL: 1. All BQL samples before the first quantifiable concentration (including the predose sample) are set to zero (see predose, 0.5 and 1 h sample in Fig. 6). This setting will underestimate the AUC of the trapezoids before the first quantifiable concentration slightly, but the bias is usually small. An unreasonable option is to ignore the BQL samples before the first quantifiable concentration and to calculate the first trapezoid from time zero to the first quantifiable concentration because this calculation would yield a much larger overestimation of the AUC compared with the small underestimation described above. 2. BQL samples that were drawn after the first quantifiable concentration and before the last quantifiable concentration are usually ignored. This approach assumes that these samples were in

15

fact lost or never drawn (see 3.5 h sample in Fig. 6). 3. BQL samples after the last quantifiable concentrations are usually all ignored. A less common and potentially suboptimal method for NCA is to set the first BQL sample after the last quantifiable concentration (see 6 h sample in Fig. 6) to a fixed value. Typically, half of the reported quantification limit is used. All subsequent BQL samples are ignored for NCA (see 7 h and 8 h sample in Fig. 6). If the terminal phase is adequately descri bed by the last quantifiable concentrations and if the residual area is below 5%, then ignoring all BQL samples after the last quantifiable concentration seems to be most reasonable. Imputing the first BQL sample at 6 h (see Fig. 6) to be half of the quantification limit (or any other prespecified value) is likely to yield (slightly) biased estimates for terminal half-life in an NCA. Importantly, such an imputed sample should not be used for calculation of residual area. 6.3 NCA for Sparse Concentration Time Data NCA methods for sparse concentration time data have been developed (36–43). These methods are often applied for preclinical PK, toxicokinetic, and animal studies for which usually only sparse data are available. Bailer (36) originally proposed a method that uses concentration time data from destructive sampling (one observation per animal) to estimate the average AUC up to the

Figure 6. Example of a plasma concentration time profile with various samples reported to be below the quantification limit (BQL) that is reported to be 1 mg/L.

16

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

last observation time and to derive the standard error of the average AUC. Assume, for example, that 20 animals are sacrificed at 5 different time points (4 animals at each time point). The average and variance of the concentration at each time point are calculated, and the linear trapezoidal rule is used to derive the average AUC. Bailer provided a formula for the standard error of the average AUC. If various treatments are compared, then the average AUC can be compared statistically between different treatments by use of this standard error. Extensions to the Bailer method were developed (41–43) and were recently implemented into version 5 of WinNonlinTM Pro. Sparse sampling methods for NCA will also become available in the next version of Thermo Kinetica . Subsequently, bootstrap resampling techniques for sparse data (one observation per animal) were developed and evaluated (37–39). The so-called pseudoprofile-based bootstrap (38) creates pseudoprofiles by randomly sampling one concentration from, for example, the four animals at each time point. One concentration is drawn at each of the five time points. These five concentrations represent one pseudoprofile. Many pseudoprofiles are randomly generated, and an NCA is performed for each pseudoprofile. Summary statistics are computed to calculate the between subject variability and standard error of the NCA statistics of interest [see Mager and Goller (38) for details]. 6.4 Reporting the Results of an NCA The PK parameters and statistics to be reported for an NCA depend on the objectives of the analysis and on the types of data. Regulatory guidelines list several PK parameters and statistics to be reported from an NCA (53,54). A detailed report on the design, analysis, and reporting of clinical trials by NCA has been published by the Association for Applied Human Pharmacology (AGAH) (54). Valuable guidelines can also be found on pages 21 to 23 of the FDA guidelines for bioavailability and bioequivalence studies (53). It is often helpful for subsequent analyses to report the average ± SD (or geometric mean and %CV) and median (percentiles) of

the following five NCA statistics: AUC, Cmax, Tmax, t1/2 , and AUMC. Reporting the results on CL and Vss provides insight into the drug disposition. If the average and SD of those five NCA statistics are reported, then the so-called back analysis method (95) can be applied to convert NCA statistics into compartmental model parameters. Alternatively, the individual NCA statistics in each subject can be used instead of average and SD data. One has to decide based on literature data, for example, before applying the back analysis method, if a one- or two-compartment model is likely to be more appropriate for the drug of interest. The back analysis method provides estimates for the mean and variance of the model parameters for a one- or two-compartment model. The resulting PK parameter estimates can be used, for example, to build a population PK model and to run clinical trial simulations. The back analysis method allows one to run such simulations based on NCA results without the individual concentration time data. However, if individual concentration time data are available, population PK analysis is the preferred method of building a population PK model. 6.5 How to Design a Clinical Trial that is to be Analyzed by NCA It is possible to optimize the design of clinical trials (e.g., of bioequivalence studies) that are to be analyzed by NCA. However, those methods rely on compartment models in a population PK analysis and clinical trial simulation and are not covered in this article. Some practical guidelines for selection of appropriate sampling time points are provided below. This article focuses on NCA analysis by numerical integration for the most common case of extravascular dosing. The methods presented here are not as powerful as optimization of the study design by clinical trial simulation but may provide a basic guide. Before planning the sampling time schedule, one needs at least some information on the blood volume required for drug analysis and on the sensitivity of the bioanalytical assay. Some knowledge on the expected average concentration time profile of the drug

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

available is also assumed. The FDA (53) recommends 12 to 18 blood samples (including the predose sample) to be taken per subject and per dose. Cawello (96) and the AGAH working group recommend in total 15 blood samples to be drawn. Five samples (including the predose sample) should be drawn before Tmax. Another five samples are to be drawn between Tmax and one half-life after Tmax, and another five samples should be drawn up to five half-lives after Tmax. If one assumes that the median Tmax is at 1.5 h and that the average terminal halflife is 5 h, then this recommendation would yield, for example, the following sampling times: 0 h (predose), 0.33, 0.67, 1.00, 1.50, 2.00, 2.50, 3.00, 4.00, 6.00, 8.00, 12.0, 16.0, 24.0, and 32.0 h post dose. Cawello (96) recommends that each of the three intervals described above should contain four samples, if only twelve samples can be drawn in total. This recommendation would yield, for example, the following sampling times: 0 h (predose), 0.50, 1.00, 1.50, 2.00, 3.00, 4.00, 6.00, 10.0, 16.0, 24.0, and 32.0 h post dose. Such a sampling schedule may need to be modified, for example, for drugs with a large variability in terminal half-life or in Tmax. If the variability in Tmax is large, then more frequent sampling may be required during the first hour post dose in this example. Overall, this sampling schedule should provide low residual areas and a robust estimate for the terminal half-life. 7

CONCLUSIONS AND PERSPECTIVES

NCA is an important component of the toolbox for PK analyses. It is most applicable for studies with frequent sampling. Numerical integration, for example, by the trapezoidal rule, is most commonly used to analyze data after extravascular dosing. Fitting of disposition curves by a sum of exponential functions is the non-compartmental method of choice for analysis of concentrations after iv bolus dosing. Non-compartmental methods for handling sparse data were developed and are available in standard software packages. Standard NCA is straightforward and can be conveniently applied also by nonprofessional users.

17

It is important to recognize the assumptions and limitations of standard NCA. Almost all applications of NCA require a series of assumptions that are similar to the assumptions required for compartmental modeling. Violation of the assumptions for NCA will result in biased parameter estimates for some or all PK parameters. From a regulatory perspective, NCA is appealing as it involves minimal decision making by the drug sponsor or drug regulator. NCA can provide robust estimates of many PK parameters for studies with frequent sampling. For these studies, NCA and compartmental modeling often complement each other, and NCA results may be very helpful for compartmental model building. Therefore, NCA may be very valuable for PK data analysis of studies with frequent sampling, irrespective of whether the study objectives require subsequent analysis by compartmental modeling. 8

ACKNOWLEDGMENT

We thank one of the reviewers and Dr. William J. Jusko for comments on this ¨ manuscript. Jurgen Bulitta was supported by a postdoctoral fellowship from Johnson & Johnson. REFERENCES 1. O. Caprani, E. Sveinsdottir, and N. Lassen, SHAM, a method for biexponential curve resolution using initial slope, height, area and moment of the experimental decay type curve. J. Theor. Biol. 1975; 52: 299–315. 2. N. Lassen and W. Perl, Tracer Kinetic Methods in Medical Physiology. New York: Raven Press, 1979. 3. M. Weiss, The relevance of residence time theory to pharmacokinetics. Eur. J. Clin. Pharmacol. 1992; 43: 571–579. 4. P. Veng-Pedersen, Noncompartmentallybased pharmacokinetic modeling. Adv. Drug Deliv. Rev. 2001; 48: 265–300. 5. P. Veng-Pedersen, H. Y. Cheng, and W. J. Jusko, Regarding dose-independent pharmacokinetic parameters in nonlinear pharmacokinetics. J. Pharm. Sci. 1991; 80: 608–612.

18

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

6. A. T. Chow and W. J. Jusko, Application of moment analysis to nonlinear drug disposition described by the Michaelis-Menten equation. Pharm. Res. 1987; 4: 59–61. 7. H. Y. Cheng and W. J. Jusko, Mean residence time of drugs in pharmacokinetic systems with linear distribution, linear or nonlinear elimination, and noninstantaneous input. J. Pharm. Sci. 1991; 80: 1005–1006. 8. H. Y. Cheng and W. J. Jusko, Mean residence time concepts for pharmacokinetic systems with nonlinear drug elimination described by the Michaelis-Menten equation. Pharm. Res. 1988; 5: 156–164. 9. H. Cheng, W. R. Gillespie, and W. J. Jusko, Mean residence time concepts for non-linear pharmacokinetic systems. Biopharm. Drug Dispos. 1994; 15: 627–641. 10. H. Cheng, Y. Gong, and W. J. Jusko, A computer program for calculating distribution parameters for drugs behaving nonlinearly that is based on disposition decomposition analysis. J. Pharm. Sci. 1994; 83: 110–112. 11. P. Veng-Pedersen, J. A. Widness, L. M. Pereira, C. Peters, R. L. Schmidt, and L. S. Lowe, Kinetic evaluation of nonlinear drug elimination by a disposition decomposition analysis. Application to the analysis of the nonlinear elimination kinetics of erythropoietin in adult humans. J. Pharm. Sci. 1995; 84: 760–767. 12. M. Weiss, Mean residence time in non-linear systems? Biopharm. Drug Dispos. 1988; 9: 411–412. 13. D. J. Cutler, A comment regarding mean residence times in non-linear systems. Biopharm. Drug Dispos. 1989; 10: 529–530. 14. H. Y. Cheng and W. J. Jusko, An area function method for calculating the apparent elimination rate constant of a metabolite. J. Pharmacokinet. Biopharm. 1989; 17: 125–130. 15. A. T. Chow and W. J Jusko, Michaelis-Menten metabolite formation kinetics: equations relating area under the curve and metabolite recovery to the administered dose. J. Pharm. Sci. 1990; 79: 902–906. 16. M. Weiss, Use of metabolite AUC data in bioavailability studies to discriminate between absorption and first-pass extraction. Clin. Pharmacokinet. 1990; 18: 419–422. 17. M. Weiss, A general model of metabolite kinetics following intravenous and oral administration of the parent drug. Biopharm. Drug Dispos. 1988; 9: 159–176. 18. J. B. Houston, Drug metabolite kinetics. Pharmacol. Ther. 1981; 15: 521–552.

19. W. J. Jusko, Guidelines for collection and analysis of pharmacokinetic data. In: M. E. Burton, L. M. Shaw, J. J. Schentag, and W. E. Evans (eds.), Applied Pharmacokinetics & Pharmacodynamics, 4th ed. Philadelphia, PA: Lippincott Williams and Wilkins, 2005. 20. H. Y. Cheng and W. J. Jusko, Mean residence times and distribution volumes for drugs undergoing linear reversible metabolism and tissue distribution and linear or nonlinear elimination from the central compartments. Pharm. Res. 1991; 8: 508–511. 21. H. Y. Cheng and W. J. Jusko, Mean residence time of drugs showing simultaneous first-order and Michaelis-Menten elimination kinetics. Pharm. Res. 1989; 6: 258–261. 22. H. Y. Cheng and W. J. Jusko, Mean interconversion times and distribution rate parameters for drugs undergoing reversible metabolism. Pharm. Res. 1990; 7: 1003–1010. 23. H. Y. Cheng and W. J. Jusko, Mean residence times of multicompartmental drugs undergoing reversible metabolism. Pharm. Res. 1990; 7: 103–107. 24. H. Y. Cheng and W. J. Jusko, Mean residence time of oral drugs undergoing first-pass and linear reversible metabolism. Pharm. Res. 1993; 10: 8–13. 25. H. Y. Cheng and W. J. Jusko, Pharmacokinetics of reversible metabolic systems. Biopharm. Drug Dispos. 1993; 14: 721–766. 26. S. Hwang, K. C. Kwan, and K. S. Albert, A liner mode of reversible metabolism and its application to bioavailability assessment. J. Pharmacokinet. Biopharm. 1981; 9: 693–709. 27. W. F. Ebling, S. J. Szefler, and W. J. Jusko, Methylprednisolone disposition in rabbits. Analysis, prodrug conversion, reversible metabolism, and comparison with man. Drug Metab. Dispos. 1985; 13: 296–304. 28. M. N. Samtani, M. Schwab, P. W. Nathanielsz, and W. J. Jusko, Area/moment and compartmental modeling of pharmacokinetics during pregnancy: applications to maternal/fetal exposures to corticosteroids in sheep and rats. Pharm. Res. 2004; 21: 2279–2292. 29. M. S. Roberts, B. M. Magnusson, F. J. Burczynski, and M. Weiss, Enterohepatic circulation: physiological, pharmacokinetic and clinical implications. Clin. Pharmacokinet. 2002; 41: 751–790. 30. H. Cheng and W. R. Gillespie, Volumes of distribution and mean residence time of drugs with linear tissue distribution and binding and nonlinear protein binding. J. Pharmacokinet. Biopharm. 1996; 24: 389–402.

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS 31. F. M. Gengo, J. J. Schentag, and W. J. Jusko, Pharmacokinetics of capacity-limited tissue distribution of methicillin in rabbits. J. Pharm. Sci. 1984; 73: 867–873. 32. R. Nagashima, G. Levy, and R. A. O’Reilly, Comparative pharmacokinetics of coumarin anticoagulants. IV. Application of a threecompartmental model to the analysis of the dose-dependent kinetics of bishydroxycoumarin elimination. J. Pharm. Sci. 1968; 57: 1888–1895. 33. D. E. Mager, Target-mediated drug disposition and dynamics. Biochem. Pharmacol. 2006; 72: 1–10. 34. D. E. Mager and W. J. Jusko, General pharmacokinetic model for drugs exhibiting targetmediated drug disposition. J. Pharmacokinet. Pharmacodyn. 2001; 28: 507–532. 35. H. Cheng and W. J. Jusko, Disposition decomposition analysis for pharmacodynamic modeling of the link compartment. Biopharm. Drug Dispos. 1996; 17: 117–124. 36. A. J. Bailer, Testing for the equality of area under the curves when using destructive measurement techniques. J. Pharmacokinet. Biopharm. 1988; 16: 303–309. 37. H. Mager and G. Goller, Analysis of pseudoprofiles in organ pharmacokinetics and toxicokinetics. Stat. Med. 1995; 14: 1009–1024. 38. H. Mager and G. Goller, Resampling methods in sparse sampling situations in preclinical pharmacokinetic studies. J. Pharm. Sci. 1998; 87: 372–378. 39. P. L. Bonate, Coverage and precision of confidence intervals for area under the curve using parametric and non-parametric methods in a toxicokinetic experimental design. Pharm. Res. 1998; 15: 405–410. 40. J. R. Nedelman and E. Gibiansky, The variance of a better AUC estimator for sparse, destructive sampling in toxicokinetics. J. Pharm. Sci. 1996; 85: 884–886. 41. J. R. Nedelman, E. Gibiansky, and D. T. Lau, Applying Bailer’s method for AUC confidence intervals to sparse sampling. Pharm. Res. 1995; 12: 124–128. 42. J. R. Nedelman and X. Jia, An extension of Satterthwaite’s approximation applied to pharmacokinetics. J. Biopharm. Stat. 1998; 8: 317–328. 43. D. J. Holder, Comments on Nedelman and Jia’s extension of Satterthwaite’s approximation applied to pharmacokinetics. J. Biopharm. Stat. 2001; 11: 75–79. 44. P. Veng-Pedersen, Stochastic interpretation of linear pharmacokinetics: a linear system

19

analysis approach. J. Pharm. Sci. 1991; 80: 621–631. 45. J. J. DiStefano, 3rd. Noncompartmental vs. compartmental analysis: some bases for choice. Am. J. Physiol. 1982; 243: R1–6. 46. J. J. DiStefano 3rd, E. M. Landaw, Multiexponential, multicompartmental, and noncompartmental modeling. I. Methodological limitations and physiological interpretations. Am. J. Physiol. 1984; 246: R651–664. 47. W. R. Gillespie, Noncompartmental versus compartmental modelling in clinical pharmacokinetics. Clin. Pharmacokinet. 1991; 20: 253–262. 48. D. J. Cutler, Linear systems analysis in pharmacokinetics. J. Pharmacokinet. Biopharm. 1978; 6: 265–282. 49. J. L. Stephenson, Theory of transport in linear biological systems: I. Fundamental integral equation. Bull. Mathemat. Biophys. 1960; 22: 1–7. 50. P. Veng-Pedersen, Linear and nonlinear system approaches in pharmacokinetics: how much do they have to offer? I. General considerations. J. Pharmacokinet. Biopharm. 1988; 16: 413–472. 51. E. Nakashima and L. Z. Benet, An integrated approach to pharmacokinetic analysis for linear mammillary systems in which input and exit may occur in/from any compartment. J. Pharmacokinet. Biopharm. 1989; 17: 673–686. 52. Weiss M. Generalizations in linear pharmacokinetics using properties of certain classes of residence time distributions. I. Log-convex drug disposition curves. J. Pharmacokinet. Biopharm. 1986; 14: 635–657. 53. Food and Drug Administration (CDER). Guidance for Industry: Bioavailability and Bioequivalence Studies for Orally Administered Drug Products - General Considerations, 2003. 54. Committee for Proprietary and Medicinal Products (CPMP). Note for Guidance on the Investigation of Bioavailability and Bioequivalence. CPMP/EWP/QWP/1401/98, 2001. 55. K. C. Yeh and K. C. Kwan, A comparison of numerical integrating algorithms by trapezoidal, Lagrange, and spline approximation. J. Pharmacokinet. Biopharm. 1978; 6: 79–98. 56. Z. Yu and F. L. Tse, An evaluation of numerical integration algorithms for the estimation of the area under the curve (AUC) in pharmacokinetic studies. Biopharm. Drug Dispos. 1995; 16: 37–58.

20

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS

57. W. J. Jusko and M. Gibaldi, Effects of change in elimination on various parameters of the two-compartment open model. J. Pharm. Sci. 1972; 61: 1270–1273. 58. J. V. Gobburu and N. H. Holford, Vz, the terminal phase volume: time for its terminal phase? J. Biopharm. Stat. 2001; 11: 373–375. 59. K. Yamaoka, T. Nakagawa, and T. Uno, Statistical moments in pharmacokinetics. J. Pharmacokinet. Biopharm. 1978; 6: 547–558. 60. I. L. Smith and J. J. Schentag, Noncompartmental determination of the steady-state volume of distribution during multiple dosing. J. Pharm. Sci. 1984; 73: 281–282. 61. M. Weiss, Definition of pharmacokinetic parameters: influence of the sampling site. J. Pharmacokinet. Biopharm. 1984; 12: 167–175. 62. W. L. Chiou, The phenomenon and rationale of marked dependence of drug concentration on blood sampling site. Implications in pharmacokinetics, pharmacodynamics, toxicology and therapeutics (Part I). Clin. Pharmacokinet. 1989; 17: 175–199. 63. M. Weiss, Nonidentity of the steady-state volumes of distribution of the eliminating and noneliminating system. J. Pharm. Sci. 1991; 80: 908–910. 64. M. Weiss, Model-independent assessment of accumulation kinetics based on moments of drug disposition curves. Eur. J. Clin. Pharmacol. 1984; 27: 355–359. 65. M. Weiss, Washout time versus mean residence time. Pharmazie 1988; 43: 126–127. 66. D. Perrier and M. Mayersohn, Noncompartmental determination of the steady-state volume of distribution for any mode of administration. J. Pharm. Sci. 1982; 71: 372–373. 67. H. Cheng and W. J. Jusko, Noncompartmental determination of the mean residence time and steady-state volume of distribution during multiple dosing. J. Pharm. Sci. 1991; 80: 202–204. 68. D. Brockmeier, H, J. Dengler, and D. Voegele, In vitro-in vivo correlation of dissolution, a time scaling problem? Transformation of in vitro results to the in vivo situation, using theophylline as a practical example. Eur. J. Clin. Pharmacol. 1985; 28: 291–300. 69. D. Brockmeier, In vitro-in vivo correlation, a time scaling problem? Evaluation of mean times. Arzneimittelforschung 1984; 34: 1604–1607. 70. M. E. Wise, Negative power functions of time in pharmacokinetics and their implications. J. Pharmacokinet. Biopharm. 1985; 13:

309–346. 71. K. H. Norwich and S. Siu, Power functions in physiology and pharmacology. J. Theor. Biol. 1982; 95: 387–398. 72. G. T. Tucker, P. R. Jackson, G. C. Storey, and D. W. Holt, Amiodarone disposition: polyexponential, power and gamma functions. Eur. J. Clin. Pharmacol. 1984; 26: 655–656. 73. J. Gabrielsson and D. Weiner, Pharmacokinetic and Pharmacodynamic Data Analysis, Concepts and Applications. 4th ed. Stockholm, Sweden: Swedish Pharmaceutical Press, 2007. 74. M. Rowland and T. N. Tozer, Clinical Pharmacokinetics: Concepts and Applications. Philadelphia, PA: Lippincott Williams & Wilkins, 1995. 75. J. G. Wagner and E. Nelson, Per cent absorbed time plots derived from blood level and/or urinary excretion data. J. Pharm. Sci. 1963; 52: 610–611. 76. J. G. Wagner and E. Nelson, Kinetic analysis of blood levels and urinary excretion in the absorptive phase after single doses of drug. J. Pharm. Sci. 1964; 53: 1392–1403. 77. J. G. Wagner, Modified Wagner-Nelson absorption equations for multiple-dose regimens. J. Pharm. Sci. 1983; 72: 578–579. 78. J. G. Wagner, The Wagner-Nelson method applied to a multicompartment model with zero order input. Biopharm. Drug Dispos. 1983; 4: 359–373. 79. J. C. Loo and S. Riegelman, New method for calculating the intrinsic absorption rate of drugs. J. Pharm. Sci. 1968; 57: 918–928. 80. J. G. Wagner, Pharmacokinetic absorption plots from oral data alone or oral/intravenous data and an exact Loo-Riegelman equation. J. Pharm. Sci. 1983; 72: 838–842. 81. F. N. Madden, K. R. Godfrey, M. J. Chappell, R. Hovorka, and R. A. Bates, A comparison of six deconvolution techniques. J. Pharmacokinet. Biopharm. 1996; 24: 283–299. 82. D. Verotta, Comments on two recent deconvolution methods. J. Pharmacokinet. Biopharm. 1990; 18: 483–489; discussion 489–499. 83. D. P. Vaughan and M. Dennis, Mathematical basis and generalization of the LooRiegelman method for the determination of in vivo drug absorption. J. Pharmacokinet. Biopharm. 1980; 8: 83–98. 84. J. H. Proost, Calculation of half-life - PharmPK Discussion, 2005. Available: http://www.boomer.org/pkin/PK05/ PK2005095.html.

AN INTRODUCTORY GUIDE TO NON-COMPARTMENTAL ANALYSIS 85. A. Sharma, P. H. Slugg, J. L. Hammett, and W. J. Jusko, Estimation of oral bioavailability of a long half-life drug in healthy subjects. Pharm. Res. 1998; 15: 1782–1786. 86. Beal SL. Ways to fit a PK model with some data below the quantification limit. J Pharmacokinet Pharmacodyn 2001; 28: 481–504. 87. V. Duval and M. O. Karlsson, Impact of omission or replacement of data below the limit of quantification on parameter estimates in a two-compartment model. Pharm. Res. 2002; 19: 1835–1840. 88. H. Jacqmin-Gadda, R. Thiebaut, G. Chene, and D. Commenges, Analysis of left-censored longitudinal data with application to viral load in HIV infection. Biostatistics 2000; 1: 355–368. 89. H. S. Lynn, Maximum likelihood inference for left-censored HIV RNA data. Stat. Med. 2001; 20: 33–45. 90. J. P. Hing, S. G. Woolfrey, D. Greenslade, and P. M. Wright, Analysis of toxicokinetic data using NONMEM: impact of quantification limit and replacement strategies for censored data. J. Pharmacokinet. Pharmacodyn. 2001; 28: 465–479. 91. A. Samson, M. Lavielle, and F. Mentre, Extension of the SAEM algorithm to left-censored data in nonlinear mixed-effects model: application to HIV dynamics model. Computat. Stat. Data Anal. 2006; 51: 1562–1574. 92. R. Thiebaut, J. Guedj, H. Jacqmin-Gadda, et al., Estimation of dynamical model parameters taking into account undetectable marker values. BMC Med. Res. Methodol. 2006; 6: 38. 93. J. Asselineau, R. Thiebaut, P. Perez, G. Pinganaud, and G. Chene, Analysis of leftcensored quantitative outcome: example of procalcitonin level. Rev. Epidemiol. Sante Publique 2007; 55: 213–220. 94. S. Hennig, T. H. Waterhouse, S. C. Bell, et al., A d-optimal designed population pharmacokinetic study of oral itraconazole in adult cystic fibrosis patients. Br. J. Clin. Pharmacol. 2007; 63: 438–450. 95. C. Dansirikul, M. Choi, and S. B. Duffull, Estimation of pharmacokinetic parameters from non-compartmental variables using Microsoft Excel. Comput. Biol. Med. 2005; 35: 389–403. 96. W. Cawello, Parameters for Compartment-free Pharmacokinetics - Standardisation of Study Design, Data Analysis and Reporting. Aachen, Germany: Shaker Verlag, 1999.

21

FURTHER READING D. Z. D’Argenio, Advanced Methods of Pharmacokinetic and Pharmacodynamic Systems Analysis. New York: Plenum Press, 1991. D. Foster, NIH course ‘‘Principles of Clinical Pharmacology’’ lecture (including video) on compartmental versus noncompartmental analysis. Available: http://www.cc.nih.gov/training/training. html. M. Weiss, The relevance of residence time theory to pharmacokinetics. Eur. J. Clin. Pharmacol. 1992; 43: 571–579. W. J. Jusko, Guidelines for collection and analysis of pharmacokinetic data. In: M. E. Burton, L. M. Shaw, J. J. Schentag, and W. E. Evans (eds.), Applied Pharmacokinetics & Pharmacodynamics. 4th ed. Philadelphia, PA: Lippincott Williams and Wilkins, 2005.

CROSS-REFERENCES Pharmacokinetic Study Bioavailability Bioequivalence Population Pharmacokinetic Methods Analysis of Variance (ANOVA)

NONCOMPLIANCE Noncompliance with the protocol, Standards of Procedure (SOP), Good Clinical Practice (GCP), and/or applicable regulatory requirement(s) by an investigator/institution, or by member(s) of the sponsor’s staff should lead to prompt action by the sponsor to secure compliance. If the monitoring and/or auditing personnel identify(ies) serious and/or persistent noncompliance on the part of an investigator/institution, the sponsor should terminate the investigator’s/institution’s participation in the trial. When an investigator’s/institution’s participation is terminated because of noncompliance, the sponsor should notify promptly the regulatory authority(ies).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

NONINFERIORITY TRIAL

loss in effectiveness from the control may still lead one to conclude that the experimental is effective (i.e., relative to no use of the experimental treatment). However, the term ‘‘noninferior’’ is in some sense misleading. It is, in fact, ‘‘inferior,’’ and what it exactly means is ‘‘not unacceptably inferior.’’ The unacceptable extent of inferiority (e.g., loss of the effect of the active control) would need to be defined. This extent is the so-called noninferiority margin.

H. M. JAMES HUNG Division of Biometrics I, Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research U.S. Food and Drug Administration Silver Spring, Maryland

SUE-JANE WANG ROBERT O’NEILL Office of Translational Sciences, Center for Drug Evaluation and Research U.S. Food and Drug Administration Silver Spring, Maryland

1 ESSENTIAL ELEMENTS OF NONINFERIORITY TRIAL DESIGN Literature on noninferiority trials or activecontrolled trials in general is abundant (2–16). Basically, there are two types of noninferiority trial design: with or without a placebo. When the placebo arm is present in the noninferiority trial, the efficacy of the experimental treatment can be evaluated via a direct comparison with the placebo. The comparison of the experimental treatment with the active control elucidates the extent of effectiveness in a relative sense. However, when the placebo is absent, the direct comparison between the experimental treatment and the active control is the only comparison available. It not only elucidates the extent of inferiority or superiority of the experimental treatment over the active control, but it also is expected to serve as a bridge for an indirect assessment of the experimental treatment’s efficacy (relative to a placebo). That is, the indirect inference pertains to the important question of how the experimental treatment would have fared against a placebo had the placebo been in the trial. In the absence of a concurrent placebo, the indirect inference for the experimental treatment’s efficacy entails use of the active control’s effect (relative to a placebo) from the historical placebo-controlled trials. For a simple reason, if the true effect of the experimental relative to the active control and the true effect of the active control versus the placebo are known, one can obtain the effect of the experimental treatment versus the placebo by transitivity.

Traditionally, the effect of an experimental treatment in treating a disease is mostly evaluated in a placebo-controlled clinical trial with the objective of demonstrating that the experimental treatment is more effective than the placebo with or without the standard-of-care medical treatment on the background. This clinical objective drives the formulation of a statistical hypothesis for testing. The trial is designed to provide statistical evidence to reject the null hypothesis that there is no treatment difference in the therapeutic effect and consequently to accept the intended alternative hypothesis that the experimental treatment is more effective than the placebo. This is often referred to as superiority testing. When a placebo can no longer be used ethically in the trial (e.g., with a life-saving drug on the market), a treatment regimen that has been proven to be effective and safe can be selected as a control, often called active or positive control, for an experimental treatment to be compared against (1). In an active-controlled trial, demonstration of superiority over the active control certainly establishes the effectiveness of the experimental treatment. Alternatively, proving the efficacy of the experimental treatment can be based on the so-called noninferiority testing, by which the experimental treatment will be shown to be not much inferior to the active control. Because the effectiveness of the active control has been established, a small amount of

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

NONINFERIORITY TRIAL

Usage of the historical data to obtain the effect of the active control presents several critical issues that directly influence the interpretability of a noninferiority trial. First, the effect of the active control versus a placebo in the historical trials is, at best, only an estimate. Derivation of the historical estimate requires meta-analysis, which is in itself controversial in practice. For instance, can the existing historical trials produce an approximately unbiased estimate of the active control’s effect, given that the negative trials are often not reported (this is the problem of publication bias)? And how does one handle large intertrial variability? Thus, the meta-analysis is often suspected of having a tendency of overestimating a treatment effect. The historical estimates in the individual historical trials are often unstable. Selection of the historical trials relevant to the noninferiority trial setting is a subjective process. Statistical methods for handling between-trial heterogeneity may depend on strong untestable assumptions. For instance, the validity of the resulting interval estimator by some meta-analysis method may be questionable (16–18). Secondly, there is a question of whether the historical estimate deemed valid in the historical trial patient population is applicable to the noninferiority trial setting; this is the question about the so-called ‘‘constancy’’ assumption (19–26). In practice, it is always doubtful that the constancy assumption is satisfied because often there are significant differences between the noninferiority trial and the historical trial settings, such as the differences in patient population, study endpoints, background or concomitant medications, or disease status. Violation of the constancy assumption will induce a bias that is the difference in the effect of the selected active control between the historical trial setting and the noninferiority setting. When the active control is much less effective in the noninferiority trial setting than in the historical trial setting, the bias may substantially increase the risk of falsely concluding noninferiority and the risk of falsely asserting that the experimental treatment is effective. Third, the quality of the noninferiority trial may be problematic because of inevitable

issues of medication nonadherence and noncompliance. The quality affects the ability of the noninferiority trial to distinguish an effective treatment from a less effective or ineffective treatment; this ability is referred as assay sensitivity in the International Conference on Harmonization (ICH) E10 guidelines (19, 20). It is not difficult to understand that if most of the patients in both treatment groups do not take the study drugs they are assigned, then the two treatment groups will be more alike than they should. That is, a noninferiority trial that lacks assay sensitivity may find an inferior treatment to be noninferior to the active control, which will lead to erroneous conclusions about noninferiority and the experimental treatment’s efficacy. The two important determinations as laid out in the ICH E-10 document that deduce the presence of assay sensitivity are (1) historical evidence of sensitivity to drug effects (i.e., that similarly designed trials in the past regularly distinguished effective treatments from less effective or ineffective treatments) and (2) appropriate trial conduct (i.e., that the conduct of the trial did not undermine its ability to distinguish effective treatments from less effective or ineffective treatments). The absence of a placebo arm from the noninferiority trial makes it impossible to verify or even check the assay sensitivity and the constancy assumptions with the trial data. Hence, the extent of the potential bias attributed to violation of these assumptions cannot be quantified in practice. 2 OBJECTIVES OF NONINFERIORITY TRIALS In the context of therapeutic effectiveness, three possible objectives have often been entertained in noninferiority studies. First, noninferiority in effectiveness traditionally means ‘‘not unacceptably inferior’’ in the sense that the effectiveness of the experimental treatment is ‘‘clinically indifferent’’ from that of the active control. Second, as a minimum requirement, the noninferiority trial must be able to assert that the experimental treatment is efficacious; for instance, the experimental treatment would have been

NONINFERIORITY TRIAL

more effective than the placebo had a placebo been in the noninferiority trial. Third, in many cases, it is often stipulated to establish that the experimental treatment preserves at least a certain portion of the effect of the selected active control in the noninferiority trial. On one hand, consideration of effect retention could arise from a part of value assessment of the relative efficacy, such as how much of loss in clinical benefits is acceptable with use of the experimental treatment in lieu of the active control to exchange for a better safety profile. On the other hand, retention of the active control’s effect may be necessary to create a cushion for the noninferiority inference to be valid beyond the many sources of unquantifiable bias. The biases can occur due to the uncertainty in the estimates of the active control’s effect in historical trials, application of such estimates to the noninferiority trial, or a lack of assay sensitivity. The three possible objectives are interrelated, though the term ‘‘noninferiority’’ may arguably be inappropriate when the primary objective is to demonstrate superiority over a putative placebo or to demonstrate the retention of a certain portion of the activecontrol effect. From the design perspective, they have tremendously different influences on the design specifications. For one, the clinical hypothesis and the corresponding statistical hypothesis for these objectives may be quite different. Consequently, statistical analysis, the decision tree, and inferential interpretation can be quite different. In practice, it is a mandatory first step to clearly define the primary objective of the noninferiority trial. The study objective is also a key factor determining the choice of the noninferiority margin. 3

MEASURE OF TREATMENT EFFECT

In addition to the study objective, selection of a measure for quantifying treatment effect is another key determination for defining the noninferiority margin. Let T, C, and P label, respectively, the experimental treatment, the selected active control, and the placebo that are studied or not studied in the noninferiority trial. To simplify the notation, let T, C,

3

and P also denote the targeted parameters of the experimental treatment, active control, and placebo, respectively. For instance, for the continuous response variable, if the effect of interest is expressed in terms of a mean change of a response variable from baseline, then T, C, and P are the mean parameters associated with the experimental treatment, active control, and placebo, respectively. For a binary outcome variable of event or not, T, C, and P may be the probabilities of an event associated with the three respective treatments. For a time-to-event variable, T, C, and P may be the respective hazard functions. The effect of a treatment can be measured on a variety of scales. Difference and ratio are two widely used scales. For a continuous response variable, the effect of the experimental treatment relative to the placebo is often defined on the difference scale, T − P. As an example, if the test drug decreases a mean of 6 mm Hg in sitting diastolic blood pressure after some point of treatment administration and the placebo decreases a mean of 2 mm Hg, then the effect of the experimental treatment is often measured as T − P = 4 mm Hg. On the ratio scale, the effect of the experimental treatment relative to the placebo is often defined as (1 − T/P). For instance, if the probability of death at some point during the study in the patients who receive the experimental treatment is 0.30 (or 30%) and the probability of death associated with the placebo is 0.40 (40%), then the effect of the experimental treatment on the risk scale is often measured by 1 − T/P = 1 − 0.30/0.40 = 25%; that is, the experimental treatment yields a 25% reduction in the mortality risk. On the difference scale, the experimental treatment reduces the probability of death by 10% from 40% to 30%. For a time-to-event parameter, the treatment effect is often quantified using a hazard ratio, if this ratio is approximately constant over time. For notational convenience, let T/P denote a risk ratio that can be a relative risk or an odds ratio for a binary outcome variable and a hazard ratio for a time-to-event variable. For a binary outcome variable or time-toevent variable, statistical inference is often based on the natural logarithm of a risk ratio statistic, rather than the risk ratio itself,

4

NONINFERIORITY TRIAL

because the former is better approximated by a Gaussian (or normal) distribution. On the log risk ratio scale, the treatment effect is defined as a difference of the log parameter, such as log probability of death or log hazard. 4

NONINFERIORITY MARGIN

Determination of a noninferiority margin is probably the most difficult task in designing a noninferiority trial (26–32). For demonstration that the experimental treatment is superior to the selected control, the objective of the trial is to rule out a zero difference between the two treatments and thus reject the null hypothesis of no treatment difference. A superiority margin is selected mostly to size the trial, not for being ruled out; therefore, the superiority margin does not need to be prespecified to define the hypothesis. In contrast, for demonstration of noninferiority, a noninferiority margin usually needs to be specified to define the statistical hypothesis before commencement of the noninferiority trial. The statistical null hypothesis to reject is that the treatment difference is at least as large as the specified margin against the experimental treatment. The noninferiority margin is an acceptable threshold in the value of the targeted parameter of the response variable between the experimental treatment and the selected positive control under comparison. At the Drug Information Association Annual Meeting in 2000, Dr. Robert Temple introduced the idea that this margin should be a statistical margin or a clinical margin, whichever is smaller. The clinical margin is determined on the basis of clinical judgments that are mostly subjective. Selection of the statistical margin depends upon the objective of the noninferiority trial. Suppose that the true probability of death is 30% associated with the selected positive control and 40% with the placebo. Thus, the control reduces 10% in the probability of death. If the noninferiority inference is set out to demonstrate that the experimental treatment is efficacious or effective (i.e., superior to a placebo), then the probability of death with the experimental treatment must be shown to be smaller than 40%; that is, the noninferiority margin for the experimental treatment versus the control can be set

to −10%. If the objective is to show that the experimental treatment preserves at least a certain portion, 50%—say, of the active control’s effect—then the probability of death with the experimental treatment must be shown to be smaller than 35%. Thus, the noninferiority margin for achieving a 50% retention on the difference scale is −5%. For a 75% retention, the noninferiority margin on the difference scale is 2.5% [= (1 − 0.75) × 10%]. These calculations can easily be articulated as follows. To obtain a noninferiority margin for the difference T − C, all that is needed is to examine the P-C axis between the control effect of 10% and the zero control effect. The margin needed for showing a 50% retention of (P − C) is the midpoint 5% between the 10% effect and zero effect. The margin defining a 75% retention of (P − C) is the one-quarter point above zero. Similar arguments can be made for risk ratio. To generate a noninferiority margin for the risk ratio T/C where the control C is the denominator, work on P/C, the risk ratio of placebo versus the control C, so that the two ratios to be compared have the same denominator. In the above example, P/C = 4/3 and the null effect of P versus C is P/C = 1. The noninferiority margin for the risk ratio T/C can then be generated by working on P/C between 4/3 and 1. For 50% retention of the control’s effect on the relative risk, the noninferiority margin is 1.17, which is the midpoint between 4/3 and 1. For 75% retention, the noninferiority margin for T/C is 1.08, which is the one-quarter point above one. In general, mathematical arguments can be made as follows. To retain X% of the control’s effect, we would need the effect of the experimental treatment relative to the placebo, had it been in the trial, to be greater than X times the control’s effect. That is, an X% retention of the control’s effect on the risk ratio amounts to showing (1 − T/P) > X(1 − C/P), equivalently, T/C < X + (1 − X)(P/C),

(1)

which indeed is the weighted average of one and the ratio P/C.

NONINFERIORITY TRIAL

For retention on the log risk ratio scale, the same arguments as those for the difference (P − C) can be made to construct the noninferiority margin for log(T/C) = log(T) − log(C). Thus, for 50% retention on the log risk ratio, the noninferiority margin for log(T/C) is the midpoint between log(4/3) and zero. By inverting log risk ratio to risk ratio, we can obtain that the noninferiority margin for T/C is the geometric mean between P/C = 4/3 and 1 to retain 50% of the control’s effect on the log risk ratio scale. For 75% retention on the log risk ratio scale, the noninferiority margin for T/C is (4/3)1/4 , raising (4/3) to the one-quarter power. For rendering a noninferiority margin on the risk ratio scale, the percent retention on the log risk ratio is convertible to that on the risk ratio and vice versa, given that the effect of active control can be estimated (16). For instance, if the relative risk C/P of the control to the placebo is 0.75, then preservation of 50% of the control effect on the risk ratio scale is equivalent to preservation of 46% of the control effect on the logarithm risk ratio. It is worth mentioning that at any level of retention on the log risk ratio the resulting noninferiority margin is always smaller than the margin for the same level of retention on the risk ratio scale. Thus, preservation of 50% of the control’s effect on the log risk ratio results in a smaller margin than preservation of 50% on the risk ratio. The statistical margin derived using the concept of retention of the control effect cannot always properly characterize clinical indifference. For example, if the selected positive control is highly effective with a relative risk C/P = 0.12, say, for a clinical adverse event (i.e., the control yields a 88% reduction of the risk of having the clinical event), then the statistical margin can be as large as 4.67 for a 50% preservation of the control’s effect and 2.47 for a 80% retention, by setting X to 50% and 80%, respectively, on the right-hand side of inequality (1). With these margins, an experimental treatment that yields at least a one-and-a-half-fold increase in risk relative to the control can still be concluded to be not inferior to the control. Such a large statistical margin cannot possibly be adequate for asserting the clinical indifference that the experimental treatment is as effective as or

5

not much worse than the positive control. For showing clinical indifference, a margin is required to define the clinical indifference. The margin determination discussed above is predicated on the knowledge of the true value of the effect parameter at stake. In practice, the true value is rarely known, and thus the effect of the active control must be estimated from trial data to determine the noninferiority margin. If a placebo is present in a noninferiority trial, the effect of the active control may be better estimated from this trial. In the absence of a concurrent placebo, the estimate will have to come from the external historical trials. In either case, the bias and the variance of the estimate must be properly incorporated in the margin determination. 5 STATISTICAL TESTING FOR NONINFERIORITY As mentioned earlier, the noninferiority margin must be selected and fixed in advance so that the noninferiority hypothesis to test is well defined before designing the trial. For example, if noninferiority is defined as retaining a 50% of the active control’s effect, one conventional approach employs the worst limit of a 95% confidence interval of the historical estimate of the control’s effect (i.e., relative to placebo) as a conservative estimate of the control’s effect in the noninferiority trial and then generates the statistical margin as previously described in section 4. Taking the smaller of the statistical margin and the clinical margin will then determine the noninferiority margin. In most cases, use of a conservative estimate of the active-control effect derived from some kind of meta-analyses of historical trials to define the noninferiority margin is necessary because of statistical uncertainty around the estimate and the unverifiable and yet often doubtful assumptions that must be made in making inferences from the noninferiority study. Once the noninferiority margin is determined, the widely used statistical method for noninferiority testing employs a 95% or higher confidence interval for the experimental treatment versus the selected active control from the noninferiority trial. If this

6

NONINFERIORITY TRIAL

interval rules out the predefined margin, then noninferiority defined by the margin can be concluded. This is in contrast with the superiority testing that depends on the same confidence interval to rule out the null value of no treatment difference. The probability of type I error of falsely concluding noninferiority associated with this confidence interval method is no more than 2.5%, conditional on the estimated noninferiority margin. When a placebo is present in the noninferiority trial, some type of the noninferiority hypothesis, such as percent retention hypothesis, can be tested directly with a more statistically efficient test method (33). The test can be constructed by dividing a relevant sum of the estimate of relative effect of the experimental treatment to the control and the estimate of the control’s effect by the standard error of the sum. Both estimates are derived from the noninferiority trial. For example, the 25% retention test on the log risk ratio scale is constructed by dividing the sum of the estimate of log (T/C) of the noninferiority trial and 75% times the estimate of log(C/P) by the standard error of this sum. A P-value can then be generated from the test. A sufficiently small P-value can indicate that the experimental treatment retains more than 25% of the control’s effect. If the placebo is absent from the noninferiority trial, by the same kind of combination, this test arguably can still be constructed in the same way but the estimate of the active control’s effect has to come from the historical trials. However, this test method is controversial (20–24, 34–39) because it assumes no bias with the historical estimate of the control’s effect when applied to the noninferiority trial—that is, when the constancy assumption holds, which is almost always very much doubtful. There are no data that can verify this assumption. Thus, this test method constructed by incorporating the historical estimate of the active control’s effect is rarely useful in practice. 6 MEDICATION NONADHERENCE AND MISCLASSIFICATION/MEASUREMENT ERROR Interpretation of the results of randomized clinical trials is primarily according to the

intent-to-treat principle, based on which all randomized patients are analyzed as per the treatment to which they are randomized. This analysis is intended to avoid selection bias, which may confound with a valid clinical interpretation. In superiority trials, medication nonadherence generally biases the intent-to-treat analysis toward the null hypothesis of no treatment difference; thus, statistical tests for superiority in intent-totreat analyses tend to be conservative. In noninferiority trials, nonadherence may bias intent-to-treat analyses in either a conservative or nonconservative direction (40–42), and thus it may undermine the clinical interpretability. Misclassification or measurement error also may generate bias. On-treatment or per-protocol analyses include only patients who are adherent to the assigned study treatment and protocol. This analysis is intended to address the question of what the true causal effect of the experimental treatment would have been had all patients adhered to the assigned treatment. In some cases, the on-treatment analyses may apparently be able to account for nonadherence when it is correctly measured, but these analyses require the unverifiable assumption that there is no unmeasured confounding caused by the factors such as selection of the patients for analyses. Hence, nonadherence when related to study outcome can also bias on-treatment analyses in either a conservative or nonconservative direction. Medication nonadherence and misclassification or measurement error may generate bias in both intent-to-treat analyses and on-treatment analyses conservatively or nonconservatively in noninferiority trials. The amount of bias is generally not estimable. Therefore, with serious nonadherence or such errors, most often no valid clinical interpretation of noninferiority trials can be made. Ontreatment or per-protocol analysis is unlikely to be able to rescue the study. 7 TESTING SUPERIORITY AND NONINFERIORITY As already discussed, in an active-control trial, the study goal can be superiority or noninferiority. In order for noninferiority to be

NONINFERIORITY TRIAL

entertained, the noninferiority margin must be fixed and prespecified in advance. The same 95% or higher confidence interval can be used to test both superiority and noninferiority simultaneously or in any sequential order (24–26), with the overall type I error rate associated with testing both objectives of no larger than a two-sided 5% level. The type I error rate for superiority and for noninferiority are each no larger than a two-sided 5% level. However, if the noninferiority margin is not prespecified, this confidence interval approach may still be problematic, particularly when the margin is influenced by the internal noninferiority trial data (43–48). Furthermore, medication noncompliance and misclassification or measurement error may still make the type I error rate for the prespecified noninferiority invalid. Therefore, to achieve noninferiority, the trial design requires the highest quality. To entertain testing for superiority and noninferiority, it is imperative to plan the study for noninferiority testing and the sample size to ensure sufficient power for both superiority and noninferiority, so defined (48). 8

CONCLUSION

It is quite clear that the effect of an experimental treatment should be evaluated, if all possible, by conducting a ‘‘showing superiority’’ trial. Showing noninferiority over a selected active control can be too difficult to provide statistical evidence for assessing the effect of the experimental treatment, particularly when the placebo cannot be used in the trial. Many factors determine the interpretability of a noninferiority trial that does not have a placebo arm. First, a noninferiority margin must be selected and fixed in advance when designing a noninferiority trial. The margin determination depends on the trial objective. Second, the critical assumptions of assay sensitivity and constancy and the judgment of ‘‘clinical indifference’’ also play key roles in the margin determination. Third, statistical uncertainty in the historical estimate of the active control’s effect also needs to be properly incorporated in the margin determination. The historical trials must have assay sensitivity. In contrast with showing superiority, the noninferiority trial must have very

7

high quality in terms of medication adherence in order for the noninferiority trial to have assay sensitivity. Testing for superiority and testing for noninferiority with a prespecified margin can be simultaneously performed; however, from the design perspective, the focus should be on planning for ‘‘showing noninferiority.’’ 8.1.1 Disclaimer. The views presented in this article are not necessarily those of the U.S. Food and Drug Administration.

REFERENCES 1. World Medical Association Declaration of Helsinki. Recommendations guiding physicians in biomedical research involving human subjects. JAMA. 1997; 277: 925–926. 2. W. C. Blackwelder, Proving the null hypothesis in clinical trials. Control Clin Trials. 1982; 3: 345–353. 3. T. R. Fleming, Treatment evaluation in active control studies. Cancer Treat Rep. 1987; 71: 1061–1064. 4. R. Temple, Difficulties in evaluating positive control trials. In: Proceedings of the Biopharmaceutical Section of American Statistical Association. Alexandria, VA: American Statistical Association, 1987, pp. 1–7. 5. G. Pledger and D. B. Hall, control equivalence studies: do they address the efficacy issue? In: K. E. Peace (ed.), Statistical Issues in Drug Research and Development. New York: Marcel Dekker, New York, 1990, pp. 226–238. 6. R. Temple, Problems in interpreting active control equivalence trials. Account Res. 1996; 4: 267–275. 7. J. Rohmel, Therapeutic equivalence investigations: statistical considerations. Stat Med. 1998; 17: 1703–1714. 8. B, Jones, P, Jarvis, J. A. Lewis, and A. F. Ebbutt, Trials to assess equivalence: the importance of rigorous methods. BMJ. 1996; 313: 36–39. 9. A. F. Ebbutt and L. Frith, Practical issues in equivalence trials. Stat Med. 1998; 17: 1691–1701. 10. R, Temple and S. S. Ellenberg, Placebocontrolled trials and active-control trials in the evaluation of new treatments. Part 1: Ethical and scientific issues. Ann Intern Med. 2000; 133: 455–463.

8

NONINFERIORITY TRIAL

11. S. S. Ellenberg and R. Temple, Placebocontrolled trials and active-control trials in the evaluation of new treatments. Part 2: Practical issues and specific cases. Ann Intern Med. 2000; 133: 464–470. 12. T. R. Fleming, Design and interpretation of equivalence trials. Am Heart J. 2000; 139: S171–S176. 13. A. L. Gould, Another view of active-controlled trials. Control Clin Trials. 1991; 12: 474–485. 14. W. C. Blackwelder, Showing a treatment is good because it is not bad: when does ‘‘noninferiority’’ imply effectiveness? Control Clin Trials. 2002; 23: 52–54. 15. R. B. D’Agostino, J. M. Massaro, and L. Sullivan, Non-inferiority trials: design concepts and issues—the encounters of academic consultants in statistics. Stat Med. 2003; 22: 169–186. 16. H. C. Bucher, G. H. Guyatt, L. E. Griffith, and S. D. Walter, The results of direct and indirect treatment comparisons in meta-analysis of randomized controlled trials. J Clin Epidemiol. 1997: 50: 683–691. 17. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E9 Statistical Principles for Clinical Trials. Current Step 4 version, February 5, 1998. Available at: http://www.ich.org/LOB/media/ MEDIA485.pdf 18. D. A. Follmann and M. A. Proschan, Valid inference in random effects meta-analysis. Biometrics. 1999; 55: 732–737. 19. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E10 Choice of Control Group and Related Issues in Clinical Trials. Current Step 4 version, July 20, 2000. Available at: http://www.ich.org/LOB/media/ MEDIA486.pdf 20. S. J. Wang, H. M. Hung, and Y. Tsong, Noninferiority analysis in active controlled clinical trials. In: S. E. Chow (ed.), Encyclopedia of Biopharmaceutical Statistics, 2nd ed. New York: Marcel Dekker, 2003, pp. 674–677. 21. Department of Health and Human Services, Food and Drug Administration [Docket No. 99D-3082]. International Conference on Harmonisation: Choice of control group in clinical trials (E10). Fed Regist. 1999; 64: 51767–51780. 22. V. Hasselblad and D. F. Kong, Statistical

methods for comparison to placebo in activecontrol trials. Drug Inf J. 2001; 35: 435–449. 23. S. J. Wang, H. M. Hung, Y. Tsong, Utility and pitfall of some statistical methods in active controlled clinical trials. Control Clin Trials. 2002; 23: 15–28. 24. H. M. Hung, S. J. Wang, Y, Tsong, J, Lawrence, and R. T. O’Neill, Some fundamental issues with non-inferiority testing in active controlled clinical trials. Stat Med. 2003; 22: 213–225. 25. H. M. Hung, S. J. Wang, and R. O’Neill, A regulatory perspective on choice of margin and statistical inference issue in non-inferiority trials. Biom J. 2005; 47: 28–36. 26. Committee for Medicinal Products for Human Use (CHMP). Guideline on the choice of the non-inferiority margin. Stat Med. 2006; 25: 1628–1638. 27. D. Hauschke, Choice of delta: a special case. Drug Inf J. 2001; 35: 875–879. 28. T. H. Ng, Choice of delta in equivalence testing. Drug Inf J. 2001; 35: 1517–1527. 29. B. Wiens, Choosing an equivalence limit for non-inferiority or equivalence studies. Control Clin Trials. 2002; 23: 2–14. 30. L. L. Laster and M. F. Johnson, Noninferiority trials: ‘‘the at least as good as’’ criterion. Stat Med. 2003; 22: 187–200. 31. L. L. Laster, M. F. Johnson, and M. L. Kotler, Non-inferiority trials: the ‘‘at least as good as’’ criterion with dichotomous data. Stat Med. 2006; 25: 1115–1130. 32. S. C. Chow and J. Shao, On non-inferiority margin and statistical tests in active control trial. Stat Med. 2006; 25: 1101–1113. 33. D. Hauschke and I. Pigeot, Establishing efficacy of a new experimental treatment in the ‘gold standard’ design (with discussions). Biom J. 2005; 47: 782–798. 34. E. B. Holmgren, Establishing equivalence by showing that a prespecified percentage of the effect of the active control over placebo is maintained. J Biopharm Stat. 1999; 9: 651–659. 35. R. Simon, Bayesian design and analysis of active control clinical trials. Biometrics. 1999; 55: 484–487. 36. S. J. Wang and H. M. Hung, Assessment of treatment efficacy in non-inferiority trials. Control Clin Trials. 2003; 24: 147–155. 37. M, Rothmann, N, Li, G, Chen, G. Y. Chi, R. T. Temple, and H. H. Tsou, Non-inferiority methods for mortality trials. Stat Med. 2003; 22: 239–264.

NONINFERIORITY TRIAL 38. S. M. Snapinn, Alternatives for discounting in the analysis of noninferiority trials. J Biopharm Stat. 2004; 14: 263–273. 39. Y, Tsong, S. J. Wang, H. M. Hung, and L. Cui, Statistical issues on objective, design and analysis of non-inferiority active controlled clinical trial. J Biopharm Stat. 2003; 13: 29–42. 40. M. M. Sanchez and X. Chen, Choosing the analysis population in non-inferiority studies: per protocol or intent-to-treat. Stat Med. 2006; 25: 1169–1181. 41. D. Sheng and M. Y. Kim, The effects of non-compliance on intent-to-treat analysis of equivalence trials. Stat Med. 2006; 25: 1183–1199. 42. E. Brittain and D. Lin, A comparison of intentto-treat and per-protocol results in antibiotic non-inferiority trials. Stat Med. 2005; 24: 1–10. 43. T, Morikawa and M. Yoshida, A useful testing strategy in phase III trials: combined test of superiority and test of equivalence. J Biopharm Stat. 1995; 5: 297–306. 44. C. W. Dunnett and A. C. Tamhane, Multiple testing to establish superiority/equivalence of a new treatment compared with kappa standard treatments. Stat Med. 1997; 16: 2489–2506.

9

45. S. J. Wang, H. M. Hung, Y, Tsong, L, Cui, and W. Nuri, Changing the study objective in clinical trials. In: Proceedings of the Biopharmaceutical Section of American Statistical Association. Alexandria, VA: American Statistical Association, 1997, pp. 64–69. 46. P, Bauer and M. Kieser, A unifying approach for confidence intervals and testing of equivalence and difference. Biometrika. 1996; 83: 934–937. 47. H. M. Hung and S. J. Wang, Multiple testing of non-inferiority hypotheses in active controlled trials. J Biopharm Stat. 2004; 14: 327–335. 48. European Agency for the Evaluation of Medicinal Products, Human Medicines Evaluation Unit, Committee for Proprietary Medicinal Products (CPMP). Points to Consider on Switching between Superiority and Non-inferiority. CPMP/EWP/482/99. July 27, 2000. Available at: http://www.emea. europa.eu/pdfs/human/ewp/048299en.pdf

CROSS-REFERENCES Active-Controlled Trial Non-inferiority Margin Non-inferiority Analysis

NONPARAMETRIC METHODS

subset A, then the random variable V = ni=1 I(Zi ) has a binomial distribution with parameters n and p = Pr(Zi ∈ A). Result 2. Let Z1 , . . . , Zn be a random sample from a continuous distribution with cumulative distribution function (cdf) F(·), and let Ri denote the rank (from least to greatest) of Zi among the n Zs, for i = 1, . . . , n. Then the vector of ranks R = (R1 , . . . , Rn ) has a joint distribution that is uniform over the set of all permutations of the integers (1, . . . , n). Result 3. Let Z be a random variable with a probability distribution that is symmetric about the point θ . Define the indicator function (·) by

DOUGLAS A. WOLFE Ohio State University, Columbus, OH, USA

Many of the earliest statistical procedures proposed and studied rely on the underlying assumption of distributional normality. How well these procedures operate outside the confines of this normality constraint varies from setting to setting. Although there were a few isolated attempts to create statistical procedures that were valid under less restrictive sets of assumptions that did not include normality, such as the early introduction of the essence of the sign test procedure by Arbuthnott (2) in 1710, and the rank correlation procedure considered by Spearman (51) in 1904, it is generally agreed that the systematic development of the field of nonparametric statistical inference traces its roots to the fundamental papers of Friedman (18), Kendall (31), Kendall & Babington Smith (33), Mann & Whitney (38), and Wilcoxon (58). The earliest work in nonparametric statistics concentrated heavily on the development of hypothesis testing that would be valid over large classes of probability distributions— usually the entire class of continuous distributions, but sometimes with the additional assumption of distributional symmetry. Most of this early work was intuitive by nature and based on the principle of ranking to de-emphasize the effect of any possible outliers on the conclusions. Point and interval estimation expanded out of this hypothesis testing framework as a direct result of centering of the test statistics and test inversion, respectively. Most distribution-free test procedures (and associated confidence intervals) are based on one or more of the following three fundamental properties.

(t) = 1,

if t > 0,

= 0,

if t ≤ 0.

Then the random variables —Z − θ — and (Z − θ ) are independent. Statistics based solely on Result 1 are referred to as counting statistics, those based solely on Result 2 are commonly known as ranking statistics, and those based on an appropriate combination of all three results are called signed-rank statistics. Over the years of development in the field, distribution-free procedures have certainly become more sophisticated, both in the problems they address and in their complexity. However, the underlying premise behind almost all such hypothesis tests continues to rest with these three basic results or with modifications thereof. Much of the early work in distributionfree hypothesis tests followed the general approach of mimicking a standard normal theory procedure for a statistical problem by replacing the sample values with some combination of rank or counting statistics. The first nonparametric test statistics looked quite similar in form to their classical normal theory counterparts. However, more recent advances in nonparametric statistics have been less tied to previously

Result 1. Let Z1 , . . . , Zn be a random sample from some probability distribution and let A be a subset of the common domain for the Zs. If I(t) represents the indicator function for this

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

NONPARAMETRIC METHODS

developed normal theory structure and, in fact, there have been a number of settings where nonparametric procedures were the first to be developed, and classical procedures followed a few years later. It is the intent of this article to provide some brief overview of nonparametric statistics. However, the field has grown over the years to such a size that one must rely on standard textbooks in the area for a truly complete picture. The very first such textbooks in nonparametric statistics were the pioneering works of Siegel (49) and Fraser (17), both arriving on the scene in the infancy of the field. Walsh (55–57) published a three-volume handbook covering those nonparametric procedures available at the time. Other texts and reference books have added to the literature of nonparametric statistics over the years, including the applicationsoriented books by Bradley (3), Conover (5), Daniel (10), Gibbons (21), Hollander & Wolfe (27), and Marascuilo & McSweeney (39). The text by Lehmann (36) occupies an intermediate place in the literature. It has a general application orientation, but a considerable amount of the basic underlying theory of some of the procedures is also presented in a substantial appendix. Textbooks dealing primarily with the theory of rank tests and associated point estimators and confidence intervals have been published by Gibˇ ak ´ ´ ´ (23), bons (20), Hajek (22), Hajek & Sid Hettmansperger (25), Noether (44), Pratt & Gibbons (46), and Randles & Wolfe (47). The monograph by Kendall (32) covers the specialized topic of rank correlation methods. These resources vary on the extensiveness of their bibliographies, but it is safe to say that the vast majority of published literature in the field of nonparametric statistics is cited in at least one of these volumes. One of the necessities in the application of distribution-free test procedures and confidence intervals is the availability of the exact null distributions of the associated test statistics. Extensive tables of many of these null distributions are available in some of the applications-oriented texts mentioned previously. In addition, recent software developments have made it a good deal easier both to compute the appropriate test statistics and to obtain the associated P values

for many of these test procedures. Of particular note in this regard are the Minitab and StatXact software packages, for both their rather complete coverage of the basic nonparametric procedures and their ability to circumvent the need for exact null distribution tables by providing the associated exact or approximate P values for many of the test procedures. StatXact also has the option of actually generating the required exact null distributions for some of the better known test statistics, including the appropriate modifications necessary in the case of tied observations. We first turn our attention to brief descriptions of the most commonly used nonparametric procedures in standard statistical settings involving one, two, or more samples, including one- and two-way analysis of variance and correlation. In each case, the emphasis will be on the description of the problem and a particular standard approach to its solution, rather than on attempting to cover the myriad of different nonparametric procedures that are commonly available for the problem. Finally, we will discuss briefly a few nonstandard topics where the development of nonparametric methods has been particularly motivated by the need to analyze medical and health sciences data. Included in these topics will be censored data and survival analysis, as well as proportional hazards models, counting processes, and bootstrap methods.

1 ONE-SAMPLE LOCATION PROBLEM 1.1 Continuity Assumption Only Let Z1 , . . . , Zn be a random sample arising from an underlying probability distribution that is continuous with cdf F(·) and median θ . Here the primary interest is in inference about θ . 1.1.1 Test Procedure. For this setting, we are interested in testing the null hypothesis that θ = θ 0 , where θ 0 is some preset value appropriate for the problem. If no additional assumptions are reasonable about the form of the underlying F, the most commonly

NONPARAMETRIC METHODS

used inference procedures are those associated with the sign statistic B = [number of sample Zs that exceed θ0 ]. The properties of B follow from the basic counting Result 1 with the set A = (θ 0 , ∞). In particular, B has a binomial distribution with number of trials n and success probability p = Pr(Z1 > θ 0 ). When the null hypothesis is true, we have p = 1/2 (since θ 0 is then the median of the underlying distribution) and the null distribution of B does not depend on the form of F. The associated level α sign procedure for testing H0 , vs. the alternative H1 : θ > θ 0 , is to reject H0 if the observed value of B exceeds bα , the upper αth percentile for the null distribution of B, namely, the binomial distribution with parameters n and p = 1/2. The appropriate tests for the other directional alternatives θ < θ 0 and θ = θ 0 rely on the fact that the binomial distribution with n trials and p = 1/2 is symmetric about its mean n/2. 1.1.2 Point Estimation and Confidence Intervals/Bounds. Natural nonparametric confidence intervals and confidence bounds for θ are associated with these sign test procedures through the common process of inverting the appropriate hypothesis tests. These intervals and bounds are based on the ordered sample observations Z(1) ≤ Z(2) ≤ · · · ≤ Z(n) . The 100(1 − α)% confidence interval for θ associated in this manner with the level α two-sided sign test is given by (Z(n+1−bα/2 ) , Z(bα/2 ) ), where bα/2 is again the upper (α/2)th percentile for the binomial distribution with parameters n and p = 1/2. The corresponding 100(1 − α)% lower and upper confidence bounds for θ (obtained by inverting the appropriate one-sided sign tests) are given by Z(n+1−bα ) and Z(bα ) , respectively. The Hodges–Lehmann (26) point estimator of θ associated with the sign test is θ˜ = median {Z1 , . . . , Zn }. 1.2 Continuity and Symmetry Assumption Let Z1 , . . . , Zn be a random sample from an underlying probability distribution that is continuous and symmetric about its median θ . Once again the primary interest is in inference about θ .

3

1.2.1 Test Procedure. We remain interested in testing the null hypothesis that θ = θ 0 . However, the additional symmetry assumption now enables us to provide generally more powerful test procedures. For this setting, the most commonly used inference procedures are those associated with the Wilcoxon signed-rank test statistic (58), T+ =

n

Ri i ,

i=1

where i = 1, 0 as Zi >, < θ 0 , and Ri is the rank of |Zi − θ 0 | among |Z1 − θ 0 |, . . . ,|Zn − θ 0 |. Thus, the Wilcoxon signed-rank statistic corresponds to the sum of the |Z − θ 0 | ranks for those Zs that exceed the hypothesized median value θ 0 . [Since we have a continuous underlying distribution, the probability is zero that there are ties among the absolute values of the (Zi − θ 0 )s. Likewise, the probability is zero that any of the Zi s actually equals θ 0 . However, these events may occur in actual data sets. In such an event, it is standard practice to discard the Zi s that equal θ 0 and reduce n accordingly. Ties among the absolute values of the (Zi − θ 0 )s are generally broken by assigning average ranks to each of the absolute differences within a tied group.] Properties of T + under H0 : θ = θ 0 derive directly from Result 3, which yields the independence of the ranks of the |Zi − θ 0 |s and the i s, and Result 2, which implies that the ranks of the |Zi − θ 0 |s are uniformly distributed over the set of permutations of the integers (1, . . . , n) under H0 . The associated null distribution of T + does not depend on the form of the underlying F(·) and has been extensively tabled (see, for example, (27) and (59)). The associated level α signed-rank procedure for testing H0 vs. the alternative H1 : θ > θ 0 is to reject H0 if the observed value of T + exceeds tα , the upper αth percentile for the null distribution of T + . The appropriate tests for the other directional alternatives θ < θ 0 and θ = θ 0 rely on the fact that the null distribution of T + is symmetric about its mean n(n + 1)/4. 1.2.2 Point Estimation and Confidence Intervals/Bounds. Once again, natural confi-

4

NONPARAMETRIC METHODS

dence intervals and confidence bounds for θ are associated with these signed-rank procedures through inversion of the appropriate hypothesis tests. These intervals and bounds are based on the ordered values of the M = n(n + 1)/2 Walsh averages of the form W ij = (Zi + Zj )/2, for 1 ≤ i ≤ j ≤ n. Letting W (1) ≤ · · · ≤ W (M) denote these ordered Walsh averages, the 100(1 − α)% confidence interval for θ associated with the level α two-sided signed-rank test is given by(W(M+1−tα/2 ) , W(tα/2 ) ), where once again tα/2 is the upper (α/2)th percentile for the null distribution of T + . The corresponding 100(1 − α)% lower and upper confidence bounds for θ (obtained by inverting the appropriate one-sided signed-rank tests) are given by W(M+1−tα ) and W(tα ) , respectively. The Hodges–Lehmann (26) point estimator of θ associated with the signed-rank test is θˆ = median {Wij , 1 ≤ i ≤ j ≤ n}. We note that both the sign and signedrank inference procedures can be applied to paired replicates data (X i , Y i ), where X i represents a pretreatment measurement on a subject and Y i represents a posttreatment measurement on the same subject, and we collect such paired data from i = 1, . . . , n independent subjects. The appropriate sign or signed-rank procedures are then applied to the post-minus-pre differences Zi = Y i − X i , i = 1, . . . , n.

2

TWO-SAMPLE LOCATION PROBLEM

Let X 1 , . . . , X m and Y 1 , . . . , Y n be independent random samples from the continuous probability distributions with cdfs F(·) and G(·), respectively. We consider here the case where G(y) = F(y − ), with −∞ < < ∞; that is, the X and Y distributions differ only by a possible location shift , and we are interested in inference about . 2.0.3 Test Procedure. For this setting, the appropriate null hypothesis is that = 0 , where 0 is some preset value (often zero) of interest for the shift. The most commonly used nonparametric inference procedures for this setting are those associated with the rank sum version of the Wilcoxon–Mann–

Whitney (38,58), W=

n

Rj ,

j=1

where Rj is the rank of Y j among the combined sample of N = (m + n) observations X 1 , . . . , X m , Y 1 , . . . , Y n . (Once again, ties among the Xs and/or Ys are broken by assigning average ranks to each of the observations within a tied group.) Properties of W under H0 : = 0 (corresponding to no differences between the X and Y probability distributions) follow directly from the basic ranking Result 2, which implies that the joint ranks of X 1 , . . . , X m , Y 1 , . . . , Y n are uniformly distributed over the set of permutations of the integers (1, . . . , N) under H0 . The associated null distribution of W does not depend on the form of the common (under H0 ) underlying distribution F(·) and has been extensively tabled (see, for example, (27) and (59)). The associated level α rank sum procedure for testing H0 vs. the alternative H1 : > 0 is to reject H0 if the observed value of W exceeds wα , the upper αth percentile for the null distribution of W. The appropriate tests for the other directional alternatives < 0 and = 0 rely on the fact that the null distribution of W is symmetric about its mean n(m + n + 1)/2. 2.0.4 Point Estimation and Confidence Intervals/Bounds. As in the one-sample setting, natural confidence intervals and bounds for are associated with these rank sum procedures through inversion of the appropriate hypothesis tests. These intervals and bounds are based on the ordered values of the mn differences U ij = Y j − X i , i = 1, . . . , m, j = 1, . . . , n. Letting U (1) ≤ · · · ≤ U (mn) denote these ordered differences, the 100(1 − α)% confidence interval for associated with the level α two-sided rank sum test is given by U({[n(2m + n + 1) + 2]/2}−wα/2 ) , U(wα/2 − [n(n + 1)/2]) , where once again wα/2 is the upper (α/2)th percentile for the null distribution of W. The corresponding 100(1 − α)% lower and upper confidence bounds for (obtained by inverting the appropriate one-sided rank sum tests) are given by U({[n(2m+n+1)+2]/2}−wα ) and

NONPARAMETRIC METHODS

U(wα −[n(n+1)/2]) , respectively. The Hodges– Lehmann (26) point estimator of associˆ = median ated with the rank sum test is {Uij , i = 1, . . . , m, j = 1, . . . , n}. 3

OTHER TWO-SAMPLE PROBLEMS

The possibility of differences in location between the X and Y distributions is certainly the most common problem of interest in the two-sample setting. However, there are circumstances where differences in scale are of primary concern, as well as situations where it is important to detect differences of any kind between the X and Y distributions. For discussion on nonparametric two-sample procedures designed for scale differences, see Wilcoxon-type scale tests. The development of nonparametric procedures designed to be effective against any differences between the X and Y distributions was initiated by the pioneering work of Kolmogorov (34) and Smirnov (50). These papers have inspired a substantial body of research on such omnibus two-sample procedures. 4 ONE-WAY ANALYSIS OF VARIANCE: k ≥ 3 POPULATIONS This is a direct extension of the two-sample location problem. The data now represent k mutually independent random samples of observations from continuous probability distributions with cdfs F 1 (x) = F(x − τ 1 ), F 2 (x) = F(x − τ 2 ), . . . , F k (x) = F(x − τ k ), where F(·) is the cdf for a continuous population with median θ and τ 1 , . . . , τ k represent the additive effects corresponding to belonging to population 1, . . . , k, respectively. Here, our interest is in possible differences in the population effects τ 1 , . . . , τ k . 4.0.5 Test Procedures. For the one-way analysis of variance setting, we are interested in testing the null hypothesis H0 : [τ 1 = · · · = τ k ], corresponding to no differences in the medians of the k populations. For this setting, the most commonly used test procedures correspond to appropriate extensions of the Mann–Whitney–Wilcoxon joint ranking scheme as specifically directed toward the particular alternative of interest. For testing

5

the null H0 vs. the standard class of general alternatives H1 : (not all τ i s equal), the Kruskal–Wallis (35) test is the most popular procedure. For one-sided ordered alternatives of the form H2 : (τ 1 ≤ τ 2 ≤ · · · ≤ τ k , with at least one strict inequality), the appropriate extension is that proposed independently by Jonckheere (28) and Terpstra (54). Finally, for umbrella alternatives H3 : (τ 1 ≤ τ 2 ≤ · · · ≤ τ q−1 ≤ τ q ≥ τ q+1 ≥ · · · ≥ τ k , with at least one strict inequality), with either the peak of the umbrella, q, known a priori or estimated from the data, the standard test procedures are those proposed by Mack & Wolfe (37).

4.0.6 Multiple Comparisons and Contrast Estimation. After rejection of H0 : (τ 1 = · · · = τ k ) with an appropriate test procedure, one is most often interested in deciding which of the populations are different and then in estimating the magnitudes of these differences. This leads to the use of multiple comparison procedures, based either on pairwise or joint rankings of the observations. With pairwise rankings, where two-sample ranks are used to compare separately the sample data for k each of the pairs of populations, the 2 most commonly used multiple comparison procedures are those considered by Dwass (12), Steel (53), and Critchlow & Fligner (7) for two-sided all-treatment differences, and by Hayter & Stone (24) for one-sided all-treatment differences. The corresponding two-sided all-treatment multiple comparison procedure based on joint rankings, where the sample data from all k populations are ranked jointly, has been studied by Nemenyi (43) and Damico & Wolfe (8), while the joint rankings multiple comparison procedure for one-sided treatments vs. control decisions can be found in (43) and (9). Point estimation of any contrasts in the τ s (that is, any linear combination β = ki=1 ai τi , with ki=1 ai = 0) is discussed in Spjøtvoll (52). Simultaneous two-sided confidence intervals for all simple contrasts of the form τ j − τ i have been developed by Critchlow & Fligner (7), while the corresponding simultaneous one-sided confidence bounds were studied by Hayter & Stone (24).

6

5

NONPARAMETRIC METHODS

TWO-WAY ANALYSIS OF VARIANCE

We consider here the standard two-way layout setting, where the data consist of one observation on each combination of k treatments and n blocks. The observation in the ith block and jth treatment combination, denoted by X ij , arises from a continuous probability distribution with cdf F(x − β i − τ j ), where F(·) is the cdf for a continuous distribution with median θ , for i = 1, . . . , n; j = 1, . . . , k. Moreover, the nk Xs are assumed to be mutually independent random variables. (This is known as the additive two-way layout model.) Here, our interest is in possible differences among the treatment effects τ 1, . . . , τ k. 5.0.7 Test Procedures. For the two-way layout with one observation per cell, we are interested in testing the null hypothesis H0 : (τ 1 = · · · = τ k ), corresponding to no differences in the k treatment effects. For this setting, the most commonly used procedures correspond to appropriate extensions of the sign test procedure for paired replicates data as specifically directed toward a particular alternative of interest. For testing the null H0 vs. the standard class of general alternatives H1 : (not all τ i s equal), the Friedman (18) test procedure is based on withinblocks ranks of the observations across treatment levels. For ordered alternatives of the form H2 : (τ 1 ≤ τ 2 ≤ · · · ≤ τ k , with at least one strict inequality), the appropriate test based on within-blocks ranks is that given by Page (45). 5.0.8 Multiple Comparisons and Contrast Estimation. After rejection of H0 : (τ 1 = · · · = τ k ) with an appropriate test procedure, one can use either the multiple comparison procedure studied by Nemenyi (43) and McDonald & Thompson (40) to reach the k(k − 1)/2 alltreatments two-sided decisions of the form τ i = τ j vs. τ i = τ j , or the corresponding treatments vs. control multiple comparison procedure due to Nemenyi (43), Wilcoxon & Wilcox (60), and Miller (41) to reach the k − 1 treatments vs. control one-sided decisions of the form τ j > τ control . A method for point estimation of a contrast in the τ s can be found in Doksum (11).

6 INDEPENDENCE Let (X 1 , Y 1 ), . . . , (X n , Y n ) be a random sample from a continuous bivariate probability distribution. The most common distributionfree tests for the independence of the X and Y variables are those considered by Kendall (31) and Spearman (51). The null distribution properties of both of these test procedures are based on the basic Result 2 and the fact that the ranks of the Xs and the separate ranks of the Ys are themselves independent under the independence of X and Y. Approximate 100(1 − α)% confidence intervals and bounds for the Kendall correlation coefficient γ = {2Pr[(Y 2 − Y 1 )(X 2 − X 1 ) > 0] − 1} have been provided by Noether (44), Fligner & Rust (16), and Samara & Randles (48). 7 CENSORED DATA One of the areas where nonparametric methods have played a major role in the analysis of medical and health sciences data in particular has been that of survival analysis of censored lifetime data. We discuss the basic concepts involved in dealing with censored data in the one-sample setting and then provide brief descriptions of the most important nonparametric methods available for other selected settings. There are times in the collection of data that we are prevented from actually observing the values of all of the observations. Such censoring leading to only partial information about the random variables of interest can be a direct result of the statistical design governing our data collection or it can be purely a consequence of additional random mechanisms affecting our data collection process. Considerable attention in the literature has been devoted to three particular types of censoring, which we now describe. The first of these, known as type I censoring, corresponds to a fixed (preset) censoring time, tc , at which the study is to come to an end. In this setting, instead of observing the random variables Z1 , . . . , Zn of interest, we are only able to observe the truncated variables W i = min(Zi , tc ), i = 1, . . . , n. Type I censoring corresponds to medical and health sciences studies

NONPARAMETRIC METHODS

conducted for a fixed period of time after initiation and no entry to the study once begun. A second type of censoring, known as type II censoring, corresponds to collecting survival (lifetime) data until a fixed number, say r < n, of the subjects have failed. Once this has occurred, the study is terminated. In this setting, we only observe the r smallest lifetimes (i.e. the first r order statistics) among Z1 , . . . , Zn . All we know about the remaining n − r unobserved lifetimes is that they are at least as long as the final observed failure. A third type of censoring, called random censoring, is probably the most common and the most complicated type of censoring associated with medical and health sciences data. In this setting, not only are the lifetimes random but the censoring times are also random. In clinical trials, for example, such random censoring could correspond to a study where not all subjects enter the study at the same time, but the study ends at one time, or to subjects leaving a study because they moved from the area or because of serious side-effects leading to discontinuation of the treatment. Probably the earliest nonparametric approach to dealing directly with censored lifetime data was provided by Kaplan & Meier (30) in their development of the product limit estimator for the survival function S(t) = 1 − G(t), −∞ < t < ∞. The first twosample rank procedure designed specifically to test hypotheses with censored data was provided by Gehan (19). He proposed a direct extension of the Mann–Whitney form of the Mann–Whitney–Wilcoxon test statistic that provided a natural way to handle censored values occurring in either the X and/or Y sample data. A generalization of the Gehan two-sample test to the k-sample (k ≥ 3) setting has been provided by Breslow (4). For additional discussion of such rank-based procedures for censored data, the reader is referred to (42). 8 OTHER IMPORTANT NONPARAMETRIC APPROACHES Brief mention must also be made here of three other major initiatives in the development of nonparametric approaches to the

7

analysis of medical and health sciences data. Paramount among such developments is that of the proportional hazards model initially proposed by Cox (6). Seldom has any single paper had such an impact on further research in the field. Kalbfleisch & Prentice (29) provide a nice discussion of the analysis of survival data by the use of the Cox proportional hazards model and extensions thereof. A second important thrust of more recent vintage has been the application of counting process methods in survival analysis. For a good discourse on this important methodology, the reader is referred to (1). Finally, we need to mention the advent of the bootstrap as an important tool in the analysis of medical data. The survey articles (14) and (15) serve very well as introductions to this important topic, and its application to the analysis of censored data is discussed in (13). REFERENCES 1. Andersen, P. K., Borgan, Ø., Gill, R. D. & Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag, New York. 2. Arbuthnott, J. (1710). An argument for divine providence, taken from the constant regularity observed in the births of both sexes, Philosophical Transaction of the Royal Society of London 27, 186–190. 3. Bradley, J. V. (1968). Distribution-Free Statistical Tests. Prentice-Hall, Englewood Cliffs. 4. Breslow, N. (1970). A generalized KruskalWallis test for comparing K samples subject to unequal patterns of censorship, Biometrika 57, 579–594. 5. Conover, W. J. (1980). Practical Nonparametric Statistics, 2nd Ed. Wiley, New York. 6. Cox, D. R. (1972). Regression models and life tables (with discussion), Journal of the Royal Statistical Society, Series B 34, 187–220. 7. Critchlow, D. E. & Fligner, M. A. (1991). On distribution-free multiple comparisons in the one-way analysis of variance, Communications in Statistics—Theory and Methods 20, 127–139. 8. Damico, J. A. & Wolfe, D. A. (1987). Extended tables of the exact distribution of a rank statistic for all treatments: multiple comparisons in one-way layout designs, Communications in Statistics—Theory and Methods 16, 2343–2360.

8

NONPARAMETRIC METHODS

9. Damico, J. A. & Wolfe, D. A. (1989). Extended tables of the exact distribution of a rank statistic for treatments versus control multiple comparisons in one-way layout designs, Communications in Statistics—Theory and Methods 18, 3327–3353. 10. Daniel, W. W. (1978). Applied Nonparametric Statistics. Houghton-Mifflin, Boston. 11. Doksum, K. (1967). Robust procedures for some linear models with one observation per cell, Annals of Mathematical Statistics 38, 878–883. 12. Dwass, M. (1960). Some k-sample rank-order tests, in Contributions to Probability and Statistics, I. Olkin, S. G. Ghurye, H. Hoeffding, W. G. Madow & H. B. Mann, eds. Stanford University Press, Stanford, pp. 198–202. 13. Efron, B. (1981). Censored data and the bootstrap, Journal of the American Statistical Association 76, 312–319. 14. Efron, B. (1982). The Jackknife, the Bootstrap, and Other Resampling Plans. Society of Industrial Applications in Mathematics, CBMS-National Science Foundation Monograph, Vol. 38. 15. Efron, B. & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Statistical Science 1, 54–77. 16. Fligner, M. A. & Rust, S. W. (1983). On the independence problem and Kendall’s tau, Communications in Statistics—Theory and Methods 12, 1597–1607. 17. Fraser, D. A. S. (1957). Nonparametric Methods in Statistics. Wiley, New York. 18. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32, 675–701. 19. Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored samples, Biometrika 52, 203–223. 20. Gibbons, J. D. (1971). Nonparametric Statistical Inference. McGraw-Hill, New York. 21. Gibbons, J. D. (1976). Nonparametric Methods for Quantitative Analysis. Holt, Rinehart, and Winston, New York. ´ 22. Hajek, J. (1969). Nonparametric Statistics. Holden Day, San Francisco. ´ ´ Z. (1967). Theory of Rank 23. Hajek, J. & Sidak, Tests. Academic Press, New York. 24. Hayter, A. J. & Stone, G. (1991). Distribution free multiple comparisons for monotonically ordered treatment effects, Australian Journal of Statistics 33, 335–346.

25. Hettmansperger, T. P. (1984). Statistical Inferenc Based on Ranks. Wiley, New York. 26. Hodges, J. L., Jr & Lehmann, E. L. (1963). Estimates of location based on rank tests, Annals of Mathematical Statistics 34, 598–611. 27. Hollander, M. & Wolfe, D. A. (1999). Nonparametric Statistical Methods. 2nd Ed. Wiley, New York. 28. Jonckheere, A. R. (1954). A distribution-free k-sample test against ordered alternatives, Biometrika 41, 133–145. 29. Kalbfleisch, J. D. & Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. 30. Kaplan, E. L. & Meier, P. (1958). Nonparametric estimation from incomplete observations, Journal of the American Statistical Association 53, 457–481. 31. Kendall, M. G. (1938). A new measure of rank correlation, Biometrika 30, 81–93. 32. Kendall, M. G. (1962). Rank Correlation Methods, 3rd Ed. Griffin, London. 33. Kendall, M. G. & Babington Smith, B. (1939). The problem of m rankings, Annals of Mathematical Statistics 10, 275–287. 34. Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione, Giornale dell’Istituto Italiano degli Attuari 4, 83–91. 35. Kruskal, W. H. & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis, Journal of the American Statistical Association 47, 583–621. 36. Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco. 37. Mack, G. A. & Wolfe, D. A. (1981). K-sample rank tests for umbrella alternatives, Journal of the American Statistical Association 76, 175–181. 38. Mann, H. B. & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other, Annals of Mathematical Statistics 18, 50–60. 39. Marascuilo, L. A. & McSweeney, M. (1977). Nonparametric and Distribution-free Methods for the Social Sciences. Wadsworth, Belmont. 40. McDonald, B. J. & Thompson, W. A., Jr (1967). Rank sum multiple comparisons in one- and two-way classifications, Biometrika 54, 487–497. 41. Miller, R. G., Jr (1966). Simultaneous Statistical Inference. McGraw-Hill, New York.

NONPARAMETRIC METHODS ˜ 42. Miller, R. G., Jr, Gong, G. & Munoz, A. (1981). Survival Analysis. Wiley, New York. 43. Nemenyi, P. (1963). Distribution-free multiple comparisons, PhD Thesis. Princeton University. 44. Noether, G. E. (1967). Elements of Nonparametric Statistics. Wiley, New York. 45. Page, E. B. (1963). Ordered hypotheses for multiple treatments: a significance test for linear ranks, Journal of the American Statistical Association 58, 216–230. 46. Pratt, J. W. & Gibbons, J. D. (1981). Concepts of Nonparametric Theory. Springer-Verlag, New York. 47. Randles, R. H. & Wolfe, D. A. (1979). Introduction to the Theory of Nonparametric Statistics. Wiley, New York. 48. Samara, B. & Randles, R. H. (1988). A test for correlation based on Kendall’s tau, Communications in Statistics—Theory and Methods 17, 3191–3205. 49. Siegel, S. (1956). Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, New York. 50. Smirnov, N. V. (1939). On the estimation of the discrepancy between empirical curves

9

of distribution for two independent samples, Bulletin of Moscow University 2, 3–16 (in Russian). 51. Spearman, C. (1904). The proof and measurement of association between two things, American Journal of Psychology 15, 72–101. 52. Spjøtvoll, E. (1968). A note on robust estimation in analysis of variance, Annals of Mathematical Statistics 39, 1486–1492. 53. Steel, R. G. D. (1960). A rank sum test for comparing all pairs of treatments, Technometrics 2, 197–207. 54. Terpstra, T. J. (1952). The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking, Indagationes Mathematicae 14, 327–333. 55. Walsh, J. E. (1962). Handbook of Nonparametric Statistics. Van Nostrand, Princeton. 56. Walsh, J. E. (1965). Handbook of Nonparametric Statistics, Vol. II. Van Nostrand, Princeton. 57. Walsh, J. E. (1968). Handbook of Nonparametric Statistics, Vol. III. Van Nostrand, Princeton. 58. Wilcoxon, F. (1945). Individual comparisons by ranking methods, Biometrics 1, 80–83. 59. Wilcoxon, F., Katti, S. K. & Wilcox, R. A. (1973). Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test, in Selected Tables in Mathematical Statistics, Vol. 1, H. L. Harter & D. B. Owen, eds. American Mathematical Society, pp. 171–259. 60. Wilcoxon, F. & Wilcox, R. A. (1964). Some Rapid Approximate Statistical Procedures, 2nd Ed. American Cyanamid Co., Lederle Laboratories, Pearl River.

NONRANDOMIZED TRIALS

groups must be alike in all important aspects, known or unknown, and differ only in the treatments that they receive. In this way, any differences observed between the groups can be attributed to the treatments, not to other factors such as baseline characteristics. To achieve comparable subject groups, the preferred method is to allocate treatments to subjects using a chance mechanism. Depending on the study design, a subject can be assigned to the treatment groups with the same chance (e.g., 1:1 randomization ratio) or different chance (e.g., 1:2 randomization ratio). Neither the investigator nor the subject knows in advance the treatment to be given before entering a trial. In practice, a randomization schedule is generated by a computer or from a table of random numbers (2). Further details are provided under the topic of randomization in this work. Randomized clinical trials are regarded as the scientific standard for comparing treatments. There are three main reasons why randomized clinical trials are the ideal scientific tool for comparing treatments. First, randomization tends to produce comparable groups. That is, the known or unknown prognostic factors and other characteristics of subjects at the time of randomization will be, on the average, evenly balanced between the treatment groups. Second, randomization eliminates the bias in the allocation of subjects that may potentially arise from either investigators or subjects. The direction of bias may go either way (favor or un-favor of the intervention) and can easily make the results of comparison uninterpretable. The third advantage of randomization is that it guarantees the validity of statistical tests of significance (3). Despite the popularity and wide acceptance of the scientific merits of randomization, some physicians are reluctant to participate in randomized clinical trials. The most frequent objection is the ethical concern with randomization (4–6). Many physicians feel that they must not deprive a patient of a new treatment that they believe to be beneficial, regardless of the validity of the evidence for that claim. Randomization would deprive about one-half of the patients (assuming a

ZHENGQING LI Global Biometric Science Bristol-Myers Squibb Company Wallingford, Connecticut

A clinical trial, as defined by Friedman, Furberg, and DeMets, is ‘‘a prospective study comparing the effect and value of intervention(s) against a control in human subjects’’ (1). Following this definition, a clinical trial must be prospective. Retrospective studies such as case-control studies in which subjects are selected on the basis of presence or absence of an event of interest do not meet this definition and will not be discussed here. Following this definition, a clinical trial must employ one or more intervention techniques and must contain a control group. Without an active intervention, a study is observational because no experiment is being performed. Without a control, there is no comparison group one can use to assess the effect of the intervention. We will be focusing on studies on human beings. Animal (or plant) studies will not be covered in the discussion although they may be studied using similar techniques. This article is a tutorial description of nonrandomized trials under the framework of clinical trial definition. Basic concepts of studies, design features, statistical methods, and applicability of nonrandomized trials will be described and their limitations discussed. In addition, references are provided for readers who are interested in exploring relevant topics further. Readers are also encouraged to read articles under the topics of randomization, stratification, historical control, observational trials, and propensity score covered elsewhere in this work. 1 RANDOMIZED VS. NONRANDOMIZED CLINICAL TRIALS Following the definition of clinical trials already outlined, the fundamental scientific principle underlying the comparison of interventions(s) versus control groups is that these

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

NONRANDOMIZED TRIALS

1:1 randomization ratio) from receiving the new and presumed better treatment. Another reason is that some physicians may feel that the patient–physician relationship is compromised if the physician must explain to the patient that the treatment for their disease would be chosen by a coin toss or computer. In 1976, the National Surgical Adjuvant Project for Breast and Bowel Cancers (NSABP) initiated a clinical trial to compare segmental mastectomy and postoperative radiation, or segmental mastectomy alone, with total mastectomy. Due to the low rates of accrual, a questionnaire was mailed to 94 NSABP principle investigators asking why they were not entering eligible patients in the trial (7). A response rate of 97% was achieved. Physicians who did not enter all eligible patients offered the following explanations: [1] concern that the doctor–patient relationship would be affected by a randomized trial (73%), [2] difficulty with informed consent (38%), [3] dislike of open discussions involving uncertainty (22%), [4] perceived conflict between the roles of scientist and clinician (18%), [5] practical difficulties in following procedures (9%), and [6] feelings of personal responsibility if the treatments were found to be unequal (8%). In addition, not all clinical studies can use randomized controls. For example, in some therapeutic areas, the prevalence of the disease is so rare that a large number of patients cannot be readily found. In such an instance, every qualified patient is precious for study recruitment. If a dramatic treatment effect is expected for the new intervention, the treatment effect may be easily explained based on clinical experience and data available from historical database. In this case, either no control is needed or historical control is sufficient to serve the purpose of the study. A nonrandomized trial is a clinical trial in which qualified subjects are assigned to different treatment groups without involving in the chance mechanism. Subjects may choose which group they want to be in, or may be assigned to the groups by their physician. Depending on how the control group is formed, several types of nonrandomized trials appear commonly in literature. In the next section, we will describe some general features of nonrandomized trials.

2 CONTROL GROUPS IN NONRANDOMIZED TRIALS We are discussing the nonrandomized trials in the context of clinical trial definition with a control group. As mentioned earlier, if the value of the treatment is overwhelmingly beneficial, no comparison may be necessary. However, in this case, one can equally argue that no trial is necessary if one knows the treatment benefit for sure. In practice, the benefit of an active treatment is likely to be of moderate magnitude, requiring care in its evaluation. For these reasons, studies without a control will not be discussed here. We discuss nonrandomized trials after explaining how a control group is formed. For a more general discussion regarding the selection of controls and control groups, readers are encouraged to read the topics of Control and Control Groups covered elsewhere in this work. 2.1 Nonrandomized Concurrent Control In control groups in a nonrandomized concurrent trial, the subjects are treated without the new intervention at approximately the same time as the intervention group is being treated. The patients are allocated to the intervention or control group based on either their physicians’ choice or the patients’ determination. Patients in the control group could be from the same institution or from a different institution. Typically, the control group needs to be selected to match the key characteristics of the intervention group. The main advantage of a nonrandomized concurrent control trial is that it more easily accepted by physicians and patients, especially by those who have objections to randomization. In addition, the data will be collected from subjects who entered the study at approximately the same time. Investigators may feel that data from the same period of time are more comparable in contrast with data collected from studies that were conducted years ago. The major weakness of the nonrandomized concurrent control trial is the potential that the intervention and control groups are not strictly comparable. Although the investigators may match a few known important

NONRANDOMIZED TRIALS

prognostic factors, there is no way to check whether the unknown or unmeasured factors are comparable between the treatment groups. The difficulty increases in a therapeutic area where the prognostic factors affecting the disease are not well characterized. 2.2 Historical Control Historical control studies use comparable subjects from past studies. These include selection of controls from published literature or from previous studies that are documented in medical charts or computer files. The argument for using historical controls is that all patients can receive the new intervention. From the point of view of investigators, a clinician conducting a historical control study has no ethical dilemma that arises potentially from randomization, especially, if he or she is already of the opinion that the new intervention is beneficial. In addition, patients may be more willing to participate in the study if they can be sure of receiving a particular treatment. Other major benefits include these studies’ contribution to medical knowledge and the potential cost savings in sample size and length of study (8, 9). The major criticism of historical control studies is that they are particularly vulnerable to bias. First, patients with more favorable prognoses may be more likely to be selected to receive the new intervention. Because of this, the patients recruited in the study may be substantially different from the target population specified in the protocol, thus making comparability between the new intervention and the historical control groups questionable. As a consequence, the more favorable results with the new intervention may be attributed simply to the fact that more favorable patients receive it. Second, unlike the nonrandomized concurrent control studies in which patients are recruited at approximately the same time, patients included in the historical control may be from studies conducted several years ago. An improvement in outcome of a disease from the new intervention may stem from changes in the patient population and patient management as well as technology change such as technology improvement in diagnosis criteria. For example, because educational and

3

screening programs now encourage people to have their glucose levels checked frequently, many who are asymptotic are identified as having diabetes and are receiving treatment. In the past, only those with symptoms would have chosen to see a physician; as a result, patients classified as diabetics would have comprised a different risk group compared with those currently considered to be diabetics. Third, without randomization, it is impossible to know whether the new intervention and the historical control groups are really comparable. For a therapeutic area where the diagnosis of the disease is clearly established and the prognosis is well known, this may be of less concern if the important prognostic factors are identified and matched through techniques such as stratification and regression. However, for a disease that is not well understood, an imbalance in unknown or unmeasured prognostic factors can easily make the interpretation of results difficult. In addition, historical studies are generally conducted in a nonuniform manner. The inaccuracy and incompleteness of historical studies can add more difficulties to using a historical control. The requirements for a valid historical control, as specified by Pocock (10), include the following: Control group has received a precisely defined treatment in a recent previous study. Criteria for eligibility, work-up, and evaluation of treatment must be the same. Important prognostic features should be known and be the same for both treatment groups. No unexplained indications lead one to expect different results. A further proviso may be added to these requirements (9). If there are differences between treatment groups with respect to these features, then it should be established these are not sufficient to explain any observed differences in outcome between groups. If this further requirement can be met by a study in which important differences in treatment effect had been demonstrated, then it would appear that such results would merit a confirmatory study.

4

NONRANDOMIZED TRIALS

2.3 Baseline Control

3.1 Study Design

In this type of study, the patients’ status over time is compared with their baseline state. Although sometimes these studies are thought to use the patient as his or her own control, they do not in fact have any separate control per se. Instead, changes from baseline are compared with an estimate of what would have happened to the patient in the absence of the treatment with the new intervention. Such estimates are generally made on the basis of general knowledge, without reference to any specific control. Baseline control studies are generally conducted when the effect is dramatic and occurs rapidly after treatment and when the estimate that the investigator intends to compare with is clearly defined. When the case is not so obvious, a specific historical experience should be sought. In addition, this type of design is more appropriate to studies in which the outcome of primary interest is easily established at baseline and can be followed after baseline by laboratory parameters (e.g., blood pressure or glucose level). For a study with patient survival as the primary endpoint, it is impossible to apply this type of study design.

To conduct a nonrandomized trial, a rigorous protocol is needed to stipulate the study’s objectives, inclusion/exclusion criteria, processes for data collection, and statistical methods to be used for data analysis. It is particularly important to specify in advance how the control group will be formed in a nonrandomized trial. For a nonrandomized, concurrent control study, the methods for patient selection and the procedure for treatment assignment need to be specified and communicated in advance between patients and their physicians. For a historical control study, the criteria used for forming the control, such as selection of studies and patients, should be specified in the protocol before conducting the study. Where no single optimal control exists, it may be advisable to study multiple controls, provided that the analytical plan specifies conservatively how each will be used in making inference. In some cases, it may be useful to have a group of independent reviewers reassess endpoints in the control group and in the intervention group in a blinded manner according to common criteria. A priori identification of a reasonable hypothesis and advanced planning for analysis can strengthen the conclusions to be drawn (11). The credibility of results based on post hoc analyses is diminished, and the conclusions are less convincing to researchers and readers. In the planning stage of a clinical trial, one needs to determine the sample size based on the primary hypothesis stated in the study. For the comparison of two treatment groups in a randomized trial, extensive literature exists on sample size determination. The sample size is usually calculated from a simple two-sample Z-test, and most introductory statistical books contain the calculations. (Readers are also encouraged to read the topics pertaining to sample size calculations in this work.) In a nonrandomized trial, some special features in the study design will impact the sample size calculation. For example, in a historical control study, once the control group has been selected, the summary statistics from the control group are known and do not change in hypothetical repetitions of the

3 STATISTICAL METHODS IN DESIGN AND ANALYSES Because randomization is not placed in a nonrandomized trial, the balance between the treatment groups in important prognostic factors, known or unknown, is not protected. The primary challenge in the design and analysis of nonrandomized trials is to address the bias that potentially arises from the treatment incomparability. In this regard, the nonrandomized concurrent control studies and historical control studies face the same fundamental issue. Therefore, we will not distinguish the statistical methods used for these two types of studies. For the baseline control study, the statistical inference underlying the comparison is a one-sample problem. The statistical methods in design and analysis are relatively straightforward. We will focus our attention on the first two types of studies and refer readers to tutorial statistical texts that have good coverage for this topic (1).

NONRANDOMIZED TRIALS

clinical study to be performed. The sample size calculation for a historical control study needs to incorporate this feature. For a binary endpoint, Makuch and Simon (12) provide a sample size formula as well as tables for calculating the sample size required in a historical control study. In general, the sample size required in a historical control study is substantially smaller than that required for a similar randomized trial. When the variability is ignored from the historical control, only 25% of the total number of patients is necessary to undertake a historical control study, compared with a similar randomized trial (8). However, this would imply the unlikely situation that the entire population of historical controls is known exactly. For a nonrandomized trial in which the data from the control group are not available at the design stage, considerations in sample size calculation are different. For example, in a nonrandomized, concurrent control trial, one concern is the impact of the imbalance of important covariates on statistical comparison between the two groups. This imbalance needs to be incorporated into sample size calculation. Lipsitz and Parzen (13) have provided a sample size formula for normal responses based on a regression model. For a nonrandomized study with a 1:1 allocation ratio between the two treatment groups, the number of patients required per treatment group can be calculated as n = n

2 1 − ρ[Y−E(Y|X), E(Y|X,W)]

1 − r2x|w

where n is the sample size calculated based on 2 the formula for a randomized trial, ρ[Y−E(Y|X), E(Y|X,W)] is the proportion of variation in the response Y jointly explained by treatment indicator X and important covariates W, which is not explained by X alone, and r2x|w is the coefficient of determination obtained by regression X on the other covariates W, which measures the imbalance between the two treatment groups in W. In practice, the two coefficients are typically determined from a pilot study or previous data. If an investigator has no knowledge about these two coefficients, he or she can specify a range of possible values and see how sensitive the estimated sample size is to various parameters.

5

When the two coefficients are equal, the sample size will be the same as the sample size required for a randomized trial. Examples for calculating the sample size have been provided in Lipsitz and Parzen (13). Sample size calculations for failure-time random variables in nonrandomized studies are also available (14). Dixon and Simon (15) have discussed sample size considerations for studies comparing survival curves using historical controls. 3.2 Statistical Analysis As discussed earlier, the control and intervention groups may have large differences on their observed covariates in a nonrandomized trial. These differences can lead to biased estimates of the treatment effects. Statistical methods are available to minimize the bias stemming from the imbalance of these observed covariates. These methods include matching, stratification, and regression adjustment. Matching is a common technique used to select controls who are similar to the treated patients on important covariates that have been identified by the investigator as needing to be controlled. Although the idea of finding matches seems straightforward, it is often difficult in practice to find subjects who match on these covariates, even when there are only a few covariates of interest. A common matching technique is Mahalanobis metric matching (16, 17). Following this method, patients in the intervention group are randomly ordered first. Then, the Mahalanobis distances between the first treated patient and all controls are calculated. The control with the minimum distance is chosen as the match for the treated patient, and both individuals are removed from the pool. This process is repeated until matches are found for all treated patients. One of the drawbacks of the method is that it is difficult to find close matches when there are many covariates included in the model. Stratification is also commonly used to control for systematic differences between the control and intervention groups. Following this technique, patients will be grouped into strata based on the observed covariates that are deemed important by the investigator. Once the strata are defined, the treated

6

NONRANDOMIZED TRIALS

patients and control subjects who are in the same stratum are compared directly. The rationale behind this approach is that subjects in the same stratum are similar in the covariates used to define the strata, and thus are comparable. However, when the number of covariates increases, the number of strata grows exponentially. When the number of strata is large, some strata might contain subjects only from one group, which would make it impossible to estimate the treatment effect in that stratum. Regression adjustment is a technique based on statistical regression models in which the treatment effect is estimated after adjusting the covariates identified by the investigator. The theory behind this approach is that, if there is any bias due to the treatment imbalance on these observed covariates, these covariates should have effects on the outcome variable. By modeling the effects of the covariates and treatment indicator on the outcome variable in the same model, the treatment effect would be estimated on the basis that the subjects in the intervention and control groups hold the same value for these adjusted covariates. Consequently, the bias due to the treatment imbalance on these covariates would be minimized or removed. The selection of regression models will depend on the outcome variables. The most commonly used models include linear regression for continuous outcomes, logistic regression for dichotomous responses, and the Cox regression for time-to-event data. In contrast with the matching and stratifications techniques, the regression adjustment can incorporate multiple covariates. However, Rubin (18) has shown that covariance adjustment may in fact increase the expected squared bias if the covariance matrices in the intervention and control groups are unequal. One common difficulty with both the matching and stratification techniques is the case of multiple covariates. Although the regression technique can accommodate multiple covariates in the model, some concern remains for the issues of overparameterizing and loss of flexibility in including interactions and higher order terms when many parameters are included. One major breakthrough in dimension reduction is the use of propensity scores (19).

The propensity score for a subject is the conditional probability of receiving the intervention rather than the control, given a vector of his or her observed covariates. The propensity score is a balancing score in the sense that the conditional distribution of the covariates, given the propensity score, is the same for the intervention and control subjects. In other words, a group of subjects with the same propensity score are equally likely to have been assigned to the intervention. Within a group of subjects with the same propensity score, some actually received the intervention and some received the control, just as if they had been randomly allocated to whichever treatment they actually received. Therefore, two groups with the same propensity score are expected to be comparable with respect to the observed covariates. The estimate of the propensity score is relatively straightforward. It is estimated by predicting treatment group membership based on the observed covariates—for example, by a multiple logistic regression or discriminate analysis. In the statistical model, the outcome is the event that a subject is in the intervention or control group, and the predictors are these covariates. The clinical outcome of the study is not involved in the modeling. (See D’Agostino [20] for a detailed description and tutorial summary.) Once the propensity score is estimated for each subject, the methods of matching, stratification, or regression adjustment can be used based on one summary score, the propensity score. Examples for implementing the propensity score approach have been described in literature (21). Although the propensity score is a good approach for reducing the covariate dimension, limitations still exist (21). First, propensity score methods can only apply for observed covariates. For unobserved and unmeasured covariates, propensity scores cannot be calculated. Second, to be able to apply the methods to propensity scores, an overlap in propensity scores between the two groups is necessary. If there is insufficient overlap between the two groups, none of the methods (matching, stratification, or regression adjustment) will be very helpful. Third, propensity score methods may not eliminate all bias because of the limitations of propensity score modeling (22), which is a linear combination of covariate. As

NONRANDOMIZED TRIALS

recommended by Braitman and Rosenbaum (23), propensity score methods work better under three conditions: rare events, a large number of subjects in each group, and many covariates influencing the subject selection. A more detailed description for propensity score and its general usage can be found elsewhere in this work.

4

CONCLUSION AND DISCUSSION

Lack of randomization and blinding make a nonrandomized trial vulnerable to bias. Although statistical methods are available to minimize the bias that may arise from the imbalance of observed covariates, it is always difficult, and in many cases impossible, to quantify the bias completely. Nonetheless, careful planning in study design, conduct, and analysis can make a nonrandomized trial more persuasive and potentially less biased. A control group should be chosen for which there is detailed information, and the control subjects should be as similar as possible to the population expected to receive the intervention in the study; the controls should be treated in a similar setting, in a similar manner. To reduce the selection bias, the methods for selecting the controls should be specified in advance. This may not always be feasible in the case of a historical control study because outcomes from the historical control may be in published form. However, efforts should be made to justify the selection of controls on a scientific basis rather than on outcomes. Any statistical techniques used to account for the population differences should be specified before selecting controls and performing the study. As noted earlier, one of the major reasons for conducting a nonrandomized trial is ethics concerns. However, it also is not ethical to carry out studies that have no realistic chance of credibility in showing the efficacy of the new treatment. When should we consider a nonrandomized trial? A few general considerations have been discussed in the International Conference on Harmonisation (ICH) of Technical Requirements for Registration of Pharmaceuticals for Human Use guidelines (24):

7

There is a strong prior belief on the superiority of the new intervention to all available alternatives, and alternative designs appear unacceptable. The disease or condition to be treated has a well-documented, highly predictable course. The study endpoints are objective. The covariates influencing the outcome of the disease are well characterized. The control group closely resembles the intervention group in all known relevant baseline, treatment (other than study drug), and observational variables. Even in these cases, appropriate attention to deign, conduct, and analysis is necessary to help reduce the bias. Nonrandomized trials provide a useful supplement to randomized trials, and nonrandomized methods are useful for exploratory and pilot studies. However, unless the issues of potential biases are fully explored, one needs to be cautious when drawing confirmative conclusions based on a nonrandomized trial. REFERENCES 1. L. M. Friedman, C. Furberg, and D. L. DeMets, Fundamentals of Clinical Trials. 3rd ed. New York: Springer-Verlag, 1998. 2. M. Zelen, The randomization and stratification of patients to clinical trials. J Chronic Dis. 1974; 27: 365–376. 3. D. P. Byar, R. M. Simon, W. T. Friedewald, et al., Randomized clinical trials: perspectives on some recent ideas. N Engl J Med. 1976; 295: 74–80. 4. F. J. Ingelfinger, The randomized clinical trial [editorial]. N Engl J Med. 1972; 287: 100–101. 5. T. C. Chalmers, J. B. Black, and S. Lee, Controlled studies in clinical cancer research. N Engl J Med. 1972; 287: 75–78. 6. L. W. Shaw and T. C. Chalmers, Ethics in cooperative clinical trials. Ann NY Acad Sci. 1970; 169: 487–495. 7. K. M. Taylor, R. G. Margolese, and C. L. Saskolne, Physicians’ reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. N Engl J Med. 1984; 310: 1363–1367.

8

NONRANDOMIZED TRIALS 8. E. A. Gehan and E. J. Freireich EJ, Nonrandomized controls in cancer clinical trials. N Engl J Med. 1974; 290: 198–203. 9. E. A. Gehan, The evaluation of therapies: historical control studies. Stat Med. 1984; 3: 315–324.

10. S. J. Pocock, The combination of randomized and historical controls in clinical trials. J Chronic Dis. 1976; 29: 175–188. 11. J. C. Bailar 3rd, T. A. Louis, P. W. Lavori, and M. Polansky, Studies without internal controls. N Engl J Med. 1984; 311: 156–162. 12. R. W. Makuch and R. W. Simon, Sample size considerations for non-randomized comparative studies. J Chronic Dis. 1980; 33: 175–181. 13. S. R. Lipsitz and M. Parzen, Sample size calculations for non-randomized studies. Statistician. 1995; 44: 81–90. 14. M. V. P. Bernardo, S. R. Lipsitz, D. P. Harrington, and P. J. Catalano, Sample size calculations for failure time random variables in non-randomized studies. Statistician. 2000; 49: 31–40. 15. D. O. Dixon and R. Simon, Sample size considerations for studies comparing survival curves using historical controls. J Clin Epidemiol. 1988; 14: 1209–1213. 16. D. B. Rubin, Bias reduction using Mahalanobis metric matching. Biometrics. 1980; 36: 293–298. 17. R. G. Carpenter, Matching when covariables are normally distributed. Biometrika. 1977; 64: 299–307. 18. D. B. Rubin, Using multivariate matched sampling and regression adjustment to control bias in observational studies. J Am Stat Assoc. 1979; 74: 318–324. 19. P. R. Rosenbaum and D. B. Rubin, The central role of the propensity score in observational studies for causal effects. Biometrika. 1983; 70: 41–55.

20. R. B. D’Agostino, Jr., Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Statistics Med. 1998; 17: 2265–2281. 21. L. Yue, Practical issues with the application of propensity score analysis to nonrandomized medical device clinical studies. In: 2004 ASA Proceedings. Alexandria, VA: American Statistical Association, 2004, pp. 970–975. 22. D. B. Rubin, Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997; 127: 757–763. 23. L. Braitman and P. R. Rosenbaum, Rare outcomes, common treatments: analytical strategies using propensity scores. Ann Intern Med. 2002; 137: 693–696. 24. Center for Biologics Evaluation and Research (CBER), Center for Drug Evaluation and Research (CDER), Food and Drug Administration, U.S. Department of Health and Human Services. Guidance for Industry: E10. Choice of Control Group and Related Issues in Clinical Trials. Rockville, MD: U.S. DHHS, May 2001. Available online at: http://www.fda.gov/cder/guidance/4155fnl. htm. Accessed June 2007.

CROSS-REFERENCES Randomization Historical control Stratification Observational trials Propensity score

OBJECTIVES

reducing the total score of the 17-item Hamilton Depression Rating Scale in subjects who meet criteria for major depressive disorder as defined by the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) (APA 1994).’’ In this protocol, a secondary objective that is confirmatory to the primary could be to evaluate the treatment effect on another efficacy measure. Another secondary objective, complementary to the primary objective, is to assess the safety and tolerability of the study drug versus the placebo. Some protocols use ‘‘exploratory objectives’’ to classify those goals that are stretched beyond what the design can fully evaluate but are somewhat related. Exploratory objectives do not require the same degree of statistical rigor as is required for primary objectives. They are usually used to generate hypotheses for future research.

YILI L. PRITCHETT Abbott Laboratories Abbott Park, Illinois

Objectives describe what clinical researchers intend to achieve in a clinical trial. Objectives vary from study to study; in particular, they are distinctive between studies designed to learn and studies designed to confirm. In the learning phase of clinical drug development, the study objective could be to establish the maximum tolerable dose, to select the most promising treatment agent among a set of candidates, to prove the concept of efficacy, or to estimate the dose-response relationship. On the other hand, in the confirmatory phase, the study objective could be to test a hypothesis that the study drug has superior efficacy than the control, or to demonstrate the acceptable benefit/risk profile of a new molecular entity. Identifying the objectives is the first step in designing a clinical trial. Clearly and properly defined objectives build the foundation for a well-planned clinical trial, since objectives impact the decision for each of the following key elements related to trial design: type of study (e.g., adaptive design, crossover design, or parallel design); sample size, outcome measures, study duration, entry criteria, study monitoring rules, frequency of data collection, and statistical data analysis plan. Objectives should be written clearly in protocols. Objectives can be classified as primary or secondary. Primary objectives are the focus of a study, and data should be collected to first support these objectives. In general, a study is considered successful if the primary objectives are met. A single, well-defined primary objective allows for clear interpretation of the clinical trial results. Secondary objectives can be either confirmatory or complementary to the primary ones. For instance, in a study of an investigational drug for the antidepressant indication, the primary object can be ‘‘To assess the efficacy of study drug (at the dose of x mg) compared with placebo in

FURTHER READING European Medicine Agency, ICH Topic E8, General Considerations for Clinical Trials, March 1998. S. Piantadosi, Clinical Trials, A Methodological Perspective. New York: Wiley, 1997. L. B. Scheiner, Learning versus confirmatory in clinical drug development. Clin. Pharmacol. Therap. 1997; 16(3).

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

OFFICE OF ORPHAN PRODUCTS DEVELOPMENT (OOPD) The U.S. Food and Drug Administration’s Office of Orphan Products Development (OOPD) is dedicated to promoting the development of products that demonstrate promise for the diagnosis and/or treatment of rare diseases or conditions. The OOPD interacts with the medical and research communities, professional organizations, academia, and the pharmaceutical industry as well as rare disease groups. The OOPD administers the major provisions of the 1983 Orphan Drug Act (ODA), which provides incentives for sponsors to develop products for rare diseases. The success of the ODA can be seen in the more than 200 drugs and biological products for rare diseases that have been brought to market since 1983, in contrast to the decade before 1983 which saw fewer than 10 such products come to market. In addition, the OOPD administers the Orphan Products Grants Program, which provides funding for clinical research in rare diseases.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/orphan/) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

OFFICE OF PHARMACOEPIDEMIOLOGY AND STATISTICAL SCIENCE (OPASS) The Office of Pharmacoepidemiology and Statistical Science (OPaSS), which includes the Office of Biostatistics and the Office of Drug Safety, was created as part of a 2002 Center for Drug Evaluation and Research (CDER) reorganization and has about 180 of CDER’s 1700 employees. Staff persons who work in the Office of Biostatistics and the Office of Drug Safety have backgrounds in a variety of disciplines that include medicine, epidemiology, pharmacology, pharmacy, statistics, regulatory science, health science, information technology, as well as administration and support services. OPaSS plays a significant role in the Center’s mission of assuring the availability of safe and effective drugs for the American people by: • Providing leadership, direction, plan-

ning, and policy formulation for CDER’s risk assessment, risk management, and risk communication programs; • Working closely with the staff of CDER’s other ‘‘super’’ offices, the Office of New Drugs and the Office of Pharmaceutical Science, to provide the statistical and computational aspects of drug review evaluation and research.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/Offices/OPaSS/default. htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

OFFICE OF REGULATORY AFFAIRS (ORA)

2

The headquarters of ORA is composed of four offices, each with its own responsibilities, that work together to achieve ORA’s mission.

The U.S Food and Drug Administration’s Office of Regulatory Affairs (ORA) is responsible for the following:

1. The Office of Resource Management (ORM) encompasses four divisions— Management Operations, Information Systems, Human Resource Development, and Planning, Evaluation, and Management—that are responsible for: • Managing bilateral agreements and Memoranda of Understanding (MOUs) with other governments. • Developing field manpower allocations and operational program plans. • Analyzing and evaluating field performance data and overall accomplishments. • Advising the Office of the Associate Commissioner for Regulatory Affairs (ACRA) and the Regional Food and Drug Directors (RFDDs) on all areas of management. • Developing and implementing nationwide information storage and retrieval systems for data originating in the field offices. 2. Office of Regional Operations (ORO) consists of four divisions—Field Science, Federal-State Relations, Import Operations, and Field Investigations— that are responsible for: • Serving as the central point through which the FDA obtains field support services. • Developing, issuing, approving, or clearing proposals and instructions affecting field activities. • Developing and/or recommending to the ACRA policy, programs, and plans for activities with state and local agencies. • Coordinating field consumer affairs and information programs.

• Managing and operating the FDA field

offices. • Coordinating and managing all FDA

field operations. • Providing advice and assistance on reg-

ulations and compliance policy matters that impact policy development, implementation, and long-range goals. • Working with additional federal agen-

cies on issues of compliance and evaluating proposed legal actions. • Directing and conducting criminal inves-

tigative activities in coordination with FDA headquarters units and other federal, state, and local law enforcement agencies.

1

HEADQUARTERS OFFICES

COMPLIANCE

The principal job of ORA is to survey and inspect regulated firms to assess their compliance with public health laws. Compliance strategies include providing information to industry; highlighting areas of significant violations and their impact on public health; prioritizing and targeting high-risk areas; cooperating with state and local public health authorities and regulators; and focusing on covering products imported into the United States through border coverage and foreign inspections. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/ora/hier/ora overview.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

OFFICE OF REGULATORY AFFAIRS (ORA) • Developing and maintaining interna-

tional regulatory policy and activities to ensure the safety, efficacy, and wholesomeness of various imported products. • Providing laboratory support in various highly specialized areas. 3. The Office of Enforcement (OE), which coordinates legal cases and policies within ORA and the Centers, has several Compliance divisions and is responsible for: • Advising the ACRA and other key officials on regulations and compliance matters that have an impact on policy development, implementation, and long-range program goals. • Coordinating, interpreting, and evaluating overall compliance efforts. • Stimulating and awareness of the need for prompt and positive action to ensure compliance by regulated industries. • Evaluating and coordinating proposed legal actions to establish compliance with regulatory policy and enforcement objectives. • Coordinating development of FDAwide bioresearch monitoring activities. • Serving as the focal point of activities relating to the Federal Medical Products Quality Assurance Program. 4. The Office of Criminal Investigations (OCI) focuses on the investigation of criminal activities in the field and is responsible for: • Directing, planning, and developing criminal investigation activities in coordination with other FDA components and with other federal, state, and local law enforcement agencies. • Initiating and conducting criminal investigations under all statutes administered by the FDA. • Providing recommendations to the Office of Chief Counsel on referrals of criminal cases to the Department of Justice, participating in grand jury investigations, and serving as agents of the grand jury.

3 FIELD COMPONENTS The field staff of ORA is organized into five regions, each of which is headed by a Regional Food and Drug Director (RFDD): • The Pacific Region: Alaska, Arizona, Cal-

•

•

•

•

ifornia, Hawaii, Idaho, Montana, Nevada, Oregon, and Washington. The region includes three district offices and two regional labs. The Southwest Region: Arkansas, Colorado, Iowa, Kansas, Missouri, Nebraska, New Mexico, Oklahoma, Texas, Utah, and Wyoming. The region includes three domestic district offices, the Southwest Import District (SWID), and the Arkansas Regional Lab. The Central Region: Delaware, Illinois, Indiana, Kentucky, Maryland, Michigan, Minnesota, Ohio, New Jersey, North Dakota, Pennsylvania, South Dakota, Virginia, West Virginia, and Wisconsin. The region includes seven district offices and the Forensic Chemistry Center. The Southeast Region: Alabama, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and the San Juan district (Puerto Rico and the U.S. Virgin Islands). The region includes four district offices and a regional laboratory. The Northeast Region: Connecticut, Maine, Massachusetts, New Hampshire, New York, Rhode Island, and Vermont. The region includes two district offices, a regional lab, and the Winchester Engineering and Analytical Center (WEAC).

ONE-SIDED VERSUS TWO-SIDED TESTS

population variances σ12 and σ22 are essentially known through their consistent estimation by the sample variances s21 and s22 . Then the statistic

The choice between a one-sided test or a twosided test for a univariate hypothesis depends on the objective of statistical analysis prior to its implementation. The underlying issue is whether the alternative∗ against which the (null) hypothesis is to be assessed is one-sided or two-sided. The alternative is often onesided in a clinical trial∗ to determine whether active treatment is better than placebo; a two-sided alternative is usually of interest in a clinical trial to determine which of two active treatments is better. The principal advantage of a one-sided test is greater power∗ for the contradiction of the null hypothesis when the corresponding one-sided alternative applies. Conversely, for alternatives on the opposite side, its lack of sensitivity represents a disadvantage. Thus, if alternatives on both sides of a null hypothesis are considered to be of inferential interest, a two-sided test is necessary. However, where the identification of one direction of alternatives is actually the objective of an investigation, the cost of the broader scope of a two-sided test is the larger sample size it requires to have the same power for this direction as its one-sided counterpart. The benefit provided by the increase in sample size is power for alternatives in the opposite direction. If this purpose for increased sample size is not justifiable on economic, ethical, or other grounds, then a one-sided test for a correspondingly smaller sample size becomes preferable. Thus, both one-sided tests and two-sided tests are useful methods, and the choice between them requires careful judgment. The statistical issues can be clarified further by considering the example of the hypothesis of equality of two population means µ1 and µ2 . The null hypothesis has the specification H0 : µ1 − µ2 = δ = 0.

z = d/σd ,

(2)

where d = (y1 − y2 ) and σd = {(s21 /n1 ) + (s22 /n2 )}1/2 , approximately has the standard normal distribution with expected value 0 and variance 1. A two-sided test for the hypothesis H0 in (1) has the two-sided rejection region any observed z such that R2 (αL , αU ) = , z zαL or z z1−αU (3) where zαL and z1−αU are the 100αL and 100(1 − αU ) percentiles of the standard normal distribution and (αL + αU ) = α is the specified significance level∗ (or Type I error). For most applications, (3) is symmetric with αL = αU = (α/2), and zαL = zα/2 = −z1−(α/2) = −z1−αU ; this structure is assumed henceforth for two-sided tests of H0 in (1). The one-sided test for assessing H0 in (1) relative to the alternative Hδ : µ1 − µ2 = δ > 0

(4)

of a larger mean for population 1 than population 2 has the one-sided rejection region RU (α) = R2 (0, α) any observed z ; = such that z z1−α

(5)

similarly, if the alternative (4) specified δ < 0, the one-sided rejection region would be RL (α) = R2 (α, 0). Thus, the symmetric twosided test based on R2 (α/2, α/2) is equivalent to simultaneous usage of the two one-sided tests based on RL (α/2) and RU (α/2). The power of the one-sided test (5) with respect to Hδ in (4) is ψU (δ|α) = Pr{RU (α)|Hδ }

(1)

= 1 − {z1−α − (δ/σd )},

Suppose y1 and y2 are sample means based on large sample sizes n1 and n2 (e.g., ni 40) from the two populations; also suppose the

(6)

where ( ) is the cumulative distribution function of the standard normal distribution.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

ONE-SIDED VERSUS TWO-SIDED TESTS

Table 1. Multiplier of One-Sided Test Sample Sizes for Two-Sided Test to Have the Same Power α Power

0.01

0.02

0.05

0.50 0.60 0.70 0.80 0.90

1.23 1.20 1.18 1.16 1.14

1.28 1.25 1.22 1.20 1.17

1.42 1.36 1.31 1.27 1.23

The power of the two-sided test (3) for this situation is ψ2 (δ|α) = Pr{R2 (α/2, α/2)|Hδ } = [1 − {z1−(α/2) − (δ/σd )} +{zα/2 − (δ/σd )}].

(7)

When δ > 0, ψU (δ|α) > ψ2 (δ|α), and the onesided test is more powerful. However, when δ < 0, ψ2 (δ|α) > α/2 > ψU (δ|α), and so the one-sided test’s power is not only much poorer, but is also essentially negligible. Also, in the very rare situations where rejection is indicated, it is for the wrong reason [i.e., H0 is contradicted by large z in RU (α) when actually δ < 0]. When one direction of alternatives such as (4) is of primary interest, the two-sided test, which achieves the same power ψ for specific α and δ as its one-sided counterpart, requires sample sizes that are λ(α, ψ) times larger [where λ(α, ψ) 1]. For usual significance levels 0.01 α 0.05 and power ψ 0.50, the two-sided test multiplier λ(α, ψ) of the one-sided test sample sizes n1 and n2 is given by λ(α, ψ) =

z1−(α/2) + zψ z1−α + zψ

2 .

(8)

In Table 1, values of λ(α, ψ) are reported for α = 0.01, 0.02, 0.05 and ψ = 0.50, 0.60, 0.70, 0.80, 0.90. For the typical application of power ψ = 0.80 and significance level α = 0.05, the sample size required for a two-sided test is 27% greater than for its one-sided counterpart. Also, the multipliers λ(α, ψ) can be seen to decrease as either α decreases or ψ increases.

Some further insight about one-sided and two-sided tests can be gained from their relationship to confidence intervals∗ . The onesided test based on RU (α) in (5) corresponds to the one-sided lower bound confidence interval δ d − z1−α σd = dL,α .

(9)

If dL,α > 0, then H0 is contradicted relative to the alternative Hδ in (4); if dL,α 0, then there is not sufficient evidence to support Hδ . In this latter context, δ may be near 0 or less than 0; but the distinction between these interpretations is not an inferential objective of a one-sided confidence interval or hypothesis test. For the purpose of the more refined assessment of whether δ is greater than 0, near 0, or less than 0, a two-sided test is needed; its corresponding confidence interval is dL,α/2 δ dU,α/2 ,

(10)

where dL,α/2 = {d − z1−(α/2) σd } and dU,α/2 = {d + z1−(α/2) σd }. If dL,α/2 > 0, then H0 is contradicted with respect to δ > 0; if dU,α/2 < 0, then H0 is contradicted with respect to δ < 0; and if dL,α/2 0 dU,α/2 , then H0 is not contradicted and δ is interpreted as being near 0 in the sense of the confidence limits (dL,α/2 , dU,α/2 ). When support for δ > 0 is the objective of an investigation, the cost for the two-sided confidence interval’s or test’s additional capability for distinguishing between δ < 0 or δ near 0 is either reduced power for the same sample size or increased sample size for the same power. A third way to specify one-sided and twosided tests is through one-sided and twosided p-values∗ ; the one-sided p-value for assessing the one-sided alternative Hδ in (4) through z in (2) is pU (z) = 1 − (z);

(11)

if pU (z) α, then z is interpreted as contradicting H0 on the basis of the small probability α for repeated sampling under H0 to yield values z. For symmetric two-sided tests of H0 in (1), the two-sided p-value∗ is p2 (z) = 2{1 − (|z|)};

(12)

ONE-SIDED VERSUS TWO-SIDED TESTS

if p2 (z) α, then H0 is contradicted. The definition of two-sided p-values for asymmetric situations is more complicated; it involves considerations of extreme outcomes for a test statistic in both directions from H0 . For summary purposes, the rejection region, confidence interval, and p-value specifications of a one-sided test are equivalent in the sense of yielding the same conclusion for H0 ; this statement also applies to symmetric twosided tests. A concern for any one-sided test is the interpretation of values of the test statistic which would have contradicted the hypothesis if a two-sided test were used. From the inferential structure which underlies one-sided tests, such outcomes are judged to be random events compatible with the hypothesis, no matter how extreme they are. However, their nature can be a posteriori described as ‘‘exploratory information supplemental to the defined (one-sided) objective’’ of an investigation. This perspective enables suggestive statements to be made about opposite direction findings; their strengthening to inferential conclusions would require confirmation by one or more additional investigations. Another issue sometimes raised is that one-sided tests seem to make it easier to contradict a hypothesis and thereby to have a weaker interpretation than would have applied to two-sided tests. However, when the null hypothesis H0 is true, the probability of its contradiction is the significance level α regardless of whether a one-sided test or a two-sided test is used. It is easier for the onesided test to contradict H0 when its one-sided alternative applies, but this occurs because the one-sided test is more powerful for such alternatives. Some additional practical comments worthy of attention are as follows: (i) Among the commonly used statistical tests for comparing two population means, z and t-tests lead to one-sided or two-sided tests in a natural manner such as (3) and (5) due to the symmetry about zero of their standardized distributions. Chi-square and F-tests∗ for such comparisons involve squared quantities and so lead to two-sided tests. One-sided counterparts for chi-square and F-test pvalues are usually computed indirectly using

3

p1 = (p2 /2) if the difference is in the same direction as the alternative hypothesis, and p1 = 1 − (p2 /2) if the difference is in the opposite direction, where p1 and p2 are one-sided and two-sided p-values, respectively. (ii) Fisher’s exact test∗ for independence in a 2 × 2 contingency table∗ leads naturally to either a one-sided or two-sided test since the discrete event probabilities for it pertain to one or the other side of the underlying permutation distribution. However, this test is often asymmetric and then a one-sided pvalue (less than 0.5) cannot be doubled to give the corresponding two-sided p-value. (iii) Fisher’s method of combining c independent tests (see Fisher [2] and Folks [3]) is analogous to a one-sided test when its power is directed at a one-sided alternative. For this test, the one-sided p-value is the probability of larger values of QF = −2

c

log pk

k=1

with respect to the χ 2 distribution with 2c degrees of freedom where the {pk } are one-sided p-values in the direction of the one-sided alternative of interest for the c respective tests. The p-value corresponding to the opposite side is obtained by the same type of computation with {pk } replaced by their complements {1 − pk }. (iv) A practical advantage of one-sided p-values is their descriptive usefulness for summarizing the results of hypothesis tests; such p-values contain more information than their two-sided counterparts because the onesided version identifies the direction of any group difference as well as providing the criterion for evaluating whether the hypothesis of no difference is contradicted. This additional descriptive feature eliminates the need for identifying the direction of difference as would be necessary in summary tables of two-sided p-values. Additional discussion of one-sided and two-sided tests is given in many textbooks dealing with statistical methodology, e.g., see Armitage [1], Hogg and Craig [4], Mendenhall et al. [5]. Also, see HYPOTHESIS TESTING. Acknowledgment

4

ONE-SIDED VERSUS TWO-SIDED TESTS

This research was supported in part by the U. S. Bureau of the Census through Joint Statistical Agreement JSA-84-5. The authors would like to express appreciation to Ann Thomas for editorial assistance.

REFERENCES 1. Armitage, P. (1971). Statistical Methods in Medical Research. Wiley, New York. 2. Fisher, R. A. (1932). Statistical Methods for Research Workers, 4th ed. Oliver and Boyd, Edinburgh, Scotland. 3. Folks, J. L. (1984). Combination of independent tests. In Handbook of Statistics: Nonparametric Methods, Vol. 4, P. R. Krishnaiah and P. K. Sen, eds. North-Holland, Amsterdam, Netherlands, pp. 113–121. 4. Hogg, R. V. and Craig, A. T. (1978). Introduction to Mathematical Statistics, 4th ed. Macmillan, New York. 5. Mendenhall, W., Scheaffer, R. L., and Wackerly, D. D. (1981), Mathematical Statistics with Applications, 2nd ed. Duxbury Press, Boston, Mass. See also CONFIDENCE INTERVALS AND REGIONS; EXPLORATORY DATA ANALYSIS; FISHER’S EXACT TEST; HYPOTHESIS TESTING; INFERENCE, STATISTICAL —I; INFERENCE, STATISTICAL —II; POWER; P-VALUES; and SIGNIFICANCE TESTS, HISTORY AND LOGIC OF.

GARY G. KOCH DENNIS B. GILLINGS

OPEN-LABELED TRIALS

‘‘Concealment’’ refers to whether the identity of the treatment for the next patient to be enrolled is known (1). In a fully blinded study, concealment might be taken for granted because neither the investigator nor the patient (nor all other careers, and trial personnel) are supposed to know the treatment assignment (2,3). However, even in an openlabeled study (or partially blinded study), it should still be possible—and it is certainly highly desirable—that those involved in recruiting patients are not aware of which treatment (new or ‘‘experimental,’’ active control, perhaps placebo, etc.) the next patient to be recruited will receive. If they are so aware, then it is possible (even if not always easy to verify) that they may not recruit a particular patient if they are unhappy with the proposed treatment allocation; in addition, they may delay recruiting that patient until the ‘‘next’’ treatment allocation is deemed more preferable. So, for example, in a placebo-controlled study, an investigator might subconsciously not offer participation to a more severely ill patient if they know that patient will receive placebo; but, conversely, the investigator might offer participation if they know that this patient will receive an active (or presumed active) compound. Other forms of selection bias may also occur— even including preferential participation of patients who will receive placebo, perhaps if for a given patient there is a high expectation of adverse events so that the overall benefit–risk for that patient might not be considered positive. Others have written extensively on this topic of selection bias (4), which includes proposals to measure and correct for it. The potential for selection bias exists even in studies that are planned to be fully blinded, but its obvious potential and, therefore, the scientific concern is heightened in open-labeled studies.

SIMON DAY Roche Products Ltd. Welwyn Garden City, UK

Open-labeled trials are a stark contrast to blinded (or masked) trials. Blinded (or masked) trials are those in which typically the patient and the treating physician (often, even if not always, synonymous with the person who observes the outcome) are unaware of the assigned treatment. Such design aspects of trials are important principally to avoid bias in selection of patients and the measurement and assessment of outcomes. Various methods exist to help keep studies double blind. The types of methods—and their complexity—vary considerably depending on the type of trial and the type of intervention (a broader term than just ‘‘treatment’’) being assessed. Other articles in this volume address such issues. The purpose of the current article is not to defend openlabeled studies but, instead, to explain why they are sometimes necessary and even sometimes preferred to blinded studies. However, even if conceptually an open-labeled study might be necessary (or preferable) to a fully blinded study, major deficiencies need to be recognized and addressed wherever possible in the study design and management. 1

THE IMPORTANCE OF BLINDING

In some sense, this article could be recognized as trying to defend the use of open-labeled studies, whereas other articles in this encyclopedia implicitly criticize them and strongly argue for fully blinded trials. Indeed, the importance of blinding should not be underestimated. Open-labeled trials are generally considered to be of lower scientific merit and particularly susceptible to bias when compared with blinded trials.

1.2 Assessment Bias in Trials 1.1 Selection Bias in Trials: Blinding and Concealment

Outcomes or endpoints in trials should be measured carefully, accurately, and without any differential bias in favor of one or other of the investigational treatments. Some outcomes are easy to measure objectively and

The terms ‘‘blinding’’ and ‘‘concealment’’ are often confused with each other, and sometimes it is assumed that achieving one necessarily achieves the other. This is not the case.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

OPEN-LABELED TRIALS

without any differential bias: The most obvious example being death. However, even this endpoint is not always immune from measurement bias. A primary endpoint of ‘‘death within 2 hours’’ (an endpoint that might be applicable in a study of acute cardiac failure) might be compromised by uncertainties in confirming exact time of death and, hence, whether it has been before or after 2 hours from randomization. This procedure might not introduce any bias in the comparison of treatments in a fully blinded trial, but if the treatment allocation is known, then some bias might ensue. If it were known that a patient had been allocated to receive placebo, then the researcher might have a lower expectation of survival; thus, agreements about time of death being 1 hour 59 minutes or 2 hours 1 minute might be handled differently. Some trials might be designed as openlabeled trials because the intervention group need regular and intensive monitoring (perhaps for dose adjustment or for managing anticipated side effects), and the intensive nature of the monitoring seems a compelling reason not to subject the control patients to the same procedures (or at least at the same frequency). The ethical and practical considerations seem sensible and reasonably easy to justify. Perhaps, in a specific example, patients allocated to a control arm are to be observed for routine follow-up once every 3 months, whereas those patients allocated to the new ‘‘experimental’’ arm might be observed every 6 weeks. Now, in the experimental arm, with patients being observed more often, there is a higher chance that the trial endpoint might be observed sooner or adverse events reported at a higher rate than for patients randomized to the control arm. 1.3 Patient Reporting Bias The most obvious scenario in which bias might be introduced by the patient is with self-reported assessments, which are often referred to as patient-reported outcomes (PROs). Many such examples can be listed, such as pain scales and quality of life. Why might patients report differentially on such scales simply because they know (or believe

they know) the identity of the treatment they have received? We can speculate (and will do so shortly), but it is sufficient to note that empirical investigations have demonstrated differential responses repeatedly to different treatment identities, even when the ‘‘treatments’’ have, in fact, been placebos. de Craen et al. (5) conducted a systematic review of the literature on perceived differential effects based on the color of drugs. They found that red, orange, and yellow tablets were best for stimulant drugs, whereas blue and green were best for sedatives. Kaptchuk et al. (6) compared a sham device (a sham acupuncture needle) with an inert pill in patients with persistent arm pain. Some endpoints showed no evidence of any differential effect (that is not to say they demonstrated there was no differential effect, they simply failed to show any effect); other endpoints did seem to show some differential effect. Ernst and Resch (7) note the important distinction between a true placebo and a ‘‘no-treatment’’ option—the mere distinction highlighting that an observable effect of placebo is not uncommon. Recently, the term ‘‘nocebo’’ has been introduced to describe a placebo that is expected to have harmful effects. So why might patients respond differentially to different colors, shapes, types of placebo? This response is speculation but Sacket (8), for example, presents a whole host of possible sources of biases that may crop up in all aspects of analytical research. Some of those most plausibly likely to influence either how patients report PROs or, indeed, how physiological measurements might be affected include: Apprehension bias. Certain measures (pulse, blood pressure) may alter systematically from their usual levels when the subject is apprehensive, (e.g., blood pressure during medical interviews). Obsequiousness bias. Subjects may systematically alter questionnaire responses in the direction they perceive desired by the investigator. Expectation bias. Observers (or patients) may systematically err in measuring and recording observations so that they concur with prior expectations (8).

OPEN-LABELED TRIALS

1.4 Efficacy and Safety Of course, influences of bias are not restricted only to efficacy measurements or only to safety measurements. Either, or both, could be influenced favorably or unfavorably. Ultimately, the results of any trial—and the decision whether to prescribe a therapy—should be an evaluation of the benefit–risk ratio. A treatment that is perceived as beneficial may have a positively biased assessment of its efficacy and a negatively biased assessment of its safety (that is, ‘‘good all around’’). Conversely, a treatment that is perceived as less beneficial may have both worse efficacy and worse safety reported than its (perceived) better comparison treatment. More confusingly, a treatment that is considered ‘‘strong’’ (perhaps a high dose of a treatment, or even multiple tablets suggestive of a high dose) may have a positive bias in relation to its efficacy but simultaneously a negative bias relating to its safety. And, of course, different patients and different observers all with different expectations may all introduce different degrees (or even directions) of biases. 2 REASONS WHY TRIALS MIGHT HAVE TO BE OPEN-LABEL Achieving the appropriate degree of blinding for a variety of the necessary study staff has been discussed above. Methods include simple ‘‘placebo’’ pills or capsules, ‘‘placebo’’ injections (which may simply be injections of saline solution, for example), sham surgery, and so on. Some situations are more difficult to manage than others are. 2.1 Different Formulations Comparing products that have different pharmaceutical forms can be difficult, but solutions do exist in some cases. Capsules and tablets can sometimes be compared by placing the tablets inside inert capsules (so that patients just believe that they are swallowing a capsule). However, issues of bioavailability of the hidden tablet may need to be addressed, and a bioequivalence study that compares the tablets (swallowed as tablets) and the tablets hidden inside capsules may be necessary. Of course, this procedure raises

3

the issue of how this study could be blinded and, if it can be, then might it be possible (and more efficient) to avoid such a bioequivalence study and simply to carry out the ‘‘real’’ study in a blinded manner. Comparing treatments that are not just different pharmaceutical formulations but that have different routes of administration becomes much more difficult (but see ‘‘doubly dummy’’ below). 2.2 Sham Surgery Comparing surgical with medical interventions (or even comparing with ‘‘no treatment’’) in a blinded way is very challenging— both practically and ethically. The notion of ‘‘sham’’ surgery has often been used but nearly always causes controversy. One example is of investigators who drilled holes into the skulls of patients with Parkinson’s disease to transplant embryonic dopamine neurons (9). Those patients randomized not to receive the intervention still had the holes drilled in their skulls—with all the potential for serious adverse consequences (anesthetics, the drilling procedure, subsequent infection, and so on). The primary outcome was a subjective rating scale of symptoms of Parkinson’s disease so that the maximal level of blinding was considered very important. 2.3 Double Dummy The most common solution to blinding when the treatments are obviously different (either in appearance, route of administration, time of administration, etc.) is to use a technique called ‘‘double dummy.’’ More properly, it might be called ‘‘double placebo’’ because, effectively, two (or possibly more) different placebos are used in the same study. Consider as a simple example a trial to compare the effects of tablet ‘‘A’’ with transdermal patch ‘‘B.’’ Patients randomized to ‘‘A’’ are also given a placebo patch and patients randomized to ‘‘B’’ also take placebo tablets. So every patient receives a tablet and a patch but no patient is unblinded. With more than two treatments, a similar technique can be used, but the burden increases on the patient to take more and more medication (most of it placebo!).

4

OPEN-LABELED TRIALS

2.4 Partially Blinded Studies In trials with more than two arms, when blinding cannot be achieved fully, it may be possible to blind some comparisons between some treatment arms. This technique would seem to be ‘‘better than nothing,’’ although the extent to which the credibility and reliability of the trial can then be assured is difficult to judge. An example of blinding of some treatment comparisons is in the ‘‘TARGET’’ trial (or trials) (10,11). Lumiracoxib was compared with both naproxen and ibuprofen, using a double-dummy approach in two substudies. In substudy one, patients were randomized to lumiracoxib or naproxen; in substudy two, patients were randomized to lumiracoxib or ibuprofen. Each substudy used double-dummy randomization so that, for example, in substudy one, patients did not know whether they were receiving lumiracoxib or naproxen—although they did know they were not receiving ibuprofen. Conversely, in substudy two, patients did not know whether they were receiving lumiracoxib or ibuprofen—but they did know they were not receiving naproxen. So here is an example in which what might have had to be an open-labeled study or a ‘‘triple-dummy’’ study could be made at least partially blinded and practically and logistically manageable.

3 WHEN OPEN-LABEL TRIALS MIGHT BE DESIRABLE In general, open-label studies will be less preferred than double-blind studies. However, it is sometimes argued that in highly pragmatic trials [see, for example, Schwartz et al. (12)] open-labeled treatment is more appropriate. Scientifically, we generally want to know the relative benefit of different pharmaceutical preparations. The relative advantages and disadvantages of the science are of more interest than efficacy (or harm) caused by their physical presentation. Yet it is selfevident that different forms of presentation may be more or less acceptable to patients (which may subjectively influence efficacy); different forms of presentation are also likely to affect patient compliance strongly, which, in turn, will affect both efficacy and safety

(see also the entry on Patient preference trials). So, a balance must be found. Excessive inclusion of placebos (double- or higher-order dummy designs) will affect compliance and adherence to treatment regimens. Seemingly minor issues of taste (or perhaps size of tablet to swallow) may impact on patients’ willingness to take medication so hence on the clinical benefit they might gain. So we need to ask carefully what question we are trying to answer: Is it about the science of the drug (or perhaps surgical procedure) or is it about the drug (or other intervention) ‘‘as it is’’? If we conduct a trial to answer the latter question and show one intervention seems better than another does, then we may not know whether it is the treatment per se that is better, or whether it is the way in which the treatment is presented or given, or whether it is a combination of the two.

4 CONCLUDING COMMENTS This article illustrates the breadth of studies that may fall under the umbrella term of ‘‘open-label’’ and how we might go about minimizing the potential for bias in such studies. Partial blinding can sometimes be a (partial) solution; but it can cause trials (or treatment regimens) to be unlike what would be used in clinical practice. It is very difficult (if not impossible) to evaluate the extent of any bias such procedures might introduce. Finally, we should note that whereas fully blinded studies are typically considered the gold standard, in highly pragmatic trials, it may be the open-label nature of the treatments that is exactly the intervention we wish to study. Blinding in these situations would not be a good thing— the open-label study would be much preferred.

REFERENCES 1. D. G. Altman and K. F. Schulz, Concealing treatment allocation in randomized trials. Br. Med. J. 2001; 323: 446–447. 2. S. Day, Blinding or masking. In: P. Armitage and T. Colton (eds.), Encyclopedia of Biostatistics, 2nd ed., vol. 1. Chichester, UK: John Wiley and Sons, pp. 518–525.

OPEN-LABELED TRIALS 3. S. J. Day and D. J. Altman, Blinding in clinical trials and other studies. Br. Med. J. 2000; 321: 504. 4. V. Berger, Selection Bias and Covariate Imbalances in Randomized Clinical Trials. Chichester, UK: John Wiley and Sons, 2005. 5. A. J. M. de Craen, P. J. Roos, A. L. de Vries, and J. Kleijnen, Effect of colour of drugs: systematic review of perceived effect of drugs and their effectiveness. Br. Med. J. 1996; 313: 1624–1625. 6. T. J. Kaptchuk, W. B. Stason, R. B. Davis, T. R. Legedza, R. N. Schnyer, C. E. Kerr, D. A. Stone, B. H. Nam, I. Kirsch, and R. H. Goldman, Sham device v inert pill: randomized controlled trial of two placebo treatments. Br. Med. J. 2006; 332: 391–397. 7. E. Ernst and K. L. Resch, Concept of true and perceived placebo effects. Br. Med. J. 1995; 311: 551–553. 8. D. L. Sackett, Bias in analytic research. J. Chronic Dis. 1979; 32: 51–63. 9. C. R. Freed, P. E. Greene, R. E. Breeze, W. Tsai, W. DuMouchel, R. Kao, S. Dillon, H. Winfield, S. Culver, J. Q. Trojanowski, D. Eidelberg, and S. Fahn, Transplantation of embryonic dopamine neurons for sever Parkinson’s disease. N. Engl. J. Med. 2001; 344: 710–719. 10. T. J. Schnitzer, G. R. Burmester, E. Mysler, M. C. Hochberg, M. Doherty, E. Ehrsam, X. Gitton, G. Krammer, B. Mellein, P. Matchaba, A. Gimona, and C. J. Hawkey, Comparison of lumiracoxib with naproxen and ibuprofen in the Therapeutic Arthritis Research and Gastrointestinal Event Trial (TARGET), reduction in ulcer complications: randomized controlled trial. Lancet 2004; 364: 665–674.

5

11. M. E. Farkouh, H. Kirshner, R. A. Harrington, S. Ruland, F. W. A. Verheugt, T. J. Schnitzer, G. R. Burmester, E. Mysler, M. C. Hochberg, M. Doherty, E. Ehrsam, X. Gitton, G. Krammer, B. Mellein, A. Gimona, P. Matchaba, C. J. Hawkey, and J. H. Chesebro, Comparison of lumiracoxib with naproxen and ibuprofen in the Therapeutic Arthritis Research and Gastrointestinal Event Trial (TARGET), cardiovascular outcomes: randomized controlled trial. Lancet 2004; 364: 675–684. 12. D. Schwartz, R. Flamant, and J. Lellouch, Clinical Trials (Trans. M. J. R. Healey). London: Academic Press Inc., 1980.

FURTHER READING S. Senn, Statistical Issues in Drug Development. Chichester, UK: John Wiley and Sons, 2007.

CROSS-REFERENCES Active-controlled trial clinical development plan combination trials non-inferiority trial phase III trials postmarketing surveillance preference trials quality of life

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

cells, so cytotoxic agents may lead to other organ damage and may eventually lead to shorter overall survival of cancer patients. In contrast, molecularly targeted agents demonstrate tumor growth inhibition but not tumor shrinkage. These agents may offer clinical benefits such as longer overall survival, progression-free survival, and better quality of life. Most molecularly targeted agents are less toxic than conventional cytotoxic agents. Thus, the maximum therapeutic effect may occur at doses well below the MTD. The intensity of dose-toxicity curve may not be predictive of the therapeutic effect. Because dose-escalation is usually guided by toxicity in traditional phase I clinical trials, such designs may be inappropriate for optimizing the use of molecularly targeted drugs. We briefly review the phase I clinical trial designs for cytotoxic agents, and then investigate the designs for molecularly targeted agents.

CHUL AHN Department of Clinical Sciences and Simmons Comprehensive Cancer Center University of Texas Southwestern Medical Center Dallas, Texas

SEUNG-HO KANG Department of Statistics Ewha Woman’s University Seoul, South Korea

YANG XIE Department of Clinical Sciences and Simmons Comprehensive Cancer Center University of Texas Southwestern Medical Center Dallas, Texas

The main purpose of a phase I clinical trial of a cytotoxic chemotherapeutic agent is ordinarily to find the highest dose with an acceptable rate of toxicity, often referred to as the maximum tolerated dose (MTD), of the new agent, which will be used as a recommended dose for experimentation in phase II efficacy studies. This recommended phase II dose is determined under the assumption that the higher the dose, the greater the antitumor activity. Thus, it is assumed that the intensity of dose-toxicity curve is predictive of the therapeutic effect. Over the past decade, a considerable number of studies have been conducted to investigate the statistical properties of phase I clinical trials of cytotoxic anticancer drugs (1–15). The emergence of a growing number of molecularly targeted therapies as anticancer agents challenges the traditional phase I clinical trial paradigm in a variety of ways. The clinical development of cytotoxic agents is based on the assumption that the agents will shrink tumors and the shrinkage of tumors will prolong the progression-free survival and overall survival of cancer patients. However, cytotoxic agents that shrink tumors may kill normal cells in addition to cancer

1 PHASE I DOSE-FINDING DESIGNS FOR CYTOTOXIC AGENTS Phase I cancer clinical trials intend to rapidly identify the MTD of a new agent for further studies. The standard phase I clinical trial identifies the MTD through an algorithm-based dose-finding approach in which dose escalation and de-escalation depend on the number of patients experiencing dose-limiting toxicity (DLT). The standard 3 + 3 algorithm-based dose-finding approach has poor operating characteristics compared with the model-based dose-finding approaches such as the continual reassessment method (CRM) (1, 4). The major criticism of the standard phase I design is that the MTD has no interpretation as an estimate of the dose level that yields a specified toxicity rate. Kang and Ahn (7–9) show that the standard algorithm-based 3 + 3 design cannot provide the accurate estimates of the MTD when the specified target toxicity rate is high. In contrast to the common belief that the standard 3 + 3 design produces 33% toxicity rate at the MTD, Kang and Ahn (7–9)

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

and He et al. (5) have shown that the expected toxicity rate at the MTD is between 19% and 24%, regardless of target toxicity level. He et al. (5) proposed a model-based approach for the estimation of the MTD that follows a standard 3 + 3 design. They showed that the model-based approach yields a less biased estimate than the standard algorithm-based 3 + 3 design. O’Quigley et al. (12) proposed the CRM, which overcomes the problems of the standard 3 + 3 design by reducing the number of patients treated with possibly ineffective dose levels, and by yielding a dose level with a specified toxicity rate. The CRM design outperforms the standard 3 + 3 design, but it has some difficulties to be implemented in practice because it takes too long to complete the trial. The CRM treats one patient at a time, and clinicians are not comfortable with using the target dose close to the MTD as the starting dose for phase I clinical trials (10). Goodman et al. (4) accommodated these concerns by proposing a modified CRM. Goodman et al. (4) and Ahn (1) showed that the modified CRM reduces the duration of the trial by 50% to 67%, and reduces the toxicity incidence by 20% to 35% from the original CRM. These designs require each patient to be followed for a certain period to observe toxicity. The next cohort of patients is assigned to the following dose level only when full observation period of each cohort has been completed. Muler et al. (16) introduced the timeto-event continual reassessment method to eliminate the need for full observation of each patient before estimating the MTD. This method accounts for the time of the observation period as a proportion of the maximum length of observation. Patients without toxicity are weighted by that proportion, and patients with toxicity receive the full weight. These weights are applied to the likelihood used in the CRM to determine the MTD. There are other phase I designs proposed for cytotoxic agents such as escalation with overdose control (2), random walk rules (3), two-stage design (14), and decision-theoretic approaches (15). In spite of the criticisms on the standard 3 + 3 algorithm-based design, the standard 3 + 3 phase I design is still

widely used in most practical cases. The reason might be that the standard designs do not require elaborate statistical considerations and they have been in use by many investigators over the years. 2 PHASE I DOSE-FINDING DESIGNS FOR MOLECULARLY TARGETED AGENTS The recent emergence of molecularlytargeted therapies has created major challenges in drug development. These newer agents are commonly referred to as having cytostatic effects because many of these agents show antimetastatic or growthinhibitory effects instead of inducing rapid tumor regression (16–30). When many of these agents slow or stop the growth of tumors and the development of metastases, the phase I clinical trial designs proposed for cytotoxic agents may not be effective in identifying the dose level that is clinically suitable for molecularly targeted agents. Because these agents act on highly specific targets that are differentially expressed or activated in cancer cells, they may have a very wide therapeutic ratio. Common toxicity of many cytotoxic drugs is not usually seen in molecularly targeted drugs. In molecularly targeted drugs, efficacy (such as target inhibition, pharmacodynamic effect, and immunologic response) is used as the alternative endpoint for phase I trials to toxicity measurement. The optimal biological dose (OBD) is usually defined as the dose level that gives the highest efficacy. The OBD is also defined as the dose level recommended for phase II clinical trials of molecularly targeted drugs (30). This recommended phase II dose is also defined as the biological modulatory dose (19) or the biologically adequate dose (22). The OBD, the dose recommended for a phase II trial, may occur at doses that are well below the MTD. For example, bevacizumab (Avastin), a monoclonal antibody to vascular endothelial growth factor, was approved for the treatment of metastatic colorectal cancer by the U.S. Food and Drug Administration in February 2004. The MTD of bevacizumab monotherapy is 20 mg/kg due to the toxicity of severe migraine headache in some patients (17). In a randomized phase II trial of bevacizumab with chemotherapy, a 5 mg/kg dose

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

yielded a higher response rate, longer median progression-free survival, and longer overall survival in patients with metastatic colorectal carcinoma (18). 2.1 Dynamic De-escalating Designs Most molecularly targeted agents are less toxic than conventional cytotoxic agents, and as a result, the maximum therapeutic effect may occur at doses that are well below the MTD. Dowlati et al. (19) and Emmenegger and Kerbel (20) apply a dynamic deescalating dosing strategy to determine the OBD for the agent, SU5416, which is an oral small-molecule vascular endothelial growth factor receptor-2 inhibitor. The rational for this novel dose de-escalation design is based on the fact that the MTD of SU5416 had been previously determined. The unique feature of this approach is to de-escalate to the OBD (referred to as a biological modulatory dose) based on pharmacodynamic information instead of toxicity. The approach tries to show a pharmacodynamic effect at the MTD. Dose de-escalation will be made to investigateifthe lower dose exhibits the same amount of pharmacodynamic effect as the higher dose. If the lower dose exhibits the same effect, then the lower dose will be chosen as the preferred dose. Dowlati et al. (19) and Emmenegger and Kerbel (20) chose the following pharmacodynamic effect as significant for trial design: (1) a 35% reduction in microvessel density in sequential tumor biopsies, and (2) a 35% reduction in blood flow within tumor as assessed by dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). The approach of Dowlati et al. to finding the OBD was as follows: Ten patients are enrolled at the MTD. If at least 5 of the 10 patients show the intended pharmacodynamic effect, dose de-escalation will continue until a reduction in pharmacodynamic effect is observed. The dose above the level in which the dose deescalation stops would be considered as the OBD of the agent. The rationale of this design is that, if the lower dose displays similar biological response rates to the MTD, it qualifies as a candidate for the biologically adequate dose. The number of patients at each dose level is greater than that in the standard

3

3 + 3 design. The advantage of this approach over the standard escalating design is that very few patients will receive a drug dose less than the OBD. To be qualified as the biologically adequate dose in this approach, the number of biological responses at the lower dose must not be less than the number of MTD responses minus 2. However, the method is rather ad hoc. The statistical properties of the biologically adequate dose need to be investigated. 2.2 Dose Determination through Simultaneous Investigation of Efficacy and Toxicity Dynamic de-escalating design determines the OBD based on the determination of the MTD (19). That is, the dose level for the MTD should be known in advance to determine the dose level for the OBD. The dose de-escalation is only determined by the response rate in the dynamic de-escalating design. Instead of determining the MTD by toxicity and then de-escalating the dose to identify the OBD by efficacies (such as immunologic response and pharmacodynamic effect), we can determine the OBD by simultaneously investigating efficacy and toxicity from molecularly targeted agents. Suppose that the dose-efficacy curves of the agents are not monotonically increasing and the efficacy rates are at least 30%. The following design is a modified standard 3 + 3 design to accommodate the response rate for the determination of the OBD. This design, just like the standard 3 + 3 design, is based on intuitively reasonable rules instead of formally justifiable statistical rules. Each dosing level enrolls three patients. The design consists of two dose-escalation steps: Step 1 uses the standard dose escalation. If at least one response occurs in the cohort of patients in step 1, dose escalation approach is switched to step 2. Step 2 uses six patients per dose level. The following dose escalation approach is used for the determination of the OBD. Step 1 1. If, in any cohort of three patients, no patient experiences a DLT, the next cohort will be enrolled as follows.

4

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

A. If none of them has response, then the next cohort will be enrolled at the next higher dose level. B. If at least one of them has response, then switch to step 2. 2. If one patient experiences a DLT, then three additional patients will be enrolled at the same dose. A. If none of three additional patients experiences a DLT (i.e., a total of 5/6 do not experience a DLT), and none of the extended cohort of six patients experiences a response, the next cohort of three patients will be enrolled at the next higher dose level. B. If none of three additional patients experiences a DLT (i.e., a total of 5/6 do not experience a DLT), and at least one of the extended cohort of six patients experiences a response, then switch to step 2. C. If a DLT occurs in more than one patient among the additional three patients (for a total of ≥2/6), then the MTD is exceeded. 3. If a DLT occurs in two or more patients in a dosing cohort, then the MTD level is exceeded. Step 2 Patients are accrued in cohorts of six including the patients at the current dose level from step 1. That is, if only three patients are recruited at the dose level from step 1, three more patients are accrued at that dose level. 1. If zero or one out of six patients experiences a DLT, A. If no one has a response, then the OBD is exceeded, and the dose escalation is terminated. Three additional patients will be enrolled at the next lower dose level if only three patients are treated at that dose level. (Note that at least one response is observed at the dose level from step 1. However, at the other dose levels in step 2, no response may be observed out of six patients.)

B. If at least one patient has a response, then the dose is escalated in subsequent patients. 2. If at least two patients experience a DLT, then the MTD level is exceeded. Three additional patients will be enrolled at the next lower dose level if only three patients are treated at that dose level. When the MTD level is exceeded, the dose escalation is terminated. Then, the MTD is defined as the next lower dose level. Three more patients will be treated at the next lower dose level if only three patients are treated at that dose. The OBD of a molecular targeted drug is estimated as the dose level at or below the MTD with the highest response rate. Here, the standard 3 + 3 design is modified to accommodate response rates to identify a phase II dose. This modified 3 + 3 design can be generalized to any A + B design. Operating characteristics of this design need to be investigated. 2.3 Individualized Maximum Repeatable Dose (iMRD) Takahashi et al. (21) describe a dose-finding approach to identify an optimal dose, referred to as individualized maximum repeatable dose (iMRD). This design potentially incorporates both escalation and de-escalation steps. A starting dose is half the MTD, and then dose is de-escalated or escalated depending on the toxicity of the agent. The dose is escalated for grade 0 toxicity, maintained at the same dose level for grade 1 toxicity, and de-escalated for toxicity grade ≥2. The modifications are still toxicity guided but allow one to approach the iMRD, which is defined as the dose associated with minimal (grade ≤1) toxicity during chronic administration of the drug. Takahashi et al. (21) suggest that the iMRD is a simple method to identify a patient’s tailored chemotherapy dose and could be the optimal dose for patients with noncurable cancers such as metastatic pancreatic cancer. This approach is appealing because of its easy implementation and the antitumor effects seen. However, this approach does not address the need to find the recommended dose for the phase II study.

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

2.4 Proportion Designs and Slope Designs Korn et al. (23) noted that statistical trial designs for identifying the OBD may require more patients than those studied for phase I cytotoxic agents. To address this concern, Hunsberger et al. (22) proposed designs in which the goal is to find a biologically adequate dose under the assumption that the target response is a binary value determined in each patient. They defined an adequate dose as either a dose that yields a specific high response rate or a dose in the plateau, and they developed two types of designs, the proportion design and the slope design, respectively, to incorporate the two definitions of an adequate dose. In the proportion design, proportion [4/6], an adequate dose means a dose that produces a specific high response rate. Three patients are assigned to the first dose. One proceeds to the next higher dose level with a cohort of three patients when ≤1/3 responses are observed. Three more patients are treated at this dose level if ≥2/3 responses are observed. One continues escalation if ≤3/6 responses are observed. The dose level that yields ≥4/6 responses or the maximum dose level tested is considered to be the adequate dose and is recommended for future clinical trials. They also propose the slope design, which is intended to stop escalation if the target response rate seems to remain constant. The escalation design depends on the estimated slope of the regression line, using the response rate at each dose level as the dependent variable and the dose level as the independent variable. The dose with the highest response rate is the recommended dose to be used in subsequent clinical trials. To address the concern that more patients may be required to identify the OBD than those for a phase I cytotoxic agent, Hunsberger et al. (22) investigated the performance of the proportion design and slope designs through simulations. They defined an adequate dose as only a dose in the plateau, and assumed that there is little or no toxicity associated with the molecularly targeted drug being studied. They investigated the performance of the two designs with respect to how often the designs reach a plateau and treat fewer patients at inactive doses. Through limited simulations, the

5

designs were shown to perform adequately with only a few patients treated at each dose level. They suggested immediately switching to a dose-escalation approach based on cytotoxic agents if any DLT is observed. They also suggested the use of aggressive doseescalation steps if the agent is not expected to cause toxicities. The utility of these designs needs to be investigated by prospective evaluation in future phase I clinical trials of molecularly targeted agents. 2.5 Generalized Proportion Designs Kang et al. (24) investigated the statistical properties of the proportion designs that can be used to determine a biologically adequate dose of molecularly targeted agents. They proposed generalized proportion designs that have four parameters; they derived the exact formula for the probability of each dose level that is recommended for phase II clinical trials. They also derived the exact formula for the number of patients needed to complete the trial. Using the exact formulas, they computed the expected number of patients who will complete the trial and computed the expected response rate at the recommended dose for comparison with specific high response rates. In the proportion [4/6] design, Hunsberger et al. (22) considered de-escalation when the starting dose level had achieved ≥4/6 responses. However, such probability is negligible. Furthermore, de-escalation produces very complicated but unimportant terms in exact formulas, so Kang et al. (24) did not consider dose deescalation. Kang et al. (24) generalized the proportion designs as follows: 1. Escalate in cohorts of size A while ≤C/A responses are observed. 2. Treat B more patients at the same dose level if ≥(C + 1)/A responses are observed. 3. Continue escalation as in steps 1 and 2 if ≤D/(A + B) responses are observed. 4. Use the dose level that yields ≥(D + 1)/(A + B) responses as the recommended dose for phase II clinical trials. The proportion [4/6] and [5/6] designs in Hunsberger et al. (22) correspond to the cases

6

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

of (A, B, C, D) = (3, 3, 1, 3), (3, 3, 1, 4). To speed up dose escalation, Kang et al. (24) modified the generalized proportion designs by incorporating an accelerated design that uses single-patient cohorts until a response is observed. Accordingly, the modified generalized proportion designs are conducted as follows. One patient is assigned to the first dose level. The design proceeds to the next higher dose level with a single-patient cohort when a response is not observed. If the first response is observed at a dose level k, the accelerated design is converted into the standard proportion designs by assigning (A − 1) more patients so that A patients are assigned to the dose level k. The remaining steps are the same as those in the generalized proportion designs. Kang et al. (24) investigated the statistical properties of the modified generalized proportion design. Specifically, they computed the expected response rate at the recommended dose and the expected number of patients to finish trials for each design, in view of finding out which designs produced specific high response rates such as 60%, 70%, or 80%. 3

DISCUSSION

Drug development currently takes too long and costs too much because it is so unproductive. Most therapeutic drugs were developed with a lack of information on their molecular targets, which can be used to test the therapeutic efficacy (31). DiMasi et al. (32) estimated that the average cost of bringing a new drug from the time of investment to marketing in 2000 was U.S. $802 million. The genetic profile of a patient can improve the diagnosis of the underlying cause of the disease and allow the selection of a specific drug treatment, which will maximize drug efficacy with fewer serious adverse drug reactions (33). Biomarkers are very valuable in the early phases of clinical development for guidance in dosing and for selection of the lead compounds (34). Because the biomarker expression profile will rule out the use of molecular targeted drugs in some patients, this will increase the probability of success of target molecules and reduce the drug development cost. Biomarkers that can be used

to identify eligible patients for clinical trials, measure adverse drug reactions, and quantify drug efficacy are urgently needed to accelerate drug development. For a cytotoxic drug, toxicity is a major endpoint, and the MTD is usually easy to obtain. However, a cytostatic drug usually causes less acute toxicity because most of these agents are target specific. That is, the MTD based on acute toxicity will not be the optimal dose chosen for phase II evaluation of a cytostatic agent. For a cytostatic drug, we need the highest dose that allows chronic administration, which is likely to be different from the traditional acute MTD. Because a cytostatic drug is expected to be used for a prolonged period of time, the determination of the MTD and the OBD based on the first one or two cycles of chemotherapy is likely to be more problematic for the OBD than the MTD. In the early stage of clinical trials of a cytostatic drug, emphasis should be given to describing chronic toxicity (35). There is an increasing need for novel statistical designs for phase I clinical trials of molecularly targeted drugs as there is a growing need to determine a dose that yields optimal biological activity based on target inhibition or response rather than toxicity. The phase I clinical trial designs proposed for cytotoxic agents may not be effective in identifying the dose level that is clinically suitable for molecularly targeted agents. The MTD of molecularly targeted drugs may be higher than the dose level required to achieve the maximum desired biological activity. Determination of the OBD will provide more useful information for further drug development of molecularly targeted drugs. It will be of considerable interest to investigate the performances of dose-finding approaches for molecularly targeted agents. The utility of these designs warrants prospective evaluation in future clinical trials of molecularly targeted drugs. The first-generation, target-based anticancer drugs such as imatinib, trastuzumab, gefitinib are now regarded as established drugs. Combination therapies using a molecularly targeted drug with a conventional cytotoxic agent are frequently being tested (25). Dose-finding approaches for the combinatorial agents need to be developed, and the

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES

performance of these agents must be thoroughly evaluated. REFERENCES 1. C. Ahn, An evaluation of phase I cancer clinical trial designs. Stat Med. 1998; 17: 1537–1549. 2. J. Babb, A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med. 1998; 17: 1103–1120. 3. S. Durham, N. Floumoy, and W. Rosenberger, A random walk rule for phase I clinical trials. Biometrics. 1997; 53: 745–760. 4. S. Goodman, M. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stat Med. 1995; 14: 1149–1161. 5. W. He, J. Liu, B. Binkowitz, and H. Quan, A model-based approach in the estimation of the maximum tolerated dose in phase I cancer clinical trials. Stat Med. 2006; 25: 2027–2042. 6. A. Ivanova, Escalation, group and A + B designs for dose-finding trials. Stat Med. 2006; 25: 3668–3678. 7. S. Kang and C. Ahn, The expected toxicity rate at the maximum tolerated dose in the standard phase I cancer clinical trial design. Drug Inf J. 2001; 35: 1189–1200. 8. S. Kang and C. Ahn, An investigation of the traditional algorithm-based designs for phase I cancer clinical trials. Drug Inf J. 2002; 36: 865–873. 9. S. Kang and C. Ahn, Phase I cancer clinical trials. In: S. Chow (ed.), Encyclopedia of Biopharmaceutical Statistics. Dekker; 2003; 1–6. DOI:10.1081/E-EBS120022143. 10. E. Korn, D. Midthune, T. Chen, L. Rubinstein, M. Christian, and R. Simon, A comparison of two phase I trial designs. Stat Med. 1994; 13: 1799–1806. 11. Y. Lin and W. Shih, Statistical properties of the traditional algorithm-based designs for phase I cancer clinical trials. Biostatistics. 2001; 2: 203–215. 12. J. O Quigley’ M. Pepe, and M. Fisher, Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990; 46: 33–48. 13. T. Smith, J. Lee, H. Kantarjian, S. Legha, and M. Raber, Design and results of phase I cancer clinical trial: three-year experience at M.D. Anderson cancer center. J Clin Oncol. 1996; 4: 287–295.

7

14. B. Storer, Design and analysis of phase I clinical trials. Biometrics. 1989; 45: 925–937. 15. J. Whitehead, Bayesian decision procedures with application to dose-finding studies. Stat Med. 1997; 11: 201–208. 16. H. Muler, C. J. McGinn, D. Normolle, T. Lawrence, D. Brown, et al., Phase I trial using a time-to-event continual reassessment strategy for dose escalation of cisplatin combined with gemcitabine and radiation therapy in pancreatic cancer. J Clin Oncol. 2004; 22: 238–243. 17. M. A. Cobleigh, V. K. Langmuir, G. W. Sledge, K. D. Miller, L. Haney, et al., A phase I/II dose-escalation trial of bevacizumab in previously treated metastatic breast cancer. Semin Oncol. 2003; 30: 117–124. 18. F. Kabbinavar, H. I. Hurwitz, L. Fehrenbacher, N. J. Meropol, W. F. Novotny, et al., Phase II randomized trial comparing bevacizumab plus fluorouracil (FU)/leucovorin (LV) with FU/LV alone in patients with metastatic colorectal cancer. J Clin Oncol. 2003; 21: 60–65. 19. A. Dowlati, K. Robertson, T. Radivoyevitch, J. Waas, N. Ziats, et al., Novel phase I dose de-escalation design to determine the biological modulatory dose of the antiangiogenic agent SU5416. Clin Cancer Res. 2005; 11: 7938–7944. 20. U. Emmenegger and R. Kerbel, A dynamic dose de-escalating dosing strategy to determine the optimal biological dose for antiangiogenic drugs: commentary on Dowlati et al. Clin Cancer Res. 2005; 11: 7589–7592. 21. Y. Takahashi, M. Mai, N. Sawabu, and K. Nishioka, A pilot study of individualized maximum repeatable dose (iMRD), a new dose finding system, of weekly gemcitabine for patients with metastatic pancreas cancer. Pancreas. 2005; 30: 206–210. 22. S. Hunsberger, L. V. Rubinstein, J. Dancey, and E. Korn, Dose escalation trial designs based on a molecularly targeted endpoint. Stat Med. 2005; 14: 2171–2181. 23. E. Korn, S. G. Arbuck, J. M. Pluda, R. Simon, R. S. Kaplan, and M. C. Christian, Clinical trial designs for cytostatic agents: are new approaches needed? J Clin Oncol. 2001; 19: 265–272. 24. S. Kang, S. Lee, and C. Ahn, An investigation of the proportion designs based on a molecularly targeted endpoint. Drug Inf J. In press. 25. T. Yamanaka, T. Okamoto, Y. Ichinose, S. Oda, and Y. Maehara, Methodological aspects of current problems in target-based anticancer

8

OPTIMAL BIOLOGICAL DOSE FOR MOLECULARLY TARGETED THERAPIES drug development. Int J Clin Oncol. 2006; 11: 167–175.

26. E. Korn, Nontoxicity endpoints in phase I trial designs for targeted, non-cytotoxic agents. J Natl Cancer Inst. 2004; 96: 977–978. 27. W. R. Parulekar and E. A. Eisenhauer, Phase I design for solid tumor studies of targeted, non-cytotoxic agents: theory and practice. J Natl Cancer Inst. 2004; 96: 990–997. 28. Y. Shaked, U. Emmenegger, S. Man, D. Cervi, F. Bertolini, et al., Optimal biologic dose of metronomic chemotherapy regimens is associated with maximum antiangiogenic activity. Blood. 2005; 106: 3058–3061. 29. H. S. Friedman, D. M. Kokkinakis, J. Pluda, A. H. Friedman, I. Cokgor, et al. Phase I trial of O6 -benzlguanine for patients undergoing surgery for malignant glioma. J Clin Oncol. 1998; 16: 3570–3575. 30. E. Deutsch, J. C. Soria, and J. P. Armand, New concepts for phase I trials: evaluating new drugs combined with radiation therapy. Nat Clin Pract Oncol. 2005; 2: 456–465. 31. U. Manne, R. Srivastava, and S. Srivastava, Recent advances in biomarkers for cancer diagnosis and treatment. Drug Discov Today. 2005; 10: 965–976. 32. J. A. DiMasi, R. W. Hansen, and H. G. Grabowski, The price of innovation: new estimates of drug development costs. J Health Econ. 2003; 22: 151–185. 33. C. Ahn, Pharmacogenomics in drug discovery and development. Genomics Inform. 2007; 5: 41–45. 34. R. Frank and R. Hargreaves, Clinical biomarkers in drug discovery and development. Nat Rev Drug Discov. 2003; 2: 566–580. 35. R. Hoekstra, J. Verwij, and F. Eskens, Clinical trial design for target specific anticancer agents. Invest New Drugs. 2003; 21: 243–250.

FURTHER READING E. Fox, G. A. Curt, and F. M. Balis, Clinical trial design for target-based therapy. Oncologist. 2002; 7: 401–409. S. Kummar, M. Gutierrez, J. H. Doroshow, and A. J. Murgo, Drug development in oncology: classical cytotoxics and molecularly targeted agents. Br J Clin Pharmacol. 2006; 62: 15–26. M. Ranson and G. Jayson, Targeted antitumour therapy future perspectives. Br J Cancer. 2005; 92(Suppl 1): S28–S31.

A. Stone, C. Wheeler, and A. Barge, Improving the design of phase II trials of cytostatic anticancer agents. Contemp Clin Trials. 2007; 28: 138–145.

CROSS-REFERENCES Dose-escalation design Maximum tolerated dose Optimal biological dose Phase I trials Cytotoxic drug Cytostatic drug

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

a preventable disease may occur. In both examples, an MTD based on a single course of treatment may prove to be overly toxic when given over multiple courses. Consider a setting in which conventional dose-finding is done based on one course with a fixed schedule when in fact a safe dose d exists with three courses. If d has substantive antidisease effect with three courses whereas d with only one course does not, then the conventional MTD of one course may lead to the erroneous conclusion in later studies that the agent is ineffective. Similarly, if conventional dose-finding is done with four courses of each dose and it turns out that the lowest dose is excessively toxic, then it may be concluded erroneously that the agent is unsafe at any dose simply because shorter schedules were not examined. Furthermore, it may be the case that two doses will prove to be equally safe if different administration schedules are applied to each. Second, most existing designs require that toxicity be evaluated quickly enough so that each enrolled patient is fully evaluated for DLT before a new patient enters the study. One exception is the TITE-CRM (6), which evaluates long-term toxicity and allows new patients to enroll before all previously enrolled patients have completed observation. However, like other phase I trial designs, the TITE-CRM does not accommodate settings where multiple schedules are studied. Specifically, the TITE-CRM allows the dose to vary across patients while keeping the schedule fixed; our method allows the schedule to vary across patients while keeping the dose fixed. One could consider assessing multiple schedules with the TITE-CRM by treating each schedule as a ‘‘dose’’ and determining the maximum tolerable schedule (MTS) with study-specific modifications as described by Braun, Levine, and Ferrara (7). However, by considering each schedule to be a dose, patients who receive an incomplete schedule essentially have received a partial ‘‘dose.’’ To force this situation into the framework of the TITE-CRM, a patient can only be evaluated up to the point of his or her last fully completed schedule. Furthermore, if there is an additional follow-up

THOMAS M. BRAUN Department of Biostatistics, School of Public Health University of Michigan Ann Arbor, Michigan

PETER F. THALL Department of Biostatistics and Applied Mathematics, University of Texas M. D. Anderson Cancer Center Houston, Texas

In conventional phase I studies, each patient is assigned to a dose of an experimental agent under study, receiving either a single administration or a course consisting of several administrations of the agent. Patients are followed for a relatively short period of time during which a dose-limiting toxicity (DLT) may occur. At the end of the follow-up period, a binary outcome indicating the presence or absence of DLT is recorded for each patient. Generally, the maximum tolerable dose (MTD) is considered the largest dose that is ‘‘safe,’’ that is, that does not present a practical limitation to therapy. Many designs using this approach exist (1–5). These designs have seen widespread use largely because they facilitate adaptive dose-finding methods wherein doses and outcomes of previous patients are used to select doses for new patients. However, these designs have some limitations due to their simplified representation of actual clinical practice. First, in many clinical settings, physicians administer an agent more than once to a patient and monitor the long-term toxicity related to the cumulative effects of the agent. In such settings, the nominal ‘‘dose’’ is actually the dose given per administration, or the total dose given over a fixed number of administrations. For example, multiple cycles of chemotherapy may be administered with the aim to effectively destroy all tumor cells. Similarly, a prophylactic agent may be given repeatedly during the period in which

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

period once each schedule has been completed, the ‘‘doses’’ will overlap, leading to ambiguity as to which dose contributed to a late-onset toxicity. Third, existing phase I trial designs assume that the course of therapy, once assigned to a patient, will not be altered during the study. However, in actual practice, most phase I trials allow modifications to or delays of administration times if a patient has experienced a low-grade (non–dose limiting) toxicity previously during the treatment process. Thus, patients who experience a DLT after one or more treatment modifications will not fit the simplified framework of phase I trial designs unless the DLT is associated with the originally planned treatment schedule. This relies on the implicit assumption that altering a patient’s treatment schedule has no effect on the probability of a DLT. Such an assumption is dubious, given the wellknown schedule dependence of many agents. Nonetheless, our method does assume that if a patient’s treatment is terminated early for reasons unrelated to DLT, that patient is considered to be fully followed without DLT up to the point of their last administration. Thus, we implicitly assume that patient withdrawal is noninformative with regard to the likelihood for DLT. Such an assumption is reasonable, as patients in most phase I trials are at high risk for disease-related mortality, and complete withdrawal of treatment for reasons unrelated to toxicity is extremely rare. Due to the limitations of conventional phase I methods described above, a paradigm shift is needed so that phase I trial designs may better reflect actual clinical practice and thus lead to better decision-making in phase I. Hopefully, this will in turn decrease the false-negative rate in phase II trials so that true treatment advances are more likely to be detected. Because this new paradigm accounts for the actual times when a patient receives the agent, the patient’s time to toxicity will be used as the outcome instead of the usual binary indicator that toxicity has occurred. The hazard of toxicity is modeled as the sum of a sequence of hazards, each associated with one administration of the agent. We determine the MTS that the patient may

receive based on the risk of toxicity occurring within a specified follow-up period that includes the maximum schedule being considered. Patient accrual, data monitoring, and outcome-adaptive decision-making are done continuously throughout the trial under a Bayesian formulation. Each time a new patient is accrued, the most recent data are used to evaluate criteria that define the MTS, which is assigned to the new patient. 1 MOTIVATING EXAMPLE The schedule-finding method was originally motivated by a study was conducted at the University of Michigan involving allogeneic bone marrow transplant (BMT) patients, who are at risk of acute graft-versus-host disease (aGVHD), an immune disease leading to damage to the patient’s skin, liver, and/or gastrointestinal (GI) tract and possible early mortality. Preclinical studies (8) have demonstrated that recombinant human keratinocyte growth factor (KGF) markedly reduces chemotherapy-induced or radiationinduced injury to the mucosal lining of the lower GI tract. Investigators therefore theorized that patients receiving KGF are less susceptible to developing aGVHD in the GI tract if given KGF (9, 10). However, it is unknown how frequently KGF can be given safely without causing an unacceptable rate of grade 4 toxicities, such as rash, edema, and increases in amylase and lipase, both of which are indicators of pancreas dysfunction. An earlier unpublished phase I study in BMT patients determined an MTD of 60 mg/kg of KGF when administered using a 10-day schedule of 3-days-on/4-days-off/3days-on, which we denote by (3+ , 4−, 3+ ). Although one course of KGF using the (3+ , 4−, 3+ ) schedule is considered safe, investigators believed that this schedule would not provide sufficient prophylaxis for aGHVD, which may take up to roughly 100 days after BMT to develop. Consequently, the investigators wished to study the safety of multiple courses of KGF, with 4 days of rest between consecutive courses, and focus on toxicity associated with the entire period of therapy. Thus, two courses would consist of the 24day schedule (3+ , 4−, 3+ , 4−, 3+ , 4−, 3+),

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

and so on. This study was planned with a maximum enrollment of 30 patients. There are many published schedulefinding studies (11–13), most of which use a crude design enrolling small cohorts of patients on predefined schedules, estimating the DLT probability for each schedule independently of the other schedules, and selecting the schedule with the optimal toxicity profile. The weaknesses of such an approach arise from the lack of an assumed regression model incorporating different schedules, preventing the borrowing of strength among schedules. As a result, a very small number of patients receive each schedule, the estimator of the DLT probability at each schedule is unreliable, and one may not extrapolate the results of the study to schedules that were not examined. Additionally, the absence of a sequentially outcome-adaptive rule for assigning patients to schedules limits the safety of any trial using such a design. Similarly to the gains of model-based methods such as the CRM and its counterparts compared with conventional ‘‘3+ 3’’ dose-finding algorithms, the MTS model and decisionmaking paradigm greatly improve upon existing methods available for schedule-finding studies. 2

DESIGN ISSUES

As mentioned earlier, the MTS method is adaptive in its selection of the schedule assigned to each patient, using Bayesian methods to continually update the MTS. As a result, we need two pieces of information: [1] a parametric model to describe how the risk of toxicity is related to the duration and times of administration under each schedule, and [2] prior distributions that describe plausible values for the parameters in the schedule/toxicity model. For illustrative purposes, assume the setting of the KGF trial. Let t* denote any given time from the start of the trial when one must evaluate the data and make a decision, including what schedule to give a newly accrued patient or whether to terminate the trial early. Let n* denote the number of patients enrolled up to t*. For i = 1, 2, . . . , n*, the most recent data collected on patient i are Di = (si , Yi , δ i ), where si is a vector of administration times, Yi is the amount

3

of follow-up, and δ i is an indicator of whether that patient has suffered toxicity. Thus, si = si,1 , . . . , si,mi lists the administration times since study entry for patient i. Under this general notation, the agent can be administered whenever and as frequently as desired to each patient, and an arbitrary number of different treatment sequences can be studied. This allows some of a patient’s actual administration times to deviate from his or her planned times. However, we assume that the investigators are interested in selecting among a specific set of treatment sequences, s(1) , . . . , s(k) . For example, in the KGF trial, one course of the (3+ , 4−, 3+ ) schedule corresponds to s(1) = (1, 2, 3, 8, 9, 10), two courses correspond to s(2) = (1, 2, 3, 8, 9, 10, 15, 16, 17, 22, 23, 24) = (s(1) , s(1) + 14), and so on. The data collected on all patients up to time t* are entered into a probability model that defines the likelihood of toxicity during the follow-up period of each patient. We now describe the first component of our design: a parametric model describing the risk of toxicity for each schedule. 2.1 Single-Administration Hazard We assume that the hazards for multiple administrations are additive and identical. Therefore, our problem is reduced to stating the hazard of toxicity for a single administration of the agent. Although this hazard can be quite general, we make the simplifying assumption that the hazard of toxicity from a single administration has a finite duration after which it completely vanishes to zero. In fact, this correctly reflects most clinical applications. Under this assumption, one cannot model the hazard with a typical parametric lifetime distribution, such as the gamma or Weibull, unless the distribution is truncated appropriately. As a simple, practical alternative, we assume that the hazard increases linearly to a maximum and decreases linearly thereafter. Specifically, we define the hazard function as  2θ1 u ; 0 ≤ u ≤ θ2   θ2 + θ3 θ2   2θ1 θ2 + θ3 − u h(u|θ ) = ; θ2 < u ≤ θ2 + θ3  θ2 + θ3 θ3    0

; u > θ2 + θ3 or u < 0

(1)

4

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

Thus, our model uses three parameters (θ 1 , θ 2 , θ 3 ) to fully describe a triangle with base of length θ 2 + θ 3 and area θ 1 , with the height of the triangle occurring at time θ 2 from administration. Figure 1 illustrates this function. Other possible forms for the hazard are described in Braun, Yuan, and Thall (14), which also contains an alternate parameterization for Equation (1). Because θ 1 is the area of a triangle representing the risk of toxicity, it quantifies the cumulative hazard of a single administration. Thus, the cumulative hazard of toxicity for each patient is simply a multiple of θ 1 , where the multiplier is the number of administrations received. Any administration given within θ 2 + θ 3 days of t* would constitute an area less than θ 1 , but the actual area can be derived easily using geometric properties of a triangle. As stated earlier, we assume additive hazards across all administrations,

as illustrated in Figure 2, where the dashed lines represent the hazard of toxicity for each of six administrations given at enrollment and at 1, 2, 7, 8, and 9 days after enrollment. The solid curve is the cumulative hazard of toxicity from all administrations, computed from the sum of the heights of each triangle underneath it. The shaded area displays the cumulative hazard of toxicity 10 days after enrollment. Because the hazard of toxicity for each single administration in Figure 2 lasts a total of 18 days, each administration contributes a fraction of θ 1 to the cumulative hazard 10 days after enrollment. Only once a patient is observed beyond 18 days from enrollment will administrations start to contribute fully to the cumulative hazard. For this model, we now need to identify prior distributions that describe plausible values for the three parameters θ 1 , θ 2 , and θ 3 .

Hazard of Toxicity

2θ1 θ2 + θ3

0

θ2

θ3

Figure 1. Parametric hazard function for a single administration of an agent.

Hazard of Toxicity

Time (in days)

0

2

7

9

18 20 Time (in days)

25 27

Figure 2. Cumulative hazard of toxicity 10 days after enrollment for a patient who received administrations at enrollment and 1, 2, 7, 8, and 9 days after enrollment.

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

2.2 Developing Prior Distributions The prior distributions of the parameters must be sufficiently uninformative so that they are dominated by the data and the MTS algorithm is able to provide a safe and reliable design. The actual mathematical forms for the prior distributions, although useful, are less important than the numerical hyperparameter values selected for each distribution. For example, because both θ 2 and θ 3 measure spans of time and will take non-zero values, one could assume that these parameters have lognormal prior distributions. Alternatively, one could place an upper bound on the distribution of θ 3 and instead use a generalized beta distribution for the prior of θ 3 , as described by Braun, Yuan, and Thall (14). The influence of the prior on the MTS algorithm may be quantified by the mean and variance hyperparameter values selected for the priors. Appropriate values for the prior mean and variance hyperparameters may be elicited from the investigators in many ways (15, 16), although in general, it is easiest to elicit values on domains with which the investigator is familiar (17). For example, the investigator can supply the expected duration of time after administration when the hazard will reach its maximum; this value can serve as the mean for the prior of θ 2 . Similarly, the expected duration of time required for the hazard to completely vanish or become negligible would serve as the prior mean for θ 3 . The investigator could also supply a range of plausible values for each parameter, and those ranges could be used to develop each parameter’s prior variance (14). The ability of the data to dominate the prior distributions is heavily influenced by the variances of those distributions, and it is essential to carefully evaluate the how changing the values of the variances impacts the MTS algorithm’s performance. This sensitivity analysis may be carried out by simulating the toxicity times of a small number of patients, comparing the prior means with their respective posterior values and evaluating how sensitive the MTS algorithm is to a small amount of data. For example, if the prior reflects the belief that toxicity is unlikely beyond 25 days after administration,

5

simulating a few patients to have toxicities that occur far beyond 25 days allows one to determine whether the prior allows the posterior mean of θ 3 to shift beyond its prior mean and reflect the data appropriately. If not, the prior variances may be calibrated and the exercise repeated until the desired effect is achieved. This exercise demonstrates the important point that the prior variances cannot be made arbitrarily large, as is usually done with Bayesian analyses of large datasets. In any small-scale clinical trial using adaptive methods, very little data are available, especially early in the trial. If there is substantial prior probability mass over too broad a range, this often cannot be overcome by a small amount of data, depending on the particular model, data structure, and decision-making algorithm. In the present setting, unduly large prior variances would severely hinder the algorithm’s ability to assign optimal schedules during the trial, especially early in the trial, and also may degrade the method’s ability to select an optimal MTS at the end. At each evaluation, the likelihood of the data collected on all enrolled patients is combined with the prior distribution to derive a posterior distribution for the parameters in the hazard. Because the posterior cannot be obtained analytically under our assumed model, we compute posterior quantities via Markov chain Monte Carlo (MCMC) methods (18); more details can be found in Braun, Yuan, and Thall (14). 2.3 Maximum Sample Size The maximum sample size proposed for any schedule-finding study should be ascertained during the design stage with an exhaustive simulation study, examining a variety of settings and study design parameters until adequate performance of the MTS algorithm is observed. Specifically, using a range of sample sizes, one should assess: [a] how frequently the algorithm correctly identifies the MTS when it exists, and [b] how frequently the algorithm terminates early when all schedules are overly toxic. Although the largest sample size will always provide the most reliable results, patient resources are typically limited in phase I trials, and it may

6

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

be that, for example, a decrease of 10 patients leads to a negligible decrease in the values seen for [a] and [b], showing an acceptable trade-off between performance of the MTS algorithm and the duration and cost of the actual study. 3

TRIAL CONDUCT

A maximum of N patients are enrolled in the trial, with each patient assigned a treatment administration sequence upon arrival. The first patient is assigned the shortest sequence. Each patient is followed for a maximum of T days, with treatment terminated if toxicity is observed. From the posterior distribution of the parameters, the cumulative probability of toxicity by time T for each schedule is computed. Given a desired target probability p, which typically is in the range 0.10 to 0.30, depending upon the nature of the DLT, we consider two alternative criteria for choosing each patient’s sequence. The first criterion defines the MTS as that schedule whose average posterior probability of toxicity is closest to the target p. This criterion is analogous to the CRM criterion (2) based upon the posterior mean probability of the more usual binary indicator of toxicity. The second criterion first computes the percentage of the posterior distribution of each schedule that lies above the target p. The MTS is defined as the schedule whose posterior percentage is closest to, but no larger than, a threshold q. The second criterion is similar to the acceptability criteria used by Thall and Russell (19), and Thall and Cook (20) for dose-finding based on both efficacy and toxicity, and it also is similar to the criterion for overdose control proposed by Babb, Rogatko, and Zacks (4). Under either criterion, the current MTS is assigned to the next patient enrolled. If none of the schedules meet the criterion used, we conclude that none of the schedules are safe and the study is terminated. A variety of other stopping rules similar to those used with the CRM (21, 22) can easily be adapted for use with the MTS algorithm. Assuming the trial does not terminate prematurely, the MTS at the end of the trial is defined as the best sequence based on the complete data from all N patients.

To protect patient safety, we impose the additional constraint that escalation can be at most one course longer than the schedules already assigned to previous patients. This restriction on escalation only applies to untried schedules, and we place no restriction on de-escalation of schedule. We could also slow escalation (and prolong the length of the study) by requiring each schedule to be assigned to a cohort of M patients before escalation will be considered. For example, suppose we enroll the first patient on the lowest schedule and 20 days later this patient has not yet experienced a DLT and investigators wish to enroll a second patient. Using our algorithm and cohorts of size M = 1, investigators could enroll the newest patient on either the first or second schedules, depending upon which schedule best satisfies our safety criterion. Once a third patient is eligible, the information from the first two patients would be used to determine if the study should remain at the second schedule, escalate to the third schedule, or de-escalate back to the first schedule. If we instead set M = 2, then a minimum of four enrolled patients would be required before a patient could be assigned to the third schedule. Regardless of the cohort size, the algorithm will eventually identify a neighborhood of the correct MTS and continue to enroll patients in that neighborhood until the study reaches its maximum sample size. Simulation studies of the MTS paradigm demonstrate its excellent operating characteristics (14), provided that the prior has been properly calibrated. The MTS algorithm has a high probability of correctly identifying optimal schedules while maintaining an observed toxicity probability close to the targeted value, although the specific performance is a function of the number of schedules examined and the range of their cumulative toxicity probabilities. A critical issue illustrated by the simulations is that the MTS method is superior to the CRM and any other method that searches for an optimal dose but does not allow schedule to vary. Simply put, if investigators fail to examine the optimal schedule in the study, the study is doomed to fail at its very inception. The MTS method provides a safe, flexible way to examine several schedules to increase the

OPTIMIZING SCHEDULE OF ADMINISTRATION IN PHASE I CLINICAL TRIALS

likelihood of finding the optimal one. In turn, this should increase the number of possibly efficacious agents that reach phase II study.

12.

REFERENCES 1. B. E. Storer, Design and analysis of phase I clinical trials. Biometrics. 1989; 45: 925–937. 2. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990; 46: 33–48. 3. S. Goodman, M. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stat Med. 1995; 14: 1149–1161. 4. J. Babb, A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med. 1998; 17: 1103–1120. 5. S. D. Durham, N. Flournoy, and W. F. Rosenberger, A random walk rule for phase I clinical trials. Biometrics. 1997; 53: 745–760. 6. Y. Cheung and R. Chappell, Sequential designs for phase I clinical trials with late-onset toxicities. Biometrics. 2000; 56: 1177–1182. 7. T. M. Braun, J. E. Levine, and J. L. M. Ferrara, Determining a maximum tolerated cumulative dose: Dose reassignment within the TITE-CRM. Control Clin Trials. 2003; 24: 669–681. 8. C. L. Farrell, J. V. Bready, K. L. Rex, J. N. Chen, C. R. DiPalma, et al., Keratinocyte growth factor protects mice from chemotherapy and radiation-induced gastrointestinal injury and mortality. Cancer Res. 1998; 58: 933–939. 9. A. Panoskaltsis-Mortari, D. Lacey, D. Vallera, and B. Blazar, Keratinocyte growth factor administered before conditioning ameliorates graft-versus-host disease after allogeneic bone marrow transplantation in mice. Blood. 1998; 92: 3960–3967. 10. O. I. Krijanovski, G. R. Hill, K. R. Cooke, T. Teshima, J. M. Crawford, et al., Keratinocyte growth factor separates graftversus-leukemia effects from graft-versushost disease. Blood. 1999; 94: 825–831. 11. C. Benz, T. Tillis, E. Tattelman, and E. Cadman, Optimal schedule of methotrexate and

13.

14.

15.

7

5-fluorouracil in human breast cancer. Cancer Res. 1982; 42: 2081–2086. H. Soda, M. Oka, M. Fukada, A. Kinoshita, A. Sakamoto, et al., Optimal schedule for administering granulocyte colony-stimulating factor in chemotherapy-induced neutropenia in non-small-cell lung cancer. Cancer Chemother Pharmacol. 1996; 38: 9–12. J. C. Byrd, T. Murphy, R. S. Howard, M. S. Lucas, A. Goodrich, et al., Rituximab using a thrice weekly dosing schedule in B-cell chronic lymphocytic leukemia and small lymphocytic lymphoma demonstrates clinical activity and acceptable toxicity. J Clin Oncol. 2001; 19: 2153–2164. T. M. Braun, Z. Yuan, and P. F. Thall, Determining a maximum-tolerated schedule of a cytotoxic agent. Biometrics. 2005; 61: 335–343. C. P. Robert, The Bayesian Choice, 2nd ed. New York: Springer, 2001.

16. A. J. Gelman, B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, 2nd ed. Boca Raton, FL: Chapman and Hall/CRC Press, 2004. 17. R. K. Tsutakawa and H. Y. RK, Bayesian estimation of item response curves. Psychometrika. 1986; 51: 152–267. 18. C. P. Robert and G. Casella, Monte Carlo Statistical Methods. New York: Springer, 1999. 19. P. F. Thall and K. T. Russell, A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometrics. 1998; 54: 251–264. 20. P. F. Thall and J. D. Cook, Dose-finding based on efficacy-toxicity trade-offs. Biometrics. 2004; 60: 684–693. 21. J. O’Quigley and E. Reiner, A stopping rule for the continual reassessment method. Biometrika. 1998; 85: 741–748. 22. S. Zohar and S. Chevret, The continual reassessment method: comparison of Bayesian stopping rules for dose-ranging studies. Stat Med. 2001; 20: 2827–2843.

CROSS-REFERENCES Phase I trials Bayesian approach Adaptive design Dose-escalation trials

ORPHAN DRUG ACT (ODA) The Orphan Drug Act (ODA), signed into U.S. federal law on January 4, 1983, has helped to bring over 100 orphan drugs and biological products to market. The intent of the Orphan Drug Act was to stimulate the research, development, and approval of products that treat rare diseases. This mission is accomplished through several mechanisms: • Sponsors are granted 7 years of mar-

• •

•

•

keting exclusivity after approval of an orphan drug product. Sponsors are granted tax incentives for clinical research. The U.S. Food and Drug Administration’s Office of Orphan Products Development (OOPD) will assist the drug’s sponsors with study designs. The OOPD permits sponsors to employ open protocols that allow patients to be added to ongoing studies. Sponsors may receive grant funding to defray costs of qualified clinical testing expenses incurred in connection with the development of orphan products.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/orphan.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

ORPHAN DRUGS The term ‘‘orphan drug’’ refers to a product, which can be either a drug or biologic, intended for use in a rare disease or condition that affects fewer than 200,000 Americans. Orphan drugs may be approved or experimental. A drug or biologic becomes an orphan drug when it receives the orphan drug designation from the Office of Orphan Products Development at the U.S. Food and Drug Administration (FDA). The orphan drug designation qualifies the sponsor to receive certain benefits from the federal government in exchange for developing the drug for a rare disease or condition. The drug must then go through the FDA marketing approval process for safety and efficacy like any other drug or biologic. To date, over 1400 drugs and biologics have been designated as orphan drugs, and over 250 have been approved for marketing. The cost of orphan products is determined by the sponsor of the drug and is not controlled by the FDA. The costs of orphan products vary greatly. Generally, health insurance will pay the cost of orphan products that have been approved for marketing. If an orphan product has been approved by the FDA for marketing, it will be available through the normal pharmaceutical supply channels. If the product has not been approved by the FDA, the sponsor (the party studying the drug) may make the product available on a compassionate use basis. For contact information on sponsors of orphan products, contact the Office of Orphan Products Development. Side-effect information for orphan products that have been approved for marketing can be found on the labeling for the product.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/orphan.htm and http://www.fda.gov/orphan/faq/index.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

ORPHAN PRODUCTS GRANT PROGRAM The U.S. Food and Drug Administration’s Office of Orphan Products Development Grant Program encourages the clinical development of products for use in rare diseases or conditions, which are usually defined as affecting fewer than 200,000 people in the United States. The products studied can be drugs, biologics, medical devices, or medical foods. Grant applications are solicited through a Request for Applications (RFA). Presently, only clinical studies qualify for consideration; an applicant may propose one discrete clinical study to facilitate Food and Drug Administration (FDA) approval of the product for a rare disease or condition, and that study may address an unapproved new product or an unapproved new use for a product already on the market. All studies must be conducted under an Investigational New Drug (IND) application or an Investigational Device Exemption (IDE). Medical foods are the only exception to this requirement.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/orphan/grants/info.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

OUTLIERS

other model failure. Heteroscedasticity and nonlinearity, as well as nonnormality, may cause what appear to be isolated outliers.

DOUGLAS M. HAWKINS School of Statistics, University of Minnesota

Different possible origins of outliers call for different resolutions. Where the outliers are caused by imperfect modeling, the model should be refined so that they are accommodated. Where the outliers result from errors of execution, the primary concern is to minimize the damage they do to the analysis. Where the outliers result from mixtures of distributions or other types of contamination, there may be interest in identifying the outliers and (in the mixture case) estimating the characteristics of the mixture components. It is not always clear in applications what caused the outliers, and so ‘‘one size fits all’’ recipes for dealing with outliers are inappropriate. Here, we concentrate on methods of identifying outliers – that is, of deciding whether observations really are implausibly discordant. This has two parts – deciding which of the observations are most discordant from the majority; and measuring their discordancy. The first task – locating the most discordant observations – seems trivial at first glance, but is so only in the simplest case of univariate random samples. Here, it is only the largest and the smallest of the data that could be potentially discordant. Locating the most discordant observations, however, can be very difficult in ‘‘structured’’ data sets such as time series, analysis of variance, multiple regression, and multivariate data. Here, simple diagnostics like regression residuals cannot be relied upon to locate discordant observations. Estimating the parameters of a model without risk of serious damage from outliers is addressed by the techniques of robust estimation, the most familiar example of which may be the box and whisker plot with its outlier identification rules. The objectives of robust estimation and outlier identification

Outliers are an important issue in data analysis. Since a single undetected outlier can destroy an entire analysis, analysts should worry about the origins, relevance, detection, and correct handling of outliers. Intuitively, an outlier is an observation so discordant from the majority of the data as to raise suspicion that it could not plausibly have come from the same statistical mechanism as the rest of the data. Apparent discordancy is what distinguishes an outlier from a contaminant – an observation that did not come from the same mechanism as the rest of the data but was generated in some other way. A contaminant may appear ordinary (and not outlying), while an outlier could be, but is not necessarily, a contaminant. Outliers can arise in several ways. • They may be contaminants generated by

some other statistical mechanism. For example, if the seeds used in a plant growth experiment contain some foreign seed, then the plants produced from the foreign seed will be contaminants and may be outliers. • They may result from procedural errors in data gathering. For example, misreading an instrument will produce a contaminant that may be outlying. It is generally accepted that several percent of even high-quality data are wrong, and some of these errors may be outlying. • The analyst may have a misconception of the true model. For example, if an instrument is thought to produce normally distributed readings, but actually produces Cauchy-distributed readings, then some valid correct readings will be severe outliers relative to the wrongly assumed normal model. • In structured data such as multiple regression or analysis of variance data, they may be a symptom of some

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

OUTLIERS

are logically connected – if the potential outliers are located, then removing them from the sample and fitting the model to the remaining data will neutralize them and so provide robust estimates. Similarly, sturdy estimates of unknown parameters will provide a good picture of the model fitting the majority of the data, thereby helping locate potential outliers. The details of both approaches are less simple than they seem though, and the connection is not enough for either technology to supersede the other. 1

LIKELIHOOD MODELS FOR OUTLIERS

Outliers may be flagged and investigated in varying degrees of formality. For example, making a normal probability plot of residuals from a multiple regression and labeling as outliers any cases that seem visually too far from the line is a method of flagging questionable cases. Being informal though, it suffers from dependence on perceptions of how large a deviation is too large and may be criticized for having no obvious theoretical basis. More formal statistical models for outliers remove the subjective element and are valuable even if used only as benchmarks to assess other less formal models. A general framework that permits the modeling of data sets that might contain outliers is the ‘‘parameter shift’’ model. Each ‘‘good’’ (scalar or vector) observation X i in the sample is modeled as following a specified statistical model with unknown parameter vector θ X i ∼ gi (., θ). There may also be one or more contaminating observations Xi with distribution X i ∼ gi (., θ , αi ), where the parameter α i drives the contaminant’s displacement, and is most conveniently parameterized so that a contaminant with α i = 0 has the same distribution as a ‘‘good’’ value. Commonly, different contaminants are modeled as having different α i , but in some settings, a common α for all is sensible. Contaminants with sufficiently extreme α i values will be outlying and so potentially identifiable; those with less extreme α i values will

be indistinguishable from the good observations. This model is less restrictive than it might seem. An outlier that resulted from a recording error of transposing digits would not plausibly be explained by this parameter shift model, but since a good choice of α i could be fit to the data, this conceptual failing is arguably unimportant. For example, a transposition error of writing 84 for X i instead of 48 is exactly the same as adding α i = 84 − 48 to the value actually generated by the model. This distributional approach to describing outliers is attractive for those who like to work with formal models, as it allows much of the processing of outliers to be formalized and codified. Using say maximum likelihood to estimate the parameters θ and α i automatically gives a robust estimate of θ, and the likelihood gives a formal outlier test through a test of the null hypothesis α i = 0. The likelihood model describes contaminants, not outliers. Contaminants that are not outlying are undetectable; however, since they are undetectable precisely because they behave like ‘‘good’’ observations, their invisibility means that failing to detect them usually does not have serious bad consequences. 2 COMPUTATIONAL ISSUES Fitting the model is less trivial than it might seem as it requires looking at all possible partitionings of the data into the ‘‘good’’ and the ‘‘potentially suspect’’ subsets. In the simplest cases, this can be done by inspection. For a single sample from a normal distribution with mean-shift contamination, it is easy to show that the highest likelihood results when the extreme order statistics of the sample are labeled as ‘‘potentially suspect’’. This means that no other types of partitioning need be studied. In multiple regression, it would be equally intuitive that the cases most likely to be outliers would be those with the largest residuals, or the largest studentized residuals. Here though intuition is misleading: a pair of outliers can so distort the regression line that both have small residuals. This is called ‘‘masking’’. They may also make the residual of a ‘‘good’’ case large – this is termed

OUTLIERS

‘‘swamping’’. Thus, the most discordant observations need not necessarily have large (or even nonzero) residuals. Reliable identification of potentially suspect cases in multiple regression requires the substantial computational exercise of looking at all possible partitioning of the cases into ‘‘good’’ and ‘‘potentially suspect’’ subsets, selecting the partition with the largest likelihood. ‘‘High-breakdown’’ methodologies do indeed guarantee locating even large numbers of outliers, however badly they may be placed, provided enough computation is done. These methodologies are inherently computer-intensive as the guarantee of locating the outliers requires potentially investigating all partitions of the data into an ‘‘inner half’’ and an ‘‘outer half’’. After this exhaustive search, all outliers along with some inliers can be expected to be in the outer half where using some cutoff rule should distinguish the outliers from the more extreme inliers. A much lighter computation is required for ‘‘sequential identification’’ methods. In these, the single observation from the whole sample whose deletion would lead to the greatest increase in likelihood is flagged as potentially suspect, and temporarily removed from the sample. The single observation in the remaining sample whose deletion would lead to the greatest improvement in model fit is then identified as another potentially suspect observation, and is also temporarily removed from the sample. This stripping of observations continues until some stopping rule is reached. Sequential identification methods are of two types. In ‘‘forward selection’’, a discordancy measure is calculated as each new potentially suspect observation is identified and the process stops when it first fails to find an observation sufficiently discordant to be called an outlier. In ‘‘backward elimination’’ a preset number of potentially suspect observations is identified in one pass, and then in a second pass, each of them is tested sequentially to see whether it really is sufficiently discordant to be called an outlier. A case that is not discordant is then reincluded with the ‘‘good’’ observations and the testing of the remaining potentially suspect observations repeated.

3

Both theoretical reasoning and practical experience show that backward elimination is much better than forward selection, which can miss arbitrarily severe outliers. This is because the masking effect may cause the outlier diagnostics of all the outliers to appear modest so that forward selection stops before all outliers have been located. Intuitive though sequential identification is, an even more intuitive approach is onepass identification of all cases whose outlier diagnostics are large. Because of masking and swamping though, this approach is much worse than even forward selection and should not be used at all. This leaves backward elimination as the most reliable approach with low computational requirements. Another ‘‘backward elimination’’ method starts with some small subset of the data that is outlier-free, and then sequentially adds observations that appear not to be discordant, reestimating parameters as each new apparently concordant observation is added. Success with this approach depends on finding a starting subset of cases that is not only outlier-free but also informative enough to correctly distinguish the inliers from the outliers in the not-yet-classified group. Starting with a high-breakdown estimate gives reliable results, but at high computational cost. Methods using full-sample estimates – for example, the cases with the smallest absolute residuals from a full-sample least squares regression fit – may succeed, but come with no guarantees. Even high breakdown methods are not completely bulletproof: see the example in Hawkins and Olive (5) in which even high breakdown regression methods saw nothing surprising about men less than an inch tall, but whose head circumference was some five feet. The second phase, of deciding whether to label a suspect case outlying, may be based on a likelihood ratio or a case diagnostic such as a studentized residual. Two different philosophies on outlier labeling are to use fixed cutoff values; and to use multiple comparison tests. An example of the first approach is to flag as outlying any cases whose regression residual exceeds 2.5 standard deviations regardless of sample size. This rule will delete a fixed percentage of

4

OUTLIERS

the cases in data sets consisting of only good data. A sound multiple comparison method is the Bonferroni approach of testing the externally studentized residual (‘‘outlier t’’, or RSTUDENT) of a regression against the α/n quantile of a Student’s t distribution, where n is the sample size and α a significance level. In this approach, a fixed proportion of good data sets will have one or more observations wrongly labeled outliers. 3

PARTICULAR CASES

The easiest situation is mean-shift outliers from a univariate normal distribution, with inliers distributed as N(µ, σ 2 ), both parameters unknown, contaminated by one or more N(µ + α i , σ 2 ) outliers. The scaled deviations from the mean, (X i − X)/s, where X is the sample mean and s the standard deviation, are effective for both identifying and testing a single outlier. Multiple outliers can be found using sequential identification either by starting with the full sample and stripping one suspect observation at a time, or by starting with the easy-to-find ‘‘inner half’’ of cases and adding concordants. Outliers in the univariate normal can also be modeled by scale contamination, where the outliers are N(µ, σ 2 (1 + α)). This (with a common α) can be thought of as mixing the mean-contamination model’s displacements over a N(0, ασ 2 ) distribution, and is effective for outliers occurring symmetrically to the left and the right of the inliers. Flagging and testing for outliers from a chi-square distribution use the scaled quantities X i /X. Cochran’s test for the largest such ratio is classical, but outlier tests can also be performed on the smallest such ratios. In homoscedastic normal linear modeling (both linear regression and analysis of variance), a general-purpose approach to outlier identification and testing is by variance analysis. Let S0 with ν degrees of freedom denote the residual sum of squares from a fitted model using some set of cases. Write S1 for the residual sum of squares obtained after

deleting a suspect observation. Then, the pseudo-F ratio F=

(ν − 1)(S0 − S1 ) S1

(1)

may be compared with the fractiles (Bonferroni-corrected or fixed) of an F distribution with 1 and ν − 1 degrees of freedom to decide whether the suspect case is or is not within plausible limits. This F ratio is the square of the ‘‘outlier t’’ statistic that software often produces as a case diagnostic. In analysis of variance with replicates, fitting the ANOVA model on the full sample and after removal of individual readings leads to tests for individual outliers. A different model is the ‘‘slippage’’ model in which all the replicate readings in some cell are displaced by the same amount. Reducing the original data to cell means and taking these means as the basic data to which to apply outlier screening methods addresses the slippage model. Multivariate outliers are commonly modeled by the multivariate normal distribution N(µ, ) with unknown mean vector and covariance matrix. The mean contamination model holds that the outliers are N(µ + α i , ), and mixing the outlier displacements over a zero-mean normal distribution leads to the scale-contamination model in which the outliers are N(µ, + ). With this baseline model, if the sample mean vector and covariance matrix are written as X and S respectively, then the discrepancy of a single case X i can be measured by its squared Mahalanobis distance from the mean Di = (X i − X) S−1 (X i − X). The traditional multivariate outlier statistic, Wilk’s lambda, is a monotonic function of Di . Sequential deletion successively trims the case with the largest Di from the current sample mean vector using the current sample covariance matrix. The likelihood outlier model can handle generalized linear models. If the deviance of any fitted model is S0 and the deviance obtained by deleting some suspect case and refitting is S1 , then the deviance explained by deleting the case is S0 − S1 , and this can be referred to its asymptotic chi-squared distribution to get an outlier test. This framework

OUTLIERS

5

Outlier t After Deletion Number Species

Diplodocus Brachiosaurus Triceratops Human Residual SS

Body Weight

Brain Weight

0

1

2

3

4

11 700 87 000 9400 62

50.0 154.5 70.0 1320.0

− 2.507 − 2.505 − 2.094 1.789 60.988

− 2.507 − 3.644 − 2.835 1.861 48.733

− 3.646 − 3.644 − 6.045 2.098 31.372

− 6.649 − 6.804 − 6.045 3.232 12.117

8.217

covers Poisson and logistic regression and loglinear modeling, among others. The generalized linear model framework is also useful for outliers in contingency tables – for example individual cells whose frequencies violate independence, which holds in the rest of the table. Deleting a particular cell, refitting the model, and computing the change in deviance between the two fits gives an outlier test statistic for that cell. Time series can have two types of outlier – an ‘‘additive outlier’’ displaces a single reading from where it should have been, but does not affect the subsequent observations. ‘‘Innovative outliers’’ displace the whole time series from their point of occurrence onward, and, apart from their occurrence at an unexpected time, can be modeled by the intervention analysis likelihood. 4

EXAMPLE

Rousseeuw and Leroy (7) discuss a data set relating the body weight and brain weight of 25 animals alive today, along with three dinosaurs. Least-squares regression of Y = loge (brain weight) on X = loge (body weight) gives (standard errors in parentheses) Y = 2.555 + 0.496

studentized residual of each of the four suspect cases and residual sum of squares of the fitted regression. The pseudo-F ratios for the successive deletions are 6.29, 13.28, 36.55, and 10.44 with 1 numerator degree of freedom and denominator degrees of freedom 25, 24, 23, and 22. Start with the last of these, 10.44, whose right tail area is 0.004. Multiplying this figure by the remaining sample size, 25, gives a Bonferroni P value of 0.1, which argues against making the fourth deletion. Going to the next F ratio of 36.55 gives a P value of 3.6 × 10−6 , indicating that the deletion of the third dinosaur, and by implication, its two even more extreme companions, is clearly called for. With an initial sample size of 28, the Bonferroni 5% and 1% points for the outlier t statistic would be the two-sided 5/28 = 0.179 and 1/28 = 0.036% points of a t distribution with 25 degrees of freedom, which are 3.50 and 4.13, respectively. None of these four cases is close to significance, illustrating the masking effect. Successively deleting the three dinosaurs makes humans the most outlying species. At this point, humans’ t of 3.232 corresponds to a right tail area of 0.0038, which after Bonferroni adjustment, makes us unremarkable. The regression found after deleting the dinosaurs is

X

(0.413) (0.078).

(2)

Y = 2.150 + 0.752

X

(0.201) (0.046). With 28 data points, a maximum of four deletions (15% of the data) seems reasonable. Starting with the full data set, and sequentially deleting the case with the largest absolute externally studentized residual for four iterations gives the following set of externally

(3)

Evidence of the damaging effect of the dinosaurs is that the slope fitted after their removal is 3.3 standard errors away from the full-sample estimate. The high breakdown regression fit using the least trimmed squares criterion gives

6

OUTLIERS

the model X = 1.88 + 0.776W, close to what we get with ordinary regression after deleting the dinosaurs. 4.1 Further Reading The text by Barnett and Lewis (2) provides an exhaustive coverage of the standard outlier situations and models. Additional useful theoretical material on the robust estimation aspects of outliers may be found in Hampel et al. (3). Rousseeuw and Leroy (7) is a valuable exposition of the high breakdown approach, though readers should be aware that many of the specifics (notably computational procedures and outlier flagging rules) are now obsolete. Better algorithms for high breakdown estimation in the difficult multivariate location/scatter case are given in (6,8), and for the regression case, in (4). Atkinson and Riani (1) provide discussion centered on sequential identification starting from an outlier-free subset of cases. Hawkins and Olive (5) provide theoretical and empirical support for routine use of a variety of multiple regression methods, including sequential outlier identification, in analysis of real data. They also argue that in games against nature, in which a malicious opponent tries to place outliers where you cannot find them in a tolerable amount of time, the opponent will always win, given large enough data sets. In other words, it is impossible to guarantee finding even huge outliers in a large data set if an opponent is allowed to hide them. This argument generalizes to other outlier settings – for example, multivariate location/scatter. REFERENCES 1. Atkinson, A. & Riani, R. (2000) Robust Diagnostic Regression Analysis. Springer, New York. 2. Barnett, V. & Lewis, T. (1994) Outliers in Statistical Data. Wiley, New York. 3. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. & Stahel, W. A. (1986) Robust statistics: The Approach Based on Influence Functions. Wiley, New York. 4. Hawkins, D. M. & Olive, D. J. (1999) Improved feasible solution algorithms for high breakdown estimation, Computational Statistics and Data Analysis 30, 1–11.

5. Hawkins, D. M. & Olive, D. J. (2002) Inconsistency of resampling algorithms for high breakdown regression estimators and a new algorithm. (with discussion), Journal of the American Statistical Association 97, 136–148, rejoinder 156–159. 6. Rocke, D. M. & Woodruff, D. L. (2001) Robust cluster analysis and outlier identification, in 2000 Proceedings, American Statistical Association, Arlington, VA, Biometrics Section. 7. Rousseeuw, P. J. & Leroy, A. M. (1987) Robust Regression and Outlier Detection. Wiley, New York. 8. Rousseeuw, P. J. & Van Driessen, K. (1999) A fast algorithm for the minimum covariance determinant estimator, Technometrics 41, 212–223.

= 1, . . . , n, are assumed to be independent variates from a Poisson distribution n with common mean. The statistic is i=1 [(Yi − Y)2 /Y], where Y = ni=1 Yi /n, and is called the sample index of dispersion. An important question to ask when a model is suspect is: Will the lack-of-fit affect inference and lead to incorrect conclusions? If the effect is negligible, and if the efforts involved in fitting a more ‘‘exact’’ model are substantial, then the approximate inference obtained under the simpler model may well suffice. In what follows the effect of overdispersion is shown to be nonignorable, and can be quite drastic, so inference under a Poisson or binomial model when overdispersion is present may be very misleading. We focus on the analysis of count and categorical responses, because these are the two areas where overdispersion most commonly arises, with binary responses being an important special case of the latter.

OVERDISPERSION CHARMAINE B. DEAN Simon Fraser University, Burnaby, British Columbia, Canada

The phenomenon which has come to be termed overdispersion arises when the empirical variance in the data exceeds the nominal variance under some presumed model. Overdispersion is often observed in the analysis of discrete data, for example count data analyzed under a Poisson assumption, or data in the form of proportions analyzed under a binomial assumption. Support for overdispersion is most likely first obtained when the ‘‘full’’ model is fitted, and the Pearson or deviance residuals are predominantly too large (4); the corresponding Pearson and deviance goodness-of-fit statistics indicate a poor fit. The Poisson and binomial distributions are derived from simple, but fairly strict assumptions, and it is not surprising that these do not apply generally in practice. Fitting either of these distributions assumes a special mean–variance relationship, since both distributions are fully characterized by a single parameter. This can be contrasted with the analysis of continuous data using a normal assumption. Consider a simple example where a set of univariate observations is modeled by a common N(µ, σ 2 ) distribution. In estimating the fitted distribution, the sample average of the data points defines the location of the normal distribution on the number line (µ), while the sample variance determines the spread of the fitted bell-curve (σ 2 ). The normal distribution is characterized by two parameters, while the Poisson and binomial distributions are completely specified when only the mean, or the probability of success, respectively, is determined; the variance is fixed by the mean. The existence of overdispersion is not a recent observation. Student (27) comments upon this problem, and Fisher (13) discusses a goodness-of-fit statistic for testing the adequacy of the Poisson distribution in the single sample problem, i.e. when the counts Y i , i

1 EFFECT OF OVERDISPERSION ON STANDARD POISSON OR BINOMIAL ANALYSES The effect of overdispersion is determined primarily by how it arises and its degree of incidence. One common way that overdispersion arises in the analysis of proportions is through a failure of the binomial independence assumption. In animal litter studies, for example, responses of animals in a litter are often positively correlated; so, too, in dental studies for responses on individual teeth for a single individual. Let Y denote a binomial response, Y = m j=1 Yj , where Y j are independent binary variates taking values 0 or 1 with probabilities (1 − p) and p, respectively. Then, E(Y) = mp, and var(Y) = mp(1 − p). If the independence assumption does not hold, and the correlation (Y j , = τ > 0, then E(Y) = mp, and var(Y) = Yk) var( m j=1 Yj ) = mp(1 − p)[1 + τ (m − 1)], leading to overdispersion with respect to the binomial model. Note that τ < 0 leads to underdispersion, which is rare in practice, but might correspond to competition among the binary variates for a positive response.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

OVERDISPERSION

Another way overdispersion may arise is through a failure of the binomial assumption of constant probability of success from trial to trial. This might occur if the population can be subdivided into naturally occurring subunits, for example colonies, where the probability of a positive response varies over these subunits. Similarly, for count data, failure of the assumptions underlying the use of the Poisson distribution generally leads to overdispersion. In particular, the probability of an event may vary over individuals or over time. For a simple example, suppose the response is the number of days absent due to illness over a period of time, in a situation where the number of episodes of illness, Y, are Poisson (µ) distributed but will likely lead to consecutive days of absence. The distribution of the number of days of absence due to illness will then be overdispersed with respect to the Poisson model. If A represents the number of days absence during a single episode, A and Y being independent, then the total num ber of days of absence is Yi=1 Ai , which has Y mean and variance, E i=1 Ai = µE(A), and Y var i=1 Ai = µE(A){E(A) + [var(A)/E(A)]} >µE(A), if E(A) > 1. In some contexts the Poisson or binomial variation is only a minute part of the overall variability. This is typical, for example, in cancer mapping studies, where the distribution of rates over a region is to be determined. In many cases it is the spatial variation of the cancer mortality rates which is the main component of the dispersion. Overdispersion cannot be ignored. In fact, many statistical packages routinely incorporate overdispersion. The magnitudes and signs of the estimated covariate effects in a loglinear or logistic analysis can be quite similar whether or not overdispersion is properly accounted for, so a researcher may gain no hint of an inappropriate analysis by there being strikingly strange estimated effects. However, the standard errors of the estimated regression parameters will be underestimated; these will reflect only the Poisson or binomial variation. The precision of the resulting estimates will be too high and P values for testing the significance of the included covariates will be correspondingly too low. This will very likely lead to incorrect

inference, unless the overdispersion is almost negligible. 2 TESTING FOR OVERDISPERSION In many situations the presence of overdispersion is clearly indicated by the presence of overly large values of the Pearson or deviance goodness-of-fit statistics, even when the full model is fitted. Formal tests for overdispersion have also been discussed in the literature. Score tests for overdispersion (3,9,10) compare the sample variance with what is expected under the model. For the testing of extra-Poisson variation, the adjusted score test statistic, for testing the null hypothesis H1 :τ = 0 in the model with overdispersed variance function µi + τ µ2i , is n

TP1 =

{(yi − µˆ i )2 − (1 − hˆ i )µˆ i }

i=1

2

n

1/2

.

(1)

µˆ 2i

i=1

In (1) hi is the ith diagonal element of the hat matrix, H, for Poisson regression; H = W1/2 X(X WX)−1 X W1/2 , where W = diag(µ1 , . . . , µn ), and X is n × p with ijth entry For loglinear µ−1 i (∂µi /∂βj ), µi = µi (xi ; β). regression, log µi = x i β, and X is the usual matrix of covariates. Estimates µˆ i and hˆ i ˆ its maxiare obtained by replacing β by β, mum likelihood estimate under the Poisson assumption. The statistic T P1 converges in distribution, as n → ∞, to a standard normal under H1 . For testing the null hypothesis H2 :τ = 0, in the model with variance function (1 + τ )µi , the score test statistic is n (yi − µˆ i )2 − (1 − hˆ i )µˆ i TP2 = √ , µˆ i (2n) i=1 (2) which is also asymptotically (n → ∞) distributed as standard normal, under H2 . It is interesting to note that the test statistic for testing H2 when considering µi → ∞ asymptotics, for fixed n, is equivalently 1

= TP2

n (yi − µˆ i )2 i=1

µˆ i

,

(3)

OVERDISPERSION

which has a limiting distribution of χ 2 (n − p). This is just the Pearson statistic, which is traditionally used to assess correct specification of the mean. The derivation of the tests for extra-Poisson variation assumes that the regression specification is correct. Hence, the Pearson statistic arises as a test of either of two alternative hypotheses: namely, µi = µi (xi , β) is incorrectly specified, or the distribution of the counts has variance form φµi . We would not, however, interpret a significantly large Pearson statistic as indicating overdispersion in a generalized linear model, unless the mean had been reasonably well modeled. Otherwise, the apparent overdispersion could reflect missing covariates, e.g. interaction terms, implying systematic lack of fit, or, the functional form of the mean may be inappropriate. Pregibon (22) develops a test for checking the form of the mean in generalized linear models; for multinomial models, O’Hara Hines et al. (21) develop diagnostic tools for this purpose. Smith & Heitjan (25) develop a test for overdispersion which results when the vector of coefficients in the mean is considered to be random. The score tests T P1 and T P2 arise from a random intercept model, and can therefore be considered a special case of Smith & Heitjan’s test. Dean (8) discusses testing for overdispersion with longitudinal count data. A score test for the adequacy of the binomial model against an alternative with variance form mi pi (1 − pi )[1 + τ (mi − 1)] is  n

(pˆ i (1 − pˆ i ))−1 [(yi − mi pˆ i )2



  i=1   +pˆ i (yi − mi pˆ i ) − yi (1 − pˆ i )] , (4) TB = n 1/2 2 mi (mi − 1) i=1

which has a standard normal distribution as n → ∞. This variance form is obtained from the correlated binomial model, discussed in the previous section. Prentice (23) derives this statistic as a score test statistic against beta-binomial model alternatives; Tarone (28) derives it by considering correlated binomial alternatives. As an example in the application of the tests, consider the data, given in (29), from

3

a clinical trial of 59 patients with epilepsy, 31 of whom were randomized to receive the anti-epilepsy drug Progabide and 28 of whom received a placebo. The total seizure count over four follow-up periods is taken here as the response variable. Table 1 shows the results of a Poisson regression analysis of the effect of Progabide on seizure rate which includes two covariates and their interactions, plus terms for a main and interaction treatment effect. The data and this fitted model have been discussed at length in Breslow (4) and his results are given here. The Pearson and deviance goodness-of-fit statistics clearly indicate lack-of-fit of the Poisson model. The score test statistics T P1 and T P2 have observed values 36.51 and 37.56, respectively. The overdispersion here is very substantial.

3

ACCOUNTING FOR OVERDISPERSION

With overdispersion present, the use of the Poisson or binomial maximum likelihood equations for estimating the regression parameters in the mean is still valid. These are the usual generalized linear model (GLM) or quasi-likelihood (QL) estimating equations, and they are unbiased estimating equations regardless of any misspecification of the variance structure. However, the estimated variances of the parameter estimates will be in error, and possibly severely so. If there are alternative ‘‘overdispersed’’ models which are postulated, then certainly one could proceed by maximum likelihood estimation. This will be discussed further below. There are, however, some robust, simple methods for adjusting standard errors to account for overdispersion and these will be considered first. McCullagh & Nelder (20) suggest a simˆ ple adjustment, which is to multiply var(β), obtained from the Poisson or binomial model, by an estimate of the GLM scale factor, φ. This estimate is usually the Pearson or deviance statistic divided by its degrees of freedom. This is appropriate if the overdispersion gives rise to a variance model which is a constant times the nominal variance, e.g. φµ for counts and φmp(1 − p) for proportions.

4

OVERDISPERSION Table 1. Loglinear Poisson Regression Fit to the Epilepsy Data; Case No. 207, with High Leverage, Omitted. Reproduced from Statistica Applicata, Vol. 8, pp. 23–41, by permission of Rocco Curto Editore Coefficient

Value

Std. error

t Statistic

Intercept ln(base count/4) Age/10 ln(base count/4): age/10 Progabide Progabide: ln(base count/4)

3.079 −0.074 −0.511 0.351 −0.610 0.204

0.451 0.201 0.153 0.068 0.191 0.088

6.833 −0.366 −3.332 5.164 −3.197 2.325

Deviance = 408.4; Pearson χ 2

=

456.52; df = 52.

This variance form may also well approximate other, possibly more complicated, variance structures in certain situations; for example, for count data, when var(yi ) = µi + τ µ2i and τ µi do not vary greatly with i. If the sample is large, then an empirical variance estimate can be computed. This is called the ‘‘sandwich’’ variance estimate, cf. Liang & Zeger (17). For loglinear models, for example, the sandwich estimator is ˆ = var(β)

n

−1 µˆ i xi xi

i=1

×

n

n yi − µˆ i 2 i=1

µˆ i

xi xi

−1 µˆ i xi xi

.

random effects models. A simple Poisson random effects model can be derived by considering a model with individual-specific random effects ν i > 0 where, conditional on (ν i , xi ), the distribution of Y i is Poisson with a mean of ν i µi (xi ;β), and the ν i are continuous, independent variates with probability density function p(ν;τ ) depending on a parameter τ . If µi (xi ;β) takes the common form exp(x i β), then the fixed and random effects are added on the logarithmic scale and the random effects can be construed as representing covariates which are unavailable. The probability function of Y in the mixed model is

(5)

i=1

Unless very large samples are available, the sandwich estimator tends to underestimate the true variance. Resampling techniques, although typically computer-intensive, have become popular for providing estimates of the variance of regression parameters. Bootstrap and jackknife estimates are discussed in (12), and there are methods of approximating these which require less computing effort, e.g. the one-step jackknife estimate. The bootstrap estimates are considered to be quite accurate. Table 2, from (4), compares these estimators in the analysis of the data from (29), mentioned earlier. Notice how much larger these estimates of the standard errors are compared with those obtained from the Poisson model. The treatment effect is no longer significant when overdispersion is taken into account. Model-based methods for incorporating overdispersion lead to mixture models or

∞ 0

(µν)y exp(−µν) p(ν)dν y!

(6)

and the score function for estimating the regression parameters has the intuitively appealing form n 1 ∂µi = 0. [yi − µi E(νi |yi )] µi ∂βr

(7)

i=1

This equation, with E(ν i —yi ) omitted, is the maximum likelihood equation for Poisson regression [Eq. (8), with σi2 = µi ]. If the distribution of ν is specified, then full maximum likelihood estimation can be performed; if ν is assumed to be gamma, then this leads to a negative binomial distribution for the counts. However, it is more common to adopt the more robust approach of specifying only the first two moments for Y, i.e. µi and σi2 , respectively; the parameters in the mean are then estimated using the quasi-likelihood

OVERDISPERSION

5

Table 2. Overdispersion Adjusted Standard Errors for Table 1 Coefficients. Reproduced from Statistica Applicata, Vol. 8, pp. 23–41, by permission of Rocco Curto Editore Coefficient

Scale Method

Sandwich

One-step Jackknife

True Jackknife

Bootstrap (nb = 5000)

1.263 0.564 0.430 0.190 0.535 0.246

0.711 0.326 0.237 0.104 0.403 0.188

0.792 0.368 0.264 0.117 0.440 0.210

0.792 0.369 0.263 0.117 0.448 0.214

0.870 0.424 0.291 0.137 0.466 0.226

Intercept ln(base count/4) Age/10 ln(base count/4): age/10 Progabide Progabide: ln(base count/4)

estimating equation, n (yi − µi ) ∂µi i=1

σi2

∂βr

= 0,

(8)

for count data, together with another estimating equation for the additional parameter τ in σi2 . There are many important reasons for the widespread use of the quasi-likelihood approach. For generalized linear models with a full likelihood, these are the maximum likelihood equations. From the viewpoint of estimating equations, Godambe & Thompson (14) (see also Nelder’s discussion of that paper) derive important optimality properties of the estimators. When σ 2 = µτ , estimation of the regression coefficients is not affected by the value of τ . Estimation is easy with standard software. An important point is that the ˜ the estimate of β, asymptotic variance of β, is independent of the choice of the estimating function for τ , and depends only on the first two moments of the distribution. This is also, asymptotically, a very efficient estimator for a wide range of models. Simulation studies have been conducted to investigate the performance of β˜ in small samples; they support the unbiasedness and efficiency of this estimator. A popular method for estimating τ is pseudo-likelihood. Davidian & Carroll (7) derived the pseudo-likelihood estimating equation as the maximum likelihood equation when residuals are normally distributed. An alternative simple choice is equating the Pearson statistic to its degrees of freedom.

Nonparametric methods of modeling random effects for handling overdispersion have been shown to be useful. Lindsay (18) is a comprehensive source on the topic. He discusses the geometry and theory of mixture models and describes a plethora of applications where mixture models are used, including overdispersion, measurement errors, and latent variable models for cluster analysis. Practical issues for estimation of nonparametric mixing distributions, such as algorithms and computational problems, are discussed at length in (2) and (16). The preceding overdispersion models incorporated a single random effect which was independent over subjects. More general models might include multiplicative random effects (25); large numbers of random effects are common in animal breeding experiments, where it is the prediction of the random effects, representing sire effects, which is the main focus of the study. In studies of the geographic distribution of cancer mortality rates, or disease incidence, the random effects may represent area-specific effects and there may be good reason to suspect that such area effects are not independent, and in some circumstances may be quite similar within small neighborhoods. A general body of theory which synthesizes the incorporation of several random effects, which are not necessarily independent, falls under the heading of generalized linear mixed models. It permits a simple incorporation of overdispersion, as discussed previously, and can also model dependences in outcome variables or random effects, as are typical in repeated measures design. A generalized linear mixed logistic model specifies

6

that

OVERDISPERSION

pi log 1 − pi

=

xi β

+

zi γ ,

(9)

where pi and xi are the probability of a positive response and the vector of covariates, corresponding to the ith proportion, respectively, zi is a vector of covariates, and γ is distributed with a mean of zero, and finite variance matrix. Conditional on γ , the responses are supposed binomially distributed. As an aside, note that the representation above elucidates that apparent overdispersion can be induced by missing covariates, or by outliers. Residual diagnostics are important for identifying the latter. In generalized linear mixed models the random effects are usually assumed to be Gaussian, and maximum likelihood estimation involves q-dimensional integration; here q is the dimension of γ . Alternative simpler approaches for inference have been proposed, using generalizations of moment methods or penalized quasi-likelihood (5). The penalized quasi-likelihood is a Laplace approximation to the integrated likelihood, with some seemingly harmless other approximations added. Breslow & Clayton (5) provide simple algorithms for estimation using an iterative fitting procedure which updates both the parameter values and a modified response variable at each step. They also evaluate the performance of their estimators. It seems that bias corrections are required for small samples, and these are given in (6). Lee & Nelder (15) discuss hierarchical generalized linear models, where the distribution of the random components is not restricted to be normal, and where, like penalized quasi-likelihood, estimation avoids numerical integration. Maximizing what they call the h-likelihood, a posterior density, gives fixed effects estimators which are asymptotically equivalent to those obtained using the corresponding marginal likelihood. Here, ‘‘asymptotically’’ refers to cluster sizes tending to infinity, and the random effects are cluster-specific. This is important to note because many applications with random effects involve several smallsized clusters. In general, their asymptotic arguments require that the total number of

random effects remains fixed, as the overall sample size becomes large. However, they also derive properties of their estimators on a model-by-model basis, and some models require less strict assumptions. When there are many random effects to be estimated, none of the procedures described here will be simple, as can be expected when dealing with complicated mechanisms for incorporating overdispersion. 4 TECHNICAL REFERENCE TEXTS AND SOFTWARE NOTES The general topic of overdispersion is discussed in McCullagh & Nelder [20, Sections 4.5, 5.5, 6.2]. Chapters 9 and 10 of that text are also relevant and discuss quasi-likelihood and joint modeling of mean and dispersion. Diggle et al. (11) discuss random effects models for longitudinal data, and Chapter 5 of Lindsey (19) is devoted to the topic of overdispersion in models for categorical data. Software for incorporating overdispersion includes S-PLUS (26), function glm, SAS ((24), procedures LOGISTIC, MIXED GENMOD, and CATMOD)), and GLIM (1). REFERENCES 1. Baker, R. J. & Nelder, J. A. (1987). GLIM, 3.77. Numerical Algorithms Group, Oxford, England. 2. B¨ohning, D. (1995). A review of reliable maximum likelihood algorithms for semiparametric mixture models, Journal of Statistical Planning and Inference 47, 5–28. 3. Breslow, N. E. (1989). Score tests in overdispersed GLMs, in Workshop on Statistical Modeling, A. Decarli, B. J. Francis, R. Gilchrist & G. U. H. Seeber, eds. Springer-Verlag, New York, pp. 64–74. 4. Breslow, N. E. (1996). Generalized linear models: checking assumptions and strengthening conclusions, Statistica Applicata 8, 23–41. 5. Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models, Journal of the American Statistical Association 88, 9–25. 6. Breslow, N. E. & Lin, X. (1995). Bias correction in generalized linear mixed models with a single component of dispersion, Biometrika 82, 81–91.

OVERDISPERSION 7. Davidian, M. & Carroll, R. J. (1987). Variance function estimation. Journal of the American Statistical Association 82, 1079–1081. 8. Dean, C. B. (1991). Estimating equations for mixed Poisson models, in Estimating Functions, V. P. Godambe, ed. Oxford University Press, Oxford, pp. 35–46. 9. Dean, C. B. (1992). Testing for overdispersion in Poisson and binomial regression models, Journal of the American Statistical Association 87, 451–457. 10. Dean, C. & Lawless, J. F. (1989). Tests for detecting overdispersion in Poisson regression models, Journal of the American Statistical Association 84, 467–472. 11. Diggle, P., Liang, K. Y. & Zeger, S. L. (1994). Analysis of Longitudinal Data. Oxford University Press, New York. 12. Efron, B. & Tibshirani R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall, London. 13. Fisher, R. A. (1950). The significance of deviations from expectation in a Poisson series, Biometrics 6, 17–24. 14. Godambe, V. P. & Thompson, M. E. (1989). An extension of quasi-likelihood estimation (with discussion), Journal of Statistical Planning and Inference 22, 137–152. 15. Lee, Y. & Nelder, J. A. (1996). Hierarchical generalized linear models (with discussion), Journal of the Royal Statistical Society, Series B 58, 619–678. 16. Lesperance, M. L. & Kalbfleisch, J. D. (1992). An algorithm for computing the nonparametric MLE of a mixing distribution, Journal of the American Statistical Association 87, 120–126. 17. Liang, K. Y. & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models, Biometrika 73, 13–22. 18. Lindsay, B. (1995). Mixture Models: Theory, Geometry and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 5. Institute of Mathematical Statistics, Hayward.

7

19. Lindsey, J. K. (1993). Models for Repeated Measurements. Oxford University Press, New York. 20. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models, 2nd Ed. Chapman& Hall, London. 21. O’Hara Hines, R. J., Lawless, J. F. & Carter, E. M. (1992). Diagnostics for a cumulative multinomial generalized linear model, with applications to grouped toxicological mortality data, Journal of the American Statistical Association 87, 1059–1069. 22. Pregibon, D. (1980). Goodness-of-link tests for generalized linear models, Applied Statistics 29, 15–24. 23. Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors, Journal of the American Statistical Association 81, 321–327. 24. SAS Institute Inc. (1990). SAS Release 6. 03 edition. SAS Institute Inc., Cary, NC. 25. Smith, P. J. & Heitjan, D. F. (1993). Testing and adjusting for departures from nominal dispersion in generalized linear models, Applied Statistics 42, 31–41. 26. Statistical Sciences (1995). S-PLUS, Version 3.3, StatSci, a division of Math-Soft, Inc., Seattle. 27. ‘‘Student’’ (1919). An explanation of deviations from Poisson’s law in practice, Biometrika 12, 211–215. 28. Tarone, R. E. (1979). Testing the goodness-offit of the binomial distribution, Biometrika 66, 585–590. 29. Thall, P. F. & Vail, S. C. (1990). Some covariance models for longitudinal count data with overdispersion, Biometrics 46, 657–671.

OVER-THE-COUNTER (OTC) DRUG PRODUCT REVIEW PROCESS

or a request to have a drug approved based on an existing monograph, the OTC division has 180 days to review the data and respond to the submitter. When the submission is reviewed, the drug is categorized through the monograph rulemaking process as follows:

Data regarding over-the-counter (OTC) drug monographs can be submitted to the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER) OTC program by anyone, including drug companies, health professionals, consumers, or citizen’s groups. If the submission is a request to amend an existing OTC drug monograph or is an opinion regarding an OTC drug monograph, it needs to be submitted in the form of a citizen petition or as correspondence to an established monograph docket. However, if no OTC drug monograph exists, data must be submitted in the format as outlined in 21 Code of Federal Regulations (CFR) Section 10.30. Data are submitted to the Dockets Management Branch, which logs the data and makes copies for the public files. The data are then forwarded to the Division of Over-the-Counter Drug Products for review and action. When the submission is received in the Division of OTC Drug Products, a project manager conducts an initial review to determine the type of drug being referenced and then forwards the submission to the appropriate team for a more detailed review. The team leader determines if the submission will need to be reviewed by other discipline areas in the review divisions, such as chemists, statisticians, or other consultants such as those from other centers or agency offices. When the consultants complete their reviews, their comments are returned to the Division of OTC Drug Products. The submission is then forwarded to a team member for review. If the submission is a comment or opinion on a specific rule or OTC drug monograph, there is no deadline established for the CDER to respond. However, if the submission is a petition or request to amend a monograph,

Category I: generally recognized as safe and effective and not misbranded. Category II: not generally recognized as safe and effective or is misbranded. Category III: insufficient data available to permit classification. This category allows a manufacturer an opportunity to show that the ingredients in a product are safe and effective, or, if they are not, to reformulate or appropriately relabel the product. When the initial review is complete and other consult requests have been received, a feedback letter outlining CDER’s recommendations may be prepared for the submitter. The recommendations will vary depending on the type of data submitted. For example, a response based on a request to amend a monograph may contain explanations approving or disapproving the amendment. If the submitter is not satisfied with the recommendations made by the division, the submitter may request a meeting to discuss any concerns. Advisory Committee meetings are usually held to discuss specific safety or efficacy concerns, or the appropriateness of a switch from prescription to OTC marketing status for a product. Usually the OTC advisory committee meets jointly with the advisory committee having specific expertise in the use of the product. After the Division of OTC Drug Products has completed its review, a feedback letter explaining CDER’s actions or recommendations may be prepared. This letter is forwarded to the submitter, and it remains on file at the Dockets Management Branch. In some cases, the agency responds to a submission by addressing the submitted issues in a rulemaking (i.e., monograph or amendment), in which case a feedback letter is not sent to

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/otc.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

OVER-THE-COUNTER (OTC) DRUG PRODUCT REVIEW PROCESS

the submitter. If the submitter is not satisfied with the recommendations described in the feedback letter, a meeting can be requested with the division to discuss it (e.g., to provide more information or respond to any concerns of the Center). If the submitted data or information supports amending an existing monograph or creating a new monograph, a notice is published in the Federal Register (FR). After the proposal is published in the Federal Register, the public has usually 30 to 90 days to respond to it. This deadline depends on the controversial nature of the notice and can be extended if a request to do so is made, and any concerned party can request an extension. All comments are sent to the Dockets Management Branch and then are forwarded to the Division of Over-the-Counter Drug Products. The comments are reviewed/evaluated by the appropriate team and, if needed, are sent to other discipline areas for further review.

After the public comments have been reviewed, the final monograph is prepared. The final monograph is like a recipe book, which sets final standards that specify ingredients, dosage, indications for use, and certain labeling. The final monograph is sent out for clearance through the appropriate channels, such as the Office of Drug Evaluation (ODE), CDER, Office of General Counsel, Deputy Commissioner for Policy, and Regulations Editorial Staff. If any office does not concur, the submission is returned to the Division of Over-the-Counter Drug Products for revision and is then rerouted to the appropriate sources. After all revisions are made and the final monograph/amendment receives all the appropriate final concurrences, it is published in the Federal Register. All final monographs and amendments that have been published in the Federal Register are then published in the Code of Federal Regulations.

OVER THE COUNTER (OTC) DRUGS

standardized the format for OTC labels (4). Thus, the first step in OTC development is to design the OTC label and optimize its ability to meet its communication objectives (5).

ERIC P. BRASS Harbor-UCLA Center for Clinical Pharmacology Torrance, California

1.1 Identification of Key Messages Although all OTC drugs will share common communication themes such as directions for use, most will also have unique requirements for the safe and effective use of the drug. For example, labels for OTC drugs used to treat heartburn must help consumers with more serious conditions such as gastrointestinal hemorrhage to seek medical attention rather than self-treating with the OTC drug. OTC drug labels may include drug interaction warnings specific for the drug. All the key messages relevant to the safe and effective use of a potential OTC drug must be explicitly identified and incorporated into a prototype label. Explicit recognition and prioritization of these messages facilitates the structuring of study primary and secondary objectives as well as the statistical analysis plans for the studies performed.

Availability of drugs over-the-counter (OTC) allows consumers improved access to safe and effective drugs without the burden of obtaining a physician’s prescription (1–3). Many drugs have moved from prescriptiononly to OTC status increasing consumers’ options for managing health problems. However, safe and effective use of OTC drugs places substantial responsibilities on the consumer. Consumers must be able to selfdiagnose their condition as matching the indication for a specific OTC product; assess whether their personal health history represents a contraindication to using the OTC drug or requires the input of a healthcare professional; understand the directions for using the drug (including dose, frequency, and duration of therapy); and self-manage the course of treatment based on their response to the drug. In the absence of a healthcare professional, the OTC drug label must effectively provide the information required by the consumer to accomplish these tasks. When considering a switch of a prescription drug to OTC status, the safety and efficacy of the drug has typically been established in the original clinical development program. Thus, the OTC development program is focused specifically on the ability of the consumer to use the drug as intended and as guided by the OTC label. Novel clinical research designs have been developed, and continue to evolve, to support OTC development. 1

1.2 Assessing Label Comprehension OTC labels are used by consumers with diverse backgrounds. Thus, the ability of consumers to understand the proposed label syntax must be evaluated formally. This evaluation is conducted using potential consumers recruited through varied means. Demographic information is collected to allow for the influence of personal characteristics or medical history on label comprehension to be defined. Consumers are provided a prototype label and then are asked specific questions designed to assess their understanding of the key communication messages. Items can be in the form of multiple-choice questions or short-answer formats. Questions are designed either to probe simple recognition of the correct message or to probe the consumer’s ability to make a decision in a described clinical scenario based on correct use of the key message being studied. As emphasis is placed on the ability of the

LABEL COMPREHENSION STUDIES

As noted, the OTC label plays a central role in the safe and effective use of OTC drugs. The label must effectively communicate to the typical consumer the key messages required to use the drug. In the United States, the Food and Drug Administration (FDA) has

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

OVER THE COUNTER (OTC) DRUGS

label to communicate the key messages effectively, consumers may refer to the label while answering the questions, and no time limit is usually given. Use of multiple-choice questions necessitates use of distractors (incorrect responses) that are credible to avoid cueing of the correct response. Use of options that might be viewed as correct in any scenario (for example, ‘‘ask your doctor’’) are problematic in assessing the true effectiveness of the label and should be avoided. Scoring of short-answer responses should be done using objective, prespecified criteria based on inclusion of key components in the response. It may be useful to assess an individual communication objective using more than one question format to gain confidence in the reliability of the responses. The label development process should involve iterative studies with improvements in the label based on the results of each round of testing. Thus, each step need not be powered to yield definitive results but must only be adequate to inform internal decision making. During these early steps head-tohead comparison of different label strategies may be particularly useful. The composition of the study population in label development studies is important. For example, poor literacy skills remain prevalent in the United States (6,7). Comprehension of medication labels may pose a serious challenge for consumers with low literacy (3). Thus, assessment of the proposed label’s comprehension by a low-literacy cohort is an important component of the development program. Analysis of label comprehension testing should include specific analysis of the low-literacy cohort (for example, lower than eighth-grade reading level) as a prespecified secondary analysis. To ensure adequate statistical power for this analysis, recruitment requirements may specify a total number of subjects and the number of subjects that must meet low-literacy criteria. For specific drugs, other subpopulations may be of particular interest. For example, if a drug has a risk of an adverse drug–drug interaction, then the OTC label should contraindicate the use of the interacting drug. This message must be communicated to the

patients using the interacting drug of interest and not necessarily to the general consumer population. Thus, study of consumers currently using the drug of interest may allow focused development of the interaction contraindication labeling. Understanding whether patients recognize the brand name versus generic name of the drug, some lay term for the drug class (for example ‘‘blood thinner’’ for warfarin), or identification of the population using the drug of interest (for example, patients with HIV infection) can be ascertained by such focused studies. Once iterative testing has preliminarily identified optimal label strategies for all messages, a pivotal label comprehension study should be preformed. This study should use the full proposed final OTC label in a general consumer population with sufficient representation from low-literacy consumers to permit generalized conclusions. Although the entire label and all messages should be evaluated, particular attention should be given to the key messages unique to the new proposed OTC drug and those with the greatest potential impact on the safe and effective use of the drug. The assessment of these messages can be prespecified as primary endpoints, and the statistical analysis plan for the study can prespecify criteria for adequate comprehension. The potential importance of this prespecification of adequate comprehension has been emphasized in discussions by FDA’s Nonprescription Drug Advisory Committee (8). Although criteria for meeting a comprehension standard can be based on the study’s point estimate for correct responses, this action risks a false positive conclusion based on an underpowered study. Thus, consideration can be given to defining a positive result on the basis of the lower bound of the 95% confidence interval exceeding a predefined threshold, or similar criteria, to ensure that the conclusion is robust (Fig. 1). The threshold or performance standard for an adequate level of comprehension can be specified based on the importance of the message to the safe and effective use of the drug. For most messages, ensuring that 75% of the population understood the message may be adequate. However, for critical safety messages in

OVER THE COUNTER (OTC) DRUGS

3

Figure 1. Prespecification of adequate response rates. The percentages of correct respondents to a hypothetical label comprehension question are plotted. In situation A, the number of correct respondents is simply plotted, but interpretation is limited as no threshold for a positive result was prespecified and the variability in the point estimate is unknown. In situation B, the point estimate for the percentage of correct respondents exceeds a prespecified threshold (dotted line), but the lower bound of the 95% confidence interval crosses this threshold. Thus, one might interpret the point estimate with caution. In situation C, the point estimate for the percentage of correct respondents exceeds a prespecified threshold (dotted line), and the lower bound of the 95% confidence interval does not cross this threshold. This result can be interpreted as supporting the hypothesis that the label message was understood by the study population.

which lack of comprehension may result in personal harm, a higher (for example, 90%) standard may be required. 2 BEHAVIORAL STUDIES IN OTC DEVELOPMENT The label development program will yield a proposed label that effectively communicates all key messages. However, this communication does not mean that consumers will follow the guidance provided by the label. Consumers may discount the importance of certain messages, augment the information provided by the label with contradictory information obtained from other sources, or apply personal value judgments to override an understood label message. Thus, more clinical research is required to define consumers’ actual behaviors if they are provided access to the proposed OTC drug.

2.1 Self-Selection Studies The first decision point in an OTC setting is the consumers’ assessment that the drug is appropriate for their personal use, followed by their purchase of the drug. This action reflects the consumer’s selection decision. Self-selection studies focus specifically on this decision. The studies are designed to allow consumers to make their selection decision in as naturalistic a setting as possible to yield data relevant to the actual OTC marketplace. Cueing of important information by study personnel must be minimized and poses challenges to all aspects of trial conduct including the obtaining of informed consent. Consumers are recruited to self-selection studies on the basis of their representation of the OTC consumers likely to evaluate the product if it were available. Thus, recruitment is usually based on the presence of the

4

OVER THE COUNTER (OTC) DRUGS

drug’s indication. For example, recruitment materials might solicit participants suffering from recurring heartburn or interested in lowering their cholesterol. Interested consumers then visit a study site where they can examine the proposed OTC drug in its proposed packaging, including the previously tested and optimized OTC label. Consumers then are asked if they feel that the drug is ‘‘right for them’’ or similar question, which requires an individual decision based on the participant’s actual medical history and the information on the label. After responding to the self-selection question, participants may be offered the opportunity to purchase the drug at a price similar to what would be anticipated in the OTC market if the drug were approved for OTC sales. This purchase decision requires a second level of analysis by the consumer. In a self-selection study, only after making the purchase decision is the consumer informed that he or she cannot actually purchase the product, and the formal study is completed. Comprehensive demographic information is collected after the participants’ selection and purchase decisions have been made to avoid cueing important elements of the consumer’s medical history. The accuracy of the self-selection decision is usually the primary endpoint in selfselection studies, with the purchase decision as a secondary endpoint. However, evaluation of the correctness of the self-selection decision may be complex. For example, if an OTC label has five elements that apply to an individual study participant (for example, age, concomitant medications, comorbid disease, etc.) then a fully correct decision would require that the participant meet each of the five criteria. If each decision by the consumer were independent and the consumer could achieve a 90% correct rate for each, then all five would be correct only 59% of the time. If all five messages are equivalently important clinically, then requiring correctness for all five elements in this example would be reasonable, but this is rarely the case. Thus, consideration of the two self-selection decisions around each key label communication message allows identification of important coprimary or secondary endpoints in selfselection studies.

It is the clinical implications of ‘‘correctness’’ of self-selection decisions that pose substantial challenges to the interpretation of these studies. For example, when study participants decide to ignore a label age restriction by 1 year because they know they have the drug’s indication and feel that 1 year age difference cannot be important, it is unlikely to pose a health risk. In contrast, participants that inappropriately select to use the drug despite a drug interaction risk may expose themselves to substantial harm. In the context of the trial, both are ‘‘wrong,’’ but their import is clearly different. In this context, it is often useful after the self-selection and purchase decisions to ask consumers who made incorrect self-selection decisions why they made those decisions. Responses to these questions might yield information that mitigates some incorrect decisions or might suggest unanticipated misunderstandings that require strengthening of label elements. The analysis of a self-selection study is guided by the concepts introduced above for the pivotal label comprehension study. Analysis is focused on the cohort that selected to use the drug, as an inappropriate decision to not use the drug when the consumer may have benefited does not represent an individual or public health risk. The primary endpoint is the percentage of self-selectors who are label compliant, with a prespecified definition of a positive trial. Correct self-selection based on individual critical label components and the correctness of the purchase decision represent important secondary endpoints. The ability of low-literacy consumers to make self-selection decisions is of particular importance. The performance of a lowliteracy subpopulation is an appropriate secondary endpoint for self-selection studies. Other populations may be of special interest. For example, a self-selection study might be done exclusively in a cohort currently using a contraindicated medication to ensure that the label drug interaction warning effectively prevents self-selection by these consumers. Reliance on a general population sample would be unlikely to yield sufficient numbers of such patients to allow a meaningful conclusion relative to what might be an important health consideration.

OVER THE COUNTER (OTC) DRUGS

2.2 Actual Use Studies Actual use studies extend the assessments made in self-selection studies to fully define consumer behaviors if consumers were provided access to the drug in the absence of a learned intermediary. The scope of the study extends beyond the selection decision and allows consumers to purchase and use the product, purchase additional supplies as desired, and self-manage the full course of treatment. The latter component is particularly important as consumers must selftriage based on the evolution of their own medical condition and response to the drug and must discontinue treatment if adverse events develop (a ‘‘deselection’’ decision). Thus, the actual use study can uniquely probe hypotheses relevant to safe use of OTC drugs (Table 1). Initial recruitment to an actual use study is similar to that for a self-selection study as described above. The study center may emphasize naturalistic elements such as shelf displays to facilitate both study participant comfort and opportunities for repurchase of drug, if incorporated into the protocol. As study participants actually will have the opportunity to use the study medication, the need to assess consumer behaviors must be balanced by the need to protect participant safety. Thus, exclusion criteria are kept to a minimum. However, consumer comorbid conditions, concomitant medications, or other conditions such as pregnancy that represent a health risk if the consumer used the drug must exclude the patient from actually taking the drug after their purchase decision. To allow assessment of unbiased selection and purchase decisions, screening for exclusion criteria should be done after the purchase decision but before the opportunity for actual product use.

5

Similarly, data collection during the treatment period should be minimal to avoid cueing of desired behaviors. Use of study diaries, follow-up phone calls, or formal study visits may reinforce the importance of correct behaviors and mask the consumers’ true inclinations if the drug were used in an actual unsupervised setting. Thus, a natural tension occurs between collecting comprehensive data and the use of the data collected in the actual use studies. One compromise is to collect as much data as possible at the conclusion of the study when consumer behavior can no longer be affected. However, this method means that participant recall may be a major factor. Objective data, such as repurchase activity, may provide more useful data on duration and amount of drug used. However, collecting some data during the trial is often necessary, and it is then important that the data collection methods be as unobtrusive as possible and minimize cueing of behaviors of interest. Primary endpoints for actual use studies should be prespecified based on the behaviors of most importance to the safe and effective use of the proposed OTC drug. As discussed for label comprehension studies, prespecification of thresholds for acceptable behavior rates enhance the interpretation of the trial results. 3

SUMMARY AND CONCLUSIONS

Clinical trials used to support prescriptionto-OTC switch development programs face unique challenges in defining how consumers would use the drug if guided by the drug’s OTC label and in the absence of a healthcare professional. Trial endpoints are often less clear than those in a traditional efficacy study, as they focus on levels of consumer

Table 1. Examples of Hypotheses that can be Assessed in Actual Use Studies Consumers with contraindications to use of the drug will not purchase and use the drug. Consumers will discontinue use of the drug after the duration of treatment specified on the label. Consumers will not exceed the dose specified on the OTC label. Consumers will discontinue use of the drug if they develop symptoms consistent with an adverse drug effect as directed by the label. Consumers will self-triage to a healthcare professional if their medical condition worsens as defined by the OTC label.

6

OVER THE COUNTER (OTC) DRUGS

understanding and their behaviors based on proposed labeling. Nonetheless, rigorous trial design and conduct allows for these trials to inform and facilitate clinical and regulatory decision making. What combination of label comprehension, self-selection, and actual use studies are required for a specific OTC development program will vary widely. However, the goal of the development program remains to allow a data-driven, affirmative assessment that the proposed drug can be used safely and effectively without healthcare professional supervision in the OTC marketplace.

improved labelling and consumer education, Drugs Aging. 2004; 21(8): 485–98. 4. Food and Drug Administration. (2000). Labeling OTC Human Drug Products Using a Column Format. (online). Available: http:// www.fda.gov/cder/guidance/3594FNL.PDF.

The author is a consultant to GSK Consumer Healthcare, Johnson & Johnson–Merck, McNeil Consumer Pharmaceuticals, and Novartis Consumer Health.

5. E. P. Brass and M. Weintraub, Label development and the label comprehension study for over-the-counter drugs, Clin. Pharmacol. Ther. 2003; 74(5): 406–412. 6. M. Kutner et al., Literacy in Everyday Life: Results from the 2003 National Assessment of Adult Literacy. Washington, DC: National Center for Education Statistics, April 2007. 7. J. A. Gazmararian et al., Health literacy among Medicare enrollees in a managed care organization, JAMA. 1999; 281(6): 545–551. 8. Food and Drug Administration. (2006). Nonprescription Drugs Advisory Committee Meeting, September 26, 2006. [cited 2007 April 11] (online). Available: http://www.fda .gov/ohrms/dockets/ac/06/transcripts/20064230t.pdf.

REFERENCES

CROSS-REFERENCES

1. E. P. Brass, Changing the status of drugs from prescription to over-the-counter availability, N. Engl. J. Med. 2001; 345(11): 810–816. 2. R. P. Juhl, Prescription to over-the-counter switch: a regulatory perspective, Clin. Ther. 1998; 20 (Suppl. C): C111–C1117. 3. C. L. Roumie and M. R. Griffin, Over-thecounter analgesics in older adults: a call for

Over-the-Counter (OTC) Drugs – Case Studies

4

DISCLOSURES

Confidence Interval Statistical Analysis Plan Informed Consent Power

Overview of Anti-Infective Drug Development The identification and development of antibacterials, which are synthetic compounds, and antibiotics, which are semisynthetic derivatives of natural products, is a nonstandard, somewhat empirical process. The effort is based on practices developed in the industrial setting, and it generally requires 7 to 10 years to identify, optimize, develop, and launch a drug (Milne, 2003). The success rate is relatively low, with costs in excess of $800 million per marketed anti-infective (DiMasi et al., 2003). In general, the development of antibacterials is more rapid than for other drug classes, due to the characteristics of the preclinical and clinical tests used in their discovery. The discovery and development of antiinfectives can be divided into at least eight steps: exploratory phase, early phase, lead optimization, candidate selection, preclinical profiling, clinical development, regulatory filing, and approval and launch (Table 13A.1.1). While these eight steps apply to drug discovery and development in any therapeutic area, an antibacterial discovery program has some unusual characteristics compared to corresponding programs for other drug classes. Antibacterial drug discovery is often conducted by simultaneously screening for inhibitors of multiple, independent bacterial targets. This differs from other therapeutic area targets in which only one or two sites are targeted by a single compound. Indeed, there may be over 200 targets for antibiotics, with inhibition of any one having the potential to yield important therapeutic benefits. For this reason, antibacterial research and development may involve overlapping processes of optimizing multiple lead inhibitor series against multiple, individual targets. For example, one project may involve screening against a cell wall target, such as MurA, while at the same time the project team is optimizing a lead identified months earlier in another screen. In fact, genomics has yielded so many targets that a project team may be working on a dozen targets simultaneously, with only one or two new compounds ultimately being tested in vivo. In contrast, a eukaryotic target for a disease state may have just a few mechanistic approaches for pharmacological intervention.

UNIT 13A.1

EXPLORATORY PHASE For antibacterials, the exploratory phase of discovery lasts approximately 3 months. It encompasses the selection and validation of the target(s), and the design of a critical path for implementation to ensure the technical feasibility of the approach. All therapeutic areas are similar in their exploratory timeline and process, with the possible exception of target validation. Validation in microorganisms is usually accomplished in a matter of months by using any of several genetic tools, such as knock-out animals or conditional lethal genetic strategies. One characteristic of the validation process for antibiotics concerns the uniqueness of the genomes of different pathogens, which is a key to bacterial target validation. That is, for an antibiotic to have broad-spectrum effects, each of the various pathogens must produce the same target with sufficient structural similarity so as to respond to the same chemical agent. The discovery of an effective antibacterial hinges on the successful identification of a compound that inhibits an essential gene product of one or more bacterial pathogens (Dougherty et al., 2002). At least two approaches may be taken to accomplish this task. The first is to identify compounds with wholecell antimicrobial activity, after which the site of action is determined. The second approach is to use in vitro screens to identify inhibitors for a selected bacterial target and then demonstrate its antimicrobial activity (see UNIT 13A.2). In terms of drug discovery, an antimicrobial target is a bacterial gene product essential for survival of the organism (Haney et al., 2002). When a target is found in just a single pathogen, the relevant antimicrobial agent is referred to as being narrow-spectrum. If a target is essential for a group of pathogens, such as all Gram-positive or Gram-negative organisms, an effective antibiotic is referred to as a limited- or directed-spectrum agent. Drugs that inhibit a gene product that is essential for all pathogens are termed broadspectrum agents. Targets may be identified and/or selected by analyses of DNA sequences from bacterial pathogens by genomics/ bioinformatics. The essentiality of a target is assessed using established protocols involving

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

Anti-Infectives

13A.1.1 Supplement 31

Table 13A.1.1 Phases of the R&D Process of Antibacterial Drug Discovery

Phase

Time

Description and unique characteristics

Exploratory phase

3 months

Project team selects multiple targets from over 250 essential genes, in contrast to a single or few eukaryotic targets for chronic diseases. Genomics may be used to select the target by multiple genome analyses to assure conservation of the target and broad-spectrum inhibition. The bacterial genome is remarkably small and simple (∼3-9 Mb) in contrast to the human genome with its multiple complex regulatory networks.

Early phase

3-6 months

High-throughput screening (HTS) encompasses both activity against the bacterial target(s) and the ability to access the target (permeate the bacterium), versus eukaryotic targets, which are generally accessible

Lead optimization

12-18 months

The antimicrobial agent must be optimized with respect to a multitude of variables against multiple pathogens. This adds a level of optimization not encountered with eukaryotic targets.

Candidate selection

1-3 months

Prokaryotic and eukaryotic systems have the same decision points here in that they are safety- and efficacy-driven. A distinguishing feature of antimicrobial research is that preclinical animal models of efficacy reliably predict clinical outcomes.

Preclinical profiling

3-6 months

Same as target candidate drug profiling; safety-driven

Clinical development

4-6 years

Antimicrobial development distinguishes itself from eukaryotic targets in that there are multiple pathogens to be covered for each indication; therefore clinical trials must include at least 10 cases of each individual pathogen to attain a labeled indication.

Regulatory submission

6-12 months

Similar between prokaryotes and eukaryotes, but antimicrobials dealing with acute, unmet medical needs due to resistance may more easily receive Priority Review.

Approval and launch 0-3 months

Overview of Antimicrobial Drug Development

Similar between prokaryotes and eukaryotes.

genomics and bioinformatics (Arigoni et al., 1998; Judson and Mekalanos, 2000; Chalker and Lunsford, 2002; Thanassi et al., 2002; Pucci et al., 2003; Lindsay, 2004). The process for the selection of the target(s) for antibacterial drug development is described in UNIT 13A.2. Alternatively, targets can be identified by reverse chemical genomics. With this approach, a target-specific inhibitor is used to identify the actual target in the intact bacterium, as opposed to an in vitro screen against an isolated target, after which the antibacterial activity is assessed. A compound exhibiting the desired activity is then tracked back to its target by any of a number of procedures (DeVito et al., 2002).

High-throughput screening (HTS; UNIT 9.4) is used to identify chemical hits which, once confirmed and characterized, progress to lead status and, ultimately, following optimization, to candidate status. The identification of hits/leads is often a difficult and tedious process requiring the testing of between 100,000 and 1 million compounds. The goal is to identify a bona fide, target-based hit/lead suitable for medicinal chemistry-based optimization, or druggable chemotype (Beyer et al., 2004; Look et al., 2004; Pratt et al., 2004). A druggable chemotype is a compound that possesses a structure that can likely be modified to become a drug candidate (Lipinski et al., 1997). For

13A.1.2 Supplement 31

Current Protocols in Pharmacology

antimicrobials in particular, natural product screening is often used to identify a chemotype that is then chemically modified before attaining drug candidate status (Silver and Bostian, 1990; Bleicher et al., 2003). Automated systems are used to analyze large numbers of compounds in miniaturized biological assays. Among the equipment utilized in these systems are dilution and mixing devices, delivery robotics, and detection systems.

ity, and pharmacokinetics, including the Cmax , T1/2 , clearance rate, and area under the curve. Sometimes additional tests must be performed in order for a hit to advance to lead status, such as profiling against a panel of unrelated eukaryotic targets to assure that there are no, or minimal, untoward off-target activities in the human host, or even to the more detailed preliminary evaluation of taste for an oral compound.

EARLY PHASE

LEAD OPTIMIZATION

This phase of antibacterial discovery research, which generally consumes 3 to 6 months, includes the establishment of highthroughput screens and biological assays to support a medicinal chemistry program from lead optimization through candidate selection. The early phase also includes the HTS identification of hits (compounds that are active in the initial screen), confirmation of hits in the screening assay, and determination of the concentration of the compound that inhibits the targeted activity by 50% (the IC50 value). Successful completion of this phase leads to declaration of a lead. A lead is a compound initially identified as a hit in the primary screening effort, which is confirmed using secondary assays. To be considered a lead, the compound must permeate the target bacteria and demonstrate a minimum inhibitory concentration (MIC) of 4 to 16 µg/ml. It must also display other physical, chemical, and biological properties that lend themselves to chemical development into a clinical candidate (Lipinski et al., 1997). Some of the secondary assays used to promote a hit to a lead include susceptibility of additional strains of pathogenic bacteria, susceptibility of resistant clinical isolates and of strains with defined resistances, determination of minimum bactericidal concentrations (MBC), kill curves (UNIT 13A.3), pattern of labeling of macromolecular synthesis (DNA, RNA, protein, cell walls, and lipids), specificity or selectivity against other targets, and determination of cytotoxicity in mammalian cells. An advantage for antibacterial drug discovery is that all in vitro target interventions can be verified using in vivo preclinical animal models with certainty that the results will predict efficacy in humans. This early-phase effort may include synthesis of a limited number of analogs to assess the effects of chemically modifying the hit. Chemical modifications are usually necessary to optimize solubility, in vitro microbiological activity, the spectrum of inhibitory activ-

The lead optimization process generally requires ∼12 to 18 months to complete, and employs a fully integrated team of scientists conducting a battery of biological assays to support a structure-activity relationship (SAR) program. For lead optimization, the entire complement of biological assays is usually employed, including in vitro microbiological assays and in vivo assays for predicting efficacy. The latter entail the use of one or more animal species, along with assays to measure solubility, pharmacokinetic properties, and ancillary pharmacology, e.g., metabolism, and a study of possible side effects. Goals must be established beforehand to determine whether the results obtained with a particular compound indicate the lead has been optimized (Table 13A.1.2). It is important to establish specific, quantitative objectives for a discovery program, with both minimum and optimum criteria. These criteria are established on the basis of information contained in the published literature and obtained from personal experience (Table 13A.1.2). Quantitative objectives should be established for data that are to be obtained from microbiological, biochemical, pharmacological, and toxicological tests. Under the headings below are some general criteria that must be displayed for a potential antimicrobial to be considered an optimized lead.

Target Activity Criterion: Target inhibition with an IC50 value ≤10 µM, and an optimum IC50 <1 µM. In general, prokaryotic targets are less sensitive to inhibition than eukaryotic targets, with IC50 values for the former ranging from high nanomolar to low micromolar, whereas IC50 values are picomolar to nanomolar for most eukaryotic target inhibitors. Criterion: The target(s) should be confirmed as essential, such that inhibition leads to death of the microbe.

Anti-Infectives

13A.1.3 Current Protocols in Pharmacology

Supplement 31

Table 13A.1.2 Criteria for a Lead Versus a Candidate

Criteria

Lead minimum value

Candidate optimum value

Target inhibition (IC50 )

≤10 µM

≤1 µM

In vitro susceptibility (MIC90 )

4 to 16 µg/ml

≤1 µg/ml

In vivo efficacy (ED50 or PD50 )

≤20 mg/kg

≤5-10 mg/kg

Solubility

≥100 µg/ml

≥2-4 mg/ml

−8

Resistance emergencea

≤1 × 10

≤1 × 10−10

Pharmacokinetics properties enabling dosing

tid

Once or twice a day

Off-target activityb

>10-fold

>100-fold

Therapeutic index (selectivity over dose-limiting toxicity)

≥10

≥100

a Frequency of selection of a drug-resistant mutant from a bacterial population of between 109 and 1012 cells. b Off-target activity is a measurement of selectivity of inhibition of a microbial target compared to homologous or

nonhomologous host protein/targets.

Whole-Cell Bacteria Activity Criterion: Minimum inhibitory concentration (MIC) of 4 to 16 µg/ml, with an optimal value being ≤1 µg/ml (UNIT 13A.3). The MIC, which is determined using standard procedures developed by the National Committee for Clinical Laboratory Standards (NCCLS, 1997), is a broth- or agar-based growthinhibitory assay used to quantify the intrinsic activity of an antibacterial agent on intact bacterium. The MIC takes into account the intrinsic target activity, the ability of the compound to penetrate into the organism, the tendency for the compound to be transported out of the cell (efflux), accessibility of the compound to the target, the stability of the compound with regard to microbial resistance factors, chemical stability, and the stoichiometry between target and test agent.

In Vivo Efficacy in Preclinical Animal Species

Overview of Antimicrobial Drug Development

Criterion: ED50 or PD50 <20 mg/kg, with the optimum value being ≤5-10 mg/kg. Assessment of in vivo activity is the standard procedure for measuring the potential of a compound to prevent (prophylaxis) or treat human infections. In vivo activity is measured in laboratory animals by examining the ability of a compound to protect against death due to infection (Bergeron, 1978; Acred, 1986; Fernandez et al., 1997; Xuan et al., 2001). Such preclinical tests must utilize a quantitative, reproducible model of the target infection. The effectiveness of the test agent may be examined in any of a number of species,

including mouse, rat, gerbil, and dog (UNITS 13A.4 & 13A.5). The overall activity of a test agent in these assays is sometimes referred to as its potency, taking into account all of the factors related to its in vitro activity and its pharmacokinetics. The in vivo activity of a test agent depends on its specificity and selectivity, and its ability to be delivered in vivo in a pharmacologically active form.

Pharmacokinetics Criterion: It must be demonstrated that the test agent has systemic properties that allow for once to three times per day dosing for hospital-based products, to once or twice per day dosing for oral administration on an outpatient basis. Pharmacokinetic measurements taken to assess these properties include oral bioavailability, time to maximum blood concentration (Tmax ), peak plasma concentration of drug (Cmax ), time for the concentration of drug to decrease from the maximum in blood/serum by 50% (T 12 ), area under the curve (which measures the drug concentration in blood over the dosing interval or over a 24 hr period), clearance rate (CL, either over the dosing interval or a 24 hr period), and volume of distribution (Vd ), which represents the relative distribution of the agent between the blood and the tissues or organs.

Therapeutic Index (TI)

Criterion: A minimum TI >10, with an optimum ≥100. The therapeutic index is the ratio of the minimum dose producing adverse effects to the dose required for efficacy.

13A.1.4 Supplement 31

Current Protocols in Pharmacology

Opinions vary as to the minimum TI acceptable for antibacterials/antibiotics. Given existing agents, a TI of ∼100 is generally thought to represent a safety margin acceptable for human testing (Hann and Oprea, 2004).

CANDIDATE SELECTION This last step in the process of selecting a clinical candidate usually requires 1 to 3 months of advanced medicinal chemistry support to identify and fully characterize a compound (Table 13A.1.2) that displays the required characteristics of an antibacterial agent based on in vitro and in vivo assays, and a safety profile appropriate for testing in humans. Included in the candidate-selection phase are more extensive ancillary pharmacology tests, acute toxicity testing, and gross toxicological testing. Although each pharmaceutical firm may use different criteria for declaring a clinical candidate, there is general agreement about the minimum standards required with regard to efficacy of a new anti-infective. In general, efficacy must be demonstrated in one or more laboratory animal species, usually using different models reflective of different infections anticipated in humans (UNIT 13A.4 & 13A.5). The effective dose, which represents the dose needed to protect 50% of the animals or to reduce an infection by two log units of bacterial growth (ED99 ), should be <20 mg/kg. Single-digit ED50 or ED99 mg/kg values indicate excellent in vivo potency for an antibiotic.

PRECLINICAL PROFILING This phase generally requires 3 to 6 months, including the time needed for scaling up chemical synthesis to produce sufficient compound for extensive preclinical toxicity testing. In this case, one batch of compound for good manufacturing procedures/good laboratory practices (GMP/GLP) testing is used to ensure consistent, reproducible data. During this time, a project team makes the final preparations for Phase I clinical studies by scaling compound synthesis to the level necessary for completion of both Phase I and the first part of Phase II studies. A formal safety assessment study on the same batch of material is performed using two different animal species to minimize the chance of injury to a study participant. These preclinical safety studies are usually conducted for a period of time in excess of that anticipated for the clinical studies and at doses higher than those to be used in humans.

CLINICAL DEVELOPMENT This frequently takes 4 to 6 years or more for antibacterials, depending on the number of clinical indications sought, the patient-safety database required, the medical need prioritization by the Food and Drug Administration (FDA), the difficulty in recruiting patients, and the complexity of the clinical trials. The three stages of antibacterial clinical development are described under the following headers.

Phase I This first human-experience trial measures safety and pharmacokinetics of single and multiple doses of the drug candidate in a small number of healthy human volunteers. This testing is often conducted outside the United States because it is easier to obtain permission for human studies abroad. A Phase I study of a potential antibiotic usually takes 4 to 6 months to complete.

Phase II The clinical efficacy of the antibacterial candidate is assessed in a Phase II study, which normally requires 12 to 18 months to complete. Phase II may include “pivotal studies,” which are experiments that must succeed to demonstrate proof-of-concept that this compound and drug class are effective in humans. Failure of pivotal studies usually leads to termination of development. For example, a pivotal study for an antibacterial candidate intended for community use in the treatment of respiratory tract infections (RTIs) may include a clinical protocol that assesses efficacy in eradicating the predominant RTI bacterial pathogens. To demonstrate the efficacy required for FDA licensing for communityacquired pneumonia (CAP), the pivotal study is designed to demonstrate both microbiological and clinical (symptomatic) cure. This includes demonstrating that the agent eradicates the pathogen as measured microbiologically and relieves symptoms observed in those with the most frequently occurring RTI pathogens (Streptococcus pneumoniae, Haemophilus influenzae, Moraxella catarrhalis), or any of the atypical pathogens (Mycoplasma pneumoniae, Chlamydia pneumoniae, Legionella pneumophila). Depending on the design of the clinical trial, approximately 100 to 300 patients may be enrolled to ensure that statistically significant differences between the treatment and placebo groups are reliably detected.

Anti-Infectives

13A.1.5 Current Protocols in Pharmacology

Supplement 31

Phase III This phase of clinical testing usually involves 2000 to 3000 patient volunteers and requires 3 years or more to complete, depending on the indications sought and the complexity of the trials. In Phase III studies, large numbers of patients are treated with the candidate drug, which is directly tested against one or more comparator drugs to assess its efficacy and safety in an unbiased, often blinded, infected population selected on the basis of FDA-approved criteria. The trial subjects are randomly assigned to one of three groups: placebo controls, comparator agent, or candidate drug. For example, expanding on a Phase II pivotal study for community-acquired pneumonia, a sponsoring company may examine efficacy and adverse events in a 400- to 700patient Phase III study for CAP. The sizes of the treatment groups are carefully calculated to ensure statistical validity. Phase III studies also involve extensive monitoring of adverse events (AEs) from which the FDA draws the specific wording to describe limitations in the labeling if the agent should be approved for use. For antibacterials, AEs typically encountered range from nausea, diarrhea, headache, and taste perversion to myelosuppression and cardiac and central nervous system side effects.

REGULATORY FILING Preparation of the documents necessary for seeking marketing approval by the FDA generally occurs during the final 6 to 12 months of clinical testing, usually overlapping with Phase III. The final report represents the New Drug Application (NDA). The FDA review of this application must conclude within 12 months of the filing date, or 6 months if it is granted a priority review because the agent addresses a critical unmet need, such as treatment of methicillin-resistant Staphylococcus aureus. The NDA contains all preclinical and clinical data collected during the research and development process, including detailed case study reports of all patients and extensive statistical analyses of outcomes and safety. The NDA also contains a requested label for approval, listing the proposed indications, method(s) of delivery, dosage(s), length of treatment, and the results relating to efficacy, pharmacology, and toxicology. Overview of Antimicrobial Drug Development

APPROVAL AND LAUNCH The research and development of a new antibacterial ends with FDA approval, or disap-

proval, of the request to sell the drug. In some cases, the NDA is awarded an Approvable Letter providing conditional permission to license the drug for human use provided certain criteria are met. This process always requires additional review by the FDA. The last stage of the approval process, which generally lasts for approximately 3 months, entails review and negotiation with the FDA about the content of the drug label if the application is approved. Unique to antimicrobial drug labels is specification of both the indication for treatment and the specific pathogens within the individual indication. Thus, an approved label may indicate that the pathogen is covered in one disease indication but not in another.

STRATEGIC PLANNING While, to a large extent, the drug-discovery process is customized for each project, certain principles apply to all (McDevitt and Rosenberg, 2001; Mills, 2003; Nwaka and Ridley, 2003; Rich, 2004). Some of these principles are discussed below.

Know the Goal by Defining the Endpoint All scientists engaged in discovery and development of antibacterials seek to identify a novel, patentable agent that is effective against a relevant spectrum of pathogens (Barrett and Barrett, 2003; Haney et al., 2002). For a new chemical entity to be considered a drug candidate, it must display the desired antimicrobial activity in vitro and in the clinic, and it must address an unmet medical need. For example, for an agent to be positioned as a community, oral RTI drug, it should have antimicrobial activity, both in vitro and in the clinic, against S. pneumoniae, H. influenzae, Moraxella catarrhalis, and against several socalled atypical pathogens (i.e., M. pneumoniae, C. pneumoniae, L. pneumophila). Given the current agents on the market, medical need would require activity against penicillinresistant S. pneumoniae or macrolide-resistant H. influenzae. Likewise, a drug intended for use in intensive care should display a broad spectrum of antimicrobial activity in vitro, and clinical activity against Gram-positive pathogens, MRSA, vancomycin-resistant enterococci, and/or Gram-negative pathogens such as Pseudomonas aeruginosa, Klebsiella pneumoniae, and Acinetobacter baumannii. In addition, incremental advances that may improve treatment outcomes, such as dosing interval, length of treatment, or better safety

13A.1.6 Supplement 31

Current Protocols in Pharmacology

or tolerability, may also be viewed as meeting a medical need. Moreover, the commercial viability of a product is directly related to its ability to address an unmet medical need.

Quality of the Target The target chosen must meet well defined criteria to maximize the opportunity for identifying and developing a drug. The criteria and process for target selection are described in detail in UNIT 13A.2. The term “quality” is a subjective evaluation that encompasses, includes, and pertains to the microbiological, genetic, biochemical, molecular, pharmacological, and toxicological characteristics of the target.

Patent Position An unencumbered patent position is particularly important for antibacterial drug discovery because a significant number of DNA sequence–based patents have been issued. These patents may restrict the use of a DNA sequence in discovery efforts and in technologybased platforms, greatly hindering the discovery process.

Ability to Assess Inhibitor/Target Activity Success in identifying new antimicrobial agents requires the ability to assay the target and identify inhibitors. The tools necessary to attain these objectives are described elsewhere in this chapter. Given the long-term financial impact of the decision, the selection of a lead should be based on the totality of the data. Thus, for example, while the off-target activity of a compound is a variable in prioritizing compounds for SAR studies (Milne, 2003), absence of the target gene in a vital bacterial pathogen is reason alone to terminate further work on these compounds. Another fatal flaw is an anomalous SAR that does not permit the engineering of desired improvements into the molecule without adversely affecting efficacy or toxicity.

SUMMARY Most industrial scientists would agree that a challenging aspect of drug-discovery research is the number of variables that must be considered when moving chemical leads towards clinical candidate status. To this end, it is necessary to balance intrinsic inhibitory activity against the target with the following: (1) a molecular structure that allows the drug

to access the target, (2) appropriate pharmacokinetic qualities across the desired dosing interval, (3) selectivity over the eukaryotic homolog, and (4) safety. The difficulties associated with coordinating these properties often account for the low success rate associated with drug discovery. The merger of data from all biological, chemical, and physiochemical laboratory measurements with medicinal chemistry allows for the development of a chemical series SAR which, in turn, yields improvements in the characteristics of the lead compound (Milne, 2003). For example, the analyses of IC50 data for a series of related analogs are compared to the maintenance, improvement, or loss of MICs, to characterize the effect of structural modifications of the molecule. Likewise, tracking the ability of a compound to penetrate into the target organism—which is usually measured in genetically defined strains by MIC determinations, while maintaining the same IC50 values against the target—makes it possible to compare the MIC values to measurements of toxicity, such as inhibition of cell growth or passive lytic potential. In the end, the measurement of success of an antibacterial drug discovery program is the identification of compounds that are eventually marketed. Although it may be possible to achieve scientific prominence in one aspect of the drug discovery and development process, success of a program is ultimately judged by how well an organization is able to execute all aspects of the program, as measured by the number of new drugs developed.

LITERATURE CITED Acred, P. 1986. The Selbie or thigh lesion test. In Experimental Models in Antimicrobial Chemotherapy (O. Zak and M.A. Sande, eds) pp. 109-121. Academic Press, London. Arigoni, F., Talabot, F., Peitsch, M., Edgerton, M.D., Meldrum, E., Allet, E., Fish R., Jamotte, T., Curchod, M.L., and Loferer, H. 1998. A genome-based approach for the identification of essential bacterial genes. Nat. Biotechnol. 3:483-489. Barrett, C.T. and Barrett, J.F. 2003. Antibacterials: Are the new entries enough to deal with the emerging resistance problems? Curr. Opin. Biotechnol. 14:1-6. Bergeron, M.G. 1978. A review of models for the therapy of experimental infections. Scand. J. Infect. Dis. 14:189-206. Beyer, D., Kroll, H.P., Endermann, R., Schiffer, G., Siegel, S., Bauser, M., Pohlmann, J., Brands, M., Ziegelbauer, K., Haebich, D., Eymann, C., and Brotz-Oesterhelt, H. 2004. New class of bacterial phenylalanyl-tRNA synthetase inhibitors

Anti-Infectives

13A.1.7 Current Protocols in Pharmacology

Supplement 31

with high potency and broad-spectrum activity. Antimicrob. Agents Chemother. 48:525-532. Bleicher, K.H., Bohm, H.J., Muller, K., and Alanine, A.I. 2003. Hit and lead generation: Beyond high-throughput screening. Nat. Rev. Drug Discov. 2:369-378. Chalker, A.F. and Lunsford, R.D. 2002. Rational identification of new antibacterial drug targets that are essential for viability using a genomicsbased approach. Pharmacol. Ther. 95:1-20. DeVito, J.A., Mills, J.A., Liu, V.G., Agarwal, A., Sizemore, C.F., Yao, Z., Stoughton, D.M., Cappiello, M.G., Barbosa, M.D.F.S., Foster, L.A., and Pompliano, D.L. 2002. An array of targetspecific screening strains for antibacterial discovery. Nat. Biotechnol. 20:478-483. DiMasi, J.A., Hansen, R.W., and Grabowski, H.G. 2003. The price of innovation: New estimates of drug development costs. J. Health Econ. 22:151185. Dougherty, T.J., Barrett, J.F., and Pucci, M.J. 2002. Microbial genomics and novel antibiotic discovery: New technology to search for new drugs. Curr. Pharm. Des. 8:1119-1135. Fernandez, J., Barrett, J.F., Licata, L., Amaratunga, D., and Frosco, M. 1999. Comparison of efficacies of oral levofloxacin and oral ciprofloxacin in a rabbit model of a staphylococcal abscess. Antimicrob. Agents Chemother. 43:667-671.

Mills, S.D. 2003. The role of genomics in antimicrobial discovery. J. Antimicrob. Chemother. 51:749-752. Milne, G.M. 2003. Pharmaceutical productivity: The imperative for new paradigms. Annu. Rep. Med. Chem. 38:383-397. NCCLS (National Committee for Clinical Laboratory Standards). 1997. Methods for dilution antimicrobial susceptibility tests for bacteria that grow aerobically. Approved standard M7A4. National Committee for Clinical Laboratory Standards, Villanova, Pa. Nwaka, S. and Ridley, R.J. 2003. Virtual discovery and development for neglected diseases through public-private partnerships. Nat. Rev. Drug Discov. 2:919-928. Pucci, M.J., Barrett, J.F., and Dougherty, T.J. 2003. Bacterial “Genes-To-Screens” in the postgenomic era. In Pathogen Genomics: Impact on Human Health (K. J. Shaw, ed.). Humana Press, Totowa, N.J. Pratt, S.D., David, C.A., Black-Schaefer, C., Dandliker, P.J., Xuei, X.L., Warrior, X., Burns, D.J., Zhong, P., Cao, Z.S., Saiki, A.Y.C., Lerner, C.G., Chovan, L.E., Soni, N.B., Nilius, A.M., Wagenaar, F.L., Merta, P.J., Traphagen, L.M., and Beutel, B.A. 2004. A strategy for discovery of novel broad-spectrum antibacterials using a high-throughput Streptococcus pneumoniae transcription/translation screen. J. Biomol. Screen. 9:3-11.

Haney, S.A., Alkane, L.E., Dunman, P.M., Murphy, E., and Projan, S.J. 2002. Genomics in antiinfective drug discovery – getting to endgame. Curr. Pharm. Des. 8:1099-1118.

Rich, A. 2004. The excitement of discovery. Annu. Rev. Biochem. 73:1-37.

Judson, N. and Mekalanos, J.J. 2000. Transposonbased approaches to identify essential bacterial genes. Nat. Biotechnol. 18:740-745.

Silver, L. and Bostian, K. 1990. Screening of natural products for antimicrobial agents. Eur. J. Clin. Microbiol. Infect. Dis. 9:455-461.

Lindsay, M.A. 2004. Target discovery. Nat. Rev. Drug Discov. 2:831-837.

Thanassi, J.A., Hartman-Neumann, S.L., Dougherty, T.J., Dougherty, B.A. and Pucci, M.J. 2002. Identification of 113 conserved essential genes using a high-throughput gene disruption system in Streptococcus pneumoniae. Nucl. Acids Res. 30:3152-3162.

Lipinski, C.A., Lombardo, F., Dominy, B.W. and Feeney, P.J. 1997. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 23:3-25. Look, G.C., Vacin, C., Dias, T.M., Ho, S., Tran, T.H., Wiesner, C., Fang, F., Marra, A., Westamacott, D., Hromockyj, A.E., Murpy, M.M., and Schullel, J.R. 2004. The discovery of biaryl acids and amides exhibiting antibacterial activity against Gram-positive bacteria. Bioorg. Med. Chem. Lett. 14:1423-1426. McDevitt, D. and Rosenberg, M. 2001. Exploiting genomics to discover new antibiotics. Trends Microbiol. 9:611-617.

Xuan, D., Zhong, M., Mattoes, H., Bui, K.-Q., McNabb, J., Nicolau, D., Quintilliani, R., and Nightingale, C.H. 2001. Streptococcus pneumoniae response to repeated moxifloxacin or levofloxacin exposure in a rabbit tissue cage model. Antimicrob. Agents Chemother. 45:794-799.

Contributed by John F. Barrett Merck Research Laboratories Rahway, New Jersey

Overview of Antimicrobial Drug Development

13A.1.8 Supplement 31

Current Protocols in Pharmacology

Overview of Safety Pharmacology Safety pharmacology, a relatively recent concept, is situated between classical toxicology and pharmacology (Fossa, 1994). Toxicology examines chemical-induced changes as a result of drug action, usually after repeated treatment at supratherapeutic doses, whereas safety pharmacology entails the search for changes that occur after acute administration of a chemical agent at doses approximating those to be employed clinically. A first attempt at defining safety pharmacology was contained in the Japanese Guidelines (Japanese Ministry of Health and Welfare, 1995), which provides considerable details on the types of pharmacology studies considered essential before an agent is administered to humans. More recently, the European Agency for the Evaluation of Medicinal Products proposed a new set of guidelines contained within the International Conference on Harmonization (ICH) safety guidelines (http://www.ich. org/cache/compo/276-254-1.html). ICH S7A deals with core battery studies, whereas ICH S7B is concerned specifically with assessment of QT interval prolongation, an effect that can lead to cardiac arrhythmias (torsades de pointe) and death. The ICH S7A guidelines came into effect in Europe in June, 2001 and have since been adopted in both the United States (US Food and Drug Administration, 2001) and Japan. The ICH S7B guidelines have also been adopted in the United States (US Food and Drug Administration, 2005). Described in this unit are a number of important issues relating to safety pharmacology as it is currently defined by regulatory agencies.

TERMINOLOGY A source of confusion about safety pharmacology arises from the various terms used to categorize the kinds of studies covered by this discipline. Besides safety pharmacology, other terms used to describe such studies include general pharmacology, ancillary pharmacology, secondary pharmacology, high-dose pharmacology, and regulatory pharmacology. Moreover, experiments performed to define the mechanisms of adverse effects observed during toxicological studies are often considered part of safety pharmacology. In its strictest sense, the word safety implies the absence of untoward effects that might endanger the health of a patient. Thus, the

UNIT 10.1

term safety pharmacology could be applied to all pharmacological studies undertaken to ensure the absence of adverse effects when the drug candidate is administered in a manner, and over a dose range, that is clinically relevant. Only studies predictive of risk should be part of a safety pharmacology assessment. The primary aim is to demonstrate that, at doses thought to be appropriate for obtaining the therapeutic benefit, there are no other effects that could be considered risk factors. A further aim would be to determine the maximum dose that could be administered before encountering adverse events. Such studies should also be useful for establishing a bridge between therapeutic doses and those to be used in toxicological studies, and for determining the maximum doses that can be administered and still ensure safety during Phase I human studies. While the term high-dose pharmacology would appear to be relevant in this regard, it is too restrictive in not including the notion that the drug candidate might have adverse effects on other systems even at therapeutic doses. In contrast, the terms general or ancillary pharmacology encompass all studies undertaken to characterize the probable therapeutic responses to the new substance. The aim of general pharmacology studies is to determine the selectivity of the drug candidate for the intended indication. For example, a novel anticancer compound would not usually be expected to display psychotropic activity, although there might be a therapeutic advantage if the drug had antidepressant or anxiolytic effects. Studies aimed at exploring such issues would come under general pharmacology since proof of safety is not their main purpose.

SAFETY PHARMACOLOGY VERSUS TOXICOLOGY Traditionally, safety pharmacology studies were conducted in a toxicological context. Toxicology differs from pharmacology in that toxicologists investigate mainly structure whereas pharmacologists investigate mainly function (Sullivan and Kinter, 1995). Indeed, there is an entire area of study linking morphological with functional changes and in defining their relationship to adverse events. For example, vomiting, diarrhea, or a change in body weight are easily detected during the clinical evaluations performed as part of

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

Safety Pharmacology/ Toxicology

10.1.1 Supplement 32

toxicological studies. Other potentially adverse responses, such changes in cardiac rhythmicity, might not be noted unless specific pharmacological studies are undertaken. Furthermore, changes in physiological function may occur in the absence of changes in organ structure and frequently occur at doses lower than those necessary to induce a structural change; and not all structural modifications are clearly associated with a detectable change in function. Although structural and functional changes are sometimes clearly related, it is not always possible to link them in terms of cause and effect. Thus, safety pharmacology and classical toxicology are complementary, with both providing information important for determining the safety of a new substance. While regulatory authorities may place more emphasis on formal toxicology studies, clinical pharmacologists may find safety pharmacology data more reassuring when designing clinical trials (Sullivan and Kinter, 1995).

COMPARISON OF JAPANESE AND ICH GUIDELINES

Overview of Safety Pharmacology

The Japanese Guidelines have their origins in a notification issued by the Japanese Ministry of Health and Welfare in September 1967 concerning “basic policies for approval to manufacture drugs.” The 1995 version of the Japanese Guidelines (Japanese Ministry of Health and Welfare, 1995) does not specifically mention safety pharmacology, describing instead general pharmacology as having the following aims: 1. to assess the overall profile of “general pharmacological effects” as compared to “principal pharmacological effects,” 2. to “obtain useful information on potential adverse effects,” and 3. to evaluate “effects of drugs on physiological functions not necessarily detectable in toxicological studies.” ICH S7A specifically mentions safety pharmacology and defines it as “those studies that investigate the potential undesirable pharmacodynamic effects of a substance on physiological functions in relation to exposure in the therapeutic dose-range and above.” The ICH definition has the considerable advantage in restricting safety pharmacology to the assessment of risk, thereby eliminating a wide range of studies that could be included if the definition were more general. Reducing the number of studies reduces drug development costs, which is an advantage to both the drug developer and the consumer.

The Japanese guidelines differ from ICH S7A in making more specific recommendations concerning the tests to be employed. Safety pharmacology studies are divided into Category A and Category B. Category A includes essential evaluations, whereas Category B studies are those to be carried out “when necessary.” For central nervous system (CNS) evaluations, Category A includes general behavioral observations, measures of spontaneous motor activity, general anesthetic effects (including assessment of potential synergism/antagonism with general anesthetics), proconvulsant effects, analgesia, and effects on body temperature. Category B includes effects on the electroencephalogram (EEG), the spinal reflex, conditioned avoidance response, and locomotor coordination. As far as the other body systems, the Japanese Category A includes studying effects on the cardiovascular and respiratory systems (e.g., respiration, blood pressure, blood flow, heart rate and electrocardiogram or ECG) the digestive system (including intestinal transit and gastric emptying) water and electrolyte metabolism (e.g., urinary volume, urinary concentrations of sodium, potassium and chloride ions) and “other important pharmacological effects.” The three vital systems the European guidelines (i.e., ICH guidelines) include in core battery studies are the CNS, cardiovascular (CV) system, and the respiratory system. Core battery CNS studies include motor activity, behavioral changes, coordination, sensory/motor reflex responses, and body temperature, with the remark that “the CNS should be assessed appropriately.” A similar remark is made about appropriate CV assessment, with specific mentions of blood pressure, heart rate and ECG, together with a suggestion that “in vivo, in vitro and/or ex vivo evaluations, including methods for repolarization and conductance abnormalities, should also be considered.” Comparison of the Japanese Guidelines and ICH S7A suggests a clear intent expressed in the ICH guidelines to free safety pharmacology from the constraints of a cookbook approach. On the other hand, their vagueness does not provide a clear idea of what evaluations could or should be performed. This is most apparent in the recommendations for follow-up studies. For the CNS, the ICH indicates that follow-up studies should include “behavioral pharmacology, learning and memory, ligand-specific binding, neurochemistry, visual, auditory and/or electrophysiology examinations, etc.” The subject of drug dependence/abuse, although a major concern for

10.1.2 Supplement 32

Current Protocols in Pharmacology

many substances with CNS effects, receives only a one-word mention in the section on “Other Organ Systems.” The ICH guidelines are much more explicit for the CV and respiratory systems. Indeed, all of ICH S7B is devoted to an analysis and recommendations for arrhythmogenic risk. ICH S7A specifically mentions renal/urinary, autonomic and gastrointestinal systems, although, surprisingly, no mention is made of nausea, despite the fact that it is a common adverse event.

CNS SAFETY PHARMACOLOGY ICH S7A guidelines recommend that core battery CNS studies include measures of drug-induced signs of CNS dysfunction as well as measures of spontaneous locomotion and motor coordination. Three other measurements, originally recommended by the Japanese Guidelines Category A, but dropped from the ICH S7A core battery, are the convulsive threshold, interaction with hypnotics, and effects on pain threshold (Porsolt et al., 2005). In spite of their exclusion from ICH S7A, such measures are useful in a core battery of CNS safety pharmacology procedures. Decreases in the convulsive threshold are an important component in the assessment of CNS safety. Several substances, including antipsychotics, such as clozapine, do not induce frank convulsions at any dose but clearly decrease the convulsive threshold. Even anticonvulsive activity, which in itself is not a risk factor, could be a predictor of cognition-impairing effects. Several anticonvulsants, such as benzodiazepines and glutamic acid NMDA receptor antagonists, are known to impair cognition. Thus, anticonvulsant activity could represent a useful first screen for potential cognitionimpairing effects. Likewise, sleep-inducing or sleep-attenuating activity could be unmasked by a barbiturate interaction procedure. While benzodiazepines, for example, do not by themselves induce sleep, their sleep-enhancing activity can be readily detected by studying their interaction with barbiturates. The same is true for psychostimulants, which may or may not induce signs of excitation in a primary observation procedure, but clearly block barbiturateinduced sleep. Finally, a drug-induced increase in pain sensitivity can be readily assessed using a simple nociception procedure (e.g., the hotplate; UNIT 5.7). Whereas analgesic activity is not in itself a risk factor, the presence of analgesic

activity could be a useful predictor of abuse liability.

CARDIOVASCULAR SAFETY PHARMACOLOGY ICH S7B guidelines, which deal exclusively with the evaluation of proarrhythmic risk, recommend that core battery cardiovascular studies include measures of drug-induced effects on arterial blood pressure, heart rate, and ECG, together with the suggestion that “in vivo, in vitro and/or ex vivo evaluations, including methods for repolarization and conductance abnormalities, should also be considered.” Recommended methodologies include measurements of ionic currents in isolated animal or human cardiac myocytes, and in cultured cell lines or heterologous expression systems with cloned human channels. Other procedures include measurement of action potential (AP) parameters in isolated cardiac preparations or analysis of specific electrophysiologic parameters indicative of AP duration in anesthetized animals, and measurement of ECG in conscious or anesthetized animals. While it is clear that proarrhythmic effects represent a major cardiovascular danger for new substances, the excessive focus on this risk by ICH S7B, together with the avalanche of reports dealing with methodological issues, have taken attention away from the many other types of drug-induced cardiovascular risk. A comprehensive systemic, cardiac, pulmonary, and renal hemodynamic evaluation in a large animal species like the dog is essential for an adequate evaluation of cardiovascular risk. Other risk factors, such as drug-induced depression of myocardial contractility or pulmonary hypertension, are critically important, even in the absence of other cardiac electrical disorders. In contrast to the telemetered conscious dog, the number of cardiovascular parameters that can be studied in an acute hemodynamic study in the anesthetized animal yields a wealth of information regarding the mechanisms responsible for effects on the cardiovascular system (Lacroix and Provost, 2000). Parameters that can be studied in the anesthetized animal include aortic blood pressure, heart rate, cardiac output, stroke volume, pulmonary artery blood pressure, peripheral vascular resistance, renal blood flow, and ECG. The hERG channel assay is now considered the model of choice for cardiac proarrhythmic risk assessment. Whereas hERG channel assays, using binding techniques and automated

Safety Pharmacology/ Toxicology

10.1.3 Current Protocols in Pharmacology

Supplement 32

Overview of Safety Pharmacology

technology, would seem appropriate as a first screen for QT prolongation in the early stages of safety evaluation, the use of the hERG channel patch clamp technique (UNIT 10.8) is recommended for the core battery of cardiovascular studies. Although more time-consuming, the patch clamp technique is an indicator of function, as opposed to receptor affinity, and lends itself more readily to compliance with good laboratory practice (GLP; see http://www.oecd.org), which is mandatory for ICH S7A-recommended core battery studies. On the other hand, the hERG channel assay cannot constitute a stand-alone in vitro test for evaluating proarrhythmic risk. Calcium agonists, for example, are known to lengthen the AP duration and favor the occurrence of early afterdepolarizations and/or delayed afterdepolarizations, either of which can lead to torsades de pointe. Cardiac risk related to this calciumdependent mechanism cannot be detected by the hERG channel assay. Furthermore, the hERG channel assay can frequently overestimate the cardiac risk for a new substance because partial inhibition of the potassium ion channel conductance (IKr ) may not result in AP prolongation in a Purkinje fiber preparation because of counteracting effects on other cardiac ion channels. For this reason the Purkinje fiber assay (UNIT 11.3) constitutes a necessary adjunct for investigating the effects of a test substance on the different parameters of the AP. Whichever in vitro assay is used, it cannot fully mimic the in vivo situation. All in vitro data must be considered in the context of plasma protein binding (UNIT 7.5), pharmacokinetic parameters (UNIT 7.1), and anticipated plasma concentrations of the test substance; thus, in vivo analyses in conscious animals monitored by telemetry remain an essential component in the assessment of proarrhythmic risk. Nonetheless, neither does telemetry in conscious animals constitute a stand-alone technique, since it provides little information regarding the mechanism responsible for any observed effect. Orthostatic hypotension is another common cardiovascular risk not covered by the ICH S7A core battery. As this constitutes a major adverse effect associated with many different classes of drugs, it is important to determine whether it is a property of a new chemical entity intended for human use. A simple animal model for orthostatic hypertension is the tilting test in the anesthetized rat (Hashimoto et al., 1999). With this assay, orthostatic hypotension is exacerbated by prazocine and

β-adrenoceptor antagonists. Inclusion of such a test in a cardiovascular core battery would not constitute a major expense but would provide a more complete assessment of cardiovascular risk.

RESPIRATORY SAFETY PHARMACOLOGY Drugs can cause changes in pulmonary function (UNIT 10.9) by direct actions on the respiratory system or as a consequence of central, metabolic (alterations in acid-base balance) or vascular (pulmonary hypertension) effects. ICH S7A guidelines include respiration as a vital function that must be assessed during safety evaluations. Whole-body plethysmography is a method of choice for examining whether a test substance affects airway function. Whole-body plethysmography is performed either in the unrestrained rat or guinea pig following oral or intravenous administration of the test agent. Modern systems allow a number of ventilatory parameters to be measured, including inspiratory and expiratory times, peak inspiratory and expiratory flows, tidal volume, respiratory rate, relaxation time, and pause and enhanced pause. This allows for the differentiation between effects on respiratory control and the mechanical properties of the lung (Murphy, 2002). As a general screen, the whole-body test is preferable to the head-out method because the animals can be studied over longer periods of time since they have freedom of movement. A weakness of whole-body plethysmography is that it is not sensitive to the respiratory-depressant effects of some drugs, such as barbiturates and opioids. Attempts to increase the sensitivity of this assay system by placing the animals in a CO2 -enriched environment (Van den Hoogen and Colpaert, 1986; Gengo et al., 2003) appear promising. Although the anesthetizeddog preparation does not lend itself to the evaluation of spontaneous lung function, it is well suited for evaluating the risk of pulmonary hypertension (as mentioned above). For this reason the anesthetized dog constitutes an important component in a comprehensive respiratory safety evaluation.

TIMING OF SAFETY PHARMACOLOGY STUDIES AND GOOD LABORATORY PRACTICE One of the issues not clearly addressed in either the Japanese or ICH guidelines is the timing of safety pharmacology studies. Whereas the Japanese guidelines imply that safety data are required for marketing

10.1.4 Supplement 32

Current Protocols in Pharmacology

approval, ICH S7A clearly states they are needed prior to initiating Phase 1 human trials. Both the Japanese and ICH guidelines require that Category A or core battery studies be performed according to Good Laboratory Practice (GLP). Exceeding these regulatory requirements, many pharmaceutical firms engage in safety pharmacology studies early in the drug discovery process, even at the very beginning of in vivo experiments (Sullivan and Kinter, 1995). A primary observation procedure, such as the Irwin test (UNIT 10.10), is frequently the first assessment in living animals for determining acute toxicity, the active dose-range, and the principal effects on behavior and physiological function. Substances are also frequently screened very early in the discovery process for potential proarrhythmic risk using hERG procedures (UNIT 10.8). Such early safety screening is rarely conducted according to GLP, and therefore falls outside the requirements of the regulatory agencies. Nonetheless, information gained from such tests is vital in directing the discovery program and the selection of clinical candidates. Safety studies are performed differently early in the discovery process than at pre-Phase 1. An early-stage Irwin test tends to use fewer animals, the mouse rather than the rat, and with doses selected sequentially on the basis of effects observed with previous doses, to determine the lethal dose as rapidly as possible. With later-stage Irwin tests, the dose range is fixed in advance, beginning with the lowest dose approximating the therapeutic dose, followed by multiples of 10 or 100, up to but not including a lethal dose. Similarly, early stage hERG tests employing a binding assay or a patch clamp procedure might examine just a single high concentration of several substances within the same chemical series rather than a range of concentrations, as would be the case in a later hERG evaluation. Indeed the experimental aims of the early and late tests are different. In early-stage work the objective is to detect the presence of risk as a guide in the selection of substances for development. Later-stage analysis is performed to confirm the absence of risk in the relevant dose-range for the selected compound.

possible and to minimize their suffering and discomfort. Since the goal of safety pharmacology is to assess the risk of side effects, the possibility of causing suffering in the experimental animals is higher than in other areas of pharmacology. The investigator must remain sensitive to these issues, not only in planning and designing the protocols, but also while performing the experiments. For example, procedures for terminating the experiment in the event of well-defined events, such as pain or death, must be established. Ethical issues must be considered within the context of the aims of the experiments, which, ultimately, are to minimize human risk. While reducing risk to humans is of paramount importance, it is still possible to devise scientifically valid experiments using a reduced number of laboratory animals. For example, it is now accepted that the traditional LD50 acute toxicity test, which requires the use of a large number of animals, yields only limited information. Considerably more information, requiring the use of fewer animals, is obtained with the Irwin procedure (UNIT 10.10).

STATISTICAL EVALUATION Since identification of risk is the chief aim of a safety pharmacology test, it is essential that positive results not be overlooked. In other words, the generation of false negatives should be kept to a minimum (Porsolt et al., 2005). False positives, or the erroneous detections of possible risk, although bothersome, are less serious and can usually be corrected with supplementary testing. Thus, the risk of false negatives should be decreased as much as possible, even if there is an increase in the risk of false positives. A test substance found not to have significant safety risks based on preclinical studies, even after the use of oversensitive statistics, is more likely to be truly devoid of risk. As a consequence, safety pharmacology, in contrast to efficacy pharmacology, should employ statistical procedures possessing maximal sensitivity for detecting possible effects on a dose-by-dose basis, at the acknowledged risk of increasing the number of false positives.

CONCLUSIONS ETHICAL AND ANIMAL WELFARE ISSUES As with all procedures involving living animals, there are important ethical considerations in the choice of methods. The guiding principles are to use as few animals as

Thanks in part to ICH S7A, safety pharmacology can now be considered an independent discipline situated between traditional toxicology and efficacy/discovery pharmacology. Safety pharmacology is, however, a pharmacological rather than a toxicological

Safety Pharmacology/ Toxicology

10.1.5 Current Protocols in Pharmacology

Supplement 32

discipline since it concerns the study of drug actions on physiological function rather than on physical/anatomical structure. Although using identical methods, safety pharmacology differs from efficacy pharmacology in that the former evaluates the potentially adverse effects of test substances on normal function, whereas the latter is aimed at establishing therapeutic potential. Both provide information critical for drug discovery and development.

LITERATURE CITED Fossa, A.A. 1994. Inaugural conference on general pharmacology/safety pharmacology. Drug Dev. Res. 32:205-272. Gengo, P.J., Pettit, H.O., O’Neill, S.J., Su, Y.F., McNutt, R., and Chang, K.J. 2003. DPI-3290 [(+)-3-((α-R)-α-((2S,5R)-4-Allyl2,5-dimethyl-1-piperazinyl)-3-hydroxybenzyl)N-(3-fluorophenyl)-N-methylbenzamide]. II. A mixed opioid agonist with potent antinociceptive activity and limited effects on respiratory function. J Pharmacol. Exp. Ther. 307: 1227-1233. Hashimoto, Y., Ohashi, R., Minami, K., and Narita, H. 1999. Comparative study of TA-606, a novel angiotensin II receptor antagonist, with losartan in terms of species difference and orthostatic hypotension. Jpn. J. Pharmacol. 81:63-72. Japanese Ministry of Health and Welfare. 1995. Japanese guidelines for nonclinical studies of drugs manual. Pharmaceutical Affairs Bureau, Japanese Ministry of Health and Welfare, Yakugi Nippo, Japan. Lacroix, P. and Provost, D. 2000. Basic safety pharmacology: The cardiovascular system. Therapie 55:63-69. Murphy, D.J. 2002. Assessment of respiratory function in safety pharmacology. Fundam. Clin. Pharmacol., 16:183-196.

Sullivan, A.T. and Kinter, L.B. 1995. Status of safety pharmacology in the pharmaceutical industry. Drug Dev. Res. 35:166-172. US Food and Drug Administration. 2001. ICH guidance for industry: S7A safety pharmacology studies for human pharmaceuticals. US Food and Drug Administration, Rockville, Md. US Food and Drug Administration. 2005. ICH guidance for industry: S7B nonclinical evaluation of the potential for delayed ventricular repolarization (QT interval prolongation) by human pharmaceuticals. US Food and Drug Administration, Rockville, Md. Van den Hoogen, R.H. and Colpaert, F.C. 1986. Respiratory effects of morphine in awake unrestrained rats. J. Pharmacol. Exp. Ther. 237:252-259.

Internet Resources http://www.fda.gov/cber/gdlns/ichs7a071201.htm The FDA’s S7A safety guidelines for pharmacology studies of human pharmaceuticals. http://www.fda.gov/cber/gdlns/ichqt.htm The FDA’s S7B safety guidelines for pharmacology studies of human pharmaceuticals. http://www.ich.org/cache/compo/276-254-1.html Contains links to both the ICH S7A (Safety Pharmacology Studies for Human Pharmaceuticals) and S7B (The Nonclinical Evaluation of the Potential for Delayed Ventricular Repolarization (QT Interval Prolongation) By Human Pharmaceuticals) safety guidelines. http://www.oecd.org The OECD Website contains comprehensive information about GLPs.

Contributed by Roger D. Porsolt Porsolt & Partners Pharmacology Boulogne-Billancourt, France

Porsolt, R.D., Picard, S., and Lacroix, P. 2005. International safety pharmacology guidelines (ICH S7A and S7B): Where do we go from here? Drug Dev. Res. 64:83-89.

Overview of Safety Pharmacology

10.1.6 Supplement 32

Current Protocols in Pharmacology

PAIRED T TEST

data, it is natural to form the differences, and perform the analysis on the differences. An interesting recent account is by Senn & Richardson (5) who discuss the study which W.S. Gosset (Student) used in his paper. Moses (3) gives a summary of the theoretical aspects of this test. We assume that there are n pairs of observations, and that each pair is independent of the other pairs. Denote the paired observations as xi and yi , and their difference as di for i = 1, . . . , n. The analysis assumes that the observations are normally distributed with means µx and µy . The random variable D = X − Y is then normally distributed with mean µd = µx − µy and variance σd2 . The null hypothesis is H0 : µd = 0. A Student t statistic can be computed as

HENRY HSU and PETER A. LACHENBRUCH Food and Drug Administration, Rockville, MD, USA

A basic concept in study design is that extraneous sources of variation should be controlled, so that the comparison is made among groups which are alike except for the intervention. Techniques which may be used include stratification and analysis of covariance. Data which are naturally paired form useful strata and often facilitate a more precise comparison than one in a set of unrelated subjects. This leads to the paired t test. The cost of pairing lies in the effort needed to determine the values of the matching variables (which is low when there are natural pairs such as litter mates or pre- and post-intervention measures), and the reduction of degrees of freedom (df) from 2(n − 1) to n − 1. The loss of degrees of freedom is almost never a problem when n is greater than 15. The statistic is based on the difference of the members of the pair, so the variance of the difference is 2σ 2 (1 − ρ), where ρ is the correlation coefficient, and σ 2 is the variance of a single observation. If ρ = 0, then the variance is the same as that of the difference between unpaired observations (2σ 2 ). In this case, the loss is that of half of the degrees of freedom. When the pairing is effective, ρ > 0, and the difference will have a smaller variance than if the observations were unpaired. The tradeoff is almost always in favor of pairing. The paired t test is used to compare mean differences when the observations have been obtained in pairs, and are thus dependent. Examples include observations made on weight before and after an intervention in a subject, serum cholesterol of two members of a family, and blood concentration of a toxin in litter mates. Subjects are often matched on characteristics such as age, gender, race, and study site (in multicenter studies). In each of these examples, the two observations are made on the same response and will be correlated. Accounting for this is necessary to analyze the study properly. For paired

t=

d √ . (sd / n)

This statistic is then compared to Student’s t distribution with n − 1 df. This analysis is equivalent to a randomized blocks analysis of variance in which the pairs correspond to blocks. If the variances in X and Y are equal, the variance of D is 2σ 2 (1 − ρ), as noted above. If ρ < 0, the variance is increased. This would not usually be the case when the matching process is effective. If the variances are unequal, the variance of D is σ12 + σ22 − 2ρσ1 σ2 . The relative efficacy of a paired t-test is 1/(1 − ρ) as compared with a twoindependent-sample t test in a parallel comparison. For example, if ρ = 1/3, the relative efficacy is 1.5. This means that 100 paired observations would have the same power to detect differences as an unpaired study of 150 observations per group. 0.1 Example We give here, in Table 1, data used by Student (Gosset) as cited in Senn & Richardson (5). The data were on the amount of sleep gained under two soporic drugs. The means given here agree with those of Senn & Richardson, but the standard deviations are slightly larger. The values they

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

PAIRED T TEST

2

Table 1. Patient

Dextro

Laevo

1 2 3 4 5 6 7 8 9 10 Mean Standard deviation

0.7 −1.6 −0.2 −1.2 −0.1 3.4 3.7 0.8 0.0 2.0 0.75 1.79

1.9 0.8 1.1 0.1 −0.1 4.4 5.5 1.6 4.6 3.4 2.33 2.00

report were those given by Student, who used the divisor n in computing the variance. Using the unpaired t-test we obtain a t = 1.86 with 18 df and a P value of 0.0792. By pairing the observations, √ as we should, the result is t = 1.58/(0.39/ 10) = 4.06 with 9 df and a P value of 0.0028. It is better to report the mean of the differences and its standard deviation rather than showing only the t statistic or the P value (or worse, NS for ‘‘not significant’’, or some number of asterisks!). 1

ROBUSTNESS

The robustness properties correspond to those of the one-sample t test. The effect of nonnormality is fairly small if n is at least 30, since the distribution of d will be close to normal in that case. If one difference (or a few) appear to be quite large (i.e. outliers) the results can be affected. Outliers can be considered a form of nonnormality. They affect the variance of the observations, and can also affect the skewness of the distributions. The P values reported from an analysis are often given as P < 0.05 or P < 0.01. Since the P value depends on the behavior of the distribution in its tails, nonnormality generally means that statements such as P < 0.001 are rarely accurate (the quoted P value for the example thus should be regarded as P < 0.01). Lack of independence among the pairs (which might arise if multiple members of a litter or a family were included in a study,

i.e. clustering) can seriously affect the level of the test. If the correlation between any pair of differences is γ , the variance of the differences is σ 2 (1 + 2γ ). Thus, the estimated variance is biased. If γ > 0, the denominator of the t statistic is too small, and the significance levels are incorrect. If the correlation holds only among certain pairs (e.g. independent clusters would lead to a block diagonal covariance matrix), the analysis is more complex, but the estimated variance is still biased. Lack of common variance in X and Y does not formally affect the analysis. However, unless the primary interest is in the difference between the observations, the lack of common variance indicates that X and Y do not have the same distribution, although they might have the same mean. If the variance differs over the pairs, heteroscedasticity concerns arise. Rosner (4) has suggested a random effects model which accounts for this. Missing values can create problems. Usually, only one member of the pair is missing. If the missing value is related to the mean value within the pair, the missingness is not random, and the t test is affected. For further discussion of these points, see Miller (2) or Madansky (1). Several alternatives exist to the paired t test. These are useful if the distribution is not normal and there is concern that this may affect the performance of the test. The sign test uses the number of positive (or negative) signs as a binomial variable with probability parameter 1/2 under H0 , and, for large samples, computes the standard normal deviate, z, to test H0 . The asymptotic relative efficiency (ARE) of this test is 0.637 when the differences are normal. The signed-rank test ranks the absolute values of the differences and sums the ranks corresponding to the positive (or negative) signs. The ARE of this test is 0.955. The normal scores test replaces the ranks of the differences by their expected values under normality and computes a t test on these. Its ARE is 1.0. For observations from nonnormal distributions, the efficiencies of the nonparametric procedures may be higher than indicated here and the t test can be very inefficient.

PAIRED T TEST

REFERENCES 1. Madansky, A. (1988). Prescriptions for Working Statisticians. Springer-Verlag, New York. 2. Miller, R. G. (1985). Beyond Anova. Wiley, New York. 3. Moses, L. (1985). Matched pairs t-tests, in Encyclopedia of Statistical Sciences, Vol. 5, S. Kotz

3

& N. L. Johnson, eds. Wiley, New York, pp. 289–203. 4. Rosner, B. (1982). A generalization of the paired t-test, Applied Statistics 31, 9–13. 5. Senn, S. & Richardson, W. (1994). The first t-test, Statistics in Medicine 13, 785–803.

PARALLEL TRACK A mechanism to permit wider availability of experimental agents is the ‘‘parallel track’’ policy (Federal Register of May 21, 1990), which was developed by the U.S. Public Health Service in response to acquired immunodeficiency syndrome (AIDS). Under this policy, patients with AIDS whose condition prevents them from participating in controlled clinical trials can receive investigational drugs shown in preliminary studies to be promising.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/parallel.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PARTIALLY BALANCED DESIGNS

ASSOCIATION SCHEMES Suppose that there are t treatments. There are three equivalent ways of defining an association scheme on the treatments: as a partition of the pairs of treatments, as a set of matrices, and as a colored graph. According to the most common definition, an association scheme with s associate classes is a partition of the unordered pairs of treatments into s associate classes. If u and υ are treatments and {u, υ} is in the ith associate class, then u and υ are called ith associates. The partition must have the property that there are numbers pkij , for i, j, k = 1, 2, . . . , s, such that if u and υ are any kth associates the number of treatments that are ith associates of u and jth associates of υ is pkij . (These numbers are not powers, so some authors write them as pijk .) It follows that each treatment has ni ith associates, where

This article should be read in conjunction with BLOCKS, BALANCED INCOMPLETE; BLOCKS, RANDOMIZED COMPLETE; GENERAL BALANCE; GROUP-DIVISIBLE DESIGNS; and INCOMPLETE BLOCK DESIGNS. Balanced incomplete block designs (BIBDs) have many desirable qualities. They are easy to analyze; many families of them are easily constructed; the loss of information on treatment estimates (in the sense of increase in variance) due to blocking is as small as it can be for the given block size; and it is easy to compare them with other incomplete block designs, because usually the BIBDs are superior in every respect. However, for many combinations of block size, number of treatments and number of replications, there is no balanced incomplete block design. Bose and Nair [5] introduced partially balanced incomplete block designs in 1939 for use in such situations, hoping to retain many of the desirable properties of BIBDs. They succeeded in their aim to a certain extent: some of the new class of designs are very good, while others are undoubtedly useless for practical experimentation. Unfortunately, any discussion of partial balance must necessarily be more technical than that of BIBDs, because there is an extra layer of complication. Moreover, many of the ideas have been developed and clarified by pure mathematicians, without ever being reexpressed in statistical language. Thus some sections of this article are unavoidably technical. It is hoped that the division into sections with titles will enable the nonmathematical reader to learn something from this article. During the development of the subject there have been minor modifications to the definition of partial balance: the most generally accepted definition, which we use here, is from a review paper in 1963 by Bose [3]. The key idea is the combinatorial concept of association scheme, to which the first part of this article is devoted. The designs themselves are introduced in the second part of the article; the third part discusses generalizations of partial balance and related ideas.

ni = pki1 + pki2 + · · · + pkis for all k = i. It is convenient to say that each treatment is its own zeroth associate. The preceding property still holds, with j

p0ij = pi0 = pi0j = 0, p0ii = ni ;

if

i = j;

pii0 = pi0i = 1.

An association scheme may be represented by a set of t × t matrices. For i = 0, . . . , s let Ai be the t × t matrix whose (u, υ) entry is 1 if u and υ are ith associates and 0 otherwise. The matrices A0 , A1 , . . . , As are called association matrices, and satisfy the following conditions: (a) (b) (c) (d) (e)

Each element of Ai is 0 or 1. Ai is symmetric. A0 = I, the identity matrix. s i=0 Ai = J, the all-ones matrix. Ai Aj = k pkij Ak .

Conversely, any set of square matrices satisfying conditions (a)–(e) defines an association scheme. An association scheme may also be represented by a colored graph. There is one

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

PARTIALLY BALANCED DESIGNS

vertex for each treatment, and there are s colors. The edge uυ is colored with the ith color if u and υ are ith associates. Now condition (e) may be interpreted as, e.g.: If uυ is a green edge then the number of green-redblue triangles containing uυ is independent of u and υ. If s = 2, no information is lost by erasing the edges of the second color to obtain an ordinary graph. A graph obtained in this way is called a strongly regular graph [4]. Much literature on association schemes with two associate classes is to be found under this heading, with no explicit mention of association schemes. See also GRAPH THEORY. The simplest association scheme has one associate class, and all treatments are first associates. Denote this scheme by B(t), where t is the number of treatments. While not very interesting in its own right, B(t) can be used to generate other association schemes. Examples of Association Schemes with Two Associate Classes

Group-Divisible Association Scheme. Suppose that t = mn. Partition the treatments into m sets of size n. The sets are traditionally called groups in this context, even though there is no connection with the technical meaning of group used later in this article. Two treatments are first associates if they are in the same group; second associates, otherwise. Call this scheme GD (m, n). Triangular Association Scheme. Suppose that t = n(n − 1)/2. Put the treatments into the lower left and upper right triangles of an n × n square array in such a way that the array in symmetric about its main diagonal, which is left empty. Two treatments are first associates if they are in the same row or column; second associates, otherwise. Call this scheme T(n). Then T(4) is GD (3, 2) with the classes interchanged. The square

Table 1. * A B C D

A * E F G

B E * H I

C F H * J

D G I J *

Table 2. 0 1 2 3 4

0

1

2

3

4

0 1 2 2 1

1 0 1 2 2

2 1 0 1 2

2 2 1 0 1

1 2 2 1 0

array for T(5) is shown in Table 1. The complement of the corresponding strongly regular graph is in Fig. 1, known as a Petersen graph. In an alternative description of T(n), the t treatments are identified with the unordered pairs from a set of size n. Then treatments are first associates if, as pairs, they have an element in common. The association scheme T(n) arises naturally in diallele cross experiments, where the treatments are all individuals (plants or animals) arising from crossing n genotypes. If there are no self-crosses, and if the gender of the parental lines is immaterial, then there are (n 2) treatments and the first associates of any given treatment are those having one parental genotype in common with it. Latin Square Association Schemes. Suppose that t = n2 , and arrange the treatments in an n × n square array. Two treatments are first associates in the association scheme L2 (n) if they are in the same row or column. Let be any n × n Latin square∗ . Then defines an association scheme of type L3 (n): two treatments are first associates if they are in the same row or column or letter of . Similarly, if 1 , . . . , r−2 are mutually orthogonal Latin squares (3 r n + 1), we may define an association scheme of type Lr (n) by declaring any two treatments in the same row, column, or letter of any of 1 , . . . , r−2 , to be first associates; any other pair, to be second associates. When r = n + 1, this scheme degenerates into B(t). Cyclic Association Schemes. Identify the treatments with the integers modulo t. Suppose that D is a set of nonzero integers modulo t with the properties that if d ∈ D, then t − d ∈ D; and that the differences d − e modulo t

PARTIALLY BALANCED DESIGNS

3

Figure 1.

with d, e in D include each element of D exactly N times and each other nonzero integer modulo t exactly M times, for some fixed numbers N and M. In the cyclic association scheme C (t, D) two treatments u and υ are first associates if u − υ (modulo t) is in D; second associates, otherwise. For example, if t = 5 we may take D = {1, 4}. Then N = 0 and M = 1. The association scheme C (5, D) is shown in Table 2. Here the (u, υ) entry of the table is i if u and υ are ith associates. Combining Association Schemes For n = 1, 2, let Cn be an association scheme with sn associate classes for a set Tn of tn treatments. There are two simple ways of combining these two schemes to obtain an association scheme for t1 t2 treatments. In both cases, we identify the new treatments with the set T1 × T2 of ordered pairs (u1 , u2 ) with u1 ∈ T1 and u2 ∈ T2 (see also NESTING AND CROSSING IN DESIGN). We may nest C2 in C1 to obtain the association scheme C1 /C2 with s1 + s2 associate classes. Treatments (u1 , u2 ) and (υ1 , υ2 ) are ith associates if u1 and υ1 are ith associates in C1 (for 1 i s1 ) or if u1 = υ1 and u2 and υ2 are (i − s1 )th associates in C2 (for s1 + 1 i s2 ). For example, B(m) / B(n) is GD (m, n).

We may cross C1 and C2 to obtain the association scheme C1 × C2 with s1 s2 + s1 + s2 associate classes. If the zeroth associate class is included in C1 , C2 and C1 × C2 , then the associate classes of C1 × C2 may be labeled by ij, for 0 i s1 and 0 j s2 . Treatments (u1 , u2 ) and (υ1 , υ2 ) are ijth associates in C1 × C2 if u1 and υ1 are ith associates in C1 and if u2 and υ2 are jth associates in C2 . The crossed scheme B(m) × B(n) is called a rectangular association scheme R (m, n), because its treatments may be arranged in an m × n array: two treatments are 01-th or 10-th associates if they are in the same row or the same column, respectively, and 11-th associates otherwise. Factorial Association Schemes There is a natural way of defining an association scheme on treatments with a factorial∗ structure. For example, suppose that treatments are all combinations of the t1 levels of factor F1 with the t2 levels of factor F2 . If F2 is nested in F1 , then the association scheme B(t1 )/B(t2 ) [i.e., GD (t1 , t2 )] corresponds naturally to the breakdown of treatment effects into the two factorial effects: F1 and F2 -within-F1 . If there is no nesting, then the association scheme B(t1 ) × B(t2 ) [i.e., R(t1 , t2 )] corresponds to the three factorial effects: main effects∗ F1 and F2 , and interaction∗ F1 F2 .

4

PARTIALLY BALANCED DESIGNS

Table 3. Operators Experienced Sow 1 1 2 2 3 3 4 4

Novices

Site

1

2

3

4

5

6

7

8

1 2 1 2 1 2 1 2

0 3 6 9 6 9 6 9

1 4 7 10 7 10 7 10

1 4 7 10 7 10 7 10

1 4 7 10 7 10 7 10

2 5 8 11 8 11 8 11

2 5 8 11 8 11 8 11

2 5 8 11 8 11 8 11

2 5 8 11 8 11 8 11

More generally, if there are n treatment factors F1 , F2 , . . . , Fn with t1 , t2 , . . . , tn levels, respectively, a factorial association scheme may be built up from B(t1 ), B(t2 ), . . . , B(tn ) by repeated use of crossing and / or nesting. Roy [49] described repeated nesting, and Hinkelmann [19], repeated crossing, but the two operations may be combined. One of the simplest crossed-and-nested schemes is (B(t1 ) × B(t2 ))/B(t3 ), which is also called a generalized right angular scheme. A somewhat more general construction, depending only on the nesting relations among the factors, is given by Speed and Bailey [54]. Example. An experiment was conducted to find out how the ultrasonic fat probe measurement of a sow was affected by the particular sow, by the site on the sow where the measurement was taken, and by the operator who did the measuring. Each of eight operators measured each of four sows at each of two sites on the body (technically known as P2 and H). The operators consisted of two batches of four: those in the first batch were experienced users of the small portable measuring instruments being used, while those in the second were novices. The 64 treatments (all combinations of sow, site, and operator), therefore, had a 4 × 2 × (2/4) factorial structure. The corresponding association scheme is shown in Table 3. Here the 64 treatments are shown in an 8 × 8 array: the entry for treatment u is i if u is an ith associate of the treatment in the top left-hand corner. An unstructured set of treatments may have a dummy factorial structure imposed on it to help with the construction of a design or

the subsequent analysis. Any corresponding factorial association scheme is called pseudofactorial. Other Association Schemes We have space to mention only a few of the many other families of association schemes. The d-dimensional triangular asso and s= ciation scheme T(d, n) has t=n d min(d, n − d). The treatments are identified with subsets of size d from a set of size n. Two treatments are ith associates if, as subsets, they have d − i elements in common. Thus T(2, n) = T(n). These schemes are also called Johnson schemes. Conceptually similar is the d-dimensional lattice association scheme, also called the Hamming association scheme H (d, n), which has t = nd and s = d. The treatments are identified with ordered d-tuples of the numbers 1, 2, . . . , n, and two treatments are ith associates if their coordinates are the same in d − i positions. The Latin square∗ association scheme Lr (n) also has generalizations. The scheme Lr (n) differs from Lr (n) in that it has r + 1 associate classes. Treatments are first, second, third, . . . , rth associates if they are in the same row, column, letter of 1 , . . ., letter of r−2 . Any way of amalgamating associate classes in Lr (n) produces another association scheme (this is not true for association schemes in general): in particular, amalgamation of the first r classes produces Lr (n). The cyclic scheme C (t, D) is formed by amalgamating classes of the general cyclic association scheme C(t), which has (t − 1)/2 associate classes if t is odd, t/2 if t is even. Treatments u and υ are ith associates if

PARTIALLY BALANCED DESIGNS

u − υ = ±i modulo t. Both C (t, D) and C(t) may be further generalized by replacing the integers modulo t by any Abelian group (in the algebraic sense of ‘‘group’’). There are also many other individual association schemes that do not fit into any general families. Algebra Conditions (a)–(e) for an association scheme have algebraic consequences that are important for partially balanced designs. Let A be the set of linear combinations of the matrices A0 , A1 , . . . , As , which is called the Bose–Mesner algebra of the association scheme. Every matrix M in A is ‘‘patterned’’ according to the association scheme; that is, the entry M uυ depends only on the associate class containing {u, υ}. If M and N are in A, then so are M + N, MN and αM, for every real number α; moreover, MN = NM. It follows that the matrices in A are simultaneously diagonalizable; i.e., the t-dimensional vector space corresponding to the treatments is a direct sum of spaces W0 , W1 , . . . , Ws , each of which is an eigenspace of every matrix in A. Hence if M is invertible and M is in A, then M−1 is in A. Usually the eigenspaces W0 , W1 , . . . , Ws are easy to calculate. We may always take W0 to be the one-dimensional space spanned by the vector (1, 1, . . . , 1). At worst, the problem of finding W0 , W1 , . . . , Ws can be reduced to a similar problem for (s + 1) × (s + 1) matrices. See Bose and Mesner [6], Cameron and van Lint [9], Delsarte [16], and Dembowski [17] for further details. The trivial association scheme B(t) has two eigenspaces, W0 and W1 . The space W1 consists of all contrasts, i.e., all vectors whose entries sum to zero. For the group-divisible association scheme GD (m, n) the spaces W1 and W2 consist of all between-groups contrasts∗ and all withingroups contrasts, respectively. That is, a vector is in W1 if its entries are constant on each group and sum to zero overall; a vector is in W2 if its entries sum to zero on each group. The rectangular association scheme has four eigenspaces: W1 and W2 consist of betweenrows contrasts and between-columns contrasts respectively, while W3 contains all vectors that are orthogonal to W0 , W1 and

5

W2 . In general, the eigenspaces of every factorial association scheme are the spaces corresponding to main effects and interactions in the factorial sense. If t is odd, every eigenspace (except W0 ) of the cyclic association scheme C(t) has dimension 2. If the treatments are written in the order 0, 1, 2, . . . , t − 1, a basis for Wi consists of (1, cos(2π i/t), cos(4π i/t), . . . , cos(2π (t − 1)i/t)); (0, sin(2π i/t), sin(4π i/t), . . . , sin(2π (t − 1)i/t)), for each i with 1 i (t − 1)/2. If t is even, there is an additional eigenspace Wt/2 spanned by (1, −1, 1, . . . , −1). For every association scheme, there are real numbers eij (0 i, j s) such that every vector in Wi is an eigenvector of Aj with eigenvalue eij . The array [eij ] is sometimes called the character table of the association scheme. Table 4 shows the character tables of some association schemes. Denote by Si the matrix representing orthogonal projection onto Wi . The eigenprojections are symmetric, idempotent, and mutually orthogonal; i.e., Si = Si , S2i = Si , and Si Sj = 0 if i = j. Moreover, i Si = I. For every j, Aj =

eij Si .

i

The (s + 1) × (s + 1) matrix [eij ] is invertible, with inverse [fij ], say. Then Si =

fji Aj

j

for each i. (See refs. 6, 9, and 16 for more details.) PARTIALLY BALANCED INCOMPLETE BLOCK DESIGNS Let be an incomplete block design∗ for t treatments, each replicated r times, in b blocks each of size k. Suppose that no treatment occurs more than once in any block: such a design is said to be binary. For

6

PARTIALLY BALANCED DESIGNS

Table 4.

a

Association Scheme B(t)

GD(m, n)

B(m) × B(n)

C (5)

a The

Eigenspace Wi

Dimension of Wi

W0 W1 = contrasts

1 t−1

W0 W1 = between groups W2 = within groups

1 m−1 m(n − 1)

W0 W1 = between rows W2 = between columns W3

1 m−1 n−1 (m − 1)(n − 1)

W0 W1 W2

1 2 2

Associate Classes Aj A0 1 1 A0 1 1 1 A00 1 1 1 1 A0 1 1 1

A1 t−1 −1 A1 m−1 m−1 −1 A01 m−1 m−1 −1 −1 A1 2 0.618 −1.618

A2 m(n − 1) −m 0 A10 n−1 −1 n−1 −1 A2 2 −1.618 0.618

A11 (m − 1)(n − 1) −(m − 1) −(n − 1) 1

entry in row Wi and column Aj is the character eij .

treatments u and υ, denote by λuυ the number of blocks in which u and υ both occur, that is, the concurrence of u and υ. The design is said to be partially balanced with respect to an association scheme C if λuυ depends only on the associate class of C that contains {u, υ}. Usually we abbreviate this and simply say that is a partially balanced (incomplete block) design, or PBIBD. A design may be partially balanced with respect to more than one association scheme; usually, the simplest possible association scheme is assumed. The balanced incomplete block design∗ (BIBD) is just a special case of a PBIBD, for a BIBD with t treatments is partially balanced with respect to the association scheme B(t). A less trivial example is shown in Table 5. Here t = 12, r = 3, b = 4, and k = 9; treatments are denoted by capital letters and blocks are columns. This design is partially balanced with respect to GD (4, 3): the ‘‘groups’’ of treatments are {A, B, C}, {D, E, F}, {G, H, I}, and {J, K, L}. Partially balanced designs often inherit the name of the appropriate association scheme. Thus a design that is partially balanced with respect to a group-divisible association scheme is called a group-divisible design∗ . Partial balance with respect to a factorial association scheme is known as factorial balance. Although PBIBDs were

Table 5. A B C D E F G H I

A B C D E F J K L

A B C G H I J K L

D E F G H I J K L

introduced only in 1939, the importance of factorially balanced designs had been recognized previously by Yates [56]. Construction and Catalogs There are two elementary methods of constructing a PBIBD for a given association scheme. Each method has a block B (u, i) for each treatment u and a fixed i with 1 i s. In the first construction B (u, i) consists of all ith associates of u; in the second, u is also in B (u, i). If 1 is partially balanced with respect to an association scheme C for t1 treatments, we may obtain a new design for t1 t2 treatments by replacing each treatment of 1 by t2 new treatments. Then is partially balanced with respect to C/B(t2 ). In particular, if 1 is a BIBD, then is group divisible. The design in Table 5 was constructed by this method with t1 = 4 and t2 = 3. For further constructions of

PARTIALLY BALANCED DESIGNS Table 6. 123

124

125

134

135

A B E

A C F

A D G

B C H

B D I

145

234

235

245

345

C D J

E F H

E G I

F G J

H I J

group divisible designs, see GROUP-DIVISIBLE DESIGNS. The lattice designs∗ for n2 treatments are partially balanced with respect to Latin square∗ association schemes Lr (n) for various r: for a simple lattice, r = 2; for a triple lattice, r = 3. The scheme L2 (n) is also the two-dimensional Hamming scheme H (2, n). Designs partially balanced with respect to H (d, n) for d = 3, 4, 5 are sometimes called cubic, quartic, or quintic, respectively. A simple construction of such a PBIBD gives a block B (i, j) for 1 i d and 1 j n: the treatments in B (i, j) are all those d-tuples whose ith coordinate is equal to j. The method of triads gives a PBIBD for the triangular association scheme T(n). There n are blocks of size 3, one for each subset 3 of size 3 of the original set of size n. The block corresponding to the set {α, β, γ } contains the treatments corresponding to the pairs {α, β}, {α, γ }, and {β, γ }. The triad design for T (5) is shown in Table 6. A PBIBD with m n blocks of size may be constructed 2 m in a similar manner. A simpler PBIBD for T(n) has n blocks of size n − 1: the blocks are the columns of the square array that defines the association scheme. Most factorial designs in common use have factorial balance. Some constructions may be found in CONFOUNDING and FACTORIAL EXPERIMENTS. Cyclic Designs∗ . These designs are partially balanced with respect to cyclic association schemes. An initial block B0 is chosen. A second block B1 is generated by adding 1 (modulo t) to every element of B0 . Further blocks B2 , B3 , . . . are generated from B1 , B2 , . . . in the same way. The process stops either with Bt−1 or with Bu−1 , where Bu is the

7

first block identical to B0 . The cyclic design for five treatments in blocks of size 3 with initial block {1, 2, 4} is shown in Table 7. It is also possible to have more than one initial block. Essentially the same method gives PBIBDs with association scheme based on any Abelian group. Although there are many fascinating methods of constructing PBIBDs, the practitioner who needs a single PBIBD should consult a catalog. Clatworthy [14] gives an extensive catalog of PBIBDs with two associate classes. John et al. [26] list cyclic designs with high efficiency factors (see below). Randomization The randomization∗ of a PBIBD is in two parts. First, the blocks of the design should be allocated randomly to the actual blocks. Second, in each block independently the treatments prescribed for that block should be randomly allocated to the plots. (If the design is resolvable∗ , and the replicates are intended to match features of the experimental material or its management, the blocks should be randomized only within replicates.) There is no need for any randomization of treatments to the letters or numbers used in the design. Any such randomization would be positively harmful if the association scheme has been chosen with reference to the specific treatments, as, e.g., in a factorial design. If a series of similar experiments is to be conducted at different sites, any randomization of treatments at individual sites will hamper, or even prevent, the subsequent analysis across all sites. The first two stages of randomization should, of course, be done, independently, at all sites. Efficiency Factors The standard analysis of an incomplete block design∗ uses a generalized inverse∗ of the t × t matrix = I − r−1 k−1 , where is the matrix whose diagonal elements are equal to r, and whose (u, υ) element is equal to λuυ if u = υ. Thus = i θi Ai , where θ0 = 1 − k−1 Table 7. 1 2 4

2 3 0

3 4 1

4 0 2

0 1 3

8

PARTIALLY BALANCED DESIGNS

and θi = −λi /rk otherwise, where λi is the common value of the concurrence of pairs of treatments which are ith associates. Hence = j Ej Sj , where Ej = i eji θi and the eji are the characters defined in the subsection ‘‘Algebra’’ of the section ‘‘Association Schemes’’. A generalized inverse of is −1 E S , the sum being restricted to those j j j j for which Ej = 0. Moreover, has the same ‘‘pattern’’ as , for = i φi Ai , where φi = j fij E−1 j . Denote by τu the effect of treatmentu. If x is any contrast vector then x · τ = u xu τ , is a linear combination of the treatment effects, and the variance of the intrablock∗ estimate of x · τ is xx σ 2 /r, where σ 2 is the intrablock variance (see INCOMPLETE BLOCK DESIGNS). Thus, if x is a vector in Wi , then the variance of x · τ is (x · x)σ 2 /(rEi ) and hence the efficiency factor (see BLOCKS, BALANCED INCOMPLETE) for x · τ is Ei . The Ei are called the canonical efficiency factors of the design. They lie between 0 and 1, and E0 is zero. If no other canonical efficiency factor is zero the design is said to be connected: in this case, there is an intrablock estimate of x · τ for every contrast x. The efficiency factor for such an x · τ may be calculated in terms of the Ei , and it lies between the extreme values of the Ei with i = 0. In particular, the efficiency factor for τu − τυ is 1/(φ0 − φi ) if u and υ are ith associates. For example, consider the design given at the top of Table 8. It is a groupdivisible design with t = b = 4, r = k = 2, λ1 = 0, λ2 = 1. The groups are {A, C} and {B, D}. The contrast eigenspaces are the between-groups eigenspace W1 , with basis x(1) = (1, −1, 1, −1), and the within-groups eigenspace W2 , with basis consisting of x(2) = (1, 0, −1, 0) and x(3) = (0, 1, 0, −1). The canonical efficiency factors are given by Ei = ei0 1 − 12 − ei2 14 . The character table [eij ] is the second part of Table 4. Thus ⎡

⎤ 1 1 2 [eij ] = ⎣1 1 −2⎦ 1 −1 0

and we have E0 = 1 ×

1 2

−2×

1 4

= 0,

E1 = 1 ×

1 2

+2×

1 4

= 1,

E2 = 1 ×

1 2

+0×

1 4

= 12 .

Moreover, [fij ] = [eij ]−1 = ⎡ ⎤ 1 1 2 1⎣ 1 1 −2⎦ 4 1 −1 0 and so 1 −1 5 φ0 = 14 E−1 1 + 2 E2 = 4 , 1 −1 3 φ1 = 14 E−1 1 − 2 E2 = − 4 , 1 φ2 = − 14 E−1 1 = −4.

Thus we have ⎡ =

1 4

=

1 4

2 ⎢−1 ⎢ ⎣ 0 −1 ⎡ 5 ⎢−1 ⎢ ⎣−3 −1

⎤ −1 0 −1 2 −1 0⎥ ⎥, −1 2 −1⎦ 0 −1 2 ⎤ −1 −3 −1 5 −1 −3⎥ ⎥, −1 5 −1⎦ −3 −1 5

and direct calculation shows that = = I − 14 J, so that is a generalized inverse for . Thus the variance–covariance matrix for the intrablock estimates of treatment effects is σ 2 /r. Furthermore, x(1) = x(1) , and so the variance of the intrablock estimate of x(1) · τ is x(1) · x(1) σ 2 /r, just as it would be in a complete block design with the same variance, so the efficiency factor for this contrast is 1. On the other hand, x(2) = 2x(2) and x(3) = 2x(3) so the variance of the intrablock estimates of the within-groups contrasts x(2) · τ and x(3) · τ is 2x(2) · x(2) σ 2 /r, which is twice the variance that would be achieved from a complete block design (assuming that the larger blocks could be found with the same intrablock variance), and so the efficiency factor for within-groups contrasts is 12 . In fact, x(2) is the contrast vector for τA − τC , and the efficiency factor 12 is equal to { 54 − (− 34 )}−1 = (φ0 − φ1 )−1 ,

PARTIALLY BALANCED DESIGNS

9

Table 8. Block Treatment Yield y

1 A 4

1 B 2

2 B 6

2 C −11

3 C −7

3 D 5

4 D 10

4 A 1

x(1) D x(1) B x(1) P

1 0 1

−1 0 −1

−1 0 −1

1 0 1

1 0 1

−1 0 −1

−1 0 −1

1 0 1

x(2) D x(2) B x(2) P

1

0

−1 − 12 − 12

−1 − 12 − 12

0 − 21

1

1 2 − 12

0 − 21

0

1 2 1 2

1 2

1 2 − 12

1 2 1 2

x(3) D x(3) B x(3) P

0

1

1

0

1 2 − 12

1 2 1 2

1 2 1 2

1 2 − 12

0 − 12

−1 − 21 − 21

−1 − 12 − 12

0 − 21

1 2

in agreement with theory, since A and C are first associates. In the same way, if z = (1, −1, 0, 0), then the efficiency factor for τA − τB is

1 2

1 2

other contrasts are obtained as linear combinations of these. The contribution of x to the intrablock sum of squares is (xP · y)2 /(xP · xP ) = (xP · y)2 /(rEi (x · x)).

z · z/zz = 2/3 = 1/{ 54 − (− 41 )} = 1/(φ0 − φ2 ), and A and B are second associates. Analysis The algebraic structure of a PBIBD gives an alternative method of performing the calculations required to estimate means and variances that does not explicitly use matrix inverses. Choose an orthogonal basis for each eigenspace Wi . If x is a basis vector for Wi , calculate the following rt-vectors: xD , which has entry xu on each plot that receives treatment u; xB , in which each entry of xD is replaced by the mean of the entries in the same block; xP = xD − xB . Let y be the rt-vector of yields. The intrablock estimate of x · τ is (xP · y)/(rEi ). Such estimates are found for each of the chosen basis vectors x. Intrablock estimates of any

The residual intrablock sum of squares is obtained by subtraction. Interblock∗ estimates and sums of squares are obtained similarly, using xB in place of xP and 1 − Ei in place of Ei . Sometimes it is desirable to combine the inter- and intrablock information. (See Brown and Cohen [7], Nelder [35], Sprott [55], and Yates [57,58].) We demonstrate the calculations on some fictitious data on eight plots using the design given at the top of Table 8. Although this design is so small that it is unlikely to be of much practical use and the calculations are not difficult to perform by other means, it serves to demonstrate the method, which is no more difficult for larger designs. We may use the basis vectors x(1) , x(2) , x(3) given in the preceding section, where we calculated the canonical efficiency factors to be E0 = 0, E1 = 1, E2 = 12 . The calculations of effects and sums of squares are shown in Table 9, and the analysis of variance∗ in Table 10. Pros and Cons of Partially Balanced Designs The advantage of partially balanced designs is the great simplification that may be achieved in the calculations needed to compare potential designs for an experiment and to analyze the results of the experiment. In

10

PARTIALLY BALANCED DESIGNS Table 9.

xP · y rEi Intrablock estimate Intrablock sum of squares xB · y r(1 − Ei ) Interblock estimate Interblock sum of squares

x(1)

x(2)

−36 2 −18 162 0 0 — 0

11 1 11 60.5 12 1 12 72

x(3) −3 1 −3 4.5 −4 1 −4 8

Table 10. Stratum Blocks

Plots

Source

d.f.

SS

W2 Residual Total

2

80

W1 W2 Residual Total

1 2

162 65

1 4

32 259

complete block designs are usually assessed by the values of their canonical efficiency factors, all of which should be as large as possible. Some contrasts are more important to the experimenter than others, so it is relatively more important for the efficiency factors for these contrasts to be large. Many catalogs of designs give a single summary value for the canonical efficiency factors, such as their harmonic mean∗ , geometric mean∗ , minimum, or maximum. Like the canonical efficiency factors themselves, these summaries can in general be calculated only by diagonalizing a t × t matrix for each design. However, for any given association scheme, the eigenspaces Wi , eigenprojections Si , and characters eij may be calculated by diagonalizing a few s × s matrices, and the canonical efficiency factors for every PBIBD for this association scheme may be calculated directly from the characters and the concurrences, with no further matrix calculations. Thus it is extremely easy to compare the canonical efficiency factors of two of these PBIBDs, and the comparison is particularly relevant because the efficiency factors apply to the same eigenspaces.

1 3

0.5 80.5

The story is the same for the analysis of PBIBD experiments. In general, each incomplete block design requires the inversion of a t × t matrix for the analysis of experiments with that design. We have shown that PBIBD experiments may be analyzed without using this inverse. Even if the standard method of analysis is preferred, the inverse can be calculated directly from the canonical efficiency factors and the matrix [fij ], which is obtained by inverting the (s + 1) × (s + 1) matrix [eij ]. Moreover, this single matrix inversion is all that is needed for all PBIBDs with a given association scheme. In the days before electronic computers, these advantages of PBIBDs were overwhelming. Even today, routine use of PBIBDs permits considerable saving of time and effort and more meaningful comparison of different designs than is generally possible. When the association scheme makes sense in terms of the treatment structure, as with factorial designs, the case for using PBIBDs is still very strong. If the treatment structure does not, or cannot, bear any sensible relationship to an association scheme, the efficiency factors of most interest to the experimenter will not be the canoncial efficiency factors of any PBIBD,

PARTIALLY BALANCED DESIGNS

so the ease of the preceding calculations does not seem very helpful (see Pearce [42]). Moreover, for an arbitrary treatment structure, the best designs (according to some efficiency criterion) may not be PBIBDs. Thus other designs should certainly be examined if the necessary computing facilities are available. However, PBIBDs still have a role, and if one is found with a very small range of nonzero canonical efficiency factors, then the efficiency factors that interest the experimenter will also be in that range, and the design is probably close to optimal according to many optimality criteria. GENERALIZATIONS OF PBIBDs AND RELATED IDEAS Generalizations of Association Schemes Various authors have tried to extend the definition of PBIBD by weakening some of the conditions (a)–(e) for an association scheme without losing too much of the algebra. Shah [52] proposed weakening condition (e) to (e)

For all i and j, there are numbers such that Ai Aj + Aj Ai = k qkij Ak .

qkij

This condition is sufficient to ensure that every invertible matrix in A still has its inverse in A, and that every diagonalizable matrix in A has a generalized inverse in A, and so any one inverse can be calculated from s linear equations. Thus Shah’s designs retain the property that the variance of the estimate of the elementary contrast τu − τυ depends only on what type of associates u and υ are. However, the property of simultaneous eigenspaces Wi for every PBIBD with a given association scheme is lost, so the calculation of inverses must be done afresh for each design. Moreover, there is no easy method of calculating efficiency factors. As an example, Shah gave the ‘‘generalized association scheme’’ in Table 11a. (This is obtained in the same way as a cyclic association scheme, but using the non-Abelian group of order 6.) The same convention is used as in Table 2. Parts b and c of Table 11 show incomplete block designs that are ‘‘partially balanced’’ with

11

respect to this generalized association scheme. (Design b was given by Shah [52] and also discussed by Preece [43]. Design c is formed from its initial block by a construction akin to the cyclic construction.) The concurrence matrices of both designs have W0 and W1 as eigenspaces, where W1 is spanned by (1, 1, 1, −1, −1, −1). However, the remaining eigenspaces for design b are √ √ W2 , spanned by (2, −1, −1, − 3, 0, 3) √ √ and (0, − 3, 3, −1, 2, 1), and W3 , √ √ spanned by (2, −1, −1, 3, 0, − 3) and √ √ (0, 3, − 3, −1, 2, −1). The remaining eigenspaces for design c are V2 , spanned by (2, −1, −1, 1, −2, 1) and (0, 1, −1, 1, 0, −1), and V3 , spanned by (2, −1, −1, −1, 2, −1) and (0, 1, −1, −1, 0, 1). Nair [32] suggested retaining condition (e) while weakening condition (b) by allowing nonsymmetric association relations, with the proviso (b) If Ai is an association matrix, then so is Ai . Then it remains true that A is an algebra in the sense of containing the sum and product of any two of its elements, and the scalar multiples of any one of its elements; moreover, A still contains generalized inverses of its symmetric elements. Thus the Nair schemes have the good properties of the Shah schemes. However, for a Nair scheme, the algebra A may or may not be commutative in the sense that MN = NM for all M and N in A. All the examples given in Nair’s paper did have commutative algebras, and Nair wrongly concluded that this is always the case. A counterexample is provided by Table 11d, which is a Nair association scheme with respect to which the designs in Tables 11b and 11c are partially balanced. If A is commutative, then the Nair scheme has all the good properties of genuine association schemes. However, since concurrence and covariance are symmetric functions of their two arguments, one may as well replace Ai by Ai + Ai if Ai is nonsymmetric, and obtain a genuine association scheme, because the commutativity of A ensures that condition (e) will still hold. If A is not commutative then the Nair scheme has the same disadvantages as the general Shah schemes. In fact, most Shah schemes arise by taking a

12

PARTIALLY BALANCED DESIGNS

Table 11. (a) Association Scheme (Shah Type)

A B C D E F

A

B

C

D

E

F

0 1 1 2 3 4

1 0 1 4 2 3

1 1 0 3 4 2

2 4 3 0 1 1

3 2 4 1 0 1

4 3 2 1 1 0

C F

A E

B F

D F A

E D B

F E C

(b) Shah’s Design A D

A D

B E

B E

C F

C D

(c) Other Design A B D

B C E

C A F

(d) Association Scheme (Nair Type)

A B C D E F

A

B

C

D

E

F

0 5 1 2 3 4

1 0 5 4 2 3

5 1 0 3 4 2

2 4 3 0 1 5

3 2 4 5 0 1

4 3 2 1 5 0

Nair scheme and fusing nonsymmetric associate classes, as in the preceding example. Another possibility is to allow selfassociates to form more than one class: thus condition (c) becomes (c) The nonzero elements of Ai are either all on, or all off, the diagonal. This was first suggested implicitly in the statistical literature when Pearce [38] introduced supplemented balance as the judicious addition of a control treatment to a BIBD. There are two classes of self-associates: the control, and all other treatments. There are also two other associate classes: control with new treatment, and new treatment with another new treatment. More general designs satisfying (c) rather than (c) were given by Rao [48] and Nigam [36]. For these designs A cannot be commutative, because the association matrices for self-associates do not

commute with J, and so the useful property of common eigenspaces is lost. If the association relations must be symmetric, then condition (e) is satisfied while (e) is not, whereas if the pairs (control, new) and (new, control) are allocated to different classes then (b) and (e) are satisfied. In either case, A contains a generalized inverse of each of its symmetric elements, and so the pattern of variances of elementary contrasts matches the pattern of concurrences. Bearing in mind the importance of condition (e) for properties of A, Higman [18] defined a generalization of association schemes called coherent configurations. These satisfy conditions (a), (b) , (c) , (d), and (e). If they also satisfy (c), they are called homogeneous. Homogeneous coherent configurations arise naturally in the context of transitive permutation groups, and so they are intimately related to recent ideas of

PARTIALLY BALANCED DESIGNS

13

Sinha [53], who defined an incomplete block design to be simple if there is a group G of permutations of the treatments with the properties that (i) if the rows and columns of the concurrence matrix are both permuted by the same element of G, then the matrix is unchanged as a whole; (ii) given any two treatments, there is a permutation in G that carries one into the other. Such designs are indeed simple in the sense that once their structure has been analyzed with respect to a single treatment, everything is known about their structure. This fact is often used implicitly by other authors, and the idea of its simplicity seems worth pursuing. As a whole, coherent configurations are of considerable mathematical interest and have many potential applications to the design of statistical experiments, but they cannot be discussed in any further detail here. Approximations to Balance∗ in Terms of Concurrence In a review paper on types of ‘‘balance’’ in experimental design, Pearce [39] defined an incomplete block design to be partially balanced if there are real numbers θ0 , θ1 , . . . , θs , n1 , . . . , ns such that every diagonal element of the matrix is equal to θ0 and every row of contains exactly ni elements equal to θi . Thus he effectively removed condition (e) (and even (e) ) from the definition of partial balance. Although Pearce [42] has argued strongly that condition (e) is entirely abstract and artificial and has no place in practical experimentation, without condition (e) all the algebraic theory collapses, and general designs that are partially balanced in Pearce’s sense have none of the good properties we have described. It is unfortunate that Pearce retained the term partial balance, because his paper was influential and caused other authors to use the term in his way, which may account for a certain amount of confusion about the meaning of the term. Sinha [53] called this property the extended Latin Square property (ELSP) and attempted to relate ELSP, partial balance, and simplicity. Unfortunately,

Figure 2. The regular graph for the design in Table 12.

some of the specific results in his paper are incorrect. Jarrett [24] used the term sconcurrence design for Pearce’s PBIBD and clarified the situation somewhat. However, he insisted that θi = θj unless i = j. Although s-concurrence designs have an obvious attraction, in general, −1 does not have the same pattern as , and little can be predicted about the efficiency factors. For a 2-concurrence design, the matrices A1 and A2 define a regular graph and its complement. If balance is impossible, the nearest approximation seems, intuitively, to be a 2concurrence design whose concurences differ by 1. Such designs were called regular graph designs∗ (RGDs) by John and Mitchell [25]. An example is given in Table 12, with one of its corresponding regular graphs in Fig. 2. (Thus a regular graph design is partially balanced if and only if its graph is strongly regular.) Intuition is not quite correct, however, and some RGDs are poor in terms of efficiency factors. Mitchell and John [31] gave a catalog of RGDs that are E-optimal in the sense of having their smallest efficiency factor as large as possible, and they conjectured [25] that, for each fixed value of t, b, and k, if there exists any RGD then there exist RGDs that are A-optimal, D-optimal, and E-optimal respectively. Cheng [10–12] and Jacroux [21] have shown that this is true in several special cases; the optimal designs are often simultaneously RGDs and PBIBDs.

14

PARTIALLY BALANCED DESIGNS Table 12. E A F B G E H F I H

B C F G E

C D G H F

Cheng and Wu [13] allowed for the case where the number of plots is not divisible by t and defined a nearly balanced incomplete block design (NBIBD) to have treatment replications differing by at most 1, and, for all treatments u, v, and w, concurrences λuv and λuw differing by at most 1. This concept is so far (algebraically) from partial balance that there seems little hope of any good general theory about NBIBDs. Nevertheless, Cheng and Wu were able to show that the best NBIBDs do have high efficiency factors (see also NEARLY BALANCED DESIGNS). Approximations to Balance in Terms of Variance or Efficiency In general, if one allows the incomplete block design to have unequal replication, unequal block sizes, or more than one occurrence of a treatment per block, then the information matrix r must be replaced by the matrix L = R2 − NK−1 N , where R is the t × t diag√ onal matrix whose (u, u) entry is ru if treatment u is replicated ru times; K is the b × b diagonal matrix whose (c, c) entry is the size of block c; and N is the t × b matrix whose (u, c) entry is the number of times that treatment u occurs in block c. Various authors have extended the definition of partial balance to cover designs that have one or more of the following conditions relaxed: equireplicate, equal block sizes, binary. Unequal block sizes make very little difference to the theory discussed so far; while for nonbinary designs equal replication is no longer equivalent to L having equal diagonal elements. Mehta et al. [30] defined to be partially balanced if it has equal replication, equal block sizes, and L is in the algebra A of an association scheme; while Kageyama [29] relaxed the complementary conditions and defined to be partially balanced if it is binary and L is in the algebra A of an association scheme. Note that in both cases the diagonal elements of L are all the same.

D A H E G

A B C G I

B C D H I

C D A E I

D A B F I

The supplemented balance designs described earlier are similar in spirit, except that L belongs to the algebra A of an inhomogeneous coherent configuration, whose classes of selfassociates consist of all treatments with the same replication. For all these extended definitions, L has a generalized inverse M in A. Since the variance of the estimate x · τ is xMx , the variance of the elementary contrast τu − τv depends only on the associate class containing the pair (u, v). Thus it would be correct to call these designs partially variancebalanced. Nigam’s [36] designs have only two variances: he called his designs nearly balanced if the two variances are within 10% of each other, but this is not related to NBIBDs. However, if treatments are not equally replicated then treatment contrasts would not have equal variances even in an unblocked design, and it may be unreasonable to expect them to have equal variances in the incomplete block design. The efficiency factor takes account of the unequal replication by comparing the variance in the blocked and unblocked designs, and so it is a more useful measure of information loss than variance. James and Wilkinson [22] showed that the efficiency factors may be obtained from the matrix R−1 LR−1 . This matrix is symmetric, so it has a complete eigenvector basis, and its eigenvalues are the canonical efficiency factors: if there are m nonzero eigenvalues, then James and Wilkinson define to have mth order balance. As Houtman and Speed [20] pointed out, this is just a special case of the general balance∗ introduced by Nelder [34]. Following a similar, but independent, line ´ of thought, Calinski [8] showed that the canonical efficiency factors are the eigenvalues of R−2 L. Since R−2 L and R−1 LR−1 are similar (R−2 L = R−1 (R−1 LR−1 )R) they have the same eigenvalues, and the eigenspaces of one can be obtained easily from those of the other. In particular, R−2 L also has a complete

PARTIALLY BALANCED DESIGNS

´ eigenvector basis. However, Calinski did not realize this, and Puri and Nigam [44] defined to be partially efficiency balanced if R−2 L has a complete eigenvector basis. Since all incomplete block designs are partially efficiency balanced, this definition seems unnecessary. A stream of papers followed the ´ Calinski/Puri/Nigam school (e.g., Puri and Nigam [46]) and another stream followed the Nelder/James/Wilkinson school (e.g., Jarrett [23]); there has been little cross-reference between the two. However, Pal [37] has recently pointed out that all designs are partially efficiency balanced, and it is to be hoped that these separate strands will be unified and simplified. More Complicated Block Structures The definition of partial balance can be extended to more complicated block structures obtained from several block factors by crossing and nesting∗ (e.g., the simple orthogonal block structures of Nelder [33]). The treatments must form a PBIBD with each system of blocks separately, ignoring all the other block systems, and the association scheme must be the same throughout. The results described in this article extend to these structures with no difficulty. See Houtman and Speed [20] for details. It is also possible to use partially balanced designs when time is a block factor, or there is some other definite order on the experimental units. Blaisdell and Raghavarao [2] have used partial balance in changeover∗ designs. Literature There is an enormous statistical and mathematical literature on association schemes and partial balance, and it is impossible to cite here all who have made important contributions. The statistical reader with some mathematical background will find comprehensive accounts of partial balance in the textbooks by John [27,28] and Raghavarao [47]; the references in Bose [3], Clatworthy [14], Pearce [42], and Preece [43] are also useful sources of further reading. A more detailed account of the practical role of the eigenspaces in some particular cases is in Corsten [15]. A full account of the

15

role of the eigenspaces is given by Houtman and Speed [20], James [22], and Nelder [34], while papers such as refs. 40 and 41 concentrate on just the eigenvalues (i.e., the canonical efficiency factors). References 44 and 46 explain the eigenspace analysis more specifically in the context presented in this article. For a mathematical treatment of association schemes and strongly regular graphs, see refs. 1, 9, 16, 17, 50, and 51. REFERENCES 1. Biggs, N. L. and White, A. T. (1979). ‘‘Permutation Groups and Combinatorial Structures.’’ Lond. Math. Soc. Lect. Notes Ser., 33, Cambridge University Press, Cambridge, England. (Strictly for mathematicians; includes a section on strongly regular graphs.) 2. Blaisdell, E. A. and Raghavarao, D. (1980). J. R. Statist. Soc. B, 42, 334–338. (PBIBDs as changeover designs.) 3. Bose, R. C. (1963). Sankhya, ¯ 25, 109–136. (A review article on partial balance.) 4. Bose, R. C. (1963). Pacific J. Math., 13, 389–419. (Introduction of strongly regular graphs.) 5. Bose, R. C. and Nair, K. R. (1939). Sankhya, ¯ 4, 337–372. (Introduction of partial balance.) 6. Bose, R. C. and Mesner, D. M. (1959). Ann. Math. Statist., 30, 21–38. (The major step in the algebra of partial balance.) 7. Brown, L. D. and Cohen, A. (1974). Ann. Statist., 2, 963–976. ´ 8. Calinski, T. (1971). Biometrics, 27, 275–292. 9. Cameron, P. J. and van Lint, J. H. (1980). ‘‘Graphs, Codes and Designs.’’ Lond. Math. Soc. Lect. Notes Ser., 43, Cambridge University Press, Cambridge. (Strictly for mathematicians; includes very clear sections on strongly regular graphs and on association schemes.) 10. Cheng, C. -S. (1978). Commun. Statist., A7, 1327–1339. 11. Cheng, C. -S. (1980). J. R. Statist. Soc. B, 42, 199–204. 12. Cheng, C. -S. (1981). Ann. Inst. Statist. Math. Tokyo, 33, 155–164. 13. Cheng, C. -S. and Wu, C. -F. (1981). Biometrika, 68, 493–500. 14. Clatworthy, W. H. (1973). ‘‘Tables of TwoAssociate Class Partially Balanced Designs.’’ Natl. Bur. Stand. (U.S.) Appl. Math. Ser. 63

16

PARTIALLY BALANCED DESIGNS (Washington, DC). (A large catalog; very useful.)

30. Mehta, S. K., Agarwal, S. K., and Nigam, A. K. (1975). Sankhya, ¯ 37, 211–219.

15. Corsten, L. C. A. (1976). In Essays in Probability and Statistics, S. Ikeda, ed. Shinko Tsusho, Tokyo, 125–154.

31. Mitchell, T. and John, J. A. (1976). ‘‘Optimal Incomplete Block Designs.’’ Oak Ridge Nat. Lab. Rep. No. ORNL/CSD-8. Oak Ridge, TN. (A catalog of regular graph designs.)

16. Delsarte, P. (1973). ‘‘An Algebratic Approach to the Association Schemes of Coding Theory.’’ Thesis, Universit´e Catholique de Louvain (appeared as Philips Research Reports Supplement, 1973, No. 10). (For mathematicians, a splendid account of association schemes and their relation to other mathematical objects.) 17. Dembowski, P. (1968). Finite Geometries. Springer-Verlag, Berlin. (Strictly for mathematicians; includes an appendix on association schemes and partial designs.) 18. Higman, D. G. (1975). Geom. Dedicata, 4, 1–32. 19. Hinkelmann, K. (1964). Ann. Math. Statist., 35, 681–695. 20. Houtman, A. M. and Speed, T. P. (1983). Ann. Statist., 11, 1069–1085. (Contains the natural extension of PBIBDs to more complicated block structures; also a useful bibliography on the use of efficiency factors and eigenspaces in analysis.) 21. Jacroux, M. (1980). J. R. Statist. Soc. B, 42, 205–209. 22. James, A. T. and Wilkinson, G. N. (1971). Biometrika, 58, 279–294. (A lucid account of the use of efficiency factors and eigenspaces in analysis—for those who are familiar with higher-dimensional geometry and algebra.) 23. Jarrett, R. G. (1977). Biometrika, 64, 67–72. 24. Jarrett, R. G. (1983). J. R. Statist. Soc. B, 45, 1–10. (A clear separation of designs with equal concurrence patterns from PBIBDs.) 25. John, J. A. and Mitchell, T. (1977). J. R. Statist. Soc. B, 39, 39–43. (Introduction of regular graph designs.) 26. John, J. A., Wolock, F. W., and David, H. A. (1972). ‘‘Cyclic Designs.’’ Natl. Bur. Stand. (U.S.) Appl. Math. Ser. 62 (Washington DC). (A catalog of designs with high efficiency factors.) 27. John, P. W. M. (1971). Statistical Design and Analysis of Experiments. Macmillan, New York. (A textbook for statisticians, with a comprehensive section on PBIBDs.)

32. Nair, C. R. (1964). J. Amer. Statist. Ass., 59, 817–833. 33. Nelder, J. A. (1965). Proc. R. Soc. Lond. A, 283, 147–162. 34. Nelder, J. A. (1965). Proc. R. Soc. Lond. A, 283, 163–178. 35. Nelder, J. A. (1968). J. R. Statist. Soc. B, 30, 303–311. 36. Nigam, A. K. (1976). Sankhya, ¯ 38, 195–198. 37. Pal, S. (1980). Calcutta Statist. Ass. Bull., 29, 185–190. 38. Pearce, S. C. (1960). Biometrika, 47, 263–271. (Introduction of supplemented balance.) 39. Pearce, S. C. (1963). J. R. Statist. Soc. A, 126, 353–377. 40. Pearce, S. C. (1968). Biometrika, 55, 251–253. 41. Pearce, S. C. (1970). Biometrika, 57, 339–346. 42. Pearce, S. C. (1983). The Agricultural Field Experiment. Wiley, Chichester, UK. (This text, full of good advice on practical experimentation, discusses the place of various kinds of theoretical balance in the practical context.) 43. Preece, D. A. (1982). Utilitas Math., 21C, 85–186. (Survey of different meanings of ‘‘balance’’ in experimental design; extensive bibliography.) 44. Puri, P. D. and Nigam, A. K. (1977). Commun. Statist., A6, 753–771. 45. Puri, P. D. and Nigam, A. K. (1977). Commun. Statist., A6, 1171–1179. (Relates efficiency balance, pairwise balance, and variance balance.) 46. Puri, P. D. and Nigam, A. K. (1982). Commun. Statist. Theory Meth., 11, 2817–2830. 47. Raghavarao, D. (1971). Constructions and Combinatorial Problems in Design of Experiments. Wiley, New York. (This remains the best reference work on construction of incomplete block designs; it requires some mathematical background.) 48. Rao, M. B. (1966). J. Indian Statist. Ass., 4, 1–9.

Block

49. Roy, P. M. (1953). Science and Culture, 19, 210–211.

29. Kageyama, S. (1974). Hiroshima Math. J., 4, 527–618. (Includes a useful review of different types of association schemes.)

50. Seidel, J. J. (1979). In ‘‘Surveys in Combinatorics,’’ B. Bollobas, ed. Lond. Math Soc. Lect. Notes Ser., 38. Cambridge University Press,

28. John, P. W. M. (1980). Incomplete Designs. Marcel Dekker, New York.

PARTIALLY BALANCED DESIGNS Cambridge, England, pp. 157–180. (Survey of strongly regular graphs.) 51. Seidel, J. J. (1983). In Proceedings of the Symposium on Graph Theory, Prague, 1982 (to appear). 52. Shah, B. V. (1959). Ann. Math. Statist., 30, 1041–1050. 53. Sinha, B. K. (1982). J. Statist. Plan. Infer., 6, 165–172. 54. Speed, T. P. and Bailey, R. A. (1982). In Algebraic Structures and Applications, P. Schultz, C. E. Praeger, and R. P. Sullivan, eds. Marcel Dekker, New York, pp. 55–74. (Gives the eigenprojection matrices for a class of association schemes.) 55. Sprott, D. A. (1956). Ann. Math. Statist., 27, 633–641. 56. Yates, F. (1935). Suppl. J. R. Statist. Soc., 2, 181–247. 57. Yates, F. (1939). Ann. Eugen. (Lond.) 9, 136–156. 58. Yates, F. (1940). Ann. Eugen. (Lond.) 10, 317–325. See also ANALYSIS OF VARIANCE; BLOCKS, BALANCED INCOMPLETE; BLOCKS, RANDOMIZED COMPLETE; COMBINATORICS; CONFOUNDING; CYCLIC DESIGNS; DESIGN OF EXPERIMENTS; FACTORIAL EXPERIMENTS; GENERAL BALANCE; GRAPH THEORY; GRAECO-LATIN SQUARES; GROUP-DIVISIBLE DESIGNS; INCOMPLETE BLOCK DESIGNS; INTERBLOCK INFORMATION; INTERACTION; INTRABLOCK INFORMATION; LATIN SQUARES, LATIN CUBES, LATIN RECTANGLES; LATTICE DESIGNS; MAIN EFFECTS; NEARLY BALANCED DESIGNS; NESTING AND CROSSING IN DESIGN; OPTIMAL DESIGN OF EXPERIMENTS; RECOVERY OF INTERBLOCK INFORMATION; and REGULAR GRAPH DESIGNS.

ROSEMARY A. BAILEY

17

‘‘PATIENTS’’ VS. ‘‘SUBJECTS’’

The delineation of the differences between physicians and patients and researchers and subjects is in no way meant to be critical of researchers or their goals. Rather it is important for all parties to the research to understand properly their roles obligations, and expectations. For example, it is important for subjects to understand that being a research subject is not the same as being a patient, and that being in a clinical study is not the same as receiving standard therapy. It is just as important for investigators to recognize these differences. In the absence of this recognition, both researchers and subjects can suffer from what is called ‘‘the therapeutic misconception’’ in which both researchers and subjects mistakenly conflate research with therapy.

LEONARD H. GLANTZ Boston University School of Public Health, Boston, Massachusetts

1 CONFUSION BETWEEN ‘‘PATIENTS’’ AND ‘‘SUBJECTS’’ In discussions and reports of clinical trials, the terms ‘‘patients’’ and ‘‘subjects’’ are often used interchangeably. This conceptual error affects both physicians/researchers and patients/subjects. Patients go to physicians for the purpose of diagnosis and treatment. Both patients and physicians have identical goals—the cure or alleviation of the condition for which the patient sought the physician’s help. It is generally recognized that physicians have a ‘‘fiduciary obligation’’ to patients that requires them to act in their patients’ best interests. It is not the case with researchers. Subjects do not seek out researchers, but researchers seek out subjects. The primary goal of researchers is not to benefit subjects but to increase knowledge. Subjects are a means to this end rather than the end in themselves as is the case in the physician–patient relationship. Whether the subject actually benefits from the clinical trial is not of primary concern since the goal of increasing knowledge will be achieved regardless of the outcome for a single subject. Because of these considerations, it cannot be said that a ‘‘fiduciary relationship’’ exists between the subject and the researcher. It is these differences in these two very different relationships that provide the basis for why researchers with subjects are much more highly regulated than physicians with patients. Since the researcher and subject interests do not coincide, and the researchers are in a position of power, it is important that a system be in place to protect the rights and welfare of the human subjects. Thus, specific ethical codes, laws, and regulations have been implemented that provide special standards for conducting research on subjects that go beyond the standards for treating patients.

2 USE OF RANDOMIZATION AND PLACEBOS A concrete example of how subjects and patients differ is provided by examining the process of randomization and the use of placebos. Randomization involves placing subjects in different arms of a research project through the use of chance allocations. No professional judgment is made as to which arm the subject will be placed in. It is immediately obvious that this approach is quite different from the way medicine is practiced in which patients are not randomly assigned to treatments but are provided the treatment that the physician believes will best alleviate or cure or the patient’s condition. Similarly, placebos are not regularly used in the practice of medicine. When patients come for treatment, they are not randomly assigned to a treatment or no-treatment arm and then ‘‘fooled’’ into believing that they may have received effective treatment when they have not. Although the use of placebos is standard in randomized clinical trials and is a useful tool in establishing the safety and efficacy of drugs, its use is designed to increase knowledge, a researcher’s goal, not to provide treatment, a physician’s goal. Nevertheless, the use of placebos is not uncontroversial in the context of research. Although placebo controlled trials are not

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

‘‘PATIENTS’’ VS. ‘‘SUBJECTS’’

particularly controversial when it comes to relatively minor well-controlled medical conditions, ethical considerations prohibit the use of placebos in certain circumstances. For example, if a person has a bacterial infection that requires the use of antibiotics, and a new antibiotic was being tested, it would be unethical to withhold the effective treatment solely for the purpose of testing the new unknown treatment. In other words, although researchers are not the subjects’ physicians, they still have an obligation not to injure others, an obligation that extends to all individuals from which researchers are not exempt. 3 WHEN THE PHYSICIAN IS ALSO THE RESEARCHER One of the more difficult conceptual issues that develops in distinguishing between ‘‘patients’’ and ‘‘subjects,’’ and therefore between ‘‘physicians’’ and ‘‘researchers,’’ is when the researcher is also the subject’s physician. In such an instance, the problem of role confusion can become quite stark. It is important for both the physician and the patient to recognize when their roles shift to that of researcher and subject. Indeed, best practice may require that whenever possible physicians not become researchers who use their patients as subjects, but that another medical professional should play the role of researcher. This practice also assures that the research subjects always have available physicians who have their best interests in mind while they are going through the research process. In circumstances where this situation is not practical, it is essential for the physician and for the patient to be explicitly aware of the change in their relationship when the physician becomes the researcher. REFERENCES 1. R. C. Fox and J. P. Swazey, The Courage to Fail: A Social View or Organ Transplants and Dialysis. Chicago, IL: University of Chicago Press, 1974. 2. H. Jonas, Philosophical reflections on experimenting with human subjects. In: P. A.

Freund (ed.), Experimentation with Human Subjects. New York: George Braziller, 1969. 3. J. Katz, Human experimentation and human rights. Saint Louis University Law J. 1993; 38(1):7–55. 4. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, The Belmont Report. 1978. DHEW Publication No. (OS) 78-0012. 5. T. Parsons, Research with human subjects and the ‘‘professional complex.’’ In: P. A. Freund (ed.), Experimentation with Human Subjects. New York: George Braziller, 1969, pp. 116–151. 6. P. Ramsey, The patient as person. New Haven, CT: Yale University Press, 1970.

PERMUTATION TESTS IN CLINICAL TRIALS

4. Within each group, the error terms all have the same variance. (On occasion it is also assumed that the variance is the same in the two groups, although this assumption can be avoided.)

DAVID M. ZUCKER Hebrew University of Jerusalem Department of Statistics

Obviously, the more complex the statistical analysis, the more assumptions are involved. In a thorough statistical analysis, efforts are made to check the validity of the assumptions and to examine the sensitivity of the results to departures from the assumptions. In routine analyses, however, these checks are often skipped. Moreover, and more critically, the checks are not foolproof, and even after the checks, some unquantifiable element of uncertainty remains. In an experiment in which treatments have been assigned to subjects by randomization, however, it is possible to apply a statistical method for testing treatment effect that does not require any statistical assumptions about the data beyond those inherently satisfied due to the randomization itself. This method is called a permutation test (or randomization test). The ability thus afforded to compute a P-value for testing treatment effect without relying on uncertain statistical assumptions is a major strength of the randomized trial.

1 RANDOMIZATION INFERENCE—INTRODUCTION Randomization—allocation of study treatments to subjects in a random fashion—is a fundamental pillar of the modern controlled clinical trial. Randomization bolsters the internal validity of a trial in three major respects: 1. It prevents possible investigator bias (which may otherwise exist even unintentionally) in the allocation of subjects to treatments. 2. It generates study groups that, on average, are balanced with respect to both known and unknown factors. 3. It provides a probability structure whereby the study results may be evaluated statistically without exogenous statistical modeling assumptions. The purpose of this article is to elaborate on the last of these three points. To begin, we note that many, probably most, statistical analyses in empirical research, especially in nonexperimental studies, involve some statistical modeling assumptions. For example, to take a simple case, the classical two-sample t-test for comparing two groups makes the following assumptions:

2 PERMUTATION TESTS—HOW THEY WORK The classic exposition of the randomized experiment in general, and the permutation test in particular, was given by R. A. Fisher (1). Rosenberger and Lachin (2) give an up-to-date comprehensive exposition. This article will describe how a permutation test works in the simple context of comparing two groups, a treatment group and a control group, but the idea applies more generally to multi-arm trials. The classical formulation of the permutation test is as a test of the null hypothesis that the treatment has no effect whatsoever on the subject’s response; that is, each subject would exhibit exactly the same response whether given the study treatment or the control regimen. This null hypothesis is referred to by some authors as the ‘‘strong’’

1. Each observation in each group is equal to the sum of a fixed group-specific population mean value plus a mean zero random error term. 2. All random error terms, both within and between groups, are statistically independent. 3. The random error terms have a normal distribution.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

PERMUTATION TESTS IN CLINICAL TRIALS

null hypothesis. Some remarks will be made later in this article on an alternative form of null hypothesis and the performance of permutation tests in that context. For now, however, we will stick with the null hypothesis that each subject would exhibit exactly the same response irrespective of the regimen given. We will illustrate how the test works in the context of an example presented in Reference 3. We first must review the concept of a P-value in general. Given a statistic aimed at comparing the two groups—the difference between the means, for example—the P-value is defined as the probability that the statistic would be equal to or more extreme than the value actually observed if the null hypothesis were in fact true. A small P-value means that the observed value of the statistic is so extreme that it is unlikely to have arisen if the null hypothesis were true, and is thus an (indirect) indicator that the null hypothesis is false. It is conventional to say that there is ‘‘statistically significant’’ evidence of a treatment difference if the two-sided P-value (i.e., considering extremes in both directions) is less than 0.05. With this background, we may now turn to the example. Suppose eight subjects were randomly assigned to either treatment or control on an equal basis (four per group). By ‘‘randomly assigned,’’ we mean that the researchers chose the actual allocation at random from among the 70 possible ways of dividing eight subjects into two groups of four subjects each, with each possible allocation having an equal 1/70 chance of being employed. Suppose further that the final results with respect to some response variable were follows:

Table 1. Illustrative Experimental Data Subject

Group

A B C D E F G H

Control Control Control Control Treatment Treatment Treatment Treatment

Response 18 13 3 17 9 16 17 17

The groups are to be compared by examining the difference in sample means between the treatment and control groups. The observed sample means are 14.75 for treatment and 12.75 for control, so that the observed mean difference is 2.00. We wish to evaluate the statistical significance of this result. We argue as follows. As noted, there are 70 possible ways of dividing the eight subjects into two groups of four subjects each. Under the null hypothesis that the treatment has no effect whatsoever on the response, the responses of all eight subjects would be the same no matter what the allocation was. The only thing that differs from allocation to allocation is who received which regimen. By way of analogy, imagine a deck of eight cards, one corresponding to each subject, with the subject’s eventual end of study response written on the face of the card. At the beginning, all cards are face down, corresponding to the fact that at the beginning we do not know the response values. We shuffle the cards well, and then split the deck into two packs of four cards each, one pack corresponding to treatment and the other to control. We wait some period of time, during which the numbers on all cards remain the same, corresponding to the fact that the treatment has no effect. Finally, we are allowed to turn the cards face up and see the response values. We can then can compute the mean difference between the groups. We wish to determine the probability of obtaining a mean difference as extreme or more so than that observed through mere ‘‘luck of the draw’’ in splitting the deck into the two packs. Given the final observed response values—the numbers on each of the eight cards—we may enumerate easily all the possible realizations that could have eventuated in this experiment with the given subjects under the null hypothesis. The list of realizations consists simply of the list of the 70 possible ways of dividing the eight subjects into the two groups, with the response value associated with each subject being in every case the value observed in fact in the actual experiment—because, again, under the null hypothesis, the treatment does not affect the response at all. Each of these possible realizations has an equal chance of 1/70 of having arisen, because the allocation was chosen at

PERMUTATION TESTS IN CLINICAL TRIALS

3

Table 2. List of Possible Realizations of the Experiment Under the Null Hypothesis Case

Subjects On Treatment

Observations On Treatment

Observations On Control

Mean Treatment

Response Control

Mean Diff

EFGH DEFG DFGH .. . ABCD

9, 16, 17, 17 17, 9, 16, 17 17, 16, 17, 17 .. . 18, 13, 3, 17

18, 13, 3, 17 18, 13, 3, 17 18, 13, 3, 16 .. . 9, 16, 17, 17

14.75 14.75 16.75 .. . 12.75

12.75 12.75 10.75 .. . 14.75

2.00 2.00 6.00 .. . −2.00

1 2 3 .. . 70

Table 3. Null Hypothesis Distribution of the Mean Difference Mean Diff Probability

−7 1/70

−6.5 3/70

−6 1/70

−5 3/70

−4.5 4/70

−4 3/70

Mean Diff Probability

−3 3/70

−2.5 4/70

−2 3/70

−1 3/70

−0.5 4/70

0 6/70

Mean Diff Probability

0.5 4/70

1 3/70

2 3/70

2.5 4/70

3 3/70

4 3/70

Mean Diff Probability

4.5 4/70

5 3/70

6 1/70

6.5 3/70

7 1/70

random. For each possible realization (case) listed, we may compute the mean difference between treatment and control that arises under that realization. We obtain a list of the form shown in Table 2. Keeping in mind that each of these cases has a 1/70 chance of having occurred, we may obtain from this list the null hypothesis probability distribution— called the ‘‘permutation distribution’’— for the mean difference. Table 3 displays this null distribution: the table provides a list of the various possible values of the mean difference, along with the probability of obtaining each value. The mean difference observed in the actual experiment was 2.00. We can compute the probability of observing a difference of this magnitude or greater under the null hypothesis, by pure chance, from the probability distribution in Table 3. For a two-sided test we work with the absolute value of the difference. The probability is found to be 50/70, which equals 0.71. Thus, the permutation test two-sided P-value for testing the null hypothesis in this experiment is P = 0.71. Note that in the calculation above, no assumptions whatsoever were made about the behavior of the response value. The calculation depended only on the following basic facts.

1. The treatment allocation followed was chosen at random, on an equally likely basis, from the 70 possible allocations of eight subjects to two groups of four subjects each. 2. The null hypothesis says that the treatment does not affect the response values in any way. 3. In light of Fact 2, we can—after the fact, given the observed data— construct a list of all possible realizations of the experiment with the given subjects under the null hypothesis, with each realization having the same 1/70 chance of having eventuated. 4. Given the list of realizations, we can, for each possible value of the observed mean difference, calculate the probability of observing that value with the given subjects under the null hypothesis. In the case of 2 × 2 table analysis with a dichotomous response variable, the permutation test is known as the Fisher exact test and is available in many standard statistical routines such as SAS PROC FREQ (SAS Institute Inc., Cary, NC). It is possible to adapt the permutation test method to more complex statistical models.

4

PERMUTATION TESTS IN CLINICAL TRIALS

See References 2 and 4 for a discussion on how this adaptation is carried out. In an experiment with a extremely small number of subjects, such as in the above example, the range of possible P-values that can arise is limited, and so the capacity to test the null hypothesis is limited. But with a moderate-to-large sample size, the range of possible P-values is typically large. Similarly, in an experiment with a very small number of subjects, it is possible to list all possible outcomes and compute the P-value in a direct fashion. But as the sample size increases, the calculation quickly becomes formidable. For example, in an experiment with 5 subjects per group, there are 252 possible allocations; with 6 subjects per group, there are 924 possible allocations; with 10 subjects per group, there are 184,756 possible allocations; and with 20 subjects per group, there are about 138 billion possible allocations. With modern computing power and special algorithms, however, it is possible to perform permutation test calculations easily for experiments with 20–30 subjects or more per group. A well-known software package for such calculations is the StatXact package (Cytel Inc., Cambridge, MA). The calculations can also be performed in SAS PROC NPAR1WAY (SAS Institute Inc., Cary, NC), using the SCORES=DATA and EXACT options. At a certain point, however, the calculations become unmanageable, and a large sample approximation is employed. Section (3) discusses this approximation. 3 NORMAL APPROXIMATION TO PERMUTATION TESTS Generally speaking, for a sample size sufficiently large, the permutation distribution of a statistic for testing the null hypothesis can be well approximated by a normal distribution. This result is a generalized version of the familiar classical central limit theorem of probability, which says that the distribution of the average of a large number of independent observations tends to a normal bell-curve distribution. The setup here is somewhat different from that of the classical theorem, but the nature of the final result is essentially the same. The normal limit result

for permutation tests was proven mathematically around 40–50 years ago (5–8). This limit theorem provides justification for the application of classical statistical tests, such as the two-sample t-test described in Section 1, to randomized experiments without relying on the classical assumptions underlying the tests. The classical test simply serves as an approximation—with theoretical backing—to the exact permutation test. Essentially, when the sample size is large, the two tests are approximately equivalent (9). Often, as in the two-sample t-test, the t-distribution is used as an approximating distribution rather than the normal distribution, which improves the approximation to some extent by matching the first two moments (10–12). In practice, for a continuous response variable such as blood pressure, with 30–40 subjects per group the normal-theory test is quite adequate unless the distribution of the response variable is very peculiar. For 2 × 2 table analysis for a dichotomous response variable, the standard recommendation is to use the normal-theory chi-square test (the continuity correction usually improves the approximation) if the expected cell size, given the table margins, is five or more in all four cells, and otherwise to fall back to the Fisher exact test. Thus, in general, it is common practice to use the normal-theory tests unless the study involves a small sample size or small number of events. Again, the normal-theory tests are justified as approximations to a permutation test. When the sample size or number of events is small, the proper course is to use an exact permutation test. 4 ANALYZE AS YOU RANDOMIZE For a permutation-based analysis (either an exact permutation test or a corresponding normal approximation) to be valid without added assumptions, as described above, the form of analysis must match the randomization scheme. This fact is known as the ‘‘analyze as you randomize’’ principle. Thus, if a matched-pairs design has been used, with randomization carried out within each pair, then the pairing must be accounted for in the

PERMUTATION TESTS IN CLINICAL TRIALS

analysis. Similarly, if stratified randomization has been carried out, then the analysis must account for the stratification. Failure to include matching or stratification factors in the analysis can lead to inaccurate P-values. The level of error tends to be small in trials with a continuous endpoint and a large sample size, but it can be more substantial in very small trials or dichotomous endpoint trials with a low event rate. Another case calling for emphasis on the need to ‘‘analyze as you randomize’’ is the cluster randomization design (or group randomization design). Here the unit of randomization is some aggregate of individuals. This design is common in community-based and school-based trials, which often involve aggregate-level interventions. In cluster randomization trials, the unit of analysis must be the cluster rather than the individual. This is necessary to preserve the validity of the permutation-based analysis, as indicated above, and also to take proper account of between-cluster variation and thereby avoid serious type I error inflation (13, 14). To provide statistically rigorous results, a cluster randomization trial must include an adequate number of clusters. Many cluster randomization studies involve a very small number of clusters per arm, such as two or four. In such a study, it is almost impossible for permutation-based analysis to yield a statistically significant result. A normaltheory procedure such as the t-test has no justification in this case. With a trial of this small size, the normality assumption cannot be effectively checked, and the central limit theorem argument presented in the preceding section to justify the use of a normal-theory analysis as an approximation to a permutation-based analysis does not apply. A trial of such a small size may be useful as a pilot study, but it cannot yield statistically definitive conclusions. By contrast, studies such as CATCH (15, 16) (96 schools) and COMMIT (17, 18) (11 matched pairs of communities) included enough units for meaningful statistical analysis. In COMMIT, because of the relatively small number of units, an exact permutation test was used rather than a normal-theory test. In References 16 and 19, methods are described that allow individual-level explanatory variables

5

to be accounted for while maintaining the cluster as the primary unit of analysis. An article by Donner and Klar in this encyclopedia provides further discussion on statistical analysis of group-randomization trials. 5 INTERPRETATION OF PERMUTATION ANALYSIS RESULTS When a permutation-based analysis yields a statistically significant result, this is evidence against the null hypothesis that the treatment has no effect whatsoever on any subject’s response. If the result is in the positive direction, the inference to be made is that there are at least some individuals for whom the treatment regimen is better than the control regimen. It cannot necessarily be inferred that treatment is better than control for all individuals or even that treatment is better than control on an ‘‘average’’ basis; the only truly definitive statement that can be made is that some individuals do better with treatment than with control. This conclusion is admittedly one of limited scope, but it is nonetheless meaningful and is achieved with a high degree of certainty, in that it does not rely on any outside assumptions. Note, however, that along with the conclusion that the treatment is better than control for some people comes the proviso that treatment might be the same or worse than control for other people, so that ‘‘on average’’ there might be no difference between the two regimens. Usually clinical trial investigators like to go further and try to infer that treatment is better than control on some ‘‘average’’ basis. It must first be understood how this objective can be formulated from a formal statistical standpoint. In a typical clinical trial, patients are recruited on a volunteer, ‘‘catch-as-catch can’’ basis, and the set of patients entering the trial does not represent a formal random statistical sample from any particular defined population. Accordingly, we must suppose the trial patients behave like a random sample from some hypothetical superpopulation. This supposition has some plausibility in some circumstances, but it must be realized that it is essentially a statistical modeling assumption.

6

PERMUTATION TESTS IN CLINICAL TRIALS

Given that we are willing to make the above-described supposition, in trials with large sample sizes, the classical normaltheory procedures generally will provide valid tests of the ‘‘population-level’’ null hypothesis that the superpopulation mean response is the same for treatment as for control, against the alternative that it is better on treatment (or worse, or different). There remains the question of what happens in trials with small sample sizes, and in particular how permutation tests perform in relation to the ‘‘population-level’’ null hypothesis. This question has been investigated in References (20–22). Overall, these investigations have shown that permutation tests generally can have inflated type I error relative to the ‘‘population-level’’ null hypothesis, but if the number of subjects (or units, in a cluster randomized trial) is the same in the two experimental arms, then the type I error level is typically close to the desired level. If the ‘‘population-level’’ null hypothesis has been rejected with results in the positive direction, the inference to be made is that treatment is superior to control on an ‘‘average’’ basis for individuals similar to those who participated in the trial. Generalization of the results to other types of individuals requires careful judgment. 6

SUMMARY

The permutation test is a method for analyzing randomized trials through which the null hypothesis that the treatment has no effect whatsoever on the response may be assessed statistically without statistical distribution assumptions beyond those arising from the randomization process itself. With a sufficiently large sample size, the permutation test can be approximated satisfactorily by a classical normal-theory test. This result provides clear justification for application of normal-theory tests in trials with moderateto-large sample sizes. In very small trials, it is preferable to perform an exact permutation test.

REFERENCES 1. R. A. Fisher, The Design of Experiments. Edinburgh: Oliver and Boyd, 1935. (8th Ed. New York: Hafner, 1966). 2. W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials: Theory and Practice. New York: Wiley, 2002. 3. O. Kempthorne, The Design and Analysis of Experiments. New York: Wiley, 1952. 4. M. H. Gail, W. Y. Tan, and S. Piantadosi, Tests for no treatment effect in randomized clinical trials. Biometrika 1988; 75: 57–64. 5. J. Hajek, Some extensions of the WaldWolfowitz-Noether theorem. Ann. Math. Stat. 1961; 32: 506–523. 6. W. Hoeffding, A combinatorial central limit theorem. Ann. Math. Stat. 1951; 22: 558–566. 7. G. E. Noether, On a theorem of Wald and Wolfowitz. Ann. Math. Stat. 1949; 20: 455–458. 8. A. Wald and J. Wolfowitz, Statistical tests based on permutations. Ann. Math. Stat. 1944; 15: 357–372. 9. W. Hoeffding, The large-sample power of tests based on permutations of observations. Ann. Math. Stat. 1952; 23: 169–192. 10. E. J. G. Pitman, Significance tests which may be applied to samples from any populations. J. R. Stat. Soc. 1937; 4 (suppl.): 119–130. 11. E. J. G. Pitman, Significance tests which may be applied to samples from any populations. II. The correlation coefficient test. J. R. Stat. Soc. 1937; 4 (suppl.): 225–232. 12. E. J. G. Pitman, Significance tests which may be applied to samples from any populations. III. The analysis of variance test. The analysis of variance test. Biometrika, 1937; 29: 322–335. 13. G. V. Glass and J. C. Stanley, Statistical Methods in Education and Psychology. Englewood Cliffs, NJ: Prentice-Hall, 1970. 14. D. M. Zucker, An analysis of variance pitfall: the fixed effects analysis in a nested design. Educ. Psychol. Measure. 1990; 50: 731–738. 15. R. V. Luepker, C. L. Perry, S. M. McKinlay, P. R. Nader, G. S. Parcel, E. J. Stone, L. S. Webber, J. P. Elder, H. A. Feldman, C. C. Johnson, S. H. Kelder, and M. Wu, Outcomes of a field trial to improve children’s dietary patterns and physical activity: The Child and Adolescent Trial for Cardiovascular Health (CATCH). JAMA 1996; 275: 768–776.

PERMUTATION TESTS IN CLINICAL TRIALS 16. D. M. Zucker, E. Lakatos, L. S. Webber, D. M. Murray, S. M. McKinlay, H. A. Feldman, S. H. Kelder, and P. R. Nader, Statistical design of the Child and Adolescent Trial for Cardiovascular Health (CATCH). Controlled Clinical Trials 1995; 16: 96–118. 17. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a four-year community intervention. Amer. J. Public Health 1995; 85: 183–192. 18. M. H. Gail, D. P. Byar, T. F. Pechacek, and D. K. Corle, Aspects of the statistical design for the Community Health Trial for Smoking Cessation (COMMIT). Controlled Clinical Trials 1992; 13: 6–21.

7

19. D. M. Zucker, Cluster randomization. In: N. Geller (ed.), Contemporary Biostatistical Methods in Clinical Trials. New York: Marcel Dekker, 2003. 20. T. Braun and Z. Feng, Optimal permutation tests for the analysis of group randomized trials. J. Amer. Stat. Assoc. 2001; 96: 1424–1432. 21. M. H. Gail, S. D. Mark, R. J. Carroll, and S. B. Green, On design considerations and randomization-based inference for community intervention trials. Stat. Med. 1996; 15: 1069–1092. 22. J. P. Romano, On the behavior of randomization tests without a group invariance assumption. J. Amer. Stat. Assoc. 1990; 85: 686–692.

Pharmacoepidemiology, Overview Pharmacoepidemiology, the study of patterns of medication use in the population and their effects on disease, is a new field. The need for this area of research became evident in 1961, during the thalidomide catastrophe, when it was realized that drugs prescribed for therapeutic purposes could produce unexpected risks. The entry of thalidomide, a new hypnotic drug, on the market was accompanied by a sudden sharp increase in the frequency of rare birth defects, characterized by the partial or complete absence of limbs [29, 31]. Consequently, several countries either instituted agencies to regulate drugs or expanded the mandate of existing agencies [51]. These agencies were previously interested only in the demonstration of a drug’s efficacy, but now required proof of a drug’s safety before it was tested in humans let alone before it was marketed for use by the general population (see Drug Approval and Regulation). These proofs of safety, based on toxicologic and pharmacologic studies, were necessary before randomized controlled trials (RCTs) (see Clinical Trials, Overview) could be conducted on human subjects, primarily to demonstrate the efficacy of a drug. The use of the epidemiologic approach to characterize population patterns of medication use and to assess their effects developed as a complement to RCTs for several reasons. First, RCTs were designed to assess the efficacy and effectiveness of a drug, providing as well some data on its safety with respect to commonly arising side-effects. However, rare sideeffects typically cannot be identified in clinical trials because of their small size. For example, to detect a relative risk of 2, for a side-effect having an incidence of 1 per 100, we would require a two-arm trial with over 3000 subjects per arm (α = 0.05, β = 0.1). If the incidence of the side-effect is 1 per 10 000, the sample size per arm would need to be over 300 000. Clearly, these sample sizes are rarely if ever used in RCTs, yet the number of people who will be using these drugs will be in the millions. Secondly, RCTs usually restrict the study subjects to people without coexisting disease, who are therefore not taking other medications that could interact with the study medication. They are also

restricted with respect to age, rarely including children and elderly subjects. Yet, the elderly will be major consumers of most of these medications, along with other medications they are using for coexisting diseases (see Co-morbidity; Drug Interactions). Thirdly, RCTs will usually be based on a short follow-up that typically assesses medication use for a period of 3–12 months. Yet, again, subjects may be using these medications for years, so that the effect of the prolonged use of these medications remains clearly unknown from the RCT data. Finally, there are situations where the RCT is either unethical or inapplicable. For example, it would be ethically unacceptable today in North America or Europe to assess the long-term effects of a new anti-hypertensive agent against placebo in an RCT, although this has been done in China [12] (see Ethics of Randomized Trials). Yet, a large number of hypertensive patients from the general population are either untreated or do not comply with their treatment (see Compliance Assessment in Clinical Trials). They could be used as the reference group for a nonrandomized study based on a cohort study design. Although pharmacoepidemiology can be simply regarded as an application of epidemiologic principles and methods to the field of medications, it is now developing as a discipline of its own because of the special nature of drugs. Indeed, the ways by which drugs are prescribed, employed, marketed and regulated impose certain constraints on epidemiologic research into their use and effects. This field poses challenges that often require special solutions not found in other domains of application, such as cancer, cardiovascular, occupational or infectious disease epidemiology; medications are marketed rapidly, practice patterns of prescribing by physicians are variable and profiles of drug use by patients are complicated by varying compliance patterns. This complex and dynamic context in which pharmacoepidemiology is situated, as well as the available sources of data, have given rise to unique statistical challenges. The fact that the lifetime of a drug on the market is relatively short and can suddenly be shortened still further by a regulatory or corporate withdrawal, often imposes major constraints on studies of its effects. These studies must be conducted rapidly and use existing data in an efficient way without compromising validity (see Validity and Generalizability in Epidemiologic Studies).

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Pharmacoepidemiology, Overview

In this article we describe several areas where biostatistical input has served to advance pharmacoepidemiology. Although the last two decades have witnessed an explosion of methodologic advances put forward by biostatistics in the design and analysis of epidemiologic studies, most of these have been fundamental to the field of epidemiology in general. We do not discuss these areas here since they are dealt with extensively elsewhere (see Analytic Epidemiology; Descriptive Epidemiology). Instead, we focus on the biostatistical aspects that produced unique methodologic advances, specific to problems posed by pharmacoepidemiologic research.

The Case–Crossover Design When conducting a case–control study, the selection of controls is usually the most challenging task. The fundamental principle is that selected controls should be representative of the source population which gave rise to the cases [33] a principle often difficult to implement in practice, especially when dealing with acute adverse events and transient exposures. For example, we may wish to study the risk of ventricular tachycardia in association with the use of inhaled beta agonists in asthma. This possible effect has been hypothesized on the basis of clinical study observations of hypokalemia and prolonged Q-T intervals as measured on the electrocardiogram in patients after beta agonist exposure [1]. These unusual cardiac deviations were observed only in the 4-hour period following drug absorption. Thus, a case–control study of this issue would first select cases with this adverse event and investigate whether the drug was taken during the 4-hour span preceding the event. For controls, on the other hand, the investigator must define a time point of reference for which to ask the question about use of this drug in the “past 4 hours”. However, if, for example, the drug is more likely to be required during the day, but controls can only be reached at home in the evening, the relative risk estimate will be biased by the differential timing of responses for cases and controls. Consequently, when dealing with the study of transient drug effects on the risk of acute adverse events, Maclure [30] proposed the case–crossover design, which uses the cases as their own controls. The case–crossover design is simply a crossover study in

the cases only. The subjects alternate at varying frequencies between exposure and nonexposure to the drug of interest until the adverse event occurs, which does for all subjects in the study since all are cases by definition. Each case is investigated to determine whether exposure occurred within the predetermined effect period, namely within the 4 hours previous to the adverse event in our example. This occurrence is then classified as having arisen either under drug exposure or nonexposure on the basis of the effect period. Thus, each case is either exposed or unexposed. For the reference information, data on the average drug use pattern are necessary to determine the probability of exposure in the time window of effect. This is done by obtaining data for a sufficiently long period of time to derive a stable estimate. In our example, we might determine the average number of times a day each case has been using beta agonists (two inhalations of 100 µg each) in the past year. This will allow us to estimate the proportion of time that each asthmatic is usually spending as “exposed” in the 4-hour effect period. This proportion is then used to obtain the number of cases expected on the basis of time spent in these “at-risk” periods, for comparison with the number of cases observed during such periods. This is done by forming a two-by-two table for each case, with the corresponding control data as defined above, and combining the tables using the Mantel–Haenszel technique as described in detail by Maclure [30]. odds ratio is The resulting then given by OR = ai N0i / (1 − ai )N1i , where ai is 1 if case i is exposed, 0 if not, N0i is the expected number of unexposed periods and N1i the expected number of exposed periods during the reference time span. Table 1 displays hypothetical data from a case–crossover study of 10 asthmatics who experienced ventricular tachycardia. These were all queried regarding their use of two puffs of inhaled beta agonist in the last 4 hours and on average over the past year. The fact of drug use within the effect period is defined by ai with three cases having used beta agonists in the 4-hour period prior to the adverse event. The usual frequency of drug use per year is converted to a ratio of the number of exposed periods to the number of unexposed periods, the total number of 4-hour periods being 2190 in one year. Using the Mantel–Haenszel formula to combine the 10 two-bytwo tables, the estimate of the odds ratio is 3.0, and the 95% confidence interval (1.2, 7.6).

Pharmacoepidemiology, Overview Table 1 Hypothetical data for a case–crossover study of beta agonist exposure in last 4 hours and the risk of ventricular tachycardia in asthma Beta agonist Usual beta Periods usea in agonist Periods of of nonlast 4 hours use in exposure exposure Case no. (ai) last year (N1i ) (N0i ) 1 2 3 4 5 6 7 8 9 10 a

0 1 0 1 0 0 0 1 0 0

1/day 6/year 2/day 1/month 4/week 1/week 1/month 2/month 2/day 2/week

365 3 730 12 208 52 12 24 730 104

1825 2184 1460 2178 1982 2138 2178 2166 1460 2086

Inhalations of 200 µg: 1 = yes, 0 = no

The case–crossover design depends on several assumptions to produce unbiased estimates of the odds ratio. Greenland [14] presented examples where the odds ratio estimates from this approach can be biased. For example, the probability of exposure cannot vary over the time. Similarly, confounding factors must be constant over time. Finally, there cannot be interaction between unmeasured subject characteristics and the exposure. Nevertheless, this approach is being used successfully in several studies [38, 43]. It has also been adapted for application to the risk assessment of vaccines [11].

Confounding Bias Because of the lack of randomization, the most important limitation of observational studies in pharmacoepidemiology is whether an important confounding factor is biasing the reported relative risk estimate. A factor is considered a confounder if it is associated, at each level of drug exposure, with the adverse event, and with exposure to the drug itself. Two approaches exist to address the problem of confounding variables, which we describe in the context of the case–control design, although they apply equally well to a cohort design. The first is to select controls that are matched to the cases with respect to all confounding factors and to use the appropriate corresponding techniques of analysis for matched data, usually conditional logistic regression (see Matched Analysis; Matching).

3

This approach, although often appropriate, has been shown to be susceptible to bias from residual confounding due to coarse matching [5]. The second solution is to select controls unmatched with respect to these confounding factors, but to measure these confounders for all study subjects and use statistical techniques based on either stratification or multiple regression, permitting removal of their effect on the risk from the effect of the drug under study. This approach can also lead to residual confounding if the confounder data are not analyzed properly. For example, the risk of venous thromboembolism has been found to be higher among users of newer oral contraceptive drugs than users of older formulations [22, 41, 52], after controlling for the effect of age. A recent study, however, showed that when confounding by the woman’s age is analyzed using finer age bands, the relative risk is substantially reduced [10]. Beyond such difficulties generic to epidemiology, the context of pharmacoepidemiology has produced several situations where confounding requires particular statistical treatment. Some are described in this section.

Missing Confounder Data It is at times impossible to obtain data on certain important confounding variables. A frequent situation encountered in pharmacoepidemiology is that of complete data for the cases and incomplete data for the controls of a case–control study. This is often encountered in “computerized database” studies based on administrative databases where cases have likely been hospitalized and thus have an extensive medical dossier. For these cases, the investigator will thus have access to ample information on potential confounding variables. However, if the controls are population-based (see Case–Control Study, Population-based), it is unlikely they were hospitalized and will not provide comparable data on confounders in the absence of medical charts. Consequently, confounder data will only be available in the cases, and not in the controls. We can assess whether a factor is a confounder on the basis of data available solely for the cases, so that if the factor is deemed not to be a confounder, then the final analysis of the risk of the drug under study will not need to be adjusted. The approach is described by Ray & Griffin [37] and was

4

Pharmacoepidemiology, Overview

used in the context of a study of nonsteroidal antiinflammatory drug (NSAID) use and the risk of fatal peptic ulcer disease [15]. The strategy is based on the definition [5] of a confounder C (C+ and C− denote presence and absence) in the assessment of the association between a drug exposure E (E+ and E− denote exposure or not to the drug) and an adverse condition D (D+ and D− denote cases and controls, respectively). Confounding is present if both following conditions are satisfied: 1. C and E are associated in the control group (in D−). 2. C and D are associated in E+ and in E−. Assuming, in the absence of effect modification, a common odds ratio between E and D (ORED ) in C+ and in C−, condition 1 becomes equivalent to: C and E are associated in the case group (D+). Thus, if in the cases we find no association between the potential confounder and drug exposure, confounding by this factor can be excluded outright, without having to verify condition 2. In this instance, the analysis involving drug exposure in cases and controls can be performed directly without any concern for the confounding variable. This strategy for assessing confounding is extremely valuable for several case–control studies in pharmacoepidemiology, since if confounding is excluded by this technique, crude methods of analysis can be used to obtain a valid estimate of the odds ratio. However, this is not often the case. As an example, we use data from a case–control study conducted using the Saskatchewan computerized databases to assess whether theophylline, a drug used to treat asthma, increases the risk of acute cardiac death [46]. In this study, the 30 cases provided data on theophylline use, as well as on smoking, possibly an important confounder. However, the 4080 controls only had data available on theophylline use and not on smoking. Table 2 displays the data from this study. The crude odds ratio between theophylline use and cardiac death is 4.3((17/13)/(956/3124)). Because of the missing data on smoking, it is only possible to partition the cases, but not the controls, according to smoking. The odds ratio between theophylline use and smoking among the cases is estimable and found to be 7.5((14/5)/(3/8)), thus indicating that smoking is indeed a strong confounder.

Table 2 Data from a case–control study of theophylline use and cardiac death in asthma, with the smoking confounder data missing for controls Cases

Notation: Combined Stratified by smoking: Smokers Nonsmokers Data: Combined Stratified by smoking: Smokers Nonsmokers a

Controls

E

E

E

E

a

c

b

d

a0 a1

c0 c1

a

a

a

a

17

13

956

3124

14 3

5 8

a

a

a

a

These frequencies are missing for controls.

An approach was recently developed to permit the estimation of the adjusted odds ratio of the theophylline by cardiac death association, in the absence of confounder data among the controls [45]. The adjusted odds ratio is given by ORadj =

P0 (w − y) , (1 − P0 )y

(1)

where y = {v − [v 2 − 4(r − 1)rwx]1/2 }/[2(r − 1)], v = 1 + (r − 1)(w + x) when r = 1 (and y = wx when r = 1), r is the odds ratio between exposure and confounder among the cases, x is the probability of exposure among the controls, P0 is estimated by a0 (a0 + c0 ), and w is the prevalence of the confounder among the controls [45]; w is the only unknown and must be estimated from external sources. An estimate of the variance of ORadj in (1) exists in closed form. For the illustrative data, an external estimate of smoking prevalence among asthmatics, obtained from a Canadian general population health survey, is 24%. Using this estimate, the adjusted odds ratio is 2.4, much lower than the crude estimate of 4.3, with 95% confidence interval (1.0, 5.8). This statistical approach was developed specifically to address the frequent problem of missing confounder data in pharmacoepidemiology. When using computerized databases, these data are more often missing only in the control series of a case–control study. This technique, based on statistical reasoning, allows us to derive adjusted estimates of relative risk with few assumptions. Extensions of this type

Pharmacoepidemiology, Overview of approach to a regression context will expand its usefulness.

The Case–Time–Control Design Case–control studies in pharmacoepidemiology that assess the intended effects of drugs are often limited by their inability to obtain a precise measure of the indication of drug exposure. Adjustment for this crucial confounding factor becomes impossible and an unbiased estimate of the drug effect is unattainable [49]. This bias, arising from confounding by indication, is a major source of limitation in pharmacoepidemiology [32]. Here again, a within-subject approach, similar to the case–crossover design, has been developed. By using cases and controls of a conventional case–control study as their own referents, the case–time–control design eliminates the biasing effect of unmeasured confounding factors such as drug indication [44]. This approach is applicable only in situations where exposure varies over time, which is typically often the case for medications. The correct application of the case–time–control design is based on a specific model for the data, a model that entails inherent assumptions and imposes certain conditions for the approach to be valid. The model, based on a case–control sampling design, is presented for a dichotomous exposure that varies over time and that is measured only for two consecutive time periods, the current period and the reference period. The logit of exposure Lij kl = logit{Pr(Eij kl = 1)}, is given by Lij kl = µ + Sil + πj + Θk

(2)

where Eij kl represents the binary exposure for group i, period j , outcome k and subject l within group i, µ represents the overall exposure logit, Sil is the effect of study subject l in group i, πj is the effect of period j and Θk is the effect of event occurrence k. More specifically, i = 0, 1 denotes the case–control group (1 = case subjects, 0 = control subjects), j = 0, 1 denotes the period (1 = current period, 0 = reference period), k = 0, 1 denotes the event occurrence (1 = event, 0 = no event) and l = 1, . . . , ni designates the study subject within group i, with n1 case subjects and n0 control subjects. The confounding effect of unmeasured severity or indication is inherently accounted for by Sil .

5

The period effect, measured by the log of the odds ratio, is given by δπ = π1 − π0 and estimated from the control subjects. The net effect of exposure on event occurrence is given by δΘ = Θ1 − Θ0 . The case subjects permit one to estimate the sum δΘ + δπ so that the effect of exposure on event occurrence δΘ , is estimable by subtraction. The estimation of the odds ratio is based on any appropriate technique for matched data, such as conditional logistic regression. Three basic assumptions are inherently made by this logit model. The first is the absence of effect modification of the exposure–outcome association by the unmeasured confounder, i.e. the exclusion of the Sil Θk interaction term in model (2). The second is the absence of effect modification of the exposure–outcome association by period, i.e. a null value for the πΘk interaction term. The third is the lack of effect modification of the exposure–period association by the confounder, represented by the absence of an Sil πj interaction term in model (2). Greenland [14] presented examples of the bias that can occur with this approach when the model contains the latter interaction. The approach is illustrated with data from the Saskatchewan Asthma Epidemiologic Project [40], a study conducted to investigate the risks associated with the use of inhaled beta agonists in the treatment of asthma. Using databases from Saskatchewan, Canada, a cohort of 12 301 asthmatics was followed during 1980–87. All 129 cases of fatal or near-fatal asthma and 655 controls were selected. The amount of beta agonist used in the year prior to the index date was found to be associated with the adverse event. In comparing low (12 or less canisters per year) with high (more than 12) use of beta agonists. The crude odds ratio for high beta-agonist use is 4.4, with 95% confidence interval (2.9, 6.7). Adjustment for all available markers of severity, such as oral corticosteroids and prior asthma hospitalizations as confounding factors, lowers the odds ratio to 3.1, with 95% confidence interval (1.8, 5.4), the “best” estimate one can derive from these case–control data using conventional tools. The use of inhaled beta agonists, however, is known to increase with asthma severity which also increases the risk of fatal or near-fatal asthma. It is therefore not possible to separate the risk effect of the drug from that of disease severity. To apply

Risk Functions Over Time Most epidemiologic studies assessing a risk over time routinely assume that the hazards are constant or proportional. Rate ratios are then estimated by Poisson regression models or Cox’s proportional hazards model [6]. Often, deviations from these simplifying assumptions are addressed at the design stage, by restricting the study to a specific follow-up period where the assumptions are satisfied. For instance, to study the risk of cancer associated with an agent considered to be an initiator of the disease, the first few years of follow-up after the initiation of exposure will not be accounted for in the analysis, to allow for a reasonable latency period. On the other hand, if the agent is suspected to be a cancer promoter, these same first years will be used in the analysis. Since such considerations are mostly dealt with at the design stage, little attention has been paid to the analytic considerations of this issue. In pharmacoepidemiology, the risk of an adverse event often varies strongly with the duration of use

(a)

Duration of drug use

Hazard ratio

the case–time–control design, exposure to beta agonists was obtained for the 1-year current period and the 1-year reference period. Among the 129 cases, 29 were currently high users of beta agonists and were low users in the reference period, while 9 cases were currently low users of beta agonists and were high users previously. Among the 655 controls, 65 were currently high users of beta agonists and were low users in the reference period, while 25 were currently low users of beta agonists and were high users previously. The case–time–control odds ratio, using these discordant pairs frequencies for a matched pairs analysis, is given by (29/9)/(65/25) = 1.2, with 95% confidence interval (0.5, 3.0). This estimate, which excludes the effect of unmeasured confounding by disease severity, indicates a minimal risk for these drugs. The case–time–control approach provides an unbiased estimate of the odds ratio in the presence of confounding by indication, a common obstacle in pharmacoepidemiology. This is possible despite the fact that the indication for drug use (in our example, disease severity) is not measured, because of the within-subject analysis. Nevertheless, as mentioned above, its validity is subject to several strict assumptions. This approach must therefore be used with caution.

Hazard ratio

Pharmacoepidemiology, Overview

(b)

Duration of drug use

Hazard ratio

6

(c)

Duration of drug use

Figure 1 Different risk profiles by duration of drug use: (a) acute effect; (b) increasing risk; (c) constant risk

of a drug. Figure 1 shows three different risk profiles, typical of drug exposure. Figure 1(a) displays the usual profile of risk associated with an acute effect. The drug will affect susceptible subjects early, reflected by the early sharp rise in the curve, and once these subjects are eliminated from the cohort, the remaining subjects will return to some lower constant baseline risk. The peak can occur almost immediately, as with allergic reactions to antibiotics, or may take a certain time to affect the organ, as with gastrointestinal hemorrhage subsequent to NSAID use. This profile of risk was used to explain variations in the risk of agranulocytosis associated with the

Pharmacoepidemiology, Overview use of the analgesic dipyrone [16]. It has recently been used to assess the risk profile of oral contraceptives [47]. Figure 1(b) shows a gradually increasing model of risk, associated with diseases of longer latency such as cancer. Figure 1(c) displays the constant hazard model, after a rapid rise in the risk level. Unfortunately, such graphs are not part of the analysis plan of most pharmacoepidemiologic studies at this point, despite the existence of appropriate techniques [9, 16]. The next few years should see an increasing use of spline functions and other similar tools to model the risk of drugs by their duration of use. The wider access to newer statistical software such as S-PLUS [42] among researchers in pharmacoepidemiology, as well as the publication of papers that simplify the understanding of these sophisticated approaches [13, 16], will encourage their wider use in a field where they are clearly pertinent.

Probabilistic Approach for Causality Assessment The traditional epidemiologic approach to assess whether a drug causes an adverse reaction is based on the Hill criteria [18], that require the association to be biologically plausible, strong, specific, consistent and temporally valid (see Hill’s Criteria for Causality). These criteria are applied to the results of pharmacoepidemiologic studies and, depending on the number of criteria satisfied, provide a level of confidence regarding causality of the drug (see Causation). The result of this exercise will be, if a drug is judged to cause an adverse reaction, that all exposed cases, or at least some etiologic proportion of these cases [39], are due to the drug. This approach is valuable for inferences to the population, but does not allow cases to be assessed individually, does not incorporate the specifics of the case, and does not entirely address the unique features of drugs as an exposure entity in epidemiology. The study of individual cases of adverse reactions has been the mainstay of national pharmacovigilance centers throughout the world for several decades [3, 51] (see Postmarketing Surveillance of New Drugs and Assessment of Risk). When a case report of a suspected drug-associated adverse

7

event is received, the natural question is whether the drug actually caused the event. Several qualitative approaches have been proposed to answer this question [20, 25]. Recently, however, a formal quantitative approach using biostatistical foundations has been put forward [26–28]. It is based on Bayes’ theorem, which can be used in the following way: Pr(D → E|B, C) =

Pr(D → E|B) Pr(C|D → E, B) , Pr(C)

where D → E denotes that the drug D causes the adverse event E, B represents the background characteristics of the case that are known to affect the risk, while C represents the case information. This Bayes’ theorem approach allows us to estimate the posterior probability that an adverse event was caused by a drug by separating the problem into two components. The first component is the prior probability of the event given the baseline characteristics of the patients. This is estimated from existing data obtained from clinical trials or epidemiologic studies. The second component is the probability of case information given that the drug caused the event. The primary limitation of this approach is the scarcity of data available to estimate the two components. In many instances, it is difficult to find the clinical and epidemiologic data necessary to estimate the prior probability, especially for rare clinical conditions that have not been the object of extensive population-based research. The same limitation applies to the second probability component because of the problems of finding cases relevant to proven drug causation. Case series that apply directly to the case being assessed are often difficult to find. These limitations are real but not limited to the Bayesian approach – they are a general problem in the assessment of individual cases. The authors of these methods suggest that the primary purpose of the Bayesian approach is to provide a framework in which subjective judgments relevant to assessment of an individual case are coherently combined. To facilitate the use of the Bayesian method, the equation is usually expressed in terms of odds rather than probabilities. This formulation simplifies somewhat the need for data, since epidemiologic studies more frequently report odds ratios than absolute probabilities. The relative likelihood may also be easier

8

Pharmacoepidemiology, Overview

to estimate subjectively. This equation formulated in terms of odds is given as posterior odds = Pr(D → E|B, C)/ Pr(D → E|B, C) = [Pr(D → E|B)/ Pr(D → E|B)] prior odds ×[Pr(C|D → E, B)/ Pr(C|D → E, B)] likelihood ratio. This approach was applied on several occasions [19, 23, 24, 34] and was recently made user-friendly by computerizing [21]. The increasing amount of new epidemiologic data on disease distribution and risk factors combined with new clinical and pharmacologic insights on drug effects will make this probabilistic approach more effective in future uses.

Methods Based on Prescription Data One of the distinguishing features of pharmacoepidemiology is the use of computerized administrative health databases to answer research questions reliably and with sufficient rapidity. The usual urgency of concerns related to drug safety makes these databases essential to perform such risk assessment studies. In particular, databases containing only information on prescriptions dispensed to patients, and no outcome information on disease diagnoses, hospitalizations or vital status, have been the object of interesting statistical developments. These standalone prescription drug databases, that do not require to be linked to outcomes databases, are more numerous and usually more easily accessible than the fully linked databases. They provide a source of data that allows the investigation of patterns of drug use that can yield some insight into the validity of risk assessment studies as well as generate and test hypotheses about these risks. In this section we briefly review some of the resourceful uses of these drug prescription databases in pharmacoepidemiology. A technique that was developed specifically for the context of drug databases is prescription sequence analysis [36]. Prescription sequence analysis is based on the situation when a certain drug A is suspected of causing an adverse event that itself is treated by a drug B. To apply this technique, the computerized drug database is searched for all patients using drug A. For these subjects, all patients prescribed drug B in the course of using drug A are

identified and counted. Under the null hypothesis that drug A does not cause the adverse event treated by drug B, this number of subjects should be proportional to the duration of use of drug A relative to the total period of observation. This extremely rapid method of assessing the association between drug A and drug B is assessed for its random error with a Monte Carlo simulation analysis. This technique was applied to assess whether patients using the antivertigo or antimigraine drug flunarizine (drug A) causes mental depression, as measured by the use of antidepressant drugs (drug B). The authors found that the number of patients starting on antidepressant drugs during flunarizine use was in fact lower than expected [36]. They thus concluded, using this rapid approach based solely on drug prescription data, that this drug probably does not cause mental depression. An extension of prescription sequence analysis, called prescription sequence symmetry analysis, was recently proposed [17]. Using a population of new users of either drug A or B, this approach compares the number of subjects who used drug A before drug B to that using B before A. Under the null hypothesis, this distribution should be symmetrical and the numbers should be equal. Another function of these databases is to use the prescriptions as covariate information to explain possible confounding patterns. The concept of channeling of drugs was put forward as an explanation of unusual risk findings [48]. For example, a case–control study conducted in New Zealand found that fenoterol, a beta agonist bronchodilator used to treat asthma attacks, was associated with an increased risk of death from asthma [8]. Using a prescription drugs database, it was found that severe asthmatics, as deemed from their use of other asthma medications prescribed for severe forms of the disease, were in fact channeled to fenoterol, probably because fenoterol was felt by prescribers to be a more potent bronchodilator than other beta agonists [35]. This phenomenon of channeling can be assessed rapidly in such databases, provided medications can be used as proxies for disease severity. This approach can be subject to bias, however, as it has been used with cross-sectional designs that cannot differentiate the directionality of the association. An application of channeling using a longitudinal design was recently presented [4]. It indicated that channeling can vary according to the timing of exposure, i.e. that disease severity was not associated with first-time use of a

Pharmacoepidemiology, Overview drug, but subsequently severe patients were more likely to be switched to that drug. This type of research into patterns of drug prescribing and drug use can be very useful in understanding the results of case–control studies with limited data on drug exposures and subject to confounding by indication (see Pharmacoepidemiology, Adverse and Beneficial Effects). These prescription drug databases have also been used to study patterns of interchange in the dispensing of NSAIDs. Such research is important because switching patterns permits the identification of brands that may not be well tolerated and result in the prescription of another agent. By using a stochastic approach, Walker et al. [50] estimated the transition probabilities from one NSAID to another. For a set of k different brands of NSAID, they derived the expected marginal distributions of the transition matrix that corresponds to a global equilibrium state by solving a system of k + 1 equations with k + 1 unknowns. By comparing these expected values with the observed marginals of the transition matrix, it was possible to assess whether this population had reached this stable state and for which drugs. Such models can be used rapidly to assess patterns of interchange and identify potentially harmful agents. Finally, these prescription drugs databases may in certain situations provide all the necessary data for a conventional cohort or case–control study. For instance, the use of beta blockers to treat hypertension and other cardiac diseases has been hypothesized to cause depression. A prescription for an antidepressant drug can be used as a proxy for the outcome of depression. In this way, a standalone prescription drug database can provide data on exposure to beta blockers, on the outcome of depression, as well as on covariate information from other medications [2, 7].

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Acknowledgments Samy Suissa is a senior research scholar supported by the Fonds de la Recherche en Sant´e du Quebec (FRSQ). The McGill Pharmacoepidemiology Research Unit is funded by the FRSQ and by grants from the National Health Research and Development Programme (NHRDP) and the Medical Research Council (MRC) of Canada.

[13]

[14]

[15]

References [1]

Aelony, Y., Laks, M.M. & Beall, G. (1975). An electrocardiographic pattern of acute myocardial infarction

[16]

9

associated with excessive use of aerosolized isoproterenol, Chest 68, 107–110. Avorn, J., Everitt, D.E. & Weiss, S. (1986). Increased antidepressant use in patients prescribed beta-blockers. Journal of the American Medical Association 255, 357–360. Baum, C., Kweder, S.L. & Anello, C. (1994). The spontaneous reporting system in the United States, in Pharmacoepidemiology, 2nd Ed, B.L. Strom, ed. Wiley, New York, pp. 125–138. Blais, L., Ernst, P. & Suissa, S. (1996). Confounding by indication and channeling over time: the risks of beta-agonists, American Journal of Epidemiology 144, 1161–1169. Breslow, N. & Day, N.E. (1980). Statistical Methods in Cancer Research. Vol. I: The Analysis of Case-Control Studies. International Agency for Research on Cancer, Lyon. Breslow, N. & Day, N.E. (1987). Statistical Methods in Cancer Research, Vol. II: The Design and Analysis of Cohort Studies, 2nd Ed. International Agency for Research on Cancer, Lyon. Bright, R.A. & Everitt, D.E. (1992). Beta-blockers and depression: evidence against an association, Journal of the American Medical Association 267, 1783–1787. Crane, J., Pearce, N., Flatt, A., Burgess, C., Jackson, R., Kwong, T., Ball, M. & Beasley, R. (1989). Prescribed fenoterol and death from asthma in New Zealand 1981–1983: case-control study, Lancet 1, 917–922. Efron, B. (1988). Logistic regression, survival analysis, and the Kaplan-Meier curve, Journal of the American Statistical Association 83, 414–425. Farmer, R.D.T., Lawrenson, R.A., Thompson, C.R., Kennedy, J.G. & Hambleton, I.R. (1997). Populationbased study of risk of venous thromboembolism associated with various oral contraceptives, Lancet 349, 83–88. Farrington, C.P., Nash, J. & Miller, E. (1996). Case series analysis of adverse reactions to vaccines: a comparative evaluation, American Journal of Epidemiology 143, 1165–1173. Gong, L., Zhang, W., Zhu, Y., Zhu, J., Kong, D., Page, V., Ghadirian, P., Lehorier, J. & Hamet, P. (1996). Shanghai trial of Nifedipine in the elderly (STONE), Journal of Hypertension 19, 1–9. Greenland, S. (1995). Dose-response and trend analysis in epidemiology: alternatives to categorical analysis, Epidemiology 6, 356–365. Greenland, S. (1996). Confounding and exposure trends in case-crossover and case-time-control design, Epidemiology 7, 231–239. Griffin, M.R., Ray, W.A. & Schaffner, W. (1988). Nonsteroidal anti-inflammatory drug use and death from peptic ulcer in elderly persons, Annals of Internal Medicine 109, 359–363. Guess, H.A. (1989). Behavior of the exposure odds ratio in a case-control study when the hazard function is not

10

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26] [27] [28]

[29]

[30]

[31]

Pharmacoepidemiology, Overview constant over time, Journal of Clinical Epidemiology 42, 1179–1184. Hallas, J. (1996). Evidence of depression provoked by cardiovascular medication – a prescription sequence symmetry analysis, Epidemiology 7, 478–484. Hill, A.B. (1965). The environment and disease: association or causation?, Proceedings of the Royal Society of Medicine 58, 295–300. Hutchinson, T.A. (1986). A Bayesian approach to assessment of adverse drug reactions: evaluation of a case of acute renal failure, Drug Information Journal 20, 475–482. Hutchinson, T.A., Leventhal, J.M., Kramer, M.S., Karch, F.E., Lipman, A.G. & Feinstein, A.R. (1979). An algorithm for the operational assessment of adverse drug reactions. II. Demonstration of reproducibility and validity, Journal of the American Medical Association 242, 633–638. Hutchinson, T.A., Dawid, A.P., Spiegelhalter, D.J., Cowell, R.G. & Roden, S. (1991). Computer aids for probabilistic assessment of drug safety I: A spread-sheet program, Drug Information Journal 25, 29–39. Jick, H., Jick, S.S., Gurewich, V., Myers, M.W. & Vasilakis, C. (1995). Risk of idiopathic cardiovascular death and nonfatal venous thromboembolism in women using oral contraceptives with differing progestogen components, Lancet 346, 1589–1593. Jones, J.K. (1986). Evaluation of a case of StevensJohnson syndrome, Drug Information Journal 20, 487–502. Kramer, M.S. (1986). A Bayesian approach to assessment of adverse drug reactions: evaluation of a case of fatal anaphylaxis, Drug Information Journal 20, 505–518. Kramer, M.S. Leventhal, J.M., Hutchinson, T.A. & Feinstein, A.R. (1979). An algorithm for the operational assessment of adverse drug reactions. I. Background, description, and instructions for use, Journal of the American Medical Association 242, 623–632. Lane, D.A. (1984). A probabilist’s view of causality assessment, Drug Information Journal 18, 323–330. Lane, D.A. (1986). The Bayesian approach to causality assessment, Drug Information Journal 20, 455–461. Lane, D.A., Kramer, M.S., Hutchinson, T.A., Jones, J.K. & Naranjo, C.A. (1987). The causality assessment of adverse drug reactions using a Bayesian approach, Journal of Pharmaceutical Medicine 2, 265–283. Lenz, W. (1966). Malformations caused by drugs in pregnancy, American Journal of Diseases of Children 112, 99–106. Maclure, M. (1991). The case-crossover design: a method for studying transient effects on the risk of acute events, American Journal of Epidemiology 133, 144–153. McBride, W.G. (1961). Thalidomide and congenital abnormalities, Lancet ii, 1358.

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39] [40]

[41]

[42] [43]

[44] [45]

[46]

[47]

[48]

Miettinen, O.S. (1983). The need for randomization in the study of intended effects, Statistics in Medicine 2, 267–271. Miettinen, O.S. (1985). Theoretical Epidemiology: Principles of Occurrence Research in Medicine. Wiley, New York. Naranjo, C.A., Lanctot, K.L. & Lane, D.A. (1990). The Bayesian differential diagnosis of neutropenia associated with antiarrhythmic agents, Journal of Clinical Pharmacology 30, 1120–1127. Petri, H. & Urquhart, J. (1991). Channeling bias in the interpretation of drug effects, Statistics in Medicine 10, 577–581. Petri, H., De Vet, H.C.W., Naus, J. & Urquhart, J. (1988). Prescription sequence analysis: a new and fast method for assessing certain adverse reactions of prescription drugs in large populations, Statistics in Medicine 7, 1171–1175. Ray, W.A. & Griffin, M.R. (1989). Use of Medicaid data for pharmacoepidemiology, American Journal of Epidemiology 129, 837–849. Ray, W.A., Fought, R.L. & Decker, M.D. (1992). Psychoactive drugs and the risk of injurious motor vehicle crashes in elderly drivers, American Journal of Epidemiology 136, 873–883. Rothman, K.J. (1986). Modern Epidemiology. Little, Brown, & Company, Boston. Spitzer, W.O., Suissa, S., Ernst, P., Horwitz, R.I., Habbick, B., Cockcroft, D., Boivin, J.F., McNutt, M., Buist, A.S. & Rebuck, A.S. The use of beta-agonists and the risk of death and near death from asthma, New England Journal of Medicine 326, 501–506. Spitzer, W.O., Lewis, M.A., Heinemann, L.A., Thorogood, M. & MacRae, K.D. (1996). Third generation oral contraceptives and risk of venous thromboembolic disorders: an international case-control study. Transactional Research Group on Oral Contraceptives and the Health of Young Women, British Medical Journal 312, 83–88. StatSci (1995). S-PLUS Version 3.3. StatSci, a division of MathSoft, Inc., Seattle. Sturkenboom, M.C., Middelbeek, A., de Jong, L.T., van den Berg, P.B., Stricker, B.H. & Wesseling, H. (1995). Vulvo-vaginal candidiasis associated with acitretin, Journal of Clinical Epidemiology 48, 991–997. Suissa, S. (1995). The case-time-control design, Epidemiology 6, 248–253. Suissa, S. & Edwardes, M. (1997). Adjusted odds ratios for case-control studies with missing confounder data in controls. Epidemiology 8, 275–280. Suissa, S., Hemmelgarn, B., Blais, L. & Ernst, P. (1996). Bronchodilators and acute cardiac death, Journal of Respiratory and Critical Care Medicine 154, 1598–1602. Suissa, S., Blais, L., Spitzer, W.O., Cusson, J., Lewis, M. & Heinemann, L. (1997). First-time use of newer oral contraceptives and the risk of venous thromboembolism, Contraception 56, 141–146. Urquhart, J. (1989). ADR crisis management, Scrip 1388, 19–21.

Pharmacoepidemiology, Overview [49]

Walker, A.M. (1996). Confounding by indication, Epidemiology 7, 335–336. [50] Walker, A.M., Chan, K.W.A. & Yood, R.A. (1992). Patterns of interchange in the dispensing of non-steroidal anti-inflammatory drugs, American Journal of Epidemiology 45, 187–195. [51] Wiholm, B.E., Olsson, S., Moore, N. & Wood, S. (1994). Spontaneous reporting systems outside the United States, in Pharmacoepidemiology, 2nd Ed., B.L. Strom, ed. Wiley, New York, pp. 139–156. [52] World Health Organization Collaborative Study of Cardiovascular Disease and Steroid Hormone Contraception,

11

(1995). Venous thromboembolic disease and combined oral contraceptives: results of international multicentre case-control study. World Health Organization Collaborative Study of Cardiovascular Disease and Steroid Hormone Contraception, Lancet 346, 1575–1582.

(See also Drug Approval and Regulation; Pharmacoepidemiology, Study Designs) S. SUISSA

PHARMACOVIGILANCE

2

The ultimate purpose of pharmacovigilance is to minimize, in practice, the potential for harm that is associated with all active medicines. Although data about all types of adverse drug reactions (ADRs) are collected, the main focus is on preventing those that are defined to be serious, i.e., fatal or life-threatening or causing hospitalization or long-term disability. Pharmacovigilance may be seen as a public health function in which reductions in harm are achieved by measures that promote the safest possible use of medications and/or provide specific safeguards against known hazards (e.g., monitoring white blood cell counts to detect agranulocytosis in users of clozapine). To minimize harm, there is first a need to identify and assess the impact of unexpected potential hazards. For most medications, serious ADRs are rare; otherwise their detection could result in the drug not reaching or being withdrawn from the market. For products that do reach the market, almost by definition, serious hazards are seldom identified during clinical trials because sample sizes are almost invariably too small to detect hazards occurring at a frequency of 1 in 1000 patients or less. In addition, the prevailing conditions of clinical trials— selected patients, short durations of treatment, close monitoring, and specialist supervision— almost invariably mean that they will underestimate the frequency of ADRs relative to that occurring in ordinary practice. Thus, the need for post-marketing safety monitoring is well recognized even though the observational methods used provide a lower standard of evidence for causality than that derived from clinical trials. During pre-marketing clinical development, the aims of pharmacovigilance are different than the broad public health function described above. In clinical trials, there is a need to protect patients being exposed, often in a context of initial uncertainty, as to whether an individual has been randomized to the treatment in question. There is also a need to gather information on harmsthat

PATRICK WALLER Southampton United Kingdom

STEPHEN EVANS London School of Hygiene and Tropical Medicine United Kingdom

1

PURPOSES OF PHARMACOVIGILANCE

DEFINITION AND SCOPE

The term ‘‘pharmacovigilance’’ originated in France and became more widely used during the 1990s. Various definitions have been proposed (1, 2), but there is no generally accepted version, which probably reflects differing perspectives on the scope of this activity. The common ground is that pharmacovigilance has a safety purpose in relation to use of medications and involves patients, health professionals, pharmaceutical companies, and regulatory authorities. At its most narrow, pharmacovigilance may be equated with spontaneous reporting of suspected adverse reactions for marketed medicines. However, broader approaches are gaining acceptance in which pharmacovigilance is viewed as being the whole process of clinical risk management in relation to medications. The point at which medications are authorized for use in the general population may be considered as the starting point. However, this boundary is increasingly becoming blurred by recognition that the clinical development process for safety needs to be seamless, and that there is a need to plan post-authorization pharmacovigilance activities based on a clear delineation of what is already known, and what is not known, about safety. For the purpose of this article, we will consider pharmacovigilance to embrace all safety-related activity beyond the point at which humans are first exposed to a medicinal drug. For many trialists, pharmacovigilance is seen as the process of reporting SUSARs (serious, unexpected, suspected adverse reactions), although this is neither the only nor the most useful aspect of pharmacovigilance.

1

2

PHARMACOVIGILANCE

occur in order to make a provisional assessment of safety and to plan for post-marketing safety development. 3

HISTORICAL BACKGROUND

The seminal event that led to the development of modern pharmacovigilance was the marketing of thalidomide in the late 1950s. A teratogen that affects limb development and classically produces phocomelia (3), thalidomide was used widely by pregnant women for the treatment of morning sickness in many countries (but not in the United States) without having first been subjected to reproductive toxicologic screening, without regulatory review of its safety, and without any mechanism for identifying unexpected hazards being in place. About 10,000 fetuses were harmed, and thalidomide remained available for several years before the association was recognized. Aside from the development of statutory drug regulation across the developed world, the thalidomide disaster led to development of spontaneous ADR reporting schemes (4) (see below), based on the notion that collating the suspicions of clinicians about ADRs would lead to ‘‘early warnings’’ of unexpected toxicity. By 1968, several countries had established such schemes and their value was becoming apparent, leading the World Health Organization to establish an international drug monitoring program (5). In the 1970s, another drug safety disaster occurred, the multisystem disorder known as the oculomucocutaneous syndrome caused by practolol (a beta-blocker used to treat angina and hypertension) (6). This disaster led to the recognition that spontaneous ADR reporting alone is insufficient as a means of studying post-marketing safety. In the late 1970s, various schemes designed to closely monitor the introduction of new drugs were suggested but not widely implemented (7). The 1980s saw the development of prescription-event monitoring (8), an observational cohort method in which patients are identified from prescriptions of a particular drug, with data on subsequent adverse events being obtained from prescribing doctors. Also, pharmaceutical companies started to conduct postmarketing surveillance studies, but the value

of such studies was often limited (9). The scientific discipline of pharmacoepidemiology developed strongly during the 1990s with the increasing use of computerized databases containing records of prescriptions and clinical outcomes for rapid and efficient study of potential hazards, which have often been identified through spontaneous ADR reporting. In some instances, prescriptions are in a separate database to clinical events and linkage between the two databases needs to be achieved through some common identifier in the two sets of data in order to study adverse events at an individual level. Confidentiality legislation has made this type of record linkage more difficult (10). Since the late 1980s, collaboration between the pharmaceutical industry and regulators under the auspices of the Council for the International Organizations of Medical Sciences (CIOMS) has, to date, resulted in six published reports (11–16) proposing international standards for, inter alia, spontaneous ADR reporting, periodic safety update reports (PSURs), core clinical safety information, risk–benefit evaluation, and the management of safety information from clinical trials. A seventh CIOMS working group is in progress that will make proposals for developmental safety update reports. Recent years have also seen an increased regulatory focus on pharmacovigilance within many countries. Relevant standards for electronic ADR reporting and PSURs have been accepted through the International Conference on Harmonisation (ICH) (which involves the United States, the European Union, and Japan) (17), and further ICH guidelines are in preparation. During the 1990s, fuelled by electronic communication, international collaboration in pharmacovigilance increased markedly. However, for reasons that are not fully understood, divergent decisions around the world on the safety of particular medications have continued to be taken (18). In part this lack of consensus is probably because subjective judgments have to be made in the presence of uncertainty about the balance of benefits and risks.

PHARMACOVIGILANCE

4

METHODOLOGIES

4.1 Spontaneous ADR Reporting Spontaneous ADR reporting is a means of centrally collating clinical suspicions that, in individual cases, a drug may have been responsible for an adverse event. Its primary purpose is to detect ‘‘signals’’ of unrecognized drug toxicity. In general, such signals then require confirmation or refutation using other sources of data. Spontaneous reporting systems are in operation in many countries around the world with some variations. Until recently, reporting has, in most countries, been restricted to health professionals, but reporting by patients is increasingly being accepted although its value remains uncertain. To be considered valid, a report must originate from an identifiable reporter, relate to an individual patient identifiable to the reporter, and contain by way of minimum information a drug and a suspected reaction. Information about temporal relationships between drug and reaction, concomitant medication, other diseases, and outcome is usually also requested. Patient-specific information on age, weight, and gender is also useful for interpretation, but concerns over privacy can lead to this information being omitted. Important or incomplete reports may need to be pursued with the reporter to obtain missing or more detailed clinical and follow-up information. Using standard dictionaries for coding, in particular, MedDRA, which is now the international standard medical terminology (19), information from spontaneous ADR reports is entered onto a database that is then subjected to regular screening for new signals. Recent years have seen major advances in the use of statistical approaches to provide indications as to whether the number of reports of a particular drug/reaction combination is in excess of expectations (20). These approaches do not require drug exposure data (e.g., numbers of prescriptions) and are analogous to using a proportional mortality ratio in population epidemiology when accurate denominators are unavailable. The basic idea is to compare the proportion of all reports that relate to a specific medical

3

term between a particular drug and all other drugs in a database. There are several such methods of testing for ‘‘disproportionality,’’ and they yield broadly similar results (21). The database used may be one held by a national regulatory authority, the database of a pharmaceutical company, or the worldwide database held by the World Health Organization (5). There are extensive regulatory requirements around the world relating to spontaneous ADR reporting (22). Reporting is mostly voluntary (and unpaid) for health professionals (mandatory reporting being unenforceable and therefore unlikely to lead to a substantial increase in reporting rates) but a statutory requirement for pharmaceutical companies. A general rule is that serious (in the regulatory definition) suspected ADRs must be reported by companies to regulatory authorities within 15 calendar days of receipt. It is generally accepted that spontaneous ADR reporting is an effective means of detecting signals and the cornerstone of pharmacovigilance. Nevertheless, both false positives and false negatives occur fairly frequently and there are many well-recognized limitations. In particular, substantial underreporting is the norm; causality cannot usually be inferred, and there is great potential for inappropriate interpretation of the data. 4.2 Randomized Controlled Trials Generally, the reporting of adverse effects in clinical trials is deficient in quality, and to improve this, various suggestions have been made (23). Prior to submission of a formal report, there is a need to be vigilant about harms during the conduct of a trial. If it is a new product, with little known of its safety, and if other trials are ongoing, it can be very important to communicate information about any new risks. In extreme cases, there may be a need to stop all trials if a new and serious hazard is found. Within the context of randomized, blinded clinical trials, there are particular difficulties in conducting pharmacovigilance relating to uncertainty about drug exposure. There may be a need to unblind individual cases with serious adverse events in order to manage patients,

4

PHARMACOVIGILANCE

make an evaluation of the safety of the drug, and meet regulatory requirements. However, tension exists with the need to ensure that the integrity of the trial is not compromised by unblinding. Most major trials use a data monitoring committee to oversee all aspects of safety, and when this so, it is best placed to examine unblinded data. The general principle must be that the safety of patients in the trial is paramount. Unblinding should be kept to a minimum, and when it is necessary, there should be clear separation between personnel with knowledge of exposure and those involved with the ongoing conduct of the trial. An EU guidance issued in April 2004 relating to the European Union Directive on Clinical Trials (2001/20/EC) indicates that sponsors should break treatment codes when an unexpected serious suspected adverse reaction occurs. This action seems unfortunate, and it would have been better to recommend that the sponsor delegates this activity to an independent committee. Current legislation seems to ignore two further aspects. First, that a randomized controlled trial (RCT) is the best source of deciding on causality and that individual cases are not usually clearly attributable to a drug under trial. There are exceptions to this, and the TGN1412 tragedy (24) underlines the possibility that individual cases can be unequivocally regarded as caused by a drug. Generally, it is the final summary of data from the RCT, with appropriate statistical analysis that allows for stronger inference on causality for adverse effects that can occur in the absence of a drug. The second aspect is a neglect of the proportionality of the reporting requirements to the types of risk encountered in the trial. Again, the TGN1412 trial illustrates this point. Here was a new, innovative biotech product about which nothing was known in terms of its effect in humans; risks were high. To take a contrast, the megatrials of aspirin in prevention of further heart attacks were studying a product for which there were years of knowledge and experience about its safety and hazards; risks were low. The legislation does not acknowledge such differences, and processes are the same in both types of trial. Although the term ‘‘unexpected’’ may be used to minimize the reporting of individual events in such trials,

there is a danger that the regulatory processes do not take into account the different proportionality of risk to alter the amount of reporting. The interpretation of ‘‘unexpected’’ is not always clear. The CIOMS VI report (16) attempts to reduce the requirements for expedited reporting to a wide range of people and suggests that DSMBs/DMCs are used in lieu of expedited reporting where appropriate. Unfortunately, regulatory authorities and companies are currently inconsistent in their interpretation of even the EU Directive (see the CIOMS VI report), and where other regulations are in force, it is even more difficult. 4.3 Epidemiological Studies Observational epidemiologic research methods may be used to detect drug toxicity, but increasingly they have become established as a means of testing hypotheses and providing evidence to support causation, measure incidence, and identify risk factors. Both cohort and case-control study designs may be used. The nested case-control study, in which cases and controls are drawn from a defined cohort, usually within a database, combines positive aspects of both approaches and is particularly valuable. The ‘‘self-controlled case series’’ method is increasing in importance in assessing the safety of medications, particularly vaccines (25). This method depends on obtaining all cases of a disease, independently of the putative exposure, and uses these cases as their own controls. Nowadays it is relatively unusual for pharmacoepidemiologic research to be conducted on datasets gathered specifically for this purpose. Many databases around the world contain records of prescriptions and medical outcomes or enable linkage of these parameters in individual patients (26). It is generally accepted that information from such a database alone is frequently insufficient and that access to detailed medical records is also needed to support the validity of such research, although this process may lead to bias (27). 4.4 Other Methods The model whereby signals are detected from spontaneous ADR reports and then investigated using epidemiologic methods is, in

PHARMACOVIGILANCE

practice, too simplistic, and in any phase of clinical development, other types of data may be needed to investigate and successfully manage a particular pharmacovigilance issue. These data include preclinical studies, mechanistic studies, clinical trials, and meta-analysis. The latter has tended to be underused in the context of safety and pharmacovigilance, and its use in the context of observational data is controversial (28). Good systematic reviews, leading to a metaanalysis, can be useful because of the rarity of adverse events, although the reporting of them needs to be improved (23). 5

SYSTEMS AND PROCESSES

As indicated, the overall process of pharmacovigilance is one of monitoring and managing risks. This cycle involves the identification and investigation of potential hazards, re-evaluation of the risk-benefit balance, tailored actions designed to minimise risk, and communication of the necessary safeguards to users (Fig. 1) (29). All aspects of this cycle are undertaken by both pharmaceutical companies and regulatory authorities within an increasingly complex framework of legislation and guidelines. Since the mid-1990s, periodic safety update reporting (12) by companies has become an accepted part of the process although its value is not yet clear. Evaluation of safety data in the postmarketing setting may also be hampered by uncertainty about drug exposure, but mostly the limitations relate to details such as dose, duration of treatment, and concomitant medication, which are often missing. Frequently there is also uncertainty about the validity of the clinical diagnosis of individual

adverse events. Observational data, whether derived from case reports or epidemiologic studies, require particularly careful interpretation with regard to key factors such as causality and frequency. Chance, bias, and confounding almost invariably need to be considered as alternative explanations to the causal hypothesis. There are many methods for assessing cause and effect from single cases (30), but in the context of multiple data sources, the criteria proposed by BradfordHill more than 35 years ago (31), in the context of smoking and lung cancer, are still generally accepted as valid. These criteria are strength, consistency, specificity, temporality, biological gradient, biological plausibility, coherence, experimental evidence, and analogy. With regard to biological plausibility, when present, this may be strongly supportive of a causal association, but the converse is not true of its absence because, almost inevitably, many newly identified ADRs cannot be anticipated from prior knowledge. Once safety has been evaluated, there is a need to reassess the balance of benefit and risk, a complex judgmental process for which there is an established framework (14). In terms of actions to prevent ADRs, there is increasingly a focus on developing specific programs designed to minimize risks (examples include hematological monitoring with clozapine and pregnancy prevention with isotretinoin) (32, 33). When new hazards are identified, regulators and manufacturers act to amend product information by adding warnings and appropriate safeguards. Dosage changes after marketing are often introduced. If the balance of risk and benefit for a product can no longer be considered positive, there is a need to remove it

Identify

Investigate

Inform Manage

Figure 1. Pharmacovigilance: the process of risk management

5

Act

Evaluate

6

PHARMACOVIGILANCE

from the market. Overall, about 3% of marketed drugs are prematurely withdrawn from the market for safety reasons (34). Any significant action relating to safety will require communication of the hazard and the proposed measures to users (prescribers and/or patients). It has been increasingly recognized that this is an important aspect of pharmacovigilance that could be improved (35). Standard vehicles for communication include letters to health professionals and drug safety bulletins, but recently use has been made of electronic communication and the Internet. The timing of such communications often represents a considerable challenge in terms of ensuring that health professionals are in a position to advise patients as soon as information becomes available in the lay media. We have no mechanism for communicating information to health professionals that does not, within a matter of hours if not minutes, also reach the lay media if the issue is of wide interest. The other major challenge is achieving a balance between promoting the necessary behavioral change to safeguard patients and not provoking unwarranted scare stories in the media that may result in some patients inappropriately stopping treatment. 6

LIMITATIONS OF PHARMACOVIGILANCE

Since the thalidomide disaster, more than 40 years ago, considerable strides have been

made in all aspects of the risk management process for medications. The drugs in use now are quite different, many new classes of medication having been developed and proven to be acceptably safe in practice. Nevertheless there is little doubt that adverse drug reactions remain an important cause of morbidity and mortality in the developed world (36, 37). Much of this harm remains potentially preventable, and our current inability to prevent it is likely to directly reflect the limitations of existing systems. However, currently, methods for assessing the effects of interventions and safeguards are primitive, and until rectified, this is likely to hamper process improvement. Most pharmacovigilance issues are based on data drawn from a low level in the evidence hierarchy and are therefore amenable to considerable debate and controversy. Although there will always be a need for observational data in pharmacovigilance, it should be possible to gain better evidence through greater use of large simple trials and meta-analysis. Once a decision has been made to take action to warn of a hazard and promote safe use, the tools presently available to minimize risk seem imperfect. Certainly they do not achieve the goal of maximal prevention of serious ADRs. Some of this inadequacy is likely to relate to the unavailability of the necessary information when it is needed and could be counteracted by development

Best evidence

Culture of scientific development

Robust scientific decision-making

Outcome measures and audit

Tools for protecting public health

Measurable performance in terms of public health benefit Figure 2. Future model for pharmacovigilance

PHARMACOVIGILANCE

7

Table 1. Key Points Underpinning Future Model for Pharmacovigilance • Pharmacovigilance should be less focussed on finding harm and more on extending knowledge of safety. • There should be a clear starting point or ‘‘specification’’ of what is already known at the time of licensing a medicine and what is required to extend safety knowledge post-authorization. • Complex risk–benefit decisions are amenable to, and likely to be improved by, the use of formal decision analysis. • A new approach to provision of safety information that allows greater flexibility in presenting key messages based on multiple levels of information with access determined by user requirements. • Flexible decision support is the most likely means of changing the behavior of health professionals in order to promote safer use of medications. • There is a need to put in place outcome measures that indicate the success or failure of the process. These should include hard end-points indicating the impact on mortality and morbidity. Surrogates, such as the impact on prescribing of medications, are more readily available and are potentially valuable. • Systematic audit of pharmacovigilance processes and outcomes should be developed and implemented based on agreed standards (‘‘good pharmacovigilance practice’’). • Pharmacovigilance should operate in a culture of scientific development. This requires the right balance of inputs from various disciplines, a stronger academic base, greater availability of basic training and resource that is dedicated to scientific strategy.

of intelligent decision support systems for prescribing which focus on safety. 7

FUTURE DEVELOPMENTS

Developments in molecular biology and genetics are expected to have a considerable impact on pharmacovigilance within this decade. The nature and usage of new active substances is changing substantially (with mostly niche biological products becoming available), providing new challenges for those involved in monitoring safety. Genetic markers may predict the safety of many drugs (38), with potential implications for their practical use. Adverse drug reactions could be preventable through recording of personal pharmacogenetic profiles and development of guidance recommending use or avoidance of drugs, or tailored dosage regimens, for patients with specific genotypes. To address the limitations described above, a model for the future conduct of pharmacovigilance was proposed in 2003 (39) (Fig. 2). The key points underpinning this model are summarized in Table 1. A proposal for a document summarizing safety at the time of authorization allied to a

plan for extending safety knowledge postauthorization has already been developed as ICH guideline E2E and implemented into European legislation (40). It is important, however, that regulatory requirements are not simply increased but rather that they are refocused toward activities that have measurable public health gain. In the future, further harmonization of regulatory activities, including shared international databases and processes for common decisions, may be possible. To know that such developments represent true progress toward safer medications, it is vital that outcome measures are developed that measure the success or failure of these processes.

REFERENCES 1. R. D. Mann and E. B. Andrews, (eds.), Pharmacovigilance— Introduction. Chichester: Wiley, 2002, pp. 3–9. 2. B. B´egaud, Dictionary of Pharmacoepidemiology. Chichester: Wiley, 2000. 3. T. Stephens and R. Brunner, Dark Remedy. The Impact of Thalidomide ond its Revival as a Vital Medicine. New York: Perseus Books, 2002.

8

PHARMACOVIGILANCE 4. M. D. Rawlins, Spontaneous reporting of adverse drug reactions. Br. J. Clin. Pharmacol. 1988; 26: 1–11.

20.

5. S. Olsson, The role of the WHO programme on international drug monitoring in coordinating worldwide drug safety efforts. Drug Safety 1998; 19: 1–10. 6. P. Wright, Untoward effects associated with practolol administration: oculomucocutaneous syndrome. BMJ 1975; i: 595–598.

21.

7. A. B. Wilson, New surveillance schemes in the United Kingdom. In: W. H. W. Inman (ed.), Monitoring for Drug Safety. Lancaster, United Kingdom: MTP Press, 1986. 8. S. A. W. Shakir, PEM in the UK. In: R. D. Mann and E. B. Andrews (eds.), Pharmacovigilance. Chichester: Wiley, 2002, pp. 333–344.

22.

9. P. C. Waller, S. M. Wood, M. J. S. Langman, A. M. Breckenridge, and M. D. Rawlins, Review of company postmarketing surveillance studies. BMJ 1992; 304: 1470–1472.

23.

10. Walley T. (2006). Using personal health information in medical research BMJ, 332: 130–131. 11. CIOMS Working Group I, International Reporting of Adverse Drug Reactions. Geneva: CIOMS, 1990. 12. CIOMS Working Group II, International Reporting of Periodic Drug-Safety Update Summaries. Geneva: CIOMS, 1992.

24. 25.

26.

13. CIOMS Working Group III, Guidelines for Preparing Core Clinical Safety Information on Drugs. Geneva: CIOMS, 1995. 14. CIOMS Working Group IV, Benefit-Risk Balance for Marketed Drugs: Evaluating Safety Signals. Geneva: CIOMS, 1998.

27.

15. CIOMS Working Group V, Current Challenges in Pharmacovigilance: Pragmatic Approaches. Geneva: CIOMS, 2001.

28.

16. CIOMS Working Group VI, Management of Safety Information from Clinical Trials. Geneva: CIOMS, 2005.

29.

17. P. F. D’Arcy and D. W. G. Harron (eds.), Proceedings of the Third International Conference on Harmonisation Yokohama 1995, Queen’s University, Belfast, Ireland, 1996. 18. M. Fung, A. Thornton, K. Mybeck, J. H. Wu, K. Hornbuckle, and E. Muniz, Evaluation of the characteristics of safety withdrawal of prescription drugs from worldwide pharmaceutical markets—1960 to 1999. Drug Inform. J. 2001; 35: 293–317. 19. E. G. Brown, Dictionaires and coding in pharmacovigilance. In: J. C. C. Talbot and P. C. Waller (eds.), Stephens’ Detection of New

30.

31.

32.

Adverse Drug Reactions, 5th ed. Chichester: Wiley, 2003. S. J. W. Evans, Statistics: analysis and presentation of safety data. In: J. C. C. Talbot and P. C. Waller (eds.), Stephens’ Detection of New Adverse Drug Reactions, 5th ed. Chichester: Wiley, 2003. E. P. van Puijenbroek, A. Bate, H. G. M. Leufkens, M. Lindquist, R. Orre, and A. G. Egberts, A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiol. Drug Safety 2002; 11: 3–10. B. Arnold, Global ADR Reporting Requirements. Richmond, VA: Scrip Reports, PJB Publications Ltd., 1997. J. P. Ioannidis, S. J. Evans, P. C. Gotzsche, R. T. O’Neill, D. G. Altman, K. Schulz, and D. Moher, with the CONSORT Group (2004). Better reporting of harms in randomized trials: An extension of the CONSORT statement. Ann Intern Med. 2004; 141: 781–788. M. Day, Agency criticises drug trial. BMJ. 2006; 32: 1290. H. J. Whitaker, C. P. Farrington, and P. Musonda, Tutorial in biostatistics: The selfcontrolled case series method. Stat. Med. 2006; 25: 1768–1797. B. L. Strom, How should one perform pharmacoepidemiology studies? Choosing among the available alternatives. In: B. L. Strom (ed.), Pharmacoepidemiology, 3rd ed. Chichester: Wiley, 2000, pp. 401–413. J. M. M. Evans and T. M. MacDonald, Misclassification and selection bias in case-control studies using an automated database. Pharmacoepidemiol. Drug Saf , 1997; 6: 313–318. R. Temple, Meta-analysis and epidemiologic studies in drug development and postmarketing surveillance. JAMA 1999; 281: 841–844. P. C. Waller and H. Tilson, Managing drug safety issues with marketed products. In: J. C. C. Talbot and P. C. Waller (eds.), Stephens’ Detection of New Adverse Drug Reactions, 5th ed. Chichester: Wiley, 2003. S. A. W. Shakir, Causality and correlation in pharmacovigilance. In: J. C. C. Talbot and P. C. Waller (eds.), Stephens’ Detection of New Adverse Drug Reactions, 5th ed. Chichester: Wiley, 2003. A. Bradford-Hill, The environment and disease: association or causation? Proc. Roy. Soc. Med. 1965; 58: 295–300. J. Munro, D. O’Sullivan, C. Andrews, A. Arana, A. Mortimer, and R. Kerwin, Active

PHARMACOVIGILANCE monitoring of 12,760 clozapine recipients in the UK and Ireland. Beyond pharmacovigilance. Br. J. Psychiatry 1999; 175: 576–580. 33. S. E. Perlman, E. E. Leach, L. Dominguez, A. M. Ruszkowski, and S. J. Rudy, Be smart, be safe, be sure. The revised pregnancy prevention program for women on isotretinoin. J. Reprod. Med. 2001; 46(2 Suppl): 179–185. 34. K. E. Lasser, P. D. Allen, S. J. Woolhandler, D. U. Himmelstein, S. M. Wolfe, and D. H. Bor, Timing of new black box warnings and withdrawals for prescription medications. JAMA 2002; 287: 2215–2220. 35. The Uppsala Monitoring Centre, Effective Communications in Pharmacovigilance: The Erice Report. Uppsala, Sweden: Uppsala Monitoring Centre, 1998. 36. J. Lazarou, B. H. Pomeranz, and P. N. Corey, Incidence of adverse drug reactions in hospitalized patients. JAMA 1998; 279: 1200–1205. 37. M. Pirmohamed, S. James, S. Meakin, C. Green, A. K. Scott, T. J. Walley, K. Farrar, B. K. Park, and A. M. Breckenridge, Adverse drug reactions as cause of admission to hospital: Prospective analysis of 18 820 patients BMJ. 2004; 329: 15–19. 38. V. Ozdemir, N. H. Shear, and W. Kalow, What will be the role of pharmacogenetics in evaluating drug safety and minimising adverse effects? Drug Safety 2001; 24: 75–85. 39. P. C. Waller and S. J. W. Evans, A model for the future conduct of pharmacovigilance. Pharmacoepidemiol. Drug Safety 2003; 12: 17–29. 40. ICH guideline E2E. (2004). Pharmacovigilance planning. Available at: http://www.ich. org/LOB/media/MEDIA1195.pdf (Accessed July 6, 2006).

9

PHASE 2/3 TRIALS

2 and phase 3 trials of a traditional clinical development program. The first stage data, which are available through a planned interim analysis, are used to stop a trial for lack of efficacy or unacceptable safety profile, or to redesign stage 2 of the trial. Potential redesign decisions include the selection of treatment arms to carry forward and adjustment of the sample size necessary to ensure adequate data for overall benefit–risk evaluations. At the end of the trial, data from both stages are combined for an integrated analysis of efficacy and safety. Operationally, patient enrollment continues seamlessly (i.e., without any delay) after patients in the first stage are enrolled, even without having to wait for the interim analysis. This process eliminates the gap in time between the phase 2 and phase 3 trials of the traditional clinical development paradigm. Thus, in comparison, clinical development programs that incorporate phase 2/3 designs may have the potential for bringing optimal treatments to patients earlier and more efficiently. When a lack of efficacy is observed or if unacceptable safety issues develop, the total number of patients enrolled with a phase 2/3 trial can be substantially smaller than the number of a phase 3 trial that is designed without knowledge of phase 2 results. However, many different methods can be used to set up a phase 2/3 trial, which largely depend on the underlying clinical applications. A particular phase 2/3 design, which is tailor-made for a limited or similar clinical application, is not likely to be applicable to other clinical situations. For this reason, it is not possible to provide general designs for a wide variety of clinical applications in a single article. Instead, we focus on the legal foundations and common issues relevant for all phase 2/3 designs. These issues are discussed in depth in different sections to provide the reader with a more clear idea of how to develop a phase 2/3 design for a particular clinical application.

QING LIU Johnson and Johnson Pharmaceutical Research and Development L.L.C Raritan, New Jersey

It is well known that in the past decade, the cost of conducting clinical trials has increased exponentially, whereas the success rate for phase 3 development has dropped to an all time low of about 50%. Part of the reason for such a low success rate is that drug companies are facing increasing time pressures to meet patient demands or to satisfy performance goals set by shareholders. Because phase 2 (dose-response) trials are often perceived to delay the timeline, many clinical development programs either omit or conduct phase 2 trials in parallel with phase 3 trials. Traditionally, phase 2 trials have served the purpose of informing the early halt of a clinical development program or providing critical information for moving the clinical development forward to large-scale phase 3 trials. Without data from phase 2 trials, larger phase 3 clinical programs are at an increased risk for attrition related to problems of lack of efficacy or unacceptable safety concerns; in addition, phase 3 designs and statistical analyses (e.g., choice of dose or hypothesis testing strategy for handling multiplicity) are often suboptimal. A potential solution to these problems, while still limiting the overall development time and cost, is to incorporate the phase 2 component directly into a phase 3 trial, so that needed learning and redesigning take place while all relevant data contribute to integrated efficacy and safety analyses. Designs that formalize the approach to maintain scientific and statistical rigor are known as phase 2/3 combination designs or phase 2/3 designs herein. Typically, the adjective ‘‘seamless’’ is added to the description to emphasize the operational feature that transitioning from phase 2 to phase 3 is seamless with respect to patient enrollment; this aspect is described in more detail below. Typically, a phase 2/3 trial is divided into two stages, which correspond to the phase

1

DESCRIPTION AND LEGAL BASIS

To avoid causing more confusion about phase 2/3 designs, it is useful to clarify the

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

PHASE 2/3 TRIALS

notion of the traditional clinical development paradigm, and in particular, the concepts of phase 2 and phase 3 trials. This description is given in the legal framework of Code of Federal Regulations (1) of the U.S. Federal Government. 1.1 Descriptions of the Traditional Paradigm According to Temple (2), a traditional clinical development program typically consists of phase 1, phase 2, and phase 3 trials. Reference 1, part 312.21 provides general descriptions of the three phases. Phase 2 includes the controlled clinical studies conducted to evaluate the effectiveness of the drug for a particular indication or indications in patients with the disease or condition under study and to determine the common short-term side effects and risks associated with the drug. Phase 2 studies are typically well controlled, closely monitored, and conducted in a relatively small number of patients, usually involving no more than several hundred subjects. Phase 3 studies are expanded controlled and uncontrolled trials. They are performed after preliminary evidence suggesting effectiveness of the drug has been obtained, and are intended to gather the additional information about effectiveness and safety that is needed to evaluate the overall benefit-risk relationship of the drug and to provide an adequate basis for physician labeling. Phase 3 studies usually include from several hundred to several thousands subjects.

Temple (2) made even more clarifications. Phase 2 trials ‘‘often include dose response studies,’’ and ‘‘are well controlled studies and can be part of substantial evidence.’’ ‘‘People have redefined ‘phase 2’ into ‘2A and 2B’.’’ Phase 3 trials are ‘‘longer-term studies’’ with ‘‘more and long-term safety,’’ ‘‘more D/R (dose response),’’ and ‘‘broader (real life) population.’’ The definition of the term ‘‘substantial evidence’’ is given by the Federal Food, Drug, and Cosmetic Act (3), and it will be discussed at greater length at the end of this section. 1.2 Inadequacy of the Traditional Paradigm The high phase 3 attrition rate reflects two fundamental issues with the traditional paradigm: 1) the designs of phase 3 trials are either deficient to meet the stated objectives

or unnecessarily expose patients to ineffective or potentially unsafe drug or treatments and 2) clinical holds under 21CFR 312.42 (1) may not be enforced often enough when, ‘‘The plan or protocol for the investigation is clearly deficient in design to meet its stated objectives.’’ Typically, the sample size for a phase 3 design is based solely on power and type I error rate considerations related to primary endpoints. These considerations alone can be deficient for evaluating the overall benefit–risk relationship. Such deficiencies cannot be addressed unless the sample size is reliably calculated on the basis of the totality of adequate efficacy (primary and secondary) and tolerability/safety data, which include laboratory values. Because phase 3 trials are expanded clinical investigations beyond the scope of short-term phase 2 trials, unexpected tolerability/safety issues or lack of efficacy may occur later in the treatment or follow-up periods. Because most phase 3 trials have no provisions for prospective interim analyses, it is difficult to design phase 3 trials adequately a priori to ensure that: a) patients do not unnecessarily receive ineffective, intolerable, or unsafe treatments, and b) the designs meet the ultimate objective of obtaining adequate data for benefit–risk evaluations. Another deficiency is that many phase 3 trials do not study enough doses. There is no regulatory mandate for phase 3 doseresponse information, although the ICH E8 (4) advocates that, ‘‘Studies in Phase III may also further explore the dose-response relationship.’’ Because of cost and time concerns, one or two doses are often selected for phase 3 as the best-predicted doses from limited and short-term phase 2 trials. This practice has led to numerous failures in later safety studies or high-profile withdrawals of drugs from the market because of emerging safety risks to patients. Not studying enough doses also limits the populations that might experience a drug’s therapeutic benefits. In clinical practice, a patient’s treatment usually starts and continues with a low dose. Use of lower doses will generally tend to reduce the frequency

PHASE 2/3 TRIALS

of adverse effects. However, a higher dose, with greater potential for side effects, may be justified when the intended efficacy could not be reached with the low dose. Deciding on the correct dose depends on having adequate information about both benefits and side effects across a range of potential doses. 1.3 Descriptions of Phase 2/3 Designs Clinical development programs are usually divided into early and late stages. Under the traditional clinical development paradigm, an early clinical development program typically starts with phase 1 safety and tolerability trials, for which healthy volunteers or patients with terminal disease are enrolled. Then a proof-of-concept, or phase 2a, trial is conducted, either in patients with the target disease or in patients in a different disease area. For many clinical applications, the latestage clinical development program consists of a phase 2b dose-response trial, one or two phase 3 trials, and open label extension trials. Phase 3b or phase 4 trials are usually conducted after a drug is approved for marketing. A phase 2b trial often has more treatment arms than a phase 3 trial and uses short-term (intermediate) endpoints as compared with phase 3 clinical endpoints that are defined over a longer treatment and evaluation period. If a phase 2b trial is elevated to employ the same entry criteria for patient enrollment as in a phase 3 trial, then it is possible to combine both phases of the late stage development into a single confirmatory trial. Other conditions are also necessary. To integrate statistical inference, patients in phase 2b must receive the same treatment, follow-up, and clinical and laboratory evaluations as patients in phase 3, even though short term efficacy and safety evaluations of phase 2b patients are used to redesign the phase 3 part of the trial. The second aspect of phase 2/3 designs regards statistical inference for integrating phase 2 and phase 3 data for evidentiary evaluation of data, which is discussed later. The crucial and perhaps feasibility-limiting aspect is logistical planning to ensure seamless patient enrollment, timely data collection, prompt interim analysis and decision making, and

3

proper implementation of interim decisions. A design that is concerned with clinical, statistical, and logistical integration of phase 2b and phase 3 trials is called a phase 2/3 design. A phase 2/3 design provides more longterm benefit–risk data than a clinical development program under a traditional paradigm, because patients in the phase 2 stage are treated and followed for their longterm benefit–risk data, while patients in a traditional phase 2b trial are typically not followed for their long-term benefit–risk data. Thus, a clinical development program that employs a phase 2/3 design can provide more benefit–risk data with fewer patients than a clinical development program under a traditional paradigm (5). To apply a phase 2/3 design effectively, that is, to limit the number of treatment arms and to reduce unnecessary enrollment of patients in case the trial is stopped for lack of efficacy or safety concerns, it is important that the early clinical development program provides enough safety and efficacy, and preferably at least limited dose-response data, before engaging in late stage clinical development. Operationally, the second phase 3 trial, when required, can start after the interim analysis results of the phase 2/3 trial are known. Phase 2/3 designs are fundamentally different from those adaptive designs that employ only immediate efficacy and safety endpoints. In a phase 2/3 trial, long-term efficacy and safety data from both phases are used for inference. Short-term efficacy and safety from phase 2b are used for adaptation. Following the descriptions of phase 2 and phase 3 trials in 21CFR 312.21 (1), the ‘‘seamless phase 2/3 designs’’ that employ immediate endpoints should properly be renamed as ‘‘adaptive phase 2b designs.’’ Clarifying the distinctions and providing proper naming conventions for the two different types of designs are necessary to avoid additional and unnecessary confusion as the underlying design issues and statistical methodologies are very different. Naturally, clarification of these design distinctions also leads to a focus on most phase 3 trials that require long-term treatment and follow-up, for which substantial efforts are needed to develop applicable methodologies.

4

PHASE 2/3 TRIALS

1.4 Legal Basis for Phase 2/3 Designs The current laws provide the legal basis for phase 2/3 combination designs. The regulations in 12CFR 312.23 state (1): In phases 2 and 3, detailed protocols describing all aspects of the study should be submitted. A protocol for a phase 2 or 3 investigation should be designed in such a way that, if the sponsor anticipates that some deviation from the study design may become necessary as the investigation progresses, alternatives or contingencies to provide such deviation are built into the protocols at the outset. For example, a protocol for a controlled short-term study might include a plan for an early crossover of nonresponders to an alternative therapy.

The unpredictable nature of phase 3 trials often leads to deviations. To ensure, as noted above, that patients do not unnecessarily receive ineffective, intolerable, or unsafe treatments, and that the designs meet the ultimate objective of obtaining adequate data for benefit–risk evaluations, 12CRF 312.23 (1) permits contingencies for deviations as long as they are built into the protocols at the outset. With current statistical theory for adaptive designs, the contingencies are known as adaptations for an ongoing trial to deal with unexpected outcomes (or deviations) from prospectively planned interim analyses. A key design aspect is the sample size, which is often determined on the basis of primary endpoints, whereas other efficacy or safety endpoints are totally ignored. To ensure the sample size is adequate, datadependent evaluation and modification are necessary according to Section 355 (b)(5)(C) of the Federal Food, Drug and Cosmetic Act (3): Any agreement regarding the parameters of the design and size of clinical trials of a new drug under this paragraph that is reached between the Secretary and a sponsor or applicant shall be reduced to writing and made part of the administrative record by the Secretary. Such agreement shall not be changed after the testing begins, except --(i) with the written agreement of the sponsor or applicant; or

(i) pursuant to a decision, made in accordance with subparagraph (D) by the director of the reviewing division, that a substantial scientific issue essential to determining the safety or effectiveness of the drug has been identified after the testing has begun.

Per Section 355 (b)(5)(C) (see Reference 3), trial sample size can be changed for either safety or efficacy reasons alone, or for evaluation of benefit and risk relationships. Of particular note is that, according to Section 355 (b)(5)(B) (Reference 3), the clinical trials in consideration are those ‘‘intended to form the primary basis of an effectiveness claim.’’ An essential aspect of any phase 2/3 design is that the phase 2 stage is, in every way except the sample size, adequate and well controlled according to the rigor of the phase 3 stage. Thus, the results of phase 2 stage are part of substantial evidence and are combinable with phase 3 results. In fact, a major legal motivation for phase 2/3 combination designs is Section 355 (d) of the Federal Food, Drug, and Cosmetic Act (3), which states that the term ‘substantial evidence’ means evidence consisting of adequate and well-controlled investigations, including clinical investigations, by experts qualified by scientific training and experience to evaluate the effectiveness of the drug involved, on the basis of which it could fairly and responsibly be concluded by such experts that the drug will have the effect it purports or is represented to have under the conditions of use prescribed, recommended, or suggested in the labeling or proposed labeling thereof.

The definition of ‘‘substantial evidence’’ is consistent with a general scientific principle that all relevant data should be included to evaluate evidence in support of a hypothesis. 2 BETTER DOSE-RESPONSE STUDIES WITH PHASE 2/3 DESIGNS Hung et al. (6) raise the fundamental question: ‘‘Why can the accumulating information at an interim time of a trial suddenly gain sufficient insights for predicting benefit/risk ratio?’’ This question underlies a fundamental difficulty of the traditional phase 3 trials, in which interim monitoring of safety

PHASE 2/3 TRIALS

and efficacy is not employed. From experience with group sequential trials (see Reference 7, pp. 29–39), conducting interim analyses allows assessing safety and efficacy of data, and therefore, when new safety signals develop, a Data Safety and Monitoring Committee (DSMB) can make appropriate decisions, which include extending the trial for better evaluation of benefit–risk ratio. However, this experience does not guarantee that a phase 2/3 design is superior to a traditional well-planned multiple-dose phase 3 design. Rather, most proposed phase 2/3 designs, which are listed in the Further Reading section, are in fact inferior to a typical multiarm phase 3 trial because their practical use would lead to a higher probability of selecting a single dose that has unacceptable safety risks. A common feature of these phase 2/3 designs is that only one treatment arm, which is most often selected on the basis of the maximum test statistic for efficacy, is carried forward so as to produce the illusion of saving cost and time. Clinical issues and regulatory ramifications of selecting one dose for phase 3 trials under the traditional paradigm are described clearly and eloquently by Hemmings (8, pp. 30, 46, 47). From a regulatory point of view, selecting only one dose in phase 2/3 trials would certainly decrease long-term dose-response information, and therefore, reduce regulatory agencies’ ability to make better benefit–risk evaluations. These single-dose designs exacerbate the regulatory difficulties because the statistical procedure to select a dose that exhibits the maximum test statistic for efficacy tends also to select the dose with the poorest tolerability and/or the highest risk of adverse events, in general. Hence, practical applications of the single-dose designs can paradoxically lead to longer and more costly overall development, a higher latestage development attrition rate, or a greater probability of harming patients once the drug is on the market. This outcome is a realistic scenario that is even supported by simulation studies of some proposed designs. For example, for designs with three treatment arms and a single control arm, it has been demonstrated that the probability of selecting a particular dose is only one third of the

5

overall power when in fact all three treatments are equally efficacious. If the overall power for demonstrating statistical significance for any selected treatment is 90% and if only one of the three treatments has an acceptable long-term benefit–risk profile, then the probability of that treatment being selected is only 30%. This level of probability of success is worse than the unacceptable lower probability bound of 1/3 = 33.33% of the subset selection criterion of Bechhofer et al. (9, p. 9). In contrast, a phase 3 trial with a subset of three properly selected doses can increase the probability substantially that one of the three doses is close to being optimal. From a broader clinical development perspective, it is not difficult to perform simulation studies to show that with these singledose phase 2/3 designs the late stage attrition rate would be far greater than the current attrition rate of 50%. With some proposed designs, it is possible to select more than one dose to carry forward, without inflation of type I error rates caused by multiplicity. In such cases, the conclusion of time and cost savings is misleading. To fulfill the promise to bring safe and effective treatments to patients, phase 2/3 trials should achieve two basic objectives: 1) that patients do not receive ineffective, intolerable, or unsafe treatments unnecessarily; and 2) that adequate data for benefit–risk evaluations can be obtained without compromising existing regulatory and scientific standards. These objectives cannot be achieved with traditional phase 3 trials by design (see the section entitled ‘‘Inadequacy of the traditional paradigm’’). It is only in pursuing these objectives that the additional benefits of time and cost savings for the overall development program can be meaningfully assessed. For placebo controlled parallel group dose-response trials, Liu and Pledger (5) proposed a ‘‘drop-the-loser’’ approach in which low doses that meet a specified futility guideline for short-term efficacy endpoints and high doses with safety and tolerability concerns are dropped. The remaining dose range, which has a high probability (e.g., 90% as compared with 30% of single-dose designs) of covering the optimal treatment, is then carried forward for long-term evaluations. In addition, patients already randomized in the

6

PHASE 2/3 TRIALS

phase 2 stage continue to receive treatment and follow-up. The design produces maximum long-term dose-response information that is not available from typical phase 3 trials and saves overall development cost and time. It is important to note that this approach already implements the idea recommended by Hemmings (8). 3

PRINCIPLES OF PHASE 2/3 DESIGNS

Prospective planning is necessary to ensure that statistical inference from the trial results is valid and interpretable. Details are discussed below. 3.1 Endpoints and Feasibility The first step is to assess the operational feasibility of a phase 2/3 trial. Two major factors that determine the feasibility is how quickly patients can be enrolled to meet the sample size requirement and how long the treatment and follow-up period are required to evaluate the clinical endpoints. For most phase 3 trials, the time required to enroll patients can be short relative to the time required for treatment and followup. If the interim analysis had to be based on the clinical endpoint, then a maximum sample size could be reached long before the interim analysis is even performed. Any decisions made at the interim analysis would not lead to any savings of time and resources. To reflect the traditional approach for late stage development, a phase 2/3 design can use short-term endpoints in the phase 2b portion for decisions regarding treatment arm selection and sample size adjustment. To ensure reliable decision making, it is important to select the type and timing of the short-term endpoints carefully so that the trial is not prematurely terminated and the likelihood is maximized to carry the optimal treatment arm forward to the second stage. An exception is when the clinical efficacy is measured by a time-to-event endpoint, and it is acceptable to assume a proportional hazards model for statistical inference. In this case, the traditional group sequential designs with type I error rate spending and nonbinding futility boundaries can be modified to incorporate other adaptive features such

as dropping treatment arms and sample-size adjustment. However, for many such applications, the required number of patients can be enrolled relatively fast. It usually takes a longer time to accumulate enough events for interim analyses. Thus, a group-sequential type of phase 2/3 designs can only reduce the time to complete the trial. 3.2 Hypotheses and Interpretability To qualify a phase 2/3 trial as a pivotal trial, it is critical that the trial results are interpretable with respect to the hypotheses specified in advance in the development program. By the general theory of adaptation developed by Liu et al. (10), it is possible to adapt a trial ‘‘on-the-fly’’ to an unspecified hypothesis without inflating the type I error rate. However, a significant P-value that combines data from both stages in this case only provides a qualitative evaluation that a new treatment is somehow different from the control. More specific evaluations are necessary to describe more precisely how patients will benefit from the new treatment, so the new treatment can be labeled properly in the package insert. 21CFR 312.6 (1) states that The label or labeling of an investigational new drug shall not bear any statement that is false or misleading in any particular and shall not represent that the investigational new drug is safe or effective for the purposes for which it is being investigated.

Apparently, more specific evaluations can solely be based on data from the second stage, which reflects exactly the traditional clinical development paradigm when hypotheses of relevance for phase 3 trials are formulated only after the results of phase 2 trials are known. The relevance of hypotheses for a treatment indication depends on consensus of the medical community and has to be agreed to by regulatory agencies. Under the traditional clinical development paradigm, agreements on designs of phase 3 trials are typically reached through the end-of-phase-2 meetings. To ensure relevance of hypotheses and seamless execution of a phase 2/3 trial, it is essential to specify the hypotheses up front in the study protocol.

PHASE 2/3 TRIALS

3.3 Types of Adaptations The set of prespecified hypotheses limits the types of adaptations admissible under a phase 2/3 design. After the interim analysis, the trial can focus on testing more specific hypotheses when other hypotheses become unlikely to be true. If a type of adaptation can potentially lead to a hypothesis outside what is agreed on, then the trial results may be difficult to interpret. In addition, the multiple type I error rates may not be properly controlled because of enlarging the set of hypotheses. Depending on the clinical applications, relevant hypotheses may be different, and therefore, admissible types of adaptations may also be different. However, it is useful to classify the types of adaptations according to their specificities to the underlying clinical applications: type A reflects goals of primary adaptation, which often leads to change in the hypotheses; type B pertains to sample size adjustment, which does not change the hypotheses or their interpretation; type C covers adaptive hypothesis testing methods for handling global null hypotheses. For a dose-response trial, Liu and Pledger (5) proposed a phase 2/3 design that covers all three types of adaptations. Specifically, type A adaptations cover dropping low doses for lacking efficacy and high doses that are potentially not safe, which is a form of primary adaptation; type B adaptations provide methods for sample size adjustment according to dose selections made; and type C adaptations provide an adaptive trend testing procedure derived by modeling the dose-response curve using data from the first stage. Type A adaptations are restricted to endpoint selection or subpatient population selection for trials with multiple endpoints or subpatient populations, respectively, for which the goals of primary adaptation are endpoint selection and patient population selection, respectively. 3.4 Adaptation Rules versus Guidelines Although it is critical to specify the types of adaptations to ensure interpretability and validity of results, it is also important (see

7

21CFR 312.23, Reference 1) to provide guidelines for adaptations in the study plan. Generally, adaptation guidelines are suggested plans for possible adaptations that are based on observations of short-term efficacy and perhaps safety data. The adaptation guidelines, which may not be as explicit as algorithms that lead to specific adaptations, embody basic principles for adaptations, which facilitate intelligent interim decision making by a DSMB. In contrast, adaptation rules, which are algorithms specified a priori that apply to only a few endpoints and do not cover every aspect of the ongoing trial, are meant to be followed only to maintain certain statistical properties of inadequate adaptive designs. By 21CFR 312.21 (1), phase 3 trials are expanded studies for collecting additional information on effectiveness and safety for evaluating overall benefit–risk, which are not already available from phase 2 trials. Therefore, additional uncertainty and risk to patients occurs in phase 3 trials that are not predictable from phase 2 trials. In wellplanned phase 2/3 trials, adaptation guidelines facilitate intelligent and responsible interim decision making by a DSMB, whose goals are to protect patients and to ensure that the ultimate trial objective of collecting adequate data to evaluate benefits and risks is achieved. Canner (11) eloquently stated that: Decision-making in clinical trials is complicated and often protracted . . . no single statistical decision rule or procedure can take the place of well-reasoned consideration of all aspects of the data by a group of concerned, competent and experienced persons with a wide range of scientific backgrounds and points of view.

DeMets (12), in discussing group sequential trials, pointed out that for practical applications the stopping boundaries are guidelines, whereas other relevant secondary efficacy and safety data as well as external information must also be taken into account in decisions to stop the trial early. In fact, it is now standard practice to provide to a DSMB a comprehensive list of trial related summaries (see Reference 7, Tables 5.1 and 5.2 within). The totality of accumulating data may lead to

8

PHASE 2/3 TRIALS

unplanned but necessary changes. An excellent example, which is described in Ellenberg et al. (7, pp. 36, 37), is the fluconazole trial for preventing serious fungal infection in patients with AIDS. Even though the primary endpoint of serious fungal infections clearly crossed the O’Brien-Fleming boundary in favor of fluconazole, the DSMB recommended continuation of the trial after observing a large excess of deaths on the fluconazole regimen. At the completion of the trial, patients on fluconazole continued to show lower rate of serious fungal infections and greater death rates. More fluconazole patients died without experiencing a serious fungal infection. This outcome suggested the need to explore a possible unintended mechanism of fluconazole that could adversely affect survival. It must be pointed out that under the traditional development paradigm, designs of phase 3 trials occur after phase 2 results are available. The underlying principles for designing phase 3 trials after examining phase 2 results are essentially adaptation guidelines as described. Thus, employing adaptation guidelines is an inherent requirement for phase 2/3 designs. 3.5 Ethical Considerations For life-threatening diseases, it is imperative to make the best treatment available as early as possible. This goal is often achieved in a context of a single pivotal trial in which effects of new treatments are tested at multiple interim analysis times. Once a new treatment is shown to be superior to a control, it is no longer ethically acceptable to subject patients to inferior treatments. In a trial with multiple treatment regimens, patients who are randomized to treatment arms that are eliminated either because of lack of efficacy or safety concerns are typically treated with the best treatment available, which can be either a new treatment regimen in the trial or a standard treatment outside the trial. When treatment eliminations are based on short-term data, long-term clinical efficacy and safety data of the underlying treatment do not even exist for patients enrolled in the first stage of a phase 2/3 trial. This issue causes bias for estimating treatment effects

of remaining treatment arms. Because both efficacy and safety data are used for dose selection, it is difficult to qualify the direction of the bias. Gallo and Maurer (13) point out that there is no means to correct the bias, even for clinical development under the traditional paradigm. For non-life-threatening diseases, multiple placebo-controlled phase 3 trials and large, simple open-label extension trials are often necessary to provide substantial evidence for efficacy and safety. With a phase 2/3 design, there are fewer ethical concerns to continuing treating and following patients already on treatment arms that are dropped for lack of efficacy as long as the placebo control is still present (and no harm results). It is often appropriate to continue treating and following patients on their randomized treatment arms that are dropped for tolerability reasons, while permitting decisions of early withdrawal on an individual caseby-case basis. For this setting, statistical procedures for testing, estimation, and confidence intervals already exist (5,14). 4 INFERENTIAL DIFFICULTIES 21CFR 312.6 (1) states that The label or labeling of an investigational new drug shall not bear any statement that is false or misleading.

And Section 355 (d)(5) of the Federal Food, Drug, and Cosmetic Act (3) provides a clause for the Secretary of DHHS (the Department of Health and Human Services) to issue an order refusing to approve an application if evaluated on the basis of the information submitted to him as part of the application and any other information before him with respect to such drug, there is a lack of substantial evidence that the drug will have the effect it purports or is represented to have under the conditions of use prescribed, recommended, or suggested in the proposed labeling thereof

Thus, in a general sense, a key inferential objective for conducting clinical trials and performing statistical analysis is to provide evidence in support of labeling statements

PHASE 2/3 TRIALS

of a drug’s effectiveness and safety. It is up to the Secretary of DHHS to decide whether the evidence presented is substantial according to established regulatory guidance [see Section 355 (b)(5)(A), Reference 3]. Under the traditional clinical development paradigm in which phase 3 trials of fixed sample size are used, statements of the drug’s effectiveness are evaluated mainly through significance tests where P-values are used as measures of strength of evidence. The P-values are then used to determine whether the evidence presented in support of a statement is substantial according to the conventional choice of type I error rates. For settings in which parametric models can be correctly specified, significance tests are typically derived according to the sufficiency principle such that no evidence of relevance for the parameter in question is left unused. Among an infinite number of ways of combing sufficient statistics, minimum sufficient statistics are used in significance tests, mainly according to the NeymanPearson theory, under which the power of significance tests at any significance level is maximized. Refinements to statements of the drug’s effectiveness based on effect sizes are often provided through point estimates and confidence intervals in journal publications. However, estimates and confidence intervals are rarely presented in package inserts of labeling; instead, summary tabulations are used. For adaptive designs, several testing procedures allow various adaptations and multiple comparisons. These adaptive testing methods are based on traditional ad-hoc tests proposed for combining results of different experiments (15–17). The combination tests belong to a class of more general conditional tests proposed by Proschan and Hunsberger (18), which are rigorously justified by Liu et al. (10) on probability measure-theoretic grounds. Although it seems that these combination tests are adequate for controlling the type I error rates at a specified level, raging and confusing debates argue whether they can lead to scientifically sound evidentiary evaluation of a drug’s effectiveness. Central to these debates is the claim that most adaptive designs do not follow the sufficiency principle (19–21), by which inference

9

regarding an unknown parameter should be based on the ‘‘minimal sufficient statistic.’’ However, most literature retains the original definition of the sufficiency principle by Fisher (22,23) that any inference should only be based on sufficient statistics. The essence of Fisher’s definition is that no data of relevance should be left out in the inference, which is exactly the emphasis of adaptive designs. It is true that adaptive designs that maintain flexibility (i.e., relying on adaptation guidelines as described earlier) do not use minimum sufficient statistics to combine data. This finding occurs simply because an α-level adaptive test that uses minimum sufficient statistics while maintaining flexibility does not exist (24). It must be pointed out that using minimum sufficient statistics is the consequence of applying the NeymanPearson theory in the setting of fixed sample size designs. It is not at all clear at this time whether, or in what sense, minimum sufficient statistics are optimal for adaptive designs. Despite this uncertainty, apparent inferential difficulties are noted with common combination tests when derived P-values or other aspects of the tests are used for evidentiary evaluations of the data. Liu and Pledger (25, p. 1981) give an example to illustrate a nonstochastic curtailing problem with the Pvalues of Brannath et al. (26) for Fisher’s combination tests. They also point out that the P-values from their recursive combination tests inherit a problem of stage-wise ordering of the sample space for group sequential designs. After a boundary is crossed, additional data that are obtained either because of natural overrunning or intentional trial extension to obtain more data for other endpoints are not incorporated in the final analysis. Therefore, the intent-to-treat principle for data analysis is violated. A recent paper by Burman and Sonesson (19) highlights logical difficulties with weighted test statistics, as well as their resulting point estimates and confidence intervals. With no explicit reference to minimum sufficiency, Chen et al. (27) suggest that an adaptive test statistic should give an equal weight to all patients according to the idea of ‘‘one-patient-one-vote.’’ It must be emphasized that these difficulties with adaptive tests neither nullify

10

PHASE 2/3 TRIALS

the legal basis for phase 2/3 combination designs nor negate their promise to bring optimal treatment to patients. Rather, these difficulties reinvigorate classic criticisms of significance tests seeking to maintain the size of type I error rates regardless of the sample size while ignoring the fundamental issue of evidentiary evaluation of data (28, pp. 61–82). According to Proschan (24), an α-level adaptive test that uses minimum sufficient statistics while maintaining flexibility does not exist. Thus, any effort to contrive adaptive tests of fixed significance level is bound to cause additional inferential or logical difficulties in evidentiary evaluation of data. Frisen (29) has concluded that To gain control over the significance level is not good enough. There are very misleading procedures that have a controlled significance level. The concentration on a solution to get a fixed significance level takes the focus away from the important issue of evaluating information.

Not surprisingly, this conclusion coincides with Section 355 (d)(5) of the Federal Food, Drug, and Cosmetic Act (3), by which the FDA’s approval process consists of application review, including assessing the strength of evidence, and then approval decisions, for which a lack of substantial evidence can be the basis for the Secretary of DHHS to issue an order refusing to approve an application. Hopefully, with this enlightenment, statistical research can focus on evidentiary evaluation of data.

It is argued that phase 2/3 designs should be used to improve on what the current phase 3 designs are lacking. The difficulties with existing adaptive methods are highlighted. It is concluded that to resolve these difficulties, statistical research should focus on evidentiary evaluation of data, not fixed-size significance tests. Of particular note is that standard criteria for properly controlling type I error rates need still be followed. Extended logistical and implementation issues are not discussed at all. Nevertheless, the readers should find that various discussions and recommendations are available in the reading list provided.

6 ACKNOWLEDGMENTS The author would like to thank two referees and an associate for their comments.

REFERENCES 1. 21CFR312. Available at: http://www. accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/ CFRSearch.cfm?CFRPart=312. 2. R. Temple, Phases of drug development and reviewer role in CDER/FDA, New Reviewers’ Workshop. 1997. 3. TITLE 21–Food and Drugs. Federal Food, Drug, and Cosmetic Act. Available at: http:// www.access.gpo.gov/uscode/title21/chapter9− subchapterv− .html. 4. ICH E8. Available at: http://www.ich.org/ cache/compo/475-272-1.html.

5

SUMMARY

The focus of the article is on descriptions and U.S. legal foundations for phase 2/3 designs. By the U.S. constitution, the FDA interprets and executes laws but has no authority to make laws. Thus, this article serves the purpose of informing regulators from the FDA of their responsibilities to implement the laws despite their personal preference, bias, or opinion. The author is unfamiliar with the legal bases for regulatory agencies of other countries, and therefore, would like to encourage readers to research and report the findings.

5. Q. Liu and G. W. Pledger, Phase 2 and 3 combination designs to accerlerate drug development. J. Am. Stat. Assoc. 2005; 100: 493–502. 6. H. M. J. Hung, R. T. O’Neill, S. J. Wang, and J. Lawrence, A regulatory view on adaptive/flexible clinical trial design. Biometrical J. 2006; 48: 565–573. 7. S. S. Ellenberg, T. R. Fleming, and D. L. DeMets, Data Monitoring Committees in Clinical Trials, A Practical Perspective. West Sussex, UK: Wiley, 2002. 8. R. Hemmings, Philosophy and methodology of dose-finding – a regulatory perspective. In: S. Chevret (ed.), Statistical Methods for DoseFinding Experiments. New York: John Wiley & Sons, 2006.

PHASE 2/3 TRIALS 9. R. E. Bechhofer, T. J. Santner, and D. M. Goldsman, Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons. New York: John Wiley & Sons, 1995. 10. Q. Liu, M. A. Proschan, and G. W. Pledger, A unified theory of two-stage adaptive designs. J. Am. Stat. Assoc. 2002; 97: 1034–1041. 11. P. L. Canner, Monitoring of the data for evidence of adverse or beneficial treatment effects. Control. Clin. Trials 1983; 4: 467–483. 12. D. L. DeMets, Stopping guidelines vs stopping rules: a practitioner’s point of view. Commun. Statist. 1984; 13: 2395–2418. 13. P. Gallo and W. Maurer, Challenges in implementing adaptive designs: comments on the viewpoints expressed by regulatory statisticians. Biometrical J. 2006; 48: 591–597. 14. M. A. Proschan, Q. Liu, and S. A. Hunsberger, Practical midcourse sample size modification in clinical trials. Control. Clin. Trials 2003; 24: 4–15. 15. P. Bauer and K. Khone, Evaluations of experiments with adaptive interim analyses. Biometrics 1994; 50: 1029–1041. 16. L. Cui, H. M. J. Hung, and S. J. Wang, Modification of sample size in group sequential clinical trials. Biometrics 1999; 55: 853–857. 17. W. Lehmacher and G. Wassmer, Adaptive sample size calculations in group sequential trials. Biometrics 1999; 55: 1286–1290. 18. M. A. Proschan and S. A. Hunsberger, Designed extension of studies based on conditional power. Biometrics 1995; 51:1315–1324. 19. C. F. Burman and C. Sonesson, Are flexible designs sound? Biometrics 2006; 62: 664–669. 20. D. R. Cox and D. V. Hinkley, Theoretical Statistics. London: Chapman & Hall, 1974. 21. O. Barndorff-Nielsen, Information and Exponential Families in Statistical Theory. New York: Wiley, 1978. 22. R. A. Fisher, On the probable error of a coefficient of correlation deduced from a small sample. Metron 1921; 1: 3–32. 23. R. A. Fisher, Theory of statistical estimation. Proc. Cambridge Philos. Soc. 1925; 22: 700–725. 24. M. A. Proschan, Discussions on ‘‘Are flexible designs sound?’’ Biometrics 2006; 62: 674–676. 25. Q. Liu and G. W. Pledger, On design and inference for two-stage adaptive clinical trials with dependent data. J. Statist. Plan. Inference 2006; 136: 1962–1984.

11

26. W. Brannath, M. Posch, and P. Bauer, Recursive combination tests. J. Amer. Statist. Assoc. 2002; 97: 236–244. 27. Y. H. J. Chen, D. L. DeMets, and K. K. G. Lan, Increasing the sample size when the unblended interim result is promising. Stats. Med. 2004; 23: 1023–1038. 28. R. Royall, Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall, 1997. 29. M. Frisen, Discussions on ‘‘Are flexible designs sound?’’ Biometrics 2006; 62: 678–680.

FURTHER READING P. Bauer and M. Kieser, Combining different phases in the development of medical treatments within a single trial. Stats. Med. 1999; 18: 1833–1848. W. Bischoff and F. Miller, Adaptive two-stage test procedures to find the best treatment in clinical trials. Biometrika 2005; 92: 197–212. F. Bretz, H. Schmidli, F. Konig, A. Racine, and W. Maurer, Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: general concepts. Biometrical J. 2006; 48: 623–634. D. A. Follman, M. A. Proschan, and N. L. Geller, Monitoring pairwise comparisons in multi-armed clinical trials. Biometrics 1994; 50: 325–336. L. Y. T. Inoue, P. F. Thall, and D. A. Berry, Seamlessly expanding a randomized phase II trial to phase III. Biometrics 2002; 58: 823–831. P. J. Kelly, N. Stallard, and S. Todd, An adaptive group sequential design for phase II/III clinical trials that select a single treatment from several. J. Biopharm. Statist. 2005; 15: 641–658. F. Konig, P. Bauer, and W. Brannath, An adaptive hierarchical test procedure for selecting safe and efficient treatments. Biometrical J. 2006; 48: 663–678. J. Maca, S. Bhattacharya, V. Dragalin, P. Gallo, and M. Krams, Adaptive seamless phase II/III designs – background, operational aspects, and examples. Drug Inf. J. 2006; 40: 463–473. M. Posch, F. Koenig, W. Brannath, C. DungerBaldauf, and P. Bauer, Testing and estimation in flexibile group sequential deisgns with adaptive treatment selection. Stats. Med. 2005; 24: 3697–3714. M. A. Proschan and S. A. Hunsberger, Combining treatment selection and definitive testing. Biometrical J. 2006; 48: 690–692.

12

PHASE 2/3 TRIALS

A. R. Sampson and M. W. Sill, Drop-the-losers design: normal case. Biometrical J. 2005; 3: 257–268. D. J. Schaid, S. Wieand, and T. M. Therneau, Optimal two stage screening designs for survival comparisons. Biometrika 1990; 77: 659–663. H. Schmidli, F. Bretz, A. Racine, and W. Maurer, Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: applications and practical considerations. Biometrical J. 2006; 48: 635–643. N. Stallard and S. Tood, Point estimates and confidence regions for sequential trials involving selection. J. Statist. Plan. Inference 2005; 135: 402–419.

P. F. Thall, R. Simon, and S. S. Ellenberg, A two-stage design for choosing among several experimental treatments and a control in clinical trials. Biometrics 1989; 45: 537–547. S. Todd and N. Stallard, A new clinical trial design combining phase II and III: sequential designs with treatment selection and a change of endpoint. Drug Inf. J. 2005; 39: 109–118. J. Wang, An adaptive two-stage design with treatment selection using the conditional error function approach. Biometrical J. 2006; 48: 679–689.

PHASE I/II CLINICAL TRIALS

with stable disease. However, Horstmann et al. (5) also pointed out the considerable heterogeneity among phase I trials and the fact that the great majority of patients entering into a phase I trial hold out hope for some benefit, despite its a priori small probability. Based on several phase I studies, Kurzrock and Benjamin (6), in line with other investigators engaged in phase I trials, concluded that reporting of the observed responses was needed. Muggia (7) concluded that ‘‘prior estimates of the risks and benefits of phase I oncology trials need updating and insistence on not conveying therapeutic intent in the informed consent process in all instances is misplaced.’’ To get some insights in the practice of phase I/II trials, a total of 96 English language papers were selected from Medline using ‘‘phase I/II’’ in their title, published from January 1, 2005, to July 1, 2006. They included (1) 32 phase I dose-finding clinical trials using the toxicity as primary endpoint and the efficacy as secondary endpoint; (2) 28 studies first conducted according to a phase I dose-finding, then followed by a phase IIA clinical trial in which an additional subset of patients were treated at the selected MTD or the next lower dose level to get an estimation of the efficacy at that dose level; (3) 27 phase IIA clinical trials using the efficacy as primary endpoint and the toxicity as secondary endpoint; and (4) 9 phase IIB randomized clinical trials using the efficacy as primary endpoint and the toxicity as secondary endpoint. In all cases, toxicity and efficacy were modeled separately (8). Despite discussion about the need to evaluate both toxicity and efficacy in the same trial, in practice, phase I clinical trials are conducted as dose-finding, and phase II trials are designed to examine potential efficacy or any responsive activity based on one dose level recommended from the phase I (9). Recently, there has been increasing research into the development of dose-finding methodologies based on joint modeling of toxicity and efficacy outcomes. Gooley et al. (10) seem to be the first to consider two dose-outcome curves. Over the last 10 years, there has been

SARAH ZOHAR Inserm U717, Biostatistics Department Inserm CIC9504, Clinical Research Center Saint-Louis Hospital Paris, France

Traditional phase I clinical trials aim to identify the maximum tolerated dose (MTD), and phase II clinical trials aim to evaluate the potential clinical activity of a new drug based on the MTD obtained from phase I. Due to the limited sample size in phase I, the MTD might not be obtained in a reliable way, thus effecting the subsequent phases II and III. The final objective is to select an optimal dose, a dose associated with the highest efficacy, given a tolerable toxicity (1). In addition, in early phase dose-finding clinical trials, investigators are increasingly interested in selecting not only a dose level but also a treatment schedule. If two different treatment schedules show the same efficacy, the schedule with less toxicity is preferred (2). Historically, for cancer chemotherapy or cytotoxic agents, the recommendation of the dose level for further studies is based on toxicity considerations. However, this practice is only valid under the assumption that the dose-toxicity and the dose-efficacy relationships are monotonous and increasing; that is, the treatment activity increases with the dose as does potential toxicity (3). Therefore, the MTD estimated in phase I dosefinding trials is assumed to be the dose level associated with the most promising rate of efficacy. Phase I trials are not limited to toxicity observation alone, but frequently the therapeutic response is also recorded even if it is not considered as the primary outcome (4). In a recent review phase I clinical trials, Horstmann et al. (5) concluded that response rates in these trials were almost always recorded and were higher than previously believed. Although reported response rates in the past have been around 4% to 6% with toxicity-related death rates of 0.5% or lower, recently published phase I trials have shown response rates exceeding 10%

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PHASE I/II CLINICAL TRIALS

a growing body of literature proposing new methodologies using either binary outcomes for toxicity and efficacy (1, 11–17), among others, or a continuous variable for treatment activity while toxicity is assumed to be binary (16, 18, 19). In this article, sequential dose-finding desi gns for phase I/II clinical trials are presented, and prospective or retrospective examples are given as illustration for these methods. 1

TRADITIONAL APPROACH

Traditional phase I/II clinical trials combine a phase I and a phase II trial of the same treatment into a single protocol. They are conducted in two stages, in which the first stage consists in a phase I dose-finding design and the second stage a single-arm phase II approach. The first stage aims at determining the MTD and recommends a dose level for the second stage. In the first stage, a ‘‘standard’’ or 3 + 3 dose-finding algorithm is usually used; the most common and widespread algorithm is as follows. Patients are treated in cohorts of three patients, with a maximum of six patients treated at any dose level and starting at the lowest dose level. For each cohort at a given dose level, (1) if no toxicity is observed over three patients then the next cohort is treated at the next higher dose level; (2) if one toxicity is observed over three patients then the next cohort of three patient is treated at the same dose level; (3) if two toxicities over six are observed then the current dose level is selected as the MTD; or (4) if two or more toxicities are observed over three patients, then the MTD is exceeded and the recommended dose level is the next lowest dose. At the end of the first stage, the dose-finding algorithm is stopped and the recommended dose level from first stage is then used in the second stage. The second stage is a phase II single-arm design in which a predetermined sample size is included. The aim of the second stage is to measure the treatment effect at the recommended dose level from the first stage (20).

2 RECENT DEVELOPMENTS 2.1 Continual Reassessment Method Approaches The continual reassessment method (CRM) was proposed for phase I dose-finding clinical trials by O’Quigley et al. (21). It provides an estimation of the MTD from a fixed set of dose levels, based on sequentially administering the dose level closest to the currently estimated MTD to each newly included patient up to a predetermined number of patients. This method has been shown to be efficient and unbiased (22). Recently, based on this method, a phase I/II design was proposed by O’Quigely et al. in the human immunodeficiency virus (HIV) setting (13). The aim of this method was to (1) model jointly the toxicity and the efficacy of a new treatment; (2) minimize the number of patients treated at unacceptably high toxic dose levels; (3) minimize the number of patients treated at inefficient dose levels; (4) rapidly either escalate dose levels in the absence of indication of drug activity or de-escalate dose levels in the presence of unacceptably high levels toxicity; (5) stop the trial early for efficiency; and (6) minimize the number of patients needed to complete the study. Further, it is assumed that there exists a dose level at which the treatment would be considered as a success in terms of toxicity and efficacy; this dose level is named the most successful dose (MSD). The patient outcome is modeled with two binary random variables: the first corresponds to a toxicity variable, whatever the treatment response is, and the second corresponds to the treatment response conditionally on the absence of toxicity. Before the trial onset, the investigators have to specify an efficacy target (i.e., a minimum response probability) and a toxicity target (i.e., a maximal toxicity probability). Two mathematical modeling procedures have been proposed. In the first, the dose-toxicity relationship and the dose-efficacy relationship are separately modeled using two oneparameter–dependent working models. In the second procedure, a CRM-derived joint

PHASE I/II CLINICAL TRIALS

model is introduced using only one parameter. In both procedures, a sequential ratio test is used as a stopping rule. For the sake of simplicity, only the algorithm of the latter compromise model is exposed: 1. The first cohort is administered at the dose level selected by the investigator. An initial toxicity target is selected. 2. A traditional 3 + 3 escalating scheme is initially selected until heterogeneity in terms of toxicity is reached; that is, at least both one toxicity and one absence of toxicity have been observed. 3. Once heterogeneity has been observed, the subsequent cohorts receive CRMdriven selected doses. 4. In absence of observed efficacy, the current dose and all the doses lower than the current dose are dropped out of the dose range, and the toxicity target is increased within the limits of a prespecified maximum toxicity target level. 5. The recommended dose for subsequent trials corresponds to the dose with both the highest efficacy probability and the toxicity below the accepted maximum threshold. This method was extended by Zohar and O’Quigley (17) in the cancer setting by adding stopping criteria for toxicity. In the Bayesian approach proposed by Ivanova (12), a dose de-escalation is proposed when a toxicity has been observed, whereas a dose-escalation is proposed in absence of both toxicity and efficacy. Finally, Fan and Wang (23) proposed a general theoretic-decision framework that generalizes the CRM and provides two computationally compromised strategies for practical implementation, the One-Step Look Ahead (OSLA) and the Two-Step Look Ahead (TSLA). 2.2 Thall and Russell (TR) and Efficacy-Toxicity Trade-Off Approaches The Thall and Russell (TR) (15) and the tradeoff (14) approaches are based on Bayesian criteria that aim at finding a dose level that satisfies both toxicity and efficacy criteria. In the TR method, the patient’s outcomes are

3

jointly modeled by means of a trinary ordinal variable, corresponding respectively to ‘‘toxicity’’ (whatever the treatment response is) and ‘‘treatment positive response’’ in the absence of toxicity and absence of both treatment response and toxicity. Before the trial onset, the investigators have to specify a minimum efficacy probability target and a maximum toxicity probability target. An underlying mathematical model expresses the probabilities of treatment response and toxicity as interdependent functions of doses (15, 20). The trial consists of sequentially treating cohorts of patients up to a maximum sample size or earlier if stopping criteria are fulfilled. Formally, the trial algorithm is as follows: 1. The first cohort is administered the first dose level, and subsequent cohorts receive dose levels according to Bayesian posterior estimates of efficacy and toxicity outcome probabilities. 2. No dose level skipping occurs unless previous cohorts have been treated at all intermediate dose levels. 3. If the current dose is too toxic and (1) is not the lowest dose level, then the next cohort receives the next lower dose level or (2) is the lowest dose level, then the trial is stopped for unacceptable toxicity for all of the dose range. 4. If the current dose has acceptable toxicity but appears inefficient and (1) the next higher dose level has acceptable toxicity, then the next cohort receives the next higher dose level; or (2) the next higher dose level has unacceptable toxicity, then the trial is stopped for inefficiency under toxicity restrictions; or (3) the current dose is the highest dose level, then the trial is stopped for inefficiency for all of the dose range. 5. If the current dose is associated with acceptable toxicity and efficacy probabilities, the next cohort receives the next higher dose level associated with acceptable toxicity, but with a higher efficacy probability than the current dose. This method allows ending the trial before the inclusion of all of the predetermined number of patients. When either (for all dose

4

PHASE I/II CLINICAL TRIALS

levels) the efficacy remains lower than the desirable target or (for all dose levels) the toxicity remains higher than the maximum acceptable toxicity, or an optimal dose is found. This method was next extended and modified by Zhang et al. (24) to estimate the optimal dose in term of toxicity and efficacy in the setting of specific biologic targets. Because the TR method was limited to trinary outcome, Thall and Cook proposed a Bayesian method in which the outcome can be either bivariate binary (where patients may experience both events—toxicity and efficacy) or trinary (14). In this method, the efficacy probabilities are not necessary monotone or increasing with the dose. Doses are selected for each cohort of patients based on a set of efficacy-toxicity trade-off contours that partition the two-dimensional outcome probability domain (14). As with the TR approach, this method is based on three basic components: (1) a Bayesian model for the joint probability of toxicity and efficacy as a function of dose; (2) decision criteria based on acceptable/unacceptable efficacy or toxicity; and (3) several elicited (toxicity, efficacy) probability pairs of targets. In this case, the targets are used to construct a family of toxicity–efficacy trade-offs contours that provide a basis for quantifying the desirability of each dose; as a result, each newly included cohort receives the most desirable dose computed for posterior estimations (25, 26). Formally, the trial algorithm is as follows: 1. The first cohort is administered the dose level chosen by the physician. The subsequent cohorts receive a dose level according to Bayesian posterior estimates of a trade-off between efficacy and toxicity, given all the observed previous outcomes. 2. No untried intermediate dose may be skipped when escalating. 3. In the absence of an acceptable dose level, the trial is stopped for an unacceptable either low efficacy or high toxicity. 4. In presence of at least one acceptable dose level, the next cohort receives the dose level associated with the maximum desirability among the acceptable doses.

5. If the trial was not stopped earlier, the dose level recommended at the end of the trial is the dose level associated with the maximum desirability among the acceptable doses. The implementation of the method can be easily achieved using dedicated freely available software developed by the authors (27) at the MD Anderson Cancer Center, Houston, Texas. 2.3 Methods Using Continuous Response Outcome In many situations, the treatment activity corresponds to a continuous variable rather than a binary variable, whereas the toxicity is assumed to be binary. A Bayesian dosefinding procedure has been proposed (19, 28) in which the latter binary variable is modeled using a logistic regression function, and the former continuous variable is modeled using a linear regression model. Formally, the trial algorithm is as follows: 1. The total number of subjects is divided into cohorts of equal size c. The first cohort receives a dose level derived from the maximization of a Bayesian prior estimate of a gain function whereas the subsequent cohorts receive dose levels based on successive posterior estimates of the same gain function, given the past outcomes. 2. When the current prior or posterior probability of a dose-dependent event exceeds some predefined value, the dose is not considered as acceptable and is dropped out. 3. The trial is stopped either when the maximum allowed number of subjects is reached, or when none of the selected doses are deemed safe to administer the agent to, or when the procedure is able to select an optimal acceptable dose with sufficient accuracy, whichever event occurs first. At the same time, Bekele and Chen (18) have proposed a dose-finding design for biomarkers, in which both the biomarker

PHASE I/II CLINICAL TRIALS

expression and the toxicity are jointly modeled to determine the optimal dose level. The toxicity is modeled as a binary variable, whereas the treatment effect is modeled as a latent continuous normally distributed variable. 2.4 Other New Approaches Several other approaches were recently published addressing the issue of optimal dosefinding in phase I/II clinical trials. Loke et al. (29) proposed a weighted dose-finding design, based on a parametric utility function associating a real value to each dose level. After the inclusion of each cohort, given the observed outcomes, the expected utility under the posterior distribution is maximized to find the optimal dose for the next cohort. By contrast, Yin et al. (1) have chosen nonparametric functions to model both the dose-toxicity and dose-efficacy relationships. They proposed an adaptive decision procedure based on odds ratio equivalence contours between toxicity and efficacy. 3

ILLUSTRATIONS

3.1 The Cisplatin Combined with S-1 Trial—Illustration of the Traditional Approach A phase l/II clinical trial in advanced gastric cancer was conducted to assess the MTD, the recommended dose, and the objective response rate at the recommended dose of cisplatin combined with S-1, an oral dihydropyrimidine dehydrogenase inhibitory fluoropyrimidine (30). This trial used the standard method as the dose-allocation procedure. Three dose levels of cisplatin were planned: 60, 70, or 80 mg/m2 , all associated with S-1 given orally at 40 mg/m2 . The MTD was defined as the dose at which 33% or more patients experienced dose-limiting toxicities (DLT) during the first course. The first 12 patients were entered into the phase I stage, and the next 13 patients were entered into the phase II stage to confirm the toxicities and efficacy at the recommended dose. All patients were eligible for toxicity evaluation in any course and objective response evaluations. In the phase I stage at level 1, one patient had a DLT, but the other two

5

patients in the same cohort showed no DLT. An additional three patients were enrolled for the safety evaluation, but overall only one of the total of six patients developed a DLT at 60 mg/m2 of cisplatin. At dose level 2, two of six patients exhibited DLTs. Based on these results, dose level 2 was declared as the MTD, and level 1 was declared as the recommended dose in the phase II stage. An additional 13 patients were treated with the same cisplatin dose in the phase II stage. The response rate at the recommended dose at the end of the trial was 73.7% (14 out of 19; 95% confidence interval [CI], 48.8%–90.9%). 3.2 The Interferon Trial—Illustration of the TR Approach A phase l/II clinical trial in metastatic melanoma was conducted to assess the optimal dose in terms of toxicity and efficacy of interferon alpha 2a. This trial used a modified TR method as dose-allocation procedure (15). Four daily dose levels of interferon alpha 2a were planned: 3, 6, 9, and 12 mIU/day. A maximum number of 21 patients was initially allowed. The patient’s tumor response and DLT were jointly modeled using a trinary ordinal variable: DLT, tumor response without toxicity, and absence of both toxicity and tumor response. The target for the minimal efficacy probability was set to 0.20, and the maximal DLT probability was fixed at 0.30. Five cohorts of patients were enrolled in the trial for a total of 16 patients. Three patients were included in each cohort except for the 9 mIU/day cohort which enrolled four patients. Table 1 reports the dose-allocation procedure along the trial. Given the outcomes of these 16 included patients, no acceptable dose level could be allocated, as they were either too toxic or insufficiently efficient without toxicity (Figure 1). Therefore, the trial could be stopped before the inclusion of the maximum predetermined number of patients. At the end of the trial, the estimated posterior limiting toxicity probabilities were 16%, 36.4%, 62.3%, and 81.1%, for the dose levels 3, 6, 9, and 12 mIU/day, respectively; those with efficacy probabilities without limiting toxicities were at 8.2%, 12.3%, 10.7%, and 6.5%, respectively. The dose level of

6

PHASE I/II CLINICAL TRIALS

Doss-efficacy relationship Doss-toxicity relationship Efficacy target Toxicity target

Probability

0.8

0.4

0.0 6

3

9

12

Dose (a)

Probability

0.8

0.4

Probability of too low efficacy Probability of unacceptable toxicity Stopping threshold

0.0 6

3

9

12

Dose (b) Figure 1. Illustration of the TR approach in interferon trial. Dose-efficacy and dose-toxicity relationships were computed at the end of the trial as well as the computation of the stopping criteria. (A) The dose-efficacy and dose-toxicity relationships. (B) The stopping criteria in relation to the dose levels. Table 1. Illustration of the TR Approach: Dose-Allocation Procedure of the Interferon Alpha 2a Trial Cohort 1 2 3 4 5

Number of patients

Dose level (mIU/day)

Outcome by patient

3 3 3 3 4

3 6 6 6 9

0/1/0 2/0/2 0/2/0 0/0/0 2/2/0/0

0 = No dose-limiting toxicity and no tumor response; 1 dose-limiting toxicity; 2 = dose-limiting toxicity.

6 mIU/day was well tolerated but not sufficiently efficient whereas the dose level of 9 mIU/day was too toxic. REFERENCES 1. G. Yin, Y. Li, and Y. Ji, Bayesian dose-finding in phase I/II clinical trials using toxicity

=

Tumor response and no

and efficacy odds ratios. Biometrics. 2006; 62: 777–787. 2. Y. Li, B. Bekele, Y. Ji, and J. D. Cook, DoseSchedule Finding in Phase I/II Clinical Trials Using Bayesian Isotonic Transformation. UT MD Anderson Cancer Center Department of Biostatistics Working Paper Series, Working Paper 26. Berkeley, CA: Berkeley Electronic Press, May 2006. Available at:

PHASE I/II CLINICAL TRIALS http://www.bepress.com/mdandersonbiostat/ paper26 3. W. M. Hryniuk, More is better. J Clin Oncol. 1988; 6: 1365–1367. 4. J. O’Quigley and S. Zohar, Experimental design for phase i and phase i/ii dose-finding studies. Br J Cancer. 2006; 94: 609–613. 5. E. Horstmann, M. S. McCabe, L. Grochow, S. Yamamoto, L. Rubinstein, et al., Risks and benefits of phase 1 oncology trials, 1991 through 2002. N Engl J Med. 2005; 352: 895–904. 6. R. Kurzroc and R. S. Benjamin, Risks and benefits of phase 1 oncology trials, revisited. N Engl J Med. 2005; 352: 930–932. 7. F. M. Muggia, Phase 1 clinical trials in oncology. N Engl J Med. 2005; 352: 2451–2453. 8. S. Zohar and S. Chevret, Recent developments in adaptive designs for phase I/II dosefinding studies. J Biopharm Stat. 2007; 17(6): 1071–83. 9. B. E. Storer, Phase I. In: P. Armitage and T. Colton, (eds.), Encyclopedia of Biostatistics. Chichester, UK: Wiley, 1999, pp. 3365–3375. 10. T. A. Gooley, P. J. Martin, L. D. Fisher, and M. Pettinger, Simulation as a design tool for phase I/II clinical trials: an example from bone marrow transplantation. Control Clin Trials. 1994; 15: 450–462. 11. T. M. Braun, The bivariate continual reassessment method extending the CRM to phase I trials of two competing outcomes. Control Clin Trials. 2002; 23: 240–256. 12. A. Ivanova, A new dose-finding design for bivariate outcomes. Biometrics. 2003; 59: 1001–1007. 13. J. O’Quigley, M. D. Hughes, and T. Fenton, Dose-finding designs for HIV studies. Biometrics. 2001; 57: 1018–1029. 14. P. F. Thall and J. D. Cook, Dose-finding based on efficacy-toxicity trade-offs. Biometrics. 2004; 60: 684–693. 15. P. F. Thall and K. E. Russell, A strategy for dose-finding and safety monitoring based on efficacy and adverse outcomes in phase I/II clinical trials. Biometrics. 1998; 54: 251–264. 16. J. Whitehead, Y. Zhou, J. Steven, and G. Blakey, An evaluation of a bayesian method of dose escalation based on bivariate binary responses. J Biopharm Stat. 2004; 14: 969–983. 17. S. Zoha and J. O’Quigley, Identifying the most successful dose (MSD) in dose-finding studies in cancer. Pharm Stat. 2006; 5: 187–199.

7

18. B. N. Bekele and Y. Shen, A Bayesian approach to jointly modeling toxicity and biomarker expression in a phase I/II dosefinding trial. Biometrics. 2005; 61: 343–354. 19. J. Whitehead, Y. Zhou, J. Stevens, G. Blakey, J. Price, and J. Leadbetter, Bayesian decision procedures for dose-escalation based on evidence of undesirable events and therapeutic benefit. Stat Med. 2006; 25: 37–53. 20. P. F. Thall, E. H. Estey, and H. G. Sung, A new statistical method for dose-finding based on efficacy and toxicity in early phase clinical trials. Invest New Drugs. 1999; 17: 155–167. 21. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics. 1990; 46: 33–48. 22. L. Z. Shen and J. O’Quigley, Consistency of continual reassessment method under model misspecification. Biometrika. 1996; 83: 395–405. 23. S. K. Fan and Y. G. Wang, Decision-theoretic designs for dose-finding clinical trials with multiple outcomes. Stat Med. 2006; 25: 1699–1714. 24. W. Zhang, D. J. Sargent, and S. Mandrekar, An adaptive dose-finding design incorporating both toxicity and efficacy. Stat Med. 2006; 25: 2365–2383. 25. P. F. Thall, J. D. Cook, and E. H. Estey, Adaptive dose selection using efficacy-toxicity trade-offs: illustrations and practical considerations. J Biopharm Stat. 2006; 16: 623–638. 26. P. F. Thall and J. D. Cook, Using both efficacy and toxicity for dose-finding. In: S. Chevret (ed.), Statistical Methods for Dose-Finding Experiments. Chichester, UK: Wiley, 2006, pp. 275–285. 27. Department of Biostatistics and Applied Mathematics, MD Anderson Cancer Center, University of Texas. Efftox [software]. Updated August 15, 2006. Available at: http://biostatistics.mdanderson.org/SoftwareDownload/ 28. Y. Zhou, J. Whitehead, E. Bonvin, and J. W. Stevens, Bayesian decision procedures for binary and continuous bivariate doseescalation studies. Pharm Stat. 2006; 5: 125–133. 29. Y. C. Loke, S. B. Tan, Y. Cai, and D. Machin, A Bayesian dose finding design for dual endpoint phase I trials. Stat Med. 2006; 25: 3–22. 30. W. Koizumi, S. Tanabe, K. Saigenji, A. Ohtsu, N. Boku, et al., Phase I/II study of S-1 combined with cisplatin in patients with

8

PHASE I/II CLINICAL TRIALS advanced gastric cancer. Br J Cancer. 2003; 89: 2207–2212.

CROSS-REFERENCES Phase I trials Phase II trials Maximum tolerable dose (MTD) Continual reassessment method (CRM)

PHASE III TRIALS

Regulatory agencies are charged with evaluating the sponsor’s evidence, and ultimately have the authority to approve a drug for marketing. Because all compounds carry some risk when administered, evaluations must weigh the risk versus the benefit of administering the drug to patients in the indicated population. We will describe considerations involved in optimizing studies to facilitate these evaluations. The design of phase III clinical trials requires careful consideration of multiple factors, which include basic scientific principles as well as the specific medical context within which a drug is developed, including disease characteristics and outcomes as well as other available treatments. We will review some of these factors in our discussion of the appropriate experimental control.

CRAIG MALLINCKRODT VIRGIL WHITMYER Eli Lilly and Company Indianapolis, Indiana

The key to rational drug development is to ask important questions and answer them with appropriate studies in a sequence such that the results of early studies can influence decisions regarding later studies. The sequence of studies in drug development is typically described as having four temporal phases (phases I to IV). The objective of phase III clinical trials is to confirm the preliminary but inconclusive evidence obtained from phase II studies that a drug is safe and effective for the intended indication in the relevant patient population. (The terms ‘‘study’’ and ‘‘trial’’ can be used interchangeably, and we will use them both throughout this article.) Phase III studies provide the primary basis upon which drugs are approved for use by the various regulatory agencies. This article focuses on the key aspects of the design, conduct, and analysis of phase III trials necessary to yield such confirmatory evidence. Many of the concepts discussed in this article are also covered in guideline documents produced by the International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) (1). The ICH guidelines describe general considerations used by the various regulatory agencies around the world when evaluating potential new drugs. These guidance documents therefore also provide the primary criteria used by drug developers when designing their research programs. Drugs may be indicated for use in the treatment of a disease (e.g., antibiotics or statins for hypercholesterolemia), the management of symptoms of a disease (e.g., many pain medications that do not treat the underlying cause of pain), or even the management of side effects of another drug (e.g., prevention of nausea caused by chemotherapy).

1

CHOICE OF CONTROL GROUP

Studies are designed to show that a drug is beneficial, where benefit is defined relative to some alternative. If the outcome and time course of a disease is very predictable, the comparison may be to what has happened in the past, in either treated or untreated patients. Such a study is said to have historical control. Itraconazole (Sporanox) was approved for the treatment of blastomycosis and histoplasmosis based on a comparison with the natural history of untreated cases of these conditions (2, 3). Historical control may also be used if no other treatment for the disease in question is available and the consequences of a disease are severe (4). In another example, weekly risedronate was compared with matched historical controls for the prevention of fractures in patients with osteoporosis (5). Matching is a variation on the historical control in which each patient is compared with a patient in a database who had similar relevant illness, prognostic, and/or demographic characteristics. For a study with historical control to provide good evidence of efficacy, the disease outcome should not be affected by the expectations of the patients who receive the drug,

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PHASE III TRIALS

and the historical outcomes must be good predictors of current outcomes. If evidence existed that the spores causing blastomycosis had developed into a tougher strain than that inhaled by the historical control group, then an historical control group would not have provided valid comparisons in the itraconazole trials. Because the course of many diseases is not uniform or predictable, the benefits of a drug usually cannot be evaluated fairly using an historical control. Therefore, phase III studies typically include a concurrent control. The most common type is placebo control, which has been called the gold standard of experimental design (6, 7). In a placebo-controlled trial, subjects are assigned either to the test treatment or to an indiscernible treatment that does not contain the test drug. The drug and placebo groups are designed to be as similar to one another as possible in every respect, with the exception that the control group recieves an inactive agent in place of the drug received by the test group. Use of a placebo control group allows researchers to differentiate the causal drug effect from those effects resulting from the expectation of improvement associated with taking a drug (placebo effect) and from other supportive care provided in the study. Placebo-controlled phase III studies are quite common across nearly every indication, and examples are found in any recent medical journal. The use of placebo control does not imply that the control group is untreated. The test drug and placebo can each be added to a common standard therapy (add-on therapy), which might be pharmacologic or nonpharmacologic (e.g., counseling). In many disease states, placebo-treated patients show appreciable improvement even when placebo is used alone (not as an add-on therapy). An interesting variation on the use of placebo is the use of ‘‘active placebo.’’ An active placebo is a drug that has some important side-effect usually associated with the test drug, but is known to have no effect on the disease for which the test drug is being tested. Though not a phase III study, low-dose lorazepam was used as an active placebo in a study examining the potential synergy between morphine and gabapentin

for the treatment of pain (8). The investigators argued that the sedating effect of a low dose of lorazepam made it a better choice than inert placebo because sedation was so predictable in the test drugs that its absence would allow investigators and study participants to identify those patients assigned to placebo. An alternative that falls somewhere between historical control and placebo control is to use no-treatment concurrent control, in which subjects are assigned either to test treatment or to no treatment at all. Such a design leaves open the possibility that improvement in the test group is due to features of treatment other than the drug itself, such as nondrug supportive care a patient receives in the study. One reason to accept this reduction in inferential certainty is difficulty in creating a placebo version of the test drug. Dini et al (9) describe a phase III trial using no-treatment concurrent control for the evaluation of a type of physical therapy called pneumatic compression for patients affected by postmastectomy lymph edema. It was not possible to construct an otherwise identical procedure that did not include the intervention under investigation. An important part of the sponsor’s responsibility is to ensure the ethical conduct of trials. For some disease states, exposing patients to placebo in the absence of other treatment or supportive care is clearly not ethical. There are multiple reasons why this may be the case. Studies in oncology, for example, often do not use placebo-only control because it is unethical to withhold treatment from a patient with cancer due to the poor expected outcome without treatment. Similarly, when an adequate standard of care is available, it may become less acceptable to use a placebo control. In the case of the risedronate trial with historical control described previously (5), the investigators argued that placebo control was unethical because adequate treatment alternatives existed. A similar, but more controversial, case has been made in the treatment of major depressive disorder (MDD). Because patients with MDD are at risk for suicide, the cost of leaving MDD untreated can be high. Because the standard of care for MDD includes multiple agents that are effective in the treatment

PHASE III TRIALS

of MDD, some argue that placebo-controlled trials are no longer justified, and that active controlled trials ought to be used in the future for phase III studies (10, 11). Others have concluded that placebo-treated patients are not at increased risk for suicide in MDD clinical trials (12) and that alternatives to placebo control do not, for reasons we will explain, provide adequate evidence of efficacy in MDD. Although the use of placebo or alternatives is controversial in MDD, situations exist where clearly alternatives to placebo are useful. In such cases, a sponsor might choose to conduct a study using an active or positive control, in which a test drug is compared with a known active drug (13, 14). However, difficulties in interpreting results can arise in active comparator studies. For example, assume a test drug was compared with a known effective drug in a two-arm study with the intent of showing the test drug to be superior to the control. Assume, however, that the results suggested that the drugs were not different. The results tell us very little about the efficacy of the test drug. The equivocal result could arise either because the drugs truly had a similar effect or because the drugs were in fact different but the study was, for some reason, unable to detect that difference. A study may lack assay sensitivity—the ability to detect a true difference—for a variety of reasons. Imprecise estimates of treatment effects (due primarily to an unduly small sample size), flawed methods, or high placebo response are common factors that limit assay sensitivity. Lack of assay sensitivity is common, especially for some psychiatric indications, no matter how carefully planned. For example, in phase III antidepressant clinical trials submitted to regulatory agencies to support drug approvals (and therefore designed in accordance with the ICH guidelines) and for compounds that were ultimately found to be efficacious, only about half the comparisons showed a difference between the known effective antidepressants and placebo (15). Note that it is possible, as in Goldstein et al. (13), to include both a placebo control and an active control within the same study. The inclusion of a placebo arm in a study with an active control allows investigators to

3

assess whether a failure to distinguish test drug from the active comparator implies ineffectiveness of the test drug or lack of assay sensitivity. If the two active drugs appear to have similar efficacy but at least one of them is significantly different from placebo, this increases confidence that the two drugs are not substantially different because we have evidence that the study was capable of finding differences. On the other hand, if none of the treatments is significantly different from any of the others, this constitutes evidence that the study lacked assay sensitivity because we know that in fact the active comparator is significantly different from placebo. A different type of active control is doseresponse concurrent control, in which subjects are assigned to one of several dosages of the same drug. A demonstration of a direct correspondence between higher dose and greater effect is a strong argument for a drug’s efficacy. Dose-response studies can simultaneously demonstrate that a treatment effect exists and establish the relationship between dose and effect. Such a study would be appropriate to demonstrate efficacy in a situation in which placebo control is deemed unethical and no other active comparator is available. A variation of this design involves comparing dosing regimens rather than dosages, such as once versus twice daily dosing. As with other active control studies, this type of control can be combined with others, such as a placebo control. 2

CONTROLLING BIAS

The discussion so far has centered on the types of comparison that can show a drug provides some benefit relative to other options. An adequate control group is necessary but not sufficient to ensure an adequate trial. One must also ensure that the test group and the control group are similar at the start of the trial so that differences observed during the trial are due to the effect of the test drug, not differences in the populations studied. If patients receiving test drug have a milder form of the disease than patients in the control group, then the test group may look better at endpoint due to the preexisting difference rather than a difference in

4

PHASE III TRIALS

the test drug’s efficacy. To avoid such selection bias, patients are typically assigned to groups using randomized allocation. In fact, because many factors, known and unknown, might influence treatment outcome, assigning patients randomly helps ensure that the groups are evenly balanced with respect to the factors that influence outcome. Blinding is another important tool used to prevent differences between test and control groups that are not due to the efficacy of the test drug. To blind a subject is to prevent the subject from knowing which treatment group she is in. The idea behind blinding is to reduce the effect of expectation bias—the phenomenon whereby a subject’s expectations influence her outcome. If a patient knows that she is being treated with placebo, for example, she may be less likely to report that she is feeling better. Blinding is particularly important when the outcome of a study is based on subjective interpretations, such as rating pain on a scale of 0 to 10. In a single-blind study, the treatment assignment is not known by the study participant but is known to those conducting the study. In a double-blind study, the treatment assignment is not known by the study participant, the investigator, or sponsor’s staff involved in the study. Double blinding can help avoid tendencies to provide preferential treatment to one group. For example, if investigators knew which patients were on the test drug, a tendency might exist to do things to help the test drug look better or to provide more supportive care for patients on placebo because it was known these patients were not receiving an active medication. Either behavior would skew the results of a study, though in different directions. 3

EVALUATION OF TREATMENT EFFECT

When a sponsor decides that drug Y is likely to be safe and effective for the treatment of

Fact

disease Z, the burden of proof lies with the sponsor. A regulatory agency assumes that drug Y is not effective until it is proven to be effective beyond a reasonable doubt. The fundamental premise behind this system is that it is better to miss out on a safe and effective drug than it is to approve a drug that is not safe and effective. We have described a number of features of adequate study design. Recall that the objective of a phase III study is to show that a test drug provides some benefit relative to some alternative. Let us begin with the gold standard case, the double-blind, randomized, placebo-controlled study. The primary objective of such a study is to confirm the efficacy of drug Y by showing patients treated with drug Y have superior outcomes compared to patients treated with placebo. The null hypothesis, or the starting assumption that the sponsor must disprove, is that patients treated with drug Y do not have superior outcomes to patients treated with placebo. Of course, the sponsor’s actual belief, based on the tentative results of smaller phase II clinical trials, is in the alternative hypothesis that drug Y is effective. The sponsor’s burden is to demonstrate beyond a reasonable doubt that the null hypothesis is false and therefore that the alternative hypothesis is true. To clarify what constitutes reasonable doubt, consider the possible correct or incorrect decisions that can be made regarding the null hypothesis. Based on the possible outcomes we can construct a grid of decisions versus facts, with two possible decisions and two possible facts, as illustrated below: Obviously we want to make the correct decision. There are costs for both erroneous decisions either because patients are not given the option to be treated with an efficacious drug (upper right) or because they take on the risk and cost of treatment with a drug that is not efficacious (lower left). However, the need for certainty must be balanced against the

Decision made Drug effective

Drug effective Drug not effective

True positive (no error) False positive (type I error)

Drug not effective False negative (type II error) True negative (no error)

PHASE III TRIALS

time, expense, and ethics associated with the clinical trial. Requiring a very high degree of certainty, such that the probabilities of making a false-positive or false-negative decision are very low, means we will need to do an extremely large study that will take a long time, be very expensive, and require more patients to be exposed to unproven therapies. Over time, some generally accepted norms have emerged in regards to how certain we need to be in our decisions regarding the outcomes of phase III clinical trials. Most individual phase III clinical trials are set up so that the maximum risk of a falsepositive result favoring drug over control that we are willing to tolerate is 2.5% (1 in 40). This is the requirement of statistical significance, meaning that the P-value obtained from the appropriate statistical test must be 0.05 or less in order to conclude drug is different from control. A P-value of 0.05 equates to a false-positive rate of 5%, but half the false-positive results would favor the control and half would favor the test drug (0.05/2.00 = 0.025, which = 1/40). Given the public health concern associated with falsepositive results, regulatory agencies typically require that a drug be proven effective in two independent studies to be approved for marketing. This independent replication lowers the overall risk of a false-positive result to 1/40 × 1/40 = 1/1600. The science of statistics provides the means for estimating the magnitude of the treatment effect, for assessing the uncertainty in the estimate, and for determining the probability of a false-positive result. If the statistical analyses indicate that the probability of a false-positive result is 5% or less (2.5% favoring drug over control), then that phase III clinical trial is declared to be positive and counts as one of the two trials needed for approval. If the risk of a falsepositive result is greater than 5%, the study is declared to be not positive. If the sponsor believes this result is a false-negative result, they may pursue additional studies to prove the drug is effective. On the other hand, the risk of a falsenegative result in phase III trials is typically capped at 10% to 20%. Statistics also provides the methods for determining how many subjects need to be studied to yield results

5

with sufficient certainty to provide the proper control of the rate of false-positive and falsenegative results. This is referred to as ensuring that the study has adequate power to detect a difference between test drug and placebo. The power of a study is increased by increasing the number of patients or by decreasing the variability in patient outcome. Because the former is more easily controlled, more power is typically achieved by adding more patients. Powering of studies is often dictated by practical and ethical considerations. Increasing the size of the study by a given number of patients has an ever decreasing effect on the power as power increases. Thus, at some point, appreciably increasing power requires large increases in the size of the study, thereby markedly increasing the cost and duration of the study, and requiring many more patients to be exposed to an unproven treatment (and perhaps placebo). 4

CHOICE OF COMPARISON

Thus far, we have considered proving a drug to be effective beyond a reasonable doubt based on demonstrating its superiority to placebo or an active drug, or by demonstrating a dose-response relationship. However, a new drug may be an important advance if it has efficacy similar to a current drug but has improved safety, tolerability, or dosing. A new drug might also be an important advance if it is, on average, similar in effect to the current drug but is more beneficial in certain patient groups. Such a case is especially possible if the new drug works by a different mechanism of action than the current drug. Accordingly, an alternative to establishing the test drug as superior to placebo or an active treatment is to establish the test drug as not inferior to an active treatment. Examples where establishing noninferiority might constitute an important advance include a new treatment for cancer that targets a specific action in cancer cells and has a response rate similar to but with much milder side-effects than a cytotoxic chemotherapy that attacks all fast growing cells. Another example might be treatments for type II diabetes that provide glucose control similar to standard treatments but do not increase

6

PHASE III TRIALS

weight, a side effect that can have multiple consequences, including making it harder in the long run to control the diabetes. Noninferiority trials seek to establish that the test drug is not worse than the active control. Because two treatments are not likely to have exactly the same effects, this is considered to be the case if the outcomes are within some clinically relevant margin of one another (16). An important difference exists between superiority and noninferiority trials with respect to the issue of proof beyond a reasonable doubt. In general, lack of assay sensitivity tends to mute differences between treatments. In superiority trials, lack of assay sensitivity increases the rate of false-negative results. In contrast, lack of assay sensitivity in noninferiority trials tends to bias results toward a conclusion that the test drug is not different from the active control, which in this case would inflate the rate of false-positive results. Recall that in the decision-making framework it is better to have a false-negative result than a false-positive result. Hence, to guard against the possibility of falsely concluding a test drug to be noninferior to an active comparator because the trial lacked assay sensitivity, it is often necessary for a noninferiority trial to explicitly demonstrate assay sensitivity. A common way to do this is to include a placebo control in addition to the active control. If the active control is found to be different from placebo, this proves that the study had enough sensitivity to find a difference as great as that between active control and placebo. This supports the further conclusion that if the difference between test drug and active control were equally great, it too would have been detected. The use of both active and placebo controls also allows pursuit of multiple goals in one trial. For example, a three-arm study with test drug, active, and placebo control could establish superiority of the test drug to placebo and simultaneously evaluate the degree of similarity of efficacy and safety to the active comparator. 5

TYPE OF DESIGN

In addition to controlling bias and determining the type of control and comparison to use,

another important aspect of phase III trials is the type of design. ‘‘Design’’ refers to the pattern of treatments and the order in which subjects receive them over time. 5.1 Parallel Group Design The most common clinical trial design for phase III is the parallel group design, in which subjects are randomized to one of two or more arms and remain on that treatment throughout the trial. Such a design simply assumes that the characteristics of the populations in the various arms are such that, were the treatments equivalent, patient outcomes would be equivalent. Accordingly, any difference at endpoint must be due to some difference between test drug and control. 5.2 Crossover Design In the most simple crossover design, which compares two treatments, each subject receives both treatments in random order, and hence acts as his own control for treatment comparisons. Intuitively, having each patient take both drug A and B and comparing the effects (crossover design) is more reliable than comparing drug A in one group and drug B in a completely separate group (parallel design), because it is more plausible that for the same patient, her outcome would be the same if the two test drugs were equivalent. Indeed, the crossover designs can reduce the number of subjects required to achieve the same level of precision compared with a parallel group design. The crossover design can be used to compare three or more treatments, in which case each patient gets multiple test treatments, though not necessarily all of them. Treatment periods in a crossover design are usually separated by a washout period where patients take no test medications so that they can return to the baseline condition. However, in many disease states and for many medications, carryover effects exist that make it difficult for subjects to return to their baseline condition, which can invalidate results from the later phases of the study. Therefore, in order for a cross-over design to be valid the disease under study must be chronic and stable, the effects of

PHASE III TRIALS

the drug must fully develop within the treatment period, and the washout periods must be sufficiently long for complete reversal of the drug effect. Asthma clinical trials can satisfy these assumptions, and crossover designs have been used in this domain (17). 5.3 Factorial Arrangement of Treatments In a factorial design, two or more treatments are evaluated within a single study by varying combinations of the treatments. The simplest example is the 2 × 2 factorial design in which subjects are randomly allocated to one of the four possible combinations of two treatments, A and B. These combinations are A alone, B alone, both A and B, and neither A nor B. In many cases, this design is used for the specific purpose of examining the way medications A and B interact with each other (synergistic or antagonistic effects). A factorial arrangement of treatments can be used with either a parallel design or crossover design. The Bypass Angioplasty Revascularization Investigation 2 Diabetes study (18) is an example of a parallel group 2 × 2 factorial design. This study compared revascularization combined with aggressive medical treatment versus aggressive medical treatment alone (factor A), while simultaneously comparing two glycemic control strategies, insulin sensitization versus insulin provision (factor B) in patients with type 2 diabetes mellitus and angiographically documented stable coronary artery disease. 6

EVALUATING RISK

The features of study design that have been discussed thus far have generally been related to the demonstration of benefit, by way of demonstrating efficacy for some indication. Although demonstrating efficacy is the primary objective of many phase III studies, a drug’s benefit must be weighed against the risk associated with the compound as well. Many drugs are associated with common adverse events (as or more frequently than 1/10 patients taking the drug), which might include nausea, somnolence, headache, excessive sweating, or any number of other conditions. When the consequences of these events

7

are not severe, the adverse events are noted but not regarded by regulatory agencies as grounds for nonapproval because the benefits of the drug outweigh the risks. Other adverse events may be more severe, and their consequences can outweigh the benefit from treating the disease. In these cases, regulators are likely not to approve the drug for marketing. In such cases, the frequency of an adverse event might be quite low (on the order of 1/10,000), but the cost when it does occur may be so great that the agency nevertheless will not approve the drug. For example, a 10% risk of transient pruritus might be acceptable, but a much smaller risk of seizure or fatal cardiac arrhythmia may be deemed unacceptable. A 1% risk of seizures might be acceptable in a treatment for a highly fatal cancer, but it would likely be unacceptable for less dangerous indications (hair loss, erectile dysfunction) or in diseases for which approved treatments without such risks are available. Although a detailed discussion of assessing the safety of drugs is beyond the scope of this article, it is important to bear in mind a few important facts. Regulatory agencies make decisions about the approval or nonapproval of drugs from a public health standpoint. Is the overall health of the public best served by having this drug available as a treatment option, or is it best served by not having the drug as an option? It would certainly be comforting if regulatory agencies could guarantee the safety of medicines. But quantifying safety risks is extremely difficult, especially for rare events. The ‘‘rule of threes’’ is a useful guide for determining the likelihood of detecting rare events. For example, based on statistical considerations, if the true rate of some serious adverse event is 1 in 1000, you would need to observe 3000 patients to be fairly certain to have observed that one event. Alternatively stated, you need to observe 3000 patients without seeing a single case of a particular adverse event in order to reasonably conclude the incidence of that adverse event is less than 1/1000. This means that to quantify all the risks, especially for rare events, within the context of controlled clinical trials, requires more patient exposure than is practically possible

8

PHASE III TRIALS

before approving a drug. There are always risks associated with medicines, some known and others not. We can reduce the unknown risks by withholding access to drugs until they have been used in very large numbers of patients. For example, we could wait to approve a drug until after it had been approved and used in another country or region for several years. But in the mean time, patients with the disease would not have access to the benefits of the drug. The choice of being an early adopter or a late adopter of drugs is a policy debate, not a scientific debate. To have useful discussions on this topic, it is important to realize that phase III clinical trials simply cannot ‘‘guarantee’’ that a medicine is safe. 7

DISCUSSION

Our discussion of research methods in phase III clinical trials has raised a number of considerations. We’d prefer to not expose patients to placebo, but placebo control enhances our ability to interpret treatment effects whereas active control may lead to more difficulty in interpretation. A crossover design can yield reliable information using fewer patients than a parallel design, but carryover effects limit the applicability of crossover designs. We’d like to be sure of the safety of new drugs, but gathering the enormous volume of data to achieve reliable estimates of risk for rare events could mean withholding a useful treatment from patients who would benefit from it. Clearly, no universally best approach to phase III research exists. Optimizing phase III research relies on management of the inevitable trade-offs in study design, to the extent this is possible. This can involve choice of design features as well as careful consideration of features such as the study population. For example, the risk of suicide is a concern in MDD clinical trials, so we don’t want to expose patients to ineffective therapies. However, placebo enhances our ability to understand treatment effects. This tradeoff can be managed by including placebo but rigorously screening patients at study entry and admitting only those deemed to not be at risk of suicide. Such a study has a rigorous

control group, but may not be representative of a portion of the population with MDD. For reasons such as this, even the best designed trials have limitations. Any one trial may address its primary hypotheses with little or no ambiguity while other important questions remain unresolved. Although we may wish that a single trial could answer all the important questions, this is simply not possible. Rather, we use information from the current trial coupled with knowledge from earlier trials to continually reevaluate what is known. With this knowledge, we then reconsider what important questions remain unresolved, and continue to plan and conduct appropriate studies to address the unresolved questions.

REFERENCES 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. Efficacy Guidelines. Updated March 9, 2007. Available at: http://www.ich.org/cache/compo/475272-1.html 2. Sporanox (itraconazole) capsules. Prescribing information. Titusville, NJ: Janssen Pharmaceutica Products, June 2006. 3. A. Glasmacher, C. Hahn, E. Molitor, T. Sauerbruch, I. G. Schmidt-Wolf, and G. Marklein. Fungal surveillance cultures during antifungal prophylaxis with itraconazole in neutropenic patients with acute leukaemia. Mycoses. 1999; 42: 395–402. 4. E. A. Gehan. Design of controlled clinical trials: use of historical controls. Cancer Treat Rep. 1982; 66: 1089–1093. 5. N. B. Watts, R. Lindsay, Z. Li, C. Kasibhatla, and J. Brown, Use of matched historical controls to evaluate the anti-fracture efficacy of once -a-week risedronate. Osteoporos Int. 2003; 14: 437–441. 6. D. Hauschke and I. Pigeot, Establishing efficacy of a new experimental treatment in the ‘‘gold standard’’ design. Biom. J. 2005; 47: 782–786; discussion 787–798. 7. J. C. Rains and D. B. Penzien,Behavioral research and the double-blind placebocontrolled methodology: challenges in applying the biomedical standard to behavioral headache research. Headache. 2005; 45: 479–486.

PHASE III TRIALS 8. I. Gilron, J. Bailey, T. Dongsheng, R. Holden, D. Weaver, and R. Houlden, Morphine, gabapentin, or their combination for neuropathic pain. N. Engl. J. Med. 2005; 352: 1324–1334. 9. D. Dini, L. Del Mastro, A. Gozza, R. Lionetto, O. Garrone, et al., The role of pneumatic compression in the treatment of postmastectomy lymphedema. A randomized phase III study. Ann. Oncol. 1998; 9: 187–190. 10. M. Enserink, Psychiatry: are placebocontrolled drug trials ethical? Science. 2000; 288: 416. 11. KJ. Rothman and KB. Michels, The continuing unethical use of placebo controls. N. Engl. J. Med. 1994; 331: 394–398. 12. A. Khan, H. Warner, and W. Brown, Symptom reduction and suicide risk in patients treated with placebo in antidepressant clinical trials. Arch. Gen. Psychiatry. 2000; 57: 311–317. 13. D. J. Goldstein, C. H. Mallinckrodt, Y. Lu, and M. A. Demitrack, Duloxetine in the treatment of major depression: a double-blind clinical trial. J. Clin. Psychiatry. 2002; 63: 225–231. 14. R. Kane, A. Farrell, R. Sridhara, and R. Pazdur, United States food and drug administration approval summary: bortezomib for the treatment of progressive multiple myeloma after one prior therapy. Clin. Cancer Res. 2006; 12: 2955–2960.

9

15. A. Khan, M. Detke, S. Khan, and C. Mallinckrodt, Placebo response and antidepressant clinical trial outcome J. Nerv. Ment. Dis. 191: 211–218. 16. L. L. Laster and M. F. Johnson, Noninferiority trials: the ‘‘at least as good as’’ criterion. Stat. Med. 2003; 22: 187–200. 17. J. D. Brannan, S. D. Anderson, C. P. Perry, R. Freed-Martens, A. R. Lassig, B. Charlton, and the Aridol Study Group. The safety and efficacy of inhaled dry powder mannitol as a bronchial provocation test for airway hyperresponsiveness: a phase 3 comparison study with hypertonic (4.5%) saline. .Respir. Res. 2005; 6: 144. 18. B. M. Mori, R. L. Frye, S. Genuth, K. M. Detre, R. Nesto, et al., for the BARI 2D Trial. Hypotheses, design, and methods for the Bypass Angioplasty Revascularization Investigation 2 Diabetes (BARI 2D) trial. Am. J. Cardiol. 2006; 97(Suppl): 9G–19G.

CROSS-REFERENCES Phase I Trials Phase II Trials Phase IV Trials Non-inferiority Trials

PHASE II TRIALS

1

OLIVER N. KEENE

PROOF-OF-CONCEPT (PHASE IIA) TRIALS

The initial proof that a new chemical entity (NCE) provides benefit to patients with the disease of interest is often referred to as ‘‘proof of concept.’’ This is an important step for pharmaceutical companies as the level of investment to take an NCE to the next stages (phase IIb dose-ranging trials, phase III trials) is considerable. If development of a drug is halted at an early stage, then considerable savings are made compared with ceasing development later in the process (1, 2). An important issue in proof-of-concept studies is regression to the mean. This occurs because companies will tend to progress the drugs with larger observed effect sizes compared with those that have smaller ones; for example, in a particular indication if a drug with an effect size of x needs an effect size of y to be progressed, then the average effect size of drugs that are progressed will be > x in their proof-of-concept studies. This effect is hard to quantify as the extent of the issue will vary; nevertheless, the tendency will be for proof-of-concept studies to overstate the subsequent effect size observed later.

GlaxoSmithKline Research and Development Greenford, Middlesex, United Kingdom

Design of phase II studies represents a challenge. For proof of concept, sufficient data are needed to allow further progression without imposing impractical targets at this stage of development. For dose-ranging studies, a broad exploration of the dose-response curve is required. Opportunities exist within phase II trials to make use of sequential and adaptive designs where important changes to the trial such as dropping treatment arms or reevaluating sample size are made as a result of interim analyses. Traditional approaches to the analysis of phase II trials that rely on significance testing to compare doses and/or establish efficacy are increasingly being replaced by methods that model the dose-response curve and by analysis based on Bayesian methods. Phase II trials typically represent the first time a new molecule or intervention is investigated in patients with the disease of interest. Before this, the compound will usually have been studied in healthy volunteers (phase I trials), and there may be some proof of pharmacological effect, but it is not until phase II that efficacy in the disease can be explored. The role of phase II trials is to provide initial proof of efficacy, reassurance on safety, and guidance on appropriate doses and dosing regimens before more comprehensive testing of the intervention in phase III trials. Trials in this phase are considered exploratory or for learning, in contrast to phase III trials which are confirmatory. Studies that specifically address the issue of initial evidence of efficacy and safety in the disease in question are often referred to as proof of concept or phase IIa trials; those that investigate different doses or dosing regimens are correspondingly called dose-ranging or phase IIb trials (although these definitions are not universal). Both aims can sometimes be addressed within a single study. These issues are, however, separated for simplicity in this discussion.

1.1 Design Phase IIa trials are typically smaller than phase III trials (20 to 200 patients). They attempt to give the NCE its best chance of showing efficacy compared with a control or placebo treatment. The duration of a phase IIa trial is limited even for a chronic disease and would typically be 1 month or less. Such studies may be completed in a single center or a small number of centers. The use of a single dose reduces study size and can be appropriate where the objective is solely to assess whether the drug has potential and where a later dose-ranging study is planned. This needs to be assessed against the extra information provided from including more than one dose. Doses for phase II studies are chosen based on pharmacokinetic (PK) exposure, results of animal models, and effects on biomarkers or surrogate endpoints in healthy volunteers.

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PHASE II TRIALS

1.2 Sample Size

1.4 Interpretation of Efficacy Results

Determining the sample size for a proofof-concept study is not straightforward (3). The size of difference to be detected may be unclear, or it may be that power calculations imply a recruitment target that is impractical at this stage of the drug’s development. Increasing the sample size in principle increases the amount of information obtained from the trial, but ‘‘it would not make sense to increase the sample size to the point comparable to that for a confirmatory trial’’ (3). Proof-of-concept studies need to be viewed in the wider context of the clinical development plan for the drug (4). It is helpful to predefine the criteria in terms of efficacy and safety outcomes that would support further progression of the compound (‘‘go/no-go’’ criteria). Such criteria can also be informative in terms of sample size, in that the probability of a ‘‘go’’ decision in a particular trial can be calculated based on predictions of the true efficacy of the drug. Before a phase IIa study, safety data are only available from volunteers. Therefore, the populations initially studied will typically exclude patients who are more at risk of adverse events, such as the elderly or those with other serious diseases. Assessment of safety in these more selected phase II populations can enable the inclusion of more vulnerable patients in phase III.

Interpretation of the outcome of an individual proof-of-concept study involves considerably more than a simple significance test on the primary endpoint. Proof-of-concept studies should emphasize both type I and type II errors. An appropriate balance should be struck between the risk of progressing a nonefficacious drug and the risk of discarding a drug of potential benefit. P-value comparison with significance levels greater than the conventional two-sided 5% level (one-sided 2.5%) may be required. For example, use of a onesided 10% level may be justified (7). Increasing the type I error rate can also allow trials to progress with the reduced sample size required in early phase studies. Estimates of effect size and confidence intervals are critical. For positive studies, these describe the range of treatment effects consistent with the observed data and can guide the team toward a true understanding of the value of the opportunity. Similarly for less positive or negative studies, confidence intervals play a critical role in assessing the risk of stopping development by providing an estimate of the drug’s potential. They are important when design assumptions were overly optimistic, and they can expose the limitations of small studies. If Bayesian rather than frequentist inference is employed, then posterior probabilities play a similar role. A full exploration of the data from an individual phase II study is generally required in addition to analysis of the primary endpoint. Interpretation of significance on secondary endpoints or on subgroups needs to be guided by the extent to which the analyses are preplanned. Where a rigorous analysis plan has been put in place, which appropriately accounts for multiplicity, then such analyses are reliable. Where extensive post hoc analysis has been performed or where there is no explicit plan to account for multiplicity, it is important for all findings from such analyses to be placed in the context of the number of investigations. Because an appropriate assessment of risk is difficult in these situations, careful consideration should always be given to running a second proof-of-concept study to confirm these exploratory analyses.

1.3 Safety and Pharmacokinetic Data Collection of safety data at this stage is key. Often these trials will include intensive safety monitoring, particularly if there are any unresolved safety issues from the preclinical or phase I data. It is also advantageous to include collection of pharmacokinetic samples to enable development of a model for the relationship between the pharmacokinetics (PK) of the drug and the pharmacodynamic (PD) outcome. Trials will often include interim analyses. These analyses allow examination of accumulating safety data to ensure that the safety of the participants is protected. In addition, there may be preplanned futility analyses so that a trial can be stopped if the efficacy is particularly unpromising (5, 6).

PHASE II TRIALS

3

1.5 Role of Bayesian Methods

2.1 Design

Bayesian analysis has potential to aid interpretation of phase II trials (8). Explicit quantification of probabilities that the drug can achieve given effect sizes are especially informative. The credibility of these analyses is enhanced if the methodology is applied with the same rigor as is seen for frequentist analyses (i.e., prespecification of priors and models). Choice of prior distribution might be seen as problematic when little is known on the effect of the drug. An appropriate strategy is to use so-called archetypal priors (9), one representing skepticism and one representing enthusiasm. The extent to which the results of analyses using the two priors converge allows confidence regarding the outcome of the trial.

Dose-response information has been typically obtained from parallel group trials where each patient is given a different dose or placebo. Other designs used include crossover studies where each patient receives more than one dose in a random order (11) or titration designs where patients receive doses of increasing magnitude (12). A key decision in dose-response studies is the number of doses to include. To fully investigate the top and bottom of the doseresponse curve, it is necessary to study a range of doses. A key shortcoming of clinical drug development is often insufficient exploration of the dose response (13). A study with two doses can provide some information, but in practice a minimum of three doses is needed. Preferably however, a dose-ranging study should study at least four doses as well as placebo to fully explore the dose-response curve and to better assess the PK/PD relationship. To justify a particular dose, it is generally necessary to show that a lower dose would not provide similar efficacy, which means in general the chosen dose cannot be the lowest dose used in the study. One common problem is that doses are chosen to be too close together to distinguish responses (10). As well as the treatment arms of an experimental drug, a placebo or comparator is usually included to measure the absolute size of the drug effect (10).

1.6 Interpretation of Safety Results Before further progression, a comprehensive review of safety data is essential. Safety signals can appear from adverse event data, laboratory data, vital signs, and other specific measurements. Where major problems are identified, this can lead to termination of the drug. When definitive evidence of causation of the signal by the intervention is not available, expert evaluation of this emergent safety data is required to ensure that appropriate measures are taken in future studies to minimise the risk to patients and to further evaluate the signal.

2.2 Interpretation of Efficacy Results 2

DOSE-RANGING (PHASE IIB) TRIALS

The purpose of dose-ranging trials is to investigate how the dose of the drug affects both efficacy and safety outcomes. The outcome of a successful dose-ranging study is to determine the dose or doses to be taken forward to phase III trials. A particular difficulty, however, is that for an efficacious drug ‘‘any given dose provides a mixture of desirable and undesirable effects, with no single dose necessarily optimal for all patients’’ (10). As well as providing information on dose, these trials can also address the appropriate dosing regimen, such as whether a drug is better given once a day or twice a day.

Analysis of dose-response trials is controversial. Simple analysis compares doses on a pairwise basis. This approach does not make full use of the information in the trial as it takes no account of the ordering of the doses. However, regulatory requirements can be interpreted as requiring this; the relevant guidance states, ‘‘It should be demonstrated . . . that the lowest dose(s) tested, if it is to be recommended, has a statistically significant . . . effect’’ (10). In principle, if the study includes four or more doses, more information can come from applying a model to the dose-response relationship (14). If such a model establishes that increasing dose leads to increasing efficacy,

4

PHASE II TRIALS

then this should be sufficient (provided that the differences in efficacy observed are of clinical interest). Use of a modeling approach allows use of a greater number of doses as opposed to a pairwise comparison approach which needs sufficient sample size at each individual dose to have adequate power. One strategy is to employ both analyses, with one designated as the primary analysis and the other as a supporting analysis. 2.3 Dropping/Adding Treatment Arms A limitation of a traditional dose-ranging trial is the need to fix a specific limited number of doses in advance. A more flexible approach to dose ranging can be achieved by using what has become known as an ‘‘adaptive design.’’ In this approach, a trial will have one or more interim analyses before the final analysis. At the interim stage, doses can be added or dropped (13, 15). Most examples published to date have been of trials that started with a range of doses and then chose one or two more to progress to the final stage (16). In this situation, as Wang et al. (17) note, ‘‘distributing the remaining sample size of the discontinued arm to other treatment arms can substantially improve the statistical power of identifying a superior treatment arm.’’ The most innovative and well-quoted example of using an adaptive design for dose ranging is the Acute Stroke Therapy by Inhibition of Neutrophils (ASTIN) trial (18, 19) in stroke, which not only adaptively selected doses at multiple interim analyses but also employed Bayesian rather than traditional (frequentist) statistical analysis. The trial terminated early for futility, saving considerable development costs. 2.4 Interpretation of Safety Results As for proof-of-concept trials, a detailed assessment is required, particularly of any safety signals identified. Where possible, the relationship of these signals to the dose administered needs to be reviewed in conjunction with the associated efficacy relationship to determine the optimum doses to take forward into phase III.

3 EFFICACY ENDPOINTS Efficacy endpoints in phase II are often different from standard phase III outcomes. To be useful, they need to be capable of predicting the response in phase III; such endpoints are typically described as surrogate endpoints (20–22). The value of a surrogate is to be less variable and/or to provide a more speedy outcome than a corresponding longer term clinical endpoint. One particular issue with a surrogate endpoint is defining the size of a clinically relevant effect. Another common approach is to use the same endpoints as phase III but measured at an earlier time point after a shorter treatment duration. In some disease areas, where surrogate endpoints are not available, phase II studies can be large (n > 100) and rely on designs that are similar to confirmatory phase III studies—that is, same primary endpoint, similar inclusion criteria, and similar length of treatment and length of assessment. In these circumstances, where clear evidence of useful effect is shown, such a study has the potential to become pivotal in a regulatory submission. Such an interpretation is aided when the potential for the study to become pivotal is identified in advance to regulatory agencies. A trial that simultaneously combines dose selection (possibly at an interim analysis) and provides evidence of efficacy in a clinically important outcome is often described as a phase II/III trial (23). 4 ONCOLOGY PHASE II TRIALS Phase II trials in oncology have a different background from trials in other areas. Because of the special issues involved, they have historically developed in different ways (24, 25). In particular, oncology trials make much greater use of single–arm, open-labeled studies to determine the initial proof of efficacy and the appropriate doses for phase III. Bayesian methods have proved particularly valuable in this area (26). REFERENCES 1. S. Senn, Some statistical issues in project prioritization in the pharmaceutical industry. Stat Med. 1996; 15: 2689–2702.

PHASE II TRIALS 2. S. Senn, Further statistical issues in project prioritization in the pharmaceutical industry. Drug Inf J. 1998; 32: 253–259. 3. C. Chuang-Stein, Sample size and the probability of a successful trial. Pharm Stat. 2006; 5: 305–309. 4. S. A. Julious and D. Swank, Moving statistics beyond the individual clinical trial: applying decision science to optimise a clinical development plan. Pharm Stat. 2005; 4: 37–46. 5. J. M. Lachin, A review of methods for futility stopping based on conditional power. Stat Med. 2005; 24: 2747–2764. 6. S. Snappin, M. G. Chen, Q. Jiang, and T. Koutsoukos, Assessment of futility in clinical trials. Pharm Stat. 2006; 5: 273–281. 7. R. M. Simon, S. M. Steinberg, M. Hamilton, A. Hildesheim, S. Khleif, et al., Clinical trial designs for the early clinical development of therapeutic cancer vaccines. J Clin Oncol. 2001; 19: 1848–1854. 8. N. Stallard, J. Whitehead, and S. Cleall, Decision-making in a phase II clinical trial: a new approach combining Bayesian and frequentist concepts. Pharm Stat. 2005; 4: 119–128. 9. D. Spiegelhalter, L. S. Freedman, and M. K. B. Parmar, Bayesian approaches to randomised controlled trials. J R Stat Soc Ser A Stat Soc. 1994; 157: 357–416. 10. International Conference on Harmonization (ICH). E4: Dose response information to support drug registration. Fed Regist. 1994; 59: 55972–55976. 11. S. J. Ruberg, Dose response studies. I. Some design considerations. J Biopharm Stat. 1995; 5: 1–14. 12. L. Z. Shen, M. S. Fineman, and A. Baron, Design and analysis of dose escalation studies to mitigate dose-limiting adverse effects. Drug Inf J. 2006; 40: 69–78. 13. B. Gaydos, K. Krams, I. T. Perevozskaya, F. Bretz, Q. Liu, et al., Adaptive dose-response studies. Drug Inf J. 2006; 40: 451–461. 14. S. J. Ruberg, Dose response studies. II. Analysis and interpretation. J Biopharm Stat. 1995; 5: 15–42. 15. A. J. Phillips and O. N. Keene, Adaptive designs for pivotal trials: discussion points from the PSI adaptive design expert group. Pharm Stat. 2006; 5: 61–66. 16. N. Stallard and S. Todd, Sequential designs for phase III clinical trials incorporating treatment selection. Stat Med. 2003; 22: 689–703.

5

17. S. J. Wang, H. M. J. Hung, and R. T. O’Neill, Adapting the sample size planning of a phase III trial based on phase II data. Pharm Stat. 2006; 5: 85–97. 18. M. Krams, K. Lees, W. Hacke, A. Grieve, J. M. Orgogozo, and G. Ford, Acute Stroke Therapy by Inhibition of Neutrophils (ASTIN): an adaptive dose-response study of UK-279,276 in acute ischemic stroke. Stroke. 2003; 34: 2543–2548. 19. D. A. Berry, P. Mueller, A. P. Grieve, M. K. Smith, T. Parke, and M. Krams, Bayesian designs for dose-ranging drug trials. In: C. Gatsonis, B. Carlin, and A. Carriquiry (eds.), Case Studies in Bayesian Statistics V. New York: Springer-Verlag, 2001, pp. 99–181. 20. T. R. Fleming, Surrogate endpoints in clinical trials. Drug Inf J. 1996; 30: 545–551. 21. L. J. Lesko and A. J. Atkinson, Jr., Use of biomarkers and surrogate endpoints in drug development and regulatory decision making: criteria, validation, strategies. Annu Rev Pharmacol Toxicol. 2001; 41: 347–66. 22. V. G. De Gruttola, P. Clax, D. L. DeMets, G. J. Downing, S. S. Ellenberg, et al., Considerations in the evaluation of surrogate endpoints in clinical trials: summary of a National Institutes of Health workshop. Control Clin Trials. 2001; 22: 485–502. 23. B. W. Sandage, Balancing phase II and III efficacy trials. Drug Inf J. 1998; 32: 977–908. 24. T. R. Fleming, One sample multiple testing procedures for phase II clinical trials. Biometrics. 1982; 43: 143–151. 25. R. Simon, Optimal two-stage designs for Phase II clinical trials. Control Clin Trials. 1989; 10: 1–10. 26. P. F. Thall and R. Simon, Practical Bayesian guidelines for phase IIB clinical trials. Biometrics. 1994; 50: 337–349.

FURTHER READING S. Senn, Statistical Issues in Drug Development. Chichester, UK: Wiley, 1997, Chapter 20. S. J. Pocock, Clinical Trials: A Practical Approach. Chichester, UK: Wiley, 1983.

CROSS-REFERENCES Dose-response trial Phase II/III trials Adaptive designs End of phase II meeting (FDA)

PHASE I TRIALS

small group of people, a slightly larger dose to the next group, and so on, while closely monitoring the patients for side effects. Escalation continues until a predefined percentage of patients experience unacceptable toxic effects at a given dose level. Nevertheless, such rule-guided designs have been criticized because of potential excessive toxicity. By contrast, model-based designs have been proposed to estimate the MTD directly as the highest dose at which a specified probability of toxic response is not exceeded, which is some prespecified percentile of the monotonic dose-response curve, to be estimated from data. The most popular model-based design is the Continual Reassessment Method (CRM) (4), with a wide literature devoted to extensions and improvements of its original design. Finally, whatever their design, cancer Phase I trials are mainly intended to determine the MTD and the toxicity, not to show whether the treatment is effective. Nevertheless, because patients are used, the most recent dose-finding trials in cancer have attempted to use combined Phase I/II designs (5). We present the Phase I clinical trials conducted in healthy volunteers and oncology patients, respectively, highlighting their differences. Some of the most often used designs are described.

SYLVIE CHEVRET U717 Inserm, France

Phase I includes the initial introduction of an investigational new drug into humans, aiming at developing a safe and efficient drug administration. These first-in-human studies are closely monitored and are usually conducted in an inpatient clinic, where a small number of healthy volunteer subjects (usually, 20–80) can be observed by full-time medical staff. The subject is usually observed until several half-lives of the drug have passed. These studies are designed to determine the metabolic and pharmacologic actions of the drug in humans, the side effects associated with increasing doses, and, if possible, to gain early evidence on effectiveness. Thus, sufficient information about the drug’s pharmacokinetics and pharmacological effects should be obtained during these trials to permit the design of well-controlled, scientifically valid, Phase II studies. However, some circumstances can occur when toxicity is anticipated and patients are used, such as with oncology (cancer) and HIV drug trials (1,2). In these settings, risk is regarded as a necessary price to pay for the chance of benefit (3). Thus, drug therapy is considered tolerable if the toxicity is acceptable. Therefore, the specific goal of such Phase I clinical trials is to find the maximum tolerated dose (MTD) of the new drug for a prespecified mode of administration. Because cancer Phase I trials are considered among the most risky in all of medicine, the design of these trials is surrounded by ethics. Actually, cancer Phase I trials are usually offered to patients who have had other types of therapy and who have few, if any, other treatment choices. They normally include dose-finding studies so that doses for clinical use can be refined, although some other Phase I trials test new combinations or new dose schedules of drugs that are already shown to be effective. The tested range of doses is usually a small fraction of the dose that causes harm in animal testing. Scientists start by giving a low dose to a

1

PHASE I IN HEALTHY VOLUNTEERS

For non-life-threatening diseases, Phase I trials are usually conducted on healthy volunteers. For many compounds, it is anticipated that adverse events will be few, and they will be noted if they occur, but the principal observations of interest will be pharmacokinetic (PK) and pharmacodynamic (PD) measures. They will allow identifying evidence of the intended physiological effect. Little consensus is observed in the design of these first-in-human studies. Buoen et al. (6) reviewed a total of 3323 healthy volunteers trials published in the five major clinical pharmacology journals since 1995. The average trial was placebo controlled, double blind, including 32 subjects at five dose levels, with large heterogeneity in sample size and

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

PHASE I TRIALS

design. Recently, Ting (7) published a book devoted on dose finding in drug development, with large emphasis on Phase I designs in healthy volunteers. 1.1 Single-Dose Studies The parallel single-dose design is still the most common design in healthy volunteer studies, although the use of crossover designs has been reported to increase over time (6). In the latter case, before proceeding to the next treatment period, a washout period is mandatory to avoid carryover effects from one period to another, which requires drugs with long half-lives. Both designs are usually placebo controlled and blinded, with evaluation of data on safety, tolerability, pharmacokinetics, and pharmacodynamics.

the multiple responses of each subject are correlated, and mixed-effects models with random subject effects are used to fit an overall dose-response relationship. By expressing prior information as pseudo-data, the same methodology can be used to perform a Bayesian analysis and to determine posterior modal estimates for the model parameters (8). Such Bayesian decision procedures have been previously proposed for dose-escalation studies in healthy volunteers based on a fixed number of patients, to be treated one at a time (9). The careful selection of these PD along with PK data obtained in Phase I studies allows understanding the dose response data to increase the likelihood of a successful Phase II trial.

1.2 Multiple Ascending Dose Studies

2 PHASE I IN CANCER PATIENTS

The healthy volunteers may be given more than one dose in separate periods with washouts. Designs for dose allocation are either fixed, in which doses are randomly allocated to volunteers, or adaptive, through sequential allocation. Phase I studies that require multiple dose escalation steps have also led to the development of pharmacokinetically guided dose escalation (PGDE) strategies for conducting early clinical trials. They use PK parameters such as concentration-time curve (AUC) or the maximum concentration (Cmax) from the preceding dose group to rationalize the dose increments for escalation. Such PGDE may provide a more rapid and safe completion of the trial. In these studies, the trial starts with a cohort of size 4, three patients treated at the lowest dose level and one on placebo (8). The same cohort participates in three more treatment periods, each individual receives placebo once and otherwise receives the first three dose levels in ascending order. The dose is subsequently escalated up to a predetermined level of plasma concentration. Samples (of blood and other fluids) are collected at various time points and analyzed to understand how the drug is processed within the body. Usually, the PK responses, such as the AUC or Cmax, are, after logarithmic transformation, assumed normally distributed. As each subject is observed more than once,

When dealing with cytotoxic drugs in cancer patients, toxicity is usually thought of as a preliminary desire for increasing efficacy, and it defines the main endpoint of Phase I trials. Patient outcome is usually characterized by a single binary response that indicates whether toxicity occurs within a relatively short time period from the start of therapy. It is assumed that the probability of toxicity increases monotonically with dose of the drug. Several toxicity criteria have been published such as the National Cancer Institute Common Toxicity Criteria Grades or similar scales such as the United States cooperative group or World Health Organization (WHO) scales that can be used in assessing toxicity. Most dose-allocation procedures dichotomize toxicity grades based on being dose limiting. According to symptom/organ categories, investigators might choose any toxicity of WHO grade 3 or above, for instance, for the purpose of dose escalation. This factor is usually referred in the protocol as dose-limiting toxicity (DLT). However, wide variations in the methods used to identify dose-limiting toxicities have been reported (10,11). Conventional Phase I designs are most naturally sequential or adaptive, based on dose allocation rules. Actually, in this oncology setting, the safe dose for additional studies is often interpreted as being the highest

PHASE I TRIALS

dose associated with no adverse outcome, or with ‘‘acceptable’’ toxicity, and it is referred as the MTD (12). Conventionally, the MTD is the highest dose with a sufficiently small risk of DLT. The value of such a dose depends on the design and the sample size used in the trial, so that the MTD is a sample statistic identified from the data. However, from a statistical point of view, the MTD should be thought as a population characteristic rather than calculated from the sample (13). Thus, it is commonly defined as a percentile of the dose-toxicity relationship, for instance the ‘‘toxic dose 20,’’ which is the dose for which the probability of toxicity is 0.20. The choice of the target mostly depends on the drug and pathology under study. See the MTD section for more details. All oncology Phase I designs tend to have formal rules concerning dose escalation following adverse events. Many dose allocation rules have been proposed, and extensive reviews have been already published (14–16). Briefly, two main approaches are commonly reported and opposed, so-called ‘‘algorithm-based,’’ ‘‘rule-guided,’’ or ‘‘datadriven’’ methods on one hand and ‘‘modelbased’’ or ‘‘model-guided’’ methods on the other hand. Although they are obviously different in terms of the underlying philosophy with regard to dose finding and interpretation of the MTD, they share certain designs characteristics. 2.1 Common Characteristics of Phase I Designs in Cancer The ethical basis of Phase I cancer trials has been widely questioned, in part because these trials involve potentially vulnerable patients (17–19). However, most included patients had a good baseline performance status (20). Nevertheless, sample heterogeneity in Phase I sample is commonly observed. A recent review of Phase I oncology trials of cytotoxic agents published from 2002 through 2004 (21) reported that patients included in such trials predominantly had solid tumors (90%), with 9% of studies including only hematologic or lymphatic malignancies, and 1% of studies enrolling both. Among the cancers, five tumor types are mostly encountered, namely cancers of colon or rectum, lung, kidney, breast, and prostate (20).

3

Whatever the rule for dose escalation, a series of ascending dose levels are prespecified. The starting dose, which is the lowest, is usually determined from preclinical animal toxicology data (usually mouse or dog) or on previous clinical experience when available, as the safest. Subsequent dose levels can be either equally spaced or not, increasing the first dose by 100%, then 66%, 50%, 40%, and 33%, according to a ‘‘modified Fibonacci’’ sequence that is used in one quarter of reported Phase I oncology trials (21). In cancer, single ascending dose studies are usually conducted. The objective is to escalate doses until a dose level is identified as suitable for use in later Phase II trials, based on ‘‘acceptable’’ safety. Such single ascending dose studies are those in which cohorts of three or six patients are given a small dose of the drug concurrently, and are observed for a specific period of time. Data from each cohort are collected and analyzed prior to choosing the next dose for administration. If they do not exhibit any adverse side effects, then a new group of patients is given a higher dose. This method continues until intolerable side effects start to develop, at which point the drug is said to have reached the MTD. The mean number of patients included in such trials usually is about 30, with most trials escalating drug doses in four to eight cohorts prior to establishing the MTD or stopping the trial (20). 2.2 Standard ‘‘3 + 3’’ Design The most popular design for dose escalation in cancer Phase I is known as the ‘‘3 + 3’’ design, in which the MTD is identified from the data (22). It is based on the inclusion of three patient cohorts, that is, treating three patients at the same time. Escalation proceeds starting at the lowest available dose until DLT is observed within the cohort. If DLT is observed in two or three patients in the cohort, then the trial is terminated. If only one patient experiences a DLT, then another cohort of three patients is included at the same dose level; if more than two DLTs are observed among the six patients, then the trial is stopped. Otherwise, the escalation continues in the same way. The final

4

PHASE I TRIALS

recommended MTD is the dose one level lower than the final dose used or the stop dose if no more than two DLT were observed on that dose. Of note, at the end of the dose escalation, investigators occasionally enroll additional patients at the MTD to explore the tolerability of the regimen with no explicit statistical justification for doing so. Although the ‘‘3 + 3’’ design has been used for many years and is still widely used, it is only recently that its statistical properties have been explored using exact computation (23). Surprisingly, the identified MTD has unpredictable statistical properties, which tend to give between 10% and 29% of the population toxicity. Indeed, the MTD is not a model-based estimate of the true dose that would yield the targeted dose-limiting toxicity rate. Because the MTD is a sample statistic, it depends on the design and sample size used, with poor properties. In the early 1990s, it seemed that some quantitative assessment design with model-guided determination of the MTD will identify the MTD with fewer patients, more precision, and fewer patients exposed to suboptimal doses.

2.3 Up and Down Designs Otherwise, up-and-down designs assign patient to the next lower dose if toxicities are observed among the group. Such ‘‘A + B’’ designs (24) could be considered in the wider family of random walk rules for the sequential allocation of dose levels to patients in a Phase I clinical trial. Patients are sequentially assigned the next higher, same, or next lower dose level according to some probability distribution, which may be determined by ethical considerations as well as the patient’s response. These designs are simple and do not require any parametric assumptions on the dose-toxicity relationship other than that the probability of toxicity is nondecreasing with dose. Maximum likelihood estimators are derived under a two-parameter logistic distribution (25) so that these designs could appear as intermediate between fully ruleand model-based approaches.

2.4 Continual Reassessment Method Among the model-based designs, the Continual Reassessment Method (CRM) was developed originally to estimate the MTD of a new drug, defined as any targeted quartile of the dose-toxicity relationship (4). Before the trial begins, the investigators express their opinions of the probability of toxicity at each dose level, and these anticipated values (pi ) refer at the ‘‘working model.’’ Although these initial guesses may not be accurate, they provide guidance about dose escalation. The CRM dose allocation rule is as follows: The first cohort of patients, originally one patient at a time, is administered either the initial MTD guess (4) or the first dose level (26–29). Then, the dose level assigned to the next patient cohort is the dose level associated with the actual toxicity probability closest to the target. These probabilities are estimated iteratively after each inclusion from a one-parameter model for the relationship between dose di and probability of toxicity, such as the power model: P(toxicity)= (pi )a , or the logistic model, logit(P(toxicity))= a0 + a di , where the scale parameter a0 is fixed. This model is rerun until the prespecified fixed sample size (usually around 20–30) is reached. Although it has been recommended to be used routinely for cancer Phase I trials, it has not been as widely used as it could be, possibly because of its statistical surrounds (30). The initial CRM actualized toxicity probabilities through the Bayes theorem, which modifies investigator’s prior knowledge with the observed data to derive so-called posterior probabilities (4,26–29). In this Bayesian framework, a prior distribution is required for model parameter, usually noninformative priors from the Gamma family, with the special case exponential for simplicity. Usage of animal toxicity data in determining the prior distribution for the CRM has been proposed (31). Thereafter, O’Quigley and Shen (32) also derived a likelihood-based estimator of the model parameter for the so-called ‘‘likelihoodCRM,’’ and the CRM is no longer a Bayesian method. Of note, maximum likelihood estimation requires a set of heterogeneous responses, so that the first stage is a rulebased design that uses three patient cohorts

PHASE I TRIALS

5

that ends during observation of the first toxicity. To handle ethical concerns better about the nature of potential toxic reactions at nonpreviously tested doses, it has been modified to move more cautiously upward toward the MTD such as never skipping a dose during the escalation (26–28). See O’Quigley et al. (33) and Zhou (15) for a review. Stopping rules have been proposed for the CRM, for instance that allow the trial to stop when the width of the posterior 95% probability interval for the MTD becomes sufficiently narrow, or when no change in selected MTD is expected (14,34–36). The CRM focuses on a single dose very quickly, and it is actually the dose level closest to the MTD with high probability. By contrast to the ‘‘3 + 3’’ design, it has been shown to allocate fewer than 10% patients a dose equal to or exceeding the MTD (4). Of note, because most patients were administered doses around the MTD, the predicted toxicity probability given by the CRM for doses that are not close to the MTD are generally inaccurate.

accruing data during the trial, choosing doses for successive patient cohorts, and selecting a maximum tolerable dose, they seem to be closely related.

2.5 Bayesian Decision Procedures

3 PERSPECTIVES IN THE FUTURE OF CANCER PHASE I TRIALS

Actually, although the CRM gives good estimates of the MTD, it does not capture the whole dose-toxicity curve. Thus, Bayesian decision-theoretic approaches have been proposed to estimate the whole dose-response curve while focusing attention on the target dose (9,37). Therefore, a two-parameter model is assumed to describe how the probability of toxicity increases with dose, usually the logistic model in log-dose (9,38). Prior information with regard to the model parameters is often elicited from investigators, expressed as if ni ‘‘pseudo-observations’’ at dose di , which results in ti toxicities, i = −1, 0, had already been observed. The dose allocation rule is based on the computation of gain functions (38). Among those, note that the ‘‘patient gain,’’ in which each patient within the cohort receives the dose at which the posterior probability of toxicity is as close to the target as possible, is similar to the dose selection criterion used in the CRM. Because both methods use Bayesian probability models as a basis for learning from

2.6 Other Extensions Many proposals have been published that allow the CRM to take into account additional information, accounting for about 100 references in a PubMed search. They incorporate PK data (39), multiple-dose administrations (40) that include that of three different percentiles (denoted low, medium, high) ‘‘LMH-CRM’’ (41), how long the drug has been administered [‘‘TITE-CRM’’ (42)], grade information about the severity of toxicity [‘‘Quasi-CRM’’ (43)], delayed patient outcome (44), as well as two groups of patients (45). At last, but not at least, much research has been devoted to the joint modeling of toxicity and efficacy. As model-based approaches, they could be related to either the CRM [‘‘bivariate CRM’’ (46)] or to Bayesian decision procedures (47). Such combined Phase I/II trials are described elsewhere (5).

Some researchers considered the percentile definition of MTD as inadequate in addressing the ethical question of protecting the patients from severe toxicity. This finding leads to the suggestion by Babb et al. (48) that the MTD should be chosen with toxicity not exceeding the tolerable toxicity with a high probability, which imposes a safety constraint on the overdose control. Over the last decade, cancer drugs under development have become more targeted, and the clinical research environment has become more scrutinized. A few authors have suggested intrasubject dose escalation until the desired biological effect is obtained, even for cancer Phase I trials. Accelerated titration, that is, rapid intra-patient drug dose escalation designs, seems to be of interest in this context (49). This method will allow subjects in the trial to benefit more and providing information on inter-subject variability. For targeted agents with well-defined pharmacodynamic markers that have little or no

6

PHASE I TRIALS

toxicity in the therapeutic dose range, the definition of a maximum tolerated dose can be avoided. This determination has raised questions over the best study designs for Phase I studies, which concerns the design of early phase dose-finding clinical trials with monotone biologic endpoints, such as biological measurements, laboratory values of serum level, and gene expression. A specific objective of this type of trial could be to identify the minimum dose that exhibits adequate drug activity and shifts the mean of the endpoint from a zero dose to the so-called minimum effective dose. Such designs would benefit from research on Phase II designs. This is a promising research area for the next future Phase I trials. 4

DISCUSSION

Phase I oncology trials have been the subject of considerable ethical (18,50) and statistical debate (19,51) for the past 20 years. Whatever the design used, ethical concerns seem fulfilled, and the overall toxic death rate for cancer Phase I studies published in peer-reviewed journals has been reported as low as 0.49% (52) or 0.54% (20), which has decreased over the study period, from 1.1% over 1991–1994 to 0.06% over 1999– 2002 (20). Surprisingly, only a few trials are based on novel designs and such formal methods have received little implementation (11,13). Based on a recent review, only 1.6% of the reported Phase I cancer trials (20 of 1235 trials) followed a model-based design proposed in one of the statistical studies (53). Although simulation studies have shown that up-anddown designs treated only 35% of patients at optimal dose levels versus 55% for Bayesian adaptive designs, the former are still widely used in practice (10). Actually, the practical simplicity of the ‘‘3 + 3’’ approach is likely to be its best pro. By contrast, modelguided escalation appears difficult and practical advice for choosing design parameters and implementing the CRM have been long mandatory. Recently, tutorials that should be helpful have been published (13,30,54). Moreover, the availability of specific software for dose finding should allow developing their use in practice [see the review of

Websites and software to increment Phase I approaches by Zohar (55)]. Finally, most limitations will be overcome best by a close collaboration between clinical pharmacologists, clinicians, and statisticians. REFERENCES 1. B. Storer, Phase I. In: P. Armitage and T. Colton (eds.), Encyclopedia of Biostatistics. Chichester, UK: John Wiley & Sons, 1999, pp. 3365–3375. 2. B. Storer, Phase I trials. In: T. Redmond (ed.), Biostatistics in Clinical Trials. Chichester, UK: John Wiley & Sons, 2001, pp. 337–342. 3. W. M. Hryniuk, More is better. J. Clin. Oncol. 1988;6:1365–1367. 4. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics 1990;46:33–48. 5. S. Zohar and S. Chevret, Recent developments in adaptive designs for phase I/II dose-finding studies. J. Biopharm. Stat. 2007;17:1071–1083. 6. C. Buoen, O. Bjerrum, and M. Thomsen, How first-time-in-human studies are being performed: a survey of phase I doseescalation trials in healthy volunteers published between 1995 and 2004. J. Clin. Pharmacol. 2005;45:1123–1136. 7. N. Ting, Dose finding in drug development. In: M. Gail, et al. (ed.), Statistics for Biology and Health. New York: Springer. 2006. 8. S. Patterson, et al., A novel Bayesian decision procedure for early-phase dose-finding studies. J. Biopharm. Stat. 1999;9:583–597. 9. J. Whitehead and H. Brunier, Bayesian decision procedures for dose determining experiments. Stat. Med. 1995;14:885–893. 10. K. Margolin, et al., Methodologic guidelines for the design of high-dose chemotherapy regimens. Biol. Blood Marrow Transpl. 2001;7:414–432. 11. S. Dent and E. Eisenhauer, Phase I trial design: are new methodologies being put into practice? Ann. Oncol. 1996;7:561–566. 12. B. Storer, Design and analysis of phase I clinical trials. Biometrics 1989;45:925–937. 13. J. Whitehead, et al., Learning from previous responses in phase I dose-escalation studies. Br. J. Clin. Pharmacol. 2001;52:1–7. 14. D. Potter, Adaptive dose finding for phase I clinical trials of drugs used for chemotherapy of cancer. Stat. Med. 2002;21:1805–1823.

PHASE I TRIALS

7

15. Y. Zhou, Choice of designs and doses for early phase trials. Fundam. Clin. Pharmacol. 2004;18:373–378.

30. E. Garrett-Mayer, The continual reassessment method for dose-finding studies: a tutorial. Clin. Trials 2006;3:57–71.

16. S. Chevret, Statistical methods for dosefinding experiments. In: S. Senn (ed.), Statistics in Practice. Chichester, UK: John Wiley & Sons, XXXX.

31. N. Ishizuka and Y. Ohashi, The continual reassessment method and its applications: a Bayesian methodology for phase I cancer clinical trials. Stat. Med. 2001;20:2661– 2681.

17. C. Daugherty, Ethical issues in the development of new agents. Invest. New Drugs 1999;17:145–153. 18. E. Emanuel, A phase I trial on the ethics of phase I trials. J. Clin. Oncol. 1995;13:1049–1051. 19. M. Ratain, et al., Statistical and ethical issues in the design and conduct of phase I and II clinical trials of new anticancer agents. J. Natl. Cancer Inst. 1993;85:1637–1643.

32. J. O’Quigley and L. Shen, Continual reassessment method: a likelihood approach. Biometrics 1996;52:673–684. 33. J. O’Quigley, M. Hughes, and T. Fenton, Dosefinding designs for HIV studies. Biometrics 2001;57:1018–1029. 34. S. Zohar and S. Chevret, The continual reassessment method: comparison of Bayesian stopping rules for dose-ranging studies. Stat. Med. 2001;20:2827–2843.

20. T. J. Roberts, et al., Trends in the risks and benefits to patients with cancer participating in phase 1 clinical trials. JAMA 2004;292:2130–2140.

35. J. O’Quigley, Continual reassessment designs with early termination. Biostatistics 2002;3:87–99.

21. S. Koyfman, et al., Risks and benefits associated with novel phase 1 oncology trial designs. Cancer 2007;110:1115–1124.

36. J. Heyd and B. Carlin, Adaptive design improvements in the continual reassessment method for phase I studies. Stat. Med. 1999;18:1307–1321.

22. N. Geller, Design of phase I and II clinical trials in cancer: a statistician’s view. Cancer Invest. 1984;2:483–491. 23. Y. Lin and W. Shih, Statistical properties of the traditional algorithm-based designs for phase I cancer clinical trials. Biostatistics 2001;2:203–215. 24. A. Ivanova, Escalation, group and A + B designs for dose-finding trials. Stat. Med. 2006;25:3668–3678. 25. S. Durham, N. Flournoy, and W. Rosenberger, A random walk rule for phase I clinical trials. Biometrics 1997;53:745–760. 26. D. Faries, Practical modifications of the continual reassessment method for phase I cancer clinical trials. J. Biopharm. Stat. 1994;4:147–164. 27. S. Goodman, M. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stat. Med. 1995;1:1149–1161. 28. E. Korn, et al., A comparison of two phase I trial designs. Stat. Med. 1994;13:1799–1806. 29. Møller, S., An extension of the continual reassessment methods using a preliminary up-and-down design in a dose finding study in cancer patients, in order to investigate a greater range of doses. Stat. Med. 1995;14:911–922.

37. C. Gatsonis and J. Greenhouse, Bayesian methods for phase I clinical trials. Stat. Med. 1992;11:1377–1389. 38. J. Whitehead and D. Williamson, Bayesian decision procedures based on logistic regression models for dose-finding studies. J. Biopharm. Stat. 1998;8:445–467. 39. S. Piantadosi and G. Liu, Improved designs for dose escalation studies using pharmacokinetic measurements. Stat. Med. 1996;15:1605–1618. 40. A. Legedza and J. Ibrahim, Longitudinal design for phase I clinical trials using the continual reassessment method. Control. Clin. Trials 2000;21:574–588. 41. B. Huang and R. Chappell, Three-dose-cohort designs in cancer phase I trials. Stat. Med. 2007. 42. Y. K. Cheung and R. Chappell, Sequential designs for phase I clinical trials with late-onset toxicities. Biometrics 2000;56:1177–1182. 43. Z. Yuan, R. Chappell, and H. Bailey, The continual reassessment method for multiple toxicity grades: a Bayesian quasi-likelihood approach. Biometrics 2007;63:173–179. 44. P. Thall, et al., Accrual strategies for phase I trials with delayed patient outcome. Stat. Med. 1999;18:1155–1169.

8

PHASE I TRIALS

45. J. O’Quigley and X. Paoletti, Continual reassessment method for ordered groups. Biometrics 2003;59:430–440. 46. T. Braun, The bivariate continual reassessment method. extending the CRM to phase I trials of two competing outcomes. Control. Clin. Trials 2002;23:240–256. 47. P. F. Thall, E. H. Estey, and H.-G. Sung, A new statistical method for dose-finding based on efficacy and toxicity in early phase clinical trials. Investigat. New Drugs 1999;17:155–167. 48. Babb, J., A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat. Med. 1998;17:1103–1120.

52. E. Horstmann, et al., Risks and benefits of phase 1 oncology trials, 1991 through 2002. N. Engl. J. Med. 2005;352:895–904. 53. A. Rogatko, et al., Translation of innovative designs into phase I trials. J. Clin. Oncol. 2007;25:4982–4986. 54. X. Paoletti, et al., Using the continual reassessment method: lessons learned from an EORTC phase I dose finding study. Eur. J. Cancer 2006;42:1362–1368.

49. R. Simon, et al., Accelerated titration designs for phase I clinical trials in oncology. J. Natl. Cancer Inst. 1997;89:1138–1147. 50. M. Agrawal and E. Emanuel, Ethics of phase 1 oncology studies: reexamining the arguments and data. JAMA 2003;290:1075–1082. 51. E. Korn, et al., Clinical trial designs for cytostatic agents: are new approaches needed? J. Clin. Oncol. 2001;19:265–272.

CROSS-REFERENCES

55. S. Zohar, Websites and software. In: S. Chevret (ed.), Statistical Methods for DoseFinding Experiments. Chichester, UK: John Wiley & Sons, 2006, pp. 289–306.

Continual Reassessment Method (CRM) Maximal tolerable dose Dose escalation design

PHASE I TRIALS IN ONCOLOGY

is given in cycles with multiple administrations of the investigational drug per cycle, sometimes alone, other times in combination with other drugs that are already part of a standard regimen. Patients typically receive multiple cycles, especially if some benefit of the therapy is noted and the patient is tolerating the regimen. From a safety standpoint, toxicity is regularly assessed throughout the course of the patient’s participation in the trial—while the patient is exposed to the study drug as well as typically for at least 30 days after the last day of study drug exposure. However, to shorten the trial duration, toxicities in the first cycle or two are used to determine if the patient or patients in a dose cohort are tolerating the study drug at a particular dose level to allow administration of the next higher dose of study drug to the next cohort of patients. Toxicity in oncology trials is typically graded using the National Cancer Institute Common Terminology Criteria for Adverse Events version 3.0 (available online from the Cancer Therapy Evaluation Program website at http://ctep.cancer.gov). Toxicity is measured on a scale from 0 to 5. The dose-limiting toxicity (DLT) is usually defined as treatmentrelated nonhematologic toxicity of grade 3 or higher, or treatment-related hematologic toxicity of grade 4 or higher. The toxicity outcome is typically binary (DLT/no DLT), for the purpose of dose-escalation considerations. We will use the terms toxicity and DLT interchangeably in this article. The underlying assumption is that the probability of toxicity is a nondecreasing function of dose. From a statistical viewpoint, the maximum tolerable dose (MTD) is defined as the dose at which the probability of DLT is equal to the maximum tolerable level, . The target level is usually set to 0.2, though values from 0.1 to 0.35 are sometimes used. Often, no statistical definition of the MTD is given (18), and the MTD can be defined, for example, as the dose level just below the lowest dose where one or two or more patients out of a predetermined number of patients (such as six), experienced DLT.

ANASTASIA IVANOVA University of North Carolina at Chapel Hill Chapel Hill, North Carolina

According to ClinicalTrials.gov, an official clinical trial resource of the National Institutes of Health, phase I clinical trials are initial studies with volunteers or patients to determine the metabolism and pharmacologic actions of drugs in humans, to assess the side effects associated with increasing doses, and to gain early evidence of effectiveness or efficacy. Conducting phase I oncology trials with healthy volunteers would potentially expose them to severe toxicity events. Therefore, phase I trials in oncology are usually conducted with patients, often patients for whom other options do not exist. The goals of phase I trials in oncology are [1] to estimate the maximum tolerable dose of the investigational agent, [2] to determine dose-limiting toxicities, and [3] in many cases, to also investigate the pharmacokinetics and pharmacodynamics of the investigational drug. Most of the existing oncology therapies are cytotoxic. The word cytotoxic means toxic to cells, or cell-killing. Chemotherapy and radiotherapy are the examples of cytotoxic therapies widely used in oncology. A new class of cancer drugs acting as cytostatic agents is being developed. These agents usually target one specific process involved in malignant transformation of cells, and they result in growth inhibition rather than tumor regression. Based on their specific mechanism of action, these target-specific agents are expected to have a more favorable toxicity profile (8). The methods presented in the remainder of the article address mostly the development of cytotoxic agents. 1

DOSE-LIMITING TOXICITY

Most often new oncologic agents are administered intravenously; however, some agents are administered orally. Frequently, therapy

Wiley Encyclopedia of Clinical Trials, Copyright © 2007, John Wiley & Sons, Inc.

1

2

2

PHASE I TRIALS IN ONCOLOGY

STARTING DOSE

The murine LD10 is the dose with approximately 10% mortality, as established in preclinical studies in animals. One-tenth or twotenths of the equivalent of LD10, expressed in milligrams per meter squared, is often used as a starting dose in a phase I trials with agents that have not been tested in humans before. Other choices for the starting dose are possible; for example, a fraction of the toxic dose low—the dose that if doubled causes no lethality in preclinical studies in animals—can be used as a starting dose. Many phase I trials are conducted with agents that have been already studied in humans. In such trials, the starting dose is usually based on experience with the agent in previously conducted human studies.

3

DOSE LEVEL SELECTION

The dose levels are selected by the investigational team, which is typically composed of physicians, pharmacologists, and other clinical development scientists. They use the preclinical and, when available, the clinical data of the investigational product (or data from drugs of the same class), taking into account the particular characteristics of the population of patients who will receive the study agent. The dose levels are chosen before the onset of the trial. The set of doses can be chosen according to the modified Fibonacci sequence, in which higher escalation steps have decreasing relative increments (100%, 67%, 50%, 40%, and 30% thereafter). Other incremental sequences are used as well. Statistical methods have been proposed that will choose dose levels for assignment as long as the first dose level is specified. For example, in the continual reassessment method (16), subsequent dose levels can be computed (at least in theory) from the working dose-toxicity model and priors on model parameters based on the dose assignments and outcomes observed in the trial. Dose levels can also be selected using pharmacology, for example, to achieve a certain area under the curve (5).

4 STUDY DESIGN AND GENERAL CONSIDERATIONS Because most cancer drugs are highly toxic and phase I trials in oncology enroll patients rather than healthy volunteers, the study design in phase I oncology trials differs from phase I dose-finding trials for other therapeutic areas. Usually, the therapeutic effect of a drug is believed to increase with dose; however, the dose of a study drug that a patient can receive is often limited by toxicity. It is important to study as many patients as possible at or near the MTD while significantly minimizing the number of patients who are administered investigational drug at toxic doses. Therefore, in oncology phase I trials, the patients are assigned sequentially, starting with the lowest dose. Usually, in the initial stage, the patients are assigned in relatively small cohorts at increasing dose levels until the first toxicity is seen. Designs for dose finding are often classified as parametric and nonparametric designs. Designs such as the continual reassessment method (16) and the escalation with overdose control (1) are often referred to as parametric designs. Biased coin designs (6, 7), group up-and-down designs (21), and the traditional or 3 + 3 design (14, 19) and its extension A + B designs (15) are frequently referred to as nonparametric designs. Nonparametric designs are attractive to physicians because they are easy to understand and implement because the decision rule is intuitive and does not involve complicated calculations. This might explain the wide use of the traditional or 3 + 3 design. 5 TRADITIONAL, STANDARD, OR 3 + 3 DESIGN The most frequently used design in phase I trials in oncology is the traditional or standard or 3 + 3 design (14, 19). This design is often associated with the set of dose levels obtained using the modified Fibonacci sequence (19). Therefore, this design is sometimes also referred to as the Fibonacci escalation scheme, but this is a misnomer because the Fibonacci sequence defines spacing between doses and not the rule for when to

PHASE I TRIALS IN ONCOLOGY

repeat, increase, or decrease the dose level for the next cohort of patients. The 3 + 3 design can be used with any set of dose levels, regardless of the spacing. In the 3 + 3 design, subjects are assigned in groups of three starting with the lowest dose with the following provisions. If only three subjects have been assigned to the current dose so far, then: 1. If no toxicities are observed in a cohort of three, the next three subjects are assigned to the next higher dose level. 2. If one toxicity is observed in a cohort of three, the next three subjects are assigned to the same dose level. 3. If two or more toxicities are observed at a dose, the MTD is considered to have been exceeded. If six subjects have been assigned to the current dose, then: 1. If at most one toxicity is observed in six subjects at the dose, the next three subjects are assigned to the next higher dose level. 2. If two or more toxicities are observed in six subjects at the dose, the MTD is considered to have been exceeded. The estimated MTD is the highest dose level with observed toxicity rate less than 0.33. The 3 + 3 design does not require sample size specification: the escalation is continued until a dose with unacceptable number of toxicities is observed. The frequency of stopping escalation at a certain dose level in the 3 + 3 design depends on the toxicity rate at this dose as well as the rates at all lower dose levels. The operating characteristics of the 3 + 3 design have been extensively studied for a number of dose-toxicity scenarios by exact computations and simulations (12, 13, 15, 17). On average, the dose chosen by the 3 + 3 design as the MTD has the probability of toxicity of 0.2. Because three to six patients are treated at each dose in the 3 + 3 design, the estimate of the MTD is not precise. Also, the 3 + 3 design does not provide rules regarding the next assignment or how

3

to estimate the MTD if there are dropouts or additional patients are assigned. Designs similar to the 3 + 3 can be constructed for other quantiles (see the article on escalation and up-and-down designs). Alternatively, designs similar to the 3 + 3 but with different cohort sizes can be constructed to target = 0.2. 6 CONTINUAL REASSESSMENT METHOD AND OTHER DESIGNS THAT TARGET THE MTD The continual reassessment method (CRM) was developed with the goal of bringing experimentation to the MTD as soon as possible and assigning as many patients as possible to the MTD. The CRM is a Bayesian design proposed in 1990 by O’Quigley et al. (16). (See the article on the continual reassessment method for more details.) Another example of a model-based design for dose-finding studies is the escalation with overdose control (1). This design is from a class of Bayesian feasible designs. It uses a loss function to minimize the predicted amount by which any given patient is overdosed. Cumulative cohort design (9) is a nonparametric design that uses only one assumption that toxicity is nondecreasing with dose. This design can be viewed as a generalization of group designs (21). The decision to decrease, increase, or repeat a dose for the next patient depends on how far an estimated toxicity rate at the current dose is from the target. Stopping rules are not generally used with any of these designs, and hence the specification of the total sample size is required. 7

START-UP RULE

At the beginning of the trial, the 3 + 3 design assigns patients in groups of three at escalating doses until the first toxicity is observed. This is an example of a start-up rule. The start-up rule brings the trial near doses with a toxicity rate close to the target level. It can be advantageous to use a start-up rule before reverting to the CRM or other design selected for the study. One frequently used start-up rule assigns patients in groups of

4

PHASE I TRIALS IN ONCOLOGY

size s at escalating doses until the first toxicity is seen. It is important to avoid escalation that is too rapid or too slow (3). With that goal in mind, the group size s in the start-up should be chosen according to the target toxicity level (11). Ivanova et al. (11) suggested choosing group size according to the following formula: s=log(0.5) / log(1−). For example, if = 0.5, the start-up with s = 1 is used; if = 0.3, the start-up with s = 2 is used; and if = 0.2, the start-up with s = 3 is used. 8

PHASE I TRIALS WITH LONG FOLLOW-UP

Because many of oncology trials are multicycle, there is often a need to follow the patients for DLT for a long period of time. Another example of trials where long-term toxicities are likely are radiation therapy trials. For trials with long follow-up periods, Cheung and Chappell (4) proposed a time-to-event modification of the CRM that they called TITE-CRM. The TITE-CRM allows using the data from all patients in the clinical trial who have received exposure to the investigational agent, not simply the data from patients who completed the safety follow-up phase of the trial. 9

toxicity variable, and the MTD is defined as the dose with a certain weighted sum of probabilities of outcome categories. For example, in the trial described by Ivanova (10), the MTD was defined as the dose where the double rate of DLTs plus the rate of lower grade toxicities that cause dose reductions is equal to 0.5. Bekele and Thall (2) used more sophisticated weighting system and ordinal toxicity variables with up to four categories. They proposed a Bayesian design for such trials.

PHASE I TRIALS WITH MULTIPLE AGENTS

It is common in oncology to treat patients with drug combinations. Often, the dose of one agent is fixed, and the goal is to find the MTD of the other agent administered in combination. Designs for set-ups where doses of both agents vary can be found in the work of Thall et al. (20) and Wang and Ivanova (22). 10 PHASE I TRIALS WITH THE MTD DEFINED USING TOXICITY GRADES Toxicity in oncology is measured on a scale from 0 to 5. Usually the goal is to find a dose at which occurs a certain proportion of DLTs, where DLT is typically defined as study drug–related nonhematologic toxicity of grade 3 or higher or study drug–related hematologic toxicity of grade 4 or higher. In some cases, investigators consider ordinal

REFERENCES 1. J. Babb, A. Rogatko, and S. Zacks, Cancer phase I clinical trials: efficient dose escalation with overdose control. Stat Med. 1998; 17: 1103–1120. 2. B. N. Bekele and F. T. Thall, Dose-finding based on multiple toxicities in a soft tissue sarcoma trial. J Am Stat Assoc. 2004; 99: 26–35. 3. K. Cheung, Coherence principles in dose finding studies. Biometrika. 2005; 92: 863–873. 4. Y. K. Cheung and R. Chappell, Sequential designs for phase I clinical trials with late-onset toxicities. Biometrics. 2000; 56: 1177–1182. 5. J. M. Collins, C. K. Grieshaber, and B. A. Chabner, Pharmacologically-guided phase I trials based upon preclinical development. J Natl Cancer Inst. 1990; 82: 1321–1326. 6. S. D. Durham and N. Flournoy, Random walks for quantile estimation. In: S. S. Gupta and J. O. Berger (eds.), Statistical Decision Theory and Related Topics V. New York: SpringerVerlag, 1994, pp. 467–476. 7. S. D. Durham and N. Flournoy, Up-and-down designs I. Stationary treatment distributions. In: N. Flournoy and W. F. Rosenberger (eds.), Adaptive Designs. Hayward, CA: Institute of Mathematical Statistics, 1995, pp 139–157. 8. R. Hoekstra, J. Verweij, and F. A. Eskens, Clinical trial design for target specific anticancer agents. Invest New Drugs. 2003; 21: 243–250. 9. A. Ivanova, N. Flournoy, and Y. Chung, Cumulative cohort design for dose-finding. J Stat Plan Inference. 2007; 137: 2316–2327. 10. A. Ivanova, Escalation, group and A + B designs for dose-finding trials. Stat Med. 2006; 25: 3668–3678.

PHASE I TRIALS IN ONCOLOGY 11. A. Ivanova, A. Montazer-Haghighi, S. G. Mohanty, and S. D. Durham, Improved upand-down designs for phase I trials. Stat Med. 2003; 22: 69–82. 12. S. H. Kang and C. Ahn, The expected toxicity rate at the maximum tolerated dose in the standard phase I cancer clinical trial design. Drug Inf J. 2001; 35: 1189–1200. 13. S. H. Kang and C. Ahn, An investigation of the traditional algorithm-based designs for phase I cancer clinical trials. Drug Inf J. 2002; 36: 865–873. 14. E. L. Korn, D. Midthune, T. T. Chen, L. V. Rubinstein, M. C. Christian, and R. M. Simon, A comparison of two phase I trial designs. Stat Med. 1994; 13: 1799–1806. 15. Y. Lin and W. J. Shih, Statistical properties of the traditional algorithm-based designs for phase I cancer clinical trials. Biostatistics. 2001; 2: 203–215. 16. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990; 46: 33–48. 17. E. Reiner, X. Paoletti, and J. O’Quigley, Operating characteristics of the standard phase I clinical trial design. Comput Stat Data Anal. 1999; 30: 303–315. 18. W. F. Rosenberger and L. M. Haines, Competing designs for phase I clinical trials: a review. Stat Med. 2002; 21: 2757–2770.

5

19. B. E. Storer, Design and analysis of phase I clinical trials. Biometrics. 1989; 45: 925–937. 20. P. Thall, R. Millikan, P. Mueller, and S. Lee, Dose finding with two agents in phase I oncology trials. Biometrics. 2003; 59: 487–496. 21. G. B. Wetherill, Sequential estimation of quantal response curves. J R Stat Soc Ser B Methodol. 1963; 25: 1–48. 22. K. Wang and A. Ivanova, Two-dimensional dose finding in discrete dose space. Biometrics. 2005; 61: 217–222.

FURTHER READING S. Chevret, ed., Statistical Methods for Dose Finding. New York: Wiley, 2006. J. Crowley, ed., Handbook of Statistics in Clinical Oncology. New York/Basel: Marcel Dekker, 2006. J. F. Holland and E. Frei, eds., Cancer Medicine, 7th ed. Hamilton, Ontario/London: BC Decker, 2006. N. Ting, ed., Dose Finding in Drug Development. New York: Springer-Verlag, 2006.

CROSS-REFERENCES Escalation and up-and-down designs Continual reassessment method

PHASE IV TRIALS

of their greater variability in purpose, tend to exhibit greater variability in methodology. Defining phase IV only makes sense if we think about the regulatory, licensing, and marketing processes for new medicines. Strictly, phase IV trials are those carried out to investigate some aspect of a medicine that is already licensed. For example, the International Conference on Harmonisation (ICH) E8 guideline (1) describes phase IV studies as being all studies performed after drug approval but ‘‘related to the approved indication.’’ In Pocock’s seminal book on clinical trials (2), however, he tends to consider phase IV trials almost synonymously with postmarketing surveillance, and he implies that they are typically uncontrolled studies and, therefore, not particularly scientifically useful. In another text on the subject, Piantadosi (3) sees such trials as possibly uncontrolled (e.g., to estimate [in absolute terms] the rate of adverse events) or as trials initiated after marketing of a new drug. In contrast to Pocock’s rather negative view, Wang and Bakhai (4) acknowledge the crucial need for phase IV studies to gain additional safety data by examining long-term administration and possible drug interactions. The truly defining feature of a phase IV trial is that it occurs after a medicine has been licensed—Wang and Bakhai, for example, tend to take this point of view, as does Grieve (5). Everitt (6) similarly takes a very broad (and perhaps rather nonspecific) view of phase IV studies, defining them simply as ‘‘Studies conducted after a drug is marketed to provide additional details about its safety, efficacy and usage profile.’’ Perhaps the only point of debate in this definition is whether these studies are carried out after marketing or after initial licensing. However, even setting aside this refinement, this definition itself causes difficulties (as also noted by Grieve, though missed by ICH E8) because it does not differentiate whether a medicine is licensed or marketed in a particular country or region, or whether it is licensed for a particular indication or via a particular route of administration, with certain dosages and dose frequencies. Many clinical trials carried out with marketed products are in view of

SIMON DAY Roche Products Ltd. Welwyn Garden City, United Kingdom

To understand the place of phase IV trials, it is necessary to contrast them with phase III trials and the purpose of phase III. Strictly speaking, phase IV is the phase coming after phase III, but many phase IV studies have all the same look and feel (and in many cases, very similar purposes) as phase III studies. The definitions of different phases make most sense in the context of development of new medicines (typically, although not exclusively, by the pharmaceutical industry), where the end of phase III is generally the point at which an application to health authorities is made for a marketing authorization. However, many clinical trials (typically those using licensed drugs) are sponsored by governments, academia, or charities. Although these inevitably come after the point of initial authorization/ marketing (i.e., postmarketing), they often have objectives that are similar to a phase III study, and their design and execution are very similar to classic phase III studies. 1

DEFINITIONS AND CONTEXT

Before defining what a phase IV trial is, we have to be clear about what a trial is and what it is not. Clinical trials are typically thought of as randomized interventional experiments, where very specific types of patients are enrolled, treatments (or controls or placebos) are blinded, follow-up measures are carefully standardized, a primary endpoint is clearly defined, and so on. However, many of these features are not absolute requirements of a trial, even though they may be desirable attributes. Clinical trials may not be blinded, may not necessarily have a control arm (or arms), and may not have a clearly defined primary endpoint. The better trials typically do have these features, but they are not necessary requirements. Phase IV trials, because

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PHASE IV TRIALS

expanding their indications or using them in different dosages or in combination with other products. All of these activities necessarily come after initial licensing or in the postmarketing period, but many such trials would look like mainstream phase III trials and might even be used for licensing new indications or dosages. 2 DIFFERENT PURPOSES FOR PHASE IV TRIALS A wide variety of studies might be carried out under the phase IV umbrella term, and they may have very different purposes. This section describes some of the most common scenarios where phase IV studies have been carried out. With the exception of the first example (seeding studies), all may address perfectly legitimate scientific questions for which a good evidence base may be highly desirable. This is not to say that all such questions are always addressed in a high quality scientific way, and we shall see an example of this later. 2.1 Seeding (and Some Other Marketing) Studies Seeding studies, which are rarely seen today, were run by the pharmaceutical industry under the guise of a scientific study as a means of getting physicians to use the product so that they would then later prescribe it to their patients. These studies were called ‘‘seeding’’ because they were akin to planting a seed and then watching it grow in the form of sales and profits. A small amount of data might have been collected to try to give the appearance that it was a useful scientific study, but often not even basic efficacy and safety data were collected. Such studies have been criticized as being marketing activities masquerading as science. Although they are much less common today, it would be na¨ıve to think that those old-fashioned types of marketing study carried out by commercial companies have completely disappeared; there are still examples to be found, which typically are not published because of their lack of any scientific merit. How much value they add varies and is hotly debated. A survey with a clearly defined

sampling frame to precisely (and in an unbiased way) estimate the incidence of particular events could be very valuable, but such studies are not common. Less formal survey-type studies in less clearly defined populations are more common, and their value is less clear. 2.2 Big Numbers, Rare Events A problem with any development plan and the licensing any new medicine is the real possibility that rare—but typically serious—adverse reactions may be experienced by only a small fraction of those taking the medication. Of the many examples, one well-known one was the association found between pediatric aspirin and Reye’s syndrome (7). As noted by Armitage et al. (8), the statistical ‘‘rule of 3’’ stemming from the Poisson distribution states that 300x cases must be observed to be 95% sure of seeing at least one event if the true incidence is x%. So in a typical drug development plan including 2000 to 5000 patients, it is quite plausible that reactions occurring in fewer than 1 in 1000 patients may not be observed. If they are observed, there may only be one case; it would be impossible to exclude, or even imply, causality of the drug under investigation. The rule of 3 makes it obvious that the chances of observing very rare events (e.g., those occurring in fewer than 1 in 10,000 patients) are extremely small. It is not until a new medicine is widely available on the market that such exposure figures are possible. A postmarketing surveillance phase IV study plays an important role in looking for such rare but medically very serious events. 2.3 Chronic Usage As with rare events, there is a limit to what can reasonably be investigated before licensing. Although treatments for long-term use need to be tested for much longer than those for short-term use (e.g., see ICH E1 [9]), it will never be possible to study novel treatments over decades, despite the fact that many patients may use those treatments for such long periods. Furthermore, prophylactic medicines may produce side effects many years after they have been taken. The only way that such possible long-term side effects

PHASE IV TRIALS

can reasonably be detected is after licensing; again, postmarketing surveillance has a useful place. 2.4 Dosage Regimes Although dose finding should be part of the early phase development (typically in phase II), there is often scope for further investigation of uses of different doses of a medicine once it is on the market. Sometimes this may be to investigate different patient-severity groups or for prophylactic use. If the study is carried out by the manufacturer, it may be to get other doses approved; if it is carried out by independent researchers, then it may simply be to better inform, guide, and influence prescribing practice. Both aims are perfectly legitimate and might result in very similar studies being designed and run. 2.5 New Indications As with dose finding, further indications for existing marketed products are a legitimate goal of investigation, either by commercial organizations with a view to being able to make further marketing claims or by noncommercial researchers. Perhaps one of the most widely investigated medicines in terms of the number of patients entering goodquality clinical trials has been aspirin, particularly for its cardioprotective effects (10–12). Because the patent for aspirin has long since expired, there is little commercial interest in further expanding the range of indications for which it has demonstrated efficacy and safety. However, from a public health perspective, there has been a great deal of interest and much resultant benefit. These studies have all been held in the postmarketing era of aspirin, and so they are considered phase IV studies. 2.6 Co-administration A common feature of studies carried out in phase IV is to investigate benefits of drugs used in combination with others. Interaction studies look at possible adverse reactions between drugs, such as interactions with other likely or possible long-term medications, and they are typically required by licensing authorities/health authorities be-

3

fore marketing. However, possible positive synergistic effects between existing medications are often not investigated before first marketing. Often, the possibility of a synergistic effect may only be noticed by serendipity and a few case studies or anecdotal observations may, quite appropriately, provide the impetus to carry out a new study. Examples of such interactions would include aspirin and streptokinase (11) and aspirin and heparin (12). Much of the development in oncology has come about through co-administration of licensed products. Similarly, much of the development of treatment in human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) has been through polypharmacy (13). 2.7 Comparisons with Other Medications As has been noted, in many ways the requirements of phase III trials may not answer all of the questions of every consumer of the research— including patients, prescribers, and purchasers. How a new treatment compares with a variety of existing treatments is an important question, but its relevance in drug licensing is not so clear cut. New treatments that are less efficacious than existing treatments may still have a place in the pharmacy; they may cause fewer side effects, or be more convenient to use or less expensive. Usually, each of these criteria can only be considered on an on-average basis; that is, a new treatment, perhaps for a mild but chronic condition, may be less efficacious than an existing one on average, but for some patients the new treatment may actually be better. Hence, absolute measures of efficacy and safety (and so risk–benefit) can be very helpful. Such absolute measures are best evaluated by comparison with placebo. Placebo-controlled trials are necessary for many licensing decisions, but they may not always be sufficient. A new treatment in a critical-care setting that is worse than an existing treatment (even though it may be better than placebo) may represent a health hazard because withholding the better treatment may lead to a fatal or irreversible outcome for the patient. So the reliance on placebo and the use of active comparators in phase III is a complex balance.

4

PHASE IV TRIALS

Notwithstanding these arguments, it is unlikely that, at the time of initial licensing, a new treatment will have been compared with all existing possible treatments for a particular indication. The need for such comparisons—often in very specific patient groups—is real, and phase IV is an ideal time to make such comparisons. 2.8 Quality of Life and Health Economics Quality of life and health economics are very different things but are put together here because they share a common difficulty. Both need to be assessed in as near ‘‘real life’’ conditions as possible. It is well recognized that most clinical trials (certainly most traditional phase III trials) are not carried out in realistic, lifelike, settings but rather in quite artificial settings. Trials are usually carried out by experienced investigators at specialist centers, and the results are sometimes criticized as not being applicable to most physicians and centers. Trials are typically carried out in relatively ‘‘clean’’ patient populations (often excluding the elderly, those on multiple co-medications, etc.), and it is claimed they may not be applicable to a more general patient population. Trials typically monitor patients more closely than would be the case in routine practice, and this may mean that patients’ satisfaction is influenced (either positively or negatively) and the cost of treating patients, even excluding specific tests and investigations required for the trial, may not be representative of routine clinical practice. All of these concerns fall under the heading of generalizability (14). The extent to which results from trials can be generalized is a point of debate, and for many purposes the lack of a formally defined sampling frame that would enable formal inference to an identifiable population may not be too important (15). Although clinical trials may use random allocation, that is not the same as and serves different purposes to random sampling. Phase IV studies aimed at investigating patient acceptability or health economics often have much simpler designs than those used to investigate safety and efficacy, and this simplicity may offer greater generalizability.

3 ESSENTIAL AND DESIRABLE FEATURES OF PHASE IV TRIALS Any piece of research should be fit for its purpose. Moreover, one might argue that particularly ethically sensitive experimental research should be at least adequate to reliably meet its objectives. Without belittling excellence in research, minimizing the unnecessary exposure of human or animal subjects as well as costs must also be important considerations. Hence, what is adequate for one trial may be inadequate or excessive for another. This applies to phase IV trials as well as trials in other phases, and other forms of research. Any study should be of an appropriate size and design to be able to reliably answer its intended question, or at least to contribute usefully to the current state of knowledge. In this respect, phase IV studies should not be seen as of second class importance when compared with phase III studies. Phase III trials often have regulatory hurdles to meet that may force them to be of a certain minimum standard; phase IV studies will not always have such hurdles, but they should still meet the scrutiny of ethics committees and of the general scientific community, including (although not restricted to) journal editors and referees. Hence, the features that we generally associate with a good study should be important for a phase IV study as well. However, that is not to say that randomized controlled trials must necessarily be considered the gold standard for phase IV studies, although they often are. 4 EXAMPLES OF PHASE IV STUDIES This section gives three published examples of studies that have been carried out after licensing of a drug, and so by implication they fall within the scope of phase IV. The first is an example of a ‘‘poor’’ study. Many such poor studies are published, and even more are conducted but not published. This example has not been picked out for any reason other than as representative of the type of study that is not very useful. The second and third examples are much better

PHASE IV TRIALS

although, as with any study at any stage of a drug’s development life-cycle, neither is perfect. 4.1 Trimipramine and Cardiac Safety Cohn et al. (16) studied the antidepressant trimipramine. These investigators were interested in its potential cardiotoxic effects, which seems a perfectly reasonable type of question to ask of a marketed drug and thus the type of question to be considered in phase IV. Sadly, their study has little value in answering such a question. They studied only 22 patients, had no control group, and seemingly had no specific prior hypothesis, only a very broad general objective. Their conclusions about the general cardiac safety of trimaprimine were based on 60 of 61 paired t-tests yielding ‘‘not statistically significant’’ results (P > 0.05). Trials such as this do nothing to advance the credibility, potential value, or importance of phase IV studies, in contrast to the next two studies. 4.2 Cimetidine for Reducing Gastric Acid Cimetidine was the first in its class as a treatment for reducing gastric acid, which meant it received a license for treating symptoms of gastric and duodenal ulcers, reflux, and other related indications. It was first licensed in the United States in August 1977. Because it was of a new class of treatment, there was an expectation that sales would take off quickly. Undoubtedly, the marketing group at the manufacturer, Smith Kline and French, were hopeful for a fast take up. However, very quick uptake of new medicines is always a cause for concern because it means there is not much time to identify possible safety signals before a large number of patients has been exposed. The study aimed to recruit 10,000 patients so that it had a 90% chance of observing at least one event if the true incidence was 1 in 4348; and an 80% chance of observing at least one event if the true incidence was 1 in 6250. (The rule of 3 tells us that the study had a 95% chance of observing at least one event if the true incidence was 1 in 3333.) The patients were recruited by physicians who were encouraged to take part in the study by sales representatives. The

5

physicians recorded all adverse events, hospitalizations, consultations, changes in concomitant diagnoses and therapies, and any new conditions being diagnosed after cimetidine had been prescribed. Data from 7248 patients who took part in the study were available for all of the analyses. A further 663 patients’ data were available for safety analyses. A full review of the results from the study is beyond the scope of this article—it is merely given for illustrative purposes. The results and more details of the design are given in Humphries et al. (17) 4.3 Salmeterol and Salbutamol in Asthma To illustrate that phase IV studies are not restricted to surveillance (or ‘‘survey’’) designs, a randomized, double-blind, parallelgroup trial (very much like one would expect to see as part of a phase III development) examined two asthma treatments developed by Glaxo. The study was not initiated because of any specific concern about these particular treatments but rather because of a general concern about the use of bronchodilators as maintenance treatment. The study was highly pragmatic; after participant randomization, the physicians could coprescribe whatever treatments they considered appropriate for their patients. The primary endpoint measures were any serious adverse events and any reasons (positive or negative) for withdrawal from the study. Over 25,000 patients throughout the United Kingdom were recruited and randomized in a 2:1 fashion to either Salmeterol (50 µg) twice daily or Salbutamol (200 µg) four times a day. Again, we will not consider the study’s results or their interpretation, but it is worth noting that, in addition to comparative statements about the two treatments, the investigators also commented that ‘‘Because of the large numbers we could compare our results with events related to asthma throughout the United Kingdom’’ (18). This, however, was not strictly correct. The representative nature of a sample cannot be ensured simply by collecting a large sample. As previously noted, random allocation is not the same as—and serves a different purpose from—random sampling.

6

5

PHASE IV TRIALS

CONCLUSION

A variety of studies fall under the umbrella term of phase IV. That breadth is partly due to the lack of a satisfactory, unique definition for phase IV studies. Given the scope of phase IV as suggested here—that is, studies that take place after initial regulatory approval has been given—it should be clear that the breadth of studies is so wide because many questions remain unanswered at the time of first licensing. Some of these questions could be critically important but impossible to realistically assess before licensing. Other questions may be of potential importance, such as those examined in trials concerning potential new or extended indications. Still other phase IV studies have no value, and they fail to adequately address questions that themselves may be of dubious merit. Thus, phase IV studies cover a very wide spectrum. REFERENCES 1. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E8 General Considerations for Clinical Trials. Current Step 4 version, July 17, 1997. Available at: http://www.ich.org/LOB/media/MEDIA484 .pdf 2. S. J. Pocock, Clinical Trials: A Practical Approach. Chichester, UK: Wiley, 1983. 3. S. Piantadosi, Clinical Trials: A Methodologic Perspective, 2nd ed. New York: Wiley, 2005. 4. D. Wang, A. Bakhai, Clinical Trials. A Practical Guide to Design, Analysis and Reporting. London: Remedica, 2006. 5. A. Grieve, Phase IV trials. In: B. S. Everitt and C. R. Palmer (eds.), Encyclopaedic Companion to Medical Statistics. London: Hodder Arnold, 2005, pp. 259–260. 6. B. S. Everitt, The Cambridge Dictionary of Statistics. 2nd ed. Cambridge: Cambridge University Press, 2002. 7. A. S. Monto, The disappearance of Reye’s syndrome—a public health triumph [editorial]. New Engl J Med. 1999; 340: 1423– 1424. 8. P. Armitage, G. Berry, and J. N. Matthews, Statistical Methods in Medical Research. 4th ed. Oxford: Blackwell Science, 2002.

9. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E1 The Extent of Population Exposure to Assess Clinical Safety for Drugs Intended for Long-Term Treatment of Non-Life-Threatening Conditions. Current Step 4 version, October 27, 1994. Available at: http://www.ich.org/LOB/ media/MEDIA435.pdf 10. Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto Miocardico (GISSI). Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction. Lancet. 1986; 1: 397–402. 11. ISIS-2 Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction. Lancet. 1988; 2: 349–350. 12. ISIS-3 Collaborative Group. ISIS-3: a randomised comparison of streptokinase vs tissue plasminogen activator vs anistreplase and of aspirin plus heparin vs aspirin alone among 41,299 cases of suspected acute myocardial infarction. Lancet. 1992; 339: 753–770. 13. E. Vittinghoff, S. Scheer, P. O’Malley, G. Colfax, S. D. Holmberg, and S. P. Buchbinder, Combination antiretroviral therapy and recent declines in AIDS incidence and mortality. J Infect Dis. 1999; 179: 717–720. 14. A. L. Dans, L. F. Dans, G. H. Guyatt, and S. Richardson, Users’ guides to the medical literature. XIV. How to decide on the applicability of clinical trial results to your patient. JAMA. 1998; 279: 545–549. 15. M. Bland, An Introduction to Medical Statistics. Oxford: Oxford University Press, 1987. 16. J. B. Cohn, C. S. Wilcox, and L. I. Goodman, Antidepressant efficacy and cardiac safety of trimaprimine in patients with mild heart disease. Clin Ther. 1993; 15: 114–126. 17. T. J. Humphries, R. M. Myerson, L. M. Giford, M. E. Acugle, M. E. Josie, et al., A unique postmarket outpatient surveillance program of cimetidine: report on phase II and final summary. Am J Gastroenterol. 1984; 79: 593–596. 18. W. Castle, R. Fuller, J. Hall, and J. Palmer, Serevent nationwide surveillance study: comparison of salmeterol with salbutamol in asthmatic patients who require regular bronchodilator treatment. BMJ. 1993; 306: 1034–1037.

PHASE IV TRIALS

FURTHER READING

CROSS-REFERENCES

S. C. Chow and J. P. Liu, Design and Analysis of Clinical Trials. Concepts and Methodologies. New York: Wiley, 1998. P. M. Fayers and D. Machin, Quality of Life. Assessment, Analysis and Interpretation. Chichester, UK: Wiley, 2007. S. Senn, Statistical Issues in Drug Development. Chichester, UK: Wiley, 2007. S. Senn, Pharmaceutical industry statistics. In: C. Redmond and T. Colton (eds.), Biostatistics in Clinical Trials. Chichester, UK: Wiley, 2001, pp. 332–337. B. L. Strom, ed. Pharmacoepidemiology, 4th ed. Chichester, UK: Wiley, 2006.

Active-controlled trial Clinical development plan Combination trials Noninferiority trial Phase III trials Postmarketing surveillance Preference trials Quality of life

7

PHYSICIANS’ HEALTH STUDY (PHS)

1.1 Aspirin Component During the 1970s, Nobel Prize-winning basic research demonstrated that, in platelets, small amounts of aspirin (50–80 mg/day) irreversibly acetylate the active site of the isoenzyme cyclooxygenase-1 (COX-1) (2), which is required for the production of thromboxane A2 (3), a powerful promoter of aggregation (4). This effect persists for the entire life of the platelet, approximately 10–12 days. The evidence of benefit from antiplatelet therapy in secondary prevention was mounting as randomized trials were completed (5–7). In addition, results from a casecontrol study (8) and cohort studies (9,10) were compatible with the possibility that aspirin might reduce risk among individuals without cardiovascular disease. However, all observational studies are subject to uncontrolled and uncontrollable confounding that has a magnitude potentially as large as the effects being sought. PHS I was the first randomized trial to test the postulated benefit of aspirin reliably on many asymptomatic, apparently healthy individuals who could maintain compliance with a regimen for at least 5 years.

J. MICHAEL GAZIANO Harvard Medical School Brigham and Women’s Hospital Massachusetts Veterans Epidemiology and Research Information Center (MAVERIC) Boston VA Healthcare System Boston, Massachusetts

CHARLES H. HENNEKENS Florida Atlantic University University of Miami NOVA Southeastern University Boca Raton, Florida

Launched in 1982, the Physicians’ Health Study (PHS I) was one of the first largescale randomized trials conducted entirely by mail. This primary prevention trial evaluated two hypotheses: whether low-dose aspirin could reduce the risk of cardiovascular disease events and whether beta carotene could lower cancer risk. It incorporated several design strategies to increase efficiency and decrease cost (1). These strategies included the choice of physicians as the study population, the use of a factorial design, the implementation of a prerandomization run-in phase, and the collection of prerandomization blood specimens. The design of PHS I enabled the maintenance of high compliance and long-term follow-up at a fraction of the usual cost of large-scale trials of primary prevention.

1

1.2 Beta Carotene Component By the 1970s, basic research had indicated that the antioxidant properties of beta carotene provided a plausible mechanism for prohibiting carcinogenesis, and observational studies suggested that people who ate more beta carotene-rich fruits and vegetables had a somewhat lower risk of cancer and cardiovascular disease. However, separating the effects of beta carotene from the potential effects of other nutrients, including other dietary antioxidants, is difficult or even impossible in such observational studies. Furthermore, other health factors associated with fruit and vegetable intake might lower risk. A large randomized trial of significant duration was necessary to test this hypothesis directly, and PHS I was the first to do so.

OBJECTIVES

This randomized double-blind, placebo-controlled trial was designed to test the benefits and risks of taking aspirin prophylactically (325 mg on alternate days) for the primary prevention of cardiovascular disease events and the prophylactic use of beta carotene (50 mg on alternate days) for the primary prevention of cancer. By using a 2 × 2 factorial design, it was possible to test two hypotheses simultaneously.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

2

PHYSICIANS’ HEALTH STUDY (PHS)

STUDY DESIGN

The choice of physicians was integral to both the low cost and the high compliance and follow-up rates associated with this trial. PHS I investigators had conducted several pilot studies between 1979 and 1981 on random samples of U.S. physicians, which showed that doctors would exhibit excellent compliance with their assigned regimen and that they would complete the required questionnaires. The cost per randomized participant for previous primary prevention trials generally ranged from $3,000 to more than $15,000 (11). This cost compares with a cost of approximately $40 per participant per year for PHS I. 2.1 Physicians as Subjects Physicians make excellent study participants for several reasons. They have a clear understanding of informed consent and quickly can recognize any potential side effects from a study regimen. Also, because of their ability to collect and report medical information accurately, it is not only possible but also feasible to conduct a trial by mail. Lastly, physicians are less mobile and therefore easier to follow than the general population, which thereby increases the likelihood that complete morbidity and mortality data could be collected for all participants for the entire duration of a trial. The trial was limited to male physicians because at that

time, a limited number of female physicians were middle aged and older. In addition, the rates of cardiovascular disease events are higher among healthy middle-aged males than among females of the same age. 2.2 Use of a Factorial Design Because there was no reason to believe that aspirin and beta carotene would interfere or interact with one another, the researchers concluded that by incorporating a 2 × 2 factorial design, they could study both agents easily and inexpensively (12). A factorial trial enabled them to evaluate these two interventions, separately and in combination, against a control. PHS participants were assigned to one of four possible combinations: active aspirin and active beta carotene, active aspirin and beta carotene placebo, aspirin placebo and active beta carotene, or aspirin placebo and beta carotene placebo. 2.3 Run-in and Randomization In 1982, 261,248 male physicians between the ages of 40 and 84 years who were identified from an American Medical Association database were sent letters inviting them to participate in the trial along with a baseline questionnaire (Fig 1). A total of 112,528 responded, with 59,285 indicating they were willing to participate. Respondents were deemed ineligible if they had a history of cancer, myocardial infarction, transient ischemic attack, or stroke; had current

261,248 U.S. male MD’s, aged 40-84

112,528 respondents to questionnaires 59,285 willing to participate 33,223 willing and eligible MDs enrolled in run-in (18 weeks on active aspirin and beta carotene placebo) 22,071 randomized

Figure 1. More than 112,000 of the 261,248 male physicians who received a letter inviting them to participate in the trial responded. Of these physicians, 59,285 indicated they were willing to participate, and this group was culled down to 33,223 who were both willing and eligible. After an 18-week run-in phase designed to test compliance, a total of 22,071 physicians were randomized.

PHYSICIANS’ HEALTH STUDY (PHS)

3

Randomization Scheme 22,071 Randomized

11,037 Aspirin

5,517 Beta carotene

11,034 Aspirin placebo

5,520 5,519 Beta carotene Beta carotene placebo

5,515 Beta carotene placebo

Figure 2. Because the study incorporated a 2 × 2 factorial design, the 22,071 participants were assigned to one of four groups: active aspirin plus beta carotene, active aspirin plus beta carotene placebo, aspirin placebo plus beta carotene, or both placebos.

MI in the Physicians's Health Study 250 239 number of MIs

200

44% in first MI due to aspirin

150 139 100 50 0

Placebo (n = 11,034)

Aspirin (n = 11,037)

liver or renal disease; had peptic ulcer or gout; had contraindications to aspirin consumption; were currently using aspirin or other platelet-active drugs or nonsteroidal anti-inflammatory agents; or were currently taking a vitamin A or a beta carotene supplement. A total of 33,223 eligible physicians were enrolled and entered an 18-week ‘‘run-in’’ phase, a pretrial period intended to test compliance with the study protocol to cull the individuals unlikely to take the required pills regularly for a long period of time. Pilot studies in U.S. physicians conducted by the PHS researchers and the initial follow-up from the British Doctors Study (13) both showed that the largest drop in compliance took place during the first several months. PHS investigators felt that the run-in phase would be very beneficial in yielding a group of committed high compliers for the long duration necessary to accrue sufficient cardiovascular

Figure 3. A statistically extreme 44% reduction in the risk of a first myocardial infarction led the PHS I Data and Safety Monitoring Board to stop the aspirin arm of the trial early.

disease endpoints. During this phase, all of the physicians enrolled in the run-in received active aspirin and placebo beta carotene. On completion of the run-in, 11,152 men were excluded, and 22,071 were randomized into the study. The remaining 22,071 physicians were then randomly assigned to one of four groups: active aspirin and active beta carotene (n = 5,517), active aspirin and beta carotene placebo (n = 5,520), aspirin placebo and active beta carotene (n = 5,519), or aspirin placebo and beta carotene placebo (n = 5,515) (Fig. 2). Overall, men in the group that was randomized into the trial were older, leaner, less likely to smoke (11% were current smokers, and 39% were former smokers), more likely to exercise, and less likely to use aspirin. Few other differences existed between the physicians who successfully competed the run-in phase and those who didn’t. Despite that, surprisingly, an analysis showed that, after

4

PHYSICIANS’ HEALTH STUDY (PHS)

adjusting for age, men who were randomized were significantly less likely to die from any cause and significantly less likely to die from either a cardiovascular event or cancer than their peers who also participated in the runin phase but who were not randomized (14).

few events. A significant 18% reduction (P = 0.01) was found in the risk of important vascular events [combined endpoint consisted of nonfatal myocardial infarction (MI), nonfatal stroke, and cardiovascular disease (CVD) death].

2.3.1 Prerandomization Blood Sample Collection. During the 4-month run-in period, participants were sent kits that contained prepared Vacutainer tubes, cold packs for mailing, and prepaid shipping packs. They were asked to have their blood drawn into these tubes, to fractionate the blood by centrifugation, and to return the samples by overnight courier. On receipt in the PHS blood laboratory, samples were immediately frozen and stored in liquid nitrogen freezers. Almost 15,000 of the participants (68%) provided blood samples. After PHS I ended, participants were asked to provide a second blood sample. More than 11,000 participants did. These samples were frozen (after being processed and aliquoted) and used to explore several biochemical and genetic hypotheses.

4.2 Beta Carotene

3

FOLLOW-UP AND COMPLIANCE

The first follow-up questionnaires were sent at 6 and 12 months of treatment, then yearly. They included questions about compliance with the study treatments, use of nonstudy medications, occurrence of major illnesses or adverse effects, and other risk factor information. By December 31, 1995, the scheduled end of the beta carotene arm of the study, fewer than 1% had been lost to follow-up, and compliance was 78% in the group that received beta carotene. 4 RESULTS 4.1 Aspirin The Data and Safety Monitoring Board stopped the aspirin component of PHS I after 5 years primarily because aspirin had a significant effect on the risk of a first myocardial infarction— reducing it by 44% (P < 0.00001) (Fig. 3). Results for stroke and total mortality were uninformative because there were too

Results from the beta carotene arm were equally important and demonstrated that 12 years of supplementation with beta carotene produced neither benefit nor harm. No differences were found in cardiovascular mortality (RR, 1.09; 95% CI, 0.93–1.27), MI (RR, 0.96; 95% CI, 0.84–1.09), stroke (RR, 0.96; 95% CI, 0.83–1.11), or a composite of the three endpoints (RR, 1.00; 95% CI, 0.91–1.09) associated with beta carotene assignment. In addition, no relationship was found between beta carotene and cancer mortality, malignant neoplasms, or lung cancer. In analyses limited to current or former smokers, no significant early or late effects of beta carotene on any endpoint of interest were detected. 4.3 Other Findings from the PHS More than 250 papers based on data from the Physicians’ Health Study have been published. Other results that have emerged from PHS I include: • Regular use (more than 60 days/year) of

•

• •

•

•

ibuprofen and other nonsteroidal antiinflammatory drugs may interfere with the cardioprotective effects of aspirin. Aspirin does not cause kidney damage, even among individuals who, over time, consume large quantities of aspirin. Aspirin may reduce the risk of asthma. Exercise helps prevent type 2 diabetes, especially in men who work out at least five times a week. Individuals who smoke one or more packs of cigarettes a day have twice the risk of macular degeneration, an eye disease that can cause loss of vision. Men with high blood levels of homocysteine have an increased risk of cardiovascular disease.

PHYSICIANS’ HEALTH STUDY (PHS) • Participants who eat whole grain break-

fast cereals have reduced total and cardiovascular mortality. • Elevated levels of insulin-like growth factor (IGF-1) are a powerful risk factor for prostate cancer; IGF-1 is also strongly linked to colon cancer risk.

5

CONCLUSIONS

PHS I was the first randomized trial to demonstrate that daily aspirin produces a statistically significant and clinically important reduction in risk of a first myocardial infarction. The trial was also the first to show that beta carotene has no significant benefit or harm on cancer or cardiovascular disease. This finding is particularly informative for cancer as the 12-year duration of the trial is far longer than any other randomized trial of beta carotene. At the time that the results from the aspirin component of the trial were published in 1989, only one other primary prevention trial of aspirin had been completed. In contrast to the Physicians’ Health Study I, the British Doctors’ Trial (13) showed no significant differences for fatal or nonfatal myocardial infarction, but the 95% confidence intervals were very wide. In the intervening years, four more large-scale trials have assessed the benefits of low-dose aspirin in the prevention of CVD. Taken together, these six randomized trials indicate a benefit of prophylactic aspirin in primary prevention of myocardial infarction and possibly ischemic stroke (15). The 10-year risk of a first coronary event in the subjects in these trials is less than 10% in five of the six. Because the benefits of aspirin are likely to outweigh the risks in individuals with a 10-year risk of a first coronary event of 10% or greater, several trials are now beginning to address the balance of benefits and risks of aspirin in populations with a 10-year risk of a first coronary event between 10% and 19%. The results from PHS I demonstrated that beta carotene alone was not responsible for the health benefits seen among people who ate plenty of fruits and vegetables. It was only one of several large randomized,

5

placebo-controlled trials that have been conducted since the 1980s to assess the role of beta carotene in cancer prevention. PHS I also found that supplementation with beta carotene had no effect on cardiovascular disease. Taken as a whole, results from largescale randomized trials of beta carotene in the primary prevention of CVD and of cancer have been disappointing. However, it is interesting to note that a recent study based on long-term consumption of beta carotene—12 or more years—among PHS participants has found that individuals with high intakes of beta carotene had improved cognitive skills (16). Over time, it has become clear that the physicians who have participated in the PHS I are healthier than the general U.S. population and also healthier than their physician peers who were invited to participate in the study but, for a variety of reasons, either didn’t or couldn’t. Certainly, most physicians have the advantage of excellent access to health care, which gives them a significant leg up over nonphysicians, so it is not surprising that doctors would be healthier than other Americans. What was surprising, however, was the difference between the 22,071 willing and eligible physicians who successfully completed the pretrial runin phase and their willing and eligible compatriots—11,152 men—who did not and were excluded, therefore, from the trial. PHS I ended on December 31, 1995, 13 years after it began in earnest when a computer assigned a Florida physician to a combination of beta carotene and placebo. However, 7,641 of the original participants continued on as part of the ongoing Physicians’ Health Study II (PHS II), which began in 1997 with nearly 15,000 participants. As the Physicians’ Health Study marks its 25th anniversary in 2008, most of the PHS participants continue to provide information on their health status. Initially, PHS II was three pronged, evaluating vitamin E, vitamin C, and a multivitamin for the primary prevention of cardiovascular disease, cancer (in general and prostate cancer, in particular), and cataracts as well as the age-related eye disease macular degeneration. The vitamin E and vitamin

6

PHYSICIANS’ HEALTH STUDY (PHS)

C arms ended in 2007, but the multivitamin arm has been funded to continue until 2011. Results from the vitamin E and vitamin C arms are expected to be submitted for publication in 2008. REFERENCES 1. C. H. Hennekens and J. E. Buring, Methodologic considerations in the design and conduct of randomized trials: the U.S. Physicians’ Health Study. Control. Clin. Trials. 1989; 10: 142S–150S. 2. C. D. Funk, L. B. Funk, M. E. Kennedy, A. S. Pong, and G. A. Fitzgerald, Human platelet/erythroleukemia cell prostaglandin G/H synthase: cDNA cloning, expression, and gene chromosomal assignment. Faseb. J. 1991; 5: 2304–2312. 3. G. A. FitzGerald, Mechanisms of platelet activation: thromboxane A2 as an amplifying signal for other agonists. Am. J. Cardiol. 1991; 68: 11B–15B. 4. J. R. Vane, Inhibition of prostaglandin synthesis as a mechanism of action for aspirin-like drugs. Nat. New. Biol. 1971; 231: 232–235. 5. The Persantine-Aspirin Reinfarction Study Research Group, Persantine and aspirin in coronary heart disease. Circulation. 1980; 62: 449–461. 6. The Persantine-Aspirin Reinfarction Study Research Group, Persantine-aspirin reinfarction study. Design, methods and baseline results. Circulation. 1980; 62: II1–42. 7. Aspirin Myocardial Infarction Study Research Group, A randomized, controlled trial of aspirin in persons recovered from myocardial infarction. JAMA. 1980; 243: 661–669. 8. H. Jick and O. S. Miettinen, Regular aspirin use and myocardial infarction. Br. Med. J. 1976; 1: 1057. 9. C. H. Hennekens, L. K. Karlson, B. Rosner, A case-control study of regular aspirin use and coronary deaths. Circulation. 1978; 58: 35–38. 10. E. C. Hammond and L. Garfinkel, Aspirin and coronary heart disease: findings of a prospective study. Br. Med. J. 1975; 2: 269–271. 11. J. E. Buring and C. H. Hennekens, Cost and efficiency in clinical trials: the U.S. Physicians’ Health Study. Stat. Med. 1990; 9: 29–33.

12. M. J. Stampfer, J. E. Buring, W. Willett, B. Rosner, K. Eberlein, and C. H. Hennekens, The 2 × 2 factorial design: its application to a randomized trial of aspirin and carotene in U.S. physicians. Stat. Med. 1985; 4: 111–116. 13. R. Peto, R. Gray, R. Collins, et al., Randomised trial of prophylactic daily aspirin in British male doctors. Br. Med. J. (Clin. Res. Ed.) 1988; 296: 313–316. 14. H. D. Sesso, J. M. Gaziano, M. Van Denburgh, C. H. Hennekens, R. J. Glynn, and J. E. Buring, Comparison of baseline characteristics and mortality experience of participants and nonparticipants in a randomized clinical trial: the Physicians’ Health Study. Control. Clin. Trials. 2002; 23: 686–702. 15. J. S. Berger, M. C. Roncaglioni, F. Avanzini, I. Pangrazzi, G. Tognoni, and D. L. Brown, Aspirin for the primary prevention of cardiovascular events in women and men: a sex-specific meta-analysis of randomized controlled trials. JAMA. 2006; 295: 306–313. 16. F. Grodstein, J. H. Kang, R. J. Glynn, N. R. Cook, and J. M. Gaziano, A randomized trial of beta carotene supplementation and cognitive function in men: the Physicians’ Health Study II. Arch. Intern. Med. 2007; 167: 2184–2190.

FURTHER READING L. Hansson, A. Zanchetti, S. G. Carruthers, et al., HOT Study Group, Effects of intensive blood-pressure lowering and low-dose aspirin in patients with hypertension: principal results of the Hypertension Optimal Treatment (HOT) randomised trial. Lancet. 1998; 351: 1755–1762. Medical Research Council’s General Practice Research Framework, Thrombosis prevention trial: randomised trial of low-intensity oral anticoagulation with warfarin and low-dose aspirin in the primary prevention of ischaemic heart disease in men at increased risk. Lancet. 1998; 351: 233–241. G. de Gaetano, Collaborative Group of the Primary Prevention Project, Low-dose aspirin and vitamin E in people at cardiovascular risk: a randomised trial in general practice. Lancet. 2001; 357: 89–95. P. M. Ridker, N. R. Cook, I. M. Lee, et al., A randomized trial of low-dose aspirin in the primary prevention of cardiovascular disease in women. N. Engl. J. Med. 2005; 352: 1293–1304.

PHYSICIANS’ HEALTH STUDY (PHS)

CROSS-REFERENCES Cardiovascular (CV) Factorial Designs Over the Counter (OTC) Drugs Primary Prevention Trials Run-in Period

7

Placebos History of Placebo The placebo, which originates from Latin meaning “I shall please”, was used for centuries as therapy for patients whom physicians were unsure of how to treat or for whom no useful treatment was available [26]. Attitudes towards therapies in Western cultures have shifted dramatically over the last four decades. Principles of safety, efficacy, and informed consent of participants are now well entrenched (see Ethics of Randomized Trials); the randomized controlled trial (see Clinical Trials, Overview) dominates as the method for evaluating therapy and patients are increasingly involved in decisions about their care [38]. The role of the placebo has evolved in response to all of these factors. In research, the notion of placebo control arose to account for those beneficial or harmful effects not directly attributable to the therapy of interest. Despite some dissent, placebos and placebocontrolled trials remain the benchmark by which all new drugs are evaluated and regulated [50]. In clinical care, any part of an encounter can have some therapeutic value and influence the patient’s response, including taking a history, stating a diagnosis, or making repeated measurements (e.g. blood pressure) or assurances about prognosis. Knowledgeable clinicians are interested in determining which parts of the placebo effect they should implement to optimize their patients’ health.

Definitions Many, similar definitions for “placebo”, “placebo effect”, and “placebo response” exist. Here, we use the word placebo to mean an inert substance or sham procedure designed to appear identical to the active substance or procedure but without known therapeutic effect. Placebo effect or placebo response is the psychophysiologic effect associated with placebos, thought to be primarily operative through the expectations or symbolic meaning of the administered therapy [5, 36, 39]. This implies both positive and negative effects, although the latter are often referred to as the nocebo effect [19]. More recently, the placebo effect has been further characterized as a true placebo effect versus a perceived placebo effect.

The perceived placebo effect, which is the effect commonly quoted and discussed in the literature, includes the true placebo effect plus other nonspecific effects, including the natural course of disease, regression to the mean, unidentified co-intervention effects, and other time-dependent effects [15, 39]. The natural course of any disease or symptom will change in severity over time and may resolve on its own (see Postmarketing Surveillance of New Drugs and Assessment of Risk). Regression to the mean, where a follow-up measurement of a patient’s disease or symptoms gives a more normal reading, is a wellrecognized phenomenon. Patients enrolled in a trial may influence clinical outcomes by implementing unidentified parallel interventions (co-intervention), for example, by starting a weight-loss program while in a trial of a diabetes drug. Time-dependent effects might include the increasing skill of the investigator in measuring study endpoints, or the decreasing “white coat hypertension” effect as patients become used to having their blood pressure measured. Since these nonspecific effects are inherent in any therapy (i.e. active intervention, placebo, or no-treatment groups), it has been argued that we should be focusing on the true placebo effect as a measurement target [15, 23].

Magnitude of the Placebo Effect For decades, the placebo response has been assumed to be similar across disease categories at approximately 35% improvement from baseline [3]. This figure has been applied equally to the proportion of patients improving and to the degree of improvement per patient per outcome. More recently, doubt has been cast on the constant placebo response as evidence accumulates that both the true and perceived placebo effects are variable [15, 36, 55]. A review of 75 trials of antidepressant medications revealed that the placebo response rate varied from 12.5 to 51.8% and had increased over time [55]. A recent systematic review of randomized trials with placebo and no-treatment arms (see Meta-analysis of Clinical Trials), concluded that there was evidence of a mild true placebo effect for some subjective outcomes (e.g. pain and anxiety) but not for more objective outcomes (blood pressure, weight loss, asthma outcomes) [23]. For example, placebo was associated with a reduction in pain by a mean of 0.65 cm

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Placebos

on a 10 cm visual analog scale, as compared with no treatment. The authors measured the difference in outcomes between the two arms at the end of the treatment period rather than the change in outcomes from baseline, thus attempting to control for the nonspecific effects and determine the true placebo effect [22]. Their concluding caution against the use of the placebo and its effects for therapeutic purposes outside of a controlled clinical trial raised a storm of criticism. Although the validity of the findings were questioned, in terms of widely varying populations and diseases, heterogeneity and low statistical power [1, 46], wrong choice of placebo [27], and contamination of the “no-treatment” arms [14], in our opinion, it is really the generalizability of the findings that is in question (see Validity and Generalizability in Epidemiologic Studies). Within randomized controlled trials, where patients acknowledge by their consent that they know they may not receive an active treatment, the perceived (total) placebo effect may be more muted than in clinical practice. Furthermore, even within randomized trials, unblinding can occur and bias clinician and patient – a situation bound to happen with a no-treatment arm. Finally, and most important, a finding that placebo therapies and notreatment arms do not differ does not negate the possibility that a true placebo effect occurred (and occurred equally) in both groups. It is important to keep in mind that placebos are not risk free. Placebos may be harmful if they delay access to effective therapy for the disease under investigation, if their nonspecific symptomatic effects mask a condition that has effective treatment or via direct nocebo effect [1]. Where placebos are used without patient consent, any revelation of the deception may seriously undermine patient–physician relationship, which is itself a powerful source of “placebo effect”.

Influences on the Placebo Effect The attitude and behavior of the clinician toward the treatment and the patient, the attitude of the patient toward her own health and the treatment, as well as external, cultural, family, and media influences, all influence the placebo effect [5, 13, 22, 28, 35, 36]. Treatment variables including appearance, invasiveness, impressiveness, perceived plausibility, past experience, and cost all appear to play a role.

Provider factors can produce major placebo effects and, although not well studied, are considered integral to the “art of medicine”. The white coat of the clinician, the stethoscope, the interest, empathy, authority, and compassion displayed in the interview with the patient and the motivation and skillfulness with which a diagnosis or therapeutic path is pursued, each may influence the patient [5, 36, 39]. In a study of 200 British patients presenting to a physician with abnormal symptoms for which no firm diagnosis could be made, the patients were randomly assigned to a negative or positive consultation, and placebo treatment or no treatment [7]. The positive consultation consisted of a firm diagnosis and reassurance that the patient would be better in a few days while in the negative consultation, the physician confessed uncertainly. Two weeks later, 64% of those who received a positive consultation reported improvement compared with 39% of those who received a negative consultation. The percentage of the placebo treated group who improved was not significantly different than the percentage in the untreated group. Patient expectations, prior experiences with similar therapies, severity of current complaints, and suggestibility likely play a role in the placebo effect but have not been rigorously studied [2, 17, 29, 34, 36, 41, 42]. In a 10-week study of exercise, 48 healthy young adults were randomized to a control aerobics training program or an exercise program with constant reminders of the aim to improve both aerobic capacity and psychological well-being [13]. After 10 weeks, significant increases in fitness levels were seen in both groups, however, self-esteem and psychological well-being were only significantly improved in the experimental group, not the control group. Cultural differences in placebo responsiveness have also been explored. A review of 117 studies of ulcer treatment, and 37 studies of treatment for anxiety, showed Germany had the highest placebo healing rates for ulcers, but experienced only moderate placebo rates for treatments of anxiety. In comparison, Italy had the lowest placebo effects for anxiety, while Brazil had the lowest rates for ulcers [35]. Similarly, a retrospective analysis of deaths attributed to lymphatic cancer found that Chinese-Americans who were born in “Earth years”, and therefore deemed by Chinese medical history to be especially susceptible to diseases involving lumps, nodules, or tumors, had a lower mean age of death than those born in other years. No

Placebos such relationship could be seen in Caucasian controls who died of similar causes [35, 36]. Patients who enter into trials compared with those who do not and patients who adhere to their treatment, whether placebo or not, compared with those who do not (see Compliance Assessment in Clinical Trials), may have different outcomes. In a large study of beta-blockers to prevent a recurrence of a myocardial infarction, it was found that those who took more than 80% of their medications had a lower mortality rate compared to poor adherers whether they received beta-blockers or placebo [35, 39]. A Cochrane methods review pilot looked at patients enrolled in phase III randomized controlled trials (see Clinical Trials, Overview) versus patients who were similar but not enrolled. In 23 out of 25 reports, better outcomes were documented for patients within the trials compared to those who were not. Overall, there were lower mortality and lower rates for complications of therapy [21]. An analysis of the CAMIAT (Canadian Amiodarone Myocardial Infarction Arrhythmia Trial) study [24] showed that adherence to placebo or amiodarone, both predicted mortality. Patients who received placebo and were considered noncompliant had over a two-fold increased risk of sudden cardiac death, cardiac mortality, and all-cause mortality compared to those placebo patients who were compliant. Drug and treatment characteristics, apart from the pharmacology of the active ingredient(s), can have powerful placebo effects. Studies suggest that subjects react to the “meanings” or suggestion of the color and quantity of drugs. Red suggests “hot” or “danger”, blue suggests “down” or “quiet”, and a quantity of two means “more than one” [6, 12, 35]. A group of medical students were told that they were participating in a single-blinded study (see Blinding or Masking) on the psychological and physiologic effects of two drugs, a sedative and a stimulant, and received one or two placebo capsules of either blue or pink without attribution of any effect [6]. The consent form they were asked to sign gave a brief description of the stimulant or sedative side effects they may expect to experience. Study results showed that the students tended to experience stimulant reactions to the pink capsules, while the blue capsules produced depressant effects. Two capsules tended to produce a greater effect than one for both psychological changes (e.g. drowsiness) and physiologic changes (e.g. pulse) [37]. In other words, analgesia associated with an injection of saline solution was reversed with

3

an opiate antagonist such as naloxone and enhanced with an opiate agonist such as proglumide, which suggests that these patients were experiencing a physiologic response such as the release of endogenous opioids [4, 35]. Furthermore, the placebo response can be shown to produce a typical pharmacokinetic profile of activity [56]. Surgery and other mechanical or invasive procedures have been thought to produce exaggerated placebo responses, ever since the original studies of sham internal mammary artery ligation for the treatment of angina, where 80% of patients in the sham treatment arm reported substantial improvement [9]. More recently, the value of an extremely common procedure, arthroscopic lavage and debridement in patients with osteoarthritis of the knee, was questioned when it was found no better than sham surgery for outcomes of pain and physical function [37]. Reviews of trials examining placebo or comparing different placebo routes indicate that injections are more powerful than pills, and devices or procedures, such as sham ultrasound or sham acupuncture seem to be associated with stronger placebo effects than oral placebos [15, 28]. However, the extent to which the response is influenced by the procedure administrator versus the procedure itself remains unclear.

Ethics of Employing Placebo in Research Many questions have been raised about the ethics of using placebos in human research. Strong critics of the placebo argue that it is not ethical to assign subjects to any intervention that has even the potential of being less efficacious than current therapy (see Ethics of Randomized Trials). The opposing view asserts that very few “standard” treatments have been proven effective by today’s research standards, that placebo-controlled trials are a necessary first step with the smallest sample size to establish whether further research with a drug or treatment is warranted and that patients tend to improve within these trials regardless of allocation because of the close attention and follow-up. Certain situations, such as life-threatening conditions where a proven effective treatment exists, are agreed to be inappropriate for placebo-only allocation [50]. One of the areas of medicine where the use of placebo has been debated widely is in depression. Critics worried that randomization to placebo might lead

4

Placebos

to serious harm including suicide. When this was reviewed, not only were completed and attempted suicide rates similar for placebo, standard antidepressants (imipramine, amitriptyline, trazodone), and the investigational antidepressants (fluoxetine, sertraline, paroxetine), but also the clinical responses were not very different among the three groups either [30]. Another large review noted that the response to placebo in depressed patients was highly variable (10–50%) between trials, and has increased over the years with increasing trial length [55]. All of these factors suggest that the use of active treatment controls instead of placebo would make accurate evaluation of interventions difficult. Although it has been well argued that research subjects, by way of informed consent, indicate their willingness to freely participate in a placebocontrolled trial, there is some evidence that patients may not be fully informed about risks and benefits of the treatment options [25, 43]. Studies of consent documents have shown that they sometimes overstate the benefits and understate the risks of research protocols and are frequently written beyond the reading level of patients [11, 33, 49]. Even when consent forms are accurate and appropriate, patients may confuse treatment in a clinical trial with that of individualized medical care, overestimate the benefits of participating in a trial, and underestimate the risks they may be involved in [20]. A recent survey of patients participating in cancer trials showed that while they were satisfied with the consent form and considered themselves well informed, a large percentage of these patients did not recognize that the treatment being described was not a standard therapy, that there was potential risk to themselves, and that the study drug was unproven [25]. This research suggests that some of the placebo effect is already evident at this early stage before any treatment is actually given – the trust in trial clinicians, the hope for therapeutic benefit, and so on may color the patient’s memory of consent.

Guidelines for the Use of Placebos in Research Several prominent guidelines provide some direction for the use of placebos in research. All agree on basic principles such as noncoercion and fully informed consent for participants. In other matters, the guidelines differ somewhat. The Declaration of Helsinki

has the strictest guidelines and is often cited by critics of the placebo. First ratified by the World Medical Association (WMA) in 1964, it cites “The benefits, risks, burdens and effectiveness of a new method should be tested against those of the best current prophylactic, diagnostic and therapeutic methods. This does not exclude the use of placebo, or no treatment, in studies where no proven prophylactic, diagnostic, or therapeutic methods exists” [57]. However, its stand has been criticized as too strict as “best . . . methods” may not be the same as “best available” or “most cost-effective” comparators [40] and many standard therapies have not been rigorously compared to determine which is the best. The International Conference on Harmonization (ICH), a committee with representatives from the drug industry and regulatory authorities internationally, has published ICH E10 – Guideline for Choice of Control Group and Related Issues in Clinical Trials [53]. These guidelines, citing the unique scientific usefulness and general safety of placebos, allow the use of placebo controls even when effective treatment exists unless subjects would be exposed to an unacceptable risk of death or permanent injury or if the toxicity of standard therapy is so severe that “many patients have refused to receive it”. ICH E10 advocates the use of modified study designs whenever possible, such as add-on studies, factorial designs, or “early escape” from ineffective therapy [53]. Various national regulatory and research bodies have formulated their own guidelines. The TriCouncil Policy Statement (TCPS) from Canada’s three main research granting agencies, requires that grant recipients abide by the Declaration of Helsinki, prohibiting the use of placebo-controlled trials if there is an existing effective therapy available. However, it allows for broad exemptions, for example, for exceptional circumstances where effective treatment is not available to patients due to cost constraints or short supply, use in refractory patients, for testing add-on treatment to standard therapy, where patients have provided an informed refusal of standard therapy for a minor condition for which patients commonly refuse treatment, or when withholding such therapy will not lead to undue suffering or the possibility of irreversible harm [51]. The US Food and Drug Administration’s Code of Federal Regulations (CFR), revised in April 2002, accepts placebocontrolled trials and does not stipulate any restrictions but strongly advocates the need for informed

Placebos consent [10, 18]. The primacy of informed patient consent in placebo-controlled trials has spurred a number of critiques [25, 32, 48], challenges [52], and recommendations to improve processes of obtaining consent [11, 31, 44, 47, 54]. Patients themselves have become involved. The National Depressive and Manic-depressive Association, a large patient-directed illness-specific organization in the United States, recently developed a consensus development panel to discuss the controversy regarding the use of placebo [8]. In their guidelines, they acknowledge that mood disorders are episodic, chronic conditions that are associated with considerable morbidity and have no curative or fully preventative treatments. Despite the Declaration of Helsinki, it was agreed that mood disorder research was not at the point where noninferiority trials (trials designed only to rule out that the new therapy is worse than the control) (see Equivalence Trials) involving active controls could be considered scientifically valid designs. Therefore, the guidelines cite “placebo is justified when testing a new antidepressant with a novel mechanism of action that has a substantial probability of efficacy with an acceptable adverse effect risk. However, placebo is also ethical in studies of new drugs in a class because the newer members may offer important advantages over the original drug.” Despite the guidelines that currently aid researchers regarding the use of placebos in research, there are ongoing issues that need to be addressed. One could debate the “proof” of efficacy compared to placebo for many drugs in current use, such that the counterargument has been made that it might be unethical to insist on the use of these drugs in control groups. The Declaration of Helsinki’s view on placebo was established in an attempt to deter pharmaceutical companies and research organizations from exploiting people in poorer populations, who may not have access to proven treatments. This concern is not relevant in many countries. Advocating the use of placebos only when no standard treatment exists leaves the efficacy of drugs in certain patient groups largely unknown. Many clinicians need to know whether a therapy is “better than nothing” and could therefore be an alternative for patients who do not respond to the conventional treatment or cannot tolerate it [45]. In chronic diseases such as hypertension, diabetes, and vascular disease, combinations of therapies are increasingly prevalent. It is likely a more

5

efficient assessment of a new drug to test it against placebo initially before embarking on multiple, more complicated, larger sample size, drug add-on trials.

Innovations to Improve Research Involving Placebo Apart from improving the process of informed consent as described above, several themes of innovative trial design are developing. All attempt to minimize patient exposure to placebo only or increase the efficiency of the design to lower sample size. The first design, now commonly used, is the add-on trial where one group gets both the standard therapy and the new therapy while the other group gets the standard therapy only. The difference between the mean response of the combination standard/new therapy group and the mean of the standard therapy group alone is a reasonable estimate of effect of the new therapy provided the two therapies do not interact with each other. The second design was developed for antidepressant drug trials to address the high placebo response rates [55]. This two-phase randomized crossover method initially randomizes more patients to placebo than active therapy and, in the second phase, crosses over only placebo nonresponders [16]. It is intended to minimize the overall placebo response rate and thus the sample size required to show a clinically important difference (see Sample Size Determination for Clinical Trials). The third design avoids the bias of patient selection according to placebo response. A 22 factorial design involving standard and new therapies has four groups: double placebo, new therapy, standard therapy, and combination standard therapy/new therapy. With a 1 : 1 : 1 : 1 allocation ratio, there are two estimates of the efficacy of the new therapy–the difference between the new therapy group mean and the double placebo group mean, and the difference between the combination standard therapy/new therapy group mean and the standard therapy group mean. These two estimates are pooled together to measure the overall efficacy of the new therapy by computing the mean of the two estimates. These two estimates should be similar to each other provided the new therapy does not interact with the standard therapy. However, one half of the patients are given either the double placebo or the new therapy of yet unproven efficacy. This may present an ethical problem if the

6

Placebos

standard therapy has excellent efficacy and clinicians are reluctant to deny patients access to it. If the investigator were to drop these two groups from the third design, the result is the first, add-on design, and the measure of interaction between the two therapies has to rely on theory, not data. A fourth design is possible where there are two known-to-be effective therapies available and where some clinicians might use the first standard, some might use the second standard, and some might use both standards together. Here, a 23 factorial design should be considered: triple placebo, new therapy, first standard therapy, second standard therapy, double combination of new therapy and the first standard therapy, double combination of the new therapy with the second standard therapy, double combination of the first standard and second standard therapies, triple combination of the new therapy, the first standard therapy and the second standard therapy with 8 groups and an equal allocation ratio. The design would provide four estimates of the efficacy of new therapy, namely: (1) the difference in the mean of new therapy group and the mean of the triple placebo group, (2) the difference in the mean of the double combination of new therapy and first standard group and the mean of the first standard group, (3) the difference in the mean of the double combination of the new therapy and second standard therapy and the mean of the second standard group and (4) the difference between the mean of the triple combination of new therapy, first standard therapy and second standard therapy and the mean of the double combination of the first and second standard therapies. These four estimates should be similar to each other provided the new therapy does not interact with either of the first or second standard therapies, an assumption that can be checked. Finally, if the two ethically challenged groups, placebo and new therapy, are dropped from the fourth design, then the remaining fractional factorial design has six groups that permit three estimates of efficacy of the new therapy, namely, (2), (3), and (4). Again these three estimates should be similar to each other provided the two standard therapies do not interact with new therapy, and these three estimates should be similar to the one estimate of the efficacy of the new therapy that it is not possible to obtain from the six groups, namely (1). This six-group fractional factorial design still permits the efficacy of the new therapy to be estimated by pooling together the three

estimates with the mean of (2), (3), and (4). Provided the interaction assumption is reasonable, this threeterm mean should provide adequate evidence of the efficacy of the new therapy. This fifth design should be ethically acceptable since no patient is being denied the benefit of a known therapy.

Summary The placebo’s role in medicine has been and continues to be in transition. For decades, the dispute regarding the placebo revolved around its ability (or not) to induce psychological or physiological effects. More recently, the debate has focused on the utilization of the placebo-controlled trial, its usefulness, and whether, despite informed consent, it impinges on patient autonomy and the practice of beneficence. There are currently several international guidelines that direct the researchers on the use of placebos; however, there are prominent differences between them. The future is likely to bring developments in trial designs where a new therapy’s effect can be compared to placebo yet no individual subject is exposed only to placebo, designs that lower the placebo response rate so that treatment effect may be ascertained more efficiently, and designs that allow further exploration and exploitation of factors influencing the placebo effect.

Acknowledgments Dr. Holbrook is the recipient of a Canadian Institutes for Health Research Career Investigator award.

References [1]

[2]

[3] [4]

[5]

Bailar, J.C. III. (2001). The powerful placebo and the Wizard of Oz, The New England Journal of Medicine 344, 1630–1632. Barsky, A.J., Saintfort, R., Rogers, M.P. & Borus, J.F. (2002). Nonspecific medication side effects and the nocebo phenomenon, JAMA 287, 622–627. Beecher, H.K. (1955). The powerful placebo, JAMA 159, 1602–1606. Benedetti, F. (1996). The opposite effects of the opiate antagonist naloxone and the cholecystokinin antagonist proglumide on placebo analgesia, Pain 64, 535–543. Benson, H. & Friedman, R. (1996). Harnessing the power of the placebo effect and renaming it “remembered wellness”, Annual Review of Medicine 47, 193–199.

Placebos [6]

[7] [8]

[9]

[10]

[11]

[12]

[13]

[14]

[15] [16]

[17]

[18]

Blackwell, B., Bloomfield, S.S. & Buncher, C.R. (1972). Demonstration to medical students of placebo responses and non-drug factors, Lancet 1, 1279–1282. Brody, H. & Waters, D.B. (1980). Diagnosis is treatment, The Journal of Family Practice 10, 445–449. Charney, D.S., Nemeroff, C.B., Lewis, L., Laden, S.K., Gorman, J.M., Laska, E.M., Borenstein, M., Bowden, C.L., Caplan, A., Emslie, G.J., Evans, D.L., Geller, B., Grabowski, L.E., Herson, J., Kalin, N.H., Keck, P.E., Jr., Kirsch, I., Krishnan, K.R., Kupfer, D.J., Makuch, R.W., Miller, F.G., Pardes, H., Post, R., Reynolds, M.M., Roberts, L., Rosenbaum, J.F., Rosenstein, D.L., Rubinow, D.R., Rush, A.J., Ryan, N.D., Sachs, G.S., Schatzberg, A.F., Solomon, S. (2002). National depressive and manic-depressive association consensus statement on the use of placebo in clinical trials of mood disorders, Archives of General Psychiatry 59, 262–270. Cobb, L., Thomas, G.I., Dillard, D.H., Merendino, K.A. & Bruce, R.A. (1959). An evaluation of internalmammary artery ligation by a double blind technic, The New England Journal of Medicine 260, 1118. Code of Federal Regulations. Title 21 Volume 5. Revised April 1, 2002 Accessed on November 18, 2002 at http://www.accessdata.fda.gov/scripts/ cdrh/cfdocs/cfcfr/CFRSearch.cfm? FR = 314.126. 2002. Coyne, C.A., Xu, R., Raich, P., Plomer, K., Dignan, M., Wenzel, L.B., et al. (2003). Randomized, controlled trial of an easy-to-read informed consent statement for clinical trial participation: a study of the eastern cooperative oncology group, Journal of Clinical Oncology 21, 836–842. de Craen, A.J., Roos, P.J., Leonard d, V. & Kleijnen, J. (1996). Effect of colour of drugs: systematic review of perceived effect of drugs and of their effectiveness, BMJ 313, 1624–1626. Desharnais, R., Jobin, J., Cote, C., Levesque, L. & Godin, G. (1993). Aerobic exercise and the placebo effect: a controlled study, Psychosomatic Medicine 55, 149–154. Einarson, T.E., Hemels, M. & Stolk, P. (2001). Is the placebo powerless? The New England Journal of Medicine 345, 1277–1279. Ernst, E. & Resch, K.L. (1995). Concept of true and perceived placebo effects, BMJ 311, 551–553. Fava, M., Evins, A.E., Dorer, D.J. & Schoenfeld, D.A. (2003). The problem of the placebo response in clinical trials for psychiatric disorders: culprits, possible remedies, and a novel study design approach, Psychotherapy and Psychosomatics 72, 115–127. Greene, P.J., Wayne, P.M., Kerr, C.E., Weiger, W.A., Jacobson, E., Goldman, P., Kaptchuk, T. (2001). The powerful placebo: doubting the doubters. Advance in Mind-Body Medicine 17, 298–307. Guidance for institutional review boards and clinical investigators. 21 CFR Part 50. 1998 Update. Accessed

[19]

[20] [21]

[22]

[23]

[24]

[25]

[26]

[27] [28]

[29]

[30]

[31]

[32]

[33]

7

on November 18, 2002 at http://www.fda.gov/ oc/ohrt/irbs/appendixb.html. 1998. Hahn, R.A. (1997). The nocebo phenomenon: concept, evidence, and implications for public health, Preventive Medicine 26, 607–611. Horng, S. & Miller, F.G. Is placebo surgery unethical? The New England Journal of Medicine 347, 137–139. How do the outcomes of patients treated within randomized control trials compare with those of similar patients treated outside these trials? http://hiru. mcmaster.ca/ebm/trout/, accessed 2002-11-17. 2001. Hrobjartsson, A. (2002). What are the main methodological problems in the estimation of placebo effects? Journal of Clinical Epidemiology 55, 430–435. Hrobjartsson, A. & Gotzsche, P.C. (2001). Is the placebo powerless? An analysis of clinical trials comparing placebo with no treatment, The New England Journal of Medicine 344, 1594–1602. Irvine, J., Baker, B., Smith, J., Jandciu, S., Paquette, M., Cairns, J., et al. (1999). Poor adherence to placebo or amiodarone therapy predicts mortality: results from the CAMIAT study. Canadian Amiodarone Myocardial Infarction Arrhythmia Trial, Psychosomatic Medicine 61, 566–575. Joffe, S., Cook, E.F., Cleary, P.D., Clark, J.W. & Weeks, J.C. (2001). Quality of informed consent in cancer clinical trials: a cross-sectional survey, Lancet 358, 1772–1777. Kaptchuk, T.J. (1998). Powerful placebo: the dark side of the randomised controlled trial, Lancet 351, 1722–1725. Kaptchuk, T.J. (2001). Is the placebo powerless? The New England Journal of Medicine 345, 1277–1279. Kaptchuk, T.J., Goldman, P., Stone, D.A. & Stason, W.B. (2000). Do medical devices have enhanced placebo effects? Journal of Clinical Epidemiology 53, 786–792. Khan, A., Leventhal, R.M., Khan, S.R. & Brown, W.A. (2002). Severity of depression and response to antidepressants and placebo: an analysis of the food and drug administration database, Journal of Clinical Psychopharmacology 22, 40–45. Khan, A., Warner, H.A. & Brown, W.A. (2000). Symptom reduction and suicide risk in patients treated with placebo in antidepressant clinical trials: an analysis of the Food and Drug Administration database, Archives of General Psychiatry 57, 311–317. Lavori, P.W., Sugarman, J., Hays, M.T. & Feussner, J.R. (1999). Improving informed consent in clinical trials: a duty to experiment, Controlled Clinical Trials 20, 187–193. Lilford, R.J. (2003). Ethics of clinical trials from a Bayesian and decision analytic perspective: whose equipoise is it anyway? BMJ 326, 980–981. Macklin, R. (1999). The ethical problems with sham surgery in clinical research, The New England Journal of Medicine 341, 992–996.

8 [34]

[35]

[36]

[37]

[38]

[39]

[40] [41]

[42]

[43]

[44] [45] [46]

Placebos Mataix-Cols, D., Rauch, S.L., Manzo, P.A., Jenike, M.A. & Baer, L. (1999). Use of factor-analyzed symptom dimensions to predict outcome with serotonin reuptake inhibitors and placebo in the treatment of obsessivecompulsive disorder, American Journal of Psychiatry 156, 1409–1416. Moerman, D.E. (2000). Cultural variations in the placebo effect: ulcers, anxiety, and blood pressure, Medical Anthropology Quarterly 14, 51–72. Moerman, D.E. & Jonas, W.B. (2002). Deconstructing the placebo effect and finding the meaning response, Annals of Internal Medicine 136, 471–476. Moseley, J.B., O’Malley, K., Petersen, N.J., Menke T.J., Brody, B.A., Kuykendall, D.H., Hollingsworth, J.C., Ashton, C.M., Wray, N.P. (2002). A controlled trial of arthroscopic surgery for osteoarthritis of the knee. The New England Journal of Medicine 347, 81–88. O’Connor, A.M., Rostom, A., Fiset, V., Tetroe, J., Entwistle, V., Llewellyn-Thomas, H., et al. (1999). Decision aids for patients facing health treatment or screening decisions: systematic review, BMJ 319, 731–734. Papakostas, Y.G. & Daras, M.D. (2001). Placebos, placebo effect, and the response to the healing situation: the evolution of a concept, Epilepsia 42, 1614–1625. Riis, P. (2000). Perspectives on the fifth revision of the declaration of Helsinki, JAMA 284, 3045–3046. Rochon, P.A., Binns, M.A., Litner, J.A., Litner, G.M., Fischbach, M.S., Eisenberg, D., Kaptchuk, T.J., Stason, W.B., Chalmers, T.C. (1999). Are randomized control trial outcomes influenced by the inclusion of a placebo group?: a systematic review of nonsteroidal antiinflammatory drug trials for arthritis treatment, Journal of Clinical Epidemiology 52, 113–122. Rosenberg, N.K., Mellergard, M., Rosenberg, R., Beck, P. & Ottosson, J.O. (1991). Characteristics of panic disorder patients responding to placebo, Acta Psychiatrica Scandinavica, Supplementum 365, 33–38. Rothman, K.J. & Michels, K.B. (1994). The continuing unethical use of placebo controls, The New England Journal of Medicine 331, 394–398. Slater, E.E. (2002). IRB reform, The New England Journal of Medicine 346, 1402–1404. Solomon, D.A. (1995). The use of placebo controls, The New England Journal of Medicine 332, 62. Spiegel, D., Kraemer, H. & Carlson, R.W. (2001). Is the placebo powerless? The New England Journal of Medicine 345, 1276–1279.

[47]

[48]

[49] [50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

St. Joseph’s Health care/McMaster University/Hamilton Health Sciences. Informed consent checklist. http: //www.fhs.mcmaster.ca/csd/forms/ic-chk list.pdf. 7-30-2003. Steinbrook, R. (2002). Protecting research subjects–the crisis at Johns Hopkins, The New England Journal of Medicine 346, 716–720. Tattersall, M.H. (2001). Examining informed consent to cancer clinical trials, Lancet 358, 1742–1743. Temple, R. & Ellenberg, S.S. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 1: ethical and scientific issues, Annals of Internal Medicine 133, 455–463. Tri-Council Policy statement. Ethical conduct for research involving humans. Updated November 21, 2000. Accessed on November 20, 2002 at. http: //www.nserc.ca/programs/ethics/english/ policy.htm. 2000. Truog, R.D., Robinson, W., Randolph, A. & Morris, A. (1999). Is informed consent always necessary for randomized, controlled trials? The New England Journal of Medicine 340, 804–807. U.S. Department of Health and Human Services Food and Drug Administration. Guidance for Industry: E 10 Choice of Control group and Related Issues in Clinical Trials. Available at: http://www.fda.gov/cder/ guidance/4155fnl.pdf., accessed July 24, 2003. 2001. Verheggen, F.W., Jonkers, R. & Kok, G. Patients’ perceptions on informed consent and the quality of information disclosure in clinical trials, Patient Education and Counseling 96 A.D.29, 137–153. Walsh, B.T., Seidman, S.N., Sysko, R. & Gould, M. (2002). Placebo response in studies of major depression: variable, substantial, and growing, JAMA 287, 1840–1847. Weiner, M. & Weiner, G.J. (1996). The kinetics and dynamics of responses to placebo, Clinical Pharmacology and Therapeutics 60, 247–254. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects, JAMA (2000). 284, 3043–3045.

ANNE HOLBROOK, CHARLES H. GOLDSMITH & MOVA LEUNG

PLANNING A GROUP-RANDOMIZED TRIAL

know as much as possible about the problem before they plan their trial. Having become experts in the field, the investigators should choose the single question that will drive their GRT. The primary criteria for choosing that question should be: (1) Is it important enough to do?, and (2) Is this the right time to do it? Reviewers will ask both questions, and the investigators must be able to provide well-documented answers. Most GRTs seek to prevent a health problem, so that the importance of the question is linked to the seriousness of that problem. The investigators should document the extent of the problem and the potential benefit from a reduction in that problem. The question of timing is also important. The investigators should document that the question has not been answered already and that the intervention has a good chance to improve the primary endpoint in the target population, which is most easily done when the investigators are thoroughly familiar with previous research in the area; when the etiology of the problem is well known; when a theoretical basis exists for the proposed intervention; when preliminary evidence exists on the feasibility and efficacy of the intervention; when the measures for the dependent and mediating variables are well-developed; when the sources of variation and correlation as well as the trends in the endpoints are well understood; and when the investigators have created the research team to carry out the study. If that is not the state of affairs, then the investigators must either invest the time and energy to reach that state or choose another question. Once the question is selected, it is very important to put it down on paper. The research question is easily lost in the dayto-day details of the planning and execution of the study, and because much time can be wasted in pursuit of issues that are not really central to the research question, the investigators should take care to keep that question in mind.

DAVID M. MURRAY Division of Epidemiology School of Public Health The Ohio State University, Columbus Ohio

1

INTRODUCTION

Planning a group-randomized trial (GRT) is a complex process. Readers interested in a more detailed discussion might wish to consult Murray’s text on the design and analysis of GRTs (1), from which much of this article was abstracted. Donner and Klar’s text is another good source of information (2). 2

THE RESEARCH QUESTION

The driving force behind any GRT must be the research question. The question will be based on the problem of interest and will identify the target population, the setting, the endpoints, and the intervention. In turn, those factors will shape the design and analytic plan. Given the importance of the research question, the investigators must take care to articulate it clearly. Unfortunately, that does not always happen. Investigators may have ideas about the theoretical or conceptual basis for the intervention, and often even clearer ideas about the conceptual basis for the endpoints. They may even have ideas about intermediate processes. However, without very clear thinking about each of these issues, the investigators may find themselves at the end of the trial unable to answer the question of interest. To put themselves in a position to articulate their research question clearly, the investigators should first document thoroughly the nature and extent of the underlying problem and the strategies and results of previous efforts to remedy that problem. A literature review and correspondence with others working in the field are ingredients essential to that process, as the investigators should

3

THE RESEARCH TEAM

Having defined the question, the investiga-

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

PLANNING A GROUP-RANDOMIZED TRIAL

tors should determine whether they have expertise sufficient to deal with all the challenges that are likely to occur as they plan and execute the trial. They should identify the skills that they do not have and expand the research team to ensure that those skills are available. All GRTs will need expertise in research design, data collection, data processing and analysis, intervention development, intervention implementation, and project administration. As the team usually will need to convince a funding agency that they are appropriate for the trial, it is important to include experienced and senior investigators in key roles. No substitute exists for experience with similar interventions, in similar populations and settings, using similar measures, and similar methods of data collection and analysis. As those skills are rarely found in a single investigator, most trials will require a team, with responsibilities shared among its members. Most teams will remember the familiar academic issues (e.g., statistics, data management, intervention theory), but some may forget the very important practical side of trials involving identifiable groups. However, to forget the practical side is a sure way to get into trouble. For example, a school-based trial that does not include on its team someone who is very familiar with school operations is almost certain to get into trouble with the schools. A hospital-based trial that does not include on its team someone who is very familiar with hospital operations is almost certain to get into trouble with the hospitals. The same can be said for every other type of identifiable group, population, or setting that might be used. 4

THE RESEARCH DESIGN

The fundamentals of research design apply to GRTs as well as to other comparative designs. As they are discussed in many familiar textbooks (3–7), they will be reviewed only briefly here. Additional information may be found in two recent textbooks on the design and analysis of GRTs (1, 2). The goal in the design of any comparative trial is to provide the basis for valid inference that the intervention as implemented caused

the result(s) as observed. To meet that goal, three elements are required: (1) there must be control observations, (2) there must be a minimum of bias in the estimate of the intervention effect, and (3) there must be sufficient precision for that estimate. The nature of the control observations and the way in which the groups are allocated to treatment conditions will determine, in large measure, the level of bias in the estimate of the intervention effect. Bias exists whenever the estimate of the intervention effect is different from its true value. If that bias is substantial, the investigators will be misled about the effect of their intervention, as will the other scientists and policy makers who use their work. Even if adequate control observations are available so that the estimate of the intervention effect is unbiased, the investigator should know whether the effect is greater than would be expected by chance, given the level of variation in the data. Statistical tests can provide such evidence, but their power to do so will depend heavily on the precision of the intervention effect estimate. As the precision improves, it will be easier to distinguish true effects from the underlying variation in the data.

5 POTENTIAL DESIGN PROBLEMS AND METHODS TO AVOID THEM For GRTs, the four sources of bias that are particularly problematic and should be considered during the planning phase are selection, differential history, differential maturation, and contamination. Selection bias refers to baseline differences among the study conditions that may explain the results of the trial. Bias caused by differential history refers to some external influence that operates differentially among the conditions. Bias caused by differential maturation reflects uneven secular trends among the groups in the trial favoring one condition or another. These first three sources of bias can mask or mimic an intervention effect, and all three are more likely given either non-random assignment of groups or random assignment of a limited number of groups to each condition.

PLANNING A GROUP-RANDOMIZED TRIAL

The first three sources of bias are best avoided by randomization of a sufficient number of groups to each study condition, which will increase the likelihood that potential sources of bias are distributed evenly among the conditions. Careful matching or stratification can increase the effectiveness of randomization, especially when the number of groups is small. As a result, all GRTs planned with fewer than 20 groups per condition would be well served to include careful matching or stratification before randomization. The fourth source of bias is caused by contamination, which occurs when interventionlike activities find their way into the comparison groups; it can bias the estimate of the intervention effect toward the null hypothesis. Randomization will not protect against contamination; although investigators can control access to their intervention materials, they can often do little to prevent the outside world from introducing similar activities into their control groups. As a result, monitoring exposure to activities that could affect the trial’s endpoints in both the intervention and comparison groups is especially important in GRTs, which will allow the investigators to detect and respond to contamination if it occurs. Objective measures and evaluation personnel who have no connection to the intervention are also important strategies to limit bias. Finally, analytic strategies, such as regression adjustment for confounders, can be very helpful in dealing with any observed bias. 6 POTENTIAL ANALYTIC PROBLEMS AND METHODS TO AVOID THEM The two major threats to the validity of the analysis of a GRT that should be considered during the planning phase are misspecification of the analytic model and low power. Misspecification of the analytic model will occur if the investigator ignores or misrepresents a measurable source of random variation, or misrepresents the pattern of any over-time correlation in the data. To avoid model misspecification, the investigator should plan the analysis concurrent with

3

the design, plan the analysis around the primary endpoints, anticipate all sources of random variation, anticipate the error distribution for the primary endpoint, anticipate patterns of over-time correlation, consider alternative structures for the covariance matrix, consider alternative models for time, and assess potential confounding and effect modification. Low power will occur if the investigator employs a weak intervention, has insufficient replication, has high variance or intraclass correlation in the endpoints, or has poor reliability of intervention implementation. To avoid low power, investigators should plan a large enough study to ensure sufficient replication, choose endpoints with low variance and intraclass correlation, employ matching or stratification before randomization, employ more and smaller groups instead of a few large groups, employ more and smaller surveys or continuous surveillance instead of a few large surveys, employ repeat observations on the same groups or on the same groups and members, employ strong interventions with good reach, and maintain the reliability of intervention implementation. In the analysis, investigators should employ regression adjustment for covariates, model time if possible, and consider post hoc stratification. 7 VARIABLES OF INTEREST AND THEIR MEASURES The research question will identify the primary and secondary endpoints of the trial. The question may also identify potential effect modifiers. It will then be up to the investigators to anticipate potential confounders and nuisance variables. All these variables must be measured if they are to be used in the analysis of the trial. In a clinical trial, the primary endpoint is a clinical event, chosen because it is easy to measure with limited error and is clinically relevant (5). In a GRT, the primary endpoint need not be a clinical event, but it should be easy to measure with limited error and be relevant to public health. In both clinical and GRTs, the primary endpoint, together with its method of measurement,

4

PLANNING A GROUP-RANDOMIZED TRIAL

must be defined in writing before the start of the trial. The endpoint and its method of measurement cannot be changed after the start of the trial without risking the validity of the trial and the credibility of the research team. Secondary endpoints should have similar characteristics and also should be identified before the start of the trial. In a GRT, an effect modifier is a variable whose level influences the effect of the intervention. For example, if the effect of a schoolbased drug-use prevention program depends on the baseline risk level of the student, then baseline risk is an effect modifier. Effect modification can be seen intuitively by looking at separate intervention effect estimates for the levels of the effect modifier. If they differ to a meaningful degree, then the investigator has evidence of possible effect modification. A more formal assessment is provided by a statistical test for effect modification, which is accomplished by including an interaction term between the effect modifier and condition in the analysis and testing the statistical significance of that term. If the interaction is significant, then the investigator should present the results separately for the levels of the effect modifier. If not, the interaction term is deleted and the investigator can continue with the analysis. Proper identification of potential effect modifiers comes through a careful review of the literature and from an examination of the theory of the intervention. Potential effect modifiers must be measured as part of the data-collection process so that their role can later be assessed. A confounder is related to the endpoint, not on the causal pathway, and unevenly distributed among the conditions; it serves to bias the estimate of the intervention effect. No statistical test exists for confounding; instead, it is assessed by comparing the unadjusted estimate of the intervention effect to the adjusted estimate of that effect. If, in the investigator’s opinion, a meaningful difference exists between the adjusted and unadjusted estimates, then the investigator has an obligation to report the adjusted value. It may also be appropriate to report the unadjusted value to allow the reader to assess the degree of confounding. The adjusted analysis will not be possible unless the potential confounders are measured. Proper identification

of potential confounders also comes through a careful review of the literature and from an understanding of the endpoints and the study population. The investigators must take care in the selection of potential confounders to select only confounders and not mediating variables. A confounder is related to the endpoint and unevenly distributed in the conditions, but is not on the causal pathway between the intervention and the outcome. A mediating variable has all the characteristics of a confounder but is on the causal pathway. Adjustment for a mediating variable, in the false belief that it is a confounder, will bias the estimate of the intervention effect toward the null hypothesis. Similarly, the investigator must take care to avoid selecting as potential confounders variables that may be affected by the intervention even if they are not on the causal pathway linking the intervention and the outcome. Such variables will be proxies for the intervention itself, and adjustment for them will also bias the estimate of the intervention effect toward the null hypothesis. An effective strategy to avoid these problems is to restrict confounders to variables measured at baseline. Such factors cannot be on the causal pathway, nor can their values be influenced by an intervention that has not been delivered. Investigators may also want to include variables measured after the intervention has begun, but will need to take care to avoid the problems described above. Nuisance variables are related to the endpoint, not on the causal pathway, but evenly distributed among the conditions. They cannot bias the estimate of the intervention effect, but they can be used to improve precision in the analysis. A common method is to make regression adjustment for these factors during the analysis so as to reduce the standard error of the estimate of the intervention effect, thereby improving the precision of the analysis. Such adjustment will not be possible unless the nuisance variables are measured. Proper identification of potential nuisance variables also comes from a careful review of the literature and from an understanding of the endpoint. The cautions described above for the selection of potential

PLANNING A GROUP-RANDOMIZED TRIAL

confounding variables apply equally well to the selection of potential nuisance variables. 8

THE INTERVENTION

No matter how well designed and evaluated a GRT may be, strengths in design and analysis cannot overcome a weak intervention. Although the designs and analyses employed in GRTs were fair targets for criticism during the 1970s and 1980s, the designs and analyses employed more recently have improved, with many examples of very well-designed and carefully analyzed trials. Where intervention effects are modest or short-lived, even in the presence of good design and analytic strategies, investigators must take a hard look at the intervention and question whether it was strong enough. One of the first suggestions for developing the research question was that the investigators become experts on the problem that they seek to remedy. If the primary endpoint is cigarette smoking among ninth-graders, then the team should seek to learn as much as possible about the etiology of smoking among young adolescents. If the primary endpoint is obesity among Native American children, then the team should seek to learn as much as possible about the etiology of obesity among those young children. If the primary endpoint is delay time in reporting heart attack symptoms, then the team should seek to learn as much as possible about the factors that influence delay time. And the same can be said for any other endpoint. One of the goals of developing expertise in the etiology of the problem is to identify points in that etiology that are amenable to intervention. Critical developmental stages, or critical events or influences that trigger the next step in the progression, may exist or it may be possible to identify critical players in the form of parents, friends, coworkers, or others who can influence the development of that problem. Without careful study of the etiology, the team will largely be guessing and hoping that their intervention is designed properly. Unfortunately, guessing and hoping rarely lead to effective interventions. Powerful interventions are guided by good theory on the process for change, combined

5

with a good understanding of the etiology of the problem of interest. Poor theory will produce poor interventions and poor results, which was one of the primary messages from the community-based heart diseaseprevention studies, where the intervention effects were modest, generally of limited duration, and often within chance levels. Fortmann et al. (8) noted that one of the major lessons learned was how much was not known about how to intervene in whole communities. The theory that describes the process of change in individuals may not apply to the process of change in identifiable groups. If it does, it may not apply in exactly the same way. Good intervention for a GRT will likely need to combine theory about individual change with theory about group processes and group change. A good theoretical exposition will also help identify channels for the intervention program. For example, strong evidence exists that recent immigrants often look to longterm immigrants of the same cultural group for information on health issues, which has led investigators to try to use those longterm immigrants as change agents for the more recent immigrants. A good theoretical exposition will often indicate that the phenomenon is the product of multiple influences and so suggest that the intervention operate at several different levels. For example, obesity among school children appears to be influenced most proximally by their physical activity levels and by their dietary intake. In turn, their dietary intake is influenced by what is served at home and at school and their physical activity is influenced by the nature of their physical activity and recess programs at school and at home. The models provided by teachers and parents are important both for diet and for physical activity. This multilevel etiology suggests that interventions be directed at the school food-service, physical education, and recess programs; at parents; and, possibly, at the larger community. GRTs would benefit by following the example of clinical trials, where some evidence of feasibility and efficacy of the intervention is usually required before launching the trial. When a study takes several years to

6

PLANNING A GROUP-RANDOMIZED TRIAL

complete and costs hundreds of thousands of dollars or more, that expectation seems very fair. Even shorter and less expensive GRTs would do well to follow that advice. What defines preliminary evidence of feasibility? It is not reasonable to ask that the investigators prove that all intervention and evaluation protocols can be implemented in the population and setting of interest in advance of their trial. However, it is reasonable to ask that they demonstrate that the major components of the proposed intervention can be implemented in the target population, which can be done in a pilot study. It is also reasonable to ask that the major components of the proposed evaluation are feasible and acceptable in the setting and population proposed for the trial, which also can be done in a pilot study. What defines preliminary evidence of efficacy? It is not fair to ask that the investigators prove that their intervention will work in the population and setting of interest in advance of their trial. However, it is fair to ask that they provide evidence that the theory supporting the intervention has been supported in other situations. It is also fair to ask that the investigators demonstrate that similar interventions applied to other problems have been effective. Finally, it is reasonable to ask that the investigators demonstrate that the proposed intervention generates short-term effects for intermediate outcomes related to the primary and secondary endpoints and postulated by the theoretical model guiding the intervention. Such evidence provides reassurance that the intervention will be effective if it is properly implemented. 9

POWER

A detailed exposition on power for GRTs is beyond the scope of this article. Excellent treatments exist, and the interested reader is referred to those sources for additional information. Chapter 9 in the Murray text (1) provides perhaps the most comprehensive treatment of detectable difference, sample size, and power for GRTs. Even so, a few points bear repeating here. First, the increase in between-group variance

caused by the ICC in the simplest analysis is calculated as 1 + (m − 1)ICC, where m is the number of members per group; as such, ignoring even a small ICC can underestimate standard errors if m is large. Second, although the magnitude of the ICC is inversely related to the level of aggregation, it is independent of the number of group members who provide data. For both of these reasons, more power is available given more groups per condition with fewer members measured per group than given just a few groups per condition with many members measured per group, no matter the size of the ICC. Third, the two factors that largely determine power in any GRT are the ICC and the number of groups per condition. For these reasons, no substitute exists for a good estimate of the ICC for the primary endpoint, the target population, and the primary analysis planned for the trial, and it is unusual for a GRT to have adequate power with fewer than 8–10 groups per condition. Finally, the formula for the standard error for the intervention effect depends on the primary analysis planned for the trial, and investigators should take care to calculate that standard error, and power, based on that analysis. 10 SUMMARY GRTs are often complex studies, with greater challenges in design, analysis, and intervention than what is seen in other studies. As a result, much care and effort is required for good planning. Future trials will be stronger and more likely to report satisfactory results if they (1) address an important research question, (2) employ an intervention that has a strong theoretical base and preliminary evidence of feasibility and efficacy, (3) randomize a sufficient number of assignment units to each study condition so as to have good power, (4) are designed in recognition of the major threats to the validity of the design and analysis of group-randomized trials, (5) employ good quality-control measures to monitor fidelity of implementation of intervention and measurement protocols, (6) are well executed, (7) employ good process-evaluation measures to assess effects on intermediate endpoints, (8)

PLANNING A GROUP-RANDOMIZED TRIAL

employ reliable and valid endpoint measures, (9) are analyzed using methods appropriate to the design of the study and the nature of the primary endpoints, and (10) are interpreted in light of the strengths and weaknesses of the study. REFERENCES 1. D. M. Murray, Design and Analysis of GroupRandomized Trials. New York: Oxford University Press, 1998. 2. A. Donner and N. Klar, Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold, 2000. 3. R. E. Kirk, Experimental Design: Procedures for the Behavioral Sciences, 2nd ed. Belmont, CA: Brooks/Cole Publishing Company, 1982. 4. L. Kish, Statistical Design for Research. New York: John Wiley & Sons, 1987. 5. C. L. Meinert, Clinical Trials. New York: Oxford University Press, 1986. 6. W. R. Shadish, T. D. Cook, and D. T. Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston, MA: Houghton Mifflin Company, 2002. 7. B. J. Winer, D. R. Brown, and K. Michels, Statistical Principles in Experimental Design. New York: McGraw-Hill, 1991. 8. S. P. Fortmann, J. A. Flora, M. A. Winkleby, C. Schooler, C. B. Taylor, and J. W. Farquhar, Community intervention trials: reflections on the Stanford Five-City Project experience. Amer. J. Epidemiol. 1995; 142(6): 576–586.

7

POISSON REGRESSION

stochastic process with constant intensity rate λ. In the Poisson regression model, it is assumed that the mean (expected value) parameter λ is a function of the covariates x = (x1 , . . . , xp ) . The connection between λ and x = (x1 , . . . , xp ) is established as a linear predictor η via a link function g, where,

EKKEHARD GLIMM and MICHAEL BRANSON Novartis Pharma AG Basel, Switzerland

One of the major limitations of conventional Poisson regression is the fact that model assumptions force the expected value and the variance of the counts to be equal. In practice, this assumption is often violated, the empirical variance of the counts being larger than their mean. Several methods are available to relax this restrictive condition. A popular approach is to introduce an overdispersion parameter. Also, the full distributional assumptions can be abandoned in favor of a model for first- and second-order moments only. Generalized estimating equations are used to fit such overdispersed ‘‘Poisson’’ models. Other methods to deal with overdispersion are zero-inflated Poisson models, the negative binomial model and mixed effectsand repeated-measures Poisson regression models. The latter constitute a field of much recent research that aims at a unification of linear model and GLM theory. This unification is also increasingly reflected in statistical software. 1

η = β0 +

βi xi and η = g(λ)

i=1

For the Poisson regression model, the log-link function g(λ) = log(λ) is used. Since λ must be positive and the support of η is the entire real line, therefore, λ = exp β0 +

p

βi xi

i=1

As the Poisson distribution is a member of the exponential family of distributions, the Poisson regression model is a special case of a generalized linear model (GLM) (1). In fact, it can be viewed as the prototype of a generalized linear model. The interest is in inference on the model parameters β = (β 0 , . . . , β p ) , that is, estimation of these parameters and their variances from observed data, confidence intervals, testing for statistical significance, and so on. In many clinical trial applications, we are primarily interested in the event rate υ = λ/t, where t is the time a patient is exposed or ‘‘at risk’’ of having an event, for example, an epileptic fit in an epilepsy trial. Because log(λ/t) = log(λ) − log(t), we can easily incorporate this situation into the GLM formulation by inclusion of log(t) into the set of predictors, forcing its parameter β (t) , say, to be 1. In the next section, we introduce more specific details of the Poisson regression model. This section is followed by the outline of a case study, and aspects of Poisson regression modeling are described via the analysis of these data. In the application of Poisson regression, it is probably the norm rather than the exception to observe that the variance of counts is much larger than its mean. This

INTRODUCTION

Poisson regression is a method to relate the number of event counts or event rates to a set of covariates, for example, age, treatments under investigation, and so on. The Poisson regression model assumes that a number (Y) of counts follows a Poisson distribution with rate λ, denoted Po(λ). The Poisson distribution, sometimes called the ‘‘distribution of rare events’’, has the probability density function f (Y = y) =

p

exp(−λ) · λy y!

where y is the discrete number of event counts, y = 0, 1, . . . . Consequently, E(Y) = λ, and Variance(Y) = λ. The Poisson distribution develops naturally from a discrete

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

POISSON REGRESSION

phenomenon is called overdispersion and is discussed after the introduction of the case study. Finally, the discussion comments on statistical software for fitting Poisson regression models, current research trends, and related methods.

The solution of these equations are the MLˆ Birch (3) showed that the soluestimates β. tions are unique and maximize the (log) likelihood. The estimated covariance matrix of βˆ is given by the inverse of the observed Fisher information, −1 x1  .  ˆ  = (x1 · · · xn )W  ..  xn 

2 PARAMETER ESTIMATION, CONFIDENCE INTERVALS AND GOODNESS-OF-FIT 2.1 Parameter Estimation Nelder and Wedderburn (1,2) have developed the theory of maximum likelihood (ML)estimation for the exponential family of distributions. The general parameterization of the exponential family density is f (y|θ , φ) = exp

yθ − b(θ ) + c(y, φ) a(φ)

where θ is called the canonical or natural parameter, and φ is called the dispersion parameter. If in a GLM, the link function g(.) is such that the linear predictor and the canonical parameter are identical (i.e., η = θ ), then the link is called canonical. Canonical links simplify likelihood inference. For the Poisson distribution the parameters are: φ = a(φ) = 1, θ = log(λ), b(θ ) = exp(θ ) and c(y,θ ) = − log(y!). Obviously, g(λ) = log(λ) is the canonical link of the Poisson distribution. Standard GLM procedures, in particular Fisher scoring, also called iterative reweighted least-squares (IWLS), are used to calculate the ML-estimates βˆi from the given data (yj , xj )j=1,...,n where yj denotes the number of event counts observed on sample unit j and xj = (x1j , . . . , xpj ) is the vector of this sample unit’s covariates. If, for example, we consider patients in a clinical trial, then the sample units would be patients and the covariates may include characteristics like age, sex, allocated treatment group (or dose of therapy), and possibly interactions thereof. For the Poisson model, the likelihood equations are n j=1

yj − λj xij = 0

(1)

Inf −1



ˆ is the diagonal matrix with elewhere W ments given by w ˆj =

∂λj ∂ηj

2 Var(Yj )

ˆ For the evaluated at the ML-estimates β. Poisson regression model, we have w ˆ j = λˆ j . Usually, the solution of the likelihood equations [Equation (1)] requires an iterative algorithm. The preferred method is Fisher scoring, which is the same as Newton–Raphson for GLMs with a canonical link (Reference 4, section 4.6). As a by-product, Fisher scoring yields the asymptotic covariance matrix Inf −1 of the ML-estimates. Fisher scoring is implemented in all major statistics software packages. The fitting method is very reliable and fast irrespective of starting values. If the estimation procedure fails to produce convergence to finite ML-estimates, then the usual reason is an over-parameterized model (i.e., too many covariates) with an infinite likelihood. This is, of course, likely to happen with either many (especially highly discrete) covariates and/or with sparse data, that is, very few observed events in general. 2.2 Confidence Intervals Univariate confidence intervals for the parameters of the model can be constructed as Wald’s (1−α)-confidence intervals βˆi ± u1−α/2 · sˆ βˆi , where u1−α/2 is the (1 − α/2)quantile the standard normal distribution of and sˆ βˆi is the estimated standard deviation from Fisher scoring. However, confidence intervals that are based on the profile likelihood (i.e., the likelihood viewed as a function of β i with the other parameters

POISSON REGRESSION

β j , j = i kept fixed at their ML-estimates βˆj ) have better small-sample properties (see Reference 4, section 1.3; Reference 5, section 1.4). Historically, Wald confidence intervals were preferred because they do not require iterative calculations, as profile likelihood ratio-based confidence intervals do. With the arrival of more powerful software packages, the computation convenience of the Wald’s confidence interval has lost importance, such that, the latter are preferred in general. All of these confidence intervals are based on largesample asymptotic theory for ML-estimates (see Reference 6). 2.3 Goodness-of-Fit Inference on the goodness-of-fit of the model can be assessed using overall goodness-of-fit tests and review of the residuals. The general ideas are the same for all GLMs. Hence, they are summarized only briefly in the following. Overall goodness-of-fit tests compare nested models, that is, a model with a set of parameters β = (β 0 , . . . , β p ) to a sub-model with 0 < q < p of the p parameters in β. For GLMs, three asymptotically equivalent tests are generally available: score, Wald, and likelihood ratio (LR) tests (7). Under fairly general regularity conditions, the test statistics have an asymptotic χ 2 distribution with (p − q) degrees of freedom, if the submodel is true. Thus, these tests can be used to select and decide about the validity of the submodel. As a special case of the LR test, a postulated (or candidate) model can be compared with the saturated model that has as many parameters as observations. The saturated model forces a perfect fit. The resulting statistic is called the deviance (Reference 4, chapter 4.1) and is a general measure of goodnessof-fit. An alternative and frequently used goodness-of-fit measure is the Pearson χ 2 statistic, 2 n yj − yˆ j yˆ j j=1

where yˆ j are the predictions from the model. Asymptotically, the Pearson χ 2 statistic has the same distribution as the other three test statistics, that is, a χ 2 distribution with (n − q) degrees of freedom where q is the

3

number of parameters in the model under investigation, provided this model is true. However, care is sometimes required in interpreting the deviance and Pearson statistics as following the χ 2 distribution with low counts (see Reference 8, section 14.4). Non-nested models are usually compared in an informal way by information criteria. A popular choice is Akaike’s information criterion (AIC) given by − 2 × log likelihood + 2p, where p is the number of parameters in the model. The smaller the AIC, the better the model. Other information criteria also exist and each are based on the same principle of combining the log likelihood of a model with a penalty for the number of parameters. A more detailed investigation of the goodness-of-fit of a model requires examining residuals, that is, deviations of observations from the model predictions. For this purpose, different types of residuals have been suggested. Let yj be the observed Poisson count and yˆ j be its prediction from the model. Then • deviance residuals are given as dj · sign(yj − yˆ j ), where dj is the contribution of the jth observation to the deviance. • Pearson residuals are given as ej = yj − yˆ j / yˆ j . In analogy, these are the individual contributions to the Pearson test statistic. • standardized Pearson residuals are ej / 1 − hˆ j where hˆ j is the jth diagonal element of the so-called ‘‘Hat’’-matrix, the asymptotic covariance matrix of the model predictions yˆ j j=1, ..., p . Standardized Pearson residuals are usually preferred because they have an approximate variance of 1. Just like in ordinary regression, these residuals are checked for values that deviate considerably from 0 to detect outliers in the data and plotted versus various characteristics like covariates, predictions of expected events counts, time, and so on, to determine whether they display nonrandom patterns that would indicate a lack of fit (see Reference 9, section 6.2, for residual analysis in a Poisson regression context). Note, however, that in cases where the distribution of the counts yj is far away from approximately normal, such plots may be of limited use.

4

POISSON REGRESSION

3 CASE STUDY: STATISTICAL ANALYSIS FOR THE INCIDENCE RATES OF ADVERSE EVENTS (NASOPHARYNGITIS) To fix ideas, we introduce a case study in which the incidence rate of an adverse event (nasopharyngitis) was of interest. A total of 710 patients [473 patients given the experimental treatment (E) and 237 patients given the control treatment (C), a 2:1 randomization ratio] participated in the trial. The total exposure time for each treatment was 384 years and 149 years for E and C, respectively. In 533 patients (349 patients in E, and 184 patients in C) no nasopharyngitis event occurred during the study period; 177 patients (124 in E, 53 in C) had at least one event, and 89 patients (62 in E, 27 in C) had more than one event with up to 21 events for a single patient. More specific details of the clinical trial cannot be disclosed for confidentiality reasons. The counts of nasopharyngitis events were modeled using a Poisson regression model. Thus, for patient i in treatment group j = E, C: yij ∼ Po(λij ) with log(λij ) = log(tij ) + β 0 + β j , where tij is the duration of exposure of the ith patient in group j. Note that just like in ordinary regression, we need a restriction on the parameters to make them identifiable. Here, β 2 = 0 such that in the general formulation, log(λij ) = log(tij ) + β0 + β1 · xj with 1, if j = E xj = 0, if j = C Therefore, the event rate for treatment group C is exp(β 0 ) and is exp(β 0 + β 1 ) for treatment E. The treatment difference as measured by the ratio of event rates (E:C) is therefore given by exp(β 1 ). This model can be easily fitted with many general statistical software packages. We used SAS PROC GENMOD (10). The estimate βˆ1 = −0.124 of the treatment effect indicates that the (exposure-adjusted) event rate is lower in E relative to C. In conjunction with the baseline estimate of βˆ0 = −6.04, this finding implies that the expected number of nasopharyngitis events in a year is 365 × exp(−6.04) = 0.87 in C, but it is

only 365 × exp(−6.04−0.124) = 0.77 in E. The 95% profile LR-based confidence interval and the Wald confidence interval for βˆ1 are (−0.33;0.09) and (−0.33;0.08) respectively, and include the null value 0. The Pearson χ 2 test statistic is 2587 with 708 degrees of freedom. This value is much larger than would be expected if the model were correct. 4 OVERDISPERSION The simple Poisson regression model has only one parameter, which causes the expected value and the variance to be same. Hence, it is inflexible regarding event rate intensity. For example, the Poisson assumption forces the event rate of the underlying stochastic process to be constant in time. In practice, the observed variance of the counts per patient in a clinical trial is often much larger than the observed mean count. This phenomenon, which is called overdispersion, is very common. Like in the case study introduced in the previous section, its root cause is often patient heterogeneity that is not captured in the model covariates. For example, it can be shown that overdispersion arises if counts which are assumed to be independent and from a common Poisson distribution are actually from a mixture of Poisson distributions with different event intensity rates λi . Of particular practical relevance is the fact that overdispersion develops naturally if patients are followed for time periods of different lengths. Overdispersion can be addressed in several ways, which are discussed in the following sections. 4.1 Zero-Inflated Poisson Model Sometimes, overdispersion is encountered because there are too many zeros relative to the nonzero counts in the data. This phenomenon sometimes affects clinical trials because events might have a higher probability to reoccur once they have occurred for the first time. This leads to the zero-inflated Poisson (ZIP) model. Early references for the ZIP model are References 11 and 12. Here, it is assumed that the nonzero counts follow a Poisson model, but that the zeros are a mixture of zeros from this model and an

POISSON REGRESSION

additional stochastic process producing extra zeros. Hence we have the model:  exp(−λ) · λy   ;y = 0 p0 + (1 − p0 ) · y! f (y) = y exp(−λ) · λ   ;y > 0 (1 − p0 ) ·  y! where p0 is the probability of obtaining an extra zero. The parameters p0 and λ are estimated using ML-methods. In the presence of covariates, p0 (x) is often modeled as a logistic regression model. See for example Reference 13. To illustrate the approach, a ZIP model is assumed for the case study previously introduced. The zero-inflation parameter is modeled as logit(p0ij ) = γ 0 + γ j ; γ 2 = 0 for all patients i in treatment group j and the Poisson parameter log(λij ) = log(tij ) + β 0 + β j as before. It fits better than the ordinary Poisson model with an offset (AIC = 1440 vs AIC = 1826). Hence, there is some indication of zeroinflation. The ML-estimates of the logit part are γˆ0 = −0.75 with a Wald 95% confidence interval of (0.40;1.10) and γˆ1 = −0.02 with a 95% confidence interval of (−0.43;0.40). It seems that the zero-inflation is the same in the two treatment groups. The estimated treatment effect is βˆ1 = −0.129 with a 95% Wald confidence interval of (−0.37, 0.11). This finding is very similar to the estimate from the ordinary Poisson regression model. However, −2 log likelihood = 1432 for the zero-inflated Poisson model and 449 for the saturated Poisson model, which indicates that the zero-inflation does not remove all of the overdispersion. 4.2 Quasi-Likelihood Overdispersion can also be tackled by introducing an additional dispersion parameter into the model and estimating it by quasilikelihood methods. This method implies that the Poisson assumption and, in fact, any explicit distributional assumption about the counts is abandoned and only the expected value and variance of the counts are assumed to follow a parametric model. Quasi-likelihood (QL) and its generalization, frequently called generalized estimating equations (GEE), solve a set of equations in the model parameters that ‘‘look like’’ the likelihood

5

equations from a GLM. To be more specific, the quasi-likelihood assumption in the Poisson regression model is that still E(Y) = λ, but that var(Y) = φλ. The extra parameter φ accounts for the overdispersion. In Poisson regression, the estimating equations for the model parameters β in the QL approach and the likelihood equations for β under a full Poisson-distributional assumption (φ = 1) are the same, so that the QL point estimates and the full likelihood point estimates of β coincide, and the same methods can be used to fit them (Reference 5, chapter 3). However, a Poisson assumption underestimates the variance of the point estimates. If we assume that the same overdispersion affects all observations independent of their covariates, then φ can be estimated by φˆ = χ 2 / (k − p), where k is the number of observations, p the number of parameters in the model, and χ 2 is the value of the deviance or the Pearson goodness-of-fit test statistic. The approximate covariance matrix of the QL estimates βˆ is given by −1 n ∂λj ∂λj φλj  V= ∂β ∂β 

j=1

where λj = g(xj ) is the jth patient’s Poisson parameter. We can estimate V by plugging βˆ and φˆ into this formula, but that approach is sensitive to misspecification of the variance function var(Y). Therefore, the so-called robust or sandwich estimator is sometimes preferred. The sandwich estimator is obtained by replacing β and φ with its 2 estimates βˆ and φˆ and var(Y j ) with yj − λˆ j in  n ∂λj −1 var(Yj ) φλj V˜ = V  ∂β j=1

φλj

−1

∂λj ∂β

V.

ˆ even V˜ is the approximate covariance of β, if var(Y) is misspecified. If overdispersion depends on covariates, then a parametric model may be used to model φ as a function of the covariates. In general, although this approach is very powerful, it has some theoretical intricacies

6

POISSON REGRESSION

associated with the fact that without additonal restrictions on φ, the overdispersed Poisson model will have a distribution outside the exponential family of distributions (see Reference 5, sections 3.4–3.6). Reanalyzing the Case Study Data. The QL approach results in the estimation of φ by φˆ = 2587/708 = 1.91 based on the Pearson test statistic. All standard errors of the parameter estimates have to be multiplied by this value to account for the overdispersion. Consequently, confidence intervals are wider. For example, the 95% profile likelihood confidence interval for the difference βˆ1 between treatments becomes (−0.51;0.28). The Wald 95% confidence interval is (−0.52;0.27). Both are Poisson likelihood-based confidence intervals and are not computed using the sandwich estimator for the for covariance matrix. Although this approach is adequate for estimation of a parameter of special interest, like the treatment effect in this example, other model inadequacies like the excessive number of zero counts remain unresolved. For prediction of the number of counts in a newly recruited patient, say, a more careful modeling effort, possibly including more covariates should be considered. 4.3 Negative Binomial Model Another method is to assume that the distribution of counts is a gamma-mixture of different Poisson distributions. If it is assumed that patients follow Poisson distributions with different parameters λj and these are in turn sampled from a gamma distribution, then it follows that the number of events has a negative binomial distribution (e.g., Reference 14, p. 122ff). This distribution has a variance that is larger than its expected value. In contrast to the QL approach, this approach is still based on a full parametric model for the counts. The negative binomial function is in the exponential family only if its dispersion parameter ξ is assumed fixed. The usual method to fit this model consists of solving the likelihood equation by IWLS for a fixed ξ (q) providing the parameter estimate βˆ (q) in the qth step, then to use Newton–Raphson with fixed βˆ(q) to update ξ (q) and alternate between the two until both ξ (q) and βˆ(q) have converged (Reference 4,

section 13.4). The negative-binomial model with zero inflation (ZINB) as a generalization of the ZIP model has been discussed in Reference 15. Reanalyzing the case-study data using a negative binomial model gives a Pearson statistic of 725, which is much better than the simple Poisson model and, in fact, reveals no signs of an inadequate fit. The treatment effect is estimated as −0.122 with a likelihood based 95% confidence interval of (−0.53;0.28), which is almost identical to the result obtained from applying the QL method. The ML-estimate of the dispersion parameter ξ is ξˆ = 4.1 with 95% profile likelihood confidence interval (3.2;5.3), which again indicates substantial overdispersion. 5 DISCUSSION In the analysis of categorical data, the counts in the cells of a contingency table are often assumed to follow a Poisson distribution, giving rise to Poisson loglinear models. These models can be interpreted as Poisson regression models with categorical covariates only. In that case, we do not have repeated counts from a single individual but rather counts per constellation of covariates. Although this situation at least formally constitutes a ‘‘Poisson regression model,’’ it is not usually treated under this heading. The analysis of categorical data arranged in contingency tables has many aspects to it that are out of scope of Poisson regression and vice versa. To give just one example, in contingency table analysis, it is usually irrelevant whether Poisson, multinomial, or independent multinomial sampling is assumed (16). For categorical data analysis, see References 17 or 4. In recent years, considerable advances have been achieved in aligning GLM theory with linear mixed model theory, resulting in generalized linear mixed models (GLMM). This is reflected in the fact that many software packages now have very similar syntax for mixed linear models and GLMMs (e.g., SAS PROC MIXED and GLIMMIX). Basically, regarding the linear model component, the same ideas and principles apply for both linear models and GLMs, especially with

POISSON REGRESSION

respect to notions such as random effects, marginal versus conditional models, and correlation structures. However, fitting these models is more complicated in the GLMM context, and some assumptions, especially on the covariance structure in marginal models, are difficult to assess. Also, in contrast to the ordinary linear mixed model, certain important equivalences of the parameters in a conditional and the corresponding marginal model do no longer hold. The interested reader is referred to chapters 11 and 12 in Reference 4, and Reference 18. An introduction to the random effects Poisson model is given by Reference 4, section 13.5. More detailed investigations have been performed in References 18 and 19. Fitting marginal and random effects GLM is complex because the likelihoods induced by the model assumptions are generally not tractable analytically. Numerical algorithms to fit these models are an active area of current research and a wide variety of approaches have been suggested. Many of these approaches concentrate on maximizing the likelihood directly by using numerical integration methods, for example, Gauss-Hermite quadrature, the EM algorithm, Markov Chain Monte Carlo methods, or Laplace transformations. See, for example, References 18 and 20 for recent overviews. Other approaches approximate the likelihood by a simpler function. The motivation for such approximations may be numerical tractability; relaxation of model assumptions (e.g., the GEE approach), or certain deficiencies of ML-estimates (e.g., penalized likelihood methods). Additionally, likelihoodbased methods have several variants, such as profile likelihood, marginal and conditional likelihood, and h-likelihood. See Reference 5 for an in-depth discussion of these concepts. Bayesian methods can also be applied to Poisson regression modeling, which frequently leverage MCMC to sample for the joint posterior distributions of parameters. The interested reader is referred to References 21–24. Most theory for Poisson regression is likelihood-based and relies heavily on the large sample properties of ML-estimates. Sometimes, this may be questionable because of small sample sizes or few counts. There is an

7

extensive literature on small-sample methods in contingency table analysis (including the case of Poisson-distributed counts), see for example, Reference 4 or 25. However, few investigations regarding sparse data have been made on Poisson regression with continuous covariates or the more complex models for handling overdispersion investigated in the previous section. A general approach to this is the application of resampling methods like the Bootstrap (26). A discussion of nonparametric regression and generalized linear models is also given in Reference 27. Poisson regression models are also closely related to multiple time-to-event methods, see References 28 and 29. REFERENCES 1. P. McCullagh and J. A. Nelder, Generalized Linear Models. London: Chapman & Hall, 1989. 2. J. A. Nelder and R. W. M. Wedderburn, Generalized linear models. J. Royal Stat. Soc. 2004; A135: 370–384. 3. M. W. Birch, Maximum-likelihood in threeway contingency tables. J. Royal Stat. Soc. 1963; B25: 220–233. 4. A. Agresti, Categorical Data Analysis, 2nd ed. New York: Wiley, 2002. 5. Y. Lee, J. A. Nelder, and Y. Pawitan, Generalized Linear Models with Random Effects. London: Chapman & Hall /CRC, 2006. 6. R. J. Serfling, Approximation Theorems of Mathematical Statistics. New York: Wiley, 1980. 7. D. R. Cox and D. V. Hinkley, Theoretical Statistics. London: Chapman & Hall, 1974. 8. P. Armitage, G. Berry, and J. N. S. Matthews, Statistical Methods in Medical Research. Oxford, UK: Blackwell, 2002. 9. A. Gelman and J. Hill, Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press, 2007. 10. SAS Institute. SAS Online Doc 9.1.3. 2007. Cary, NC. Available: http://support.sas.com/ onlinedoc/913/docMainpage.jsp. 11. J. Mullahy, Specification and testing of some modified count data models. J. Economet. 1986; 33: 341–365. 12. D. Lambert, Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992; 34: 1–14.

8

POISSON REGRESSION

13. M. S. Ridout, J. P. Hinde, C. G. B. Demetrio, A score test for testing a zero-inflated Poisson regression model against zero-inflated negative binomial alternatives. Biometrics 2001; 57: 219–223. 14. A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics, 3rd ed. Singapore: McGraw-Hill, 1974. 15. D. B. Hall and K. S. Berenhaut, Score Tests for heterogeneity and overdispersion in zeroinflated poisson and binomial regression models. Canad. J. Stat. 2002; 30: 415–430. 16. J. Palmgren, The Fisher information matrix for log-linear models arguing conditionally in the observed explanatory variables. Biometrika 1981; 68: 563–566. 17. Y. V. V. Bishop, S. E. Fienberg, and P. W. Holland, Discrete Multivariate Analysis. Cambridge, UK: MIT Press, 1975. 18. G. Molenberghs and G. Verbeke, Models for Discrete Longitudinal Data. New York: Springer, 2005. 19. P. J. Diggle, P. J. Heagerty, K-Y. Liang, and S. L. Zeger, Analysis of Longitudinal Data, 2nd ed. Oxford, UK: Oxford Science Publications, 2002. 20. J. C. Pinheiro, D. M. Bates, Mixed Effects Models in S and S-Plus. New York: Springer, 2000. 21. D. K. Dey, S. K. Ghosh, and B. K. Mallick, Generalized Linear Models: A Bayesian Perspective. New York: Marcel Dekker, 2000. 22. P. Congdon, Applied Bayesian Modelling. New York: Wiley, 2003.

23. D. J. Spiegelhalter, J. K. Abrams, and R. P. Myles, Bayesian Approaches to Clinical Trials and Health-Care Evaluation. New York: Wiley, 2004. 24. A. Gelman, J. B. Carlin, H. B. Stern, and D. B. Rubin, Bayesian Data Analysis. London: Chapman & Hall/CRC, 1995. 25. K. F. Hirji, Exact Analysis of Discrete Data. London: Chapman & Hall/CRC, 2006. 26. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. London: Chapman & Hall /CRC, 1994. 27. P. J. Green and B. W. Silverman, Nonparametric Regression and Generalized Linear Models. London: Chapman & Hall, 1994. 28. T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York: Springer, 2000. 29. P. K. Andersen, Ø. Borgan, R. D. Gill, and N. Keiding, Statistical Models based on Counting Processes. New York: Springer, 1993.

CROSS-REFERENCES Generalized Linear Models Generalized Estimating Equations Mixed Effects Models Regression Adverse Event Outcome Incidence Rate Time to Event

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

of patients, then considering the average drug effect is not sufficient. Because BSV is estimated in a population PKPD analysis, population PKPD models can predict the range of drug concentrations, drug effects, and therapeutic responses for the whole patient population of interest. Consequently, population PKPD models allow one to predict the results of clinical trials and to optimize future clinical trial designs. This article presents the concept, methods, and some applications at an introductory level. More in depth reports on population PKPD can be found in several reviews (4–8).

¨ JuRGEN B BULITTA

Department of Pharmaceutical Sciences, School of Pharmacy and Pharmaceutical Sciences State University of New York at Buffalo Buffalo, New York

NICHOLAS H G HOLFORD Department of Pharmacology and Clinical Pharmacology University of Auckland Auckland, New Zealand

Population pharmacokinetics (PK) and pharmacodynamics (PD) originated in the 1970s and has been adopted extensively in drug development and clinical pharmacology (1). Population PKPD can incorporate all available and relevant data from a clinical study in a single data analytical step (2, 3). Therefore, population PKPD methods provide a sound basis for data analysis and for interpretation of results. Population PKPD models allow one to study the effect of multiple covariates like renal function, body size, body composition, and age on PK and PD simultaneously. Population PKPD can be distinguished from hypothesis testing procedures such as analysis of variance (ANOVA) and analysis of covariance (ANCOVA) in that population. PKPD can account for the full time course of PK, PD, and disease progression. This difference becomes especially clear when attempting to make predictions. ANOVA-based methods cannot be extended to predict unstudied circumstances. As the time course of concentrations and the time course of effects is usually nonlinear, almost all PKPD models are nonlinear. ANOVA and ANCOVA can only account for nonlinearity by transformation of the independent or dependent variable(s). Because a substantial amount of between subject variability (BSV) often occurs, it is very important to account for BSV when describing PK and PD parameters. If the goal of therapy is to cure more than 95%

1

TERMINOLOGY

Precise use of terminology is important in population PKPD modeling. Related terms such as variability, and uncertainty as well as covariates and covariance are frequently confused. 1.1 Variable A variable is a quantity that can change its value. Variables can be dependent or independent. Independent variables are used as input for a model. It is usually assumed that independent variables are known or measured with negligible error (e.g., sampling time, dose, body weight, age, and creatinine concentration). A dependent variable is a function of the independent variable(s) and the model parameter(s). The value of a dependent variable is predicted by the model. Usually, dependent variables are associated with measurement error (e.g., drug concentration in plasma, amounts in urine, or pain score). A variable can be several different types: (1) continuous (e.g., a concentration represented as a floating point number), (2) count data (e.g., an integer for the number of counts from a radioactive decay or the number of headaches experienced by a migraine sufferer), (3) ordered categorical (e.g., a pain score from zero to five), or (4) unordered categorical (e.g., Asian, Caucasian, Polynesian).

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

1.2 Parameter

1.5 Covariate (also called covariable)

Parameters are components of the model equation(s) that characterize the underlying system. They are constants in the model but may change from person to person and over time within a person. Usually, parameters are assumed not to change during one dosing interval (during one occasion). The observations are used to estimate the model parameters.

A covariate is an independent variable that is used to modify (predict) the value of a parameter in the model equation(s). Renal function is a covariate that is used to modify the predicted value for renal clearance. (Other examples include effect of body size, body composition, and disease state on clearance and volume of distribution). 1.6 Covariance

1.3 Variability Variability describes the (true) diversity or heterogeneity of subjects. For example, if 100 subjects are studied, then each subject will have his/her own value for clearance and volume of distribution. The BSV describes the variability of clearance and volume of distribution between those 100 subjects. The BSV can be described by a distribution (e.g., by a normal distribution with a mean and standard deviation or by a log-normal distribution with a geometric mean and coefficient of variation, or by a nonparametric distribution). In a parametric population PK model, the BSV of clearance is an estimated model parameter. 1.4 Uncertainty Measured quantities and model parameters are only known with some degree of certainty. Uncertainty is described by the standard error of the estimate. However, because uncertainty is typically not distributed normally in nonlinear estimation problems, it is better to describe uncertainty with some other metric such as a confidence interval. Uncertainty must be distinguished from variability. For example, in a hypothetical study with a million patients, the average clearance of all patients will be known with negligible uncertainty, and the confidence interval for average clearance will be extremely narrow. Note that the BSV (variability) may be large (e.g., 30% coefficient of variation), but with very large numbers of patients its uncertainty will be very small (e.g., 90% confidence interval of 29.9% to 30.2%). The estimate of variability will not change systematically with the size of the study, but its uncertainty will become smaller as study size gets larger.

Covariance describes the common (joint) variability of two random effects. Apparent clearance and apparent volume of distribution may vary predictably (fixed effect) because of variability in extent of absorption, but commonly the interest lies in the apparently random association of BSV between model parameters (random effects). The joint distribution of BSV is described by the variance of each parameter and the covariance between the parameters. The variance and covariance can be combined to describe the correlation between the parameters. 1.7 Mixed Effects Model The term mixed effects model is used to explain sources of variability. The word effect does not refer to a drug effect. It refers to a factor that has an effect on variability. These effects are divided into those that have the same effect every time (i.e., they are predictable) and those that cannot be predicted. The first kind of effect is called a fixed effect and the second kind is called a random effect. When variability is described by a combination of fixed effects and random effects, the resulting model is called a mixed effects model. 1.8 Subject Variability Often, a marked variability occurs in pharmacokinetic concentration time profiles between various subjects, and often even more variability occurs in drug effects. The BSV in structural model parameters is used to describe a major part of variability. Besides variability between subjects, within subject variability (WSV) of model parameters also occurs. The sum of BSV and WSV is the population parameter variability (PPV) (9).

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

The WSV can be split into between occasion variability (BOV) and within occasion variability (WOV). An occasion is defined as the shortest time interval that allows one to estimate all model parameters reliably. Commonly this means one dosing interval. Figure 1 illustrates the components of PPV. If estimates for renal function are available on different occasions, BOV can also be split into a predictable and a random component. 2

FIXED EFFECTS MODELS

A fixed effects model describes the system of interest in a predictable (deterministic) manner. It contains no random effects. Fixed effects models comprise ‘‘input-output models’’ (which describe the overall response) and ‘‘group models’’ (which describe predictable effects on BSV). Covariates can be used to split BSV into a predictable component (BSVpredictable ) and a random component (BSVrandom ) as described below. Covariates like weight or renal function are used to distinguish BSVpredictable from BSVrandom . 2.1 Input-Output Models Input-output models describe the relationship between independent and dependent variables. In a PK model, this would be the relationship between time and concentration for the chosen dosage regimen. For a PD model, this might be the relationship between drug concentration and drug effect. Input-output models are also called structural models as they define the structure of the system. Most commonly, compartmental models are used as structural models in population PKPD analyses. However, physiologically based population PK models may be used more frequently to gain better understanding of the biological factors that affect drug disposition such as blood flow and organ specific elimination. 2.2 Group Models The group model incorporates the effect of one (or more) covariates on the parameters of the structural model. All subjects from the population who share the same set of covariates used to predict differences in a

3

parameter belong to the same group and have the same group estimate for clearance and volume of distribution. An allometric size model is often used to describe the effect of body size on clearance and volume of distribution. Allometry provides a strong theoretical basis for understanding how body function and structure change with body size (10). Total body weight (WT) is used in our example to describe body size. FWTCL,i and FWTV,i represent the fractional changes for the ith subject (with weight WTi ) relative to a subject with a standard weight (WTSTD ) of 70 kg. CLPOP is the population estimate for clearance, and VPOP is the population volume of distribution. CLGRP,WTi is the clearance, and VGRP,WTi is the volume of distribution for subjects with WTi (these group estimates are sometimes called typical values): FWTCL,i =

WTi WTSTD

0.75 ;

CLGRP,WTi = CLPOP · FWTCL,i FWTV,i =

WTi ; WTSTD

VGRP,WTi = VPOP · FWTV,i The allometric exponent of 0.75 for clearance is important, especially if PK data in pediatric patients are compared with PK data in adult patients (11). Some authors estimate or assume exponents for clearance different from 0.75 without any clear justification for departing from well-based theory. The creatinine clearance [e.g., as estimated by the Cockcroft & Gault formula (12)] can be used to predict renal function. Renal function is often a useful predictor of renal clearance. Equation (12) predicts the creatinine clearance of the ith subject (CLcri ) for a nominal individual with weight 70 kg: 140 − agei mL CLcri = SCri min SCri (mg/dL) is the serum creatinine concentration of the ith subject, and agei is the age of this subject in years. The intact nephron hypothesis (13) assumes that renal function (including glomerular filtration, secretion, and reabsorption) is proportional to

4

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

Figure 1. Different components of population parameter variability.

the glomerular filtration rate. Glomerular filtration rate is related closely to the renal clearance of creatinine. Therefore, CLcri can be used to predict renal function for the ith subject. A standard glomerular filtration rate (CLcrSTD ) of 100 mL/min for a 70 kg subject is used to scale GFRi : RFi =

CLcri CLcrSTD

The group renal clearance (CLGRP,R ) in subjects with WTi and RFi then becomes: CLGRP,R (WTi , RFi ) = CLPOP,R · RFi · FWTCL,i CLPOP,R is the population estimate of renal clearance in a subject of standard weight with RF of 1. Importantly, CLcri is calculated for a nominal subject with weight 70 kg, as FWTCL,i includes the effect of body size on clearance and as the effect of weight on renal clearance should only be included once. 3

RANDOM EFFECTS MODELS

The random effects model describes the unpredictable variability of the structural model parameters. Two levels of random effects are usually described for PKPD models. The first level describes random effects associated with structural model parameters (i.e., BSVRandom and WSVRandom ). Although these kinds of variability are considered to be random, this random variability ‘‘simply’’ represents a lack of knowledge of the underlying sources of variability.

The second level describes random effects associated with each prediction of an observation. The difference between the observations and model predictions (residuals) are described by the residual unidentified variability (RUV) or residual error. The distribution of PK parameters is often skewed to the right. Studies with small sample size neither allow one to determine the shape of the distribution reliably nor to estimate its variance precisely. As distributions of PK parameters are usually skewed to the right, it is often assumed that PK parameters are log-normally distributed, which means that the logarithm of the individual PK parameters is normally distributed. This distribution has reasonable properties, because some subjects show very large values for clearance, and only positive values for clearance are physiologically possible. The individual predictions of clearance and volume of distribution can be obtained from the group clearance by introducing a random effect: The random effects for clearance (CL) (ηCL,i ) and volume of distribution (ηV,i ) describe the difference (on log-scale) of the individual clearance and volume of distribution for the ith subject from their group estimates CLGRP and VGRP . It is assumed that ηCL and ηV are normally distributed with mean zero and variance BSV2 and BSV2 . CL V CLi = CLGRP · exp ηCL,i Vi = VGRP · exp ηV,i The individual clearances (CLi ) and individual volumes of distribution (Vi ) in the

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

equations shown above follow a log-normal distribution. The BSVCL and BSVV are standard deviations of a normal distribution on log-scale and are approximations to the apparent coefficient of variation (CV) on linear scale (see the comment from Beal (14) for a more in depth discussion). For example, a BSV2 estimate of 0.04 yields an apparent CL CV of 20% for clearance. A useful covariate should result in smaller estimates for the BSV of the respective PK parameter. Figure 2 shows an example for the variability of clearance and volume of central and peripheral compartment with or without body weight as covariate. As shown in Fig. 2, inclusion of WT as a covariate reduced the random (unexplained) variability in clearance and volume of distribution. The reduction of unexplained variability is expressed as a fraction of total variance. The relative variance shown in Fig. 2 is the variance that includes WT as covariate divided by the variance without WT. Inclusion of WT as a covariate reduced the unexplained variability by 27.9% for CL, by 29.1% for V1, and by 38.5% for V2.

3.1 Residual Error Another level of variability cannot be described by variability in PK parameters. This variability is called residual error or residual unidentified variability (RUV). The RUV describes the random deviation of an observation from its predicted value in the respective subject. Bioanalytical error, process noise, model misspecification, and WOV of fixed effect parameters contribute to the estimated RUV. Process noise may be caused by the fact that the study is performed in a different way than documented (e.g., a sample was recorded to be taken at 1 hour post dose although it was actually taken at 1.25 hours post dose). Commonly, the RUV is described by a combined additive and proportional error model. C denotes the individual predicted concentration in the absence of residual error, Y is the individual predicted concentration including an additive (SDC ), and a proportional (CVC )

5

residual error component. The random deviates εSDC and εCVC are assumed to be normally distributed with mean zero and standard deviation SDC and CVC . Y = C · (1 + εCVC ) + εSDC The residual variability is a different level of random effects than the random BSV. If one subject has 10 observed concentrations, this subject will have 10 random deviates for εSDC and 10 random deviates for εCVC (one for each predicted observation), but it will only have one random deviate for ηCL and ηV . 4 MODEL BUILDING AND PARAMETER ESTIMATION Model building is a nontrivial task and depends very much on the planned application of the final model. It is traditional to invoke the principle of parsimony (i.e., to use of the simplest model that fulfills the requirements of the planned applications and that is physiologically reasonable as the final model). Mechanistic models (if available) are expected to be superior to empiric models when making predictions outside of the observed range. Model building typically proceeds in steps based on hypothesis tests that compare one model with another. This method may be used in conjunction with informal decision based on the mechanistic plausibility of the model. Parameter estimation is part of the model building process. Each model that is evaluated will involve parameter estimation. The plausibility of the parameter estimates and their identifiability (related to uncertainty of the value) are often used in an informal way to guide model building. 4.1 Hypothesis Testing Hypothesis testing can be used to test the null hypothesis that a more complex model is indistinguishable from a simpler model. Hypothesis testing can be done by the likelihood ratio test (see objective function below) if both models are nested. Nested means that the more complex model becomes equivalent to the simpler model if a parameter in the complex model is fixed to some

6

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

Figure 2. Box-whisker plot for the BSV of CL and the volume of distribution for the central (V1) and peripheral (V2) compartment without and with total body weight as covariate (indicated by the index wt). Boxes show median and interquartile range, whiskers are the 10th to 90th percentiles, and the markers represent 5th to 95th percentiles.

value—usually zero. Examples for nested models are: 1. One, two, or three compartment models with or without lag-time. Multicompartment models become indistinguishable from a one-compartment model if the inter-compartmental clearance is fixed to zero. As lag-time approaches zero, models with lag-time converge to models without lag-time. 2. Models with a) first-order, b) mixedorder (Michaelis-Menten), or c) parallel first-order and mixed-order elimination are nested. • Model c) converges to model a) if the

maximum rate of the mixed-order process (Vmax) is estimated to be zero. • Model c) converges to model b) if the

first-order clearance of model c) is estimated to be zero. • Model b) approaches model a) if the

Michaelis-Menten constant (Km) is much higher (more than 10 × ) than the peak concentration (the ratio of Vmax and Km then equals the firstorder clearance). In a statistically strict sense, a first-order elimination model is not nested within a mixedorder (Michaelis-Menten) elimination model. However, if Km approaches infinity, the objective function of the first-order and the mixed-order model converge to the same value.

3. A model with zero-order elimination is nested within a mixed-order elimination model if Km is zero. Examples of models that are not nested: 1. A model with first-order input (absorption) is not nested within a model with zero-order input. 2. Models that use different covariates are not nested (e.g., a model with total body weight as size descriptor is not nested within a model with lean body mass as size descriptor).

4.2 Objective Function Population software packages calculate (or approximate) the log-likelihood. This likelihood is the probability of observing the data given the proposed model and parameters. In maximum likelihood estimation, the set of parameters (fixed and random effects) that maximizes this probability is searched. Three general likelihood calculation procedures exist: (1) exact solutions for exact likelihoods (only for nonparametric methods), (2) exact solutions to approximate likelihoods (e.g., NONMEM; GloboMax LLC, Hanover, MD), and (3) approximate solutions to exact likelihoods (e.g., MC-PEM). The exact or approximated likelihood is used as the objective function during the estimation process that searches for the highest likelihood of the data for a specific model and set of model parameters. Usually, population programs

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

either report the log-likelihood or report −2 × the log-likelihood (-2LL). The difference in -2LL between two models can be used to test the null hypothesis that the more complex model is equivalent to the simpler model. If one parameter is added to a model, then a drop by 3.84 points of -2LL is associated with a P-value of 0.05, under the assumption that the difference in -2LL is approximately chisquare distributed. This procedure is called ¨ a likelihood ratio test. Wahlby et al. (15–17) studied the type I error rates of the likelihood ratio test in NONMEM in detail. Population models use the observations from all subjects simultaneously, which is a considerable advantage over the standard two-stage approach. For example, when some subjects do not have data during the absorption phase, it is possible to deduce their absorption parameters based on other subjects with absorption phase data. This ‘‘borrowing’’ of information is a key feature of population models. 4.3 Bootstrap Bootstrapping is a powerful concept to derive the uncertainty of all parameters of a population model. Importantly, confidence intervals can be constructed for fixed effects, random effects, and for derived statistics. Confidence intervals derived via bootstrapping are believed to represent the uncertainty of model parameters better than confidence intervals based on asymptotic standard error estimates and assuming the uncertainty is normally distributed. Standard error estimates are often difficult to compute for some models, and it may not be possible to determine standard errors. A variety of bootstrap techniques can be used. In a parametric bootstrap, new datasets are generated by simulating from the final population model. These simulated datasets are then estimated by the final population PK model. This method is called parametric bootstrapping, because the new datasets are generated from parametric distributions for the random effects in the population model. A parametric bootstrap provides robust measures for the precision and bias of the population parameter estimates under the assumption that the chosen population model is correct.

7

A nonparametric bootstrap involves generation of new datasets by random samples from the original dataset with replacement (‘‘resampling with replacement’’). The same number of subjects as in the original dataset is randomly drawn for each new dataset. Each subject might be drawn multiple times. The population model parameters are reestimated for each of the bootstrap datasets and the median, and nonparametric 95% confidence interval (2.5th to 97.5th percentile) are calculated from the model parameter estimates of all bootstrap replicates. An example of the results of a nonparametric bootstrap is shown in Table 1 for a two-compartment model with zero-order input. The last two columns of Table 1 indicate the slight asymmetry of the confidence intervals (e.g., for random effects parameters). The BSV terms in Table 1 are the square roots of the estimated variance on log-scale. These square roots are commonly interpreted as an approximation to the coefficient of variation (CV) of a log-normal distribution. As pointed out by Beal (14). these quantities should be called apparent CVs, that is the apparent CV of CL was 30.7% (20.1% to 41.4%) [median (95% confidence interval)] in Table 1. The number of bootstrap samples depends on which statistics are to be derived from the bootstrap replicates. About 50 replicates may be sufficient to get a reasonable estimate of the median and standard error of population model parameters. However, 1,000 or 2,000 replicates are probably required to obtain robust estimates of a 90% or 95% confidence interval. 4.4 Bayesian Estimation The word ‘‘Bayesian’’ is used in various circumstances in PKPD. A key feature of Bayesian techniques is that prior knowledge is incorporated into the estimation of model parameters. Maximum a posteriori (MAP) Bayesian estimation as implemented in ADAPT II (or in the NONMEM POSTHOC step) assumes that the parameters of the population model (=priors) are known exactly (without uncertainty). If the individual PK parameters for a new subject are to be estimated, MAP-Bayesian estimation provides

8

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

Table 1. Population means and between subject variability (apparent \% CV) of a two compartment PK model Parameter abbreviation

Explanation

Point estimate & non-parametric 95% confidence interval Median P2.5 P97.5

Confidence interval limit divided by median P2.5 P97.5

Fixed effects CL (L h−1 ) V1 (L)

Total clearance Volume of central compartment Volume of peripheral comp. Intercompartmental clearance

9.52 9.45

8.37 8.59

11.0 10.5

88% 91%

116% 111%

28.4

23.7

33.4

83%

118%

4.73

4.24

5.35

90%

113%

BSV(CL) BSV(V1) BSV(V2) BSV(CLic )

Between subject variability of the respective PK parameter

0.307 0.243 0.380 0.251

0.201 0.414 66%∗ 0.151 0.321 62%∗ 0.159 0.499 42%∗ 0.188 0.315 75%∗ Correlations between random effects

r(CL, V1) r(V1, V2) r(V2, CLic )

Coefficient of correlation between pairs of random effects

0.423 0.091 −0.002

CVc SDc (mg/L)

Proportional error Additive error

0.134 0.0932

V2 (L) CLic (Lh−1 )

Random effects

−0.062 −0.330 −0.648

135%∗ 132%∗ 131%∗ 125%∗

0.809 0.454 0.740 Error model

0.124 0.0570

0.145 0.135

93%∗ 61%∗

109%∗ 145%∗

Data are medians and 95\% confidence intervals from a non-parametric bootstrap with 1,000 replicates. These numbers are the 2.5th percentile and the 97.5th percentile divided by the median (see columns 3 to 5). These confidence intervals were also asymmetric, if ratios were calculated on variance scale.

the most likely individual parameter estimates given the prior knowledge ( = population PK model and its parameters) and the observations of the new subject. Software packages like BUGS (which includes PKBUGS and WinBUGS; MRC Biostatics Unit, Cambridge, UK) offer the possibility to account for uncertainty in the priors (i.e., to specify the uncertainty of the parameters of the population PK model). Consider the following scenario: A population PK model was derived in healthy volunteers based on a dataset with 24 healthy volunteers with frequent sampling. New data in patients with sparse sampling (1–5 samples per patient) become available. It is unknown whether the patient population has similar average population PK parameters and a similar variability in these parameters. Bayesian analysis (e.g., in BUGS) allows one to use the population

PK model in healthy volunteers for analysis of the sparse dataset in patients. It is often observed that the patient population has a larger volume of distribution and a lower clearance relative to the healthy volunteers. If enough information is in the patient dataset to estimate the differences in population PK parameters, a full Bayesian approach as implemented in BUGS® allows one to estimate population PK model parameters specific for the patient population. The population mean parameters and BSV can be different in the patient population compared with the healthy volunteers. The population PK model in healthy volunteers is used to support this estimation, and the population PK model parameters for the patients will depend on the estimates of the population PK parameters for the healthy volunteers ( = prior information with uncertainty).

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

4.5 Mixture Models If one estimates the PK parameters in each subject individually, one may observe that one group of subjects has high clearances, whereas the other group of subjects has lower clearances. This kind of bimodal distribution for clearance might be predictable using a fixed effects model based on a covariate such as genotype. However, if an explanatory covariate is not available (e.g., if the genotype cannot be tested), then a random effects model can be used. Mixture models offer the possibility to use a probabilistic (random effect) model to assign each subject to a particular category. This category does not rely on the existence of an observed covariate but uses the best fit of a subject’s data to alternative models to predict the most likely category. The category can be used to assign different parameters or even different models to describe the data of a particular subject. Mixture models are a powerful feature of population software packages. They are often useful when BSV seems to be very large and cannot be reduced by available covariates. They can also be applied when a multimodel distribution is known (e.g., a genetic polymorphism for clearance) but no genotypic or phenotypic information (covariate) is available. 5

SOFTWARE

Various population software packages available. Perhaps the most important differences among them are the estimation method, flexibility, robustness, and estimation time. Important issues with regard to flexibility are: • Number of dependent and independent

variables a program can handle. • Ability to handle cumulative outputs

(like excretion into urine). • Ability to invoke a program using a scripting mechanism (important for advanced meta-analysis applications such as bootstrapping). • Possibility to parallelize the computations in order to reduce overall runtime.

9

Some programs (and algorithms) for population analyses (in alphabetical order) are as follows: • Adapt V (EM algorithm, will be avail-

able soon) • BUGS (WinBUGS and PKBUGS) • Kinetica (includes P-Pharm) • Monolix (SAEM algorithm) • MW\Pharm & MultiFit • NLME in S-Plus • NLMIX • NONMEM (FO, FOCE, and LAPLA-

CIAN method) • NPML • PDx-MC-PEM (MC-PEM algorithm) • PEM • S-ADAPT (MC-PEM algorithm) • SAS (proc nlmixed; SAS Institute Inc.,

Cary, NC) • USC*PACK (NPEM, NPAG, and NPOD

algorithm) • WinNonmix

These estimation algorithms use either a parametric or a nonparametric parameter variability model. The programs calculate either the exact likelihood, an approximation to the true likelihood, or an approximate likelihood. The differences between those programs were reviewed by Aarons (18) and Pillai et al. (1). Girard and Mentr´e (19) presented a blinded comparison of various population software packages. Bauer et al. (20) systematically compared various estimation algorithms for population PKPD models. An in-depth discussion of the appropriateness of one or the other algorithm is beyond the scope of this article. We would like to mention only the following two points from our point of view: 1. Use of the so-called FO method (firstorder method) for population analyses seems only warranted if extremely long run times prevent the use of the FOCE method or algorithms that calculate the exact likelihood. Both the FOCE method (with or without interaction option) and the FO method use an

10

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

approximate likelihood. However, the approximation of the FOCE method is much better than the approximation of the FO method, especially if the BSV is large. 2. The programs and algorithms shown above have different capabilities of being parallelized. Parallel computing can offer an increase in computation speed but the only program currently available that offers this feature is SADAPT.

6

NONMEM

NONMEM was the first population software package and is currently the most widely used population program. Most experience with any population program is currently available for NONMEM. Its current version (NONMEM VI) offers the FO, FOCE, and LAPLACIAN method for estimation. NONMEM uses parametric parameter variability models and offers the possibility of specifying mixture models. NONMEM VI also offers the possibility to include a so-called ‘‘frequentists’ prior’’ to incorporate knowledge from a previous population PK analysis (e.g., from literature reports). Archives from a very active NONMEM user (NMusers) e-mail discussion group can be found under the following link: http://www.cognigencorp.com/nonmem/nm/. 6.1 Other Programs A list of other software packages and algorithms is shown above. Two key features of other programs are as follows: 1. Ability to specify the parameter variability model by a nonparametric distribution as implemented in the USC*PACK that uses the NPEM, NPAG, or NPOD algorithm. This method makes no assumptions on the distribution of parameters. 2. Incorporation of prior information in a full Bayesian analysis as implemented in BUGS® (including WinBUGS and PKBUGS). This method is flexible, and it allows incorporation of prior information (with uncertainty) from another

population PK analysis or to perform a meta-analysis.

7 MODEL EVALUATION Model evaluation depends on the planned application of the model. Sometimes, the terms validation or qualification are used. It is arguable whether a population model can be fully validated. The authors believe that evaluation of a population model for an intended application is probably the best one can show. Various methods of model evaluation are used. Some more popular methods are listed below. 7.1 Residuals Residuals are the differences between the observations and model predictions. It is often helpful to use both the individual predictions and the population predictions in a residual analysis. The latter are the predictions for an individual using the group parameter values, which use the covariates of the current subject but with all random effects, η and ε, set to zero. A residual analysis usually involves the following plots: 1. Observations versus individual predictions plot on linear scale and on log-log scale. 2. Observations versus population predictions plot on linear scale and on log-log scale (use of a log 2 scale may be advantageous). 3. Residuals versus time plot. 4. Residuals versus observations. Visual inspection of plots 1 and 2 on loglog scale is important, as plots on linear scale do not reveal misfits for low concentrations. Ideally the data points in plots 1 and 2 should fall around the line of identity without bias throughout the whole concentration range. If the observation versus individual prediction plot does not show any systematic bias, it does not always mean that the model has good predictive performance [see Bulitta and Holford (21) for details]. If this plot shows systematic bias, it usually means that the

Figure 3. Visual predictive check using an empirical effect compartment model to describe the delay in effect of warfarin on PCA. The lines are the median and the 90% prediction interval obtained by simulation from the PKPD model. The symbols are the observed PCA values.∗

11

Prothrombin Complex Activity

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

structural model is not flexible enough to describe the observed profiles. A systematic bias in the observation versus population prediction plot points to potential problems with the predictive performance of the model. 7.2 Predictive Checks Visual predictive checks are a tool used to compare the predictive performance of competing models visually. A visual predictive check involves a stochastic simulation of the individual predicted concentrations for a large number of subjects (usually between 1,000 and 10,000). It is often helpful to calculate these predicted concentrations at nominal time points (theoretical times). The percentiles of those 1,000 to 10,000 profiles are calculated and plotted on top of the observations of all subjects. More sophisticated applications of the predictive checks were presented by Yano et al. (22). Figure 3 shows an example for a visual predictive check. A PKPD model was used to describe the time course of prothrombin ∗ The CL is the clearance of warfarin and k e0 is a first-order transfer rate constant between the central compartment and the hypothetical effect compartment. The word ‘‘hypothetical’’ indicates that the amount in this compartment is assumed to be negligible. The E0 is the baseline, Emax is the maximum effect, and EC50 the concentration associated with half-maximal effect.

complex activity (PCA) after a large oral dose of warfarin. An empirical effect compartment model (23) was used to predict the delay in relation to warfarin plasma concentration. The model captures the central tendency of the raw data adequately, but the BSV is too large. Ideally, 10% of the observations should fall outside the 90% prediction interval (595% percentile) at each time point. Figure 4 shows the same observations but with a physiologically based turnover model (24) to describe the changes in clotting factors that determine PCA. It is expected to be a superior model on mechanistic grounds. A comparison with Fig. 3 demonstrates that the model misspecification from the effect compartment model causes an overprediction of variability. 7.3 Prediction Discrepancy A method related to the visual predictive check is the prediction discrepancy (25). The prediction discrepancy compares the distribution of observed and simulated values in a more formal way that does not rely on subjective visual assessment. 7.4 Cross-Validation and External Validation Cross-validation is a form of internal validation of a population model. The observations are randomly split into two parts: an index dataset (for example, that includes 67% of subjects) and a validation dataset (that

12

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

includes the remainder of subjects). The population model is estimated from the index dataset without using the validation dataset. This estimated population model is then used to predict the concentrations of the validation dataset. If no systematic bias is in the model predictions for the validation dataset, the population PK model derived from the index dataset is concluded to be appropriate for prediction of profiles for subjects of the validation dataset. The random splitting between the index and validation dataset can be repeated multiple times. External validation is supposed to be the gold standard of model validation. An external dataset from a similar subject population as used in the clinical trial from which the population model was estimated must be available. The concentrations are predicted for the external dataset and then compared with the observations of the validation dataset as for an internal validation. Criteria for validation should be specified prior to the analysis, but we are not aware of examples where it has been done. Almost always, the ‘‘validation’’ is claimed to be successful based on the authors’ assessment. Unfortunately, comparable clinical trials are not available in most cases. Differences in bioanalytical assays and between clinical sites may bias the results of an external validation. 8

STOCHASTIC SIMULATION

Deterministic simulations predict the concentration time profile of one specific patient based on one set of PK parameters. Stochastic simulations account for the true between patient variability and therefore allow one to predict the range of plasma concentrations for a whole patient population. Instead of the phrase ‘‘stochastic simulations’’ the phrase ∗ The CL is the clearance of warfarin that is assumed to inhibit the input of the prothrombin complex activity (PCA). The Imax is the maximum inhibition and IC50 is the warfarin concentration in the central compartment (C1) that results in half-maximal inhibition. The kin describes the zeroorder input into the PCA compartment, and kout is the first-order rate constant of loss from the PCA compartment.

‘‘Monte Carlo simulation’’ is frequently used. In this context, the term Monte Carlo is used to describe a simulation technique and not an estimation technique like in the Monte Carlo Parametric Expectation Maximization algorithm. 8.1 Experimental Design One of the greatest merits of population PKPD modeling is its ability to evaluate and to optimize clinical trial designs. Software packages like WinPOPT (26), PFIM (27), and others can provide optimal sampling times, the optimal number of treatment groups, and the optimal number of subjects per group. The associated standard errors for the population model parameters can be predicted. Clinical trial simulation also offers the ability to predict the power of clinical trials, but it is computationally more difficult because of the time consuming nature of the repeated simulation and analysis steps and the lack of an automated procedure for converging on an optimal design. Clinical trial simulation can be performed in any of the abovementioned software packages (e.g., NONMEM, S-Adapt, SAS, etc.). The Pharsight® Trial Simulator™ is available as a specialized tool for comparison of userspecified clinical trial designs. Optimization of clinical trial design is very valuable, when large clinical trials are planned. A large amount of resources can be saved by optimizing the experimental design. 9 APPLICATIONS Population PKPD and disease progression modeling offers a variety of applications for clinical drug development, regulatory decision-making, and understanding characteristics of the disease in the patient population of interest. 9.1 Drug Development Clinical drug development takes about 5–10 years and total preapproval costs amount to about 802 million US dollars [adjusted to the year 2000, (28)] per approved drug. About 89% (29) of the drugs that enter the clinical phase do not reach the market. Population

13

Prothrombin Complex Activity

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

Figure 4. Visual predictive check using physiologically based turnover model to describe the delay in effect of warfarin on PCA.∗

PKPD can accelerate drug development, for example, by design and analysis of clinical trials (30). Population PKPD modeling can be powerful if a large number of potentially important independent variables are used (patient covariates, co-medications, etc.) as is usually the case during phase II and III of clinical drug development. More rational dosage regimens can be derived by population PKPD techniques that improve the ability to individualize drug dosing and to include such recommendations in the drug label. Population PKPD modeling can also provide valuable information for selecting dosage regimens for phase II or III clinical trials and can support ‘‘go / no-go decisions.’’ The meaning of potentially important biomarkers and their relationship to the disease progression and drug response can be evaluated. After a predictable and meaningful relationship between a biomarker and clinical outcome is established, this biomarker can be used as surrogate endpoint to make regulatory decisions. A biomarker is a directly observable quantity such as blood pressure or an EEG response. A surrogate marker is a biomarker used for regulatory decisions. The clinical outcome can be thought of as something that

the patient feels or cares about (e.g., the probability of stroke, frequency of admission to hospital, or relief from pain). 9.2 Regulatory Science Population PKPD models support a more rational decision making, especially for large clinical trials. Population models consider all observations in a single data analytical step and account for the time course of concentrations and drug effect(s). These models provide a sound basis for regulatory decision making. Techniques like nonparametric bootstrapping allow one to estimate the confidence intervals of model parameters that can be used for statistical comparisons. A drug can be statistically shown to be effective if the confidence interval for efficacy (Emax: maximal drug effect) does not include zero. Population PKPD models are also very helpful for analysis of studies in patient populations with considerably different sets of covariates. This feature is used in bridging studies in which results in Caucasians may be translated to Japanese patients. 9.3 Human and Disease Biology Population PKPD modeling offers a possibility to improve our understanding of the mechanism and time course of disease progression. Such analyses are performed more

14

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS

rarely than a population PK analysis. However, disease progression models that incorporate the system biology will probably be important in the future and are likely to support clinical science and academic research. A more in-depth understanding of the human and disease biology offers the potential to guide development of new drugs. 10

ACKNOWLEDGMENT

¨ Jurgen Bulitta was supported by a postdoctoral fellowship from Johnson & Johnson. REFERENCES 1. G. C. Pillai, F. Mentre, and J. L. Steimer, Nonlinear mixed effects modeling - from methodology and software development to driving implementation in drug development science. J. Pharmacokinet. Pharmacodyn. 2005; 32: 161–183. 2. L. Zhang, S. L. Beal, and L. B. Sheiner, Simultaneous vs. sequential analysis for population PK/PD data I: best-case performance. J. Pharmacokinet. Pharmacodyn. 2003; 30: 387–404. 3. L. Zhang, S. L. Beal, and L. B. Sheiner, Simultaneous vs. sequential analysis for population PK/PD data II: robustness of methods. J. Pharmacokinet. Pharmacodyn. 2003; 30: 405–416. 4. L. B. Sheiner and J. L. Steimer, Pharmacokinetic/pharmacodynamic modeling in drug development. Annu. Rev. Pharmacol. Toxicol. 2000; 40: 67–95. 5. L. B. Sheiner and T. M. Ludden, Population pharmacokinetics/dynamics. Annu. Rev. Pharmacol. Toxicol. 1992; 32: 185–209. 6. R. Jelliffe, Goal-oriented, model-based drug regimens: setting individualized goals for each patient. Ther. Drug Monit. 2000; 22: 325–329. 7. H. Sun, E. O. Fadiran, C. D. Jones, et al., Population pharmacokinetics. A regulatory perspective. Clin. Pharmacokinet. 1999; 37: 41–58. 8. L. Sheiner and J. Wakefield, Population modelling in drug development. Stat. Methods Med. Res. 1999; 8: 183–193. 9. N. H. G. Holford, Target concentration intervention: beyond Y2K. Br. J. Clin. Pharmacol. 1999; 48: 9–13. 10. G. B. West, J. H. Brown, and B. J. Enquist, The fourth dimension of life: fractal geometry and allometric scaling of organisms. Science 1999; 284: 1677–1679.

11. B. J. Anderson, K. Allegaert, J. N. Van den Anker, V. Cossey, and N. H. Holford, Vancomycin pharmacokinetics in preterm neonates and the prediction of adult clearance. Br. J. Clin. Pharmacol. 2007; 63: 75–84. 12. D. W. Cockcroft and M. H. Gault, Prediction of creatinine clearance from serum creatinine. Nephron 1976; 16: 31–41. 13. J. M. Hayman, N. P. Shumway, P. Dumke, and M Miller, Experimental Hyposthenuria. J. Clin. Invest. 1939; 18: 195–212. 14. S. Beal, Apparent coefficients of variation. Available: http://gaps.cpb.ouhsc.edu/nm/ 91sep2697.html, 1997. 15. U. Wahlby, M. R. Bouw, E. N. Jonsson, and M. O. Karlsson, Assessment of type I error rates for the statistical sub-model in NONMEM. J. Pharmacokinet. Pharmacodyn. 2002; 29: 251–269. 16. U. Wahlby, E. N. Jonsson, and M. O. Karlsson, Assessment of actual significance levels for covariate effects in NONMEM. J. Pharmacokinet. Pharmacodyn. 2001; 28: 231–252. 17. U. Wahlby, K. Matolcsi, M. O. Karlsson, and E. N. Jonsson, Evaluation of type I error rates when modeling ordered categorical data in NONMEM. J. Pharmacokinet. Pharmacodyn. 2004; 31: 61–74. 18. L. Aarons, Software for population pharmacokinetics and pharmacodynamics. Clin. Pharmacokinet. 1999; 36: 255–264. 19. P. Girard and F. Mentr´e, A comparison of estimation methods in nonlinear mixed effects models using a blind analysis. Abstract 834. Abstracts of the Annual Meeting of the Population Approach Group in Europe2005, p. 14. 20. R. J. Bauer, S. Guzy, and C. Ng, A survey of population analysis methods and software for complex pharmacokinetic and pharmacodynamic models with examples. AAPS J. 2007; 9: E60–83. 21. J. B. Bulitta and N. H. G. Holford, Assessment of predictive performance of pharmacokinetic models based on plasma and urine data. Brisbane, Australia: PAGANZ: Population Approach Group in Australia & New Zealand, 2005. 22. Y. Yano, S. L. Beal, and L. B. Sheiner, Evaluating pharmacokinetic/pharmacodynamic models using the posterior predictive check J. Pharmacokinet. Pharmacodyn. 2001; 28: 171–192. 23. L. B. Sheiner, D. R. Stanski, S. Vozeh, R. D. Miller, and J. Ham, Simultaneous modeling of pharmacokinetics and pharmacodynamics:

POPULATION PHARMACOKINETIC AND PHARMACODYNAMIC METHODS application to d-tubocurarine. Clin. Pharmacol. Ther. 1979; 25: 358–371. 24. N. L. Dayneka, V. Garg, and W. J. Jusko, Comparison of four basic models of indirect pharmacodynamic responses. J. Pharmacokinet. Biopharm. 1993; 21: 457–478. 25. F. Mentre, S. Escolano, Prediction discrepancies for the evaluation of nonlinear mixedeffects models. J. Pharmacokinet. Pharmacodyn. 2006; 33: 345–367. 26. S. B. Duffull, N. Denman, J. A. Eccleston, and H. C. Kimko, WinPOPT - Optimization for Population PKPD Study Design. Available: www.winpopt.com, 2006. 27. S. Retout, F. Mentr´e, PFIM - Population designs evaluation and optimisation. Available: http://www.bichat.inserm. fr/equipes/Emi0357/ download.html. 28. J. A. DiMasi, R. W. Hansen, and H. G. Grabowski, The price of innovation: new estimates of drug development costs. J. Health Econ. 2003; 22: 151–185. 29. I. Kola, J. Landis, Can the pharmaceutical industry reduce attrition rates? Nat. Rev. Drug Discov. 2004; 3: 711–715. 30. P. Lockwood, W. Ewy, D. Hermann, and N. Holford, Application of clinical trial simulation to compare proof-of-concept study designs for drugs with a slow onset of effect; an example in Alzheimer’s disease. Pharm. Res. 2006; 23: 2050–2059.

FURTHER READING An introduction to pharmacometrics with many links to other sites. Available: http://www. health.auckland.ac.nz/pharmacology/staff/nholford/pkpd/pkpd.htm. A. J. Atkinson, C. E. Daniels, R. Dedrick, and C. V. Grudzinskas, Principles of Clinical Pharmacology. San Diego, CA: Academic Press, 2001. P. L. Bonate. Pharmacokinetic-Pharmacodynamic Modeling and Simulation. New York: Springer, 2005. H. C. Kimko and S. B. Duffull, Simulation for Designing Clinical Trials: A PharmacokineticPharmacodynamic Modeling Perspective. New York: Marcel Dekker, Inc., 2002. Lecture (including video and PowerPoint) by Dr. Raymond Miller in the Principles of Clinical Pharmacology course by Dr. Arthur J. Atkinson at the NIH. Available: http://www.cc. nih.gov/researchers/training/principles.shtml.

15

P. Macheras and A. Iliadis, Modeling in Biopharmaceutics, Pharmacokinetics and Pharmacodynamics: Homogeneous and Heterogeneous Approaches. New York: Springer, 2005. I. Mahmood, Interspecies Pharmacokinetic Scaling: Principles And Application of Allometric Scaling. Rockville, MD: Pine House Publishers, 2005. Pharmacometrics homepage of the American College of Clinical Pharmacology (AACP). Available: http://www.accp1.org/cgi-bin/start.cgi/ pharmacometrics/disclaimer.htm. PharmPK online archives including many links to population software packages and learning material. Available: http://www.boomer. org/pkin/. J. C Pinheiro and D. M. Bates, Mixed Effects Models in S and S-Plus. New York: Springer, 2002. D. J. Spiegelhalter, K. R. Abrams, and J. P. Myles, Bayesian Approaches to Clinical Trials and Health-Care Evaluation (Statistics in Practice). Chichester, UK: John Wiley & Sons, 2004.

CROSS-REFERENCES Phase II/III Trials Mixed Effects Models Bayesian Approach Meta-analysis Clinical Trial Simulation

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

1.1 Sponsorship PEPI was sponsored by the National Heart, Lung, and Blood Institute. The initial design was conceived by institute staff. A request for proposals was prepared to solicit interest from qualified institutions. Potential investigators prepared applications that addressed the scientific issues and proposed methods for the study; these documents were peerreviewed and scored for scientific merit. PEPI was conducted as a collaborative research agreement through the Coordinating Center at Bowman-Gray School of Medicine at Wake Forest University.

ROBERT D. LANGER DRSciences, LLC, Jackson, Wyoming and Geisinger Health System, Danville, Pennsylvania

During the 1980s, several observational studies of varying designs conducted in different populations found an association between estrogen replacement in postmenopausal women and lower rates of coronary heart disease (CHD) and possibly stroke (1–4). Because CHD was and remains the leading cause of death in women, there was substantial interest in testing this potential benefit. Nevertheless, there was cause for skepticism because studies that compared the characteristics of HRT users to nonusers suggested a potential healthy user bias (5). By the mid1980s, some investigators and clinicians were calling for a clinical trial to test whether hormone therapy prevented CHD, and a conference that explored this possibility was held under the auspices of the U.S. National Heart, Lung, and Blood Institute (6). For a variety of reasons, including the recognition that, CHD rates are relatively low in women near the age of menopause, and that the estrogen effect on CHD was believed to be mediated principally by lipid benefits—a mechanism that takes years to become evident in generally healthy people—a trial with CHD as an endpoint was projected to require upward of 30,000 women followed for about 10 years. Given the magnitude of resources required, this idea was tabled pending more evidence. The concept of the PEPI study took root in this environment. 1

1.2 Clinical Centers, Participants, and Investigators PEPI was conducted at seven clinical centers across the United States, which involved a cross-section of U.S. women. Clinical centers were located at Stanford University, the University of California—Los Angeles, the University of California—San Diego, the University of Texas at San Antonio, the University of Iowa, Johns Hopkins University, and George Washington University. The study enrolled 875 women ages 45 to 64 years old and between 1 and 10 years postmenopausal at study baseline (8). Investigators were drawn from diverse disciplines, which included gynecology, general and internal medicine, cardiovascular epidemiology, laboratory medicine, biostatistics, and public health. 1.3 Considerations in the Selection of Treatments Kennedy DL, Baum C, Forbes MB. Noncontraceptive estrogens and progestins: use patterns over time. The PEPI trial was planned at a time when hormone therapy was in flux (9). A major goal was to provide clinically relevant information on treatments expected to be mainstream when the study was completed. Until approximately a decade before, nearly all postmenopausal hormone therapy was estrogen alone. Typical doses were 1.25 to 2.5 mg of conjugated equine estrogens. In the mid-1970s,

DESIGN AND OBJECTIVES

The primary objective of PEPI was to compare a variety of treatment regimens expected to be in common clinical use when the results were published for their effects on cardiovascular risk factors. Secondary stipulated outcomes included endometrial safety, changes in bone density, and effects on quality of life (7).

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

several studies that used different methods and that were conducted in different populations linked estrogen to endometrialcarcinoma in women with a uterus (10–13). Because approximately 60% of postmenopausal women are in this category, the recognition of this association resulted in a steep decline in estrogen use for several years, and a parallel search for ways to protect against the endometrial hyperplasia associated with this risk. By the early 1980s, several clinical trials had demonstrated that the addition of a progestin (a synthetic progesterone) for one third to one half of a monthly cycle of estrogen protected against endometrial hyperplasia (14–16). A variety of dosing schedules and drug combinations were developed for women with a uterus; the typical dose of estrogen was reduced to about half the prior amount with 0.625 mg daily becoming standard, and progestogens such as medroxyprogesterone acetate, megestrol, levonorgestrel, and norethindrone acetate were added for 8 to 15 days per monthly cycle. Some clinicians favored 21 to 25 days of estrogen per month, whereas others favored continuing estrogen daily (17). As PEPI was conceived at the NHLBI, evidence was developing to suggest that adding progestogen changed the effects of estrogens on lipids, particularly the effect on highdensity lipoprotein (HDL). Questions were generated regarding potential differences between progestogens in the magnitude of this effect (17). Clinicians recognized that, at least for some women, cyclical progestogen seemed to promote fluid shifts, breast tenderness, bloating, and irritability. This finding generated interest in altered dosing patterns that could still prevent endometrial hyperplasia. The concept of using a lower dose of progestogen continuously along with continuous estrogen began to emerge on the theory that this could induce endometrial atrophy (18,19). From a lipid perspective, the relative effect of a lower cumulative progestogen dose with continuous exposure compared with a higher episodic dose was unknown but worthy of study. Other emerging treatments, with limited or no availability in the United States at the time, included nonoral estrogens and progesterone.

2 STUDY DESIGN 2.1 Duration PEPI was planned as study of intermediate markers that were likely to represent mechanisms that linked hormone therapy and cardiovascular disease. Important among these were lipids, and based on data from lipid studies conducted in the same era, a treatment interval of 3 years was felt to be appropriate (7). This interval would provide adequate time for effects to reach steady state, and it would also provide adequate time to evaluate adverse effects, tolerability, and quality of life, as well as effects in organ systems that react more slowly, which include bone. 2.2 Trial Design PEPI was a randomized, double-blind, placebo-controlled clinical trial. 2.3 Treatments Treatments were chosen primarily based on their contribution to the data on which the hypothesis of coronary benefit was based, and on the principle that PEPI should focus on treatments that were likely to represent clinical use in the United States when the study was completed (7). Conjugated equine estrogens (CEE) were selected for the estrogen component of all active treatments because most observational data reflected the use of this compound, and because it was the dominant estrogen in the United States at the time. The CEE dose was set at 0.625 mg, which was the more conservative dose that had become common in the years just prior to the study. Two progestogens were selected: medroxyprogesterone acetate (MPA), which was the dominant drug in clinical use at the time, and micronized progesterone, which was not then available in the United States. MPA was tested in two patterns, the then common sequential regimen of 10 mg for 12 days per cycle, and the continuous pattern of 2.5 mg daily that was developing at that time. Micronized progesterone was selected because of concern that available synthetic

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

3

Table 1. PEPI Treatment Groups Estrogen

Progestogen

CEE 0.625 mg daily CEE 0.625 mg daily CEE 0.625 mg daily CEE 0.625 mg daily Placebo

Placebo MPA 10 mg for 12 days per cycle MPA 2.5 mg daily Micronized progesterone for 12 days per cycle Placebo

progestins, which include MPA, had androgenic effects that could attenuate the effects of estrogen on cardiovascular risk factors. Micronized progesterone is a C-21 compound identical to endogenous progesterone but prepared in a manner that created small particles in an oil base to facilitate absorption through the gut; it was given as 200 mg for 12 days per cycle. To address the possibility that progestogens might add benefit for some outcomes in addition to providing endometrial protection (e.g., bone, quality of life), women with and without a uterus were randomized equally to all treatments. Five treatment conditions were defined as shown in Table 1. 2.4 Eligibility Primary eligibility criteria were age between 45 and 64 years, and at least 1, but not more than 10 years postmenopausal. Menopausal status was assessed by time since last menstrual period or time since oophorectomy, as well as hormone levels (follicle stimulating hormone and estradiol). Exclusions included endometrial hyperplasia on baseline study biopsy; hypertension that required more than two drugs; coronary heart disease that required antiarrhythmic medication; congestive heart failure; insulin-dependent diabetes; any major cancer within 5 years of baseline; malignant melanoma, endometrial, or breast cancer ever; nontraumatic hip fracture; or a thromboembolic event associated with estrogen use. Women with myocardial infarction more than 6 months prior to baseline were eligible so long as they did not require management as noted above. Exclusions for potential adherence issues included severe vasomotor symptoms or failure to maintain adequate compliance on a (single blind) 1-month cycle of placebo medications (7).

3

OUTCOMES

3.1 Cardiovascular The primary motivation for studying hormone therapy regimens was the potential benefit in preventing CHD. Accordingly, cardiovascular risk factors were to be primary. It was recognized that effects could occur on multiple mechanisms and that these effects could even oppose one another (7,20). The hypothesis that estrogen could reduce CHD through increased HDL was prominent, and HDL was recognized to be the strongest lipid predictor in women. Accordingly, it was set as primary but with other lipids, which included low-density lipoprotein (LDL) and triglycerides, also to be assessed. It was decided to draw fasting blood for lipids at baseline and every 6 months, because this measurement would provide an adequate interval to observe changes and would yield 6 data points for compliant participants to assess longitudinal trends. However, it was clearly understood that lipids were unlikely to be the only important pathway that moderated hormone effects on CHD risk. Researchers were concerned that hormones might impact insulin resistance, which is a particularly strong cardiovascular disease risk factor in women. The most practical assessment for this factor on an outpatient basis was a glucose tolerance test. Striking a balance between the science and participant comfort, fasting, and 2-hour postglucose load samples for insulin and glucose were obtained at baseline and yearly. Insulin levels were stipulated as the primary factor. Researchers were suspicious that hormones could elevate systolic blood pressure, which could potentially worsen that risk factor. Accordingly, systolic and diastolic blood pressures were measured following a 5-minute rest using random-zero sphygmomanometers at each

4

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

visit (every 6 months). Finally, hemostatic effects were developing as a new class of cardiovascular risk factors, and the effects of hormone regimens on these markers were unknown. Fibrinogen was the most widely available assay and was measured at baseline, 1 year, and 3 years on all participants. Specimens for other markers of hemostasis, and for renin and aldosterone, were obtained on a subsample of women that represented three of the seven participating centers. 3.2 Endometrial Safety The primary safety outcome in PEPI was endometrial disease. A normal baseline endometrial biopsy was required for eligibility. As noted earlier, hormone therapy regimens were in flux (17). A variety of dosing patterns, compounds, and cumulative doses was in use, and a fairly radical new pattern— continuous progestin—was in development (18,19). No comparative data existed on the endometrial safety of these alternatives. Clearly, these data were necessary to help determine the relative merits of these treatments in clinical practice. Because not all disease is symptomatic, it was decided to obtain endometrial biopsies on all women with a uterus annually (even those on placebo to maintain treatment masking). 3.3 Bone Mineral Density Evidence from other clinical trials and observational studies documented maintenance of bone density with hormone therapy, whereas most women lost bone density in the absence of treatment (21,22). It was controversial whether hormone therapy could increase bone density and whether differences existed between treatment regimens in this effect. Based on the expectation that clinically meaningful changes in bone density require 1 to 2 years of treatment, PEPI obtained bone density measurements at baseline, 12 months, and 3 years. 3.4 Quality of Life Clinical experience suggested that hormone regimens could have diverse effects on quality of life. Potential areas affected included mood, sexual function, sense of well-being,

sleep, breast and pelvic symptoms, vaginal bleeding patterns, and physical activity. A comprehensive quality-of-life assessment was assembled to address these areas. Where possible, scales were drawn from existing standardized instruments. Otherwise, new scales were constructed with the assistance of content experts (23). 3.5 Other Measurements of waist, hip, and thigh circumference were obtained annually, as was recent dietary history using a food frequency questionnaire. Electrocardiograms were performed at baseline and at the end of study. 3.6 Sample Size and Power Sample size was estimated based on the desire to detect differences in HDL-cholesterol between the 5 study arms. The null hypothesis was that there would be no detectable differences in HDL between any of the 5 treatment groups. All pair-wise comparisons were considered, which yielded 10 potential comparisons. The result of subtracting the average of two prerandomization measures from the average of four postrandomization measures was defined as the treatment-related difference. The required sample size was calculated using standard formulas for z-tests that stipulated 80% power to detect a minimum of 5 mg/dl difference in HDL for each pair-wise comparison after 3 years of follow-up. Type I error was set at α = 0.05, and the Bonferroni approach was used so that α was divided by 40 (10 pair-wise comparisons times 4 endpoints). Potential attrition was considered, which includes differential loss of women with or without a uterus, and the number of subjects needed was increased to account for this loss with dropouts assigned values consistent with the null hypothesis. The cross-sectional standard deviation used for HDL was 17.9 mg/dl with a serial correlation of 0.72. Based on these assumptions, 840 women were required (120 at each of the 7 participating sites). That sample was projected to yield 92% power to detect differences of 5 mg/dl or greater in any pair-wise comparison after 3 years. Following determination of the sample size required for the HDL outcome, the minimum

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

detectable treatment effects at 80% and 90% power were estimated for the other 3 primary study outcomes. Estimated detectable differences for these at 80% power were fibrinogen 20.1 mg/dl, insulin 26.4 µU/ml, and systolic blood pressure 5 mmHg. 3.7 Follow-up Visits and Safety Assessments Participants were contacted at 3 months to assess any potential safety issues and to reinforce compliance if appropriate. Participants were examined in the clinic every 6 months. At those visits, a clinician took an interval history focused on potential adverse events; blood pressure, height, and weight were measured; and blood was drawn to be stored for possible future assays. All participants had an annual physical exam including pelvic, Pap smear, clinical breast exam, and mammography. Women with a uterus had an annual endometrial biopsy (7).

5

could be consulted regarding vaginal bleeding and could request treatment assignment within 3 groups—placebo, estrogen alone, or combined hormone therapy. This information allowed the consultant to determine the need for clinical evaluation or action while retaining the blind for all other parties in most cases (7). 4

RESULTS

PEPI was planned to enroll 840 participants: 120 per center and 168 per treatment arm. That recruitment goal was exceeded with 875 women enrolled as some centers were permitted to continue recruitment beyond their goal to ensure that adequate numbers were obtained. The primary results were published in January 1995 (20). Overall participant retention was excellent, with 97% of women completing the 36-month assessment.

3.8 Blinding

4.1 Cardiovascular Risk Factors

All participants, laboratory personnel, and clinical site investigators were blind to study treatment. Placebo and active drugs were prepared by the manufacturers in visually identical form. These elements were sent to a central facility that repackaged them in packages organized by cycle, labeled them with trial randomization codes, and shipped them to the clinical sites. Because of concern about possible endometrial hyperplasia, a system was designed to permit partial unblinding of a consulting gynecologist. This physician

HDL cholesterol, which was the primary stipulated outcome, was improved in all active treatment groups compared with placebo (20). Differences were observed between treatments in this effect; CEE alone and CEE plus micronized progesterone increased HDL significantly more than either of the CEE plus MPA regimens (Fig. 1). The two MPAcontaining regimens were also associated with an increase in 2-hour post-load glucose compared with placebo, whereas CEE alone and CEE plus micronized progesterone

PEPI: Mean % Change in High Density Lipoprotein Cholesterol % Placebo 70 14 CEE CEE + MPA sequential CEE + MPA continuous CEE + MP sequential

12 68 10 66 64

8 6 4 2

62

0 −2

Figure 1. PEPI: Mean percent change in High Density Lipoprotein Cholesterol.

60 −4 mg/dL 0

6

12

18 Months

24

30

36

6

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

PEPI: Mean % Change in Two-Hour Glucose 124

% 8 6

120 116

4 2 0

112 110 108

Placebo CEE CEE + MPA sequential CEE + MPA continuous CEE + MP sequential

−2 −4 −6 −8 0

mg/dL

12

24

Figure 2. PEPI: Mean percent change in 2-hour glucose.

36

Months PEPI: Mean % Change in Systolic Blood Pressure 122

% 6

120

4

118

2

116

0

Placebo CEE CEE + MPA sequential CEE + MPA continuous CEE + MP sequential

114 −2 112 −4 110 mm/Hg

−6 0

6

12

18

24

30

36

Months

were not. Estrogen alone was associated with a reduction in fasting glucose (Fig. 2). All active treatments had equivalent effects in a direction associated with lower cardiovascular risk on LDL and fibrinogen. All active treatments increased triglycerides in a similar manner. There were no significant effects on blood pressure (Fig. 3) (20).

Figure 3. PEPI: Mean percent change in systolic blood pressure.

study period (20). No endometrial carcinomas occurred in the ‘‘estrogen alone’’ arm, but there was an excess rate of hysterectomy. The three active treatments with a progestogen were equally effective in preventing endometrial disease. In all, two cases of endometrial carcinoma were observed: one in a participant on CEE plus micronized progesterone and one in the placebo group.

4.2 Endometrial Safety The annual rate of endometrial hyperplasia among women with a uterus who were randomized to CEE alone was over 20% (20,24). Women with significant hyperplasia as assessed by the local gynecologic consultant or their personal physician were withdrawn from study treatment. More than half of women with a uterus in this arm were lost to treatment for this reason over the 3-year

4.3 Bone Mineral Density All active treatments were associated with a modest increase in bone mineral density, and no significant differences were observed between treatments (Fig. 4) (20,25). This finding was of major importance because other studies had demonstrated independent effects on bone preservation for both estrogens and progestogens. PEPI demonstrated

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI) PEPI: % Change in Bone Mineral Density at 36 Months 6 5

Lumbar spine

Hip

4 3 2 1 0 CEE

CEE + MPA seq

CEE + MPA cont

CEE + MP seq

Figure 4. PEPI: Mean percent change in Bone Mineral Density at 36 months.

that, in typical clinical regimens, there was no meaningful benefit in bone mineral density associated with adding a progestogen to estrogen. So, with regard to bone, there was no reason for women without a uterus to use a progestogen. 4.4 Quality of Life All active treatments reduced vasomotor symptoms compared with placebo. Treatments with progestogens increased breast tenderness. No differences were observed in cognitive or affective measures (23,26). 4.5 Monitored Safety Outcomes Other than endometrial hyperplasia, no significant differences were observed in the incidence of any monitored outcome, which includes coronary events, stroke, gall bladder disease, and breast cancer. However, no incidences were expected given the sample size and duration of the study.

5

CONCLUSIONS

PEPI was an important intermediate step in the evaluation of the hypothesis that hormone therapy, particularly the estrogen component, could reduce coronary disease in women after menopause. Because coronary disease is the most common cause of death in women, with rates accelerating with age after menopause, such an effect could have enormous public health impact.

7

5.1 PEPI and the WHI PEPI demonstrated biological plausibility for intermediate markers of cardiovascular disease that supported the hypothesis that hormone therapy, particularly estrogen, could reduce the risk of coronary disease. It contributed to the body of knowledge that resulted in a large-scale test of hormone therapy effects on coronary events in one of the clinical trials of the Women’s Health Initiative (WHI), which was planned by members of the same NIH Institute (NHLBI) as the PEPI trial progressed (27). However, important differences were observed between the PEPI and WHI trials in participant characteristics. Women in PEPI were between 1 and 10 years postmenopausal and between the ages of 45 and 64 years at baseline. In contrast, women in the WHI were between 50 and 79 years old, and there were no criteria for years since menopause (28). The average time since menopause in the WHI was more than 12 years (29). The failure of the WHI to demonstrate a reduction in coronary heart disease for one combination regimen tested in PEPI (CEE plus continuous MPA 2.5 mg) caused a major reconsideration of this hypothesis (29). However, the PEPI finding of significant differences in HDL and glucose measures between CEE alone (and CEE plus micronized progesterone) and the two MPA-containing regimens was also partially replicated in the WHI finding that CEE alone was not associated with an increase in coronary events overall, and it was associated with a significant reduction in coronary events in women aged 50 to 59 years (30). 5.2 Definitive Answers to Important Secondary Questions PEPI provided unequivocal answers to two important secondary questions. First, it showed that there was no rationale for the use of a progestogen in women who did not need endometrial protection. There was no advantage for bone mineral density, and there was a potential disadvantage in HDL cholesterol, particularly with MPA. Second, it demonstrated that the rate of endometrial hyperplasia associated with even a mid-level

8

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

dose of unopposed estrogen was unacceptably high for routine clinical use. Yet it also found that hyperplasia rarely progressed to endometrial carcinoma with appropriate evaluation for unexpected bleeding and annual monitoring. 5.3 Contributions to Subsequent Women’s Health Research Because it was the first major NIH trial focused on older women, PEPI pioneered some history and exposure forms, as well as some quality of life scales, that have been adopted by a variety of studies since. Prior to PEPI, nearly all major long-term clinical trials had been conducted in men, and some investigators were concerned that women might not volunteer and participate in trials of this type. PEPI’s success in both recruitment and adherence laid these concerns to rest. REFERENCES 1. M. J. Stampfer, W. C. Willett, G. A. Colditz, et al., A prospective study of postmenopausal estrogen therapy and coronary heart disease. N. Engl. J. Med. 1985; 313: 1044–1049. 2. D. B. Petitti, J. A. Perlman, and S. Sidney, Noncontraceptive estrogens and mortality: long-term follow-up of women in the Walnut Creek Study. Obstet Gynecol. 1987; 70: 289–293. 3. T. L. Bush, E. Barrett-Connor, L. D. Cowan, et al., Cardiovascular mortality and noncontraceptive use of estrogen in women: results from the Lipid Research Clinics Program Follow-up Study. Circulation 1987; 75: 1102–1109. 4. A. Paganini-Hill, R. K. Ross, B. E. Henderson, Postmenopausal oestrogen treatment and stroke: a prospective study. BMJ 1988; 297: 519–522. 5. M. H. Criqui, L. Suarez, E. Barrett-Connor, J. McPhillips, D. L. Wingard, and C. Garland, Postmenopausal estrogen use and mortality. Results from a prospective study in a defined, homogeneous community. Am. J. Epidemiol. 1988; 128: 606–614. 6. E. Eaker, B. Packard, E. K. Wenger, Coronary heart disease in women: proceedings of an NIH workshop. New York: Haymarket Doyma, 1987.

7. M. A. Espeland, T. L. Bush, I. Mebane-Sims, M. L. Stefanick, S. Johnson, R. Sherwin, M. Waclawiw, Rationale, design, and conduct of the PEPI Trial. Postmenopausal Estrogen/Progestin Interventions. Control. Clin. Trials 1995; 16: 3S–19S. 8. V. T. Miller, R. L. Byington, M. A. Espeland, R. D. Langer, R. Marcus, S. Shumaker, M. P. Stern, Baseline characteristics of the PEPI participants. Control. Clin. Trials 1995; 16: 54S–65S. 9. P. C. MacDonald, Estrogen plus progestin in postmenopausal women – Act II. N. Engl. J. Med. 1986; 315: 959–961. 10. H. K. Ziel and W. D. Finkle, Increased risk of endometrial carcinoma among users of conjugated estrogens. N. Engl. J. Med. 1975; 293: 1167–1170. 11. D. C. Smith, R. Prentice, D. J. Thompson, and W. L. Herrmann, Association of exogenous estrogen and endometrial carcinoma. N. Engl. J. Med. 1975; 293: 1164–1167. 12. T. M. Mack, M. C. Pike, B. E. Henderson, R. I. Pfeffer, V. R. Gerkins, M. Arthur, and S. E. Brown, Estrogens and endometrial cancer in a retirement community. N. Engl. J. Med. 1976; 294: 1262–1267. 13. NS Weiss, DR Szekely, and D. F. Austin, Increasing incidence of endometrial cancer in the United States. N. Engl. J. Med. 1976; 294: 1259–1262. 14. M. E. Paterson, T. Wade-Evans, D. W. Sturdee, M. H. Thom, and J. W. Studd, Endometrial disease after treatment with oestrogens and progestogens in the climacteric. Br. Med. J. 1980; 280: 822–824. 15. M. H. Thom and J. W. Studd, Oestrogens and endometrial hyperplasia. Br. J. Hosp. Med. 1980; 23:506, 508–509, 511–513. 16. M. H. Thom, P. J. White, R. M. Williams, D. W. Sturdee, M. E. Paterson, T. Wade-Evans, and J. W. Studd, Prevention and treatment of endometrial disease in climacteric women receiving oestrogen therapy. Lancet 1979; 2: 455–457. 17. M. I. Whitehead and D. Fraser, The effects of estrogens and progestogens on the endometrium. Modern approach to treatment. Obstet. Gynecol. Clin. North Am. 1987; 14: 299–320. 18. L. Weinstein, Efficacy of a continuous estrogen-progestin regimen in the menopausal patient. Obstet. Gynecol. 1987; 69: 929–932. 19. L. Weinstein, C. Bewtra, and J. C. Gallagher, Evaluation of a continuous combined low-dose regimen of estrogen-progestin for treatment

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI) of the menopausal patient. Am. J. Obstet. Gynecol. 1990; 162: 1534–1589. 20. The Writing Group for the PEPI trial, Effects of estrogen or estrogen/progestin regimens on heart disease risk factors in postmenopausal women: the Postmenopausal Estrogen/Progestin Interventions (PEPI) Trial. JAMA 1995; 273: 199–208. 21. R. Lindsay, D. M. Hart, C. Forrest, and C. Baird, Prevention of spinal osteoporosis in oophorectomized women. Lancet 1980; 2: 1151–1154. 22. B. Ettinger, H. K. Genant, and C. E. Cann, Long-term estrogen replacement therapy prevents bone loss and fractures. Ann. Intern. Med. 1985; 102: 319—324. 23. G. A. Greendale, B. A. Reboussin, P. Hogan, V. M. Barnabei, S. Shumaker, S. Johnson, et al., Symptom relief and side effects of postmenopausal hormones: results from the Postmenopausal Estrogen/Progestin Interventions Trial. Obstet. Gynecol. 1998; 92: 982–988. 24. The Writing Group for the PEPI trial. Effects of hormone replacement therapy on endometrial histology in postmenopausal women. The Postmenopausal Estrogen/Progestin Interventions (PEPI) Trial. JAMA 1996; 275: 370–375. 25. The Writing Group for the PEPI trial. Effects of hormone therapy on bone mineral density: results from the postmenopausal estrogen/progestin interventions (PEPI) trial. JAMA 1996; 276: 1389–1396. 26. B. A. Reboussin, G. A. Greendale, and M. A. Espeland, Effect of hormone replacement therapy on self-reported cognitive symptoms: results from the Postmenopausal Estrogen/Progestin Interventions (PEPI) trial. Climacteric 1998; 1: 172–179. 27. J. E. Rossouw, L. P. Finnegan, W. R. Harlan, V. W. Pinn, C. Clifford, and J. A. McGowan, The evolution of the Women’s Health Initiative: perspectives from the NIH. J. Am. Med. Womens Assoc. 1995; 50: 50–5. 28. The Women’s Health Initiative Investigators, Design of the Women’s Health Initiative clinical trial and observational study. Control. Clin. Trials 1998; 19: 61–109. 29. Writing Group for the Women’s Health Initiative, Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results From the Women’s Health Initiative randomized controlled trial. JAMA 2002; 288: 321–333.

9

30. J. Hsia, R. D. Langer, J. E. Manson, L. Kuller, K. C. Johnson, S. L. Hendrix, M. Pettinger, S. R. Heckbert, N. Greep, S. Crawford, et al., Conjugated equine estrogens and coronary disease. Arch. Int. Med. 2006; 166: 357–365.

FURTHER READING PEPI published manuscripts—Baseline E. Barrett-Connor, H. G. Schrott, G. Greendale, D. Kritz-Silverstein, M. A. Espeland, M. P. Stern, T. Bush, J. A. Perlman, Factors associated with glucose and insulin levels in postmenopausal women. Diabetes Care. 1996. T. L. Bush, M. A. Espeland, and I. Mebane-Sims, The Postmenopausal Estrogen/Progestin Interventions (PEPI) Trial: overview. Control. Clin. Trials 1995; 16: 1S–2S. G. A. Greendale, P. Hogan, and S. Shumaker, Sexual functioning in postmenopausal women: the Postmenopausal Estrogen/Progestins Intervention Trial (PEPI). J. Women’s Health 1996; 5: 445–458. G. A. Greendale, P. Hogan, D. Kritz-Silverstein, R. Langer, S. R. Johnson, and T. Bush, Age at menopause in women participating in the Postmenopausal Estrogen/Progestins Interventions (PEPI) Trial: an example of bias introduced by selection criteria. Menopause 1995; 2: 27–34. G. A. Greendale, M. K. James, M. A. Espeland, E. Barrett-Connor, Can we measure prior postmenopausal estrogen/progestin use? The Postmenopausal Estrogen/Progestin Interventions Trial. Am. J. Epidemiol. 1997; 9: 763–770. G. A. Greendale, L. Bodin-Dunn, S. Ingles, R. Haile, and E. Barrett-Connor Leisure, home and occupational physical activity and cardiovascular risk factors in postmenopausal women: the Postmenopausal Estrogens/Progestins Intervention (PEPI) study. Arch. Intern. Med. 1996; 156: 418–424. G. A. Greendale, H. B. Wells, E. Barrett-Connor, R. Marcus, and T. L. Bush, Lifestyle factors and bone mineral density: the Postmenopausal Estrogen/Progestin Interventions study. J. Women’s Health 1995; 4: 231–246. S. Johnson, I. Mebane-Sims, P. E. Hogan, and D. B. Stoy, The Postmenopausal Estrogen/Progestin Interventions (PEPI) Trial: recruitment of postmenopausal women in the PEPI trial. Control. Clin. Trials 1995; 16: 20S–35S. S. L. Hall and G. A. Greendale, The relation of dietary vitamin C intake to bone mineral density: results from the PEPI study. Calcif. Tissue Int. 1998; 63: 183–189.

10

POSTMENOPAUSAL ESTROGEN/PROGESTIN INTERVENTIONS TRIAL (PEPI)

R. D. Marcus, G. Greendale, B. A. Blunt, T. L. Bush, S. Sherman, R. Sherwin, H. Wahner, and H. B. Wells, Correlates of bone mineral density in the Post-menopausal Estrogen/Progestin Interventions Trial (PEPI). J. Bone Miner. Res. 1994; 3: 1467–1476. V. T. Miller, R. L. Byington, M. A. Espeland, R. Langer, R. Marcus, S. Shumaker, M. P. Stern, The Postmenopausal Estrogen/Progestin Interventions (PEPI) Trial: Baseline characteristics of the PEPI participants. Control. Clin. Trials 1995; 16: 54S–65S. H. G. Schrott, C. Legault, M. L. Stefanick, V. T. Miller, J. LaRosa, P. D. Wood, and K. Lippel, Lipids in postmenopausal women: baseline findings of the Postmenopausal Estrogen/Progestin Interventions Trial. J. Women’s Health 1994; 3: 155–164. M. L. Stefanick, C. Legault, R. P. Tracy, G. Howard, G. Kessler, D. L. Lucas, and T. L. Bush, The distribution and correlates of plasma fibrinogen in middle-aged women: initial findings of the Postmenopausal Estrogen/Progestins Interventions (PEPI) study. Arterioscler. Thromb. Vasc. Biol. 1995; 15: 2085–2093. P. D. Wood, G. Kessler, Lippel K., M. L. Stefanick, C. H. Wasilauskas, and H. B. Wells, The Postmenopausal Estrogen/Progestin Interventions (PEPI) Trial: Physical and laboratory measurements in the PEPI trial. Controlled Clin Trials, 1995, 16: 36S–53S. PEPI published manuscripts—Outcomes E. Barrett-Connor, S. Slone, G. Greendale, D. Kritz-Silverstein, M. Espeland, S. R. Johnson, M. Waclawiw, and E. Fineberg, The Postmenopausal Estrogen/Progestin Interventions Study: primary outcomes in adherent women. Maturitas 1997; 27: 261–274. M. A. Espeland, M. L. Stefanick, D. KritzSilverstein, S. E. Fineberg, M. A. Waclawiw, M. K. James, and G. A. Greendale, Effect of postmenopausal hormone therapy on body weight and waist and hip girths. J. Clin. Endocrinol. Metabol. 1997; 82: 1549–1556. M. A. Espeland, S. M. Marcovina, V. Miller, P. D. Wood, C. Wasilauskas, R. Sherwin, H. Schrott, T. L. Bush, Effect of postmenopausal hormone therapy on lipoprotein(a) concentrations. Circulation 1998; 97: 979–986.

M. A. Espeland, P. E. Hogan, S. E. Fineberg, G. Howard, H. Schrott, M. A. Waclawiw, and T. L. Bush, Effect of postmenopausal hormone therapy on glucose and insulin concentrations. Diabetes Care 1998; 21: 1589–1595. G. A. Greendale, B. Melton, P. Hogan, V. Barnabei, S. Shumaker, S. Johnson, and E. Barrett-Connor, Effects of estrogen or estrogen/ progestin regimens on physical, cognitive, and affective symptoms: results from The Postmenopausal Estrogen/Progestin Interventions Trial. Obstet. Gynecol. 1998; 92: 982–988. G. A. Greendale, B. A. Melton, A. Sie, H. R. Singh, L. K. Olson, O. Gatewood, L. W. Bassett, C. Wasilauskas, T. Bush, and E. Barrett-Connor, The effects of estrogen and estrogen/progestin treatments on mammographic parenchymal density: Postmenopausal Estrogen/Progestin Interventions (PEPI) Investigators. Annals Ann. Intern. Med. 1999; 130: 262–269. R. D. Langer, J. J. Pierce, K. A. O’Hanlan, S. R. Johnson, M. A. Espeland, J. F. Trabal, V. M. Barnabei, M. J. Merino, R. E. Scully, Transvaginal ultrasonography compared with endometrial biopsy for the detection of endometrial disease. N. Engl. J. Med. 1997; 337: 1792–1798. C. Legault, M. A. Espeland, C. H. Wasilauskas, T. L. Bush, J. Trabal, H. L. Judd, S. R. Johnson, G. A. Greendale, Agreement in assessing endometrial pathology: The Postmenopausal Estrogen/Progestin Interventions Trial. J. Women’s Health 1998; 7: 435–441. H. G. Schrott, C. Legault, M. L. Stefanick, V. T. Miller, Effect of hormone replacement therapy on the validity of the Friedewald equation in postmenopausal women: The Postmenopausal Estrogen/Progestin Interventions Trial (PEPI). J. Clin. Epidemiol. 1999; 52: 1187–1195.

CROSS-REFERENCES Women’s Health Initiative

Power Transformations The assumptions in regression analysis include normally distributed errors of constant variance (see Scedasticity). Often these assumptions are more nearly satisfied not by the original response variable y, but by some transformation of y, z(y). For nonnegative responses, one frequently used transformation is log y. The original and transformed analyses can then be compared in a number of ways. Residuals can be plotted against fitted values, or assessed for normality by probability plots for various transformations (see Normal Scores). Another comparison is through analysis of the linear model using t or F tests – a correct transformation often yields a simple linear regression model, with no, or just a few, interaction or quadratic terms. A formal way of comparing transformations is to embed them in a parametric family and then to make inferences about the transformation parameter λ (see Model, Choice of). Transformations of three parts of the model are of differing complexity and importance. The most important is described in the following section. Transformation of the Response: Box and Cox [5]. The logarithmic transformation is one special case of the normalized power transformation [5]   yλ − 1 λ = 0 λ−1 z(λ) = (1)  λy˙ y˙ log y λ = 0, where the geometric mean of the observations is written as y˙ = exp(Σ log yi /n). The regression model to be fitted is then z(λ) = X β + ε.

(2)

For fixed λ, the value of β is estimated by least squares giving a residual sum of squares for the z(λ) of R(λ). The maximum likelihood estimate λˆ minimizing R(λ) is found by numerical search (see Optimization and Nonlinear Equations), often over a grid of λ values. Exact minimization of R(λ) is not required, since simple rational values of λ are customary in the analysis of data: λ = 1, no transformation: λ = 1/2, the square root; λ = 0, the logarithmic and λ = −1, the reciprocal being widely used. An approximate minimum of R(λ) is, however, required to establish confidence intervals for and to

test hypotheses about λ. It is important that these comparisons of R(λ) do use the full form in (1) including the geometric mean. Omission of this term leads to meaningless comparisons – for most data, log y is very much smaller than y and so are the corresponding sums of squares, regardless of how well the regression models fit. Transformation of Explanatory Variables: Box and Tidwell [6]. For a regression model with p terms, it sometimes makes sense to consider models in which one (or perhaps more) of the explanatory variables is transformed, when the model is y=

p

βj xj + βk xkλ + ε.

(3)

j =k

Again the maximizing value of λ has to be found numerically, but the calculations are more straightforward than those for the Box–Cox transformation, since the scale of the observations is unaffected by the transformation. The residual sums of squares of y can be compared directly as λ varies. Transformation of Both Sides of the Model: Carroll and Ruppert [7]. The Box and Cox transformation often yields both approximately normal errors and a simple linear model. But sometimes the two transformations do not happen together. An example is the data on mandible length from Royston and Altman [10] plotted in Goodness of Fit. The analyses in Diagnostics, Forward Search, and Residuals showed that the log transformation yielded normal errors, but increased the evidence for the inclusion of a quadratic term in the linear model (see Polynomial Regression). If there is a simple model for y, the simplicity of the linear model can be maintained by subjecting both sides of the model to the same transformation. The purpose of the transformation is then to obtain normal errors of constant variance. Let E(Y ) = η = x T β be the simple linear model. The transformation model is then ηλ − 1 yλ − 1 = +ε λ−1 λy˙ λy˙ λ−1

λ = 0,

y˙ log y = y˙ log η + ε

λ = 0.

(4)

The optimizing value of λ again minimizes the residual sum of squares of the transformed response. For given λ, estimating the parameters in general requires nonlinear estimation, although this is unaffected by

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Power Transformations

division by λy˙ λ−1 . An example for the data on the volume of trees analyzed in Residuals is given by Atkinson [2]. The procedures using constructed variables described in Residuals provide tests for the value of λ, which avoid the need for calculation of λˆ whilst also giving information on the effect of individual observations on tests and parameter estimates. Numerous examples for the Box–Cox and Box–Tidwell transformations are given by Atkinson [1]. Cook and Weisberg [8, Chapter 13], describe interactive graphical methods for selecting a transformation. Diagnostic material on the transformation of both sides of the model is given by Hinkley [9], Shih [11], and by Atkinson [2]. Atkinson and Shephard [4] extend the Box–Cox transformation to time series analysis and describe the related diagnostic procedures. Diagnostic procedures for the effect of groups of observations via the Forward Search use the Fan Plot, described in greater detail in [3, Chapter 4].

References [1]

Atkinson, A.C. (1985). Plots, Transformations, and Regression. Oxford University Press, Oxford.

[2]

Atkinson, A.C. (1994). Transforming both sides of a tree, American Statistician 48, 307–313. [3] Atkinson, A.C. & Riani, M. (2000). Robust Diagnostic Regression Analysis. Springer-Verlag, New York. [4] Atkinson, A.C. & Shephard, N. (1996). Deletion diagnostics for transformation of time series, Journal of Forecasting 5, 1–17. [5] Box, G.E.P. & Cox, D.R. (1964). An analysis of transformations (with discussion), Journal of the Royal Statistical Society, Series B 26, 211–246. [6] Box, G.E.P. & Tidwell, P.W. (1962). Transformations of the independent variables, Technometrics 4, 531–550. [7] Carroll, R.J. & Ruppert, D. (1988). Transformation and Weighting in Regression. Chapman & Hall, London. [8] Cook, R.D. & Weisberg, S. (1999). Applied Regression Including Computing and Graphics. Wiley, New York. [9] Hinkley, D.V. (1985). Transformation diagnostics for linear models, Biometrika 72, 487–496. [10] Royston, P.J. & Altman, D.G. (1994). Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion), Applied Statistics 43, 429–467. [11] Shih, J.-Q. (1993). Regression transformation diagnostics in transform-both-sides model, Statistics and Probability Letters 16, 411–420.

A.C. ATKINSON

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

treatments while accounting for the random assignment of groups in the design. Historically, because the main focus of a trial is comparison of treatments over all groups, and groups are assigned at random to a treatment, the effect of an assigned group was not considered to be of interest. This perspective has resulted in many authors (4–8) limiting discussion to estimation of fixed effects. Support for this position stems from the fact that a group cannot be guaranteed to be included in the study. This fact, plus the result that the average of the group effects is zero, has been sufficient for many to limit discussion of group effects to estimating the group variance, not particular group effects. Nevertheless, when conducting a group randomized trial, it is natural to want to estimate response for a particular group. As a result of the limitations of random effects in a mixed model, some authors (9, 10) suggest that such estimates can be made, but they should be based on a different model. The model is conditional on the group assignment, and represents groups as fixed effects. With such a representation, response for individual groups can be directly estimated. However, the fixed effect model would not be suitable for estimating treatment effects, because the evaluation would not be based on the random assignment of groups to treatments. The apparent necessity of using different models to answer two questions in a group randomized trial has prompted much study (11). A basic question is whether a mixed model can be used to predict the mean of a group that has been randomly assigned to a treatment. If such prediction is possible, the mixed model would provide a unified framework for addressing a variety of questions in a group randomized trial, including prediction of combinations of fixed and random effects. A number of workers (12–15) argue that such prediction is possible. The predictor is called best linear unbiased predictor (BLUP) (16). Moreover, as the BLUP minimizes the expected mean squared error (EMSE), it is optimal.

EDWARD J. STANEK III University of Massachusetts Amherst, Massachusetts,

1

INTRODUCTION

The main objective in a group randomized trial is a comparison of the expected response between treatments, where the expected response for a given treatment is defined as the average expected response over all groups in the population. Often, interest may exist in the expected response for a particular group included in the trial. As a random sample of groups is included in the trial, it is usually represented as a random effect. In this context, predicting the expected response for an individual group requires predicting a random effect. For example, in a study of the impact of teaching paradigms on substance use of high school students in New Haven, Connecticut, high schools were randomly assigned to an intervention or control condition. The main evaluation of the intervention was a comparison of student response between intervention and control averaged over all high schools. Interest also existed in student response at particular high schools. As high schools were randomly assigned to conditions, the difference in response for a particular high school from the population average response is a random effect. 2

BACKGROUND

Since the early work on analysis of group randomized trials (1–3), models for response have included both fixed effects and random effects. Such models are called mixed models. Fixed effects appear directly as parameters in the model, whereas group effects are included as random variables. The random effects have mean zero and a nonzero variance. An advantage of the mixed model is the simultaneous inclusion of population parameters for

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

In light of these results, it may appear that methods for prediction of random effects are on firm ground. Some problems exist, however. The BLUP of a realized group is closer to the overall sample treatment mean than the sample group mean, a feature called shrinkage. This shrinkage results in a smaller average mean squared error. The reduction in EMSE is often attributed to ‘‘borrowing strength’’ from the other sample observations. However, the best linear unbiased predictor of a realized group is biased, whereas the sample group mean is unbiased. The apparent contradiction in terms is a result of two different definitions of bias. The BLUP are unbiased in the sense that zero average bias exists over all possible samples (i.e., unconditionally). The sample group mean is unbiased conditional on the realized group. A similar paradox may occur when considering the EMSE, where the mean squared error (MSE) for the best linear unbiased predictor may be larger than the MSE of the group mean for a realized group. This result can occur because, for the BLUP, the EMSE is best in the sense that the average MSE is minimum over all possible samples (i.e., unconditionally), whereas for the sample group mean, the MSE is evaluated for a realized group (17). These differences lead to settings, such as when the distribution of group means is bimodal, where the BLUP seems inappropriate for certain groups (18). For such reasons, predictors of random effects are somewhat controversial. Such controversies are not resolved here. Instead, an attempt is made to clearly define the problem, and then to describe some competing models and solutions. Statistical research in this area is active, and new insights and controversies may still emerge. 3

METHODS

A simple framework is provided that may be helpful for understanding these issues in the context of a group randomized trial. A study population is first defined, along with population/group parameters, which provides a finite population context for the group randomized trial. Next, random variables that

develop from random assignment of groups, and sampling subjects within a group, are explicitly represented. This context provides a framework for discussion of various mixed models and assumptions. Then, an outline of predictors of random effects is developed. This development is limited to the simplest setting. Finally, a broader discussion of other issues concludes the article. 3.1 Modeling Response for a Subject Suppose a group randomized trial is to be conducted to evaluate the impact of a substance abuse prevention program in high schools. It is assumed that, for each student, a measure of the perception of peer substance abuse response can be obtained from a set of items of a questionnaire administered to the student as a continuous response. The kth measure of response (possibly repeated) of student t in high school s is denoted by Ystk = yst + Wstk

(1)

indexing students by t = 1, . . ., M in each of N high schools, indexed by s = 1, . . ., N. Measurement error (corresponding to test, retest variability) is represented by the random variable W stk , and distinguishes yst (a fixed constant representing the expected response student t) from Y stk . The subscript k indicates an administration of the questionnaire, where potentially k > 1. The average expected response over students in school s M 1 yst , whereas the averis defined as µs = M t=1

age over all schools is defined as µ =

1 N

N

µs ;

s=1

µs will be referred to as the latent value of school s. Discussion is limited to the simplest setting where each school has the same number of students, an equal number of schools are assigned to each intervention, an equal number of students are selected from each assigned school, and a single measure of response is made on each selected student. These definitions provide the context for defining the impact of a substance abuse prevention program. To do so, imagine that each student could be measured with and without the intervention. If no substance abuse

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

program is implemented, let the expected response of a student be yst ; if an intervention program is in place, let the expected response be y∗st . The difference, y∗st − yst = δ st represents the effect of the intervention on student t in high school s. The average of these effects over students in school s is M 1 δst , whereas the average defined as δs = M t=1

over all schools is defined as δ =

1 N

N

δs . The

s=1

parameter δ is the main parameter of interest in a group randomized trial. To emphasize the effect of the school and of the student in the school, β s = (µs − µ) is defined as the deviation of the latent value of school s from the population mean and εst = (yst − µs ) as the deviation of the expected response of student t (in school s) from the school’s latent value. Using these definitions, the expected response of student t in school s is represented as yst = µ + βs + εst

(2)

This model is called a derived model (19). The effect of the intervention on a student can be expressed in a similar manner as δ st ∗ , where δ ∗ = δ − δ and δ ∗ = = δ + δs∗ + δst s s st δ st − δ s . Combining these terms, the expected response of student t in school s receiving the ∗ , where β ∗ intervention is y∗st = µ + δ + βs∗ + εst s ∗ = β s + δ s and εst = εst + δ st . When the effect of the intervention is equal for all students (i.e., δ st = δ), the expected response for student t (in school s) is represented by yst = µ + xs δ + βs + εst where xs is an indicator of the intervention, taking a value of 1 if the student receives the intervention, and 0 otherwise. It is assumed the effect of the intervention is equal for all students to simplify subsequent discussions. The latent value for school s under a given condition is given by µ + xs δ + β s .

exist (intervention and control), with n groups (i.e., schools) assigned to each treatment. A simple way to conduct the assignment is to randomly permute the list of schools, assigning the first n schools to the control and the next n schools to intervention. The school’s position in the permutation is indexed by i = 1, . . ., N, referring to control schools by i = 1, . . ., n and to the intervention schools by i = n + 1, . . ., 2n. A sample of students in a school can be represented in a similar manner by randomly permuting the students in a school, and then including the first j = 1, . . ., m students in the sample. The expected response (over measurement error) of a student in a school is represented as Y ij , using the indices for the positions in the permutations of students and schools. Note that Y ij is a random variable because the actual school and student corresponding to the positions will differ for different permutations. Once a permutation (say of schools) is selected, the school that occupies each position in the permutation is known. Given a selected permutation, the school is a ‘‘fixed’’ effect; models that condition on the schools in the sample portion of a permutation of schools will represent schools as fixed effects. Schools are represented as random effects in a model that does not condition on the permutation of schools. The resulting model is an unconditional model. The expected response for a school assigned to a position is represented explicitly as a random variable, and it accounts for the uncertainty in the assignment by a set of indicator random variables, U is , s = 1, . . ., N where U is takes the value of one when school s is assigned to position i, and the value 0 otherwise. Using these ranN Uis µs represents a random dom variables, s=1

variable corresponding to the latent value of the school assigned to position i. Using β s = (µs − µ) and noting that for any permuN N Uis = 1, Uis µs = µ + Bi , where tation, Bi =

3.2 Random Assignment of Treatment and Sampling The first step in a group randomized trial is random assignment of groups to treatments. It is assumed that two treatments

3

N

s=1

s=1

Uis βs represents the random effect

s=1

of the school assigned to position i in the permutation of schools. The random variables U is are used to represent permutations of schools. A similar set of indicator random variables U jt (s) that take

4

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

on a value of 1 when the jth position in a permutation of students in school s is student t, and 0 otherwise, relate yst to Y ij . For ease of exposition, refer to the school that will occupy position i in the permutation of schools as primary sampling unit (PSU) i, and to the student that will occupy position j in the permutation of students in a school as secondary sampling unit (SSU) j. PSUs and SSUs are indexed by positions (i and j), whereas schools and students are indexed by labels (s and t) in the finite population. As a consequence, the random variable corresponding to PSU i and SSU j is given by Yij =

N M

Uis Ujt(s) yst

effect (i.e., Bi , where i = 1). Once the school corresponding to the first PSU has been randomly assigned, the random variables U is for i = 1 and s = 1, . . ., N will be realized. If school s* is assigned to the first position, then the realized values (i.e., U 1s = u1s , s = 1, . . ., N) are u1s = 0 when s = s* and u1s = 1 when s = s*; the parameter for the realized random effect will correspond to βs∗ , the deviation of the latent value of school s* from the population mean. Methods are discussed for predicting the latent value of a realized PSU assigned to the control condition (i.e., i ≤ n) in the simplest setting when no measurement error exists. The model for SSU j in PSU i is the simple random effects model,

s=1 t=1

Using the representation response for a SSU in a PSU,

of

Yij = µ + xi δ + Bi + Eij noting

that

M t=1

Ujt(s) = 1, δ

N

Uis xs = xi δ

s=1

because the treatment assigned to a position depends only on the position, and N M Uis Ujt(s) εst . Adding measurement Eij = s=1 t=1

error, the model is given by ∗ Yijk = µ + xi δ + Bi + Eij + Wijk

∗ where Wijk =

M N s=1 t=1

Yij = µ + Bi + Eij

expected

Uis Ujt(s) Wstk . This model

includes fixed effects (i.e., µ and δ) and random effects (i.e., Bi and Eij ) in addition to measurement error, and hence is called a mixed model. 3.3 Intuitive Predictors of the Latent Value of a Realized Group Suppose that interest exists in predicting the latent value of a realized group (i.e., school), say the first selected group, µ + Bi = N Uis µs , where i = 1. Before randomly s=1

assigning schools, as which school will be first is not known, the expected value of the first PSU is a random variable represented by the sum of a fixed (i.e., µ) and a random

(3)

The latent value of PSU i is represented by the random variable µ + Bi . The parameter for the latent value of school s is M 1 yst . The goal is to predict the latent µs = M t=1

value for the school corresponding to PSU i, which is referred to as the latent value of the realized PSU. It is valuable to develop some intuitive ideas about the properties of predictors. For school s, the latent value can be represented as the sum of two random m M 1 Ysj + Ysj ), variables (i.e.,µs = M j=1

where Ysj =

M t=1

Ujt(s) yst represents response for

SSU j in school s. Let Y sI = Y sII =

1 M−m

j=m+1

M

1 m

m

Ysj and

j=1

Ysj represent random vari-

j=m+1

ables corresponding to the average responses of SSUs in the sample and remainder, respectively. Then, µs = f Y sI + (1 − f )Y sII , where the fraction of students selected in a school m . If school s is a control is given by f = M school (i.e., one of the first n PSUs), then the average response for students selected from the school, Y sI , will be realized after sampling and the only unknown quantity in the expression for µs will be Y sII , the average response of students not included in the sample. Framed

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

in this manner, the essential problem in predicting the latent value of a realized PSU is predicting the average response of the SSUs not included in the sample. The predictor of the latent value of a realized PSU will be close to the sample average for the PSU when the second-stage sampling fraction is large. For example, representing a predictor of Y sII by Yˆ sII , and assuming that f = 0.95, the predictor of the latent value of school s is µˆ s = 0.95Y sI + 0.05Yˆ sII . Even poor predictors of Y sII will only modestly affect the predictor of the latent value. This observation provides some guidance for assessing different predictors of the latent value of a realized random effect. As the second-stage sampling fraction is allowed to increase, a predictor should be closer and closer to the average response of the sample SSUs for the realized PSU.

4

MODELS AND APPROACHES

A brief discussion is provided of four approaches that have been used to predict the latent value of a realized group, limited to the simplest setting given by Equation (3). The four approaches correspond to Henderson’s approach (12, 20), a Bayesian approach (6), a super-population model approach (21, 22), and a random permutation model approach (22). An influential paper by Robinson (13) and its discussion brought prediction of realized random effects into the statistical limelight. This paper identified different applications (such as selecting sires for dairy cow breeding, Kalman filtering in time series analysis, Kriging in geographic statistics) in which the same basic problem of prediction of realized random effects occurred, and directed attention to the best linear unbiased predictor (BLUP) as a common solution. Other authors have discussed issues in predicting random effects under a Bayesian setting (23); random effects in logistic models (24–26); and in applications to blood pressure (27), cholesterol, diet, and physical activity (17). Textbook presentations of predictors are given by References 6 and (28–30). A general review in the context of longitudinal data is given by Singer and Andrade (31).

5

4.1 Henderson’s Mixed Model Equations Henderson developed a set of mixed model equations to predict fixed and random effects (12). The sample response for the model in Equation (3) is organized by SSUs in PSUs, resulting in the model Y = Xα + ZB + E

(4)

where Y is an nm × 1 response vector, X = 1nm is the design matrix for fixed effects, and Z = In ⊗ 1m is the design matrix for random effects, where 1nm denotes an nm × 1 column vector of ones, In is a n × n identity matrix, and A1 ⊗ A2 represents the Kronecker product formed by multiplying each element in the matrix A1 by A2 (32). The parameter α = E(Y ij ) is a fixed effect corresponding to the expected response, whereas the random effects are contained in B = (B1 B2 · · · Bn ) . It is assumed that E(Bi ) = 0, var(B) = G, E(Eij ) = 0, and var(E) = R so that = var(ZB + E) = ZGZ + R. Henderson proposed estimating α (or a linear function of the fixed effects) by a linear combination of the data a Y. Requiring the estimator to be unbiased and have minimum variance lead to the generalized least squares estimator αˆ = (X −1 X )−1 X −1 Y. Difficulties in inverting (in more complex settings) led Henderson to express the estimating equations as the solution to two simultaneous equations known as Henderson’s mixed model equations (20) ˆ = X R−1 Y X R−1 Xαˆ + X R−1 ZB ˆ = Z R−1 Y Z R−1 Xαˆ + (Z R−1 Z + G−1 )B These equations are easier to solve than the generalized least squares equations because the inverse of R and G are often easier to compute than the inverse of . The mixed model equations were motivated by computational needs and developed from a matrix ˆ was a byproduct of identity (33). The vector B this identity and had no clear interpretation. ˆ could be interpreted as the predictor If B of a realized random effect, then the mixed model equations would be more than a computational device. Henderson (34) provided the motivation for such an interpretation by

6

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

developing a linear unbiased predictor of α + Bi that had minimum prediction squared error in the context of the joint distribution of Y and B. Using Y Xα Y ZG E = and var = B 0n B GZ G (5) Henderson showed that the best linear unbiˆ i , where ased predictor of α + Bi is αˆ + B ˆ = GZ −1 (Y − Xα). ˆ i is the ith element of B ˆ B Using a variation of Henderson’s matrix identity, one may show that it is identical to the predictor obtained by solving the second of Henderson’s mixed model equations ˆ With the additional assumptions that for B. G = σ 2 In and R = σ e 2 Inm , where var(Bi ) = σ 2 , and that var(Eij ) = σ e 2 , the expresˆ i = k(Y i − Y), with ˆ i simplifies to B sion for B m n 2 1 k = 2 σ 2 , Yi = m Yij , and Y = n1 Y i. σ +σe /m

j=1

i=1

The coefficient k is always less than one, and ‘‘shrinks’’ the size of the deviation of sample mean for the ith PSU from the sample PSU average. The predictor of α + Bi is given by ˆ i = Y + k(Y i − Y) αˆ + B Henderson’s mixed model equations develop from specifying the joint distribution of Y and B. Only first and second moment assumptions are needed to develop the predictors. As discussed by Searle et al. (6), Henderson’s starting point was not the sample likelihood. However, if normality assumptions are added to Equation (5), then the conditional mean E(α + Bi |Y i = yi ) = α + k(yi − α). Replacing α by the sample average, y, the predictor of α + Bi conditional on Y is y + k(yi − y). 4.2 Bayesian Estimation The same predictor can be obtained using a hierarchical model with Bayesian estimation (6). Beginning with Equation (4), the terms in the model are classified as observable (including Y, X, and Z), and unobservable (including α, B, and E). The unobservable terms are considered as random variables, and hence use different notation to distinguish α from the

corresponding random variable A. The model is given by Y = XA + ZB + E and a hierarchical interpretation develops from considering the model in stages (28, 29). At the first stage, assume that the clusters corresponding to the selected PSUs are known, so that A = α 0 and B = β 0 . At this stage, the random variables E develop from selection of the SSUs. At the second stage, assume A and B are random variables, and have some joint distribution. The simplest case (which is considered here) assumes that the unobservable terms are independent and that A ∼ N(α, τ 2 ), B ∼ N(0n , G), and E ∼ N(0nm , R), where G = σ 2 In and R = σ e 2 Inm . Finally, the prior distributions are specified for α, τ 2 , σ 2 , and σ e 2 . Y is conditioned upon in the joint distribution of the random variables to obtain the posterior distribution. The expected value of the posterior distribution is commonly used to estimate parameters. The problem is simplified considerably by assuming that σ 2 , and σ e 2 are constant. The lack of prior knowledge of the distribution of A is represented by setting τ 2 = ∞. With these assumptions, the Bayesian estimate of B is the expected value of the posterior ˆ = GZ −1 (Y − Xα)). ˆ This distribution (i.e., B predictor is identical to the predictor defined in Henderson’s mixed model equation. 4.3 Super-Population Model Predictors Predictors of the latent value of a realized group have also been developed in a finite population survey sampling context (21). To begin, suppose a finite population of M students in each of N schools is considered as the realization of a potentially very large group of students, called a super-population. Students or schools are not identified explicitly in the super-population, and are instead referred to as PSUs and SSUs. Nevertheless, a correspondence exists between a random variable, Y ij , and the corresponding realized value for PSU i and SSU j, yij , which would identify a student in a school. Predictors are developed in the context of a probability model specified for the super-population. This general strategy is referred to in survey sampling as model-based inference.

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

The random variable corresponding to the latent value for PSU i (where i ≤ n) given by M 1 Yi = M Yij , can be divided into two parts, j=1

Y i,I and Y i,II , such that Y i = f Y i,I + (1 − f )Y i,II , m 1 where Y i,I = m Yij corresponds to the averj=1

age response of SSUs that will potentially be M 1 Yij observed in the sample, Y i,II = M−m j=m+1

is the average response of the remaining ranm . Predicting the dom variables, and f = M latent value for a realized school simplifies to predicting an average of random variables not realized in the sample. For schools that are selected in the sample, only response for the students not sampled need be predicted. For schools not included in the sample, a predictor is needed for all students’ responses. Scott and Smith assume a nested probability model for the super-population, representing the variance between PSUs as σ 2 , and the variance between SSUs as σ e 2 . They derive a predictor for the average response of a PSU that is a linear function of the sample, unbiased, and has minimum mean squared error. For selected PSUs, the predictor simplifies to a weighted sum of the sample average response, and the average predicted response m for not included in the sample, M yi + SSUs ˆ ˆ M−m Y , where Y = y + k(y − y) and M

k=

i,II

σ2 . σ 2 +σe2 /m

i,II

i

This same result was derived

by Scott and Smith under a Bayesian framework. For a sample school, the predictor of the average response for students not included in the sample is identical to Henderson’s predictor and the predictor resulting from Bayesian estimation. For PSUs not included in the sample, Scott and Smith’s predictor reduces to the simple sample mean y. A substantial conceptual difference exists between Scott and Smith’s predictors and the mixed model predictors. The difference is a result of the direct weighting of the predictors by the proportion of SSUs that need to be predicted for a realized PSU. If this proportion is small, the resulting predictor will be close to the PSU sample mean. On the other hand, if only a small portion of the SSUs are observed in a PSU, Scott and Smith’s predictor will be very close to the mixed model predictors.

7

5 RANDOM PERMUTATION MODEL PREDICTORS An approach closely related to Scott and Smith’s super-population model approach can be developed from a probability model that develops from sampling a finite population. Such an approach is based on the two-stage sampling design and is design-based (35). As selection of a two-stage sample can be represented by randomly selecting a twostage permutation of a population, models under such an approach are referred to as random permutation models. This approach has the advantage of defining random variables directly from sampling a finite population. Predictors are developed that have minimum expected mean square error under repeated sampling in a similar manner as those developed by Scott and Smith. In a situation comparable with that previously described for Scott and Smith, predictors of the realized latent value are nearly identical to those derived by Scott and Smith. The only difference is the use of a slightly different shrinkage is ∗ The predictor constant. ∗ ˆ m M−m ˆ Y , where Y = y + given by y + M i

k∗ (y

6

i

−

y), k∗

M

=

σ ∗2 , σ ∗2 +σe2 /m

i,II

and

i,II

σ ∗2

= σ2 −

σe2 M.

PRACTICE AND EXTENSIONS

All of the predictors of the latent value of a realized group include shrinkage constants in the expressions for the predictor. An immediate practical problem in evaluating the predictors is estimating this constant. In a balanced setting, simple method of moment estimates of variance parameters can be substituted for variance parameters in the shrinkage constant. Maximum likelihood or restricted maximum likelihood estimates for the variance are also commonly substituted for variance parameters in the prediction equation. In the context of a Bayesian approach, the resulting estimates are called empirical Bayes estimates. Replacing the variance parameters by estimates of the variance will inflate the variance of the predictor using any of the approaches. Several methods have been developed that account for the larger variance (36–40).

8

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

In practice, groups in the population are rarely of the same size, or have identical intragroup variances. The first three approaches can be readily adapted to account for such unbalance when predicting random effects. The principal difference in the predictor is replacement of σ e 2 by σ ie 2 , the SSU variance within SSU i, when evaluating the shrinkage constant. The simplicity in which the methods can account for such complications is an appeal of the approaches. Predictors of realized random effects can also be developed using a random permutation model that represents the two-stage sampling. When SSU sampling fractions are equal, the predictors have a similar form with the shrinkage constant constructed from variance component estimates similar to components used in two-stage cluster sampling variances (41). Different strategies are required when second-stage sampling fractions are unequal (42). In many settings, a simple response error model will apply for a student. Such response error can be included in a mixed model such as Equation (4). When multiple measures are made on some students in the sample, the response error can be separated from the SSU variance component. Bayesian methods can be generalized to account for such additional variability by adding another level to the model hierarchy. Super-population models (43) and random permutation models (35) can also be extended to account for response error. Practical applications often involve much more complicated populations and research questions. Additional hierarchy may be present in the population (i.e., school districts). Variables may be available for control corresponding to districts, schools, and students. Measures may be made over time on the same sample of students or on different samples. Response variables of primary interest may be continuous, categorical, or ordinal. Some general prediction strategies have been proposed and implemented (29, 44, 45) using mixed models and generalized mixed models (30), often following a Bayesian paradigm. Naturally, the hypotheses and approaches in such settings are more complex. Active research is occurring in these

areas, which should lead to clearer guidance in the future. 7 DISCUSSION AND CONCLUSIONS The latent value of a group is a natural parameter of interest in a group randomized trial. Although such a parameter may be readily understood, development of an inferential framework for predicting such a parameter is not easy. Many workers struggled with ideas underlying interpretation of random effects in the mid-twentieth century. Predictors have emerged largely based on computing strategies and Bayesian models in the past 20 years. Such strategies have the appeal of providing answers to questions that have long puzzled statisticians. Computing software (based on mixed model and Bayesian approaches) is widely available and flexible, allowing multilevel models to be fit with covariates at different levels for mixed models. Although flexible software is not yet available for super-population model or random permutation model approaches, some evidence exists that when sampling fractions are small (<0.5), predictors and their MSE are very similar (35). Whether the different approaches predict the latent value of a realized group is a basic question that can still be asked. All of the predictors have the property of shrinking the realized group mean toward the overall sample mean. Although the predictors are unbiased and have minimum expected MSE, these properties hold over all possible samples, not conditionally on a realized sample. This use of the term ‘‘unbiased’’ differs from the popular understanding. For example, for a realized group, an unbiased estimate of the group mean is the sample group mean, whereas the BLUP is a biased estimate of the realized group mean. The rationale for preferring the biased estimate is that, in an average sense (over all possible random effects), the mean squared error is smaller. As this property refers to an average over all possible random effects, it does not imply smaller MSE for the realized group (18). In an effort to mitigate this effect, Raudenbush and Bryk (28) suggest including covariates that model realized group parameters to reduce the potential

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

biasing effect. Alternative strategies, such as conditional modeling frameworks, have been proposed (46–48), but increase in complexity with the complexity of the problem. Although increasing popularity of models that result in BLUP for realized latent groups exists, the basic questions that plagued researchers about interpretation of the predictors in the late twentieth century remain for the future. REFERENCES 1. H. Scheffe, A ‘mixed model’ for the analysis of variance. Ann. Math. Stat. 1956; 27: 23–36. 2. O. Kempthorne, Fixed and mixed models in the analysis of variance. Biometrics 1975; 31: 437–486. 3. M. B. Wilk and O. Kempthorne, Fixed, mixed, and random models. Amer. Stat. Assoc. J. 1955: 1144–1167. 4. H. Scheff´e, Analysis of Variance. New York: Wiley, 1959. 5. D. M. Murray, Design and Analysis of Grouprandomized Trials. New York: Oxford University Press, 1998. 6. S. R. Searle, G. Casella, and C. E. McCulloch, Prediction of random variables. In: Variance Components. New York: Wiley, 1992. 7. R. E. Kirk, Experimental Design. Procedures for the Behavioral Sciences. 3rd ed. New York: Brooks/Cole Publishing Company, 1995. 8. A. Donner and N. Klar, Design and Analysis of Cluster Randomized Trials in Health Research. London: Arnold, 2000. 9. C. Eisenhart, The assumptions underlying the analysis of variance. Biometrics 1947; 3: 1–21. 10. S. Searle, Linear Models. New York: Wiley, 1971. 11. M. L. Samuels, G. Casella, and G. P. McCabe, Interpreting blocks and random factors. J. Amer. Stat. Assoc. 1991; 86: 798–821. 12. C. R. Henderson, Applications of Linear Models in Animal Breeding. Guelph, Canada: University of Guelph, 1984. 13. G. K. Robinson, That BLUP is a good thing: the estimation of random effects. Stat. Sci. 1991; 6: 15–51. 14. D. A. Harville, Extension of the Gauss-Markov theorem to include the estimation of random effects. Ann. Stat. 1976; 4: 384–395. 15. R. C. Littell, G. A. Milliken, W. W. Stroup, and R. D. Wolfinger, SAS System for Mixed Models. Cary, NC: SAS Institute, 1996.

9

16. A. S. Goldberger, Best linear unbiased prediction in the generalized linear regression model. Amer. Stat. Assoc. J. 1962; 57: 369–375. 17. E. J. III Stanek, A. Well, and I. Ockene, Why not routinely use best linear unbiased predictors (BLUPs) as estimates of cholesterol, per cent fat from Kcal and physical activity? Stat. Med. 1999; 18: 2943–2959. 18. G. Verbeke and G. Molenberghs, Linear Mixed Models for Longitudinal Data. New York: Springer, 2000. 19. K. Hinkelmann and O. Kempthorne, Design and Analysis of Experiments. Volume 1. Introduction to Experimental Design. New York: Wiley, 1994. 20. C. R. Henderson, O. Kempthorne, S. R. Searle, and C. M. von Krosigk, The estimation of environmental and genetic trends from records subject to culling. Biometrics 1959: 192–218. 21. A. Scott and T. M. F. Smith, Estimation in multi-stage surveys. J. Amer. Stat. Assoc. 1969; 64: 830–840. 22. E. J. III Stanek and J. M. Singer, Estimating cluster means in finite population two stage clustered designs. International Biometric Society Eastern North American Region, 2003. 23. D. R. Cox, The five faces of Bayesian statistics. Calcutta Stat. Assoc. Bull. 2000; 50: 199–200. 24. T. R. Ten Have and A. R. Localio, Empirical Bayes estimation of random effects parameters in mixed effects logistic regression models. Biometrics 1999; 55: 1022–1029. 25. T. R. Ten Have, J. R. Landis, and S. L. Weaver, Association models for periodontal disease progression: a comparison of methods for clustered binary data. Stat. Med. 1995; 14: 413–429. 26. P. J. Heagerty and S. L. Zeger, Marginalized multilevel models and likelihood inference. Stat. Sci. 2000; 15: 1–26. 27. D. Rabinowitz and S. Shea, Random effects analysis of children’s blood pressure data. Stat. Sci. 1997; 12: 185–194. 28. S. R. Raudenbush and A. S. Bryk, Hierarchical Linear Models: Applications and Data Analysis Methods. 2nd ed. London: Sage Publications, 2002. 29. H. Goldstein, Multilevel Statistical Modeling. 3rd ed. Kendall’s Library of Statistics 3. London: Arnold, 2003. 30. C. E. McCulloch and S. R. Searle, Generalized, Linear, and Mixed Models. New York: Wiley, 2001.

10

PREDICTING RANDOM EFFECTS IN COMMUNITY INTERVENTION

31. J. M. Singer and D. F. Andrade, Analysis of longitudinal data. In: E. P. K. Sen and C. R. Rao (eds.), Handbook of Statistics. New York: Elsevier Science, 2000. 32. F. A. Graybill, Matrices with Applications in Statistics. Belmont, CA: Wadsworth International, 1983. 33. H. V. Henderson and S. R. Searle, On deriving the inverse of a sum of matrices. SIAM Rev. 1981; 23: 53–60. 34. C. R. Henderson, Selection index and expected genetic advance. In: Statistical Genetics and Plant Breeding. Washington, DC: National Academy of Sciences - National Research Council, 1963. 35. E. J. III Stanek and J. M. Singer, Predicting random effects from finite population clustered samples with response error. J. Amer. Stat. Assoc. 2004. 99: 1119–1130. 36. R. N. Kackar and D. A. Harville, Approximations for standard errors of estimators of fixed and random effects in mixed linear models. J. Amer. Stat. Assoc. 1984; 79: 853–862. 37. R. E. Kass and D. Steffey, Approximate Bayesian inference in conditionally independent hierarchical models (Parametric empirical Bayes models). J. Amer. Stat. Assoc. 1989; 84: 717–726. 38. D. A. Harville and D. R. Jeske, Mean squared error of estimation or prediction under general linear model. J. Amer. Stat. Assoc. 1992; 87: 724–731. 39. K. Das, J. Jiang, and J. N. K. Rao, Mean squared error of empirical predictor (Technical Report). Ottawa, Canada: Carleton University School of Mathematics and Statistics, 2001. 40. J. Wang and W. A. Fuller, The mean squared error of small area predictors constructed with estimated area variances. J. Amer. Stat. Assoc. 2003; 98: 716–723. 41. W. Cochran, Survey Sampling. New York: Wiley, 1977. 42. E. J. III Stanek and J. M. Singer, Predicting realized cluster parameters from two stage samples of an unequal size clustered population. Amherst, MA: Department of Biostat/Epid., University of Massachusetts, 2003. 43. H. Bolfarine and S. Zacks, Prediction Theory for Finite Populations. New York: SpringerVerlag, 1992. 44. X-H. Zhou, A. J. Perkins, and S. L. Hui, Comparisons of software packages for generalized linear multilevel models. Amer. Stat. 1999; 53: 282–290.

45. J. de Leeuw and I. Kreft, Software for multilevel analysis. In: A. Leyland and H. Goldstein (eds.), Multilevel Modelling of Health Statistics. Chichester: Wiley, 2001. 46. A. P. Dawid, Conditional independence in statistical theory. J. Royal Stat. Soc. 1979; 41: 1–31. 47. E. J. III Stanek and J. R. O’Hearn, Estimating realized random effects. Commun. Stat. Theory Meth. 1998; 27: 1021–1048. 48. G. Verbeke, B. Spiessens, and E. Lesaffre, Conditional linear mixed models. Amer. Stat. 2001; 55: 25–34.

PREFERENCE TRIALS

choice while randomizing those not expressing strong preferences in the usual manner (5). The effects of patient preferences and the advantages and disadvantages of the patient preference trial design will be discussed.

MARION K. CAMPBELL University of Aberdeen Aberdeen, United Kingdom

1

DAVID J. TORGERSON University of York York, United Kingdom

POTENTIAL EFFECTS OF PREFERENCE

In a standard randomized controlled trial, treatments are allocated to patients by means of a random process. By definition, patients cannot choose the treatment to which they get allocated. As such, if a patient has a strong preference to receive one particular treatment option, they have one of two choices:

When evaluating the effectiveness of interventions in health-care, it is now widely accepted that the randomized controlled trial is the gold standard design to adopt (1, 2). When randomization is adopted, this results in a number of benefits; for example, the groups generated should only differ by chance in baseline prognostic variables (i.e., bias is minimized), the potential for attribution is maximized, and the results can be analyzed using standard statistical testing. Acceptance of the standard randomized trial, however, requires that patients recruited to the trial be prepared to accept any of the interventions under study. Some participants may have explicit preferences for one of the treatments under evaluation, which may affect their willingness to participate in a standard randomized controlled trial, and they may refuse to be randomized (3). If the proportion of eligible participants who refuse randomization is significant, this jeopardizes the generalizability of the trial results (4). However, if the preferred treatment is only available within the context of a randomized trial (e.g., a novel treatment), many people may consent to take part hoping to get access to the new treatment. This consent hoping to be allocated to only one treatment can have significant consequences for the trial. If, for example, 50% of those being recruited to a trial of A versus B want intervention A, and this is only available within the trial, then they are likely to consent to participate in the trial; however, half of this 50% will be disappointed as they are going to be allocated to treatment B. A proposed solution to these problems has been the ‘‘patient preference trial,’’ which allocates patients with strong preferences to their treatment of

• Decline randomization to the trial. • Accept randomization to the trial in the

hope that they will be allocated to their preferred treatment. 1.1 Effects of Declining Randomization If a significant proportion of patients decline entry to the trial, this has a number of consequences for the trial conduct. First, it will lessen the generalizability of the results of the trial. For a trial to be truly generalizable, it must demonstrate that the individuals who consent to take part in the trial are representative of the wider population to whom the trial results are expected to apply. If, however, a significant proportion of patients decline on the basis of an explicit preference, this would indicate that those who consent to participate are markedly different from those who decline, compromising this assumption. A further consequence for the conduct of the trial if a significant number of patients decline to participate is that a longer recruitment window is likely to be required to achieve the target sample size. This has implications for the timeliness of the results because the publication of valid results requires that the target sample size is achieved. It also has cost consequences for the conduct of the trial as the trial team will have to be funded for a longer time period, leading to a more expensive trial for a potentially less generalizable result. The increased trial cost

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PREFERENCE TRIALS

also has knock-on effects for future research, as the more funders are required to channel money to this trial, the less money will be available for future research. 1.2 Effects of Accepting Randomization When Patients Have a Preference When patients accept randomization in the hope of being allocated to their preferred treatment, they can either be assigned to their treatment of choice or be assigned to the treatment they did not wish to receive. For patients who are not randomized to their treatment of choice, this can result in so called ‘‘resentful demoralization’’ (6). Resentful demoralization describes a pattern of behavior that can result if people are not allocated to their treatment of choice. This can involve difficulties with patient compliance with treatment, patients having a lower threshold for reporting adverse events of treatment, and an overly negative rating of their treatment effect. Patient compliance with treatment has been noted to be particularly problematic with treatments that require motivation, such as compliance with an exercise routine or compliance with dietary interventions (5). Conversely, patients allocated to their treatment of choice may rate their treatment overly positively. It is worthy of note, however, that although resentful demoralization is a known concept it remains more a theoretical concern because it is very difficult to accurately measure its effects and also to accurately attribute changes in patient compliance and outcome directly to initial preferences. The potential for patient preference to introduce systematic bias into the rating of treatment is particularly pertinent for subjective outcomes such as the patients assessments of the acceptability of the treatment or their self-assessed rating of their quality of life. For example, Clement et al. (7) highlighted the effect of initial preference on patients’ evaluations of different schedules of antenatal visits. In their trial, they compared traditional antenatal care (13 visits) with a new schedule of care (six to seven visits). The results showed that for women with no initial preference 7.4% of women allocated to traditional care were dissatisfied

with the frequency of visits compared with 31% allocated to new care. For those women who expressed an initial preference for traditional care, this difference in dissatisfaction rates was much more marked, with 3.5% randomized to traditional care being dissatisfied compared with 71.1% randomized to new style care. However, for women with an initial preference for new style care, the direction of effect was completely in the opposite direction: 48.9% randomized to traditional care were dissatisfied compared with 12.2% allocated to their preferred treatment of new care. These results highlight the significant impact preference can have on subjective trial outcomes. 2 THE PATIENT PREFERENCE DESIGN In an attempt to ameliorate the effects of preference in the standard randomized controlled trial, Brewin and Bradley (5) proposed the use of an alternative design: the partially randomized patient preference trial. Consider a trial of treatment A versus treatment B. In a patient preference design, patients who have an explicit preference for treatment A are allocated directly to treatment A. Similarly, patients who have an explicit preference for treatment B are allocated directly to treatment B. Only those patients who do not have a strong preference are randomized in the standard manner. This results in four parallel groups: those directly allocated to treatment A (group AllocA), those randomized to treatment A (RandA), those randomized to treatment B (RandB), and those directly allocated to treatment B (AllocB). A schematic outlining the resulting trial groups is presented in Figure 1. All four groups are then followed up, and their outcomes are compared. The effect of the intervention is still derived from the randomized comparison (RandA versus RandB); however, the effect of motivational factors can then also be estimated (from AllocA versus RandA and from AllocB versus RandB). An example of the use of the patient preference design is a trial by Bedi et al. (8) who compared the effectiveness of counseling compared with antidepressants for depression in primary care. These are very different

PREFERENCE TRIALS

3

Eligible for inclusion in trial Strong preference for treatment A expressed

No strong preference expressed

Strong preference for treatment B expressed

Randomize

Allocated to treatment A

Randomized to treatment A

Randomized to treatment B

Allocated to treatment B

Figure 1. The patient preference design.

treatments requiring different levels of active patient participation. The researchers were concerned that if a standard randomized controlled trial design was adopted the patients who did not get their preferred treatment (if they had a strong preference) might underestimate its effectiveness. To address these concerns, a patient preference design was adopted, where patients were first offered randomization but were allocated to their treatment of choice if they refused randomization on the basis of preference. In this trial, 103 patients agreed to randomization, and a further 220 were allocated directly to receive their treatment of choice (140 opted for counseling compared with 80 opting for antidepressants). The study found that there was no evidence of a difference in outcome between the randomized groups at 8 weeks. The investigators also noted that expressing a preference for either treatment did not appear to confer any additional benefit on outcome. 3 ADVANTAGES AND DISADVANTAGES OF THE PATIENT PREFERENCE DESIGN The patient preference design has a number of advantages and disadvantages. A key advantage is the comprehensive nature of the data collected and the additional inferences that can be drawn from explicit comparison of the results from the preference and randomized groups. Potential disadvantages have, however, been highlighted, especially the potential for selection bias across the trial groups, uncertainties about how the analysis of the data from the preference arms should

be integrated into the overall trial analysis, and the added cost of following up the nonrandomized patients. We will discuss these issues in turn. 3.1 Potential for Additional Insights Provided by the Data from the Preference Arms One of the purported advantages of the patient preference design is that the data gained from those in the preference arms provide valuable information on the impact of motivational factors, which can then be used to inform the most appropriate implementation of the trial intervention. This was apparent in a trial conducted by Henshaw et al. (9) who compared the effectiveness of surgical (vacuum aspiration) versus medical termination (mifepristone) of early pregnancies. In this trial, 363 women agreed to be recruited to the trial; 95 (26%) expressed an explicit preference for vacuum aspiration, 73 (20%) expressed an explicit preference for medical termination, and 195 (54%) agreed to be allocated to either method randomly. Reasons for preferences varied. Primary reasons for preferring medical treatment included fear of anesthesia and perceptions that medical treatment was less invasive and more ‘‘natural.’’ Conversely, primary reasons for preferring surgery included the relative speed of the technique, and fears about the physical and potential psychological sequelae of medical treatment. The results of the randomized comparison showed no evidence of a difference in acceptability between the two treatment approaches. There was also no evidence of difference in acceptability of

4

PREFERENCE TRIALS

the two techniques in those allocated to their treatment of choice. Further examination of the trial data showed, however, that women who had stated a preference for the surgical option tended to live much farther away from the hospital than those preferring the medical termination option. This primarily reflected a desire by women living farther from the hospital to take advantage of the fewer visits to hospital required in the surgical arm. This extra insight led the investigators to conclude that although there was no apparent superiority of effect with the surgical intervention, both treatment options should remain available. This additional insight would not have been possible to detect without the preference cohorts. 3.2 Potential for Selection Bias across the Trial Groups As patients with strong preferences can choose their treatment of choice in a patient preference design, it is possible that this could lead to selection bias across the different trial groups. It may be that preference could be determined by some underlying confounding factor. For example, in a trial of medical versus surgical management of a disease, those who have more severe disease may prefer surgery in the expectation that surgery is a more radical treatment and thus more suited to their needs. If the preference is a proxy for some other underlying measure of patient status, differential representation across preference and randomized groups may be problematic for future analysis. Such self-selection to trial groups was observed in a recent trial in the United Kingdom of surgical management versus continued optimized medical management for the treatment of gastroesophageal disease (10). In this patient preference trial, patients expressing a strong desire for surgery were directly allocated to surgical management (laparoscopic fundoplication) whereas those with a strong desire to remain on medical management were directly allocated to medical management (with proton pump inhibitors). Those who expressed no strong preference were randomized in the standard manner. On completion of recruitment,

however, it was apparent that there was a differential distribution of disease severity across the trial groups. Across key measures of self-reported quality of life at baseline, those choosing surgery reported a lower quality of life score compared with those recruited to the randomized comparisons. In contrast those choosing medical management reported better quality of life than those recruited to the randomized groups. As one would expect, there was no evidence of a difference in self-reported health across the randomized groups. This study showed that the introduction of preference arms can lead to a narrower spectrum of patients being recruited to the randomized element of a patient preference trial. 3.3 Analysis of the Data from the Preference Arms Although the comparison of data from those directly allocated to their chosen treatment with the randomized groups within a patient preference design is intuitively appealing, the true implications of such an analysis are unclear. The primary reason for this is that data from the non-randomized groups are not afforded the theoretical protection of randomization and, as such, are subject to the range of biases inherent in all observational data—especially the influence of confounding variables (whether known or unknown) (11). It has, therefore, been recommended that the primary analysis remains the comparison of the randomized groups. It is also unclear what the interpretation of a trial should be if those allocated to their preferred treatment appear to do worse that those randomized to the treatment. 3.4 Required Increase in Sample Size With the introduction of parallel preference arms, four groups of participants are routinely followed up in the patient preference design compared with two groups in a standard design. One of the key advantages of this additional data collection exercise has already been discussed: the extra data gained from those in the preference arms provide valuable information on the impact of motivational factors that could then be used to

PREFERENCE TRIALS

inform the most appropriate implementation of the trial intervention. However, in the standard randomized design, these patients with a strong preference would have refused randomization and hence would not have required time and resource to follow them up. This raises an interesting dilemma: are the extra data collected worth the extra cost? If resources for the conduct of research are unlimited, this is not an issue as all extra information is of intrinsic value; however, resources are limited, and when funders direct extra resources toward the follow-up of preference patients, this means that there is less money available for the conduct of future studies. As such this is an extra dimension which must be considered when adopting the preference design. An interesting illustration of this dilemma was presented in a randomized trial conducted by Cooper et al. (12). In this study of different treatments for the management of menorrhagia, women were randomized to be approached to take part in either (1) a standard randomized comparison or (2) a patient preference design. The results of this study showed that although the preference design included more women overall there was no apparent impact on recruitment to the randomized arms under the different designs. In this instance, the investigators concluded that the money that had been spent on following up the preference groups might have been better spent promoting overall recruitment to the standard randomized design. 4

ALTERNATIVE DESIGNS

The patient preference design is the most widely used design to accommodate for preferences, but a number of other designs have also been suggested. The two main alternatives are: • The two-step randomized design • The fully randomized preference design

4.1 Two-Step Randomized Design The two-step randomized design aims to measure the effect of preference by explicitly including allocation to receive preference into

5

the study design. Wennberg (13) proposed such a design in the early 1990s whereby potential trial participants are initially randomized either to receive their preferred treatment directly (Wennberg group A) or to be allocated to treatment using standard randomization (Wennberg group B). See Figure 2 for the design. ¨ Rucker (14) proposed a similar design to Wennberg, but included the possibility of a secondary randomization within the group initially randomized to receive their pref¨ erence. In the Rucker design, participants are initially randomized into the same two groups as the Wennberg design (i.e., to receive their preferred treatment directly or to be randomized to a treatment group). However, ¨ in the Rucker design, those initially allocated to receive their preference directly (similar to Wennberg group A) but who do not have a strong preference are subsequently randomized to treatment (Figure 2). The main advantage of this type of twostage design is that the effect of preference can be robustly measured (because it is planned for in the design); however, patients are still required to accept randomization, which may be problematic. Few studies have used this design in practice. 4.2 Fully Randomized Preference Design Recently, an alternative trial design has been proposed to measure the effect of patient preferences, the fully randomized preference design (15). This is appropriate for situations in which a new treatment is only available within the trial and where participants who have a preference would normally consent to participate in the hope of being allocated to the new treatment. In this design, preferences are recorded before randomization, allowing estimation of the effect of the intervention both among participants without a preference but also among those with a preference. For example, in a randomized trial of a physiotherapy intervention versus standard care for the management of back pain that used this approach (16), the following groups of participants resulted: 1. Indifferent participants allocated to physiotherapy

6

PREFERENCE TRIALS

Eligible for inclusion in trial

Random allocation

Randomized group

Randomized to A

Preference group

Receive preferred option A

Randomized to B

Receive preferred option B

(a) Wennberg design

Eligible for inclusion to trial

Random allocation

Randomized group

Randomized to A

Preference group

Randomized to B

No strong preference

Randomized to A

Randomized to B

Strong preference

Receive preferred Receive preferred option A option B

(b) Rücker design ¨ Figure 2. Two-step randomized designs. (a) Wennberg design. (b) Rucker design.

2. Indifferent participants allocated to standard care 3. Participants preferring physiotherapy allocated to receive physiotherapy 4. Participants preferring physiotherapy allocated to standard care By using this approach, the investigators were able to demonstrate that the treatment was effective across the different preference groups, and the results appeared not to have been affected by patient preference. Interestingly, the study showed the treatment did not work any better among those who really wanted the treatment and those who were allocated to receive it.

Another demonstration of the effect of eliciting participant preferences has occurred in a randomized trial of two treatments for neck pain (17). Participants were presented with two treatment options: the standard treatment, which consisted of between 5 and 10 physiotherapy treatments, or a brief intervention, which was conducted usually as a one-off treatment session with the aim of teaching the patient to ‘‘self-treat’’ using cognitive behavioral principles. As in the back pain trial, the patients were asked for their preferences before randomization. This resulted in six different groups of participants: 1. Indifferent participants randomized to standard care

PREFERENCE TRIALS

2. Indifferent participants randomized to brief intervention 3. Participants preferring standard care and randomized to standard care 4. Participants preferring standard care and randomized to brief intervention 5. Participants preferring brief intervention and randomized to brief intervention 6. Participants preferring brief intervention and randomized to standard care Figure 3 shows the main trial results. The figure shows that among those participants who were indifferent (groups 1 and 2) the standard care resulted in better outcomes relative to the brief intervention. (In this group, patient preferences have been eliminated so this is the ‘‘true’’ treatment effect unconfounded by preference.) The results among the subgroup of patients who had a preference for standard care (groups 3 and 4) were similar; however, those who were randomized to the brief intervention actually got slightly worse. The final group, those who preferred the brief intervention (groups 5 and 6) showed the most interesting result. For these patients, the direction of effect was actually reversed. Patients who wanted the brief intervention and were randomized to the brief intervention (group 5) did better than those who wanted the brief intervention but were randomized to standard care (group 6). This was despite our observing that standard care is the superior treatment in the absence of patient preference. This trial, therefore,

showed that the presence of preferences can influence treatment effects. In this instance, the proportion of participants who wanted the brief intervention was relatively small, so their results would not have strongly biased the trial’s overall findings. However, in clinical or health policy terms, the extra insights enhance the trial’s overall findings. Had participants’ preferences not been ascertained, we would have concluded that standard care was best and that all patients should be offered standard care. However, it is apparent that for those who express a strong desire for the brief intervention, this may indeed be the best treatment for them. Although the advantages of this design have been clearly demonstrated, the fully randomized preference design is not without its limitations. As with the standard randomized controlled trial design, it is highly likely that a proportion of patients who have strong preferences will refuse to be randomized under this design. As such, the results will only be able to inform the influence of weaker levels of preference. The interpretability of the results of this design would, however, be superior to those from a standard randomized controlled trial design as the influence of preference could at least be explored, even though all such analyses would be exploratory unless the trial had been specifically powered to address them. This design also relies on accurate measurement of preferences at baseline. To date, however, there has been little research available to guide how best to measure preference

Change in NPQ at 12 months

1

Figure 3. Preference Results of Neck Pain Trial.

0.5

Preference

0 −0.5

7

Indifferent

Brief Intervention

Usual Care

−1 −1.5 −2 −2.5 −3 −3.5 Brief Intervention

Usual Care

8

PREFERENCE TRIALS

in a valid, reliable manner. Further research in this field is required. 4.

5

DISCUSSION 5.

Participant preference can lead to bias in trials. Ignoring the role of preference on outcome could lead us to conclude something is effective when it is not, or is ineffective when it is not. It is, therefore, important when designing a trial to consider whether preference for particular treatments might be problematic for the trial under consideration. If patient preference could be an issue, it is important that it be accounted for; however, patients must be presented with full information on all treatments available within the trial to ensure that they make an informed choice whether to accept randomization or to choose their treatment. The patient-preference design advocated by Brewin and Bradley has been used successfully in a number of fields and has been shown to yield additional insights that can aid the interpretation of trial results. However, there have been a number of criticisms of the patient preference design, and investigators have been considering other ways of accommodating appropriately for preferences. A number of alternative designs exist, although they have not been used so widely in the literature. One such alternative is the fully randomized preference design, which has been used in a number of recent trials (16–20). As the results of these and other trials become available, together with future research on how best to measure preferences, recommendations for the most appropriate design to use for specific situations will become clearer. REFERENCES 1. S. J. Pocock, Clinical Trials: A Practical Approach. Chichester, UK: Wiley, 1983.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

2. B. Sibbald and M. Roland, Understanding controlled trials: why are randomised controlled trials important? BMJ. 1998; 316: 201. 3. K. Fairhurst and C. Dowrick, Problems with recruitment in a randomised controlled trial of counselling in general practice: causes and

16.

implications. J Health Serv Res Policy. 1996; 1: 77–80. C. Bradley, Designing medical and educational intervention studies. Diabetes Care. 1993; 16: 509–518. C. R. Brewin and C. Bradley, Patient preferences and randomised clinical trials. BMJ. 1989; 299: 313–315. T. D. Cook and D. T. Campbell, Quasiexperimentation: Design and Analysis Issues for Field Settings Chicago: Rand McNally, 1979. S. Clement, J. Sikorski, J. Wilson, and B. Candy, Merits of alternative strategies for incorporating patient preferences into clinical trials must be considered carefully [letter]. BMJ. 1998; 317: 78 N. Bedi, C. Chilvers, R. Churchill, M. Dewey, C. Duggan, et al., Assessing effectiveness of treatment of depression in primary care: partially randomised preference trial. Br J Psychiatry. 2000; 177: 312–318. R. C. Henshaw, S. A. Naji, I. T. Russell, and A. A. Templeton, Comparison of medical abortion with surgical vacuum aspiration: women’s preferences and acceptability of treatment. BMJ. 1993; 307: 714–717. A. Grant, S. Wileman, C. Ramsay, L. Bojke, D. Epstein et al. The place of minimal access surgery amongst people with gastro-oesophageal reflux disease—a UK collaborative study. The Reflux Trial (ISRCTN15517081). Health Technol Assess (in press). W. A. Silverman and D. G. Altman, Patient preferences and randomised trials. Lancet. 1996; 347: 171–174. K. G. Cooper, A. M. Grant, and A. M. Garratt, The impact of using a partially randomised patient preference design when evaluating managements for heavy menstrual bleeding. BJOG. 1997; 104: 1367–1373. J. E. Wennberg, M. J. Barry, F. J. Fowler, and A. Mulley, Outcomes research, PORTS, and healthcare reform. Ann NY Acad Sci. 1993: 703: 52–62. ¨ G. Rucker, A two-stage trial design for testing treatment, self selection and treatment preference effects Stat Med 1989; 8: 477–485. D. J. Torgerson, J. A. Klaber Moffett, and I. T. Russell, Patient preferences in randomised trials: threat or opportunity. J Health Serv Res Policy. 1996; 1: 194–197. J. Klaber Moffett, D. Torgerson, S. Bell-Syer, D. Jackson, H. Llewlyn-Phillips, et al., Randomised controlled trial of exercise for low

PREFERENCE TRIALS back pain: clinical outcomes, costs, and preferences. BMJ. 1999; 319: 279–283. 17. J. A. Klaber-Moffett, S. Richmond, D. Jackson, S. Coulton, S. Hahn, et al., Randomised trial of a brief physiotherapy intervention compared with usual physiotherapy for neck pain patients: outcomes, costs and patient preference. BMJ. 2005; 330: 75. 18. J. L. Carr, J. A. Klaber Moffett, E. Howarth, S. J. Richmond, D. J. Torgerson, et al., A randomised trial comparing a group exercise programme for back pain patients with individual physiotherapy in a severely deprived area Disabil Rehabil 2005; 27: 929–937. 19. K. J. Sherman, D. C. Cherkin, J. Erro, D. L. Miglioretti, R. A. Deyo, Comparing yoga, exercise, and a self-care book for chronic low back pain: a randomised, controlled trial. Ann Intern Med. 2005; 143: 849–856.

FURTHER READING

20. E. Thomas, P. R. Croft, S. M. Paterson, K. Dziedzic, E. M. Hay, What influences participants’ treatment preference and can it influence outcome? Results from a primary care-based randomised trial for shoulder pain. Br J Gen Pract. 2004; 54: 93–96.

Phase III trials

9

M. King, I. Nazareth, F. Lampe, P. Bower, M. Chandler, et al., Conceptual framework and systematic review of the effects of participants’ and professionals’ preferences in randomised controlled trials. Health Technol Assess. 2005; 9(35): 1–186. M. King, I. Nazareth, F. Lampe, P. Bower, M. Chandler, et al., Impact of participant and physician intervention preferences on randomised trials: a systematic review. JAMA. 2005; 293: 1089–1099.

CROSS-REFERENCES Large simple trials External validity

PREMARKET APPROVAL (PMA)

FDA regulations provide 180 days to review the PMA and to make a determination. In reality, the review time is normally longer. Before approving or denying a PMA, the appropriate FDA advisory committee may review the PMA at a public meeting and provide the FDA with the committee’s recommendation on whether the FDA should approve the submission. After the FDA notifies the applicant that the PMA has been approved or denied, a notice is published on the Internet (1) that announces the data on which the decision is based and (2) that provides interested persons an opportunity to petition FDA within 30 days for reconsideration of the decision. The regulation governing premarket approval is located in Title 21 Code of Federal Regulations (CFR) Part 814, Premarket Approval. A class III device that fails to meet PMA requirements is considered to be adulterated under section 501(f) of the FD&C Act and cannot be marketed. PMA requirements apply to Class III devices, which is the most stringent regulatory category for medical devices. Device product classifications can be found by searching the Product Classification Database. The database search provides the name of the device, classification, and a link to the Code of Federal Regulations, if any. The CFR provides the device type name, identification of the device, and classification information. A regulation number for Class III devices marketed prior to the 1976 Medical Device Amendments is provided in the CFR. The CFR for these Class III devices that require a PMA states that the device is Class III and will provide an effective date of the requirement for PMA. If the regulation in the CFR states that ‘‘No effective date has been established of the requirement for premarket approval,’’ a Class III 510(k) should be submitted. Please note that PMA devices often involve new concepts and many are not of a type marketed prior to the Medical Device Amendments. Therefore, they do not have a classification regulation in the CFR. In this case, the

Premarket approval (PMA) is the Food and Drug Administration (FDA) process of scientific and regulatory review to evaluate the safety and the effectiveness of Class III medical devices. Class III devices are those that support or sustain human life, are of substantial importance in preventing impairment of human health, or that present a potential, unreasonable risk of illness or injury. Because of the level of risk associated with Class III devices, the FDA has determined that general and special controls alone are insufficient to assure the safety and the effectiveness of class III devices. Therefore, these devices require a PMA application under section 515 of the Food, Drug and Cosmetic (FD&C) Act to obtain marketing clearance. Please note that some Class III preamendment devices may require a Class III 510(k). PMA is the most stringent type of device marketing application required by the FDA. The applicant must receive FDA approval of its PMA application prior to marketing the device. PMA approval is based on a determination by the FDA that the PMA contains sufficient valid scientific evidence to assure that the device is safe and effective for its intended use(s). An approved PMA is, in effect, a private license that grants the applicant (or owner) permission to market the device. The PMA owner, however, can authorize use of its data by another. Usually, the PMA applicant is the person who owns the rights, or otherwise has authorized access, to the data and other information to be submitted in support of FDA approval. This applicant may be an individual, partnership, corporation, association, scientific or academic establishment, government agency or organizational unit, or other legal entity. The applicant is often the inventor/developer and ultimately the manufacturer. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cdrh/devadvice/pma/) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

PREMARKET APPROVAL (PMA)

product classification database will only cite the device type name and the product code. If it is unclear whether the unclassified device requires a PMA, then use the three-letter product code to search the PMA database and the Premarket Notification 510(k) database. These databases can be found by clicking on the hypertext links at the top of the product classification database web page. Enter only the three-letter product code in the product code box. If there are 510(k)’s cleared by FDA and the new device is substantially equivalent to any of these cleared devices, then the applicant should submit a 510(k). Furthermore, a new type of device may not be found in the product classification database. If the device is a high-risk device (it supports or sustains human life, is of substantial importance in preventing impairment of human health, or presents a potential, unreasonable risk of illness or injury) and has been found to be not substantially equivalent (NSE) to a Class I, II, or III [Class III requiring 510(k)] device, then the device must have an approved PMA before marketing in the United States. Some devices that are found to be not substantially equivalent to a cleared Class I, II, or III (not requiring PMA) device, may be eligible for the de novo process as a Class I or Class II device. For additional information on the de novo process, see ‘‘New section 513(f)(2) - Evaluation of Automatic Class III Designation: Guidance for Industry and [Center for Device and Radiological Health] CDRH Staff’’. PMA requirements apply to Class III preamendment devices, transitional devices, and postamendment devices. 1

PREAMENDMENT DEVICES

A preamendment device is one that was in commercial distribution before May 28, 1976, which is the date the Medical Device Amendments were signed into law. After the Medical Device Amendments became law, the classification of devices was determined by FDA classification panels. Eventually, all Class III devices will require a PMA. However, preamendment Class III devices require a PMA only after the FDA publishes a regulation

calling for PMA submissions. The preamendment devices must have a PMA filed for the device by the effective date published in the regulation to continue marketing the device. The CFR will state the date that a PMA is required. Prior to the PMA effective date, the devices must have a cleared Premarket Notification 510(k) prior to marketing. Class III preamendment devices that require a 510(k) are identified in the CFR as Class III and include the statement ‘‘Date premarket approval application (PMA) or notice of completion of product development protocol (PDP) is required. No effective date has been established of the requirement for premarket approval.’’ Examples include intra-aortic balloon and control system (21 CFR 870. 3535), ventricular bypass (assist) device (21 CFR 870.3545), cardiovascular permanent pacemaker electrode (21 CFR 870.3680), and topical oxygen chamber for extremities (21 CFR 878.5650). 2 POSTAMENDMENT DEVICES A postamendment device is one that was first distributed commercially on or after May 28, 1976. Postamendment devices equivalent to preamendment Class III devices are subject to the same requirements as the preamendment devices. 3 TRANSITIONAL DEVICES Transitional devices are devices that were regulated by FDA as new drugs before May 28, 1976. Any Class III device that was approved by a New Drug Application (NDA) is now governed by the PMA regulations. The approval numbers for these devices begin with the letter N. These devices are identified in the CFR as Class III devices and state that an approval under section 515 of the Act (PMA) is required as of May 28, 1976 before this device may be commercially distributed. An example of such device is intraocular lenses (21 CFR 886.3600). Please note that some transitional devices have been subsequently downclassified to Class II.

PREMARKET NOTIFICATION 510(K)

may be used as a predicate. The term ‘‘legally marketed’’ also means that the predicate cannot be one that is in violation of the Act. Until the submitter receives an order that declares a device SE, the submitter may not proceed to market the device. Once the device is determined to be SE, it can then be marketed in the United States. The SE determination is usually made within 90 days and is made based on the information submitted by the submitter. Please note that FDA does not perform 510(k) pre-clearance facility inspections. The submitter may market the device immediately after 510(k) clearance is granted. The manufacturer should be prepared for an FDA quality system (21 CFR 820) inspection at any time after 510(k) clearance.

Each person who wants to market a Class I, II, and III device intended for human use in the United States, for which a Premarket Approval (PMA) is not required, must submit a 510(k) to the Food and Drug Administration (FDA) unless the device is exempt from 510(k) requirements of the Federal Food, Drug, and Cosmetic Act (the Act) and does not exceed the limitations of exemptions in .9 of the device classification regulation chapters. No 510(k) form exists; however, 21 CFR (Code of Federal Regulations) 807 Subpart E describes requirements for a 510(k) submission. Before marketing a device, each submitter must receive an order, in the form of a letter, from the FDA that finds the device to be substantially equivalent (SE) and states that the device can be marketed in the United States. This order ‘‘clears’’ the device for commercial distribution. A 510(k) is a premarket submission made to the FDA to demonstrate that the device to be marketed is at least as safe and effective, that is, substantially equivalent, to a legally marketed device (21 CFR 807.92(a)(3)) that is not subject to PMA. Submitters must compare their device to one or more similar legally marketed devices and must make and support their substantial equivalency claims. A legally marketed device, as described in 21 CFR 807.92(a)(3), is a device that was legally marketed prior to May 28, 1976 (preamendments device), for which a PMA is not required, or a device that has been reclassified from Class III to Class II or I, or a device that has been found SE through the 510(k) process. The legally marketed device(s) to which equivalence is drawn is commonly known as the ‘‘predicate.’’ Although devices recently cleared under 510(k) are often selected as the predicate to which equivalence is claimed, any legally marketed device

1

SUBSTANTIAL EQUIVALENCE

A 510(k) requires demonstration of substantial equivalence to another legally U.S. marketed device. Substantial equivalence means that the new device is at least as safe and effective as the predicate. A device is substantially equivalent, in comparison to a predicate if it: • has the same intended use as the predi-

cate; and • has the same technological characteris-

tics as the predicate; or • has the same intended use as the predi-

cate; and • has different technological characteris-

tics and the information submitted to FDA; • does not raise new questions of safety and effectiveness; and • demonstrates that the device is at least as safe and effective as the legally marketed device. A claim of substantial equivalence does not mean the new and predicate devices must be identical. Substantial equivalence is established with respect to the intended

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cdrh/devadvice/314.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

PREMARKET NOTIFICATION 510(K)

use, design, energy used or delivered, materials, chemical composition, manufacturing process, performance, safety, effectiveness, labeling, biocompatibility, standards, and other characteristics, as applicable. A device may not be marketed in the United States until the submitter receives a letter that declares the device substantially equivalent. If the FDA determines that a device is not substantially equivalent, the applicant may: • resubmit another 510(k) with new data, • request a Class I or II designation

through the de novo process • file a reclassification petition, or • submit a premarket approval applica-

tion (PMA).

2

WHO IS REQUIRED TO SUBMIT A 510(K)

The Act and the 510(k) regulation (21 CFR 807) do not specify who must apply for a 510(k). Instead, they specify which actions, such as introducing a device to the U.S. market, require a 510(k) submission. The following four categories of parties must submit a 510(k) to the FDA: 1. Domestic manufacturers introducing a device to the U.S. market; Finished device manufacturers must submit a 510(k) if they manufacture a device according to their own specifications and market it in the United States. Accessories to finished devices that are sold to the end user are also considered finished devices. However, manufacturers of device components are not required to submit a 510(k) unless such components are promoted for sale to an end user as replacement parts. Contract manufacturers, which are those firms that manufacture devices under contract according to someone else’s specifications, are not required to submit a 510(k). 2. Specification developers introducing a device to the U.S. market; A specification developer develops the specifications for a finished device, but it

has the device manufactured under contract by another firm or entity. The specification developer submits the 510(k), not the contract manufacturer. 3. Repackers or relabelers who make labeling changes or whose operations affect the device significantly. Repackagers or relabelers may be required to submit a 510(k) if they change the labeling significantly or otherwise affect any condition of the device. Significant labeling changes may include modification of manuals, such as adding a new intended use, deleting or adding warnings, contraindications, and so on. Operations, such as sterilization, could alter the condition of the device. However, most repackagers or relabelers are not required to submit a 510(k). 4. Foreign manufacturers/exporters or U.S. representatives of foreign manufacturers/exporters introducing a device to the U.S. market. Please note that all manufacturers (including specification developers) of Class II and III devices and select Class I devices are required to follow design controls (21 CFR 820.30) during the development of their device. The holder of a 510(k) must have design control documentation available for FDA review during a site inspection. In addition, any changes to the device specifications or manufacturing processes must be made in accordance with the Quality System regulation (21 CFR 820) and may be subject to a new 510(k). 3 WHEN A 510(K) IS REQUIRED A 510(k) is required when: 1. Introducing a device into commercial distribution (marketing) for the first time. After May 28, 1976 (effective date of the Medical Device Amendments to the Act), anyone who wants to sell a device in the United States is required to make a 510(k) submission at least 90 days prior to offering the device for sale, even though it may have been

PREMARKET NOTIFICATION 510(K)

under development or clinical investigation before that date. If your device was not marketed by your firm before May 28, 1976, a 510(k) is required. 2. You propose a different intended use for a device which you already have in commercial distribution. The 510(k) regulation (21 CFR 807) specifically requires a 510(k) submission for a major change or modification in intended use. Intended use is indicated by claims made for a device in labeling or advertising. Most changes in intended use will require a 510(k). Please note that prescription use to over the counter use is a major change in intended use and requires the submission of a new 510(k). 3. A change or modification of a legally marketed device occurs and that change could significantly affect its safety or effectiveness. The burden is on the 510(k) holder to decide whether a modification could significantly affect safety or effectiveness of the device. Any modifications must be made in accordance with the Quality System regulation, 21 CFR 820, and recorded in the device master record and change control records. It is recommended that the justification for submitting or not submitting a new 510(k) be recorded in the change control records. A new 510(k) submission is required for changes or modifications to an existing device, in which the modifications could significantly affect the safety or effectiveness of the device or the device is to be marketed for a new or different indication for use. 4

PREAMENDMENT DEVICES

The term ‘‘preamendments device’’ refers to devices legally marketed in the United States by a firm before May 28, 1976 and that have not been: • significantly changed or modified since

then; and • for which a regulation requiring a PMA

application has not been published by the FDA.

3

Devices that meet the above criteria are referred to as ‘‘grandfathered’’ devices and do not require a 510(k). The device must have the same intended use as that marketed before May 28, 1976. If the device is labeled for a new intended use, then the device is considered a new device and a 510(k) must be submitted to FDA for marketing clearance. Please note that you must be the owner of the device on the market before May 28, 1976, for the device to be grandfathered. If your device is similar to a grandfathered device and marketed after May 28, 1976, then your device does NOT meet the requirements of being grandfathered and you must submit a 510(k). For a firm to claim that it has a preamendments device, it must demonstrate that its device was labeled, promoted, and distributed in interstate commerce for a specific intended use and that intended use has not changed.

PREMATURE TERMINATION OR SUSPENSION If a trial is terminated prematurely or is suspended, the sponsor should inform promptly the investigators/institutions and the regulatory authority(ies) of the termination or suspension and the reason(s) for the termination or suspension. The IRB (Institutional Review Board)/IEC (Independent Ethics Committee) should also be informed promptly and provided the reason(s) for the termination or suspension by the sponsor or by the investigator/institution, as specified by the applicable regulatory requirement(s).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PRE-NDA MEETING The Pre-NDA (New Drug Application) meeting with the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER) discusses the presentation of data (both paper and electronic) in support of the application. The information provided at the meeting by the sponsor includes: • A summary of clinical studies to be sub-

mitted in the NDA. • The proposed format for organizing the

submission, including methods for presenting the data. • Other information that needs to be discussed. This meeting uncovers any major unresolved problems or issues; ensures that the studies the sponsor is relying on are adequate, are well-controlled, and establish the effectiveness of the drug; acquaints the reviewers with the general information to be submitted; and discusses the presentation of the data in the NDA to facilitate its review. Once the NDA has been filed, another meeting may occur 90 days after the initial submission of the application to discuss issues that were uncovered in the initial review.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/prndamtg. htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PRESCRIPTION DRUG USER FEE ACT (PDUFA) In 1992, the U.S. Congress passed the Prescription Drug User Fee Act (PDUFA), which was reauthorized by the Food and Drug Modernization Act of 1997 and again by the Public Health Security and Bioterrorism Preparedness and Response Act of 2002. The PDUFA authorized the U.S. Food and Drug Administration (FDA) to collect fees from companies that produce certain human drug and biologic products. Any time a company wants the FDA to approve a new drug or biologic before marketing, it must submit an application along with a fee to support the review process. In addition, companies pay annual fees for each manufacturing establishment and for each prescription drug product that is marketed. Before the PDUFA, taxpayers alone paid for product reviews through budgets provided by Congress; under the new program, the industry provides the funding in exchange for FDA agreement to meet drug-review performance goals, which emphasize timeliness.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oc/pdufa/overview.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PRESCRIPTION DRUG USER FEE ACT (PDUFA) II The Food and Drug Administration (FDA) Modernization Act (FDAMA) of 1997 amended the Prescription Drug User Fee Act (PDUFA) I and extended it through September 30, 2002 (PDUFA II). PDUFA II also committed the FDA to review in a faster time frame goals for some applications, new goals for meetings and dispute resolution, and the electronic receipt and review of applications by 2002. In July 1998, the FDA completed the original PDUFA II Five-Year Plan, which was the FDA’s blueprint for investing the resources expected under PDUFA II. It was based on the planning efforts of the three FDA components directly responsible for meeting these goals: (1) the Center for Drug Evaluation and Research (CDER), (2) the Center for Biologics Evaluation and Research (CBER), and (3) the Office of Regulatory Affairs (ORA).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oc/pdufa2/2000update/pdufa2000.pdf) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PRESCRIPTION DRUG USER FEE ACT (PDUFA) IV On September 27, 2007, President George W. Bush signed into law H.R. 3580, which is the Food and Drug Administration (FDA) Amendments Act of 2007, with (from left to right) Health and Human Services (HHS) Secretary Michael Leavitt, FDA Commissioner Andrew von Eschenbach, and Rep. Joe Barton of Texas in the Oval Office. This new law represents a very significant addition to FDA authority. Among the many components of the law, the Prescription Drug User Fee Act (PDUFA) and the Medical Device User Fee and Modernization Act (MDUFMA) have been reauthorized and expanded. These programs will ensure that FDA staff members have the additional resources needed to conduct the complex and comprehensive reviews necessary to new drugs and devices. Overall, this new law will provide significant benefits for those who develop medical products and for those who use them

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oc/initiatives/advance/fdaaa. html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PRESCRIPTION DRUG USER FEE ACT III The Prescription Drug User Fee Act (PDUFA) of 1992 provided the U.S. Food and Drug Administration (FDA) with the authority to collect increasing levels of fees for the review of human drug applications. In July 1998, the FDA made available a PDUFA II Five-Year Plan that presented the major assumptions the FDA was making and the investments it intended to make over the 5-year period of 1998 to 2002 to achieve the new goals associated with the PDUFA 1992, as amended and extended through the year 2002 by the Food and Drug Modernization Act of 1997. That plan was updated several times. The PDUFA was reauthorized and extended through fiscal year 2007 by the Prescription Drug User Fee Amendments of 2002 (PDUFA III). The first PDUFA III Five-Year Plan was published in 2003 and serves as the blueprint for FDA’s planned investment of the fee revenues and appropriations expected through fiscal year 2007. The agency may update this plan if any significant events merit substantial changes.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oc/PDUFA/5yrplan.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PREVENTION TRIALS

reduction of a frequent outcome, such as disease recurrence or death, and may need to be of only one or two years’ duration. In contrast, a primary prevention trial of a common vascular disease or cancer will typically focus on the reduction of incidence of a disease typically occurring at a rate of 1% per year or less, and will assess interventions that may require several years to realize the most important of their hypothesized effects. Hence, therapeutic trials may require relatively small numbers of patients, perhaps only a few hundred, while primary prevention trials may require tens of thousands of subjects, with resulting logistical challenges and substantial costs. To cite a particular example, the Multiple Risk Factor Intervention Trial (MRFIT) (10) of combined hypertension treatment, blood cholesterol lowering and smoking cessation vs. control for the prevention of coronary heart disease, involved the randomization of over 12 000 middleaged and older men thought to be at high risk of coronary heart disease, with an average follow-up duration of over seven years. Only about 2% of MRFIT men experienced the designated primary outcome, coronary heart disease mortality, by the planned date of study completion. The cost of the trial was reported to be in the vicinity of US$100 million. The size and cost of prevention trials may be increased further by the need to be conservative in establishing intervention goals to ensure the safety of ostensibly healthy study subjects for whom the ability to monitor adverse events carefully may be somewhat limited. In contrast, a therapeutic trial of frequently monitored patients may be able to risk toxicity to achieve efficacy. Furthermore, the long duration of prevention trial follow-up may imply a noteworthy reduction in adherence to intervention goals over time, resulting in important increases in the necessary sample size. The interventions studied in a prevention trial, like the treatments assessed in a therapeutic trial, may have the potential to affect various disease processes, beneficially or adversely, in addition to those specifically

ROSS L. PRENTICE Fred Hutchinson Cancer Research Center, Seattle, WA, USA

There is a considerable history of the use of randomized clinical trials to assess strategies for the primary prevention of disease. For example, in the US, major coronary heart disease prevention trials date from the 1960s (7,9,10) to the present, while a number of trials have been initiated over the past one to two decades that focus on the prevention of major cancers and other important diseases. Primary prevention trials generally focus on reducing the occurrence rate of one or more diseases, in contrast to screening trials, which aim to reduce mortality rates through early detection and effective treatment of disease. However, the history of primary prevention trials is quite modest compared with that of therapeutic trials that assess strategies for the treatment of established disease. In fact, the role and place of primary prevention trials in relation to other research strategies remains controversial, and is an important topic for further methodologic research. It is useful to review some basic features of prevention trials to explain the reasons for controversy, and to highlight research needs. First, consider the nature of the interventions or the treatments to be assessed or compared. In therapeutic research, these arise typically from basic biological research in conjunction with drug screening studies. While such sources may also generate primary disease prevention hypotheses, particularly for chemoprevention trials, observational epidemiologic sources also provide a common and important source of prevention trial hypotheses. For example, preventive interventions may involve such ‘‘lifestyle’’ maneuvers as physical activity increases, nutrient consumption reductions or supplementation, or modifications of sexual behavior. A therapeutic trial among patients diagnosed with a serious disease will aim typically to identify effective treatments for the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

PREVENTION TRIALS

targeted for reduction. In view of the typical dominance of the patient’s disease, these other effects are often relatively unimportant in a therapeutic trial. In a prevention trial, however, overall benefit vs. risk assessments can be quite different from the assessment for the designated primary outcome alone. Even a fairly rare adverse effect can eliminate the public health utility of a preventive maneuver. The need to assess the interventions in terms of suitable measures of benefit vs. risk has implications for trial design and particularly for trial monitoring and reporting. 1 ROLE AMONG POSSIBLE RESEARCH STRATEGIES In view of the above litany of obstacles and challenges, it seems logical to take the viewpoint that a full-scale disease prevention trial is justified only if the interventions to be assessed have sufficient public health potential, and if alternate less costly research strategies appear unable to yield a sufficiently reliable assessment of intervention effects. If the intervention of interest falls outside the range of common human experience, as is often the case with chemopreventive interventions, there is little debate that randomized controlled trials constitute the research strategy of choice, and the discussion can focus on public health potential and research costs. However, if the intervention is already practiced in varying degrees by large numbers of persons, purely observational approaches may sometimes provide reliable disease prevention information at lesser cost, and perhaps in a shorter time, than can a randomized, controlled intervention trial. In fact, a single observational study; for example, a cohort study, may be able to assess a broader range of interventions than is practical to include in the design of a randomized prevention trial. However, key determinants of observational study reliability include the ability to control confounding and the ability to measure accurately the level of intervention adopted. Measurement error in the ‘‘exposure’’ histories of interest, or in confounding factors histories, can invalidate observational study hypothesis tests and estimates of intervention effects. Furthermore, randomized

trials have the major advantages that the randomized treatment assignment (i.e. intervention vs. control) is statistically independent of all prerandomization risk factors, whether or not these are even recognized, and that outcome comparisons among randomization groups (i.e. intention to treat analysis) typically will provide valid hypothesis tests, even if adherence to intervention varies among study subjects and is poorly measured. Consider the specific context of the Women’s Health Initiative (WHI) clinical trial (17,21), which is enrolled 68 132 postmenopausal American women in the age range 50–79. This trial is designed to allow randomized controlled evaluation of three distinct interventions: a low-fat eating pattern, hypothesized to prevent breast cancer and colorectal cancer, and, secondarily, coronary heart disease; hormone replacement therapy, hypothesized to reduce the risk of coronary heart disease and other cardiovascular diseases, and, secondarily, to reduce the risk of hip and other fractures; and calcium and vitamin D supplementation, hypothesized to prevent hip fractures and, secondarily, other fractures and colorectal cancer. Each of these three interventions is already being practiced in some fashion by large numbers of postmenopausal American women. Important disease reductions can be hypothesized for each intervention, based on substantial observational studies, animal experiments and randomized trials with intermediate outcomes (e.g. the Postmenopausal Estrogen/Progestin Intervention (PEPI) Trial (11)). In the case of hormone replacement therapy, a randomized trial is motivated by potential confounding in cohort and case–control studies as hormone users tend to be of higher socioeconomic status with fewer vascular disease risk factors, by the magnitude of the hypothesized benefits, and, importantly, by the need for reliable summary data on benefits vs. risks, particularly since breast cancer risk may be adversely affected by hormone replacement therapy. The dietary modification trial component is motivated by associations between international cancer incidence rates and per capita fat consumptions, by migrant study data, and by rodent feeding experiments. A large

PREVENTION TRIALS

number of case–control and cohort studies of dietary fat and various cancers have yielded mixed results. These latter studies rely exclusively on dietary self-reports, which are known from repeatability studies to involve substantial measurement error, though the absence of a gold-standard dietary measurement procedure precludes an assessment of measurement error characteristics as a function of actual dietary habits and of study subject characteristics, such as body mass. For example, Prentice (15) describes a plausible measurement model for dietary fat intake under which even the strong associations suggested by international disease rate comparisons would be essentially eliminated by random and systematic aspects of dietary assessment measurement error. Hence, the current observational studies of dietary fat in relation to cancer or other diseases may be uninterpretable, motivating the need for a randomized intervention trial to assess whether a change to a low-fat eating pattern during the middle decades of life can reduce the risk of selected cancers and cardiovascular diseases. This controversy over the interpretation of the observational data on dietary fat points to the pressing need for objective measures of fat consumptions (i.e. biomarker measures) and for the development of flexible measurement models to allow self-report and objective exposure data to be combined in exposure–disease rate analyses. The issues of exposure measurement error, along with limited exposure variation within populations, and highly correlated exposure variables, also point to a possible greater role for aggregate (ecologic) study designs (e.g. (16)) among observational research strategies. The calcium and vitamin D component of the WHI is viewed as a comparatively inexpensive addition to the clinical trial. It is motivated by the public health potential of the intervention, as well as by observational data, and data from smaller clinical trials. 2 PREVENTION TRIAL PLANNING AND DESIGN Suppose that an intervention having potential to prevent one or more diseases is to be subjected to a randomized controlled trial.

3

The trial design should be responsive to the target population to which the intervention, if effective, might be applied. For example, the three interventions to be studied in the trial component of the WHI are all potentially applicable to the general population of postmenopausal women, and the trial will be open to women who are not otherwise practicing the interventions to any noteworthy degree. After identifying the potential target population for the intervention, there may still be reason to focus the trial on a subset of this population. There may be an identifiable subset at elevated risk for the primary outcome that could be chosen for trial participation, in order to reduce trial sample size. For example, it may be proposed that a colon cancer prevention strategy be assessed in subjects known to have had colonic polyps, or a breast cancer prevention strategy among women with a history of breast cancer among one or more first-degree relatives, even though it is hoped that the results will be applicable to a broader target population. There are several aspects to deciding on such an approach. First, although sample size may be reduced, trial logistics may be complicated and trial costs increased. For example, the costs of screening to identify eligible subjects will increase typically, and a larger number of participating clinical centers may be required. Depending on the intervention mechanism, high-risk study subjects may benefit less, by virtue of their stage in the targeted disease process, compared with other potential study subjects. Also, a focus on high-risk subjects may lead to a distorted view of the overall risks and benefits relative to the entire targeted population. Within the target population, criteria may be needed to exclude study subjects with medical contraindications to either intervention or control regimens, study subjects who are already practicing the intervention to an unacceptable degree, or who may not adhere to intervention group requirements or to other protocol requirements. Even if study subjects are selected on the basis of elevated risk for the diseases that are targeted for prevention, primary outcome events may constitute a small minority of the disease events experienced by study subjects

4

PREVENTION TRIALS

during the course of the trial, and perhaps even a small minority of disease events that may in some way be affected by intervention activities. Hence, there is an obligation to define sets of outcomes, to be carefully ascertained, including those that may plausibly be affected by intervention activities, in order to provide an opportunity to assess the overall risks and benefits in the target population. The cost and logistics of a full-scale disease prevention trial may motivate a trial with some intermediate outcomes in place of the disease to be prevented. For example, a major trial was conducted in the US to prevent colonic polyps, rather than colon cancer, by means of a low-fat, high-fiber dietary pattern. This study makes the assumption that the formation of polyps is on the pathway between dietary habits and colon cancer occurrence, and that reduction in polyps occurrence will convey a corresponding reduction in colon cancer incidence. While the conditions for an intermediate outcome of this type to serve as a ‘‘surrogate’’ for the disease of interest are rather strict (see (4) and (13)), the benefits in terms of trial sample size, cost and duration may sometimes justify an intermediate outcome trial. In other circumstances, a trial with one or more intermediate outcomes may be conducted first to inform the decision concerning a trial with ‘‘harder’’ outcomes. In some circumstances, the relationship between an intervention or behavior and a reduction in disease will be regarded as sufficiently well established that the research effort can shift logically to strategies to encourage the desired behavior change. Cigarette smoking cessation or prevention in relation to lung cancer and heart disease, or breast screening by means of mammography and other techniques provide important examples. Randomized trials with such behaviors as outcomes can be classified as disease control research, rather than primary prevention research. In such studies, the intervention may sometimes be able to be delivered with particular economy to persons in natural groups, such as social groups, schools, or communities. In fact, the use of community organizations and media may even define the intervention strategy, as in

the Community Intervention Trial for Smoking Cessation, which took place in 11 pairs of matched cities in the US. Such studies naturally involve group randomization and there is a range of interesting design and analysis issues (5,6). Returning to individually randomized prevention trials with disease outcomes, other design choices include the possible use of factorial designs, and intervention and control randomization fractions. Factorial designs have an obvious appeal in that they provide the potential to make two or more intervention comparisons in the same study population at a cost that will typically be considerably less than that for separate studies. For example, in the WHI trial, study subjects must be eligible and willing to participate in one or more of the hormone replacement or dietary intervention components, and, subsequently, are offered the opportunity to participate in the calcium and vitamin D component—a so-called partial factorial design. There is only a modest overlap between the hormone replacement and dietary intervention components due to component-specific exclusionary criteria, but a large overlap of either of these with the calcium and vitamin D component. As a result, the projected trial sample size is 68 132 rather than the 120 000 or more that would be required to assess the three interventions separately. The potential disadvantages to a factorial design are the possibility that the benefit associated with an intervention may be reduced by the presence of one or more other interventions, and the possibility that adherence to a given intervention may be reduced by the study demands or adverse effects that may arise from participation in the other interventions. The necessary sample size of a two-arm trial is approximately proportional to [γ (1 − γ )]−1 , where γ is the fraction of the trial cohort assigned to the intervention group. Hence, if the average study costs associated with an intervention group subject are C times that for the corresponding control group, subject trial costs will be approximately proportional to [Cγ (1 − γ )][γ (1 − γ )]−1 , which is minimized by setting γ = (1 + C1/2 )−1 . For example, if study costs per intervention group subject are 2.25 times

PREVENTION TRIALS

that per control group subject, then γ = 0.4, a randomization fraction that is used for the dietary intervention component of the WHI. Upon selecting the interventions to be evaluated, the target population, and major trial outcomes, one needs to make a series of design assumptions that will determine the size of the trial cohort. Perhaps the most fundamental assumption concerns the anticipated primary endpoint intervention benefit, often expressed as a relative risk (hazard ratio) for fully adherent intervention vs. fully adherent control subjects as a function of time from randomization. Assumptions concerning primary endpoint disease rates in the absence of intervention, on intervention and control group adherence rates and accrual patterns, trial duration and competing risks can then be combined with the basic relative risk assumption to produce the sample size that will yield a significant result (e.g. at the 0.05 significance level) under design assumptions, with a specified probability or power (e.g. power of 90%). Various authors, including Self & Mauritsen (18), provide flexible sample size and power procedures that allow one to incorporate assumptions of this type. The WHI Study Group (21) details such assumptions for the WHI clinical trial. Corresponding primary endpoint power calculations played a major role in the specification of sample sizes of 48 000, 27 500, and 45 000 for the dietary modification, hormone replacement therapy, and calcium and vitamin D trial components, respectively. Pilot and feasibility studies play a critical role in prevention trial planning. Such studies provide the opportunity to assess study subject recruitment rates, to evaluate the potential of a run-in period to identify and exclude study subjects who may not comply with study requirements, to observe biomarker changes that may help to establish the basic relative risk assumption, and to assess costs associated with all aspects of at least the early phases of trial operation. Information on these topics can be critically important to the development of an efficient trial design. See Urban et al. (19) for an example of the use of cost projections to inform the design choices for a low-fat diet

5

intervention trial, including eligibility criteria, average follow-up duration, randomization fraction and number of clinical centers. Careful consideration of subsampling rates for the collection and processing of baseline and follow-up data and specimens can also play an important role in controlling trial costs. 3 CONDUCT, MONITORING, AND ANALYSIS A disease prevention trial requires a clear, concise protocol that describes trial objectives, design choices, performance goals and monitoring and analysis procedures. A detailed manual of procedures that describes how the goals will be achieved is necessary to ensure that the protocol is applied in a standardized fashion. Carefully developed data collection and management tools and procedures, with as much automation as practical, can also enhance trial quality. Centralized training of key personnel may be required to ensure that the protocol is understood, and to enhance study subject recruitment, intervention adherence, and comparability of outcome ascertainment, possibly through blinding of the randomization assignment between intervention and control groups. A committee knowledgeable in the various aspects of the trial, and often external to the investigative group, typically will be required for the timely review of safety and clinical outcome data. As mentioned previously, prevention trial monitoring for early stoppage will usually not only involve the designated primary outcome(s), but also some suitable measure of overall benefit vs. risks, as well as of important adverse effects. Some aspects of the proposed monitoring of the WHI clinical trial are described in Freedman et al. (3). For example, early stoppage for benefit may be merited if the primary outcome incidence reduction is significant at customary levels (p < 0.05) and the summary benefit vs. risk measure is supportive (e.g. p < 0.20) without important adverse effects. Early stoppage based on harm may be indicated if an important adverse event is significant (p < 0.05) without suggestive evidence (p > 0.20) of

6

PREVENTION TRIALS

benefit vs. risk. More sophisticated stopping criteria could also be considered and critical values that acknowledge the multiplicity of outcomes and of testing times need to be constructed. See Cook (1) for an example of such a construction for a bivariate response. The basic test statistic to compare two randomization groups with respect to a failure time (disease) endpoint might often reasonably be defined to be of weighted logrank form: n

δi g(ti )[zi − n1 (ti )n(ti )−1 ],

attempt to explain intervention effects in terms of intermediate measures can often be based efficiently on case–control (8) or case–cohort (12) subsampling procedures, and should acknowledge measurement error in the intermediate variable assessment. See (22), which provides basic results from the WHI Clinical Trial component on combined hormones, following early stoppage based on risks & feeding benefits as and example of important and unexpected trial results, and of the complexity of prevention trial reporting.

i=1

where n is the total number of intervention and control study subjects, n1 (ti ) and n(ti ) are, respectively, the number of intervention subjects and total subjects at risk for failure at the failure time (δ i = 1) or censoring time (δ i = 0) for the ith subject, while zi indicates whether the ith subject is assigned to intervention (zi = 1) or control (zi = 0). The test will have high efficiency if the ‘‘weight’’ g(t) at time t from randomization is chosen to approximate the logarithm of the anticipated intervention vs. control group hazard ratio for the endpoint under test, taking account of anticipated adherence rates. For example, if this hazard ratio is expected to be approximately constant, then one might set g(t) ≡ 1, in which case one has the classical logrank test, while if the hazard ratio is expected to decline approximately exponentially over the follow-up period, then one might set g(t) = t. Adaptive versions in which the form of g(t) is responsive to the evolving trial data may also be considered. The above test can be generated as a partial likelihood (2) score test for β = 0 under a hazard ratio model exp[x(t)/β], where x(t) = zg(t). This modeled regression vector can be extended to include other variables that may be intermediate between intervention activities and outcome events, in an attempt to explain intervention effects on disease outcomes. The trial monitoring process will have some effect on these tests and estimators, with typically larger effects if early stoppage occurs. The estimation of intervention effects may be biased (e.g. (20)) even for outcomes that do not contribute to early stoppage decisions. Analyses that

4 ACKNOWLEDGMENTS This work was supported by grant CA53996 from the National Cancer Institute. This entry builds upon Prentice (14), which includes additional discussion of a number of the previously mentioned technical issues. REFERENCES 1. Cook, R. J. (1996). Coupled error spending functions for parallel bivariate tests, Biometrics 52, 422–450. 2. Cox, D. R. (1975). Partial Biometrika 62, 269–276.

likelihood,

3. Freedman, L. S., Anderson, G. A., Kipnis, V., Prentice, R. L., Wang, C. Y., Rossouw, J., Wittes, J. & DeMets, D. L. (1996). Approaches to monitoring the result of long-term disease incidence prevention trials: examples of the Women’s Health Initiative, Controlled Clinical Trials, 17, 509–525. 4. Freedman, L. S., Graubard, B. I. & Schatzkin, A. (1992). Statistical validation of intermediate endpoints for chronic diseases, Statistics in Medicine 11, 167–178. 5. Freedman, L. S., Green, S. B. & Byar, D. P. (1990). Assessing the gain in efficiency due to matching in a Community Intervention Study, Statistics in Medicine 9, 943–952. 6. Gail, M. H., Byar, D. P., Pehacek, T. F. & Corle, D. K. (1992). Aspects of the statistical design for the Community Intervention Trial for Smoking Cessation (COMMIT), Controlled Clinical Trials 13, 6–21. 7. Hypertension Detection and Follow-up Program Cooperative Group (1979). Five year findings of the Hypertension Detection and Follow-up Program 1. Reductions in mortality

PREVENTION TRIALS

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

of persons with high blood pressure, including mild hypertension, Journal of the American Medical Association 242, 2562–2571. Liddell, F. D. K., McDonald, J. C. & Thomas, D. C. (1977). Methods for cohort analysis: appraisal by application to asbestos mining data (with discussion), Journal of the Royal Statistical Society, Series A 140, 469–490. Lipid Research Clinic Program (1984). The Lipid Research Clinic Coronary Primary Prevention Trial Results. I. Reduction in incidence of coronary heart disease, Journal of the American Medical Association 251, 351–364. Multiple Risk Factor Intervention Trial (MRFIT) Research Group (1982). Multiple Risk Factor Intervention Trial: risk factor changes and mortality results, Journal of the American Medical Association 248, 1465–1477. PEPI Trial Writing Group (1995). Effects of estrogen or estrogen/progestin regimens on heart disease risk factors in postmenopausal women: the Post-menopausal Estrogen/Progestin Intervention (PEPI) Trial, Journal of the American Medical Association 273, 199–208. Prentice, R. L. (1986). A case–cohort design for epidemiologic cohort studies and disease prevention trials, Biometrika 73, 1–11. Prentice, R. L. (1989). Surrogate endpoints in clinical trials: discussion, definition and operational criteria, Statistics in Medicine 8, 431–440. Prentice, R. L. (1995). Experimental methods in cancer prevention research, in Cancer Prevention and Control, P. Greenwald, B. S. Krawer & D. L. Weed, eds. Marcel Dekker, New York, pp. 213–224. Prentice, R. L. (1996). Measurement error and results from analytic epidemiology: dietary fat and breast cancer, Journal of the National Cancer Institute 88, 1738–1747. Prentice, R. L. & Sheppard, L. (1995). Aggregate data studies of disease risk factors, Biometrika 82, 113–125. Rossouw, J. E., Finnegan, L. P., Harlan, W. R., Pinn, V. W., Clifford, C. & McGowan, J. A. (1995). The evolution of the Women’s Health Initiative: prospectives from the NIII, Journal of the American Medical Women’s Association 50, 50–55. Self, S. G. & Mauritsen, R. (1988). Power/sample size calculations for generalized linear models, Biometrics 44, 79–86. Urban, N., Self, S., Kessler, L., Prentice, R., Handerson, M., Iverson, D., Thompson, D. L.,

7

Byar, D., Insull, W., Gorach, S. G., Clifford, C. & Goldman, S. (1990). Analysis of the costs of a large prevention trial, Controlled Clinical Trials 11, 129–146. 20. Whitehead, J. (1986). Supplementary analysis at the conclusion of a sequential clinical trial, Biometrics 42, 461–471. 21. Women’s Health Initiative Study Group. (1998). Design of the Women’s Health Initiative Clinical Trial and Observational Study, Controlled Clinical Trials 19, 61–109. 22. Writing group for the Women’s Health Initiative. (2002). Risk and benefit of estrogen plus progestron in healthy postmenopausal women, Journal of the American Medical Association 288, 321–330.

PRIMARY EFFICACY ENDPOINT

1. Easy to diagnose or observe. 2. Free of measurement or ascertainment errors. 3. Capable of being observed independent of treatment assignment. 4. Clinically relevant. 5. Chosen before the start of data collection.

DEAN A FOLLMANN Biostatistics Research Branch National Institute of Allergy and Infectious Diseases Bethesda, Maryland

1

DEFINING THE PRIMARY ENDPOINT

The following qualities are mentioned by Neaton et al. (2) for a clinical endpoint:

The primary endpoint is the outcome by which the effectiveness of treatments in a clinical trial is evaluated. As such, it plays a key role in the overall design of the study and needs to be carefully specified to ensure that the results of the clinical trial will be accurate and accepted by the medical community. Studies with poorly chosen endpoints can lead to the wrong conclusions or produce controversial results. A carefully chosen endpoint can ensure no bias. But for a study result to be acceptable to the medical community, the endpoint needs also to be meaningful—of either demonstrated or accepted relevance for the population and interventions of the trial. A primary efficacy endpoint needs to be specified before the start of the clinical trial. This is important for several reasons. A primary endpoint is required to determine a sample size that ensures a clinically meaningful difference in the primary endpoint between the two randomized groups will be detected with high probability, generally 0.80 to 0.90. Another important reason for a priori specification is that, if a primary endpoint is specified after unblinded data have been observed, there is no way to ensure that choice of endpoint has not been influenced by the knowledge of its effect on the comparison of arms. One might consciously or subconsciously be attracted to a primary endpoint that corresponds to a small P-value for testing for the treatment effect. A primary endpoint forces trial designers to know exactly what the trial will be capable of concluding at the end of the trial. Key characteristics of a primary endpoint have been discussed by different investigators. Meinert (1) lists the following desired characteristics of a primary endpoint:

1. Relevant and easy to interpret. 2. Clinically apparent and easy to diagnose. 3. Sensitive to treatment differences. Ideally, all criteria should be satisfied for a primary efficacy endpoint, though this may be difficult at times. The primary endpoint, along with a method of analysis, needs to be specified precisely in the protocol to avoid ambiguity. For example, a primary endpoint of mortality could be interpreted in different ways. Is this the occurrence of death by a specific time after randomization, or the time until death? Even if time until death is the endpoint, there are different statistical tests that could be used, and a specific test needs to be identified before the trial starts. 2

FAIRNESS OF ENDPOINTS

The primary endpoint should be defined identically for both groups, explicitly and implicitly. However, bias or lack of fairness can be introduced in many ways, some of which are subtle. Probably the most widely recognized way of avoiding bias in endpoints is to have the study be ‘‘blinded’’ so that the randomized volunteer, the treating physician, and the person assessing the primary outcome do not know what treatment the patient is receiving. In an unblinded study, bias can be introduced in multiple ways. With a ‘‘soft’’ or somewhat subjective endpoint such as quality-of-life, pain intensity, or number of days with suicidal ideation, patients who know they are receiving active treatment may feel happier

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PRIMARY EFFICACY ENDPOINT

and may report feeling better even if nothing has changed. Even with an objective quantitative endpoint such as blood pressure, patients who know they have received active treatment may feel calmer and obtain better measurements than those who are upset with the disappointment of obtaining a placebo. Having the treated physician blinded is also important as knowledge of randomization assignment might cause the physician to more quickly prescribe additional therapies for failing placebo patients. More subtly, a physician might provide cues as to his expectation about a patient’s success. Relatedly, a person ascertaining the endpoint, such as the ejection fraction as determined by echocardiography, might be more likely to read higher if he knows the patient has received the promising active arm. In time-to-event trials, lack of fairness can be introduced by the length of follow-up. Ideally, the period of observation will be of sufficient length so that harms and benefits of all therapies can be assessed. For example, one might compare an aggressive therapy such as bone marrow transplantation or surgery compared with a less aggressive therapy such as a drug therapy. If the outcome were 30-day mortality, the drug arm might look better because the aggressive therapy needs time to manifest its benefit. A better outcome for such a trial might be mortality at 3 years. Another practice to be avoided is to use different periods of evaluation for the different arms. For example, one might tally deaths that occur in the year after initiation of therapy. If therapy A begins immediately while therapy B takes a month to start, bias can be introduced. Therapy A is being evaluated over the first year after randomization, but therapy B is being evaluated over months 1 to 13. If both treatments are useless, therapy B will look better if the overall death rate is pronounced early and then diminishes whereas therapy A will look better if the death rate increases over time. A related problem is when different arms receive differential scrutiny. Consider a trial of drugs versus implantable cardiac defibrillators (ICDs) where the endpoint is time to first arrhythmic event since randomization. Arrhythmias can be monitored continuously

with the ICD, but such monitoring is impossible in the drug arm. Ideally, one should only count arrhythmias that are determined the same way in both arms, such as arrhythmias requiring hospitalization and confirmed by an external electrocardiogram (ECG). Admittedly this can be difficult if ICD-detected arrhythmias are successfully treated, obviating the need for hospitalization. Relatedly, an intensive therapy may entail more frequent visits to the clinic than the comparator therapy. In such a case, the primary endpoint should only be evaluated on visits that are common to both groups. At times, this principle may be hard to implement. Suppose a trial of antimalaria prophylaxis has one group coming in weekly to receive a shot and the other coming in monthly to receive a 30-day supply of pills. The endpoint is monthly parasitemia averaged over the rainy season. Further, suppose that if patients come in with malaria they must receive curative drugs that have a lingering prophylaxis effect. Even if both weekly and monthly prophylaxis arms are useless, the weekly arm will look better because of the lingering effect of curative treatment. In this setting, the group receiving pills should also come in weekly. More subtly, individuals should not effectively choose their own endpoint but rather should have the endpoint chosen for them. Suppose there were two possible efficacy endpoints, E1 or E2 , with E1 having smaller, more desirable values than E2 . It would be silly to design a trial where patients could choose endpoint E1 or E2 after randomization. If treatment had no effect on either endpoint but caused patients to choose E1 , the trial would demonstrate treatment was efficacious when it only had a meaningless effect on endpoint choice. Although no one would explicitly design a trial with such an endpoint, patient choice of endpoint can effectively happen in subtle ways. Suppose patients come in for 24-hour ECG monitoring when they feel at risk for arrhythmia and the primary endpoint, for each patient, is the proportion of visits with two or more documented arrhythmias in a 24-hour period. If treatment has no effect on arrhythmias but causes patients to become hypervigilant and come

PRIMARY EFFICACY ENDPOINT

in for monitoring when the heart is functioning normally, the treatment will look efficacious on this endpoint. Effectively, treatment patients are choosing endpoint ‘‘E1 ’’ (monitoring during normal and abnormal periods), and control patients are choosing endpoint ‘‘E2 ’’ (monitoring during abnormal periods). Patient selection of endpoint also effectively happens when success is defined at the end of therapy but duration of therapy depends on a patient’s response. For example, comparing two different cycles of chemotherapy when response is measured after the end of therapy but therapy is continued until patients either have a complete response or fail can introduce a biased comparison. Here, one therapy might induce quick but transient complete responses, while the other induces delayed but lasting complete responses. Using a fixed time since randomization would properly identify the second treatment as better, and measuring response at the end of therapy would incorrectly show the two treatments were similar. Another potential problem with fairness arises when patients have multiple events or observations and each can be scored. For example, in seizure disorders, both the number and severity of each seizure can be recorded. In coronary artery disease, both the number of occluded arteries and the percentage of occlusion can be recorded. Suppose that for each occluded artery the treatment reduces the percentage of occluded area, and treatment also prevents new occlusions from occurring. A comparator that had no effect on percentage of occluded area but rather encouraged mild occlusions to form might look better than treatment if the endpoint were average occlusion among occluded arteries identified at the end of the study. A similar problem would arise if the proportion of ‘‘harmful’’ seizures were used as the endpoint. In both cases, the numerator and denominator of the average are observed after randomization, and the ratio is summarizing the numerator and denominator in an undesirable way. A better approach is to define as an endpoint the average of the final occlusive area just over the arteries that were identified as occluded at baseline, or to use time to first serious seizure.

3

3

SPECIFICITY OF THE PRIMARY ENDPOINT

At times, there will be a tension between selecting an endpoint that is precise and specific versus having an endpoint that can be measured in everyone and is fair to the study arms. For example, suppose that two methods of platelet transfusion are being compared where each patient will receive transfusions as needed (3). The aim of the therapies is to reduce the number of patients who allo-immunize by developing antibodies that destroy the transfused platelets. Although it is relatively simple to determine whether each transfusion has succeeded, it is more complicated to know if a failure is due to allo-immunization because this would require additional transfusions that should be HLA-matched. Such transfusions might be impossible to procure, or a patient might die before receiving such a transfusion. Thus, one might consider as endpoints such time to transfusion failure or time to transfusion failure due to allo-immunization. The more specific endpoint may be missing in more patients, while the simpler endpoint should be ascertainable in nearly everyone. In such settings, it is better to define an endpoint that can be measured reliably in nearly everyone, especially if there is a potential for differential missingness by treatment with the more specific endpoint. Any endpoint that is likely to be missing for a number of participants is seriously flawed and should be avoided. On the other hand, use of nonspecific endpoints can substantially increase the sample size. Suppose we are interested in testing a vaccine for clinical influenza. We might have a nonspecific endpoint of ‘‘flu-like’’ symptoms, which is easy to measure, or we might have a more expensive definitive test that identifies ‘‘true flu.’’ Suppose it is known that 30% of the control group will have flu-like symptoms and that one third of these will have true flu. If the effect of the vaccine is solely to reduce true flu by 50%, then a simple calculation shows that using flu-like symptoms as an endpoint requires about three times the sample size that is required when true flu is used. One way to address this dilemma is through the use of validation sets where a subset of patients are measured for true flu (4).

4

PRIMARY EFFICACY ENDPOINT

4

COMPOSITE PRIMARY ENDPOINTS

In some settings, treatment may be expected to have effects on a variety of specific endpoints. For example, tighter glycemic control in type II diabetics might be expected to reduce strokes and heart attacks. A bone marrow transplant for scleroderma might be expected to improve lung function, kidney function, and mortality. In human immunodeficiency virus (HIV) disease, therapies might delay the onset of a wide variety of opportunistic infections. For such trials, it is common to consider use of a composite primary endpoint such as the time to the first event, where the event is a list of specific constituent events such as stroke or heart attack, or all the opportunistic infections that define acquired immunodeficiency syndrome (AIDS). The constituents of a composite endpoint typically reflect a common biological pathway. However, having a composite outcome composed of constituents that reflect quite different disease severity may make interpretation of the results difficult. Or the composite may be dominated by more frequent but less serious constituent endpoints. For example, in the Azithromycin and Coronary Events Study (ACES) trial, which randomized about 4000 patients with stable coronary artery disease to azithromycin or placebo, the primary endpoint was a composite including death due to coronary heart disease, myocardial infarction, hospitalization, and revascularization (5). Revascularization accounted for about half of the events, and death accounted for about 10% of the events. Composite endpoints are typically analyzed as the time to the first event on a list. Although simple, this approach can be unappealing if the events on the list differ in severity. A different approach to composite endpoints is to evaluate the entire history of events and then either rank all patients on the basis of the entire trajectory of events (6) or rank pairs of patients over their common follow-up time (7). Also see Bjorling and Hodges (8) for discussion. A simple example of this strategy would be to look at the times of strokes, heart attacks, and deaths where patients are first ordered by time to death, then survivors are ranked by time to first

stroke or heart attack. In contrast, a typical composite would rank patients by time to first event so that a patient who survives the study but has a stroke at 1 year after randomization is tallied as worse than a patient who dies after 2 years. Sometimes treatments may be expected to have varied effects. If treatment is expected to be beneficial on some endpoints but harmful on others, deciding on a primary efficacy endpoint can be quite difficult. However, if all endpoints are similar in importance the problem may be lessened. For example, in the International Study of Infarct Survival (ISIS) trial, patients with suspected heart attacks were randomized to streptokinase (a blood-thinning drug) or placebo (9). The drug was hoped to prevent heart attacks but might cause strokes where blood leaks into the brain. Composite endpoints such as allcause mortality or stroke and heart attacks would make sense for such a trial, as it would be unfair to only count heart attacks. If the importance of the endpoints differs, the problem is more complicated. For example, in the Women’s Health Initiative, which included a randomized trial of hormone replacement therapy (HRT) versus placebo in postmenopausal women, HRT was anticipated to prevent coronary heart disease (CHD), but there was concern it might increase the rate of breast cancer. Although time to CHD was specified as the primary efficacy endpoint, time to breast cancer and time to a composite that included CHD and breast cancer were all monitored during the course of the trial (10). A common misconception in selecting constituent endpoints is that more constituents are good because this results in more events and more events increases power. If the treatment effect remains the same for different constituent lists, then the longer the list, the more events and the greater power for the study. However, some constituent endpoints might be less responsive to treatment, and this can dilute the overall treatment effect. A simple calculation illustrates the point. Suppose that two constituent endpoints each occur at a rate of 10% in the control group, and treatment reduces the event rate by 50% for the first constituent. If treatment reduces the event rate for the second constituent by

PRIMARY EFFICACY ENDPOINT

20%, then we have about equal power using either the composite or the first constituent. The point of indifference changes with the event rates and with assumed treatment effects. Selection of the exact constituents of a composite endpoint requires careful consideration of the impact on power and interpretability. 5

MISSING PRIMARY ENDPOINT DATA

Ideally, the primary endpoint should be measured on everyone, and the study should be constructed to minimize the amount of missing data. Because the patients with complete data are a subset of those randomized, there is no assurance that patients with nonmissing data are comparable between the treatment and control groups. Perhaps the sicker patients randomized to treatment are more likely to have missing outcomes than comparable patients in the placebo group. Even small amounts of missing data can cause significant controversy in the interpretation of a study with equivocal results. As such, investigators should pay serious attention to the possibility of missing data in the design stage and choose the hypothesis and primary endpoints in a way that is likely to minimize missing data. If dropout is a concern, this might entail use of relatively short-term endpoints, such as blood pressure change over 3 months instead of 3 years. Another solution might be to use a simpler measuring device that gives a result on everyone as opposed to a sophisticated device that occasionally fails for uncertain reasons. If missing data cannot be avoided, it makes sense to augment the sample size so that power is maintained even with some missing observations. Such augmentation should ensure adequate power for the analysis specified in the protocol. Because the protocol should specify how missing data are handled in the analysis, determining the proper augmented sample size may require careful thought. Despite the best of intentions, trials often have some missing primary endpoint data. If some participants’ primary endpoint is missing, it is advisable to run sensitivity analyses using different approaches to handle the missing data. Ideally, these analyses

5

will be prespecified in the protocol so there is no suspicion that the emphasized approach gives an unfairly small P-value. There are a wide variety of approaches for dealing with missing data. The protocol should specify the missing data approach used in the primary analysis. 6

CENSORED PRIMARY ENDPOINTS

Studies where some patients are expected to die before the primary endpoint can be ascertained present problems in analysis and interpretation. If a treatment causes deaths and the groups are compared in terms of a measurement on the surviving patients (e.g., viremia), one might conclude that the treatment is beneficial when in fact it is deadly. Often, careful definition of the primary endpoint can avoid this situation. If a substantial death rate is anticipated, perhaps mortality or a composite involving mortality should be used. One might use time to first event of a composite list of events or a ranking scheme as mentioned earlier. Another possibility is to define the primary endpoint soon enough after randomization so that the number of deaths will be small. Ignoring the deaths, unless there are just a few, is not a good strategy. At times, none of these approaches will be satisfactory. For example, in trials of certain chronic diseases such as different approaches to kidney dialysis, an appealing endpoint is the rate of change of a measure of kidney function, such as the glomerular filtration rate (GRF), between the groups (11). A simple random effects model applied to such data could be misleading as patients who die have fewer measurements and thus contribute less to the estimate of the mean slope. If treatment does nothing but quickly kill off those with rapid declines in GFR, the survivors will have better slopes, and a deadly treatment may end up looking better than the control. A simple though inefficient way around this problem is to calculate a slope for each individual and then treat these slopes as the primary outcomes and compare slopes between the two groups, with everyone weighted equally. More sophisticated methods are also possible, but all rely

6

PRIMARY EFFICACY ENDPOINT

on essentially unexaminable assumptions, and trials with such endpoints need to be interpreted very cautiously (12). Primary endpoints may also be censored by causes other than death. For example, in blood pressure trials, hypertensives are randomized to different treatment, and change in blood pressure (e.g., over 6 months) is often used as the primary endpoint. However, if a patient’s blood pressure exceeds a safety threshold during the follow-up period, the patient needs to receive open-label treatment; so the final reading reflects the rescue medication rather than the assigned medicine. In such trials, it is common use the last pre-rescue blood pressure in lieu of the 6-month value, or the last observation carried forward (LOCF). In general, the LOCF is popular but can introduce problems (13). A related, appealing approach is the last rank carried forward (14). Problems of censoring also arise in trials of HIV vaccines where postinfection viremia or CD4 count is a primary endpoint, but the viral loads after antiretroviral therapy (ART) are often undetectable. For such trials, an early measure of the first viral measurement after detection of infection may minimize ART-related censoring. Other possibilities are to use a rank-based procedure where those who initiate ART receive the worst rank (15), or to use a duration endpoint of time to ART or viremia >50,000 copies/mL (16). 7

SURROGATE PRIMARY ENDPOINTS

Surrogate endpoints are meant to stand in place of a true endpoint that is more difficult to obtain—e.g. more expensive, rarer, or takes longer to appear (17–19). A decisive result on a validated surrogate implies a result on the true endpoint of interest. Examples of surrogates used in this fashion are short-term change in viral load for antiretroviral drugs in HIV infection and blood pressure change for some antihypertensive interventions such as diet or exercise. However, validated surrogates are not common. Unvalidated surrogates have not been proven to stand in place of a true endpoint, and such surrogates need to be interpreted on their own. Use of an unvalidated surrogate will likely result in further study,

either to validate the surrogate or more likely to design a definitive trial using a clinical endpoint. The proper use of unvalidated surrogate endpoints is to provide early clues as to whether a therapy is worth pursuing. For example, the Postmenopausal Estrogen/Progestin Interventions (PEPI) trial showed that HRT achieved a favorable impact on surrogate cardiovascular risk factors (e.g., blood pressure and cholesterol) (20). The results of this trial encouraged the design of the Women’s Health Initiative, which ultimately showed a harmful effect of HRT on cardiovascular events. Another trial found that interleukin 2 (IL-2) therapy clearly increases the number of CD4 cells in HIV infected patients, and low CD4 counts are clearly associated with increased risk of death and AIDS (21). Because of the favorable association on this unvalidated surrogate, different trials are addressing whether IL-2 therapy will have a benefit on clinical events; see, for example, Emery et al. (22). 8 MULTIPLE PRIMARY ENDPOINTS Some trials use multiple primary or coprimary endpoints. This can occur because there are several similar types of endpoints and it is difficult to choose a single one, or because the treatment has anticipated effects on rather different outcomes. For example, the PEPI trial was designed to assess the effects of HRT on four primary endpoints, each a surrogate endpoint for heart disease: systolic blood pressure, insulin, highdensity lipoprotein C (HDL-C), and fibrinogen. In vaccine trials, co-primary endpoints of infection and viremia are sometimes used. If success of the study is defined as any of the primary endpoints achieving statistical significance, a multiplicity adjustment is usually employed to control the falsepositive rate. For example, with the Bonferroni method, each of k endpoints is tested with a type I error rate equal to alpha/k. This method strongly controls the type I error rate. Another possibility is to use a multivariate test to combine different endpoints. O’Brien (23) discusses several types of tests. The rank-sum method is based on separately ranking each endpoint and calculating the

PRIMARY EFFICACY ENDPOINT

average rank for each person. These average ranks are then compared between the two groups by a t-test or a Wilcoxon test. A similar approach involves standardizing each person’s outcome by subtracting off the mean and dividing by the standard deviation of the outcome. The average standardized outcome for each person can then be treated as outcomes in a t-test. Other approaches involve assuming a common treatment effect across different outcomes and deriving a multivariate test (24). A nice discussion of the topic is given by O’Brien and Geller (25); see also Follmann (26). Sometimes multiple or co-primary endpoints arise when treatments need to show benefit on several endpoints to be declared efficacious. For example, an antimigraine drug might need to show benefit in terms of pain, sensitivity to light, sensitivity to sound, and sensitivity to smells. Here, each endpoint can be tested at level alpha because a treatment ‘‘wins’’ only if it wins on each of the four endpoints. This is quite different from the setting previously described where a treatment wins if it wins on any of the endpoints, and thus a multiplicity adjustment is typically required. 9

SECONDARY ENDPOINTS

Primary endpoints are different from secondary endpoints. Studies stand or fall on the basis of the primary endpoint and are powered for the primary endpoint. Secondary endpoints have no such burden and can be used to explore other important aspects of an intervention. Secondary endpoints can be supportive of the primary endpoint, such as a primary endpoint of CVD mortality with secondary all-cause mortality. Secondary endpoints can also be quite different, such as diastolic blood pressure 6 months after randomization. Although the total sample size is not determined by the secondary endpoints, sometimes power calculations are performed to confirm that the sample size will be adequate or to reveal that measuring the secondary endpoint on a subset of the participants will provide adequate power. Often there is no multiplicity penalty for incorporating secondary endpoints. In the regulatory

7

setting, however, there can be a penalty for multiplicity if specific claims are to be made on the basis of secondary endpoints.

REFERENCES 1. C. L. Meinert, Clinical Trials: Design, Conduct, and Analysis. Oxford: Oxford University Press, 1986, pp. 66–67. 2. J. D. Neaton, D. N. Wentwort, F. Rhame, C. Hogan, D. I. Abrams, and L. Deyton, in choice of a clinical endpoint for AIDS clinical trials. Stat Med. 1994; 13: 2107–2125. 3. Trial to Reduce Alloimmunization to Platelets Study Group. Leukocyte reduction and ultraviolet B irradiation of platelets to prevent alloimmunization and refractoriness to platelet transfusions. N Engl J Med. 1997; 337: 1861–1870. 4. M. E. Halloran and I. M. Longini, Using validation sets for outcomes and exposure to infection in vaccine field studies. Am J Epidemiol. 2001; 154: 391–398. 5. J. T. Grayston, R. A. Kronmal, L. A. Jackson, A. F. Parisi, J. B. Muhlestein, et al., Azithromycin for the secondary prevention of coronary events. N Engl J Med. 2005; 352: 1637–1645. 6. D. A. Follmann, J. Wittes, and J. Cutler, A clinical trial endpoint based on subjective rankings. Stat Med. 1992; 11: 427–438. 7. D. A. Follmann, Regression analysis based on pairwise ordering of patients’ clinical histories. Stat Med. 2002; 21: 3353–3367. 8. L. E. Bjorling and J. S. Hodges, Rule-based ranking schemes for antiretroviral trials. Stat Med. 1997; 16: 1175–1191. 9. R. Collins, R. Peto, and P. Sleight, ISIS-2: a large randomized trial of intravenous streptokinase and oral aspirin in acute myocardialinfarction. Br Heart J. 1989; 61: 71–71. 10. G. L. Anderson, C. Kooperberg, N. Geller, J. E. Rossouw, M. Pettinger, and R. L. Prentice, Monitoring and Reporting of the Women’s Health Initiative Randomized Hormone Therapy Trials. Draft manuscript. Clinical Trials. 2007; 4: 207–217. 11. S. Klahr, A. Levey, G. J. Beck, A. W. Caggiula, L. Hunsicker, et al., for the Modification of Diet in Renal Disease Study Group. The effects of dietary protein restriction and bloodpressure control on the progression of chronic renal disease. N Engl J Med. 1994; 330: 877–884.

8

PRIMARY EFFICACY ENDPOINT

12. E. F. Vonesh, T. Greene, and M. D. Schluchter, Shared parameter models for the joint analysis of longitudinal data and event times. Stat Med. 2006; 25: 143–163. 13. N. Ting, Carry forward analysis. In: S. C. Chow (ed.), The Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker, 2003, pp. 175–180. 14. P. C. O’Brien, D. Zhang, and K. R. Bailey, Semi-parametric and non-parametric methods for clinical trials with incomplete data. Stat Med. 2005; 24: 341–358. 15. D. Follmann, A. Duerr, S. Tabet, P. Gilbert, Z. Moodie, et al., Endpoints and regulatory issues in HIV vaccine clinical trials: lessons from a workshop. J Acquir Immune Defic Syndr. 2007; 44: 49–60. 16. P. B. Gilbert, V. G. DeGruttola, M. G. Hudgens, S. G. Self, S. M. Hammer, and L. Corey, What constitutes efficacy for a human immunodeficiency virus vaccine that ameliorates viremia: issues involving surrogate end points in phase 3 trials. J Infect Dis. 2003; 188: 179–193. 17. J. Wittes, E. Lakatos, and J. Probstfield, Surrogate endpoints in clinical-trials: cardiovascular diseases. Stat Med. 1989; 8: 415–425. 18. R. L. Prentice, Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med. 1989; 8: 431–440. 19. M. D. Hughes, Evaluating surrogate endpoints. Control Clin Trials. 2002; 23: 703–707. 20. PEPI Study Group. Effects of estrogen or estrogen/progestin regimens on heart-disease risk-factors in postmenopausal women—the Postmenopausal Estrogen/Progestin Interventions (PEPI) trial. JAMA. 1995; 273: 199–208. 21. J. A. Kovacs, S. Vogel, J. M. Albert, J. Falloon, R. T. Davey, et al., Controlled trial of interleukin-2 infusions in patients infected with the human immunodeficiency virus. N Engl J Med. 1996; 335: 1350–1356. 22. S. Emery, D. I. Abrams, D. A. Cooper, J. Darbyshire, H.C. Lane, et al., and the ESPRIT Study Group. The evaluation of subcutaneous Proleukin (interleukin-2) in a randomized international trial: rationale, design, and methods of ESPRIT. Control Clin Trials. 2002; 23: 198–220. 23. P. C. O’Brien, Procedures for comparing samples with multiple endpoints. Biometrics. 1984; 40: 1079–1087. 24. S. Pocock, N. Geller, and A. Tsiatis, The analysis of multiple endpoints in clinical trials. Biometrics. 1987; 43: 487–498.

25. P. C. O’Brien and N. L. Geller, Interpreting tests for efficacy in clinical trials with multiple endpoints. Control Clin Trials. 1997; 18: 222–227. 26. D. Follmann, Multivariate tests for multiple endpoints in clinical trials. Stat Med. 1995; 14: 1163–1176.

FURTHER READING D. M. Finkelstein and D. A. Schoenfeld (eds.), AIDS Clinical Trials. New York: Wiley, 1995. J. F. Heyse, Outcome measures in clinical trials. In: P. Armitage and T. Colton (eds.), The Encyclopedia of Biostatistics. New York: Wiley, 2005, pp. 3898–3904. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E9 Statistical Principles for Clinical Trials. Current Step 4 version, 5 February 1998. Available at: http://www.ich.org/LOB/media/MEDIA485.pdf S. Piantadosi, Clinical Trials: A Methodologic Perspective, New York: Wiley, 2005, pp. 187–209. S. J. Pocock, Clinical Trials: A Practical Approach, New York: Wiley, 1983, pp. 41–49, 188–191.

CROSS-REFERENCES Multiple Endpoints Randomization Blinding Intention-to-Treat Analysis Bonferroni Adjustment LOCF Missing Values Multiple Comparisions Bias Time to Disease Progression Surrogate Endpoints

PRIORITY REVIEW Priority Review is a designation for an application after it has been submitted to the U.S. Food and Drug Administration (FDA) for review for approval of a marketing claim. Under the Food and Drug Administration Modernization Act (FDAMA), reviews for New Drug Applications are designated as either Standard or Priority. A Standard designation sets the target date for completing all aspects of a review and the FDA taking an action on the application (approve or not approve) at 10 months after the date it was filed. A Priority designation sets the target date for the FDA action at 6 months. A Priority designation is intended for those products that address unmet medical needs. The internal FDA procedures for the designation and responsibilities for Priority Review are detailed in the Center for Drug Evaluation and Research (CDER) Manual of Policies and Procedures.

This article was modified from the website of the United States Food and Drug Administration (http://www.accessdata.fda.gov/scripts/cder/ onctools/Accel.cfm#Priority) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

unique. Likewise for concomitant medication; for instance, aspirin (which would substantially reduce the risk of heart attack) could confound the trial results if there was an imbalance between trial groups in the number of patients taking aspirin in a clinical trial for the prevention of heart attacks. Similarly for the level of concomitant interferon a patient is taking while participating in a clinical trial for the treatment of multiple sclerosis. Both aspirin and interferon therapy might also influence the likelihood of adverse reactions to the study therapy. As pointed out in Kinukawa et al. (1), baseline imbalance in prognostic factors between treatment groups incidental to randomization may produce misleading results, particularly when dealing with only moderately effective treatments or very heterogeneous patients. It is generally believed that sufficiently large trials, in terms of the number of patients enrolled, can render baseline imbalances virtually impossible. But in fact there are different causes of baseline imbalances, and only some of these are influenced by the sample size. Thus, the identification of and statistical planning for handling influential prognostic factors ‘‘is an integral part of the planned analysis and hence should be set out in the protocol. Pre-trial deliberations should identify those covariates and factors . . . and how to account for these. . . . Special attention should be paid to the role of baseline measurements of the primary variable’’ (2). For example, clinical trials generally record the gender and age of all patients, and then check for baseline balance (across treatment groups) with regard to these prognostic factors, among others. A lack of baseline balance is problematic because it may result in confounding. In a well-designed and conducted clinical trial in which patients are randomly assigned to two or more treatment groups, except for differences in the experimental procedure applied to each treatment group, the groups should ideally be treated exactly alike. Under these circumstances, any differences between the groups that are statistically significant are attributed to differences in the treatment conditions. This, of course, assumes

VANCE W. BERGER A. J. SANKOH Biometry Research Group National Cancer Institute Bethesda, Maryland

A prognostic factor or variable is a patient characteristic that can predict that patient’s eventual response to an intervention. Prognostic variables include any and all known factors that could potentially impact the patient’s response outcome, including baseline demographic, concomitant illness, and medications factors that are likely to modify any treatment benefit, and those that may predict adverse reactions. These factors are also sometimes referred to as confounders because the presence of an imbalance at baseline between the treatment groups due to such factors may result in an apparent treatment effect when in fact none exists, or mask an effect that does in fact exist. Prognostic factors also include those potential confounders that have been used as stratification factors for randomization. Stratified randomization is often used when a baseline characteristic, such as tumor stage in cancer trials or baseline expanded disability status scale in multiple sclerosis trials, is known to affect the outcome or response. In such cases, the baseline characteristic is used as a stratification factor in the randomization scheme to minimize imbalance between treatment groups. Similarly, when the study population contains subgroups of particular interest, the characteristics defining these subgroups should be considered as potential progonostic factors. For example, in a long-term trial of a new therapy for preventing heart attack, diabetes mellitus would be a potential confounder because patients with diabetes are known to have a much higher risk of heart attack than those without diabetes. In such a trial, those patients with diabetes would also be an interesting subgroup in whom the effects of the intervention might be

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

that except for the various treatment conditions the groups were, in fact, treated exactly alike. Unfortunately, it is always possible that despite an experimenter’s best effort there are some unsuspected systematic differences in the way the groups are treated (in addition to the intended treatment conditions). Factors that lead to systematic differences of this sort are generally referred to as confounding factors. Thus, if age is prognostic for outcome, with younger patients responding better in general than older patients, and if one treatment group has both younger patients and better responses than the other treatment group, then it is unclear what is causing the better responses. Is it the treatment itself? Or is it the fact that patients receiving this treatment were healthier and more likely to respond to begin with by virtue of being younger? Though many statistical methods have been developed to distinguish the one from the other, there really is no way to tell for sure, without having to assume a model that enables one to act as though the counterfactual responses to the other treatment (the treatment not received) are known. So, if gender is strongly prognostic for treatment outcome, then one would like to ensure a fair comparison even if one group is advantaged with more of the better responders. For example, suppose that women tend to respond better than men, and that more women received the active treatment and more men received the control treatment. In such a case, the inference would be unclear if only the unadjusted analysis were conducted and showed a higher response rate for the active treatment group. The comparison is confounded because the group with better responders to start with (having nothing to do with the treatment itself) ended up with the better response rate. Essentially, a race was won, and the runner with the head start won it. So who is faster? Clearly, one would like to know if women treated with the active treatment fared better than women treated with the control treatment; likewise, one wants to know if men treated with the active treatment fared better than men treated with the control treatment.

1 A GENERAL THEORY OF PROGNOSTIC VARIABLES We have mentioned the counterfactual— that is, the notion that a patient who received one treatment might have received the other treatment had the randomization turned out differently. This leads naturally to the classification of patients by their principal strata (3, 4). In the simplest formulation, the response is binary, and there are two treatments, so each patient does or does not respond to each treatment, and there are four types of patient: (0,0) patients do not respond to either treatment. (0,1) patients respond to the active treatment but not to the control. (1,0) patients respond to the control but not to the active treatment. (1,1) patients respond to both treatments. Though it may be tempting to posit the nonexistence of the third type, in fact there are many reasons why a patient might respond to the control but not to the active treatment, even if the control is a placebo. Some of these reasons were articulated by Berger et al. (5). Although generally unrecognized, the entire exercise of identifying and testing covariates (or predictors, prognostic factors, confounders, or covariables, which will be treated interchangeably in this article) in clinical trials is, for the most part, nothing more than an attempt to distinguish patients by their potential outcomes, or principal strata. The principal strata are never observed directly, even in a crossover trial, because one cannot simultaneously treat a given patient with only one treatment and with only the other treatment. Using both treatments at the same time leads to clear confounding, and using both treatments at different times leads to the possibility of carryover effects, or residual effects of the first treatment interfering with the effects of the second treatment. So suffice it to say that we will never know which patients are of each of the four types, but we can surmise that sufficiently homogeneous sets of patients will all be of the same type. This is why we collect information on prognostic variables.

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

Another way to view prognostic factors is by appeal to recognizable subsets. Using gender as a covariate, for example, creates two recognizable subsets: females and males. The same is true for age bracket if age is used, and so on. In general, if a large group of patients is treated with the same treatment, then some will respond and some will not. If we classify this large group by gender, then we may still find that some women respond and some do not, and also that some men respond and some do not. Still, progress would have been made if the gender-specific response rates differ appreciably because this would mean that gender partially explains the variability in responses. That is, if 80% of the women and only 20% of the men respond to the same treatment, then we are better off tabulating response by gender, and the recognizable subsets ‘‘female’’ and ‘‘male’’ become useful. Contrast this with the notion of classifying patients after the fact by their response. Doing so results, tautologically, in a 100% response rate among the responders and a 0% response rate among the nonresponders. But the set of responders and the set of nonresponders are not recognizable subsets until after the study, so they are useless as predictors of response. Still, we can hope to improve the specificity of the recognizable subsets by using more predictors. So, for example, we may find greater homogeneity among men over 40 years of age than we do among all males. If so, then we would like the men over 40 years to be a recognizable subset, meaning that we would need both gender and age (or at least an indicator of age above or below 40 years) in the model. Of course, age need not be dichotomized at all, and could result in a large number of recognizable subsets, especially when crossclassified with gender and other predictors. In the limit, with a large number of covariates relative to the number of patients, each patient will constitute his or her own stratum. In this case, it is tautological that each stratum will be homogeneous with respect to the response pattern. However, this is not progress because a consumer of the research would be unable to determine how to apply the results. Each stratum, or recognizable subset, would be too narrow, and nobody in the trial would belong to any of these strata.

3

The goal is to be parsimonious and select a relatively small number of predictors to obtain a relatively small number of strata, with a relatively large number of patients per stratum. Of course, taken together, these strata should be exhaustive, or partition the target population. We also want these strata to be as homogeneous as possible for response type. So the most useful prognostic variable would be one that would perfectly predict response to both treatments. For example, suppose that all women under 40 years of age would respond to both treatments (1,1), all women over 40 years would respond to just the active treatment (0,1), all men under 40 years would respond to just the control (1,0), and no men over 40 years would respond to either treatment (0,0). Then gender and age would be jointly sufficient for predicting (perfectly) the response pattern of each patient, and there would be no need to consider any other covariates. Of course, the situation described is highly artificial; in practice, it is often the case that many covariates are required to create any separation in distributions of response types among the various strata created by the covariates. 2 VALID COVARIATES AND RECOGNIZABLE SUBSETS In practice, it is unfortunate that unrecognizable subsets are often created and taken to be recognizable subsets, by at least two distinct mechanisms. First, though uncommon, sometimes variables measured subsequent to randomization are used as covariates. Theoretically, it might be valid to use as a covariate a variable that is measured subsequent to randomization if it cannot be influenced by the treatments administered. For example, in an asthma study, air quality might be predictive of the outcome of a patient exposed to the air in question; but, with some caveats, it should not be influenced by the treatment administered. Air quality at the time a patient is enrolled (or at the time a patient is treated), or average air quality for the time a patient is in the study, would then be an ‘‘external’’ covariate (6) that could be used to distinguish some patients from others. One caveat would be that ineffective treatment

4

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

might cause a patient to spend more time in a less urban area for improved air quality, so the treatment could influence the covariate. But the covariate could instead be air quality at the clinic and not where the patient happens to travel, in which case the treatment administered to a given patient should not influence the covariate, which would still discriminate patients better than the cruder covariate ‘‘center’’ or ‘‘clinic’’ because even at a given center the air quality will change over time and patient enrollment is generally conducted over time. So, although air quality may not be a patient characteristic, it nevertheless becomes a de facto patient characteristic that indicates something about the conditions prevailing at the time this patient was in the study. As an aside, there is a precedent for covariates being de facto patient characteristics rather than true patient characteristics. Berger (7) drew a distinction between covariates that are true patient characteristics and covariates that more like credentials. As a sports analogy, one can predict the success of an athlete by his or her running speed and endurance, but one can also use credentials such as being named to an all-star team. But these are fundamentally different types of predictors; the former are true athlete characteristics, and the latter are credentials that indicate only that someone else has deemed this athlete worthy. The so-called reverse propensity score (7) is a covariate that is more like a credential than a patient characteristic. Huusko et al. (8) and Bookman et al. (9) recently used as covariates variables measured subsequent to randomization that were not of the external type (such as air quality), and most certainly could have been influenced by the treatments administered (i.e., they were ‘‘internal’’ covariates). The problem with this approach has long been recognized. Prorok et al. (10) pointed out the threat to validity when covariates can be influenced by treatments (and labeled these covariates as ‘‘pseudo-covariates’’). Essentially, the problem is that the treatment would then be responsible for what is mistakenly taken to be preexisting differences. These effects, though in fact attributable to the treatments,

would instead be attributed to the covariate. It is clear how this misattribution can lead to a type II error (missing a true treatment effect), but in fact it can also lead to a type I error (the appearance of a treatment effect when in fact none exists). See Berger (11) for details. Yusuf et al. (12) labeled the subgroups formed by postrandomization covariates ‘‘improper subgroups’’ to distinguish them from the proper (recognizable) subgroups one obtains with valid covariates. The second mechanism for mistaking unrecognizable subsets for recognizable subsets is through the use of the prerandomization run-in period and the associated patient selection. Specifically, the idea is to pretreat all patients with the active treatment before randomization, and then to randomize only those patients who respond well to the active treatment. Depending on the trial, ‘‘responding well’’ may consist of not experiencing adverse events, adhering to the treatment regimen, or even showing signs of efficacy. In essence, then, the response during the run-in is being used to form subgroups, and only one subgroup goes on to get randomized and to enter the primary part of the trial. The problem is that this is not a recognizable subset until after all the patients are treated. In other words, there is no way to know at the outset which patients will and will not respond to the treatment during the run-in phase, so the practice is similar to a stockbroker offering as credentials the return on only those stocks he picked that went on to great returns. See Berger et al. (5) for more details regarding the drawbacks of this method. 3 STRATIFIED RANDOMIZATION AND ANALYSIS The major objective of a comparative trial is to provide a precise and valid treatment comparison. The trial design and conduct can contribute to this objective by preventing bias and ensuring an efficient comparison. An efficient randomization scheme is key to achieving this objective. Randomization eliminates some forms of selection bias and tends to balance treatment groups with respect to subjects and prognostic factors (both known and unknown). In particular, stratified randomization can help to balance treatment

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

assignments within prognostic factors, especially those that are known to impact treatment outcome. As a result, stratification may prevent type I errors and improve power for small trials, but only when the stratification factors have a large effect on prognosis. Stratification also has some benefits in large trials when interim analyses are planned with small numbers of patients and in trials designed to show the equivalence of two therapies. Theoretical benefits include facilitation of subgroup analysis and interim analysis. The maximum desirable number of strata is unknown, but experts argue for keeping it small. The general rule is that stratified randomization factors should always be considered in the analysis model. However, there are situations in which it is not realistic to include all stratification factors in the model. For example, for major cardiovascular event trials with large numbers of centers, randomization is often stratified by center, among other factors. But it is often impractical to include (individual) study centers in the model. Therefore, for some covariates, the purpose of using them as stratification factors in the randomization scheme is to balance them across treatments and not necessarily to include them in the analysis model. For early stage trials, the objective may be to understand the relationship between the covariates and the primary endpoints, and exploratory or hypothesis-generating analyses can be performed to determine which covariates are most useful. Of course, it is always possible that covariates may be extremely predictive in one study and not predictive at all in a future study, so it may be a good idea to retain the right to select the covariates based on the data at hand. Though this could lead to abuse, there are valid ways to do this (13). For example, one could prespecify not the covariates themselves but rather the objective mechanism to be used to select the covariates. Doing so in a transparent way would make it clear to all parties that the data were not dredged so as to find the set of covariates that made the most impressive case for the active treatment. Stratification may be conducted during the design stage, during the analysis stage,

5

or both. Stratifying the randomization (during the design stage) consists of randomizing separately for each stratum formed by the appropriate covariates. It may be particularly useful to stratify the randomization in small studies. For later stage trials, one usually has a good idea regarding the associations between covariates and the primary endpoint. Based on this knowledge, it is possible that stratified randomization based on these known factors could be performed and the covariates for the primary analysis specified prospectively. Again, the maximum desirable number of strata is unknown, but experts argue for keeping it small. If too many covariates have a potential impact on the primary endpoint, then stratified randomization may not be ideal. This is because it would create too many subsets, or, rather, too few patients per subset (stratum). In such a case, one could use a randomization technique that attempts to balance all prognostic factors rather than just those that are specified. One good choice is the maximal procedure, the benefits of which are described by Berger (14). Briefly, the maximal procedure was designed to balance the need for allocation concealment (the inability to predict future allocations with or without complete or partial knowledge of past allocations) with the need for balance (ensuring comparable numbers of patients in each group at all times). Other commonly used adaptive or dynamic randomization schemes that allow flexibility in patient assignment to treatment groups to achieve balance in patient numbers and prognostic factors include Efron’s biased coin, minimization, and play the winner randomization designs. 4 STATISTICAL IMPORTANCE OF PROGNOSTIC FACTORS The identification of and adjustment for prognostic factors is an important component of clinical trial design, conduct, and analysis. Valid comparisons of treatment difference may require the appropriate adjustment for relevant prognostic factors. The greatest opportunity for obtaining misleading results by excluding covariates from the analysis occurs when the covariate is strongly prognostic

6

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

(related to the outcome) and/or grossly unbalanced across treatment groups. Kinukawa et al. (1) and Berger (13) considered both of these factors in determining an objective method for ranking and selecting covariates to include in the model. This covariate adjustment helps correct for the groups’ predisposition to behave differently from the outset because of the imbalance in the covariate. Adjusted analyses also tend to be more precise (less variable) than unadjusted analyses (15, 16). For late stage trials, the prognostic factors that lead to such reduction in variability are usually known beforehand, and thus the adjustment is prospectively specified in the protocol and in the statistical analysis plan before the data are collected and analyzed. The design of the study, including sample size estimation and the analysis method, will thus take these variance reductions into account. The simplest situation of a covariateadjusted analysis is probably the case of binary data and a single binary covariate, such as gender. This is the situation Berger (17) considered in some detail. The methods used for this case, and for many other cases, can be classified as parametric or nonparametric, with the parametric methods relying on more assumptions, and hence being generally less robust. Caution is required because some methods labeled as ‘‘nonparametric’’ actually still make use of questionable assumptions. The (true) nonparametric approach offers not only superior robustness but also greater clarity of understanding. So for concreteness, consider again the gender discussion. That is, consider a clinical trial with two treatments, a binary outcome (each patient responds or fails to respond), and gender as the only covariate. If gender is prognostic for outcome, then one would like to ensure a fair comparison even if one group is advantaged with more of the better responders. If it is known that women tend to respond better than men, and that more women received the active treatment and more men received the control treatment, then one would like to know if women treated with the active treatment fared better than women treated with the control treatment, and, likewise, if men treated with the active

treatment fared better than men treated with the control treatment. In evaluating the treatment effect, there is no need to compare the outcomes of men to the outcomes of women because the unadjusted analysis would do. One can also obtain separate gender-specific P-values for the treatment comparison, one for women and one for men. In general, however, a single combined P-value is preferred. How does one combine the two comparisons to obtain a single P-value? One of the oldest and most commonly used methods for combining P-values from multiple tests is Fisher’s combined probability test. This test is based on comparing −2 times the sum of the natural logarithms of the individual P-values to a chi-square distribution with 2n degrees of freedom, where n is the number of (independent) P-values being combined. Other methods include the weighted Ztest and nonparametric resampling with permutation tests. In the case of the resampling permutation tests, one would hypothetically randomize the patients again and again, and tabulate the data for each hypothetical pseudo-dataset created by this hypothetical re-randomization (18). Berger (17) distinguished three exact nonparametric covariate adjusted analyses for this situation. First, one could use an adjusted test statistic, such as a weighted average of gender-specific estimates of treatment effects. Second, one could use an unadjusted test statistic, such as the overall difference in response rates across treatment groups (ignoring gender), but permuting so as to preserve the observed level of covariate imbalance. In other words, one would assess the extremity (as defined by the alternative hypothesis that the active treatment is superior to the control in bringing about responses) of the observed dataset relative not to all other possible datasets but rather relative to only those that are equally unbalanced. So the hypothetical rerandomization would be restricted to keep the observed number of men and women in each treatment group. Third, one could combine the covariate-adjusted test statistic with the adjusted strategy for restricting the hypothetical re-randomization. It turns out that these three approaches actually are different. That is, although any two, or even all

PROGNOSTIC VARIABLES IN CLINICAL TRIALS

three, might agree for some datasets, they will also differ from each other for some datasets. See Berger (17) for more details. More work is required to determine the ideal covariate adjustment technique for any given situation. One of the most commonly used statistical models in medical research is the logistic regression model for binary response data, with continuous and/or binary predictors. Logistic regression models the effects of prognostic factors xj in terms of a linear predictor of the form xj β j , where β j is the parame ters. In the generalized additive model, f j (xj ) replaces xj β j , where f j is an unspecified (nonparametric) function that can be estimated using a scatterplot smoother. The estimated function f j (xj ) can reveal possible nonlinearities in the effect of the xj . It is also not uncommon to use analysis of covariance (ANCOVA) when there are continuous predictors (in addition to treatment, which is categorical) and the response is continuous. In addition to the objection to assuming normality, criticisms of this approach include the model supposition that one can meaningfully compare the responses of patients with certain values of covariates with patients with any other values of covariates. This is certainly problematic. What, for example, is to be made of a finding that a patient with a lower initial blood pressure and then treated with one treatment has the same change in blood pressure as a different patient with a higher initial blood pressure and treated with the other treatment? Is this equality? Many models would say yes, yet the reality is that the change affected may be more impressive (for the treatment) when it is applied to the patient with the higher initial blood pressure. Or, conversely, it may be a better indication of efficacy if the change is for the patient with the lower initial blood pressure. Because ANCOVA is used so often with data that not only are not normally distributed but also are ordinal and not even continuous, let us consider another example, with ordinal data. Suppose that pain is measured before and after treatment for each patient and is recorded on a three-point scale as ‘‘none,’’ ‘‘moderate,’’ or ‘‘severe.’’ How does a shift from severe to moderate compare with

7

a shift from moderate to none? Each represents an improvement of one category, so it is tempting to call them equal. In fact, this is often done. But are they really equal? Who is in a position to say? Is it not possible that some patients would value being pain-free the most, and so to these patients the shift from moderate to none is much larger than the shift from severe to moderate? Is it not also possible that other patients want only to avoid severe pain, and so they would rank the shifts differently? Is it not even possible that a given patient would, at different times, rank the two shifts differently, at one time preferring one and at a later time preferring the other? The logical conclusion is that some shifts simply cannot be compared. If we accept this premise, then we need special analyses to take into consideration the fact that there is only a partial ordering, and not a complete ordering, on the set of outcomes. The partial ordering is sufficiently rich to allow for a precise analysis, but one must be careful in determining which outcomes can be compared with which others. There are three kinds of improvements: severe to moderate (SM), severe to none (SN), and moderate to none (MN). Clearly, SN dominates the other two, which cannot be directly compared with each other. Berger et al. (19) developed a specialized analysis technique to exploit this partial ordering but avoid using any ‘‘pseudo-information’’ derived from artificially comparing outcomes that are inherently noncomparable. The fundamental idea is to derive a U-statistic based on all possible paired comparisons (each patient in one treatment group compared with each patient in the other treatment group), with each comparison categorized as [1] favoring the active treatment, [2] favoring the control treatment, [3] not favoring either treatment, or [4] being indeterminate. Re-randomization is then conducted with this test statistic used to derive the permutation (nonparametric) P-value. More work is required before this method can be applied to clinical trials with more than two treatment groups. REFERENCES 1. K. Kinukawa, T. Nakamura, K. Akazawa, and Y. Nose, The impact of covariate imbalance

8

PROGNOSTIC VARIABLES IN CLINICAL TRIALS on the size of the logrank test in randomized clinical trials. Stat Med. 2000; 19: 1955–1967. 2. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E9 Statistical Principles for Clinical Trials. Step 4 version, February 5, 1998. Available at: http://www.ich.org/LOB/media/MEDIA485. pdf 3. C. Frangakis and D. B. Rubin, Principal stratification in causal inference. Biometrics. 2002; 58: 21–29. 4. D. B. Rubin, Direct and indirect causal effects via potential outcomes. Scand J Stat. 2004; 31: 161–170. 5. V. W. Berger, A. Rezvani, and V. Makarewicz, Direct effect on validity of response run-in selection in clinical trials. Control Clin Trials. 2003; 24: 156–166. 6. J. D. Kalbfleisch and R. L. Prentice, The Statistical Analysis of Failure Time Data. New York: Wiley, 1980. 7. V. W. Berger, The reverse propensity score to manage baseline imbalances in randomized trials. Stat Med. 2005; 24: 2777–2787. 8. T. M. Huusko, P. Karppi, V. Avikainen, H. Kautiainen, and R. Sulkava, Randomized, clinically controlled trial of intensive geriatric rehabilitation in patients with hip fracture: subgroup analysis of patients with dementia. BMJ. 2000; 213: 1107–1111. 9. A. M. Bookman, K. S. Williams, and J. Z. Shainhouse, Effect of a topical diclofenac solution for relieving symptoms of primary osteoarthritis of the knee: a randomized controlled trial. CMAJ. 2004; 171: 333–338.

10. P. C. Prorok, B. F. Hankey, and B. N. Bundy, Concepts and problems in the evaluation of screening programs. J Chronic Dis. 1981; 34: 159–171. 11. V. W. Berger, Valid adjustment for binary covariates of randomized binary comparisons. Biom J. 2004; 46: 589–594. 12. S. Yusuf, J. Wittes, J. Probstfield, and H. A. Tyroler, Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991; 266: 93–98. 13. V. W. Berger, A novel criterion for selecting covariates. Drug Inf J. 2005; 39: 233–241. 14. V. W. Berger, Selection Bias and Covariate Imbalances in Randomized Clinical Trials. Chichester, UK: Wiley, 2005. 15. J. M. Neuhaus, Estimation efficiency with omitted covariates in generalized linear models. J Am Stat Assoc. 1998; 93: 1124–1129. 16. L. D. Robinson and N. P. Jewell, Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev. 1991; 59: 227–240. 17. V. W. Berger, Nonparametric adjustment techniques for binary covariates. Biom J. 2005; 47: 199–205. 18. V. W. Berger, Pros and cons of permutation tests in clinical trials. Stat Med. 2000; 19: 1319–1328. 19. V. W. Berger, Y. Y. Zhou, A. Ivanova, and L. Tremmel, Adjusting for ordinal covariates by inducing a partial ordering. Biom J. 2004; 46: 48–55.

PROPENSITY SCORE

If the treated and control groups differ before treatment with respect to observed prognostic variables or covariates, then an overt bias exists, whereas if they differ with respect to covariates not observed, then a hidden bias exists. Randomization balances covariates using the laws of chance that govern coin flips, making no use of the covariates themselves, so randomization balances covariates whether observed or not, preventing both overt biases and hidden biases. In contrast, because propensity scores balance covariates using the covariates themselves, propensity scores can remove overt biases, those visible in the data at hand; but propensity scores do little or nothing about hidden biases, which must be addressed by other means (5). Propensity scores try to fix the problem introduced by unequal probabilities of treatment by comparing individuals who had the same probability of treatment given observed covariates. If two people, one treated and one control, both had the same probability of treatment, say 0.75, based on their observed covariates, then those covariates will not help to predict which of these two patients actually received treatment, and these observed covariates will tend to balance. In practice, the probability of receiving the treatment is estimated from the observed pattern of treatment assignments and the observed covariates, perhaps using a model such as a logit model. Treated and control patients with the same or similar estimated probabilities of treatment, the same estimated propensity scores, are then compared. For instance, in the study of CABG and drug treatment (4), the estimated probabilities were grouped into five strata each containing 20% of the patients, the top stratum having the highest estimated probabilities of CABG, the bottom stratum having the lowest estimated probabilities of CABG, based on the observed covariates. Within these five strata, all 74 observed covariates were approximately balanced, in fact, slightly better balanced than they would have been had patients been randomly assigned to CABG or drugs within each stratum. Of course, randomization would have balanced

PAUL R. ROSENBAUM University of Pennsylvania Department of Statistics Philadelphia, Pennsylvania

In the simplest randomized trial, a coin is flipped for each patient, determining assignment to treatment or control, so each patient has the same probability, 0.5, of receiving the treatment. Randomization ensures that no systematic relationship exists between a patient’s initial prognosis and the treatment received, so treated and control groups tend to be comparable before treatment, and differing outcomes in treated and control groups reflect effects of the treatment. More precisely, randomization ensures that the only differences between treated and control groups before treatment are because of chance— the flip of a coin that assigned one patient to treatment and the next to control—so, if a statistical test concludes that the difference is too large to be caused by chance, then a treatment effect is demonstrated (2). In contrast, in observational studies in which randomization is not used to assign subjects to treatment or control, different types of patients may have different chances of assignment to treatment or control (3). For instance, in an observational study comparing coronary artery bypass graft surgery (CABG) to drug treatments, patients with good left ventricular function and substantial occlusion of several coronary arteries were more likely to receive surgery, whereas patients with either poor left ventricular function or less extensive blockage of coronary arteries were more likely to be treated with drugs (4). As a result, the CABG and drug groups were not comparable before treatment; for instance, patients with good left ventricular function were overrepresented in the CABG group and underrepresented in the drug group. Indeed, the two groups differed significantly before treatment with respect to 74 observed prognostic variables and may also have differed before treatment in ways that were not observed.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

PROPENSITY SCORE

both these observed covariates and unobserved covariates, whereas stratification on the propensity score only balances the observed covariates. Instead of forming a few strata, each treated patient may be matched to one control (6) or to several controls (7). Compared with stratification, matching has the advantage of permitting tighter control for individual covariates (6). Software in SAS aids in matching using either SAS Macros (8) or SAS Procs (9). The typical application of propensity scores constructs either a few strata or many matched sets that balance observed covariates. It is generally good practice in the use of propensity scores to: (1) examine the distribution of propensity scores in treated and control groups to verify that the groups overlap sufficiently to permit them to be compared, (2) to check that the strata or matched sets have indeed balanced the covariates, and (3) to improve the initial propensity score model if some covariates appear to be unbalanced. These steps are easy to implement, and are illustrated in two case studies (4, 6). Propensity scores are sometimes used in other ways. The model for the propensity score may be used directly for inference about treatment effects, without constructing strata or matched sets, possibly in conjunction with a model of the outcome (see References 5 Section 3.5 and References 10 and 11). For instance, the hypothesis of no treatment effect may be tested by logit regression of the treatment indicator on covariates and the response; moreover, this test may be inverted to obtain a confidence interval for an additive effect. Alternatively, the reciprocal of the propensity score can be used as a weight to correct for selection effects (12–14). The use of propensity scores as weights has advantages when parallel adjustments are needed in many outcomes, for example, when estimating a distribution function (14, 15). Several properties of propensity scores are known from statistical theory. First, matching or stratification on propensity scores tends to balance observed covariates (1). Second, if the only covariates one needs to worry about are the observed covariates— if the only biases are overt biases—then appropriate estimates of treatment effects may be obtained by adjusting for the propensity

score (1). Third, just as randomization is the basis for certain randomization or permutation tests, the propensity score is the basis for certain adjusted permutation tests (see Reference 5 Section 3.4 and Reference 10), which again are valid when it suffices to adjust for observed covariates (see also References 11, 16, and 17). This approach is explicit about the consequences of estimating the propensity score. If the outcome is rare but the treatment is common, as is true of the rare side effects of commonly used treatments, propensity scores have certain special advantages over competing methods of adjustment (18). In addition to theory, the performance of propensity scores has been evaluated in a simulation (19) and in a case study in which a randomized control group was set aside and a nonrandomized control group was used instead (20). Propensity scores have been generalized in several ways. If doses of treatment exist, rather than treatment or control, a onedimensional propensity score is sometimes possible (21, 22), or several propensity scores may be used (13). For certain types of treatments, such as elective surgery, it may be unnatural to think of treatment or control and more natural to think of treating now or waiting and possibly treating at a later date. In this case, a time-dependent propensity score is possible, as is a form of risk-set matching related to Cox’s proportional hazards model (23). In particular, the basic result linking propensity scores to conditional permutation tests (10) remains true with timedelayed treatments (see Reference 23 Section 4.4). Several recent reviews of propensity scores exist (see Reference 5 Section 3 and 10 and References 21 and 24), as do many applications (for instance, see References 15 and 25–28 ).

REFERENCES 1. P. R. Rosenbaum and D. B. Rubin, The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55. 2. R. A. Fisher, The Design of Experiments. Edinburgh: Oliver & Boyd, 1935.

PROPENSITY SCORE 3. W. G. Cochran, The planning of observational studies of human populations (with Discussion). J. Royal Stat. Soc. A 1965; 128: 134–155. 4. P. R. Rosenbaum and D. B. Rubin, Reducing bias in observational studies using subclassification on the propensity score. J. Amer. Stat. Assoc. 1984; 79: 516–524. 5. P. R. Rosenbaum, Observational Studies, 2nd ed. New York: Springer-Verlag, 2002. 6. P. R. Rosenbaum and D. B. Rubin, Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Amer. Stat. 1985; 39: 33–38. 7. H. Smith, Matching with multiple controls to estimate treatment effects in observational studies. Sociolog. Methodol. 1997; 27: 325–353. 8. E. J. Bergstralh, J. L. Kosanke, and S. L. Jacobsen, Software for optimal matching in observational studies. Epidemiology 1996; 7: 331–332. Available http://www.mayo.edu/hsr/sasmac.html. 9. K. Ming and P. R. Rosenbaum, A note on optimal matching with variable controls using the assignment algorithm. J. Comput. Graph. Stat. 2001; 10: 455–463. 10. P. R. Rosenbaum, Conditional permutation tests and the propensity score in observational studies. J. Amer. Stat. Assoc. 1984; 79: 565–574. 11. P. R. Rosenbaum, Covariance adjustment in randomized experiments and observational studies. Stat. Sci. 2002; 17: 286–327. 12. K. J. Anstrom and A. A. Tsiatis, Utilizing propensity scores to estimate causal effects with censored time-lagged data. Biometrics 2001; 57: 1207–1218. 13. G. W. Imbens, The role of the propensity score in estimating dose-response functions. Biometrika 2000; 87: 706–710. 14. P. R. Rosenbaum, Model-based direct adjustment. J. Amer. Stat. Assoc. 1987; 82: 387–394. 15. R. Barsky, J. Bound, K. K. Charles, and J. P. Lupton, Accounting for the black-white wealth gap: a nonparametric approach. J. Amer. Stat. Assoc. 2002; 97: 663–673. 16. J. M. Robins, S. D. Mark, and W. K. Newey, Estimating exposure effects by modeling the expectation of exposure conditional on confounders Biometrics 1992; 48: 479–495. 17. J. M. Robins and Y. Ritov, Toward a curse of dimensionality appropriate asymptotic theory

3

for semi-parametric models. Stat. Med. 1997; 16: 285–319. 18. L. E. Braitman and P. R. Rosenbaum, Rare outcomes, common treatments: analytic strategies using propensity scores. Ann. Intern. Med. 2002; 137: 693–695. 19. X. S. Gu and P. R. Rosenbaum, Comparison of multivariate matching methods: structures, distances and algorithms. J. Comput. Graph. Stat. 1993; 2: 405–420. 20. R. H. Dehejia and S. Wahba, Causal effects in nonexperimental studies: reevaluating the evaluation of training programs. J. Amer. Stat. Assoc. 1999; 94: 1053–1062. 21. M. M. Joffe and P. R. Rosenbaum, Propensity scores. Amen. J. Epidemiol. 1999; 150: 327–333. 22. B. Lu, E. Zanutto, R. Hornik, and P. R. Rosenbaum, Matching with doses in an observational study of a media campaign against drug abuse. J. Amer. Stat. Assoc. 2001; 96: 1245–1253. 23. Y. P. Li, K. J. Propert, and P. R. Rosenbaum, Balanced risk set matching. J. Amer. Stat. Assoc. 2001; 96: 870–882. 24. P. R. Rosenbaum, Propensity score. In: T. Colton and P. Armitage (eds.), Encyclopedia of Biostatistics, vol. 5. New York: Wiley, 1998, pp. 3551–3555. 25. P. A. Gum, M. Thamilarasan, J. Watanabe, E. H. Blackstone, and M. S. Lauer, Aspirin use and all-cause mortality among patients being evaluated for known or suspected coronary artery disease: a propensity analysis. JAMA 2001; 286: 1187–1194. 26. S. T. Normand, M. B. Landrum, E. Guadagnoli et al., Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. J. Clin. Epidemiol. 2001; 54: 387–398. 27. L. A. Petersen, S. Normand, J. Daley, and B. McNeil, Outcome of myocardial infarction in Veterans Health Administration patients as compared with medicare patients. N. Engl. J. Med. 2000; 343: 1934–1941. 28. C. D. Phillips, K. M. Spry, P. D. Sloane, and C. Hawes, Use of physical restraints and psychotropic medications in Alzheimer special care units in nursing homes. Amer. J. Public Health 2000; 90: 92–96.

PROPORTIONAL ODDS MODELS

is a commonly used technique in clinical trials. The second section presents several ways to fit the proportional odds model when stratification makes data so sparse that traditional maximum likelihood methods may fail. The last section discusses some recently proposed strategies for fitting the proportional odds model when repeated measurements occur for each subject, such as in longitudinal clinical trial studies.

IVY LIU Victoria University of Wellington School of Mathematics, Statistics and Computer Science Wellington, New Zealand

BHRAMAR MUKHERJEE University of Michigan Department of Biostatistics Ann Arbor, Michigan

1

In clinical trials, scientists and physicians are searching for optimal treatment plans or better diagnoses and care for patients with a multitude of diseases. Ideally, the scale for the measured clinical response should be continuous to provide an accurate and efficient assessment of the disease or patient condition. In practice, it is often impossible to measure the disease state, improvement caused by treatment, health outcome, or quality of life using a continuous scale. It is common to assess patients using ordered categories (e.g., worse, unchanged, or better), or using some other predefined categories. With advances in modern medicine and clinical sciences, it is often possible to characterize an adverse event into different ordered subtypes based on histological and morphological terms. Chow and Liu (1) contain a description of methods to analyze ordered categorical data that are generated in clinical trials. Good (2) also mentioned analytical strategies for handling ordinal responses in clinical trials. Among the model-based methods, the proportional odds model is one of the most popular models for analyzing ordinal responses. The model considers logits of the cumulative probabilities (which are often called cumulative logits) as a linear function of covariates. Early work proposing this model includes Simon (3) and Williams and Grizzle (4). The model gained wider acceptance and popularity after the seminal article by McCullagh (5) on regression modeling for ordinal responses. The first section describes the motivation behind the model and strategies to assess model fit. Stratification by grouping patients according to certain characteristics before randomization

MODEL DESCRIPTION

Methods for analyzing categorical data such as logistic regression and loglinear models were primarily developed in the 1960s and 1970s. Although models for ordinal data received some attention then (6,7), a stronger focus on the ordinal case was inspired by articles by McCullagh (5) on logit modeling of cumulative probabilities, which is called proportional odds model. This section overviews the model. 1.1 Model Structure For a c -category ordinal response variable Y (e.g., scaled from worse to better) and a set of predictors x (e.g., treatments or patients certain characteristics) with corresponding effect parameters β, the model has form logit ( P ( Y ≤ j | x)) = αj − β x, j = 1, . . . , c − 1 The minus sign in the predictor term makes the sign of each component of β have the usual interpretation in terms of whether the effect is positive or negative, but it is not necessary to use this parameterization. The parameters {α j } called cutpoints are usually nuisance parameters of little interest that satisfy the monotonicity constraint α 1 ≤ α 2 ≤ ··· ≤ α c−1 . For a fixed j, the proportional odds model is simply a logistic regression by collapsing the ordinal response into the binary outcome (≤ j, > j ). The model implies that each of the c − 1 logistic regression models holds for the same coefficients β. Specifically, the model implies that odds ratios for describing effects of explanatory variables on

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

PROPORTIONAL ODDS MODELS

the response variable are the same for each possible way of collapsing a c-category response to a binary variable. This particular type of cumulative logit model, with effect β the same for all j = 1, . . . , c − 1, is referred to as a proportional odds model (5). Although many alternative models can be used to analyze ordinal response data (see Reference (8), Ch. 7), an advantage of using the proportional odds model is that to fit this model, it is unnecessary to assign scores to the response categories. So, when the model fits well, different studies using different scales for the response variable should give similar conclusions. This property is useful for clinical investigations because different studies evaluate patients’ improvement in a different ordinal scale, and the model provides translation of the different scales. One can motivate the model by assuming that the ordinal response Y has an underlying continuous response Y* (9). Such an unobserved variable is called a latent variable. Let Y* have a mean that is linearly related to x and have a logistic conditional distribution with constant variance. Then for the categorical variable Y obtained by chopping Y* into categories (with cutpoints {α j }), the proportional odds model holds for predictor x, with effects proportional to those in the continuous regression model. If this latent variable model holds, then the effects are invariant to the choices of the number of categories and their cutpoints. Thus, when the model fits well, different studies that use different scales for the response variable should give similar conclusions. 1.2 Model Fit Walker and Duncan (10) and McCullagh (5) used Fisher scoring algorithms to find the maximum likelihood (ML) estimator of model parameters β. One can base significance tests and confidence intervals for the model parameters on likelihood ratio, score, or Wald statistics. For instance, in SAS (SAS Institute, Inc., Cary, NC), one can implement the model fitting using PROC LOGISTIC or PROC GENMOD (with distribution = multinomial and link = clogit). In R, polr in package VGAM is available [e.g., polr(Y ∼ covariates)] for the proportional odds model fitting. When

explanatory variables are categorical and data are not too sparse, one can also form chi-squared statistics to test the model fit by comparing observed frequencies to estimates of expected frequencies that satisfy the model. For sparse data or continuous predictors, such chi-squared fit statistics are inappropriate. Lipsitz et al. (11) generalized the Hosmer-Lemeshow statistic designed to test the fit of a logistic regression model for binary data to the situation with ordinal responses. It gave an alternative way to construct a goodness-of-fit test for sparse data or continuous predictors by comparing observed data to fitted counts for a partition of the possible response (e.g., cumulative logit) values. Toledano and Gatsonis (12) gave a generalization of a receiver operating characteristic curve that plots sensitivity against (1−specificity) for all possible collapsing of c categories. Kim (13) proposed a graphical method for assessing the proportional odds assumption. When proportional odds structure may not be adequate, possible strategies to improve the fit include (1) adding additional terms, such as interactions, to the linear predictor; (2) generalizing the model by adding dispersion parameters (5,14); (3) permitting separate effects for each logit for some but not all predictors, that is, replacing β by β j for some β’s (called partial proportional odds; see Reference (15)); (4) using the multinomial logit model for a nominal response, which forms baseline-category logits by pairing each category with a baseline; and (5) replacing logit link by a different link function. (see Reference (16) for reviews on different link functions). Liu and Agresti (16) also pointed out that one should not use a different model based solely on an inadequate fit of the cumulative logit model according to a goodness-of-fit test, because when the sample size is large, statistical significance need not imply an inadequate fit in a practical sense. A sensible strategy is to fit models for the separate cumulative logits and check whether the effects are different in a substantive sense.

PROPORTIONAL ODDS MODELS

2 STRATIFIED PROPORTIONAL ODDS MODEL Stratification is a well-known technique used in clinical trials. It controls the effects of confounding factors on patients’ responses to evaluate the effect caused by the factor/treatment of primary interest more precisely. The most common stratification factor is clinical center. Because of the time it takes each clinic to recruit many patients, a study often uses many centers. Such data are often sparse in the sense that only few patients are at each center. Suppose we treat the centers as predictors in the proportional odds model. It raises the issue that the ML estimator of the effect of interest is not consistent (17) for sparse data situations. For the case of binary response (c = 2), when the model holds but the number of centers is large and the data are sparse, the ML estimator tends to overestimate the true log odds ratio. In practice, this result happens whenever the number of strata grows at the same rate as the sample size, so the number of parameters also grows with sample size (18). This section includes conditioning and other approaches to improve on the ML estimator when the data are sparse or finely stratified. Of course, if data are not sparse, then the proportional odds model fitting mentioned in the previous section provides a satisfactory solution by including stratification effects in the prediction model. 2.1 Conditioning Approach Classic methods for fitting a logistic regression model to stratified data are based on the conditional likelihood principle to eliminate the stratum-specific nuisance parameters. For a stratified proportional odds model, classic conditioning techniques do not apply to eliminate nuisance parameters because there is no reduction due to sufficiency; the nuisance parameters remain even in the conditional likelihood. However, ignoring the stratification effects when they are present and using a constant effect model is also known to underestimate the cumulative odds ratios severely. Also, omitting the stratification variables may result in the response and the treatments that seem to be associated in

3

one direction or not exhibit an association when in fact, the opposite is true, which is called Simpson’s paradox. McCullagh (19,20) and Agresti and Lang (21) proposed a novel strategy by simultaneously fitting conditional models to all possible binary collapsing of the response into the binary outcome (≤ j, > j ). McCullagh (20) considered this approach for a binary covariate, whereas Agresti and Lang (21) considered a categorical covariate. Mukherjee et al. (22) extended this idea to a general ordinal regression set-up with any set of covariates, categorical or continuous, and provide an alternative simple fitting procedure for obtaining the parameter estimates and their standard errors. We refer the method to the amalgamated conditional logistic regression (ACLR) method. The ACLR method first naively treats the different binary collapsings of the ordinal scales as independent and forms a pseudolikelihood by taking the product of the conditional likelihoods over all strata and all possible collapsings. The stratum-specific nuisance parameters are no longer present in the resultant likelihood, and maximum likelihood estimates of β can be obtained by direct optimization. Any standard software that implements conditional logistic regression for binary data can be used to produce this estimate. An appropriate formula for the sandwich estimator of the variance of the estimate is derived as in the theory of generalized estimating equations. Wald tests and confidence intervals can be constructed based on the resultant standardized test statistic. The proposed estimate is extremely easy to compute; it is unbiased but not fully efficient. However, simulation studies indicate that the estimator is robust across a wide range of stochastic patterns in the stratification effects, and it provides reasonable power for testing treatment/covariate effects across different scenarios. The two natural alternatives of the ACLR approach, which include the simple proportional odds model that ignores variation caused by stratification and the random effects proportional model, both suffer in terms of bias under model misspecification of the stratification effects. In simple situations with matched pair data or binary covariates, the equivalence of the ACLR approach to

4

PROPORTIONAL ODDS MODELS

other conditioning approaches is also established in the paper. Overall, ACLR seems to be a readily implementable simple solution for handling nuisance parameters in stratified proportional odds models. 2.2 Other Approaches When the main focus is analyzing the association between the ordinal response and a categorical covariate while controlling for a third variable (e.g., center), it is common to display data using a three-way contingency table. The common question of primary interest is whether the response and the covariate are conditionally dependent; if so, how strong is the conditional association? The most popular non-model-based approach for testing conditional independence for stratified data is the generalized Cochran-Mantel-Haenszel (CMH) test. Originally, the CMH test is proposed for stratified 2 × 2 tables (23). Landis et al. (24) proposed generalized CMH statistics for stratified r × c tables that combine information from the strata. The test works well when the response and covariate associations are similar in each stratum. Landis et al. (25) and Stokes et al. (26) reviewed CMH methods. In addition to the test statistics, related estimators to evaluate the associations are available for various ordinal odds ratios. Liu and Agresti (17) gave the CMH-type estimator of an assumed common cumulative odds ratio in a proportional odds model for an ordinal response with several 2 × c tables. Liu (27) generalized it to the case when the covariate has more than two levels. However, when the data are very sparse, such as when the number of strata grows with the sample size, these estimators (like the classic Mantel-Haenszel estimator for stratified 2 × 2 tables) are superior to ML estimators. Another approach treats the stratumspecific parameters in the proportional odds model as a random sample from a distribution, such as normal distribution. Such a model is a special case of multivariate generalized linear mixed models (GLMMs) that include random effects. Hartzel et al. (28) showed two possible random effects proportional odds models for multicenter clinical trials. The first one is a random intercept

version of the model as logit ( P ( Yk ≤ j | x)) = αj − ck − βx

(1)

where Y k is the ordinal response at the kth center, ck is the effect caused by the kth center, and x is the covariate. The parameters {ck } are independent observations from a normal distribution N(0, σ 2 ). The straightforward Gauss-Hermite quadrature is usually adequate (29,30) to find the ML estimate of the fixed effect β. Hartzel et al. (28) pointed out that the ML estimate of β for the model and its standard error are very similar to those for the model that treats {ck } as fixed effects. Another possible random effects proportional odds model is the one that permits heterogeneity in the conditional association between the response and covariate. It treats the true center-specific ordinal log odds ratios as a sample with some unknown mean and standard deviation. It is natural to use this model in many multicenter clinical trials and a useful extension to model (1): logit ( P ( Yk ≤ j | x)) = αj − ck − bk x

(2)

where {bk } are independent observations from a N(β, σb2 ) distribution, and {ck } are independent observations from a N(0, σc2 ) distribution. Both models (1) and (2) are special cases of multivariate GLMMs (31). Hartzel et al. (28) presented the estimation methods and an expectation maximization (EM) algorithm for fitting the nonparametric random effects version of the models. The Bayesian approach through specifying priors at the effects gives an alternative choice for the mixed models. However, for highly sparsed multicenter clinical trials, the random effects model or a Bayesian approach encounter problems with convergence (32,33). 3 MODELS FOR ORDINAL RESPONSE IN LONGITUDINAL CLINICAL TRIALS Sometimes, the ordinal responses occur at various time points for each patient over a longitudinal clinical study. The repeated measurements provide more-detailed information on the patient profiles during the

PROPORTIONAL ODDS MODELS

entire study. Two classes of models typically are used in longitudinal studies: (1) marginal models that analyze effects averaged over all patients at particular levels of predictors and (2) subject-specific models that are conditional models with random effects to describe covariate effects at the patient level. The next two subsections describe these two approaches. 3.1 Marginal Models The Marginal models describe so-called population-averaged effects that refer to an averaging over patients at particular levels of predictors. Denote the repeated ordinal response over T time points by (Y 1 , . . . YT ). Although the notation here represents same number of time points T collected for each patient, the models and fitting algorithms apply to the more general setting in which the number of time points that correspond to each subject can be different. The marginal proportional odds model has the form

empirical dependence among the T responses to adjust the standard error of the estimate of β. The GEE methodology (37) that originally specified for marginal models with univariate distributions, such as the binomial and Poisson, extends to proportional odds models (38) for repeated ordinal responses. Let yit (j) = 1 if the response outcome at time t for patient i falls in category j (1 ≤ i ≤ N, 1 ≤ j ≤ c − 1, 1 ≤ t ≤ T). Otherwise, yit (j) = 0. For patient i, denote the responses over T time points by yit = (yit (1), yit (2), . . . yit (c − 1)) . The covariance matrix for yit is the one for a multinomial distribution. For each pair of categories (j1 , j2 ), one selects a working correlation matrix for the time pairs (t1 , t2 ). Let β* = (β,α 1 , . . . α c−1 ) . The generalized estimating equations for estimating the model parameters take the form ∗

u(βˆ ) =

N

j = 1, . . . , c − 1, t = 1, . . . , T where xt contains the values of the predictors at time t. This model focuses on the dependence of repeated responses of T first-order marginal distributions on the predictors. For ML fitting of this marginal model, it is awkward to maximize the log-likelihood function. The likelihood function is a product of multinomial distributions from the various predictor levels, where each multinomial configuration is defined for the cT cells in the cross-classification of the T responses. A possible approach is to treat the marginal model as a set of constrained equations and use methods of maximizing multinomial likelihoods subject to constraints (34–36). It is computationally intensive when T is large or when there are several predictors, especially if any of them are continuous. An easier way applies a generalized estimating equations (GEE) method based on a multivariate generalization of quasi likelihood. It specifies only the marginal models and a working guess for the correlation structure among the T responses. One uses the

ˆ i Vˆ −1 (y − πˆ ) = 0 D i i i

i=1

logit ( P ( Yt ≤ j | xt )) = αj − β xt ,

5

where yi = (y i1 , . . . y iT ) is the vector of observed responses for patient i, π i is the vector of probabilities associated with yi , Di = ∂[πi ] ∂β ∗ , V i

is the covariance matrix of yi , and the hats denote the substitution of the unknown parameters with their current estimates. Lipsitz et al. (38) suggested a Fisher scoring algorithm for solving the above equation. See also Reference (39) for model fitting algorithms. Stokes et al. (26) contain SAS commands on fitting the proportional odds model with GEE. For marginal models, the association structure is usually not the primary focus. It seems reasonable to use a simple structure rather than to expend much effort modeling it. However, when the association structure is itself of interest, a GEE2 approach is possible (40). 3.2 Subject-Specific Models Instead of modeling the marginal distributions by treating the dependence structure among T repeated outcomes as a nuisance parameter, one can model the joint distribution of the repeated responses by using patient-specific random effects. Such a model has conditional interpretations with subjectspecific effects. It is a special case of the multivariate GLMMs. The repeated responses are

6

PROPORTIONAL ODDS MODELS

typically assumed to be independent, given the random effect, but variability in the random effects induces a marginal nonnegative association between pairs of responses after averaging over the random effects. The proportional odds model with random effects has the following form logit ( P ( Yit ≤ j | xit , zit )) = αj − β xit − u i zit , j = 1, . . . , c − 1, t = 1, . . . , T, i = 1, . . . , N where zit refers to a vector of predictors that correspond to the random effects and ui are independent and identically distributed from a multivariate N(0,) (31). The simplest case takes zit to be a vector of 1’s and ui to be a single random effect from a N(0, σ 2 ) distribution. This simplest model is the random intercept model that is identical to model (1) by replacing multicenter index k with the patient index i. When more than a couple of terms are in the vector of random effects, ML model fitting can be challenging. Because the random effects are unobserved, to obtain the likelihood function we construct the usual product of multinomials that would apply if they were known, and then we integrate out the random effects. This integral does not have closed form, and it is necessary to use some approximation for the likelihood function, for instance, the Gauss-Hermite quadrature (29,30). We can then maximize the approximated likelihood using a variety of standard methods. For models with higherdimensional integrals, more feasible methods use Monte Carlo methods, which use averages over the randomly sampled nodes to approximate the integrals. Booth and Hobert (41) proposed an automated Monte Carlo EM algorithm for generalized linear mixed models that assesses the Monte Carlo error in the current parameter estimates and increases the number of nodes if the error exceeds the change in the estimates from the previous iteration. Alternatively, pseudo-likelihood methods avoid the intractable integral completely, which makes computation simpler (42). However, these methods are biased for highly non-normal cases, such as the proportional odds mixed model. Assuming multivariate

normality for the random effects also has the possibility of a misspecified random effects distribution. Hartzel et al. (43) used a semiparametric approach for ordinal models. The method applies an EM algorithm to obtain nonparametric ML estimates by treating the random effects as having a discrete support distribution with a set of mass points with unspecified locations and probabilities. The choice between the marginal (population-averaged) and subject-specific model depends on whether one prefers interpretations to apply at the population or the subject level. Subject-specific models are especially useful for describing within-subject effects or comparisons. Marginal models may be adequate if one is interested in summarizing responses for different groups (e.g., gender, race) without expending much effort on modeling the dependence structure. In clinical trials, the marginal models tend to be more useful than the subject-specific model, if the predictors of interest, like treatments assigned remain unchanged over the entire period of study. However, if the treatments assigned to each patient vary over time, then the subject-specific model could be more plausible, but one might still take into account the explicit carry-over effect of the preceding treatment as in a crossover clinical trial. In this article, we provide a comprehensive review of the proportional odds model for modeling clinical outcomes that are recorded on an ordinal scale. Several other choices of link functions and modeling strategies can be used for ordinal outcomes that we have not covered, such as the continuation-ratio logit model or the adjacent category logit model. For a more detailed survey of methods for ordered categorical data, see References (8) and (16).

REFERENCES 1. S-C. Chow and J-P. Liu, Design and Analysis of Clinical Trials. Concepts and Methodologies. New York: John Wiley & Sons, 1998. 2. P. I. Good, The Design and Conduct of Clinical Trials. New York: John Wiley & Sons, 2006. 3. G. Simon, Alternative analyses for the singlyordered contingency table. J. Am. Stat. Assoc. 1974; 69:971–976.

PROPORTIONAL ODDS MODELS 4. O. D. Williams and J. E. Grizzle, Analysis of contingency tables having ordered response categories. J. Am. Stat. Assoc. 1972; 67:55–63. 5. P. McCullagh, Regression models for ordinal data (with discussion). J. Royal Stat. Soc. B 1980; 42:109–142. 6. E. J. Snell, A scaling procedure for ordered categorical data. Biometrics 1964; 20:592–607. 7. R. D. Bock and L. V. Jones, The Measurement and Prediction of Judgement and Choice. San Francisco, CA: Holden-Day, 1968. 8. A. Agresti, Categorical Data Analysis, 2nd ed. New York: John Wiley & Sons, 2002. 9. J. A. Anderson and P. R. Philips, Regression, discrimination and measurement models for ordered categorical variables. Appl. Statist. 1981; 30:22–31. 10. S. H. Walker and D. B. Duncan, Estimation of the probability of an event as a function of several independent variables. Biometrika 1967; 54:167–179. 11. S. R. Lipsitz, G. M. Fitzmaurice, and G, Molenberghs, Goodness-of-fit tests for ordinal response regression models. Appl. Statist. 1996; 45:175–190. 12. A. Y. Toledano and C. Gatsonis, Ordinal regression methodology for ROC curves derived from correlated data. Stat. Med. 1996; 15:1807–1826. 13. J-H. Kim, Assessing practical significance of the proportional odds assumption. Statist. Probab. Lett. 2003; 65:233–239. 14. C. Cox, Location scale cumulative odds models for ordinal data: A generalized nonlinear model approach. Stat. Med. 1995; 14:1191–1203. 15. B. Peterson and F. E. Harrell, Partial proportional odds models for ordinal response variables. Appl. Statist. 1990; 39:205–217. 16. I. Liu and A. Agresti, The analysis of ordered categorical data: an overview and a survey of recent developments. Test 2005; 14:1–73. 17. I. Liu and A. Agresti, Mantel--Haenszel--type inference for cumulative odds ratios. Biometrics 1996; 52:1222–1234.

7

21. A. Agresti and J. B. Lang, A proportional odds model with subject-specific effects for repeated ordered categorical responses. Biometrika 1993; 80:527–534. 22. B. Mukherjee, J. Ahn, I. Liu, and B. Sanchez, Fitting stratified proportional odds models by amalgamating conditional likelihoods. Stat. Med. 2008, to appear. 23. N. Mantel and W. Haenszel, Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 1959; 22:719–748. 24. J. R. Landis, E. R. Heyman, and G. G. Koch, Average partial association in three-way contingency tables: a review and discussion of alternative tests. Internat. Statist. Rev. 1978; 46:237–254. 25. J. R. Landis, T. J. Sharp, S. J. Kuritz, and G. G. Koch, Mantel-Haenszel methods. In: Encyclopedia of Biostatistics. Chichester, UK: John Wiley & Sons, 1998. pp. 2378–2691. 26. M. E. Stokes, C. S. Davis, G. G. Koch, Categorical Data Analysis Using the SAS system, 2nd ed. Cary, NC: SAS Institute, 2000. 27. I. Liu, Describing ordinal odds ratios for stratified r × c tables. Biometr. J. 2003; 45:730–750. 28. J. Hartzel, I. Liu, A. Agresti, Describing heterogeneous effects in stratified ordinal contingency tables, with application to multi-center clinical trials. Comput. Statist. Data Anal. 2001; 35:429:449. 29. D. Hedeker and RD. Gibbons, A randomeffects ordinal regression model for multilevel analysis. Biometrics 1994; 50:933–944. 30. D. Hedeker and R. D. Gibbons, MIXOR: a computer program for mixed-effects ordinal regression analysis. Comput. Methods Prog. Biomed. 1996; 49:157–176. 31. G. Tutz and W. Hennevogl, Random effects in ordinal regression models. Comput. Statist. Data Anal. 1996; 22:537–557. 32. B. Mukherjee, I. Liu, and S. Sinha, Analysis of matched case-control data with multiple ordered disease states: possible choices and comparisons. Stat. Med. 2007; 26:3240–3257.

18. J. Neyman and E. L. Scott, Consistent estimates based on partially consistent observations. Econometrika 1948; 16:1–22.

33. I. Liu and D. Wang, Diagnostics for stratified clinical trials in proportional odds models. Commun. Stat. Theory Meth. 2007; 36:211–220.

19. P. McCullagh, A logistic model for paired comparisons with ordered categorical data. Biometrika 1977; 64:449–453.

34. J. B. Lang, Maximum likelihood methods for a general class of log-linear models. Ann. Statist. 1996; 24:726–752.

20. P. McCullagh, On the elimination of nuisance parameters in the proportional odds model. J. Royal Stat. Soc. B 1984; 46:250–256.

35. J. B. Lang, Multinomial-Poisson homogeneous models for contingency tables. Ann. Statist. 2004; 32:340–383.

8

PROPORTIONAL ODDS MODELS

36. J. B. Lang and A. Agresti, Simultaneously modeling joint and marginal distributions of multivariate categorical responses. J. Am. Stat. Assoc. 1994; 89:625–632. 37. K-Y. Liang and S. L. Zeger, Longitudinal data analysis using generalized linear models. Biometrika 1986; 73:13–22. 38. S. R. Lipsitz, K. Kim, L. Zhao, Analysis of repeated categorical data using generalized estimating equations. Stat. Med. 1994; 13: 1149–1163. 39. M. E. Miller, C. S. Davis, and J. R. Landis, The analysis of longitudinal polytomous data: generalized estimating equations and connections with weighted least squares. Biometrics 1993; 49:1033–1044. 40. P. Heagerty and S. L. Zeger, Marginal regression models for clustered ordinal measurements. J. Am. Stat. Assoc. 1996; 91:1024– 1036. 41. J. G. Booth and J. P. Hobert, Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. Royal Stat. Soc. B 1999; 61:265–285.

42. N. Breslow and D. G. Clayton, Approximate inference in generalized linear mixed models. J. Amer. Statist. Assoc. 1993; 88:9–25. 43. J. Hartzel, A. Agresti, and B. Caffo, Multinomial logit random effects models. Stat. Model. 2001; 1:81–102.

CROSS-REFERENCES Categorical Variables Cox Proportional Hazard Model Generalized Estimating Equations Generalized Linear Models Logistic Regression Analysis

PROPORTIONS, INFERENCES, AND COMPARISONS

table, requiring consideration of trinomial distributions, and are beyond our scope here. 1

ONE-SAMPLE CASE

JASON T. CONNOR Carnegie Mellon University Pittsburgh, PA, USA

We observe X ∼ Bin(N, p) where N is fixed, and wish to estimate or test hypotheses about the unknown parameter p.

PETER B. IMREY Cleveland Clinic Cleveland, OH, USA

2 Binomially based inferences about one proportion, or about two proportions using data from independent samples, are among the most common tasks in statistical analysis, taught in every elementary course. However, despite the ease with which these tasks can be described and the frequency with which they are encountered, they remain controversial and inconsistently handled in statistical practice. Numerous papers in theoretical and applied publications have covered binomial point estimation, interval estimation, and hypothesis testing using exact, approximate, and Bayesian methods. Yet, even with the advanced computational power now widely available, no single approach to this set of tasks has emerged as clearly preferable. The methodological choices regarding testing equality of two proportions, or estimating any disparity between them, are equally perplexing. This article surveys, nonexhaustively, a range of methods for handling each of the problems above, based on underlying binomial or two-factor product-binomial distributions. A related problem, which is a comparison of two proportions using data from a matched-pairs design, can be placed within the framework of a single proportion inference by considering the binomial distribution of the number of discordant pairs of the (1,0) type after conditioning on the total number of both types ((1,0) and (0,1)) of discordant pairs. Unconditional approaches to such matched dichotomous data place the problem in the context of marginal symmetry of a 2 × 2 multinomial contingency

POINT ESTIMATION

The observed proportion pˆ = X/N is simultaneously the method of moments, maximum likelihood, and minimum variance unbiased estimator of p (22). The Bayesian posterior mean, under a conjugate beta prior p ∼ Beta(α, β), is pˆ B = (X + α)/(N + α + β). In the special case of the uniform prior, Beta(1,1), the posterior mean thus reduces to pˆ B = (X + 1)/(N + 2), which is biased toward 0.5 compared with the maximum likelihood estimator (MLE) (32). Another popular Bayesian choice is the Jeffreys prior, Beta(1/2, 1/2), which yields pˆ B = (X + (1/2))/(N + 1) and produces, as will be discussed, wellbehaved frequentist confidence intervals.

3

INTERVAL ESTIMATION

The discreteness of the binomial distribution – there are only N + 1 possible outcomes when X ∼ Bin(N, p) – sometimes leads to erratic and unpredictable behavior by confidence intervals for p. This is particularly apparent for intervals based on the asymptotic normal approximation N(p, p(1 − p)/N) to the distribution of p. ˆ On the basis of this approximation, nearly all entry level and many advanced courses teach the Wald 100(1 √ ˆ − p)/N), ˆ − α)% confidence interval, pˆ ± z1−(α/2) (p(1 for p, with zγ the 100γ th percentile of the standard normal, for example, zγ = 1.96 when γ = 0.975. This interval offers intuitive and easily understandable properties for introductory level students. For fixed p, ˆ the interval narrows as N increases while, for fixed N, the interval is widest when pˆ = 0.5 and narrows as p approaches 0 or 1.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

PROPORTIONS, INFERENCES, AND COMPARISONS

The drawback, however, is that extremely large samples are necessary for the interval to achieve nominal 100(1 − α)% coverage, that is, for 100(1 − α)% of all intervals constructed to contain p. While this is particularly true for p near its extremes of 0 or 1, the coverage probability is low over the entire range of p. Moreover, due to the binomial’s discreteness, coverage does not approach 100(1 − α)% monotonically as N increases (20), so that a larger N can yield poorer performance. Many classic texts, in recognition of the asymptotic origin of the Wald interval, recommend it only when min (Np, N(1 − p)) > 5 or > 10 or, more stringently, when

Np(1 − p) > 5 or > 10. Yet even when N = 40 and p = 0.5 and this latter condition is thus met, the exact coverage of the Wald interval is only 91.9%. Even when N is 100, the portion of the range of p for which a 95% confidence interval achieves 95% coverage is negligible (Figure 1a). A number of methods exist to ensure at least 100(1-α)% coverage for any fixed p (1,3,34). These methods vary in computational complexity and associated software requirements. While some instructors and practitioners desire closed-form formulae, others believe that ‘‘simplicity and ease of computation have no roles to play in statistical practice’’ (27). Some believe the confidence

Figure 1. Coverage probabilities of six 95% confidence intervals for p, with N = 100

PROPORTIONS, INFERENCES, AND COMPARISONS

coefficient is meaningful only as a guaranteed minimum coverage probability at each use of an interval, while others find a method that guarantees only average coverage over a range of conditions to be quite satisfactory. Such varied opinions leave much room for disagreement on practical recommendations. Further, to guarantee nominal coverage, one must generally form an interval by inverting the acceptance region of an exact hypothesis test. This requires computing binomial probabilities of the observed outcome and some unobserved outcomes over a range of possible values of p. Such computationally intensive calculations are now simple to perform using many statistical software packages, but unsuitable for the elementary courses in which confidence intervals for proportions must be taught. Below, we promote alternatives to the standard Wald interval that meet four criteria (18). The confidence region must: 1. be one contiguous interval; 2. be invariant to X → N − X transformation, that is, the lower and upper endpoints of the interval for p based on X should respectively be the upper and lower endpoints of the interval for (1 − p) based on (N − X); 3. yield monotone endpoints in X, that is, for fixed N, LB(X, N) < LB(X + 1, N) and UB(X, N) < UB(X + 1, N); and 4. yield monotone endpoints in N, that is, for fixed X, LB(X, N) > LB(X, N + 1), and UB(X, N) > UB(X, N + 1). We review three easily computable alternatives to the Wald interval and a more computationally demanding ‘‘exact’’ interval, comparing coverage properties with the Wald interval and with each other. 4

3

desirable frequentist properties as will be demonstrated. While there are no closed-form expressions for the endpoints, all common statistical software packages (Excel, SAS, S-PLUS, etc.) include simple function calls for such Beta quantiles. Note that although error is allocated equally to each tail, the interval itself will be symmetric only when the posterior distribution itself is, requiring X + (1/2) = N − X + (1/2) and hence X = N/2. 5

WILSON INTERVAL

The Wald interval is formed by inverting the Wald test of H 0 : p = p0 . That test is based on a Central Limit Theorem (CLT) approximation to the distribution of pˆ using the maximum likelihood binomial variance estimator N p ˆ q, where q = 1 − p. ˆ Wilson, in a 1927 JASA paper (55), introduced an interval with a similar relationship to what is now known as the score test. Writing q0 = 1 − p0 , the score test CLT approximation to the distribution of pˆ uses the variance Np0 q0 implied by the hypothesized p0 . The 100(1α)% Wilson interval, also referred to as the score interval (2), is √ z2 z21−α/2 X + 1−α/2 z1−α/2 N 2 ± . (1) p q + 4N N + z21−α/2 N + z21−α/2 As noted by Agresti and Caffo (4), this interval is centered about the pseudo-estimator p˜ = (X + z2α/2 /2)/(N + z2α/2 ), which may be viewed as a weighted average of the sample proportion and one-half or, equivalently, as obtained by adding (z2α/2 /2) successes and (z2α/2 /2) failures to the data. Similarly, the interval’s width is a multiple of a pseudo standard error obtained from a weighted average of the maximum likelihood variance estimator used for the Wald interval and the true variance when p = 1/2.

JEFFREYS INTERVAL

Assuming a Jeffreys prior p ∼ Beta(1/2, 1/2), the resulting posterior distribution is p|X, N ∼ Beta(X + (1/2), N − X + (1/2)), and the Bayesian equal-tailed 100(1 − α)% credible set is formed by the α/2 and (1 − α/2) quantiles. Interpreting this Bayesian credible set as a frequentist confidence interval offers

6

AGRESTI–COULL INTERVAL

The Agresti–Coull interval (5) takes the functional form of the standard asymptotic Wald interval but with minor adjustments in X, N, ˜ = N + z2 , and and p, to X˜ = X + (z2α/2 /2), N α/2 ˜ ˜ p˜ = X/N, respectively. Thus, the interval has

4

PROPORTIONS, INFERENCES, AND COMPARISONS

the standard Wald-like form p˜ q˜ p˜ ± z α , ˜ 2 N

(2)

˜ The difference is that the where q˜ = 1 − p. whole experiment is treated as if there are (z2α/2 /2) more successes and (z2α/2 /2) more failures than were actually observed. For a 95% confidence interval, this has the affect of approximately adding two successes and two failures, or in a Bayesian sense, starting with a Beta(2, 2) prior for p. This prior has mean 1/2 and is concave with single mode 1/2, while the Jeffreys prior has mean 1/2 and is convex and bimodal at 0 and 1. Thus, in the Bayesian sense, the Agresti–Coull prior distribution on p is more informative. The Agresti–Coull interval is never narrower than the Wilson interval, making it a more conservative choice. It offers a clear improvement over the Wald interval when X = 0 or X = N, for which the width of the Wald interval is zero, and corrects particularly well for the sometimes too narrow intervals and poor coverage of the Wilson method when p is close to 0 or 1. The Agresti–Coull and Wilson (score) intervals are very similar when p is near the center of its range. See (4) for a fine, highly accessible review of this interval. 7

CLOPPER–PEARSON INTERVAL

As always, it is easier to reach a performance goal on average than to guarantee such performance is always achieved across a range of conditions. The intervals above are superior to the Wald interval in providing approximately 100(1-α)% coverage averaged over the range of possible values of p, but coverage by each is below nominal for some combinations of p and N. In contrast, the Clopper–Pearson Interval (25), often called the Exact Interval, achieves at least nominal coverage for all combinations of p and N (1,3,25,34). For X = 0 or N, the 100(1-α)% Clopper–Pearson interval is −1 N−X +1 , 1+ XF2X,2(N−X+1) ( α2 ) −1 N−X 1+ , (3) (X + 1)F2(X+1),2(N−X) (1 − α2 )

where Fdf1 ,df2 (c) is the c quantile from the F distribution with df 1 and df 2 degrees of freedom. When X = 0 or N, the undefined bounds in (3) are respectively replaced by 0 and 1. ‘‘Exact’’ in reference to the Clopper– Pearson interval refers to use of the exact binomial sampling distribution rather than using an asymptotic approximation to produce the interval. (The relevant exact binomial sums implicitly determine (3) through their relationship, and that of the F distributions, to the incomplete beta function.) This method, however, does not produce an exactly 100(1 − α)% interval, but rather one of at least 100(1 − α)% and sometimes much higher coverage. Thus, the price of guaranteeing at least 100 × (1 − α) coverage for each combination of N and p is loss of precision, in the sense that intervals are on average wider than necessary to achieve that coverage for most N, p combinations. Nevertheless, when preservation of nominal coverage is preferred despite this conservatism, the Clopper–Pearson interval accomplishes this objective and is widely used. Other exact methods, for example, Blyth– Still (18), the Blaker (17) interval nested within it, and Blyth–Still–Casella intervals (21), are also available in specialized software. The continuity correction to the Wilson interval results in a wider, more conservative interval that better approximates the Clopper–Pearson interval. This frequently increases minimum coverage, as in Figure 1, to the nominal 100(1 − α)% (31,38). 8 COVERAGE COMPARISON For a fixed sample size of N = 100, and 101 possible values of the true p (0, 0.01, 0.02, . . . ,1.00), Figure 1 shows the true coverages of 95% Wald, Jeffreys, Wilson, continuity-corrected Wilson, Agresti–Coull, and Clopper–Pearson (aka Exact) intervals. Ideally, the coverage of each would be 95% for every value of p. While the discreteness of the binomial distribution prohibits this, one still desires coverage near 95%. It is clear from Figure 1 that the Wald interval, even when N = 100, offers poor coverage. The Jeffreys interval offers better coverage properties than the interval

PROPORTIONS, INFERENCES, AND COMPARISONS

5

Figure 2. Widths of five 95% confidence intervals for p, N = 100

obtained using a Uniform prior (not shown). Also, Ghosh (37) shows that the Wald interval is not only centered at the wrong place, but is also frequently wider than the Wilson interval. Figure 2 shows interval widths for N = 100 and X from 1 to 50 (the plot is symmetric around 50). The Clopper–Pearson intervals are clearly wider for most values of X, and therefore for most values of p. The Exact approach produces the widest intervals except when p is near 0 or 1, when the Agresti–Coull intervals are wider. The Wilson and Agresti–Coull methods produce similar widths for p not near 0 or 1. Figure 2 also demonstrates that the Jeffreys interval is desirably narrow compared to other intervals (Figure 2) while producing coverages nearly 100(1 − α)% throughout the range of p (Figure 1). There is philosophical debate over the relative merits of requiring at least 95% coverage for any value of p, as offered by exact methods or continuity corrections, versus requiring average 95% coverage for all situations in which one might compute an

interval. The practicing statistician must weigh the benefits and costs of each of the two competing approaches, and choose the most appropriate method for the given situation. If guaranteed coverage is required, then the exact Clopper–Pearson interval is preferable. If average nominal coverage is satisfactory, then the Jeffreys, Wilson, and Agresti– Coull intervals offer sound frequentist properties. In the example shown, the Jeffreys interval offers tightest oscillation around 95%; this may differ, however, for other choices of N = 100. The Agresti–Coull interval may be the best compromise choice. It improves upon the Jeffreys and Wilson intervals by ensuring that coverage is not far below 100(1-α)% for values of p near 0 or 1. But it is not overly conservative, as are the continuity-corrected Wilson and the exact intervals, throughout the rest of the admissible range of p. The Agresti–Coull interval also offers the advantage of a form that is easy to remember and teach: for 95% confidence, just construct the simple Wald interval after adding two successes and two failures to the data. Various other references, for example,

6

PROPORTIONS, INFERENCES, AND COMPARISONS

(1,6,20,42,53), offer similar graphical comparisons of the available choices of confidence intervals for a binomial proportion.

9

HYPOTHESIS TESTING

The score and Wald tests, respectively inverting the Wilson and Wald intervals, are commonly used, as is the likelihood ratio test (2). The score statistic is equivalently Pearson’s goodness of 2 = 2i=1 (Oi − Ei )2 /Ei = fit chi-square XS1 (pˆ − po )2 /(po (1 − po )/N), and the Wald statistic is Neyman’s (44) modified chi2 = 2i=1 (Oi − Ei )2 /Oi = square statistic XW1 (pˆ − po )2 /(p(1 ˆ − p)/N), ˆ where the Oi and Ei , i = 1, 2 are respectively the observed counts of successes and failures and their expectations under H 0 : p = p0 . The corresponding forms ofthe likelihood ratio statistic are 2 = 2 2i=1 Oi log Oi /Ei = 2 [X log(p/p ˆ 0) + XL1 (N − X) log((1 − p)/(1 ˆ − p0 ))]. Under H 0 , all three statistics are asymptotically chi-square with one degree of freedom. As suggested by their denominator variances and the properties of their associated confidence intervals, convergence to this distribution is 2 2 than for XW1 , generally more rapid for XS1 2 with departures for XW1 tending towards higher than nominal type I error rates. The 2 is intermediate (48). behavior of XL1 The exact test, dual to the Clopper– Pearson interval, is easy to calculate. For given p0 and N, the binomial probability of observing X = 0 to X = N under the x null hypothesis is simply Pr(X = x) = N X p0 (1 − p0 )N−x . Calculating the P value, the sum of probabilities of the observed and equally or less probable nonobserved outcomes, is straightforward. The mid-P value, a 1961 innovation of Lancaster (40), has recently received renewed attention (2). The mid-P value is the exact P value, as described above, less half the probability of the observed count. The mid-P value removes unattractive discrepancies between the properties of P values from discrete and continuous sampling distributions. Specifically, unlike conventional exact P values, the sum of mid-P values from two opposing one-sided exact tests equals 1.0, the mean of

mid-P values is 0.5 under the null hypothesis, and the null distribution of mid-P values is closer to uniform than that of P values. However, while generally conservative, tests based on the mid-P value do not guarantee a type I error rate of less than α, nor do the corresponding confidence limits assure at least 100(1 − α)% coverage. Hypothesis tests corresponding to the Jeffreys, Agresti–Coull, or another confidence interval for a binomial proportion can be performed by rejecting exactly when the interval excludes the hypothesized value. Such tests may well have type I error closer to nominal than one or more of the Wald, score, and likelihood ratio tests. To find the P value for a test formed by inverting a confidence interval, determine the confidence coefficient CC of the interval with p0 on the boundary. Then, the P value is 1 − CC. For example, assuming X = 38 successes are observed in N = 100 trials with null hypothesis p = 0.5, the Jeffreys hypothesis test can be performed by determining which confidence coefficient provides an upper bound exactly equal to 0.5. A region of the Beta(38.5, 62.5) distribution bounded above at 0.5, and excluding equal probabilities on each tail, contains 98.40% of the distribution; hence P = 0.0160. Likewise, using the equation for the upper bound of the Agresti– Coull confidence interval and solving for α yields P = 0.0165.

10 POWER AND SAMPLE SIZE DETERMINATION Whether the endpoints of an interval or the rejection region of a test are based on exact binomial calculations or a large-sample approximation to the relevant binomial distribution(s), coverage of an interval or power of a test may be calculated either exactly using binomial distributions under the alternative, or approximately using a limiting normal distribution. Only direct calculation from the binomial distribution under an alternative gives true coverage or power, although Gaussian approximations to such calculations are generally used and often provide sufficient accuracy for practical purposes.

PROPORTIONS, INFERENCES, AND COMPARISONS

For example, the power of a two-tailed test of H 0 : p = p0 under the fixed alternative of H A : p = p1 can be approximated as follows. First, find the acceptance region of the stipulated test, (T L , T U ); for example, for 2 , Xε(TL , TU ) with (T L , T U ) = Np0 ∓ z1−(α/2) XS1 √ (Np0 (1 − p0 )). Then, using normal theory under H 1 , calculate the large-sample normal approximation to the probability that X falls outside that region, for example,

Power = Pr Z ≤

TL − Np1

Np1 (1 − p1 )

+ Pr Z ≥

TU − Np1 Np1 (1 − p1 )

. (4)

This approximates the true power, Pr(X ∈ / (T L , T U )|p = p1 ) under X ∼ Bin(N, p1 ), which may be calculated instead by summing the Bin(N, p1 ) probabilities for all values of X outside (T L , T U ). This approach applies, with obvious modifications, to any other test procedure, specifically including exact tests and tests using the mid-P. Power, and the method for determining it, are based on the test’s rejection region, not on how that rejection region is derived. Note that standard momentbased expressions in textbooks, and default power calculations in statistical software packages, are almost always asymptotic approximations to the true power of a test. The ease of inverting the moment-based expressions to yield approximate sample size requirements has much to do with this. However, discrepancies in default sample size recommendations of software packages are common, owing to variations in defaults on the tests used and the specific power calculations inverted to obtain them. While the asymptotically based counterparts to an exact test are generally more powerful, that is, type II error probabilities β for the asymptotic tests are lower than for the exact test, this gain comes at the expense of higher type I error rates, which are not guaranteed by the asymptotic tests. As shown in Figure 1 (with α = 1 − coverage probability), these may far exceed the nominal value used to (asymptotically) determine the rejection region. For fixed α, power as a function of

7

sample size is also saw-toothed: counterintuitively, a small increase in sample size may slightly reduce power. 11

ONE-SAMPLE SUMMARY

Numerous superior alternatives to the classic Wald interval exist. The Clopper–Pearson and, at least for 95% confidence, the continuity-corrected Wilson intervals, assure that coverage cannot fall below the nominal coefficient for any combination of N and p. Similarly, the tests based on inverting these intervals assure that type I error cannot exceed the nominal α. The cost is wider intervals and reduced power relative to procedures whose coverage roughly centers around, rather than below, the nominal confidence coefficient. Among such procedures are the relatively simple Jeffreys, Wilson, and Agresti–Coull intervals, the latter two with closed-form expressions, and their corresponding tests. The Agresti–Coull method produces intervals with generally sound frequentist properties, in a form easily taught and remembered. For research design, software for sample size determination should be used cautiously. For hypothesis testing, for instance, such software typically requires specification of p0 as well as a nominal α and a target or guess at p1 . Then, for fixed desired power, a single or a selection of sample size recommendations is provided from among the available choices. The researcher must recognize that power and sample size results for alternative tests can differ not only because one test is more efficient, but also because two tests of nominal size α may have different actual type I error rates, and/or because of approximations used in the calculations. This is particularly so since the power functions of the several tests, and even their relative performance, may be nonmonotonic with parameters and/or sample size in neighborhoods of their hypothesized or recommended values. 12

TWO INDEPENDENT SAMPLES

Inference for two independent samples focuses on how much the relative frequency

8

PROPORTIONS, INFERENCES, AND COMPARISONS

of an observed characteristic differs between two sampled populations. In general, the lessons of the one-sample case regarding (a) the liberality of the standard Wald procedure, (b) conservatism of the standard exact procedure, (c) availability of simple intermediate approaches that achieve closer to nominal type I error by sacrificing control of maximum type I error, and (d) the inherent trade-offs of coverage and power with different levels of type I error control, all continue to apply. However, the two-sample situation is more complex because (i) the null hypothesis of interest is typically the composite hypothesis of no difference between populations, with the common underlying proportion a nuisance parameter, and (ii) the disparity between populations may be parameterized in several ways, most commonly as the difference (‘‘risk difference’’) (2–4,8,11,43,45,49), ratio (‘‘risk ratio’’) (2,4,12,45,49), and odds ratio (2,3,54), that are functionally dependent only for a fixed value of the nuisance parameter. A consequence of (ii) is that there is no longer a one-to-one relationship of hypothesis tests to confidence intervals for any single parameter determined to be of primary interest, and the details of interval estimation vary depending upon the association parameter chosen. To simplify exposition, and in conformity with the historical development, the ensuing discussion will thus proceed primarily from a testing perspective, with the reader referred to the excellent reviews (1,3) for additional detail on confidence intervals. Comparing two binomial proportions has long occupied the field of statistics. In 1900, Karl Pearson introduced what became the ‘‘standard’’ chi-square test as a goodness-offit test to determine whether observed data were compatible with a proposed probability model (47). Its proper application to contingency tables was clarified in 1922 by Fisher (33). Hundreds of papers have since offered extensions, improvements, and adjustments to the test, which Science 84, a popular magazine of the American Association for the Advancement of Science (AAAS), called one of the 22 most important scientific breakthroughs of the twentieth century (7). For the remainder of this section, we consider a two-by-two (2 × 2) table of observed

counts under the probability model ni1 |ni· ∼ Bin(ni· , pi ),

Response

Population 1 Population 2 Total

Present

Absent

Total

n11 n21 n·1

n12 n22 n·2

n1· n2· n·· = N

i = 1, 2. Such data, with this model, may arise from several experimental or observational research designs. The row totals n1· and n2· may be fixed by the conditions of a designed observational study or experiment or, in an observational study, only N may have been determined by the researcher, or n11 , n12 , n21 , n22 may be independent Poisson counts. In these latter cases, the within-row binomial distributions arise by conditioning inference on the observed row totals n1· , n2· of a multinomial or productPoisson distribution, respectively. For such situations, we will discuss a general class of asymptotic procedures, exact inference, and Bayesian inference. 13 ASYMPTOTIC METHODS Read and Cressie (29,48) defined the class of power-divergence asymptotic test statis ij ) which, for H 0 : p1 = p2 as tics T λ (N, m above, take the form ij ) = T λ (N, m

2 2

nij λ 2 nij −1 ij λ(λ + 1) m i=1 j=1

(5) for λ = − 1, 0, limiting forms 2andthe 2 ij ) = 2 mij /nij ), m T −1 (N, m ij log( i=1 j=1 ij ) = 2 2i=1 2j=1 nij log(nij / mij ). The T 0 (N, m ij are estimated expected values of the m nij obtained by minimizing T ζ (N, mij ), for some ζ (not necessarily equal to λ), under the constraint p1 = p2 . In our notation, we suppress ζ , which does not affect the asymptotic distribution of the test statistics. This family includes the likelihood ratio test (λ = ζ = 0), Pearson’s chi-square (λ = 1, ζ = 0), Neyman’s minimum modified chi-square

PROPORTIONS, INFERENCES, AND COMPARISONS

(λ = ζ = − 2), and others that may be conveniently studied within this unifying framework. Under all of the experimental or observational designs given above, when p1 = p2 each member of the power-divergence family converges in distribution to chi-square with one degree of freedom as n1· , n2· → ∞, and hence provides an asymptotically valid test of H 0 or, equivalently when n1· and n2· are random, of row by column independence. Pearson’s chi-square, which is the score test as in the one-sample case, is commonly written in each of the several forms 2 = XS2

2 2 2 2

ij )2 (Oij − Eij )2 (nij − m = ij Eij m i=1 j=1

= =

i=1 j=1

−1 (n−1 ˆ − p) ˆ 1· + n2· )p(1

N(n11 n22 − n12 n21 )2 n1· n2· n1· n2·

(6)

ij = ni· n·j , pˆ 1 = n11 /n1· , pˆ 2 = n21 /n2· , with m pˆ = n·1 /N. In the third form it is easily extended to the score test for H0 : p1 − p2 = , ((pˆ 1 − pˆ 2 ) − )2 2 XS2 = −1 . (7) (n1· + n−1 ˆ − p) ˆ 2· )p(1 The set of all not rejected by this test forms an asymptotic confidence interval for p1 − p2 analogous to the Wilson interval in the one-sample case. Neyman’s minimum modified chi-square (44), which as earlier is the Wald statis ij above tic, replaces the denominator Eij = m with nij , yielding 2 XW2 =

2 2

(Oij − Eij )2 i=1 j=1

= =

Oij

2 2

ij )2 (nij − m i=1 j=1

nij

(pˆ 1 − pˆ 2 )2 . (8) pˆ 1 (1 − pˆ 1 )/n1· + pˆ 2 (1 − pˆ 2 )/n2·

The set of all not rejected by the corresponding Wald test of H 0 using 2 = XW2

((pˆ 1 − pˆ 2 ) − )2 pˆ 1 (1 − pˆ 1 )/n1· + pˆ 2 (1 − pˆ 2 )/n2·

may be written directly as (pˆ 1 − pˆ 2 ) ± 1.96 √ (pˆ 1 (1 − pˆ 1 )/n1· + pˆ 2 (1 − pˆ 2 )/n2· ). This is the confidence interval traditionally presented in elementary statistics courses and texts, and most commonly used in practice. Unfortunately, this shares the propensity of the one-sample Wald interval to be too narrow to achieve nominal coverage. The performance of the same interval, however, substituting pi = (ni1 + 1)/(ni· + 2) for pˆ i , is much improved (3). As also noted by Agresti (3), the Wald interval for the log odds ratiousing the empir 2 2 −1 ical asymptotic variance i=1 j=1 nij derived by the delta method,

n11 n22 log n12 n21

(pˆ 1 − pˆ 2 )2

(9)

9

2 2 −1 ± 1.96 n ij

(10)

i=1 j=1

generally performs well, and its performance is improved if the nij are respectively replaced by nij + 2ni· n·j /N 2 and if, in addition, the intervals are extended to ±∞ respectively whenever min(n12 , n21 ) = 0, min(n11 , n22 ) = 0. Further, inverting an exact or asymptotic chi-square test of H0 : p1 /p2 = θ based on the score statistic XSθ =

n2· (pˆ 2 − p1 )2 p2 )2 n1· (pˆ 1 − + , p1 ) p2 ) p1 (1 − p2 (1 −

(11)

where pi is the MLE of pi under H 0 , provides a generally well-behaved interval estimate for the risk ratio p1 /p2 . 2 = 2 2i=1 The likelihood ratio statistic XL2 2 j=1 nij log[nij /(ni· n·j )] is also a commonly used power-divergence statistic for comparing two proportions. Although the various power-divergence statistics share the same limiting null chi-square distribution as both n1· and n2· increase, and the tests have the same Pitman efficiency under local alternatives in the nonnull case, they differ in performance under nonlocal alternatives (e.g. in Bahadur efficiency), in more general problems under nonstandard ‘‘sparse’’ asymptotics (in which the number of cells increases with N), and in small samples. Cox and Groeneveld (28) provide a thorough compar2 2 and XL2 , predicting when each ison of XS2 will provide a higher, that is, more powerful,

10

PROPORTIONS, INFERENCES, AND COMPARISONS

test statistic, under various null hypotheses. Read and Cressie (48) recommend λ = 2/3 as the best compromise between high power under a wide range of true alternative hypotheses and the ability of the chi-square distribution to approximate the distribution of the test statistic under the null hypothesis for small samples. Of the power-divergence methods in general practical use, the Pearson chi-square, with λ = 1, is closest to this.

some test results, a process most consider unsatisfactory for scientific discourse. While the exact test and interval are straightforward in the one-sample case, with two samples the null hypothesis is composite: rather than a specified p0 , under H0 : p1 = p2 = p, p can assume any value in [0,1]. A simplification strategy is required to select from or combine over this universe of distributions compatible with H 0 .

14

16 FISHER’S EXACT TEST

CONTINUITY CORRECTIONS

Owing to the discreteness of counts, the sampling distributions of power-divergence statistics from contingency tables are discrete, that is cdfs are step functions, and cannot generally be well-approximated by the continuous χ12 distribution when n1· or n2· is small. In such circumstances, the actual type I error rate of a test may be well above or below the desired α level. Continuity corrections are modifications to the test statistic or the approximating distribution for the purpose of reducing or minimizing the effects of such approximation errors. Such a continuity correction can be used, for instance, to control excessive type I error of an asymptotic chi-square test by shrinking the test statistic, thus making the test more conservative. The classic continuity correction does this by replacing (O − E)2 by (|O − E| − (1/2))2 in the 2 . numerator of XS1 Continuity corrections are thoroughly covered in this volume and elsewhere (1,2,38,43). They are generally constructed to better approximate the behavior of exact methods for which statistical software is increasingly available. Thus, their current utility is primarily for situations when maintaining the nominal α is crucial and statistical software for exact testing is not handy. 15

EXACT METHODS

Statistical inferences using exact methods rely on computations from one or more completely specified discrete probability laws. A disadvantage is that discreteness makes it impossible to perform tests with a precisely predetermined type I error rate without using a supplemental randomization to decide

R.A. Fisher proposed conditioning on fixed row and column marginal totals. This restriction on the sample space allows direct numerical calculation of the distribution of any test statistic from the 2 × 2 table. For instance, the distribution of any single cell count is hypergeometric, as in n

n2· n·1 −t N , n·1

1·

Pr(n11 = t) =

t

(12)

where t ranges from max(0, n1· + n·1 − N) to min(n1· , n·1 ). One can calculate the probability of each possible t, and then a P value by summing the probabilities of n11 and all t more compatible than n11 with the alternative hypothesis. The highly constrained hypergeometric setting may lead to very few possible values of n11 , and hence very few possible tables and P values. In the tea tasting example through which Fisher introduced the test, N = 8 with n1· = n·1 = n2· = n·2 = 4. Since 0 ≤ n11 ≤ 4, there are five possible P values, respectively, 0.014, 0.24, 0.76, 0.99, and 1.0 for n11 = 0, . . . , 4 for his one-sided test, and three possible P values (0.029, 0.48, 1.0) for the two-sided test. Thus, for instance, the common true type I error rate of all one-sided tests with desired (nominal) type I error rates between 2 and 23% is actually 1.4%. This inherent conservatism, at times extreme, can be removed by supplemental randomization to precisely achieve any nominal α; Tocher (52) has shown such randomized tests to be uniformly most powerful among unbiased tests. The mathematical optimality of randomized tests has not, however, overcome the taint of arbitrariness that restricts their use.

PROPORTIONS, INFERENCES, AND COMPARISONS

Consequently, a variety of alternative nonrandomized exact methods have been developed that reduce the discreteness and hence conservatism of Fisher’s approach. For instance, Agresti (3) provides a thorough and readable review of confidence interval options for the difference, ratio, and odds ratio of two binomial proportions that guarantee nominal coverage. See (6,26,50) for details of some intervals with good properties. Using mid-P values with Fisher’s exact test has gained considerable recent support. As in the one-sample case, this method does not guarantee preservation of the nominal level α, but performance of a test using midP is generally closer to nominal than that of the corresponding exact test, and power is inherently higher (1). 17

UNCONDITIONAL EXACT TESTS

It is more common in experiments, and always the case in observational research, for at least one tabular margin to be random. Conditioning on only one set of margins, say {n1· , n2· }, produces a much larger sample space than conditioning on two, allowing many more possible tables and P values under H 0 , and thus tests frequently yielding closer to the nominal type I error rates. However, the indeterminate nuisance parameter p1 = p2 = p must still be removed. This problem may be overcome, in the context of an 2 , by arbitrary test statistic T such as T = XS2 maximizing the exact P value over possible values of the nuisance parameter: P value = sup Prp (T ≥ to |n1· , n2· ),

(13)

0≤p≤1

where t0 is the observed test statistic (9,10,19). This ‘‘unconditional exact’’ method has been criticized precisely because it maximizes over the full range of p; Fisher (35) and others (51) have argued that possible samples with far different total successes than were observed are irrelevant. In response, Berger and Boos (15) proposed maximizing the P value over a 100(1 − β)% confidence set Cβ for p, and adjusting the result for the

11

restriction to the confidence set. Adding β to the maximum over the confidence set yields a valid P value, namely, P value = sup Prp (T ≥ to |n1· , n2· ) + β. (14) p∈Cβ

Since the approach becomes more conservative (P values increase) as the confidence interval narrows, these authors favor a high confidence coefficient, for example, 100(1 − β)% = 99.9% or β = 0.001, and hence a wide interval. This modification of the approach of Boschloo (19) also maintains the guaranteed level α and is ‘‘often uniformly more powerful than other tests’’ including Fisher’s exact test (14). At the time of publication, this test and its associated confidence intervals were not widely commercially available. However, both have been implemented in StatXact Version 6 (30), and modified Boschloo P values are obtainable using software available from Berger at http://www4.stat.ncsu.edu/ ∼berger/software.html. Several alternative unconditional exact intervals that also guarantee coverage have been proposed by Chan & Zhang (24). 18

BAYESIAN METHODS

The reader is referred to (39) for a thorough review of Bayesian inference for 2 × 2 tables based on a variety of noninformative, subjective, and correlated priors. At the price of introducing a continuous subjective prior distribution for the two unknown parameters p1 and p2 , the choice of which is open to debate, the Bayesian statistician bypasses many frequentist difficulties due to discreteness. In a hypothesis-testing framework, the frequentist’s final result is a P value – the probability of the observed data and data more compatible with the alternative, conditional on a fixed null hypothesis. Owing to discreteness of the sample space, the discrete distribution of a test statistic may yield few choices of achievable type I error rates for a nonrandomized exact test, and may be poorly approximated by the continuous chi-square or other asymptotic distribution. The Bayesian, however, conditions upon the observed data – all of it rather than

12

PROPORTIONS, INFERENCES, AND COMPARISONS

selected margins – and calculates the posterior probability of a particular hypothesis of interest: Pr(p1 > p2 |{nij }), Pr(p1 < p2 |{nij }), Pr (|p1 − p2 | < ε| {nij }), and so on (39). Such probabilities are determined by integrating over the appropriate space in the joint posterior distribution of (p1 , p2 ), which will be continuous whenever a continuous prior is selected. Circumstances are uncommon in which a discrete bivariate prior would be reasonable for the two unknown proportions. A more general advantage of Bayesian methods is that they satisfy the likelihood principle, which asserts that inference about a parameter should depend only on the relative values of the likelihood function at the parameter’s admissible values and not otherwise on the data collection method. As Berger and Berry clearly illustrate (13), frequentist hypothesis testing incorporates the probability of outcomes that never occur and therefore, two different research designs that yield the exact same data may provide different P values and hence, different inferences. Bayesian methods do not exhibit this somewhat counterintuitive behavior because, no matter what the research design, they condition on all of the data rather than on a particular choice of marginal totals (41). In a Bayesian analysis, discreteness is still manifest in the distribution of the posterior probability of a specific hypothesis, considered as a random variable. The posterior can realize only as many values as there are possible tables under the study design, but this does not produce the interpretive complications presented by the frequentist context. As an example, when independent Jeffreys priors are placed on p1 and p2 , Pr(p1 > p2 ) is closely approximated by −1 (z), where

n11 + n12 + n21 + n22 . n1· n2· n·1 n·2 (15) This corresponds to the one-sided P value 2 . Numerical methods such as Gibbs from XS2 sampling and Markov Chain Monte Carlo (MCMC) can be used to improve this approximation (23,36). Howard also considers the case of correlated priors for p1 and p2 . Such z = (n11 n22 − n12 n21 )

a choice, with positive correlation, is one way to represent the subjective belief that p1 and p2 are unequal but not likely to differ substantially. 19 STUDY DESIGN AND POWER For Student’s t-test and other tests based on continuous distributions, it is possible to choose a rejection region that guarantees precisely an arbitrary prespecified type I error rate, whatever the common value of the mean. This is not possible when comparing two proportions, regardless of the test used, without very large sample sizes or use of a supplementary randomization resisted by the scientific community. Consequently, the power of a test at a nominal α depends on the sample sizes in each group, the values of each of the two probabilities, and the true type I error rate achieved by the test in that combination of circumstances. The power is relatively increased when true type I error overshoots the nominal α, as for instance, with the Wald test based on the observed difference pˆ 1 − pˆ 2 in small to moderate samples, and relatively decreased when true type I error falls below nominal α, as with Fisher’s exact test in most of its uses. When designing a study, such trade offs should be kept in mind. The various widely published asymptotic formulae for approximate sample sizes or power for the chi-square, likelihood ratio, or Fisher’s exact tests generally fail to capture the saw-tooth nature of this sample size/power trade off. As in the single proportion case, small increases in sample size may reduce power. As a general rule, efficient use of data is promoted by the use of methods for which actual type I error rates are close to nominal. Methods based on power-divergence statistics and associated confidence intervals, slightly modified if needed to improve coverage, are generally useful for moderate to large samples, and for smaller samples when strict control of type I error is not essential. The power of any test against any fixed distribution in the space of alternatives is defined as the probability of the rejection region

PROPORTIONS, INFERENCES, AND COMPARISONS

under the alternative distribution. This may generally be computed more accurately by exact calculations or Monte Carlo methods than by asymptotic formulae. Sample size may be chosen indirectly by this method which, although cumbersome, is reliable and avoids ambiguities associated with the use of asymptotic approximations in general purpose statistical software. Most available sample size/statistical power software packages calculate the power for Pearson, likelihood ratio, and Fisher exact tests (46), but may default to approximations. When the true type I error rate must not be allowed to exceed the nominal α to any degree, as frequently occurs in the context of regulatory decision making, Fisher’s exact test has been the conventional choice. However, unconditional exact methods generally offer more power, and under randomization, the unconditional framework for inference is no less valid than the conditional. The modified Boschloo unconditional test is usually more powerful than the original Boschloo test, which is uniformly more powerful than Fisher’s exact test (14). Similarly, inversion of the modified Boschloo test produces a narrower unconditional exact confidence interval for p1 − p2 , while guaranteeing at least nominal coverage (15,30). We reiterate that commercially available sample-size software uses a variety of formulae and approximations that are not always well documented, and may or may not incorporate continuity corrections by default. 20

TWO-SAMPLE SUMMARY

The Pearson chi-square test is powerful and offers nearly α type I error rates for large samples. For smaller studies, however, test statistics are desirable that either do not require asymptotic assumptions or are adjusted to improve performance. The modified Boschloo unconditional exact test is a more powerful alternative to Fisher’s exact test that strictly preserves nominal type I error when study design leaves at least one tabular margin random. If both margins are inherently fixed and preservation of

13

nominal type I error is essential, then Fisher’s exact test, with its inherent conservatism, is warranted. Bayesian tests that escape some discreteness problems and produce straightforward interpretations may also provide insight. Confidence intervals for differences of proportions and ratios of proportions or odds are generally available by inverting hypothesis tests. These methods are discussed in a variety of texts and manuscripts (2–4,8,11,12,43,45,49,54). Bedrick provides thorough coverage of confidence intervals for ratios of two proportions within the powerdivergence family of statistics. He concludes that 0.67 ≤ λ ≤ 1.25 give intervals with the best coverage. This range includes the intervals based on the ‘‘Cressie–Read statistic’’ with λ = 2/3, and on the Pearson chi-square test. For small samples, when strict preservation of coverage is not essential, confidence intervals can be constructed by inverting hypothesis tests based on mid-P values (16) to substitute for the overly-broad exact intervals. Closed forms do not exist but software such as Cytel’s StatXact (30) computes these intervals. For simple problems of inference about one and two proportions, research and expanded computing power have clarified deficiencies in methods that have formed the basis of statistical pedagogy and most scientific practice. Although improved methods have been identified, and their properties are generally well-understood, they have not yet seen widespread application. A key requirement for the acceptance of any modern statistical methodology is convenient availability in software. While many of the newer techniques discussed here are implemented in special purpose commercial software and/or can be readily programmed using SAS/IML, S-PLUS, or the freeware R, as of March 2004, we know of none employed as defaults, and support by the general purpose statistical packages used by most data analysts is inconsistent. More frequent use of the improved methods, and hence better inferences for these scientifically ubiquitous

14

PROPORTIONS, INFERENCES, AND COMPARISONS

situations, await more widespread and convenient implementation by the mass market software packages.

15. Berger, R.L. & Boos, D.D. (1994). P values maximized over a confidence set for the nuisance parameter, Journal of the American Statistical Association 89, 1012–1016.

REFERENCES

16. Berry, G. & Armitage, P. (1995). Mid-P confidence intervals: A brief review, The Statistician 44, 417.

1. Agresti, A. (2001). Exact inference for categorical data: recent advances and continuing controversies, Statistics in Medicine 20, 2709–2722.

17. Blaker, H. (2000). Confidence curves and improved exact confidence intervals for discrete distributions, Canadian Journal of Statistics 28, 793–798.

2. Agresti, A. (2002). Categorical Data Analysis, 2nd Ed. Wiley, New York.

18. Blyth, C.R. & Still, H.A. (1983). Binomial confidence intervals, Journal of the American Statistical Association 78, 108–116.

3. Agresti, A. (2003). Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions, and odds ratios more exact, Statistical Methods in Medical Research 12, 3–21.

19. Boschloo, R.D. (1970). Raised conditional level of significance for the 2 × 2 table when testing the equality of two probabilities, Statistica Neerlandica 24, 1–35.

4. Agresti, A. & Caffo, B. (2000). Simple and effective confidence intervals for proportions and differences of proportions results from adding two successes and two failures, The American Statistician 54, 280–288. 5. Agresti, A. & Coull, B. A. (1998). Approximate is better than ‘‘exact’’ for interval estimation of binomial proportions, American Statistician 52, 119–126. 6. Agresti, A. & Min, Y. (2001). On small-sample confidence intervals for parameters in discrete distributions, Biometrics 57, 963–971. 7. American Association for the Advancement of Science. (1984). Science 84, Washington, November. 8. Anbar, D. (1983). On estimating the difference between two probabilities, with special reference to clinical trials, Biometrics 39, 257–262. 9. Barnard, G.A. (1945). A new 2 × 2 tables, Nature 156, 177.

test

for

10. Barnard, G.A. (1947). Significance tests for 2 × 2 tables, Biometrika 34, 123–138. 11. Beal, S.L. (1987). Asymptotic confidence intervals for the difference between two binomial parameters for use with small samples, Biometrics 43, 941–950. 12. Bedrick, E.J. (1987). A family of confidence intervals for the ratio of two binomial proportions, Biometrics 43, 993–998. 13. Berger, J.O. & Berry, D.A. (1988). Statistical analysis and the illusion of objectivity, American Scientist 76, 159–165. 14. Berger, R.L. (1994). Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions, Institute of Statistics Mimeo Series No. 2266.

20. Brown, D.L., Cai, T. & DasGupta, A. (2001). Interval estimation for a binomial proportion, Statistical Science 16, 101–133. 21. Casella, G. (1986). Refining binomial confidence intervals, Canadian Journal of Statistics 14, 113–129. 22. Casella, G. & Berger, R.L. (1990). Statistical Inference. Duxbury Press, Belmont. 23. Carlin, B.P. & Lewis, T.A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall/CRC Press, Baco Raton. 24. Chan, I.S.F. & Zhang, Z. (1999). Test-based exact confidence intervals for the difference of two binomial proportions, Biometrics 55, 1202–1209. 25. Clopper, C.J. & Pearson, E.S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial, Biometrika 26, 404–413. 26. Coe, P.R. & Tamhane, A.C. (1993). Small sample confidence intervals for the difference, ratio, and odds ratio of two success probabilities, Communications in Statistics, Part B – Simulation and Computation 22, 925–938. 27. Corcoran, C. & Mehta, C. (2001). Comment on ‘‘Interval estimation for a binomial proportion’’, Statistical Science 16, 122–123. 28. Cox, C.P. & Groeneveld, R.A. (1986). Analytic results on the difference between G2 and χ 2 test statistics in one degree of freedom cases, The Statistician 35, 417–420. 29. Cressie, N. & Read, T.R.C. (1984). Multinomial goodness-of-fit tests, Journal of the Royal Statistical Society, Series B 46, 440–464. 30. Cytel Software Corporation. (2003). StatXact Version 6 with Cytel Studio, Vol. 2,

PROPORTIONS, INFERENCES, AND COMPARISONS Cytel Software Corporation, Cambridge, pp. 527–530. 31. D’Agostino, R.B. (1990). Comment on ‘‘Yates’s correction for continuity and the analysis of 2 × 2 contingency tables’’, Statistics in Medicine 9, 367. 32. DeGroot, M.H. & Schervish, M.J. (2001). Probability and Statistics, 3rd Ed. Wiley, New York. 33. Fisher, R.A. (1922). On the interpretation of chi-square from contingency tables, and the calculation of P, Journal of the Royal Statistical Society 85, 87–94. 34. Fisher, R.A. (1935). Design of Experiments. Oliver and Boyd, Edinburgh, Chapter 2. 35. Fisher, R.A. (1945). A new test for 2 × 2 tables (Letter to Editor), Nature 156, 388. 36. Gelman, A., Carlin, J.B., Stern, H.S. & Rubin, D.B. (1995). Bayesian Data Analysis. Chapman & Hall, London. 37. Ghosh, B.K. (1979). A comparison of some approximate confidence intervals for the binomial parameter, Journal of the American Statistical Association 74, 894–900. 38. Haviland, M.G. (1990). Yates’s correction for continuity and the analysis of 2 × 2 contingency tables, Statistics in Medicine 9, 363–365. 39. Howard, J.V. (1998). The 2 × 2 table: A discussion from a Bayesian viewpoint, Statistical Science 13, 351–367. 40. Lancaster, H.O. (1961). Significance tests in discrete distributions, Journal of the American Statistical Association 56, 223–234. 41. Little, R.J.A. (1989). Testing the equality of two independent binomial proportions, The American Statistician 43, 283–288. 42. Newcombe, R.G. (1998a). Two-sided confidence intervals for the single proportion: Comparison of seven methods, Statistics in Medicine 17, 857–872. 43. Newcombe, R.G. (1998b). Interval estimation for the difference between two independent proportions: comparison of eleven methods, Statistics in Medicine 17, 873–890. 44. Neyman, J. (1949). Contribution to the theory of the χ 2 test, Proceedings of the Berkeley Symposium on Mathematical Statistics and

15

Probability, University of California Press, Berkeley. 45. Nurminem, M. (1986). Confidence intervals for the ratio and difference of two binomial proportions, Biometrics 42, 675–676. 46. O’Brien, R. (1998). A Tour of Unify Pow: A SAS module/macro for sample-size analysis, Proceedings of the 23rd Annual SAS Users Group International Conference, SAS Institute Inc., Cary, pp. 1346–1355. 47. Pearson, K. (1900). On a criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Philosophical Magazine 5(50), 157–175. 48. Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York. 49. Santner, S.J. & Snell, M.A. (1980). Smallsample confidence intervals for p1 − p2 and p1 /p2 in 2 × 2 contingency tables, Journal of the American Statistical Association 75, 386–394. 50. Santner, T.J. & Yamagami, S. (1993). Invariant small sample confidence-intervals for the difference of two success probabilities, Communications in Statistics, Part B – Simulation and Computation 22, 33–59. 51. Sprott, D.A. (2000). Statistical Inference in Science. Springer-Verlag, New York. 52. Tocher, K.D. (1950). Extension of NeymanPearson theory of tests to discontinuous variates, Biometrika 37, 130–144. 53. Vollset, S.E. (1993). Confidence intervals for a binomial proportion, Statistics in Medicine 12, 809–824. 54. Walter, S.D. & Cook, R.J. (1991). A comparison of several point estimators of the odds ratio in a single 2 × 2 contingency table, Biometrics 47, 795–811. 55. Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference, Journal of the American Statistical Association 22, 209–212.

PROSTATE CANCER PREVENTION TRIAL

1

OBJECTIVES

Selection of a primary endpoint for the PCPT was complicated by the presence of several potential biases (8). Table 1 summarizes the potential biases and how these were considered in the planning and execution of the study. Additional discussion of these biases is contained within the ‘‘Study Design’’ section of this case report.

ALON Z. WEIZER MAHA H. HUSSAIN University of Michigan, Ann Arbor, Michigan

Current prostate cancer treatment in the United States is focused on the early detection and active treatment of patients; yet, there is limited evidence to support this strategy (1). To date, only a single randomized controlled study has demonstrated the superiority of surgery over surveillance in the management of prostate cancer (2). In addition, concerns continue to exist regarding the apparent overtreatment of indolent disease as well as the impact on quality of life in men treated for localized prostate cancer [3,4]. The recognition that prostate cancer has a prolonged natural history and that many men with disease will die of other causes has formed the basis for several trials focused on chemoprevention of prostate cancer (5). In 1993, the Prostate Cancer Prevention Trial (PCPT) randomized over 18,000 men to placebo versus finasteride to detect a difference in seven-year prevalence between the two groups (6). Finasteride, a 5-alpha reductase inhibitor that blocks the conversion of testosterone to dihydrotestosterone within the prostate, targets 5-alpha reductase type 2 receptors that are upregulated in prostate cancer (7). Although the results of this large, randomized-controlled study demonstrated a 25% reduction in prevalence in the active treatment arm, prostate cancers identified in the treatment group were more likely to be high grade than the placebo group. This has prevented widespread use of finasteride as a risk reduction strategy. Secondary analyses have focused on explaining the cause of these high-risk cancers as well as on providing valuable information regarding the utility of prostate-specific antigen in screening for prostate cancer.

1.1 Primary Objective The primary objective of the PCPT was to compare the seven-year disease prevalence in patients treated with finasteride versus a placebo. Prevalence (existing and new cases) rather than incidence (new cases) was chosen as the primary endpoint because prostate cancer could be detected during the trial based on physician recommendation to obtain a prostate biopsy (or to perform a transurethral resection of the prostate) or at the end of the study when all study subjects not diagnosed with prostate cancer were offered a biopsy. This design makes it difficult to ascertain incident cases. Although incidence is more useful for clinicians to counsel patients regarding individual risk, prevalence provides more meaningful information regarding public health planning and as such was more consistent with the aims of the study. 1.2 Secondary Objective Adverse events and side effects were reported by subjects during directed interviews over the course of treatment and graded according to the toxicity criteria of the Southwest Oncology Group. 2

STUDY DESIGN

To determine differences in seven-year prevalence between men receiving placebo versus finasteride, several potential biases as described in Table 1 were considered in designing the PCPT. 2.1 Eligibility Criteria Subjects eligible for the trial were men 55 years of age or older with a normal digital

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

PROSTATE CANCER PREVENTION TRIAL

Table 1. Biases Considered in Development of the Prostate Cancer Prevention Trial Bias

Reason

Management

PSA

Finasteride lowers serum PSA, decreasing likelihood of biopsy recommendation

∗ DSMC

DRE

Finasteride reduces prostate volume, possibly improving the sensitivity of DRE in the treatment arm—could bias toward more or fewer biopsy recommendations

Recognized but not able to manage

Prostate biopsy technique

Finasteride reduces prostate volume, increasing the percentage of the gland sampled by the biopsy technique, possibly increasing cancer detection in the treatment arm

Mid-study recommendation to change biopsy technique to direct biopsies to peripheral portion of the gland, the area where most clinically meaningful tumors are identified

Nonadherence

PSA would be adjusted when not required, increasing the likelihood of biopsy recommendation

Adjusted for in sample size calculation

Placebo ‘‘drop-in’’

Would decrease PSA, which would not be adjusted, decreasing the likelihood of biopsy recommendation

Adjusted for in sample size calculation

TURP

Finasteride decreases chance of TURP for urinary symptoms, decreasing possibility of prostate cancer detection

Detection of prostate cancer during TURP low and likely not to impact number of cancers detected significantly

Refuse end-of-study biopsy

Would decrease the detection of the primary endpoint

Adjusted for in sample size calculation

adjusted PSA during study to maintain equivalent numbers of biopsy recommendations in both arms

PSA, prostate specific antigen; ∗ DSMC, data safety monitoring/committee; DRE, digital rectal exam; TURP, transurethral resection of the prostate.

rectal examination, no significant coexisting medical conditions, and an American Urological Association (AUA) symptom score of less than 20. The AUA symptom score is a validated instrument used in routine urologic practice to quantify the degree of lower urinary tract symptoms in men and the impact of these symptoms on a patients’ quality of life. Scores of ≥21 are associated with severe symptoms (9). Men meeting these criteria were supplied a three-month supply of placebo. At the end of the three-month period, men were randomized to finasteride versus placebo in a doubleblinded fashion if the serum PSA level was 3.0 ng/mL or lower, adherence was within 20% of expected rate of placebo use, and there were no significant toxic effects.

2.2 Study Interventions 2.2.1 Clinical Follow-up. All subjects were contacted by telephone every three months to determine interim medical events, had biannual medical visits for reissuing of medication and recording of clinically significant medical events, and underwent annual digital rectal examination by a clinician with serum PSA determination. Serum PSA measurements were performed by a central laboratory. Because finasteride decreases serum PSA levels, an independent data and safety monitoring committee ensured that men in the finasteride group would have an equal rate of prostate biopsy recommendation by initially doubling the PSA in the treatment group. During a patient’s fourth year in the study, this adjustment factor was changed to

PROSTATE CANCER PREVENTION TRIAL

2.3 times the measured value in the finasteride treated patients. 2.2.2 Prostate Biopsy and Transurethral Resection of the Prostate. Recommendation for a prostate biopsy was made by the treating physician based on serum PSA determination and digital rectal examination findings. Prostate biopsies were performed using transrectal ultrasound guidance with a minimum of six cores obtained. Transurethral resection of the prostate was based on a decision of the treating physician and patient. All pathology specimens were reviewed by a central pathology laboratory. Patients found to have prostate cancer were removed from the study. Patients found to have high-grade prostatic intra-epithelial neoplasia were offered a second biopsy. Concordance was achieved in all cases by referee pathologist if necessary. 2.3 Statistical Analysis To determine the sample size of this doubleblinded, randomized-controlled study, the investigators assumed a seven-year prevalence of prostate cancer in the placebo arm of 6% and that a 25% reduction in the prevalence in the treatment arm would be clinically meaningful. Using a two-sided statistical test with an alpha of 0.05, a power of 0.92, and a three-year accrual period, the investigators determined that a sample size of 18,000 men was required. Other assumptions used in the determination of sample size included that 60% of the study participants would be diagnosed with cancer or undergo an end-of-study biopsy, 20% would die during the study, 5% would decline an end-of-study biopsy, and 15% would be lost to follow-up. In addition, the investigators assumed that 14% would not adhere to the treatment arm and 5% of men in the placebo group would take finasteride (drop-ins). Differences between groups were based on an intent-to-treat analysis and included all men who were diagnosed with prostate cancer or underwent an end-of-study biopsy for the primary objective. The secondary objective included all men enrolled in the study.

3

2.4 Data and Safety Monitoring An independent data and safety monitoring committee reviewed data on safety, adherence, prostate cancer diagnosis, and data related to study assumptions every six months. This committee suggested protocol revisions to maintain equal numbers of prostate biopsy recommendations between treatment groups. No formal interim stopping rules were specified. 3

RESULTS

A total of 18,882 patients underwent randomization over three years. The data and safety monitoring committee ended the trial 15 months prior to the anticipated date because the primary objective of the study had been achieved. 3.1 Evaluation of Study Assumptions Table 2 outlines the assumptions used in the design of the study and the actual findings. The high percentage of men declining an endof-study biopsy was balanced by the lower than expected death rate and lower percentage of patients lost to follow-up. In addition, the study attempted to keep the percentage of patients recommended prostate biopsy between groups equal. Although there was a difference between the number of patients recommended biopsy by group (22.5% finasteride, 24.8% placebo, P < 0.001), there was no significant difference in the number of recommended biopsies performed between groups. 3.2 Primary Objective: Prostate Cancer Prevalence A total of 9060 men were included in the evaluation of the primary endpoint. Prostate cancer was detected in 18.4% of men treated with finasteride versus 24.4% of men treated with placebo for a relative risk reduction of 24.8% (95% confidence interval 18.6 to 30.6%, P < 0.001). The number of men with prostate cancer was higher in placebo group both for cause biopsies and end-of-study biopsies. Finasteride conferred a reduction of prostate cancer risk regardless of age, ethnicity, family history of prostate cancer, and PSA level at study entry.

4

PROSTATE CANCER PREVENTION TRIAL

Table 2. Actual Results Compared With Study Assumptions of the Prostate Cancer Prevention Trial Actual Value Assumption

Anticipated Value

Placebo

Finasteride

60% 20% 5% 15% 14% 5% Equivalent

59.6% 6.7% 22.8% 7.4% 10.8% 6.5% 22.5%

63% 7.0% 25.4% 8.0% 14.7% N/A 24.8%

Cancer or end-of-study biopsy Death Declining end-of-study biopsy Lost to follow-up Nonadherence ‘‘Drop-in’’ percentage Recommendation for biopsy

Of the 757 graded prostate cancers detected in the finasteride arm, 280 (37%) were assigned a Gleason score of 7, 8, 9, or 10 versus 237 (22.2%) of 1068 graded tumors in the placebo group. This results in a 1.67 times risk of a high-grade tumor in the treatment versus placebo arm in men found to have cancer (95% confidence interval 1.44—1.93, P = 0.005).

3.3 Secondary Objective: Adverse Events Men in the finasteride group versus placebo reported more reduced volume of ejaculate (60.4 versus 47.3%), erectile dysfunction (67.4 versus 61.5%), loss of libido (65.4 versus 59.6%), and gynecomastia (4.5 versus 2.8%) (all P-values < 0.001, not adjusted for multiple comparisons). Conversely, men in the placebo group versus finasteride group reported more benign prostatic hyperplasia (8.7 versus 5.2%), increased urinary symptoms (15.6 versus 12.9%), incontinence (2.2 versus 1.9%), urinary retention (6.3 versus 4.2%), surgical intervention for prostate enlargement/symptoms (1.9 versus 1.0%), prostatitis (6.1 versus 4.4%), and urinary tract infections (1.3 versus 1.0%) (all P-values < 0.001, not adjusted for multiple comparisons). There was no significant difference in the number of deaths between each group or number of prostate cancer deaths (5 in each group).

4 CONCLUSIONS

The investigators published the initial findings of the PCPT in 2003. Multiple followup publications based on secondary analyses of both clinical and biological data from the study participants have been reported. Although it is beyond the scope of this review to detail the results of secondary analyses, three topics deserve mention. First, although the findings of the PCPT showed a 24.8% reduction in the seven-year prevalence of prostate cancer in the finasteride arm, highgrade disease (Gleason score 7, 8, 9, and 10) was found in 6.4% of the finasteride group versus 5.1% in the placebo group (6). Thus, multiple subsequent analyses have focused on the meaning of this finding. Second, this study has provided a significant opportunity to evaluate the risk of prostate cancer detection based on PSA level. When the trial was initiated in 1993, PSA had only been used in routine clinical practice for approximately a decade. Although a value of 4.0 ng/mL was chosen based on optimizing sensitivity and specificity, the end-of-study biopsies in both the placebo as well as treatment groups have clearly indicated that there is a spectrum of risk even with PSA values well below 4.0 ng/mL. This has resulted in clinical refinements in the use of PSA, a search for more sensitive/specific markers, and a risk calculator developed from the clinical data of the PCPT. Finally, the costs and benefits of prolonged use of finasteride for prevention of prostate cancer continues to be debated.

PROSTATE CANCER PREVENTION TRIAL

4.1 Meaning of High-Grade Cancers in the Treatment Arm Detractors of the PCPT point to the increased percentage of men with high-risk prostate cancer in the treatment arm as a reason not to use finasteride as a risk reduction strategy. The biologic argument is that finasteride represents a form of androgen deprivation and that the high-grade cancers seen caused by selection of more aggressive portions of the tumors place the patients at increased risk (10). The investigators in a series of secondary analyses have demonstrated that if this was the circumstance, most high-grade tumors would have been detected at the end of the study. However, the data demonstrate that cancer detection, and specifically highgrade cancer detection did not change during the study (11). In addition, the investigators have identified several possible explanations for the higher prevalence of Gleason 7–10 tumors in the treatment arm (11). Perhaps the most compelling of these arguments is that finasteride decreases prostate volume. This reduction in prostate volume has two effects. First, it allows a more accurate assessment of the prostate by digital rectal examination. Indeed, a recent report demonstrates that finasteride improves the sensitivity of digitial rectal examination in the detection of prostate cancer (12). Second, the smaller volume gland of finasteride-treated men results in a greater proportion of the gland being sampled compared with the placebo. As a result, it is more likely that an identified tumor is accurately sampled by a biopsy than is the case with a bigger gland (13). Other arguments include the suggestion that Gleason scoring is not accurate in finasteride-treated glands. Although androgen deprivation does create difficulties with Gleason grading, the impact of finasteride seem less likely. Continued investigation into this will lead to more definitive information (11). Finally, the investigators suggest that finasteride may be more effective at preventing lower grade tumors than higher grade tumors. This may be the case as secondary analyses have demonstrated that finasteride reduces the percentage of patients with highgrade prostatic intraepithelial neoplasia, a

5

presumed precursor to prostate cancer (14). In addition, results from men on the trial who subsequently underwent radical prostatectomy demonstrate a selective inhibition of low-grade cancer, which allowed high-risk cancer to be detected earlier (15). 4.2 Operating Characteristics of PSA Because men in the placebo arm were offered an end-of-study biopsy, the PCPT represented a unique opportunity to understand the risk of prostate cancer in men with PSA values below 4.0 ng/mL. Although lowering the value to recommend biopsy improves the sensitivity of PSA screening, specificity is poor, subjecting a large portion of men to unnecessary biopsies. The investigators of the prostate cancer prevention trial identified 2950 men in the placebo arm who had a PSA < 4.0 ng/mL and who underwent an end-of-study biopsy. Fifteen percent of these men had biopsy detected prostate cancer. In addition, 15% of the cancers detected in this group were high grade. They found no ‘‘cutoff’’ value to eliminate the risk of prostate cancer (16). An additional finding in follow-up study was that finasteride improved the operating characteristics of PSA (17). In a separate study, the investigators identified PSA, age, family history, ethnicity, digital rectal examination findings, and history of a prostate biopsy as independent predictors of detection of prostate cancer on a biopsy. These results were based on 5519 men in the placebo arm of the PCPT in whom these data were available and had at least 2 previous annual PSA determinations. These variables were used to create a risk calculator that can be used by clinicians to inform patients regarding the chances of detecting prostate cancer (and high-grade cancer) on a six-core biopsy. These results apply to men older than 55 years old because of the inclusion criteria of the initial study (18). This risk calculator has been externally validated (19). The link to this calculator is included in the additional reference section. 4.3 PCPT in Clinical Practice Although a significant amount of information has been learned from the PCPT, the fact is that finasteride has not been widely

6

PROSTATE CANCER PREVENTION TRIAL

instituted as a risk reduction strategy in men at risk of prostate cancer. Compelling arguments have been made for the higher percentage of patients with high-risk cancer in the finasteride arm; despite this clinicians continue to have reservations about its use. Additional explanations for the limited implementation of the study findings include the sexual side effects experienced by otherwise healthy men. From a public health standpoint, the cost burden associated with finasteride use is substantial, costing $130,000 per quality-adjusted life year gained (assuming high-risk cancers are artifact and the favorable effect of finasteride on lower urinary tract symptoms) (20). The greatest impact is likely on how PSA can be used in screening for prostate cancer. Abandonment of a normal value in favor of risk calculators and change in PSA over time offers the potential to detect cancers earlier without subjecting men unnecessarily to an initial or subsequent biopsy. In addition, for those men who are placed on finasteride for lower urinary tract symptoms, the improved sensitivity of PSA and digital rectal examination provide the clinician better tools to determine the need for prostate biopsy. Future analysis of biologic and clinical data from the PCPT and other screening studies will hopefully elucidate the continued controversy around the identification of highrisk cancers in the treatment arm, the impact of long-term finasteride use, and whether refined strategies of screening for prostate cancer can result in improved survival while reducing the potential morbidity of treatment in men with indolent disease. REFERENCES 1. D. Ilic, D. O’Connor, S. Green, and T. Wilt, Screening for prostate cancer. Cochrane Database Syst. Rev., 2006 Jul; 19(3): CD004720. 2. B. Axelson, L. Holmberg, M. Ruutu, M. Haggman, S. O. Anderson, S. Bratell, A. Spangberg, C. Busch, S. Nordling, H. Garmo, J. Palmgren, H. O. Adami, B. J. Norlen, J. E. Johansson, and the Scandanavian Prostate Cancer Group Study No. 4. Radical prostatectomy versus watchful waiting for early prostate cancer. N. Engl. J. Med., 2005; 352(19): 1977–1784.

3. M. S. Litwin, J. L. Gore, L. Kwan, J. M. Brandeis, S. P. Lee, H. R. Withers, and R. E. Reiter, Quality of life after surgery, external beam irradiation, or brachytherapy for early stage prostate cancer. Cancer, 2007; 109(11): 2239–2247. 4. R. K. Nam, A. Toi, L. H. Klotz, J. Trachtenberg, M. A. Jewett, S. Appu, D. A. Loblaw, L. Sugar, S. A. Narod, and M. W. Kattan, Assessing individual risk for prostate cancer. J. Clin. Oncol., 2007; 25(24): 3582–3588. 5. I. M. Thompson, Chemoprevention of prostate cancer: agents and study designs. J. Urol., 2007; 178(3 Pt 2):S9–S13. 6. I. M. Thompson, P. Goodman, C. Tangen, M. S. Lucia, G. J. Miller, L. G. Ford, M. M. Lieber, R. D. Cespedes, J. N. Atkins, S. M. Lippman, S. M. Carlin, A. Ryan, C. M. Szczepanek, J. J. Crowley, and C. Coltman, The influence of finasteride on the development of prostate cancer. N. Engl. J. Med., 2003 Jul; 349(3): 215–224. 7. G. Andriole, D. Bostwick, F. Civantos, J. I. Epstein, M. S. Lucia, J. McConnell, and C. G. Roehrborn, The effects of 5 alpha-reductase inhibitors on the natural history, detection and grading of prostate cancer: current state of knowledge. J. Urol., 2005 Dec; 174(6): 2098–2104. 8. P. J. Goodman, I. M. Thompson, C. M. Tangen, J. J. Crowley, L. G. Ford, and C. A. Coltman, The prostate cancer prevention trial: design, biases and interpretation of study results. J. Urol., 2006 Jun; 175(6): 2234–2242. 9. M. J. Barry, F. J. Fowler, M. P. O’Leary, R. C. Bruskewitz, H. L. Holtgrewe, W. K. Mebust, and A. T. Cockett, The American Urological Association symptom index for benign prostatic hyperplasia. The measurement committee of the American Urological Association. J. Urol., 1992 Nov; 148(5): 1549–1557. 10. P. C. Walsh, Re: Long-term effects of finasteride on prostate specific antigen levels: results from the Prostate Cancer Prevention Trial. J. Urol., 2006 Jul; 176(1): 409–410. 11. E. Canby-Hagino, J. Hernandez, T. C. Brand, and I. M. Thompson, Looking back at PCPT: looking forward to new paradigms in prostate cancer screening and prevention. Eur. Urol., 2007 Jan; 51(1): 27–33. 12. I. M. Thompson, C. M. Tangen, P. J. Goodman, M. S. Lucia, H. L. Parnes, S. M. Lippman, and C. A. Coltman, Finasteride improves the sensitivity of digital rectal examination for prostate cancer detection. J. Urol., 2007 May; 177(5): 1749–1752.

PROSTATE CANCER PREVENTION TRIAL 13. Y. C. Cohen, K. S. Liu, N. L. Heyden, A. D. Carides, K. M. Anderson, A. G. Daifotis, and P. H. Gann, Detection bias due to the effect of finasteride on prostate volume: a modeling approach for analysis of the Prostate Cancer Prevention Trial. J. Natl. Cancer Inst., 2007 Sep; 99(18): 1366–1374. 14. I. M. Thompson, M. S. Lucia, M. W. Redman, A. Darke, F. G. La Rosa, H. L. Parnes, S. M. Lippman, and C. A. Coltman, Finasteride decreases the risk of prostatic intraepithelial neoplasia. J. Urol., 2007 Jul; 178(1): 107–109; discussion 110. 15. M. S. Lucia, J. I. Epstein, P. J. Goodman, V. E. Darke, Reuter, F. Civantos, C. M. Tangen, H. L. Parnes, S. M. Lippman, F. G. La Rosa, M. W. Kattan, E. D. Crawford, L. G. Ford, C. A. Coltman, and I. M. Thompson, Finasteride and high-grade prostate cancer in the Prostate Cancer Prevention Trial. J. Natl. Cancer Inst., 2007 Sep; 99(18): 1375–1383. 16. I. M. Thompson, D. K. Pauler, P. J. Goodman, C. M. Tangen, M. S. Lucia, H. L. Parnes, L. M. Minasian, L. G. Ford, S. M. Lippman, E. D. Crawford, J. J. Crowley, and C. A. Coltman, Prevalence of prostate cancer among men with a prostate-specific antigen level < or = 4.0 ng per milliliter. N. Engl. J. Med., 2004 May; 350(22): 2239–2246. 17. I. M. Thompson, C. Chi, D. P. Ankerst, P. J. Goodman, C. M. Tangen, S. M. Lippman, M. S. Lucia, H. L. Parnes, and C. A. Coltman, Effect of finasteride on the sensitivity of PSA for detecting prostate cancer. J. Natl. Cancer Inst., 2006 Aug; 98(16): 1128–1133. 18. I. M. Thompson, D. P. Ankerst, C. Chi, P. J. Goodman, C. M. Tangen, M. S. Lucia, Z. Feng, H. L. Parnes, and C. A. Coltman, Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J. Natl. Cancer Inst., 2006 Apr; 98(8): 529–534. 19. D. J. Parekh, D. P. Ankerst, B. A. Higgins, J. Hernandez, E. Canby-Hagino, T. Brand, D. A. Troyer, R. J. Leach, and I. M. Thompson, External validation of the Prostate Cancer Prevention Trial risk calculator in a screened population. Urology, 2006 Dec; 68(6): 1152–1155. 20. S. B. Zeliadt, R. D. Etzioni, D. F. Penson, I. M. Thompson, and S. D. Ramsey, Lifetime implications and cost-effectiveness of using finasteride to prevent prostate cancer. Am. J. Med., 2005 Aug; 118(8): 850–857.

7

FURTHER READING A link to the Prostate Cancer Prevention Trial risk calculator for detecting cancer at the time of prostate biopsy can be found at: https://labssec.uhs-sa.com/clinical int/dols/proscapredictor. htm Review of prostate cancer prevention, screening, and active surveillance of low risk disease: E. A. Klein, E. A. Platz, and I. M. Thompson, Epidemiology, etiology, and prevention of prostate cancer. In: A. J. Wein (ed.), Campbell-Walsh Urology. Philadelphia, PA: Saunders Elsevier, 2007. J. A. Eastham and P. T. Scardino, Expectant management of prostate cancer. In: A. J. Wein (ed.), Campbell-Walsh Urology. Philadelphia, PA: Saunders Elsevier, 2007. A review of prostate specific antigen in the early detection of prostate cancer: I. M. Thompson and D. P. Ankerst, Prostate specific antigen in the early detection of prostate cancer. CMAJ, 2007; 176(13):1853–1858.

CROSS-REFERENCES Study Design: Phase III, prevention trials Clinical Fields: Disease Trials of Urological Diseases

PROTOCOL The Protocol is a study plan on which all clinical trials are based. The plan is carefully designed to safeguard the health of the participants as well as to answer specific research questions. A protocol describes what types of people may participate in the trial; the schedule of tests, procedures, medications, and dosages; and the length of the study. While in a clinical trial, participants who follow a protocol are examined regularly by the research staff to monitor their health and to determine the safety and the effectiveness of their treatment. The investigator/institution should conduct the trial in compliance with the protocol agreed to by the sponsor and, if required, by the regulatory authority(ies), which was given approval/favorable opinion by the IRB (Institutional Review Board)/IEC (Independent Ethics Committee). The investigator/ institution and the sponsor should sign the protocol, or an alternative contract, to confirm their agreement. The investigator should not implement any deviation from, or changes of, the protocol without agreement by the sponsor and prior review and documented approval/favorable opinion from the IRB/IEC of an amendment, except where necessary to eliminate an immediate hazard(s) to trial subjects, or when the change(s) involves only logistical or administrative aspects of the trial [e.g., change of monitor(s) or change of telephone number(s)]. The investigator, or person designated by the investigator, should document and explain any deviation from the approved protocol. The investigator may implement a deviation from, or a change in, the protocol to eliminate an immediate hazard(s) to trial subjects without prior IRB/IEC approval/favorable opinion. This article was modified from the websites of the United States Food and Drug Administration and the National Institutes of Health (http://clinicaltrials.gov/ct/info/glossary and http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PROTOCOL AMENDMENT A Protocol Amendment is a written description of a change(s) to or formal clarification of a protocol. The investigator may implement a deviation from, or a change in, the protocol to eliminate an immediate hazard(s) to trial subjects without prior IRB (Institutional Review Board)/IEC (Independent Ethics Committee) approval/favorable opinion. As soon as possible, the implemented deviation or change, the reasons for it, and, if appropriate, the proposed protocol amendment(s) should be submitted: • To the IRB/IEC for review and approval/

favorable opinion; • To the sponsor for agreement; and, if

required, • To the regulatory authority(ies).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PROTOCOL DEVIATORS The investigator should not implement any deviation from, or changes of, the protocol without agreement by the sponsor or without prior review and documented approval/ favorable opinion from the IRB (Institutional Review Board)/IEC (Independent Ethics Committee) of an amendment, except where necessary to eliminate an immediate hazard(s) to trial subjects, or when the change(s) involve(s) only logistical or administrative aspects of the trial [e.g., change of monitor(s) or change of telephone number(s)]. The investigator may implement a deviation from, or a change in, the protocol to eliminate an immediate hazard(s) to trial subjects without prior IRB/IEC approval/favorable opinion. As soon as possible, the implemented deviation or change, the reasons for it, and, if appropriate, the proposed protocol amendment(s) should be submitted: • To the IRB/IEC for review and approval/

favorable opinion; • To the sponsor for agreement; and, if

required, • To the regulatory authority(ies).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

PROTOCOL VIOLATORS

ethical use of randomization (6). The eligibility criteria of a study are created in order to allow entry to the study to those who may benefit from the treatment, while attempting to disallow those patients who cannot benefit or may be harmed (1). For inclusion criteria, one wants to establish that a patient has been properly diagnosed with the disease in question and has the potential for benefit by the treatment. Newly diagnosed patients may want to be included in the trial (but this depends on the context of the trial). Some trials only include patients at moderate to high risk of a clinical outcome in order to keep the sample size down to a practical level, as, generally, patients at a low risk will generate fewer clinical outcomes, such as coronary heart disease, that may be the primary outcome of the study. However, the trialists are often making the assumption that the results of the study will also apply to patients at a lower risk. A specific age group, gender, or other population may exist that is being targeted by the trialists for practical reasons. A type of inclusion criterion also exists that is based on a run-in period between a screening visit and a randomization visit. The patients may be required to pass some sort of ‘‘test,’’ perhaps a minimum standard for compliance with the treatment. However, some kinds of ‘‘placebo run–in’’ periods are considered by some to be controversial for ethical reasons (7). The main reasons for excluding patients from recruitment to a trial are based on ethical considerations. One may not wish to test an experimental drug, for example, on children or pregnant women. Patients who have contraindications for the treatment being studied must be excluded as well (8). Similarly, if a treatment is strongly indicated for a patient (this does not apply to pre-marketing trials where the treatment is not available on prescription yet), then the patient has to receive the treatment and it would be unethical to withhold it. Patients who are unlikely to complete the study should be excluded (or to be specific, prevent randomization for those) (9), for example, students who may move away from the area or people who may not follow the rules of the

ALEX D. McMAHON University of Glasgow Robertson Centre for Biostatistics Glasgow, United Kingdom

In any clinical trial, a high probability exists that at least some of the patients will deviate from the intended conduct of the study. Instances of deviation from the rules of a study, as laid out in the trial protocol, are commonly called ‘‘protocol deviations’’ or ‘‘protocol violations.’’ The patients who are involved in these instances are known as ‘‘protocol deviators’’ or ‘‘protocol violators.’’. The two main types of protocol violation are those concerned with entry criteria and those that occur after entry to the study (1), which is usually defined as the point of randomization. A sliding scale of concern exists with protocol violators, which is measured by the amount of bias that could be introduced into the results of the study because of the effects of a particular type of violation, which should be kept in mind when each particular type of violation is being considered during a trial. The avoidance of bias is the most important measure of the quality of a clinical trial (2), especially when a trial is examining only moderate treatment benefits (3). The three main levels of ‘‘seriousness’’ of a violation are: (1) major deviations that have been caused (or could have been prevented) by the investigator; (2) major deviations that have not been caused by the investigator (for example, caused by the patient); and (3) minor deviations that are unlikely to affect the results of the study (4). Deviations from a protocol can be more serious when caused by the investigator than those that are caused by the patient, because the most serious types of violation involve deliberate departures from the randomized treatment or decipherment of the treatment allocation (4, 5). 1 VIOLATIONS OF THE ELIGIBILITY CRITERIA It has been argued that the primary purpose of eligibility criteria in a trial is to allow the

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

PROTOCOL VIOLATORS

study properly, for example, patients whose cognitive function has been impaired. Certain types of patient in the field of mental health may also find it particularly difficult to comply with the plan of a study (10). It is expected that some patients in a study will violate the eligibility criteria. Some researchers have argued that having eligibility criteria that were too strict actually encourages this type of violation to happen (3, 8, 11, 12). Sometimes the inclusion criteria are poorly reported (especially in publications) and suspicions are raised that at least some of them were not very sensible or productive (13, 14). If a particular research organization has planned a series of trials, it is possible that the eligibility criteria get carried forward to the later trials, without an attempt to re-assess their suitability (11). It often happens that the eligibility criteria are duplicated in the protocol, and the common problem of presenting the same criterion in both the inclusion list and the exclusion list exists. One way to prevent the inclusion of unnecessary eligibility criteria is to give the rationale for every rule in the protocol, which should help to weed out the extraneous ones (11, 13). Some areas of study require very large sample sizes, and impediments to recruitment such as unnecessary eligibility criteria should be avoided. A general argument exists that one should always endeavor to include a wide variety of patients in clinical trials on principle (3, 11, 12, 15–18.) An unusual type of eligibility criterion violation exists that is worth noting as a special case. Sometimes a patient’s diagnosis is unknown until after randomization has taken place. This event can happen when the diagnosis depends on a laboratory test that may take a week or two to arrive, but in the meantime, the patient must be randomized. Examples of this include trials of antibiotics for patients with infections and multi-center cancer trials that rely on a central pathologist to make the final diagnosis (19, 20). 2 VIOLATIONS OF THE STUDY CONDUCTED AFTER RANDOMIZATION Many things can go wrong in a trial once the follow-up after randomization has commenced. Various anomalies in the desired

treatment regimen can occur; for example, the patient may not receive any trial treatment at all, the patient could stop taking the treatment for a while, the patient could withdraw from treatment entirely, the wrong treatment could be given, the trial treatment could be supplemented by other disallowed treatments, the patient could switch treatment arms, and compliance with the treatment could be substantially less than the amount that was desired. Variation in compliance could be a source of bias in a study because it has been noted that clinical outcomes can be much better for patients who comply well with their treatment than for patients who do not (21), which can even occur when comparing good placebocompliers with poor placebo-compliers. A particularly serious form of violation, which may be unavoidable, is when a patient has to be withdrawn from a treatment because the patient has suffered some kind of adverse effect. These adverse events may be related to the treatment, and indeed may be considered to be ‘‘treatment failures’’ of the efficacy of the treatment. In trials of pharmaceutical interventions, the patient may suffer an ‘‘adverse drug reaction,’’ which is an unwanted side effect of the drugs. The patient may also withdraw, or be withdrawn, from the study entirely (and not just the treatment itself). After a period of time, patients sometimes refuse to participate further in a study, perhaps because they have moved their home, emigrated, or had a change in their domestic circumstances such as the death of a loved one. When this refusal occurs, a patient is said to be ‘‘lost to follow-up’’ (22). Violations of other aspects of the study can occur. Clinical measurements may be poorly taken or may have not been recorded at all. In addition to the measurements, the data themselves could be of poor quality or missing regardless of how the measurement was taken. A common reason for irregularities in the recording of clinical data is that the patient may have not attended a scheduled study visit. A related problem is that the patient fails to attend within a specified window of calendar time; for example, the patient is required to report to a clinic three weeks after randomization (within a window of plus or minus three days), but the patient

PROTOCOL VIOLATORS

actually arrives four weeks after randomization. Some of these violations are of a minor nature, and having some missing outcome data may not bias the study even if it is an expensive flaw because of wasted effort. However, if an investigator has managed to decipher the concealed treatment allocation for some reason, then it would be considered a major protocol violation as it lays the study open to accusations of bias. It goes without saying that general diligence and attention to detail will help to prevent many of the minor forms of protocol violation. Trials should try to avoid postrandomization violations by making the trial as simple as practically possible. Measurements should be as convenient as possible for the patient. Sometimes a treatment can be tailored to make it easier for patient compliance. An important principle of clinical trials that can help with post-randomization violators is to continue to measure outcome data after a patient has withdrawn from treatment (or temporarily discontinued treatment) (10, 19, 23). Protocols should have explicit plans to handle events in this manner whenever possible. It may be necessary to specify a temporary rescue medication in the protocol. Sometimes the follow up of a trial can continue independently of actual personal contact with the patients (provided, of course, that the patients have not withdrawn consent to follow-up). Follow up is possible when outcomes can be gleaned from hospital records or government health data systems (24). 3 DEALING WITH VIOLATIONS ONCE THEY HAVE HAPPENED Once the follow-up period of a study has been completed, it is obviously too late to prevent any problems with protocol violation. The violations that have happened need to be dealt with when the results of the trial are being calculated. Many types of minor deviation can be easily ignored. An example is the issue of visit timing. The best thing to do is to include all of the visits as they actually happened in the trial. A very small penalty is paid for these minor deviations from the plan in terms of ‘‘noise’’ (or random error) being

3

added to the results. This noise is fair, as poor studies have to pay a larger penalty than well-organized and competent studies. Most types of protocol violation are, in fact, dealt with automatically by the intention-to-treat philosophy for the reporting of trial results (25–29). In brief, the intention-to-treat strategy means that patients are analyzed by their randomized groups, and patients are not excluded from the analysis because they are protocol violators. Generally, an intention-totreat analysis will protect a study from bias, although the cumulative effect of many protocol violators may be to slightly attenuate the estimated treatment effect. Major violations that have been caused by the investigator can be problematic, however, as discussed earlier. Also, if the primary outcome data has not been collected, then the number of useful patients in the study is effectively reduced and topping up the sample size with extra patients to compensate should be considered. Sometimes an intention-to-treat analysis can be challenging, especially if missing outcome data exist. In the event that some of the intended efficacy measurements have been made, the analysis can be salvaged by using the last available reading in a ‘‘last observation carried forward’’ analysis. This type of analysis is acceptable if a dropout patient has left the study because of suspicions of poor efficacy or toxicity, as the bias does not favor a study treatment. However, a ‘‘last observation carried forward’’ analysis may be suspect if placebo patients are dropping out because of poor efficacy or if a condition has fluctuating severity (23). A useful alternative is to take the ‘‘worst’’ measurement of efficacy from the available readings. If no outcome data are available at all, then this flaw in the study is sometimes penalized by imputing a treatment ‘‘failure’’ for the relevant patients. Alternatively, researchers often exclude the patients with no data because the omission will not bias the results, which is sometimes called a ‘‘modified intention-to-treat’’ analysis. Studies where the intention-to-treat strategy must be dispensed with occasionally exist, for example, if many patients were prematurely randomized into a trial before they ever received a treatment (20). Similar issues exist for analysis regarding patients who have finished the study but

4

PROTOCOL VIOLATORS

have not been receiving enough treatment or the treatment has been of a shorter duration than was expected. Again, the intention-totreat analysis is the standard strategy for dealing with this type of violation, although the intention-to-treat approach does have its critics (30). Alternative analyses may be carried out that only include the patients with good compliance (and possibly no other protocol deviations), or may actually try to model the different types of compliance in more complicated analyses (12, 26, 29–32). These types of analyses may provide some useful information (especially in early-phase drug trials), but are generally hard to interpret and should be considered to be of a speculative nature. Some researchers think that the intention-to-treat analysis should be in the abstract, but that the alternative ‘‘on treatment’’ or ‘‘per protocol’’ analyses should be discussed in the main text (25). In any case, it really is not flippant to say that the best way of dealing with protocol violations, such as having missing outcomes, is to try to prevent them from happening in the first place (25–29). 4 PLANNING AHEAD TO AVOID VIOLATORS Many clinical trials are designed to test interventions that are complex or difficult to control in a stable way (12). This type of trial serves as a reminder that protocol deviations are a fact of life and should be embraced as a normal feature of a trial. A clinical trial is technically a scientific experiment, but it is a very different type of experiment from that carried out in laboratories by life scientists, engineers, and so on. The unit of study is a human being, and often the study will continue as patients continue with their daily lives (although many studies are carried out in a hospital setting). When a study is being conducted in parallel with a patient’s normal activities, it seems sensible to try to accommodate the study as much as possible with the patient’s normal routine. Researchers should try to plan ahead to keep a trial as smooth and as trouble free as possible, but, at the same time, planning a strategy for the inevitable departures from the ideal.

The number of protocol violations is normally expected to decrease with time as the recruitment and follow-up accrues, because all of the people involved with the study will become progressively familiar with both the protocol and the practicalities of running the study (4). Protocol violations are probably harder to avoid in multi-center trials (4–19) because some of the centers will have more problems than others. It has been noted that those centers that contribute a small number of patients may also be the same ones who have a greater problem with protocol violations (19, 33). In all trials, careful data collection and monitoring of protocol violations will provide useful information about any problems that may exist with a study, and may even provoke an early change to the design of a study. Some protocol amendments may simply change the mechanics of the study by perhaps improving the collection of data, but changes to the eligibility criteria can potentially change the patient population of the study away from the original ‘‘target’’ population. In multi-center trials, careful data collection and monitoring may suggest that a particular center may have to be excluded from further recruitment to the study (4). As a summary, here is a list of suggestions that should help to reduce the amount of protocol violations in a clinical trial. The list is not intended to be exhaustive, and many of the suggestions are probably just common sense. – Keep eligibility criteria as simple and nonrestrictive as possible – Explicitly justify eligibility criteria and do not duplicate them – Ensure that eligibility checklists are used at the screening and randomization visits – Keep follow-up time as short as possible and the number of study visits as few as possible – Only make clinical measurements on the patient that are definitely required – Collect as few data as possible; reexamine the need for nonessential data – Avoid as much inconvenience to the patient as possible

PROTOCOL VIOLATORS

– Make the study treatment as convenient as is practical for the patient – Specify what will happen in the event of a temporary suspension of the study treatment – Make plans to collect outcome data after a patient has suspended treatment or withdrawn entirely from treatment – Monitor the patients carefully, and examine the data carefully as it accrues – Monitoring and data checking are especially important at the beginning of a trial – Monitoring and data checking are especially important in multi-center trials – Only allow centers into a multi-center trial if they are likely to recruit a reasonable number of patients – If necessary, amend the protocol when a problem has been identified

REFERENCES 1. J. A. Lewis, Statistical standards for protocols and protocol deviations. Recent Results Cancer Res. 1988; 111: 27–33. 2. S. C. Lewis and C. P. Warlow, How to spot bias and other potential problems in randomised controlled trials. J. Neurol. Neurosurg. Psychiatry 2004; 75: 181–187. 3. R. Peto, R. Collins, and R. Gray, Large-scale randomized evidence: large simple trials and overviews of trials. J. Clin. Epidemiol. 1995; 48(1): 23–40. 4. G. T. Wolf and R. W. Makuch, A classification system for protocol deviations in clinical trials. Cancer Clin. Trials 1980; 3: 101–103. 5. K. F. Schulz and R. W. Makuch, Allocation concealment in randomised trials: defending against deciphering. Lancet 2002; 359: 614–618. 6. S. J. Senn, Falsification and clinical trials. Stat. Med. 1991; 10: 1679–1692. 7. S. Senn, Are placebo run ins justified? BMJ 1997; 314(7088): 1191–1193. 8. S. Yusuf, P. Held, and K. K. Teo, Selection of patients for randomized controlled trials: implications of wide or narrow eligibility criteria. Stat. Med. 1990; 9: 73–86. 9. J. H, Ellenberg, Cohort Studies. Selection bias in observational and experimental studies. Stat. Med. 1994; 13: 557–567.

5

10. L. Siqueland, A. Frank, D. R. Gastfriend, L. Muenz, P. Crits-Christoph, J. Chittams et al., The protocol deviation patient: characterization and implications for clinical trials research. Pyschother. Res. 1998; 8(3): 287–306. 11. A. Fuks, C. Weijer, B. Freedman, S. Shapiro, M. Skrutkowska, and A. Riaz, A study in contrasts: eligibility criteria in a twenty-year sample of NSABP and POG clinical trials. J. Clin. Epidemiol. 1998; 51(2): 69–79. 12. A. D. McMahon, Study control, violators, inclusion criteria, and defining explanatory and pragmatic trials. Stat. Med. 2002; 21(10): 1365–1376. 13. S. H. Shapiro, C. Weijer, and B. Freedman Reporting the study populations of clinical trials: clear transmission or static on the line? J. Clin. Epidemiol. 2000; 53: 973–979. 14. A. Britton, M. McKee, N. Black, K. McPherson, C. Sanderson, and C. Bain, Choosing between randomised and non-randomised studies: a systematic review. Health Technol. Assess. 1998; 2(13): 19–29. 15. D. G. Altman and J. M. Bland, Why we need observational studies to evaluate the effectiveness of health care. BMJ 1998; 317: 409– 410. 16. K. R. Bailey, Generalizing the results of randomized clinical trials. Controlled Clin. Trials 1994; 15: 15–23. 17. N. Black, Why we need observational studies to evaluate the effectiveness of health care. BMJ 1996; 312: 1215–1218. 18. M. McKee, A. Britton, K. McPherson, C. Sanderson, and C. Bain, Interpreting the evidence: choosing between randomised and non-randomised studies. BMJ 1999; 319: 312–315. 19. S. J. Pocock, Protocol deviations. In: Clinical Trials. A Practical Approach. 1st ed. Chichester: John Wiley & Sons, 1983, pp. 176– 186. 20. D. Fergusson, S. D. Aaron, G. Guyatt, and P. Hebert, Post-randomisation exclusions: the intention to treat principle and excluding patients from analysis. BMJ 2002; 325: 652–654. 21. R. Collins and S. MacMahon, Reliable assessment of the effects of treatment on mortality and major morbidity, I: clinical trials. Lancet 2001; 357: 373–380. 22. D. Moher, K. F. Schulz, and D. G. Altman, The CONSORT statement: revised recommendations for improving the quality of reports

6

PROTOCOL VIOLATORS

of parallel-group randomised trials. Lancet 2001; 357: 1191–1194. 23. P. W. Lavori, Clinical trials in psychiatry: should protocol deviation censor patient data? Neuropsychopharmacology 1992; 6(1): 39– 48. 24. The West of Scotland Coronary Prevention Group, Computerised record linkage: compared with traditional patient follow-up methods in clinical trials and illustrated in a prospective epidmiological study. J. Clin. Epidemiol. 1995; 48(12): 1441–1452. 25. C. Begg, Ruminations on the intent-to-treat principle. Controlled Clin. Trials 2000; 21: 241–243. 26. G. Chene, P. Morlat, C. Leport, R. Hafner, L. Dequae, I. Charreau et al., Intention-to-treat vs on treatment analyses of clinical trial data: experience from a study of pyrimethamine in the primary prophylaxis of toxoplasmosis in HIV-infected patients. Controlled Clin. Trials 1998; 19: 233–248. 27. G. A. Diamond and G. A. Diamond, Alternative perspectives on the biased foundations of medical technology assessment. Ann. Intern. Med. 1993; 18: 455–464. 28. S. Hollis and F. Campbell, What is meant by intention to treat analysis? Survey of published randomised controlled trials. BMJ 1999; 319: 670–674. 29. J. M. Lachin, Statistical considerations in the intent-to-treat principle. Controlled Clin. Trials 2000; 21: 167–189. 30. L. B. Sheiner and D. B. Rubin, Intention-totreat analysis and the goals of clinical trials. Clin. Pharmacol. Therapeut. 1995; 57(1): 6–15. 31. I. R. White and S. J. Pocock, Statistical reporting of clinical trials with individual changes from allocated treatment. Stat. Med. 1996; 15: 249–262. 32. C. C. Wright and B. Sim, Intention-to-treat approach to data from randomized controlled trials: a sensitivity analysis. J. Clin. Epidemiol. 2003; 56: 833–842. 33. R. J. Sylvester, H. M. Pinedo, M. DePauw, M. J. Staquet, M. E. Buyse, J. Renard et al., Quality of institutional participation in multicenter clinical trials. N. Engl. J. Med. 1981; 305(15): 852–855.

PUBLICATION BIAS

research, primary quantitative studies, and narrative reviews as well as in systematic reviews and meta-analyses. Although publication bias has been around for as long as research has been conducted and its results disseminated, it has come to prominence in recent years largely with the introduction and widespread adoption of the use of systematic review and meta-analytic methods to synthesize bodies of research evidence. As methods of reviewing have become more systematic and quantitative, it has been possible to demonstrate empirically that the problem of publication bias exists as well as to estimate its impact. Publication bias is a particularly troubling issue for meta-analysis, as it has been claimed that this method of review produces a more accurate appraisal of a research literature than is provided by traditional narrative reviews (1). However, if the sample of studies retrieved for review is biased, the validity of the results of the meta-analytic review, no matter how systematic and thorough in other respects, is threatened. This is not a hypothetical issue: evidence that publication bias has had an impact on meta-analyses has been firmly established by several lines of research. Because systematic reviews are promoted as providing a more objective appraisal of the evidence than traditional narrative reviews and systematic review and meta-analysis are now generally accepted in many disciplines as the preferred methodology for summarizing a research literature, threats to their validity require attention. Publication bias deserves particular consideration as it presents perhaps the greatest threat to the validity of this method. On the other hand, the vulnerability of systematic review and meta-analysis to publication bias is not an argument against their use because such biases exist in the literature irrespective of whether systematic review or other methodology is used to summarize research findings. In fact, as Rothstein et al. (2) suggest, the attention given to objectivity, transparency, and reproducibility of findings in systematic reviews and meta-analyses has been the driving force behind the attempt to assess

HANNAH R. ROTHSTEIN Baruch College City University of New York New York City, New York

1 PUBLICATION BIAS AND THE VALIDITY OF RESEARCH REVIEWS Publication bias is the general term used to describe the problem that results whenever the published research literature on a topic is systematically unrepresentative of the entire body of completed studies on that topic. The major consequence of this problem is that, when the research that is readily available differs in its results from the results of all the research that has been done on the topic, both readers and reviewers of that research are in danger of drawing the wrong conclusion about what that body of research shows. In some cases, the result can be danger to the public, as, for example, when an unsafe or ineffective treatment is falsely viewed as safe and effective. Two events that received much media attention in 2004 and 2005 serve as cautionary examples. Controversy surrounded Merck’s recall of Vioxx (rofecoxib), a popular arthritis drug; Merck maintained that it recalled Vioxx as soon as the data indicated the high prevalence of cardiovascular events among those who took Vioxx for more than 18 months, but media reports claimed that Merck had hidden adverse event data for years. In another controversy, the New York State attorney general filed a lawsuit against GlaxoSmithKline charging that the company had concealed data about the lack of efficacy and increased suicide risk associated with the use of Paxil (paroxetine), a selective serotonin reuptake inhibitor, in childhood and adolescent depression. In the majority of cases, the cause of publication bias will not be deliberate suppression, nor will the effects of publication bias be as serious; nevertheless, these examples highlight why the topic is critically important. Publication bias is a potential threat in all areas of research, including qualitative

Wiley Encyclopedia of Clinical Trials, Copyright © 2007, John Wiley & Sons, Inc.

1

2

PUBLICATION BIAS

the problems presented by publication biases and to deal with them. At present, there are several methods for assessing the potential magnitude of bias caused by selective publication or selective availability of research. In situations where the potential for severe bias exists, it can be identified, and appropriate cautionary statements about the results of the meta-analysis can be made. When potential bias can effectively be ruled out or shown not to threaten the results and conclusions of a meta-analysis, the validity and robustness of the results and conclusions are strengthened. 2

RESEARCH ON PUBLICATION BIAS

Dickersin (3) provides a concise and compelling summary of the research that has demonstrated the existence of publication bias in both the biomedical and the social sciences for clinical trials (experiments) as well as in observational studies. Dickersin’s review suggests that researchers themselves appear to be a key source of publication bias through their failure to submit research for publication when the major results do not reach statistical significance, but there are indications that bias at the editorial level also plays a role. 2.1 Direct Evidence of Publication Bias The body of research under review provides direct evidence of publication bias, including explicit editorial policies, results of surveys of investigators and reviewers, experimental studies, and follow-up evaluations of studies registered at inception. Several surveys in both medicine and psychology conducted in the 1970s and 1980s (4, 5) have shown that studies with results that rejected the null hypothesis were more likely to be submitted by their authors and recommended for publication by reviewers. Two studies examining the fate of manuscripts submitted for publication used articles that were similar except for whether they had statistically significant or nonsignificant findings; in both cases, the articles with statistically significant results were more likely to be accepted for publication (6, 7). In another study, six fully reported cohort studies of biomedical

research conducted in the 1990s were identified and followed; although the research projects had been approved by institutional research ethics review boards from inception to dissemination, ‘‘The main factor associated with failure to publish in all of these cohort studies was negative or null findings’’ (3). However, in most of these cases, it was the studies’ investigators who had decided not to submit statistically nonsignificant results for publication; in only a small proportion of cases was the failure to publish related to rejection by a journal. Most of the studies on this topic have shown that, when research with statistically nonsignificant results is published, it is published more slowly than other research (8–10). A few studies, however, have found no relationship between significance of results and time to publication (11, 12). 2.2 Indirect Evidence of Publication Bias Dickersin’s review of research on publication bias (3) also includes convincing indirect evidence. For example, numerous studies conducted between 1959 and 2001 in various fields including medicine, psychology, and education have shown that the vast majority of published studies have statistically significant results. Although it is theoretically possible that most research is so carefully conceptualized and designed that only studies supporting the hypotheses they test are actually carried out, it seems rather unlikely. The indirect evidence includes negative correlations between sample size and effect size in collections of published studies, which is what one would expect if studies were selected for publication based on their statistical significance. The discussion of funnel plots will explain this point in more detail. 2.3 Clinical Significance of the Evidence Taken as a whole, the evidence that statistically nonsignificant studies are less likely to be published and have longer times to publication than those that have attained statistical significance appears compelling. The result is that published literature generally produces higher estimates of treatment effectiveness than does unpublished literature. Importantly, in the 122 meta-analyses

PUBLICATION BIAS

reviewed by Sterne et al. (13), published and unpublished studies demonstrated no differences in quality—in fact, when trial quality was taken into account, the difference between treatment effectiveness estimates from published studies compared with the unpublished studies was larger. This is not merely an academic issue. Practitioners who rely on data only from published studies in choosing treatments for their patients may recommend treatments that are less beneficial than they think them to be. This is true whether the practitioner reads the published studies or reviews based solely on published studies. Several systematic reviews have demonstrated publication bias related to overestimation of treatment benefits for patients with ovarian cancer, cardiovascular disease, and thyroid disease.

3 DATA SUPPRESSION MECHANISMS RELATED TO PUBLICATION BIAS Attention to publication bias originally focused on the consequences of publication or nonpublication of studies based on the studies’ direction and the statistical significance of the results, rather than on the quality of the research. However, numerous potential information suppression mechanisms go well beyond this simple conceptualization of publication bias: Language bias. For example, selective inclusion of studies published in English, either because of database coverage or inclusion criteria of a review, can exclude a large proportion of research results from non–Englishspeaking countries. Citation bias. More frequent citation of studies with statistically significant results means those articles are more likely to have an impact on practice. In addition, the cited articles are more likely to be located for inclusion in a later systematic review. Availability bias. Selective inclusion of the studies that are most easily accessible to the reviewer often reflects the print material and databases available

3

to the institution with which the author is affiliated. Cost bias. Studies selected may be confined to those that are available free or at low cost. Subscriptions to some electronic databases are expensive, as are some journal subscriptions. Retrieval of the studies, once they are identified, also can be costly. Familiarity bias. Selective inclusion may comprise studies only from one’s own discipline. Searches related to simple, straightforward questions such as drug efficacy may be confined to a relatively circumscribed set of sources. Topics for review that are broader or more complex—for example, an assessment of the impact of free lunch programs on children’s nutrition— are likely to be represented in a variety of research literatures. Outcome bias. In addition to whole studies that are missing from the published literature, outcomes within studies are often selectively reported. Reporting of some outcomes but not others usually depends on the direction and statistical significance of the results rather than the theoretical or clinical importance of the outcome. Duplication bias. Some findings are likely to be published more than once without being clearly identified as such; as a result, they may inadvertently be included more than once in a metaanalysis. Some of these are biases introduced by the authors of primary studies; others are caused by the editorial policies of journals or the decisions made by database publishers; others come from the reviewers producing syntheses. In addition, data may ‘‘go missing’’ for other reasons—including financial, political, ideological, and professional competing interests of investigators, research sponsors, journal editors, and other parties. All of these biases produce the same result, namely, that the literature located by a reviewer will not represent the full population of completed studies, so they all pose the same threat to a review’s validity. Whenever readers see the

4

PUBLICATION BIAS

phrase ‘‘publication bias,’’ they should bear in mind that the broader but more cumbersome ‘‘publication bias and associated dissemination and availability biases’’ is implied.

4

PREVENTION OF PUBLICATION BIAS

Experts in the area have proposed two strategies to prevent publication bias. If widely adopted, these strategies would lead to a great reduction in publication bias, at least in research areas that are based on trials. In addition, thorough literature search techniques can reduce the impact of publication bias on meta-analytic and other systematic reviews. 4.1 Trials Registries A number of Internet-accessible registries of ongoing and completed unpublished as well as published clinical trials currently exist, but they have been found to be incomplete and lacking in a systematic means of identifying trials across registries. Often, access to the studies they contain is restricted. Calls for universal, standardized registration of trials at their inception as a means of reducing publication bias have been sounded since 1977 (14). As efforts to establish such registers have met with mixed success, governmental action may be required to create them. Furthermore, international prospective registration of clinical trials would need to be coupled with open access to the results of these trials before any ‘‘unbiased sampling frame for subsequent meta-analyses’’ could result (15). 4.2 Prospective Meta-analysis In a prospective meta-analysis, multiple groups of investigators conducting ongoing trials agree, before knowing the results of their studies, to combine their findings when the trials are complete. In a variant of this strategy, the meta-analysis is designed prospectively to standardize the instruments used to measure specific outcomes of interest across studies. An illustrative example of a prospective meta-analysis is found in Simes (16).

4.3 Thorough Literature Search Minimization of publication bias through registries and prospective analyses are forwardlooking strategies; thus, they are of little use to present systematic reviewers who wish to minimize the effects of publication bias in their meta-analyses. The best current means of accomplishing this goal is by conducting a thorough literature search that attempts to locate, retrieve, and include hard-to-find study results (also known as fugitive, grey, or unpublished literature). Experts in the field agree that systematic reviews based only on research available through electronic databases are likely to be deficient and biased. For most systematic reviews of clinical trials, a thorough literature search would include not only electronic bibliographic databases but also cover-to-cover searches of key journals; examination of conference proceedings and reference lists of prior reviews and articles on the primary topic; contacts with other researchers and institutions; and searches of trials registers, online library catalogues, and the Internet. The primary problems associated with the retrieval and inclusion of additional literature are time and expense. The methodological quality of such literature also can be hard to assess, although it is not evident that it is, on the whole, of lower quality than the published literature. Despite the increased resources that must be expended in the location and retrieval of difficult to find literature, when one is interested in reducing the threat of publication bias there can be little justification for conducting a meta-analysis that does not make the attempt to systematically include evidence from such sources. 5 ASSESSMENT OF PUBLICATION BIAS Several graphical and statistical methods that have been developed to identify, quantify, and assess the impact of publication bias on meta-analyses. The earliest technique, file drawer analysis— also called the failsafe N method—was developed by Rosenthal in 1979 (17) to assess the sensitivity of conclusions of an analysis to the possible presence of publication bias. A second set of techniques, including a graphical diagnostic

PUBLICATION BIAS

called the funnel plot and two explicit statistical tests, was developed in the 1990s by Begg and Mazumdar (18) and Egger et al. (19). The third set of techniques, designed to adjust estimates for the possible effects of publication bias under some explicit model of publication selection, includes the trim and fill method of Duval and Tweedie (20–22) as well as the selection model approaches by Hedges and Vevea (23, 24) and by Copas and Shi (25, 26). 5.1 File Drawer Analysis (Failsafe N Approach) As early as 1979, Rosenthal (17) became concerned that meta-analytic data could be wrong because the statistically nonsignificant studies were remaining in researchers’ file drawers. He developed a formula to examine whether the inclusion of these studies would nullify the observed effect: would their inclusion reduce the mean effect to a level not statistically significantly different from zero? This formula allows meta-analysts to calculate the ‘‘failsafe N,’’ the number of zero-effect studies that would be required to nullify the effect. If this number is relatively small, then there is cause for concern. If this number is large, one might be more confident that the effect, while possibly inflated by the exclusion of some studies, is nevertheless not zero. Two rules of thumb that have been developed to define large include the number of ‘‘missing studies’’ larger than the number of observed studies, and the number of missing studies equal to or greater than five times the number of observed studies plus 10 studies. Many meta-analyses, from the 1980s through today, use this approach to assess the likelihood that publication bias is affecting their results. Despite its continuing popularity, this approach is limited in a number of important ways. First, it focuses on the question of statistical significance rather than practical or theoretical significance. That is, it asks, ‘‘How many missing studies are needed to reduce the effect to statistical nonsignificance?’’, but it does not tell us how many missing studies need to exist to reduce the effect to the point that it is not important. Second, it assumes that the mean effect size in the hidden studies is zero, although

5

it could be negative or positive but small. Third, the failsafe N is based on significance tests that combine P-values across studies, as was Rosenthal’s initial approach to metaanalysis. Today, the common practice is to cumulate effect sizes rather than P-values, and the P-value is computed for the mean overall effect. Finally, although this method may allow one to conclude that the mean effect is not entirely an artifact of publication bias, it does not provide an estimate of what the mean effect might be once the missing studies are included. Orwin (27) proposed a variant on the Rosenthal formula that addresses two of these issues. Orwin’s method shifts the focus to practical rather than statistical significance, and it does not necessarily assume that the mean effect size in the missing studies is zero. It does not, however, address the other criticisms of the failsafe N method. 5.2 Funnel Plots The funnel plot, in its most common form, is a display of an index of study precision (usually presented on the vertical axis) as a function of effect size (usually presented on the horizontal axis). Large studies appear toward the top of the graph and generally cluster around the mean effect size. Smaller studies appear toward the bottom of the graph and tend to be spread across a broad range of values because smaller studies have more sampling error variation in effect sizes. In the absence of publication bias, the studies will be distributed symmetrically about the mean effect size with a pattern resembling a inverted funnel, hence the plot’s name (28). In the presence of bias, the bottom of the plot will show a larger concentration of studies on one side of the mean than the other. This reflects the fact that studies are more likely to be published if they have statistically significant results. For a small study to reach statistically significant findings, the treatment effect must be larger than that required to reach statistical significance in a larger study. Thus, an association between effect size and sample size may be suggestive of selective publication based on statistical significance. Figure 1A shows a symmetrical funnel plot of log odds ratios as a function of precision

6

PUBLICATION BIAS

(1/standard error), and Figure 1B shows an asymmetrical funnel plot. A comprehensive summary of the issues involved in the use of funnel plots to detect publication bias is offered by Sterne, Becker, and Egger (29). Although the funnel plot is intuitively appealing because it offers a clear visual sense of the relationship between effect size and precision, it is limited by the fact that its interpretation is largely subjective.

the standardized effect size and the variances (or standard errors) of these effects. Tau is interpreted in much the same way as any correlation: a value of zero indicates no relationship between effect size and precision, and deviation from zero indicates the presence of a relationship. If asymmetry is caused by publication bias, the expectation is that high standard errors (small studies) will be associated with larger effect sizes. A statistically significant correlation suggests that bias exists but does not directly address the consequences of the bias. In particular, it does not suggest what the mean effect would be in the absence of bias. A further limitation of

5.3 Statistical Tests Begg and Mazumdar (18) developed a statistical test to detect publication bias based on the rank correlation (Kendall’s tau) between

Funnel Plot of Precision by Log odds ratio

Precision (1/Std Err)

30

20

10

0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

Log odds ratio (A) Funnel Plot of Precision by Log odds ratio

Precision (1/Std Err)

30

20

10

0 −2.0

−1.5

−1.0

−0.5

0.0

0.5

Log odds ratio (B)

1.0

1.5

2.0

Figure 1. (A) Example of a symmetrical funnel plot. (B) Example of an asymmetrical funnel plot.

PUBLICATION BIAS

this test is that, contradictory as it may first seem, it has both low power and an inflated type I error rate under many conditions (30). Egger et al. (19) developed a linear regression method that, like the rank correlation test, is intended to quantify the bias pictured in a funnel plot. It differs from Begg and Mazumdar’s test in that it uses the actual values of the effect sizes and their precision, rather than ranks. The Egger test uses precision (the inverse of the standard error) to predict the ‘‘standardized effect.’’ In this equation, the size of the standardized effect is captured by the slope of the regression line (B1) while bias is captured by the intercept (B0). The intercept in this regression corresponds to the slope in a weighted regression of the effect size on the standard error. When there is no bias, the intercept is zero. If the intercept is significantly different from zero, there is evidence of asymmetry, suggesting bias. This approach can be extended to include more than one predictor variable, which means that one can assess the impact of sample size, unconfounded by other factors that might be related to effect size. The power for this test is generally higher than power for the Begg and Mazumdar method, but it is still low under many conditions. The type I error rate also will be too high unless there are clear variations in the sizes of the studies included in the meta-analysis, with one or more trials of medium or large size. Furthermore, it has been found that the results of the Egger test may be misleading when applied to odds ratios, but a modification has been proposed to address this problem (31). Finally, as was true of the Begg and Mazumdar test, a statistically significant result using the Egger test may suggest that bias exists but does not suggest what the mean effect would be in the absence of bias. 5.4 Selection Models Techniques designed to adjust estimates for the possible effects of publication bias under some explicit model of publication selection include Duval and Tweedie’s trim and fill method (20–22), Hedges and Vevea’s general selection model approach (23, 24), and Copas and Shi’s selection model approach (25, 26).

7

Unlike the Hedges–Vevea and Copas–Shi selection model methods, trim and fill is relatively simple to implement and involves relatively little computation, so we will concentrate on the trim and fill approach. Interested readers may wish to see Hedges and Vevea (24) for an overview and illustration of the other selection models. The trim and fill procedure assesses whether publication bias may be affecting the results of a meta-analysis and estimates how the effect would change if the bias were removed. This procedure is based on the assumption that, in the absence of bias, a funnel plot will be symmetric around the mean effect. If more small studies cluster on one side than the other at the bottom of the funnel plot, existing studies may be missing from the analysis. Trim and fill extends this idea by imputing the missing studies—it adds them to the analysis and recomputes the effect size. The trim and fill method assumes that, in addition to the number of observed studies in a meta-analysis, other relevant studies are not included due to publication bias. The number of these studies and the effect sizes associated with them are unknown but can be estimated. To adjust for the effect of missing studies, trim and fill uses an iterative procedure to remove the most extreme small studies from the other side of the funnel plot (those without counterparts on the first side), and it recomputes the effect size at each iteration until the funnel plot is symmetric around the (new) effect size. This ‘‘trimming’’ yields an effect size adjusted for missing studies, but it also reduces the variance of the effects, yielding a confidence interval that is too narrow. Therefore, the algorithm then adds the removed studies back into the analysis and imputes a mirror image for each of them. The final estimate of the mean overall effect and its variance is based on the ‘‘filled’’ funnel plot. Figure 2 shows an example of a filled funnel plot. The clear circles are the original studies, and the dark circles are the imputed studies. The primary advantage of the trim and fill approach is that it yields an effect size estimate that is adjusted for bias. One should not regard the adjusted mean effect as the best estimate of treatment because this mean is

8

PUBLICATION BIAS Funnel Plot of Precision by Log odds ratio

Precision (1/Std Err)

30

20

10

0 −2.0

−1.5

−1.0

−0.5 0.0 0.5 Log odds ratio

1.0

estimated on imputed data points. Rather, the method should be seen as a useful sensitivity analysis that assesses the potential impact of missing studies on a meta-analysis. Its primary advantage is that it allows assessment of the degree of divergence between the original mean effect and the trim and fill adjusted mean effect. In many cases, the adjusted effect will be similar to the original effect, increasing our confidence in the meta-analytic result. In other cases the magnitude of the effect size may shift, but the overall result (that the treatment is or is not effective) does not change. In some situations, the possible presence of publication bias threatens the main finding, which we can describe as severe bias. Whether bias is actually found to affect results, the information provided by trim and fill is extremely important for assessing the robustness of meta-analytic results. 5.5 Comparing the Results of the Different Methods The results obtained from various methods may not be in agreement because they address different concerns. The traditional failsafe N method defines publication bias as the number of studies obtaining no effect that would be necessary to completely nullify the observed mean effect size; in other words, it addresses the question ‘‘Is the entire effect due to bias?’’ Orwin’s variant defines publication bias as the number of studies obtaining a specified low effect that would be necessary to drop the observed

1.5

2.0

Figure 2. A ‘‘filled’’ funnel plot.

mean effect size below a specified threshold; it too answers the question ‘‘Is the entire effect due to bias?’’ Whenever there is a distribution with effect sizes far from zero and when the number of studies is relatively large, both versions of failsafe N analyses will yield a conclusion that there is no publication bias. Trim and fill interprets effect size distribution asymmetry as evidence of publication bias, and determines the degree of bias based on the difference between the original effect size and the recomputed effect size after ‘‘missing’’ studies have been added to make the distribution symmetrical. In other words, when there is asymmetry, trim and fill assumes is bias, and is used to address the question ‘‘How much does the effect size shift after adjusting for bias’’, that is, by making the distribution symmetrical. A caveat must be noted for all of the procedures. Funnel plot, rank correlation, and regression look for a relationship between sample size and effect size as evidence of publication bias. Although the procedures may detect a relationship between sample size and effect size, they cannot assign a causal mechanism to it. That is, the effect size may be larger in small studies because a biased sample of the smaller studies was included in the meta-analysis, but it is also possible that the effect size actually is larger in smaller studies—perhaps because the smaller studies used different populations or different procedures than the larger ones. Similarly, when effects in a meta-analysis are truly heterogeneous, this can produce asymmetry in

PUBLICATION BIAS

the funnel plot that trim and fill may view as bias, with the result that it will impute studies that are actually not missing. If the apparent bias actually reflects legitimate heterogeneity in the effect sizes, this needs to be attended to in much the same way as heterogeneity produced by other moderators of treatment effects. 6

IMPACT OF PUBLICATION BIAS

Sutton’s survey of meta-analyses suggested that publication bias exists in most published meta-analyses, but that the conclusions were nevertheless valid in most cases (32). In approximately 50% of the meta-analyses surveyed, the impact of bias was minimal: modest in about 45%, and severe in only 5%. The amount of bias also appears to vary substantially among fields of research, and the prevalence of bias likely varies with the experience and resources of the researchers who are conducting the meta-analysis. Sutton’s survey of meta-analyses (32) was based primarily on the Cochrane Collaboration’s database of systematic reviews of health-care trials. Because the researchers who produce the Cochrane database reviews are trained to do extensive searches, they typically include about 30% more studies than do the authors who produce the meta-analyses appearing in journals. Therefore, the bias noted by Sutton is probably an underestimate of the bias found in other reviews of clinical trials of health-care interventions. REFERENCES 1. M. Egger, G. Davey Smith, and D. G. Altman, Systematic Reviews in Health Care: Metaanalysis in Context. London: BMJ Books, 2000. 2. H. R. Rothstein, A. J. Sutton, and M. Borenstein, eds., Publication Bias in Meta Analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005. 3. K. Dickersin, Publication bias: recognizing the problem, understanding its origins and scope, and preventing harm. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein, (eds.), Publication Bias in Meta Analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 11–34.

9

4. A. Coursol and E. E. Wagner, Effect of positive findings on submission and acceptance rates: a note on meta-analysis bias. Prof Psychol Res Pr. 1986; 17: 136–137. 5. J. Hetherington, K. Dickersin, I. Chalmers, and C. L. Meinert, Retrospective and prospective identification of unpublished controlled trials: lessons from a survey of obstetricians and pediatricians. Pediatrics. 1989; 384: 374–380. 6. M. J. Mahoney, Publication prejudices: an experimental study of confirmatory bias in the peer review system. Cognit Ther Res. 1977; 1: 161–175. 7. W. M. Epstein, Confirmational response bias among social work journals. Sci Technol Human Values. 1990; 15: 9–37. 8. A. L. Misakian and L. A. Bero, Publication bias and research on passive smoking. Comparison of published and unpublished studies. JAMA. 1998; 280: 250–253. 9. J. M. Stern and R. J. Simes, Publication bias: evidence of delayed publication in a cohort study of clinical research projects. BMJ. 1997; 315: 640–645. 10. J. P. Ioannidis, Effect of the statistical significance of results on the time to completion and publication of randomized efficacy trials. JAMA. 1998; 279: 281–286. 11. K. Dickersin, How important is publication bias? A synthesis of available data. AIDS Educ Prev. 1997; 9(Suppl): 15–21. 12. J. F. Tierney, M. S. Clarke, and A. Lesley, Is there bias in the publication of individual patient data meta-analyses? Int J Technol Assess Health Care. 2000; 16: 657–667. ¨ 13. J. A. Sterne, P. Juni, K. F. Schulz, D. G. Altman, C. Bartlett, and M. Egger, Statistical methods for assessing the influence of study characteristics on treatment effects in ‘‘metaepidemiological’’ research. Stat Med. 2002; 21: 1513–1524. 14. T. C. Chalmers, Randomize the first patient. N Engl J Med. 1977; 296: 107. 15. J. A. Berlin and D. Ghersi, Preventing publication bias: registries and prospective metaanalysis. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein (eds.), Publication Bias in Meta-analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 35–48. 16. R. J. Simes, Prospective meta-analysis of cholesterol-lowering studies: the Prospective Pravastatin Pooling (PPP) Project and the Cholesterol Treatment Trialists (CTT) Collaboration. Am J Cardiol. 1995; 76: 122C–126C.

10

PUBLICATION BIAS

17. R. Rosenthal, The ‘‘file drawer problem’’ and tolerance for null results. Psychol Bull. 1979; 86: 638–641. 18. C. Begg and M. Mazumdar, Operating characteristics of a rank correlation test for publication bias. Biometrics. 1994; 50: 1088–1101. 19. M. Egger, G. Davey Smith, M. Schneider, and C. Minder, Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997; 315: 629–834. 20. S. Duval, The ‘‘trim and fill’’ method. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein (eds.), Publication Bias in Meta-analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 127–144. 21. S. J. Duval and R. L. Tweedie, A nonparametric ‘‘trim and fill’’ method of accounting for publication bias in meta-analysis. J Am Stat Assoc. 2000; 95: 89–98. 22. S. J. Duval and R.L. Tweedie, Trim and fill: a simple funnel plot-based method of testing and adjusting for publication bias in metaanalysis. Biometrics. 2000; 56: 276–284. 23. L. Hedges and J. Vevea, Estimating effect size under publication bias: Small sample properties and robustness of a random effects selection model. J Educ Behav Stat. 1996; 21: 299–332. 24. L. V. Hedges and J. Vevea, The selection model approach. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein (eds.), Publication Bias in Meta-analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 145–174. 25. J. B. Copas, What works? Selectivity models and meta-analysis. J R Stat Soc Ser A Stat Soc. 1999; 162: 95–109. 26. J. B. Copas and J. Q. Shi, A sensitivity analysis for publication bias in systematic reviews. Stat Methods Med Res. 2001; 10: 251–265. 27. R. F. Orwin, A fail-safe N for effect size in meta-analysis. J Educ Stat. 1983; 8: 157–159. 28. R. J. Light and D. B. Pillemer, Summing Up. Cambridge, MA: Harvard University Press, 1984. 29. J. A. C. Sterne, B. J. Becker, and M. Egger, The funnel plot. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein (eds.), Publication Bias in Meta-analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 75–98.

30. J. A. C. Sterne and M. Egger, Regression methods to detect publication and other bias in meta-analysis. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein (eds.), Publication Bias in Meta-analysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 99–110. 31. R. M. Harbord, M. Egger, and J. A. C. Sterne, A modified test for small-study effects in meta-analyses of controlled trials with binary endpoints. Stat Med. 2006; 25: 3443–3457. 32. A. J. Sutton, Evidence concerning the consequences of publication and related biases. In: H. R. Rothstein, A. J. Sutton, and M. Borenstein (eds.), Publication Bias in Metaanalysis: Prevention, Assessment and Adjustments. Chichester, UK: Wiley, 2005, pp. 175–192.

FURTHER READING J. A. Berlin and G. A. Colditz, The role of metaanalysis in the regulatory process for foods, drugs, and devices. JAMA. 1999; 281: 830–834. P. J. Easterbrook, J. A. Berlin, R. Gopalan, and D. R. Matthews, Publication bias in clinical research. Lancet. 1991; 337: 867–872. S. Hahn, P. R. Williamson, J. L. Hutton, P. Garner, and E. V. Flynn, Assessing the potential for bias in meta-analysis due to selective reporting of subgroup analyses within studies. Stat Med. 2000; 19; 3325–3336. S. Iyengar and J. B. Greenhouse, Selection models and the file drawer problem. Stat Sci. 1988; 3: 109–135. C. Lefebvre and M. Clarke, Identifying randomised trials. In: M. Egger, G. D. Smith, and D. G. Altman (eds.), Systematic Reviews in Healthcare: Meta-analysis in Context. London: BMJ Books, 2001, pp. 69–86. P. Macaskill, S. D. Walter, and L. Irwig, A comparison of methods to detect publication bias in meta-analysis. Stat Med. 2001; 20: 641–654. L. McAuley, B. Pham, P. Tugwell, and D. Moher, Does the inclusion of grey literature influence estimates of intervention effectiveness reported in meta-analyses? Lancet. 2000; 356: 1228–1231. B. Pham, R. Platt, L. McAuley, T. P. Klassen, and D. Moher, Is there a ‘‘best’’ way to detect and minimize publication bias? An empirical evaluation. Eval Health Prof. 2001; 24: 109–125.

PUBLICATION BIAS M. Smith, Publication bias and meta-analysis. Eval Educ. 1980; 4: 22–24. L. A. Stewart and M. K. B. Parmar, Meta-analysis of the literature or of individual patient data: is there a difference? Lancet. 1993; 341: 418–422. J. Vevea and L. Hedges, A general linear model for estimating effect size in the presence of publication bias. Psychometrika. 1995; 60: 419–435.

CROSS-REFERENCES Cochrane collaboration Meta-analysis Power Registry trials Statistical significance

11

P Value The P value is probably the most ubiquitous statistical index found in the applied sciences literature and is particularly widely used in biomedical research. It is also fair to state that the misunderstanding and misuse of this index is equally widespread. A complete discussion of all issues relevant to the P value could touch on many of the subtlest and most difficult areas in statistical inference. In this article we will focus on those aspects most relevant to the interpretation of results arising from biomedical research. The P value, as it is used and defined today, was first proposed as part of a quasi-formal method of inference by R.A. Fisher, and popularized in his highly influential 1925 book, Statistical Methods for Research Workers [13]. It is defined as the probability, under a given simple hypothesis H0 (the null hypothesis), of a statistic of the observed data, plus the probability of more extreme values of the statistic (see Hypothesis Testing). Other names that have been used for the P value include “tail-area probability,” “associated probability,” and “significance probability” [16]. It can be written as P value = Pr(t(X) ≥ t(x)|H0 ), where X is a random vector, x is a vector of observed results, and t(X) is a function of the data, known as a test statistic (e.g. the sample average). This seemingly simple mathematical definition belies the complexity of its interpretation, and even the occasional difficulty of its calculation in realworld settings. The seeds of this difficulty are embodied in its definition, which is partly conditional (depending on observed data for calculation), and partly unconditional (calculated over a set of unobserved outcomes defined by the study design) [26]. It is therefore an index that seems to measure simultaneously an “error rate” (pre-experiment, unconditional perspective) and the strength of inferential evidence (post-experiment, conditional perspective). Fisher’s original purpose for the P value was in the latter category, as an index that denoted the statistical compatibility between observed results and a hypothetical distribution. He meant it to be used informally as a measure of statistical evidence. The larger the statistical distance (and the smaller the P value), the greater was the evidence against the null hypothesis.

While the basic idea had undeniable appeal, this definition posed several problems, some logical, some practical, which Fisher never fully resolved. They included the following: 1. How a measure of distance from a single hypothesis could be interpreted as a measure of evidence without consideration of other hypotheses [6]. 2. How the probability of unobserved “more extreme” results were relevant to the evidential meaning of the observed result [19, 21]. 3. How to calculate a P value when the experimental design (e.g. sequential) or execution (unanticipated events) rendered the sample space uncertain [8, 11]. 4. How to choose an appropriate test statistic [9, 10, 20]. 5. How the numerical magnitude of the P value should be interpreted operationally [2, 3, 12]. These questions were difficult to address because the P value was not part of any formal system of inference. This contrasted with the likelihood ratio, which was a part of Bayes’ theorem. But while Fisher also developed the idea of mathematical likelihood, he eschewed Bayes’ theorem as a useful tool in scientific inference. In its stead, he developed a panoply of methods which were meant to be tools in what he regarded as the fluid and nonquantifiable process of scientific reasoning. He offered various “rules of thumb” for the use of these tools. Among such rules were the suggestion that if the P value were less than 0.05, one might start doubting the null hypothesis. He was clear that the response to such a finding would generally be to conduct another experiment, or gather more data. Had this been the full extent of the P value’s theoretical foundation, it is doubtful that it would occupy as central a role as it does today. It is ironic that it became popularized, and indeed reified, because of the development of another method that did not include it, and which explicitly denied that conditional inference could be part of a “scientific” method [18].

The P Value as an “Observed Error Rate” In 1933, J. Neyman and E.S. Pearson (N–P) proposed the “hypothesis test”, which involved

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

P Value

“rejecting” or “accepting” null or alternative hypotheses with predetermined frequencies when they were true [24]. The probabilities of making the wrong decision were called error rates, and designated into two classes: α, (or type I) error (the probability of rejecting the null hypothesis when it was true) and β (type II) error (the probability of accepting the null hypothesis when the alternative was true) (see Level of a Test). A critical value for a test statistic is determined (via a likelihood ratio), and the null hypothesis is rejected if the statistic exceeds that critical value, and is accepted if not (see Critical Region). This methodology borrowed several of Fisher’s ideas, most notably that of mathematical likelihood, and, with its introduction of the concept of an alternative hypothesis and associated power, appeared to address some of the logical problems posed by Fisher’s less formal system. But N–P explicitly rejected P values, because for the “error rates” to have meaning (and for a method to be “scientific”), an hypothesis had to be rejected or accepted; a result could not reflect back on underlying hypotheses in a graded way, which use of the P value implied. The N–P method provided the formal framework of statistical inference (or at least decision-making) that the P value lacked, but it was a framework that encouraged the misinterpretation of P values. The juxtaposition of the two approaches has been the source of considerable confusion ever since. Since both P values and type I error rates were tail-area probabilities under the null hypothesis, the P value came to be interpreted as an “observed type I error”, and defined in most applied textbooks that way. While it is mathematically true that the P value is smallest alpha level at which one could still justify the rejection of the null hypothesis, this fact does make not make the P value an error rate. (It should be stressed that the confusion we are discussing here is not the more common lay misperception that P value represents the probability of the null hypothesis.) The outcome cannot fall within the region over which the P value is calculated; by definition, the observed outcome is always on the border of that region, and is usually the most probable. In an applied setting, this confusion is manifested in the inability to quantitatively distinguish between the very different inferential implications of a result reported as “P ≤ 0.05” vs. “P = 0.05”. It is fascinating how few users of P values recognize the profound difference in inferential weight introduced by that subtle change in

inequality sign, whereas many agonize over the far smaller difference between P = 0.04 and P = 0.07. The identification of the P value as a form of post hoc type I error rate, or as an “improved” estimate of that error, created a powerful illusion that is undoubtedly the source of its appeal; that a deductively derived error rate and inductive measure of inferential strength were identical, and that the “objective” qualities of the first could be bestowed upon the latter. Fisher and Neyman fought vigorously in print over which approach was preferable, and Fisher in particular expressed profound dismay at seeing his “significance probability” absorbed into hypothesis testing (“acceptance procedures”, in his words) [14, 23]. But textbook and software writers, either not wanting to confuse readers, or perhaps themselves not being clear about the issues, obscured the distinction, and technology triumphed over philosophy [17].

The P Value as a Measure of Evidence We have seen above how the conditionality of the P value makes it inappropriate to view as a post-test α level. We will see that its unconditional characteristics render it problematic as an evidential measure as well. Fisher’s proposal that the P value be used as a measure of evidence appeared based on the intuitively appealing idea that the more unlikely an event was under the null hypothesis, the more “evidence” that event provided against the hypothesis. The tail area appeared to be a convenient way in which to index the statistical distance between the null hypothesis and the data. Fisher himself did not seem wedded to that measure, simultaneously endorsing the use of the mathematical likelihood as an alternative. If the P value were used informally as Fisher originally intended, it could be viewed as equivalent to any other functional transformations of the distance between the observed statistic and its null value, like a Z-score, or standardized likelihood. But its use in any formal way poses several difficulties. Because the P value is calculated only with respect to one hypothesis, and has no information, by itself, of the magnitude of the observed effect (or equivalently, of power), it implicitly excludes the magnitude of effect from the definition of “evidence”. Small deviations from the null hypothesis in large experiments can have the same P value as large deviations in small experiments. The likelihood functions for these two

P Value

3

results are quite different, as are any data-independent summaries of the likelihoods. This difference is also reflected in the perspective of most scientists, who would typically draw different conclusions from such a pair of results. The corrective in the biomedical literature has been to urge the reporting of P values together with estimates of effect size, and of the precision of the measured effect, usually reported as a confidence interval [1, 15, 25]. This does not completely solve the problem of representing the evidence with the same number when the data appear quite different, but it at least gives scientists additional information upon which to base conclusions. The dependence on only one hypothesis means that data with different inferential meaning can have the same P value [4]. A converse problem occurs when the same data can be represented by two different P values. This problem is created by the inclusion of “unobserved” outcomes in the tail area calculation of the P value. Experiments of different design can have different “unobserved” outcomes even if the observed data are identical. The classic example involves the contrast between a fixed sample size experiment and one in which a data-dependent stopping-rule is used. Suppose that two treatments, A and B, are applied to each subject in a clinical trial, and the preferred treatment for that subject recorded. The sequence of preferred treatments are five A s followed by one B. If this was planned as an experiment with n = 6, then the one-sided P value based on the binomial distribution is 5 1 16 1 1 6 + = 0.11. 1 2 2 2

its acronym of “ECMO”, a variety of P values could be derived from the data presented [22]. Perhaps the most illuminating examinations of the inferential meaning of P values have used Bayesian or likelihood approaches. Bayesian analyses show that the P value is approximately the Bayesian posterior probability, under a diffuse prior, of an effect being in the direction opposite to the one observed, relative to the null hypothesis. More generally, the one-sided P value is the lower bound on that probability for all unimodal symmetric priors centered on the null [7]. Both likelihood and Bayesian arguments show that the P value substantially overstates the evidence against a simple, “sharp” null hypothesis, particularly for P values above 0.01. In the normal (Gaussian) case, the minimum likelihood ratio for the null, which is the minimum Bayes factor as well, is substantially higher than the associated P value, as shown in Table 1 [5, 12]. Since the P value is usually calculated relative to a sharp null hypothesis, most users interpret its magnitude as reflecting on the null hypothesis. The standardized likelihood [exp(−Z 2 /2)] is the smallest Bayes factor that can multiply the prior odds of the null hypothesis to calculate the posterior odds. We see that the odds are not changed nearly as much as the P value’s magnitude would suggest; nor is the probability. In the range of very small P values (P < 0.001) the quantitative differences are not likely to be important in practice. But in the range which includes many P values reported in biomedical research – that is, 0.01 < P < 0.10 – the differences between most users’ impression of the P value’s meaning and its

If the experiment had been designed to stop when the first B success was observed, then the more extreme results would consist of longer sequences of As, and the P value based on the negative binomial would be 1 1 5 1 1 6 + + · · · = 0.031. 2 2 2 2

Table 1 The relationship between the observed Z-score, (standard normal deviate), the fixed-sample size twosided P value, the Gaussian standardized likelihood exp(−Z 2 /2), and the smallest possible Bayesian posterior probability when the prior probability on the simple null hypothesis is 0.5. The latter is calculated using Bayes’ theorem, and equals stand.lik./(1 + stand.lik.)

While this is an idealized situation, this issue is manifest in the real-world applications by the discussions of how to measure appropriately the evidence provided by biomedical experiments that have stopped unexpectedly or because of large observed differences. In one trial of medical therapy known by

Z-score

P value (two-sided)

Gaussian standardized likelihood

Min. Pr(H0 |Z) when Pr(H0 ) = 0.5

1.64 1.96 2.17 2.58 3.29

0.10 0.05 0.03 0.01 0.001

0.26 0.15 0.09 0.036 0.0045

0.21 0.13 0.08 0.035 0.0045

4

P Value

maximum inferential weight is striking. When onesided P values are used, the contrast is even more marked.

One-sided vs. Two-sided P Values Much has been written about how one could choose whether to cite a one- or two-sided P value. This is somewhat academic, since the de facto standard in the biomedical literature is for two-sided tests. Some writers have stressed that this is little more than a semantic distinction – that is, about what label should be attached to a Z = 1.96 – and that if the result is completely reported, a reader could make the appropriate adjustment if they judge it appropriate. It is interesting to note that this distinction becomes more than academic if one dichotomizes results into “significant” and “not significant”. Then, whether the test was one- or two-sided is an important factor in assessing the meaning of the verdict. However, this assumes that not even the direction of the result is reported, which is unlikely. This issue is similar to that confronting the experimenter who stops a trial before its planned end. In both situations, an experimenter’s intentions or thoughts are taken as relevant to the strength of evidence, such that two persons with the same data could quote different P values. There is no clear resolution to this, since those who focus on the unconditional aspects of the P value will see such considerations as important, and those who regard it primarily as a conditional evidential measure will insist that such considerations should be irrelevant. In general, the custom that most or all P values be reported as twosided seems a good one, with the condition that if one-sided P values are used, this be indicated clearly enough so their value could be doubled by a reader. If the P values are in a range in which doubling makes a substantive difference, the evidence is equivocal enough so there will be controversy regardless of how the P value is reported.

Conclusions The P value represented an attempt by R.A. Fisher to provide a measure of evidence against the null hypothesis. Its peculiar combination of conditionality and unconditionality, combined with its absorption into the hypothesis testing procedure of Neyman and

Pearson, have led to its misinterpretation, widespread use, and seeming imperviousness to numerous criticisms. Since P values are not likely to soon disappear from the pages of medical journals or from the toolbox of statisticians, the challenge remains how to use them and still properly convey the strength of evidence provided by research data, and how to decide whether issues of design or analysis should be incorporated into their calculation.

References [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9] [10] [11]

[12]

[13]

[14]

Altman, D.G. & Gardner, M. (1992). Confidence intervals for research findings, British Journal for Obstetrics and Gynaecology 99, 90–91. Barnard, G. (1966). The use of the likelihood function in statistical practice, in Proceedings of the Fifth Berkeley Symposium, Vol. 1. University of California Press, Berkeley, pp. 27–40. Berger, J. (1986). Are P -values reasonable measures of accuracy?, in Pacific Statistics Congress, I. Francis, B. Manly & F. Lam eds. North-Holland, Amsterdam. Berger, J.O. & Berry, D.A. (1988). Statistical analysis and the illusion of objectivity, American Scientist 76, 159–165. Berger, J. & Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P -values and evidence, Journal of the American Statistical Association 82, 112–139. Berkson, J. (1942). Tests of significance considered as evidence, Journal of the American Statistical Association 37, 325–335. Casella, G. & Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem, Journal of the American Statistical Association 82, 106–111. Cornfield, J. (1966). Sequential trials, sequential analysis and the likelihood principle, American Statistician 20, 18–23. Cox, D. (1977). The role of significance tests, Scandinavian Journal of Statistics 4, 49–70. Cox, D. & Hinckley, D. (1974). Theoretical Statistics. Chapman & Hall, London. Dupont, W. (1983). Sequential stopping rules and sequentially adjusted P values: does one require the other (with discussion)?, Controlled Clinical Trials 4, 3–10. Edwards, W., Lindman, H. & Savage, L. (1963). Bayesian statistical inference for psychological research, Psychological Review 70, 193–242. Fisher, R. (1973). Statistical Methods for Research Workers, 14th Ed. Hafner, New York. Reprinted by Oxford University Press, Oxford, 1990. Fisher, R. (1973). Statistical Methods and Scientific Inference, 3rd Ed. Macmillan, New York.

P Value [15]

[16] [17]

[18]

[19] [20] [21]

Gardner, M. & Altman, D. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing, Statistics in Medicine 292, 746–750. Gibbons, J. & Pratt, J. (1975). P -values: interpretation and methodology, American Statistician 29, 20–25. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kruger, L. (1989). The Empire of Chance. Cambridge University Press, Cambridge. Goodman, S.N. (1993). P -values, hypothesis tests and likelihood: implications for epidemiology of a neglected historical debate (with commentary and response), American Journal of Epidemiology 137, 485–496. Hacking, I. (1965). The Logic of Statistical Inference. Cambridge University Press, Cambridge. Howson, C. & Urbach, P. (1989). Scientific Reasoning: the Bayesian Approach. Open Court, La Salle. Jeffreys, H. (1961). Theory of Probability, 3rd Ed. Oxford University Press, Oxford.

[22]

5

Lin, D.Y. & Wei, L.J. (1989). Comments on “Investigating therapies of potentially great benefit: ECMO”, Statistical Science 4, 324–325. [23] Neyman, J. (1952). Lectures and Conferences on Mathematical Statistics and Probability, 2nd Ed. The Graduate School, US. Department of Agriculture, Washington. [24] Neyman, J. & Pearson, E. (1933). On the problem of the most efficient tests of statistical hypothesis, Philosophical Transactions of the Royal Society Series A 231, 289–337. [25] Rothman, K. (1978). A show of confidence, New England Journal of Medicine 299, 12–13. [26] Seidenfeld, T. (1979). Philosophical Problems of Statistical Inference. Reidel, Dordrecht.

STEVEN N. GOODMAN

QUALITY ASSESSMENT OF CLINICAL TRIALS

eager to do the best possible job, it becomes increasingly difficult to know whether the investment of monies results in a quality product, namely, a report of an RCT whereby the estimates of the intervention’s effectiveness are free of bias (i.e., systematic error). Quality is a ubiquitous term, even in the context of RCTs. Importantly, it is a construct, and not something that one can actually put their hands on. It is a gauge for how well the RCT was conducted or reported. In the context of this article the discussions around quality are related to internal validity, and it is defined as ‘‘the confidence that the trial design, conduct, analysis, and presentation has minimized or avoided biases in its intervention comparisons’’ (2). It is possible to conduct an RCT with excellent standards and report it badly, thereby giving the reader an impression of a ‘‘low’’ quality study. Alternatively, an RCT may be conducted rather poorly and very well reported, providing readers with the impression of ‘‘high’’ quality. Typically, the only way to ascertain the quality of an RCT is to examine its report. It is the only tangible evidence as to how such studies are conducted. In order to reasonably rely on the quality of a trial’s report, it must be a close facsimile for how the trial was conducted. Early investigations pointed in this direction (3, 4). Liberati et al. assessed the quality of reports of 63 published RCTs using a 100item quality assessment instrument developed by Thomas C. Chalmers (5). To evaluate whether the quality of reports closely resembled their conduct, the investigators interviewed 62 (of 63) corresponding authors. The average quality ratings went from 50% (reports) to 57% (conduct), thus suggesting a similar relationship between the quality of reports and their conduct. However, more recent studies suggest that this relationship may not be as simple as previously thought (6, 7). Soares et al. assessed the quality of 58 published reports of terminated phase III RCTs completed by the Radiation Therapy Oncology Group comparing them with the quality of the protocols for which they had complete access. These investigators reported that in six of the seven quality items examined, a substantially higher

DAVID MOHER University of Ottawa Chalmers Research Group, Children’s Hospital of Eastern Ontario Research Institute and Departments of Pediatrics Epidemiology and Community Medicine, Faculty of Medicine Canada

Commercial airline pilots have enormous responsibilities. They are accountable for expensive equipment, undergo extensive training and life-long learning, and must abide by international standards regardless of where they fly. Part of this standardization includes going through a checklist of tasks before any flight departure. For the flight to become airborne, the checklist must be completed. A completed checklist provides the flight crew with important information regarding the status of the aircraft and the safety and well being of the passengers. Clinical trialists share several similarities with airline pilots. Many of them are investigators (and principal investigators) of large multi-center randomized controlled trials (RCTs). It is not uncommon for these studies to cost millions of dollars to complete, and they are typically the end stage of a long development process that can easily cost in excess of $100 million dollars. Clinical trialists have often completed advanced training, such as clinical epidemiology, and very definitely have the safety and welfare of trial participants uppermost in their minds. What seems to be missing from the trialist’s arsenal is standardization to the process of conducting such studies. More recently, some consensus has developed on how RCTs should be reported (1), although only a very small percent (2.5%) of the approximately 20,000 health care journals have bought into such reporting standards. Without an agreed upon checklist of standard operating procedures, clinical trialists are left in the precarious position of operating in a partial and loose vacuum. Although

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

QUALITY ASSESSMENT OF CLINICAL TRIALS

reporting in the protocol existed compared with what was published. For example, only 9% of published reports provided information on the sample size calculation, yet this information was readily available in 44% of the Group’s protocols. This study is important for several reasons, not the least of which is that the authors had access to all the protocols from which the published report emanated. If one is to gain a more thorough understanding of the relationship between trial reporting and conduct, it is important that journals and their editors support the publication of RCT protocols (8, 9). Such information provides readers with a full and transparent account of what was planned and what actually happened. Although these results are impressive and might suggest more of a chasm between reporting and conduct, caution is advised. This publication is recent and time is required to permit replication and extension by other groups. Historians may well view the first 50 years of reporting of RCTs with some amazement. They will encounter what might be described as a cognitive dissonance: a disconnection between the increased sophistication of the design of these studies and the apparent lack of care – disastrous in some cases – with which they were reported. Unfortunately, despite much evidence that trials are not all performed to high methodological standards, many surveys have found that trial reporting is often so inadequate that readers cannot know exactly how the trials were done (10–14). Three approaches exist to assess the quality of RCT reports: components, checklists, and scales. Component assessment focuses on how well a specific item is reported. Such an approach has the advantage of being able to complete the assessment quickly and is not encumbered with issues surrounding other methods of quality assessment. Allocation concealment is perhaps the best-known example of component quality assessment (15). Checklists combine a series of components together, hopefully items that are conceptually linked. No attempt is made to combine the items together and come up with an overall summary score. The CONSORT group developed a 22-item checklist to help investigators improve the quality of

reporting their trials (1). Scales are similar to checklists with one major difference: Scales are developed to sum their individual items together and come up with an overall summary score. Many scales have been developed (2, 16). Jaded et al. developed a three-item scale to assess the quality of reports of RCTs in pain (17). Assessing the quality of trial reports using scales is currently considered contentious by some (18, 19). ¨ Using 25 different scales, Juni et al. assessed the quality of 17 trial reports included in a systematic review examining the thromoprophylaxis of low-molecularweight heparin (LMWH) compared with standard heparin. They reported little consistency across the scales and found that different scales yielded different assessments of trial quality. When these quality scores were incorporated into the meta-analytic analysis, it resulted in different estimates of the intervention’s effectiveness (i.e., LMWH was ‘apparently’ effective using some scales and ineffective using others). Psychometric theory would predict this result, and practice shows it. This observation is important: If different scales are applied to the same trial, inconsistent quality assessments and estimates of an intervention’s effectiveness, often in the opposite direction, can result. Given the unfortunate process by which quality scales have been developed, this finding is only altogether unexpected. Indeed, previous research pointed in this direction (2). In the four years since this publication, no published replication or extension of this work has occurred. Until this extension happens, the results should be cautiously interpreted. Scales can provide holistic information that may not be forthcoming with the use of a component approach. It is ironic that the use of scales in the context of systematic reviews is considered so problematic. The science of systematic reviews is predicated, somewhat, on the axiom that the sum is better than the individual parts. Yet, summary measures of quality assessment are considered inappropriate by some. Pocock et al. reported that a statement about sample size was mentioned in only five (11.1%) of 45 reports and that only six (13.3%) made use of confidence intervals (20). These investigators also noted

QUALITY ASSESSMENT OF CLINICAL TRIALS

that the statistical analysis tended to exaggerate intervention efficacy because authors reported a higher proportion of statistically significant results in their abstracts compared with the body of the papers. A review of 122 published RCTs that evaluated the effectiveness of selective serotonin reuptake inhibitors (SSRIs) as a first-line management strategy for depression found that only one (0.8%) paper described randomization adequately (21). A review of 2000 reports of trials in schizophrenia indicated that only 1% achieved a maximum possible quality score with little improvement over time (22). Such results are the rule rather than the exception. And, until quite recently, such results would have had little impact on clinical trialists and others. A landmark investigation from 1995 found empirical evidence that results may be biased when trials use inferior methods or are reported without adequately describing the methods; notably, the failure to conceal the allocation process is associated with an exaggeration of the effectiveness of an intervention by 30% or more (23). Schulz et al. assessed the quality of randomization reporting for 250 controlled trials extracted from 33 meta-analyses of topics in pregnancy and childbirth, and then they analyzed the associations between those assessments and estimated intervention effects. Trials in which the allocation sequence had been inadequately or unclearly concealed yielded larger estimates of treatment effects (odds ratios exaggerated, on average, by 30% to 40%) than those of trials in which authors reported adequate allocation concealment. These results provide strong empirical evidence that inadequate allocation concealment contributes to bias in estimating treatment effects. Several studies subsequently examined the relationship between quality and estimated intervention effects (24–27). In an important demonstration of the contributions methodologists can make to detecting and understanding the extent by which bias can influence the results of clinical trials, Egger et al. have set a standard, completing a series of systematic reviews around methodological questions, such as quality (28). These authors report that after pooling 838 RCT

3

reports, the effect of low quality on the estimates of an intervention’s effectiveness may be large, on the order of 30%, although some differences may exist in some methodological evaluations. The cause for concern is obvious: Treatments may be introduced that are perhaps less effective than was thought, or even ineffective. In the mid-1990s, two independent initiatives to improve the quality of reports of RCTs led to the publication of the ‘‘original’’ CONSORT statement that was developed by an international group of clinical trialists, statisticians, epidemiologists, and biomedical editors (29). The CONSORT statement consists of a checklist and flow diagram. The statement was revised in 2001, fine-tuning its focus on the ‘‘simple’’ two-group parallel design (1) along with an accompanying explanation and elaboration document (30). The intent of the latter document, known as the E & E, is to help clarify the meaning and rationale of each checklist item. Two aspects separate CONSORT from previous efforts. First, authors are asked to report particulars about the conduct of their studies, because failure to do so clearly is associated with producing biased treatment results. This approach is in keeping with the emergence of evidence-based health care. Second, CONSORT was inclusionary, whereby a wide variety of experts, including clinical trialists, methodologists, epidemiologists, statisticians, and editors, participated in the whole process. Continual review and updating of CONSORT is essential. To maintain the activities of the CONSORT Group requires considerable effort, and a mechanism has been developed to monitor the evolving literature and help keep the CONSORT Statement evidence-based. Some items on the CONSORT checklist are already justified by solid evidence that they affect the validity of the trial being reported. Methodological research validating other items is reported in a diverse set of journals, books, and proceedings. In order to bring this body of evidence together, several CONSORT members have formed an ‘‘ESCORT’’ working party. They are starting to track down, appraise, and annotate reports that provide Evidence Supporting (or refuting) the CONSORT Standards On Reporting

4

QUALITY ASSESSMENT OF CLINICAL TRIALS

Trials. The ESCORT group would appreciate receiving citations of reports that readers consider relevant to any items on the checklist (via the CONSORT website). CONSORT has been supported by a growing number of medical and health care journals (e.g., Canadian Medical Association Journal, Journal of the American Medical Association, and British Medical Journal) and editorial groups, including the International Committee of Medical Journal Editors (ICMJE, The Vancouver Group), the Council of Science Editors (CSE), and the World Association of Medical Editors (WAME). CONSORT is also published in multiple languages. It can be accessed together with other information about the CONSORT group at www.consort-statement.org. There has been some initial indications that the use of CONSORT does improve the quality of reporting of RCTs. Moher et al. examined 71 published RCTs in three journals in 1994 in which allocation concealment was not clearly reported in 61% (n = 43) of the RCTs (31). Four years later, after these three journals required authors reporting an RCT to use CONSORT, the percentage of papers in which allocation concealment was not clearly reported had dropped to 39% (30 of 77, mean difference =−22%; 95% confidence interval of the difference: −38%, −6%). Devereaux et al. reported similar encouraging results in an evaluation of 105 RCT reports from 29 journals (32). CONSORT ‘‘promoter’’ journals reported a statistically higher number of factors (6.0 of 11) compared with nonpromoter journals (5.1 of 11). Egger et al. examined the usefulness of the flow diagram by reviewing 187 RCT reports published during 1998 in four CONSORT ‘‘adopting’’ journals, comparing them with 83 reports from a nonadopting journal (33). They observed that the use of flow diagrams was associated with better reporting, in general. Although the simple two-group design is perhaps the most common design reported (34), a quick examination of any journal issue would indicate that other designs are used and reported. Although most elements of the CONSORT statement apply equally to these other designs, certain elements need to be adapted and, in some cases, additional elements need to be added, to adequately

report these other designs. The CONSORT group is now developing CONSORT ‘‘extension papers’’ to fill in the gaps. A CONSORT extension for reporting randomized cluster (group) designs was published recently (35). Other trial extension papers in development consider equivalence, non-inferiority, multi-armed parallel, factorial (a special case of multi-arm), and concurrent within individual trials. The six articles will have a standard structure, mirroring features of previous publications. The CONSORT Group will introduce and explain the key methodological features of that design, consider empirical evidence about how commonly these trials have been used (and misused), and review any published evidence relating to the quality of reporting of such trials. After these brief literature reviews, the Group will provide a design-specific CONSORT checklist (and flow diagram, if applicable) and provide examples of good reporting. The goal is to publish these extensions in different journals, in the hope of increasing their dissemination throughout the disciplines of clinical medicine. The poor quality of reporting of harm (safety, side effects) in RCTs has recently received considerable attention. Among 60 RCTs on antiretroviral treatment with at least 100 patients, only a minority of reports provided reasons and numbers per arm of withdrawals resulting from toxicity, and of participants with severe or life-threatening clinical adverse events (36). These observations have been validated in a substantially larger study of 192 trials covering antiretroviral therapy and 6 other medical fields (37). Overall, the space allocated to safety in the Results section of RCTs was slightly less than that allocated to the names of the authors and their affiliations. The median space was only 0.3 pages across the 7 medical fields. To help address these problems, the Group developed a paper similar in format to the other extension papers. Ten recommendations that clarify harms-related issues are each accompanied by an explanation and examples to highlight specific aspects of proper reporting. For example, fever in vaccine trials may be defined with different cut-offs, measured at various body sites, and at different times after immunization (38).

QUALITY ASSESSMENT OF CLINICAL TRIALS

The results of such assessments are obviously problematic. The fourth recommendation asks authors to report whether the measurement instruments used to assess adverse events were standardized and validated. This document is currently in peer review (39). An increasing need to standardize many aspects of RCT conduct and reporting exists. Until such time as this standardization happens, these studies run the ever-increasing risk of misadventure and inappropriate interpretation. For example, Devereaux et al. recently reported on ‘‘attending’’ internal medicine physicians interpretation of various aspects of blinding in the context of trials (40). The 91 respondents (92% response rate) provided 10, 17, and 15 unique interpretations of single, double, and triple blinding, respectively. More than 41,000 RCTs are now actively recruiting participants (41). As their numbers suggest, such studies form an important and central role in the development and maintenance of the delivery of evidencebased health care. If RCTs are to be conducted and reported with the highest possible standards, considerable energy must be spent on improving their conduct and reporting. Only through a well-funded and continuing program of research, evaluation, and dissemination will standard-making groups be able to provide up-to-date knowledge and guidance as to the importance of specific conduct and reporting recommendations.

REFERENCES 1. D. Moher, K. F. Schulz, and D. G. Altman, for the CONSORT Group, The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials. Ann. Intern. Med. 2001; 134: 657–662. 2. D. Moher, A. R. Jadad, and P. Tugwell, Assessing the quality of randomized controlled trials: current issues and future directions. Int. J. Technol. Assess. Health Care 1996; 12: 195–208. 3. A. Liberati, H. N. Himel, and T. C. Chalmers, A quality assessment of randomized control trials of primary treatment of breast cancer. J. Clin. Oncol. 1986; 4: 942–951.

5

4. V. Hadhazy, J. Ezzo, and B. Berman, How valuable is effort to contact authors to obtain missing data in systematic reviews. Presented at the VII Cochrane Colloquium, Rome, Italy, October 5–9, 1999. 5. T. C. Chalmers, H. Smith, B. Blackburn, B. Silverman, B. Schroeder, D. Reitman, and A. Ambroz, A method for assessing the quality of a randomized control trial. Control Clin. Trials 1981; 2: 31–49. 6. C. L. Hill, M. P. LaValley, and D. T. Felson, Discrepancy between published report and actual conduct of randomized clinical trials. J. Clin. Epidemiol. 2002; 55: 783–786. 7. H. P. Soares, S. Daniels, A. Kumar, M. Clarke, C. Scott, S. Swann, and B. Djulbegovic, Bad reporting does not mean bad methods for randomised trials: observational study of randomised controlled trials performed by the Radiation Therapy Oncology Group. BMJ 2004; 328: 22–24. 8. F. Godlee, Publishing study protocols: making them visible will improve registration, reporting and recruitment. BMC 2001; 2(4): 1. 9. D. G. Altman, personnel communication. 10. K. F. Schulz and D. A. Grimes, Sample size slippages in randomised trials: exclusions and the lost and wayward. Lancet 2002; 359: 781–785. 11. K. F. Schulz and D. A. Grimes, Unequal group sizes in randomised trials: guarding against guessing. Lancet 2002; 359: 966–970. 12. K. F. Schulz and D. A. Grimes, Blinding in randomised trials: hiding who got what. Lancet 2002; 359: 696–700. 13. K. F. Schulz and D. A. Grimes, Allocation concealment in randomised trials: defending against deciphering. Lancet 2002; 359: 614–618. 14. K. F. Schulz and D. A. Grimes, Generation of allocation sequences in randomised trials: chance, not choice. Lancet 2002; 359: 515–519. 15. K. F. Schulz, I. Chalmers, D. A. Grimes, and D. G. Altman, Assessing the quality of randomization from reports of controlled trials published in obstetrics and gynecology journals. JAMA 1994; 272: 125–128. ¨ 16. P. Juni, D. G. Altman, and M. Egger, Systematic reviews in health care: assessing the quality of controlled clinical trials. BMJ 2001; 323: 42–46. 17. A. R. Jadad, R. A. Moore, D. Carroll, C. Jenkinson, D. J. Reynolds, D. J. Gavaghan, and H. J. McQuay, Assessing the quality of

6

QUALITY ASSESSMENT OF CLINICAL TRIALS reports of randomized clinical trials: is blinding necessary? Control Clin. Trials 1996; 17: 1–12.

of trial quality in systematic reviews? Empirical study. Health Technol. Assess. 2003; 7(1): 1–76.

¨ 18. P. Juni, A. Witschi, R. Bloch, and M. Egger, The hazards of scoring the quality of clinical trials for meta-analysis. JAMA 1999; 282: 1054–1060.

29. C. Begg, M. Cho, S. Eastwood, R. Horton, D. Moher, I. Olkin, R. Pitkin, D. Rennie, K. Schulz, D. Simel, and D. Stroup, Improving the quality of reporting of randomized controlled trials: the CONSORT statement. JAMA 1996; 276: 637–639.

19. J. A. Berlin and D. Rennie, Measuring the quality of trials: the quality of quality scales. JAMA 1999; 282: 1083–1085. 20. S. J. Pocock, M. D. Hughes, and R. J. Lee, Statistical problems in the reporting of clinical trials. N. Engl. J. Med. 1987; 317: 426–432. 21. M. Hotopf, G. Lewis, and C. Normand, Putting trials on trial - the costs and consequences of small trials in depression: a systematic review of methodology. J. Epidemiol. Community Health 1997; 51: 354–358. 22. B. Thornley and C. E. Adams, Content and quality of 2000 controlled trials in schizophrenia over 50 years. BMJ 1998; 317: 1181–1184. 23. K. F. Schulz, I. Chalmers, R. J. Hayes, and D. G. Altman, Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995; 273: 408– 412. 24. E. M. Balk, P. A. Bonis, H. Moskowitz, C. H. Schmid, J. P. Ioannidis, C. Wang, and J. Lau, Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 2002; 287: 2973–2982. 25. L. L. Kjaergard, J. Villumsen, and C. Gluud, Reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses. Ann. Intern. Med. 2001; 135: 982–989. ¨ 26. P. Juni, D. Tallon, and M. Egger, ‘Garbage in-garbage out’? Assessment of the quality of controlled trials in meta-analyses published in leading journals. In: Proceedings of the 3rd Symposium on Systematic Reviews: Beyond The Basics, St Catherine’s College, Oxford. Centre for Statistics in Medicine, Oxford, United Kingdom, 2000: 19. 27. D. Moher, B. Pham, A. Jones, D. J. Cook, A. R. Jadad, M. Moher, P. Tugwell, and T. P. Klassen, Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 1998; 352: 609–613. ¨ 28. M. Egger, P. Juni, C. Bartlett, F. Holenstein, and J. Sterne, How important are comprehensive literature searches and the assessment

30. D. G. Altman, K. F. Schulz, D. Moher, M. Egger, F. Davidoff, D. Elbourne, P. C. Gøtzsche, and T. Lang, for the CONSORT Group, The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann. Intern. Med. 2001; 134: 663–694. 31. D. Moher, A. Jones, and L. Lepage, for the CONSORT Group, Use of the CONSORT statement and quality of reports of randomized trials: a comparative before and after evaluation? JAMA 2001; 285: 1992–1995. 32. P. J. Devereaux, B. J. Manns, W. A. Ghali, H. Quan, and G. H. Guyatt, The reporting of methodological factors in randomized controlled trials and the association with a journal policy to promote adherence to the Consolidated Standards of Reporting Trials (CONSORT) checklist. Control Clin. Trials 2002; 23: 380–388. ¨ 33. M. Egger, P. Juni, and C. Bartlett, for the CONSORT Group, Value of flow diagrams in reports of randomized controlled trials. JAMA 2001; 285: 1996–1999. 34. B. Thornley and C. E. Adams, Content and quality of 2000 controlled trials in schizophrenia over 50 years. BMJ 1998; 317: 1181– 1184. 35. M. K. Campbell, D. R. Elbourne, and D. G. Altman, for the CONSORT Group, The CONSORT statement: extension to cluster randomised trials. BMJ, in press. 36. J. P. A. Ioannidis and D. G. ContopoulosIoannidis, Reporting of safety data from randomized trials. Lancet 1998; 352: 1752– 1753. 37. J. P. A. Ioannidis and J. Lau, Completeness of safety reporting in randomized trials: an evaluation of 7 medical areas. JAMA 2001; 285: 437–443. 38. J. Bonhoeffer, K. Kohl, R. Chen et al., The Brighton Collaboration: addressing the need for standardized case definitions of adverse events following immunization (AEFI). Vaccine 2002; 21: 298–302. 39. J. Ioannidis, personal communication.

QUALITY ASSESSMENT OF CLINICAL TRIALS 40. P. J. Devereaux, B. J. Manns, W. A. Ghali, H. Quan, C. Lacchetti, V. M. Montori, M. Bhandari, and G. H. Guyatt, Physician interpretations and textbook definitions of blinding terminology in randomized controlled trials. JAMA 2001; 285: 2000–2003. 41. CenterWatch. (2004). (online). http://www.centerwatch.com/.

Available:

7

QUALITY ASSURANCE Quality Assurance is known as all planned and systematic actions that are established to ensure that the trial is performed and that the data are generated, documented (recorded), and reported in compliance with Good Clinical Practice (GCP) and the applicable regulatory requirement(s).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

QUALITY CONTROL Quality Control is the operational techniques and activities undertaken within the quality assurance system to verify that the requirements for quality of the trial-related activities have been fulfilled. Quality control should be applied to each stage of data handling to ensure that all data are reliable and have been processed correctly.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

QUALITY OF LIFE

patient’s perception of an abnormal physical, emotional, or cognitive state’’. Functional status includes four dimensions: physical, physiological, social, and role activity. General health perceptions include the patients’ evaluation of past and current health, their future outlook, and concerns about health. All these factors subsequently influence the overall evaluation of quality of life (see Figure 1). Although various definitions of HRQoL have been proposed during the past decade, there is general agreement that HRQoL is a multidimensional concept that focuses on the impact of disease and its treatment on the well-being of an individual. Cella and Bonomi (2) state

DIANE L. FAIRCLOUGH University of Colorado Health Sciences Center, Denver, CO, USA

PATRICIA A. GANZ UCLA Jonsson Comprehensive Cancer Center, Los Angeles, CA, USA

1

BACKGROUND

The term quality of life (QoL) has been used in a wide variety of ways. In the broadest definition, the quality of our lives is influenced by our physical and social environment as well as our emotional and existential reactions to that environment. From a societal or global perspective, measures of QoL may include social and environmental indicators, such as whether there is affordable housing and how many days of air pollution there are each year in a particular location. These are general issues that concern everyone in a society. Kaplan and Bush (22) proposed the use of the term health-related quality of life (HRQoL) to focus on the specific role of health effects on the individual’s perceptions of well-being, distinguishing these from job satisfaction and environmental factors. In the medical literature, the terms QoL and Health-related quality of life (HRQoL) have become interchangeable.

Health-related quality of life refers to the extent to which one’s usual or expected physical, emotional and social well-being are affected by a medical condition or its treatment.

We may also include other aspects like economic and existential well-being. Patrick and Erickson (29) propose a more inclusive definition that combines quality and quantity. the value assigned to duration of life as modified by the impairments, functional states, perceptions and social opportunities that are influenced by disease, injury, treatment or policy.

All of these definitions emphasize the subjective nature of the evaluation of HRQoL, with a focus on its assessment by the individual. It is important to note that an individual’s well-being or health status cannot be directly measured. We are only able to make inferences from measurable indicators of symptoms and reported perceptions. Often the term quality of life is used when any patient-reported outcome is measured. This has led to both confusion and controversy. Side effects and symptoms are not equivalent to HRQoL, although clearly they influence an individual’s evaluation of quality of life. While symptoms are often part of the assessment of HRQoL, solely assessing symptoms is a simple, convenient way of

1.1 Health-Related Quality of Life The World Health Organization (WHO) defined health in 1948 (38,39) as a ‘‘state of complete physical, mental, and social wellbeing and not merely the absence of infirmity and disease’’. This definition reflects the focus on a broader picture of health. Wilson and Cleary (37) propose a conceptual model of the relationships among health outcomes. There are five levels of outcomes that progress from biomedical measures to quality of life reflecting the WHO definition of health. The biological and physiological outcomes include the results of laboratory tests, radiological scans, and physical examination as well as diagnoses. Symptom status is defined as ‘‘a

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

QUALITY OF LIFE

Figure 1.

avoiding the more complex task of assessing HRQoL. 2 MEASURING HEALTH-RELATED QUALITY OF LIFE Guyatt et al. (19) define an instrument to include the questionnaire, the method of administration, instructions for administration, the method of scoring and analysis, and interpretation for a health status measure. All these aspects are important when evaluating a measure of HRQoL. 2.1 Health Status versus Patient Preferences There are two general types of HRQoL measures, health status assessment, and patient preference assessment (33,40). The development of these two forms is a result of the differences between the perspectives of two different disciplines: psychometrics and econometrics. In the health status assessment measures, multiple aspects of the patient’s perceived well-being are selfassessed and a score is derived from the responses on a series of questions. This score reflects the patient’s relative HRQoL compared with other patients and to the HRQoL of the same patient at other times. These measures are primarily designed to compare groups of patients receiving different treatments or to identify change over time within groups of patients. As a result, these measures have been used in clinical trials to facilitate the comparisons of therapeutic regimens. The assessments range from a single global question asking patients to rate their current quality of life to a series of questions about specific aspects of their daily life during a recent period of time. Among these health status measures, there is considerable range in the context of the questions with some measures focusing more on the perceived impact of the disease and therapy (How much are you bothered by hair loss?),

other measures focusing on the frequency and severity of symptoms (How often do you experience pain?), and still others assessing general status (How would you rate your quality of life?). Measures in the second group, patient preferences, are influenced strongly by the concept of utility borrowed from econometrics, which reflects individual decision making under uncertainty. These preference assessment measures are primarily used to evaluate the trade-off between the quality and quantity of life. Values of utilities are always between 0 and 1 with 0 generally associated with death and 1 with perfect health. Examples include time trade offs (24), standard gamble (32), and multiattribute assessment measures (5,10). Time trade-off utilities are measured by asking respondents how much of the time they expect to spend in their current state would they give up for a reduced period of time in perfect health. If, for example, a patient responded that he would trade five years in his current state for four years in perfect health (trading one year), the resulting utility is 0.8. Standard gamble utilities are measured by asking respondents to identify the point at which they become indifferent to the choices between two hypothetical situations. Suppose a patient is presented with two treatment alternatives, one option is a radical surgical procedure with no chance of relapse but significant impact on HRQoL and the other option is watchful waiting, with a chance of progressive disease and death. The chance of progressive disease and death is raised or lowered until the respondent considers the two options to be equivalent. Assessment of time trade-off and standard gamble utilities requires the presence of a trained interviewer or specialized computer program. Because of these resource needs, these approaches are generally too time- and resource-intensive to use in a large clinical trial. Multiattribute

QUALITY OF LIFE

3

assessment measures combine the advantages of self-assessment with the conceptual advantages of utility scores. Their use is limited by the need to develop and validate the methods by which the multiattribute assessment scores are converted to utility scores for each of the possible health states defined by the multiattribute assessments. For example, the EuroQoL scale, also known as the EQ-5D, is a standardized non–diseasespecific instrument for describing and evaluating HRQoL (3). The EQ-5D covers five dimensions of health: mobility, self-care, role (or main) activity, family and leisure activities, pain and mood. Within each dimension, the respondent chooses one of three items that best describes his or her status. Weights are used in scoring the responses, reducing the 243 (35 ) possible health states to a single utility score. Utilities have traditionally been used in the calculation of quality-adjusted life years (QALYs) for economic evaluation (cost-effectiveness) and policy research as well as in analytic approaches such as Q-TwiST (14,15). It is important to note that the utility one gives to a hypothetical situation has been seen to vary from what the individual gives when the situation is real; results of any analysis should be interpreted carefully from that perspective.

valid and reliable than objective measures. This misconception is generally based on the observation that patient ratings do not always agree with ratings of trained professionals. If we take the ratings of these professionals as constituting the gold standard, we are ignoring the valuable information of how the patient views his or her health and quality of life, especially the aspects of emotional and social functioning. There is measurement error in both subjective and objective assessments; neither is necessarily more accurate or precise in all circumstances. Most widely used measures of HRQoL are the product of careful development resulting in a measure that is highly reliable, sensitive to change with good predictive validity and minimal measurement error. In contrast, some of the biomedical endpoints that we consider objective can include a demonstrably high degree of measurement error (e.g. blood pressure), misclassification among experts, or have poor predictive and prognostic validity (e.g. pulmonary function tests) (36).

2.2 Objective Versus Subjective

There are two basic types of health status measures—generic and disease-specific. The generic instrument is designed to assess HRQoL in individuals with and without active disease, and across disease types (e.g. heart disease, diabetes, depression, cancer). The Medical Outcomes Study Short Form (MOS SF-36) is an example of a generic instrument (35). The broad item content of a generic instrument is an advantage when comparing vastly different groups of subjects or following subjects for extended periods after treatment has ended. Disease-specific instruments narrow the scope of assessment and address in a more detailed manner, the impact of a particular disease or treatment (e.g. joint pain and stiffness in patients with arthritis or treatment toxicities in patients with cancer) (12). As a result, they may be more sensitive to smaller, but clinically significant changes induced by treatment (28).

Health status measures differ among themselves in the extent to which they assess observable phenomena or require the respondent to make inferences. These measures may assess symptoms or functional benchmarks wherein individuals are asked about the frequency and severity of symptoms or whether they can perform certain tasks such as walking a mile. The measures may also, or instead, assess the impact of symptoms or conditions by asking individuals how much the symptoms bother them or interfere with their usual activities. Many instruments provide a combination. The value of each will depend on the research objectives: Is the focus to identify the intervention with the least severity of symptoms or to assess the impact of the disease and its treatment? There has been considerable discussion of whether subjective assessments are less

2.3 Generic Versus Disease-Specific Instruments

4

QUALITY OF LIFE

2.4 Global Index versus Profile of Domain-Specific Measures HRQoL measures come in a variety of forms reflecting their intended use. The major distinction is between an index and a profile. Profiles consist of multiple scales that reflect the multiple dimensions of QoL such as the physical, emotional, functional, and social well-being of patients. In most instruments, each scale is constructed from the responses to multiple questions (often referred to as items). Two methods of construction are used for the creation of indices. In the first, a single question is used to assess the subject’s assessment of quality of life. In the latter, developers provide methods to combine responses to multiple questions to provide a single index of QoL. The advantage of the single index is that it provides a straightforward approach to decision making, which may be required in settings such as clinical trials where QoL is the primary outcome. Indices that are in the form of utilities are used in costeffectiveness analyses performed in pharmacoeconomic research. On the other hand, a profile of the various domains reflects the multidimensional character of quality of life (1). There are limitations that should be considered when using either approach. First, there is always a set of ‘‘values’’ being imposed when an index is constructed. These values may come from each individual’s concept of what is meant by quality of life when a single question is asked, or the values that a developer assigns to the construction of the index. The values may be as arbitrary as the number of questions that are used to assess each domain, or statistically derived to maximize discrimination among different groups of patients. It is impossible to construct an index that aggregates the multiple dimensions of HRQoL that will be suitable in all contexts. The important point is that one should be aware of the weights (values) that are placed on the different domains in the interpretation of the results. A single index measuring HRQoL cannot capture changes in individual domains. For example, a particular intervention may produce benefits in one dimension and deficits in another that cancel each other and are thus not observed in the aggregated score.

2.5 Response Format Questionnaires may also differ in their response format. The most widely used format is the Likert scale, which contains a limited number of ordered responses that have a descriptive label associated with each level. Variations include scales in which only the extremes are anchored with a descriptive label. Individuals can discriminate at most seven to ten ordered categories (25,31) and reliability and sensitivity to change drops off at five or fewer levels. Dichotomous response formats and visual analog scales (VAS) are also used. The VAS consists of a line, generally 10 cm in length, with descriptive anchors at each end of the line. The respondent is instructed to place a mark on the line. The original motivation of the VAS was that the continuous measure could potentially discriminate more effectively than a Likert scale; this has not generally been true in most validation studies where both formats have been used. The VAS format has several limitations. It requires a level of eye–hand coordination that may be unrealistic for anyone with a neurological condition, those experiencing numbness and tingling side effects of chemotherapy, and for the elderly. VAS precludes telephone assessment and interview formats. Finally, it requires an additional data management step in which the position of the mark is measured. If forms are copied rather than printed, the full length of the line may vary, requiring two measurements and additional calculations. A compromise format is a numerical analog where patients provide a number between 0 and 100 (see Table 1). 2.6 Period of Recall QoL scales often request individuals to base their evaluation over a specified period, such as the last seven days or the last four weeks. The time frame must be short enough to detect differences between treatments and long enough to minimize short-term fluctuations (noise) that do not represent real change (26). In addition, the reliability with which individuals can rate aspects of their QoL beyond several weeks must be called into question. Scales specific to diseases or treatments where there can be rapid changes

QUALITY OF LIFE

5

Table 1. Example of a Likert and Visual Analog Scale

will have a shorter recall duration, whereas instruments designed for assessment of general populations will often have a longer recall duration. Longer time frames may also be appropriate when assessments are widely spaced (e.g. annually). 2.7 Scoring The majority of HRQoL scales that are derived from a series of questions with a Likert response format are scored by summing or averaging the responses after reverse coding negatively worded questions. There are more complicated weighting schemes based on factor analytic weights; item response or Rasch models are rare but may become more common in computer assisted testing. To facilitate interpretation, there has been an increasing tendency to rescale this result so that the possible range of responses is 0 to 100 with 100 reflecting the best possible outcome. Most instruments also have an explicit strategy for scoring in the presence of missing responses to a small proportion of questions. The most common method is to impute the missing response using the average of the other responses in the specific subscale when at least half of the questions have been answered. In contrast, utilities are always on a 0 to 1 scale. Scoring depends on the method used to elicit the preferences. Scores derived from multiattribute assessment are instrument specific.

3 DEVELOPMENT AND VALIDATION OF HRQOL MEASURES 3.1 Development The effort and technical expertise required to develop a new instrument is generally underestimated, with most efforts taking three to five years (or more) rather than the couple of weeks initially expected. Researchers contemplating this step should research the existing instruments as well as the various methodologies for instrument development and validation including traditional psychometric theory and item response theory. To fully develop an instrument from the beginning requires multiple studies, hundreds of observations, years of testing and refinement. 3.2 Validation There are numerous procedures for establishing the psychometric properties of an instrument. For a formal presentation, the reader is referred to one of the many available books. A partial list specific to HRQoL includes Streiner and Norman (31), McDowell and Newell (23), Juniper et al. (21) and Naughton et al. (27), Frank-Stromborg and Olsen (11), Staquet, Hays, and Fayers (30) and Fayers and Machin (9). The validity of a measure in a particular setting is the most important and the most difficult aspect to establish. This is primarily because HRQoL is an unobservable latent variable and there are no gold standards against which the empirical measures of validity can be compared. Nonetheless, we can learn a good deal about an instrument by examining the instrument itself and the empirical information that has been collected. For example, we can demonstrate that

6

QUALITY OF LIFE

the measure behaves in a manner that is consistent with what we would expect and correlates with observable things that are believed to be related to HRQoL. Face validity refers to the content of an instrument: Does the instrument measure what it proposes to measure? and Are the questions comprehensible and without ambiguity? The analogy is whether an archer has chosen the intended target. The wording of the questions should be examined to establish whether the content of the questions is relevant to the population of interest. Although experts (physicians, nurses) may make this evaluation, it is advisable to verify the face validity with patients as they may have a different perspective. Criterion validity is the strength of a relationship between the scale and a gold standard measure of the same construct. As there is no gold standard for the dimensions of quality of life, we rely on the demonstration of construct validity. This is the evidence that the instrument behaves as expected and shows similar relationships (convergent validity) and the lack of relationships (divergent validity) with other reliable measures for related and unrelated characteristics. Confirmatory factor analysis structural equation modeling is one of the statistical methods used to support the construct validity or proposed structure (subscales) of an instrument. Application may be used to confirm that a scale is unidimensional. Results from exploratory factor analysis in selected populations should be interpreted cautiously especially when the sample is homogeneous with respect to stage of disease or treatment (8). The next question is: Would a subject give the same response at another time, if they were experiencing the same HRQoL? This is referred to as test–retest reliability. If there is a lot of variation (noise) in responses for subjects experiencing the same level of HRQoL, then it is difficult to discriminate between subjects who are experiencing different levels of HRQoL or change in HRQoL over time. This is generally measured using Pearson or intraclass correlations when the data consists of two assessments. Finally, we ask: Does the instrument discriminate among subjects who are experiencing different levels of HRQoL? and Is the instrument sensitive to

changes that are considered important to the patient? These characteristics are referred to as discriminant validity and responsiveness. Reliability can be characterized using the analogy of the archer’s ability to hit the same target repeatedly with consistency. Internal consistency refers to the extent to which items in the same scale (or subscale) are interrelated; specifically the extent to which responses on a specific item increase as the responses to other items on the scale increase. Cronbach’s coefficient α is typically reported. For the assessment of group differences, values above 0.7 are generally regarded as acceptable though values above 0.8 (good) are often recommended. For assessment of individual patients in clinical practice, it is recommended that the value should be above 0.9. Responsiveness is the ability of a measure to detect changes that occur as the result of an intervention (16,17). Here the analogy is whether the archer can respond to change and hit various areas on the target consistently. One factor that can affect responsiveness is a floor or ceiling effect. If responses are clustered at either end of the scale, it may not be possible to detect change due to the intervention. 3.3 Translation/Cross-Cultural Validation When HRQoL is measured in diverse populations, attention needs to be paid to the methods of translation and cross-cultural validation. Backward and forward translations must be performed using the appropriate native language at each step. There are numerous examples where investigators have found problems with certain questions as questionnaires are validated in different languages and cultures. Techniques such as cognitive testing, with subjects describing verbally what they are thinking as they form their responses, have been very valuable when adapting a questionnaire to a new language or culture. This should be followed by formal validation studies designed to generate both standard reliability and validity statistics. Item response theory (IRT) methods and Rasch models have facilitated the examination of differential item functioning across cultures or languages.

QUALITY OF LIFE

3.4 Item Banking and Computer-Adaptive Testing There has been considerable effort over the last decade to develop item banks, large databases of responses to individual questions from multiple questionnaires. One objective of this effort is to establish a method of translating results obtained on one scale to another scale. The second application, referred to as computer-adaptive testing, is an attempt to reduce the subject burden during testing by selectively presenting questions that will best discriminate in the range of function of that subject. Specifically, if an individual’s response to the first question indicates a high level of functioning, that individual will not be presented with questions designed to discriminate among individuals with low levels of functioning. In both cases, IRT methods play a predominant role. 4

USE IN RESEARCH STUDIES

Implicit in the use of measures of HRQoL, in clinical trials and in effectiveness research, is the concept that clinical interventions such as pharmacologic therapies, can affect parameters such as physical function, social function, or mental health. (37)

All principles of good design and analysis are applicable, but there are additional requirements specific to HRQoL. These include selection of an appropriate measure of HRQoL and the conduct of an assessment to minimize any bias. The HRQoL instruments should be selected carefully, ensuring that they are appropriate to the research question and the population under study. New instruments and questions should be considered only if all other options have been eliminated. Among the most common statistical problems are multiple endpoints and missing data. 4.1 Instrument Selection Ware et al. (34) suggest two general principles to guide the selection of instruments to discriminate among subjects or detect change

7

in the target population. ‘‘When studying general populations, consider using positively defined measures. Only some 15% of general population samples will have chronic physical limitations and some 10 to 20% will have substantial psychiatric impairment. Relying on negative definitions of health tells little or nothing about the health of the remaining 70 to 80% of general populations. By contrast, when studying severely ill populations, the best strategy may be to emphasize measures of the negative end of the health status continuum.’’ In cases where the population is experiencing periods of health and illness, very careful attention must be paid to the selection of the instrument balancing the ability to discriminate among subjects during different phases of their disease and treatment with appropriateness over the length of the study. One cannot assume that a questionnaire that works well in one setting will work well in all settings. For example, questions about the ability to perform the tasks of daily living, which make sense to individuals who are living in their own homes, may be confusing when administered to a patient who has been in the hospital for the past week or is terminally ill and receiving hospice care. Similarly, questions about the amount of time spent in bed provide excellent discrimination among ambulatory subjects, but not among hospitalized patients. There is a temptation to pick an HRQoL instrument, become familiar with it, and use it in all circumstances. Flexibility must be maintained in the choice of instrument to target the specific research or clinical setting, the specific population, the challenges associated with administration, and the problem of respondent burden. 4.2 Multiple Endpoints Because QoL is a multidimensional concept that is generally measured using several scales that assess functional, physical, social, and emotional well-being, there are multiple endpoints associated with most QoL evaluations. Longitudinal data arise in most HRQoL investigations because we are interested in how a disease or an intervention affects an individual’s well-being over time.

8

QUALITY OF LIFE

Because of the multidimensional nature of HRQoL and repeated assessments over time, research objectives need to be explicitly specified and an analytic strategy developed for handling multiple endpoints. Although adequate for univariate outcomes such as survival, statements such as ‘‘To compare the quality-of-life of subjects on treatments A and B’’ are insufficient; details should include domains, population, and the time frame relevant to the research questions. Strategies addressing the multiplicity of endpoints include limiting confirmatory analyses, construction of summary measures/statistics (6) and multiple comparison procedures. Examples of summary measures that reduce multiplicity over time include area-underthe-curve (AUC) and average rates of change (slope); their interpretation is straightforward. Construction of these measures is complicated by the presence of missing data. QoL indices are used to reduce the multiplicity across domains. 4.3 Missing Data Although analytic strategies exist for missing data, their use is much less satisfactory than initial prevention. Some missing data, such as that due to death, is not preventable; however, missing data should be minimized at both the design and implementation stages of a clinical trial (7,26,41). The protocol and training materials should include specific procedures to minimize missing data. A practical schedule with HRQoL assessments linked to planned treatment or follow-up visits can decrease the number of missing HRQoL assessments. When possible, it is wise to link HRQoL assessments with other clinical assessments. 5 INTERPRETATION/CLINICAL SIGNIFICANCE All new measures take time to become useful to clinicians or patients. This process requires that we define ranges of values that have clinical implications. When measures such as hemoglobin and blood pressure were first used, there was a period during which normal ranges were established; once the ranges

were available, the readings became clinically useful. Nor are the rules that have been developed simple, since the benefits/risks of a change in either measure depends on where the individual started, age, gender, and current condition or lifestyle (pregnancy or about to run a marathon). Interpretation of measures of QoL is similarly complex. Clinical significance has various meanings depending on the setting. When a treatment decision is required, there is an implied ordering of information into categories (often dichotomous) that correspond to various decisions. For example, one might consider a patient’s current hemoglobin as well as gender and clinical history when making a decision to treat, monitor closely, or do nothing. In contrast, when the situation calls for evaluation of effectiveness of an intervention based on the information from a randomized clinical trial, the decision is generally based on continuous or ordered information such as grams/dL of hemoglobin. Thus, meaningful differences/changes in QoL measures will depend on whether a decision is being made for an individual or for a group of individuals. There are two general strategies used to define the clinical significance of QoL scores, distribution-based and anchor-based methods. There is no single approach that is appropriate to all settings and none of the methods is without some limitations. 5.1 Distributional Methods One general approach is based on the distribution of scores expressed as the relationship (ratio) between the magnitude of an effect and a measure of variability (3,18). The magnitude of the effect may be either the difference between two groups or the change within a group. Measures of variability include the standard deviation of a reference group, the standard deviation of change, and the standard error of measurement. A distributional method was used by Cohen (4) in his criteria for meaningful effect or ‘‘effect size’’ in psychosocial research. The major advantage is that values are relatively easy to generate from validation studies or clinical trials. There are a number of limitations. Many clinicians are unfamiliar with ‘‘effect size’’ and skeptical about defining meaningful

QUALITY OF LIFE

differences solely on the basis of distributions. These values are generally applicable to groups. Measures of variability can differ across studies being affected by the selection criteria, which can influence the heterogeneity of the sample. Finally, one still needs to make a decision about the size of the effect that is relevant in any particular setting, requiring a value judgment of risks and benefits. 5.2 Anchor-Based Methods Anchor-based methods are based on the relationship between scores on the QoL measure and an independent measure or anchor. Examples of anchors are the patient’s rating of health, disease status, and treatments with known efficacy. The anchor must be interpretable and there needs to be an appreciable association of the anchor with QoL. Within this group of methods, there are numerous approaches, none of which fits all needs. One concern is that the motivation for QoL measurement is to move beyond traditional clinical endpoints, but we appear to be using these same clinical endpoints to ‘‘justify’’ and interpret QoL measures. One approach is to classify subjects into groups based on the anchor, and estimate differences in the QoL measures. For example, one might form three groups based on function corresponding to no, moderate, or severe limitations and observe average scores of 80, 70, and 50 respectively. The mirror image of that approach is to classify subjects using QoL measures and describe the outcomes in terms of either an external or internal anchor. In the first case, one might observe that a group of patients with a mean score of 80 experience 5% mortality, while another group of patients with a score of 60 experience 20% mortality. In the latter case, one might observe that 32% of those who score 50 on the SF-36 physical function scale can walk a block without difficulty, in contrast to 50% who score 60. Another approach is to elicit a value, a minimum important difference (MID) from clinicians or patients; that is, the smallest difference in the scores that is perceived as important, either beneficial or harmful, and which would lead the clinician to consider

9

a change in the patient’s management (20). Within-patient transitions are yet another approach that has been used. Individuals are asked to judge, during a specified time, whether they have improved, not changed, or gotten worse. The corresponding changes in QoL scores are then summarized within each group. The advantage of this approach is that it is easy to assess and appears simple to interpret. However, there is an accumulation of evidence that the retrospective assessment reflects the subject’s current QoL rather than the change. 6

CONCLUSIONS

In the health sciences, QoL assessment is now an integral component of patient-focused research. Given the increasing complexity of health care, the extent of chronic illness, and the variety of therapies that generally do not improve survival but often only decrease morbidity, measurement of HRQoL outcomes provides an additional evaluation of treatment benefit. These developments have occurred in the latter half of the twentieth century, a period in which individual preferences and autonomy have been increasingly valued in many societies, especially in Europe and North America. In parallel, reliable and valid measurement strategies have evolved from social science research to make it possible to quantify subjective assessments of health status and QoL. Further, advances in statistical methodology have been integrated into research designs, making it possible to interpret these assessments in a variety of research and clinical settings. A major aspect of this work has been to bridge the gap between psychometric/statistical theory and the language and realities of clinical practice. There are many who remain skeptical about the contributions of QoL assessments to treatment decisions and health care policies; however, the more these measurements are integrated into research, the greater the likelihood that the outcomes that matter to patients will ultimately be incorporated into medical care (13). Many of the studies that have failed to detect changes have suffered from nonignorable missing data and the use of inappropriate analytic methods. Therefore,

10

QUALITY OF LIFE

statisticians have an important role to play in the design and analysis of studies with QoL outcomes, if the studies are to produce interpretable results. In this article, we have provided a perspective on where we are in QoL assessment, as well as an honest evaluation of some of the limitations of this methodology. Much additional work needs to be done, and fortunately there is considerable international interest in addressing the challenges of this young measurement science. REFERENCES 1. Aaronson, N. K., Ahmedzai, S., Bergman, B., Bullinger, M., Cull, A., Duez, N. J., Filiberti, A., Flechtner H., Fleishman, S. B. & de Haes J. C. (1993). The European organization for research and treatment of cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in oncology, Journal of the National Cancer Institute 85, 365–376. 2. Cella, D. F. & Bonomi, A. E. (1995). Measuring quality of life: 1995 update, Oncology 9, 47–60. 3. Cella, D., Bullinger, M., Scott, C. & Barofsky, I. (2002). Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life, Mayo Clinic Proceedings 77, 384–392. 4. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Lawrence Erlbaum Associates, Hillsdale, NJ. 5. EuroQol Group (1990). EuroQol—A new facility for the measurement of health-related quality of life, Health Policy 16, 199–208. 6. Fairclough, D. L. (1997). Summary measures and statistics for comparison of quality of life in a clinical trial of cancer therapy, Statistics in Medicine 16, 1197–1209. 7. Fairclough, D. L. & Cella, D. F. (1996). Eastern cooperative oncology group (ECOG), Journal of the National Cancer Institute Monographs 20, 73–75. 8. Fayers, P. M. & Hand, D. J. (1997). Factor analysis, causal indicators and quality of life, Quality of Life Research 6, 139–150. 9. Fayers, P. M. & Machin, D. (2000). Quality of Life: Assessment, Analysis and Interpretation. John Wiley & Sons, UK, Chapters 3–7. 10. Feeny, D., Furlong, W., Barr, R. D., Torrance, G. W., Rosenbaum, P. & Weitzman, S. (1992). A comprehensive multiattribute system for

classifying the health status of survivors of childhood cancer, Journal of Clinical Oncology 10, 923–928. 11. Frank-Stromborg, M. & Olsen, S. (1997). Instruments for Clinical Health-care Research. Jones and Bartlett, Boston. 12. Ganz, P. A. (1990). Methods of assessing the effect of drug therapy on quality of life, Drug Safety 5, 233–242. 13. Ganz, P. A. (2002). What outcomes matter to patients: a physician-researcher point of view, Medical Care 40, III11–III19. 14. Glasziou, P. P., Simes, R. J. & Gelber, R. D. (1990). Quality adjusted survival analysis, Statistics in Medicine 9, 1259–1276. 15. Goldhirsch, A., Gelber, R. D., Simes, R. J., Glasziou, P. & Coates, A. S. (1989). Costs and benefits of adjuvant therapy in breast cancer: a quality-adjusted survival analysis, Journal of Clinical Oncology 7, 36–44. 16. Guyatt, G. H., Deyo, R. A., Charlson, M., Levine, M. N. & Mitchell, A. (1989). Responsiveness and validity in health status measurement: a clarification, Journal of Clinical Epidemiology 42, 403–408. 17. Guyatt, G. H., Kirshner, B. & Jaeschke, R. (1992). Measuring health status: What are the necessary measurement properties? Journal of Clinical Epidemiology 45, 1341–1345. 18. Guyatt, G. H., Osoba, D., Wu, A. W., Wyrwich, K. W. & Norman, G. R. (2002). Methods to explain the clinical significance of health status measures, Mayo Clinic Proceedings 77, 371–383. 19. Guyatt, G. H., Patrick, D., Feeny, D. (1991). Glossary, Controlled Clinical Trials 12, 274 S–280 S. 20. Jaeschke, R., Singer, J. & Guyatt, G. H. (1989). Measurement of health status. Ascertaining the minimal clinically important difference, Controlled Clinical Trials 10, 407–415. 21. Juniper, E. F., Guyatt, D. H. & Jaeschke, R. (1996). How to develop and validate a new health-related quality of life instrument, in Quality of Life and Pharmacoeconomics in Clinical Trials, B. Spilker, ed. LippincottRaven Publishers, Philadelphia, pp. 49–56. 22. Kaplan, R. & Bush, J. (1982). Health-related quality of life measurement for evaluation research and policy analysis, Health Psychology 1, 61–80. 23. McDowell, I. & Newell, C. (1996). Measuring Health: A Guide to Rating Scales and Questionnaires. Oxford University Press, New York.

QUALITY OF LIFE 24. McNeil, B. J., Weichselbaum, R. & Pauker, S. G. (1981). Speech and survival: tradeoffs between quality and quantity of life in laryngeal cancer, The New England Journal of Medicine 305, 982–987. 25. Miller, G. A. (1956). The magic number seven plus or minus two: some limits on our capacity for information processing, Psychological Bulletin 63, 81–97. 26. Moinpour, C. M., Feigl, P., Metch, B., Hayden, K. A., Meyskens, F. L. Jr. and Crowley, J. (1989). Quality of life end points in cancer clinical trials: review and recommendations, Journal of the National Cancer Institute 81, 485–495. 27. Naughton, M. J., Shumaker, S. A., Anderson, R. T. & Czajkowski, S. M. (1996). Psychological aspects of health related quality of life measurement: test and scales, in Quality of Life and Pharmacoeconomics in Clinical Trials, B. Spilker, ed. Lippincott-Raven Publishers, Philadelphia, pp. 117–132. 28. Patrick, D. L. & Deyo, R. A. (1989). Generic and disease-specific measures in assessing health status and quality of life, Medical Care 27, S217–S232. 29. Patrick, D. L. & Erickson, P. (1993). Health Status and Health Policy: Quality of Life in Health Care Evaluation and Resource Allocation. Oxford University Press, New York. 30. Staquet, M. J., Hays, R. D. & Fayers, P. M. (1998). Quality of Life Assessment in Clinical Trials: Methods and Practice Part II. Oxford University Press, Oxford, New York, Chapters 2–5. 31. Streiner, D. L. & Norman, G. R. (1995). Health Measurement Scales: A Practical Guide to their Development and Use. Oxford University Press, Oxford, New York. 32. Torrance, G. W., Thomas, W. H. & Sackett, D. L. (1971). A utility maximizing model for evaluation of health care programs, Health Services Research 7, 118–133. 33. Tsevat, J., Weeks, J. C., Guadagnoli, E., Tosteson, A. N., Mangione, C. M., Pliskin, J. S., Weinstein M. C., Cleary, P. D. (1994). Using health-related quality-of-life information: clinical encounters, clinical trials, and health policy, Journal of General Internal Medicine 9, 576–582. 34. Ware, J. E. Jr., Brook, R. H., Davies, A. R. & Lohr, K. N. (1981). Choosing measures of health status for individuals in general populations, American Journal of Public Health 71, 620–625.

11

35. Ware, J. E. Jr. & Sherbourne, C. D. (1992). The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection, Medical Care 30, 473–483. 36. Wiklund, I., Dimenas, E. & Wahl, M. (1990). Factors of importance when evaluating quality of life in clinical trials, Controlled Clinical Trials 11, 169–179. 37. Wilson, I. B. & Cleary, P. D. (1995). Linking clinical variables with health-related quality of life. A conceptual model of patient outcomes, JAMA 273, 59–65. 38. World Health Organization (1958). The First Ten Years of the World Health Organization. World Health Organization, Geneva. 39. World Health Organization. Constitution of the World Health Organization. Basic Documents 48. World Health Organization, Geneva. 40. Yabroff, K. R., Linas, B. P. & Schulman, K. (1996). Evaluation of quality of life for diverse patient populations, Breast Cancer Research and Treatment 40, 87–104. 41. Young, T. & Maher, J. (1999). Collecting quality of life data in EORTC clinical trials–what happens in practice? Psychooncology 8, 260–263.

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

cannot compensate entirely for shortfalls in practices at the investigator site, which can also substantially increase costs by yielding ‘‘dirty’’ data. ICH notes that ‘‘quality control should be applied to each stage of data handling to ensure that all data are reliable and have been processed correctly’’ (5). Similarly, EMEA states that use of GCP ‘‘provides public assurance that the rights, safety and well being of trial subjects are protected . . . and that the clinical trial data are credible’’ (4).

LARRY A. HAUSER and ELLEN MORROW Quintiles Transnational Corp., Overland Park, Kansas

In light of a recently overheard comment by a senior regulator that ‘‘data cleaning activities still remain a mystery,’’ publication of this article is clearly timely. Topics covered will include the various methods of data capture and validation, along with ways to identify and address data inconsistencies. The need for accurate recording and processing of patient data is fundamental to any clinical trial. If data stored in the database are incorrect, then conclusions of the analyses may also be incorrect. Checking of data quality is an ongoing process during data collection. The primary goal of the query management process is to achieve a high-quality database, which is essential for meeting the objectives of the company’s protocol as well as the requirements of regulatory agencies such as the United States Food and Drug Administration (FDA) and the European Medicines Agency (EMEA) (1). High-quality data can then be used as a basis for additional analysis and reporting, whether intended for a regulatory filing or a peer-reviewed publication.

1.1 Scope To achieve a high-quality database, information stored in the clinical database must match the source data at the investigative site. This task is the focus of data management personnel, who work closely with clinical monitors to ensure consistency. Where questions develop about data accuracy, a well-defined query management process— sometimes referred to as data cleaning—is essential to ensure that requirements for analysis and reporting are met. This article describes the query-management process that is used along the ‘‘route to a quality database.’’ This process includes identifying the data that seem to be suspicious in accuracy or completeness and describing what is done at the investigator site and within DM to address such inconsistent data. The article will highlight the importance of the Data Validation Plan and what components should be included. Figure 1 depicts these key elements of the query management process, and Table 1 provides definitions that are commonly used to describe the query management process.

1 THE IMPORTANCE OF HIGH-QUALITY DATA The importance of high-quality data is underlined by guidelines for Good Clinical Practice (GCP), which is the international ethical and scientific quality standard for designing, recording, and reporting trials that involve human subjects. GCPs are endorsed by bodies such as the International Conference on Harmonization (ICH) (2,3) and EMEA (4). ICH specifies that it is the responsibility of the investigator to ensure the accuracy, completeness, legibility, and timeliness of delivery of the data. Data management staff

1.2 Paper-Based Versus Electronic Data Capture Many different methods of data capture can be used in clinical trials. This article focuses on query management for data obtained using two of these methods: paper-based data entry by DM from a CRF and data captured at the investigator site via an EDC system. The

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

Investigator Site

Answering queries

Edit Checks

Identifying data Inconsistencies

Validation Plan

Updating the Database

Changing Roles

Addressing Inconsistencies

Clinical Monitor Figure 1. Key elements of the query management process.

Table 1. Query Management Definitions Case Report Form (CRF): A questionnaire specifically used in a clinical trial; the primary data collection tool from the investigator site. eCRF: The electronic version of a case report form in an EDC system; the data entry screens for an EDC study. Data capture methods: Validated methods used in a clinical trial to input data into the database; may be data entered by DM staff, by staff at the investigator site, or electronically transferred from another source. Data capture methods include: data entry (single or double) by DM from the CRF; loading data from external source (e.g., central lab, medical device, etc.); fax-based data capture; optical character recognition; bar codes; data entry at the investigator site via an EDC system; and electronic diaries. Data Clarification Form (DCF): The form used in a paper-based study to present the investigator with a query, which is designed to obtain a clear and definitive resolution. Data inconsistency: Data contained in the clinical database that are ‘‘suspicious’’ in terms of its accuracy; may indicate that the data were erroneously captured, or are missing, illegible, logically inconsistent, out of range, or even fraudulent. Inconsistencies are often referred to as discrepancies for paper-based studies. Data integrity: Data that are consistent with all source data documentation, are maintained during transfer, storage and retrieval without change or corruption, and are consistent, valid and accessible. Data Management (DM): The function within the clinical trial sponsor or Contract Research Organization responsible for delivering a final database. Data quality: The extent to which the database reflects source documents and meets high standards for accuracy, currency, completeness and consistency. Data validation: The process used to ensure that the clinical data are accurate and valid. eSource: Data captured without a paper source. Query: A question that is addressed to the investigator to clarify a data inconsistency. Query resolution: The process of submitting queries to the investigator and receiving an ‘‘answer’’ either in the form of a returned DCF (paper studies) or system query closure and database update if required (EDC).

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

query management approach can be similar for each. However, EDC offers several key advantages: • Data inconsistencies are identified ear-

lier, often at the time of entry • Usually far fewer data inconsistencies

exist. • The inconsistencies that are identified

can be addressed by a member of the investigator site staff who is familiar with the data. Throughout most of the history of clinical trials, paper-based CRFs have been the primary device to communicate the results (often transcribed from source documents) of clinical trials from the investigator site to a centralized sponsor company or Contract Research Organization (CRO), in which the data are then entered and processed. This process happens on an ongoing basis or batched at the end of the trial. The data are entered and prepared for analysis and reporting. Up to this point, the transition from paperbased recording of clinical trials to EDC has been gradual. However, as Deborah Borfitz writes on Bio-ITWorld.com (6), ‘‘explosive growth in electronic data capture (EDC) adoption this year (2007) helps qualify the technology as the ‘most disruptive . . . since the introduction of the personal computer itself,’’ according to an EDC Spending Forecast and Analysis by Christopher Connor of Health Industry Insights (7). Borfitz’ article, Connor: 2007 Will Be Tipping Point for EDC, says that in 2007, the EDC adoption growth rate was forecast to accelerate from 6.5% to 13.3%. ‘‘By year end, nearly half of all new Phase I–III studies will use EDC, the report predicts.’’ A significant driver for adoption of EDC is the need to improve efficiencies and reduce time to database lock, with the potential to reduce the time needed to bring a new medicine to market. As an interactive, webbased approach, EDC can decrease clinical trial costs significantly. The earlier a data inconsistency is identified within the clinical trial process, the more efficient and cost effective the trial will be, which makes EDC

3

studies a highly efficient method of data capture (3). 2

DATA VALIDATION PLAN

The query management process starts with defining a data validation plan (DVP), which documents the approach used to ensure high quality data. The DVP should include descriptions of: (1) edit specifications, (2) definitions of self-evident corrections (for paper-based studies only), and (3) project-specific processing guidelines. Another plan—the clinical monitoring plan—needs to work in parallel with the DVP. 2.1 Edit Specifications An edit specification document describes all checks that are planned to be run against the data during the course of the clinical trial. Data considered to be critical for the study analysis, such as primary safety or efficacy data, will be examined most closely, with a complete set of checks (8). Other data considered to be less important (e.g., information on patients’ height) may not be checked beyond an initial confirmation of complete ness. Edit specification documents include two categories of checks: 1. Programmed/automatic checks. Most checks that are carried out are programmed or automatic, which account for at least 80% of checks on Phase II–III data. Exceptions may include very small studies, in which it is more efficient to check the data manually. Programmed/automatic checks are included in the EDC system or Clinical DM System (CDMS), and run against the data to identify any inconsistencies. This category includes: • Range checks or screen checks are used to detect data that are outside of a pre-defined range (e.g., ‘‘subject should be between 18 and 65 years of age).’’ Range checks include missing data and unexpected or impossible data, such as future dates or a blood pressure of 8.

4

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE • Logic checks are used to compare one

• Allows DM staff to make self-evident

data value to another (e.g., if a date of informed consent is after the first visit date, or if a physical exam is described as abnormal, but no abnormality is provided). 2. Manual checks. Some checks cannot be programmed easily because of their complexity and the time it would take to program them—or because the check involves the review of free text. Checks of this nature are referred to as manual checks as they are not programmed directly into the CDMS or EDC system. To perform checks of these data, listings are programmed outside of the CDMS or EDC system and reviewed by DM staff. An example of this technique would be the comparison of central laboratory data to the CRF/eCRF data to ensure internal consistency. Although ‘‘unplanned’’ manual checks of the data can be performed throughout the trial, it is important to include any planned manual checks in the edit specification document.

modifications to the database as outlined in the Sponsor/CRO Standard Operating Procedures (SOPs) • Has been endorsed by the investigator • Does not require the issuance of a query

Most items in an edit specification document are designed to allow the DM team to identify which data fields contain suspicious data. The document includes query text, which defines the specific question for the investigator. For an EDC trial, this text is presented immediately on entry of the data at the investigator site; for paper-based trials, it is presented much later in the process, usually on a DCF. Query text needs to be specific enough to explain to clinical site staff exactly why the data did not meet the specified criteria, but it must not guide them toward a particular answer. For example, query text should state: ‘‘Value is outside of range,’’ and not: ‘‘Is the value a 7?’’ Table 2 provides an example of part of an edit specification document. 2.2 Self-Evident Corrections A self-evident correction is a guideline which: • Only applies to paper-based studies

(these types of corrections are covered by autoqueries in an EDC study)

Examples of self-evident corrections include: • If a ‘‘Yes/No’’ box is answered ‘‘No’’ or

is left blank, but legitimate data exists in the associated field, the box may be changed to ‘‘Yes’’ • For concomitant or previous medications, if any information is recorded in the wrong column on the form, the information may be moved to the correct column Self-evident corrections should not be applied where the intended meaning of the data is not clear. 2.3 Project-Specific Processing Guidelines For an EDC study, project-specific processing guidelines may include a determination of whether answers to manual queries will be reviewed internally. For paper-based studies, specifications for applying self-evident corrections and details regarding the use of the Data Clarification Form (e.g., one inconsistency per DCF and the method and frequency of DCF distribution) are documented in the project-specific processing guidelines section of the Data Validation Plan. 2.4 Source Document Validation/Verification Source Document Validation/Verification (SDV)—an integral part of the data validation procedures required GCP—involves a comparison with source documents to ensure that all data are recorded correctly and completely and reported at the investigator site. This process confirms accurate transcription from source to CRF or eCRF and is typically performed by the clinical monitor.

Table 2. Example of Part of a Typical Edit Specification Document DCM/ Panel

DCM/Panel Type

Check #

Check Name

Check Type

Adverse Events

Programmed

AE

Adverse Events

Programmed

CM

Concomitant Medications

Programmed

CM

Concomitant Medications

Manual

DM

Demography

Programmed

DM

Demography

Programmed

DS

End of Study

Programmed

DS

End of Study

Programmed

IV

Investigator Signature Physical Exam

Programmed

PE

Programmed

Evaluable Question/ Field Name

DCF Text For AE # \Sequence number\ \AE term\ date of resolution is prior to start date. Please clarify. For AE # \Sequence number\ \AE term\ action taken is missing. Please provide. Information is recorded for concomitant medication # \Sequence number\ but medication name is missing. Please provide. Concomitant Medication is listed with indication, but corresponding AE or Medical History is not listed. Please clarify. Age is out of the range of 18–65 and inclusion # \Sequence number\ is yes. Please clarify. Date of birth is missing, invalid or incomplete. Please provide. Reason for early termination is provided and subject completed the study. Please clarify. Reason for early termination is adverse Event/death, but no AEs are recorded with action taken of study drug discontinued. Please clarify. Investigator signature date is missing, invalid or incomplete. Please provide. Body system \PE body area\ is abnormal, yet no abnormality is recorded. Please provide an abnormality or verify that normal or not done should be checked.

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

AE

CP Event/ CRF Visit Page #

5

6

3

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

IDENTIFYING DATA INCONSISTENCIES

The first step in the query management process is to identify data that may have been erroneously captured or are missing. Many data inconsistencies are ‘‘caught’’ by programmed/automatic checks; others are identified through manual review of listings. • Programmed/automatic checks. In an

EDC study, many data inconsistencies are identified at the time the investigator site staff enters the data into the software (some EDC systems refer to these as ‘‘autoqueries’’). For paper studies, the edit programs are run in batches within the CDMS and display the resulting data inconsistencies, which are then reviewed by DM staff. • Manual/Review of Listings. Many ways exist to identify suspicious data ‘‘manually.’’ Each function represented in the clinical trial team reviews the data from a different perspective: • Preplanned manual checks: These

checks have been documented in the edit specification document; typically, this review is perform ed by DM staff. • Ad-hoc review of data: Reviews of the data occur continuously throughout the clinical trial by DM staff, investigator site staff, or the clinical monitor, often during SDV. • Statistical analysis and report generation: As the biostatistician runs the analysis programs or the medical writers perform reviews of the listings and tables, data inconsistencies can be identified. Finding data errors at this point in the clinical trial process is the most costly and least efficient.

4

ADDRESSING DATA INCONSISTENCIES

Various approaches are used to address data inconsistencies, depending on how the inconsistency was identified.

• Inconsistencies

identified using programmed/automatic checks. For EDCbased studies, identification of inconsistencies at the time of data entry allows rapid resolution. These inconsistencies may reflect data entry errors or mistakes when the data are transcribed from a source document; it is a key advantage that these errors can be corrected by knowledgeable clinical site staff who are familiar with the data (termed ‘‘heads up data entry’’). For most EDC processes and systems, as long as the corrected value meets the programmed criteria, the inconsistency may be closed permanently and requires no additional action by DM staff. For paper-based studies, each data inconsistency identified via the programmed edit checks is displayed in the CDMS and must be reviewed by DM staff. Figure 2 shows the decision tree that is applied in reviewing the inconsistency to determine what action is required. • Is the inconsistency a result of a

data entry error? If yes, then the DM staff can correct the error in the database and no additional action is required. • Can a self-evident correction be applied to the inconsistency? If yes, then staff can correct the database with the self-evident correction and no additional action is required. • If neither of the above applies, then the staff member must create a query to be sent to the site. • Inconsistencies identified using manual

checks. For both paper and EDC studies, addressing inconsistencies identified by manual review is a more laborintensive task than addressing those identified via programmed/automatic checks. The query must be created in the EDC system (by either DM staff or the clinical monitor) or CDMS, including writing and entering text to explain the purpose of the query to investigator site staff.

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

Select one subject’s Inconsistencies in CDMS

Select first inconsistency to review

Is there a data entry error?

Yes

Correct database per CRF

No

Can I apply a self evident correction to resolve this inconsistency?

Yes

Correct database per self evident correction

No

Are there other data on the CRF indicating the data in question is correct and no DCF should be sent?

Yes

Resolve the inconsistency in CDMS indicating no action required

No

Are there other inconsistencies associated with this same data point?

Yes

Determine the inconsistency most appropriate for the site to address

Create DCF and send to investigator site

Figure 2. Process for addressing data inconsistencies for a paper-based study.

7

8

5

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

QUERY RESOLUTION

For paper-based studies, the investigator site staff answers the query directly on the DCF, either by clarifying the original value, changing the original value to a new value (based on source documentation), or confirming that the original value was correct. For EDC studies, the automated queries are normally closed automatically if the site response meets the criteria specified by the relevant edit check. Manual queries are usually closed by the functional group (e.g., DM or clinical monitoring) that created them. The average inconsistency rate for medium complexity paper-based studies is often around one inconsistency per CRF page, with 60–70% normally resolved internally and the rest submitted as queries to investigator site staff (3). Table 3 gives common terms used to define query statuses in the CDMS or EDC system. DM staff view reports of query statuses and work with clinical monitors to manage the query resolution process for a clinical trial.

5.1 Database Update For paper-based studies, the DCF is then returned to DM staff, who update the database as necessary, or confirm that the original data were correct, and close out the query in the CDMS. Site or clinical DCFs may also be sent to DM as a result of data review at the investigator site. As Fig. 3 shows, a similar process is used to update the database on receipt of the site or clinical DCF, although a query may first be initiated in the CDMS before the database can be updated. For EDC studies, investigator site staff receive the query via the EDC system and can update the database directly with the corrected value or can close out the query by indicating that the original value is correct. Only investigator site staff can change the data in an EDC application; neither DM nor clinical staff have access to change data. In both paper-based and EDC studies, if the response obtained to a query does not satisfy

Table 3. Common Terms Used to Define Query Status Query Status Open Answered

Resolved/Closed

Meaning The query has not yet been addressed by investigator site staff. Programmed/Automated queries: The query was answered by site staff via an EDC system or DCF; however, the response is still outside the limits specified by the relevant edit check. Manual queries: All become ‘‘answered’’ or similar when the site has responded as above. Automated queries: Depending on the CDMS in use, automatically close if the site response meets the criteria specified by the relevant edit check or in paper-based studies DM has accepted the change and updated the DB as necessary. Manual queries: Typically closed by the functional group that created them or DM in paper-based studies, after review of the site’s response.

Table 4. A Comparison Between Paper- and EDC-Based Trials for Query Management Paper-Based Studies Send data clarification form (DCF) to site and receive resolution response Resolution over days or longer Outstanding queries can delay database lock Source document is paper at investigator site

EDC-Based Studies Interactive query resolution Rapid resolution of data inconsistencies – often hours Potential to significantly decrease time to database lock Potential eSource; otherwise, the source document is paper

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

Site

Data Review

Resolve Queries

Receive DCFs

Access DCF module in CDMS

Initiate Clinical DCF

Indicate DCF received in CDMS

Obtain CRF (if necessary)

Review DCF response

Does the response answer the question appropriately?

Yes

No

Should the DB be updated?

Create new query with more specific query text No

Yes

Update database

Indicate investigator confirmed in CDMS System

Create DCF with new query text to investigator site

Close Query in CDMS Figure 3. Process for updating the database in a paper-based study.

9

10

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

the criteria or satisfactorily answer the question, then it is resubmitted to investigator site staff. Table 4 summarizes the differences in the query management process between paperbased and EDC studies, highlighting some aspects that make EDC a more efficient process. As Fig. 4 depicts, the route to a quality database is shorter for EDC-based studies than for paper-based. Given that data entry is performed at the investigator site—by staff who are familiar with the data—steps to review data inconsistencies and make database corrections internally by DM staff are simply not needed. 6 THE CHANGING ROLES OF DM, CLINICAL, AND INVESTIGATOR SITE STAFF

time spent in entering data is quickly compensated for by the absence of DCFs to be answered. The role of the clinical monitor in the query management process shifts in an EDC study as well. The monitor still performs source data verification/validation unless using eSource methods but uses the EDC system to generate queries, which are then resolved by the investigator site staff. As the ‘‘tipping point’’ is reached for eSource, responsibilities for confirming data against the source document will disappear, as no paper document will exist to compare against. With the converging of the clinical monitoring and DM roles, the DVP and the clinical monitoring plan will soon be merged as well. 7 THE FUTURE OF QUERY MANAGEMENT

Traditionally, the role of the investigator site staff in the query management process was to record the data onto the paper CRF and answer any queries that were raised by DM staff; the clinical monitor ensured that source documents matched the data recorded on the CRF and facilitated the resolution of queries between DM and the investigator site; and DM identified inconsistencies in the data, queried the investigator site, and updated the database. With increasing numbers of studies that use EDC rather than the traditional paperbased data collection, the roles of DM, clinical, and investigator site staff in the query management process are changing. Because fewer data inconsistencies are brought into DM, fewer resources are required on an EDC study. DM will retain the role of producing queries for checks programmed outside of the EDC system (manual review of listings). However, reviewing CDMS-generated inconsistencies, correcting data entry errors, applying self-evident corrections, receiving answered DCFs, and updating databases will not be a part of the DM role for much longer. Investigator site staff not only enter the data in an EDC trial but also can correct many inconsistencies at the time of entry, which makes their role in the query management process much more efficient. The

The increase in EDC-based studies illustrates how data quality has been positively impacted. This information is especially true in the query management area, in which EDC-based systems can resolve data inconsistencies more accurately and efficiently. In addition, the use of eSource can improve patient compliance in trials using electronic diaries. Collaboration with the Clinical Data Interchange Standards Consortium (CDISC) (9), FDA, and others has yielded positive results. With the encouragement of the FDA, CDISC initiated the eSource Data Interchange Group (eSDI) to discuss current issues related to eSource data in clinical trials. The objective of eSDI was to develop a document that ‘‘aligns multiple factors in the current regulatory environment, to encourage the use of eSource collection and industry data standards to facilitate clinical research for investigators, sponsors and other stakeholders.’’ This document identifies the benefits of implementing standards for eSource, along with issues that may inhibit its adop tion. Never has the old adage ‘‘better, quicker, cheaper’’ been more appropriate for query management. Increasing competitive and regulatory pressures are creating ever-increasing challenges for the pharmaceutical

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

Data Entered

EDC Study? No

Yes Data Inconsistencies Identified

Data Inconsistencies Identified

Data Inconsistencies Reviewed

Via Programmed/ Auto?

Yes

Database Corrected Where Possible

No

Query Sent to Site

Query Sent To Site

Query Answered

Query Returned to DM

Database Updated

Figure 4. Overview of query management process.

11

12

QUERY MANAGEMENT: THE ROUTE TO A QUALITY DATABASE

industry. The need to improve quality control, advance new medicines to the market faster, and contain costs is very evident. One discipline in which these pressures have been deeply felt is in managing clinical trial data, and at the heart of this arena is query management. Important trends that will continue to be critical in the future include: • Improved processes (e.g., quality checks

•

•

•

•

embedded throughout). These processes must be optimized, taking into account EDC functionality and the diversity of protocol objectives (e.g., first-in-man trial vs Phase III pivotal trial vs latephase trial). EDC systems designed to help facilitate process improvements (e.g., increased use of technology to interface with medical devices, rather than manual data entry). This action sets the stage for moving forward with eSource and minimizing dependence on paper documents. Immediate access to data for study stakeholders, helping to facilitate earlier decision making. Validation plans that proactively incorporate monitoring, analysis and reporting aspects to ensure the cost effectiveness of the query management process. Changing traditional DM and clinical monitoring roles to fully embrace new systems and processes effectively.

4. European Medicines Agency. Available: http:// www.emea.europa.eu/Inspections/GCPgeneral .html. 5. International Conference on Harmonization. Available: http://www.ich.org/LOB/media/ME DIA482.pdf; section 5.1.3, p. 20. 6. D. Borfitz, Connor: 2007 will be the tipping point for EDC. Bio-IT World May 21, 2007. Available: http://www.bio-itworld.com/ archive/eclinica/index0 5212007.htm. 7. Health Industry Insights, Electronic data capture (EDC) poised to disrupt life sciences industry, says health industry insights. Health Industry Insights May 7, 2007. Available: http://www.healthindustry-insights.com/ HII/getdoc.jsp?containerId=prUS20671907. 8. Society for Clinical Data Management, Inc., Good Clinical Data Management Practices (GCDMP), version 3. September 2003. Milwaukee, WI: SCDM. 9. CDISC, Electronic Source Data Interchange (eSDI) Group, Leveraging the CDISC Standards to Facilitate the use of Electronic Source Data within Clinical Trials. V1.0. November 20, 2006. Available: http://www .cdisc.org/eSDI/eSDI.pdf.

FURTHER READING Food and Drug Administration, 21 CFR Part 11, Electronic records; electronic signatures; rule. Fed. Reg. 1997; 62: 13429–13466. Food and Drug Administration, Guidance for Industry: Computerized Systems Used in Clinical Trials. Washington, D.C.: FDA, April 1999.

In conclusion, today we are on the threshold of a truly exciting future for those involved with query management, one where we can finally deliver a quality database ‘‘better, quicker, cheaper.’’

CROSS-REFERENCES

REFERENCES

Queries

1. R. K. Rondel, S. A. Varley, and C. F. Webb, eds., Clinical Data Management. 2nd ed. London: Wiley & Sons, 2000. 2. International Conference on Harmonization, Good Clinical Practice (ICH GCP): consolidated guideline. Feb Reg. 1997: 62(90). 3. S. Prokscha, Practical Guide to Clinical Data Management. 2nd ed. Boca Raton, FL: CRC Press, Taylor & Francis, 2007.

Data Clarification Form Data cleaning Data validation Edit checks Edit specification

QUESTION-BASED REVIEW (QBR)

efforts to improve the efficiency of the review process and to encourage implementation of quality by design. The rapid and nearly complete transition to QbR submissions indicates the commitment of both OGD and ANDA sponsors to the new pharmaceutical quality assessment system.

The Office of Generic Drugs (OGD) developed a Question-based Review (QbR) for its Chemistry, Manufacturing, and Controls (CMC) evaluation of Abbreviated New Drug Applications (ANDAs). QbR is an assessment system that is focused on critical pharmaceutical quality attributes, which will concretely and practically assess a sponsor’s implementation of the Food and Drug Administration’s (FDA) current Good Manufacturing Practices (cGMPs) for the twenty-first century and Quality by Design Initiatives. OGD intended the QbR implementation process to have goals that were SMART (Specific, Measurable, Acceptable, Realistic, and Timely). As QbR represents a significant change for ANDA sponsors, much effort was focused on education and communication as well as on measuring the response to this outreach. OGD began to publicize its QbR approach in 2005 and has communicated the changes to stakeholders through various workshops, webcasts, meetings, and individual industry talks. In early 2006, OGD posted two example QbR submissions on their webpage. From October to December 2006, OGD gave three workshops on how to prepare an effective Quality Overall Summary (QOS) and provided additional QbR training at the Generic Pharmaceutical Association (GPhA) ANDA Basics Course in May 2007. On April 11, 2007, OGD and GPhA held a webcast meeting to discuss the progress of QbR and to explain in more detail OGD’s expectations for submitting quality by design ANDAs. As of January 2007, OGD encouraged ANDA sponsors to submit a QOS that answered the QbR questions with every ANDA submission. In July 2007, more than 90% of ANDA submissions met this goal. QbR submissions are an essential part of OGD’s This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/ogd/QbR/QBR submissions upd.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

may also be of interest to assess interactions between the intervention and the covariates in the context of the parametric analysis of covariance model, although often the power to detect such interactions is low, particularly when the covariate is a class variable with many levels. Analysis of covariance of categorical or time-to-event outcome variables relies on the large sample properties of maximum likelihood estimates to make valid inferences about the estimated parameters. These properties may not apply if the number of covariates or classes is large or if important covariates are omitted from the model (5,6). Including interactions between intervention effects and the covariates only serves to magnify this problem, which then can result in computational instability of the maximum likelihood estimates of the model parameters. A solution to the dilemmas posed by analysis of covariance via parametric models and described above is provided by nonparametric methods for covariate adjustment. Rank analysis of covariance is one such solution (7). Randomization-based methods provide yet another solution and are the topic considered here (8). Briefly, this approach is based on weighted least squares methods to analyze differences between treatment groups with respect to outcome variables and covariables simultaneously with the restriction that differences in the covariables among treatment groups are zero (i.e., no imbalance occurred). In a randomized study, the expected value of such differences would in fact be equal to zero. Randomization-based nonparametric analysis of covariance is useful in situations where the Cochran–Mantel–Haenszel (CMH) test for dichotomous outcomes and the Wilcoxon Rank Sum test for ordinal outcomes are the analyses of choice, but adjustment for covariates is desired. Among the advantages of randomization-based nonparametric analysis of covariance is the fact that this approach is relatively assumption free. All that is required is the assumption that a valid randomization scheme was implemented in the clinical trial.

LISA M. LAVANGE and GARY G. KOCH University of North Carolina at Chapel Hill

1

INTRODUCTION

Analysis of covariance (ANCOVA) is a commonly used analysis strategy that allows for the effects of covariates to be taken into account in estimating differences among randomization groups in a controlled clinical trial (1–3). In cases where the covariates are correlated with the outcome of interest, analysis of covariance will result in a reduction in the variance of the treatment differences estimated from the clinical trial. Analysis of covariance can also provide adjustment for any random imbalances among treatment groups that may have occurred. The covariates for which adjustment is desired are typically defined from measurements taken during a screening period or at a baseline visit before randomization and therefore are referred to as baseline covariates. Popular candidates for covariate adjustment in clinical trials of therapeutic agents are clinical center (in multicenter trials); demographic characteristics, such as age, race, and sex; and measures of baseline disease severity (4). Analysis of covariance is usually accomplished through parametric procedures that involve specification of a multiple regression model appropriate for the outcome variable being studied. Linear models are used for continuous or intervally scaled outcome variables, logistic models for dichotomous outcomes, and proportional hazards models for time-to-event outcomes. Once a parametric model is selected and the choice of covariates is made, the method for handling the covariates in the model is needed, for example, through linear or categorical effects for age. In clinical trials, the full model specification, including the choice of covariables and specifications for their handling, is typically made before unmasking to avoid bias caused by patterns observed in the data themselves. It

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

The methodology for randomization-based nonparametric analysis of covariance is described in the following section. Data for an example are provided to illustrate the methods in a clinical trial of treatment for a respiratory disorder. 2

METHODS

Let k = 1, . . . n index the patients enrolled in a clinical trial, and let xk = {xk1 ,. . ., xkt } denote the (tx1) vector of covariates measured before randomization (i.e., baseline covariates). Let U 1k = 1 if the kth patient is randomized to the test treatment group and 0 if the kth patient is randomized to the control group. Define U 2k = (1 − U 1k ). Let yik denote the outcome variable for the kth patient if that patient receives the ith treatment, where i = 1 for the test treatment and i = 2 for the control. Note that yik may be dichotomous, ordinal, or continuous. The null hypothesis of no treatment effect is given by: H0 : y1k = y2k ≡ y∗k , k = 1, . . . , n.

(1)

That is, all patients enrolled in the trial would be expected to have the same outcome y∗k regardless of the treatment received. When H 0 is true, the patients enrolled in the clinical trial can be viewed as a finite population, and the ni patients randomized to the ith treatment group can be viewed as a simple random sample selected from the finite population without replacement (9). In this context, the {y∗k } are fixed constants, and the {U ik } are random variables with the following properties: E(Uik ) = (ni /n), var (Uik ) = (n1 n2 /n2 ), and

(2)

cov (Uik , Uik ) = −n1 n2 /n (n − 1), for k = k . 2

Note also that the {U 1k } are independent of the {y∗k } and {xk } because of the fact that randomization does not depend on any observed data values. The sample means for each treatment group are defined as n n yi = Uik yik /ni and xi = Uik xk /ni , k=1

i = 1, 2.

k=1

(3)

Let dy = y1 − y2 and dx = (x1 − x2 ) denote the differences between randomization groups with respect to the outcome variable and vector of covariates, respectively, and define d = [dy , dx ] . Let f∗k = [y∗k , xk ] be the component vector for the outcome variable and covariables for the kth patient and define f = nk=1 f∗k /n. Then V 0 , the variance– covariance matrix for d relative to the randomization distribution under the null hypothesis, H 0 , is given by n n ( f∗k − f )( f∗k − f ) k=1 n1 n2 (n − 1) (4) The matrix V 0 represents the variance– covariance structure for d under H0 with respect to all possible randomization assignments of n patients to the two treatment groups and is therefore a matrix of known constants and not random variables. The observed data values, {y∗k } and {xk }, are assumed given, and the treatment assignment is the random variable. To provide a test of the null hypothesis of no treatment effect, nonparametric analysis of covariance can be applied by fitting a linear model to d in which the differences between treatment groups with respect to the covariates assessed before randomization are constrained to equal zero, as would be expected under any valid randomization scheme. The linear model is given by V0 =

ε(d) = ε[dy , dx ] = Zg,

(5)

where Z = [1, 0t ] is the design matrix that constrains the differences in means of the covariates to zero. The regression coefficient, g, is estimated through weighted least squares: gˆ = (Z V0−1 Z)−1 Z V0−1 d.

(6)

The variance of gˆ is estimated by ˆ = (Z V0−1 Z)−1 . var(g)

(7)

A chi-square test statistic for the null hypothesis of no treatment effect is provided by

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

ˆ Because V 0 is a matrix of Qg = gˆ 2 /var(g). known constants (as opposed to random variables), the significance of Qg could be assessed according to its exact distribution across all possible randomizations under H0 . Nevertheless, for sufficient sample sizes (e.g., ≥ 30 patients per randomization group) Qg has an approximate chi-square distribution with one degree of freedom via randomization central limit theory. The exact distribution is a desirable property when sample sizes are not large enough for a chi-square approximation to be reasonable. Note that V 0 can be partitioned into matrices that correspond to the sums of squares and cross products of the {y∗k } and {xk } as follows: vyy,0 Vyx,0 V0 = (8) Vyx,0 Vxx,0 In this partitioning of V 0 , vyy,0 is the variance of dy , V yx,0 is the covariance between dy and dx , and V xx,0 is the covariance matrix for dx , assuming the null hypothesis of no treatment effect. Then the regression coefficient from the linear model defined in Equation (5) and estimated in Equation (6) can be computed as −1 Vxx,0 dx gˆ = dy − Vyx,0 −1 = (y1 − y2 ) − Vyx,0 Vxx,0 (x1 − x2 )

= (g1 − g2 ).

(9)

The variance of gˆ is given by −1 ˆ = vyy,0 − Vyx,0 Vxx,0 Vyx,0 var(g)

=

n n g2∗k , n1 n2 (n − 1)

(10)

k=1

−1 Vxx,0 xk is the residwhere g∗k = y∗k − Vyx,0 ual for the prediction of y∗k by the multiple linear regression model with xk , as first discussed for rank analysis of covariance (10). Thus, gˆ is simply the difference in means of the outcome variable for the two treatment groups adjusted for the covariables included in the model, subject to the constraint that differences in means of the covariables are ˆ as defined equal to zero. Note that var(g) in Equations (7) and (10) is less than vyy,0 ,

3

the variance of the unadjusted difference dy , and this property can enable Qg to provide a more powerful test for H 0 than its unadjusted counterpart Qy = (d2y /vyy,0 ). The amount of variance reduction will depend on the size of the correlation between {y∗k } and {xk }. If a significant result for the test of the null hypothesis is obtained using the procedures described above, then it may be of interest to compute a confidence interval about the adjusted treatment group differences. Methods are available for confidence interval estimation based on the variance covariance structure consistent with the alternative hypothesis. The counterpart to V 0 that applies under the alternative hypothesis, namely HA : y1k = y2k , for all k,

(11)

is given by 2

n

Uik (fik − f i ) (fik − f i ) , i=1 k=1 ni (ni − 1) (12) where fik = (yik , xk ) and f i = nk=1 (fik /ni ). A confidence interval about the adjusted treatment group difference under the alternative hypothesis, HA , can be obtained by substituting V A for V 0 in the expression in Equation (6) to estimate gˆ and in the expression in Equation (7) to estimate the variance ˆ Note, however, that the confidence interof g. val estimate based on V A will not be in one-to-one correspondence with the outcome of the test of H 0 provided by Qg , as is the case with parametric analysis of covariance. Note also that the variance structure specified in Equation (12) results from viewing the clinical trial population as two parallel random samples selected from two infinite populations of patients who might have received each treatment, respectively. That is, {yik } corresponds to a random sample of all possible outcomes that might have been observed for an infinite population of patients receiving the ith treatment. The methods described thus far are easily extended to cases in which one or more stratification factors, such as clinical center, are incorporated in the randomization scheme. Let h = 1, . . . , q index the q strata, and let nhi denote the number of patients in the hth stratum receiving the ith treatment. VA =

4

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

Stratum-specific treatment group differences with respect to the outcome variable and covariables are defined as dhy = yh1 − yh2 and dhx = (xh1 − xh2 ), where yhi and xhi are the within-stratum means. To combine information across the strata, an appropriate weighting scheme must first be specified. The most commonly used stratum weights, referred to as Mantel–Haenszel weights, are given by wh =

(nh1 nh2 ) . (nh1 + nh2 )

(13)

Let dh = [dhy , dhx ] , and define q

q

2 h=1 wh Vh0 and Vw0 = , q ( h=1 wh )2 (14) where V h0 is the within-stratum variance covariance matrix for dh under the null hypothesis. Estimates of treatment group differences that control for stratification factors in addition to baseline covariate adjustments then can be computed as

dw =

h=1 wh dh q h=1 wh

−1 −1 −1 Z) Z Vw0 dw , gˆ w = (Z Vw0

(15)

and variances are estimated as −1 −1 Z) . var(gˆ w ) = (Z Vw0

(16)

With this approach, treatment group differences for means of the outcome variable and differences in means of covariables are first averaged across strata using appropriate weights to control for the effects of the stratification factor, as shown in the expression in Equation (14). Covariate adjusted treatment group differences and their variances then are computed in the expressions in Equations (15) and (16). If stratum-specific sample sizes are sufficiently large, then these two steps can be reversed. That is, separate linear models can be fit within each stratum to obtain adjusted treatment group differences and their respective variances, gˆ h and var(gˆ h ). Weighted averages then are computed across strata using the Mantel– Haenszel weights. This procedure is analogous to including interactions between each covariable and the stratification factor in

a multiple regression model. The withinstratum sample sizes must be sufficient to support the adjustments when the computations are carried out in this order. As in the unstratified case, confidence interval estimation is available based on the alternative variance covariance structure. Under H A , V wA can be defined in a method analogous to V w0 , thereby providing the ability to compute estimated treatment group differences and their associated confidence intervals adjusting for the stratification factor. Nonparametric analysis of covariance methods described above are suitable for situations in which the outcome variables are continuous, ordinal, or time-to-event variables. The methods also apply when the outcome variable is dichotomous, such as a response variable, and the treatment group differences are differences in the proportions of patients responding to each treatment. Methods for applying nonparametric ANCOVA have also been developed for the case where the odds ratio of the two treatment groups is of interest, rather than the difference in proportions responding (11). The methodology is derived from the same basic principles, but the computations are more involved. Extensions to the methods described above are available for cases in which the response vector is multivariate and to cases in which more than two treatment groups are of interest (8,12). 3 SOFTWARE User-supplied SAS macros for nonparametric ANCOVA are available from the authors (13,14). The analysis results presented here for the example data that illustrate the randomization-based methods were generated using the macro of Zink and Koch (13). 4 EXAMPLE DATA The methods described above can be illustrated using data from a clinical trial for the treatment of a respiratory disorder. In this study, 111 patients were enrolled at two clinical centers and randomly assigned in a 1:1

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

ratio to receive either test treatment (A) or placebo (P). Patient’s status with respect to respiratory health was assessed at baseline (before randomization) and again following treatment. An ordinal, five-point scale was used to measure respiratory health status at both time points, with 0 corresponding to a status of ‘‘poor’’ and 4 corresponding to a status of ‘‘excellent.’’ Table 1 provides a crosstabulation of the responses at baseline and follow-up, stratified by center and treatment. Nonparametric analysis of covariance methods first are illustrated with a dichotomous outcome variable for which a positive response corresponds to a respiratory health status of ‘‘good’’ or ‘‘excellent,’’ whereas a negative response corresponds to any other status. With this outcome, it is of interest to determine whether a difference exists in the probability of a positive response between the two treatment groups. Baseline status is included as a covariate to provide for a possible reduction in variance in assessing treatment group differences. Ignoring center

5

for this initial example, the null hypothesis in the expression in Equation (1) applies, with {y∗k } equal to the dichotomous response variable observed for each patient and {xk } equal to the vector of baseline values (range 0–4). In this application, baseline is handled as a linear covariate. Table 2 provides the results of the nonparametric analysis under the null hypothesis assuming the variance covariance structure V 0 specified in the expression in Equation (4). As can be seen from the table, a significant difference in the probability of response between treatment groups is found (P-value = 0.0115). Adjusting for baseline provides a slightly stronger result than its unadjusted counterpart, given in the first row of the table (P-value = 0.0390). Note that results identical to the nonparametric ANCOVA can be obtained from a two-step analysis in which a linear regression model is first fit to the response variable that includes baseline as the single predictor. The residuals from this linear model then are analyzed for association with treatment using an extended

Table 1. Example Data from a Clinical Trial in Respiratory Disease Status at Follow-up Visit1 Clinic 1 Test Treatment

Placebo

Clinic 2

Test Treatment

Placebo

1 Values

Baseline1 0 1 2 3 4 Total 0 1 2 3 4 Total 0 1 2 3 4 Total 0 1 2 3 4 Total

0 1 0 0 0 0 1 0 3 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 0 2 0 2 2 0 0 4 0 0 1 0 0 1 0 0 2 0 0 2

2 2 5 3 0 0 10 0 2 6 1 0 9 0 1 2 0 0 3 0 2 2 5 1 10

3 0 0 4 5 1 10 0 0 3 2 0 5 0 1 4 3 1 9 0 0 2 3 2 7

4 0 0 1 2 1 4 0 0 1 3 3 7 0 1 2 3 8 14 0 2 1 5 1 9

of health status assessed at baseline and follow-up range from 0 = ‘‘Poor’’ to 4 = ‘‘Excellent.’’

Total 3 6 9 7 2 27 0 7 13 6 3 29 0 3 9 6 9 27 0 4 7 13 4 28

6

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

Table 2. : Results of Nonparametric Analysis of Covariance and Unadjusted Counterparts for a Dichotomous Outcome Variable: Response = Good or Excellent Model Unadjusted1 Adjusted for baseline (under H 0 ) Adjusted for baseline; stratified by center (under H 0 ) Adjusted for baseline; stratified by center (under HA ) CMH stratified by center; unadjusted (under H 0 )

Chi-Square (d.f.)

P-Value

Treatment Difference ± SE

95% Confidence Limits

4.260 (1) 6.388 (1) 6.337 (1)

0.0390 0.0115 0.0118

0.194 ± 0.092 n/a n/a

[0.015, 0.373] n/a n/a

6.697 (1)

0.0097

0.197 ± 0.076

[0.048, 0.346]

4.401 (1)

0.0359

n/a

n/a

1 Mantel–Haenszel chi-square test (1 d.f.) generated under H ; treatment difference and confidence limits generated 0 under HA .

Mantel–Haenszel chi-square test to compare the mean scores. Incorporating stratification by center in this example serves to illustrate the computations defined in the expressions in Equations (14) through (16), the results of which are also provided in Table 2. A test of the null hypothesis is significant (P-value = 0.0118); however, stratification by center does not yield any greater advantage after adjusting for baseline values in this example. To estimate the treatment group difference and associated 95% confidence interval, the variance covariance structure consistent with the alternative hypothesis V wA is used. As can be seen from Table 2, the proportion of patients with a response of ‘‘good’’ or ‘‘excellent’’ is higher by approximately 0.20 for patients receiving the test treatment compared with those receiving placebo (95% CI: [0.047, 0.346]). This treatment group difference is similar to the unadjusted counterpart in the first row of the table, but the standard error is somewhat smaller, and the confidence interval narrower, as an advantage of the baseline adjustment. Also for comparison purposes, the results of a stratified CMH test are provided in Table 2. Stratification by center is incorporated, but adjustment for both baseline and center with this analysis method could involve excessive stratification with many noninformative strata (i.e., strata with only one outcome or only one treatment). As can be seen from Table 2, the test of association in this example without baseline adjustment has a somewhat weaker result, although it

is still statistically significant (P-value = 0.0359). Nonparametric analysis of covariance methods are equally suitable for dichotomous, ordinal, and continuous outcome measures, as described earlier. To illustrate this point, an analysis of the five-level status variable was carried out using the respiratory data example. Standardized rank scores [i.e., ranks divided by (n+1) with mid-ranks used for ties] were generated for both the outcome and baseline values within each center. The scores at follow-up then were analyzed via nonparametric analysis of covariance, adjusting for baseline scores and stratifying by center. The results are shown in Table 3. A significant difference in favor of the test treatment is found with respect to the scores under the null hypothesis (P-value = 0.0230). To estimate the treatment group difference and corresponding confidence limits, the analysis was repeated using integer scores in lieu of the standardized ranks. Average scores on the integer scale (0–4) are higher by 0.405 among patients receiving the test treatment than among those receiving placebo, and the 95% confidence interval is [0.078, 0.733]. Results for the Wilcoxon rank sum test without stratification and the van Elteren test as its stratified counterpart are included in Table 3 for comparison purposes. The test results are weaker without the advantage of the baseline adjustment (P-value = 0.0809 unstratified, and P-value = 0.0524 stratified). Note that nonparametric analysis of covariance methods is available for computing confidence limits that correspond to the use of

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

7

Table 3. : Results of Nonparametric Analysis of Covariance and Unadjusted Counterparts for an Ordinal Outcome Variable: Response Ranges from 0 (Poor) to 4 (Excellent)1 Model Unadjusted Wilcoxon Rank Sum test (under H 0 ) Unadjusted van Elteren test, stratified by center (under H 0 ) Adjusted for baseline; stratified by center (under H 0 ) Adjusted for baseline; stratified by center2

Chi-Square (d.f.)

P-Value

Treatment Difference ± SE

95% Confidence Limits

3.046 (1)

0.0809

n/a

n/a

3.763 (1)

0.0524

n/a

n/a

5.168 (1)

0.0230

n/a

n/a

5.593 (1)

0.0180

0.405 ± 0.167

[0.078, 0.733]

1 Standardized

ranks scores are used, except where noted. test and P-value generated under H 0 ; treatment differences and confidence limits generated under HA . Integer scores used in lieu of standardized ranks. 2 Chi-square

standardized rank scores with adjustment for baseline through the Mann–Whitney probability that a randomly selected patient in one treatment group has a better outcome than a randomly selected patient in the second treatment group (15). 5

DISCUSSION

Nonparametric analysis of covariance provides a useful analysis strategy for situations in which covariate adjustment is deemed important, but assumptions about variable distributions are to be kept to a minimum. These methods can be thought of as extensions of the Cochran–Mantel–Haenszel test and the Wilcoxon Rank Sum test to accommodate baseline covariates. The variance covariance matrix is a matrix of known constants, and the test statistic under the null hypothesis has an exact permutation distribution for which exact P-values can be obtained. The variance covariance structure assumed under the alternative hypothesis allows for confidence interval estimation but does not lend itself to exact testing. The methods are basically assumption free, requiring only a valid randomization scheme to be employed for the clinical trial. That is, the randomization scheme that was used was consistent with the study objectives and was implemented correctly, with no inappropriate manipulation by study personnel. This requirement does not imply that the randomization for the study was optimal in any sense, but rather it implies that it produced

the treatment assignments as planned. The methods are straightforward to use and their application is the same, regardless of whether the outcome variable is dichotomous, ordinal, or continuous. Because they provide the same advantage as parametric analysis of covariance, namely the potential for variance reduction in estimating and testing treatment group differences, and they minimize the assumptions required, these methods are useful in a variety of clinical trial settings. REFERENCES 1. L. M. LaVange, T. A. Durham, G. G. Koch, Randomization-based nonparametric analysis of multicentre trails, Statistical Issues in Medican Research, 2005; 14;1–21. 2. G. W. Snedecor, W. G. Cochran, Statistical Methods, 8th ed, Iowa, Iowa State University Press, 1989. 3. G. G. Koch, I. A. Amara, G. W. Davis, D. B. Gillings, A review of some statistical methods for covariance analysis of categorical data. Biometrics, 1982; 38;553–595. 4. CPMP Working Party. Points to consider on adjustment for baseline covariates. London: EMEA, 2003. 5. L. D. Robinson, N. P. Jewell, Some surprising results about covariate adjustment in logistic regression models. International Statistical Review, 1991; 59:227–240. 6. M. H. Gail, S. Wieand, S. Piantadosi, Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 1984; 71:431–444.

8

RANDOMIZATION-BASED NONPARAMETRIC (ANCOVA)

7. Encyclopedia of Clinical Trials entry for Rank analysis of covariance. 8. G. G. Koch, C. M. Tangen, J. W. Jung, I. A. Amara, Issues for covariance analysis of dichotomous and ordered categorical data from randomized clinical trials and nonparametric strategies for addressing them. Stat. Med. 1998; 17:1863–1892. 9. W. G. Cochran, Sampling Techniques, 3rd ed. Wiley: New York, 1977. 10. D. Quade, Rank analysis of covariance, J. Am. Stat. Assoc. 1967; 62:1187–1200. 11. C. M. Tangen, G. G. Koch, Complementary nonparametric analysis of covariance for logistic regression in a randomized clinical trial setting. J. Biopharm. Stat. 1999; 9:45–66.

12. C. M. Tangen, G. G. Koch, Non-parametric analysis of covariance for confirmatory randomized clinical trials to evaluate doseresponse relationships. Stat.Med. 2001; 20:2585–2607. 13. R. C. Zink, G. G. Koch, NparCov Version 2. Chapel Hill, NC: Biometric Consulting Laboratory, Department of Biostatistics, University of North Carolina, 2001. 14. V. Haudiquet, Macro for Non Parametric Analysis of Covariance: User’s Guide. Paris, France: Wyeth Research, 2002. 15. J. Jung, G. G. Koch, Multivariate NonParametric Methods for Mann-Whitney Statistics to Analyze Cross-Over Studies with Two Treatment Sequences. Stat. Med. 1999; 18:989–1017.

RANDOMIZATION PROCEDURES

as a two-step process that involves first enumerating the set of allowable sequences, and second assigning a probability to each. Alternatively, Berger (3) defines randomization as the process of creating one treatment group by taking a random sample of all accession numbers used in a study. To reconcile the two definitions, all that is required is that the set of allowable sequences be nondegenerate. The set is nondegenerate if

VANCE W. BERGER and WILLIAM C. GRANT Therese Dupin-Spriet Rick Chappell

1

BASICS

Randomization is an important aspect of any experiment that compares the effects of some treatments on some outcome or outcomes. Note that any evaluation of a treatment is necessarily comparative, because a treatment cannot be ‘‘good’’ or ‘‘bad’’ in a vacuum. Rather, it is good or bad relative to other possible treatments (including the lack of any treatment), so randomization is a key component of the evaluation of any treatment. Through randomization, an experiment’s observer can better distinguish between the treatment effects and the effects of other factors that influence the outcome. In this section, we will explore how randomization accomplishes this.

1. at least two sequences are contained in the allowable set, and 2. each accession number appears in each treatment group for at least one sequence. With this caveat, randomization consists of a set of allowable sequences and a probability distribution over those sequences, such that no treatment assignment is known with certainty prior to the start of the trial. Equivalently, the set of accession numbers in any one treatment group is a random sample of the set of all accession numbers in the study.

1.1 Rationale for Randomization 2 GENERAL CLASSES OF RANDOMIZATION: COMPLETE VERSUS IMBALANCE-RESTRICTED PROCEDURES

Without randomization, treatments tend to be assigned based on prognostic factors, such as severity of disease. The experiment is confounded if sicker patients receive one treatment (perhaps an elective surgery) and healthier patients receive another treatment, because it can be difficult to separate the effects of the treatments from the effects of underlying differences across the treatment groups. Randomization is meant to prevent this confounding by removing the influence of patient characteristics (including preferences) on treatment assignments.

Many researchers speak of randomization as if it consisted of but one unique procedure. In fact, many ways can be employed to randomize, and this section will draw a preliminary distinction between complete (unrestricted) randomization and restricted randomization. 2.1 Definitions of Complete and Imbalance-Restricted Randomization Complete randomization refers to the case in which no allocation influences any other allocation. In other words, if a trial contains only two treatments and uses an equal allocation ratio, then complete randomization is equivalent to flipping a fair coin for each patient, to determine his or her treatment assignment. With six treatments, the analogy would be tossing a die. More generally, consider T treatments indexed by t, t = 1, . . . ,T, and

1.2 Defining Randomization Randomization is the use of chance (probability) for assigningtreatments. A more precise definition requires formulating a sequence of treatment assignments so that each treatment assignment corresponds to an accession number, which indicates the order in the assignment sequence. Berger and Bears (1) and Berger et al. (2) define randomization

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

RANDOMIZATION PROCEDURES

let kt > 0 be the prespecified allocation ratio for treatment t, so that these ratios sum to one. Let Ai be a multinomial indicator variable for patient i’s treatment assignment, so that Ai = t as patient i is assigned to treatment t. A trial is completely randomized if Prob(Ai = t) = kt , for all i = 1, . . . ,n and all t = 1, . . . ,T, even conditionally on all previous allocations. That is, the previous allocations do not in any way influence the probabilities of future allocations. Conversely, restricted randomization refers to the variety of procedures that constrain the imbalance between treatments in some way. In comparing the different kinds of restricted randomization, several questions are paramount. First, what kinds of limits are placed on the maximum imbalance that can occur? One possibility is that absolute boundaries are established so that treatments are assigned deterministically whenever the imbalance limit is reached. Alternatively, the assignment of treatments may be dictated by probability rules that make the likelihood of additional imbalance decrease as the extent of existing imbalance increases. A second question is, ‘‘What restrictions govern balance restoration?’’ Sometimes, perfect balance is guaranteed at fixed points during the sequence. Alternatively, the exact points when balance will be restored may be uncertain, but the frequency of perfect balance points may be fixed. Finally, the number of perfect balance points may be variable and determined by some probability distribution. 2.2 Problems with Complete Randomization Complete randomization can result in imbalances between treatments (4), both at the end of the study (terminal imbalance) and at intermediate points during the trial (intrasequence imbalance). Statistical power and efficiency of treatment effect estimators can be compromised by large terminal imbalances, and even if a trial has terminal balance, intrasequence imbalances can cause additional problems. Chronological bias (5) occurs when the distribution of some factor that affects the outcome changes systematically over time and intrasequence imbalance causes this factor to be confounded with treatments. For example, many more early

patients may be allocated to the active group and many more late patients may be allocated to the control group, whereas many more female patients enter the trial early, and many more male patients enter the trial late. Then gender and treatments will be confounded. This confound is not technically a bias, because there is no propensity for one group to be systematically favored over the other. Still, the confounding can interfere with valid treatment comparisons. 2.3 Problems with Imbalance-Restricted Randomization: Selection Bias The only way to control chronological bias is to impose restrictions on the randomization that disallow excessive imbalances. But in so doing, one introduces the possibility of selection bias. Note that many authors have stated definitively that randomization by itself eliminates all selection bias, and hence it guarantees internal validity. We will demonstrate the falsity of this statement and of the related statement that in a randomized trial all baseline imbalances are of a random nature. In unmasked trials especially, patterns created by imbalance restrictions may allow investigators to predict future treatment assignments based on knowledge of past ones. Blackwell and Hodges (6) provide a seminal description of an ‘‘optimal guessing rule’’ based on predicting that the treatment that is most under-represented at that point in time will be assigned next. Rosenberger and Lachin (4) refer to this method as the convergent strategy. Smith (7,8), Efron (9), and Stigler (10) employ similar models to quantify expected selection bias that results from restricted randomization procedures. The implication of this convergent strategy, and of more accurate predictions that may even allow some future allocations to be known with certainty even without having observed the allocation sequence, is that allocation concealment is much more elusive than is generally thought. One does not need to observe the allocation sequence directly to predict upcoming allocations, which thereby violates the principle of allocation concealment. Without allocation concealment, it is possible for an investigator to enroll healthier

RANDOMIZATION PROCEDURES

patients when one treatment is to be assigned and to enroll sicker patients when the other treatment is to be assigned, thereby creating a selection bias, or confounding. Marcus (11) clarified the role of patient characteristics by assuming that a fixed fraction of the baseline population is expected to have a good prognosis, and that the extent of selection bias depends on this fraction, as well as the statistical association between prognosis and treatment. Additional work from Berger et al. (2) shows how selection bias varies with the extent of uncertainty that concerns investigator predictions. By enrolling particular types of patients only when predictions exceed threshold uncertainty levels, Berger et al.’s (2) investigator behaves in a plausible and rational manner. Empirical work corroborates the concern with selection bias for restricted randomization (3). 3 PROCEDURES FOR IMBALANCE-RESTRICTED RANDOMIZATION The most common restricted randomization procedure is the permuted block procedure, with a fixed block size. The problems with this approach will be discussed in this section, and so one improvement was to vary the block size so as to make prediction of future allocations more difficult. As we will discuss, varying the block size does not actually accomplish this objective. To address the need for better restricted randomization procedures, newer methods have been developed with better properties. We will discuss some of these methods as well. 3.1 Traditional Procedures for Imbalance-Restricted Randomization A ‘‘block’’ refers to a subsequence of treatments that contain an equal number of the different possible treatments in a trial, as defined by the order in which patients are enrolled. So, for example, the first four patients enrolled might constitute a block, which means that two of these four patients will receive the active treatment, and the other two will receive the control. Randomization is used to determine which patients receive which treatments within each block. Generally, permuted block randomization

3

procedures employ a sequence of blocks where each particular block is independently determined by a random allocation rule. 3.1.1 Fixed Block Randomization. The most common permuted blocks design employs a fixed block size. For a trial with n total assignments, M blocks are established with equal size m = n/M. When there are two possible treatments, each block contains m/2 patients to be assigned to treatment A and m/2 to be assigned to treatment B. Perfect balance is forced at M points during the entire sequence (which corresponds to the end of each block), and the imbalance never exceeds m/2. In unmasked or imperfectly masked trials, the forced balance allows for prediction of future allocations and hence selection bias, especially with small blocks lengths (12,13). Block lengths should be unknown to the investigators, but may sometimes be deduced, particularly in multicenter trials where each center has to recruit a pre-determined small number of patients. For this reason, it is problematic to use small block sizes, especially blocks of size two. 3.1.2 Variable Block Randomization. Instead of using the same size for every block in the overall sequence, a variable block procedure produces a sequence by stringing together blocks of different sizes, with the size of each block determined according to some probability rule. For example, if blocks of size two and four are to be used, then at each point of perfect balance, the size of the next block is four with probability P and two with probability 1–P. It is common to encounter arguments that varying the block size eliminates the possibility of all prediction, but this statement simply is not true (14). Although the variable block procedure may result in less predictability than a corresponding fixed block procedure, it will also result in more expected points of perfect balance than a fixed block procedure, and so it can be even more susceptible to prediction and selection bias (3, 15, 16). 3.1.3 Biased-Coin Procedures. A biased coin procedure is a collection of probability measures over treatments, with one probability measure for every possible history of

4

RANDOMIZATION PROCEDURES

treatments that have been already assigned. In particular, let H be a set of sequences, where each sequence (ak )k=1,...,K ∈ H consists of treatments that have been already been assigned. Each sequence (ak )k=1,...,K ∈ H is a history h. If A is the set of possible treatments, then the probability measure β(h) is a probability measure over A for history h. A biased coin procedure is a collection (β(h))h∈H of independent probability measures. Efron (9) pioneered the use of biased coin designs to limit imbalance. For two treatments, his procedure assigns equal probability to the treatments for every perfectly balanced history. If one treatment has been assigned more than the other in some history, then the under-represented treatment is assigned with probability P > 0.5 and the over-represented treatment is assigned with probability 1–P. Higher values for P mean that greater imbalances are decreasingly likely to occur. Adaptive-biased coin procedures or urn designs use probabilities that vary according to the degree of imbalance (7,8). The imbalance-dependent probabilities are determined by repeated draws of treatment-marked balls from an urn. If a treatment-type-A ball is drawn for the first treatment, then the second ball is drawn after replacing the first-drawn ball and adding some number of type-B balls. Similarly, when B is drawn, then a B-ball replacement and Aball adding occurs. As a result, the probability of drawing the under-represented treatment increases in the extent of imbalance. Different urn designs allow for different kinds of imbalance-dependent probabilities, each based on different draw-and-replace rules. See Reference 17 for a summary. Soares and Wu (18) modified Efron’s procedure with the ‘‘big-stick rule’’ that assigns, deterministically, the under-represented treatment whenever a fixed maximum limit on the imbalance is reached. Otherwise, a fair coin (equal probability on each) is used to assign the next treatment. As a compromise between Efron’s procedure and the big-stick rule, Chen (19) uses a maximum limit on imbalance (similar to Reference 18) and a value for P greater than 0.5 (similar to Reference 9). This ‘‘biased coin design with imbalance tolerance’’ uses three different probability measures:

1. equal probability is assigned to the treatments for every perfectly balanced history; 2. the under-represented treatment is assigned with 100% probability when the maximum imbalance limit has been reached; and 3. the under-represented treatment is assigned with probability P ∈ (0.5, 1) whenever there is an imbalance less than the limit.

3.2 Procedures for Minimizing Predictability and Selection Bias For the restricted randomization procedures discussed previously, a primary focus is the level of expected imbalance. As a general rule, procedures with more forced balance will be more predictable. When choosing between fixed block procedures, for instance, a smaller block size entails lower expected imbalance and higher predictability. When choosing a value of P for Efron’s biased-coin procedure, a higher P value entails lower expected imbalance and higher predictability. In any urn design, lower imbalance and higher predictability can be expected from adding a greater number of under-represented balls. Regarding the general tradeoff between balance and predictability, Klotz (20) depicts the dynamic nature of randomization as a choice between imbalance and entropy. Guidance for making this choice can be found in several recent randomization designs. The following three procedures share a common question: To satisfy some particular imbalance restrictions, what kind of randomization procedures produces the lowest possible predictability? When considered from this perspective, the task of designing a randomization procedure becomes a type of constrained optimization problem, in which the clinical trial statistician seeks to minimize predictability subject to some imbalance constraints. 3.2.1 The Maximal Procedure. The maximal procedure (2) minimizes predictability subject to a maximum tolerable imbalance (MTI) constraint and possibly also a terminal balance constraint. Given the set of sequences that satisfy terminal balance and

RANDOMIZATION PROCEDURES

some particular MTI, denoted as the set (MP), the maximal procedure places a uniform distribution on (MP). The maximal procedure resembles both the big-stick procedure (18) and the related procedure in Reference 19 . All three procedures allow the extent of imbalance to follow a random walk that is bounded by reflecting barriers at the MTI. To understand the lower predictability that results from the maximal procedure, it is useful to consider what kind of biasedcoin procedure would be equivalent to the maximal procedure. By employing a uniform distribution, the maximal procedure guarantees that any one sequence is as likely as any other. At any given level of imbalance I, the probability of creating more imbalance is simply equal to the number of sequences that can follow from I + 1 to the number of sequences that can follow from I–1. If I = MTI, then no sequences can follow from I + 1 and therefore I–1 occurs with 100% probability. Also, the smaller MTI–I is, the fewer sequences can follow from I + 1 and the more sequences can follow from I–1, which thereby ensures a lower probability of creating more imbalance. This result helps explain the advantage of the maximal procedure for reducing predictability. For the maximal procedure, not only are higher imbalance levels less likely than lower levels, but also the probability of an increase in imbalance is decreasing in the level of imbalance. As a result, the maximal procedure reduces predictability because more predictable histories are less likely to occur. In addition, the imbalance that does occur tends to occur at allocations other than the ones that can be predicted when fixed blocks are used, so the hope is that if enough early attempts to predict allocations are thwarted, then all subsequent prediction may stop and in this way selection bias may be eliminated. 3.2.2 Quantification Methods for Unequal Block Randomization. Because the maximal procedure is constrained only by some MTI constraint and possibly by terminal balance, the number of times that perfect balance is restored within a sequence is uncertain. At the extreme, it is possible that perfect balance occurs only at the very beginning and very end of the sequence. Block procedures

5

are a better design when intrasequence balance must be restored with some guaranteed frequency. To choose among different block possibilities, Dupin-Spriet et al. (15) provide quantification methods for comparing predictability. The quantification formula can be used to calculate the predictability p(L) of any block L of length NL that contain L1 , L2 , . . . , LT allocations to treatments 1, . . . , T. The quantification formula reveals how predictability varies for different combinations of block lengths. For sequences of length 12 in two-arm trials, the minimum predictability results from combining one long with two short blocks, as long as the block lengths are known but block order remains unknown to the investigator. For such a trial with a balanced allocation ratio, the lowest predictability results from a mixture between one A4 B4 block and two A1 B1 blocks (where A4 B4 is a block of eight assignments with four of each treatment). The quantification method findings reflect two opposing effects of employing a smaller block size. Greater predictability occurs when an investigator correctly suspects the presence of a smaller block. On the other hand, a smaller block can masquerade as a subsequence from a larger block, which thereby makes it more difficult for the investigator to ascertain correctly which order has occurred. 3.2.3 Game-Based Randomization Procedures. In another recent perspective on predictability and selection bias, Grant and Anstrom (21) employ game-theoretic methods for randomization. Game theory directly accounts for the interdependence of a randomization procedure and investigator predictions. Although existing theories of selection bias do incorporate physician strategies, they have done so by treating physician strategies as exogenous determinants of bias. In terms of game theory, it is inappropriate to parameterize the mean difference in response between patients that the physician expects to receive treatment A and patients expected to receive B. Mean difference in response is not a parameter that can be used to design a randomization procedure because the randomization procedure helps determine this difference. Accordingly, the game-theoretic perspective identifies the statistician’s best

6

RANDOMIZATION PROCEDURES

response, which is consistent with subgameperfect Nash equilibrium. 3.3 Stratified Randomization If it is important to balance baseline covariates, then randomization can be performed separately within strata, usually by allocating whole blocks to each of these categories of subjects. Strata are defined either by binary covariates (e.g., gender), or by categorizing continuous ones (e.g., age). Strata must be mutually exclusive and should be both predictive of the expected treatment effect and reproducible in subsequent trials. In multicenter trials, it is recommended to stratify by center. If several important prognostic factors have to be balanced simultaneously, then stratification may be performed independently for each factor. But, this method makes the allocation procedure complex, and multiplies the number of strata with the paradoxical effect of increasing imbalance because of the number of small strata that contain fewer subjects than the block length. In practice, ‘‘the use of more than two or three stratification factors is rarely necessary’’ (22), particularly if the anticipated total number of trial subjects is relatively small. If stratified randomization is used, then it is logical to use the corresponding covariates in the statistical analysis for a design-based analysis (see the section entitled ‘‘Randomization-Based Analysis and the Validation Transformation’’). 3.4 Covariate-Adaptive Randomization Because the efficacy of stratification is limited if several covariates are to be balanced simultaneously, ‘‘dynamic’’ methods are available to adjust allocation for each patient according to baseline covariates of preceding ones. Finally, all of these covariates are included into the final analysis. Even so, controversy surrounds the analysis that reflects the randomization scheme (23). 1. Deterministic allocation (24). With this technique, each new patient is given the treatment for which the sum of imbalances for all covariates is smallest. This procedure is remarkably effective for balancing the groups with respect to the specified covariates, but it has the

drawback of excluding formal randomization so other covariates might be highly unbalanced. Besides, the knowledge of preceding patient characteristics makes this allocation predictable, particularly with small numbers of patients and small number of covariates. 2. Covariate-adaptive randomization with unequal ratio (25). With this method, the sum of the differences in imbalances for all the covariates is first calculated for each treatment. This ‘‘measure of imbalance’’ is smallest for one treatment, which is then given precedence over the others in an unbalanced ratio randomization; For example, a 3/4 probability of allocation to the treatment which produces the smallest imbalance. 3. Hierarchical covariate-adaptive methods. Not all covariates are equally prognostic, so it may be desirable to rank covariates by order of priority (26,27). Allocation can be performed so as to yield the best balance for the most relevant covariate, and the following ones are considered in turn if the preceding ones are perfectly balanced. The key is to define the imbalance function suitably. 4. Reduction of the variance of predicted response (28). In the DA -Optimum Design, the purpose is to reduce the variance of the predicted response, given the prognostic factors. The probability of an unbalanced allocation to each treatment is calculated from this variance for this treatment arm to the sum of such variances over all treatments. Thus, the ‘‘loss’’ of information because of imbalance is minimized (i.e., the power of the trial for a given samples size is maximized).

3.5 Response-Adaptive Randomization Response-adaptive randomization (RAR) is the controversial procedure of relating randomization probabilities to success rates of previous patients in the same trial. Its goal

RANDOMIZATION PROCEDURES

is to assign more patients to the more successful treatment while continuing the study. It requires quick responses and the ethical assumption that it is permissible to randomize patients’ treatments even if there is enough evidence to violate equipoise and preferentially assign patients to one or more study arms that are tentatively judged superior. Although RAR can be applied to trials with any number of treatment arms in excess of one, we restrict discussion to the twoarm case. The ‘‘play the winner rule’’ (29) is RAR in its simplest form, in which each patient is deterministically assigned to the previous patient’s treatment if it was judged a success or otherwise if not. Others (30) have substituted biased coin randomization for determinism, with the idea being that there is an urn with a given number of balls of each color to start with, but then balls of a given color are added to reflect the outcomes so far. Ware (31) describes the ethical and scientific controversies surrounding extracorporeal membrane oxygenation (ECMO) to help premature newborns survive their first days, and the RAR that was used in this trial. The scientific controversy was heated because the first patient was assigned to the control therapy and died, and the subsequent 11 patients were assigned to ECMO and lived. We address two scientific aspects of RAR designs. First, importantly and as noted by many authors, RAR deliberately confounds the randomization probabilities with time, which results in bias to estimated treatment effects in the presence of a time trend. Karrison et al. (32) give an artificial but realistic example that shows the potential severity of such bias in a trial with a binary event whose probability increases over time and on which the treatment has no effect (the event probabilities in the two groups are identical). A random perturbation causes group A’s event rate to exceed the other’s rate, which in turn, by RAR, results in succeeding patients being randomized to Group A in greater numbers. Because of the trend, these later patients are more likely to have the event that in turn leads to even more randomizations to Group A. Comparing the two groups at the trial’s end could lead to a substantial and significant excess proportion of events in Group A even in the absence of a real treatment effect; this

7

result occurs solely because relatively more of Group A’s patients were accrued later in the study when event rates were higher, and it is similar to chronological bias (32). As noted previously, restricted randomization such as blocking provides a simple solution to the problem of confounding with time. Presume that the first block is randomized in a balanced way with n patients in each group and, in general, that any subsequent block, k = 2, . . . , K, is randomized with n and R k*n patients where R k depends on previous outcomes (we will discuss the choice of n next). Then, stratifying the resultant analysis by block, for example a 2 × k two-way ANOVA for continuous outcomes or the analysis of k 2 × 2 tables for binomial outcomes, will account for any possible time trend. No model assumptions are needed; the only requirement is that randomization proportions remain constant within each block. Of course, as also noted, permuted blocks are susceptible to prediction and selection bias. A second methodological difficulty, which also has a simple solution, is that the analysis of data from even blocked RAR designs can be complicated. Wei et al. (33) show how the adaptive nature of the design changes the distribution of the test statistic. However, Jennison and Turnbull (34) provide a method to ensure that the test statistic has the appropriate distribution under adaptive sampling and to preserve the standard analysis’ expected size and power. To do so, we need to make the increments in information for each block identical. That is, presuming as above that the two groups in the first block are randomized to n:n patients, and patients in the better:worse groups of subsequent strata are adaptively randomized to n :R k*n , then the total number of patients n + R k*n in later strata should exceed n + n. This method is used because the first block, being balanced, induces the most efficient possible estimation of the treatment effect; imbalance lowers the efficiency that must then be made up with a larger sample size to preserve the information contribution. For data with constant variance (and approximately so for other distributions, such as the binomial with probabilities near 0.5), we have n =∼n(R k + 1) / (2 R k). Thus, if

8

RANDOMIZATION PROCEDURES

R k = 0.5 for a block then we must use sam ple sizes of n = 1.5*n and R k*n* = 0.75*n to maintain equal information increments. Note that the block’s total size is 2.25*n, 12% bigger than the first block’s, to compensate for 2:1 randomization. More complete descriptions of the theoretical issues involved are given in References 4 and 34. Jennison and Turnbull (34) show how to optimize the successive randomization proportions {R k} based on the properties of a two-armed bandit. Karrison et al. (32) give practical applications of simple, bias-free, stratified RAR designs. 4 RANDOMIZATION-BASED ANALYSIS AND THE VALIDATION TRANSFORMATION Randomization has been discussed in terms of the design of clinical trials, but it is also instrumental in the subsequent analysis. The usual assumptions underlying classic statistical analysis, such as normality of distributions and random samples, are not realistic in most clinical trials to which they are applied. This mismatch between theory and practice creates not only intuitive problems but also more tangible ones. For illustration, consider a simple 2 × 2 table as would develop with an unadjusted treatment comparison in a twoarm trial with a binary primary endpoint. The most common analysis is the chi-square test, which obtains the P-value by referring the observed value of the chi-square test statistic to the chi-square distribution with one degree of freedom. This analysis provides the correct answer to the question ‘‘How extreme is this finding relative to the chi-square distribution with one degree of freedom?’’ The valid use of this measure as a P-value is predicated on the chi-square distribution with one degree of freedom being the correct null reference distribution. Unfortunately, it is not (Reference 12). Using randomization, and not the presumed chi-square distribution, as the basis for inference roots this inference in fact, as long as randomization was actually used and the analysis follows this randomization. Consider that the ‘‘peers’’ of the computed value of the test statistic are not the values of the chi-square distribution but rather the

values of the same test statistic computed according to other realizations of the randomization that could have occurred. The strong null hypothesis (12), which specifies that the counterfactual potential outcomes of each patient are independent of the treatment, makes this calculation possible. One can use any test statistic for such a designbased permutation test, and if one uses the one-sided chi-square test statistic, then the result will be Fisher’s exact test (one-sided). This method is the simplest example of the ‘‘validation transformation,’’ which normalizes an approximate P-value by projecting it onto the space of exact P-values, which thereby makes it exact. Berger (35) proposed this method for use when considering the minimum among several possible P-values. Clearly, any exact test is a fixed point of this validation transformation, V(.), which means that V(p) = p if P is already exact. Fisher’s exact test is generally conducted based on the minimally restricted randomization procedure that specifies only the numbers of patients to be allocated to each group. This test is not exact if any other randomization procedure was used, and so to be exact the analysis needs to follow the design (restrictions on the randomization) exactly (12), the so-called design-based analysis or platinum standard (36). It is possible to derive an exact test by imposing the validation transformation (which itself accounts for the precise restrictions used on the randomization) on any P-value (which is used not as a P-value per se but rather as a test statistic). 5 CONCLUSIONS Randomization is indispensable in ensuring valid inference, yet it is also surrounded by many misconceptions and misinformation. Some claim that randomization is unethical for various reasons, but generally it is some aspect of a randomized trial, and not randomization itself, which is the cause of the objection. Moreover, some compelling reasons are suggested to regard randomized trials as highly ethical (37). Some claim that randomization is no better than quality observational studies, and others claim that randomization by itself guarantees validity. Both are wrong.

RANDOMIZATION PROCEDURES

What is true is that randomization can eliminate some biases that cannot be reliably eliminated any other way. But randomized trials can still be subverted, especially if there is no allocation concealment. The next misconception is that allocation concealment is ensured simply by using sealed envelopes or central randomization. As we have demonstrated, one need not gain access to the allocation sequence to predict future allocations, so there are two threats to allocation concealment, direct observation and prediction (38), and allocation concealment needs to be quantified in any given trial. Steps are needed to prevent the prediction of future allocations, and one good step is the use of less predictable procedures, such as the maximal procedure (2), instead of permuted blocks, especially with small block sizes. The best procedure of all would probably vary the type of randomization used across trials, or across strata within a trial, to vary also the susceptibilities to chronological bias and selection bias. Unless one uses unrestricted randomization, one cannot assume that selection bias was not a problem, and so one should routinely check for it (13), and one should be prepared to correct for it (3,39) if need be. Response adaptive randomization should be used with caution. Given the absurdity of preferring an approximation over the very quantity it is trying to approximate, exact tests should be used whenever feasible, perhaps by applying the validation transformation to an approximate analysis. Several cases exist in which the approximate analysis is actually the one of interest, and in such cases it should be used, but this should be justified. REFERENCES 1. V. W. Berger and J. D. Bears, When can a clinical trial be called ‘randomized’? Vaccine 2003; 21: 468–472. 2. V. W. Berger, A. Ivanova, and M. D. Knoll, Minimizing predictability while retaining balance through the use of less restrictive randomization procedures. Stat. Med. 2003; 22: 3017–3028. 3. V. W. Berger, Selection Bias and Covariate Imbalances in Randomized Clinical Trials. New York John Wiley & Sons, 2005.

9

4. W. Rosenberger and J. M. Lachin, Randomization in Clinical Trials: Theory and Practice. New York: John Wiley & Sons, 2002. 5. J. P. Matts and R. B. McHugh, Conditional Markov chain design for accrual clinical trials. Biomet. J. 1983; 25: 563–577. 6. D. Blackwell and J. Hodges, Design for the control of selection bias. Ann. Mathemat. Stat. 1957; 28: 449–460. 7. R. Smith, Sequential treatment allocation using biased coin designs. J. Royal Stat. Soc. Series B 1984; 46: 519–543. 8. R. Smith, Properties of biased coin designs in sequential clinical trials. Ann. Stat. 1984; 12: 1018–1034. 9. B. Efron, Forcing a sequential experiment to be balanced. Biometrika 1971; 58: 403–417. 10. S. Stigler, The use of random allocation for the control of selection bias. Biometrika 1969; 56: 553–560. 11. S. M. Marcus, A sensitivity analysis for subverting randomization in controlled trials. Stat. Med. 2001; 20: 545–555. 12. V. W. Berger, Pros and cons of permutation tests in clinical trials. Stat. Med. 2000; 19: 1319–1328. 13. V. W. Berger and D. Exner, Detecting selection bias in randomized clinical trials. Control. Clin. Trials 1999; 20: 319–327. 14. V. W. Berger, Do not use blocked randomization. Headache 2006; 46: 343. 15. T. Dupin-Spriet, J. Fermanian, and A. Spriet, Quantification methods were developed for selection bias by predictability of allocations with unequal block randomization. J. Clin. Epidemiol. 2005; 58: 1269–1276. 16. J. M. Lachin, Statistical properties of randomization in clinical trials. Control. Clin. Trials 1988; 9: 289–311. 17. L. J. Wei and L. M. Lachin, Properties of the urn randomization in clinical trials. Control. Clin. Trials 1988; 9: 345–364. 18. J. F. Soares and C. F. J. Wu, Some restricted randomization rules in sequential designs. Communicat. Stat. Theory Methods 1982; 12: 2017–2034. 19. Y. P. Chen, Biased coin design with imbalance tolerance. Communicat. Stat. Stoch. Models 1999; 15: 953–975. 20. J. H. Klotz, Maximum entropy constrained balance randomization for clinical trial. Biometrics 1978; 34: 283–287. 21. W. C. Grant and K. J. Anstrom, Minimizing selection bias in clinical trials: a Nash

10

RANDOMIZATION PROCEDURES equilibrium approach to optimal randomization. Proceedings of the American Statistical Association, Biopharmaceutical Section, 2005, pp. 545–551.

22. ICH 1998. International Conference on Harmonisation of technical requirements for registration of pharmaceuticals for human use. ICH topic E9. Note for guidance on statistical principles for clinical trials. 23. EMEA 2003. European Agency for the Evaluation of Medicinal Products. Committee for Proprietary Medicinal Products. Points to consider on adjustment for baseline covariates. CPMP/EWP/2 863/99. 24. D. R. Taves, Minimization: a new method of assigning patients to treatment and control groups. Clin. Pharmacol. Ther. 1974; 15: 443–453. 25. S. J. Pocock and R. Simon, Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics 1975; 31: 103–115. 26. V. W. Berger, A novel criterion for selecting covariates. Drug Informat. J. 2005; 39: 233–241. 27. O. Nordle and B. O. Brantmark, A selfadjusting randomization plan for allocation of patients into two treatment groups. Clin. Pharm. Ther. 1977; 22: 825–830. 28. A. C. Atkinson, Optimum biased-coin designs for sequential treatment allocation with covariate information. Stat. Med. 1999; 18: 1741–1752. 29. M. Zelen, Play the winner rule and the controlled clinical trial. JASA. 1969; 64: 131–146. 30. L. J. Wei and S. Durham, The randomized play-the-winner rule in medical trials. JASA. 1978; 73: 840–843. 31. J. H. Ware, Investigating therapies of potentially great benefit: ECMO (with discussion). Stat. Sci. 1989; 4: 298–340. 32. T. G. Karrison, D. Huo, and R. Chappell, Group sequential, response-adaptive designs for randomized clinical trials. Control Clin Trials 2003; 24: 506–522. 33. L. J. Wei, R. T. Smythe, D. Y. Lin, and T. S. Park, Statistical inference with datadependent treatment allocation rules. JASA. 1990; 85: 156–162. 34. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. New York: Chapman & Hall/CRC, 1999. 35. V. W. Berger, Admissibility of exact conditional tests of stochastic order. J. Stat. Plan.

Infer. 1998; 66: 39–50. 36. J. W. Tukey, Tightening the clinical trial. Control. Clin. Trials 1993; 14: 266–285. 37. S. Wieand and K. Murphy, A commentary on ‘Treatment at random: the ultimate science or the betrayal of hippocrates? J. Clin. Oncol. 2004; 22: 5009–5011. 38. V. W. Berger, Is allocation concealment a binary phenomenon? Med. J. Aust. 2005; 183: 165. 39. V. W. Berger, The reverse propensity score to detect selection bias and correct for baseline imbalances. Stat. Med. 2005; 24: 2777–2787.

RANDOMIZATION SCHEDULE

large number of incomplete blocks. The section entitled ‘‘Schedules to mitigate loss of balance in treatment assignments due to incomplete blocks’’ describes the allocation procedures that can be used to overcome this deficiency—constrained randomization, balanced-across-the-centers randomization, and incomplete block balanced-across-thecenters randomization. It is explained how to generate randomization schedules for such procedures; the references to publications that contain more details are provided. When a dynamic allocation is used in a study to balance the randomization on several predictors, a randomization schedule is not generated in advance. Instead, the list of dynamically assigned treatments is documented during the allocation process. The last section discusses two issues related to the use of the randomization schedule: forced allocation that develops in multicenter studies with central randomization, and the distinction between the patient allocation schedule and the drug codes schedule that needs to be appreciated to avoid partial unblinding in randomized trials.

OLGA M. KUZNETSOVA Merck & Co., Inc. Rahway, New Jersey

The randomization schedule is used in a clinical trial to assign subjects randomly to one of the treatment regimens studied in the trial. Randomization helps to make the treatment groups comparable in baseline characteristics (both measured and not measured) and thus lessen the potential for bias in the evaluation of the treatments. It also allows causal inference with respect to intervention effects. The schedule is generated prior to the trial start and provides the list of treatment assignments by subject number (allocation number). In a stratified trial, a range of allocation numbers is designated for each stratum, and a separate randomization schedule is provided for each stratum. Subjects randomized into the trial are assigned allocation numbers in order of entry in their respective randomization stratum and receive the corresponding treatment. Typically, the randomization schedule is also used to prepackage the study drug and label the drug kits with the allocation number (and the study visits to be given to the patient). The schedule is concealed until the trial is closed and the database is locked to ensure the proper blinding (masking) of the trial. The choice of the randomization procedure for a trial should be evaluated carefully. The first section outlines relevant considerations and describes the most commonly used allocation procedure—permuted block randomization. Two approaches to generation of the permuted block schedule are discussed. Although it is the most popular choice for masked trials, permuted block randomization leads to a relatively high fraction of predictable assignments in open-label trials. The next section describes alternative allocation procedures used in open-label trials, including variations of the permuted block randomization. Another limitation of the permuted block randomization is the loss of balance in treatment assignments it incurs in heavily stratified studies with

1

PREPARING THE SCHEDULE

The preparation of the randomization schedule starts with the selection of the randomization procedure appropriate for the study. The choice of the procedure mainly depends on what types of bias are the most relevant for the study (those that randomization is supposed to ameliorate). It also depends on the local tools available for randomization and drug distribution. For example, tools to support a permuted block randomization are widespread in the clinical trials world, whereas other types of randomization algorithms might not be in easy reach. 1.1 Choice of the Randomization Procedure In open-label trials, selection bias and observer bias are serious concerns. If the investigator that randomizes patients knows the treatment to be assigned to the next patient, he might be influenced by this knowledge in his decision to enroll or not to

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

RANDOMIZATION SCHEDULE

enroll the next candidate. The likely result is that the treatment groups are not comparable in their characteristics (selection bias) (1). Thus, a randomization sequence that makes it harder to predict the next treatment assignment might be sought for an open-label trial to lessen the opportunity for selection bias. Selection bias is a minor issue in doublemasked trials, in which neither patients nor investigators have knowledge of the treatment assignments. In such trials, cases of unblinding are rare—mainly when necessitated by a serious adverse event—and do not necessarily make any of the subsequent treatment assignments predictable. Even if the selection bias in an open-label trial is lessened by making the sequence of treatment assignments less predictable, observer bias—which is a bias in evaluation of the response—can occur as soon as the treatment assignment of the patient becomes known to the investigator. Ultimately, the evaluator has to be disassociated with the allocation process and not be aware of the treatment assignment of the evaluated patient whenever possible. In double-masked trials, in which cases of unblinding are rare, the preference might still be given to a randomization sequence that is more robust with respect to an observer bias. Although randomization in general helps to promote balance in known as well as unknown covariates, the allocation procedures differ in their susceptibility to an accidental bias—which is a bias associated with imbalance in unobserved covariates (2). Comparative properties of different randomization procedures with respect to accidental bias can be found in Reference 3. One common type of accidental bias is the bias associated with the unknown changes in the study population (or response to treatment) with time. An example of this is a study in which subjects enrolled earlier have a more severe disease than subjects enrolled later on. When this bias is a concern, the allocation procedures that provide a tighter balance in treatment assignments not only at the end of randomization, but throughout the enrollment period, will help lessen accidental bias (4). This might be especially important in studies with planned interim analyses.

1.2 Stratified Randomization Stratified randomization is used when the balance in baseline covariates (mainly those known to be strong predictors of outcome parameters) beyond one that would be achieved without controlling for covariates is required. If the covariates are categorical, then a separate randomization schedule is prepared for each stratum defined by a combination of the levels of the baseline covariates to balance the treatment assignments within each stratum. Continuous variables are categorized by breaking the range of values into several segments before the stratified randomization can be employed. It is often more efficient to conduct an analysis with a continuous covariate in the model even if the covariate was categorized to achieve balance at randomization. In multicenter trials, the allocation is typically stratified by center as recommended by the International Conference on Harmonisation (ICH) E9 Guidance (5). Often the allocation is stratified by center even when centers are not expected to differ in treatment response to facilitate the logistics of drug distribution in absence of centralized randomization system. In this case, randomization is performed within each center according to the randomization schedule prepared for the center, whereas the drug kits labeled with respective allocation numbers are sent to each center prior to randomization. After the randomization procedure is selected, it is implemented in the schedule using standard local randomization tools or by writing a customized code. The SAS software package (6,7) allows for easy generation of randomization schedule using SAS Procedure PLAN (PROC PLAN) or random number generators (8). The ICH E9 Guidance (5) requires to have the schedule reproducible, which might be achieved by saving the code used to generate the schedule. 1.3 Permuted Block Randomization Schedule In masked clinical trials, permuted block randomization is the most commonly encountered approach for randomization. It can be used for allocation to any number of treatment groups in specified ratios, and it has a

RANDOMIZATION SCHEDULE

potential to provide a good balance in treatment assignments throughout the enrollment as well as at the end of it. It also supports randomization in stratified trials. The permuted block schedule is built of a sequence of blocks of fixed length (block size). Each block contains a set of the treatment codes entered in specified ratios and randomly permuted within the block. The smallest possible block size is equal to the sum of the treatment ratios provided they have no common divisor. For example, in a parallel design study where patients are to be allocated randomly in 2:2:1 ratio to one of the three treatments: Treatment 1, Treatment 2, or Treatment 3, the smallest possible block size is 5. All such blocks are a random permutation of (1,1,2,2,3). An example of a permuted blocks schedule with the block size 5 is presented in a sequence of treatment assignments below (with blocks bracketed to illustrate the technique) (2, 1, 1, 3, 2), (1, 1, 2, 3, 2), (3, 2, 1, 2, 1) ... Larger block sizes—any multiple of 5— can also be used. If, for example, the block size of 10 is chosen, then the blocks will each be a random permutation of (1, 1, 1, 1, 2, 2, 2, 2, 3, 3). The block size should not be disclosed in the protocol nor otherwise revealed to subjects, investigators, or raters. The choice of the block size for a permuted block schedule is influenced by the type of study blinding (masked or open-label), the expected number of blocks that might be left incomplete at the end of enrollment (which mainly depends on the number of strata), and whether tight balance in treatment assignments throughout the study is important because of an anticipated trend in baseline covariates or drug supplies issues. The approach often taken in double-masked studies is to have a block size large enough to include at least two subjects on each treatment. Otherwise, if a treatment is assigned to a single patient in a permuted block, and a patient who was unblinded because of a serious adverse event happened to be assigned this treatment, then it is known with certainty that none of the other patients on the block received the same treatment. Such knowledge might lead to observer bias, and, if the unblinding happened early enough so that some subjects in the block are yet to

3

be enrolled, to selection bias as well. Such a situation will be largely eliminated if each treatment occurs at least twice in a permuted block. With this approach, in a study with four treatments assigned in 1:1:1:1 ratio, the block size of at least eight (twice the size of the smallest block) will be used. However, in a three-arm study with the 3:2:2 ratios the smallest block size (9) can be used. In open-label trials, larger block sizes are preferable, regardless of the allocation ratio, to leave more upcoming treatment assignments unpredictable. When the permuted block randomization is stratified (for example, by center in a multicenter trial), each stratum is assigned a set of permuted blocks. If the study has many small strata, then using large blocks will lead to large proportion of incomplete blocks and thus, potential suboptimal balance in treatment assignments. In such circumstances, requiring at least two instances of each treatment in a permuted block might not be practical, especially if the minimal block size is large. In some cases, an acceptable solution might be to use constrained blocks or balanced across the centers schedule (see the section entitled ‘‘Other Randomization Procedures’’), but if the possibility of unblinding is low, then using the smallest block size might be the best option. However, the blocks of size 2—the smallest blocks for 1:1 allocation—are almost never used. Two approaches can be used to the permuted block schedule generation. The most commonly used approach is where the permuted blocks are drawn at random with replacement from a set of all existing permuted blocks with specified contents. It is easy to implement programmatically, and the resulting schedule might have some blocks appear several times while not necessarily listing all existing blocks of given structure. An alternative approach is to list all existing permuted blocks with specified block size and treatment ratio and randomly permute them in order (provided the set is not too large). The procedure can be repeated to generate as many blocks as needed. This approach was described by Zelen (10) for an example of a two-group study with 1:1 allocation that uses permuted blocks of size 4. This strategy can be implemented with SAS (8) and can be used

4

RANDOMIZATION SCHEDULE

in multicenter studies to balance the blocks across the centers. Although the permuted block schedule is the most popular choice in clinical trials, it has two weaknesses. First, it incurs a relatively high fraction of predictable assignments in open-label trials. Second, it incurs a loss of balance in treatment assignments in heavily stratified studies with large number of incomplete blocks. When any of these issues is a concern for a trial, other randomization procedures can be used. 2

SCHEDULES FOR OPEN-LABEL TRIALS

For open-label trials, in which selection bias is an issue, the permuted blocks schedule has a downside—one or more allocations at the end of the block are known with certainty when all previous allocations in the same block are filled. Even though the block size should never be revealed to the investigators, it will possibility be deduced based on observed allocation pattern. One sure way to avoid a selection bias in an open-label trial is to use the totally unpredictable complete randomization, in which each subject is allocated with the probability determined by the targeted allocation ratio. Such randomization, however, might result in a considerable imbalance in treatment assignments, especially in smaller trials or trials stopped early, in which such imbalance might appreciably lower the power for a treatment comparison. An imbalance in treatment assignments throughout the enrollment might occur with complete randomization, which has a potential for accidental bias if the baseline characteristics of the randomized patients change with time. Because of these limitations, other randomization procedures that lessen the selection bias while providing a good balance in the treatment assignments at the end or throughout the enrollment are often used in open-label trials. 2.1 Variations of Permuted Blocks Randomization One common randomization procedure used to lessen the selection bias in open-label trials

is a variation of the permuted block schedule where permuted blocks of variable block size are used (5). With that approach, the block sizes are selected randomly from a prespecified set (for example from a set of 4, 6, and 8 for 1:1 allocation). The limitations of this strategy are discussed in Reference 3. They mention that under the Blackwell and Hodges (1) model for selection bias, variable block allocation results in the selection bias that is approximately equal to that encountered using a schedule with the average block size (across the block sizes used to generate a schedule). Still, the expected number of assignments that may be predicted with certainty is smaller with the variable block size schedule than with the sequence of permuted blocks of the average size. Another variation of the permuted block schedule is the schedule in which blocks not only vary in size but also are slightly unbalanced in treatment assignments. This variation can be accomplished by shifting an assignment or two from each block in the randomization sequence to a neighboring block. The treatment codes are then permuted within the unbalanced blocks; with this treatment, the sequence of assignments is almost balanced at the ends of the blocks. The potential for selection bias of such schedules has not been formally evaluated, but the absence of preplanned balance in treatment assignments in the randomization sequence might limit the ability of the investigator in prediction of the next treatment assignment. 2.2 Other Randomization Procedures Other randomization procedures that provide a good balance in the treatment assignments at the end or throughout the enrollment such as the random allocation rule (3), biased coin designs (2) and designs based on urn models (see review in Reference 11) are used as an alternative to permuted blocks schedule in open-label trials. The susceptibility of these procedures as well as the permuted blocks randomization to selection bias is examined by Rosenberger and Lachin (3). Another randomization design, which is called the maximal procedure (4), was recently proposed as an alternative to a sequence of permuted blocks of small sizes. In a two-arm

RANDOMIZATION SCHEDULE

trial with a 1:1 allocation ratio, the maximal procedure allows any sequence of treatment assignments that maintains the imbalance no larger than specified maximum imbalance b throughout the enrollment. Such an allocation procedure provides less potential for selection bias than a sequence of permuted blocks of size 2b. The maximal procedure can also be extended to more than two treatments. A very nice way to eliminate a selection bias is offered by randomizing patients in blocks [a technique called ‘‘block simultaneous randomization’’ by Rosenberger and Lachin (3, p.154)]. This technique is feasible, for example, in twin studies with 1:1 allocation, in which each set of twins is assigned a block of two treatments in random order, or in studies with family-based randomization. Such randomization, however, is not always practical in other circumstances as it relies on a block of subjects that arrive simultaneously for randomization visit.

3 SCHEDULES TO MITIGATE LOSS OF BALANCE IN TREATMENT ASSIGNMENTS BECAUSE OF INCOMPLETE BLOCKS The balance in treatment assignments normally expected with permuted blocks randomization can be impaired in stratified trials if many blocks are left incomplete at the end of enrollment (12). Typically, it is a concern in studies with a large number of strata, as every stratum is likely to have its last block not fully allocated. For example, in a 30-center study stratified by center as many as 30 blocks could be incomplete. Stratification by two factors, center and gender, might bring the number of incomplete blocks to 60, which might include most permuted blocks if the centers are small. This method may have a detrimental impact on balance in treatment assignments. Often when a randomization stratified by center is expected to result in many incomplete blocks, a better overall balance in treatment assignments is achieved when patients are randomized centrally, in order of their randomization time, regardless of what center they belong to. Central randomization

5

however, might lead to a considerable imbalance in treatment assignments in some centers. Some centers might even have only one treatment (or not the full set of treatments) assigned to their subjects. An imbalance in treatment assignments within centers might lead to problems at the analysis stage or even bias in efficacy or safety assessment if the center effects matter because of different standards of care or patient population in ways that may impact the intervention effects. The drug use is also suboptimal when the centers are not balanced in treatment assignments, as in this case the drug kits will have to be replenished more frequently than it would have been necessary with a balanced allocation. Therefore, even with small centers, merits exist to having the allocation reasonably balanced within the centers. 3.1 Constrained Randomization In multicenter studies stratified by center, the lack of balance within incomplete blocks might be an issue mostly when the block size is large. For example, in a three-arm study with a 5:5:2 allocation to treatments A, B, and C, the smallest block size is 11. Such long blocks have a chance to have the treatments distributed very unevenly within the blocks, with assignments to one treatment gathered mainly in the beginning of the block and other treatments gathered at the end of the block. An extreme example of a very unbalanced in time block (1,1,1,1,1, 2,2,2,2,2,3,3) is a possibility. If only 6 subjects are enrolled at the site and such block is left incomplete, then the site will have five subjects randomized to Treatment 1, one subject randomized to Treatment 2, and no subjects randomized to Treatment 3. One way of keeping incomplete blocks reasonably balanced is to use constrained randomization following Youden (13,14). Constrained randomization restricts the set of permuted blocks that can be used in the allocation schedule to the blocks that have the treatment assignments better balanced within a block. The set of ‘‘better balanced’’ blocks has to be specified prior to randomization schedule generation. In the example of the three-arm study above, ‘‘better balanced’’ blocks of size 12 can

6

RANDOMIZATION SCHEDULE

be defined in the following way: first, five permuted blocks of size 2, each containing a random permutation of Treatments 1 and 2 are lined up. Then two Treatment 3 assignments are inserted at random—one among the first six assignments, another among the last six assignments. Two examples of the constrained blocks such defined are (1,3,2, 2,1,2,1,1,2,2,3,1) and (2,1,1,2,3,1,2,2,3,1,2,1). A different, less restrictive set of ‘‘better balanced’’ blocks can be defined as a set of blocks of 12 that have either a random permutation of (1,1,1,2,2,3) followed by a random permutation of (1,1,2,2,2,3) or a random permutation of (1,1,2,2,2,3) followed by a random permutation of (1,1,1,2,2,3). The maximal procedure (4) described above can be used to generate constrained permuted blocks within which the maximum imbalance in treatment assignments will not exceed a prespecified limit. The rules that specify how the schedule is built from the specified set of constrained blocks can also assign to different blocks different probabilities to be included in the schedule. Rosenberger and Lachin (3) offer some general principles in defining rules for constrained randomization sequences. In particular, they point out that for a two-arm trial with equal allocation, the set of constrained blocks should be symmetric with respect to the two treatments. Similarly, if a study with more than two arms has an allocation ratio symmetric with respect to a subset of treatments, then the set of rules that defines the acceptable set of randomization sequences should be symmetric with respect to the same set of treatments. It can be added that no bias should exist with respect to the placement of any of the treatments within blocks. For example, in a three-arm study with 5:5:2 allocation to treatments A, B, and C, it would be unacceptable to have all the constrained blocks to start with treatment A. The set of the constrained blocks that can be used in the randomization schedule should be defined in such a way that the 5:5:2 ratio holds across all the first allocations on the blocks (as well as across all second allocations, all third allocations, etc.). Rosenberger and Lachin (3) quote Cox (15) who justifies the use of prespecified permutations rather than all existing permutations in

randomization procedure, but they point out that the impact of such procedure on inferences under randomization model needs to be considered carefully. Constrained randomization in heavily stratified trials provides a better balance in treatment assignments not only at the end of enrollment, but throughout the enrollment. That makes it less susceptible to accidental bias associated with the time trend in covariates and also helps to have groups balanced in treatment assignments at the time of interim analysis. Constrained randomization schedules can be generated easily with SAS PROC Plan (8,16). 3.2 Randomization Balanced Across the Centers Another way to lessen the imbalance caused by incomplete blocks in a multicenter trial is to have the blocks balanced across the centers. For example, consider a 40-center study stratified by center and gender with a 2:1:2 allocation to Treatments 1, 2, and 3, in which most centers are expected to randomize six to nine patients across the two genders. Even if the permuted blocks of the smallest size (12) are used in the randomization schedule, at the end of enrollment most permuted blocks will be left incomplete, and the balance in treatment assignments within gender groups will be suboptimal. The balance within gender groups might be improved by balancing the schedule for each gender across the centers. The simplest way to do so would be to follow Zelen’s (10) approach to schedule generation. Thirty different permutations of (1,1,2,2,3) can be created. They all can be listed out, permuted randomly in order, and assigned to the first 30 centers in the schedule for males. The procedure will be repeated to generate one more set of 30 permuted blocks, the first 10 of which will be assigned to centers 31–40 in the schedule for males. Thus, at least among the first 30 centers, an exact balance in treatment assignments will be prepared for male patients that are allocated first, second, third, fourth, and fifths, respectively, in their centers. If the same number of males (for example, three) is allocated in all 30 centers, then a perfect balance in treatment

RANDOMIZATION SCHEDULE

assignments will exist across these centers in male gender stratum, even though all blocks are incomplete. In reality, the balance will not be perfect because the numbers of allocated male patients will vary across centers. Also, the balance will not be perfect across the remaining 10 centers. Nevertheless, such an arrangement provides a better balance in treatment assignments within a male gender stratum than an allocation schedule built from randomly selected permuted blocks. The balance can be improved even more if the set of 30 existing permuted blocks is broken into 6 balanced sets of 5 blocks (see example of two such blocks in Table 1). Five permuted blocks within each set have a 2:1:2 allocation ratio across the row of subjects allocated first, second, and so on. To prepare the schedule for the first 30 centers, the six balanced sets are permuted in order. After that, the five blocks within each set are permuted in order. The blocks are then distributed within centers in the order they are listed on a schedule, so that the first set of five balanced blocks is assigned to centers 1–5, the second set of five balanced blocks is assigned to centers 6–10, and so on (8,11). Table 1 shows an example of the 10 blocks from such schedule sent to the first 10 centers in the study. Thus, if five centers within the same set enroll approximately the same number of male patients, then the balance in treatment assignments across these centers in male gender stratum will be good even

if all bocks are incomplete. The procedure is repeated to generate another schedule for the next set of 30 centers. The first two sets of five balanced blocks from this additional schedule are assigned to centers 31–40, which improves the balance across these 10 centers compared with the earlier example of a simpler balancing. 3.3 Incomplete Block Balanced-Across-the-Centers Randomization Balancing across centers is mostly advantageous when the minimal block size is large; small centers cause poor balance in treatment assignments. If the centers are too small compared with the minimum block size, then an incomplete block design with incomplete blocks balanced across the centers can be considered. Consider an example of a six-arm study with 2:3:3:4:4:4 allocation to arms A, B, C, D, E, and F. The minimal block size for such design is 20, whereas the study centers are expected to enroll only about five to eight patients each. In this case, the consecutive blocks of 20 allocations can be built of permuted blocks of size 5 of the following four types: ABDEF, ACDEF, BCDEF, BCDEF. The treatments will be permuted within each block of five, and the four blocks of five will be permuted in order within a set of four consecutive blocks (20 allocations). These incomplete blocks taken together will

Table 1. Two Balanced Sets of Permuted Blocks Sent to the First 10 Centers

Center

1

2

3

4

5

6

7

8

9

10

2

1

3

1

3

2

3

1

1

3

Order of Allocation Within Center 1st subject

7

2nd subject

3

2

1

3

1

3

1

3

1

2

3rd subject

1

3

3

1

2

1

1

2

3

3

4th subject

3

1

1

2

3

1

3

3

2

1

5th subject

1

3

2

3

1

3

2

1

3

1

8

RANDOMIZATION SCHEDULE

lead to the required 2:3:3:4:4:4 allocation ratio. The allocation will be executed by sending the incomplete blocks of five to the study centers. A balanced-across-the-centers allocation schedule can be considered a single-factor case of the factorial stratification proposed by Sedransk (17). The factorial stratification was designed to provide in the multifactor randomization a close balance across all first, second, third, and so on assignments in all strata. Factorial stratification is most beneficial if subjects are evenly distributed across the strata, which is not often the case in clinical trials. The applications of the factorial stratification in clinical trials should be explored even more. 3.4 Dynamic Allocation Many blocks left incomplete at the end of randomization is a common problem for studies stratified by center. If the number of centers is large, then this stratification factor has several levels (often as many as 50–150). Introduction of an additional stratification factor doubles an already large number of strata and might not be feasible. Another situation in which stratified randomization leads to many incomplete blocks is a study with a lot of important predictors. For example, in a cancer trial, as many as 10 strong predictors may exist; an imbalance in any would make the study results hard to interpret. A balance in as many predictors, however, cannot be accomplished with stratified randomization in a moderate size study of 50–200 patients, as with more stratification cells than patients, most of the blocks will be left incomplete (18). A baseline-adaptive dynamic allocation procedure such as minimization (19,20) can be used to balance the treatment groups in many factors. With minimization, a new patient is allocated to the treatment that in some sense minimizes the imbalance across all factor levels to which this patient belongs. In the Pocock-Simon (20) version of minimization, a random element is also involved at every step to allow a patient, with a small probability, to be allocated to a treatment other than the one that minimizes the imbalance. This method makes the allocation less deterministic than Taves’ (19) version.

When a dynamic allocation procedure is used in a clinical trial, a randomization schedule cannot be prepared in advance. In baseline-adaptive procedures, the treatment assignment for a patient is determined by his/her baseline covariates and allocations assigned to previously allocated patients. In response-adaptive procedures, the allocation also depends on previously observed responses. Instead of the allocation schedule, for the dynamic allocation procedures the list of treatment assignments is documented by recording the treatment assignments after each allocation. To keep the complete record of how the dynamic allocation was executed, the values of the parameters that determine the allocation should also be included in the list. For example, if a dynamic procedure determines the allocation of the new subject based on his/her gender and age, treatments already allocated at the subject’s study center, the drug supplies available at the study center, and the random element, then the values of all these parameters at the time of randomization should be recorded. This provides evidence that the dynamic allocation procedure was executed as planned. 4 ISSUES RELATED TO THE USE OF RANDOMIZATION SCHEDULE 4.1 Forced Allocation Sometimes, mainly in multicenter studies with centralized randomization, so called forced randomization (21) is employed to overcome shortage of drug supplies at certain centers. For example, consider a multicenter study with 8 treatment arms, in whcih subjects are randomized centrally, that follow one permuted block randomization schedule with the block size 16. Suppose at the study start each center was sent 16 drug kits, 2 for each arm. If two subjects are allocated to the Treatment A in Center 1, and the third subject happened to be allocated to the same Treatment A before the drug supplies are replenished at the center, the center will have no drug A to give to this patient. In such a case, the forced allocation will find the smallest allocation number in the randomization sequence

RANDOMIZATION SCHEDULE

that does have the corresponding bottle available at Center 1. Thus, the patient will be ‘‘force-allocated’’— allocated not according to the schedule, but in violation of the schedule. The next patient eligible for randomization will be assigned the first unfilled allocation number in the randomization sequence and will receive the corresponding bottle. This practice of skipping the allocation numbers in the allocation sequence and backfilling them with later patients has a potential for partial unblinding of the study. An obvious example is a study with a rare treatment group, for example, a study with 5:5:1 allocation to treatment groups A, B, and C. If one block of 11 drug kits (5 + 5 + 1) is sent to each site at the beginning of the trial, and a center has a patient force-allocated before 5 patients are randomized in this center, then it is certain that the skipped allocation belongs to the group C. Thus, the treatment to be assigned to a patient that later backfills this allocation is known in advance, which may lead to a selection and observer bias. Another opportunity for unblinding with the forced allocation develops when there is a known shortage of drug supplies of certain kind—for example, an experimental drug is not available until the new batch is produced. With that, the allocation numbers skipped on the schedule are known to belong to the experimental drug. When new drug supplies are distributed to the sites and massive backfilling of the skipped allocation numbers begin, the opportunity for selection bias as well as observer bias presents itself. Such a situation is also prone to accidental bias related to time of randomization because no patients are randomized to the experimental drug for some time. For example, in a seasonal allergic rhinitis study in which the levels of allergens vary a lot with time, randomization that is not balanced across time may lead to biased results. The opportunity for unblinding can be reduced in studies with forced allocation if the dynamic nature of such allocation is acknowledged and properly accommodated. One of the recommended elements of such a study [described in a different context by Rosenberger and Lachin (3), p. 159] is to follow a subject throughout the study by his

9

screening number, while keeping his allocation number and the treatment assigned at randomization concealed. After the allocation is determined, the appropriate drug kit coded with a secret drug code is given to the patient. Thus, it remains unknown whether a patient was allocated following the randomization schedule or skipping some allocation numbers on it. This system prevents forced randomization from causing either the selection or the observer bias. However, it does not help to avoid the accidental bias related to time of randomization. In general, before resolving to use the forced allocation, an opportunity for unblinding should be thoroughly assessed, and other allocation options should be considered. A viable alternative is a dynamic allocation that includes the treatment imbalance at the site in a set of parameters of the allocation algorithm. The requirements to the initial shipment of drug to the sites and the resupply rules should be designed carefully to avoid introduction of bias or an opportunity for a partial unblinding. 4.2 The Distinction Between the Randomization Schedule and the Drug Codes Schedule Most commonly, the drug kits are labeled directly with the allocation numbers according to the randomization schedule. This technique allows distributing drugs in multicenter studies with the site-based randomization by sending to each site the set of drug kits that corresponds to the site’s schedule. However, in some studies, the centralized, rather than site-based randomization is called for—mainly when it is important to stratify the randomization by several parameters. In these cases, patients in the same stratum across all centers are randomized following the randomization schedule prepared for this stratum. Because with centralized randomization the allocation numbers to be assigned to patients in a given center are not known in advance, the drug kits cannot be labeled with the allocation numbers. Instead, the drug kits are labeled with secret drug codes. The centralized randomization is executed with the help of an interactive voice response system (IVRS) or

10

RANDOMIZATION SCHEDULE

a web-based computerized system. When a patient is ready to be randomized, the site coordinator contacts the centralized system and enters the patient’s baseline covariates. The system assigns the allocation number to a patient and tells the site coordinator which of the kits stored at the site to give to the patient at the randomization visit. At all subsequent visits, the system is contacted again to receive the code for the kit to be distributed at the visit according to the treatment regimen to which the patient was allocated. The schedule that maps each type of drug kit (for example, placebo bottles, experimental drug bottles, and active control bottles) to a sequence of codes is not unlike the randomization schedule. It is generated with the use of a random element to preserve the blinding. Often the permuted block schedule that maps kit numbers in a specified range (say, from 1000 to 2000) to one of the drug types is adequate for simpler study designs. Thus, it is very tempting to use the same tools that generate the allocation schedule to generate the drug codes schedule. However, caution has to be exercised here and the study design as well as the drug distribution and resupply patterns have to be assessed for unblinding possibilities. The essential difference between the two schedules is that the allocation numbers are assigned in sequential order—1, 2, 3, and so on—while drug codes for the distributed kits move along the sequence of codes for a particular drug type and are not sequential. The progression of the drug codes with time can lead to partial unblinding of the treatment assignments. For example, consider the study that has a 2-week active run-in period followed by a 2-week treatment period with 1:1 allocation to active treatment or placebo. One bottle with either an active drug tablets or matching placebo tablets is distributed at the beginning of each period. If a common permuted blocks schedule with 3:1 active to placebo ratio is used to generate the drug codes schedule for the whole study, then during the active run-in only the active bottles will be distributed and the drug codes in the active sequence will grow, whereas those in the placebo sequence will not. Therefore, at the randomization visit it will be easy to

tell patients allocated to placebo from those allocated to the active treatment arm: The patients allocated to placebo will receive the drug kits with low component IDs, whereas those allocated to the active drug will receive the drug kits with higher component IDs. The same considerations apply to any studies in which the treatment ratio changes across the treatment periods, including adaptive design studies. The easy fix might be to permute the drug codes randomly within each sequence, so they would not be coming in increasing order (9). This distinction between the allocation schedule and the drug codes schedule needs to be appreciated to avoid the unblinding of the study. 5 SUMMARY The permuted block randomization schedule is used in most clinical trials to allocate the subjects randomly to a treatment regimen. It is much studied and well understood, and it nicely supports stratified randomization. Most research institutions have standard tools to generate such schedules sparing the statistician the need to do any actual coding. All this makes the permuted block schedule the number one choice for the allocation procedure. This approach has two deficiencies: a relatively high fraction of predictable assignments in open-label trials and loss of balance because of many incomplete blocks in heavily stratified trials. In open-label trials, one alternative allocation procedure described in the section entitled ‘‘Schedules for open-label trials’’ can be used. In heavily stratified trials, an allocation schedule can be generated following constrained randomization, balanced-across-the-centers randomization, or incomplete block balanced-across-thecenters randomization. Implementation of such procedures requires customized programming, which can be easily done in SAS (8). A careful review and validation of a customized randomization program is required to ensure the schedule is generated correctly. The study statistician can write and test the program using a random seed, which would be later changed to generate the actual allocation schedule. This process keeps the study

RANDOMIZATION SCHEDULE

statistician blinded with respect to the randomization schedule. When the customized procedures that employ an allocation schedule generated prior to study start are not sufficient for study needs—for example, when a balance on several predictors is required—a dynamic allocation can be used. A randomization schedule is not prepared in advance for such dynamic allocation procedures; instead, a program is written to assign a treatment dynamically to a new patient based on his/her set of covariates. It is very important to test the program thoroughly before using it to allocate the patients to avoid mistakes in allocation. Some practical issues must be considered when choosing an allocation procedure for a study. First, will there be sufficient drug supplies at the study centers to execute the allocation according to a schedule? Will forced allocation be allowed in case the required type of the randomization drug kit is absent at the center at the time of randomization and what implication will it have? What will be done if a patient is allocated to a wrong treatment by mistake? Will partial unblinding be introduced in an IVRS study by the way the schedule for drug codes is generated? Will resupply pattern provide clues regarding the treatment assignment for some patients? The resolution of all these practical issues, and not simply the fact that the randomization schedule was used to allocate the subjects, ensures that bias was not introduced into the study by the allocation procedure. REFERENCES 1. D. Blackwell and J. Hodges Jr., Design for the control of selection bias. Ann. Mathemat. Stat. 1957; 28: 449–460. 2. B. Efron, Forcing a sequential experiment to be balanced. Biometrika 1971; 58: 403–417. 3. W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials. New York: John Wiley & Sons, 2002. 4. V. W. Berger, A. Ivanova, and M. Knoll, Minimizing predictability while retaining balance trough the use of less restrictive randomization procedures. Stat. Med. 2003; 22: 3017–3028. 5. International Conference on Harmonisation. Guidance on Statistical Principles for Clinical Trials. International Conference on

6. 7. 8.

9.

10.

11.

12.

13. 14.

11

Harmonisation; E-9 Document, 1998. pp. 49583–49598. SAS Institute Inc., SAS Procedures Version 8. Cary, NC: SAS Press, 2000. SAS Institute Inc., SAS/Stat User’s Guide, Version 8. Cary, NC: SAS Press, 2000. O. Kuznetsova and A. Ivanova, Allocation in randomized clinical trials. In: A. Dmitrienko, C. Chuang-Stein, and R. D’Agostino, eds. Pharmaceutical Statistics Using SAS. Cary, NC: SAS Press, 2006. O. M. Kuznetsova, Why permutation is even more important in IVRS drug codes schedule generation than in patient randomization schedule generation. Control. Clin. Trials 2000; 22: 69–71. M. Zelen, The randomization and stratification of patients to clinical trials. J. Chron. Dis. 1974; 27: 365–375. L. J. Wei and J. M. Lachin, Properties of the urn randomization in clinical trials. Control. Clin. Trials 1988; 9: 345–364. A. Hallstrom and K. Davis, Imbalance in treatment assignments in stratified blocked randomization. Control. Clin. Trials 1988; 9: 375–382. W. J. Youden, Inadmissible random assignments. Technometrics 1964; 6: 103–104. W. J. Youden, Randomization and experimentation. Technometrics 1972; 14: 13–22.

15. D. R. Cox, Planning of Experiments.. New York: Wiley, 1958. 16. C. Song and O. M. Kuznetsova, Implementing constrained or balanced across the centers randomization with SAS v8 Procedure PLAN, PharmaSUG Proceedings, 2003. pp. 473–479. 17. N. Sedransk, Allocation of sequentially available units to treatment groups. Proceedings of the 39th international Statistical Institute. Book 2, 1973. pp. 393–400. 18. T. M. Therneau, How many stratification factors are ‘too many’ to use in a randomization plan. Control. Clin. Trials 1973; 14: 98–108. 19. D. R. Taves, Minimization: a new method of assigning patients to treatment and control groups. Clin. Pharmacol. Therapeut. 1974; 15: 443–453. 20. S. J. Pocock and R. Simon, Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trials. Biometrics 1975; 31: 103–115. 21. D. McEntegart, Forced randomization when using interactive voice response systems. Appl. Clin. Trials 2003; 12: 50–58.

12

RANDOMIZATION SCHEDULE

FURTHER READING S. Senn, Statistical Issues in Drug Development. New York: John Wiley & Sons, 1997.

CROSS-REFERENCES Biased-coin Randomization Block Randomization Minimization Randomization Methods Randomization Procedures Stratified Randomization Simple Randomization

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

Although extensions to generalized linear models (8) and some survival models (4) have occurred, these rank-based nonparametric ANCOVA methods will be presented in the linear model framework only.

JEAN-MARIE GROUIN INSERM U657 and AFSSaPS France

1 THE ANCOVA MODEL ASSUMPTIONS AND RANK NONPARAMETRIC METHODS

The analysis of covariance (ANCOVA) is a standard technique for comparing two or more groups while adjusting for categorical or continuous variables that are related to the response variable (i.e., the so-called covariates). Two main applications of ANCOVA to clinical trials are used. In randomized clinical trials, ANCOVA is mainly intended to improve the efficiency of the analysis, because the validity of unadjusted comparisons is a priori guaranteed by randomization, by using the relationship between baseline covariates and the outcome to reduce the errors when comparing treatment groups. In some observational studies, such as postmarketing drug trials, it is necessary to adjust for all important potential confounders that are systematically associated with both the outcome and the studied treatment. These different aspects and meanings of ANCOVA have been described, comprehensively particularly in Cox and McCullagh (1). ANCOVA was originally performed in the context of the normal linear model. Subsequently, nonparametric ANCOVA methods have been developed to address departures from the underlying assumptions of this model, that is, when these cannot be deemed reasonable [for example, see Puri and Sen (2), Hettmansperger and McKean (3,4), and more recently, Koch et al. (5,6) and Leon et al. (7)]. Among them, the first methods were historically based on ranks and specifically address departures from normality. Two kinds of methods exist. The first are ‘‘pure’’ rank methods that replace original data by their ranks; these procedures are essentially test-oriented. The second, called the ‘‘R methods’’, are rank-based semiparametric methods that offer a robust estimation of the treatment effect among other things, but, they rely on some additional assumptions that are not mandatory for the pure rank methods.

Consider, without loss in generality, the comparison between two groups on a continuous outcome after adjusting for one covariate only, typically a placebo-controlled randomized efficacy clinical trial of a new drug with a continuous response variable y and its baseline x as a covariate. The full ANCOVA model assumes that the response values are related linearly and additively to the treatment effect and the covariate, and that an interaction may occur between the treatment and the covariate, it has the following form: yij = µ + αi + βi xij + eij where yij and xij are, respectively, the covariate and outcome values for the jth patient (j = 1, . . . ,ni ) in the ith treatment group (i = 1,2), µ is the overall mean, α i and β i are the treatment parameter and the covariate slope for the ith group, respectively. The errors or residuals eij are assumed to be independent and identically distributed as a normal with mean 0 and same variance σ 2 (eij ∼ N[0,σ 2 ]). After testing that no interaction exists between the treatment and the covariate so that the treatment effect is fully interpretable, the usual ANCOVA model for estimating the treatment effect assumes a common covariate slope (β = β 1 = β 2 ) for both treatments: yij = µ + αi + βxij + eij It is worth noting that the model-based ANCOVA does not require each covariate to have the same distribution in the groups being compared, although this assumption is highly desirable for interpretive reasons. In randomized trials, such an assumption is guaranteed by the randomization process for

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

the baseline covariates that are unaffected a priori by the treatment. In nonrandomized observational studies, this assumption does not hold any more; the covariates considered are generally not comparable with regard to the groups and moreover can be affected by the treatment; the model-based ANCOVA is thus needed to adjust for possible biases when comparing groups. The method for estimating the parameters is based on the Least Squares (LS) method, that is, on the minimization of the sum of squared residuals that are a function of the parameters (eij = yij − [µ + α i + βxij ]). From a robustness viewpoint, the problem is that this method tends to place too much weight on the extreme errors when the data have a distribution with tails heavier than those of the normal. Severe departures from the normality assumption are usually caused by the presence of outlying values that may become a critical issue in clinical trials as they may unduly influence the estimates of the treatˆ ment effect, = α 2 − α 1 . The LS estimate of the treatment effect is shown to be ˆ 2 − x1 ), ˆ = (y2 − y1 ) − β(x where βˆ is the LS estimate of the covariate slope, and yi , xi are the outcome and the covariate means of the ith group. These means may be influenced unduly by extreme values, and the estimated slope can reflect an undue swing in the direction of some influential outliers, particularly distant from the cloud of points. Moreover, even if randomization a priori ensures the groups balance for each covariate on the average (i.e., the expectation of x2 − x1 is 0), the perfect balance is never achieved for a given randomization ˆ 2 − x1 ) may be and therefore the size of β(x substantial in practice. When outlying values occur in the response variable, which is the most frequent situation, omitting the covariate is not a solution to the problem because the unadjusted analysis, based on the response variable only, would also be unduly influenced by these outlying values. Among various diagnostic tools, an analysis based on the examination of the studentized residuals (i.e., residuals standardized by the estimated standard deviation and by the location of the point in the predictors

space) can be performed to identify these outlying values, usually indicated by studentized residuals more extreme than ±3 or even ±2. There are two main sources of outlying values observed in clinical trials. The first comes from patients completing the trial, who are true outliers because their response to the treatment differs substantially from the others. The other frequent source of outlying values comes from patients who failed to complete the trial or treatment and whose outcome value is thus missing. Since the primary analysis of most clinical trials relies on the ‘‘intent-to-treat’’ principle, the missing value is usually estimated by an adequate method and ‘‘replaced’’ by a new value that may differ substantially from the values of patients completing the trial. Therefore, the actual distribution of the primary outcome and thus of the errors is often a mixture of two distributions. The main distribution concerns the outcome values of patients completing the trial and the other distribution is based on the values of patients who failed to complete the trial, usually a minority. Together, they result in a skewed distribution, usually with one heavy tail because of the influence of drop-outs. Outlying values may occur either in the response variable or in the covariates. In randomized clinical trials, the covariates considered are usually those of the baseline, and potential outlying values in these baseline covariates are often limited by inclusion/exclusion criteria. In contrast, outlying covariate values can be more frequent in observational studies. In randomized clinical trials, ANCOVA is intended essentially to increase the efficiency of the analysis; however, when outlying values occur in the covariates, the value of this method is diminished. Rank nonparametric methods essentially address failures of the explicit distributional assumption (i.e., the normality of the errors). They are less sensitive to outliers because the influence of extreme values is minimized by the ranking process, and therefore they adequately address the crucial issue of heavytailed error distributions. Although they may

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

contribute to the reduction or even the suppression of the impact caused by the violations of the assumptions other than normality, such as the homogeneity of the error variances or the linear relationship between the outcome and the covariates, they can only address the failures of these assumptions indirectly. Although the LS method is optimal for normal errors, it may have poor power properties with regard to alternative nonparametric rank methods, especially when errors have heavy-tailed distributions. This can be illustrated by considering the twogroup problem and assuming a ‘‘location shift’’ model, (that is, the distributions of the response values are the same in both groups) in all other respects except for a location shift, (i.e., y1j and y2j − have identical distributions). This shift is just the difference between any location parameters, for example, the mean or the median of each group. We can compare the LS and the Rank procedures by considering their estimates of the shift . In the two-group problem, the LS estimate is just the difference in means ˆ LS = y1 −y2 ) and the Rank estimate, also ( called the Wilcoxon–Hodges–Lehmann estimate (9), is the median of all paired differˆ W = medj≤k {y1,j −y2,k }). The comparences ( ison of the two procedures can be based on the concept of the asymptotic relative efficiency (ARE) developed by Pitman (1948) in a series of unpublished lecture notes (see also chapter 2 of Ref. 4, for example). The ARE of the Wilcoxon estimate relative to the LS estimate, ARE(Wilcoxon,LS), is defined as the ratio of the asymptotic precisions of the two estimates. When errors are normally distributed, it can be shown that ARE(Wilcoxon,LS) is π3 0.955, therefore little efficiency is lost by using the Rank nonparametric approach with a gain of a greater applicability. The ARE can be larger than 1 even with slight departures from the normal, and theoretically could reach infinity. In conclusion, in addition to their robustness properties and therefore their greater applicability, rank nonparametric methods are as efficient, if not more so, as their LS counterparts in many realistic situations in clinical trials.

2

3

RANK METHODS FOR ANCOVA

These methods are ‘‘pure’’ rank methods because they replace the original outcome values by their ordered ranks. They are essentially test-oriented procedures that rely on very few assumptions and have therefore been used extensively in clinical trials. Since they generally offer no simple estimation or confidence interval for the treatment effect, they are usually not the primary analyses of the primary endpoint. However, as they are known to be robust against outliers, they have been frequently used in clinical trials as alternative sensitivity analyses to the primary parametric analysis. Most theory as well as derivation of most rank statistics is based on the assumption of the continuity of the underlying distribution functions, (i.e., ties are excluded). There is not much literature regarding ties, although some theory is available especially for the linear rank statistics on independent observations [Lehmann (10) Behnen (11)], and the usual approach is based on Koch and Sen (12) and Koch (13) who recommend breaking these ties by using the ‘‘mid-rank’’ method (see below). The rank procedures are very general, as the continuous distribution function of the outcome values is of unspecified form. While the classic parametric method requires that the outcome variable has units of measurement that are equal or have the same meaning across its entire range, the rank methods may relax the assumption of an equal interval scale. Therefore, Rank procedures are also still valid even if the ‘‘location shift’’ model cannot hold. Unlike the model-based classic ANCOVA, Rank ANCOVA methods require the covariates to have identical distributions in the groups being compared to be valid. Therefore, Rank methods for ANCOVA can be applied to randomized clinical trials but not to observational clinical studies. One highly desirable property of pure rank methods is that their statistics are distribution free (i.e., their distributions under the null hypothesis of no treatment effect, for example, do not depend on the original error distribution), hence, their great applicability.

4

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

Another attractive feature is that their exact distributions can be derived, at least theoretically, from the rank permutation principle. However, when the sample size is large, as for most phase II or III efficacy trials, computations become cumbersome, and the derivation of their distributions can only rely on asymptotic considerations. 2.1 The Quade Method The Rank method for ANCOVA was originally presented by Quade (14) in the one-way ANCOVA model with one or more continuous covariates, thus no factor other than the treatment and no interaction terms between the treatment and the continuous covariates are allowed. In this approach, the outcome values yij are replaced by their ordered ranks (i.e., from 1 to n) denoted by rank[yij ], and mid-ranks are assigned to ties, when they occur. For example, assuming that the eighth and ninth lowest outcome values are tied, they are both assigned the mid-rank 8.5. Denote ryij as the deviation rank from the rank mean ryij = rank[yij ] −

n+1 . 2

Then the same ranking process is applied independently to the covariate values xij rxij = rank[xij ] −

n+1 . 2

The first step of Quade’s method consists of fitting an ordinary LS regression of the ryij on the rxij leaving the treatment effect out of the model, that is, ryij = βrxij + ij (in this case, the slope β is the Spearman correlation coefficient between the outcome and covariate values). The second step consists of using these regression-based residuals and performing the usual analysis of variance Fisher test to compare the treatment groups. This Rank ANCOVA does not test the same hypothesis as its parametric counterpart, that is, the equality of adjusted means, it rather tests that the conditional distributions of y|x (the distribution of the outcome

given a covariate value), which are of unspecified form, are identical. For example, if the Fisher test is significant, it will be concluded that the distribution of y given x is not the same in the two groups of treatment; that is, the y|x values in one group will tend to be greater than in the other. In terms of efficiency properties, if the errors are assumed to be normally distributed and when one covariate only is considered, the ARE of the Quade test relative to the classic ANCOVA test can be shown to be 3 1 − ρP2 π 1 − ρS2 ρ P and ρ S denoting the Pearson and Spearman correlation coefficients, respectively. When no linear relationship exists between the covariate and the response variable (ρ P = ρ S = 0), it is the two-group situation, and the ARE is equal to π3 0.955, and there is little loss in efficiency. When there is a perfect linear relationship (ρ P = ρ S = ± 1), it can be shown that the ARE decreases to √ 3 = 0.866, 2 which is still not a great loss in efficiency. However, if the error distribution deviates from normality, then the ARE can be much larger than 1 and the gain in efficiency may be substantial. Also in the context of the one-way ANCOVA model with one or more continuous covariates, Puri and Sen (2) have developed a unified theory for distribution-free rank order tests that generalizes the Quade test. Although slightly different in its formulation, their test statistic is similarly based on residuals from regression fits of ranks of outcome values on ranks of covariates values. They have derived its exact rank permutation distribution and have also shown that it is asymptotically distributed as a chi-square distribution. The use of general rank scores that are a function ϕ[r] of original ranks (r = 1, . . . ,n) is allowed, and their procedure is actually equivalent to the Quade method when these scores are the original ranks. Their procedure can account for outliers in data much more efficiently by using adequate rank scores with regard to the form

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

of the distribution of errors. For example, the sign scores n+1 ϕ[r] − sign r − 2 can be an adequate set of rank scores when the distribution of errors is particularly asymmetric. Quade’s method can also be combined with the randomized model framework of extended Cochran–Mantel–Haenszel statistics to perform rank comparisons between treatments, after adjusting for the effects of one or more continuous covariates. This methodology has been described by Koch et al. (15,16). It can be easily adapted to the situation in which there are multiple strata (i.e., categorical covariates). For example, in a multicentric randomized controlled efficacy trial of a new drug with a continuous outcome and one continuous covariate (e.g., the baseline), the comparison between treatments should adjust for center. The first step is to rank the outcome and covariate values independently within each center, and fit, for each center separately, a linear regression model without the treatment effect as for Quade’s method. The second step consists of comparing the treatment groups by performing the center-stratified Cochran–Mantel–Haenszel statistic on the regression residuals obtained from the different fits. It should also be emphasized that the Quade method and its extensions are limited as they cannot specifically test interactions between the treatment and the covariates. The Quade method and its extensions, particularly in the framework of extended Cochran–Mantel–Haenszel statistics can be easily implemented in any general statistical software [for example, see Stokes et al.  (17) for such an implementation in the SAS system] 2.2 The Conover–Iman Rank Transform Method Conover and Iman (18) have proposed a ‘‘rank transform’’ procedure (RT) that has become very popular and has often been used in efficacy clinical trials as an alternative sensitivity analysis to the primary analysis of

5

the primary endpoint, especially in the presence of suspected influential outlying values. It simply consists of replacing the outcome values by their ordered ranks and then performing the LS test based on these ranks and assuming that the asymptotic distributions of both tests are the same. It can be easily implemented in any general statistical package, including a classic linear regression program, which may partly explain its popularity. The Conover and Iman method is only a testing procedure like the Quade method. However, unlike the Quade method, it allows the use of a general ANCOVA model with possible interactions between the treatment and covariates. However, although Hora and Conover (19) presented some asymptotic theory for treatment effect in a randomized block design with no interaction, there is actually no general supporting theory for the RT and this procedure has been criticized by many authors [see Fligner (20), Akritas (21), McKean and Vidmar (22), Brunner and Puri (23) and chapter 4 of Hettmansperger and McKean (4) for a recent discussion]. Fligner (20) and Akritas (21) highlighted that the RT is a nonlinear transformation on the outcomes values; hence, the original hypotheses of interest that are linear contrasts in model parameters are often no longer being tested by this rank procedure. McKean and Vidmar (22) illustrated it by presenting an ANCOVA example where the covariate slopes for the treatment groups are not homogeneous, (i.e., there is a significant interaction between the groups and the covariate so that proceeding with the test on the treatment effect is not reasonable). However, the RT test on homogeneous slopes, that is applied to these data and is actually no longer testing homogeneous slopes on the original scale, is not significant and is misleading in this case. The second reason for which the RT procedure does not work has been identified by Akritas (21), who has highlighted that the homogeneity of error variances is not transferred to the asymptotic RT statistics in general. Furthermore, many authors [see Brunner and Neumann (24) and Sawilowsky et al. (25)] have shown that RT tests have undesirable asymptotic and small-sample properties in most realistic

6

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

designs. Therefore, except in some particular conditions given by Brunner and Puri (26), the RT should not be regarded as a technique for deriving statistics in general. 2.3 Other Rank Methods for ANCOVA Other Rank methods that can account for the effects of covariates and their interactions with the treatment have been recently developed by Akritas and Arnold (27) as well as by Brunner and Puri (23,26). Unlike the RT methods, they are supported by a full asymptotic theory. However, these methods are not yet known to have been applied to clinical trials. 3

R METHODS FOR ANCOVA

Unlike the pure rank methods, the R methods offer a full analysis with a robust estimation of the treatment effect and its confidence interval, with diagnostic tools for model criticism and with a robust testing procedure. Their capacity to yield an estimate of the treatment effect is particularly important in some clinical trials where the risk–benefit of a drug has to be assessed. Either in superiority or noninferiority efficacy clinical trials, the efficacy of a drug and its clinical relevance cannot be inferred from a test result only. The size of the treatment effect and its confidence interval also have to be estimated. As for the pure Rank methods, the R methods apply to continuous response variables with a distribution function of unspecified form. However, as semiparametric methods, they require that the response variable should have an equal interval scale so that the ‘‘location shift’’ model can be assumed. Lastly, unlike the pure rank methods for ANCOVA, the R methods for ANCOVA do not require the assumption of identical covariate distributions in the groups being compared. The R methods can be used in both contexts (i.e., in randomized clinical trials and observational clinical studies). 3.1 The Jaeckel–Hettmansperger–McKean (JHM) Method This semiparametric estimation method was originally described by Jaeckel (28) and

then extended by McKean and Hettmansperger (22,29,30) and Hettmansperger and McKean (3,4,31) to hypothesis testing and confidence procedures in the general linear model. It can be applied to the ANCOVA model with any kind of covariates, continuous or categorical, and also interaction terms between the treatment and these covariates. It is supported by a full asymptotic theory unlike the RT procedure. The JHM method has been fully and comprehensively described in Hettmansperger and McKean (4). This method can be briefly summarized by considering the general linear model, the ANCOVA model being a particular case. The response values yj (j = 1, . . . ,n) are assumed to follow a general linear model, yj = x j θ + ej , where θ denotes the vector of parameters (i.e., the treatment parameters, the covariate slopes and the possible interaction terms in the ANCOVA model). In this model formulation, there is no intercept and the predictors x (i.e., the covariate values, the treatment and the possible interaction indicators in the ANCOVA model) are assumed to be centered, thus the residuals ej are not centered and therefore, for example, the treatment effect must be interpreted as a shift in the location of the residuals. The residuals are assumed to be identically distributed with a density that is of unspecified form. Hettmansperger and McKean seek estimates of the parameters θ that minimize the residual dispersion D(θ ), which is defined as a weighted sum of the residuals, D(θ ) =

n

wj × ej −

j=1

n

wj (yj − xj θ )

j=1

where wj is the weight assigned to the residual ej . For example, the LS estimate θˆLS minimizes n j=1

e2j =

n

ej × ej

j=1

and it is clearly observed that, as the weights are the residuals themselves, too much weight is assigned to extreme residuals. Hettmansperger and McKean propose the use of weights that are a score function of

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

the ranks of the residuals, wj = ϕ(Rank[ej ]), to give less weight to the largest residuals. They therefore seek the rank-based estimate, the so-called R-estimate θˆR , that minimizes the sum of residuals weighted by their rank scores, D(θ ) =

n

ϕ(Rank[ej ]) × ej .

j=1

When the rank scores on the residuals are the deviation ranks from the rank mean ϕ(Rank[ej ]) = Rank[ej ] −

n+1 , 2

these yield the Wilcoxon R analysis that extends the Wilcoxon–Hodges–Lehmann (9) method for the two-group problem to any linear model and therefore to any ANCOVA model. The difference between the RT and the R analysis is obvious. In the RT, the ranks are applied indiscriminately to the nonidentically distributed responses; therefore, they are not free of the predictors. In contrast, in the R method, the ranks are applied to the residuals and are adjusted by the fit so that they are free of the predictors. Therefore, the R method allows an analysis based on the same original scale and respects the original parametrization of the linear hypotheses tested. Once the parameters are estimated, a location of the R residuals (ˆej = yj −xj θˆR ) can be estimated. Hettmansperger and McKean have proposed two options (see chapter 3 of Ref. 4, for example). The estimate can be the median of the residuals (medj {ˆej }), when the data are heavily skewed, or the median of the Walsh averages medk≤j

cˆ k + eˆ j , 2

when the error distribution is symmetric. The location of the residuals can then be used in association with other parameters to estimate the treatment effect in each group separately, which is analogous to the least squares means.

7

The R estimates of the parameters have approximate asymptotic normal distributions. The derived expressions of these estimates and their confidence intervals are identical to their LS counterparts except that the variance σ 2 is replaced in the R analysis by a scale parameter τ 2 that plays the same role as σ 2 does in the LS analysis. This scale parameter depends on the residual density f and also on the choice of the rank scores. For example, it can be shown √ −2 for Wilcoxon scores. 12 f 2 that τ 2 = A consistent estimate of τ was derived by Koul et al. (32). The ARE of the R estimate with regard to the LS estimate is the ratio of their asymptotic precisions and is shown 2 always to be σ 2 whichever ANCOVA model τ is considered. When errors are normally distributed the ARE for Wilcoxon scores is again 3/π = 0.955 as for the simple two-group problem. The loss in efficiency with respect to the classic analysis for the normal distribution is only 5%, whereas the gain in efficiency over the classic analysis for long-tailed residual distributions can be substantial. Optimal rank scores can be chosen according to the form of the density of the residual distribution to get an asymptotically efficient analysis (see McKean and Sievers (33) and chapter 3 of [Ref. 4]). If the residual distribution is not known in advance, which is often the case in clinical trials, and given their properties with regard to their LS counterparts, the Wilcoxon scores should be the default scores. While the Wilcoxon estimates are robust against outlying response values only, they may not be appropriate for outlying covariate values. In randomized clinical trials, the application of inclusion and exclusion criteria usually prevents the observation of such outlying values on the baseline covariates. This observation may occur however in observational studies, and in this case, weighted Wicoxon scores [see chapter 5 of [Ref. 4] and Terpstra and McKean (34), for example] can be used to handle outlying values either on the response or on the covariate. Hettmansperger and McKean have proposed a testing procedure analogous to the LS procedure. The classic Fisher test is based on the reduction in the sum of residual squares between the full model and the ‘‘reduced’’

8

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

model under the null hypothesis tested. For example, when testing the treatment effect the parametrization of the model assumes that there is no treatment effect under the null hypothesis (i.e., α 1 = α 2 = α), which corresponds to a ‘‘reduced’’ version of the vector of parameters. Hettmansperger and McKean (3,4) have derived the corresponding rank-based test F R based on the reduction in residual rank-based dispersion between the full model and the reduced model under the null hypothesis, and have shown that this test is asymptotically distributed as a chi-square. Of note, this test is not distribution free because it partly depends on the scale parameter of the error density function. Some small-sample studies [see McKean and Sheather (35)] have indicated that the use of an F distribution for the F R test is even more adequate (the empirical α level of F R is close to the nominal value). Similarly the examination of the model fit can be performed by various diagnostics, particularly by the plot of the so-called studentized R residuals, that are analogous to the LS studentized residuals and are useful in identifying outliers [see chapter 3 and 5 of [Ref. 4] and Naranjo et al. (36), for example]. As R estimates tend to fit the heart of the data more than their LS counterparts, R residual plots can often detect outliers that are missed in LS residual plots. The Wilcoxon R-analysis can be performed in MINITAB (37) (using the rregr command) or the RGLM software package written by Kapenga et al. (38)., which allows the fit of a general linear model and the use of various rank scores other than the Wilcoxon scores. A Web interface of RGLM is available at www.stat.wmich.edu/slab/RGLM. Note that R code for the Wilcoxon analysis is also available and is discussed in Terpstra and McKean (34). 3.2 The Aligned Rank Test method The aligned rank (AR) procedure is another rank-based method for fitting general linear models and therefore general ANCOVA models with possible interaction terms between the treatment and the covariates as in the Hetmansperger–McKean method. This procedure was introduced by

Hodges and Lehmann (39) in some simple designs and was generalized to the linear model by Puri and Sen (40). This method is a test-oriented procedure only and offers neither estimation nor model criticism as with the pure rank methods. An analogy with classic inference is that the aligned rank test is a gradient-scoretype test whereas the Hettmansperger and McKean F R test, based on the reduction in rank-based residual dispersions between full and reduced models, is a likelihood-ratiotype test. Although these two tests are algebraically equivalent in classic LS inference, they are not equivalent for rank-based procedures (4,31). Before comparing the groups, the basic principle of the AR method is to make the response values more comparable by eliminating the effect of the covariate values on them. For this purpose the AR procedure fits the reduced regression model under the null hypothesis of no treatment effect (i.e. with the covariates only and without the treatment effect), and it can be achieved by either the ordinary LS or the R-method. Then the AR procedure considers rank scores on the regression-based residuals, and the AR test statistic is derived as a quadratic form in these rank scores. In this procedure the observations are ‘‘aligned’’ by the reduced model estimates, hence the so-called aligned rank test. The AR test is asymptotically distribution free unlike the Hettmansperger-McKean test, and both tests have the same asymptotic efficiency (see [Ref. 31]). Of note, whereas the HM test requires the full-model estimates and an estimate of the scale parameter τ 2 , the AR test only requires the reducedmodel estimates. However, Hettmansperger and McKean have indicated that the small sample properties of the Aligned Rank test can be poor on certain designs (see [Ref. 31]). AR tests are not implemented routinely in the commonly used statistical softwares and usually have to be programmed specially. 4 CONCLUSION Rank-based nonparametric analysis of covariance methods are robust alternative

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE

methods for classic analysis of covariance when the data deviate from normality and/or contain outlying values. These rank-based methods are robust without loss of efficiency. The ‘‘pure’’ rank methods are very general as they rely on only a few assumptions, but being test-oriented procedures, they are inconvenient for estimating the treatment effect. This issue is crucial in most clinical trials where the risk–benefit of a drug must be assessed. Rank-based semi-parametric methods rely on more demanding assumptions (i.e., on the ‘‘location shift’’ model assumption), so their application is narrower. However, results are interpreted on the same original scale as the raw data and, above all, they allow a full model analysis with a simple estimation of the treatment effect among other things.

REFERENCES 1. D.R. Cox, and P. McCullagh, Some aspects of analysis of covariance. Biometries 1982; 38: 541–561. 2. M. Puri, and P.K. Sen, Analysis of covariance based on general rank scores. Annals of Mathematical Statistics. 1969; 10: 610–618. 3. T.P. Hettmansperger, and J.W. McKean, A robust alternative based on ranks to least squares in analyzing linear models. Technometrics. 1977; 19: 275–284. 4. T.P. Hettmansperger, and J.W. McKean, Robust Nonparametric Statistical Methods. London, UK: Arnold, 1998. 5. G.G. Koch, C.M. Tangen, J.W. Jung, and I.A. Amara, Issues for covariance analysis of dichotomous and ordered categorical data from randomized clinical trials and nonparametric strategies for addressing them. Stat. Med. 1998; 17: 1863–1892. 6. G.G. Koch, and C.M. Tangen, Nonparametric analysis of covariance and its role in noninferiority clinical trials. Drug Information J. 1999; 33(4): 1145–1159 7. S. Leon, A.A. Tsiatis, and M. Davidian, Semiparametric estimation of treatment effect in a pretest-postest study. Biometrics 2003; 59: 1046–1055. 8. L.A. Stefanski, R.J. Carroll, and D. Ruppert, Optimal bounded score functions for generalized linear models with applications to logistic regression. Biometrika 1986; 73: 413–424.

9

9. J.L. Hodges, and E.L. Lehmann, Estimates of location based on rank tests. Ann. Mathemat. Stat. 1963; 34: 598–611. 10. E.L. Lehmann, Nonparametrics: statistical methods based on ranks. San Francisco, CA: Holden-Day, 1995. 11. K. Behnen, Asymptotic comparison of rank tests for the regression problem when ties are present. Ann. Stat. 1976; 4: 157–174. 12. G.G. Koch, and P.K. Sen, The use of nonparametric methods in the statistical analysis of the ‘mixed model’. Biometrics 1968; 24: 27–48. 13. G.G. Koch, The use of nonparametric methods in the statistical analysis of a complex split plot experiment. Biometrics 1970; 26: 105–128. 14. D. Quade, Rank analysis of covariance. J. Am. Stat. Assoc. 1967; 62: 1187–1200. 15. G.G. Koch, J.A. Amara, G.W. Davis, and D.B. Gillings, A review of some statistical methods for covariance analysis of categorical data. Biometrics 1982; 38: 563–595. 16. G.G. Koch, G.J. Carr, I.A. Amara, M.E. Stokes, and T.J. Uriniak, Categorical data analysis. in Statistical Methodology in the Pharamaceutical Sciences. D.A. Berry (eds) New York: Marcel Dekkerinc. 17. M.E. Stokes, C.S. Davis, and G.G. Koch, Categorical data analysis using the SAS system. Cary, NC: SAS Institute Inc., 1995. 18. W.J. Conover, and R.L. Iman, Rank transform as a bridge between parametric and nonparametric statistics. Amer. Statistic. 1981; 35: 124–133. 19. S.C. Hora, and W.J. Conover, The F-statistic in the two-way layout with rank-score transformed data. Ann. Stat. 1973; 1: 799–821. 20. M.A. Fligner, Comment. Am. Statist. 1981; 35: 131–132. 21. M.G. Akritas, The rank transform method in some two-factor designs. J. Am. Stat. Assoc. 1990; 85: 73–78. 22. J.W. McKean, and T.J. Vidmar, A comparison of two rank-based methods for the analysis of linear models. Am. Stat. 1994; 48: 220–229. 23. E. Brunner, and M.L. Puri, Nonparametric methods in design and analysis of experiments. In Handbook of Statistics, vol 13. Amsterdam: Elsevier Science. 1996, pp. 631–703. 24. E. Brunner, and N. Neumann, Rank tests in 2 × 2 designs. Statistica Neerlandica 1986; 40: 251–272. 25. S.S. Sawilowsky, R.C. Blair, and J.J. Higgins, An investigation of the Type I error and power properties of the rank transform procedure

10

RANK-BASED NONPARAMETRIC ANALYSIS OF COVARIANCE in factorial ANOVA. J. Edu. Stat. 1989; 14: 255–267.

26. E. Brunner, and M.L. Puri, Nonparametric methods in factorial designs. Stat. Papers 2001; 42: 1–52. 27. M.G. Akritas, and S.F. Arnold, Fully nonparametric hypotheses for factorial designs. I: Multivariate repeated measures designs. J. Am. Stat. Associ. 1994; 89: 336–342. 28. L.A. Jaeckel, Estimating regression coefficients by minimizing the dispersion of the residuals. Ann. Mathema. Stat. 1972; 43: 1449–1458. 29. J.W. McKean, and T.P. Hettmansperger, Tests of hypotheses of the general linear model based on ranks. Communicat. in stat. Theory Meth. 1976; 5: 693–709. 30. J.W. McKean, and T.P. Hettmansperger, A robust analysis of the general linear model based on one step R-estimates. Biometrika 1978; 65: 571–579. 31. T.P. Hettmansperger, and J.W. McKean, A geometric interpretation of inferences based on ranks in the linear model. J. Am. Stat. Assoc. 1983; 78: 885–893. 32. H.L. Koul, G.L. Sievers, and J.W. McKean, An estimator of the scale parameter for the rank analysis of linear models under general score functions. Scandinav. J. Stat. 1987; 14: 131–141. 33. J.W. McKean, and G.L. Sievers, Rank scores suitable for the analysis of linear models under asymmetric error distributions. Technometrics 1989; 31: 207–218. 34. J.T. Terpstra, and J.W. McKean, Rank-based analyses of linear models using R. J. Stat. Soft. 2005; 14:issue 7. 35. J.W. McKean, and S.J. Sheather, Small sample properties of robust analyses of linear models based on R-estimates: a survey. W. Stahel and S. Weisberg, eds Directions in robust statistics and diagnostics, Part II. New York: Springer-Verlag, 36. J.D. Naranjo, J.W. McKean, S.J. Sheather, and T.P. Hettinansperger, The use and interpretation of rank-based residuals. J. Nonparamet. Stat. 1994; 3: 323–341. 37. MINITAB Reference Manual. Minitab, Inc. Valley Forge, PA: Data Tech Industries, Inc, 1991. 38. J.A. Kapenga, J.W. McKean, and T.J. Vidmar, RGLM: Users Manual, Version 2. SCL Technical Report 1995. Kalamazoo, MI: Department of Statistics, Western Michigan University.

39. J.L. Hodges, and E.L. Lehmann, Rank methods for combination of independent experiments in analysis of variance. Anna. Mathemat. Stat. 1962; 33: 482–497. 40. M.L. Puri, and P.K. Sen, Nonparametric Methods in General Linear Models. New York: John Wiley & Sons, 1985.

FURTHER READING A. Abebe, K. Crimin, J.W. McKean, J.V. Haas, and T.J. Vidman, Rank-based procedures for linear models: applications to pharmaceutical science data. Drug Informat. J. 2001; 35: 947–971. M.C. Akritas, S.F. Arnold, and E. Brunner, Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J. Am. Stat. Assoc. 1997; 92: 258–265. M.G. Akritas, S.F. Arnold, and Y. Du, Nonparametric models and methods for nonlinear analysis of covariance. Biometrika 2000; 87: 507–526. A. Bathke, and E. Brunner, A nonparametric alternative to analysis of covariance. In M.G. Akritas and D.N. Politis (eds.), Recent advances and trends in nonparametric statistics. Elsevier Science, New York 2003. D. Draper, Rank-based robust analysis of linear models. I: Exposition and review. Statistical Science. 1988; 3(2): 239–271. P.J. Huber, Robust statistics. New York: John Wiley & Sons, 1981. G.L. Thompson, and L.P. Ammann, Efficacies of rank transform in two-way models with no interaction. J. Am. Stat. Associ. 1989; 85: 519–528. G.L. Thompson, A note on the rank transform for interaction. Biometrika 1991; 78: 697–701. Correction to ‘A note on the rank transform for interaction’. Biometrika 1993; 80: 211.

CROSS-REFERENCES Analysis of Covariance (ANCOVA) Covariates Nonparametrics Outliers Missing Values Rank Analysis of Variance Wilcoxon Rank Sum Test

RECORD ACCESS The sponsor should specify in the protocol or in another written agreement that the investigator(s)/institution(s) provide direct access to source data/documents for trial-related monitoring, audits, IRB (Institutional Review Board)/IEC (Independent Ethics Committee) review, and regulatory inspection. The sponsor should verify that each subject has consented, in writing, to direct access to his/her original medical records for trial-related monitoring, audit, IRB/IEC review, and regulatory inspection.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

REFUSE TO FILE LETTER New Drug Applications (NDA) that are incomplete become the subject of a formal ‘‘refuse to file’’ action. In such cases, the applicant receives a letter detailing the decision and the deficiencies that form its basis. This decision must be forwarded within 60 calendar days after the NDA is initially received by the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research (CDER). If the application is missing one or more essential components, a ‘‘Refuse to File’’ letter is sent to the applicant. The letter documents the missing component(s) and informs the applicant that the application will not be filed until it is complete. No further review of the application occurs until the applicant provides the requested data and the application is found acceptable and complete.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/refuse.htm and http://www.fda.gov/cder/handbook/refusegn. htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

REGISTRATION OF DRUG ESTABLISHMENT FORM This form is used by manufacturers, repackers, and relabelers to register establishments and is used by private-label distributors to obtain a labeler code. This form is also used to provide updates in registration information annually or at the discretion of the registrant, when any changes occur.

This article was borrowed from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/drls/registration listing. htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

REGRESSION VERN T. FAREWELL MRC Biostatistics Unit, Cambridge, UK

The use of the term regression in statistics originated with Francis Galton, to describe a tendency to mediocrity in the offspring of parent seeds, and was used by Karl Pearson in a study of the heights of fathers and sons. The sons’ heights tended on average to be less extreme than the fathers, demonstrating a so-called ‘‘regression towards the mean’’ effect (for details and a description of the most widely used form of regression analysis). The term is now used in a wide variety of analysis techniques which examine the relationship between a response variable and a set of explanatory variables. The nature of the response variable usually determines the type of regression that is most natural (for a very general formulation of regression models and examples).

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

REGRESSION MODELS TO INCORPORATE PATIENT HETEROGENEITY

be strictly monotonic in a. For fixed a we require that (x, a) be monotonic increasing in x or, in the usual case of discrete dose levels di , i = 1, . . . , k, that (di , a) > (dm , a) whenever i > m. The true probability of toxicity at x (i.e. whatever treatment combination has been coded by x) is given by R(x) and we require that, for the specific doses under study (d1 , . . . , dk ) there exists values of a, say a1 , . . . , ak such that (di , ai ) = R(di ), (i = 1, . . . , k). In other words, our one parameter working model has to be rich enough to model the true probability of toxicity at any given level. We call this a working model because we do not anticipate a single value of a to work precisely at every level, that is, we do not anticipate a1 = a2 = · · · = ak = a. Many choices are possible. Excellent results have been obtained with the simple choice:

JOHN O’QUIGLEY Universite Pierre et Marie Curie Paris, France

1

BACKGROUND

The purpose of the dose-finding study is to estimate some percentile, p, efficiently from an unknown, supposed monotonic, dosetoxicity or dose-response curve. The endpoints of interest are typically the presence or absence of toxicity and/or the presence or absence of some indication of therapeutic effect. The statistical purpose of the design is to identify a level, from among the k dose levels available; d1 , . . . , dk , such that the probability of toxicity at that level is as close as possible to some value θ . The value θ is chosen by the investigator such that he or she considers probabilities of toxicity higher than θ to be unacceptably high, whereas those lower than θ unacceptably low in that they indicate, indirectly, the likelihood of too weak an antitumor effect. In the case of patient heterogeneity, we may imagine that the level corresponding to the target level may differ from one group to another. The dose for the jth entered patient, X j can be viewed as random taking values xj , most often discrete in which case xj {d1 , . . . , dk } but possibly continuous where X j = x; x R+. In light of the remarks of the previous two paragraphs we can, if desired, entirely suppress the notion of dose and retain only information pertaining to dose level. This information is all we need, and we may prefer to write xj {1, . . . , k}. Let Y j be a binary random variable (0, 1) where 1 denotes severe toxic response for the jth entered patient (j = 1, . . . , n). We model R(xj ), the true probability of toxic response at X j = xj ; xj {d1 , . . . , dk } or xj {1, . . . , k} via

(di , a) = αia , (i = 1, . . . , k) where αi represents the working model 0 < α1 < . . . < αk < 1 and 0 < a < ∞ Sometimes, it can be advantageous to make use of the reparameterized model (di , exp(a) so that there are no constraints on a) = αi the parameter a. Likelihood estimates of are of course unchanged. Once a model has been chosen and we have data in the form of the set j = {y1 , x1 , . . . , yj , xj }, the outcomes of the ˆ i ), first j experiments we obtain estimates R(d (i = 1, . . . , k) of the true unknown probabilities R(di ), (i = 1, . . . , k) at the k dose levels. The target dose level is the level that has probability of toxicity associated with it as close as we can get to θ . The dose or dose level xj assigned to the jth included patient is such that ˆ j ) − θ | < |R(d ˆ i ) − θ |, |R(x (i = 1, . . . , k; xj = di ) Thus, xj is the closest level to the target level in the above precise sense. Other choices of closeness could be made, incorporating cost or other considerations. We could also weight the distance, for example multiply ˆ j )−θ | by some constant greater than 1 |R(x ˆ j ) > θ . This result would favor when R(x conservatism; such a design tends to exper-

R(xj ) = Pr(Yj = 1|Xj = xj ) = E(Yj |xj ) = (xj , a) for some one parameter working model (xj , a). For given fixed x we require that (x, a)

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

REGRESSION MODELS TO INCORPORATE PATIENT HETEROGENEITY

iment more often below the target than a design without weights. It is straightforward to write down the likelihood, but we reserve this for below where we write down the likelihood for the heterogeneous case. Under a null value for the group parameter we recover the usual likelihood for the homogeneous case. 1.1 Models for Patient Heterogeneity As in other types of clinical trials, we are essentially looking for an average effect. Patients of course differ in the way they may react to a treatment and, although hampered by small samples, we may sometimes be in a position to address the issue of patient heterogeneity specifically. One example occurs in patients with acute leukemia where it has been observed that children will better tolerate more aggressive doses (standardized by their weight) than adults. Likewise, heavily pretreated patients are more likely to suffer from toxic side effects than lightly pretreated patients. In such situations, we may wish to carry out separate trials for the different groups to identify the appropriate MTD for each group. Otherwise, we run the risk of recommending an ‘‘average’’ compromise dose level that is too toxic for a part of the population and suboptimal for the other. Usually, clinicians carry out two separate trials or split a trial into two arms after encountering the first DLTs when it is believed that there are two distinct prognostic groups. This method has the disadvantage of failing to use information common to both groups. A twosample CRM has been developed so that only one trial is carried out based on information from both groups. A multisample CRM is a direct generalization, although we must remain realistic in terms of what is achievable in the light of the available sample sizes. Let I, taking value 1 or 2, be the indicator variable for the two groups. Otherwise, we use the same notation as previously defined. For clarity, we suppose that the targeted probability is the same in both groups and is denoted by θ , although this assumption is not essential to our conclusions. The dose-toxicity model is now the following: Pr (Y = 1 | X = x, I = 1) = ψ1 (x, a)

(1)

Pr (Y = 1 | X = x, I = 2) = ψ2 (x, a, b) (2)

Parameter b measures to some extent the difference between the groups. The functions ψ1 and ψ2 are selected in such a way that for each θ (0, 1) and each dose level x(a0 , b0 ) exists to satisfy ψ1 (x, a0 ) = θ and ψ2 (x, a0 , b0 ) = θ . This condition is satisfied by many function pairs. The following model has performed well in simulations: ψ1 =

exp(a + x) 1 + exp(a + x)

(3)

ψ2 =

b exp(a + x) 1 + b exp(a + x)

(4)

Many other possibilities exist, and an obvious generalization of the model of O’Quigley et al. (1), which are derived from Equation (1) in which equation (3) applies to group 1 and ai =

tanh(di − b) + 1 , (i = 1, . . . , k) 2

to group 2. A non zero value for b indicates group heterogeneity. Let zk = (xk , yk , Ik ), k = 1, . . . , j be the outcomes of the first j patients, where Ik indicates to which group the kth subject belongs, xk is the dose-level at which the kth subject is tested, and yk indicates whether or not the k-th subject suffered a toxic response. To estimate the two parameters, one can use a Bayesian estimate or maximum likelihood estimate as for a traditional CRM design. On the basis of the observations zk , (k = 1, . . . , j) on the first j1 patients in group 1 and j2 patients in group 2 (j1 + j2 = j); we can write down the likelihood as j

1 ψ1 (xi , a)yi (1 − ψ1 (xi , a))1−yi L(a, b) = i=1

j

× i=j

1 +1

ψ2 (xi , a, b)yi

× (1 − ψ2 (xi , a, b))1−yi . If we denote by (aˆ j , bˆ j ) values of (a, b) to maximize this equation after the inclusion of j patients, then the estimated dose-toxicity relations are ψ1 (x, aˆ j ) and ψ2 (x, aˆ j , bˆ j ) respectively. If the (j + 1)th patient belongs to group 1, then he or she will be allocated at the dose level that minimizes |ψ1 (xj+1 , aˆ j ) − θ | with xj+1 ∈ {d1 , . . . , dk }.

REGRESSION MODELS TO INCORPORATE PATIENT HETEROGENEITY

However, if the (j + 1)th patient belongs to group 2, then the recommended dose level minimizes |ψ2 (xj+1 , aˆ j , bˆ j ) −θ |. The trial is carried out as usual: After each inclusion, our knowledge on the probabilities of toxicity at each dose level for either group is updated via the parameters. It has been shown that under some conditions, the recommendations will converge to the right dose level for both groups as well as the estimate of the true probabilities of toxicity at these two levels. Note that it is not necessary that the two sample sizes be balanced, nor that entry into the study be alternating. Figure 1 gives an illustration of a simulated trial carried out with a two-parameter model. Implementation was based on likelihood estimation, which necessitates nontoxicities and a toxicity in each group before the model could be fully fit. Prior to this occurring, dose level escalation followed an algorithm incorporating grade information. The design called for shared initial escalation, that is, the groups were combined until evidence of heterogeneity began to manifest itself. The first DLT occurred in group 2 for the fifth included patient. At this point, the trial was split into two arms, group 2 recommendation ˆ 0) and being based on L(a, ˆ 0) and ψ2 (x, a, group 1 continuing without a model. Note that many possible variants on this design exist. The first DLT in group 1 was encountered at dose level 6 and led to recommend a lower level to the next patient to be included (Fig. 1). For the remainder of the study, allocation for both groups leaned on the model together with the minimization algorithms described above. 1.2 Pharmacokinetic Studies Statistical modeling of the clinical situation of Phase I dose finding studies, such as takes place with the CRM, is relatively recent. Much more fully studied in the Phase I context are pharmacokinetics and pharmacodynamics. Roughly speaking, pharmacokinetics deals with the study of concentration and elimination characteristics of given compounds in specified organ systems, most often blood plasma, whereas pharmcodynamics focuses on how the compounds affect the body; this subject is vast and is referred to as PK/PD modeling.

3

Clearly, such information will have a bearing on whether a given patient is likely to encounter dose limiting toxicity or, in retrospect, why some patients and not others could tolerate some given dose. Many parameters are of interest to the pharmacologist, for example the area under the concentration time curve, the rate of clearance of the drug, and the peak concentration. For our purposes, a particular practical difficulty develops in the Phase I context, in which any such information only becomes available once the dose has been administered. Most often then the information will be of most use in terms of retrospectively explaining the toxicities. However it is possible to have pharmacodynamic information and other patient characteristics relating to the patient’s ability to synthesise the drugs, available before selecting the level at which the patient should be treated. In principle, we can write down any model we care to hypothesize, say one including all the relevant factors believed to influence the probability of encountering toxicity. We can then proceed to estimate the parameters. However, as in the case of patient heterogeneity, we must remain realistic in terms of what can be achieved given the maximum obtainable sample size. Some pioneering work has been carried out here by Piantadosi et al. (2), which indicates the potential for improved precision by the incorporation of pharmacokinetic information. This field is large and awaits additional exploration. The strength of CRM is to locate with relatively few patients the target dose level. The remaining patients are then treated at this same level. A recommendation is made for this level. More studies, following the Phase I clinical study, can now be made. During these next studies is where we examine the main advantage of pharmacokinetics. Most patients will have been treated at the recommended level and a smaller amount at adjacent levels. At any of these levels, we will have responses and a great deal of pharmacokinetic information. The usual models, in particular the logistic model, can be used to observe whether this information helps explain the toxicities. If so, we may be encouraged to carry out additional studies at higher or lower levels for certain patient profiles,

4

REGRESSION MODELS TO INCORPORATE PATIENT HETEROGENEITY Trial history

6

Dose Level

5

4

3

2

1

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 Patient No

which are indicated by the retrospective analysis to have probabilities of toxicity much lower or much higher than suggested by the average estimate. This step can be viewed as the fine tuning and may itself lead to new more highly focused phase I studies. At this point, we do not recognize the use of a model in which all the different factors are included as regressors. These additional analyses are necessarily very delicate, which require great statistical and/or pharmacological skill, and a mechanistic approach based on a catch-all model is probably to be advised against. 2

DEALING WITH SEVERAL GROUPS

For some studies, we may suspect group differences. However, given the relatively few available patients, we put all subjects into a single study. Even so, it is still possible to allow for the potential existence of group differences, and we proceeded as though we were dealing with a homogeneous population. Only relatively strong differences, which are identified by the accumulating observations, can lead to different recommendations for the different groups. In this case, all patients begin the study as though no differences were anticipated. Different schedules of treatment alone may lead to different behaviors and, under a working assumption that

Figure 1. Simulated trial for two groups.

the groups behave similarly; it requires no real extra effort to allow ourselves a mechanism that would indicate to us that the accumulating evidence suggests the groups are behaving differently. Certain studies involve more than a single grouping covariable; an example is the two prognostic groups based on heavily and lightly pre-treated patients, together with a variable labeling the scheduling. For three different schedules and two prognostic groups, we have a total of six subgroups. Separate analyses are not feasible. For several groups, a regression model, such as that oulined above, is not feasible either. Typically, the likelihood will be flat if multidimensional and operating behavior is potentially unstable. The solution is to add yet more structure (i.e., constraints) to the model. Rather than allow for a large, possibly infinite, range of potential values for the second parameter b, measuring differences between the groups, the differences themselves are taken from a very small finite set. Because, in any event, if the first group finishes with a recommendation for some level, d0 for example, then the other group will be recommended either the same level or some level, one, two or more, steps away from it. The idea is to parameterize these steps directly. The indices themselves are modeled and the model is less cluttered if we work

REGRESSION MODELS TO INCORPORATE PATIENT HETEROGENEITY

with log (di , a) rather than (di , a) writing: log (di , a) = exp(a) log(αi ); αi = i + zh(i) (5) where h(i) = mI(1 ≤ i + m ≤ k) + kI(i + m > k) + I(i + m < 1), m = 0, 1, 2, . . . , the second two terms in the above expression take care of edge effects. It is easy to put a discrete prior on m, possibly giving the most weight to m = 0 and only allowing one or two dose level shifts if the evidence of the accumulating data points strongly in that direction. No extra work is required to generalize to several groups. Under broad conditions analogous to those of Shen and O’Quigley (3), applied to both groups separately, consistency of the model in terms of identifying the correct level can be demonstrated. This result is of interest but it can be more illuminating to study small sample properties, often via the use of simulations, because for dose-finding studies, samples are invariably rather small. REFERENCES 1. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for Phase I clinical trials in cancer. Biometrics 1990;46:33–48. 2. S. Piantodosi, J. Fisher, and S. Grossman, Practical implementation of a modified continual reassessment method for dosefinding trials. Cancer Chemother. Pharmacol. 1998;41:429–436. 3. L. Z. Shen and J. O’Quigley, Consistency of continual reassessment method in dose finding studies. Biometrika 1996;83:395–406.

FURTHER READING C. Ahn, An evaluation of phase I cancer clinical trial designs. Stats. Med. 1998;17:1537–1549. T. Braun, The bivariate-continual reassessment method. Extending the CRM tophase I trials of two competing outcomes. Control. Clin. Trials 2002;23:240–256. S. Chevret, The continual reassessment method in cancer phase I clinical trials: A simulation study. Stats. Med. 1993;12:1093–1108.

5

D. Faries, Practical modifications of the continual reassessment method for phase I cancer clinical trials. J. Biopharm. Statist. 1994;4:147–164. C. Gatsonis and J. B. Greenhouse, Bayesian methods for phase I clnical trials. Stats. Med. 1992;11:1377–1389. S. Goodman, M. L. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies Stats. Med. 1995;14:1149–1161. N. Ishizuka, and Y. Ohashi, The continual reassessment method and its applications: A Bayesian methodology for phase I cancer clinical trials. Stats. Med. 2001;20:2661–2681. A. Legedeza, and J. Ibrahim, Longitudinal design for phase I trials using the continual reassessment method. Control. Clin. Trials 2002;21:578–588. S. Moller, An extension of the continual reassessment method using a preliminary up and down design in a dose finding study in cancer patients in order to investigate a greater number of dose levels. Stats. Med. 1995;14:911–922. J. O’Quigley, Estimating the probability of toxicity at the recommended dose following a Phase I clinical trial in cancer. Biometrics 1992;48:853–862. J. O’Quigley and S. Chevret, Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stats. Med. 1991;10:1647–1664. J. O’Quigley and L. Z. Shen, Continual Reassessment Method: A likelihood approach. Biometrics 1996;52:163–174. J. O’Quigley, L. Shen, and A. Gamst, Two sample continual reassessment method. J. Biopharm. Statist. 1999;9:17–44. J. O’Quigley and X. Paoletti, Continual reassessment method for ordered groups. Biometrics 2003;59:429–439. S. Piantadosi and G. Liu, Improved designs for dose escalation studies using pharmacokinetic measurements. Stats. Med. 1996;15:1605–1618. B. E. Storer, Phase I Clinical Trials. Encylopedia of Biostatistics. 1998. Wiley, New York. B. E. Storer, An evaluation of phase I clinical trial designs in the continuous dose-response setting. Stats. Med. 2001;20:2399–2408.

REGRESSION TO THE MEAN

follow the normal (Gaussian) distribution. However, the phenomenon is not restricted to normally distributed variables (4).

CLARENCE E. DAVIS University of North Carolina Chapel Hill, North Carolina

2 REGRESSION TO THE MEAN IN CLINICAL TRIALS

Regression to the mean was first identified by Sir Francis Galton (1). Currently, the phrase is used to identify the phenomenon that a variable that is extreme on its first measurement will tend to be closer to the center of the distribution for a later measurement. For example, in a screening program for hypertension, only persons with high blood pressure are asked to return for a second measure. Because of regression to the mean, on average, the second measure of blood pressure will be less than the first. Of course, if persons are selected because their measure on some variable is small, on average their second measure will be larger. 1

Regression to the mean is relevant to the design and the conduct of clinical trials in several ways. First, it is an important justification for a concurrent, randomized control group. For example, consider a study designed to evaluate a new treatment for hypercholesterolemia. Participants will be selected if their serum cholesterol is above some threshold. Then, they will be treated with the new therapy and their serum cholesterol will be measured some time later. This example demonstrates the classic ‘‘before–after’’ study (5). Because of regression to the mean, the average cholesterol at the after measurement will be less than that of the before measurement, even if the new treatment has no effect on serum cholesterol. Thus, the investigator will be led to the conclusion that the new treatment will lower serum cholesterol because of the regression to the mean effect. The solution to this problem is to have a concurrent control group selected in the same manner as the treated group. The average change in cholesterol in the control group is an estimate to the effect of regression to the mean. Thus, the difference between the change in the treated group and the change in the control group will be an unbiased estimate of the true treatment effect. It should be noted that the use of analysis of covariance with the baseline measure as the covariate is preferable to using change scores (2, 6). A second, frequent issue related to regression to the mean in clinical trials is its effect on recruitment of trial participants. Very often, potential participants for a clinical trial are selected because they have extreme values on some measurement (e.g., elevated glucose or blood pressure). These candidates for the trial are then rescreened a short time later to verify eligibility. Because of regression to the mean, a portion of the candidates

REGRESSION TO THE MEAN IN GENERAL

Regression to the mean will occur for any measurement that has some variability, which is either because of biologic variability or measurement techniques. The size of the regression to the mean effect is proportional to the extremity of the criteria for selection and to the degree of variability in measurement (2). For example, virtually no short-term biologic variability exists in the heights of adults. Thus, if height in a study is measured with good precision, little or no regression to the mean will occur, even if subjects are selected for extreme height. However, many biologic measures (blood pressure, cholesterol, glucose, etc.) vary from time to time within an individual. Moreover, a lack of precision may exist in how the variable is measured. These two forms of variability combine to lead to sizable amounts of regression to the mean when persons are selected because of extreme variables. The amount of regression to the mean can be estimated from mathematical models of the measurements (2, 3). Most mathematic descriptions of regression to the mean assume that the variables being measured

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

REGRESSION TO THE MEAN

will have lower values and will be deemed no longer eligible for the trial. Solutions for this problem include the use of a lower threshold for inclusion at the second and subsequent measurements and the use of the average of two or more measurements to determine eligibility (2). REFERENCES 1. S. M. Stigler, Regression towards the mean, historically considered. Statist. Meth. Med. Res. 1997; 6: 103–114. 2. C. E. Davis, The effect of regression to the mean in epidemiologic and clinical studies. Am. J. Epidemiol. 1976; 104: 493–498. 3. A. G. Barnett, J. C. van der Pols, and A. J. Dobson, Regression to the mean:what it is and how to deal with it. Int. J. Epidemiol. 2005; 34: 215–220. ¨ 4. H. G. Muller, I. Abramson, A. Rahman, Nonparametric regression to the mean. Proc. Nat. Acad. Sci. U.S.A. 2003; 100: 9715–9720. 5. N. Laird, Further comparative analysis of pretest post-test reseravh designs. Am Statistician 1983; 37: 99–102. 6. V. J. Vickers and D. G. Altman, Statistics notes: analyzing controlled trials with baseline and follow up measures. BMJ 2001; 323: 1123–1124.

FURTHER READING P. Armitage, G. Berry, J. N. S. Matthews, Statistical Methods in Medical Research. Malden, MA: Blackwell Science, 2002. pp. 204–207.

CROSS-REFERENCES Control groups change from baseline

REGULATORY AUTHORITIES Regulatory Authorities are bodies that have the power to regulate. In the International Conference on Harmonisation (ICH) Good Clinical Practice (GCP) guideline, the expression ‘‘Regulatory Authorities’’ includes the authorities who review submitted clinical data and who conduct inspections. These bodies are sometimes referred to as competent authorities. Before initiating the clinical trial(s), the sponsor [or the sponsor and the investigator, if required by the applicable regulatory requirement(s)], should submit any required application(s) to the appropriate authority (ies) for review, acceptance, and/or permission [as required by the applicable regulatory requirement(s)] to begin the trial(s). Any notification/submission should be dated and contain sufficient information to identify the protocol. To preserve the independence and the value of the audit function, the regulatory authority(ies) should not request the audit reports routinely. Regulatory authority(ies) may seek access to an audit report on a caseby-case basis, when evidence of serious GCP noncompliance exists, or in the course of legal proceedings or investigations.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

REGULATORY DEFINITIONS

2.1 Adverse Drug Reactions (ADR) There is another term similar to an AE called an adverse drug reaction (ADR). For purposes of premarketing, in which new indications, dosages, and/or new formulations can be provided, an ADR is defined as all noxious and unintended responses to a medicinal product related to any dose. The main difference between an AE and an ADR is that an ADR has implied causality. This implied causality means that a possible relationship could not be ruled out. In summary, ADRs are considered related to therapy (see also ‘‘Safety/Toxicity’’).

KELLY L. TRAVERSO Oxford Connecticut

1

WHAT IS AN ADVERSE EVENT?

An adverse event (or experience) (AE) is defined by the Food and Drug Administration (FDA) Code of Federal Regulations (CFR) as any untoward medical occurrence in a patient or clinical investigation subject who had been administered a pharmaceutical product or any unfavorable and unintended sign (including abnormal lab finding), symptom, or disease temporally associated with the use of a medicinal product whether considered treatment related (see also ‘‘Regulatory Issues’’). What this means is that any event that a patient experiences in a clinical trial must be reported to the sponsor whether the investigator believes it to be related to the compound that is being investigated. This reporting includes any new illness, disease, or worsening of existing illness or disease after exposure to the drug or initiation of the clinical trial. Examples of an adverse event include a clinically relevant deterioration of a laboratory assessment, medication error, abuse or misuse of a product, and/or any new diagnosis.

2.2 Serious Adverse Event/ Adverse Drug Reaction Both and ADR and AE can be considered serious if the event(s) meets at least one of the following six criteria: results in death, is considered life threatening, causes a hospitalization, prolongs an existing hospitalization, results in a congenital anomaly or birth defect, and/or is considered an important medical event. Important medical events are based on appropriate medical judgment that the event may jeopardize the patient and may require medical or surgical intervention to prevent one of the following outcomes listed above. If an AE or an ADR is reported as serious, then it is considered a Serious Adverse Event (SAE) and Serious Adverse Drug Reactions (SADR). The same concept in regards to relationship and implied causality apply for an SAE and an SADR as they do for AEs and SAEs. It is imperative in clinical studies not to get the terms serious and severe confused. When reporting an AE, one must provide the severity of the event. Severity is the intensity at which the event was classified. The three classifications of severity are mild, moderate, and severe. An event that is considered severe can still be considered an AE. Serious relates to the event or outcome of the event that may impair the patient’s normal lifestyle. The term serious also is used as a guide to determine regulatory reporting, which will be discussed in the next section

2 Pre-Existing Conditions versus Adverse Events Preexisting events (conditions that were found before entry into a study) and planned hospitalization or procedures are not considered adverse events, unless the event worsens while enrolled in a study, in which case the event will be considered aggravated or worsened. Procedures in general are not considered AEs; however, the underlying condition is an AE. For example, if a patient underwent an appendectomy, the appendectomy is not the AE. The AE is appendicitis. In this example, the appendectomy would be considered a treatment for the appendicitis.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

REGULATORY DEFINITIONS

of this article. Therefore, severity and serious should not be used interchangeably when referring to adverse events and clinical trial development (see also ‘‘Study Design: Phases I–IV’’). 2.3 Defining Unexpected Adverse Drug Reactions and Serious Adverse Drug Reactions After the AE or SAE has been reported to the sponsor, it is determined whether the event is included in the appropriate reference document. In clinical trials, the Investigator

Patient initial and/or number entered here

Brochure (IB) is most often the reference document that is used to determine if the event is expected or unexpected. An event is considered unexpected if it has not been previously observed in preclinical and/or clinical studies and/or the nature and severity is greater or more specific than what is documented in the IB. For example, if the event that was reported was congestive heart failure, which was considered life threatening, and the IB listed just congestive heart failure, then the event would be considered

Seriousness criteria are checked here. May have more then one box checked

Event and clinical course of event information entered here

Study drug information is entered in Section 2

For clinical trials the study box is checked off. Section 25a indicates if the report is an initial or follow-up

Figure 1. An example of a CIOMS I form. Highlighted are some areas that are mandatory when submitting the report to the health authorities.

REGULATORY DEFINITIONS

unexpected because the event reported was greater in severity. SAEs that are considered unexpected are referred to as Serious Unexpected Adverse Drug Reactions (SUADRs). In other countries, a SUADR is called a Suspected Unexpected Adverse Drug reaction (SUSAR); both terms for the most part are synonymous for purposes of regulatory reporting. 3 TIMELINES AND FORMAT FOR REPORTING SUADRS All adverse-event information must be reported to the health authorities within a defined period of time. Adverse events and

Patient initial and/or number entered here

Seriousness criteria are checked here. May have more then one box checked

3

Serious ADRs that are considered expected and are either considered related or unrelated are submitted in a line listing to the health authorities with the annual submission (for FDA, this submission is called the IND annual report) (see also ‘‘Regulatory issues’’). Unless these events jeopardize the safety of patients, they do not warrant expedited reporting. However, serious adverse drug reactions that are considered related and unexpected are reported to the FDA in an expedited fashion. The four minimum criteria for initially reporting an event are reporter, patient, drug, and event. If the SUADR did not result in death or was not considered life threatening,

Study drug information entered here

Event and clinical course of event information entered here

Figure 2. An example of a MedWatch form. This report includes device reporting as well. Highlighted are some key areas of the MedWatch form. Notice that the MedWatch form has the classification of the report (i.e., 7-day, 15-day, or annual report).

4

REGULATORY DEFINITIONS

then it is submitted in 15 calendar days. However, if the event report did result in death or was considered life threatening, then the FDA must be made aware of the event in 7 calendar days. According the ICH E2A, the 7-day notification can be via facsimile, phone, or writing and does not have to be complete report. The 7-day notification must be followed by a complete report 8 calendar days later. A complete report consists of all patient demographic information, study therapy, causal relationship by the sponsor and investigator, and a narrative that describes the clinical course of events. Any laboratory values and/or diagnostic results,

concomitant therapies, and past medical history should also be provided if known. It is always important to do due diligence, follow up with the investigator to obtain any missing information that is considered relevant to the event(s), and report any new information. Generally once an event is considered a SUADR, any information pertaining to that event must be submitted in the expedited timeline. It is important to note that a 7-day notification only occurs once per event, which means that any submission after the 7-day notification would be considered a 15-day report. A follow-up 7-day report should never occur.

Box G7 indicates what type of report the case is. If the case is a follow-up the follow-up number is listed

Figure 2. (continued)

REGULATORY DEFINITIONS

3.1 Format for Submitting Expedited Reports to the Health Authorities The two most common types of reports that are used to submit SUADRS to the health authorities are the CIOMS 1 and the MedWatch (3500A) forms. The CIOMs I form is the most widely used form and is used in most international countries. The MedWatch form is an FDA-specific form that is used predominately in the United States. These forms differ in format, but contain the same basic information, such as patient demographics, suspect drug information, Seriousness criteria, laboratory values, concomitant medications, manufacturer information, and a narrative summary field. Figures 1 and 2 illustrate both forms. REFERENCES 1. Clinical Safety Data Management: Definitions and Standards for Expedited Reporting. ICH E2A. March, 1995, pp. 330–336.

CROSS-REFERENCES Study Design: Phase I–Phase IV Safety/Toxicity Regulatory Issues

5

RELATIVE RISK MODELING

the proportional hazards model at all. Certainly, there are examples of situations where some other form of model provides a better description of the underlying biologic process. Two alternative models that have received some attention are the additive hazards model λ(t, Z) = λ0 (t) + Z β and the accelerated failure-time model S(t, Z) = S0 [t\, t exp(Z β)], where S(t) = exp[− 0 λ(u) du] is the survival function. Although any risk model can be reparameterized in proportional hazards form, it may be that a more parsimonious model can be found using some alternative formulation. For example, the additive risk model could be written as λ(t, Z) = ˜ β), where Z˜ = Z/λ0 (t) if the baseλ0 (t)(1 + Z line hazard λ0 (t) were some known parametric function, such as a set of external rates for an unexposed population. In this case, whether the proportional hazards or additive hazards model provides a more parsimonious description of the data depends on whether relative risk or the excess risk is more nearly constant over time (or requires the fewest time-dependent interaction effects). The advantages of relative risk models are both mathematical and empirical. Mathematically, the proportional hazards model allows ‘‘semiparametric’’ estimation of covariate effects via partial likelihood without requiring parametric assumptions about the form of the baseline hazard. Furthermore, at least with the standard loglinear form of the relative risk model, asymptotic normality seems to be achieved faster in many applications than for most alternative models. Empirically, it appears that many failuretime processes do indeed show rough proportionality of the hazard to time and covariate effects, at least with appropriate specification of the covariates. Evidence of this phenomenon for cancer incidence is reviewed in Breslow & Day [2, Chapter 2]: age-specific incidence rates from a variety of populations have more nearly constant ratios than differences.

DUNCAN C. THOMAS University of Southern California, Los Angeles, CA, USA

Risk models are used to describe the hazard function λ(t, z) for time-to-failure data as a function of time t and covariates Z = Z1 , . . . , Zp , which may themselves be time dependent. The term ‘‘relative risk models’’ is used to refer to the covariate part r(·) of a risk model in a proportional hazards form λ(t, Z) = λ0 (t) r[Z(t); β],

(1)

where β represents a vector of parameters to be estimated. In the standard proportional hazards model, the relative risk term takes the loglinear form r(Z, β) = exp(Z β). This has the convenient property that it is positive for all possible covariate and parameter values, since the hazard rate itself must be nonnegative. However, in particular applications, some alternative form of relative risk model may be more appropriate. First, an aside on the subject of time is warranted. Time can be measured on a number of different scales, such as age, calendar time, or time since start of observation. One of these must be selected as the time axis t for use of the proportional hazards model. In clinical trials, time since diagnosis or start of treatment is commonly used for this purpose, since one of the major objectives of such studies is to make statements about prognosis. In epidemiologic studies, however, age is the preferred time axis, because it is usually a powerful determinant of disease rates, but it is not of primary interest; thus, it is essential that its confounding effects be eliminated. However, other temporal factors, such as calendar date, or time since exposure began may still be relevant and can be handled either by treating them as covariates or by stratification.

1

2

WHY MODEL RELATIVE RISKS?

Before proceeding further, it is worth pausing to inquire why one might wish to adopt

DATA STRUCTURES AND LIKELIHOODS

Failure-time data arise in many situations in biology and medicine. In clinical trials,

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

RELATIVE RISK MODELING

time-to-death or time-to-disease-recurrence are frequently used endpoints. In epidemiology, cohort studies are often concerned with disease incidence or mortality in some exposed population, and case–control studies can be viewed as a form of sampling within a general population cohort. All these designs involve the collection of a set of data for each individual i = 1, . . . , I comprising a failure or censoring time ti , a censoring indicator di = 1 if the failure time is observed (i.e. the subject is affected), zero otherwise, and a vector of covariates zi , possibly time dependent. The appropriate likelihood depends on the sampling design and data structure. For a clinical trial or cohort study with the same period of observation for all subjects, but where only the disease status, not the failuretime itself, is observed, a logistic model for the probability of failure of the form Pr(D = 0|Z) = [1 + αr(Z, β)]−1 might be used, where α is the odds of failure for a subject with Z ≡ 0. Again, the standard form is obtained using r(Z, β) = exp(Z β). The likelihood for this design would then be L(α, β) =

Pr(D = di |Z = zi ; α, β)

i

=

[αr(zi ; β)]di . 1 + αr(zi ; β)

(2)

i

The same model and likelihood function would be used for an unmatched case–control study, except that α now involves the control sampling fractions as well as the baseline disease risk. In a clinical trial or cohort study in which the failure times are observed, the proportional hazards model (1) leads to a full likelihood of the form λ0 (ti )di r[zi (ti ); β]di L[λ0 (·), β] = i

× exp −

ti

λ0 (t) r[zi (t); β]dt ,

si

(3) where si denotes the entry time of subject i. Use of the full likelihood requires specification of the form of the baseline hazard. Cox

(6) proposed instead a ‘‘partial likelihood’’ of the form L(β) =

N n=1

r[zi (tn ); β] , n r[zj (tn ); β]

(4)

j∈Rn

where n = 1, . . . , N indexes the observed failure times, in denotes the individual who fails at time tn and Rn denotes the set of subjects at risk at time tn . This likelihood does not require any specification of the form of the baseline hazard; the estimation of β is said to be ‘‘semiparametric’’, as the relative risk factor is still specified parametrically (e.g. the loglinear model in the standard form). This partial likelihood can also be used to fit relative risk models for matched case–control studies (including nested case–control studies within a cohort), where n now indexes the cases and Rn indicates the set comprising the nth case and his matched controls. For very large data sets, it may be more convenient to analyze the data in grouped form using. For this purpose, the total persontime of follow-up is grouped into k = 1, . . . , K categories on the basis of time and covariates, and the number of events N k and person-time T k in each category is recorded, together with the corresponding values of the (average) time tk and covariates zk . The proportional hazards model now leads to a Poisson likelihood for the grouped data of the form L(λ, β) =

K

[λk Tk r(zk ; β)]Nk

k=1

×

exp[−λk Tk r(zk ; β)] , Nk !

(5)

where λk = λ0 (tk ) denotes a set of baseline hazard parameters that must be estimated together with β. 3 APPROACHES TO MODEL SPECIFICATION For any of these likelihoods, it suffices to substitute some appropriate function for r(Z;β) and then use the standard methods of maximum likelihood to estimate its parameters and test hypotheses. In the remainder of

RELATIVE RISK MODELING

this article, we discuss various approaches to specifying this function. The major distinction we make is between empiric and mechanistic approaches. Empiric models are not based on any particular biologic theory for the underlying failure process, but simply attempt to provide a parsimonious description of it, particularly to identify and quantify the effects of covariates that affect the relative hazard. Perhaps the best known empiric model is the loglinear model for relative risks, but other forms may be appropriate for testing particular hypotheses or for more parsimonious modeling in particular data sets, as discussed in the following section. With a small number of covariates, it may also be possible to model the relative risk nonparametrically. Mechanistic models, on the other hand, aim to describe the observed data in terms of some unobservable underlying disease process, such as the multistage theory of carcinogenesis. We touch on such models briefly at the end. Before proceeding further, it should be noted that what follows is predicated on the assumption that the covariates Z are accurately measured (or that the exposure– response relationship that will be estimated refers to the measured value of the covariates, not to their true values). There is a large and growing literature on methods of adjustment of relative risk models for measurement error. 3.1 Empiric Models The loglinear model, ln r(Z;β) = Z β, is probably the most widely used empiric model and is the standard form included in all statistical packages for logistic, Cox, and Poisson regression. As noted earlier, it is nonnegative and it produces a nonzero likelihood for all possible parameter values, which doubtless contributes to the observation that in most applications, parameter estimates are reasonably normally distributed, even with relatively sparse data. However, the model involves two key assumptions that merit testing in any particular application: 1. for a continuous covariate Z, the relative risk depends exponentially on the value of Z; and

3

2. for a pair of covariates, Z1 and Z2 , the relative risk depends multiplicatively on the marginal risks from each covariate separately (i.e. r(Z;β) = r (Z1 ;β 1 )r(Z2 ;β 2 )). Neither of these assumptions is relevant for a single categorical covariate with K levels, for which one forms a set of K − 1 indicator variables corresponding to all levels other than the ‘‘referent’’ category. In other cases, the two assumptions can be tested by nesting the model in some more general model that includes the fitted model as a special case. This test can be accomplished without leaving the general class of loglinear models. For example, to test the first assumption, it may suffice to add one or more transformations of the covariate (such as its square) to the model and test the significance of its additional contribution. To test the second assumption, one could add a single product term (for two continuous or binary covariates) or a set of (K − 1)(L − 1) products for two categorical variables with K and L levels respectively. If these tests reveal significant lack of fit of the original model, one might nevertheless be satisfied with the expanded model as a reasonable description of the data (after appropriately testing the fit of that expanded model). However, one should then also consider the possibility that the data might be more parsimoniously described by some completely different form of model. In choosing such an alternative, one would naturally be guided by what the tests of fit of the earlier models had revealed, as well as by categorical analyses. For example, if a quadratic term produced a negative estimate, that might suggest that a linear rather than loglinear model might fit better; similarly, a negative estimate for an interaction term might suggest an additive rather than multiplicative form of model for joint effects. In this case, one might consider fitting a model of the form r(Z;β) = 1 + Z β. Alternatively, one might prefer a model that is linear in each component, but multiplicative in their joint effects, r(Z;β) = p (1 + Zp β p ), or one that is loglinear in each component but additive jointly, r(Z;β) = 1 + p [exp(Zp β p ) − 1].

4

RELATIVE RISK MODELING

In a rich data set, the number of possible alternative models can quickly get out of hand, so some structured approach to model building is needed. The key is to adopt a general class of models that would include all the alternatives one might be interested in as special cases, allowing specific submodels to be tested within nested alternatives. A general model that has achieved some popularity recently consists of a mixture of linear and loglinear terms of the form r(Z, W; β, γ )

= exp(W0 γ 0 ) 1 +

M

Zm β m exp(Wm γ m ) ,

m=1

(6) where β m and γ m denote vectors of regression coefficients corresponding to the subsets of covariates Zm and Wm included in the mth linear and loglinear terms, respectively. Thus, for example, the standard loglinear model would comprise the single term m = 0, while the linear model would comprise a single term m = 1 with no covariates in the loglinear terms. A special case that has been widely used in radiobiology is of the form r(Z, W; β; γ ) = 1 + (β 1 Z + β 2 Z2 )

×exp(−β 3 Z + W γ ), where Z represents radiation dose (believed from microdosimetry considerations to have a linear-quadratic effect on mutation rates at low doses multiplied by a negative exponential survival term to account for cell killing at high doses) and W comprises modifiers of the slope of the dose–response relationship, such as attained age, sex, latency, or age at exposure. For example, including the log of latency and its square in W allows for a lognormal dependence of excess relative risk on latency for a discussion of software for fitting such models). Combining linear and loglinear terms, using the same p covariates, would produce a model of the form r(Z; β, γ ) = exp(Z γ )(1 + Z β) against which the fit of the linear and loglinear models could be tested with p df. Although useful as a test of fit of these two specific models, the interpretation of the parameters is not straightforward since the

effect of the covariates is essentially split between the two components. It would be of greater interest to form a model with a single set of regression coefficients and an additional mixing parameter for the combination of the submodels. Conceptually the simplest such model is the exponential mixture (28) r(Z; β; θ ) = (1 + Z β)1−θ exp(θ Z β),

(7)

which produces the linear model when θ = 0 and the loglinear model with θ = 1. An alternative, based on the Box–Cox transformation, was proposed by Breslow & Storer (3), which also includes the linear and loglinear models as special cases. However, Moolgavkar & Venzon (20) pointed out both of these mixture models are sensitive to the coding of the covariates: for example, for binary covariates, relabelling the two possible values leads to different models, leading to different inferences both about the mixing parameter and the relative importance of the component risk factors. Guerro & Johnson (12) developed a variant of the Box–Cox model of the form r(Z; β, θ ) =

exp(Z β), (1 + θ Z

β)1/θ ,

θ = 0, θ = 0,

(8)

which appears to be the only model in the literature to date that does not suffer from this difficulty. These kinds of mixtures could in principle also be used to compare relative risk with additive (excess) risk models, although the interpretation of the β coefficient becomes problematic because it has different dimensions under the different submodels. Although suitable for testing multiplicativity vs. additivity with multidimensional categorical data, these mixtures are less useful for continuous covariates because they combine two quite different comparisons (the form of the dose–response relationship for each covariate and the form of their joint effects) into a single mixing parameter. One way around this difficulty is to compare linear and loglinear models for each covariate separately first to determine the best form of model, then to fit joint models, testing additivity vs. multiplicativity. Alternatively, one

RELATIVE RISK MODELING

could form mixtures of more than two submodels with different mixing parameters for the different aspects. A word of warning is needed concerning inference on the parameters of most nonstandard models. Moolgavkar & Venzon (20) pointed out that for nonstandard models, convergence to asymptotic normality can be very slow indeed. Thus, the log-likelihoods are generally far from quadratic, leading to highly skewed confidence regions. For this reason, Wald tests and confidence limits should generally be avoided. Furthermore, as the parameter moves away from the null, the standard error increases more quickly than the mean, so that the Wald test can appear to become less and less significant the larger the value of the parameter (13,34). These problems are particularly important for the mixing parameters θ , for which inference should be based on the likelihood ratio test and likelihood-based confidence limits. For example, Lubin & Gaffey (15) describe an application of the exponential mixture of linear-additive and linear-multiplicative models (28) to testing the joint effect of radon and smoking on lung cancer risk in uranium miners; the point estimate of θ was 0.4, apparently closer to additivity than multiplicativity, but the likelihood ratio tests rejected the additive model (χ12 = 9.8) but not the multiplicative model (χ12 = 1.1). A linear mixture showed an even more skewed likelihood, with θˆ = 0.1 (apparently nearly additive) but with very similar likelihood ratio tests that rejected the additive but not the multiplicative model. 3.2 Models for Extended Exposure Histories Chronic disease epidemiology often involves measurement of an entire history of exposure {X(u), u < t} which we wish to incorporate into a relative risk model through one or more time-dependent covariates Z(t). How this is done depends upon one’s assumptions about the underlying disease mechanism. We defer for the moment the possibility of modeling such a disease process directly and instead continue in the vein of empiric modeling, now focusing on eliciting information about the temporal modifiers of the exposure–response relationship.

5

Most approaches to exposure–response modeling in epidemiology are based on an implicit assumption of dose additivity, i.e. that the excess relative risk at time t is a sum of independent contributions from each increment of exposure at earlier times u, possibly modified in some fashion by temporal factors. This hypothesis can be expressed mathematically as r[t, X(·); β; γ ] = R[Z(t); β], where Z(t) =

t

f [X(u); α] g(t, u; γ ) du,

(9)

0

and where R(Z;β) is some known relative risk function such as the linear or loglinear models discussed above, f is a known function describing the modifying effect of dose rate, and g is a known function describing the modifying effect of temporal factors. The simplest weighting functions would be f (X) = X and g(t, u) = 1, for which Z(t) becomes cumulative exposure, probably the most widely used exposure index in epidemiology. For many disease with long latency, such as cancer, it is common to use lagged cumulative exposure, corresponding to a weighting function of the form g(t, u;γ ) = 1 if t − u > γ , zero otherwise. Other simple exposure indices might include t time-weighted exposure 0 X(u) (t − u) du or t−γ age-weighted exposure 0 X(u) u du, which could be added as additional covariates to R(Z;β) to test the modifying effects of latency or age at exposure. The function f can be used to test dose-rate effects (the phenomenon that a long, low-intensity exposure has a different risk from a short, high-intensity exposure for the same cumulative dose). For example, one might adopt a model of the form f (X) = X α or f (X) = Xexp(−αX) for this purpose. Models that do not involve unknown parameters α and γ are easily fitted using standard software by the device of computing the time-dependent covariate(s) for each subject in advance. Relatively simple functions of γ (such as the choice of lagging interval in the simple latency model) might be fitted by evaluating the likelihood over a grid of values of the parameter. For more complex

6

RELATIVE RISK MODELING

functions g(t, u;γ ), such as a lognormal density in t − u with unknown mean and variance (and perhaps additional dependence of these parameters on age, exposure rate, or other factors), it is preferable to use a package with the capability of computing Z(t;α, γ ) at each iteration. This generally requires some programming by the user, whereas most of the likelihood calculations and iterative estimation are handled by the package. For example, using SAS procedure NLIN, one can recompute the covariates at each iteration by the appropriate commands inside the procedure. Unfortunately, the additivity assumption has seldom been tested. In principle, this could be done by nesting the dose-additive model in some more general model that includes interactive effects between the dose increments received at different times. The obvious alternative model would simply add further covariates of the form ∗

Z (t) =

t

u

F[X(u)X(v); α] 0

0

×G(t, u, v; γ , δ) dv du,

(10)

where F and G are some known weighting functions. However, one should take care to see that the dose-additive model is well fitted first before testing the additivity assumption (e.g. by testing for nonlinearities and temporal modifiers). 3.3 Nonparametric Models The appeal of Cox’s partial likelihood is that no assumptions are needed about the form of the dependence of risk on time, but it remains parametric in modeling covariate effects. Even more appealing would be a nonparametric model for both time and covariate effects. For categorical data, no parametric assumptions are needed, of course, although the effects of multiple covariates are commonly estimated using the loglinear (i.e. multiplicative) model, with additional interaction terms as needed. Similarly, continuous covariates are frequently categorized to provide a visual impression of the exposure–response relationship, but the choice of cutpoints is arbitrary. However, nonparametric smoothing techniques

are now available to allow covariate effects to be estimated without such arbitrary grouping. One approach relies only on an assumption of monotonicity. Thomas (29) adapted the technique of isotonic regression to relative risk modeling, and showed that the MLE of the exposure–response relationship under this constraint was a step function with jumps at the observed covariate values of a subset of the cases. The technique has been extended to two dimensions by Ulm (33), but in higher dimensions the resulting function is difficult to visualize and can be quite unstable. Cubic splines and other means of smoothing provide attractive alternatives which produce smooth, but not necessarily monotonic, relationships. The generalized additive model (14) has been widely used for this purpose. For example, Schwartz (26) described the effect of air pollution on daily mortality rates using a generalized additive model, after controlling for weather variables and other factors using similar models. A complex dependence on dew point temperature was found, with multiple maxima and minima, whereas the smoothed plot of the particulate air pollution was seen to be almost perfectly linear over the entire rate of concentrations. With the advent of Markov chain Monte Carlo methods, Bayesian techniques for model selection and smoothing have become feasible and are currently an active area of research. A full treatment of these methods is beyond the scope of this article; see Gilks et al. (11) for recent reviews of this literature. 4 MECHANISTIC MODELS In contrast with the empiric models discussed above, there are circumstances where the underlying disease process is well enough understood to allow it to be characterized mathematically. Probably the greatest activity along these lines has been in the field of cancer epidemiology. Two models in particular have dominated this development, the multistage model of Armitage & Doll (1) and the two-event model of Moolgavkar & Knudson (18). For thorough reviews of this literature, see (17), (31), and (36); here, we merely sketch the basic ideas.

RELATIVE RISK MODELING

The Armitage–Doll multistage model postulates that cancer arises from a single cell that undergoes a sequence of k heritable changes, such as point mutations, chromosomal rearrangements, or deletions, in a particular sequence. The model further postulates that the rate of one or more of these changes may depend on exposure to carcinogens. Then the model predicts that the hazard rate for the incidence of cancer (or more precisely, the appearance of the first truly malignant cell) following continuous exposure at rate X is of the form λ(t, Z) = αtk−1

k (1 + β i X).

(11)

i=1

Thus, the hazard has a power function dependence on age and a polynomial dependence on exposure rate with order equal to the number of dose-dependent stages. It further implies that two carcinogens would produce an additive effect if they act at the same stage and a multiplicative effect if they act at different stages. If exposure is instantaneous with intensity X(u), its effect is modified by the age at and time since exposure: if it acts at a single stage i, then the excess relative risk at time t is proportional to Zik (t) = X(u)ui−1 (t − u)k−i−1 /tk−1 and for an extended exposure at varying dose rates, the excess relative risk is obtained by integrating this expression over u (8,35). More complex expressions are available for time-dependent exposures to multiple agents acting at multiple stages (30). These models can be fitted relatively easily using standard software by first evaluating the covariates Zik (t) for each possible combination of i < k and then fitting the linear relative risk model, as described above. Note, however, that the expressions given above are only approximations to the exact solution of the stochastic differential equations (16), which are valid when the mutation rates are all small. The Moolgavkar–Knudson two-stage model postulates that cancer results from a clone of cells of which one descendant has undergone two mutational events, either or both of which may depend on exposure to carcinogens. The clone of intermediate cells is subject to a birth-and-death process with

7

rates that may also depend on carcinogenic exposures. The number of normal stem cells at risk varies with age, depending on the development of the particular tissue. Finally, in genetically susceptible individuals, all cells carry the first mutation at birth. The predicted risk under this model (in nonsusceptible individuals) is then approximately λ[t, X(u)] = µ1 µ2 [1 + β 2 X(t)] ×exp[ρ(t − u)]du,

t 0

[1 + β 1 X(u)] (12)

where µk are the baseline rates of the first and second mutations, β k are the slope of the dependence of the mutation rates on exposure, and ρ is the net proliferation rate (birth minus death rates) of intermediate cells. For the more complex exact solution, see (24). There have been a number of interesting applications of these models to various carcinogenic exposures. For example, the multistage model has been fitted to data on lung cancer in relation to asbestos and smoking (30), arsenic (4), coke oven emissions (9), and smoking (5,10), as well as to data on leukemia and benzene (7) and nonleukemic cancers and radiation (32). The two-stage model has been fitted to data on lung cancer in relation to smoking (23), radon (21,25), and cadmium (27), as well as to data on breast (22) and colon cancers (19). For further discussion of some of these applications, see (31). As in any other form of statistical modeling, the analyst should be cautious in interpretation. A good fit to a particular model does not of course establish the truth of the model. Instead the value of models, whether descriptive or mechanistic, lies in their ability to organize a range of hypotheses into a systematic framework in which simpler models can be tested against more complex alternatives. The usefulness of the multistage model of carcinogenesis, for example, lies not in our belief that it is an accurate description of the process, but rather in its ability to distinguish whether a carcinogen appears to act early or late in the process or at more than one stage. Similarly, the importance of the Moolgavkar–Knudson model lies in its ability to test whether a carcinogen acts as an ‘‘initiator’’ (i.e. on the mutation rates) or a

8

RELATIVE RISK MODELING

‘‘promoter’’ (i.e. on proliferation rates). Such inferences can be valuable, even if the model itself is an incomplete description of the process, as must always be the case.

13. Hauck, W. W. & Donner, A. (1977). Wald’s test as applied to hypotheses in logit analysis, Journal of the American Statistical Association 72, 851–853.

REFERENCES

14. Hastie, T. J. & Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall, New York.

1. Armitage, P. & Doll, R. (1961). Stochastic models of carcinogenesis, in Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, J. Neyman, ed. University of California Press, Berkeley, pp. 18–32. 2. Breslow, N. E. & Day, N. E. (1980). Statistical Methods in Cancer Research, Vol. I. The Analysis of Case–Control Studies. IARC Scientific Publications, No. 32, Lyon. 3. Breslow, N. E. & Storer, B. E. (1985). General relative risk functions for case–control studies, American Journal of Epidemiology 122, 149–162. 4. Brown, C. C. & Chu, K. (1983). A new method for the analysis of cohort studies: implications of the multistage theory of carcinogenesis applied to occupational arsenic exposure, Environmental Health Perspectives 50, 293–308. 5. Brown, C. C. & Chu, K. (1987). Use of multistage models to infer stage affected by carcinogenic exposure: example of lung cancer and cigarette smoking, Journal of Chronic Diseases 40, 171–179. 6. Cox, D. R. (1972). Regression models and life tables, Journal of the Royal Statistical Society, Series B 34, 187–220. 7. Crump, K. S., Allen, B. C., Howe, R. B. & Crockett, P. W. (1987). Time factors in quantitative risk assessment, Journal of Chronic Diseases 40, 101–111. 8. Day, N. E. & Brown, C. C. (1980). Multistage models and primary prevention of cancer. Journal of the National Cancer Institute 64, 977–89. 9. Dong, M. H., Redmond, C. K., Maxumdar, S. & Costantion, J. P. (1988). A multistage approach to the cohort analysis of lifetime lung cancer risk among steelworkers exposed to coke oven emission, American Journal of Epidemiology 128, 860–873. 10. Freedman, D. A. & Navidi, W. C. (1989). Multistage models for carcinogenesis, Environmental Health Perspectives 81, 169–188. 11. Gilks, W. R., Richardson, S. & Spiegelhalter, D. J., eds (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London. 12. Guerro, V. M. & Johnson, R. A. (1982). Use of the Box–Cox transformation with binary response models, Biometrika 69, 309–314.

15. Lubin, J. H. & Gaffey, W. (1988). Relative risk models for assessing the joint effects of multiple factors, American Journal of Industrial Medicine 13, 149–167. 16. Moolgavkar, S. H. (1978). The multistage theory of carcinogenesis and the age distribution of cancer in man, Journal of the National Cancer Institute 61, 49–52. 17. Moolgavkar, S. H. (1986). Carcinogenesis modelling: from molecular biology to epidemiology, Annual Review of Public Health 7, 151–169. 18. Moolgavkar, S. & Knudson, A. (1980). Mutation and cancer: a model for human carcinogenesis, Journal of the National Cancer Institute 66, 1037–1052. 19. Moolgavkar, S. H. & Luebeck, E. G. (1992). Multistage carcinogenesis: population-based model for colon cancer, Journal of the National Cancer Institute 84, 610–618. 20. Moolgavkar, S. & Venzon, D. J. (1987). General relative risk regression models for epidemiologic studies, American Journal of Epidemiology 126, 949–961. 21. Moolgavkar, S. H., Cross, F. T., Luebeck, G. & Dagle, G. D. (1990). A two-mutation model for radon-induced lung tumors in rats, Radiation Research 121, 28–37. 22. Moolgavkar, S. H., Day, N. E. & Stevens, R. G. (1980). Two-stage model for carcinogenesis: epidemiology of breast cancer in females, Journal of the National Cancer Institute 65, 559–569. 23. Moolgavkar, S. H., Dewanji, A. & Luebeck, G. (1989). Cigarette smoking and lung cancer: reanalysis of the British doctors’ data, Journal of the National Cancer Institute 81, 415–420. 24. Moolgavkar, S. H., Dewanji, A. & Venzon, D. J. (1988). A stochastic two-stage model for cancer risk assessment. I. The hazard function and the probability of tumor, Risk Analysis 8, 383–392. 25. Moolgavkar, S. H., Luebeck, E. G., Krewski, D. & Zielinski, J. M. (1993). Radon, cigarette smoke, and lung cancer: a re-analysis of the Colorado Plateau uranium miners’ data, Epidemiology 4, 204–217. 26. Schwartz, J. (1993). Air pollution and daily mortality in Birmingham, Alabama, American Journal of Epidemiology 137, 1136– 1147.

RELATIVE RISK MODELING 27. Stayner, L., Smith, R., Bailer, A. J., Luebeck, E. G. & Moolgavkar, S. H. (1995). Modeling epidemiologic studies of occupational cohorts for the quantitative assessment of carcinogenic hazards, American Journal of Industrial Medicine 27, 155–170. 28. Thomas, D. C. (1981). General relative risk models for survival time and matched case–control studies, Biometrics 37, 673–686. 29. Thomas, D. C. (1983). Nonparametric estimation and tests of fit for dose–response relations, Biometrics 39, 263–268. 30. Thomas, D. C. (1983). Statistical methods for analyzing effects of temporal patterns of exposure on cancer risks, Scandinavian Journal of Work and Environmental Health 9, 353–366.

9

31. Thomas, D. C. (1988). Models for exposure– time–response relationships with applications in cancer epidemiology, Annual Review of Public Health 9, 451–482. 32. Thomas, D. C. (1990). A model for dose rate and duration of exposure effects in radiation carcinogenesis, Environmental Health Perspectives 87, 163–171. 33. Ulm, K. (1983). Dose–response-models in epidemiology, in Mathematics in Biology and Medicine: An International Conference. Bari, Italy. 34. Vaeth, M. (1985). On the use of Wald’s test in exponential families, International Statistical Review 53, 199–214. 35. Whittemore, A. S. (1977). The age distribution of human cancers for carcinogenic exposures of varying intensity, American Journal of Epidemiology 106, 418–32. 36. Whittemore, A. & Keller, J. B. (1978). Quantitative theories of carcinogenesis, SIAM Review 20, 1–30.

RELIABILITY STUDY

where U is a mean zero error term with some variance–covariance . The error U is assumed independent of X. Note that model (1) implies that the error is nondifferential, not only with respect to Y but also with respect to Z because f (W|X, Y, Z) = f (W|X). Under model (1), replicate data from a reliability study can be used for valid estimation and inference. This model has been applied to the analysis of blood pressure, serum hormones, and other serum biomarkers such as vitamin concentrations, viral load measurements, and CD4 cell counts. We assume that subjects in an internal reliability study are selected completely at random. That is, if V is an indicator variable that equals 1 if a participant is in the validation study and 0 otherwise, then Pr(V = 1|Y, X, Z, W) = Pr(V = 1) = π . To correct point and interval estimates relating Y to X for bias from measurement error in W, it is necessary to estimate and var(X) = X using model (1). Estimates of the quantities, and X , are needed to correct the estimate of the parameter of interest describing the association between Y and X, β, for bias due to measurement error. If an internal reliability sample is used one can estimate from it. The quantity var(W) = W can be estimated from the combined main study/internal reliability study data, and X ˆX = ˆ W − . ˆ The same can be estimated by approach can be used if an external reliability sample is used except, in this case, W should be estimated from the main study only. This is because, under model (1), it is reasonable to assume that may be transportable from one population to another, whereas X , and hence W , are likely to vary across populations. Because an internal reliability study ensures that is correctly estimated and yields more efficient estimates of X , it is preferred to an external reliability study. In some applications, the goal of the research is simply estimation of the reliability coefficient, also known as the intraclass correlation coefficient, ρ, equal to X [ W ]−1 . These applications arise, for example, in the evaluation of new medical diagnostic procedures such as new technology for ascertaining load of HIV in body tissue, or in

DONNA SPIEGELMAN Harvard School of Public Health, Boston, MA, USA

Reliability studies and validation studies provide information on measurement error in exposures or other covariates used in epidemiologic studies. Such information on the measurement error process is needed to obtain valid estimates and inference using methods such as regression calibration or maximum likelihood. Reliability studies are based on repeating an error-prone measurement, and the validity of this method depends on a model for the errors given by (1) below. Validation studies are applicable to a broader class of error models, including models admitting differential error, but validation studies require that one be able to measure correct (‘‘gold standard’’) covariate values on some subjects. To define reliability sampling plans more precisely, let Y be the response variable, and let X be the true values of the variable which may be misclassified or measured with error. In some cases, X can never be observed and can be thought of as a latent variable. In other cases, X is a ‘‘gold standard’’ method of covariate assessment which is infeasible and/or expensive to administer to large numbers of study participants. Instead of observing X, we observe W, which is subject to error. Finally, there may be covariates Z upon which the model for response depends that are measured without error. In main study/reliability study designs, the main study consists of the data (Yi , Wi , Zi ), i = 1, . . . , n1 . If the reliability study is internal, it consists of (Yi , Wij , Zi ), j = 1, . . . , ni , i = n1 + 1, . . . , n1 + n2 observations, and if the reliability study is external, it consists of (Wij ), j = 1, . . . , ni , i = n1 + 1, . . . , n1 + n2 observations. Thus, there is only a single measurement for each main study subject, but replicate measurements for each subject in the reliability study. The measurement error model for which a reliability study can be used is W = X + U,

(1)

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

RELIABILITY STUDY

assessing the consistency of different clinicians in evaluating the functional status of their patients. Designs of studies whose purpose is to estimate the reliability coefficient have n1 = 0 and no data on Y. In what follows, we will first discuss design of such reliability studies. Then, we will discuss the main study/reliability study design, where n1 > 0 and Y is observed in the main study and possibly in the reliability study. 1

DESIGN OF RELIABILITY STUDIES

A nontechnical introduction to reliability study design considerations appeared in a recent epidemiology textbook by Armstong et al. (1). A series of papers by Donner and colleagues (3,4,8) investigated design of reliability studies in considerable detail. The first and last of these provided formulas for the power to test H0 : ρ = ρ 0 , vs. Ha : ρ = ρ A , where ρ is the intraclass correlation or reliability coefficient, equal to X / W , for a given (n2 , R), and where R is the number of replicates per subject. In addition, tables were given for power for fixed values of n2 and R. The first paper was based upon exact calculations, and the last paper developed a less computationally intensive approximation to the exact formula which appears to work quite well. The total number of observations (n2 × R) is minimized with a relatively small value for R, as long as the reliability is 40% or higher. In these cases, R = 2 or 3 is sufficient. Eliasziw & Donner (4) minimized reliability study cost with respect to n2 and R, subject to fixed power to test H0 vs. Ha as given above using the formula for power derived in (3). Cost was taken as a function of the unit cost of replicating data within subjects, the unit cost of accruing subjects, and the unit cost related jointly to the number of replicates and the number of subjects. Tables were given for the optimal values of n2 and R, for different unit cost ratios and different values of ρ 0 . They found that for ρ > 0.2, the cost per subject is more influential than the cost per measurement. In addition, they found that the optimal n2 and R were highly stable despite moderate changes in unit cost ratios. Freedman et al. (5) investigated the design of reliability studies when X and W are

binary. Reliability of W as a surrogate for X was parameterized by the probability of disagreement between the two replicate measures of X, W 1 , and W 2 , corresponding to the values obtained from two different raters. They gave tables for n2 which assured a fixed confidence interval width around the estimated probability of disagreement when R = 2. For probability of disagreement between 0.05 and 0.40 and confidence interval widths of 0.1 to 0.2, sample sizes between 50 and 350 are needed. These authors also considered study design when the goal is to estimate the within-rater probability of disagreement as well as the between-rater probability of disagreement, and provided tables of power for scenarios in which there are two raters and two replicates per rater. 2 DESIGN OF MAIN STUDY/RELIABILITY STUDIES One can select various main study sizes (n1 ), reliability study sizes (n2 ), and numbers of replicate measurements (R) for each subject in the reliability substudy. An ‘‘optimal’’ main study/reliability study design will find (n1 , n2 , R) to achieve some design goal. One may wish to minimize the variance of an important parameter estimate, such as the log relative risk, β, subject to a fixed total cost. Alternatively, one may wish to minimize the overall cost of the study, subject to specified power constraints on the parameter of interest (for further discussion of choices of design optimization criteria). Liu & Liang (6) considered the optimal choice of R for internal reliability designs with n1 = 0, that is, designs in which all subjects are in the reliability study. They studied generalized linear models for f (Y|X, Z;β) with the identity, log, probit, and logit link functions. They assumed the measurement error model for X described by (1) with X following a multivariate normal distribution MVN(µX , X ). The validity of their results required an additional approximation in the case of the logistic link function, which is the link function most commonly used in epidemiology. For scalar X and W, these authors derived a formula for asymptotic relative efficiency of β ∗ , the measurement-error corrected parameter describing the relationship

RELIABILITY STUDY

Figure 1. The relationship between the sample size (n2 ) and the number of replicates per subject (R) in a reliability study, and the standard error of the measurement-error corrected logistic regression coefficient, β*. Abbreviations and symbols used are: se for standard error and rI for the reliability coefficient var(x)/var(w). Number of replicates, R: • = 2; + = 3; * = 5; ∼ = 10. (a) Cholesterol; (b) glucose; (c) body mass index; (d) systolic blood pressure

3

4

RELIABILITY STUDY

between Y and X, as a function of / X and R. They found that the precision of β*, relative to the precision which would be obtained for estimating β if X were never measured with error, is little improved by increasing R above 4. Rosner et al. (7) investigated the effect of changing n2 and R on the variance of elements of a nine-dimensional vector β, where β is the log odds ratio relating coronary heart disease incidence to the model covariates in data from the Framingham Heart Study (2). Four of the model covariates were measured with error (Figure 1). In this figure, n1 was 1731, and and x were assigned the values estimated in the analysis. When n2 was greater than or equal to 100, the standard error of the four measurement-error corrected estimates reached an asymptote, indicating little or no gain in efficiency from increasing n2 beyond that value. At that point, the gain in efficiency ranges between a 10%–20% reduction in the variance for the three variables measured with some error (BMI has little error, as evidenced by the high reliability coefficient, r1 = 95%). Increasing the number of replicates decreased the standard errors of the estimates substantially when n2 was small, but made little difference for larger reliability studies. For the three variables measured with error (cholesterol, glucose, and systolic blood pressure), the design (n2 = 10, R = 10) was equally efficient as the design (n2 = 100, R = 2). Although the former requires fewer measurements, the latter may be more feasible, as it only requires two visits per subject. 3

CONCLUSION

Although model (1) is restrictive, there are many instances in biomedical research where

it is considered reasonable. Methods of analysis under this model are well developed, but more research is needed on optimal design, and there is a need for user-friendly software for finding optimal designs. 4 ACKNOWLEDGMENTS This work was supported by National Cancer Institute grants CA50587 and CA03416. REFERENCES 1. Armstong, B. K., White, E. & Saracci, R. (1992). Principles of Exposure Measurement in Epidemiology. Oxford University Press, Oxford, pp. 89–94. 2. Dawber, T. R. (1980). The Framingham Study. Harvard University Press, Cambridge, Mass. 3. Donner, A. P. & Eliasziw, M. (1987). Sample size requirements for reliability studies, Statistics in Medicine 6, 441–448. 4. Eliasziw, M. & Donner, A. P. (1987). A costfunction approach to the design of reliability studies, Statistics in Medicine 6, 647–655. 5. Freedman, L. S., Parmar, M. K. B. & Baker, S. G. (1993). The design of observer agreement studies with binary assessments, Statistics in Medicine 12, 165–179. 6. Liu, X. & Liang, K. Y. (1992). Efficacy of repeated measurements in regression models with measurement error, Biometrics 48, 645–654. 7. Rosner, B., Spiegelman, D. & Willett, W. (1992). Correction of logistic regression relative risk estimates and confidence intervals for random within-person measurement error, American Journal of Epidemiology 136, 1400–1413. 8. Walter, S. D., Eliasziw, M. & Donner, A. P. (1998). Sample size and optimal designs for reliability studies, Statistics in Medicine 17, 101–110.

REPEATABILITY AND REPRODUCIBILITY

be based on repeated measurements of blood pressure that do not include the first measurement of the day.

JOHN E. CONNETT

1.0.2 Example 2. To estimate repeatability of the laboratory assessment of the level of serum cholesterol, a split-specimen technique may be used. This requires aliquoting into two or more equal-sized portions of a well-homogenized serum specimen. The same laboratory protocol, equipment, settings, and standardized reagents should be used by the same laboratory operator. The time between measurement of aliquots should be as small as possible to reduce variability caused by chemical changes or deterioration. Repeatability provides a lower bound on measurement error. It represents intrinsic or unavoidable variability caused by unknown and perhaps uncontrollable factors. As such, it can provide a useful index of the quality of measurement of a given laboratory. Repeatability is used in computation of the coefficient of variation (q.v.), which is defined as

Division of Biostatistics University of Minnesota Minneapolis, Minnesota

1 REPEATABILITY AND REPRODUCIBILITY, WITH APPLICATIONS TO DESIGN OF CLINICAL TRIALS Repeatability and reproducibility are related, but not identical, terms that describe how closely repeated measurements of the same entity can be expected to agree. Both repeatability and reproducibility are important in designing a clinical trial, in assessing the quality of laboratory procedures, and in analyzing quantitative outcomes. Repeatability is the standard deviation of measurement of a physical quantity under identical conditions: the same equipment, the same laboratory, the same technician, and the same source of the specimen being measured, and with a relatively short time-span between the measurements.

CV = σ/µ, where the standard deviation σ is the repeatability standard deviation and µ is the expected value. The coefficient of variation is often expressed as a percent; coefficients of variation typically range from 1% to 25%. Generally speaking, a coefficient of variation less than 3% to 5% would be considered acceptable. Repeatability is thus a somewhat idealized quantity. In practice, measurements are not made under identical conditions. There is variation in the subject’s condition and in technicians, equipment, laboratory locations, reagents, time of day, and many other factors that can substantially affect the measurement. In general, if the outcome of a clinical trial is a quantitative measure like systolic blood pressure, repeatability will give a tooconservative estimate of the variability of measurement, resulting in an underestimate of the required sample size (q.v.). Thus, estimates of repeatability should be used with caution in the design of a trial, because the ideal conditions under which

1.0.1 Example 1. To estimate repeatability of the measurement of systolic and diastolic blood pressure, standardized conditions should be imposed: the same subject, comfortably seated for at least 5 minutes (longer if the measurement is preceded by a period of strenuous activity) in a straight-back chair in a quiet location, feet flat on the floor. The same blood pressure cuff and manometer would be applied to the same arm by the same technician following a standardized blood pressure measuring protocol. The pulse rate should be measured simultaneously with the blood pressure measurement; repeated instances to determine the repeatability of the blood pressure measurement should have close agreement of the pulse rate. Because an initial measurement of blood pressure is often elevated because of the so-called pressor effect (apprehension caused by lack of familiarity with the procedure or anxiety concerning the results), estimation of repeatability should

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

REPEATABILITY AND REPRODUCIBILITY

repeatability was estimated are often not met. Despite this, there are situations where using repeatability as a basis for an initial sample size estimate can be useful. If, for example, the repeatability standard deviation of systolic blood pressure is used to compute sample size for a trial in which changes in blood pressure over a one-month period are to be compared between two drug regimens, the resulting sample size estimate will very likely represent an absolute minimum necessary for the trial under the most ideal conditions. This can be a useful lower bound on the sample size, if even under completely ideal conditions, the minimum required sample size exceeds what can possibly be attained in the patient population to which the investigators have access. It would have the effect of removing the proposed trial design in that population from additional consideration. Low repeatability can result in unanticipated problems in a clinical trial. Blood pressure measurements are subject to error as noted above, some of which (e.g., the pressor effect) can lead to bias and some of which can lead to uncertainties in classifying participants as hypertensive or not. This can affect eligibility for a trial. Similarly, the protocol for a clinical trial may call for initiation of antihypertensive treatment if systolic blood pressure (SBP) exceeds a specified threshold. If interim measurements of SBP are subject to random or systematic error, some ineligible participants will be admitted into the trial, and some may be treated inappropriately. For that reason, many BP trial protocols will require repeat confirmatory measurements of SBP before ascertainment of eligibility or initiation of treatment: for example, four measurements at least 5 minutes apart by the same technician, with the criterion blood pressure defined as the average of the last three measurements. In this situation, since the conditions of measurement are held constant, the repeatability standard deviation may be used to compute the standard error of the average. To take day-to-day variability into account, measurements bearing on eligibility or treatment initiation may be repeated at 24- or 48-hour intervals. As noted, estimates of variability based on the repeatability standard deviation will usually be lower

than what is obtained in a realistic application. The same considerations apply to a wide spectrum of quantitative measurements (e.g., CD4+ counts in HIV studies, serum creatinine levels in studies of kidney transplant survival, lung function measurements in COPD, etc.). Reproducibility is the standard deviation of repeated measurements of a physical quantity under realistic variations in the conditions of measurement: e.g., variations in technicians, equipment, time of day, laboratory, ambient temperature, condition of the subject, and innumerable other factors that either are not known or cannot easily be controlled. Reproducibility will be larger than repeatability because it includes components of variance that repeatability does not. In designing clinical trials in which the primary outcome is a quantitative measure, a prior estimate of reproducibility is essential. 1.0.3 Example. A clinical trial evaluating a drug designed to reduce diastolic blood pressure is planned. Hypertensive subjects are randomized to receive a standard dose of active drug A for a one-month period, or to receive placebo P for one month. Diastolic blood pressure is measured before randomization and again after one month of treatment. The primary outcome is the change in blood pressure between the two measurements. The repeatability standard deviation for diastolic blood pressure (based on repeated measurements of the same subjects by the same technician using the same equipment and identical test conditions) is, say, σ repeat = 4 mm Hg. However, it cannot be guaranteed that the second test at one month will be conducted under identical conditions to those of the first test; there may be unavoidable differences in the technician, the blood pressure cuff, the time of day, and the condition of the subject. The reproducibility standard deviation is estimated as, e.g., σ reprod = 8 mm Hg. However, this figure is also not sufficient by itself for the purpose of estimation of sample size, because there is another important component of variation: the between-subjects standard deviation σ between of the subjects’ ‘‘true’’ blood pressure, which is, e.g., 5 mm Hg. The standard deviation of measurement that should be used for sample size estimation

REPEATABILITY AND REPRODUCIBILITY

with these figures is σtotal = =

2 2 σreprod + σbetween

82 + 52 = 9.4 mm Hg

Estimation of repeatability and reproducibility can be based on the analysis of variance (q.v.) of repeated measurements of the quantity of interest in a group of subjects. Repeatability standard deviation would be estimated as the square root of the withinsubjects mean square, where all controllable conditions of measurement are held identical. Similarly, reproducibility standard deviation would be estimated as the square root of the within-subjects mean square, where realistic variability in the conditions of measurement is incorporated. 1.0.4 Related Measures. Coefficient of variation (noted above); Pearson correlation coefficient; intraclass correlation coefficient; concordance correlation coefficient (1). FURTHER READING L. I. Lin, A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989;45:255–268.

3

S. Bauer and J. W. Kennedy, Applied statistics for the clinical laboratory: II. Within-run imprecision. J. Clin. Lab. Automat. 1981;1:197–201. R. A. Deyo, P. Diehr, and D. L. Patrick, Reproducibility and responsiveness of health status measures. Statistics and strategies for evaluation. Contr Clin Trials 1991;12 (4 Suppl): 142S–158S. J. Timmer, Scientists on Science: Reproducibility. (2006). Available: http://arstechnica. com/journals/science.ars/2006/10/25/5744.

CROSS-REFERENCES Standard Deviation Coefficient of Variation Analysis of Variance Components of Variance Intraclass Correlation Coefficient Sample Size Estimation

REPEATED MEASUREMENTS

as A and B) for toenail dermatophyte onychomycosis (TDO). Refer to De Backer et al. (2) for more details about this study. TDO is a common toenail infection, difficult to treat, affecting more than 2% of the population. Antifungal compounds classically used for treatment of TDO need to be taken until the whole nail has grown out healthy. However, new compounds, have reduced the treatment duration to 3 months. The aim of the current study was to compare the efficacy and safety of two such new compounds, labeled A and B, and administered during 12 weeks. In total, 2 × 189 patients were randomized, distributed over 36 centers. Subjects were followed during 12 weeks (3 months) of treatment and followed further, up to a total of 48 weeks (12 months). Measurements were taken at baseline, every month during treatment, and every 3 months afterwards, resulting in a maximum of 7 measurements per subject. As a first response, the unaffected naillength (one secondary endpoint in the study) was considered, measured from the nail bed to the infected part of the nail, which is always at the free end of the nail, expressed in millimeters. Obviously this response will be related to the toesize. Therefore, only those patients are included here for which the target nail was one of the two big toenails, which reduces the sample under consideration to 146 and 148 subjects, respectively. Individual profiles for 30 randomly selected subjects in each treatment group are shown in Fig. 1. The second outcome will be severity of the infection, coded as 0 (not severe) or 1 (severe). The question of interest was whether the percentage of severe infections decreased over time, and whether that evolution was different for the two treatment groups. A summary of the number of patients in the study at each time-point and the number of patients with severe infections is given in Table 1. A key issue in the analysis of longitudinal data is that outcome values measured repeatedly within the same subjects tend to be correlated, and this correlation structure needs to be taken into account in the statistical analysis. This finding is easily seen with paired observations obtained from, e.g., a pretest/post-test experiment. An obvious choice

GEERT VERBEKE EMMANUEL LESAFFRE Catholic University of Leuven Leuven, Belgium

1

INTRODUCTION AND CASE STUDY

In medical science, studies are often designed to investigate changes in a specific parameter that are measured repeatedly over time in the participating subjects. Such studies are called longitudinal studies, in contrast to cross-sectional studies where the response of interest is measured only once for each individual. As pointed out by Diggle et al. (1), the main advantage of longitudinal studies is that they can distinguish changes over time within individuals (longitudinal effects) from differences among people in their baseline values (cross-sectional effects). In randomized clinical trials, where the aim is usually to compare the effect of two (or more) treatments at a specific time-point, the need and the advantage of taking repeated measures is at first sight less obvious. Indeed, a simple comparison of the treatment groups at the end of the follow-up period is often sufficient to establish the treatment effect(s) (if any) by virtue of the randomization. However, in some instances, it is important to know how the patients have reached their endpoint; i.e., it is important to compare the average profiles (over time) between the treatment groups. Furthermore, longitudinal studies can be more powerful than studies evaluating the treatments at one single timepoint. Finally, follow-up studies often suffer from dropout; i.e., some patients leave the study prematurely, for known or unknown reasons. In such, cases, a full repeatedmeasures analysis will help in drawing inferences at the end, because such analyses implicitly impute the missing values. As a typical example, data from a randomized, double-blind, parallel group, multicentre study are considered for the comparison of two oral treatments (in the sequel coded

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

REPEATED MEASUREMENTS

Treatment A

Unaffected Nail Length [mm]

Unaffected Nail Length [mm]

2

20 15 10 5 0

0 1 2 3

9 6 Time [months]

12

Treatment B 20 15 10 5 0

0 1 2 3

9 6 Time [months]

12

Figure 1. Toenail data: Individual profiles of 30 randomly selected subjects in each treatment arm. Table 1. Toenail Data: Number and Percentage of Patients with Severe Toenail Infection, for each Treatment Arm Separately

Baseline 1 month 2 months 3 months 6 months 9 months 12 months

# Severe

Group A # Patients

Percentage

# Severe

Group B # Patients

Percentage

54 49 44 29 14 10 14

146 141 138 132 130 117 133

37.0% 34.7% 31.9% 22.0% 10.8% 8.5% 10.5%

55 48 40 29 8 8 6

148 147 145 140 133 127 131

37.2% 32.6% 27.6% 20.7% 6.0% 6.3% 4.6%

for the analysis is the paired t-test, based on the subject-specific difference between the two measurements. Although an unbiased estimate for the treatment effect can also be obtained from a two-sample t-test, standard errors and hence P-values and confidence intervals obtained from not accounting for the correlation within pairs will not reflect the correct sampling variability and still lead to wrong inferences. In general, classic statistical procedures assuming independent observations cannot be used in the context of repeated measurements. In this article, an overview is given of the most important models useful for the analysis of clinical trial data and widely available through commercial statistical software packages. In Section 2, the focus is on linear models for Gaussian data. In Section 3, models for the analysis of discrete outcomes are discussed. Section 4 deals with some design issues, and Section 5 lends the article with some concluding remarks.

2 LINEAR MODELS FOR GAUSSIAN DATA With repeated Gaussian data, a general, and very flexible, class of parametric models is obtained from a random-effects approach. Suppose that an outcome Y is observed repeatedly over time for a set of persons, and suppose that the individual trajectories are of the type as shown in Fig. 2. Obviously, a linear regression model with intercept and linear time effect seems plausible to describe the data of each person separately. However, different persons tend to have different intercepts and different slopes. One can therefore assume that the jth outcome Y ij of subject i(i = 1, . . . , N, j = 1, . . . , ni ), measured at time tij , satisfies Yij = b˜ i0 + b˜ i1 tij + εij . Assuming the vector b˜ i = (b˜ i0 , b˜ i1 ) of person-specific parameters to be bivariate normal with mean (β0 , β1 ) and 2 × 2 covariance matrix D and assuming εij to be normal as well, this leads to a so-called linear mixed model. In practice, one will often formulate the model as Yij = (β0 + bi0 ) + (β1 + bi1 )tij + εij

REPEATED MEASUREMENTS

3

Individual profiles with random intercepts and slopes 30 25 20 Response yij

15 10 5 0 −5 −10 −15 −20

0

1

2

3

4

5

Time tij Figure 2. Hypothetical example of continuous longitudinal data that can be well described by a linear mixed model with random intercepts and random slopes. The thin lines represent the observed subject-specific evolutions. The bold line represents the population-averaged evolution. Measurements are taken at six time-points: 0, 1, 2, 3, 4, and 5.

with b˜ i0 = β0 + bi0 and b˜ i1 = β1 + bi1 , and the new random effects bi = (bi0 , bi1 ) are now assumed to have mean zero. The above model can be viewed as a special case of the general linear mixed model, which assumes that the outcome vector Y i of all ni outcomes for subject i satisfies Y i = Xi β + Zi bi + εi

(1)

in which β is a vector of population-averaged regression coefficients called fixed effects, and where bi is a vector of subject-specific regression coefficients. The bi are assumed normal with mean vector 0 and covariance D, and they describe how the evolution of the ith subject deviates from the average evolution in the population. The matrices X i and Zi are (ni × p) and (ni × q) and (ni × q) matrices of known covariates. Note that p and q are the numbers of fixed and subject-specific regression parameters in the model, respectively. The residual components ε i are assumed to be independent N(0, i ), where i depends on i only through its dimension ni . Model (1) naturally follows from a so-called two-stage model formulation. First, a linear regression model is specified for every subject separately, modeling the outcome variable as a function of

time. Afterward, in the second stage, multivariate linear models are used to relate the subject-specific regression parameters from the first-stage model to subject characteristics such as age, gender, and treatment. Estimation of the parameters in Equation (1) is usually based on maximum likelihood (ML) or restricted maximum likelihood (REML) estimation for the marginal distribution of Y i , which can easily be seen to be Y i ∼ N(Xi β, Zi DZi + i ).

(2)

Note that model (1) implies a model with very specific mean and covariance structure, which may or may not be valid, and hence need to be checked for every specific data set at hand. Note also that, when i = σ 2 Ini , with Ini equal to the identity matrix of dimension ni , the observations of subject i are independent conditionally on the random effect bi . The model is therefore called the conditional independence model. Even in this simple case, the assumed random-effects structure still imposes a marginal correlation structure for the outcomes Y ij . Indeed, even if all i equal σ 2 Ini , the covariance matrix in Equation (2) is not the identity matrix, illustrating that, marginally, the repeated measurements Y ij of subject i are not assumed to

4

REPEATED MEASUREMENTS

be uncorrelated. Another special case arises when the random effects are omitted from the model. In that case, the covariance matrix of Y i is modeled through the residual covariance matrix i . In the case of completely balanced data, i.e., when ni is the same for all subjects, and when the measurements are all taken at fixed time-points, one can assume all i to be equal to a general unstructured covariance matrix , which results in the classical multivariate regression model. Inference in the marginal model can be done using classic techniques including approximate Wald tests, t-tests, F-tests, or likelihood ratio tests. Finally, Bayesian methods can be used to obtain ‘‘empirical Bayes estimates’’ for the subject-specific parameters bi in Equation (1). Refer to Henderson et al. (3), Harville (4–6), Laird and Ware (7), and Verbeke and Molenberghs (8, 9) for more details about estimation and inference in linear mixed models. As an illustration, the unaffected naillength response in the toenail example is analyzed. The model proposed by Verbeke, et al. (10) assumes a quadratic evolution for each subject, with subject-specific intercepts, and with correlated errors within subjects. More formally, they assume that Y ij satisfies ⎧ (βA0 + bi ) + βA1 t + βA2 t2 + εi (t), ⎪ ⎪ ⎪ ⎨ in group A Yij (t) = ⎪(βB0 + bi ) + βB1 t + βB2 t2 + εi (t), ⎪ ⎪ ⎩ in group B (3) where t = 0, 1, 2, 3, 6, 9, 12 is the time in the study, expressed in months. The error components ε i (t) are assumed to have common variance σ 2 , with correlation of the form corr(εi (t), ε i (t − u)) = exp(−ϕu2 ) for some unknown parameter ϕ. Hence, the correlation between within-subject errors is a decreasing function of the time span between the corresponding measurements. Fitted average profiles are shown in Fig. 3. An approximate F-test shows that, on average, there is no evidence for a treatment effect (P = 0.2029). Note that, even when interest would only be in comparing the treatment groups after 12 months, this could still be done based on

the above-fitted model. The average difference between group A and group B, after 12 months, is given by (βA0 − βB0 ) − 12(βA1 − βB1 ) + 122 (βA2 − βB2 ). The estimate for this difference equals 0.80 mm (P = 0.0662). Alternatively, a two-sample t-test could be performed based on those subjects that have completed the study. It yields an estimated treatment effect of 0.77 mm (P = 0.2584), illustrating that modeling the whole longitudinal sequence also provides more efficient inferences at specific time-points. 3 MODELS FOR DISCRETE OUTCOMES Whenever discrete data are to be analyzed, the normality assumption in the models in the previous section is no longer valid, and alternatives need to be considered. The classic route, in analogy to the linear model, is to specify the full joint distribution for the set of measurements Yij , . . . , Yini per individual. Clearly, it implies the need to specify all moments up to order ni . Examples of marginal models can be found in Bahadur (11), Altham (12), Efron (13), Molenberghs and Lesaffre (14, 15), Lang and Agresti (16), and Fahrmeir and Tutz (17). Especially for longer sequences and/or in cases where observations are not taken at fixed time-points for all subjects, specifying a full likelihood, as well as making inferences about its parameters, traditionally done using maximum likelihood principles, can become very cumbersome. Therefore, inference is often based on a likelihood obtained from a random-effects approach. Associations and all higher order moments are then implicitly modeled through a random-effects structure. This structure will be discussed in Section 3.1. A disadvantage is that the assumptions about all moments are made implicitly and are very hard to check. As a consequence, alternative methods have been in demand, which require the specification of a small number of moments only, leaving the others completely unspecified. In many cases, one is primarily interested in the mean structure, whence only the first moments need to be specified. Sometimes, there is also interest in the association structure, quantified, for example, using odds ratios or correlations.

REPEATED MEASUREMENTS

5

Fitted averages

Unaffected naillength [mm]

10

5 Treatment A Treatment B

0 0

1

2

3

6

9

12

Time [months] Figure 3. Toenail data: Fitted average profiles based on model (3).

Estimation is then based on so-called generalized estimating equations (GEEs), and inference no longer directly follows from maximum likelihood theory. This approach will be explained in Section 3.2. In Section 3.3, both approaches will be illustrated in the context of the toenail data. A comparison of both techniques will be presented in Section 3.4.

3.1 Generalized Linear Mixed Models (GLMM) As discussed in Section 2, random effects can be used to generate an association structure between repeated measurements, which can be exploited to specify a full joint likelihood in the context of discrete outcomes. More specifically, conditionally on a vector bi of subjectspecific regression coefficients, it is assumed that all responses Y ij for a single subject i are independent, satisfying a generalized linear model with mean µij = h(xij β + zij bi ) for a prespecified link function h, and for two vectors xij and zij of known covariates belonging to subject i at the jth time-point. Let fij (yij |bi ) denote the corresponding density function of Y ij , given bi . As for the linear mixed model, the random effects bi are assumed to be sampled from a normal distribution with mean

vector 0 and covariance D. The marginal distribution of Y i is then given by

f (yi ) =

ni

fij (yij |bi )f (bi )dbi

(4)

j=1

in which dependence on the parameters β and D is suppressed from the notation. Assuming independence accross subjects, the likelihood can easily be obtained, and maximum likelihood estimation becomes available. In the linear model, the integral in Equation (4) could be worked out analytically, leading to the normal marginal model (2). In general, however, this is no longer possible, and numerical approximations are needed. Broadly, approximations to the integrand in Equation (4) and methods based on numerical integration can be distinguished between. In the first approach, Taylor series expansions to the integrand are used, simplifying the calculation of the integral. Depending on the order of expansion and the point around which one expands, slightly different procedures are obtained. Refer to Breslow and Clayton (18), Wolfinger and O’Connell (19), and Lavergne and Trottier (20) for an overview of estimation methods. In general, such approximations will be accurate whenever the responses yij are ‘‘sufficiently continuous’’ and/or if all ni are sufficiently

6

REPEATED MEASUREMENTS

large, which explains why the approximation methods perform poorly in cases with binary repeated measurements, with a relatively small number of repeated measurements available for all subjects [Wolfinger (21)]. Especially in such examples, numerical integration proves very useful. Of course, a wide toolkit of numerical integration tools, available from the optimization literature, can be applied. A general class of quadrature rules selects a set of abscissas and constructs a weighted sum of function evaluations over those. Refer to Hedeker and Gibbons (22, 23) and to Pinheiro and Bates (24) for more details on numerical integration methods in the context of random-effects models. 3.2 Generalized Estimating Equations (GEEs) Liang and Zeger (26) proposed so-called (GEEs), which require only the correct specification of the univariate marginal distributions provided one is willing to adopt ‘‘working’’ assumptions about the association structure. More specifically, a generalized linear model [McCullagh and Nelder (25)] is assumed for each response Y ij , modeling the mean µij as h(xij β) for a prespecified link function h, and a vector xij of known covariates. In the case of independent repeated measurements, the classic score equations for the estimation of β are well known to be

that, in general, no maximum likelihood estimates are obtained, because the equations are not first-order derivatives of some loglikelihood function for the data under some statistical model. Still, very similar properties can be derived. More specifically, Liang and Zeger (26) showed that βˆ is asymptotically normally distributed, with mean β and with a covariance matrix that can easily be estimated in practice. Hence, classic Wald-type inferences become available. This result holds provided that the mean was correctly specified, whatever working assumptions were made about the association structure, which implies that, strictly speaking, one can fit generalized linear models to repeated measurements, ignoring the correlation structure, as long as inferences are based on the standard errors that follow from the general GEE theory. However, efficiency can be gained from using a more appropriate working correlation model-[Mancl and Leroux (27)]. The original GEE approach focuses on inferences for the first-order moments, considering the association present in the data as nuisance. Later on, extensions have been proposed that also allow inferences about higher order moments. Refer to Prentice (28), Lipsitz, Laird and Harrington (29), and Liang et al. (30) for more details. 3.3 Application to the Toenail Data

∂µ i −1 V (Y i − µi ) = 0 S(β) = ∂β i

(5)

i

where µi = E(Y i ) and V i is a diagonal matrix with vij = Var(Yij ) on the main diagonal. Note that, in general, the mean-variance relation in generalized linear models implies that the elements vij also depend on the regression coefficients β. Generalized estimating equations are now obtained from allowing nondiagonal ‘‘covariance’’ matrices V i in Equation (5). In practice, this approach comes down to the specification of a ‘‘working correlation matrix’’ that, together with the variances vij results in a hypothesized covariance matrix V i for Y i . Solving S(β) = 0 is done iteratively, constantly updating the working correlation matrix using moment-based estimators. Note

As an illustration of GEE and GLMM, the severity of infection binary outcome in the toenail example is analyzed. First, GEE is applied based on the marginal logistic regression model P(Yi (t) = 1) log 1 − P(Yi (t) = 1)

βA0 + βA1 t, in group A = (6) βB0 + βB1 t, in group B Furthermore, an unstructured 7 × 7 working correlation matrix is used. The results are reported in Table 2, and the fitted average profiles are shown in the top graph of Fig. 4. Based on a Wald-type test, a significant difference is obtained in the average slope between the two treatment groups (P = 0.0158).

REPEATED MEASUREMENTS

7

Table 2. Toenail Data: Parameter Estimates (Standard Errors) for a GLMM and a GEE Parameter Intercept group A (β A0 ) Intercept group B (β B0 ) Slope group A (β A1 ) Slope group B (β B1 ) Random intercepts s.d. (σ )

Alternatively, a generalized linear mixed model is considered, modeling the association through the inclusion of subject-specific (random) intercepts. More specifically, it is now assumed that P(Yi (t) = 1|bi ) log 1 − P(Yi (t) = 1|bi )

βA0 + bi + βA1 t, in group A = (7) βB0 + bi + βB1 t, in group B with bi normally distributed with mean 0 and variance σ 2 . The results, obtained using numerical integration methods, are also reported in Table 2. As before, a significant difference between β A1 and βA1 and βB1 (P = 0.0255) is obtained. 3.4 Marginal versus Hierarchical Parameter Interpretation Comparing the GEE results and the GLMM results in Table 2, large differences between the parameter estimates are observed, which suggests that the parameters in both models need to be interpreted differently. Indeed, the GEE approach yields parameters with a population-averaged interpretation. Each regression parameter expresses the average effect of a covariate on the probability of having a severe infection. Results from the generalized linear mixed model, however, require an interpretation conditionally on the random effect, i.e., conditionally on the subject. In the context of the toenail example, consider model (7) for treatment group A only. The model assumes that the probability of severe infection satisfies a logistic regression model, with the same slope for all subjects, but with subject-specific intercepts. The population-averaged probability of

GLMM Estimate (SE)

GEE Estimate (SE)

−1.63 (0.44) −1.75 (0.45) −0.40 (0.05) −0.57 (0.06) 4.02 (0.38)

−0.72 (0.17) −0.65 (0.17) −0.14 (0.03) −0.25 (0.04)

severe infection is obtained from averaging these subject-specific profiles over all subjects. This model is graphically presented in Fig. 5. Clearly, the slope of the average trend is different from the subject-specific slopes, and this effect will be more severe as the subject-specific profiles differ more, i.e., as the random-intercepts variance σ 2 is larger. Formally, the average trend for group A is obtained as P(Yi (t) = 1) = E[P(Yi (t) = 1|bi )] exp(βA0 + bi + βA1 t) = 1 + exp(βA0 + bi + βA1 t) exp(βA0 + βA1 t) = E (8) 1 + exp(βA0 + βA1 t) Hence, the population-averaged evolution is not the evolution for an ‘‘average’’ subject, i.e., a subject with random effect equal to zero. The bottom graph in Fig. 4 shows the fitted profiles for an average subject in each treatment group, and these profiles are indeed very different from the populationaveraged profiles shown in the top graph of Fig. 4 and discussed before. In general, the population-averaged evolution implied by the GLMM is not of a logistic form any more, and the parameter estimates obtained from the GLMM are typically larger in absolute value than their marginal counterparts [Neuhaus, et al. (31)]. However, one should not refer to this phenomenon as bias because the two sets of parameters target at different scientific questions. Note that this difference in parameter interpretation between marginal and random-effects models immediately follows from the nonlinear nature, and therefore, it is absent in the linear mixed model, discussed in Section 2. Indeed, the regression parameter vector β in the linear mixed model (1)

8

REPEATED MEASUREMENTS

Marginal average evolutions (GEE) 0.4

Pr(Y=1)

0.3

0.2

0.1

0.0 0

1

2

3

4

5

6

Treatment:

8 7 Time A

9

10

11

12

13

14

13

14

B

Evolutions for subjects with random effects zero (GLMM)

Pr(Y=1 ¦ zero random effect)

0.4

0.3

0.2

0.1

0.0 0

1

2

3

4

5

Treatment:

6

7 8 Time A

9

10

11

12

B

Figure 4. Toenail data. Treatment-specific evolutions. (a) Marginal evolutions as obtained from the marginal model (6) fitted using GEE. (b) Evolutions for subjects with random effects in model (7) equal to zero.

is the same as the regression parameter vector modeling the expectation in the marginal model (2). 4

DESIGN CONSIDERATIONS

So far, this discussion has focused on the analysis of longitudinal data. In the context of a clinical trial, however, one is usually

first confronted with design questions. These questions involve the number of patients to be included in the study, the number of repeated measurements to be taken for each patient, as well as the time-points at which measurements will be scheduled. Which design will be ‘‘optimal’’ depends on many characteristics of the problem. In a cross-sectional

REPEATED MEASUREMENTS

9

Subject−specific and average evolutions 1.0

Pr(Y=1)

0.8

0.6

0.4

0.2

0.0 Time Figure 5. Graphical representation of a random-intercepts logistic model. The thin lines represent the subject-specific logistic regression models. The bold line represents the population-averaged evolution.

analysis, such as the comparison of endpoints between several treatment groups, power typically depends on the alternative to be detected and the variance in the different treatment groups. In a longitudinal context, however, power will depend on the complete multivariate model that will be assumed for the vector of repeated measurements per subject. This approach typically includes a parametric model for the average evolution in the different treatment groups, a parametric model for how the variability changes over time, as well as a parametric model for the association structure. Not only is it difficult in practice to select such models before the data collection, power calculations also tend to highly depend on the actual parameter values imputed in these models. Moreover, unless in the context of linear mixed models [see Helms (32) and Verbeke and Lesaffre (33)], no analytic power calculations are possible, and simulation-based techniques need to be used instead. Therefore, power analyses are often performed for the cross-sectional comparison of enpoints, whereas the longitudinal analyses are considered additional, secondary analyses.

5

CONCLUDING REMARKS

No doubt repeated measurements occur very frequently in a variety of contexts, which leads to data structures with correlated observations, hence, no longer allowing standard statistical modeling assuming independent observations. Here, a general overview of the main issues is given in analysis of repeated measurements, with focus on a few general classes of approaches often used in practice and available in many commercially available statistical software packages. A much more complete overview can be found in Diggle et al. (1). Many linear models proposed in the statistical literature for the analysis of continuous data are special cases of linear mixed models discussed in Section 2. Refer to Verbeke and Molenberghs (8, 9) for more details. Nonlinear models for continuous data were not discussed, but the nonlinearity implies important numerical and interpretational issues similar to those discussed in Section 3 for discrete data models, and these are discussed in full detail in Davidian and Giltinan (34) and Vonesh and Chinchilli (35). An overview of many models for discrete data can be found in Fahrmeir and Tutz (17). One major approach to the analysis of correlated data is based on

10

REPEATED MEASUREMENTS

random-effects models, both for continuous as well as for discrete outcomes. These models are presented in full detail in Pinheiro and Bates (24). A variety of models is nowadays available for the analysis of longitudinal data, all posing very specific assumptions. In many other contexts, procedures for model checking or for testing goodness of fit have been developed. For longitudinal data analysis, relatively few techniques are available, and it is not always clear to what extent inferences rely on the underlying parametric assumptions. Refer to Verbeke and Molenberghs (9) and to Verbeke and Lesaffre (36) for a selection of available methods for model checking and, for some robustness results, in the context of linear mixed models. As model checking is far from straightforward, attempts have been made to relax some of the distributional assumptions [see, e.g., Verbeke and Lesaffre (37) and Ghidey et al. (38)]. Finally, it should be noted that many applications involving repeated measures will suffer from missing data; i.e., measurements scheduled to be taken are not available, for a variety of known or (often) unknown reasons. Technically speaking, the methods that have been discussed here can handle such unbalanced data structures, but depending on the chosen analysis, biased results can be obtained if the reason for missingness is related to the outcome of interest. Refer to Little and Rubin (39), Verbeke and Molenberghs (8, 9), and article by Molenberghs and Emmanuel Lesaffre on missing data for missing data issues. Nowadays, GEEs and mixed models can be fitted using a variety of (commercially available) software packages, including MIXOR, MLwiN, and Splus. However, in the context of clinical trials, the SAS procedures GENMOD (for GEE-analyses), MIXED (for linear mixed models), and NLMIXED (for generalized linear and nonlinear mixed models) are probably the most flexible and best documented procedures, and they are therefore the most widely used ones.

6 ACKNOWLEDGMENTS The authors gratefully acknowledge support from Fonds Wetenschappelijk OnderzoekVlaanderen Research Project G.0002.98 Sensitivity Analysis for Incomplete and Coarse Data and from Belgian IUAP/PAI network Statistical Techniques and Modeling for Complex Substantive Questions with Complex Data. REFERENCES 1. P. J. Diggle, K. Y. Liang, and S. L. Zeger, Analysis of Longitudinal Data. Oxford: Clarendon Press, 1994. 2. M. De Backer, P. De Keyser, C. De Vroey, and E. Lesaffre, A 12-week treatment for dermatophyte toe onychomycosis: Terbinafine 250mg/day vs. itraconazole 200mg/day—a double-blind comparative trial. Brit. J. Dermatol. 1996; 134: 16–17. 3. C. R. Henderson, O. Kempthorne, S. R. Searle, and C. N. VonKrosig, Estimation of environmental and genetic trends from records subject to culling. Biometrics. 1959; 15: 192–218. 4. D. A. Harville, Bayesian inference for variance components using only error contrasts. Biometrika. 1974; 61: 383–385. 5. D. A. Harville, Extension of the Gauss-Markov theorem to include the estimation of random effects. Ann. Stat. 1976; 4: 384–395. 6. D. A. Harville, Maximum likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 1977; 72: 320–340. 7. N. M. Laird and J. H. Ware, Randomeffects models for longitudinal data. Biometrics. 1982; 38: 963–974. 8. G. Verbeke and G. Molenberghs, Linear Mixed Models in Practice: A SAS-Oriented Approach. Number 126 in Lecture Notes in Statistics. New-York: Springer-Verlag, 1997. 9. G. Verbeke and G. Molenberghs, Linear Mixed Models for Longitudinal Data. Springer Series in Statistics. New-York: SpringerVerlag, 2000. 10. G. Verbeke, E. Lesaffre, and B. Spiessens, The practical use of different strategies to handle dropout in longitudinal studies. Drug Information J. 2001; 35: 419–434. 11. R. R. Bahadur, A representation of the joint distribution of responses of p dichotomous items. In: H. Solomon (ed.), Studies in Item

REPEATED MEASUREMENTS Analysis and Prediction. Stanford, CA: Stanford University Press, 1961. 12. P. M. E. Althman, Two generalizations of the binomial distribution. Appl. Stat. 1978; 27: 162–167. 13. B. Efron, Double exponential families and their use in generalized linear regression. J. Am. Stat. Assoc. 1986; 81: 709–721. 14. G. Molenberghs and E. Lesaffre, Marginal modelling of correlated ordinal data using a multivariate plackett distribution. J. Am. Stat. Assoc. 1994; 89: 633–644. 15. G. Molenberghs and E. Lesaffre, Marginal modelling of multivariate categorical data. Stat. Med. 1999; 18: 2237–2255. 16. J. B. Lang and A. Agresti, Simultaneously modeling joint and marginal distributions of multivariate categorical responses. J. Am. Stat. Assoc. 1994; 89: 625–632. 17. L. Fahrmeir and G. Tutz, Multivariate Statistical Modelling based on Generalized Linear Models. Springer Series in Statistics. New York: Springer-Verlag, 1994. 18. N. E. Breslow and D. G. Clayton, Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 1993; 88: 9–25. 19. R. D. Wolfinger and M. O’Connell, Generalized linear mixed models: A pseudo-likelihood approach. J. Stat. Comput. Simul. 1993; 48: 233–243. 20. C. Lavergne and C. Trottier, Sur l’estimation dans les mod`eles lin´eaires g´en´eralis´es a` effets al´eatoires. Revue de Statistique Appliqu´ee, 2000; 48: 49–67. 21. R. D. Wolfinger, Towards practical application of generalized linear mixed models. In: B. Marx and H. Friedl (eds.), Proceedings of the 13th International Workshop on Statistical Modeling, New Orleans, LA, July 27–31, 1998. 22. D. Hedeker and R. D. Gibbons, A randomeffects ordinal regression model for multilevel analysis. Biometrics. 1994; 50: 933–944. 23. D. Hedeker and R. D. Gibbons, MIXOR: A computer program for mixed-effects ordinal regression analysis. Comput. Methods Programs Biomed. 1996; 49: 157–176. 24. J. C. Pinheiro and D. M. Bates, Mixed Effects Models in S and S-Plus. New-York: SpringerVerlag, 2000. 25. P. McCullagh and J. A. Nelder, Generalized Linear Models. 2nd ed. New York: Chapman & Hall, 1989. 26. K. Y. Liang and S. L. Zeger, Longitudinal data analysis using generalized linear models.

11

Biometrika. 1986; 73: 13–22. 27. L. A. Mancl and B. G. Leroux, Efficiency of regression estimates for clustered data. Biometrics. 1996; 52: 500–511. 28. R. L. Prentice, Correlated binary regression with covariates specific to each binary observation. Biometrics. 1988; 44: 1033–1048. 29. S. R. Lipsitz, N. M. Laird, and D. P. Harrington, Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika. 1991; 78: 153–160. 30. K. Y. Liang, S. L. Zeger, and B. Qaqish, Multivariate regression analyses for categorical data. J. R. Stat. Soc. Series B 1992; 54: 3–40. 31. J. M. Neuhaus, J. D. Kalbfleisch, and W. W. Hauck, A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int. Stat. Rev. 1991; 59: 25–30. 32. R. W. Helms, Intentionally incomplete longitudinal designs: Methodology and comparison of some full span designs. Stat. Med. 1992; 11: 1889–1913. 33. G. Verbeke and E. Lesaffre, The effect of dropout on the efficiency of longitudinal experiments. Appl. Stat. 1999; 48: 363–375. 34. M. Davidian and D. M. Giltinan, Nonlinear Models for Repeated Measurement Data. New York: Chapman & Hall, 1995. 35. E. F. Vonesh and V. M. Chinchilli, Linear and Nonlinear Models for the Analysis of Repeated Measurements. New York: Marcel Dekker, 1997. 36. G. Verbeke and E. Lesaffre, The effect of misspecifying the random effects distribution in linear mixed models for longitudinal data. Comput. Stat. Data Anal. 1997; 23: 541–556. 37. G. Verbeke and E. Lesaffre, A linear mixedeffects model with heterogeneity in the random-effects population. J. Am. Stat. Assoc. 1996; 91: 217–221. 38. W. Ghidey, E. Lesaffre, and P. Eilers, Smooth random effects distribution in a linear mixed model. Biometrics 2004; 60: 945–953. 39. R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data. New York: Wiley, 1987.

REPOSITORY

Repositories have three core functions: the acquisition/data management, storage/ archival, and distribution of biological samples along with the associated function of maintaining annotated data for current and future research purposes (8). Repositories share in common the goals of storing research samples and data for use in a manner that does not compromise biological integrity of the samples; protecting subject autonomy, privacy, and confidentiality; and having mechanisms in place for retrieving and distributing samples to the research community. Repositories provide appropriate storage conditions, inventory management systems, quality assurance processes, and security systems to protect and secure the samples and data as well as maintain the confidentiality of the specimen donors. The success of genetic investigations, such as genome-wide association scans, is critically dependent on having appropriately selected biosamples for phenotype assessment and high-quality DNA for genotyping, resequencing, or mutation discovery. These parameters in turn determine type and quantity of specific specimens that will be collected for clinical genetic investigation. It is critical that decisions are made during the planning stage regarding repository model, operational specifications and safety, quality assurance, risk management, inventory control, and sample tracking. It is also essential to consider the mechanism by which future ancillary investigations will use stored data and biosamples, as well as future sharing of samples and data, and disposition of biosamples and clinical data at the conclusion of the study. Not addressing these issues early in the clinical study-planning phase will result in higher costs in outlying years and could negatively impact study success.

CHERYL A WINKLER Laboratory of Genomic Diversity, National Cancer Institute, SAIC-Frederick, Maryland

KATHLEEN H. GROOVER National Cancer Institute Frederick Central Repository Services, SAIC-Frederick, Maryland

Rapid advances in genomics and proteomics are leading to an increased awareness of the value of collecting biosamples with annotated data for medical research (1–3). This article addresses the role and function of repositories and biobanks for the banking of DNA, biosamples, and linked, annotated data collected for current and future biomedical investigation. In addition to scientific issues, ethical and legal issues impact repositories and the use and sharing of human samples and data. The protection of human subject rights to confidentiality, privacy, and wishes are specified in Title 45 (Public Welfare) Part 46 (protection of human subjects) of the Code of Federal Regulations (CRF) (4,5). Compliance with these guidelines applies to federally funded research and U.S. agencies that support research. Although no national standards are set for repositories in the U.S., The International Society for Biological and Environmental Repositories (ISBER) (6) and the National Cancer Institute (7) have developed guidelines and best practice recommendations for the management and operation of human subject repositories. Repositories are responsible for assuring compliance with policies and regulations that govern ethical research, protection of human subjects, and for compliance with federal, state, and local laws and as applicable laws of other nations. Repository operations that handle and store human specimens fall under the category of human research; therefore, Institutional Review Board (IRB) approvals are required. IRB review is required separately for biosample collection, repository storage, and/or the data management center.

1

REPOSITORIES AND BIOBANKS

Repositories may be free standing, a part of an institution, or exist as a virtual network of samples and associated data managed by a single entity (9). Below are listed several repository models:

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

REPOSITORY • Virtual, integrated information repos-

•

•

•

•

•

•

itories that archive individual level phenotype, genotype, exposure, and DNA sequence data and the association among them (2,3). Centralized or network repositories for the collection of DNA, biosamples, and databases of clinical information and datasets resulting from clinical trials that are made available to the wider research community at the end of the studies (10). Biobanks are biorepositories that collect, store, process, and distribute biosamples with associated environmental, lifestyle, and/or medical data for biomedical and genetic research. Many biobanks differ conceptually from other repositories because (1) they are designed to discover genetic and nongenetic factors and their interactions that are associated with human diseases and (2) they may be population-based; they may have both retrospective and prospective components; and they tend to enroll very large numbers of individuals (1,11,12). Nonprofit bioresource institutes such as the Coriell Institute for Medical Research or the American Type Culture Collection that acquire, archive, and distribute normal and diseased tissues, cell lines, and DNA to the research community. Each of these repositories also archives special collections for academic and government agencies. Contract repositories that receive, process, store, and distribute biosamples from multiple sites or studies. Regional academic, government, or institutional repositories that collect, process, and store samples for a specific study. Decentralized network of collection facilities associated with academic medical centers, clinical studies, and community hospitals that collect biosamples and annotated data for archival and distribution to the larger research community (8).

2 CORE ELEMENTS OF A REPOSITORY 2.1 Acquisition/Data Management 2.1.1 Acquisition. The study design, objectives, and aims will determine the type and quantity of samples to collect for phenotyping and genotyping (13,14). Standardized protocols should be consistently applied in preparing and storing biospecimens to ensure their quality and to avoid introducing variables into research studies. Biospecimen containers should be chosen with analytical goals in mind. Each storage container label should uniquely and legibly identify the biospecimen aliquot, be firmly affixed to the container, and able to withstand storage conditions. All genetic study requires genomic DNA for genotyping and possible resequencing. Investigators should anticipate the number, concentration, and type of biosamples that may be needed at a later time to validate genetic statistical signals or show causality and function. By understanding anticipated future needs, appropriate biosample storage containers can be prepared up front eliminating unnecessary storage and aliquotting costs and sample compromising thaw/freeze cycles. Common primary DNA sources include: • Peripheral blood and its derivatives • Blood spots on cellulose filter paper/

Guthrie cards • Biopsy tissue • Normal or diseased tissue • Buccal swabs • Buccal rinses • Saliva • Paraffin block tissue

Peripheral blood has many advantages, such as the following: (1) it is collected by venipuncture; (2) it is transported at ambient temperatures; (3) white blood cells are an excellent source of DNA; (4) blood and its components (i.e., plasma) are frequently used for phenotyping assays and diagnostic laboratory tests; (5) gradient separation methods yield peripheral blood mononuclear cells (PBMC) for cryopreservation as a backup DNA source, for future functional studies, or for lymphoblastoid cell line development.

REPOSITORY

3

Blood tubes containing buffers that stabilize DNA or RNA at the point of collection prior to purification are commercially available. Saliva, buccal rinses, and buccal swabs are commonly used to obtain DNA. These methods are noninvasive, and commercially available kits can be provided to donors for self-collection. Renewable DNA can be obtained from

scientific needs. Interoperability of systems is paramount to exchanging data and biospecimens, as well as providing accountability of biospecimens and related data uses (7). Features of a bioinformatics/information technology system to support operations include, but are not limited to the following:

• Epstein-Barr Virus immortalized B cell

• Biosample collection, processing, stor-

• Research participant enrollment and

consent lines • Cell lines from other immortalized cell

types • Whole genome amplified DNA

With the recent advances in whole genome amplification (WGA), a method of amplifying limited amounts DNA or poor-quality DNA, and genotyping platforms that use nanogram quantities of DNA, filter paper blood spots, saliva and buccal samples are attractive alternatives to peripheral blood for genomic DNA. In principle, WGA yields an identical, more concentrated DNA sample while retaining the original genomic, nucleotide sequence (15,16). WGA DNA is compatible with most single nucleotide polymorphism (SNP) genotyping platforms. Cell lines, which are usually established from fresh or cryopreserved PBMC by Epstein-Barr virus (EBV) are an excellent, renewable source of genomic and mitochondrial DNA, and they are frequently used for gene expression (17) or other functional investigations. EBV immortalization of PBMC should be considered if extensive laboratory or clinical data is collected from study subjects or if it is anticipated that the DNA will be widely distributed thus making the investment cost-effective. 2.1.2 Data Management. The availability of high-quality clinically annotated biospecimens is essential to the success of scientific research. The management of data annotating biosample collections is a critical function of repositories. A repository’s use is increased with a well-designed information and management system. Repository informatics systems should be robust and reliable in support of daily operations and should meet changing

age, and dissemination • Quality Assurance (QA)/Quality Control

(QC) • Collection of research participant data • Data and access security • Validation documentation • Management reporting functions • Clinical annotation • Multisite access and coordination • Catalog, annotation and validation of

• • • • • • •

clinical, genotyping, or phenotyping data Web-based Browsing Scalable User-friendly interface Interactive real-time data Statistical and computational analyses Data mining

Repositories should have an integrated inventory management system that provides a chain of custody for biosamples and tracks individual samples at each step of collection, processing, storage, and distribution. Samples should be bar-coded with a unique identifier that links each biosample to its location and associated data. Repositories should maintain records for the following (6): • Donor permissions for use of samples

and data granted through informed consent • Protocol information, internal review approvals, and ethical and/or legal characteristics of the collection

4

REPOSITORY • Description of the biosample, pathology, • •

• • •

and collection protocol Clinical data associated with individual samples Quality assurance data collected during biosample inventory, processing, and laboratory testing Clinical data validation Viability and recovery of cryopreserved cells and tissues Quality of molecular analytes, DNA, and RNA from cells and tissues

Repository management systems should have in place security systems to ensure patient confidentiality and inventory security. Confidential subject documents should be stored in locked facilities with limited access. Computer and paper records should be available for audit by regulatory agencies and for quality assurance purposes. 2.2 Storage/Archival Quality storage of biospecimens requires that repositories address biosample storage, security, and quality assurance. 2.2.1 Biosample Storage. Biospecimens should be stored in a stabilized state and be tracked within an inventory system that identifies the specific location of a sample within a freezer (e.g., freezer, rack, box, and grid location). The goals of short-term and long-term storage are to maintain the biological and functional integrity of biosamples by storing at the optimal temperature and by avoiding temperature fluctuations. In selecting biospecimen storage temperature, consider the biospecimen type, anticipated length of storage, biomolecules of interest, and whether study goals include preserving viable cells (7). Storage conditions and deviations from protocols, which include information about temperature, thaw/refreeze episodes, and equipment failure, should be recorded (7). Storage temperatures influence the time during which samples can be recovered without damage; by storing samples below the glass transition phase of water (–132◦ C) biochemical reactions and metabolic activity are effectively stopped. Although it may

be ideal to store all biosamples below this temperature, little specific evidence in the literature supports this statement. Biosample storage containers and temperatures should be chosen with analytical goals in mind. Optimal storage temperatures differ for different types of biosamples. Therefore, most repositories provide a range of storage temperatures from –196◦ C to controlled ambient temperature. Listed below are typical storage temperature ranges for biosamples (6): • Blood spots on cellulose paper, Guthrie

cards, paraffin-embedded tissue blocks: Ambient temperature (20◦ C to 23.5◦ C) with controlled humidity and light • Purified DNA in solution: –20◦ C to –25◦ C • Plasma, serum, urine, and nonviable cells or tissue: –70◦ C to –80◦ C • Viable, cryopreserved cells or tissues treated with cryoprotectants such as dimethysulfoxide: Vapor or liquid phase of liquid nitrogen (–170◦ C to –196◦ C) In addition to selection of storage containers and temperatures, it is important to consider steps to manage the risk of specimen loss. This process includes but is not limited to, the preparation of duplicate samples, splitting of samples between shipment packages or dates, splitting of collections between two storage units, and even splitting of collections for a study between storage units located in different buildings or even locales. The decision to take additional steps and costs to split collections will be based on costs associated with collection, the ability to identify and replace a particular sample, and study impact considerations. 2.2.2 Security. Repositories have an obligation to secure the biosample collection and associated data from harm or loss, and to protect the rights of privacy and confidentiality of donors. Best practice recommendations are to do the following (6,7,9): • Control access to facility, storage units,

and samples • Maintain temperature stability with

good standard operating procedures and preventive maintenance programs

REPOSITORY • Provide round-the-clock monitoring of

• •

• • •

• •

storage conditions and response to alarm conditions Have sufficient cold storage capacity and staffing for contracted studies Have sufficient back up cold storage capacity in the event of equipment failure Meet power or nitrogen supply requirements Provide automatic back-up systems in the event of power outage Have emergency procedures in place to respond to equipment failure, weather or other emergencies Control access to databases and provide secure backup procedures Provide secure storage for confidential files

2.2.3 Quality Assurance. Repositories should develop formalized QA/QC policies to minimize errors that could adversely affect scientific results (7). QA/QC policies should be customized for the intended and potential uses of biospecimens at the repository. Each biosample resource should either establish a written Quality Management System (QMS) or adhere to one published by the organization with which the repository is affiliated. The QMS should describe the repository’s QA/QC programs and approaches for ensuring that program requirements are met. Procedures for conducting audits in the following areas should also be described (7): • Equipment repair and maintenance • Training records and staff adherence to

training schedules • Data management • Recordkeeping • Adherence to Standard Operating Procedures (SOPs) Each repository should develop an SOP manual that states policies and describes all procedures in detail (7). The manual should also address contents, implementation, modification, staff access and review.

5

Specifically, the SOP manual should include at least the following information: • Biospecimen handling policies and pro-

•

•

• • • •

• •

•

• •

• • •

cedures, including supplies, methods, and equipment used Laboratory procedures for tests performed in-house and any division of a biospecimen into aliquots or other processing Policies and procedures for shipping and receiving biospecimens, including material transfer agreements or other appropriate agreements to be used Policies for managing records Administrative, technical and physical security Information systems security (18) QA/QC policies and procedures for supplies, equipment, instruments, reagents, labels, and processes employees in sample retrieval and processing Safety programs Emergency biosafety policies and procedures, including the reporting of accidents, errors, and complaints Policies and procedures and schedules for equipment inspection, maintenance, repair, and calibration Procedures for disposal of medical waste and other biohazardous waste Policies and procedures regarding the training of technical and QA/QC staff members Procedures for removal of biospecimens from the repository Policies for the disposition of biospecimens Points of contact and designated backup information including names and emergency contact numbers

2.3 Distribution of Biosamples and Data Most repositories have formal mechanisms in place for request and approval of biosample distribution. Stored or archived biosamples may comprise samples collected for the expressed purpose of distribution to the research community or those collected

6

REPOSITORY

by investigators, which were not originally intended to be shared with others, but are subsequently shared with the greater research community. IRB review at the collection site(s) is required for all samples that are identified or are potentially identifiable or are coded and written informed consent obtained from the donor subject. The informed consent should contain information about the repository and the conditions under which biosamples and private data will be shared. Repositories that distribute samples also require an IRB, convened under Assurances of the Office of Human Research Protections (OHRP). Samples are exempt from IRB only if, as stated in 45 CFR 46 101(b) . . . (4) ‘‘research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens, if the sources are publicly available or if the information is recorded by investigators in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects.’’ Once approved for shipment, biosamples should be retrieved from storage according to SOPs and in a manner that preserves integrity of the specimen. Shipping time, distance, climate, season, method of transportation, regulations, and sample type and use need to be considered when determining how to regulate temperature during shipment. Options include shipment at ambient temperature, over blue ice, dry ice, or vapor-phase liquid nitrogen. Appropriate documentation, based on regulations that pertain to the material type, refrigerant, and destination should be prepared by trained individuals and accompany the shipment. Notification of the recipient of the date, time, mode, and expected delivery is critical to successful and safe transport of biospecimens.

3

SUMMARY

Human subject repositories and biobanks are gaining both increased recognition and scrutiny as public and private funding agencies, scientists, and the general public recognizes the benefits of discovering the genetic and environmental basis for many common

diseases, but grapple with the ethical implications of human subject privacy and confidentiality in the genomics age. Genome-wide association scans enrolling thousands of subjects have revealed common genetic variants associated with prostate cancer, colorectal cancer, diabetes, obesity, and other common diseases thus providing the impetus to continue funding large genetic studies, repositories, and biobanks. These institutions face the challenge of maintaining the chain of trust between donors and investigators (13) while making samples and data available to the wider research community. 4 ACKNOWLEDGMENTS This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract N01-CO12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. REFERENCES 1. W. Ollier, T. Sprosen, and T. Peakman, UK Biobank: from concept to reality. Pharmacogenomics 2005; 6: 639–646. 2. M. D. Mailman, M. Feolo, Y. Jin, M. Kimura, K. Tryka, R. Bagoutdinov, L. Hao, A. Kiang, J. Paschall, L. Phan, et al., The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 2007; 39: 1181–1186. 3. C. S. Goh, T.A. Gianoulis, Y. Liu, J. Li, A. Paccanaro, Y.A. Lussier, and M. Gerstein, Integration of curated databases to identify genotype-phenotype associations. BMC Genomics 2006; 12:7: 257. 4. Health Information Portability and Accountability Act of 1996 (HIPAA). 5. Regulations CoF Title 45 Public Welfare, Department of Human Services, Part 46, Protection of Human Subjects. In: Services HaH, ed. 6. Best Practices for Repositories I: Collection, Storage, and Retrieval of Human Biological Materials for Research International Society for Biological and Environmental Repositories

REPOSITORY (ISBER) Cell Preservation Technology. 2005; 3: 5–48. 7. National Cancer Institute (June 2007) National Cancer Institute Best Practices for Biospecimen Resources. 8. A. Friede, R. Grossman, R. Hunt, R. M. Li, S. Stern, eds. National Biospecimen Network Blueprint. Durham, NC: Constella Group, Inc., 2003, pp. 29–48. 9. E. Eiseman, G. Bloom, J. Brower, N. Clancy, and S. Olmsted, Case Studies of Existing Human Tissue Repositories: ‘‘Best Practices’’ for a Biospecimen Resource for the Genomic and Proteomic Era. Santa Monica, CA: RAND Corporation, pp. 1–208. 10. A. J. Cuticchia, P. C. Cooley, R. D. Hall, and Y. Qin, NIDDK data repository: a central collection of clinical trial data. BMC Med. Inform. Decis. Mak. 2006; 4:6: 19. 11. M. Asslaber and K. Zatloukal, Biobanks: transnational, European and global networks. Brief Funct. Genomic Proteomic 2007. In press. 12. M. A. Austin, S. Harding, and C. McElroy, Genebanks: a comparison of eight proposed international genetic databases. Genet. Med. 2003; 5: 451–457. 13. N. T. Holland, L. Pfleger, E. Berger, A. Ho, M. Bastaki, Molecular epidemiology biomarkers-sample collection and processing considerations. Toxicol. Appl. Pharmacol. 2005; 206: 261–268.

7

14. N. T. Holland, M.T. Smith, B. Eskenazi, and M. Bastaki, Biological sample collection and processing for molecular epidemiological studies. Mutat. Res. 2003; 543: 217–234. 15. S. Hughes, N. Arneson, S. Done, and J. Squire, The use of whole genome amplification in the study of human disease. Prog. Biophys. Mol. Biol. 2005; 88: 173–189. 16. L. Lovmar and A. C. Syvanen, Multiple displacement amplification to create a longlasting source of DNA for genetic studies. Hum. Mutat. 2006; 27: 603–614. 17. R. S. Spielman, L.A. Bastone, J. T. Burdick, M. Morley, W. J. Ewens, and V. H. Cheung, Common genetic variants account for differences in gene expression among ethnic groups. Nat. Genet. 2007; 39: 226–231. 18. G. Stoneburner, A. Goguen, and A. Feringa, Risk Management Guide for Information Technology Systems (online). Gaithersburg, MD: National Institut of Standards and Technology, 2002. Available: http://csrc.nist.gov/publications/nistpubs/80030/sp800-30.pdf.

RESPONSE ADAPTIVE RANDOMIZATION

In what follows, let pA and pB denote the respective success probabilities, and let qi = 1 − pi for i = A,B.

D. STEPHEN COAD Queen Mary, University of London London, United Kingdom

1

EXAMPLES

Numerous response adaptive randomization rules have been proposed in the literature. The purpose of this section is to give examples of two specific classes of such rules and to indicate briefly some of their properties.

When a clinical trial is carried out, data accrue sequentially on the different treatments. If some treatments seem to be more promising than others, it is natural to consider assigning more patients to these than the less promising ones. In this way, the previous patients’ responses are used to skew the allocation probabilities toward the apparently better treatments. This is called response adaptive randomization, and its origins may be traced back to the 1930s (1). Although there is a voluminous literature devoted to various properties of the different methods proposed, there are still relatively few examples of the use of response adaptive randomization in practice. One reason for this is that many of the models studied in the theoretical literature are not very realistic. However, there have been some successful applications in practice, such as the Harvard ECMO study (2) and two adaptive clinical trials in depression (3,4). Note that response adaptive randomization is different from covariate adaptive randomization, where balance is sought across the treatment groups with respect to some prognostic factors. Examples of this approach include the minimization randomization method and biased-coin randomization. Response adaptive randomization and covariate adaptive randomization are both members of the class of adaptive designs for clinical trials concerned with treatment comparisons. Other classes include dose-finding designs, such as up-and-down rules and the continual reassessment method, and flexible adaptive designs, such as sample size reestimation methods and the combination of P-values approach. In this article, we will mainly focus on the case where response adaptive randomization is used to compare two treatments A and B with binary responses, although many of the considerations apply more generally.

1.1 Urn Models Several adaptive urn designs are now available for comparing two or more treatments (5). The most widely studied of these designs is the randomized play-the-winner (RPW) rule (6). However, although this design is simple and some of its properties can be analyzed mathematically, it is known to be highly variable. A far less variable urn design is the drop-the-loser (DL) rule (7). Since this particular design has been shown to work well in many settings, it will be described in detail. Suppose that, at the start of the trial, there are a balls for both treatments A and B in the urn, and that there are b so-called immigration balls. The immigration balls are present in order to ensure that the urn does not empty. When a new patient arrives, a ball is removed from the urn. If it is an immigration ball, it is replaced and one additional ball of both type A and type B is added to the urn. This procedure is repeated until a ball representing one of the treatments is drawn. A patient then receives the corresponding treatment. If, at that stage, a success is observed, this ball is returned to the urn; otherwise, if a failure is observed, this ball is not returned. Thus, as the trial proceeds, the urn will contain a higher proportion of balls for the better treatment. Suppose that, at stage n in the trial, that is, after n responses have been observed, the DL rule has assigned NA (n) patients to treatment A and NB (n) to treatment B. Then it is known that qB NA (n) → n qA + qB

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

RESPONSE ADAPTIVE RANDOMIZATION

in probability as n → ∞ (7), so that this rule has the same limiting allocation proportions as the RPW rule. It is also known that the allocation proportion for the DL rule is less variable than for the RPW rule. In fact, the DL rule is always more powerful than the RPW rule (8). Moreover, it has recently been shown that the DL rule attains a lower bound for the asymptotic variance of the allocation proportion (9). 1.2 Sequential Maximum Likelihood Estimation Rules One drawback of many urn models is that they are not derived in order to satisfy any optimality criteria, and thus, one can develop designs that are more effective at reducing the numbers of patients assigned to inferior treatments. One possible approach is to consider minimizing the expected value of a loss function of the form Ln = u(pA , pB )NA (n) + v(pA , pB )NB (n) subject to some power requirement. Thus, if u = qA and v = qB , interest lies in minimizing the total number of treatment failures. Another important special case is u = v = 1, which corresponds to minimizing the total sample size. This type of approach has been studied in several settings (10,11). As an example, suppose that we wish to find the allocation ratio R = nA /nB that minimizes the total number of treatment failures for a fixed asymptotic variance of the estimated value of the log odds ratio given by θ = log {pA qB /(pB qA )}. Then the optimal √ √ allocation is Ropt = pB qB /( pA qA ), and so the optimal allocation to treatment A is Ropt = (Ropt + 1) = ρ(pA , pB ), say. Since this allocation ratio is a function of the unknown success probabilities, these parameters are replaced by their maximum likelihood estimates pˆ A and pˆ B . A biased coin design is then constructed based on these estimates, so that the next patient is allocated to treatment A with probability ρ(pˆ A , pˆ B ). A response adaptive randomization rule constructed in the above way is called a sequential maximum likelihood estimation (SMLE) rule. Since NA (n) →ρ n

almost surely as n → ∞ (12), the limiting allocation is optimal. However, since SMLE rules involve estimates of unknown parameters, the resulting allocation proportions associated with them tend to be rather variable, but they can be made less variable by considering doubly adaptive biased coin designs (8,13). With these designs, a function is introduced to represent how close the current allocation proportions are to the target value ρ. Asymptotic properties of these designs have been studied both in the two-treatment case and when there are multiple treatments (14). 2 INFERENCE AFTER THE TRIAL If response adaptive randomization is used, then the standard methods for inference at the end of a clinical trial cannot be used, since, for example, the sample sizes on the treatments are now random variables and dependencies are introduced through the use of such a design. Consequently, valid methods are needed in order to carry out hypothesis tests and estimate parameters of interest. In what follows, in order to illustrate some of the issues that develop, we concentrate on likelihood-based inference. 2.1 Hypothesis Testing It is well known that the use of response adaptive randomization does not affect the form of the likelihood function. It follows that standard tests based on the generalized likelihood ratio test are of the same form as for a nonadaptive design. However, because of the induced dependence in the data, the sampling distributions of the familiar test statistics will usually be affected by the adaptive nature of the design. For example, a version of Wilks’s theorem for generalized likelihood ratio tests has been obtained when an adaptive urn design is used (15). Although the null distribution of the statistic is the same as for a nonadaptive design, its distribution under the alternative hypothesis is affected and, hence, so is the power of the test. Since response adaptive randomization rules can be more variable than complete randomization, this can lead to a loss of power when carrying out standard hypothesis tests

RESPONSE ADAPTIVE RANDOMIZATION

at the end of the trial. As an example, consider testing H 0 : pA = pB against H 1 : pA = pB . Then it has been shown by simulation that, for the standard asymptotic test of the simple difference pA − pB , use of the SMLE rule that minimizes the total number of treatment failures yielded comparable power to equal allocation, whereas the RPW rule led to a considerable loss in power (11). This behavior is caused by the allocation proportion for the RPW rule being highly variable. For a fixed allocation proportion, the power of the above asymptotic test can be expressed as an increasing function of the noncentrality parameter of the chi-squared distribution. By considering a Taylor series expansion of this parameter, the average power lost for a response adaptive randomization rule is known to be an increasing function of the variance of the allocation proportion (8), which provides a theoretical justification for the above behavior. Note that the rule that maximizes the noncentrality parameter is the Neyman allocation, which has allocation √ √ ratio R = pA qA / pB qB . The above ideas have also been extended to the comparison of more than two treatments with binary responses (8). By considering the noncentrality parameter of the standard asymptotic test of homogeneity, it is known that the average power lost is determined by the covariance matrix of the allocation proportions. More recently, it has been shown how to find the response adaptive randomization rule that maximizes the power of the above test of homogeneity (16). This is achieved by using the doubly adaptive biased coin design developed for comparing multiple treatments (14). The effect of response adaptive randomization on the power of a test based on Wilks’s lambda statistic in the context of multivariate data has also been studied (17). 2.2 Estimation Since the use of response adaptive randomization does not affect the form of the likelihood function, the maximum likelihood estimators of the success probabilities are the same as for a nonadaptive design. However, their sampling distributions will usually be affected by the adaptive nature of the design.

3

For example, the estimators will often be biased and the usual pivots based on these estimators will no longer lead to confidence intervals with the correct coverage probabilities. First consider point estimation. Then it is known that E(pˆ i ) = pi + pi qi

∂ 1 E ∂pi NA (n)

for i = A,B (18), which gives an exact formula for the bias of the maximum likelihood estimators. Since the expectation is difficult to evaluate in practice, it is approximated by using the limiting behavior of the allocation proportion in order to obtain the first-order bias. A similar, but more complicated, exact formula for the variance of the maximum likelihood estimators is also available. Using the limiting behavior of the allocation proportion and its variance, the second-order variance of the estimators may be derived. Simulation suggests that these approximations work well for moderate trial sizes. Now consider interval estimation. Then, in general, the maximum likelihood estimators are asymptotically normal to first order (19). It follows that, to a first approximation, a 100(1 − α)% confidence interval for pi is given by pˆ i qˆ i pˆ i ± Z α 2 ni for i = A,B, where Zγ denotes the 100 γ % point of the standard normal distribution. However, for moderate trial sizes, this approximation may yield actual coverage probabilities far from the nominal values, since pˆ i often has a non-negligible bias and its distribution could also be appreciably skewed. To order to correct for these, one needs to develop a second-order asymptotic theory for the distribution of pˆ i . For example, one approach is to construct corrected confidence intervals for pi by considering the higher order moments of pˆ i and by applying the Cornish–Fisher inversion of the Edgeworth expansion. Exact confidence intervals for the RPW rule have also been derived (20), and bootstrapped confidence intervals are available (21).

4

RESPONSE ADAPTIVE RANDOMIZATION

3

INCORPORATION OF STOPPING RULES

So far, we have considered clinical trials that have a fixed number of patients. There are often benefits to incorporating a stopping rule so that the trial can be stopped early if there is convincing evidence of a treatment difference or less promising treatments may be dropped from the trial. 3.1 Early Stopping The use of sequential tests in clinical trials dates back to the 1950s, and a wide range of such tests are now available (22,23). Since sequential tests can significantly reduce the total number of patients in the trial, it is natural to consider combining a sequential test with response adaptive randomization. Note that this idea was first considered in the early 1970s, but most of the early work considered response adaptive designs that were deterministic and, hence, susceptible to selection bias (24). As an example, the triangular test has recently been combined with several urn models and SMLE rules, when the parameter of interest θ is the log odds ratio and interest lies in testing H 0 :θ = 0 against H 1 :θ > 0 (25). This test is a sequential stopping rule that has been used successfully in practice (23). It is based on two test statistics: the efficient score for the parameter of interest θ , denoted by Z, and Fisher’s information about θ contained in Z, denoted by V. Suppose that, after n responses have been observed, the numbers of successes and failures on treatment A are denoted by SA and FA , and define SB and FB similarly. Then the Z and V statistics are given by nB SA − nA SB Z= n and V=

nA nB (SA + SB )(FA + FB ) n3

The values of Z are plotted against those of V, and the test continues until a point falls outside the stopping boundaries. A decision is then made to reject or accept H0 according to which stopping boundary is crossed. One of the findings of the above work was that the use of the DL rule with the triangular

test is beneficial compared with the triangular test with complete randomization, since it yields fewer treatment failures on average while providing comparable power with similar expected sample size. The DL rule works well in this sequential setting because its allocation proportion has a relatively low variance. In contrast, use of the RPW rule always leads to a longer trial on average. The performance of the SMLEs lies somewhere in between this rule and the DL rule. 3.2 Elimination of Treatments When more than two treatments are being compared in a clinical trial, interest is often in which treatment is the most promising. Thus, it is natural to incorporate some form of elimination rule that enables less promising treatments to be eliminated as the trial proceeds. Note that this approach may be regarded as an extreme example of response adaptive randomization, since, if a treatment is dropped, its allocation probability becomes zero. Work in this area for continuous responses dates back to the early 1980s (26), but these ideas have also been explored more recently (27). The comparison of multiple treatments is considerably more complicated than comparing just two. In practice, a global test statistic can be used to compare the treatments or they can be compared pairwise (22). If the former approach is adopted, multiple comparisons then need to be carried out in order to identify the less promising treatments. For example, the problem of comparing three treatments with binary responses using a global statistic followed by the least significant difference method has recently been considered (16), but no elimination rule was incorporated. However, a procedure has been proposed for a similar problem that allows less promising treatments to be eliminated (28). In this work, interest lay in finding the design that leads to the smallest total number of treatment failures subject to attaining a specified probability of correct selection of the best treatment at the end of the trial. One of the findings is that the DL rule is again more effective than complete randomization and all of the other adaptive designs considered. As before, this behavior is caused by the DL rule being less variable.

RESPONSE ADAPTIVE RANDOMIZATION

One issue that has received little attention in the literature is that of estimation after trials that have used response adaptive randomization and elimination. To appreciate some of the difficulties that develop as a result of using an elimination rule, consider the case of three treatments with binary responses and suppose that treatments are eliminated until a single one remains. Then estimation is reasonably straightforward after the first elimination, since only the stopping rule associated with this elimination needs to be considered. However, after the second treatment is eliminated, information from two stopping rules needs to be taken into account. Obviously, with more elimination times, the problem of estimation becomes more difficult.

5

distribution theory is available in the twotreatment case (33), and some of these ideas have been explored further (34). One practical drawback of many response adaptive randomization methods is that the underlying models are rather unrealistic. For example, covariates may influence the responses, but most of the existing work ignores such information. However, in the case of continuous responses, designs that use covariate information have been developed in both the univariate case (35) and the multivariate case (17), and an analogous design is available for binary responses (36). Distribution theory is also available for normal linear models with covariates (33). REFERENCES

4

FUTURE RESEARCH DIRECTIONS

For simplicity, we have mainly concentrated in this article on response adaptive randomization in the context of two treatments with binary responses. Of course, there is also a wealth of literature on other types of response. For example, several response adaptive randomization rules for continuous responses have recently been compared (29) and the results show that similar conclusions to the binary case apply. However, there has been limited work on survival data (30). There is much work to be carried out on the inferential aspects of response adaptive randomization. Although it is probably feasible to work out the asymptotic theory in detail for some of the simpler models, it is likely that simulation-based approaches will be needed in more complex ones. However, the asymptotic distribution of the maximum likelihood estimators has been derived for covariate-adjusted models (31) and for models with delayed responses (32). There are many other possible areas for research. In particular, it would be very valuable to develop a group sequential procedure with response adaptive randomization and elimination in order to compare several treatments. The main difficulty here lies in obtaining the joint distribution of the sequence of test statistics, so that the appropriate stopping boundaries may be calculated for a given significance level and power. However, the

1. W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933; 25: 275–294. 2. J. H. Ware, Investigating therapies of potentially great benefit: ECMO (with discussion). Stat. Sci. 1989; 4: 298–340. 3. R. N. Tamura, D. E. Faries, J. S. Andersen, and J. H. Heiligenstein, A case study of an adaptive clinical trial in the treatment of outpatients with depressive disorder. J. Amer. Statist. Assoc. 1994; 89: 768–776. 4. J. S. Andersen, Clinical trial designs—made to order. J. Biopharm. Stat. 1996; 6: 515–522. 5. W. F. Rosenberger, Randomized urn models and sequential design (with discussion). Sequential Anal. 2002; 21: 1–41. 6. L. J. Wei and S. D. Durham, The randomized play-the-winner rule in medical trials. J. Amer. Statist. Assoc. 1978; 73: 840–843. 7. A. Ivanova, A play-the-winner-type urn design with reduced variability. Metrika 2003; 58: 1–13. 8. F. Hu and W. F. Rosenberger, Optimality, variability, power: Evaluating responseadaptive randomization procedures for treatment comparisons. J. Amer. Statist. Assoc. 2003; 98: 671–678. 9. F. Hu, W. F. Rosenberger, and L-X. Zhang, Asymptotically best response-adaptive randomization procedures. J. Statist. Plann. Inf. 2006; 136: 1911–1922. 10. L. S. Hayre and B. W. Turnbull, A class of approximate sequential tests for adaptive comparison of two treatments. Commun. Statist.–Theor. Meth. 1981; 10: 2339–2360.

6

RESPONSE ADAPTIVE RANDOMIZATION

11. W. F. Rosenberger, N. Stallard, A. Ivanova, C. N. Harper, and M. L. Ricks, Optimal adaptive designs for binary response trials. Biometrics 2001; 57: 909–913. 12. V. Melfi, C. Page, and M. Geraldes, An adaptive randomized design with application to estimation. Canad. J. Statist. 2001; 29: 107–116. 13. J. R. Eisele, The doubly adaptive biased coin design for sequential clinical trials. J. Statist. Plann. Inf. 1994; 38: 249–261. 14. F. Hu and L-X. Zhang, Asymptotic properties of doubly-adaptive biased coin designs for multi-treatment clinical trials. Ann. Statist. 2004; 21: 268–301. 15. A. Ivanova, W. F. Rosenberger, S. D. Durham, and N. Flournoy, A birth and death urn for randomized clinical trials: Asymptotic methods. Sankhya B 2000; 62: 104–118. 16. Y. Tymofyeyev, W. F. Rosenberger, and F. Hu, Implementing optimal allocation in sequential binary response experiments. J. Amer. Statist. Assoc. 2007; 102: 224–234. 17. A. Biswas and D. S. Coad, A general multitreatment adaptive design for multivariate responses. Sequential Anal. 2005; 24: 139–158. 18. D. S. Coad and A. Ivanova, Bias calculations for adaptive urn designs. Sequential Anal. 2001; 20: 91–116. 19. W. F. Rosenberger, N. Flournoy, and S. D. Durham, Asymptotic normality of maximum likelihood estimators from multiparameter response-driven designs. J. Statist. Plann. Inf. 1997; 10: 69–76. 20. L. J. Wei, R. T. Smythe, D. Y. Lin, and T. S. Park, Statistical inference with datadependent treatment allocation rules. J. Amer. Statist. Assoc. 1990; 85: 156–162. 21. W. F. Rosenberger and F. Hu, Bootstrap methods for adaptive designs. Statist. Med. 1999; 18: 1757–1767. 22. C. Jennison and B. W. Turnbull, Group Sequential Methods with Applications to Clinical Trials. London: Chapman and Hall, 2000. 23. J. Whitehead, The Design and Analysis of Sequential Clinical Trials, 2nd rev. ed. Chichester, U.K: Wiley, 1997. 24. H. Robbins and D. O. Siegmund, Sequential tests involving two populations. J. Amer. Statist. Assoc. 1974; 69: 132–139. 25. D. S. Coad and A. Ivanova, The use of the triangular test with response-adaptive treatment allocation. Statist. Med. 2005; 24: 1483–1493.

26. C. Jennison, I. M. Johnstone, and B. W. Turnbull, Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. In: S. S. Gupta and J. O. Berger (eds.), Statistical Decision Theory and Related Topics III, vol. 2. New York: Academic Press, 1982. 27. D. S. Coad, Sequential allocation involving several treatments. In: N. Flournoy and W. F. Rosenberger (eds.), Adaptive Designs. Hayward, CA: Institute of Mathematical Statistics, 1995. 28. D. S. Coad and A. Ivanova, Sequential urn designs with elimination for comparing K ≥ 3 treatments. Statist. Med. 2005; 24: 1995–2009. 29. A. Ivanova, A. Biswas, and A. Lurie. Response-adaptive designs for continuous outcomes. J. Statist. Plann. Inf. 2006; 136: 1845–1852. 30. W. F. Rosenberger and P. Seshaiyer, Adaptive survival trials. J. Biopharm. Statist. 1997; 7: 617–624. 31. W. F. Rosenberger and M. Hu, On the use of generalized linear models following a sequential design. Statist. Probab. Lett. 2001; 56: 155–161. 32. Z. D. Bai, F. Hu, and W. F. Rosenberger, Asymptotic properties of adaptive designs for clinical trials with delayed response. Ann. Statist. 2002; 30: 122–139. 33. C. Jennison and B. W. Turnbull, Group sequential tests with outcome-dependent treatment assignment. Sequential Anal. 2001; 20: 209–234. 34. C. C. Morgan and D. S. Coad. A comparison of adaptive allocation rules for group-sequential binary response clinical trials. Statist. Med. 2007; 26: 1937–1954. 35. U. Bandyopadhyay and A. Biswas, Adaptive designs for normal responses with prognostic factors. Biometrika 2001; 88: 409–419. 36. W. F. Rosenberger, A. N. Vidyashankar, and D. K. Agarwal. Covariate-adjusted responseadaptive designs for binary response. J. Biopharm. Statist. 2001; 11: 227–236.

FURTHER READING F. Hu and A. Ivanova, Adaptive designs. In: S-C. Chow (ed.), Encyclopedia of Biopharmaceutical Statistics, 2nd ed. New York: Marcel Dekker, 2004. J. N. S. Matthews, An Introduction to Randomized Controlled Clinical Trials. London, U.K: Arnold, 2000.

RESPONSE ADAPTIVE RANDOMIZATION W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials: Theory and Practice. New York: Wiley, 2002.

CROSS-REFERENCES Biased-Coin Randomization Estimation Power Stopping Boundaries Variability

7

Response Surface Methodology The main purpose of this methodology is to model the response based on a group of experimental factors presumed to affect the response, and to determine the optimal setting of the experimental factors that maximize or minimize the response. The factors are all quantitative and the objective is achieved through a series of experiments. Let F1 , F2 , . . . , Fk be k factors affecting the response y and let E(y) = f (X1 , X2 , . . . , Xk ), where X1 , X2 , . . . , Xk are the levels of F1 , F2 , . . . , Fk , and E(y) is the expected response. We assume f to be a polynomial of degree d. A k-dimensional design of order d is said to be constituted of n runs of the k factors (Xi1 , Xi2 , . . . , Xik ), i = 1, 2, . . . , n, if from the responses recorded at the n points all of the coefficients in the dth degree polynomial are estimable.

First-Order Design Initially a first-orderdesign will be used to fit the model E(y) = β0 + ki=1 βi Xi . For this, one usually uses Plackett & Burman designs [15]. A Hadamard matrix Hm is an m × m matrix of ±1 such that Hm Hm = mIm , where Im is the identity matrix of order m. A necessary condition for the existence of Hm is m = 2 or m ≡ 0 (mod 4). If 4t − 5 ≤ k < 4t − 1, in an H4t , the first column will be converted to have all ones, and any k columns of the last 4t − 1 columns of H4t will be identified with the coded levels of the k factors in 4t runs. These 4t runs and several central points (0, 0, . . . , 0) in coded levels constitute the design. If the lack of fit is significant, then one plans a second-order design at that center. Otherwise, one moves away from the center by the method of steepest ascent to determine a new center to plan a second-order design.

Method of Steepest Ascent One maximizes the estimated response yˆ = βˆ0 + k βˆi Xi from the initial design on the contours i=1 k 2 2 i=1 Xi = R . The maximum occurs when Xi ∝ βˆi . One decides desirable increments to proceed for

factor Fi , determines the proportionality constant, and determines Xj for j = 1, 2, . . . , k, j = i. In this way all coordinates in k dimensions are determined, to obtain ∆. Moving the center by incrementing ∆, 2∆, 3∆, . . ., one determines the expected y. ˆ If yˆ shows a maximum or minimum in the experimental region, then one moves the center to the setting at which yˆ is optimum and carries out a secondorder experiment. Otherwise, at a reasonable distance away from the original center, one performs another first-order experiment to determine a new path along which a center in the second-order experiment will be determined.

Second-Order Experiment Let the design consist of F noncentral points and n0 central points and let n = F + n0 . In the coded doses, without loss of generality, one assumes that: Xiα = 0, Xiα Xiβ = 0, 2 3 Xiα Xiβ = 0, Xiα = 0, 3 Xiα Xiβ = 0, Xiα Xiβ Xiγ = 0, 2 Xiα Xiβ Xiγ = 0, (1) Xiα Xiβ Xiγ Xiδ = 0, for α = β = γ = δ; 2 Xiα = n; (2) 4 Xiα = na, 2 2 Xiα Xiβ = nλ4 , for α = β. (3) If X is the design matrix, then X X cannot be made diagonal in original parameters. However, in orthogonal polynomials of the factors settings, one 1. A second-order may obtain orthogonality if λ4 = 2 2 Xiβ = n. design is called orthogonal when Xiα A second-order design is said to be rotatable if the variance of the estimated response at (x1 , x2 , . . . , xk ) is a function of ρ = ki=1 xi2 . For a rotatable design, we have a = 3λ4 and λ4 > k/(k + 2). The last inequality is needed to make X X nonsingular. By making var(y) ˆ the same at the settings at which ρ = 1 and ρ = 0, we obtain uniformly precise rotatable designs and, for them, λ4 for different values of k are as shown in Table 1.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Response Surface Methodology regression equation,

Table 1 k

λ4

ˆ yˆ = βˆ0 + X βˆ + X BX,

2 3 4 5

0.7844 0.8385 0.8704 0.8918

where X = (X1 , X2 , . . . , Xk ) is the vector of the k factors settings, βˆ = (βˆ1 , βˆ2 , . . . , βˆk ), and   ˆ β11 12 βˆ12 . . . 12 βˆ1k   Bˆ =  21 βˆ21 βˆ22 . . . 12 βˆ2k  .

The nonzero central points are usually taken as follows: 1. A 3n experiment or a fractional replication of a 3n experiment in which main effects and twofactor interactions are not aliased with each other; with factor levels −g, 0, and g (see Fractional Factorial Designs). 2. A Central Composite Design (CCD), in which the factorial points form a 2n experiment or a resolution 5 fractional replication with levels g and −g and 2n axial points (±α, 0, . . . , 0), (0, ±α, . . . , 0), . . . (0, 0, . . . , ±α). 3. A Box–Behnken design [3], in which the v × b incidence matrix of a balanced incomplete block design with parameters v, b, r, k, and λ, where r = 3λ, is used, in which the ones in each column are replaced by ±g, so that the v factors are experimented in b(2k ) runs. The factor levels in the runs are determined so that the design is orthogonal, or rotatable, or uniform precision. Let us illustrate using a CCD, which is orthogonal and rotatable in k = 3 factors. The noncentral points are F = 8 + 6 = 14 of the form (±g, ±g, ±g), (±α, 0, 0), (0, ±α, 0), (0, 0, ±α). Let n0 be the number of central points and let n = 14 + n0 . If one wants the CCD to be orthogonal and rotatable, one must have 8g 4 + 2α 4 = 24g 4 ,

8g 4 = 14 + n0 .

Furthermore, condition (2) implies that 8g + 2α = 14 + n0 . 2

2

An approximate solution is α = 2.197,

g = 1.306,

n0 = 9.

Canonical and Ridge Analysis Using a second-order design, one conducts an experiment and, using the data, fits a second-degree

1 ˆ β 2 k1

1 ˆ β 2 k2

...

βˆkk

The critical point at which the derivative of yˆ with respect to X is zero is given by x0 , where ˆ Letting z = X − x0 , and yˆ0 = βˆ0 + ˆ 0 = −β. 2Bx ˆ ˆ x0 β + x0 Bx0 , the regression equation can be rewritˆ ten as yˆ = yˆ0 + z Bz. Let λ1 ≥ λ2 ≥ · · · ≥ λκ be the eigenvalues of ˆ and let D be the diagonal matrix with eleB, ments λ1 , λ2 , . . . , λk . Let M be an orthogonal matrix ˆ and let w = M z. Then yˆ = such that D = M BM, yˆ0 + ki=1 λi wi2 , where w = (w1 , w2 , . . . , wk ). This implies that at the critical value x0 local maximum is attained when λ1 ≤ 0, and a local minimum is attained when λk ≥ 0. When the inequalities are strict, x0 is the unique critical value, whereas when the inequalities are not strict, x0 is a point at which a local maximum or minimum is attained. When some λi are positive and some negative, one may find an absolute maximum (or minimum) of yˆ at concentric spheres of varying radii Ri . The estimated regresˆ is maximized sion function yˆ = βˆ0 + X βˆ + X BX (or minimized) such that X X = R2 , and x∗ satisfying 2(Bˆ − µIn )x∗ = βˆ maximizes (or minimizes) yˆ if µ > λk (or µ < λ1 ). For different choices of µ depending on the objective, x∗ and R 2 will be determined. The yˆ values at those x∗ values will be determined, and yˆ will be plotted against R 2 to find the absolute maximum or minimum in the region of experimentation.

Further Reading 1. For some of the original ideas in this methodology, the interested reader is referred to the papers of G.E.P. Box and his co-authors (see Box & Wilson [7], Box [1, 2], Box & Youle [8], Box & Hunter [6], and Box & Draper [4]. 2. For more details of this methodology, the interested reader is referred to the books by Box & Draper [5], Khuri & Cornell [10], and Myers & Montgomery [13]. For review articles

Response Surface Methodology

3.

4.

5. 6.

on this methodology, see Herzberg & Cox [9], Mead & Pike [11], and Myers et al. [14]. For other constructions of rotatable designs, see Raghavarao [16]. For the construction of Hadamard matrices, which are Plackett & Burman designs, see Raghavarao [16]. Taguchi and his co-workers developed different ideas to optimize responses using orthogonal arrays [17]. See also Vining & Myers [18] for combining Taguchi and response surface philosophies. For handling dual responses in this methodology, see Myers & Carter [12]. For basic ideas of orthogonal blocking of secondorder experiments, see Box & Hunter [6].

References [1] [2]

[3]

[4] [5] [6]

[7]

Box, G.E.P. (1952). Multifactor designs of first order, Biometrika 39, 49–57. Box, G.E.P. (1954). The exploration and exploitation of response surfaces: some general considerations and examples, Biometrics 10, 16–60. Box, G.E.P. & Behnken, D.W. (1960). Some new threelevel designs for the study of quantitative variables, Technometrics 2, 455–475. Box, G.E.P. & Draper, N.R. (1963). The choice of a second order rotatable design, Biometrika 50, 335–352. Box, G.E.P. & Draper, N.R. (1987). Empirical Model Building and Response Surfaces. Wiley, New York. Box, G.E.P. & Hunter, J.S. (1957). Multifactor experimental designs for exploring response surfaces, Annals of Mathematical Statistics 28, 195–241. Box, G.E.P. & Wilson, K.B. (1951). On the experimental attainment of optimum conditions, Journal of the Royal Statistical Society, Series B 13, 1–45.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

3

Box, G.E.P. & Youle, P.V. (1955). The exploration and exploitation of response surfaces: an example of the link between the fitted surface and the basic mechanism of the system, Biometrics 11, 287–322. Herzberg, A.M. & Cox, D.R. (1969). Recent work on the design of experiments: a bibliography and a review, Journal of the Royal Statistical Society, Series A 132, 29–67. Khuri, A.I. & Cornell, J.A. (1987). Response Surfaces: Designs and Analyses. Marcel Dekker, New York. Mead, R. & Pike, D.J. (1975). A review of response surface methodology from a biometric viewpoint, Biometrics 31, 803–851. Myers, R.H. & Carter, W.H., Jr (1973). Response surface techniques for dual response systems, Technometrics 15, 301–317. Myers, R.H. & Montgomery, D.C. (1995). Response Surface Methodology: Process and Product Optimization using Designed Experiments. Wiley, New York. Myers, R.H., Khuri, A.I. & Carter, W.H., Jr (1989). Response surface methodology: 1966–1988, Technometrics 3, 137–157. Plackett, R.L. & Burman, J.P. (1946). The design of optimum multifactorial experiments, Biometrika 33, 305–325. Raghavarao, D. (1971). Constructions and Combinatorial Problems in Design of Experiments. Wiley, New York. Taguchi, G. (1987). System of Experimental Design: Engineering Methods to Optimize Quality and Minimize Cost. UNIPUB/Kraus International, White Plains, New York. Vining, G.G. & Myers, R.H. (1990). Combining Taguchi and response surface philosophies: dual response approach, Journal of Quality Technology 22, 38–45.

D. RAGHAVARAO & S. ALTAN

RISK ASSESSMENT

extensive bibliographies. See also (1) for a popular exposition. Risk theory has a specialized meaning, being concerned with the financial integrity of an insurance company in the light of random fluctuations in claims. It forms an application of the theory of stochastic processes (7). Statisticians will note that usage 2 above is closely related to the concept of a risk function in decision theory. There, uncertain events, the distribution of which depends on an unknown scenario, have consequences measured by a loss function; a particular decision function, defining the action to be taken when the event is observed, has an average loss for any given scenario; and the risk (or integrated risk) is the mean of the average loss when taken over the prior distribution of the scenarios. Application of this approach is hampered by the difficulty of determining losses in financial terms, and of defining the various probability distributions. Attention has been focused on various interrelated aspects of risk, including the following:

PETER ARMITAGE University of Oxford, Oxford, UK

The various usages of the term risk all concern the possible occurrence of events or situations, called hazards, the consequences of which are uncertain, but may be harmful. Informal usages of risk may indicate the nature, or merely the existence, of the possible danger (‘‘There is a risk of post-operative infection’’; ‘‘I never take risks’’). In technical discussions the term is used quantitatively, but even there the usage is not standard. There are two principal, and mutually incompatible, interpretations: 1. The probability, or chance, of an adverse event. Clearly, this must be put in context: it should refer to a defined set of circumstances, and, for hazards continuing over time, the rate per time unit, or for a unit of exposure, is normally used. 2. A combination of the chance of an adverse effect and its severity. There are obvious difficulties with this type of definition: How is severity measured, and how are the two components combined?

1. Risk estimation: the estimation of the probabilities of the adverse outcomes, and of the nature and magnitude of their consequences. 2. Risk evaluation: determination of the significance of the hazards for individuals and communities. This depends importantly on the next aspect. 3. Risk perception: the extent to which individuals assess risks and the severity of possible outcomes, assessments that may differ from those made by ‘‘experts’’. 4. Risk management: the measures taken by individuals and societies to prevent the adverse effects of hazards and to ameliorate their consequences.

The extensive literature on risk covers many aspects, which are commonly collectively termed risk assessment or risk analysis. The first of these terms is sometimes used more restrictively, to include the concepts of risk estimation, risk evaluation, and risk perception, as defined below. The study of risk brings together engineers, behavioral and social scientists, statisticians, and others, and to some extent usage of terms varies amongst these groups. For example, engineers and other technologists tend to favor approach 2, statisticians and biologists tend to favor 1, and behavioral and social scientists tend often to use a mutifaceted approach. Reference (6) contains chapters by groups of writers from different backgrounds, and has

We deal briefly with these topics in turn. The articles in (6), and their bibliographies, provide a much broader picture. Many of these topics are discussed fully elsewhere,

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

RISK ASSESSMENT

in relation to risk assessment for environmental chemicals. 1

HEALTH HAZARDS

There are several clearly distinct categories of hazards that give rise to health risks. First, there are hazards that arise from the physical and biological environment. Many of the hazards in the physical environment are man-made. It is our own choice, collectively, to pollute the atmosphere with emissions from domestic fires, power stations, or burning oil wells, and to treat water supplies with disrespect. These are examples of damage to the environment, and damage to ourselves from the environment. The biological environment presents a hazard to us mainly in the form of microorganisms causing infectious disease. Other hazards arise from personal, rather than societal, choice. These include habits with adverse consequences, such as the consumption of tobacco, alcohol, and narcotic drugs. The category includes also indulgence in sport and travel; and our often unwise dietary choices. We tend to shrug off the hazards that we ourselves incur, by understating the risks or overstating the benefits, while deprecating the folly of others. Finally, there are hazards that cannot be prevented by personal decisions. They follow inexorably from our innate or ingrained characteristics— our genetic makeup or our experiences in early life. In some instances, medical science can reduce the risk to which susceptible individuals are subject: in others, the burden has to be endured. 2

RISK ESTIMATION

The risks from many prominent health hazards can be estimated reliably from objective statistical information. In other instances, in which numeric information is lacking, risks may be guessed by informed experts (as in the setting of insurance premiums for nonstandard risks). There are, for instance, no reliable data on the frequency of explosions at nuclear power installations, and estimates of risk would have to rely on expert judgments, or on careful estimation of risks of failure

at individual links in the chain of connected events. Even when statistical information is available, an individual may argue that his or her risk is not properly represented by the population estimate. There is a long-standing debate as to whether medical statistical information necessarily applies to an individual in the population concerned; in the nineteenth century, for instance, opposite views were held by P.C.A. Louis and by C. Bernard. Clearly, if the individual has known characteristics that can be shown to affect the risk, they should be taken into account. If no such characteristics can be identified, it seems reasonable to apply the population estimate to the individual. The point is important, in emphasizing that risk estimation is far from being an objective matter. Statistical information on the risk of mortality from different diseases is widely available, for individuals of each sex at different ages, in different occupational and social groups and for different countries. Information on the risks of morbidity is less comprehensive. Such information, based on data for large populations, is of some value for the estimation of risks for random members of the populations, but gives little or no indication of risks for individuals exposed to certain specified hazards; such as environmental pollution, social habits, or the onset of disease. For questions of this type, special investigations are required. The whole range of types of epidemiologic study is available, including case–control studies, cohort studies, and case–cohort studies. The risks of adverse progression of disease may be estimated by a study of prognosis. See (3) for an example of various investigatory methods employed in a study of the apparent excess risks of childhood leukemia due to contamination of water supplies in a town. In many ‘‘high-profile’’ public health problems, it is not possible to mount epidemiologic studies to give unambiguous estimates of risk. The mechanism giving rise to the risk may not be fully understood, or the dangers may arise from a complex chain, the risks for which are difficult to measure. In such instances, the risks may be estimable only within very broad bands. For instance, in the crisis in the British beef industry, due to the

RISK ASSESSMENT

outbreak of bovine spongiform encephalopathy (BSE), leading to an apparent risk of Creutzfeldt–Jakob disease (CJD), it was very difficult to estimate precisely the risk of CJD to a person eating beef. Since the cessation of use of suspect cattle feed, and the culling of relevant herds, it is probably reasonable to say that the risk is ‘‘extremely low’’, and perhaps to put some upper bound on it, but such estimates would rely on somewhat shaky data, and on the personal judgments of experts. Another example is that of prolonged exposure to low levels of possibly carcinogenic chemicals. Carcinogenicity experiments, with the administration of high doses to animals, may give quite precise estimates of a median effective dose. However, risk estimation for low-level exposure to humans involves extrapolation to low doses (using models that are not necessarily correct (5)), and from the animal to the human species. The result of such extrapolation may well be reassuring, but it is unlikely to be quantitatively precise. 3

RISK EVALUATION AND PERCEPTION

The evaluation of risk, either by individuals or by societies, should in principle involve a balancing of the costs and benefits: the potential occurrence of adverse effects, arising from exposure to a hazard, should be balanced against the potential benefits in physical or psychological rewards. Cost–benefit analysis is, however, a somewhat idealized concept. Apart from the difficulties of risk estimation, outlined above, both the potentially adverse effects and the supposed benefits may be difficult to evaluate on commensurate scales. The benefits may in part be assessable as direct economic gains to a community. They may also include amenities, such as palatable food or attractive cosmetics, the value of which may be estimable by enquiry as to the prices that people are willing to pay for them. The costs may be even more elusive. They include direct financial losses; for instance, in productivity. They include also disbenefits of pain and other symptoms. One might enquire how much people would be willing to pay to

3

avoid such discomforts, but this would be a difficult exercise for people who had never experienced the symptoms in question. Then, there is the crucial question of the value of human life. There are various approaches to this task, such as: (i) calculation of lost earning capacity; (ii) implicit evaluation based on societal practice, such as compensation awards or expenditure on specific safety measures; or (iii) the size of insurance premiums. None of these possible approaches is likely to be simple, but it seems important to encourage further discussion and research, especially for the evaluation of risks for which community decisions, such as the imposition of government regulations, are required. Evaluation by individuals of risks incurred by possible individual choices, again in principle involves the balancing of costs and benefits, but these may be very subjective and even more difficult to quantify than those involved in community action. In a sense, the decisions actually taken by individuals, sometimes without appreciable introspection, carry implications about the values attached by those individuals to the various elements in the equation. From this point of view, the relevant estimates of risk may be the subjective perceptions of the individuals themselves, rather than more ‘‘objective’’ estimates provided by experts. These two forms of estimate may be quite disparate. We tend to be more concerned about infrequent but dramatic events, such as major air crashes, than about frequent but less dramatic series of events such as the regular toll of road accident deaths. In one study (4), people thought that accidents caused as many deaths as disease, whereas in fact disease causes 15 times as many. The incidences of death from spectacular causes such as murder, botulism, tornadoes, and floods were all overestimated, whereas those for cancer, stroke, and heart disease were underestimated. The importance of the ‘‘benefit’’ side of the equation is illustrated by the varying acceptability of activities with comparable risks. People are generally prepared to accept much higher risks of death from activities in which they participate voluntarily, such as sports, than from those encountered involuntarily.

4

4

RISK ASSESSMENT

RISK MANAGEMENT

This term covers the decisions, taken by individuals and communities, to accept or forego hazardous situations after assessment of risks, or to reduce exposure to the hazards and/or their adverse consequences. As noted above, decisions by individuals are highly personal, and to a detached observer they may often seem irrational. A rational study of teenage smoking may conclude that the hazardous practice should be avoided, but its conclusions may carry little weight with a young person who is ill-informed about risks, and whose ‘‘benefits’’ include the pleasures of conformity with peer practice. Nevertheless, in such situations, improved information about risks and adverse consequences is highly desirable, and the provision of risk information forms one of the major roles of government and other public bodies concerned with risks. Institutions with a role in risk management include international, national, and regional governments, and a variety of public and private organizations. Apart from the provision of information, governments may issue regulations to reduce or control the use of hazardous substances. Their decisions may be guided by advisory committees, perhaps internationally based. For instance, in the assessment of evidence of carcinogenicity of chemicals, authoritative advice is provided by the program of the International Agency for Research on Cancer (IARC) Monographs on the Evaluation of the Carcinogenic Risk of Chemicals to Humans (2). Institutions concerned with mitigation of the adverse effects of hazards include the judiciary (through compensation awarded in the law courts), insurance companies, and a variety of community bodies concerned with social welfare. 5

CONCLUSIONS

The interdisciplinary nature of all the aspects of risk assessment discussed here has encouraged lively discussion and research. Biostatistics forms only one component in the mixture, but it is an essential ingredient. Publications are spread widely in the technical press, but

special note should be taken of the journal Risk Analysis. REFERENCES 1. British Medical Association (1987, 1990). The BMA Guide to Living with Risk. Penguin, London. 2. International Agency for Research on Cancer (IARC) (1982). IARC Monographs on the Evaluation of the Carcinogenic Risk of Chemicals to Humans, Supplement 4, Chemicals, Industrial Processes and Industries Associated with Cancer in Humans. IARC Monographs, Vols.1–29. International Agency for Research on Cancer, Lyon. 3. Lagakos, S. W., Wessen, B. J. & Zelen, M. (1986). An analysis of contaminated well water and health effects in Woburn, Massachusetts (with discussion), Journal of the American Statistical Association 81, 583–614. 4. Lichtenstein, S., Slovic, P., Fischhoff, B., Layman, M. & Combs, B. (1978). Judged frequency of lethal events, Journal of Experimental Psychology: Human Learning and Memory 4, 551–578. 5. Lovell, D. P. & Thomas, G. (1996). Quantitative risk assessment and the limitations of the linearized multistage model, Human and Experimental Toxicity 15, 87–104. 6. Royal Society (1992). Risk: Analysis, Perception and Management. Report of a Royal Society Study Group. Royal Society, London. 7. Seal, H. L. (1988). Risk theory, in Encyclopedia of Statistical Sciences, Vol. 8, S. Kotz & N. L. Johnson, eds. Wiley, New York, pp.152–156.

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

recommended for use in subsequent phase II studies and is frequently called the maximum tolerated dose (MTD). The second primary objective is to identify the toxicities associated with the agent, and in particular to identify those that limit the dose that can be administered; these are called dose-limiting toxicities (DLTs). Although the study protocol must define generic DLTs, specific DLTs are identified during the study. The MTD is defined in terms of the fraction of patients experiencing DLT; however, the specific definition depends on the study design. Chemotherapeutic agents are administered in ‘‘cycles’’ (or ‘‘courses’’) of generally 3 to 6 weeks duration. Only DLTs that occur during the evaluation period, which is usually the first cycle of therapy, affect doseescalation decisions. Patients are usually treated until disease progression or unacceptable toxicity, and they usually receive the same dose throughout the study unless a dose reduction is required because of toxicity. However, it is becoming increasingly common to use some form of within-patient dose escalation. Phase I studies can be broadly divided into two categories. The first category is the first-in-human study, which is done with agents for which there is only preclinical experience. The second category is the new study of an agent or agents that have been evaluated previously in humans; this includes studies involving new agent formulations, routes of administration, or doseadministration schedules, and also includes studies of combinations of agents. At the beginning of the study, one or more patients are treated at the ‘‘starting dose’’ of the agent. For first-in-human studies, the starting dose is commonly scaled by body surface area from a minimally toxic dose in the most sensitive animals investigated (2, 3). The ratio of the MTD to starting dose thus obtained varies enormously over agents (4). Most study designs investigate a set of predetermined doses, or ‘‘dose levels.’’ A common set of doseescalation factors (the ratios of the doses at successive levels) used to determine the levels is the modified Fibonacci series in which the factors are usually 2, 1.67, 1.5, 1.4, 1.33, 1.33, . . . ; when normalized to the starting

DOUGLAS M. POTTER Biostatistics Department Graduate School of Public Health University of Pittsburgh Biostatistics Facility University of Pittsburgh Cancer Institute Pittsburgh, Pennsylvania

Many designs have been proposed for phase I dose-finding studies of chemotherapeutic agents in cancer patients (1). These designs are based either on a set of dose-escalation rules (rule-based designs) or on a dosetoxicity model and a target toxicity rate among the patients treated (model-guided designs). These two types of design have various strengths and weaknesses; however, hybrid designs with a rule-based first stage and a model-guided second stage combine the essential strengths and avoid the serious weaknesses. This article begins by reviewing the basics of phase I studies of chemotherapeutic agents. It then discusses the strengths and weakness of a commonly used rule-based design and a commonly used model-guided design. Finally, it compares three hybrid designs with these two designs as well as with each other. These and other phase I designs can be used for any agents that, like chemotherapeutic agents, are believed to be most effective at the highest tolerable dose, where tolerability is defined in terms of toxicity. 1

BASIC ELEMENTS OF PHASE I STUDIES

The assumption underlying phase I dosefinding studies of chemotherapeutic agents is that both efficacy and toxicity increase with dose. The justification for this assumption is that these agents kill not only cancer cells but also other dividing cells, and that the killing is dose-dependent for all dividing cells. Thus, one primary objective of a phase I study is to estimate the maximum dose of an agent that can be administered without unacceptable toxicity. That dose would usually be

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

dose, these factors correspond to dose levels of 1, 2, 3.3, 5, 7, 9.3, 12.4, . . . . For agents that have been evaluated previously in humans, the starting dose will be based on clinical experience, and escalation factors are usually fixed at about 1.3. Most phase I designs attempt both to minimize the number of patients treated at low, probably ineffective doses, and also to minimize the chance that patients will be treated at excessively toxic or lethal doses. These are ethical considerations, and they compromise the ability to minimize the sample size required to determine the MTD with a specified accuracy. The designs of interest here are for studies that estimate a populationaveraged MTD based on toxicity that can be assessed in the near term; delayed or cumulative toxicity is not of primary concern and can be difficult to assess (due to disease progression, dose reductions, etc.). The designs can be divided into two categories: rule based and model guided. In general, designs well-suited for first-in-human studies will be inappropriate for agents that have been evaluated previously in humans and vice versa. This limitation is seldom acknowledged, but it is critical: if used in first-in-human studies, many model-guided designs will often fail to find a MTD that is based on observed DLTs. This limitation, which is discussed in more detail later, has motivated the development of the robust twostage designs. Additional background information on phase I clinical studies can be found in many books (see, e.g., Eisenhauer et al. [5]) and in a recent review (1), which also contains much of the material discussed here. 2

THE STANDARD RULE-BASED DESIGN

The standard rule-based dose-escalation design (4) uses cohorts of three patients treated at the same dose level. This design is by far the most widely used, and is thus an appropriate benchmark; I refer to it as the ‘‘standard’’ because it is default design in the phase I protocol templates that are available at the website of the National Cancer Institute’s Cancer Therapy Evaluation Program (http://ctep.cancer.gov/). If 1/3 patients

experiences DLT, 3 more patients are added at the same level, and if 0/3 or 1/6 experience DLT, the dose for the next cohort is escalated by one level. Escalation stops and de-escalation begins as soon as 2 patients at a dose level experience DLT. If only 3 patients have been treated at a dose level, 3 additional patients will be treated. If >1/6 patients experience DLT, de-escalation continues until ≤1/6 patients at a dose level experience DLT; this level is defined to be the MTD. The standard design is not only simple, but it is also simple to implement: there is no requirement for statistical input. In addition, it is ‘‘robust,’’ where by robust I mean that, given the observed DLTs, escalation will be sensible, and the design will determine a reasonable MTD. A design that is not robust may, for example, declare a MTD to be found in the absence of any DLT. The standard design has many shortcomings, which have led statisticians to consider new designs for phase I studies. (1) Many patients are likely to be treated at low doses. (2) At least two patients will be treated at ≥1 dose level above the MTD, and thus will be subject to more severe toxicity than at the MTD. (3) Dose-escalation factors are not reduced upon the observation of DLT; reduction would improve the margin of safety. (4) The MTD is not a dose with any particular probability of DLT. (5) The MTD estimate has large variability. (6) Because no model is fit to the data, definition of the MTD does not make optimal use of the DLT data. (7) Frequently, additional patients are evaluated at the MTD after it has been determined. This design has no provision for modifying the MTD based on DLTs observed in the additional patients. (8) Dose levels result in the following conundrum: if the levels are closely spaced, many patients will be needed to complete the study, and many of these will be treated at low doses; however, if the levels are coarsely spaced, the accuracy of the MTD will be poor, and patients will be more likely to be treated at excessively toxic doses. Various rule-based and model-guided designs have been developed to address these problems (1).

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

3 DESIGNS BASED ON THE CONTINUAL REASSESSMENT METHOD O’Quigley et al. (6) introduced the continual reassessment method (CRM), a Bayesian model-guided design. The method uses a model for the population-averaged DLT probability (PDLT ) at each dose level, and defines the MTD to be the dose level that has PDLT closest to , the target DLT probability. PDLT is continually updated with all DLT information available, and single patients are treated at the best current estimate of the MTD. The original CRM ignited an explosion of methodological work; however, it is impractical in its unmodified form because escalation could skip dose levels. In particular, one would never treat the first patient accrued at a dose level determined by a prior, unless that were the starting dose. Goodman et al. (7) describe several modifications to the CRM, including the use of two-patient and three-patient cohorts rather than only single patients. This version of the CRM is practical and quite widely used, particularly with three-patient cohorts, and is defined as follows: 1. The dose levels are chosen, and initial guesses, P0i , are made for PDLT (ξ i ), the probability of toxicity at the i-th dose level. 2. The dose-toxicity model is: logit (PDLT (ξ i )) = 3 + βξ i , where logit(x) = log(x/(1 – x)), and β is the parameter to be updated. 3. ξ i is defined by setting β = 1, and solving logit(P0i ) = 3 + ξ i . Thus, ξ i is a number associated with the i-th dose level but is not an actual dose. 4. The prior for β is: uniform(0,3) or exp(−β). 5. The posterior for β is computed in the normal way from the prior and the likelihood, which has a factor of PDLT (ξ i )y (1 − PDLT (ξ i ))1−y for each patient, where y is 1 if the patient has DLT, and 0 otherwise. To choose the dose for the next patient, the posterior mean of β is computed using the DLT experience of all patients; this is used to update PDLT (ξ i ).

3

6. The study begins at the first dose level; escalation or de-escalation is to the dose with PDLT (ξ i ) closest to , but escalation may be by no more than one dose level. 7. The study stops when at least 18 subjects have been treated, and at least 6 of these patients have been treated at the MTD, which is defined to be the dose level with PDLT (ξ i ) closest to . (Other stopping rules were considered by Goodman et al. but are not discussed here.) Goodman et al. (7) provide simulations that compare the CRM with a commonly used rule-based method (the standard method but without de-escalation). These show that the CRM will less frequently declare a very low dose to be the MTD and will usually select the target dose with greater probability, and that, using cohorts of three, the CRM will result in modestly less average toxicity; however, the CRM requires more subjects. In general, however, the differences between the two schemes are not striking. There are many other model-guided designs that build, at least in part, on the original continual reassessment method, and they share many of the strengths and weaknesses of the model of Goodman et al. In particular, the performance of these designs depends to varying degrees on the validity of underlying assumptions and the accuracy of the prior information. Although many investigators have demonstrated that the designs perform quite well under model misspecification when the initially specified dose levels bracket the true MTD (see, e.g., [8]), performance of the designs can be unacceptable when that criterion is not met. However, it is often the case in phase I studies that there is rather limited prior information about the true MTD, particularly in firstin-human studies, in which the dose levels initially selected may be far below the true MTD. In examples of real phase I studies described by Simon et al. (4), it was not unusual for a study to require 20 dose levels and for the MTD to be 1000 times the starting dose. To accommodate this situation, additional dose levels are often added as needed in a phase I study.

4

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

Some model-guided designs cannot accommodate added levels, and in others, this procedure can make it impossible to elicit priors that have the requirements necessary for acceptable performance (1). The method of Goodman et al. is used to illustrate the latter problem. Assume, as is typical in descriptions of model-guided designs, that six dose levels had been initially selected, and that P0i = 0.7 for the sixth level. All added levels must have P0i > 0.7. In this model, as P0i → 0.95, the model loses flexibility because, when P0i = 0.95, ξ i will equal 0, and thus updating β will have no effect on PDLT (ξ i ). As a consequence, the MTD can be found in the absence of any DLT; see, for example, Potter (1). Many model-guided designs share this problem, which is due in part to the inflexibility of the dose-toxicity model and in part to the fact that prior information is specified with a set of P0i . In summary, model-guided designs based on six dose levels and a fixed sample size will frequently be inappropriate, and, because these designs lack robustness, ad hoc addition of new dose levels with a concomitant increase in sample size is not guaranteed to result in a design with acceptable properties. One approach to this problem is to add dose levels and generate an entirely new set of P0i at some point during the study. Although this may have some appeal, it raises new issues (e.g., at what point should the new set be generated, and what criteria should be used to pick the P0i ), and has a rather ad hoc flavor. This approach seems to have few advocates, and is not discussed further here. It is in first-in-human studies that a design that minimizes the number of patients treated at low doses has the greatest value; hence, it is ironic that many model-guided designs are inappropriate for this situation: they lack the requisite robustness of the rule-based designs. Robust two-stage modelguided designs address this problem in a straightforward manner. 4 ROBUST TWO-STAGE MODEL-GUIDED DESIGNS There are at least three robust two-stage designs, in which the first stage is rule-based

and the second is model-guided, much in the spirit of the CRM. All use two-parameter logistic models of the form logit(PDLT (xi )) = α + βxi , where PDLT (xi ) is the probability of DLT at dose or log(dose) xi , and α and β are parameters. Maximum likelihood estimates (MLEs) are used to determine the parameters. All also define the MTD in terms of a particular . The implicit assumption underlying these designs is that the MTD will generally be rather close to the dose at which the first DLT was observed; although model robustness does not depend on this assumption, performance does. In two of the three schemes—Wang and Faries (9), and Potter (10)—the dose at which DLT occurs in the first stage is used to provide information to initialize the second stage; these two schemes are very similar to the method introduced in a different context by Wu (11). Both α and β are estimated in these two schemes, and initialization is required to guarantee that the MLEs exist. In the third scheme—Storer (12)–β is fixed, and thus no initialization is required. The key differences between the robust two-stage designs and other model-guided designs are the following. (1) In the robust two-stage designs, initialization of the second stage, if required, is provided by the dose at which DLT is observed in the first stage. To avoid bias, initialization information is used as minimally as possible. In other model-guided designs, prior information, often in the form of guesses for P0i , must be elicited before beginning the study. (2) The dose-toxicity models in the robust twostage designs are more flexible than those in other model-guided designs; they do not place implicit bounds on the MTD. Although these features of the designs are necessary for robustness, they are not sufficient: stopping rules must also be carefully crafted. However, in defining the three designs to be discussed here to be robust, I have not required any criteria on stopping rules. 4.1 The Wang and Faries Design Wang and Faries (9) describe a design in which the first stage uses cohorts of two patients and ends when one or two DLTs are observed in the same cohort. The second stage

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

is based on the logistic dose-ranging method of Murphy and Hall (13) in which single patients are treated at the dose level closest to the MLE of the MTD. A two-parameter logistic model, logit(PDLT (xi )) = α + βxi , is used in this method, where xi is the true dose. To attempt to ensure that the logistic regression will converge and parameter estimates will exist, Murphy and Hall use ‘‘seed data’’ in the form of ‘‘pseudo-patients,’’ which are imaginary observations that are treated as real observations. In the recommended scheme, one pseudo-patient with DLT is always placed at the highest dose level; if the first stage stops with two DLTs, a second pseudo-patient without DLT is placed one level above that at which the two DLTs were observed. Oddly, if there is only one DLT, the seed data do not satisfy the requirements for the existence of MLEs for α and β (14, 15). The study stops after 20 patients are treated in the second stage. One might consider this to be a large number of patients to treat after the first DLT had been observed; however, if a smaller number were treated, the bias introduced by the seed data would be greater. The investigators noted that, as in the single-patient CRM, the next patient could be treated using available DLT information even if some patients on the study have not completed the evaluation period. Although Wang and Faries provide the results of simulations, comparisons are only with variations of their method. 4.2 The Storer Design Storer (12) discusses a two-stage design in which the first stage escalates dose levels using single patients until the first DLT is observed; the first stage then ends. The second stage treats cohorts of three patients, and uses the logistic model logit(PDLT (xi )) = α + βxi , in which β is fixed at 0.75, and xi is the log of the dose. The first cohort is treated at one dose level below that at which the first stage ended, and all other cohorts at the level closest to the MLE of the MTD, although dose-level skipping is not permitted. Storer also considers a variation of the second stage in which dose levels are defined but not used; patients are treated at the MLE of the MTD if treatment at that dose would not be ruled out by the prohibition of dose-level skipping.

5

In both designs, the study ends when 24 patients have been treated in the second stage. Storer shows that the performance of these methods is quite insensitive to the value of β over the range of 0.5 to 1.0. It is also quite insensitive to model misspecification, as is the CRM (8); performance was investigated with simulated data having values of β ranging from 0.5 to 3.0, and was generally very good for both methods. One is also free, of course, to adjust the value of β in the model on the basis of prior clinical or preclinical experience with the same or similar agents. Storer uses simulations to compare these designs with his two-stage rule-based method (16), which uses the first stage already described. The second stage, which uses three-patient cohorts, begins one dose level below that at which the first stage ends. The rules for this stage, which are based only on the DLT experience of each cohort, are the following: 0 DLT, escalate one level; 1 DLT, add another cohort at the current level; >1 DLT, de-escalate one level. The study ends when 24 patients have been treated in the second stage. As a refinement to all designs, he uses logistic regression to re-fit the data after the study has ended, and estimates both α and β. He then re-computes PDLT (xi ) and defines the MTD to be the dose with PDLT = . Data are fit without adding pseudo-patients, and thus there is no guarantee that parameter estimates will exist. Re-fitting the data generally improves performance, and it also reduces differences among the methods. In fact, the differences between the two model-guided methods are negligible, and those between the modelguided methods and the rule-based methods are quite minor: some situations in which the bias of the MTD is smaller for the modelguided designs, and others in which the precision is better for the rule-based method. Comparisons are also made with the standard rule-based method; they show principally that Storer’s methods result in a more precise determination of the MTD and a smaller fraction of patients treated at low doses.

6

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

4.3 The Potter Design I introduced a two-stage method (10) that is similar in many respects to Storer’s two-stage model-guided design and borrows elements of the design of Piantadosi et al. (17). The primary motivation for the design was to improve upon the rule-based schemes when clinical data are already available for the agent. In the first stage, patients are accrued in cohorts of three; cohorts of two would be more appropriate for first-in-human studies. Dose levels are escalated in 50% increments. The first stage ends at the first DLT, and the second stage begins. Cohorts of three patients are treated at the MLE for the MTD using the model logit(PDLT (xi )) = α + βxi , where xi is the log of the dose. The model is the same as Storer’s, but both α and β are estimated. Monotonicity is assured by placing 20 pseudo-patients without DLT at a very low dose, and 20 with DLT at a very high dose. Convergence of the logistic regression is ensured by placing 10 pseudo-patients (1 with DLT, 9 without) at half the dose at which the first DLT occurred, and 10 pseudopatients (9 with DLT, 1 without) at five times that dose. Each of these 20 pseudo-patients is given a weight of 0.1, and thus negligible bias is introduced unless the dose-toxicity curve is extremely flat. Dose levels are not used in the second stage, although straightforward modification would accommodate them. Thus, escalation/ de-escalation factors depend on the number of DLTs observed, but in a way that the dose increments are generally smaller than in the first stage; this property allows the sequence of stage 2 doses to converge to the estimated MTD. For example, if no DLTs are observed in the first cohort, and one, two or three DLTs are observed in the second, then the dose for the third cohort will be that for the second multiplied by a factor of 1.14, 0.82, or 0.73, respectively. (These factors remain essentially unchanged when no DLTs are observed in the first n cohorts, and one, two, or three DLTs are observed in cohort n+1.) The preferred stopping rule is that four DLTs be observed, at least 18 patients be treated, and at least nine patients be treated in the second stage. Performance was investigated with simulated data having various combinations of

the constant and slope parameters (1 < β < 6), and it was found to be generally good. When the slope parameter is small and the MTD is far from the starting dose, the method tends to underestimate the MTD. However, Storer’s designs also have this property, as will any method that approaches the MTD from below. The bias introduced by the pseudo-patients was investigated by comparing with a design that placed pseudo-patients such that they introduced no bias; by this method, the bias was shown to be small. Simulations demonstrate that the method is more accurate than the standard rule-based scheme, treats a smaller fraction of patients at low doses, and is just as robust. Zamboni et al. (19) discuss a clinical study using this method. 4.4 Comparisons of the Three Designs In comparing these three schemes, it is useful to consider three components: first-stage design, second-stage design, and stopping rules. It should be clear that the first stage of any one of these three methods could be combined with the second stage of another. It should also be clear that the stopping rules used in one design could be used in any other design. In the first stage, Wang and Faries use cohorts of two, and Potter uses cohorts of two or three, depending on whether the study is first-in-human or not; Storer uses single patients. In all three designs, the first stage ends when one or more DLTs are observed. Korn et al. (18) and many others feel that a dose-escalation scheme using single patients exposes the patients to excessive risk of severe toxicity, although this concern must surely depend on dose-level spacing. Simon et al. (4) address this issue at some level by requiring the first stage to terminate when toxicities less severe than DLTs are observed. However, use of this first stage in the twostage model-guided designs would require a different approach to initialization for the two designs that require it (e.g., adding a rule-based stage between the first and second stages that used cohorts of two or three, and terminated at the first DLT). There are many subtle but important differences among the second stage designs of

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

the three methods. The logit model used by Wang and Faries is linear in dose; that used by Storer and Potter is linear in log(dose), and thus scale-independent. This difference is critical for Storer’s design because β is fixed. The scheme of Wang and Faries uses single patients; those of Storer and Potter use three-patient cohorts. Although single patients can be useful in the early stages of a study, larger cohorts are safer and, for a fixed sample size and a fixed evaluation period, allow a study to be completed earlier (7). Thus, it is unclear that there would be an advantage to a single-patient design in the second stage. The initialization method used by Wang and Faries introduces some bias, and also appears to have a technical problem: the MLEs for α and β may not exist. An interesting difference between the schemes of Storer and Potter is that β is fixed in the former and estimated in the latter. If the fixed value selected by Storer were wildly wrong, then the scheme of Potter would presumably perform better. On the other hand, by fixing β, Storer eliminates the need for pseudo-patients and the resulting potential for bias (which, however, is shown to be small). No head-to-head comparison between these two schemes has been made, but it seems likely that differences in performance would generally be small if the comparison were made using both the same first stage and the same stopping rules. Storer fits the data after the study has ended in order to refine the estimate of the MTD; however, MLEs for α and β are not guaranteed to exist. This problem could be addressed through the use of pseudo-patients; alternatively, isotonic regression could be used (20, 21). The schemes of Wang and Faries and of Potter can be easily implemented with standard logistic regression software packages; that of Storer requires some straightforward coding. Although stopping rules are critically important for accuracy and robustness, they are generally given short shrift. Wang and Faries, and Storer use large fixed sample sizes in the second stage: 20 in the first case and 24 in the second. These sample sizes are substantially larger than the number of patients that would typically be accrued in the standard design after observation of the

7

first DLT, and, for that reason, there would likely be resistance to accepting these two designs. Potter uses both minimum total (18) and stage 2 sample sizes (9) and a minimum number of DLTs (4). There are two reasons for including the last requirement. First, the accuracy of the MTD will depend less on the sample size than on the observed number of DLTs if most patients do not have DLT. Second, robustness is not guaranteed by large sample size. A toxicity is defined as an adverse event that is possibly, probably or definitely related to treatment. Thus, attribution is somewhat subjective, especially in first-in-human clinical studies. Some DLTs will be false positives: adverse events not related to treatment. If a false-positive DLT were observed far below the true MTD, and if the stopping rule were based only on sample size, an MTD could be declared in the absence of any true treatment-related DLTs. The same sort of problem could arise if an outlying treatment-related DLT were observed far below the MTD.

5

CONCLUSIONS

The standard rule-based dose-escalation design has several limitations. Among these are that it defines a MTD that does not correspond to a predefined PDLT and treats a large fraction of the patients at low, presumably ineffective doses. Model-guided designs, such as the continual reassessment method, attempt to address these and other limitations. Such schemes work well if the initially specified range of doses to be investigated brackets the true MTD; however, if this condition is not met and additional doses must be added as the study progresses, these schemes can find an MTD in the absence of the observation of any DLT. In first-in-human studies, the latter situation is a particular concern. Robust two-stage model-guided designs use a rule-based first stage that ends upon observation of the first DLT, and a second stage that uses a flexible model that is initialized with the available DLT data. These designs have the robustness of rule-based designs and the advantages of the model-guided designs.

8

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES

This work has compared the designs of three robust two-stage model-guided doseescalation schemes. Implicit in the comparison is the notion that the first-stage designs, second-stage designs, and stopping rules can be mixed and matched to generate a new design that combines features of the original three designs. For example, Storer’s firststage and/or stopping rules may be unattractive to many, but the second stage may be appealing; if it were combined with a first stage using cohorts of two and a stopping rule similar to Potter’s, this new design would be practical and robust. In fact, this combination design and Potter’s design using the same first stage can be expected to perform similarly and are probably the most practical of the robust two-stage model-guided designs. 5.0.1 Acknowledgments. It is a pleasure to acknowledge both the careful reading and many valuable suggestions offered by Jim Schlesselman. This work was supported in part by a grant from the National Cancer Institute (U01CA099168). A portion of the material here overlaps with that found in copyright, 2006, From Phase I Studies of Chemotherapeutic Agents in Cancer Patients: A Review of the Designs, by Douglas M. Potter. Reproduced by permission of Taylor & Francis Group, L.L.C., http://www. taylorandfrancis.com. REFERENCES 1. Potter, DM. Phase I studies of chemotherapeutic agents in cancer patients: a review of the designs, J Biopharm Stat. 2006; 16: 579–604. 2. P. V. Avecedo, D. L. Toppmeyer, and E. H. Rubin, Phase I trial design and methodology for anticancer drugs. In: B. A. Teicher and P. A. Andrews (eds.), Anticancer Drug Development Guide: Preclinical Screening, Clinical Trials, and Approval. Totowa, NJ: Humana Press, 2004, pp. 351–362. 3. H. Boxenbaum and C. DiLea, First-time-inhuman dose selection: allometric thoughts and perspectives. J Clin Pharmacol. 1995; 35: 957–966. 4. R. Simon, B. Freidlin, L. Rubinstein, S. G. Arbuck, J. Collins, and M. C. Christian, Accelerated titration designs for phase I clinical

trials in oncology. J Natl Cancer Inst. 1997; 89: 1138–1147. 5. E. A. Eisenhauer, C. Twelves, and M. Buyse, Phase I cancer clinical trials: a practical guide, 2006. Oxford University Press, New York. 6. J. O’Quigley, M. Pepe, and L. Fisher, Continual reassessment method: a practical design for phase I clinical trials in cancer. Biometrics. 1990; 46: 33–48. 7. S. N. Goodman, M. L. Zahurak, and S. Piantadosi, Some practical improvements in the continual reassessment method for phase I studies. Stat Med. 1995; 14: 1149–1161. 8. L. Z. Shen and J. O’Quigley, Consistency of continual reassessment method under model misspecification. Biometrika. 1996; 83: 395–405. 9. O. Wang and D. E. Faries, A two-stage dose selection strategy in phase I trials with wide dose ranges. J Biopharm Stat. 2000; 10: 319–333. 10. D. M. Potter, Adaptive dose finding for phase I clinical trials of drugs used for chemotherapy of cancer. Stat Med. 2002; 21: 1805–1823. 11. C. F. J. Wu, Efficient sequential designs with binary data. J Am Stat Assoc. 1985; 80: 974–984. 12. B. E. Storer, An evaluation of phase I clinical trial designs in the continuous dose-response setting. Stat Med. 2001; 20: 2399–2408. 13. J. R. Murphy and D. L. Hall, A logistic doseranging method for phase I clinical investigators trials. J Biopharm Stat. 1997; 7: 635–647. 14. A. Albert and J. A. Anderson, On the existence of maximum likelihood estimates in logistic regression. Biometrika. 1984; 71: 1–10. 15. T. J. Santner and D. E. Duffy, A note on A. Albert and J. A. Anderson’s conditions for the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1986; 73: 755–758. 16. B. E. Storer, Design and analysis of phase I clinical trials. Biometrics. 1989; 45: 925–937. 17. S. Piantadosi, J. D. Fisher, and S. Grossman, Practical implementation of a modified continual reassessment method for dose-finding trials. Cancer Chemother Pharmacol. 1998; 41: 429–436. 18. E. L. Korn, D. Midthune, T. T. Chen, L. V. Rubinstein, M. C. Christian, and R. M. Simon, A comparison of two phase I trial designs. Stat Med. 1994; 13: 1799–1806. 19. W. C. Zamboni, L. L. Jung, M. J. Egorin, D. M. Potter, D. M. Friedland, et al., Phase I and pharmacologic study of intermittently

ROBUST TWO-STAGE MODEL-GUIDED DESIGNS FOR PHASE I CLINICAL STUDIES administered 9-nitrocamptothecin in patients with advanced solid tumors. Clin Cancer Res. 2004; 10: 5058–5064. 20. A. Ivanova, A. Montazer-Haghighi, S. G. Mohanty, and S. D. Durham, Improved upand-down designs for phase I trials. Stat Med. 2003; 22: 69–82. 21. R. K. Paul, W. F. Rosenberger, and N. Flournoy, Quantile estimation following non-parametric phase I clinical trials with ordinal response. Stat Med. 2004; 23: 2483–2495.

CROSS-REFERENCES Continual reassessment method (CRM) Dose escalation design Maximum tolerable dose Phase I trials

9

RUN-IN PERIOD

detect placebo/therapeutic responders and nonresponders. Each of these reasons has the underlying goal of enhancing participant selection. Thus, run-in periods can be considered extensions of the predefined study eligibility criteria. Often, if a specific response is not seen during the run-in period, the participant is not eligible for randomization. The type of run-in used depends on the desired study population. We will introduce the two primary types of run-in periods and their various roles in participant selection.

VANCE W. BERGER Biometry Research Group National Cancer Institute Bethesda, Maryland

VALERIE L. DURKALSKI Department of Biostatistics, Bioinformatics & Epidemiology Medical University of South Carolina Charleston, South Carolina

A run-in is a period before randomization during which patients who have met all entry criteria for the clinical trial are assigned the same regimen. In some such trials, all patients receive placebo during the run-in period, whereas in other trials all patients receive the active intervention during the run-in period. Often, though not always, participants become eligible for randomization pending their set of responses during this period. This enrichment design strategy is used to select study participants who are either more likely to respond to the active intervention or less likely to respond to placebo (or both). Consequently, this strategy increases the chance to detect differences if they truly exist, and, as we shall see, even if they do not. Run-in periods fall under a number of aliases in the literature including acute phase, open qualification phase, maintenance trial, washout period, withdrawal trial, discontinuation design, and enrichment design (1, 2). These names are not necessarily interchangeable; however, the overall goal of these designs is to choose a study population that is intended to receive the intervention in a standard clinical setting and is most likely to respond to the experimental intervention. We will review the various types of run-in periods, their objectives, and their impact on clinical trial results. 1

1.1 Placebo Run-In A placebo run-in period is characterized by enrolling eligible participants into a prerandomization phase of the study and assessing their response to placebo. This phase is used to identify participants who are unlikely to follow the assigned treatment regimen (nonadherers), to identify those who have some sort of response to the placebo (placebo responders), or to take participants off current medications (washout). 1.2 Active Run-In An active run-in period exposes participants to the experimental intervention, either in a single-masked or open-label fashion, before randomization to assess feasibility and/or preliminary safety. Participants who show a positive response (possibly defined by the lack of an adverse event or tolerability of the experimental intervention, stability of disease, disease reduction, or compliance with the study protocol) are then eligible for randomization. 2 EXAMPLES OF TRIALS WITH RUN-IN PERIODS Many randomized trials have used the type of run-in period we consider in this article. In this section we will discuss a few such trials.

TYPES OF RUN-IN PERIODS

2.1 Carvedilol

Investigators may want to reduce the rate of nonadherence (not taking the assigned intervention as prescribed) in the study population, establish baseline characteristics, or

Pablos-Mendes et al. (3) discussed the run-in period used in two carvedilol trials. Specifically, the Australian–New Zealand Collabo-

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

RUN-IN PERIOD

rative Group Study (4) entered 442 patients with chronic heart failure on a 2 to 3 week run-in period with carvedilol. The active runin was used to identify eligible patients based on their response to the trial drug. Those with a positive response and no serious adverse effects were randomized to either placebo or carvedilol. Twenty seven (6%) of the patients who had entered in the run-in period were withdrawn for various reasons, including worsening heart failure (eight), hypotension (eight), and death (three). The final study analyses, based on only the 415 patients who were randomized (207 to carvedilol, 208 to placebo), concluded that 14.5% of the patients randomized to carvedilol experienced serious adverse effects that led to study withdrawal versus 6% in the group randomized to placebo. The treatment group event rate would have increased to 24.4% if the run-in group was included; however, it was known that carvedilol has a negative effect on these patients at the initiation of therapy. The U.S. Carvedilol Heart Failure Study Group (5) similarly excluded 103 of 1197 eligible patients based on a run-in period in which seven patients died while taking carvedilol and another 17 had worsening heart failure. Packer, Cohn, and Colucci (6) stated that even including these 24 patients in the analysis would retain the statistically significant (at the 0.05 level) mortality benefit of carvedilol over placebo. Of course, these were the 24 most extreme cases, and a full 103 were excluded. Moreover, as we shall see in Section 4.2, withdrawal effects may have also artificially lowered this p-value. In addition, Moye [7] pointed out that mortality was not pre-specified as the primary endpoint anyway, so there are other issues involved in taking the p-value at face value. Although the study design and analyses are valid, the caveat in the results for these studies is that, without a placebo-control in the run-in period, it is difficult to fully interpret the serious adverse effects caused by carvedilol, and one is left with the question of whether the natural progression of the disease or the active drug caused the events in the run-in period.

2.2 Cast The Cardiac Arrhythmia Suppression Trial (CAST) (8) incorporated an active run-in period in their evaluation of the effect of antiarrhythmic therapy (encainide, flecainide, or moricizine) in patients with asymptomatic or mildly symptomatic ventricular arrhythmia after myocardial infarction. The run-in period consisted of titration of one of the three anti-arrhythmic drugs. Once titrated to the clinical appropriate dose, patients were randomized based on their suppression of ventricular ectopy. If suppression was attained during the active run-in period, the patients were eligible for randomization to either placebo or active drug. Of the 2309 screened patients, 1727 patients who were able to suppress their arrhythmias were randomized. At an interim analysis, randomized patients treated with active drug had a higher rate of death from arrhythmia than the patients assigned to placebo; thus, the study was stopped. The run-in failures, those unable to suppress while taking active medication, were not included in the analysis. However, theoretically, if they had been included in the analysis, the death rate may have been similar to what was found in the randomized patients alone. 2.3 Physician’s Health Study A single-masked run-in period was used to exclude participants unwilling or unable to comply with the study regimen in the U.S. Physician’s Health Study (9). The primary study objective was to test the effects of aspirin and beta carotene on the prevention of ischemic heart disease and cancer among male physicians. Compliance with study medication was found to be an issue in pilot studies. Therefore, the study design incorporated an 18-week run-in period in which eligible men were given open-label aspirin and beta-carotene placebo. Of 33,223 run-in participants, only 22,071 were randomized; 11,037 to aspirin and 11,034 to a placebo for aspirin. Among the randomized men, a reduction was detected in the rate of myocardial infarction in the group randomized to aspirin. The reported adherence rate among this study population was 90%. The similar British Physician’s Health Study

RUN-IN PERIOD

did not include a run-in period and failed to detect a reduction in myocardial infarction. The adherence rate in that study population was 70%. 2.4 Pindolol Perez et al. (10) reported on a trial of pindolol versus placebo, both in conjunction with fluoxetine antidepressant therapy. The run-in period was meant to exclude placebo responders, and in fact 19 out of 132 patients did respond to placebo. There were two other exclusions, for a total of 21, and the remaining 111 patients were randomized, 56 to placebo and 55 to pindolol. The analysis based on the 111 patients found a 75% response rate in the pindolol group versus 59% in the placebo group (P = 0.04). However, if the total 132 patients had been included in the analysis, the statistical significance would have been reduced to 0.09 (78% versus 65%, respectively). 3

OBJECTIVES OF RUN-IN PERIODS

The exclusion of participants from the randomized phase of a study (and essentially from the analysis) is driven by a variety of motivations, which are generally distinct from the motivations for the run-in period itself. The run-in period, with or without any associated patient selection, allows for careful patient evaluation, washing out of prior treatments, and optimization of dosing. The goal of the patient selection is to create an advantage for the active treatment group. Specifically, participants who do not tolerate or adhere to the active intervention during a study and participants who respond to the placebo can dilute the intervention effect. To decrease this dilution effect and, in turn, to increase statistical power, investigators screen for these types of patients with the appropriate type of run-in period before randomization. Generally, with an active run-in, those patients who do not tolerate or adhere to the active treatment according to a prespecified criteria (i.e., at least 80% adherent to taking study medication) are excluded from subsequent randomization. Likewise, participants who respond to placebo during

3

a placebo run-in are also excluded from the subsequent randomized phase of the study. Similar to the placebo run-in, the active run-in identifies patients who are more likely to adhere and respond to the experimental intervention. Selecting only such patients increases the power of the study. If participants respond (in some way, possibly by tolerating the treatment) to the experimental treatment during this early phase, then it is assumed that they are more likely to respond (in possibly a different way, perhaps involving efficacy) during the subsequent randomized active study phase. Additionally, if they remain compliant during the run-in period, then it is expected that they will be compliant during the subsequent randomized active phase of the trial. Washout periods are slightly different from the above placebo run-in periods because they do not necessarily require a placebo. Instead, otherwise eligible participants are slowly taken off current medications (before randomization), especially when these medications are believed to interfere with measuring the treatment effect. Participants who can be taken off the specific medications without side effects are eligible for randomization. As in the previously described run-in periods, washout periods attempt to reduce potential noise that may interfere with the assessment of the effect of a new experimental intervention. They attempt to rule out the effects of recent exposures to other medications and establish baseline measurements, particularly with respect to safety data. 4

IMPACT OF RUN-IN PERIODS

There are several (presumably unintended) consequences of run-in periods, especially with regard to patient selection based on run-in periods. These include unmasking, withdrawal effects, ethical concerns, selection biases that can compromise both internal and external validity, and a lack of efficiency. We will discuss each of these concerns in this section. 4.1 Unmasking Leber and Davis (1) discuss how an otherwise masked randomized trial can become

4

RUN-IN PERIOD

unmasked by virtue of the run-in period. Specifically, when each patient experiences the active treatment, the side effects and overall responses can be observed. It is far easier to discern similarity to or difference from a known set of responses than it is to classify a set of responses as responses to the active treatment or to the control. So unmasking is one way that run-ins, with or without enrichment, can compromise the internal validity of a randomized trial. One precautionary step to mitigate the effects of unmasking is to use masked assessors—that is, the investigator who measures the outcome can differ from the one who treats the patient. 4.2 Withdrawal Effects Exposure to the experimental intervention can induce a dependency on this intervention. If so, then the run-in period would, even without any enrichment, interfere with the comparison of treatment A to treatment B. This is because the pretreatment with treatment A would mean that the sequence AA is being compared with AB. If the patients receiving AA fare better than those receiving AB, then one would want to attribute this to the superiority of A to B, and infer that A is more efficacious. But a competing explanation is that BB is as good as, or even better than, AA, in which case the poor outcomes of the patients treated with AB are due not to B but rather to withdrawal effects from A. Leber and Davis (1) reviewed a study assessing tacrine as a treatment for dementia to illustrate this point. The study had a complex design that included a randomized dose titration phase followed by a 2-week washout period followed by randomization of only the responders to placebo or tacrine. Overall concerns with the study design included both unmasking and withdrawal effects. Carryover and withdrawal effects from the active intervention can be reduced by implementing a washout period before randomization, to allow the participant to establish a baseline of disease status before randomization. The difficulty is in determining the duration of washout, particularly when the disease status of participants in the run-in period is improving while on active intervention. Greenhouse et al. (11) also discussed withdrawal effects.

4.3 Ethical Concerns Senn (12) expressed an ethical concern over tricking patients by not informing them that they will all receive placebo (as would be the case in any trial with a placebo runin period). One obvious way around this is to inform patients of the use of this runin period; however, doing so might defeat the purpose, as the placebo effect seems strongest when patients do not know they are taking the placebo. 4.4 Selection Bias There is a key distinction between classic enrichment based on patient characteristics (13) and enrichment based on a run-in phase. Whereas enrichment based on patient characteristics is desirable because it can be mimicked in clinical practice, enrichment based on a run-in cannot be mimicked in clinical practice. That is, classic enrichment might be based on readily ascertained criteria such as gender, age, or prior illnesses. If a certain identifiable class is excluded from the randomized trial, then this same identifiable class can be excluded from clinical practice as a contraindication. On the contrary, if one needs to risk adverse events by being exposed to the treatment before determining whether one should have even taken the treatment in the first place, then this would have to be the case in clinical practice as well. The enrichment goal of minimizing exposure to a potentially harmful treatment cannot be met if one needs to take this treatment to find out if one should or should not take the treatment. In other words, the excluded segment is not identifiable, and becomes so only with the administration of the treatment itself. For this reason, the most useful set of experiences to consider when deciding to take a treatment is either the set of all experiences of all patients taking this treatment or the experiences of those patients most like you. But without knowing how you would respond to this treatment (and before taking it, how would you know?), it is not possible to define similarity based on response to the treatment. Exclusion from the analysis of otherwise eligible participants who were in fact treated with the experimental

RUN-IN PERIOD

treatment can badly distort the results, and overstate the safety and efficacy. Clearly, this is an issue of external validity, but internal validity is also an issue in that there is no longer a fair comparison between treatments (11, 14). That is, one preselects a population not sufficiently harmed by treatment A (in the case of an active run-in), or with an outcome not significantly improved when treated with placebo (in the case of a placebo run-in). Then the comparison is between placebo and treatment A among the remaining patients, already shown to tolerate treatment A and/or to not respond to placebo. Internal validity may (in the absence of other biases) be present for this select population only, as internal validity is generally defined in terms of the ability to believe the results as they apply to only the sample in question. Yet the spirit of internal validity relates to a fair treatment comparison. If the study is randomized, it is argued, then this ensures a fair treatment comparison that must apply to the sample on which the comparison is based. The run-in enrichment does not by itself interfere with the letter of internal validity (without other biases to cause systematic differences in the treatment groups, the treatment groups may well be comparable). But the spirit of internal validity is violated because of the asymmetric handling of the treatment groups in defining the study samples. Each treatment may be better than the other for a certain segment of the population, and the treatments are equally effective if these two segments are the same size (contain the same number of patients). Comparing treatments, then, involves comparing the relative sizes of these treatment-favored patient segments. Excluding one of these segments from consideration disables this comparison and allows only for the much less informative question of whether the opposite segment exists. The distortion of the results is in a predictable direction. Indeed, if intervention A is given during the run-in period, then it is plausible to believe that the participants who are eligible for randomization (based on their response to A) are more likely to be positive responders during the randomization phase if they are randomly assigned to the same intervention (A). In fact, if all responses are durable so

5

that response during the run-in necessarily implies another response during the randomized phase, then the best statement that can be made is the tautology ‘‘those patients who respond to A respond to A.’’ It may be plausible that the participants excluded from randomization based on their nonresponse to A may have responded to B even though they did not respond to A. Likewise, participants excluded because they did respond to B during the run-in may not have responded to A. The exclusion of either type of participant leads to selection bias and taints the internal and external validity of the study. Because of this bias, the study may mistakenly identify a difference between two treatments even if it does not exist. That is, the gain in power is affected by creating pseudo-power. 4.5 Lack of Efficiency A meta-analysis performed on 42 published randomized controlled trials on selective serotonin reuptake inhibitor (SSRIs) antidepressants found a lack of a statistically significant difference between effect sizes reported from studies that used the placebo-run to eliminate responders versus studies that did not discontinue responders (15). Previous studies (16, 17) have explored the utility of placebo run-ins and have shown that run-in periods are not always beneficial to trial conduct. The resources required to conduct a run-in phase may outweigh the benefits. Designs that seek to eliminate a specific subgroup of a population before randomization need to consider the anticipated effect of excluding versus including this subgroup on both enrollment rates and the measurement of treatment response to avoid inefficiencies in study conduct. Of course, one could argue that a lack of efficiency of a run-in phase translates into a lack of any bias as well. This would be spurious reasoning because studies using run-in periods presumably differ in other important ways from studies not using run-in periods. One does not, on the basis of observing poorer survival times for cancer patients than for healthy subjects, infer that all cancer therapy is ineffective. Instead, we recognize that the cancer, not the cancer treatment, is likely the cause of this finding. Likewise,

6

RUN-IN PERIOD

if the estimated magnitude of the treatment effect appears similar whether or not a run-in phase was used, then it is just as likely that those studies using the run-in periods did so to compensate for a perceived lack of a true treatment effect. As long as a good response during the run-in predicts a good response during the randomized phase, the run-in selection will confer an advantage—possibly an unfair one, empirical suggestions to the contrary notwithstanding. But again, this advantage may be more than offset by the reduction is sample size. 5

SUGGESTIONS AND CONCLUSIONS

The threat to internal and external validity is present when patients are selected on the basis of a run-in period. The belief that randomization by itself confers validity is patently false, given all the biases that can occur even in randomized trials. For example, without allocation concealment, investigators can recruit healthier patients to one treatment arm than to another (18), or a lack of masking may cause the patients in one group to be treated better (and/or have more favorable scoring of subjective endpoints) than the patients in the other group. This being the case, more attention needs to be paid to threats to validity in randomized trials. Even a run-in phase alone, without any associated patient selection, can badly bias the results of the subsequent randomized study. We have discussed two ways in which this can occur, specifically through unmasking and withdrawal effects. In fact, there is a third way: for treatments that require doses to be tailored to the specific patient, an active run-in phase may be used to optimize dosing. Such was the case with the CAST Trial (8). The experimental process of finding the right dose would need to be repeated for each patient in clinical practice because optimality of a treatment is here defined for the specific patient, not in some overall sense. As such, the experiences of the trial participants during this early phase of dose finding are relevant to future patients who themselves would need to go through this exercise. Yet these early experiences would

be systematically excluded from the analyses, thereby creating a clear bias in favor of the active treatment. For this reason, and also because of possible unmasking and withdrawal effects, the use of a run-in period needs careful consideration. The threat to internal and external validity is present when patients are selected on the basis of a run-in period. The belief that randomization by itself confers validity is patently false, given all the biases that can occur even in randomized trials. For example, as already noted, without allocation concealment, investigators can recruit healthier patients to one treatment arm than to another (18), or a lack of masking may cause the patients in one group to be treated better (and/or have more favorable scoring of subjective endpoints) than the patients in the other group. This being the case, more attention needs to be paid to threats to validity in randomized trials. Using a run-in period to select patients to be randomized can be an effort to confuse the best-case results with the typical experiences of an unselected patient population. A strong justification would be needed to overcome the biases inherent in using a run-in period, but it is hard to imagine that there could be any legitimate justification for using the combination of a run-in period and patient selection based on this run-in period in a pivotal trial. It may be that this type of design is useful as an intermediate step in the final approval process. If a treatment fails even in this most favorable scenario, then it certainly cannot be expected to function in the real world. But the appearance of proof of safety and efficacy in this context cannot be taken as definitive; rather, it would need to be verified with more robustly designed trials. This statement has various implications for both future trials and the interpretation of trials already conducted. Researchers engaged in clinical trials should think carefully about the use of run-in periods and should cease the flawed practice of run-in enrichment. Conclusions drawn from these trials should be based on the experiences of all eligible patients, possibly by imputing treatment failures for the excluded cohort of an active run-in or treatment responses for

RUN-IN PERIOD

the excluded cohort of a placebo run-in. Sensitivity analyses might be useful in this regard. Meta-analysts should find some way to weigh these trials less in the data synthesis. 5.0.1 Acknowledgment. The authors thank the anonymous reviewers for offering helpful comments. REFERENCES 1. P. D. Leber and C. S. Davis, Threats to the validity of clinical trials employing enrichment strategies for sample selection. Control Clin Trials. 1998; 19: 178–187. 2. F. M. Quitkin and J. B. Rabkin, Methodological problems in studies of depressive disorder—utility of the discontinuation design. J Clin Psychopharmacol. 1981; 1: 283–288. 3. A. Pablos-Mendez, B. Graham, and S. Shea, Run-in periods in randomized trials: implications for the application of results in clinical practice. JAMA. 1998; 279: 222–225. 4. Australia-New Zealand Heart Failure Research Collaborative Group. Congestive heart failure: effects of carvedilol, a vasodilator– beta-blocker, in patients with congestive heart failure due to ischemic heart disease. Circulation. 1995; 92: 212–218. 5. M. Packer, M. R. Bristow, J. N. Cohn, W. S. Colucci, M. B. Fowler, et al., The effect of carvedilol on morbidity and mortality in patients with chronic heart failure. N Engl J Med. 1996; 334: 1349–1355. 6. M. Packer, J. N. Cohn, and W. S. Colucci, Reply. N Engl J Med. 1996; 335: 1319. 7. L. A. Moye, End-point interpretation in clinical trials: the case for discipline. Control Clin Trials. 1999; 20: 40–49. 8. CAST Investigators. Preliminary report: effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. N Engl J Med. 1989; 321: 406–412. 9. Steering Committee of the Physician’s Health Study Research Group. Preliminary report: findings from the aspirin component of the ongoing Physician’s Health Study. N Engl J Med. 1988; 318: 262–264. 10. V. Perez, I. Gilaberte, D. Faries, E. Alvarez, and F. Artigas, Randomized, double-blind, placebo-controlled trial of pindolol in combination with fluoxetine antidepressant medication. Lancet 1997; 349: 1594–1597.

7

11. J. B. Greenhouse, Stangl D, D. Kupfer, and R. Prien, Methodologic issues in maintenance therapy clinical trials. Arch Gen Psychiatry. 1991; 48: 313–318. 12. S. Senn, Statistical Issues in Drug Development. Chichester, UK: Wiley, 1997. 13. R. J. Temple, Enrichment designs: efficiency in development of cancer treatments. J Clin Oncol. 2005; 23: 4838–4839. 14. V. W. Berger, A. Rezvani, and V. A. Makarewicz, Direct effect on validity of response run-in selection in clinical trials. Control Clin Trials. 2003; 24: 156–166. 15. S. Lee, J. R. Walker, L. Jakul, and K. Sexton, Does elimination of placebo responders in a placebo run-in increase the treatment effect in randomized clinical trials? Depress Anxiety. 2004; 19: 10–19. 16. E. Brittain and J. Wittes, The run-in period in clinical trials the effect of misclassification on efficiency. Control Clin Trials. 1990; 11: 327–338. 17. K. B. Schechtman and M. E. Gordon, A comprehensive algorithm for determining whether a run-in strategy will be a costeffective design modification in a randomized clinical trial. Stat Med. 1993; 12: 111–128. 18. V. W. Berger, Selection Bias and Covariate Imbalances in Randomized Clinical Trials. Chichester, UK: Wiley, 2005.

FURTHER READING J. B. Greenhouse and M. M. Meyer, A note on randomization and selection bias in maintenance therapy clinical trials. Psychopharmacol Bull. 1991; 27: 225–229. P. Knipschild, P. Leffers, and A. R. Feinstein, The qualification period. J Clin Epidemiol. 1991; 6: 461–464. J. M. Lang, The effect of a run-in on the generalizability of the results of the Physicians-HealthStudy. Am J Epidemiol. 1987; 126: 777. J. M. Lang, J. E. Buring, B. Rosner, N. Cook, and C. H. Hennekens, Estimating the effect of the run-in on the power of the Physicians Health Study. Stat Med. 1991; 10: 1585–1593. S. Piantidosi, Clinical Trials: A Methodologic Perspective, 2nd ed. New York: Wiley, 2005. F. Reimherr, M. Ward, and W. Byerley, The introductory placebo washout: a retrospective evaluation. Psychiatry Res. 1989; 30: 191–199.

CROSS-REFERENCES Enrichment designs

8

RUN-IN PERIOD

External/Internal validity Selection bias Target population Washout period

SAFETY INFORMATION The sponsor is responsible for the ongoing safety evaluation of the investigational product(s). The sponsor should promptly notify all concerned investigator(s)/institution(s) and the regulatory authority(ies) of findings that could affect adversely the safety of subjects, impact the conduct of the trial, or alter the IRB (Institutional Review Board)/IEC’s (Independent Ethics Committee) approval/favorable opinion to continue the trial.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

conducted to demonstrate the superiority of the study drug over a standard therapy or an active control agent. Treatment effect, therapeutical equivalence/noninferiority, or superiority are usually tested in terms of some primary study endpoints, which could be either a continuous variable (e.g., blood pressure or bone density) or a discrete variable (e.g., binary response). In a parallel design, patients are randomly assigned to one of several prespecified treatment groups in a double-blind manner. The merit of the parallel design is that it is relatively easy to conduct. In addition, it can be completed in a relatively short period of time as compared with that of the crossover design. The analysis of variance (ANOVA) model is usually considered for analysis of the collected clinical data. For a crossover design, each patient is randomly assigned to a treatment sequence. Within each sequence, one treatment (e.g., treatment or control) is first applied to the patient. After a sufficient length of washout, the patient will be crossovered to receive another treatment. The major advantage of the crossover design is that each patient can serve as his/her own control. For a fixed sample size, a valid crossover design usually provides a higher statistical efficiency as compared with a parallel design. However, crossover designs also suffer from a drawback. A crossover design may have potential carryover effect, which may contaminate the treatment effect. For more details regarding the comparison between a parallel design and a crossover design, readers may find the reference by Chow and Liu (2) useful. The rest of the entry is organized as follows. In Section 2, testing in one-sample problems is considered. Procedures for sample size calculation in two-sample problems under a parallel design and a crossover design are discussed in Sections 3 and 4, respectively. Sections 5 and 6 present procedures in multiple-sample problems under a parallel design (one-way analysis of variance) and a crossover design (Williams design), respectively. A concluding remark is given in the last section.

HANSHENG WANG Guanghua School of Management, Peking University Department of Business Statistics & Econometrics Beijing, P. R. China

SHEIN-CHUNG CHOW Department of Biostatistics and Bioinformatics Duke University School of Medicine Durham, New Caledonia, USA

1

INTRODUCTION

Sample size calculation plays an important role in clinical research. In clinical research, a sufficient number of patients is necessary to ensure the validity and the success of an intended trial. From a statistical point of view, if a clinically meaningful difference between a study treatment and a control truly exists, such a difference can always be detected with an arbitrary power as long as the sample size is large enough. However, from the sponsor’s point of view, it is not cost-effective to have an arbitrary sample size because of limited resources for a given timeframe. As a result, the objective of sample size calculation in clinical research is to obtain the minimum sample size needed for achieving a desired power for detecting a clinically meaningful difference at a given level of significance. For good clinical practice, it is suggested that sample size calculation/justification should be included in the study protocol before conducting a clinical trial (1). In practice, the objective of a clinical trial can be classified into three categories: testing treatment effect, establishing equivalence/noninferiority, and demonstrating superiority. More specifically, a clinical trial could be conducted to evaluate the treatment effect of a study drug, it could be conducted to establish therapeutic equivalence/noninferiority of the study drug as compared with an active control agent currently available in the marketplace, or it could be

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

2

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

ONE-SAMPLE DESIGN

2.2 Test for Noninferiority/Superiority

Let xi be the response of interest from the ith patient, i = 1, . . . , n. It is assumed that xi ’s are independent and identically distributed (i.i.d.) normal random variables with mean 0 and variance σ 2 . Define 1 x= xi n n

i=1

n 1 and s = (xi − x)2 n−1 2

i=1

where x and s2 are the sample mean and sample variance, respectively. Let = µ − µ0 be the true mean difference between a treatment and a reference. Without loss of generality, assume > 0 ( < 0) an indication of improvement (worsening) of the treatment as compared with the reference. 2.1 Test for Equality To test whether a mean difference between the treatment and the reference value truly exists, the following hypotheses are usually considered: H0 : = 0

versus

Ha : = 0

(1)

The following hypotheses are usually considered to test noninferiority or superiority: H0 : ≤ δ

√ n|| − zα/2 σ

where is the cumulative standard normal distribution function. As a result, the sample size needed to achieve the desired power of 1 − β is given by (zα/2 + zβ )2 σ 2 n= 2

(2)

x − µ0 − δ > tα,n−1 √ s/ n Similarly, the power of the above test can be approximated by √

n( − δ) − zα σ

Therefore, the sample size needed to achieve the desired power of 1 − β is given by n=

x − µ0 √ > tα/2,n−1 s/ n

Ha : > δ

where δ is the superiority or noninferiority margin. When δ > 0, the rejection of the null hypothesis indicates superiority over the reference value. When δ < 0, the rejection of the null hypothesis implies noninferiority against the reference value. For a given significance level α, the null hypothesis H0 is rejected if

For a given significance level α, the null hypothesis H0 is rejected if

where tα/2,n−1 is the upper (α/2)th quantile of the t-distribution with n − 1 degrees of freedom. Under the alternative hypothesis (i.e., = 0), the power of the above test can be approximated by

versus

(zα + zβ )2 σ 2 ( − δ)2

2.3 Test for Equivalence Equivalence between the treatment and the reference value can be established by testing the following hypotheses: H0 : || ≥ δ

versus

Ha : || < δ

(3)

where δ is the equivalence limit. Equivalence between the treatment and the reference can be established by testing the following two one-sided hypotheses: H01 : ≥ δ H02

versus Ha1 : < δ and : ≤ −δ versus Ha2 : > −δ

(4)

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

In other words, H01 and H02 are rejected and equivalence at the α level of significance is concluded if √ n(x − µ0 − δ) < −tα,n−1 s √ n(x − µ0 + δ) > tα,n−1 s

√

5 mm Hg is given by

n=

√ n(δ − ) n(δ + ) − zα + − zα − 1 σ σ

= 31.36 ≈ 32 On the other hand, suppose a standard therapy exists for treatment of hypertension in the marketplace. To show the superiority of the study drug as compared with the standard therapy, the sample size needed for achieving an 80% power at the 5% level of significance assuming a superiority margin of 2.5 mm Hg is given by

Based on a similar argument as given in Chow and Liu (3, 4), the sample size needed to achieve the desired power of 1 − β is given by n=

(zα + zβ/2 )2 σ 2 if = 0 δ2

n=

(zα + zβ )2 σ 2 (δ − ||)2

(zα/2 + zβ )2 σ 2 (1.96 + 0.84)2 × 102 = 2 52

and

The power of the above test can be approximated by

if = 0

2.4 An Example Consider a clinical study for evaluation of a study drug for treatment of patients with hypertension. Each patient’s diastolic blood pressure was measured at baseline and at post-treatment. The primary endpoint is posttreatment blood pressure change from baseline. Assuming the normal range of diastolic blood pressure is 80 mm Hg or below, a patient with a diastolic blood pressure higher than 90 mm Hg is considered as hypertension. The objective of the study is to show whether the study drug will decrease diastolic blood pressure from 90 mm Hg to the normal range of 85 mm Hg or below. Therefore, a minimum decrease of 5 mm Hg is considered as a clinically meaningful difference. Suppose from a pilot study that it is estimated that the standard deviation of the study drug is about 10 mm Hg (σ = 10). Thus, the sample size needed for achieving an 80% power at the 5% level of significance for detection of a clinically meaningful difference of

3

n=

(zα + zβ )2 σ 2 (1.64 + 0.84)2 × 102 = 2 ( − δ) (5 − 2.5)2 = 98.41 ≈ 99

3

TWO-SAMPLE PARALLEL DESIGN

For testing two-sample from a parallel design, let xij be the responses of interest, which are obtained from the jth patient in the ith treatment group, j = 1, . . . , ni , i = 1, 2. It is assumed that xij , j = 1, . . . , ni , i = 1, 2, are independent normal random variables with mean µi and variance σ 2 . Define

xi· =

ni 1 xij ni

and

j=1

ni 2 1 s = (xij − xi· )2 n1 + n2 − 2 2

i=1 j=1

where xi· and s2 are the sample mean of the ith treatment and sample variance, respectively. Let = µ2 − µ1 be the true mean difference between the treatment and the control. In practice, it is not uncommon to have an unequal sample size allocation between treatment groups. Let n1 /n2 = κ for some κ. When κ = 2, there is a 2:1 ratio between the treatment group and the control group.

4

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

3.1 Test for Equality

3.3 Test for Equivalence

For testing equality between treatment groups, consider the hypotheses given in Equation (1). For a given significance level α, the null hypothesis H0 of (1) is rejected if

For testing therapeutic equivalence, consider the two one-sided hypotheses given in Equation (4). For a given significance level α, the null hypothesis of inequivalence at the α level of significance is rejected if

x −x 1· 2· s 1 + 1 n1 n2

> tα/2,n1 +n2 −2

The power of the above test can be approximated by ⎞

⎛ || ⎜ ⎝ 1 σ n + 1

1 n2

⎟ − zα/2 ⎠

As a result, the sample size needed to achieve the desired power of 1 − β at the α level of significance is given by n1 = κn2

x1 − x2 − δ < −tα,n1 +n2 −2 σ n1 + n1 1

x1 − x2 + δ > tα,n1 +n2 −2 σ n1 + n1 1

2

Under the alternative hypothesis (i.e., || < δ), the power of this test can be approximated by ⎞ ⎛ δ− ⎜ ⎝ σ n1 + 1 ⎛

(zα/2 + zβ )2 σ 2 (1 + 1/κ) n2 = 2

and

1 n2

For testing noninferiority/superiority, consider the hypotheses given in Equation (2). For a given significance level α, the null hypothesis H0 is rejected if x1 − x2 − δ > tα,n1 +n2 −2 s n1 + n1 1

⎞ 1 n2

⎟ − zα ⎠ − 1

As a result, the sample size needed to achieve the desired power of 1 − β is given by n1 = κn2 n2 =

(zα + zβ/2 )2 σ 2 (1 + 1/κ) if = 0 δ2

n2 =

(zα + zβ )2 σ 2 (1 + 1/κ) (δ − ||)2

if = 0

2

Under the alternative hypothesis (i.e., > δ), the power of the above test can be approximated by ⎞

⎛ −δ ⎜ ⎝ σ n1 + 1

1 n2

⎟ − zα ⎠

Hence, the sample size needed to achieve the desired power of 1 − β at the α level of significance is given by n1 = κn2

⎟ − zα ⎠

δ+ ⎜ + ⎝ σ n1 + 1

3.2 Test for Noninferiority/Superiority

and

2

and n2 =

(zα + zβ )2 σ 2 (1 + 1/κ) ( − δ)2

3.4 An Example Consider the same example as discussed in Section 3.3. Now assume that a parallel design will be conducted to compare the study drug with an active control agent. The objective is to establish the superiority of the study drug over the active control agent. If this objective cannot be achieved, then the sponsor will then make an attempt to establish equivalence between the study drug and the active control agent. According to historical data, the standard deviations for both the study drug and the active control agent are approximately 5 mm Hg (σ = 5). Assuming that the superiority margin is 3.5 mm

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

Hg (δ = 3.5), the true difference between treatments is 3.0 mm Hg ( = 3.0). Then the sample size needed to achieve the desired power of 80% (β = 0.20) at the 5% (α = 0.05) level of significance is given by n=

2σ 2 (zα + zβ )2 2 × 52 (1.64 + 0.84)2 = 2 ( − δ) (3.5 − 2.5)2 = 307.52 ≈ 308

On the other hand, the sponsor may suspect that the study drug will not provide such a significant improvement of 3.0 mm Hg. Alternatively, the sponsor may want to show that the study drug is at least as good as the active control agent (δ = 0). Therefore, instead of demonstrating superiority, the sponsor may want to establish equivalence between the study drug and the active control agent. Assume that a difference of 2.0 mm Hg (δ = 2.0) is not considered of clinical importance. Therefore, the sample size needed to achieve the desired power of 80% (β = 0.20) at the 5% (α = 0.05) level of significance for establishment of therapeutical equivalence between the study drug and the active control agent is given by 2σ 2 (zα + zβ/2 )2 2 × 52 (1.64 + 1.28)2 = n= 2 ( − δ) (2.0 − 0.0)2

5

random effect of the jth patient in the ith sequence under treatment k. It is further assumed that (sij1 , sij2 ), i = 1, 2, j = 1, . . . , n i.i.d. bivariate normal random vectors with mean 0 and covariance matrix given by σ2 ρσBT σBR BT = 2 ρσBT σBR σBR 2 2 where σBT and σBR are the intersubject variabilities under the test and reference, respectively; ρ is the intersubject correlation coefficient; and eij1l and eij2l are assumed to be independent normal random variables with mean 0. Depending on the treatment, eij1l and 2 2 or σWR . Further eij2l may have variance σWT define 2 2 σD2 = σBT + σBR − 2ρσBT σBR

which is usually referred to as the variability from the subject-by-treatment interaction. Let = µ2 − µ1 be the true mean difference between the treatment and the control. Define yijk· =

1 (yijk1 + · · · + yijkm ) and m

dij

= yij1· − yij2· An unbiased estimator for is given by

= 106.58 ≈ 107

1 dij 2n 2

ˆ =

n

i=1 j=1

4

TWO-SAMPLE CROSSOVER DESIGN

For testing two-sample from a crossover design, without loss of generality, consider a standard 2 × 2m replicated crossover design comparing mean responses of a treatment and a control. Let yijkl be the lth response of interest (l = 1, . . . , m) observed from the jth patient (j = 1, . . . , n) in the ith sequence (i = 1, 2) under the kth treatment (k = 1, 2). The following linear mixed effects model is usually considered: yijkl = µk + γik + sijk + eijkl

Under our model assumption, ˆ follows a normal distribution with mean and variance 2 /(2n), where σm 2 σm = σD2 +

1 2 2 (σ + σWR ) m WT

2 , the following estimator is To estimate σm useful:

1 (dij − di· )2 2(n − 1) 2

2 = σˆ m

n

i=1 j=1

where 1 dij n n

where µk is the effect due to the kth treatment, γik is the fixed effect of the ith sequence under the kth treatment, and sijk is the

di· =

j=1

6

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

4.1 Test for Equality

4.3 Test for Equivalence

For testing equality, consider the hypothesis given in Equation (1). For a given significance level α, the null hypothesis H0 of (1) is rejected if

Similarly, therapeutic equivalence can be established by testing the two one-sided hypotheses given in Equation (4). For a given significance level α, the null hypothesis H0 of inequivalence is rejected if

ˆ √ > tα/2,2n−2 σˆ m / 2n

Under the alternative hypothesis (i.e., = 0), the power of this test can be approximated by

√ 2n|| − zα/2 σm Therefore, the sample size needed to achieve the desired power of 1 − β at the α level of significance is given by 2 (zα/2 + zβ )2 σm 2 2

n=

√ 2n(ˆ − δ) < −tα,2n−2 σˆ m √ 2n(ˆ + δ) > tα,2n−2 σˆ m

Under the alternative hypothesis (i.e., || < δ), the power of this test can be approximated by √

2n(δ − ) − zα σm √

2n(δ + ) − zα − 1 + σm Thus, the sample size needed to achieve the desired power of 1 − β at the α level of significance is given by

4.2 Test for Noninferiority/Superiority For testing noninferiority/superiority, consider the hypotheses given in Equation (2). The null hypothesis H0 of (2) at the α level of significance is rejected if ˆ − δ > tα,2n−2 √ σˆ m / 2n Under the alternative hypothesis (i.e., > δ), the power of this test can be approximated by

−δ √ − zα σm / 2n

As a result, the sample size needed to achieve the desired power of 1 − β at the α level of significance is given by n=

2 (zα + zβ )2 σm 2 2( − δ)

and

n=

2 (zα + zβ/2 )2 σm if 2 2δ

n=

2 (zα + zβ )2 σm 2(δ − ||)2

=0

if = 0

4.4 An Example In the previous example, instead of using a parallel design, use a standard two-sequence two-period crossover design to compare the study drug and the active control agent. Then the standard deviation of intrasubject comparison is about 2.5 mm Hg (σ2 = 2.5). And the mean difference between the study drug and the active control agent is about 1 mm Hg (δ = 1). Then the sample size required for achieving an 80% power at the 5% (α = 0.05) level of significance is given by n=

(zα/2 + zβ )2 σ22 (1.96 + 0.84)2 × 2.52 = 2 2 2 × 1.02 = 24.5 ≈ 25

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

5

7

MULTIPLE-SAMPLE ONE-WAY ANOVA

τ is the number of pairwise comparisons of interest. The null hypothesis is rejected if

In this section, multiple samples from a parallel design comparing more than two treatments are considered. More specifically, let xij be the jth patient from the ith treatment group, i = 1, . . . , k, j = 1, . . . , n. It is assumed that

√ n(xi· − xj· ) > tα/(2τ ),k(n−1) √ 2σˆ The power of this test can be approximated by

xij = Ai + ij

where Ai is the fixed effect of the ith treatment and ij is a random error in observing xij . It is assumed that ij are i.i.d. normal random variables with mean 0 and variance σ 2 . Define SSE =

k n

√ n|ij | √ − zα/(2τ ) 2σ

where ij = µi − µj is the true mean difference between the treatment i and j. As a result, the sample size needed to achieve the desired power of 1 − β is given by n = max{nij , for all interested comparison}

(xij − xi· )2

i=1 j=1

SSA =

where nij is given by

k (xi· − x·· )2 i=1

nij =

where 1 xij n n

xi· =

and

x·· =

j=1

k 1 xi· k

5.2 Simultaneous Comparison

i=1

Situations also exist where the interest is to detect any clinically meaningful difference between any possible treatment comparisons. Thus, the following hypotheses are usually considered:

Then σ 2 can be estimated by σˆ 2 =

2(zα/(2τ ) + zβ )2 σ 2 ij2

SSE k(n − 1)

H0 : µ1 = µ2 = · · · = µk 5.1 Pairwise Comparison

versus

It is common practice to compare means between the pairs of treatments of interest. This type of problem can be formulated as the following hypotheses: H0 : µi = µj

versus

Ha : µi = µj

(5)

for some pairs (i, j). If all possible pairwise comparisons, are considered a total of k(k − 1)/2 possible comparisons exists. It should be noted that multiple comparison will inflate the type I error. As a result, appropriate adjustment, such as the Bonferroni adjustment, should be made for the purpose of controlling the overall type I error rate at the desired significance level. Assume that

Ha : µi = µj for some 1 ≤ i < j ≤ k (6)

For a given level of significance α, the above null hypothesis should be rejected if FA =

nSSA/(k − 1) > Fα,k−1,k(n−1) SSE/[k(n − 1)]

where Fα,k−1,k(n−1) denote the α upper quantile of the F-distribution with k − 1 and k(n − 1)) degrees of freedom. As demonstrated by Chow et al. (5), under the alternative hypothesis, the power of the test can be approximated by 2 ) P(FA > Fα,k−1,k(n−1) ) ≈ P(nSSA > σ 2 χα,k−1

8

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

2 where χα,k−1 represents the αth upper quantile for a χ 2 distribution with k − 1 degrees of freedom. Under the alternative hypothesis, nSSA/σ 2 is distributed as a noncentral χ 2 distribution with degrees of freedom k − 1 and noncentrality parameter λ = n , where

=

k 1 (µi − µ)2 , σ2

µ=

i=1

k 1 µj k j=1

As a result, the sample size needed to achieve the desired power of 1 − β can be obtained by solving 2 2 (χα,k−1 |λ) = β χk−1

(7)

2 where χk−1 (·|λ) is the cumulative distribution function of the noncentral χ 2 distribution with degrees of freedom k − 1 and noncentrality parameter λ.

5.3 An Example Assume that the sponsor wants to conduct a parallel trial to compare three drugs (k = 3) for treatment of patients with hypertension. The three treatments are a study drug, an active control agent, and a placebo. The primary endpoint is the diastolic blood pressure decreases from baseline. It is assumed that mean decrease for the three treatments are given by 5 mm Hg, 2.5 mm Hg, and 1.0 mm Hg, respectively (µ1 = 5, µ2 = 2.5, and µ3 = 1.0). A constant standard deviation of 5 mm Hg (σ = 5) is assumed for the three treatments. Then the sample size needed to achieve the desired power of 80% (β = 0.20) at the 5% (α = 0.05) level of significance can be obtained by first finding the value of λ according to Equation (7), which is given by λ = 9.64. Therefore, the sample size required per treatment group is given by n=

9.64 λ = = 29.21 ≈ 30

0.33

control. As a result, intersubject variability can be removed during pairwise comparisons under appropriate assumptions. The U.S. Food and Drug Administration (FDA) has identified crossover design as the design of choice for bioequivalence trials. In practice, the standard two-sequence two-period crossover design is often used. However, it may not be useful when more than two treatments are compared. When more than two treatments are compared, it is desirable to compare pairwise treatment effects with the same degrees of freedom. In such a situation, Williams design is recommended [see, e.g., Chow and Liu, (3, 4)]. Under a Williams design, the following model is commonly used:

yijl = Pj + γi + µl + eijl , j = 1, . . . , n

where yijl is the response of interest from the jth patient in the ith sequence under the lth treatment, Pj represents the fixed effect for the j period, j represents the number of the period for the ith sequence’s lth treatment, a j=1 Pj = 0, γi is the fixed sequence effect, µj is the fixed treatment effect, and eijl is a normal random variable with mean 0 and variance σil2 . For fixed i and l, eijl , j = 1, . . . , n are independent and identically distributed. For fixed i and j, eijl , l = 1, . . . , a are usually correlated because they all come from the same patient. Without loss of generality, suppose that the first two treatments are to be compared (i.e., treatments 1 and 2). Define dij = yij1 − yij2 Then, the true mean difference between treatments 1 and 2 can be estimated by the following unbiased estimator: 1 dij kn k

6

MULTIPLE-SAMPLE WILLIAMS DESIGN

i, l = 1, . . . , k,

ˆ =

n

i=1 j=1

As discussed, one advantage for adopting a crossover design in clinical research is that each patient can serve as his/her own

It can be shown that ˆ is normally distributed with mean = µ1 − µ2 and variance σd2 /(kn),

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

where σd2 is defined as the variance of dij and can be estimated by ⎛ ⎞2 k n n 1 1 2 ⎝dij − σˆ d = dij ⎠ k(n − 1) n i=1 j=1

j =1

6.1 Test for Equality For testing equality, consider the hypotheses given in Equation (1). For a given significance level α, the null hypothesis H0 of (1) is rejected if ˆ > tα/2,k(n−1) √ σˆ d / kn Under the alternative hypothesis (i.e., = 0), the power of the test can be approximated by √

kn − zα/2 σd The sample size needed to achieve the desired power of 1 − β at the α level of significance is given by n=

(zα/2 + zβ )2 σd2 k 2

6.3 Test for Equivalence The equivalence between the treatment and the control can be established by testing the two one-sided hypotheses given in Equation (4). For a given significance level α, the null hypothesis of inequivalence at the α level of significance is rejected if √ kn(ˆ − δ) < tα,k(n−1) and σˆ d √ kn(ˆ + δ) > tα,k(n−1) σˆ d Under the alternative hypothesis (i.e., || < δ), the power of the above test can be approximated by

√ kn(δ − ) − zα σd √

kn(δ + ) + − zα − 1 σd Hence, the sample size needed for achieving the power of 1 − β at the α level of significance is given by

6.2 Test for Noninferiority/Superiority For testing for noninferiority/superiority, the hypotheses given in Equation (2) are considered. For a given significance level α, the null hypothesis H0 of Equation (2) is rejected if ˆ − δ > tα,k(n−1) √ σˆ d / kn Under the alternative hypothesis (i.e., > δ), the power of this test can be approximated by −δ √ − zα σd / kn As a result, the sample size needed to achieve the desired power of 1 − β at the α level of significance is given by n=

(zα + zβ )2 σd2 k( − δ)2

9

n=

(zα + zβ/2 )2 σd2 if = 0 kδ 2

n=

(zα + zβ )2 σd2 k(δ − )2

if = 0

6.4 An Example Suppose a clinical trial is conducted with a standard 6 × 3 Williams design (k = 6) to compare a study drug, an active control agent, and a placebo. Assume that the mean difference between the study drug and the placebo is 3 mm Hg ( = 3) with a standard deviation for the intrasubject comparison of 5 mm Hg (σd = 5). Thus, the sample size required per sequence to achieve the desired power of 80% (β = 0.20) at the 5% (α = 0.05) level of significance can be obtained as n=

(zα/2 + zβ )2 σd2 (1.96 + 0.84)2 × 52 = 2 k 6 × 32 = 3.63 ≈ 4

10

7

SAMPLE SIZE CALCULATION FOR COMPARING MEANS

DISCUSSION

In clinical research, sample size calculation/ justification plays an important role to ensure the validity, success, and cost-effectiveness of the intended clinical trials. A clinical trial without sufficient sample size may not provide a desired reproducibility probability (6–8). In other words, the observed clinical results may not be reproducible at a certain level of significance. From a regulatory point of view, a large sample size is always preferred. However, from the sponsor’s point of view, an unnecessarily large sample size is a huge waste of the limited resources in clinical research and development. Therefore, the objective of the sample size calculation is to select a minimum sample size for achieving a desired power at a prespecified significance level. REFERENCES 1. ICH, International Conference on Harmonization Tripartite Guideline E6 Good Clinical Practice: Consolidated Guidance. 1996. 2. S. C. Chow and J. P. Liu, Design and Analysis of Clinical Trials. New York: Wiley, 1998. 3. S. C. Chow and J. P. Liu, Design and Analysis of Bioavailability and Bioequivalence Studies. New York: Marcel Dekker, 1992.

4. S. C. Chow and J. P. Liu, Design and Analysis of Bioavailability and Bioequivalence Studies. New York: Marcel Dekker, 2000. 5. S. C. Chow, J. Shao, and H. Wang, Sample Size Calculation in Clinical Research. New York: Marcel Dekker, 2003. 6. S. C. Chow and J. Shao, Statistics in Drug Research. New York: Marcel Dekker, 2002. 7. S. C. Chow, J. Shao, and O. Y. P. Hu, Assessing sensitivity and similariy in bridging studies. Journal of Biopharmaceutical Statistics 2002: 12: 385–340. 8. S. C. Chow, Reproducibility probability in clinical research. Encyclopedia of Biopharmaceutical Statistics 2003; 838–849.

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

power, the prespecified significance level, and the hypotheses of interest under the study of choice (e.g., a parallel design or a crossover design). According to the study objective, the hypotheses could be one of testing for equality, testing for superiority, or testing for equivalence/non-inferiority. Under each null hypothesis, different formulas or procedures for calculation can be derived under different study design accordingly. The rest of the article is organized as follows. In Section 2, sample size calculation procedures for one-sample design are derived. In Section 3, formulas/procedures for sample size calculation for two-sample parallel design are studied. Considerations for crossover design are given in Section 4. Procedures for sample size calculation based on relative risk, which is often measured by odds ratio in log-scale, are studied under both a parallel design and a crossover design in Section 5 and Section 6, respectively. Finally, the article is concluded with a discussion in Section 7.

HANSHENG WANG Guanghua School of Management, Peking University Department of Business Statistics & Econometrics Beijing, P. R. China

SHEIN-CHUNG CHOW Department of Biostatistics and Bioinformatics Duke University School of Medicine Durham, New Caledonia, USA

1

INTRODUCTION

In clinical research, in addition to continuous responses, primary clinical endpoints for assessment of efficacy and safety of a drug product under investigation could be binary responses. For example, in cancer trials, patients’ clinical reaction to the treatment is often classified as response (e.g., complete response or partial response) or nonresponse. Based on these binary responses, the proportions of the responses between treatment groups are then compared to determine whether a statistical/clinical difference exists. Appropriate sample size is usually calculated based on a statistical test used to ensure that a desired power exists for detecting such a difference when the difference truly exists. Statistical test procedures employed for testing the treatment effect with binary response are various chi-square or Z-type statistics, which may require a relatively large sample size for the validity of the asymptotic approximations. This article will focus on sample size calculation for binary responses based on asymptotic approximations. In practice, the objective of sample size calculation is to select the minimum sample size in such a way that the desired power can be obtained for detection of a clinically meaningful difference at the prespecified significance level. Therefore, the selection of the sample size depends on the magnitude of the clinically meaningful difference, the desired

2

ONE-SAMPLE DESIGN

This section considers the sample size formula for one-sample test for comparing proportions. More specifically, let xi , i = 1, . . . , n be independent and identically distributed (i.i.d) binary observations. It is assumed that P(xi = 1) = p, where p is the true response probability, which can be estimated by 1 xi n n

pˆ =

i=1

The parameter of interest is given by = p − p0 , where p0 is some reference value (e.g., the response rate of a standard treatment). 2.1 Test for Equality To test whether a clinically meaningful difference does exist between the test drug and the reference value, the hypotheses of interest are given by H0 : = 0 versus Ha : = 0

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

Under the null hypothesis, the following test statistic

T=

√ nˆ p(1 ˆ − p) ˆ

where ˆ = pˆ − p0 , is asymptotically distributed as a standard normal random variable. Thus, for a given significance level α, the null hypothesis would be rejected if |T| > zα/2 , where zα/2 is the (1 − α/2)th lower quantile of a standard normal distribution. On the other hand, if the alternative hypothesis = 0 is true, the power of the above test can be approximated by

√ n|| p(1 − p)

− zα/2

significance level α, the null hypothesis would be rejected if √ n(ˆ − δ) > zα p(1 ˆ − p) ˆ On the other hand, under the alternative hypothesis > δ, the power of the above test can be approximated by √ n( − δ) − zα p(1 − p) Thus, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation √ n( − δ) − zα = zβ p(1 − p) which leads to

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation √

n||

p(1 − p)

− zα/2 = zβ

(zα/2 + zβ )2 p(1 − p) 2

(zα + zβ )2 p(1 − p) ( − δ)2

(2)

2.3 Test for Equivalence In order to establish equivalence, the following hypotheses are considered

which leads to n=

n=

H0 : || ≥ δ versus Ha : || < δ

(1)

where δ is the equivalence limit. The above hypotheses can be decomposed into the following two one-sided hypotheses:

2.2 Test for Non-Inferiority/Superiority As indicated in Chow et al. (1), the problem of testing non-inferiority and superiority can be unified by the following statistical hypotheses: H0 : ≤ δ versus Ha : > δ where δ is the non-inferiority or superiority margin. Under the null hypothesis, the following test statistic √ n(ˆ − δ) p(1 ˆ − p) ˆ is asymptotically distributed as a standard normal random variable. Thus, for a given

H01 : ≥ δ versus Ha1 : < δ and H01 : ≤ −δ versus Ha1 : > −δ The equivalence between p and p0 can be established if √ √ n(ˆ − δ) n(ˆ + δ) < −zα and > zα p(1 ˆ − p) ˆ p(1 ˆ − p) ˆ When the sample size is sufficiently large, the power of the above testing procedure can be approximated by √ √ √n(δ−) − zα + √n(δ+) − zα − 1 p(1−p)

p(1−p)

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

3

n=

(zα + zβ/2 )2 p(1 − p) δ2

if = 0

Therefore, the total number of sample size required reduced from 69 to 49 if the hypotheses of interest for testing superiority (with a superiority margin of 5%) is changed to the hypotheses for detecting a significant difference.

n=

(zα + zβ )2 p(1 − p) (δ − ||)2

if = 0

3

Based on similar argument as given in Chow and Liu (2, 3), the sample size needed for achieving the power of 1 − β is given by

2.4 An Example For illustration purposes, consider an example concerning a cancer study. Suppose that it is estimated that the true response rate of the study treatment is about 50% (i.e., p = 0.50). On the other hand, suppose that a response rate of 30% (p = 0.30) is considered a clinically meaningful response rate for treating cancer of this kind. The initial objective is to select a sample size so that 80% (i.e., β = 0.20) can be assured for establishing a superiority of the test treatment as compared with the reference value with a superiority margin of 5% (i.e., δ = 0.05). The significance level is set to be 5% (i.e., α = 0.05). According to Equation (2), the sample size needed is given by

n= =

(zα + zβ )2 p(1 − p) (p − p0 − δ)2 (1.64 + 0.84) 0.5(1 − 0.5) ≈ 69 (0.5 − 0.3 − 0.05)2

In this section, the problem of sample size calculation for a two-sample parallel design is studied. Let xij be a binary response from the jth subject in the ith treatment group, where j = 1, . . . , ni and i = 1, 2. For a fixed i, xij s are assumed to be i.i.d binary responses with P(xij = 1) = pi . The parameter of interest is then given by = p1 − p2 , which can be estimated by ˆ = pˆ 1 − pˆ 2 , where pˆ i =

ni 1 xij ni j=1

3.1 Test for Equality In a two-sample parallel design, usually one treatment group serves as a control whereas the other is an active treatment. In order to test for equality between the control and the treatment group in terms of the response rate, the following hypotheses are considered H0 : = 0

versus

Ha : = 0

2

That is, a total of 69 patients are needed for achieving the desired power for demonstration of superiority of the test treatment at the 5% level of significance. However, the investor feels difficult to recruit so many patients within the limited budget and relatively short time frame. Therefore, it may be of interest to consider the result if the study objective is only to detect a significant difference. In this case, according to Equation (1), the sample size needed is given by (zα/2 + zβ )2 p(1 − p) n= (p − p0 )2 =

TWO-SAMPLE PARALLEL DESIGN

(1.96 + 0.84)2 0.5(1 − 0.5) = 49 (0.5 − 0.3)2

For a given significance level α, the null hypothesis would be rejected if pˆ 1 − pˆ 2 > zα/2 pˆ 1 (1 − pˆ 1 )/n1 + pˆ 2 (1 − pˆ 2 )/n2 On the other hand, if the alternative hypothesis = 0 is true, the power of the above test can be approximated by

|| p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2

− zα/2

Thus, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation

|| p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2

− zα/2 = zβ

4

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

which leads to

3.3 Test for Equivalence

n1 = κn2 n2 =

(zα/2 + zβ )2 p1 (1 − p1 )/κ + p2 (1 − p2 ) 2

Equivalence between the treatment and the control can be established by testing the following hypotheses H0 : || ≥ δ versus Ha : || < δ where δ is the equivalence limit. The above hypotheses can be decomposed into the following two one-sided hypotheses:

3.2 Test for Non-Inferiority/Superiority Similarly, the problem of testing noninferiority and superiority can be unified by the following hypotheses:

H01 : ≥ δ versus Ha1 : < δ and H01 : ≤ −δ versus Ha1 : > −δ

H0 : ≤ δ

versus

Ha : > δ

where δ is the superiority or non-inferiority margin. For a given significance level α, the null hypothesis should be rejected if

pˆ 1 − pˆ 2 − δ pˆ 1 (1 − pˆ 1 )/n1 + pˆ 2 (1 − pˆ 2 )/n2

pˆ 1 − pˆ 2 − δ pˆ 1 (1 − pˆ 1 )/n1 + pˆ 2 (1 − pˆ 2 )/n2

−δ p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2

− zα

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation −δ − zα = zβ p1 (1 − p1 )/n1 + p2 (1 − p2 )/n2 which leads to

(zα + zβ )2 [p1 (1 − p1 )/κ + p2 (1 − p2 )] ( − δ)2 (3)

pˆ 1 − pˆ 2 + δ pˆ 1 (1 − pˆ 1 )/n1 + pˆ 2 (1 − pˆ 2 )/n2

> zα

On the other hand, under the alternative hypothesis || > δ (equivalence), the power of the above test can be approximated by δ− − zα √ p1 (1−p1 )/n1 +p2 (1−p2 )/n2

+ √

δ+ p1 (1−p1 )/n1 +p2 (1−p2 )/n2

− zα − 1

Based on similar argument as given in Chow and Liu (2, 3), the sample size required for achieving the desired power of 1 − β is given by n1 = κn2 n2 =

(zα + zβ/2 )2 [p1 (1 − p1 )/κ + p2 (1 − p2 )] δ2 if = 0

n1 = κn2 n2 =

< −zα

and > zα

On the other hand, under the alternative hypothesis that > δ, the power of the above test can be approximated by

For a given significance level α, the null hypothesis of inequivalence would be rejected if

n2 =

(zα + zβ )2 [p1 (1 − p1 )/κ + p2 (1 − p2 )] (δ − ||)2 if = 0

(4)

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

5

3.4 An Example

4

TWO-SAMPLE CROSSOVER DESIGN

Consider an example concerning the evaluation of two anti-infective agents in the treatment of patients with skin structure infections. The two treatments to be compared include a standard therapy (active control) and a test treatment. After the treatment, the patient’s skin is evaluated as cured or not. The parameter of interest is the posttreatment cure rates. Suppose that, based on a pilot study, it is estimated that the cure rate for the active control is about 50% (p1 = 0.50) whereas the cure rate for the test treatment is about 55% (p2 = 0.55). Only a 5% ( = 0.05) improvement is observed for the test treatment as compared with the control, which may not be considered of any clinical importance. Therefore, the investigator is interested in establishing equivalence between the test treatment and the control. According to Equation (4) and assuming equal sample size allocation (κ = 1), 5% (α = 0.05) level of significance, and 80% (β = 0.20) power, the sample size needed for establishment of equivalence with an equivalence margin of 15% (δ = 0.15) is given by

In clinical research, a crossover design is sometimes considered to remove intersubject variability for treatment comparison (4). This section focuses on the problem of sample size determination with binary responses under an a × 2m replicated crossover design. Let xijkl be the lth replicate of a binary response (l = 1, . . . , m) observed from the jth subject (j = 1, . . . , n) in the ith sequence (i = 1, . . . , a) under the kth treatment (k = 1, 2). It is assumed that (xij11 , . . . , xij1m , . . . , xijk1 , . . . , xijkm ), i = 1, 2, and j = 1, . . . , n are i.i.d. random vectors with each component’s marginal distribution specified by P(xijkl = 1) = pk . Note that by not specifying the joint distribution, the observations from the same subject are correlated with each other in an arbitrary manner. On the other hand, specifying that P(xijkl = 1) = pk implies that no sequence, period, and carryover effects exist. Let = p2 (test) − p1 (reference) be the parameter of interest and define

xijk· =

1 (xijk1 + · · · + xijkm ) and m

dij = xij1· − xij2· n1 = n2 = =

(zα +zβ )2 (p1 (1−p1 )+p2 (1−p2 )) (δ−||)2

(1.64+0.84)2 (0.50(1.00−0.50)+0.55(1.00−0.55)) (0.15−0.05)2

An unbiased estimator of can be obtained as 1 dij an a

≈ 306

ˆ =

n

i=1 j=1

Thus, a total of 306 patients per treatment group are needed for establishment of equivalence between the test treatment and the active control. On the other hand, suppose the investigator is also interested in showing non-inferiority with a non-inferiority margin of 15% (δ = 0.15), the sample size needed is then given by n1 = n2 = =

σˆ d2 =

(−δ)2

≈ 77

Hence, testing non-inferiority with a noninferiority margin of 15% only requires 77 subjects per treatment group.

a n 1 (dij − di· )2 and a(n − 1) i=1 j=1

di· =

(zα +zβ )2 (p1 (1−p1 )+p2 (1−p2 ))

(1.64+0.84)2 (0.50(1.00−0.50)+0.55(1−0.55)) (0.15+0.05)2

It can be verified that ˆ is asymptotically distributed as N(0, σd2 ), where σd2 = var(dij ) and can be estimated by

1 n

n

dij

j=1

4.1 Test for Equality For testing equality, the following hypotheses are considered: H0 : = 0

versus

Ha : = 0

6

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

Then, the null hypothesis would be rejected at the α level of significance if

which gives

ˆ √ σˆ / an > zα/2 d

n=

On the other hand, under the alternative hypothesis that = 0, the power of the above test can be approximated by √

an|| − zα/2 σd

(zα + zβ )2 σd2 a( − δ)2

(6)

4.3 Test for Equivalence The equivalence between two treatments can be established by testing the following hypotheses

Thus, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation √ an|| − zα/2 = tβ σd

H0 : || ≥ δ

versus

Ha : || < δ

where δ is the equivalence limit. The above hypotheses can be decomposed into the following two one-sided hypotheses:

which leads to

H01 : ≥ δ versus Ha1 : < δ

(zα/2 + zβ )2 σd2 n= 2 2

(5)

and H01 : ≤ −δ versus Ha1 : > −δ

4.2 Test for Non-Inferiority/Superiority Similarly, the problem of testing for noninferiority/superiority can be tested based on the following hypotheses H0 : ≤ δ

versus

Ha : > δ

where δ is the non-inferiority or superiority margin. For a given significance level α, the null hypothesis would be rejected if ˆ − δ √ > zα σˆ d / an

−δ − zα/2 √ σd / an

√

an(ˆ − δ) < −zα σˆ d

√

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation −δ − zα/2 ≥ zβ √ σd / an

and

√ an(ˆ + δ) > zα σˆ d

On the other hand, under the alternative hypothesis that || < δ, the power of the above test can be approximated by

On the other hand, under the alternative hypothesis that > δ, the power of the above test procedure can be approximated by

Thus, the null hypotheses of inequivalence at the α level of significance would be rejected if

an(δ−) σd

√ − zα + an(δ+) − zα − 1 σ d

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation √

an(δ − ||) − zα ≥ zβ/2 σd

which leads to n≥

(zα + zβ/2 )2 σd2 a(δ − ||)2

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

4.4 An Example Suppose that an investigator is interested in conducting a clinical trial with a crossover design for comparing two formulations of a drug product. The design used is a standard 2 × 4 crossover design (i.e., ABAB,BABA) (a = 2, m = 2). Based on a pilot study, it is estimated that σd = 50\% and the clinically meaningful difference is about = 0.10. Thus, the sample size needed for achieving an 80% (β = 0.10) power at the 5% (α = 0.05) level of significance can be computed according to Equation (5). It is given by

n=

(zα/2 + zβ )2 σd2 (1.96 + 0.84)2 × 0.502 = a 2 2 × 0.102

= 98 Thus, a total of 98 patients per sequence are needed for achieving the desired power. If the investigator is interested in showing noninferiority with a margin of 5% (δ = −0.05), then according to Equation (6), the sample size required can be obtained as n=

(zα + zβ )2 σd2 (1.64 + 0.84)2 × 0.52 = a( − δ)2 2 × (0.10 + 0.05)2

≈ 35

= log(OR), is often considered. An estimator ˆ where of can be obtained as ˆ = log(OR), ˆ = pˆ 1 (1 − pˆ 2 ) OR pˆ 2 (1 − pˆ 1 ) 5.1 Test for Equality For testing equality, the following hypotheses are considered: H0 : = 0

Ha : = 0

versus

Based on Taylor’s expansion, the following test statistic can be obtained:

−1/2 1 1 + T = ˆ n1 pˆ 1 (1 − pˆ 1 ) n2 pˆ 2 (1 − pˆ 2 ) Under the null hypothesis, T is asymptotically distributed as a standard normal random variable. Thus, for a given significance level α, the null hypothesis would be rejected if |T| > zα/2 . Under the alternative hypothesis, the power of such a testing procedure can be approximated by −1/2 1 1 + − z || n p (1−p α/2 ) n p (1−p ) 1 1

1

2 2

2

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation

Hence, only 35 subjects per sequence are needed in order to show the non-inferiority of the test formulation as compared with the reference formulation of the drug product.

Assuming n1 /n2 = κ, it follows that

5

n1 = κn2

RELATIVE RISK—PARALLEL DESIGN

In addition to the response rate, in clinical trial, it is often of interest to examine the relative risk between treatment groups (5). Odds ratio is probably one of the most commonly considered parameters for evaluation of the relative risk. Given the resposne rates of p1 and p2 for the two treatment groups, the odds ratio is defined as OR =

p1 (1 − p2 ) p2 (1 − p1 )

It can be estimated by replacing pi with pˆ i , where pˆ i is as defined in Section 3. In practice, the log-scaled odds ratio, defined as

7

||

n2 =

1 n1 p1 (1−p1 )

+

(zα/2 + zβ )2 2

1 n2 p2 (1−p2 )

−1/2

− zα/2 = zβ

1 1 + κp1 (1 − p1 ) p2 (1 − p2 )

5.2 Test for Non-Inferiority/Superiority As indicated earlier, the problem of testing non-inferiority and superiority can be unified by the following statistical hypotheses H0 : ≤ δ

versus

Ha : > δ

where δ is the non-inferiority or superiority margin. Define the following test statistic T = (ˆ − δ)

1 n1 pˆ 1 (1−pˆ 1 )

+

1 n2 pˆ 2 (1−pˆ 2 )

−1/2

8

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

For a given significance level α, the null hypothesis would be rejected if T > zα . On the other hand, under the alternative hypothesis, the power of the above test can be approximated by ( − δ)

1 n1 p1 (1−p1 )

+

1 n2 p2 (1−p2 )

−1/2

− zα

1 n1 p1 (1−p1 )

+

1 n2 p2 (1−p2 )

−1/2

n2 =

(zα + zβ ( − δ)2

)2

In order to establish equivalence, the following statistical hypotheses are considered Ha : || < δ

For a given significance level α, the null hypothesis should be rejected if (ˆ − δ)

1 1 + n1 pˆ 1 (1 − pˆ 1 ) n2 pˆ 2 (1 − pˆ 2 )

−1/2

= and 1 1 (ˆ + δ) + n1 pˆ 1 (1 − pˆ 1 ) n2 pˆ 2 (1 − pˆ 2 )

1

1 p1 (1−p1 )

(δ + ) n

+n

1

−1

2 p2 (1−p2 )

1 1 + n p (1−p 1 p1 (1−p1 ) 2 2 2)

−1

1 kp1 (1−p1 )

1 p2 (1−p2 )

−1

if = 0 −1

+

(7)

1 p2 (1−p2 )

(zα +zβ/2 )2

δ2 (1.64+1.28)2 0.22

1 p1 (1−p1 )

+

1 p2 (1−p2 )

1 0.50(1−0.50)

+

1 0.50(1−0.50)

≈ 1706

−1/2 > zα

On the other hand, under the alternative hypothesis —— < δ is true, the power of the above testing procedure can be approximated by (δ − ) n

(δ−||)2

+

Consider a clinical trial for evaluating the safety and efficacy of a test treatment for treating patients with schizophrenia. The objective of the trial is to establish the equivalence between a test treatment with an active control in terms of the odds ratio of the relapse rates. Data from a pilot study indicates that the relapse rates for both the test treatment and the active control are about 50% (p1 = p2 = 0.50). Assuming a 20% equivalence limit (δ = 0.20), equal sample size allocation (κ = 1), 5% significance level (α = 0.05), and an 80% power (β = 0.20), according to Equation (7), the sample size needed is given by

n=

< −zα

(zα +zβ )2

1 kp1 (1−p1 )

5.4 An Example

5.3 Test for Equivalence

versus

δ2

if = 0

1 1 + κp1 (1 − p1 ) p2 (1 − p2 )

H0 : || ≥ δ

n2 =

(zα +zβ/2 )2

− zα/2 = zβ

Assuming n1 /n2 = κ, it follows that n1 = κn2

n1 = κn2 n2 =

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation ( − δ)

Based on a similar argument as given in Chow and Liu (2, 3), the sample size needed to achieve the desired power of 1 − β is given by

− zα/2 + − zα/2 − 1

Therefore, a total of 1706 patients per treatment group is needed for achieving the desired power for establishment of equivalence in terms of the odds ratio. 6 RELATIVE RISK—CROSSOVER DESIGN For the purpose of simplicity, consider a 1 × 2 crossover design with no period effects. Without loss of generality, it is assumed that every patient will receive the test treatment first

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

and then crossover to receive the control. Let xij be the binary response from the jth subject in the ith period, j = 1, . . . , n. The parameter of interest is still the log-scale odds ratio between the test and the control, for example, = log

p1 (1 − p2 ) p2 (1 − p1 )

n|| − zα/2 = zβ σd

x1j − p1 x2j − p2 = var − p1 (1 − p1 ) p2 (1 − p2 )

n=

Similarly, the problem of testing noninferiority and superiority can be unified by the following hypotheses:

which can be estimated by σˆ d2 , the sample variance based on x1j x2j , − pˆ 1 (1 − pˆ 1 ) pˆ 2 (1 − pˆ 2 )

H0 : ≤ δ

6.1 Test for Equality In order to test for equality, the following statistical hypotheses are considered versus

Ha : = 0

√ T=

is asymptotically distributed as a standard normal distribution. Thus, the null hypothesis at the α level of significance would be rejected if |T| > zα/2 . On the other hand, under the alternative hypothesis that = 0, the power of the above testing procedure can be approximated by √ n|| − zα/2 σd

Ha : > δ

n(ˆ − δ) σˆ d

is asymptotically distributed as a standard normal random variable. Thus, the null hypothesis at α level of significance would be rejected if |T| > zα/2 . On the other hand, under the alternative hypothesis that > 0, the power of the above test procedure can be approximated by

Under the null hypothesis, the following test statistic √ nˆ T= σˆ d

versus

where δ is the non-inferiority or superiority margin in log-scale. Under the null hypothesis, the following test statistic

where j = 1, · · · , n

H0 : = 0

(zα/2 + zβ )2 σd2 2

6.2 Test for Non-Inferiority/Superiority

where

√

which gives

√ n(ˆ − ) →d N(0, σd2 )

dj =

Thus, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation

which can be estimated by replacing pi with its estimator pˆ i = j xij /ni . According to Taylor’s expansion, it can be verified that

σd2

9

−δ − zα σd

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation −δ − zα = zβ σd which leads to n=

(zα + zβ )2 σd2 ( − δ)2

10

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS

6.3 Test for Equivalence

6.4 An Example

Equivalence between treatment groups can be established by testing the following interval hypotheses

Consider the previous example regarding schizophrenia with the same objectives. Suppose that the trial is now conducted with a crossover design as discussed in this section. Assuming a 20% equivalence limit (δ = 0.20), equal sample size allocation (κ = 1), 5% significance level (α = 0.05), and an 80% power (β = 0.20), then according to Equation (8), the sample size needed according to Equation (8) is given by

H0 : || ≥ δ

versus

Ha : || < δ

or the following two one-sided hypotheses: H01 : ≥ δ versus Ha1 : < δ

n=

and

≈ 853

H01 : ≤ −δ versus Ha1 : > −δ Thus, the null hypothesis of inequivalence at the α level of significance would be rejected if √ n(ˆ − δ) < −zα σˆ d

Hence, a total of 853 subjects are needed in order to achieve the desired power of 80% for establishing equivalence between treatment groups. 7 DISCUSSION

and √

(zα + zβ/2 )2 σd2 (1.64 + 1.28)2 2.002 = 2 δ 0.202

n(ˆ + δ) > zα σˆ d

On the other hand, under the alternative hypothesis that || < δ is true, the power of the above test can be approximated by δ+ δ− − zα + − zα − 1 σd σd

Based on similar arguments as given in Chow and Liu (2, 3), the sample size needed for achieving the desired power of 1 − β is given by

In clinical research, sample size calculation with binary response is commonly encountered. This article provides formulas for sample size calculation based on asymptotic theory under a parallel design and a crossover design. It should be noted that a relatively large sample size is usually required in order to ensure that the empirical Type I and II errors are close to the nominal levels. In practice, however, the sample size is often far too small. In such a situation, various exact tests (e.g., binomial test and Fisher’s exact test) should be used for sample size calculation. However, sample size determination based on an exact test usually requires extensive computation. More details can be found in Chow et al. (1). REFERENCES

n= n=

(zα + zβ/2 )2 σd2 δ2

if = 0

(zα + zβ )2 σd2 (δ − ||)2

if = 0

(8)

1. S. C. Chow, J. Shao and H. Wang, Sample Size Calculation in Clinical Research. New York: Marcel Dekker, 2003. 2. S. C. Chow and J. P. Liu, Design and Analysis of Bioavailability and Bioequivalence Studies. New York: Marcel Dekker, 1992.

SAMPLE SIZE CALCULATION FOR COMPARING PROPORTIONS 3. S. C. Chow and J. P. Liu, Design and Analysis of Bioavailability and Bioequivalence Studies. New York: Marcel Dekker, 2000. 4. S. C. Chow and H. Wang, On sample size calculation in bioequivalence trials. J. Pharmacokinet. Pharmacodynam. 2001; 28: 155– 169. 5. H. Wang, S. C. Chow and G. Li, On sample size calculation baseed on odds ratio in clinical trials. J. Biopharmaceut. Stat. 2003; 12: 471–483.

11

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

sample size calculation based on time-toevent data, which is subject to right censoring. Three commonly employed testing procedures, namely exponential model, Cox’s proportional hazard model, and the log-rank test, are discussed. The remaining of this article is organized as follows. In Section 2, sample size formulae based on exponential model are derived. In Section 3, Cox’s proportional hazard model is discussed. In Section 4, sample size formulae based on log-rank test are presented. For each model, the sample size formulae for testing equality, superiority, and equivalence/non-inferiority are derived. Finally, this article concludes with a general discussion.

HANSHENG WANG Guanghua School of Management, Peking University Department of Business Statistics & Econometrics Beijing, P. R. China

SHEIN-CHUNG CHOW Department of Biostatistics and Bioinformatics Duke University School of Medicine Durham, New Caledonia, USA

1

INTRODUCTION 2

In clinical research, the occurrence of certain events (e.g., adverse events, disease progression, relapse, or death) is often of particular interest to the investigators, especially in the area of cancer trials. In most situations, these events are undesirable and unpreventable. In practice, it would be beneficial to patients if the test treatment could delay the occurrence of such events. As a result, the time-to-event has become an important study endpoint in clinical research. When the event is death, the time-to-event is defined as the patient’s survival time. Consequently, the analysis of time-to-event data is referred to as survival analysis. The statistical method for the analysis of time-to-event data is very different from those commonly used methods for other types of data, such as continuous and binary response, for two major reasons. First, timeto-event data are subject to censoring (e.g., right, left, or interval censoring). In other words, the exact value of the response is unknown; however, it is known that the value is larger or smaller than an observed censoring time or within an observed censoring interval. Second, the time-to-event data are usually highly skewed, which makes many standard statistical methods designed for normal data not applicable. In this article, the authors will focus on procedures for

EXPONENTIAL MODEL

In this section, the sample size formulas for testing equality, superiority, and equivalence/noninferiority are derived based on exponentially distributed time-to-event data. In clinical research, the exponential model is probably one of the most commonly used simple parametric models for time-to-event data. It only involves one parameter (i.e., time-independent hazard rate λ). Many other important parameters (e.g., median survival time) can be computed based on the λ. Therefore, the effect of different treatments are usually directly compared in terms of the hazard rate λ. Suppose that a two-arm parallel design is conducted to compare a new treatment with a standard therapy. The duration of the trial is expected to be T with T0 accrual time. Each patient will enter the trial independently with entry time aij , where i = 0, 1 is the indicator for treatment group and j is the patient’s identification number in the ith treatment group. It is assumed that aij follows a continuous distribution with the density function given by g(z) =

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

γ e−γ z , 1 − e−γ T0

0 ≤ z ≤ T0

2

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

where γ is the parameter for the patient’s accrual pattern. More specifically, γ = 0 corresponds to a uniform patient entry, γ > 0 indicates the fast patient entry, and γ < 0 implies a lagging patient entry. Denoted by tij to be the time-to-event of the jth patient in the ith treatment group, it is assumed that tij is exponentially distributed with hazard rate λi . Note that tij are not always observable because of the duration of the trial. Therefore, the observable variable is (xij , δij ) = (min(tij , T − aij ), I{tij ≤ T − aij }). In other words, only the smaller values of T − aij and tij are observable. It can be shown that the likelihood function of xij is given by −γ

ni

ni

a

j=1 ij γ ne λ L(λi ) = (1 − e−γ T0 )n i

δ j=1 ij −λi

ni

e

x j=1 ij

Thus, the maximum likelihood estimator (MLE) for λ can be obtained as j=1 δij

j=1

xij

According to the Central Limit Theorem and Slusky’s Theorem, it can be established that √

ni (λˆ i − λi ) →d N(0, σ 2 (λi ))

where zα/2 is the (α/2)th upper quantile of a standard normal distribution. On the other hand, under the alternative hypothesis that = 0, the power of the above test can be approximated by

−1/2 − zα/2

Hence, the sample size needed for achieving the desired power of 1 − β can be obtained by solving the following equation:

σ 2 (λ1 ) σ 2 (λ2 ) + n1 n2

−1/2 − zα/2 = zβ

n1 = κn2

γ e−λi T (1 − e(λi −γ )T0 ) 1+ (λi − γ )(1 − e−γ T0 )

−1

(1) For more technical details, the readers may refer to Lachin and Foulkes (1) and Chow et al. (2). 2.1 Test for Equality As indicated earlier, the treatment effect is usually defined as the difference in the hazard rates between treatment groups. Thus, the parameter of the interest is given by = λ1 − λ2 , where λi is the hazard rate of the ith treatment. In order to test for equality, the following hypotheses are usually considered: H0 : = 0

−1/2

Assume that n1 /n2 = κ, it follows that

λ2i σ (λi ) = E(δij ) 2

=

σ 2 (λˆ 1 ) σ 2 (λˆ 2 ) + n1 n2

is asymptotically distributed as a standard normal random variable. Therefore, for a given significance level α, the null hypothesis would be rejected if −1/2 2 ˆ 2 ˆ σ ( λ ) ( λ ) σ 1 2 > zα/2 (λˆ 1 − λˆ 2 ) + n n 1 2

|λ1 − λ2 |

where

λ2i

T = (λˆ 1 − λˆ 2 )

σ 2 (λ1 ) σ 2 (λ2 ) + |λ1 − λ2 | n1 n2

ni λˆ i = ni

Under the null hypothesis, the following test statistic:

versus

Ha : = 0

n2 =

(zα/2 + zβ )2 2 σ (λ1 )/κ + σ 2 (λ2 ) 2 (λ1 − λ2 )

(2)

2.2 Test for Noninferiority/Superiority From a statistical point of view, the problem of testing noninferiority and superiority can be unified by the following hypotheses: H0 : ≤ δ

versus

Ha : > δ

where δ is the superiority or non-inferiority margin. For a given significance level α, the null hypothesis would be rejected if

σ 2 (λˆ 1 ) σ 2 (λˆ 2 ) + (λˆ 1 − λˆ 2 − δ) n1 n2

−1/2 > zα

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

On the other hand, under the alternative hypothesis that > δ is true, the power of the above test can be approximated by

( − δ)

σ 2 (λ1 ) σ 2 (λ2 ) + n1 n2

−1/2 − zα

Thus, the sample size required for achieving the desired power of 1 − β can be obtained by solving the following equation:

σ 2 (λ1 ) σ 2 (λ2 ) + ( − δ) n1 n2

−1/2 − zα = zβ

Assume that n1 /n2 = κ, it follows that:

On the other hand, under the alternative hypothesis that || < δ is true, the power of the above tests can be approximated by −1/2 2 σ (λ1 ) σ 2 (λ2 ) + − zα + (δ − ) n1 n2 −1/2 2 σ (λ1 ) σ 2 (λ2 ) + − zα − 1 (δ + ) n1 n2 Therefore, assuming that n1 /n2 = κ, the sample size needed for achieving the power of 1 − β is given by n1 = κn2

(zα + zβ/2 )2 2 σ (λ1 )/κ + σ 2 (λ2 ) if = 0 2 δ

(zα + zβ )2 2 2 n2 = σ (λ )/κ + σ (λ ) if = 0 1 2 (δ − ||)2 (3)

n2 =

n1 = κn2 n2 =

3

(zα + zβ )2 2 σ (λ1 )/κ + σ 2 (λ2 ) 2 ( − δ)

2.3 Test for Equivalence

2.4 An Example

Therapeutic equivalence can be established by testing the following interval hypotheses: H0 : || ≥ δ

versus

Ha : || < δ

The purpose is to reject the null hypothesis of inequivalence and conclude the alternative hypothesis of equivalence. The above interval hypothesis can be partitioned into the following two one-sided hypotheses: H01 : ≥ δ

versus

Ha1 : < δ

H02 : ≤ −δ

versus

Ha2 : > −δ

Consider a cancer trial comparing a new treatment with a standard therapy in treating patients with a specific cancer. The trial is planned for a total of 4 years (T = 4) with 2-year accrual (T0 = 2). Uniform patient entry (γ = 0) is assumed for both treatment groups. According to a pilot study, it is estimated that the hazard rate for treatment and the standard therapy are given by 1.50 (λ1 = 1.50) and 2.00 (λ2 = 2.00), respectively. Thus, according to Equation (1), the variance function can be obtained as σ (λi ) = 2

Note that tests for the above two one-sided hypotheses are operationally equivalent to the test for the interval hypotheses for equivalence. As a result, for a given significance level α, the null hypothesis would be rejected if

σ 2 (λˆ 1 ) σ 2 (λˆ 2 ) + (λˆ 1 − λˆ 2 − δ) n1 n2

−1/2 < −zα

σ 2 (λˆ 1 ) σ 2 (λˆ 2 ) (λˆ 1 − λˆ 2 + δ) + n1 n2

−1/2 > zα

e−λi T − e−λi (T−T0 ) 1+ λi T0

−1

This result yields σ 2 (1.50) = 2.29 and σ 2 (2.00) = 4.02. Assume equal sample size allocation between treatment groups (κ = 1). By Equation (2), the sample size required for achieving an 80% (β = 0.20) power at 5% level of significance (α = 0.05) is given by

n=

and

λ2i

=

(zα/2 + zβ )2 (λ2 − λ1 )2

σ 2 (λ1 ) + σ 2 (λ2 ) k

(1.96 + 0.84)2 (2.29 + 4.02) ≈ 198 (2.00 − 1.50)2

4

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

Hence, a total of 396 patients (198 patients per treatment group) are needed for achieving an 80% power. 3 COX’S PROPORTIONAL HAZARDS MODEL In clinical research, Cox’s proportional hazard model (3) is probably one of the most commonly used regression models for timeto-event data. More specifically, let ti be the time-to-event for the ith subject, Ci be the corresponding censoring time, and zi be the associated d-dimensional covariates (e.g., treatment indication, demographical information, medical history). Also, let h(t|z) be the hazard rate at time t given the covariates z. It is assumed that h(t|z) = h(t|0)eb z , where b is the regression coefficient. For the purpose of simplicity, assume that the treatment indication is the only covariate available here. For a more general situation, the readers may refer to Schoenfeld (4). 3.1 Test for Equality In order to test for equality, the following statistical hypotheses are considered: H0 : b = 0

versus

Ha : b = 0

a standard normal random variable. Therefore, for a given significance level α, the null hypothesis would be rejected if |T| > zα/2 . On the other hand, under the alternative hypothesis that b = 0 is true, T is approximately distributed as a normal random variable with unit variance and mean given by log(b)(np1 p2 d)1/2 , where pi is the proportion of patients in the ith treatment group and d is the proportion of the patients with observed events. Hence, the sample size needed for achieving the power of 1 − β is given by n=

T=

d

Ik −

k=1

×

d

k=1

Y1i Y1i + Y2i

Y1i Y2i (Y1i + Y2i )2

3.2 Test for Noninferiority/Superiority As discussed earlier, the hypotheses of testing noninferiority and superiority can be unified by the following statistical hypotheses H0 : b ≤ δ

T=

d

where Yij denotes the number of subjects at risk just prior the ith observed event. T is the test statistic derived based on partial likelihood developed by Cox (3). As can be seen, it is the same as the commonly used logrank test statistic. As a result, the procedures introduced in this section can also be viewed as the sample size calculation procedures for log-rank test but under proportional hazard assumption. Under the null hypothesis, it can be verified that T is asymptotically distributed as

versus

Ha : b > δ

where δ is the superiority or non-inferiority margin. Similarly, consider the following test statistic Ik −

k=1

−1/2

(4)

For more technical details, the readers may refer to Schoenfeld (4, 5) and Chow et al. (2).

Consider the following test statistic:

(zα/2 + zβ )2 b2 p1 p2 d

×

d

k=1

Y1i eδ Y1i eδ + Y2i

Y1i Y2i eδ (Y1i eδ + Y2i )2

−1/2

The null hypothesis would be rejected at the α level of significance if T > zα . On the other hand, under the alternative hypothesis that b > δ is true, T is approximately distributed as a normal random variable with unit variance and mean given by (b − δ)(np1 p2 d)1/2 . Hence, the sample size needed for achieving the power of 1 − β is given by n=

(zα/2 + zβ )2 (b − δ)2 p1 p2 d

(5)

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

3.3 Test for Equivalence To establish therapeutic equivalence between a test treatment and a control, similarly consider the following interval hypotheses: H0 : |b| ≥ δ

Ha : |b| < δ

versus

Testing the above interval hypotheses is equivalent to testing the following two onesided hypotheses: H01 : b ≥ δ H02 : b ≤ −δ

versus

d

k=1

×

Y1i eδ Ik − Y1i eδ + Y2i

d

k=1

and

d

k=1

×

Y1i Y2i eδ (Y1i eδ + Y2i )2

Y1i e−δ Ik − Y1i e−δ + Y2i

d

k=1

−1/2 < −zα

(zα/2 + zβ )2 (1.96 + 0.84)2 = 2 2 b p1 p2 d 1.5 × 0.50 × 0.50 × 0.20

≈ 70 Therefore, a total of 70 patients (35 patients per treatment group) are needed for achieving the desired power for establishment of therapeutic equivalence between the test treatment and the standard therapy. On the other hand, if the investigator wishes to show a superiority with a superiority margin of 40% (δ = 0.40), then, according to Equation (5), the sample size needed is given by

n=

Y1i Y2i e−δ (Y1i e−δ + Y2i )2

= −1/2

(zα + zβ )2 (b − δ)2 p1 p2 d (1.64 + 0.84)2 (1.50 − 0.40)2 × 0.50 × 0.50 × 0.20

≈ 102 > zα

On the other hand, under the alternative hypothesis that |b| < δ is true, the sample size needed for achieving the power of 1 − β is given by (zα + zβ/2 )2 δ 2 p1 p2 d

if b = 0

(zα + zβ )2 n= (δ − |b|)2 p1 p2 d

if b = 0

n=

n=

Ha2 : b > δ

As a result, the null hypothesis should be rejected at the α level of significance if

(p1 = p2 = 0.50). On the other hand, based on a pilot study, it is observed that about 20% (d = 0.20) of patients will experience death before the end of the study. According to Equation (4) the sample size needed for achieving an 80% power (β = 0.20) for establishment of therapeutic equivalence at the 5% (α = 0.05) level of significance is given by

Ha1 : b < δ

versus

5

3.4 An Example Consider the same example as described in Section 2.4. Suppose that a constant hazard ratio of e1.5 (b = 1.5) is assumed between the standard therapy and the test compound. It is also assumed that sample size will be evenly distributed between the two groups

Hence, a total of 102 patients (51 patients per treatment group) are needed for showing superiority of the test treatment over the standard therapy with an 80% power at the 5% level of significance.

4

LOG-RANK TEST

In practice, it is commonly encountered that the time-to-event data are neither exponentially distributed nor do they satisfy the proportional hazard assumption. In such a situation, a nonparametric test is often considered for evaluation of the treatment effect. The hypotheses of interest are then given by H0 : S1 (t) = S2 (t) versus

Ha : S1 (t) = S2 (t)

6

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

For testing the above hypotheses, the following log-rank test statistic is usually considered: d

Y1i Ii − T= Y1i + Y2i i=1

×

d

i=1

Y1i Y2i (Y1i + Y2i )2

1/2

where the sum is over all observed events, Ii is the indicator of the first group, and Yij is number of patients at risk just before the jth death in the ith group. The null hypothesis would be rejected at the α level of significance if |T| > zα/2 . Under the alternative hypothesis, however, the asymptotic distribution of T is complicated. As a result, procedure for sample size calculation is also complicated. In this section, the procedure suggested by Lakatos (6, 7) will be introduced. To calculate the desired sample size, Lakatos (6, 7) suggested that the trial period should be first divided into N equally spaced intervals. The length of the interval should be sufficiently small so that the hazard rate and number of patients at risk can be roughly treated as a constant within each interval. Therefore, θi can be defined as the ratio of the hazard of the event in the ith interval, φi as the ratio of patients in the two treatment groups at risk in the ith interval, and di as the number of deaths within the ith interval and ρi = di /d, where d = di . It is further defined that γi =

φi θi φi φi − and ηi = 1 + φi θi 1 + φi (1 + φi )2

Assuming equal sample size allocation between treatment groups, the sample size needed for achieving the power of 1 − β is given by n = 2d/(p1 + p2 ), where pi is the proportion of the observed events in the ith treatment group and

d=

2 (zα/2 + zβ )2 ( N i=1 wi ρi ηi ) N ( i=1 wi ρi γi )2

More details can be found in Lakatos (6, 7) and Chow et al. (2).

4.1 An Example

To illustrate the procedure for sample size calculation based on the log-rank test statistic, consider a two-year cardiovascular trial with the hazard rates of 0.50 and 1.00 for a test treatment and a control, respectively. It is assumed that the time-to-event is exponentially distributed. However, a log-rank test is used for comparing the two treatments. Assuming equal sample size allocation between treatment groups, the time interval [0, 2] can be partitioned into 20 equally spaced intervals. Within each interval, the parameters needed (e.g., θi , ρi ) can be computed according to the formula introduced in the previous section. As a result, sample size required for achieving a 90% power at the 5% level of significance is given by 139. For more detailed procedures regarding this example, the readers may refer to Lakatos (7) and Chow et al. (2).

5 DISCUSSION Sample size calculation for time-to-event data plays an important role in assuring the success of the clinical trial. In this entry, three commonly used sample size calculation procedures for time-to-event endpoint are briefly introduced. The first one is mainly due to Lachin and Foulkes (1) and is based on exponentially distributed data. The second one is mainly due to Schoenfeld (4, 5), which is based on Cox’s proportional hazard model. The last one is due to Lakatos (6), which requires no parametric or semiparametric assumption. Comparing those three procedures, they requires less and less parametric or model assumptions, which is a merit from a statistical modeling point of view. However, the complexity required for sample size calculation also increases as the parametric assumption becomes weaker. From a practical point of view, which procedure should be used not only depends on its statistical properties, but credit should also be given to the procedure, which is computationally simple and stable.

SAMPLE SIZE CALCULATION FOR COMPARING TIME-TO-EVENT DATA

REFERENCES 1. J. M. Lachin and M. A. Foulkes, Evaluation of sample size and power for analysis of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and sttratification. Biometrics 1986; 42: 507–519. 2. S. C. Chow, J. Shao, and H. Wang, Sample Size Calculation in Clinical Research. New York: Marcel Dekker, 2003. 3. D. R. Cox, Regression models and life tables. J. Royal Stat. Soc. B 1972; 34: 187–220. 4. D. A. Schoenfeld, Sample-size formula for the proportional-hazards regression model. Biometrics 1983; 39: 499–503. 5. D. A. Schoenfeld, The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 1981; 68: 316–319. 6. E. Lakatos, Sample size determination in clinical trials with time-dependent rates of losses and noncomplicance. Controlled Clin. Trials 1986; 7: 189–199. 7. E. Lakatos, Sample size based on the log-rank statistic in complex clinical trials. Biometrics 1988; 44: 229–241.

7

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

The rest of this entry is organized as follows. In the next section, sample size formulas for comparing intra-subject variabilities will be derived under both a replicated parallel and a crossover design. In Section 3, the problem of sample size calculation for comparing inter-subject variabilities will be studied. In Section 4, the formula for sample size determination based on comparing total variabilities of response between treatment groups will be presented. Finally, the entry is concluded with a discussion in Section 5.

HANSHENG WANG Guanghua School of Management, Peking University Department of Business Statistics & Econometrics Beijing, P. R. China

SHEIN-CHUNG CHOW Department of Biostatistics and Bioinformatics Duke University School of Medicine Durham, New Caledonia, USA

1

2 COMPARING INTRA-SUBJECT VARIABILITIES In order to be able to assess intra-subject variability, repeated measurements obtained from the same subject under the same experiment conditions are necessarily obtained. Thus, in practice, a simple parallel design without replicates or a standard 2 × 2 crossover design (TR, RT) with replicates but under different experiment conditions are often considered. In this section, only sample size calculation for comparing intra-subject variabilities of response between treatment groups under a replicated parallel design and a replicated crossover design are considered.

INTRODUCTION

In practice, the variabilities of responses involved in clinical research can be roughly divided into three categories. They are, namely, intra-subject, inter-subject, and total variabilities (1). More specifically, the intrasubject variability is the variability observed by repeated measurements on the same subject under exactly the same experiment conditions. This type of variability is very often due to measurement error, biological variability, and so on. In an ideal situation, the intra-subject variability can be eliminated by averaging infinite number of repeated observations from the same subject under the same experiment conditions. However, even if this averaging can be done, one can still expect difference to be observed in terms of the mean responses between different subjects, who receive exactly the same treatment. This difference is referred to as inter-subject variability, which is caused by the difference between subjects. The total variability is simply the sum of the intra-subject and intersubject variabilities, which is the most often observable variability in a parallel design. Statistical procedures for comparing intrasubject variabilities are well studied by Chichilli and Esinhart (2). The problem of comparing inter-subject and total variabilities are studied by Lee et al. (3, 4). A comprehensive summarization can be found in Lee et al. (5) and Chow et al. (6).

2.1 Parallel Design with Replicates Consider a two-arm parallel trial with m replicates. For the jth subject in the ith treatment group, let xijk be the value obtained in the kth replicate. In practice, the following mixed effects model is usually considered: xijk = µi + Sij + eijk

(1)

where µi is the mean response under the i treatment, Sij is the inter-subject variability, and eijk is the intra-subject variability. It is also assumed that for a fixed i, Sij are independent and identically distributed (i.i.d) 2 2 ), and eijk , are i.i.d N(0, σWi ). Under as N(0, σBi 2 this model, an unbiased estimator for σWi can be obtained as follows

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

2 σˆ Wi =

where xij·

1 ni (m − 1)

1 = m

m

of testing non-inferiority and superiority can be unified by the following hypotheses

ni

m

(xijk − xij· )2 ,

j=1 k=1

H0 : xijk

(2)

k=1

H0 :

=

2 σWR

versus

Ha :

2 σWT

2 σˆ WR 2 σˆ WT 2 σˆ WR

> Fα/2,nT (m−1),nR (m−1)

=

2 σˆ WT 2 δ 2 σˆ WR

On the other hand, under the alternative 2 2 < σWR , the power of the hypothesis that σWT above test can be approximated by

P FnR (m−1),nT (m−1) >

2 σWT

2 σWR

Fα/2,nR (m−1),nT (m−1)

where FnR (m−1),nT (m−1) denotes an Fdistributed random variable with (nR (m − 1), nT (m − 1)) degrees of freedom. Thus, under the assumption that n = nT = nR , the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation:

2 σWR

2 σWT 2 σWR

< δ2

F1−β,n(m−1),n(m−1) = Fα/2,n(m−1),n(m−1)

< F1−α,nT (m−1),nR (m−1)

P FnR (m−1),nT (m−1) >

2 σWT

versus Ha :

2 σWT

2 Fα,nR (m−1),nT (m−1) δ 2 σWR

Thus, assuming that n = nT = nR , the sample size required for achieving the power of 1 − β can be obtained by solving the following equation:

or

< F1−α/2,nT (m−1),nR (m−1)

2 σWT

≥ δ2

On the other hand, under the alternative 2 2 /σWR < δ, the power of hypothesis that σWT the above test procedure is given by

2 σWR

Based on Equation (2), the null hypothesis would be rejected at the α level of significance if 2 σˆ WT

2 σWR

where δ is the non-inferiority or superiority margin. As a result, the null hypothesis would be rejected at the α level of significance

2.1.1 Test for Equality. For testing equality in intra-subject variabilities of responses between treatment groups, the following hypotheses are usually considered:

2 σWT

2 σWT

(3)

2 δ 2 σWR

F1−β,n(m−1),n(m−1) Fα,n(m−1),n(m−1)

2.1.3 Test for Similarity. Similarity or equivalence between treatment groups can be established by testing the following hypotheses:

H0 : Ha :

2 σWT 2 σWR

≥ δ 2 or

2 σWT 2 σWR

≤ 1/δ 2

versus

2 σWT 1 < < δ2 2 δ2 σWR

where δ > 1 is the similarity limit. The above interval hypotheses can be partitioned into the following two one-sided hypotheses: H01 :

2.1.2 Test for Non-Inferiority/Superiority. From a statistical point of view, the problem

=

H02 :

2 σWT

2 σWR

2 σWT

2 σWR

≥ δ2

versus

Ha1 :

≤ 1/δ 2 versus Ha2 :

2 σWT

2 σWR

2 σWT

2 σWR

< δ2

> 1/δ 2

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

As a result, the null hypothesis of dissimilarity would be rejected and similarity would be concluded at the α level of significance if 2 σˆ WT 2 δ 2 σˆ WR 2 δ 2 σˆ WT 2 σˆ WR

< F1−α,nT (m−1),nR (m−1)

and

> Fα,nT (m−1),nR (m−1)

On the other hand, under the alternative hypothesis of similarity, a conservative approximation for the power can be obtained as follows (see, for example, Chow et al. (6)) 1 − 2P Fn(m−1),n(m−1) >

2 δ 2 σWT 2 δWR

F1−α,n(m−1),n(m−1)

Hence, the sample size needed for achieving the power of 1 − β for establishment of similarity or equivalence in intra-subject variabilities between treatment groups at the α level of significance can be obtained by solving the following equation: 2 δ 2 σWT 2 σWR

Fβ/2,n(m−1),n(m−1) = F1−α,n(m−1),n(m−1)

F0.80,2n,2n 0.302 = 0.452 F0.025,2n,2n which leads to n = 25. Hence, a total of 50 subjects (25 per arm) are needed in order to achieve the desired power for detecting such a difference in intra-subject variability between treatment groups. 2.2 Replicated Crossover Design In this section, consider a 2 × 2m crossover design comparing two treatments (Test and Reference) with m replicates. Let ni be the number of subjects assigned to the ith sequence and xijkl be the response from the jth subject in the ith sequence under the lth replicate of the kth treatment (k = T, R). The following mixed effects model is usually considered for data from a 2 × 2m crossover trial:

xijkl = µk + γikl + Sijk + ijkl

(4)

where µk is mean response of the kth formulation, γikl is the fixed effect of the lth replicate under the kth treatment in the ith sequence with constraint

Note that detailed derivation of the above procedures for sample size calculation for comparing intra-subject variabilities can be found in Lee et al. (5) and Chow et al. (6). 2.1.4 An Example. Consider a two-arm parallel trial with three replicates (m = 3) comparing intra-subject variabilities of bioavailability of a test formulation with a reference formulation of a drug product. Based on a pilot study, it is estimated that the standard deviation for the test formulation is about 30% (σWT = 0.30) whereas the standard deviation for the reference formulation is about 45% (σWR = 0.45). The investigator is interested in selecting a sample size so that a significance difference in the intra-subject variability between the test and reference formulations can be detected at the 5% (α = 0.05) level of significance with an 80% (β = 0.20) power. Thus, according to Equation (3), the sample size needed can be obtained by solving

3

2 m

γikl = 0

i=1 l=1

and SijT and SijR are the subject random effects of the jth subject in the ith sequence. (SijT , SijR ) s are assumed i.i.d bivariate normal random vectors with mean (0, 0) . As SijT and SijR are two observations taken from the same subject, they are not independent from each other. The following covariance matrix between SijT and SijR is usually assumed to describe their relationship: B =

2 ρσBT σBR σBT 2 ρσBT σBR σBR

2 ijkl ’s are assumed i.i.d as N(0, Wk ). It is also assumed that (SijT , SijR ) and ijkl are 2 2 and σBR are the independent. Note that σWT 2 2 inter-subject variances and σWT and σWR are intra-subject variances.

4

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

In order to obtain estimators for intrasubject variances, a new random variable zijkl is defined by an orthogonal transformation zijk = P xijk , where xijk = (xijk1 , xijk2 , . . . , xijkm ),

2 σWT

zijk = (zijk1 , zijk2 , . . . , zijkm )

2 σWR

and P is an m × m orthogonal matrix √ with the first column given by (1, 1, . . . , 1) / m. It can be verified that for a fixed i and any l > 1, zijkl are i.i.d normal random variable with 2 2 variance σWk . Therefore, σWk can be estimated by 2 σˆ Wk

Thus, assuming that n = n1 = n2 , the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation:

×

ni m 2

(zijkl − zi·kl )2

F1−β,(2n−2)(m−1),(2n−2)(m−1) Fα/2,(2n−2)(m−1),(2n−2)(m−1)

(6)

2.2.2 Test for Non-Inferiority/Superiority. For testing non-inferiority and superiority, similarly, consider the following hypotheses: H0 :

1 = (n1 + n2 − 2)(m − 1)

=

2 σWT

≥ δ2

2 σWR

versus

Ha :

2 σWT 2 σWR

< δ2

then reject the null hypothesis at the α level of significance if

and

l=2 i=1 j=1

zi·kl

ni 1 = zijkl ni

2 σˆ WT 2 2 δ σˆ WR

(5)

< F1−α,d,d

j=1

2 2 It should be noted that σˆ Wk /σWk is a 2 χ -distributed with d = (n1 + n2 − 2)(m − 1) 2 2 and σˆ WR are degrees of freedom, and σˆ WT mutually independent. More details can be found in Chichilli and Esinhart (2) and Chow et al. (6).

2.2.1 Test for Equality. Similarly, consider the following hypotheses for testing equality in intra-subject variabilities of responses between treatment groups:

On the other hand, under the alternative 2 2 /σWR < δ, the power of hypothesis that σWT the above test is given by P Fd,d >

versus

2 σˆ WR

> Fα/2,d,d or

2 σˆ WT 2 σˆ WR

2 σWT

P Fd,d >

2 σWT 2 σWR

=

F1−β,(2n−2)(m−1),(2n−2)(m−1) Fα,(2n−2)(m−1),(2n−2)(m−1)

2.2.3 Test for Similarity. For testing similarity, similarly, consider the following interval hypotheses:

< F1−α/2,d,d H0 :

On the other hand, under the alternative 2 2 < σWR , the power of the hypothesis that σWT above test is given by

Fα,d,d

2 2 Ha : σWT = σWR

Thus, the null hypothesis would be rejected at the α level of significance if 2 σˆ WT

2 δ 2 σWR

Hence, assuming that n = n1 = n2 , the sample size required for achieving the power of 1 − β can be obtained by solving

2 δ 2 σWR 2 2 = σWR H0 : σWT

2 σWT

Fα/2,d,d

Ha :

2 σWT 2 σWR

≥ δ 2 or

2 σWT 2 σWR

≤ 1/δ 2

versus

σ2 1 < WT < δ2 2 2 δ σWR

where δ > 1 is the equivalence limit. Testing the above interval hypotheses is equivalent to

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

testing the following two one-sided hypotheses: H01 : H02 :

2 σWT

2 σWR

2 σWT

2 σWR

≥ δ2

versus

Ha1 :

≤ 1/δ 2 versus Ha2 :

2 σWT

2 σWR

2 σWT

2 σWR

< δ2

> 1/δ 2

Thus, the null hypothesis of dissimilarity would be rejected and similarity would be concluded at the α level of significance if 2 σˆ WT 2 δ 2 σˆ WR

< F1−α,d,d

and

2 δ 2 σˆ WT 2 σˆ WR

> Fα,d,d

On the other hand, under the alternative hypothesis of similarity, a conservative approximation to the power as given in Chow et al. (6) is given by δ2 σ 2 1 − 2P Fd,d > 2WT F1−α,d,d σWR Hence, assuming that n = n1 = n2 , the sample size required for achieving the power of 1 − β can be obtained by solving the following equation: 2 δ 2 σWT 2 σWR

=

Fβ/2,(2n−2)(m−1),(2n−2)(m−1) F1−α,(2n−2)(m−1),(2n−2)(m−1)

Note that the detailed derivation of the above procedures for sample size calculation for comparing intra-subject variabilities between treatment groups under a 2 × 2m crossover design can be found in Lee et al. (5) and Chow et al. (6). 2.2.4 An Example. Consider the same example as described in the previous section. However, the study design is now a standard 2 × 6 crossover design (m = 3). According to Equation (6), the sample size needed for achieving an 80% power (β = 0.20) at the 5% (α = 0.05) level of significance can be obtained by solving F0.80,4(n−1),4(n−1) 0.302 = 0.452 F0.025,4(n−1),4(n−1) which gives n = 14. As a result, a total of 28 subjects (14 subjects per sequence) are needed in order to achieve the desired power for detecting a 15% difference in intra-subject variability between treatment groups.

5

3 COMPARING INTER-SUBJECT VARIABILITIES For comparing intra-subject variability, because an estimator for inter-subject variability can only be obtained under a replicated design and it usually can be expressed as a linear combination of various variance component estimates, its sampling distribution is relatively difficult to derive. Howe (7), Graybill and Wang (8), and Hyslop et al. (9) developed various methods for estimation of inter-subject variabilities. One important assumption for the validity of these methods is that the variance component estimators involved in the estimation must be independent from one another. Lee et al. (3) generalized these methods for the situation where some variance components are actually dependent with one another. The sample size formulas introduced in this section are mostly based on the methods by Lee et al. (3). 3.1 Parallel Design with Replicates Under the model in Equation (1), consider the following statistics: s2Bi =

ni 1 (xij· − xi·· )2 , ni − 1

where

j=1

xi·· =

ni 1 xij· ni

(7)

j=1

2 Thus, σBi can be estimated by 2 σˆ Bi = s2Bi −

1 2 σˆ m Wi

2 where xij· and σˆ Wi are as defined in Equation (2).

3.1.1 Test for Equality. For testing equality in inter-subject variabilities of response between treatment groups, consider the following hypotheses: H0 : η = 0 versus Ha : η = 0 2 2 where η = σBT − σBR and can be estimated by 2 2 2 − σˆ BT = s2BT − s2BR − σˆ WT /m ηˆ = σˆ BR 2 /m + σˆ WR

6

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

For a given significance level α, a (1 − α) × 100% confidence interval for√η can be obtained √ as (ηˆ L , ηˆ U ) = (ηˆ − L , ηˆ + U ), where 2 nT − 1 L = 1− 2 χα/2,n −1 T 2 nR − 1 4 + sBR 1 − 2 χ1−α/2,n −1 R 2 4 σˆ WT nT (m − 1) + 2 1− 2 m χ1−α/2,n (m−1)

of significance is given by n=

σ ∗2 (zα/2 + zβ )2 2 2 2 (σBT − σBR )

s4BT

+

4 σˆ WR m2

T

nR (m − 1) 1− 2 χα/2,n (m−1) R

U =

s4BT

1−

3.1.2 Test for Non-Inferiority/Superiority. Similar to testing for non-inferiority/superiority in the intra-subject variability between treatment groups, the problem of testing non-inferiority and superiority in the inter-subject variability between treatment groups can also be unified by the following hypotheses:

2

H0 : η ≥ 0 versus Ha : η < 0

2

nT − 1

2 2 − δ 2 σBR . For a given signifwhere η = σBT icance level α, its (1 − α) × 100%th upper √ confidence bound is given by ηˆ U = ηˆ + U , where

2 χ1−α/2,n

T −1

2 nR − 1 + 1− 2 χα/2,n −1 R 2 4 σˆ nT (m − 1) + WT 1 − 2 m2 χα/2,n T (m−1) 2 σˆ 4 nR (m − 1) + WR 1 − 2 m2 χ1−α/2,n (m−1) s4BR

√

1 − zα/2 −

2 2 n|σBT − σBR | σ∗

σˆ 4 + WT m2

2 σ ∗2 = 2 ⎣ σBT +

+

2 σWT

m

2

2 + σBR +

4 4 σWR σWT + m2 (m − 1) m2 (m − 1)

2 σWR

2

R

4 δ 4 σˆ WR + 2 m

nR − 1 2 χα,n −1

nT (m − 1) 1− 2 χα,n (m−1)

2

T

nR (m − 1) 1− 2 χ1−α,n (m−1)

2

R

Therefore, the null hypothesis would be rejected at the α level of significance if ηˆ U < 0. On the other hand, under the alternative hypothesis, the power of the above test can be approximated by

where ⎡

T −1

+ δ 4 s4BR 1 −

Thus, the null hypothesis would be rejected at the α level of significance if 0 ∈ (ηˆ L , ηˆ U ). On the other hand, under the alternative hypothesis that η = 0 is true, the power of the above test can be approximated by

2 χ1−α,n

R

2

nT − 1

U = s4BT 1 −

√ 2 2 n(σBT − δ 2 σBR ) −zα − σ∗

2

m

where ⎡ σ

As a result, the sample size needed for achieving the power of 1 − β for detecting a meaningful difference in the inter-subject variability between treatment groups at the α level

∗2

= 2⎣

2 σBT

σ2 + WT m

2

+δ

4

2 σBR

δ4 σ 4 σ4 + 2 WT + 2 WR m (m − 1) m (m − 1)

σ2 + WR m

2

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

Hence, the sample size needed for achieving the power of 1 − β is given by n=

σ ∗2 (zα + zβ )2 2 2 2 (σBT − δ 2 σBR )

3.1.3 An Example. Consider a two-arm parallel design with three replicates (m = 3) for each patient. Suppose that the investor is interested in comparing the inter-subject variability of the pharmacokinetics parameters collected from the patients. Suppose from a pilot study it is estimated that σBT = 0.35, σBR = 0.45, σWT = 0.25, and σWR = 0.20. It follows that 2 2 0.252 0.202 ∗2 + 0.452 + σ = 2 0.352 + 3 3

0.254 0.204 + 2 + 2 = 0.135 3 (3 − 1) 3 (3 − 1) Thus, the sample size needed for achieving an 80% power (β = 0.20) at the 5% level of significance (α = 0.05) is given by n=

σ ∗2 (zα/2 + zβ )2 2 2 2 (σBT − σBR )

=

+ 0.84)2

0.135(1.96 (0.352 − 0.452 )2

≈ 166 Therefore, a total of 332 patients (166 patients per treatment group) are needed in order to achieve the desired power for detecting a 10% difference in the inter-subject variability between treatment groups at the α level of significance. 3.2 Replicated Crossover Design Consider the model in Equation (4), the intersubject variabilities can be estimated by 1 2 1 2 2 and σˆ BR = s2BR − σˆ WR σˆ m WT m ni = j=1 xijk. /ni , n = n1 + n2 ,

2 = s2BT − σˆ BT

where xi.k. s2BT =

2 ni 1 (xijT. − xi.T. )2 n−2

and

i=1 j=1

s2BR =

2 ni 1 (xijR. − xi.R. )2 n−2 i=1 j=1

(8)

7

3.2.1 Test for Equality. For testing equality in inter-subject variability of responses between treatment groups, similarly, consider the following hypotheses: H0 : η = 0 versus Ha : η = 0 2 2 − σBR and can be estimated where η = σBT by 2 2 2 − σˆ BR = s2BT − s2BR − σˆ WT /m ηˆ = σˆ BT 2 /m + σˆ WR

According to Lee et al. (3), an approximate (1 − α)th confidence interval √ √ for η is given by (ηˆ L , ηˆ U ) = (ηˆ − L , ηˆ + U ), where L = λˆ 21 1 − +

λˆ 22

ns − 1 2 χα/2,n s −1

1−

ns − 1

2

2 χ1−α/2,n s −1

2 ns (m − 1) 1− 2 χα/2,ns (m−1) 2 4 σˆ WR ns (m − 1) + 2 1− 2 m χ1−α/2,ns (m−1) 2 ns − 1 U = λˆ 21 1 − 2 χ1−α/2,ns −1 2 ns − 1 2 + λˆ 2 1 − 2 χα/2,ns −1 2 4 σˆ WT ns (m − 1) + 2 1− 2 m χ1−α/2,ns (m−1) 2 4 σˆ WR ns (m − 1) + 2 1− 2 m χα/2,ns (m−1) σˆ 4 + WT m2

2

and ns = n1 + n2 − 2. Thus, the null hypothesis would be rejected at the α level of significance if 0 ∈ (ηˆ L , ηˆ U ). On the other hand, under the alternative hypothesis, the power of the above test can be approximated by 1 − zα/2 −

√

2 2 ns |σBT − σBR | ∗ σ

8

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

where ⎡ σ

∗2

2 = 2 ⎣ σBT +

2 σWT m

2 2 − 2ρ 2 σBT σBR +

σ4 + 2 WR m (m − 1)

2

+

2 σBR

4 σWT 2 m (m −

σ2 + WR m

2

where

1)

⎡ σ

Thus, the total sample size needed for achieving the power of 1 − β is given by n=

power of the above test can be approximated by √ 2 2 ns (σBT − δ 2 σBR ) −zα − σ∗

∗2

2 = 2 ⎣ σBT +

2

2 2 − 2δ 2 ρ 2 σBT σBR +

+

σ ∗2 (zα/2 + zβ )2 +2 2 2 2 (σBT − σBR )

2 σWT m

4 δ 4 σWR 2 m (m − 1)

+δ

4

2 σBR

σ2 + WR m

2

4 σWT 2 m (m − 1)

Hence, the sample size needed for achieving the power of 1 − β is given by 3.2.2 Test for Non-Inferiority/Superiority. For testing non-inferiority and superiority, similarly, consider the following hypotheses: H0 : η ≥ 0 versus

Ha : η < 0

2 2 where η = σBT − δ 2 σBR . For a given significance level of α, an approximate (1 − α)th upper confidence bound for η can be con√ structed as ηˆ U = ηˆ + U , where

U =

λˆ 21

1−

ns − 1 2 χ1−α/2,n s −1

+ λˆ 22 1 −

+

4 σˆ WT m2

+

4 δ 4 σˆ WR 2 m

ns − 1 2 χα/2,n s −1

1−

2

2

ns (m − 1) 2 χ1−α/2,n s (m−1)

1−

ns (m − 1) 2 χα/2,n s (m−1)

2 0.352 + 0.452 + 2 − 2(0.65 × 0.35 × 0.45)2

0.254 0.354 = 0.154 + 2 + 2 22

2

s2 −δ 2 s2 ± (s2 +δ 2 s2 )2 −4δ 2 s4 BT BR BT BR BTR 2

Thus, the null hypothesis is rejected at the α level of significance if ηˆ U < 0. On the other hand, under the alternative hypothesis, the

σ ∗2 (zα + zβ )2 +2 2 2 2 (σBT − δ 2 σBR )

3.2.3 An Example. To illustrate the use of sample size formula derived above, consider a standard 2 × 4(m = 2) crossover design (ABAB, BABA). The objective is to compare inter-subject variabilities of a test treatment and a control. From a pilot study, it is estimated that ρ = 0.65, σBT = 0.35, σBR = 0.45, σWT = 0.25, and σWR = 0.35. Based on this information, it follows that 2 0.252 σ ∗2 = 2 0.352 + 2

2

and λˆ i =

n=

Hence, the sample size needed for achieving an 80% power (β = 0.20) at the 5% (α = 0.05) level of significance is given by n=

0.154(1.96 + 0.84)2 + 2 ≈ 191 (0.352 − 0.452 )2

As a result, a total of 191 subjects are needed for achieving the desired power.

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

4

COMPARING TOTAL VARIABILITIES

In practice, in addition to the intra-subject and inter-subject variability, the total variability is also of interest to researchers. The total variability is defined as the sum of the intra-subject and inter-subject variabilities. As the total variability is observable even in an experiment without replicates, in this section, both replicated and nonreplicated designs (parallel and crossover) will be discussed. 4.1 Parallel Designs Without Replicates Consider a parallel design without replicates. In this case, the model in Equation (1) reduces to xij = µi + ij 2 where ij is assumed to be i.i.d. as N(0, σTi ). 2 In this case, the total variability σTi can be estimated by ni

1 (xij − xi· )2 , ni − 1

2 = σˆ Ti

where

Thus, assuming that n = nR = nT , the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation: 2 σTT 2 σTR

H0 :

2 σTT 2 σTR

F1−β,n−1,n−1 Fα/2,n−1,n−1

≥ δ2

versus

2 σTT 2 σTR

< δ2

where δ is the non-inferiority or superiority margin. Thus, the null hypothesis would be rejected at the α level of significance if 2 σˆ TT 2 δ 2 σˆ TR

< F1−α,nT ,nR

On the other hand, under the alternative hypothesis, the power of the above test is given by

1 xij ni

xi· =

=

4.1.2 Test for Non-Inferiority/Superiority. For testing non-inferiority and superiority, consider the following unified hypotheses:

j=1

ni

9

P FnR ,nT >

2 σTT 2 δ 2 σTR

Fα,nR −1,nT −1

j=1

4.1.1 Test for Equality. For testing equality in total variability between treatment groups, the hypotheses become H0 :

2 σTT

=

2 σTR

versus

Ha :

2 σTT

=

2 σˆ TR 2 σˆ TT 2 σˆ TR

> Fα/2,nT −1,nR −1

or

< F1−α/2,nT −1,nR −1

P FnR ,nT >

2 σTT 2 σTR

Fα/2,nR −1,nT −1

2 δ 2 σTR

=

F1−β,n−1,n−1 Fα,n−1,n−1

4.1.3 Test for Similarity. Similarly, similarity between treatment groups can be established by testing the following interval hypotheses:

H0 :

2 Under the alternative hypothesis that σTT < 2 σTR , the power of the above test is given by

2 σTT

2 σTR

Then, the null hypothesis is rejected at the α level of significance if 2 σˆ TT

Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation:

Ha :

2 σTT 2 σTR

≥ δ 2 or

2 σTT 2 σTR

≤ 1/δ 2

versus

2 σTT 1 < < δ2 2 δ2 σTR

where δ > 1 is the similarity limit. As indicated earlier, testing the above interval

10

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

hypotheses is equivalent to testing the following two one-sided hypotheses: H01 : H02 :

2 σTT 2 σTR

2 σTT 2 σTR

≥δ

2

≤ 1/δ 2

versus

Ha1 :

versus

Ha2 :

2 σTT 2 σTR 2 σTT 2 σTR

<δ

2

> 1/δ 2

4.2 Parallel Design with Replicates In this section, focus is placed on the replicated parallel design as described in the model in Equation (1). Under the model in Equation (1), the total variabilities can be estimated by

Thus, for a given significance level α, the null hypothesis of dissimilarity is rejected and the alternative hypothesis of similarity is accepted if 2 σˆ TT 2 δ 2 σˆ TR

< F1−α,nT ,nR

and

2 δ 2 σˆ TT 2 σˆ TR

> Fα,nT ,nR

On the other hand, under the alternative hypothesis of similarity, a conservative approximation to the power is given by (see, for example, Chow et al. (6)) 2 δ 2 σTT 1 − 2P Fn−1,n−1 > 2 F1−α,n−1,n−1 σTR Hence, the sample size needed for achieving the power of 1 − β can be obtained by solving the following equation: 2 δ 2 σTT 2 σTR

Fβ/2,n−1,n−1 = F1−α,n−1,n−1

2 σˆ Ti = s2Bi +

m−1 2 σˆ Wi m

where s2Bi is as defined in Equation (7) in Section 3.

4.2.1 Test for Equality. For testing equality in total variabilities of responses between treatment groups, consider the following statistical hypotheses that are usually considered:

H0 : η = 0 versus Ha : η = 0

2 2 − σTR . η can be estimated by where η = σTT

2 2 − σˆ TR ηˆ = σˆ TT

4.1.4 An Example. Consider a two-arm parallel design comparing total variabilities of a test treatment with a reference treatment. It is estimated from a pilot study that σTT = 0.55 and σTR = 0.60. Suppose that the investigator wishes to establish noninferiority of the test treatment as compared with the reference treatment with a noninferiority margin of 10% (δ = 1.10). The sample size needed for achieving an 80% (β = 0.20) power at the 5% (α = 0.05) level of significance can be obtained by solving the following equation: F0.20,n−1,n−1 0.552 = 1.102 × 0.602 F0.05,n−1,n−1 which gives n = 22. Hence, a total of 44 subjects (22 subjects per treatment group) are needed.

For a given significance level α, an approximate (1 − α) × 100% confidence interval of η √ can be constructed as ( η ˆ , η ˆ ) = ( η ˆ − , ηˆ + U L L √ U ), where

2 nT − 1 L = 1− 2 χα/2,n −1 T 2 nR − 1 4 + sBR 1 − 2 χ1−α/2,n −1 R 2 4 2 (m − 1) σˆ WT nT (m − 1) + 1− 2 m2 χ1−α/2,n (m−1) T 2 4 2 (m − 1) σˆ WR nR (m − 1) + 1− 2 m2 χα/2,n (m−1) s4BT

R

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

and U =

s4BT

1−

upper confidence bound √ of η can be 2con− structed as ηˆ U = ηˆ + U , where ηˆ = σˆ TT 2 δ 2 σˆ TR and U is given by

2

nT − 1 2 χ1−α/2,n

T −1

11

2

nR − 1 2 χα/2,n R −1 2 4 (m − 1)2 σˆ WT nT (m − 1) + 1 − 2 m2 χα/2,n T (m−1) 2 4 (m − 1)2 σˆ WR nR (m − 1) + 1− 2 m2 χ1−α/2,n (m−1) + s4BR 1 −

U = s4BT 1 −

√ 2 2 n|σTT − σTR | 1 − zα/2 − σ∗

2

2 χ1−α,n

T −1

+ δ 4 s4BR 1 −

nR − 1 2 χα,n R −1

2

2 nT (m − 1) 1− 2 χα,n (m−1) T 2 4 δ 4 (m − 1)2 σˆ WR nR (m − 1) + 1− 2 m2 χ1−α,n (m−1) 4 (m − 1)2 σˆ WT + 2 m

R

Thus, the null hypothesis is rejected at the α level of significance if 0 ∈ (ηˆ L , ηˆ U ). On the other hand, under the alternative hypothesis, assuming that n = nT = nR , the power of the above test can be approximated by

nT − 1

R

Thus, the null hypothesis is rejected at the α level of significance if ηˆ U < 0. The power of the above test can be approximated by

√ 2 2 n(σTT − δ 2 σTR ) −zα − σ∗

where ⎡ 2 + σ ∗2 = 2 ⎣ σBT

+

2 σWT

2

m

2 + σBR +

4 4 (m − 1)σWT (m − 1)σWR + 2 2 m m

2 σWR

2

m

Hence, the sample size needed for achieving the desired power of 1 − β is given by n=

σ ∗2 (zα/2 + zβ )2 2 2 2 (σTT − σTR )

4.2.2 Test for Non-Inferiority/Superiority. For testing non-inferiority/superiority, similarly, consider the following unified hypotheses: H0 : η ≥ 0 versus

Ha : η < 0

2 2 where η = σTT − δ 2 σTR . For a given significance level of α, an approximate (1 − α)th

where

σ ∗2

⎡ 2 2 σ2 σ2 2 2 = 2 ⎣ σBT + WT + δ 4 σBR + δ 4 WR m m +

4 4 (m − 1)σWT (m − 1)σWR + δ4 2 2 m m

Hence, the sample size needed for achieving the desired power of 1 − β is given by n=

σ ∗2 (zα + zβ )2 2 2 2 (σTT + δ 2 σTR )

4.2.3 An Example. Consider a two-arm parallel design with three replicates (m = 3) comparing total variabilities of responses between a test treatment and a control. It is assumed that σBT = 0.35, σBR = 0.45, σWT = 0.25, and σWR = 0.35. Suppose that one of the primary objective is to claim that a significance difference exists in total variabilities

12

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

of responses between the test treatment and the control. It follows that 2 2 0.252 0.352 ∗2 + 0.452 + σ = 2 0.352 + 3 3

(3 − 1)0.254 (3 − 1)0.354 + + = 0.168 2 3 32 Hence, the sample size needed for achieving an 80% power (β = 0.20) at the 5% (α = 0.05) level of significance can be obtained as n=

0.168(1.96 + 0.84)2 ≈ 68 (0.352 + 0.252 − (0.452 + 0.352 ))2

As a result, a total of 136 subjects (68 subjects per treatment group) are needed. 4.3 The Standard 2 × 2 Crossover Design Under the standard 2 × 2 crossover design, the notations defined in the model in Equation (4) can still be used. However, the subscript l is omitted as no replicate exists. Under the model in Equation (4), the total variability can be estimated by

2 σˆ TT

ni 2 1 = (xijT − xi·T )2 n1 + n2 − 2

and

and λˆ i =

ni 2 1 = (xijR − xi·R )2 n1 + n2 − 2 i=1 j=1

where xi·T

ni 1 = xijT , ni

and

xi·R

j=1

ni 1 = xijR ni j=1

4.3.1 Test for Equality. For testing equality in total variabilities between treatment groups, similarly, consider the following hypotheses: H0 : η = 0 versus Ha : η = 0 2 2 − σTR . η be estimated by ηˆ = where η = σTT 2 2 σˆ TT − σˆ TR . Define 2 σBTR =

ni 2 1 (xijT − xi·T ) n1 + n2 − 2 i=1 j=1

× (xijR − xi·R )

2 2 2 4 (σˆ TT + σˆ TR ) − 4σˆ BTR

2

Assume that λˆ 1 < 0 < λˆ 2 . According to Lee et al. (3), an approximate (1 − α) × 100% confidence interval√ for η can √ be constructed by (ηˆ L , ηˆ U ) = (ηˆ − L , ηˆ + U ), where L = λˆ 21 1 −

n1 + n2 − 2 2 χ1−α/2,n +n −2 1

2

2

2 n1 + n2 − 2 2 χα/2,n 1 +n2 −2 2 n1 + n2 − 2 U = λˆ 21 1 − 2 χα/2,n +n −2 1 2 2 n1 + n2 − 2 2 + λˆ 2 1 − 2 χ1−α/2,n +n −2 + λˆ 22 1 −

1

2

Thus, the null hypothesis is rejected at the α level of significance if 0 ∈ (ηˆ L , ηˆ U ). Under the alternative hypothesis, the power of the above test can be approximated by

i=1 j=1

2 σˆ TR

2 2 σˆ TT − σˆ TR ±

√

1 − zα/2 −

2 2 ns |σTT − σTR | ∗ σ

where 4 4 2 2 σ ∗2 = 2(σTT + σTR − 2ρ 2 σBT σBR )

The sample size needed for achieving the desired power of 1 − β is then given by n=

σ ∗2 (zα/2 + zβ )2 +2 2 2 2 (σTT − σTR )

4.3.2 Test for Non-Inferiority/Superiority. For testing non-inferiority/superiority, similarly, consider the following unified hypotheses:

H0 : η ≥ 0 versus Ha : η < 0 2 2 where η = σTT − δ 2 σTR . For a given significance level of α, an approximate (1 − α)th

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

upper confidence bound of η can be con√ structed as ηˆ U = ηˆ + U , where 2 n1 + n2 − 2 U = λˆ 21 − 1 2 χα,n 1 +n2 −2 2 n1 + n2 − 2 2 + λˆ 2 −1 2 χ1−α,n +n −2 1

As a result, a total of 100 subjects (e.g., 50 subjects per sequence) are needed for achieving the desired power. 4.4 Replicated 2 × 2m Crossover Design Under the model in Equation (4), the total variabilities can be estimated by

2

2 = s2Bk + σˆ Tk

and

λˆ i =

2 −δ 4 σˆ 2 ± (σˆ 2 +δ 4 σˆ 2 )2 −4δ 2 σˆ 4 σˆ TT TR TT TR BTR

2

Thus, the null hypothesis would be rejected at the α level of significance if ηˆ U < 0. On the other hand, under the alternative hypothesis, the power of the above test can be approximated by √ 2 2 n(σTT − δ 2 σTR ) −zα − σ∗ where 4 4 2 2 + δ 4 σTR − 2δ 2 ρ 2 σBT σBR ) σ ∗2 = 2(σTT

As a result, the sample size needed for achieving the power of 1 − β can be obtained as

Hence, the sample size needed for achieving an 80% (β = 0.20) power at the 5% (α = 0.05) level of significance is given by n=

0.244(1.96 + 0.84)2 +2 (0.352 + 0.252 − (0.452 + 0.352 ))2

≈ 100

m−1 2 σˆ Wk m

k = T, R

2 is as defined in Equation (5) and where σWk s2Bk is as defined in Equation (8).

4.4.1 Test for Equality. For testing equality, consider the following hypotheses: H0 : η = 0 versus

Ha : η = 0

2 2 where ηˆ = σˆ TT − σˆ TR . For a given significance level α, an approximate (1 − α) × 100% confidence interval by √ of η can be constructed √ (ηˆ L , ηˆ U ) = (ηˆ − L , ηˆ U = ηˆ + U ), where

L =

1−

λˆ 21

2

ns − 1 2 χ1−α/2,n s −1

2 ns − 1 2 χα/2,n s −1 2 4 (m − 1)2 σˆ WT ns (m − 1) + 1− 2 m2 χα/2,n (m−1)

σ ∗2 (zα + zβ )2 +2 n= 2 2 2 (σTT − δ 2 σTR ) 4.3.3 An Example. Consider a standard 2 × 2 crossover design (TR, RT) comparing total variabilities of responses between a test treatment with a control. It is assumed that ρ = 0.60, σBT = 0.35, σBR = 0.45, σWT = 0.25, and σWR = 0.35. Suppose that one of the primary objectives is to detect a clinically significant difference. It follows that

σ ∗2 = 2 (0.352 + 0.252 )2 + (0.452 + 0.352 )2 −2 × 0.602 × 0.352 × 0.452 = 0.244

13

+ λˆ 22 1 −

4 (m − 1)2 σˆ WR + 2 m

s

1−

ns (m − 1)

2

2 χ1−α/2,n s (m−1)

and

2 ns − 1 1− 2 U = χα/2,ns −1 2 ns − 1 + λˆ 22 1 − 2 χ1−α/2,ns −1 2 4 (m − 1)2 σˆ WT ns (m − 1) + 1− 2 m2 χ1−α/2,ns (m−1) 2 4 (m − 1)2 σˆ WR ns (m − 1) + 1− 2 m2 χα/2,ns (m−1) λ21

14

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

and λˆ i s are the same as those used for the test of equality for inter-subject variabilities. Thus, the null hypothesis would be rejected at the α level of significance if 0 ∈ (ηˆ L , ηˆ U ). Under the alternative hypothesis, the power of the above test can be approximated by

√

1 − zα/2 −

2 2 n|σTT − σTR | σ∗

and λˆ i s are same as those used for the test of non-inferiority for inter-subject variabilities. Thus, the null hypothesis would be rejected at the α level of significance if ηˆ U < 0. On the other hand, under the alternative hypothesis, the power of the above test can be approximated by √ 2 2 ns (σTT − δ 2 σTR ) −zα − σ∗

where ⎡ σ

∗2

= 2⎣

σ2 + WT m

2 σBT

2 +

2 σBR

σ2 + WR m

−

2 2 2ρ 2 σBT σBR

+

4 4 (m − 1)σWR (m − 1)σWT + 2 2 m m

2

⎡ σ

∗2

= 2⎣

σ ∗2 (zα/2 + zβ )2 . 2 2 2 (σTT − σTR )

2 2 − δ 2 σTR . η can be estimated where η = σTT 2 2 2 by ηˆ = σˆ TT − δ σˆ TR . For a given significance level of α, an approximate (1 − α)th upper confidence√bound of η can be constructed as ηˆ U = ηˆ + U , where

ns − 1 2 χα,n s −1

+ λˆ 22 1 −

2

ns − 1

2

2 χ1−α,n s −1 4 (m − 1)2 σˆ WT + 1− 2 m 4 (m − 1)2 σˆ WR + 1− m2

2

+δ

4

2 σBR

σ2 + WR m

2

Hence, the sample size needed for achieving the desired power of 1 − β is given by

H0 : η ≥ 0 versus Ha : η < 0

σ2 + WT m

4 4 δ 4 (m − 1)σWR (m − 1)σWT + + 2 2 m m

n=

4.4.2 Test for Non-Inferiority/Superiority. For testing non-inferiority/superiority, similarly, consider the following unified hypotheses:

U = λˆ 21 1 −

2 σBT

2 2 − 2δ 2 ρ 2 σBT σBR

Hence, the sample size needed for achieving the desired power of 1 − β is given by n=

where

4.4.3 An Example. Consider a 2 × 4 crossover design (ABAB,BABA) for comparing total variabilities between two formulations (A and B) of a drug product. It is estimated from a pilot study that ρ = 0.65, σBT = 0.35, σBR = 0.45, σWT = 0.25, and σWR = 0.35. Suppose the objective is to detect a significant difference in total variabilities between the two formulations. It follows that

σ 2

ns (m − 1) 2 χ1−α,n s (m−1) 2 ns (m − 1) 2 χα,n s (m−1)

σ ∗2 (zα + zβ )2 +2 2 2 2 (σTT − δ 2 σTR )

∗2

=2

0.352 +

0.252 2

2

2 0.352 + 0.452 + 2 − 2 × (0.65 × 0.35 × 0.45)2

0.254 0.354 = 0.154 + 2 + 2 22

SAMPLE SIZE CALCULATION FOR COMPARING VARIABILITIES

Hence, the sample size needed for achieving an 80% (β = 0.20) is given by n=

(0.154)(1.96 + 0.84)2 (0.352 + 0.252 − (0.452 + 0.352 ))2 + 2 ≈ 64

As a result, a total of 64 subjects (e.g., 32 subjects per sequence) are needed for achieving the desired power. 5

DISCUSSION

In clinical research, in addition to comparing mean responses, it is also of interest to compare the variabilities associated with the responses. As indicated in Chow and Shao (10), the treatment with a larger variability may be a safety concern. In addition, the treatment with a larger variability may have a small probability of reproducibility of the clinical responses. In this article, statistical procedures for sample size calculation are derived under a parallel design and a crossover design with and without replicates. However, it should be noted that if the variance estimator is a linear component of several variance components, how to establish the similarity in variabilities between two treatments is still a challenging problem to researchers. Further research is needed. REFERENCES 1. S. C. Chow and H. Wang, On sample size calculation in bioequivalence trials. J. Pharmacokinet. Pharmacodynam. 2001; 28: 155–169. 2. V. M. Chichilli and J. D. Esinhart, Design and analysis of intra-subject variability in cross-over experiments. Stat. Med. 1996; 15: 1619–1634.

15

3. Y. Lee, J. Shao, and S. C. Chow, Confidence intervals for linear combinations of variance components when estimators of variance components are dependent: an extension of the modified large sample method with applications. J. Amer. Stat. Assoc., in press. 4. Y. Lee, J. Shao, S. C. Chow, and H. Wang, Test for inter-subject and total variabilities under crossover design. J. Biopharmaceut. Stat. 2002; 12: 503–534. 5. Y. Lee, H. Wang, and S. C. Chow, Comparing variabilities in clinical research. Encycl. Biopharmaceut. Stat. 2003: 214–230. 6. S. C. Chow, J. Shao, and H. Wang, Sample Size Calculation in Clinical Research. New York: Marcel Dekker, 2003. 7. W. G. Howe, Approximate confidence limits on the mean of X + Y where X and Y are two tabled independent random variables. J. Amer. Stat. Assoc. 1974; 69: 789–794. 8. F. Graybill and C. M. Wang, Confidence intervals on nonnegative linear combinations of variances. J. Amer. Stat. Assoc. 1980; 75: 869–873. 9. T. Hyslop, F. Hsuan, and D. J. Holder, A small sample confidence interval approach to assess individual bioequivalence. Stat. Med. 2000; 19: 2885–2897. 10. S. C. Chow and J. Shao, Statistics in Drug Research. New York: Marcel Dekker, 2002.

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS

1 GENERAL FRAMEWORK FOR SAMPLE-SIZE CALCULATION Let θ 1 and θ 2 denote the measure of treatment effects in the two arms, respectively. We use the test statistic Zn to test the null hypothesis H 0 :θ = θ 2 against H A :θ 1 =θ 2 , where the subscript n in Zn refers to the total sample size used in calculating the test statistic. When n is large, under H0 , Zn is approximately standard normal. Under HA , Zn is also approximately normal with unit variance and mean E(Zn ). Parameter E(Zn ) depends on θ 1 ,θ 2 , n, and possibly some nuisance parameters. The sample size is determined such that when the alternative hypothesis is true, there is 1 − β of chances of detecting the underlying difference of the treatment effects at the significance level α. With this required condition, it follows that the sample size is the solution to the following simple equation

JOANNA H. SHIH National Cancer Institute, Bethesda, Maryland

Sample size and power determination plays a crucial role in the design of clinical trials. To ensure that a trial is adequately sized so that the true treatment effect can be detected with a desired power, it is important to evaluate the assumptions used in the sample-size calculation procedure. For morbidity/mortality trials, time-to-event outcome is usually used in the primary analysis. Because the rate of outcome event for this type of study is usually not very high, survival times for a large number of study participants are censored at the end of study. Therefore, survival analysis is applied to time-to-event outcomes. Various close-form sample-size formulas have been proposed for the analysis of survival endpoint. Most of them are formed based on restrictive assumptions and do not take into consideration factors that may affect the estimate of the treatment effect. Factors typically encountered in clinical trials include dropin, drop-out, staggered entry, loss to follow, lags in treatment effects, and informative noncompliance. It is important to consider these issues in planning, because they recognize some practical problems encountered in the conduct of the research and ensure that intent-to-treat analyses are properly powered. This article reviews the sample-size calculation methods for testing the treatment effects in two-arm survival trials. For each method considered, its basic features, underlying assumptions, and limitations are discussed. Presented first are the test statistics and associated methods that yield closeform sample-size formulas. Then, attention will be focused on methods that incorporate the aforementioned complexities in samplesize calculation. To present all these methods in a cohesive way, a generic framework for sample-size calculation is used. All the methods considered are variations in the parameters used in the framework.

E(Zn ) = Zα/2 + Zβ

(1)

where Zp is the (1 − p) × 100th percentile of the standard normal distribution. The quantities α and 1 − β are referred as type I error and power. To avoid false positive and false negative, both α and β are set at small values; typical values are α = 0.05 and β = 0.1. 2

CHOICE OF TEST STATISTICS

For survival analysis of clinical trials, various test statistics have been proposed in comparing the treatment difference. These test statistics generally can be grouped into parametric versus nonparametric. For the parametric tests, it is assumed that the parametric form of the underlying survival distribution in each treatment group is known, whereas it is unspecified for the nonparametric counterpart. Because different tests have different implications for sample-size determination, in this section, we present the explicit forms and highlight the major properties and associated assumptions of some commonly used tests. All these tests, as will be shown below, depend on the number of events that vary with the length of followup. Thus, in essence the number of events

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS

and length of follow-up are key factors that determine the sample size. 2.1 Parametric Tests 2.1.1 Test Based on Difference of Hazards. Lachin (1, and references therein) tests the treatment difference in terms of difference in hazards. They assume that survival time in each group follows a exponential distribution with constant hazard λi , i = 1,2. The test is based on λˆ1 − λˆ2 , where λˆi = di /total time at risk for group i, is the maximum likelihood estimate and di is the number of events in group i. The variance of λˆi is equal to λ2i /E(di ), where E(di ) denotes the expected number of events in group i. The explicit expression of E(di ) depends on the distributions of survival time and censoring time. Lachin (1) assumes that study participants are accrued uniformly during the interval [0,T 0 ] and the total trial length is T 0 + τ . Under this assumption, E(di ) is given by e−λi τ − e−λi (T0 +τ ) E(di ) = nqi 1 − (2) λi T0 where qi ,i = 1,2 is the proportion of study participants in group i. Under equal sample allocation, q1 = q2 = 1/2. The test statistic Zn has the form Zn =

λˆ1 − λˆ2 ˆ (λ21 /d1 + λˆ22 /d2 )1/2

(3)

which under the null hypothesis is approximately standard normal, and under the alternative hypothesis is approximately normal with unit variance and mean equal to E(Zn ) =

λ1 − λ2 λ21 /E(d1 ) + λ2 /E(d2 )

1/2

(4)

Inserting Equation (2) to Equation (4) and equating Equation (4) to Zα/2 + Zβ in Equation (1), the total required sample size n is readily determined. 2.1.2 Test Based on Hazard Ratio. Rubinstein et al. (2) also assume exponential survival and propose a test based on log(λˆ1 /λˆ2 ), where λˆ is also the maximum likelihood estimate. The variance of log(λˆ1 /λˆ2 ) is

approximately equal to 1/E(d)1 + /E(d2 ). In deriving E(di ), they assume that patients are accrued during the interval [0,T 0 ] according to a Poisson process and the total trial length is T 0 + τ . They also allow the possibility of loss to follow-up and assume that time to loss to follow-up in the ith group is exponentially distributed with parameter φ i , i = 1,2. Under these assumptions, the expected number of events in group i is given by λi E(di ) = nqi ∗ λi

∗

∗

e−λi τ − e−λi (T0 +τ ) 1− λ∗i T0

(5)

for λi * = λi + φ i . The test statistic Zn = log(λˆ1 /λˆ2 )/ 1/d1 + 1/d2 which under H A has mean approximately equal to E(Zn ) = log(λ1 /λ2 )/ 1/E(d1 ) + 1/E(d2 ) Again inserting (5) to E(Zn ) and using (1), the total required sample size is determined. 2.2 Nonparametric Tests 2.2.1 Logrank Test. The logrank test is used frequently for comparing the overall survival experience between two groups. The hypothesis to be tested is H 0 :S1 (.) = S2 (.), where Si denotes the survival function for group i, and the dot represents the whole range of survival times under study. Unlike the previous two parametric tests, it is nonparametric and thus makes no assumption about the form of the survival functions S1 and S2 . Hence, it is suitable for application to a wide variety of clinical trials with survival endpoints. The logrank test takes the following simple form:

d k=1

n1k n1k + n2k

1/2 n1k n2k (n1k + n2k )

Xk −

Zn = d k=1

where the sum is over all events in the two groups with d = d1 + d2 , Xk is the indicator of the first group (i.e., X k = 1 if the kth event comes from the first group, and 0 otherwise), and n1k and n2k are the number at risk, just before the kth event in the two groups, respectively. The logrank is most sensitive to departures from the null hypothesis in which the hazard ratios between the two samples are roughly constant over time.

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS

In some situations, it may be reasonable to expect that the survival difference in terms of hazard ratio changes over time. In that case, the weighted logrank test may be preferred with higher weights assigned to times when larger differences are expected. The weighted logrank test (3) is given by d n1k k=1 wk Xk − n1k + n2k Zn = 1/2 2 wk n1k n2k d k=1 (n1k + n2k ) where wk s are the weights. The regular logrank test is a special case of the weighted logrank test with wk = 1 for all ks. Under the alternative hypothesis HA :S1 (.) = S2 (,), the expected value of the weighted logrank test involves integration and may not have a close-form expression. However, by assuming equal censoring distributions in the two groups, wk = 1 for all ks, and constant hazard ratio between the two groups, Schoenfeld (4) shows that E(Zn ) is simplified to E(Zn ) = log(θ ) q1 q2 E(d) (6) where θ is the constant hazard ratio, and E(d) is the expected number of events in the two groups combined. Equating Equation (6) to Zα/2 + Zβ yields E(d), the total number of events required for the study. The required total number of samples is related to E(d) by n=

E(d)(1 + φ) φP1 + P2

1 ˆ {S2 (τ ) + 4Sˆ 2 (τ + 0.5T0 ) 6 +Sˆ 2 (τ + T0 )}

where as denoted before T 0 is the length of accrual and τ is the minimal period of followup. The proportion P1 is related to P2 , under the constant hazard ratio assumption, by P1 = 1 − (1 − P2 )θ . Freedman (6) derives a formula for the total number of events under the logrank test. Assuming the hazard ratio between the two groups is constant and that the ratio of study participants at risk, just before a given event, in the two groups is constant and equal to the ratio of the sample sizes, the total expected number of events is given by E(d) =

(Zα/2 + Z2β )(1 + θ φ 2 ) φ(1 − θ )2

(8)

Having calculated the total number of events required, the proportions of events and total sample size can be calculated by the same way as described above. 2.2.2 Binomial Test. The binomial statistic tests the difference of proportions of events. Like the logrank test, it is robust for it does not require any distributional assumption for the survival times. However, unlike the logrank test that compares the survival experience throughout the duration of the study, a specific time point used to calculate the proportion of events must be specified ahead of time. In practice, usually the average length of follow-up is used for making the comparison. The test statistic Zn is given by

(7)

where φ = q1 /q2 is the ratio of sample sizes between the two groups, and Pi = 1,2 are the proportions of events in the two groups. Usually, the survival experience in one group, say group 2, is available from a previous trial. Then P2 can be estimated by 1 − ˆ S(t*), where Sˆ is the Kaplan-Meier survival curve from that previous trial, and t* is the average length of follow-up for the current study. Alternatively, if the accrual rate is constant, then the Simpson’s rule can be used to approximate the proportion of events for group 2 (5): P2 = 1 −

3

Pˆ 1 − Pˆ 2 Zn = ˆ − P)(1/n ˆ P(1 1 + 1/n2 ) where Pˆ i and ni are the proportion of events and sample size in group i, and Pˆ = q1 Pˆ 1 + q2 Pˆ 2 is the pooled proportion. Under the alternative hypothesis HA :P1 = P2 , the mean of Zn is approximately equal to E(Zn ) =

P1 − P2 n−1 P(1 − P)(1/q1 + 1/q2 )

where P = q1 P1 + q2 P2 . From Equation (1), the required total sample size is given by n=

(Zα/2 + Zβ )2 P(1 − P)(1/q1 + 1/q2 ) (P1 − P2 )2

(9)

4

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS

Under equal sample allocation, q1 = q2 = 1/2, n1 = n2 = 2(Zα /2 + Zβ )2 P(1 − P)/(P1 − P 1 )2 3

ADJUSTMENT OF TREATMENT EFFECT

The above sample-size formulas are easy to compute, but most of them require restrictive assumptions. In practice, the survival function is rarely exponential, the difference of treatment effects in term of hazard ratio may vary with time, and the pattern of accrual rate may not be uniform or Poisson. Furthermore, the proportion of events will change if complexities such as drop-out, drop-in, loss to follow-up, and lags in treatment effects develop. Drop-out occurs when participants discontinue taking the therapy in the treatment group, and drop-in occurs when participants switch from the control group to the treatment group. In contrast, because of loss of contact or death by a competing cause during the course of follow-up, loss to follow-up occurs. Lags refer to the delay period before the full treatment effect is achieved. They occur when the full effect of a treatment regimen is not expected to occur instantaneously. According to the intention to treat analysis where the outcomes of the study participants are analyzed with respect to the original treatment of randomization, regardless of subsequent treatment alternation or other changes, the aforementioned factors will dilute the treatment effect. Hence, it is necessary to inflate the sample size so that the desired power is achieved. Various methods have been proposed to modify sample-size calculation to account for drop-in and dropout. Lachin and Foulkes (7) inflate the sample size simply by dividing by (1-din -dout )2 , where din and dout are the drop-in and drop-out rates in the control and treatment groups, respectively. Lakatos (8) uses a discrete nonstationary Markov chain model to specify the trial process. The event rates in both groups are adjusted according to the state space, initial distributions, and specified transition probabilities. The Lakatos method is very general, for it allows any pattern of survival, accrual, drop-in, drop-out, loss to follow-up, and lag in the effectiveness of treatment during the course of a trial. Under that model,

the expected value of the logrank statistic is calculated using the ratios of the hazards and proportions at risk at each discrete interval. By making the interval between transitions sufficiently short, and assuming the hazard ratio and ratio of patients at risk are constant within each interval, the expected number of events can be calculated. Some of its key features and implementation are reviewed below. 3.1 The Discrete Nonstationary Markov Process Model The discrete state Markov process model is used to account for transitions from one state to another and changes of event rates over time. Assume that a study participant is assigned to either a treatment group or a control group. Then at a given time t during the course of a clinical trial, a participant is in one of the four states: (1) loss to follow-up, (2) having an event, (3) taking the therapy in the treatment group, and (4) taking the therapy in the control group. The probabilities of these four possible states are summed up to 1, and in the beginning of the study the probability of being in the third state is 1 for the treatment group, and the probability of being in the fourth state is 1 for the control group. The total length of the study is divided into I disjoint periods of equal length [0, t1 ), . . . ,[tI−1 , tI ) such that the hazard in each period is constant. Each period is divided into S small equal-length intervals, and the probability of having an event is assumed the same across all these subintervals. From time tj−1 to tj , j = 1, . . . , I × S, each participant at risk of the event outcome in each assigned treatment group may move from one state to another state according to some transition probabilities that are set up based on the assumed rates of events, dropin, drop-out, and loss to follow-up. After I × S transitions, the adjusted event rate in each group is the probability of being in the second state (having an event). Under the assumed piecewise exponential model, the expectation of weighted logrank test is approximately I di φik θik φik − k=1 i=1 1 + φik θik 1 + φik E(Zn ) = 2 I di φik i=1 k=1 (1 + φik )

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS

where di is the number of events in the ith period. Parameter φ ik is the ratio of number of people at risk in the first group to that in the second group just prior to the kth death in the ith period, and θ it is the ratio of the hazards between the two groups just prior to the kth death in the ith period. If the length of each interval is small, then φ ik and θ ik are approximately constant within each period. Then E(Zn ) can be simplified to √ I d ρi γi E(Zn ) = I i=1 ( i=1 ρi ηi )1/2

(10)

where γi =

φi θi φi − , 1 + φi θi 1 + φi

ηi =

φi di , ρi = (1 + φi )2 d

The possibly time-dependent hazard ratios θ i s are the parameter values assumed by the investigators, and quantities of φ i s are obtained from the Markov chain model. As before, the expected total number of events E(d) is obtained by equating Equation (10) to Zα/2 + Zβ . Let Pi * denote the adjusted event rate in group i at the end of transitions. Then the required total sample size equals E(d) / (q1 P1 * + q2 P2 *). When a participant switches to the other arm during the course of follow-up, it may not be immediate that the original treatment effect vanishes and the full effect of the switched treatment is adopted. This effect is referred as the lagged effects for noncompliance. Wu et al. (9) adjust the event rates for noncompliance and lagged effects. However, their method requires numerical integration and cannot be easily modified to adjust for other factors. The above Markov model can flexibly incorporate lagged effects by creating additional states to reflect the intermediate treatment effects after crossover of the assigned treatment. 3.2 Implementation Shih (10) implements the Markov model with a SAS macro written in IML. The program, named SIZE, is comprehensive for calculating sample size, power, and duration of

5

the study with the aforementioned complexities. When calculating duration, an iterative bisection procedure is used to search for the shortest duration time that yields the required power with the given sample size. Additional key features of SIZE include the following. • It allows unequal allocation to the two

treatment groups and any pattern of staggered entry, drop-in, drop-out, loss to follow-up, nonconstant hazards, and lags. • It has the option of using the standard logrank test, weighted logrank test, or the binomial test. For the binomial test, the adjusted event rates Pi *,i = 1,2, obtained from the Markov chain model are used in the binomial sample-size formula (9). • It allows heterogeneity in the treatment effect as well as other parameters so that the impact on power of uncertainties of parameters can be considered. To obtain a realistic assessment of the predicative probability of obtaining a significant result, Spiegelhalter et al. (11,12) suggest that a plausible range of treatment benefit expressed by the prior distribution be explored in power calculation. In SIZE, the uncertainties are expressed in terms of a discrete prior distribution.

3.3 Illustration A hypothetical example is used to illustrate the impact of the aforementioned complexities on sample size. The sample-size calculation was done using SIZE. Consider a two-arm treatment versus control clinical trial with 2 years of average follow-up. The event rate after 2 years in the control group is 30%, and the hazard rate is piecewise constant over the four 6-month periods with ratios 1:2:2:2. The hazard ratio is assumed constant and equal to 0.7146, which corresponds to 25% reduction of the event rate after 2 years in the treatment group. Under this setup, a total sample of 1438 is required to achieve 90% power with 0.05 two-sided type I error. The expected number of events is 378. If crossover or loss to follow-up is

6

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS

expected to occur, then the sample size needs to be inflated to achieve the same power. For example, if 10% of patients crossover in each arm and 10% of loss to follow-up occurs after 2 years, then the adjusted event rate in the control and treatment group is 0.2803 and 0.2167, respectively. The required sample size is increased to 1886 with 469 expected events. Staggered entry also affects the sample size, although its impact generally is small. Suppose the trial participants enter the study uniformly for 1 year and are followed for a minimum of 1.5 years. Then the maximum follow-up is 2.5 years, and the average follow-up is 2 years. With the same setup as above including crossovers and loss to follow-up, the required sample size is changed to 1896 with 472 expected events. In the above settings, it is assumed that hazard ratio holds constant over time. If the hazard ratio changes over time, then the nonproportional hazards occur and need be accounted for in the sample-size calculation. Consider the scenario where the treatment has no effect in the first 3 months, achieved half of the effect between the third and the sixth month, and reaches the full effect afterward. The event rate set up for the control group and its reduction after 2 years in the treatment group are the same as before. Without crossover and loss to followup, the required sample size is 1495 with 393 expected events. When crossover occurs, it is necessary to adjust the lagged effects under nonproportional hazards. Suppose the noncompliers in the treatment group return to the control group in the same fashion as they reach the treatment effect level before dropout. For patients who drop the therapy in the control group and start taking the therapy in the treatment group, suppose the treatment efficacy is set at the initial level and the full treatment effect is achieved in the same way as those in the treatment group. Assume 10% of the patients in each group crossover after 2 years and 10% of loss to follow-up occurs. The adjusted event rate in the control and treatment group is 0.2799 and 0.2175, respectively. A total sample size of 2044 is required with 509 expected events.

4 INFORMATIVE NONCOMPLIANCE The treatment effect adjustment methods described above assume that the risk of the study endpoint depends only on the treatment actually received and not on the compliance status. For example, in Shih (10), when a study participant crosses over from the treatment group to the control group, the risk of the endpoint event returns to the level of the control group either immediately or in the opposite direction as the treatment effect is reached before drop-out. However, some studies have shown that noncompliers form a different subgroup than compliers, and have higher event rates (13,14). Jiang et al. (15) show that informative noncompliance may have a more adverse impact on the power than non-informative noncompliance. Several sample-size calculations approaches have been proposed to incorporate different risks for compliers and noncompliers (e.g., References 15–17). These approaches assume the risk of outcome event is related to the compliance status but not to the time of discontinuation of assigned treatment. However, in the analysis of the Controlled Onset Verapamil Investigation of Cardiovascular Endpoints trial (18), Li and Grambsch (19) find that the risk of outcome event is different for early versus late noncompliers. Motivated by this observation, they propose a samplesize approach that allows the possibility of a time-vary association between noncompliance and risk. Their method for the logrank sample-size calculation takes Lakatos’ Markov chain approach as a basis, modifying it to add an additional noncompliance state for each treatment group and to incorporate time-varying rate for each state. They show that the time-varying pattern of the relationship between noncompliance and risk can have a significant impact on the sample-size calculation. REFERENCES 1. J. M. Lachin, Introduction to sample size determination and power analysis for clinical trials. Control. Clin. Trials 1981; 2: 93–113. 2. L. V. Rubinstein, M. H. Gail, and T. J. Santner, Planning the duration of a comparative clinical trial with loss to follow0up and a period

SAMPLE SIZE CONSIDERATIONS FOR MORBIDITY/MORTALITY TRIALS of continued observation. J. Chron. Dis. 1981; 34: 469–479. 3. R. E. Tarone and J. Ware, On distributionfree tests for equality of survival distributions. Biometrika 1977; 64: 156–160. 4. D. Schoenfeld, The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 1981; 68: 316–319. 5. D. Schoenfeld, Sample size formula for the proportional-hazards regression model. Biometrics 1983; 39: 499–503. 6. L. S. Freedman, Tables of the number of patients reguired in clinical trials using the logrank test. Stat. Med. 1982; 1: 121–129. 7. J. M. Lachin and M. A. Foulkes, Evaluation of sample size and power for analysis of survival with allowance for nonuniform patients entry, losses to follow-up, noncompliance, and stratification. Biometrics 1986; 42: 507–519. 8. E. Lakatos, Sample size based on the log-rank statistic in complex clinical trials. Biometrics 1988; 44: 229–241. 9. M. Wu, M. Fisher, D. DeMets, Sample size for lon-term medical trial with time dependent noncompliance and event rates. Control. Clin. Trials 1980; 1: 109–121. 10. J. H. Shih, Sample size calculation for complex clinical trials with survival endpoints. Control. Clin. Trials 1995; 16: 395–407. 11. D. J. Spiegelhalter and L. S. Freedman, A predictive approach to selecting the size of a clinical trial. Stat. Med. 1986; 5: 1–13. 12. D. J. Spiegelhalter, L. S. Freedman, and M. K. Parmar, Applying Bayesian ideas in drug development and clinical trials. Stats. Med. 1993; 12: 1501–1511. 13. Coronary-Drug-Project-Research-Group, Influence of adherence to treatment and response of cholesterol on mortality in the coronary drug project. New England Journal of Medicine 1980; 303: 1038–1041. 14. S. Snapinn, Q. Jiang, and B. Iglewicz, Informative noncompliance in endpoint trials. Curr. Control. Trials Cardiovasc. Med. 2004; 5. 15. Q. Jiang, S. Snapinn, and B. Iglewicz, Calculation of sample size in survival trials: the impact of informative noncompliance. Biometrics 2004; 60: 800–806. 16. R. Porcher, V. Levy, and S. Chevret, Sample size correction for treatment crossovers in randomized clinical trials with a survival endpoint. Control. Clin. Trials 2002; 23: 650–651. 17. M. S. Friedike, P. Royston, and A. Babiker, A menu-driven facility for complex sample size

7

calculation in randomized controlled trials with a survival or a binary outcome: update. Stata J. 2005; 5: 123–129. 18. H. R. Black, W. J. Elliott, J. D. Neaton, et al., Rationale and design for the controlled onset verapamil investigation of cardiovascular endpoints (CONVINCE) trial. Control. Clin. Trials 1998; 19: 370–390. 19. B. Li and P. Grambsch, Sample size calculation in survival trials accounting for timevarying relationship between noncompliance and risk of outcome event. Clin. Trials 2006; 3: 349–359.

SCREENING, MODELS OF

likely to have the disease are then further investigated to arrive at a final diagnosis (45). The objective of screening is the early detection of a disease where early treatment is either easier or more effective than later treatment. Figure 1 is a schematic representation of the main features of the natural history of a disease which are relevant to screening. The preclinical phase of the disease is the phase in which a person has the disease but does not have any clinical symptoms and is not yet aware of having it. Screening aims to detect the disease during this phase. In principle, the preclinical phase starts with the beginning of the disease, but, in practice, modeling focuses on the phase commencing at the earliest point at which the disease is detectable with a screening test. This is known as the detectable preclinical phase. The preclinical phase finishes with the clinical surfacing of the disease. This is the point at which the person develops clinical symptoms of the disease, seeks medical attention for these symptoms, and the disease is diagnosed. The disease then enters the clinical phase, where the person has a diagnosable case of the disease. The outcome of a screening test is designated either positive, if the person is identified as likely to have the disease, or negative if they are not. All screening tests are open to error either from the test itself or its interpretation. These errors are designated as false positive, where a person without the disease has a positive screening result, and false negative, where a person with the disease has a negative screening result. The sensitivity of a screening test is the probability that a person with the disease has a positive screening result. The specificity of a screening test is the probability that a person without the disease has a negative screening result. Cases of the disease which clinically surface following a false negative result (i.e. where the screening test missed the disease) are known as interval cases. It is important to note that sensitivity and specificity are not properties of the test alone. For example, mammography is used to screen for breast cancer in women. In this case

CHRIS STEVENSON Australian National University, Canberra, ACT, Australia

Screening asymptomatic people to allow the early detection and treatment of chronic diseases is an important part of modern medicine and public health. For screening to be both an efficient and cost-effective medical intervention, it must be carefully targeted and evaluated. Mathematical models of disease screening constitute one of the major tools in the design and evaluation of screening programs. The purpose of this article is to describe models for disease screening and how they have developed in recent years. The discussion will focus on screening for cancer, because most of the methodologic advances in screening design and evaluation have concerned cancer screening. In the first part of the article we will describe the characteristics of these models and illustrate them with a discussion of a simple screening model. In the second part we will describe the development of the two main types of model. In the third part we will discuss model fitting and validation, and in the final part we will briefly describe models for diseases other than cancer and discuss the current state and possible future directions for models of disease screening. This is not intended to be an exhaustive study of all modeling of disease screening. Rather, it is intended to be a description of the main approaches used and their strengths and weaknesses. For more detailed reviews of modeling disease screening, see Eddy & Shwartz (30), Shwartz & Plough (56), Prorok (50,51), Alexander (5), and Baker et al. (9). 1

WHAT IS SCREENING?

Screening for disease control can be defined as the examination of asymptomatic people in order to classify them as likely or unlikely to have the disease that is the object of screening. People identified by a screening test as

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

SCREENING, MODELS OF

Figure 1. The natural history of a disease with and without screening

the sensitivity and specificity will depend on characteristics of the test, such as the nature of the mammography machine and the number of views taken, as well as on factors such as the skill of the person interpreting the mammogram, the size of any tumor in the woman being screened, the density of her breast tissue, and so on. The reliability of a test is its capacity to give the same result, either positive or negative, on repeated application in a person with a given level of the disease. The survival time is the length of time between disease diagnosis, either by clinical surfacing or detection by screening, and death. The lead time is the time between the detection of a disease by screening and the point at which it would have clinically surfaced in the absence of screening. The lead time is an important issue in the examination of screening benefits. The immediate focus of screening is to detect an early form of the disease. Hence the lead time can be used as an index of benefit in its own right. It is also important in examining survival benefits conferred by screening. A simple comparison of survival times between screened and unscreened populations is likely to show spurious screening benefits, since the survival time for a screen detected disease includes the lead time while that for a disease which surfaced clinically does not. There is another, more subtle, reason why such survival comparisons may be spurious, even if adjusted for lead time. Screening will tend to detect people with a longer preclinical phase. This is known as length-biased

sampling. Usually this will equate to a more slowly progressing disease. Since the disease behavior before clinically surfacing is likely to be correlated with that after surfacing, this is likely to result in screen detected diseases having a longer survival time than clinically surfacing diseases. 2 WHY USE MODELING? The evaluation of screening usually focuses on whether or not the screening program has led to a fall in mortality from the disease in question. As with most medical interventions, randomized controlled trials (RCT) provide the most satisfactory empirical basis for evaluating screening programs. However, they do have significant limitations. RCTs for screening are expensive and time-consuming to run—typically requiring very large sample sizes and having long time lags until benefits are apparent. For example, the RCT of mammography screening carried out in the two Swedish counties of Kopparberg and Ostergotland had a total sample size of 134 867. A statistically significant mortality differential between the control and study groups did not appear until after six years of follow-up, with a further four years of follow-up before the results could be considered definitive (58). Twenty years of data would be required to yield results on some aspects of screening program design (21). Any one trial cannot address all the issues involved in designing a screening program. For example, the Minnesota Colon Cancer Control Study used an RCT to demonstrate a

SCREENING, MODELS OF

statistically significant fall in mortality due to screening with a Fecal Occult Blood Test (FOBT), followed by colonoscopy in those with a positive screen (43). However, Lang & Ransohoff (39) have subsequently suggested that the sensitivity of FOBT is considerably less than that reported in the Minnesota study. FOBT has a high false positive rate, and they argue that one-third to one-half of the fall in mortality could be due to chance selection for colonoscopy where an early cancer or large adenomatous polyp is present but not bleeding and the FOBT is positive for other reasons. The original RCT provides no basis for deciding on the role of FOBT separately from that of colonoscopy. Models are one way in which the information on the disease and screening tests from a number of different sources—including RCTs and other clinical and epidemiologic research—can be combined with known and hypothesized features of the specific population to be screened. They can be used to investigate the effect of different screening regimes on different subgroups of the population, both on disease mortality and program costs. For example, one use of modeling has been to investigate the inclusion of different age groups in the population to be screened. They can also be used to project the future course of the disease and screening program, to evaluate the changes in costs and benefits over time. The modeling approach does have limitations. The extra information is obtained from models only by imposing assumptions about the screening process. These include assumptions about the natural history of the disease, about the characteristics of the screening test and about the behavior of the population under study. These assumptions can only rarely be verified, although they can be evaluated as part of the modeling process. A further complication in making these assumptions is that the natural history of most diseases is not completely understood, particularly in the asymptomatic preclinical phase, which is the main focus of screening. This means that one may hypothesize a disease model that meets the constraints of current knowledge but which is still ultimately misleading.

3

3 CHARACTERISTICS OF SCREENING MODELS 3.1 Types of Model Bross et al. (14) proposed a classification of models used to analyze screening strategies into two types: surface models and deep models. Surface models consider only those events that can be directly observed, such as disease incidence, prevalence, and mortality. Deep models, on the other hand, incorporate hypotheses about the disease process that generates the observed events. Their intent is to use the surface events as a basis for understanding the underlying disease dynamics. This implies models that explicitly describe the disease natural history underlying the observed incidence and mortality. Deep modeling permits generalization from the particular set of circumstances that generated the surface events. As a result, whereas surface models provide a basis for interpreting the observable effects of screening, deep models provide an explicit basis for determining the outcomes of screening scenarios that have not been directly studied in clinical trials (56). This article will focus on the application of deep models to population screening. These models can be further grouped into two broad categories—those that describe the system dynamics mathematically and those that entail computer simulation. The first of these, designated analytic models, uses a model of the disease to derive direct estimates of characteristics of the screening procedure and its consequent benefits. The second, designated simulation models, uses the disease model to simulate the course of the disease in a hypothetical population with and without screening and derives measures of the benefit of screening from the simulation outcomes. 3.2 Markov Framework for Modeling Most screening models use an illness–death model for the disease which is developed within the framework of a Markov chain. A sequence of random variables {X k , k = 0, 1, . . . } is called a Markov chain if, for every

4

SCREENING, MODELS OF

collection of integers k0 < k1 < · · · < kn < ν, Pr(Xν = i|Xk0 , . . . , Xkn ) = Pr(Xν = i|Xkn ), for all i.

(1)

In other words, given the present state (Xkn ), the outcome in the future (X ν = i) is not dependent on the past (Xk0 , . . . , Xkn−1 ). The Markov chain formulation is applied to an illness–death model in the following way (16). The population under study is classified into n states, the first m of which are illness states and the remaining n − m of which are death states. An illness state can be broadly defined to be the absence of illness (a healthy state), a single specific disease or stage of disease, or any combination of diseases. In modeling screening, these states typically refer to a healthy state and preclinical and clinical phases of the disease. A death state is defined by cause of death, either single or multiple. Emigration or loss to follow-up may also be treated as a death state. In modeling screening, typically there will be one death state due to death from the disease and another due to death from any other competing cause. Entry to a terminal stage of the disease is also sometimes treated as a death state. Transition from one state to another is determined by the transition probabilities, pij , where pij = Pr(Xk+1 = j|Xk = i), i, j = 1, 2, . . . , n;

k = 1, 2, . . . . (2)

Death states are absorbing states, since once one reaches that state, transition to any other state is impossible (i.e. pij = 0, for i = m 1, . . . , n, and j = i). The disease model is said to be progressive if, once one enters the first stage of the disease, in the absence of interventions (such as screening) and competing risks, the only valid transitions are through the remaining disease stages. Because the disease is modeled using a Markov chain, the future path of an individual through the illness and death states depends only on his or her current state, and the future distribution of individuals between illness and death states depends only on the present distribution and not on any past distributions.

This basic model can be varied in a number of ways. The Markov chain treats time as increasing in discrete steps corresponding to the index k. Thus a transition between states can only occur at discrete time intervals. Most screening models extend this to allow transitions to occur in continuous time. In this case, the transition probabilities for any two points in time t1 and t2 are pij (t1 , t2 ) = Pr(X(t2 ) = j|X(t1 ) = i), i, j = 1, 2, . . . , n.

(3)

If pij (t1 , t2 ) only depends on the difference t2 − t1 but not on t1 or t2 separately, the model is time homogeneous. The simple Markov chain described above is time homogeneous. This can be varied to allow the transition probabilities to vary with time. The probabilities can also be allowed to vary with age and other relevant characteristics of the individual. Some of the model formulations also allow the probability of transition out of a state to depend on the sojourn time in that state. 4 A SIMPLE DISEASE AND SCREENING MODEL In this section we describe a simple model presented (and discussed in greater detail) by Shwartz & Plough (56), based on a characterization of the disease process proposed by Zelen & Feinleib (65). We assume that a person can be in one of three states—a healthy state, the preclinical phase of the disease, or its clinical phase. This characterization also implicitly assumes a death state following the clinical phase, but since the focus of the analysis is on the preclinical phase, the death state is not explicitly used. The model is progressive in that once a person enters the preclinical state, in the absence of screening or death from another cause, the disease will ultimately surface and enter the clinical phase. If the person is screened while in the preclinical state, then the disease may be detected with a probability depending on the sensitivity of the screening test.

SCREENING, MODELS OF

5

The main assumption underlying this model (and the whole screening process) is that the earlier in the preclinical phase the disease is found, the better will be the prognosis. Hence, the screening benefit is directly related to the lead time. For this model we define the following:

this is a function of the hazard rates g(·) and p(·). Thus

1. L is the lead time; 2. g(y) is the hazard rate for entering the preclinical state at age y; 3. p(t) is the hazard rate for clinical surfacing after the disease has been in the preclinical phase for time t; 4. f (t) is the false negative rate of the screen when the disease has been present for time t; and 5. b(t) is the probability of ultimately dying from the disease if it is detected when it has been present for time t.

Furthermore, the probability that the disease clinically surfaces in some time interval δa following a is

For simplicity, we ignore the possibility of death from other causes. If we let m and σ 2 be the mean and variance of the sojourn time distribution, then Zelen & Feinleib (65) show that if we assume a constant hazard rate for disease initiation (i.e. g(t) = g) we obtain the following expression for the mean lead time: E(L) =

2 σ m m2 + σ 2 1+ . = 2m 2 m

(4)

Note that E(L) > m/2 for σ 2 > 0. This illustrates the effect of length-biased sampling, since, if the screen detected cases were selected at random from all of the cases, one would expect the mean lead time to be m/2. This expression also illustrates one of the central difficulties with this form of modeling. The lead time, which is the main index of screening benefit, is a function of the distribution of the sojourn time in the preclinical phase. However, the preclinical phase is, by definition, unobservable. The question of how to estimate characteristics of the sojourn time distribution has been at the center of most of the work done in this area. For a person to be in the preclinical state at age a, then they must have entered the preclinical state before age a and not leave it until after age a. Hence the probability of

Pr(preclinical phase at age a) a g(u) exp[−G(u)] = 0

× exp[−P(a − u)] du.

(5)

Pr(clinical surfacing in (a, δa)) a = g(u) exp[−G(u)] exp[−P(a − u)] 0

× p(a − u)δa du.

(6)

We combine this with the prognosis measure b(·) to calculate a baseline probability of death from the disease in the absence of screening: Pr(death in the absence of screening) ∞ a = g(u) exp[−G(u)] exp[−P(a − u)] 0

0

× p(a − u)δab(a − u) du da.

(7)

Now we introduce the effect of screening. We will consider the case of one screening test performed at age s. There are four possibilities: 1. the disease is detected by the screening test; 2. the disease clinically surfaces before the test (i.e. at age a < s); 3. the disease is missed by the screening test and clinically surfaces after the screen (i.e. it is an interval case); or 4. the disease both enters the preclinical phase and clinically surfaces after the screening test. For the disease to be detected by this test, it must be in the preclinical phase and the test must not give rise to a false negative. The probability of this is Pr(disease detection at age s) s = g(u) exp[−G(u)] exp[−P(s − u)] 0

× (1 − f (s − u)) du.

(8)

6

SCREENING, MODELS OF

We have already calculated the probability that the disease clinically surfaces at age a < s in (6). For the disease to have been missed by the screen, the person must be in the preclinical state at age s, the test must have produced a false negative, and the disease must have clinically surfaced after the screen. The probability of this is Pr(disease missed by test) ∞ s g(u) exp[−G(u)] exp[−P(a − u)] = s

Although none of the models used for disease screening is exactly like the simple model presented here, they all incorporate its fundamental ideas. In particular, they all depend on knowing in one form or another the transition probabilities into and out of the preclinical state, the distribution of the sojourn time in the preclinical state, the sensitivity of the screening test, and the disease prognosis as a function of the development of the disease.

0

× f (s − u)p(a − u)δa du da.

(9)

The probability that the disease both enters the preclinical phase and clinically surfaces after the screening test is Pr(disease both develops and surfaces after ∞ ∞ g(u) exp[−G(u)] the screen) = s

s

× exp[−P(a − u)]p(a − u)δa du da.

(10)

Once again we can combine these probabilities with our prognosis measure to obtain the probability of death from the disease in the presence of screening: Pr(death in the presence of screening) s g(u) exp[−G(u)] exp[−P(s − u)] = 0

× (1 − f (s − u))b(s − u) du s a + g(u) exp[−G(u)] 0

0

× exp[−P(a − u)] × p(a − u)δab(a − u) du da ∞ s + g(u) exp[−G(u)] s

0

× exp[−P(a − u)] × f (s − u)p(a − u)δab(a − u) duda ∞ ∞ + g(u) exp[−G(u)] s

s

× exp[−P(a − u)] × p(a − u)δab(a − u) du da.

(11)

This expression gives us our screening figure to compare with the baseline figure in (7).

5 ANALYTIC MODELS FOR CANCER A mathematical disease model with two states was first proposed by Du Pasquier (24), but it was Fix & Neyman (32) who introduced the stochastic version and resolved many problems associated with the model. Their model has two illness states—the state of ‘‘leading a normal life’’ and the state of being under treatment for cancer—and two death states—deaths from cancer and deaths from other causes or cases lost to observation. Chiang (15) subsequently developed a general illness–death stochastic model which could accommodate any finite number of illness and death states. Some of the major analytic models developed for cancer screening are listed in Table 1. Lincoln & Weiss (41) were the first to propose a model of cancer as a basis for analyzing serial screening, in this case screening for cervical cancer. They did not explicitly use a Markov framework, but their model implicitly uses a classification of the disease into illness states. Zelen & Feinleib (65) proposed the simple three-state, continuous-time, progressive disease characterization described in the previous section and used it in model screening for breast cancer. In a modification to this basic model, the authors further divide the preclinical state into two parts, defined as: 1. a preclinical state in which the disease never progresses to the clinical state (i.e. the sojourn time is allowed to be infinite); and 2. a preclinical state in which the disease is progressive and will eventually progress to the clinical state.

Similar three-state illness model to Zelen & Feinleib, with a point occurring in either the preclinical or clinical phase where the disease becomes incurable Applied to breast cancer screening

Disease prevalence and incidence data

Preclinical state sojourn time distribution and disease prevalence and incidence data

Disease incidence data

Preclinical state sojourn time distribution Screening parameters including age at first screen, screening sensitivity, and screening interval

Zelen & Feinleib (65)

Prorok (48,49)

Blumenson (10–12)

Symptoms assumed never to appear, with all disease detected by screening Model applied to cervical cancer screening Progressive three-state illness model—a healthy state, the preclinical phase, and the clinical phase Assumes single screen Applied to breast cancer screening Uses the Zelen & Feinleib illness model and develops theory for application to multiple screens

Two illness states—a ‘‘healthy’’ state, in which the disease is not detectable, and a state covering the time between when the disease is first detectable and when it is actually detected by a screening examination

Key features

The probability density for the beginning of the detectable preclinical phase and the probability of a false negative screen at time t after entering the detectable preclinical phase—calculated by assuming specific functional forms rather than by direct estimation

Model inputs

Lincoln & Weiss (41)

Literature references for model

Table 1. Selected Analytic Models of Cancer Screening

(continued overleaf)

Mean lead time and proportion of preclinical cases detected Number of cases of diseases becoming incurable before detection

Mean lead time

Distribution of time to discovery of tumor

Model output/measures of screening benefit

SCREENING, MODELS OF 7

Albert et al. (3,4), Louis et al. (42)

Shwartz (54,55)

Literature references for model Model inputs

Maximum likelihood estimation of model parameters based on screening data and model assumption

Screening parameters

Specific functional forms and associated parameters governing tumor growth rate and lymph node involvement— chosen to be consistent with published results and available data Breast cancer incidence and death rates and death rates from other causes

Table 1. (continued)

Applied to breast and cervical cancer screening

Model predictions validated against independent data source (third-order validation) Progressive illness model with preclinical phase classified into states corresponding with prognostic tumor staging schemes, and two ‘‘death’’ states—one corresponding to clinical surfacing and one to death from a competing cause Model of screening strategy with probability of positive screen depending on person’s age and disease state

Percentage decrease in lost salvageables due to screening

Percentage reduction due to screening in observed cases of late disease

The probability that there will be no disease recurrence and the probability of detection before nodal involvement The probability of disease detection before death from other causes

Progressive illness model with a healthy state, 21 disease states defined in terms of the tumor size and lymph node involvement and two death states—death from breast cancer and death from any other cause Transition from preclinical phase to clinical phase possible in any disease state, with probability dependent on tumor size and tumor rate of growth

Changes in life expectancy as a result of screening

Model developed specifically for breast cancer

Key features

Model output/measures of screening benefit

8 SCREENING, MODELS OF

Transition probability matrix for movement between model states estimated from numbers of cancers detected for each stage in a screening program

Albert (2)

Cancer regression allowed in the two preclinical states

Found that observed data could not be explained without allowing for cancer regression in the illness model Four-state illness model developed for cervical cancer—a healthy state, two preclinical states, and a clinical state.

Applied to breast cancer Four-state illness model developed for cervical cancer

Focus on sojourn time in detectable preclinical phase and survival times after detection

Probability distribution specified for preclinical state sojourn time and survival time

Age-specific clinical incidence data and prevalence data derived from detection rates at first Pap smear

Progressive three-state illness model—a ‘‘healthy’’ state with no detectable disease, the detectable preclinical phase, and the clinical phase

Applied to breast cancer

Aimed at maintaining comparability between the model and observable characteristics of a screened population Progressive illness model consisting of disease stages corresponding with prognostic tumor staging schemes

Disease incidence derived from screening data

Survival times in the presence and absence of screening—derived from screening data by assuming particular functional forms for survival distributions

Age and stage-specific disease incidence

Coppleson & Brown (20)

Day & Walter (22), Walter & Day (63), Walter & Stitt (64)

Dubin (25,26)

(continued overleaf)

Not applicable (model focused on examination of disease natural history)

Not applicable (model focused on examination of disease natural history)

Survival time after detection by screening

Reduction in life years lost to women dying of breast cancer Mean lead time Lead time

Reduction in probability of dying of breast cancer

Increase in life expectancy

SCREENING, MODELS OF 9

Parameters of sojourn distribution estimated from screening data and data on interval cancer cases

Parameters of sojourn distribution estimated from screening data and data on interval cancer cases

Model parameters derived from published results of disease studies and screening programs

van Oortmarssen & Habbema (59)

Eddy (27,29,30)

Model inputs

Brookmeyer & Day (13)

Literature references for model

Table 1. (continued)

Assumes that once a disease is detectable by a screening modality, then any screen using that modality will detect the disease Applied to breast, cervix, lung, bladder, and colon cancer

Applied to cervical cancer Five-stage combined disease and screening model—one healthy state, three preclinical states defined by detectability by screening, and one clinical state

Screening false negative rate

Preclinical phase divided into two states—one in which the disease may progress or regress and a second in which the disease always progresses Applied to cervical cancer Similar illness model to Brookmeyer & Day

Probability of death following detection Increase of life expectancy due to screening

Probability of disease detection

Not applicable (model focused on examination of disease natural history)

Total preclinical phase sojourn distribution

Extends Day & Walter model

Key features

Model output/measures of screening benefit

10 SCREENING, MODELS OF

Stage shifts estimated from analysis of a randomized controlled trial of screening

Peak time period for mortality comparison selected from results of a randomized controlled trial of screening.

Uses known prognostic factors which are available early in a randomized controlled trial of screening to predict subsequent mortality differentials

Connor et al. (19), Chu & Connor (17)

Baker et al. (9)

Day & Duffy (21)

Applied to breast cancer

Applied to proportional hazards model for survival analysis Users surrogate endpoints for randomized controlled trial of screening to shorten the duration of the trial and increase its power

Focus is on estimation of the shift of the disease at detection to an earlier stage or an earlier point in the same stage as a result of screening Applied to breast cancer screening Focuses analysis on period when screening has maximum effect and hence analysis of screening trial results gives rise to more powerful statistical tests

Multistage progressive disease model

Tumor size at cancer detection used as a basis for predicting subsequent mortality differentials

Ratio of cancer mortality between screened and control groups

Reduction of deaths at a given stage due to screening Death prognosis of screen detected cancers

SCREENING, MODELS OF 11

12

SCREENING, MODELS OF

These are used to allow for the possibility that some individuals with the disease in a preclinical state will never have the disease progressing to a clinical state. This approach has been generalized in a number of ways by subsequent authors, with most focusing on simple disease models and the estimation of specific screening characteristics. Prorok (48,49) extended the lead time estimation to multiple screens. Blumenson (10–12) calculated the probability of terminal disease as a function of disease duration to date, and used this as a prognostic measure to evaluate screening strategies. Shwartz (54,55) modeled disease progression for breast cancer using tumor size and number of axillary lymph nodes involved to define the preclinical and clinical states. He then determined screening benefit measures, from data on five year survival rate and five year disease recurrence rate for patients, as a function of tumor size and lymph node involvement. Albert and his co-workers (3,4,42) developed a comprehensive model for the evolution of the natural history of cancer in a population subject to screening and natural demographic forces. In its general formulation, the model uses Zelen & Feinleib’s classification of the disease into preclinical and clinical phases, but divides the preclinical phase into states corresponding with prognostic tumor staging schemes. It also has two death states which correspond to clinical surfacing of the disease or death from a competing risk. The model is progressive, but allowance is made for staying indefinitely in any given state. This model is then applied to breast and cervical cancer. Breast cancer is modeled with two illness states, state 1 corresponding to disease with no lymphatic involvement and state 2 corresponding to disseminated disease (the contrary case). Cervical cancer is modeled with three illness states, state 1 corresponding to neoplasms in situ, state 2 corresponding to occult invasive lesions, and state 3 corresponding to frankly invasive lesions. The authors then impose on this model a screening strategy with a particular probability of a positive screen, depending on a person’s age and disease state. Using this, they derive equations describing how the natural history of cancer (depicted by

the distribution of numbers in each state and associated sojourn times) evolves over time in the presence of screening. These, in turn, are used to derive equations for measures of benefit from screening in terms of the disease status. These benefit measures include the percentage reduction in the cumulative number of observed cases of late disease due to screening and the percentage decrease in lost ‘‘salvageables’’ due to screening. A salvageable is a person who would have benefited from screening but who, in the absence of screening, progresses to a late stage of the disease before discovery. Dubin (25,26) developed a general multistage disease model similar to that of Chiang (15), and applied this to breast cancer using the same two stage classification as Albert et al. (4). He noted the difficulty in estimating parameter values for detailed disease models from existing data from screening programs. His model aimed to avoid these difficulties by maintaining comparability between the model and the observable characteristics of a screened population. He did this by focusing on age and stage-specific incidence and survival times in the presence and absence of screening. He derived formulas for the proportion of disease incidence which had been diagnosed earlier due to screening than it would have been in the absence of screening, and used these to derive various measures of screening benefit. Dubin’s model is not strictly a deep model as defined above. However, although he makes no explicit hypotheses about the rate of disease progression, such hypotheses are implicit in his model. Day & Walter (22) developed a variation on the simple three-stage model which has been used extensively. The focus of this model is the sojourn time in the detectable preclinical phase, for which a probability distribution is specified. For example, Walter & Day (63), in applying the model to breast cancer, used several alternate distributions, including the exponential, the Weibull, and a nonparametric step function. Under the model assumptions, one may derive expressions for the anticipated incidence rates of clinical disease among groups with particular screening histories and for the anticipated prevalence of preclinical disease found at the various screening times.

SCREENING, MODELS OF

One advantage of this model is that it is relatively simple to obtain approximate confidence intervals for parameter values. The model was extended by Walter & Stitt (64) to permit evaluation of survival of cancer cases detected by screening. A useful synthesis of the analytic models described above applied to breast cancer is presented by O’Neill et al. (46). All of the above are progressive models, but there are some forms of cancer for which the assumption of progression is not appropriate and for which some form of regression is required. These are cancers, such as large bowel cancer and particularly cervical cancer, where screening detects preinvasive or even precancerous lesions (13). A number of models have attempted to address this. Coppleson & Brown (20) developed a model for cervical cancer and found that the observed data could not be explained without allowing for regression. Albert (2) developed a variation of his earlier model for cervical cancer which allowed for regression from the carcinoma in situ stage back to the healthy state. Brookmeyer & Day (13) and van Oortmarssen & Habbema (59) both developed similar extensions to the Day & Walter model to divide the preclinical phase into two. The first stage allows regression to a healthy state, but once a cancer reaches the second stage only progression is allowed. The Coppleson & Brown, Albert, and van Oortmarssen & Habbema studies provide an interesting variation on the use of these models, in that the aim of the model was not to study cancer screening directly. Rather, the model was used to study the disease dynamics and, in particular, to examine the epidemiologic evidence for the existence of regression in preinvasive cervical cancer. The models described above follow a common theme of characterizing the disease as a series of states (corresponding to health, the various disease stages, and death), with people moving between the states with certain transition probabilities and/or certain sojourn times. Screening is then evaluated by superimposing on the disease process a screening process with particular screening regimes and screen sensitivity. This is in contrast to the next model, due to Eddy (27), which uses a different strategy.

13

Eddy’s modeling strategy uses a time varying Markov framework. However, he models the interaction between the screen and the disease in his basic model. This is a five-stage model defined in terms of three time points. The first is a reference time point tp . The way in which this is defined varies with the cancer under discussion but, as an example, for breast cancer it is the point at which the disease can first be detected by physical examination. The occult interval is then defined as the time interval between this and the point tM at which the disease is first detectable by screening (e.g. by mammography). The patient interval is defined as the time between tp and the time t at which the patient would actually seek medical care for the lesion. With Eddy’s model, t , tp , and tM can occur in any order. The important assumption is that once a disease is detectable by a screening modality (i.e. after tM ), then any screen using that modality will always detect the disease. This assumption replaces the assumption commonly made in models of screening that successive screens are independent. The other two states are a ‘‘healthy’’ state (which includes any preclinical disease which is still undetectable by screening) and a clinical disease state. Eddy models the probability distributions of the occult and patient intervals and uses these to derive formulas for the probabilities of discovering a malignant lesion by screening and by other methods. Eddy’s model has been applied to several breast cancer screening data sets as well as to cervical, gastrointestinal, lung, and bladder cancer. It has also been extended to the case in which there is more than one type of screening test (31). Finally, there are three recent analytic models which provide interesting variations on screening modeling. The first of these is the stage shift model (19). This assumes that the effect of screening is to shift the diagnosis of a cancer from a higher to a lower stage or within a given stage to an earlier time of diagnosis. Connor et al. develop the theory for a randomized controlled trial with equal sized intervention and control groups, but the equations can be modified to allow for proportional number of cases if unequal groups are used. The

14

SCREENING, MODELS OF

method of fitting this model requires a completed trial with follow-up that has reached the point at which comparable sets of cancer cases have accumulated in the study and control groups. For most of the discussion, Connor et al. ignore variability associated with the estimation process and the determination of the point at which comparability is reached in order to emphasize the exploratory nature of the analysis. However, they do present simple variance estimates based on the assumption that their data follow a Poisson distribution. The need for a completed trial and long follow-up period limits the model’s applicability, but it has been used to analyze breast cancer screening data (17). The second is the peak analysis model (9). This uses data from a randomized trial to determine the time period during which screening has the maximum effect on mortality. The results of the trial can then be analyzed restricting attention to that time period, providing more powerful statistical tests. For breast cancer screening, for example, this could mean excluding the mortality experience of the first few years after the initiation of screening. A disadvantage of this model is that the selection of the peak time period for the mortality comparison could be regarded as ‘‘data-driven’’ and subject to the usual problems of a post hoc analysis (44). The third is the use of surrogate endpoints for RCTs to shorten the duration of the trial and to increase the power (21). Day & Duffy apply this approach to a study comparing breast cancer screening at three yearly and one yearly intervals. Tumor size is the most important variable in predicting survival from breast cancer in the screening context, so they consider the difference in tumor size distribution between the study groups. They show that using this as an index of benefit and projecting expected mortality allows a result after only five years, compared with the 15–20 years required for a trial based on observed mortality. Furthermore, they demonstrate the rather surprising result that the use of surrogate endpoints leads to an increase in the power of the RCT compared with using the observed mortality. While completed trials remain necessary to establish the primary benefits of screening,

this approach allows faster and more efficient resolution of subsidiary issues. 6 SIMULATION MODELS FOR CANCER Some of the major simulation models developed for cancer screening are listed in Table 2. Knox (34) developed the earliest and most comprehensive simulation model. As with the analytic models, Knox uses a healthy state, a number of illness states and two death states. However, the model involves considerably more illness states, including classifying the disease as a preclinical, early clinical, or late clinical cancer, and further classifying each of these as treated or not treated and each cancer as high or low grade. Knox defines a transition matrix containing the estimated transfer rates between the various pathological states, modified according to the age of the individual or the duration of the state. He then simulates the evolution of the disease in a hypothetical cohort of study subjects which has similar characteristics to the population that he wishes to study (which, in this case, is the adult female population of England and Wales) using the transition matrix and a standard life table to provide the risks of competing causes of death. Finally, he adds details of the screening procedures to be considered, specifying the clinico-pathological states to which they apply, and their sensitivities and specificities in relation to each, and the transfers between model states which will occur following detection or nondetection. The screening policies are arranged in incremental series, and the results compared with each other and with the results of providing no screening at all. This allows the appraisal of benefits and costs in both absolute and marginal terms. This model has been applied to both cervical cancer (34) and breast cancer (35). It illustrates one major difference between the analytic and simulation approaches—the greater complexity of the disease and screening models in the simulation case. However, this extra complexity requires more detailed information on the disease dynamics in order to specify the model and this information is often not readily available. Knox (36) says of his earlier work that

Model parameters derived from published results of disease studies and screening programs and from the known characteristics of population under study

Model parameters derived from published results of disease studies and screening programs and from the known characteristics of population under study

Knox (36), Knox & Woodman (37)

Model inputs

Knox (34,35)

Literature references for model

Table 2. Selected Simulation Models of Cancer Screening

Illness model with two disease states—one in which the disease is susceptible to early detection and full or partial cure and a second in which the disease is incurable Model applied to subjects who have died from cancer but may have been saved if screening had been offered Applied to breast and cervical cancer screening

Illness model with 26 defined states Transition matrix defined for movements between these states following detection or nondetection of disease in the presence of specified screening procedures Model applied to a hypothetical cohort of study subjects with similar characteristics to the population under study Model applied to both breast and cervical cancer screening Cohort simulation model

Cohort simulation model

Key features

(continued overleaf)

Simulated reduction in mortality due to screening

Simulated mortality and morbidity in the presence of screening under various screening regimes

Model output/measures of screening benefit

SCREENING, MODELS OF 15

Model parameters derived from published results of disease studies and screening programs and from the known characteristics of population under study

Model parameters derived from published results of disease studies and screening programs and from the known characteristics of population under study

Habbema et al. (33), van Oortmarssen et al. (61)

Model inputs

Parkin (47)

Literature references for model

Table 2. (continued)

Follows similar approach to Parkin Applied to breast and cervical cancer screening

Illness model has nine states—a healthy state, three preclinical states, one clinical state, two death states, and a hysterectomy state (in which a woman is no longer at risk of cervical cancer) Model applied to a hypothetical population with age structure similar to that of the population under study General framework for microsimulation modeling

Microsimulation model developed for cervical cancer screening

Key features

Simulated mortality and morbidity in the presence of screening under various screening regimes

Simulated mortality and morbidity in the presence of screening under various screening regimes

Model output/measures of screening benefit

16 SCREENING, MODELS OF

SCREENING, MODELS OF The chief problem of applying the predictions stemmed from uncertainties about the clinical course of the early stages of cancer.

In this and in all his subsequent analyses, he simplified his model to one with only two illness states. This two-state model is worth discussing in detail because of its different approach to the population under study. Whereas the usual approach is to consider all people at risk of a cancer and to use the model to project mortality with and without screening, Knox’s approach is to consider only those who have died from cancer, and to use the model to estimate how many would have been saved if screening had been offered. He refers to it as ‘‘tearing down’’ a graph of age-distributed deaths in successive steps through the insertion of screening procedures at selected ages (37). This means that Knox does not need to consider variations in the course of the disease, such as lesions which never clinically surface or which regress to a healthy state, because all members of his population have, by definition, a progressive form of the disease. The two illness states are designated A and B. During state A the disease is susceptible to early detection and full or partial cure. During state B, the disease is incurable. The sojourn time in each state varies around an age-specific mean. The screening procedure has a probability of detecting the lesion which rises linearly during period A, while the probability of curing the disease falls linearly during A. This model has the advantage of simplicity, which means that it is relatively easy to find plausible parameter values for it. However, this simplicity has disadvantages. The model only considers the situation of a fully established screening program, so that it cannot be used to investigate issues surrounding setting up a new program. Also, because it is focused on mortality reduction, it cannot be used to consider issues relating to costs of screening programs. Researchers at the Australian Institute of Health and Welfare have extended this approach by combining Knox’s disease model with a costs model to evaluate the introduction of breast and cervical cancer screening programs in Australia (6,7). They have also

17

combined the disease model with mortality projections to investigate the timing of mortality reductions due to the introduction of a breast cancer screening program (8). Parkin (47) identifies a number of advantages of the cohort simulation approach of transferring year by year specified proportions of a single cohort in a deterministic fashion between model states. These include the model’s ability to: 1. demonstrate the relationships between variables; 2. explore the effects of different acceptance rates and test characteristics on outcome measures; 3. examine the net cost-effectiveness of different screening policies by imputing costs to the different outcomes of screening tests; and 4. explore the effect of different theoretic natural histories on the outcome of screening. However, he also identifies some of the disadvantages of this approach. First, services have to be planned, not for a single cohort over an entire lifespan, but for a very heterogenous population over relatively short time periods. When a screening program providing for testing at certain fixed ages is introduced into a community, only people younger than the starting age for the screening policy can possibly receive the full schedule of tests. Thus, benefits from screening will at first be small, but will increase progressively as more of the population receives a series of examinations. Furthermore, many people will have already had previous examinations, so the results of the screening policy will depend on the existing screening status of the population. This cannot be simulated by a single cohort model; nor can differences in the risk of disease in different birth cohorts. Secondly, it may be desirable to use characteristics other than age to identify subgroups of the population for selective screening. This is less often of practical use, since such subgroups are usually not readily identifiable, but a planning model should be able to explore the effectiveness of policies involving differential screening of such subpopulations.

18

SCREENING, MODELS OF

In addition, population subgroups often have different rates of attendance at screening programs which may be correlated with different disease risks. Finally, screening programs do not exist in isolation from the rest of the health care system. Much screening activity can take place outside a screening program. Most models usually treat this activity as ‘‘diagnostic’’ and ignore it. However, a planning model should take account of all relevant screening activity. Parkin proposes instead a microsimulation approach. Here, the life histories of individual members of a population are simulated. The population in his model has the demographic make-up of that of England and Wales and its size is governed by two considerations: (i) the computer time involved in microsimulation of very large populations; and (ii) the need for reliable results in a stochastic simulation of relatively rare events. Each individual is characterized by his or her values for a set of variables which will be used in simulating demographic events, the disease natural history, or screening programs. The values of these variables are updated annually using sets of conditional transition probabilities (e.g. the probability of childbirth given age, marital status, and initial parity). The occurrence of a transition is decided by comparing the relevant probability against a randomly generated number. There is considerable flexibility in modeling screening programs and, since the model follows individuals, it is possible to simulate contacts with the health care system and the ‘‘incidental’’ screening which occurs on such occasions. Parkin’s microsimulation model was developed specifically for cervical cancer screening, but a group working at Erasmus University in the Netherlands has developed a general modeling framework for microsimulation modeling of cancer screening called MISCAN (MIcrosimulation SCreening ANalysis) (33,61). Strictly speaking, MISCAN is not itself a model, but rather a model generator—a package that can generate and calculate a variety of these microsimulation models.

The MISCAN approach, like Parkin’s model, is based on the actual structure of a population as it develops in a given country at a particular time. The mass screening program under consideration is taken as starting in a particular year and finishing in a particular year. Standard demographic techniques are used to project the study population to a year well after the nominated end of the program. This allows for both the introduction of the program to be modeled and the effects, after the end of the program, to be followed up. The basic structure of the cancer model is similar to Knox’s earlier model with a detailed classification of clinical and preclinical cancer states, although it uses a smaller number of states. The interaction between the disease model and the screening program is designed to allow projection of screening and treatment costs as well as cancer mortality and morbidity. MISCAN has been widely used to analyze breast and cervical cancer screening programs.

7 MODEL FITTING AND VALIDATION Eddy (28) proposed four levels of validation for mathematical models: 1. First-order validation: this requires that the structure of the model makes sense to people who have a good knowledge of the problem. 2. Second-order validation: this involves comparing estimates made by the model with the data that were used to fit the model. 3. Third-order validation: this involves comparing the predictions of the model with data that were available when the model was fitted but that were not used in the estimation of model parameters. 4. Fourth-order validation: this involves comparing the outcomes of the model with observed data when applied to data generated and collected after the model was built (for example, data from a previously unobserved screening program).

SCREENING, MODELS OF

In this section we discuss model fitting and validation for cancer screening in the framework of these levels. First-order validation is generally not difficult to accomplish. The conceptualization of cancer as a series of preclinical and clinical stages is virtually universally accepted as a reasonable characterization of the disease. Problems may arise when the details of the disease stages are specified, but generally a wide variety of model formulations are plausible within the constraints of the limited knowledge of preclinical cancer. Second-order validation highlights one of the central problems with this sort of deep model. This is the difficulty of directly relating available data to model parameters. The mismatch between the data available, either from screening trials or other sources, and the model data requirements for parameter estimation has been recognized from the beginning of this type of modeling. Lincoln & Weiss [41, p. 188] note, for example, that Here we can do no more than introduce plausible forms for the different functions involved and plausible values for the parameters.

They go on to describe the difficulties in relating available data to the mathematical functions on which their model is based. This is a recurring problem in modeling cancer for screening, and to some extent affects all of the models described in this article. Some of the analytic models have developed methods of estimating model parameters using standard statistical estimation approaches. Dubin (26), for example, structured his model so that it could directly use the data from screening trials, although as a consequence his model relates less to the disease natural history than do the others. Louis et al. (42) derived nonparametric models for the probability distributions specified in their model and proposed the use of maximum likelihood methods to fit them. Day & Walter (22) used both parametric and nonparametric functions for their preclinical sojourn time and suggested either maximum likelihood methods or least squares criteria to fit them. However, many of the analytic models and all of the simulation models proceed in a more ad hoc fashion by varying their disease

19

natural history and model parameters until their models closely reproduce existing data. Knox (35), pp. 17–18 gives an example of how this ad hoc fitting operates, in fitting his earlier model to breast cancer screening data. He describes fitting the natural history data thus: A statement of the natural history of the disease process must be provided in the form of a ‘‘transition matrix’’ which gives estimated transfer rates between the various pathological states, modified suitably according to the age of the woman or the duration of the state. This set of values is adjusted iteratively until an output is produced which matches available data on incidence, prevalence and mortality. If, as sometimes happens, more than one natural history statement is capable of mimicking these facts, then the natural history will have to be treated as one of the uncertainties. Subsequent runs will then have to be repeated for a range of natural history alternatives, and each prediction of results will be conditional upon the accuracy of the natural history used.

Parkin (47) provides an example of just such an uncertainty about natural history, with the final model including three different natural histories as alternatives. This approach to model fitting has the disadvantage that, particularly for models with a large number of unknown parameters, the fit of the predicted values may be close to the observed data whether or not the model is in any sense valid. However, fitting the model to a number of independent data sets simultaneously and validating it against each of these data sets, as was done by van Oortmarssen et al. (61), provides some protection against this possibility. Third-order validation is usually made difficult by the lack of data. Generally, most available data are used in determining the parameters of the model (56). Breast cancer models are a good example of this. The only real data sources for fitting models for breast cancer screening are the screening studies, and in particular the RCTs. The first major study was the Health Insurance Plan of New York study (HIP) (53). This program started screening in 1963. Subsequent studies were not started for another ten years, with the Utrecht Screening Program (18)

20

SCREENING, MODELS OF

starting screening in 1974 and the Swedish Two-county Randomized Trial starting in 1977 (58). This means that many of the models only had access to the HIP data. Screening technology has changed significantly since the HIP program began (61), so when later studies became available they could not be directly compared with the HIP program and, in any case, it is questionable whether models based only on HIP data are directly relevant to modern screening. Because of the long time before mortality benefits from screening are fully apparent, models fitted using solely data from later studies have only appeared relatively recently (61) and, at least in their published form, have generally not addressed the issue of third-order validation. However, as more screening programs are implemented, more data should become available for third-order validation (6). Eddy (28) recognized that fourth-order validation is only possible in rare cases. However, there are at least two examples of studies which use models in a way that could be called fourth-order validation, coincidentally both using Eddy’s own model. Verbeek et al. (62) compare predictions from Eddy’s model for breast cancer to data from a mammography screening program in Nijmegen. The authors note that the comparison does not suggest too good a fit. However, this is only a preliminary study, and further validation work remains to be done. Eddy (29) compares his model for cervical cancer with a later independent analysis of empirical data. In this case the model appears to predict accurately the effect of different cervical cancer screening policies on outcomes that are important for policy decisions. The best way to see how these models are fitted and used in practice is to examine examples. The following three sections describe an example of fitting a model followed by a description of its application. 7.1 An Example of Model Fitting This section describes the analysis by van Oortmarssen et al. (61) of breast cancer screening based on the MISCAN computer simulation package. This model is designed to reproduce the detection rates and incidence of interval cancers as observed in the

screening projects in Utrecht and Nijmegen in the Netherlands. The basic model structure is shown in Figure 2. The first state is the state of no breast cancer. Women stay in this state until a transition occurs to one of the preclinical states that is detectable by screening (either mammography or clinical examination). The preclinical phase is divided into four states. There is one preinvasive state, intraductal carcinoma in situ (dCIS), and three screen detectable invasive states subdivided according to the diameter of the tumor: <10 mm, 10–19 mm, and ≥20 mm. The subdivision applied to the preclinical invasive states is also used for the clinical phase and for screen detected tumors. The state ‘‘false positives’’ refers to women with a positive screening examination in whom no breast cancer is found at further assessment. The two end states of the model are ‘‘death from breast cancer’’ and ‘‘death from other causes’’. Transitions into the ‘‘death from other causes’’ state (not shown in the figure) are possible from every other model state and are governed by the Dutch life table, which is corrected for death from breast cancer. The values of the key parameters of the model are summarized in Table 3. Parameters relating to clinical breast cancer and survival can usually be taken directly from available data. In this case, the preclinical incidence was estimated from the reported Dutch clinical incidence figures shifted to younger ages according to the model’s assumptions about the transitions and durations in the preclinical stages. The distribution of the tumor diameters for clinically diagnosed cancers was obtained directly from data on cancers diagnosed outside the screening program in Utrecht and Nijmegen. Survival is described by a fraction cured and a survival time distribution for women who are at risk of dying from breast cancer. The survival time distribution is based on the lognormal, with mean and variance taken from a published analysis of the Swedish Cancer Registry data (52). The fraction cured was estimated from the Utrecht data on clinically diagnosed cancers and varied with age according to another published analysis of Swedish data on age-specific breast cancer survival (1).

SCREENING, MODELS OF

21

Figure 2. The structure of the disease and screening model for breast cancer developed by van Oortmarssen et al. The state ‘‘death from other causes’’ is not shown. It may be reached from all other states. Source: van Oortmarssen et al. (61)

The combination of model assumptions on clinical incidence, stage distribution, and survival result in a good fit for the mortality rate for breast cancer in the Netherlands at all ages. Parameters relating to the preclinical phase are less easily specified. Parameter estimation was done by comparing simulated results from the model with data from the Utrecht and Nijmegen projects. An initial set of parameter values, partly taken from an earlier analysis of the HIP screening trial (60), resulted in many discrepancies between the simulated and observed data. The model parameters were systematically varied until a set of model specifications was found which gave an adequate overall fit to the Utrecht and Nijmegen data. Finally, the improvement in prognosis due to screen detection was calculated from the results of the Swedish Two-county screening study (58). This model passes both first- and secondorder validation, in that it is consistent with

what is known about the natural history of breast cancer and with previous models developed in the literature, and its results are consistent with the Utrecht and Nijmegen data used in its fitting. Third-order validation is more problematic. As noted above, the HIP data are not directly comparable with those considered here and the authors used all the other available data in fitting the model. Similarly, fourth-order validation is not possible in this case, since published results from other breast cancer RCTs were not available at the time this analysis was carried out. 7.2 An Application of the Model to Breast Cancer Screening The breast cancer disease model described above was applied to Australian data by Stevenson et al. (57) to simulate the introduction of a breast cancer screening program. Australian breast cancer data and life table data were used to estimate cancer incidence and population life expectancies.

22

SCREENING, MODELS OF

Table 3. Key Assumptions of the van Oortmarssen et al. Breast Cancer Screening Model Parameter

Assumption

Preclinical incidence

Based on Dutch clinical incidence, 1977–82 Independent of age 10% 22% 68% Age-dependent

Clinical stage distribution <10 mm 10–19 mm ≥20 mm 20 year survival of clinically diagnosed breast cancer (diagnosis at age 55) <10 mm 10–19 mm ≥20 mm Duration of preclinical invasive stages Age 40 years Age 50 years Age 60 years Age 70 years Sensitivity of mammography dCIS <10 mm ≥10 mm Impact of early detection Mortality reduction for screen detected cancers

83% 68% 51% Average duration (years) 1.6 2.1 3.0 4.7 Independent of age 70% 70% 95% 52%

Source: van Oortmarssen et al. (61).

Table 4. Breast Cancer Screening Options Option number 1 2 3 4 5

Age group screened (years)

Screening intervaln (years)

50–69 50–69 40–49 50–69 40–49 50–69 40–69

2 3 1 2 2 3 2

Pilot testing of screening programs suggested that a screening participation rate of 70% was a reasonable target (6). All other model parameters were taken from the van Oortmarssen et al. model. The model was applied to five different screening options defined in terms of the age group offered screening and the interval between successive screens. These are listed in Table 4. Taking 1990 as the nominal starting year, the analysis simulated the introduction of a screening program phased

in over five years and running for a further 25 years. The simulated total life years lost in the absence of a screening program and the life years saved by screening for each of the screening options are listed in Table 5. These results show a clear benefit in including women aged 40–49 in the screening program and of a two year interval over a three year interval. However, they also show that decreasing the interval to one year for women aged 40–49 makes only a marginal improvement. An analysis of screening should include consideration of costs as well as benefits. A complete discussion of estimating costs is beyond the scope of this article, but generally they will be based on both current screening experience (with, for example, screening pilot projects in the location under study) and model based projections. These estimates are usually reported as the present value of the costs. This involves applying an annual discount rate to costs projected for future years. Hence, where costs are compared with benefits, the benefits are usually also presented in present value terms by applying the same annual discount rate.

SCREENING, MODELS OF

23

Table 5. Number and Proportion of Life Years Saved among Australian Women by Mammography Screening over a 30 Year Screening Period, as Estimated from the van Oortmarssen et al. Simulation Model Screening option

Total life years lost in the absence of a screening program (’000s)

Number of life years saved as a result of the screening program (’000s)

Life years saved as a percentage of total life years lost

1 2 3 4 5

3766.6 3767.0 3741.2 3743.1 3755.6

250.5 202.4 324.3 258.1 309.4

6.7 5.4 8.7 6.9 8.2

Note: These results are based on the simulation of individual life histories, with the outcomes for each individual being determined randomly by applying the probabilities of developing the disease and of surviving the disease. This means that the outcome for each individual may vary between simulations. This accounts for the small variation in the simulated total life years lost figures. Source: Stevenson et al. (57).

Table 6. Relative Cost-Effectiveness of Screening at Different Screening Intervals for Women Aged 40–69 Screening option

Net present value of costs to service providers and women ($ million)

Net present value of life years saved (’000s)

Average cost per life year saved ($)

3 4 5

1917.8 1097.5 1374.6

622.2 628.6 620.6

3082.3 1745.9 2215.0

Note: Net present value calculated by applying an annual discount rate of 5%. Source: Costs data taken from Australian Health Ministers’ Advisory Council report on breast cancer screening (6). Projected life years saved data taken from Stevenson et al. (57).

The estimated total costs and costs per life year saved for the three screening options which include women aged 40–49 are presented in Table 6. This shows that the small increase in life years saved gained by moving to a one year screening interval for women aged 40–49 is offset by a substantial increase in the cost per life year saved. 7.3 A Comparison of Two Models for Breast Cancer Stevenson et al. (57) also simulated the introduction of an Australian breast cancer screening program using Knox’s two-state disease model described above. In Table 7 is presented a comparison between the percentage life years saved for each screening option derived from both this model and the van Oortmarssen et al. model. There are clear differences between the two models, with the Knox model estimates consistently higher for all screening options. Furthermore, the

evidence from the Knox model for including women aged 40–49 is more equivocal. It is tempting to ask which model is right but, while there is some reason for preferring the van Oortmarssen model (because of its more extensive validation), a more relevant question is which model more correctly addresses the issue under study. The Knox model applies to a steady state situation, in which the screening program has been operating for long enough so that no one in the target population is too old to have participated in the full program. The van Oortmarssen et al. model makes allowance for the start of the program excluding some women from fully participating. The effect of this is that the Knox model will overstate the gains in life years saved at the start of the program. The difference in the results for including women aged 40–49 years arise from more realistic assumptions in the van Oortmarssen et al. model about the effect

24

SCREENING, MODELS OF

Table 7. Percentage of Total Life Years Saved among Australian Women by Mammography Screening as Estimated by Two Simulation Models Screening option

Life years saved as a percentage of total life years lost—van Oortmarssen et al. model

Life years saved as a percentage of total life years lost—Knox two-stage model

1 2 3 4 5

6.7 5.4 8.7 6.9 8.2

12.6 11.1 12.9 11.1 12.8

Source: Stevenson et al. (57).

of screening at those ages on subsequent mortality.

8

MODELS FOR OTHER DISEASES

Models for screening can be applied to diseases other than cancer. For example, screening tests exist for diabetes and there is a clear value in its early detection. Undiagnosed diabetes could be considered as a preclinical phase of the disease and modeling techniques applied to investigating its characteristics. Similarly, a disease such as hypertension could be modeled either for its own sake or as a preclinical form of cardiovascular disease. Some work has been done on simulation modeling for coronary heart disease (38). This model used logistic regression to estimate transition probabilities between risk factor states and heart disease. It focused on the effects of risk factor reduction, but did not address details of screening programs. Hence, it avoided having to model details of the preclinical phase. There have to date been no significant published attempts at modeling the preclinical phase to investigate specific screening programs for chronic diseases other than cancer. On the other hand, modeling of infectious diseases has a long history in biostatistics. Most recently considerable work has been done on disease models of AIDS and HIV, although most of this effort has focused on projecting the spread of the disease rather than modeling screening programs (see, for example, Day et al. (23)). However, there has been some work on modeling screening for infectious diseases.

Lee & Pierskalla’s model (40) is a good illustration of the similarities and differences in modeling infectious diseases for mass screening. In this model, the preclinical phase equates to the period during which a disease is infectious but without symptoms and the clinical phase to the period during which symptoms develop, the person seeks treatment, and is isolated or removed from the population. The main quantities used in the modeling are: 1. the number of infected people at a given time; 2. the natural incidence rate of the disease; 3. the rate of transmission of the disease from a contagious unit to a susceptible; 4. the rate of infected units ending the infectious period (i.e. clinical surfacing); and 5. the probability that an infected unit will not be detected by a screening test (i.e. the probability of a false negative). The crucial difference here is that disease is initiated by spread from one unit to another, as well as by its natural incidence rate. Hence, in addition to the lead time, the main index of benefit is the removal of infected units from the population. Indeed, Lee & Pierskalla show that defining the measure of screening benefit as the average lead time across the population under study is equivalent to defining it as the average number of infected units per time period in the population. Taking treatment as the endpoint of the model, rather than ultimate mortality, has

SCREENING, MODELS OF

the advantage of avoiding the necessity of modeling survival in the presence of screening. However, these models still have the difficulty of specifying parameters for an unobserved preclinical phase. For example, Lee & Pierskalla note that their model is an oversimplification, because it assumes that the sensitivity of the screening test is constant and independent of how long the person has been infected with the disease. They also note that varying this assumption is of little practical use, since data on transmission rates at the various disease states are almost nonexistent. 9 CURRENT STATE AND FUTURE DIRECTIONS The problem of model validation and its effect on the credibility of model based results is still a barrier to their wider use. Nevertheless, there are a number of areas in which modeling can make a uniquely important contribution to our current understanding of screening. In the absence of specific RCTs, modeling remains the only effective way of evaluating different screening regimes. For example, the inclusion of women aged between 40 and 50 in a mammography screening program is still a contentious issue, with no international consensus on the effectiveness of screening at these ages (6). While it could be argued that decisions on screening these women should not be made in the absence of reliable evidence on the presence or absence of the benefits, in practice, governments are already developing screening programs and modeling plays an important role in guiding policy-makers. Modeling also has a crucial role to play in assessing the cost-effectiveness of screening programs. Even for cheap and easily available screening technologies, organized mass screening programs are the best way to insure that the benefits of screening are fully realized (7). Modeling is not only necessary in order to plan these programs, but funding bodies are unlikely to fund such programs without at least initial cost–effectiveness studies, and modeling is the only practical way to derive the necessary estimates of future benefits and costs.

25

Miller et al. (44), p. 768 best summarize the current situation when, in discussing some recent models, they say It is clear that these, and other models already developed or under consideration, may enhance our understanding of the natural history of screen-detected lesions and the process of screening. However, they require validation with the best available data, which is preferably derived from randomized trials, before they could be extrapolated in ways that might guide policy decisions. As such data become available, assumption-based models need to be modified to incorporate this extra information, in order to improve the extrapolations needed to make policy.

While analytic models have a role in investigating specific facets of the disease and screening process (see, for example, (59)), the more comprehensive simulation models, and particularly the microsimulation models, seem best suited to the overall assessment of costs and effectiveness in screening programs and the investigation of different screening regimes. However, the challenge in using the simulation approach is to derive disease and screening models which are sufficiently complex to model all relevant aspects of screening but sufficiently simple to enable interpretable second-order validation. REFERENCES 1. Adami, H. A., Malker, B., Holmberg, L., Persson, I. & Stone, B. (1986). The relationship between survival and age at diagnosis in breast cancer, New England Journal of Medicine 315, 559–563. 2. Albert, A. (1981). Estimated cervical cancer disease state incidence and transition rates, Journal of the National Cancer Institute 67, 571–576. 3. Albert, A., Gertman, P. M. & Louis, T. A. (1978). Screening for the early detection of cancer—I. The temporal natural history of a progressive disease state, Mathematical Biosciences 40, 1–59. 4. Albert, A., Gertman, P. M., Louis, T. A. & Liu, S.-I. (1978). Screening for the early detection of cancer—II. The impact of the screening on the natural history of the disease, Mathematical Biosciences 40, 61–109.

26

SCREENING, MODELS OF

5. Alexander, F. E. (1989). Statistical analysis of population screening, Medical Laboratory Science 46, 255–267. 6. Australian Health Ministers’ Advisory Council (1990). Breast Cancer Screening in Australia: Future Directions. Australian Institute of Health: Prevention Program Evaluation Series No 1. AGPS, Canberra.

18. Collette, H. J. A., Day, N. E., Rombach, J. J. & de Waard, F. (1984). Evaluation of screening for breast cancer in a non-randomized study (the DOM project) by means of a case control study, Lancet i, 1224–1226. 19. Connor, R. J., Chu, K. C. & Smart, C. R. (1989). Stage-shift cancer screening model, Journal of Clinical Epidemiology 42, 1083–1095.

7. Australian Health Ministers’ Advisory Council (1991). Cervical Cancer Screening in Australia: Options for Change. Australian Institute of Health: Prevention Program Evaluation Series No 2. AGPS, Canberra.

20. Coppleson, L. W. & Brown, B. (1975). Observations on a model of the biology of carcinoma of the cervix: a poor fit between observations and theory, American Journal of Obstetrics and Gynecology 122, 127–136.

8. Australian Institute of Health and Welfare (1992). Australia’s Health 1992: the Third Biennial Report of the Australian Institute of Health and Welfare. AGPS, Canberra.

21. Day, N. E. & Duffy, S. W. (1996). Trial design based on surrogate end points—application to comparison of different breast screening frequencies, Journal of the Royal Statistical Society, Series A 159, 49–60.

9. Baker, S. G., Connor, R. J. & Prorok, P. C. (1991). Recent developments in cancer screening modeling, in Cancer Screening, A. B. Miller, J. Chamberlain, N. E. Day, M. Hakama & P. C. Prorok, eds. Cambridge University Press, Cambridge, pp. 404–418. 10. Blumenson, L. E. (1976). When is screening effective in reducing the death rate? Mathematical Biosciences 30, 273–303. 11. Blumenson, L. E. (1977). Compromise screening strategies for chronic disease, Mathematical Biosciences 34, 79–94. 12. Blumenson, L. E. (1977). Detection of disease with periodic screening: Transient analysis and application to mammography examination, Mathematical Biosciences 33, 73–106. 13. Brookmeyer, R. & Day, N. E. (1987). Twostage models for the analysis of cancer screening data, Biometrics 43, 657–669. 14. Bross, I. D. J., Blumenson, L. E., Slack, N. H. & Priore, R. L. (1968). A two disease model for breast cancer, in Prognostic Factors in Breast Cancer, A. P. M. Forrest & P. B. Bunkler, eds. Williams & Wilkins, Baltimore, pp. 288–300. 15. Chiang, C. L. (1964). A stochastic model of competing risks of illness and competing risks of death, in Stochastic Models in Medicine and Biology, J. Gurland, ed. University of Wisconsin Press, Madison, pp. 323–354.

22. Day, N. E. & Walter, S. D. (1984). Simplified models of screening for chronic disease: estimation procedures from mass screening programmes, Biometrics 40, 1–14. 23. Day, N. E., Gore, S. M. & De Angelis, D. (1995). Acquired immune deficiency syndrome predictions for England and Wales (1992–97): sensitivity analysis, information, decision, Journal of the Royal Statistical Society, Series A 158, 505–524. 24. Du Pasquier, (1913). Mathematische theorie der Invaliditatsversicherung, Milt. Verein. Schweiz. Versich.-Math. 8, 1–153. 25. Dubin, N. (1979). Benefits of screening for breast cancer: application of a probabilistic model to a breast cancer detection project, Journal of Chronic Diseases 32, 145–151. 26. Dubin, N. (1981). Predicting the benefit of screening for disease, Journal of Applied Probability 18, 348–360. 27. Eddy, D. M. (1980). Screening for Cancer: Theory, Analysis and Design. Prentice-Hall, Englewood Cliffs. 28. Eddy, D. M. (1985). Technology assessment: the role of mathematical modeling, in Assessing Medical Technologies, Institute of Medicine, ed. National Academy Press, Washington, pp. 144–153.

16. Chiang, C. L. (1980). An Introduction to Stochastic Processes and their Applications. Krieger, Huntington, New York.

29. Eddy, D. M. (1987). The frequency of cervical cancer screening: comparison of a mathematical model with empirical data, Cancer 60, 1117–1122.

17. Chu, K. C. & Connor, R. J. (1991). Analysis of the temporal patterns of benefits in the Health Insurance Plan of Greater New York Trial by stage and age, American Journal of Epidemiology 133, 1039–1049.

30. Eddy, D. M. & Shwartz, M. (1982). Mathematical models in screening, in Cancer Epidemiology and Prevention, D. Schottenfeld & J. F. Fraumeni, eds. Saunders, Philadelphia, pp. 1075–1090.

SCREENING, MODELS OF 31. Eddy, D. M., Nugent, F. W., Eddy, J. F., Coller, J., Gilbertsen, V., Gottleib, L. S., Rice, R., Sherlock, P. & Winawer, S. (1987). Screening for colorectal cancer in a high-risk population, Gastroenterology 92, 682–692. 32. Fix, E. & Neyman, J. (1951). A simple stochastic model of recovery, relapse, death and loss of patients, Human Biology 23, 205–241. 33. Habbema, J. D. F., Lubbe, J. Th. N., van der Maas, P. J. & van Oortmarssen, G. J. (1983). A computer simulation approach to the evaluation of mass screening, in MEDINFO 83. Proceedings of the 4th World Conference on Medical Informatics, van Bemmel et al., eds. North-Holland, Amsterdam. 34. Knox, E. G. (1973). A simulation system for screening procedures, in Future and Present Indicatives, Problems and Progress in Medical Care, Ninth Series, G. McLachlan, ed. Nuffield Provincial Hospitals Trust, Oxford, pp. 17–55. 35. Knox, E. G. (1975). Simulation studies of breast cancer screening programmes, in Probes for Health, G. McLachlan, ed. Oxford University Press, London, pp. 13–44. 36. Knox, E. G. (1988). Evaluation of a proposed breast cancer screening regimen, British Medical Journal 297, 650–654. 37. Knox, E. G. & Woodman, C. B. J. (1988). Effectiveness of a cancer control programme, Cancer Surveys 7, 379–401. 38. Kottke, T. E., Gatewood, L. C., Wu, S. C. & Park, H. A. (1988). Preventing heart disease: is treating the high risk sufficient? Journal of Clinical Epidemiology 41, 1083–1093. 39. Lang, C. A. & Ransohoff, D. F. (1994). Fecal occult blood screening for colorectal cancer—is mortality reduced by chance selection for screening colonoscopy? Journal of the American Medical Association 271, 1011–1013. 40. Lee, H. L. & Pierskalla, W. P. (1988). Mass screening models for contagious diseases with no latent period, Operations Research 36, 917–928. 41. Lincoln, T. & Weiss, G. H. (1964). A statistical evaluation of recurrent medical examination, Operations Research 12, 187–205. 42. Louis, T. A., Albert, A. & Heghinian, S. (1978). Screening for the early detection of cancer—III. Estimation of disease natural history, Mathematical Biosciences 40, 111–144. 43. Mandel, J. S., Bond, J. H., Church, T. R., Snover, D. C., Bradley, G. M., Schuman, L. M. & Ederer, F. (1993). Reducing mortality from colorectal cancer by screening for fecal occult

44.

45. 46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

27

blood, New England Journal of Medicine 328, 1365–1371. Miller, A. B., Chamberlain, J., Day, N. E., Hakama, M. & Prorok, P. C. (1990). Report on a workshop of the UICC Project on evaluation of screening for cancer, International Journal of Cancer 46, 761–769. Morrison, A. S. (1985). Screening in Chronic Disease. Oxford University Press, New York. O’Neill, T. J., Tallis, G. M. & Leppard, P. (1995). A review of the technical features of breast cancer screening illustrated by a specific model using South Australian cancer registry data, Statistical Methods in Medical Research 4, 55–72. Parkin, D. M. (1985). A computer simulation model for the practical planning of cervical cancer screening programmes, British Journal of Cancer 51, 551–568. Prorok, P. C. (1976). The theory of periodic screening I: lead time and proportion detected, Advances in Applied Probability 8, 127–143. Prorok, P. C. (1976). The theory of periodic screening II: doubly bounded recurrence times and mean lead time and detection probability estimation, Advances in Applied Probability 8, 460–476. Prorok, P. C. (1986). Mathematical models and natural history in cervical cancer screening, in Screening for Cancer of the Uterine Cervix, M. Hakama, A. B. Miller & N. E. Day, eds. IARC Scientific Publication, Vol. 76, pp. 185–198. Prorok, P. C. (1988). Mathematical models of breast cancer screening, in Screening for Breast Cancer, N. E. Day & A. B. Miller, eds. Hans Huber, Toronto, pp. 95–109. Rutqvist, L. E. (1985). On the utility of the lognormal model for analysis of breast cancer survival in Sweden 1961–1973, British Journal of Cancer 52, 875–883. Shapiro, S., Venet, W., Strax, P., Venet, L. & Roeser, R. (1982). Ten to fourteen year effect of screening on breast cancer mortality, Journal of the National Cancer Institute 69, 349–355. Shwartz, M. (1978). A mathematical model used to analyse breast cancer screening strategies, Operations Research 26, 937–955. Shwartz, M. (1978). An analysis of the benefits of serial screening for breast cancer based upon a mathematical model of the disease, Cancer 41, 1550–1564. Shwartz, M. & Plough, A. (1984). Models to aid in planning cancer screening programs, in Statistical Methods for Cancer Studies, R. G. Cornell, ed. Marcel Dekker, New York, pp. 239–416.

28

SCREENING, MODELS OF

57. Stevenson, C. E., Glasziou, P., Carter, R., Fett, M. J. & van Oortmarssen, G. J. (1990). Using Computer Modelling to Estimate Person Years of Life Saved by Mammography Screening in Australia, Paper presented at the 1990 Annual Conference of the Public Health Association of Australia. 58. Tabar, L., Fagerberg, G., Duffy, S. & Day, N. E. (1989). The Swedish two county trial of mammography screening for breast cancer: recent results and calculation of benefit, Journal of Community Health 43, 107–114. 59. van Oortmarssen, G. J. & Habbema, J. D. (1991). Epidemiological evidence for agedependent regression of pre-invasive cervical cancer, British Journal of Cancer 64, 559–565. 60. van Oortmarssen, G. J., Habbema, J. D., Lubbe, K. T. & van der Maas, P. J. (1990). A model-based analysis of the HIP project for breast cancer screening, International Journal of Cancer 46, 207–213. 61. van Oortmarssen, G. J., Habbema, J. D., van der Maas, P. J., de Koning, H. J., Collette, H. J., Verbeek, A. L., Geerts, A. T. & Lubbe, K. T. (1990). A model for breast cancer screening, Cancer 66, 1601–1612. 62. Verbeek, A. L. M., Straatman, H. & Hendriks, J. H. C. L. (1988). Sensitivity of mammography in Nijmegen women under age 50: some trials with the Eddy model, in Screening for Breast Cancer, N. E. Day & A. B. Miller, eds. Hans Huber, Toronto, pp. 29–38. 63. Walter, S. D. & Day, N. E. (1983). Estimation of the duration of a preclinical state using screening data, American Journal of Epidemiology 118, 865–886. 64. Walter, S. D. & Stitt, L. W. (1987). Evaluating the survival of cancer cases detected by screening, Statistics in Medicine 6, 885–900. 65. Zelen, M. & Feinleib, M. (1969). On the theory of screening for chronic diseases, Biometrika 56, 601–614.

SCREENING TRIALS

the study protocol for both groups. It is also important in the analysis that all individuals in the control group be compared to all individuals in the study group, including both individuals accepting the offer of screening and those rejecting the offer. A decision on the number of screening examinations and the interval between examinations (screens) must be made. The number of screens depends on the tradeoff between a sufficient number to realize an effect, if there is one, and the cost of additional screens. Trials may incorporate screening for essentially the entire follow-up period (35), or employ an abbreviated screening period typically involving four or five screening rounds, with a subsequent follow-up period devoid of screening (22,32). Several modeling efforts have addressed these issues (6,16,18,19). Another design problem involves the relationship between study duration, sample size, and the expected timing of any effect. Sample size and study duration are inversely related. If these two parameters were the only ones to consider, the relationship between follow-up cost, on the one hand, and recruitment and screening cost, on the other, would determine the design. However, the time at which a reduction in mortality may occur must also be considered. For those cancer screening trials that have demonstrated a reduction in mortality, a separation between the mortality rates in the screened and control groups did not occur until four to five years or more after randomization (32,35). Furthermore, the difference may continue to increase with time, even after screening stops (32). Thus, even with a very large sample size, follow-up may have to continue for many years to observe the full effect of the screening. A follow-up period of at least 10 years is appropriate, but a longer period may be required if the screening effect is manifested primarily among a subset of patients with slowly growing cancer. Determination of the appropriate endpoint in a cancer screening study is intimately related to the disease natural history. For a screening trial, the relevant natural history is from the time the cancer is screen-detected to death. This natural history is usually not well

PHILIP C. PROROK National Institutes of Health, Bethesda, MD, USA

The early detection of cancer and other chronic diseases has long been a goal of medical scientists. Many believe that by moving the point of diagnosis backward in time so that the disease is diagnosed earlier than usual, treatment will be more effective than treatment given at the usual time. However, this presumption may not be correct and the effect of any screening program must be evaluated. Cohort studies and case–control studies have been used for evaluating screening for several types of cancer, and the design and interpretation of these studies have recently been the topic of increasing discussion. However, an observational study rarely yields definitive answers or permits solid conclusions with regard to the public health consequences of cancer screening. The most rigorous approach is the randomized clinical trial. There are special design and analysis issues for such screening trials. 1

DESIGN ISSUES

The randomized controlled trial involves the prospective testing and long-term follow-up of defined populations according to a protocol. There are several major design and implementation aspects that should be considered. First, the target disease(s), the screening test(s), and the diagnostic and therapeutic regimens must be determined. Then, the appropriate outcome variable must be chosen and the sampling unit (individual or group) selected. Next, the admission and exclusion criteria need to be established and a randomization procedure chosen to allocate eligibles to the study and control groups. The study and control groups should be followed up with equal intensity and in the same time frame, with the outcome variable measured in a blind fashion, if possible. Every effort should be made to maximize adherence to

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

SCREENING TRIALS

understood, and potential early indicators of outcome such as a shift in disease stage or a lengthening of survival among cases, which depend on knowledge of this natural history for their validity, cannot provide a definitive assessment of screening in the absence of this knowledge. There is only one outcome variable known to be valid in a cancer screening trial, namely the population cancer mortality rate. This is the number of cancer deaths per unit time per unit population at risk (23,27). The mortality rate provides a combined assessment of early detection plus therapy. No improvement in mortality will be seen if either the screening does not lead to earlier detection or therapy at the time of earlier detection confers no extra benefit. Intermediate or surrogate outcome measures have also been considered. However, these have critical shortcomings that can be traced to the well-known biases that occur in screening programs: lead time bias, [length bias], and over-diagnosis bias (23,27). Among the most frequently proposed alternative endpoints are the case–finding rate or yield, stage of disease, and case–survival rate. Another measure that has been proposed as an endpoint in screening studies is the population incidence rate of advanced stage disease (1,5), since, if screening reduces the rate of advanced disease, disease that has metastasized, or is likely to lead to death, then it is reasonable to expect that the death rate from the disease will also be reduced. 2

SAMPLE SIZE

In the hypothesis-testing framework, the sample size can be calculated from the appropriate formula if one knows the event rate, effect size, and statistical procedure, all of which depend on the choice of endpoint for the study. In cancer screening trials, this is the cancer mortality. Since the analysis involves a comparison of the numbers or rates of deaths, methods for Poissondistributed data can be used (34,36). Other factors that must be taken into consideration are noncompliance in the screened and control groups, randomized groups of different sizes, and lower than expected event rates among the individuals who participate in the

study. Several approaches have been formulated to address these problems (23,25). One approach to sample size estimation is based on the method suggested by Taylor & Fontana (36), modified to allow for an arbitrary magnitude of screening impact, an arbitrary sample size ratio between the screened and control groups, and arbitrary levels of compliance in the screened and control groups. Let N c be the number of individuals randomized to the control group, and N s the number randomized to the screened group, with N s = fN c . Assume the study is designed to detect a (1 − r) × 100% reduction (0 ≤ r ≤ 1) in the cumulative disease-specific death rate over the duration of the trial. Also, let Pc be the proportion of individuals in the control group who comply with the control group intervention and Ps be the proportion of individuals in the screened group who comply with the screened group intervention. Using a model in which the death rate in the presence of noncompliance in the screened group is a linear combination of the screened and control group death rates, weighted by the compliance levels, one finds that the total number of disease-specific deaths, D, needed for a one-sided α-level significance test with power 1 − β is given by D=

[(1 + f 2 )Z1−α − (1 2 )1/2 (1 + f )Zβ]2 , f (1 − 2 )2

where 1 = r + (1 − r)Pc , 2 = 1 − (1 − r)Ps , and Z1−α and Zβ are the 1 − α and β quantiles of the standard normal distribution, respectively. The number of participants required in the control group is N c = D/(1 + f 2 )Rc Y, where Y is the duration of the trial from entry to end of follow-up in years, and Rc is the average annual disease-specific death rate in the control group expressed in deaths per person per year. Calculation of N c requires an estimate of Rc . Individuals recruited for a screening trial are expected to be healthier than the general population due to selection factors and eligibility criteria. Hence, the usual cancer mortality rate obtained from national or registry data is likely to overestimate the mortality rate of the participants, at least for the early part of a trial. An ad hoc approach

SCREENING TRIALS

to this problem is to use the relationship between the observed and expected death rates in previous screening trials. Alternatively, one can calculate an expected event rate using the age-specific incidence rates of a cancer-free population combined with the survival rates of these incident cases to arrive at the expected mortality (23,25,26). 3

STUDY DESIGNS

3.1 Classic Two-Arm Trial that Addresses a Single Question In this design, the study population is randomized to a group offered screening according to a protocol and a control group not offered screening. At the end of follow-up, the mortality rates in the two groups are compared (28). The prototype trial for this design is the Health Insurance Plan (HIP) trial of breast cancer screening (32). 3.2 Designs for Investigating more than One Question in the Same Study Extensions of the classic two-arm design have been used or suggested for cancer screening trials to answer more than one question in the same study. This topic has also been discussed for cancer prevention trials (11). For example, the National Study of Breast Cancer Screening in Canada involves two different study populations, but under the same administrative and scientific umbrella (i) to determine in women aged 40–49 at entry whether annual screening by mammography and physical examination, when used as an adjunct to the highest standard of care in the Canadian health care system, can reduce mortality from breast cancer, and (ii) to evaluate in women aged 50–59 at entry the additional contribution of routine annual mammographic screening to screening by physical examination alone in reducing breast cancer mortality. This involves separate randomizations of women in the two age groups (22). In some circumstances, several related questions can be addressed by including additional randomized groups in the trial. An example is the colon cancer screening trial at the University of Minnesota (14). Two basic issues are being addressed;

3

namely, whether screening can reduce mortality, and whether there is a different effect at different screening frequencies. Three randomized groups were formed: a control group, a group offered annual screening with a test to detect occult blood in the stool, and a group offered the occult blood test every two years. Another extension of the basic design is a two-group trial in which the intervention group includes multiple interventions, known as the all-versus-none design (11). One version of this design involves several interventions, with each intervention aimed at early detection of a different type of cancer. Use of this design requires two assumptions: first, that the test for any given cancer does not affect the case detection or mortality of any other cancer site, and, secondly, that disease-specific mortality is independent among the cancers under study. An example is the Prostate, Lung, Colorectal and Ovarian Cancer (PLCO) Screening Trial sponsored by the National Cancer Institute (15), the objectives of which are to determine whether: (i) in females and males, screening with flexible sigmoidoscopy can reduce mortality from colorectal cancer, and screening with chest X-ray can reduce mortality from lung cancer; (ii) in males, screening with digital rectal examination plus serum prostate-specific antigen (PSA) can reduce mortality from prostate cancer; and (iii) in females, screening with pelvic examination plus CA 125 and transvaginal ultrasound can reduce mortality from ovarian cancer. Another design option to answer more than one question at a time is the reciprocal control design (11). In this design, the participants in each arm of a trial receive an intervention, but also serve as controls for an intervention in another arm of the trial. This requires the assumption that the intervention aimed at a given cancer does not affect any of the other cancers under study. Within each of the above design types there are options for the relationship between screening and follow-up (9). A natural design is to randomize individuals either to an intervention or a control group, with the intervention consisting of periodic screening throughout the trial. Those in the control arm are not offered the periodic screening; they follow their usual medical care practices. This is

4

SCREENING TRIALS

called the continuous-screen design. The NCI Cooperative Lung Cancer Screening RCT, conducted in the mid 1970s to the mid 1980s, essentially followed this design (10). One drawback of the continuous-screen design is that the cost involved in screening all intervention group participants for the duration of the trial may be prohibitive. An alternative is the stop-screen design in which screening is offered for a limited time in the intervention group and both groups are followed for disease incidence and mortality until the end of the trial. This design is used when it is anticipated that a long follow-up will be required before a reduction in mortality can be expected to emerge, and when it would be expensive or difficult to continue the periodic screening for the entire trial period. The Health Insurance Plan (HIP) of Greater New York Breast Cancer Screening Study followed this design (32). The stop-screen design can result in a considerable saving in cost. However, the analysis can be more complex than that of the continuous-screen design, because the difference in disease-specific mortality between the two groups may be diluted by deaths from cancers that develop in the intervention group after screening stops. The split-screen design is a variant of the stop-screen design in which a screen is also offered to all those in the control group at the time the last screen is offered to the intervention group. The Stockholm Breast Cancer screening trial, conducted in the 1980s, is an example of this design (13). An advantage of the split-screen design is that there is greater potential to identify comparable sets of cancer cases for the analysis (discussed below). The delayed-screen design is a variant of the continuous-screen design in which periodic screening is offered to the control group starting at some time after the start of the study and continuing until the end of the study. This design allows the estimation of the marginal effect of introducing screening at some standard time or age, relative to starting the screening at a later time or age. The current UK Breast Cancer Screening Trial of women under 50 is basically following this design to evaluate the effect of beginning screening before the age of 50 years (24).

4 ANALYSIS Screening trials involve special issues in their analysis, both for the continuous-screen and stop-screen designs, and for primary and secondary analyses. Primary analyses are concerned with evaluating whether there is a statistically significant difference in diseasespecific mortality between the control and intervention groups. Secondary analyses are concerned with ascertaining the magnitude of the mortality difference, and with gaining a deeper understanding of the underlying mechanisms (9). 4.1 Primary Analysis Proposed statistical methods for primary analysis include a Poisson test statistic comparing the observed death rates (33), a Fisher exact test comparing the observed proportions of cancer deaths (2), and a logrank test comparing disease-specific death rates over time in the two groups (2,3). For example, the Poisson process test statistic for comparing cumulative mortality rates is Zr = (PYS DC − PYC DS ) , [PYC PYS (DC + DS )]1/2 where DC = the number of deaths from the cancer of interest in the control group through the time of analysis, DS = the corresponding number of deaths in the screened group, PY C = the number of person years at risk of death from the cancer of interest in the control group through the time of analysis, and PY S = the corresponding number of person years in the screened group (34). Logrank test analysis may be based on the disease-specific mortality experience of all randomized participants, termed the overall mortality analysis, or it may be based on the mortality experience of comparable groups of cancer cases in the two arms of the trial, in which case it is termed the limited mortality analysis (3). 4.2 Overall Mortality Analysis Overall mortality analyses possess the advantage of comparability of comparison groups formed by randomization. However,

SCREENING TRIALS

logrank tests comparing disease-specific mortality can be relatively inefficient. This is because the logrank test is optimal under proportionality of the disease-specific mortality hazards in the two groups, whereas, in cancer screening trials, there is generally a delay from the beginning of the intervention program to the time that effects on cancer mortality can be observed, and the magnitude of any effect may vary over time. In addition, in stop-screen designs, cases continue to accrue in both groups after screening stops. The cancer deaths in the intervention group that are due to cancers developing after screening dilute the screening effect. Thus, the ratio of hazards decreases with time after some point in the trial. If the specific form of departure from proportional death rates is known, then efficiency can be gained by use of a weighted logrank statistic instead of the usual (unweighted) logrank statistic (8). Zucker & Lakatos (42) propose a method to accommodate a possible lag until full screening effect within a continuous-screen design. They specify a range of plausible lag times to full screening effect and then identify the weighted logrank statistic that minimizes the worst possible efficiency loss over this range. Self (29) and Self & Etzioni (30) propose adaptive testing methods for stopscreen designs. Their suggestion is to use the observed departures from constancy of the relative hazards to improve the efficiency of the test procedure. These methods are also weighted logrank tests, but the weights are identified in a data-dependent fashion. Sequential versions of these procedures are not yet available. 4.3 Limited Mortality Analysis In a limited mortality analysis, one restricts analysis to comparable sets of cancers, one set consisting of cancers from the intervention group diagnosed through some designated time interval after the start of the study, and the other consisting of their counterparts in the control group diagnosed during the same interval. Limited mortality analyses are typically only applied to split-screen or stop-screen designs. The split-screen design leads naturally to two presumably comparable case sets; namely, those diagnosed up to

5

and including the final screen offered. In the stop-screen design, however, determination of comparable case sets is less straightforward. The main question is how to choose the time interval for ascertainment of cases for analysis. The end of the case ascertainment period should not be too long after screening stops, because the continued accrual of clinically detected cases in both groups may lead to dilution of the observed screening effect as described previously. If, however, we exclude all cases diagnosed after screening has stopped, another form of dilution can arise. Among the cases diagnosed in the control group after screening has stopped, some may correspond to cases in the intervention group that were screen-detected and therefore diagnosed earlier than they would have been without screening. If this set of cancers benefits from the earlier diagnosis due to the screening, then excluding the control group counterparts to these cancers also dilutes the screening effect (8). 4.4 Comparability Whatever the method used to select comparable case sets, the true comparability of the sets selected must be fully investigated. Both the cases in the selected sets and the cases that are diagnosed after the time used to define the selected sets must be evaluated. Methods have been proposed for assessing the comparability of case sets in a stop-screen study (3). They consider the numerical as well as the biological comparability of the sets. Numerical comparability concerns the numbers of cancers in the case sets. Biologic or qualitative comparability concerns the composition of the cancer case sets with regard to their natural history, and especially their survival characteristics in the absence of screening. Qualitative comparability of candidate sets is assessed by covariates defined at randomization associated with the cancer cases. The identification of comparable case sets is not straightforward. In a stop-screen design, it may be impossible to identify comparable sets if screening is available outside of the trial, and the use of outside screening differs between the two arms after trial screening stops. In a continuousscreen design, it is unlikely that equalization

6

SCREENING TRIALS

will ever occur. In such cases, appropriately weighted overall mortality analysis may be the only valid option. In summary, the overall analysis is the most unbiased as it compares all randomized individuals. However, this approach may assess a diluted relative effect of screening in a stop-screen design and it requires follow-up of all randomized trial participants. Alternatively, the limited analysis requires follow-up of only selected case sets after a certain point in time and so is less costly. However, the approach is subject to bias if the case sets are not truly comparable. 4.5 Secondary Analyses Secondary analyses of cancer screening trials typically involve information related to the outcome of cancer cases captured in survival data, and indications of earlier diagnosis, captured by estimates of the screening program’s lead time, sensitivity, and the degree of shifting to an earlier clinical stage at diagnosis in the screened group. Estimates of survival differential are based on the postdiagnosis survival curves in the two case sets. The postdiagnosis survival of screen-detected cases includes lead time, which must be explicitly removed to avoid lead-time bias. Initial approaches were developed for the HIP trial assuming a fixed lead time of one year and considering the k-year actuarial survival from diagnosis of control group cases and interval cases (cases diagnosed in the intervals between screens because of signs or symptoms) as equivalent to the k + 1-year survival from diagnosis of screen-detected cases (33). Walter & Stitt (39) allowed lead time to be a random variable with a known distribution. This approach was extended to nonparametric estimation procedures by Xu & Prorok (40). Explicit adjustment for lead time requires knowledge of its probability distribution. Direct estimates of mean lead time have been presented by Shapiro et al. (31), Morrison (23), and Kafadar & Prorok (17). The approach of Shapiro et al. and Morrison yields crude estimates of average lead time based on comparing disease incidence in the control and intervention groups. Kafadar & Prorok used differences in survival from

entry and from diagnosis between screened and control group cases to estimate benefit time and lead time assuming two comparable case sets like those identified for a limited mortality analysis. Other methods for estimating lead time have been developed by Zelen & Feinleib (41) and Walter & Day (38). These approaches may be thought of as statistical modeling efforts. Simulation modeling is also being increasingly employed to estimate screening program properties and disease natural history, and to project the costs and benefits of alternative screening strategies (7,37). The information on shifts in the distribution of clinical stage at diagnosis should be interpreted with caution, since shifts may be due to overdiagnosis or length bias and therefore need not imply disease-specific mortality benefit. However, a stage-shift model has been developed that allows the estimation of the amount of shift between and within stages due to screening, as well as the associated mortality benefits. The model requires comparable case sets (4). 5 TRIAL MONITORING Various categories of data and information become available at successive stages of a screening trial. These relate to the population under study, acceptance of the screening test by the population, outcomes and characteristics of the screening test, and intermediate and final effect measures or endpoints used for determining the value of screening. These variables can be examined on a regular basis for evidence to alter the protocol or stop the trial, and are also valuable in assessing the consistency of findings or conclusions. More specifically, the data that can be used for monitoring include: population descriptors such as demographic, socioeconomic, and risk characteristics of the population; the proportion of the study population offered screening who undergo the initial screening, the level of compliance with scheduled repeat screens, and the level of screening contamination in the control group; the yield of cancers as a result of screening, the interval cancer rate, and screening test characteristics including sensitivity, specificity and predictive value; diagnostic and therapeutic

SCREENING TRIALS

follow-up among individuals designated suspicious or positive by the screening test and the costs involved in these procedures; cancer case characteristics such as stage, histologic type, grade, and nodal involvement; survival of cancer cases; incidence and prevalence rates of the cancer of interest; the incidence rate of advanced stage cancer; and mortality rates from the cancer of interest and other causes (28). Another aspect of the monitoring process of a trial is the use of formal statistical stopping rules. These include various methods aimed at accounting for repeated looks at the data such as the Lan–DeMets technique and stochastic curtailment procedures, as well as Bayesian approaches (12,20,21). REFERENCES 1. Chamberlain, J. (1984). Planning of screening programs for evaluation and non-randomized approaches to evaluation, in Screening For Cancer. I-General Principles on Evaluation of Screening for Cancer and Screening for Lung, Bladder and Oral Cancer, P.C. Prorok & A.B. Miller, eds. UICC Technical Report Series, Vol. 78, International Union Against Cancer, Geneva, pp. 5–17. 2. Chu, K.C., Smart, C.R. & Tarone, R.E. (1988). Analysis of breast cancer mortality and stage distribution by age for the Health Insurance Plan clinical trial, Journal of the National Cancer Institute 80, 1125–1132. 3. Connor, R.J. & Prorok, P.C. (1994). Issues in the mortality analysis of randomized controlled trials of cancer screening, Controlled Clinical Trials 15, 81–99. 4. Connor, R.J., Chu, K.C. & Smart, C.R. (1989). Stage-shift cancer screening model, Journal of Clinical Epidemiology 42, 1083–1095. 5. Day, N.E., Williams, D.R.R. & Khaw, K.T. (1989). Breast cancer screening programs: the development of a monitoring and evaluation system, British Journal of Cancer 59, 954–958. 6. Eddy, D.M. (1980). Screening for Cancer— Theory, Analysis and Design. Prentice-Hall, Englewood Cliffs. 7. Eddy, D.M., Hasselblad, V., McGivney, W. & Hendee, W. (1988). The value of mammography screening in women under age 50 years, Journal of the American Medical Association 259, 1512–1519.

7

8. Etzioni, R. & Self, S.G. (1995). On the catchup time method for analyzing cancer screening trials, Biometrics 51, 31–43. 9. Etzioni, R.D., Connor, R.J., Prorok, P.C. & Self, S.G. (1995). Design and analysis of cancer screening trials, Statistical Methods in Medical Research 4, 3–17. 10. Fontana, R.S. (1986). Screening for lung cancer: recent experience in the United States, in Lung Cancer: Basic and Clinical Aspects, H.H. Hansen, ed. Martinus Nijhoff, Boston, pp. 91–111. 11. Freedman, L.S. & Green, S.B. (1990). Statistical designs for investigating several interventions in the same study: methods for cancer prevention trials, Journal of the National Cancer Institute 82, 910–914. 12. Freedman, L.S. & Spiegelhalter, D.J. (1989). Comparison of Bayesian with group sequential methods for monitoring clinical trials, Controlled Clinical Trials 10, 357–367. 13. Frisell, J., Eklund, G., Hellstrom, L., Glas, U. & Somell, A. (1989). The Stockholm breast cancer screening trial—5-year results and stage at discovery, Breast Cancer Research Treatment 13, 79–87. 14. Gilbertsen, V.A., Church, T.R., Grewe, F.A., Mandel, J.S., McHugh, R.M., Schuman, L.M. & Williams, S.E. (1980). The design of a study to assess occult-blood screening for colon cancer, Journal of Chronic Diseases 33, 107–114. 15. Gohagan, J.K., Prorok, P.C., Kramer, B.S., Cornett, J.E. (1994). Prostate cancer screening in the prostate, lung, colorectal and ovarian cancer screening trial of the National Cancer Institute. Journal of Urology 152, 1905–1909. 16. Habbema, J.D.F., Lubbe, J.T.N., Van der Maas, P.J. & Van Oortmarssen, G.J. (1983). A computer simulation approach to the evaluation of mass screening, in Medinfo-83, Van Bemmel, Ball & Wigertz, eds. IPIF-IMIA, North-Holland, Amsterdam, pp. 1222–1225. 17. Kafadar, K. & Prorok, P.C. (1994). A dataanalytic approach for estimating lead time and screening benefit based on survival curves in randomized cancer screening trials, Statistics in Medicine 13, 569–586. 18. Kirch, R.L.A. & Klein, M. (1974). Surveillance schedules for medical examinations, Management Science 20, 1403–1409. 19. Knox, E.G. (1973). A simulation system for screening procedures, in The Future and Present Indicatives, Problems and Progress in Medical Care, G. McLachlan, ed. Ninth Series, Nuffield Provincial Hospitals Trust. Oxford University Press, London, pp. 17–55.

8

SCREENING TRIALS

20. Lan, K.K.G. & De Mets, D.L. (1983). Discrete sequential boundaries for clinical trials, Biometrika 70, 659–663. 21. Lan, K.K.G., Simon, R. & Halperin, M. (1982). Stochastically curtailed tests in longterm clinical trials, Communications in Statistics—Sequential Analysis 1, 207–219. 22. Miller, A.B., Howe, G.R. & Wall, C. (1981). The national study of breast cancer screening, Clinical and Investigative Medicine 4, 227–258. 23. Morrison, A.S. (1985). Screening in Chronic Disease. Oxford University Press, New York. 24. Moss, S. (1994). Personal communication. 25. Moss, S., Draper, G.J., Hardcastle, J.D. & Chamberlain, J. (1987). Calculation of sample size in trials of screening for early diagnosis of disease, International Journal of Epidemiology 16, 104–110. 26. Petronella, P.G.M., Verbeek, A.L.M. & Straatman, H. (1995). Sample size determination for a trial of breast cancer screening under age 50: population versus case mortality approach, Journal of Medical Screening 2, 90–93. 27. Prorok, P.C. (1984). Evaluation of screening programs for the early detection of cancer, in Statistical Methods for Cancer Studies, R.G. Cornell, ed. Marcel Dekker, New York, pp. 267–328. 28. Prorok, P.C. (1995). Screening studies, in P. Greenwald, B.S. Kramer & D.L. Weed, eds. Cancer Prevention and Control, Marcel Dekker, New York, pp. 225–242. 29. Self, S.G. (1991). An adaptive weighted logrank test with application to cancer prevention and screening trials, Biometrics 47, 975–986. 30. Self, S.G. & Etzioni, R. (1995). A likelihood ratio test for cancer screening trials, Biometrics 51, 44–50. 31. Shapiro, S., Goldberg, J.D. & Hutchison, G.B. (1974). Lead time in breast cancer detection and implications for periodicity of screening, American Journal of Epidemiology 100, 357–366. 32. Shapiro, S., Venet, W., Strax, P. & Venet, L. (1988). Periodic Screening for Breast Cancer. The Health Insurance Plan Project and Its Sequelae, 1963–1986. The Johns Hopkins University Press, Baltimore. 33. Shapiro, S., Venet, W., Strax, P., Venet, L. & Roeser, R. (1982). Ten- to fourteen-year effect of screening on breast cancer mortality, Journal of the National Cancer Institute 69, 349–355.

34. Shiue, W.K. & Bain, L.J. (1982). Experiment size and power comparisons for two-sample Poisson tests. Applied Statistics 31, 130–134. 35. Tabar, L., Fagerberg, G., Duffy, S.W., Day, N.E., Gad, A. & Grontoft, O. (1992). Update of the Swedish two-county program of mammographic screening for breast cancer, Radiologic Clinics of North America 30, 187–210. 36. Taylor, W.F. & Fontana, R.S. (1972). Biometric design of the Mayo lung project for early detection and localization of bronchogenic carcinoma, Cancer 30, 1344–1347. 37. van Oortmarssen, G.J., Habbema, J.D.F., van der Maas, P.J., de Koning, H.J., Collette, H.J.A., Verbeek, A.L.M., Geerts, A.T. & Lubbe, K.T.N. (1990). A model for breast cancer screening, Cancer 66, 1601–1612. 38. Walter, S.D. & Day, N.E. (1983). Estimation of the duration of a preclinical disease state using screening data, American Journal of Epidemiology 118, 865–886. 39. Walter, S.D. & Stitt, L.W. (1987). Evaluating the survival of cancer cases detected by screening, Statistics in Medicine 6, 885–900. 40. Xu, J.L. & Prorok, P.C. (1995). Nonparametric estimation of the postlead time survival distribution of screen detected cancer cases, Statistics in Medicine 14, 2715–2725. 41. Zelen, M. & Feinleib, M. (1969). On the theory of screening for chronic diseases, Biometrika 56, 601–613. 42. Zucker, D.M. & Lakatos, E. (1990). Weighted logrank-type statistics for comparing survival curves when there is a time lag in the effectiveness of treatment, Biometrika 77, 853–864.

SECONDARY EFFICACY ENDPOINTS

authors will stick to the issue of how to assess and manage interpretation of secondary endpoints. Dozens of endpoint and outcome types (see Table 1 for examples) are collected in a trial. When determining the number of primary and particularly secondary endpoints, one must consider the following questions:

FRANK ROCKHOLD GlaxoSmithKline R&D King of Prussia, Pennsylvania

TONY SEGRETI Research Triangle Institute Research Triangle, New Caledonia

• Does each outcome measure a dimension

As it will become clear later on in this article, there has not been a great deal of discussion concerning the handling and analysis of secondary efficacy endpoints in clinical trials and, as a result, this article is timely in its content. Many of the statistical and design issues for secondary endpoints are similar to those seen for handling primary endpoints, but some important differences exist in interpretation and emphasis that are relevant to this discussion. In clinical research settings (especially those involving interactions with regulatory agencies) that evaluate new drug and vaccine candidates, it is usually recommended that clinical trials have a single primary objective measured by a single primary variable, which typically includes hypothesis testing about comparative efficacy of the products under study. Secondary endpoints or other supportive measures are related to secondary objectives. Although many trials have more than one primary endpoint, a push to minimize the number of primary endpoints usually occurs. Clinical trials are large, complex, long, expensive, and, most importantly, rely on the commitment of the patients to participate in a study protocol. A need exists to balance the desire to maximize the amount of information collected in any given study with the amount of measurement that is reasonable to expect a patient to endure. The amount of information desired almost always exceeds that contained within the primary endpoints of the trials. These additional endpoints then become ‘‘secondary’’ endpoints. Many other related topics exist that could be discussed in conjunction with this topic, including secondary safety endpoints, methods of analyzing secondary variables, composite endpoints that collapse a number of variables into a single measure, and so on; however, for this discussion the

of the disease that is distinct? • Would any single outcome measure be

sufficient to capture treatment effect? • Does one outcome measure move tempo-

rally with another? • Would one outcome measure reinforce

another? • Is the endpoint evaluated at two or more

time points and does it need to satisfy a specific criterion (or a set of specific criteria)? All of these answers relate to providing a hierarchy for the endpoints and determining which one is most important clinically. Once one decides what the primary endpoint or endpoints should be, do all of the ‘‘leftovers’’ automatically become secondary endpoints? In order to provide some context, the authors propose the following definitions: • Primary endpoint: Primary endpoints

define the disease in the sense that an experimental treatment that is not shown superior to placebo for all of these endpoints is not a viable treatment for the disease under study. • Secondary endpoint: Secondary endpoints, although not considered primary, are considered important to prescribing physicians in helping identify the ideal treatment for each of their patients. Bob O’Neill, of the U.S. Food and Drug Administration, stated ‘‘ . . . a secondary endpoint is a clinical endpoint that is not sufficient to characterize fully the treatment in a manner that would support a regulatory ‘claim’’’ (1). This fact may be due, in part,

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

SECONDARY EFFICACY ENDPOINTS Table 1. Endpoint/Outcome Types • Observer-rated outcomes • Physician-rated outcomes • Patient-reported outcomes • Adjudicated outcomes (e.g., MI, cause-specific mortality) • Objective outcomes • Subjective outcomes • • • •

Derived outcomes—slope, change from baseline, time between recurrences Scoring/rating data Laboratory measurements Pharmacokinetic endpoints

Table 2. Types of Secondary Endpoints (2) 1. Secondary variables can supply background and understanding of the primary variables. 2. For composite primary endpoints, useful secondary endpoints are the separate components of the primary variables and their related variables. 3. Major variables for which treatments under study are important but for which the study is underpowered constitute a major role for secondary variables. 4. Secondary variables can have the role of aiding and understanding the mechanism by which the treatment works or in supplying details of the processes of the conditions under investigation. 5. Secondary variables can be variables that relate to sub-hypotheses that are important to understand but are not the major objective of the treatment. 6. Secondary variables can be variables designed for exploratory analyses.

to a lack of power (e.g., when mortality is a secondary endpoint even though it may be clinically the most important). This fact will become clearer in the methodology literature review to follow, as it is a direct consequence of the Coreg (Manufactured by GlaxoSmithKline, Research Triangle Park, NC) debate in the 1996 advisory committee meeting. That event generated most of the recent literature on secondary endpoints. Sometimes secondary endpoints help describe how the treatment works and they may have a role in the decision about the efficacy or safety of a treatment. In addition, secondary endpoints may have a role in describing the effect of a treatment in addition to how it works. Ralph D’Agostino (2) came up with an excellent characterization of different types of secondary endpoints given in Table 2. This thoughtful review by D’Agostino puts in context the many different types and reasons for secondary endpoints. One could argue that reason 6 could drive one to create something called a ‘‘tertiary’’ endpoint, as no particular

inference is implied and therefore the multiplicity of discussion to follow may not be important for those types of endpoints. Once the number and type of secondary endpoints is settled, one can then move on to consider some of the higher-level issues. How does one deal with an important secondary endpoint like mortality that is secondary because the experiment does not have enough power to deal with it as a primary endpoint? And, as a follow-up to that question, how does one perform interim monitoring on secondary endpoints? An important paper by Prentice (3) discusses this specific issue. One then has to also worry about how long the list of endpoints is and what the correlation between endpoints might be. As will be outlined later, Moye (4) and others have talked about the issues around interpreting secondary endpoints when the primary endpoint has not been established as statistically significant. In addition, the literature review of statistical methodology to follow issues around type I error comparison-wise experiment-wise error rates will be discussed. However, many of

SECONDARY EFFICACY ENDPOINTS

these errors lead to loss of statistical power or increase in the sample size to maintain power. In the end, one must be concerned that an implied ‘‘illusion of certainty’’ does not exist; for example, simply by collecting multiple endpoints one actually gets closer to the truth when, if fact, the issue may be clouded. A number of regulatory guidelines and documents exist that shed light on these issues, among them the ICH E9 Statistical Principles for Clinical Trials, (5) CPMP ‘‘Points to Consider on Multiplicity Issues in Clinical Trials,’’ (6) and other diseasespecific guidances that describe clinical endpoint requirements. ICH E9 states that there should generally be only one primary variable and that secondary variables be either supportive measurements related to that primary outcome or measurements of effects related to the secondary objectives. ICH E9 also goes on to say that the number of secondary variables should be limited and should be related to the limited number of questions to be answered in the trial. Although this approach is desirable, multiple primary endpoints are often required and the number of secondary endpoints is also not limited but may extend to a lengthy list. In the sections that follow, these issues and the corresponding statistical methodologies are reviewed. 1

LITERATURE REVIEW

Historically, little discussion has occurred in the statistical literature on the analysis of secondary endpoints. The recent increase in interest in this topic reflects a heightened awareness that all clinical trials will analyze secondary endpoints and that, without a priori decision rules, the interpretation of these results will be data-driven, or seen as data-driven, and consequently viewed as exploratory or hypothesis-generating. Just as important, clinical trials have become more complex and expensive resulting in a natural expectation that today’s trial will provide more evidence on a broader range of topics than its predecessors. In addition, as more and better treatments are developed for a given disease, the need to demonstrate a therapeutic advance will often demand a

3

clinical trial that evaluates both a primary endpoint and one or more secondary endpoints in a rigorous manner. This desire to define a framework for the analysis of secondary endpoints represents a middle ground between the statistical purist who advocates controlling α at .05 and spending it all on the primary endpoints with other P-values regarded as exploratory, and the less confining viewpoint of the trialist who would like the freedom to calculate P-values on important secondary endpoints and interpret the resulting pattern in conjunction with the results of the primary endpoint and the scientific context of the trial—and claim validity as deemed appropriate. The precipitating factor that raised the interest of statisticians in secondary endpoints was the evaluation of Carvedilol (Coreg) by the Cardiovascular and Renal Drugs Advisory Committee of the U.S. Food and Drug Administration. Fisher (7) provides a comprehensive overview of the complexities of the issue. Carvedilol had demonstrated a compelling and statistically significant effect on mortality, a secondary endpoint, in congestive heart failure across a series of four studies, but had demonstrated a statistically significant effect on the primary endpoint, clinical progression, in only one of these studies. In the other three studies, the primary endpoint of exercise was not statistically significant. The issue of contention in the May, 1996 meeting of this committee was the failure to identify two studies with a statistically significant primary endpoint despite an overall mortality endpoint across all four studies and a statistically significant mortality endpoint in two of the individual studies. Although this issue galvanized the statistical community, it was not an ideal starting point for the discussion because it involved an atypical situation. The secondary outcome was mortality, unique because of its importance and objectivity, and this result was first noted by a data and safety monitoring board (DSMB) looking across a series of trials. The paper by Davis (8) argues that secondary variables should be evaluated with a significance level of α/(k + 1), where α is the significance level for analysis of the primary endpoint and k is the number of secondary endpoints, and that this evaluation should

4

SECONDARY EFFICACY ENDPOINTS

not be conditional on achieving statistical significance on the primary endpoint. The thrust of this article is that secondary endpoints represent a major portion of the value to be extracted from a clinical trial and that it is beneficial to extract as much information as possible, even if the primary endpoint is negative. Further, the rule proposed does control overall significance at the (2k + 1)α/(k + 1), significance for primary endpoints at the customary α, and significance for secondary endpoints at kα/(k + 1). O’Neill (1) states that secondary endpoints cannot be validly analyzed in the absence of a significant primary endpoint. He focuses on the decision-making setting, especially as it relates to the drug-approval process, and defines a primary endpoint as one that would support a regulatory approval. A secondary endpoint, by contrast, provides evidence useful in characterizing the treatment effect but not sufficient to support a regulatory claim of treatment effect. A valid analysis is one that controls overall type 1 error as the customary α level and provides adjustment for the multiplicity of endpoints. He points out that interpreting nominally significant secondary endpoints as a ‘‘win,’’ in the sense of supporting a regulatory claim, in a clinical trial causes the significance level to exceed α and damages the validity of the inference. O’Neill also highlights the difficulty in interpreting secondary endpoints, given that trials often lack sufficient power for secondary endpoints, as well as the need to replicate findings in a second study to confirm exploratory results. The implicit position advocated is that primary endpoints should act as a gatekeeper and that a win on the primary endpoint(s) allows entry to valid inference on secondary outcomes; however, no view exists on what multiplicity adjustment, if any, is appropriate. The corollary is that studies that do not win on the primary endpoint(s) do not permit valid inference, and nominal P-values should be seen as exploratory. Prentice (3), in discussing these papers, emphasizes the extent to which, in largeoutcome studies of relatively healthy individuals, the distinction between primary and secondary endpoints is somewhat arbitrary and notes that trial monitoring may not even focus on the primary endpoint. Hence,

although some recent work has occurred here, a need exists to further develop statistical methodology that would allow for consideration of both primary and secondary endpoints in the trial monitoring process both for testing and for estimation. Prentice also notes that if clinical trials are monitored based on the primary endpoint, estimation of treatment effects on secondary outcomes will be affected by this monitoring; he discusses the methodology for dealing with the resulting bias. He offers a qualified endorsement of Davis’ strategy on secondary endpoints as a reasonable compromise to allow for valid inference among a limited, prespecified group of secondary endpoints without paying a steep price in lost power by modifying the α spend for primary endpoints. In an article in which the statistical controversy around carvedilol is used as a starting point, Moy´e (9) makes the case that control of type I and type II error rates with appropriate multiplicity control are the generally accepted standard by which the scientific community should judge the success of confirmatory studies. Further, if a trial fails to meet this criterion, no further inference based on secondary endpoints is valid, whether based on consistent results across a series of studies, even if the secondary outcome is mortality. No matter how striking a result, it will not pass the hurdle of a confirmatory study if the analysis is not pre-specified in the protocol. Moy´e offers an alternative methodology, labeled the prospective alpha allocation scheme (PAAS), that retains strict pre-specification of α spend and a priori designation of primary and secondary endpoints but potentially relaxes α spend by allowing an α P to be spent on primary endpoints and an α S to be used for secondary endpoint resulting in an overall α E (0.1 is offered as a standard) for the trial. When α P is set at the customary 0.05 level, PAAS allows trials to be powered at traditional levels without an increase in sample size, uses the customary levels of evidence for assessing primary endpoints, and demands that secondary endpoints be specified and have their level of significance controlled. It also provides the flexibility to set α E at the 0.05 level and offer this same framework for controlling valid inference among primary and

SECONDARY EFFICACY ENDPOINTS

secondary endpoints without an increase in the experiment-wise error rate, albeit at an increase in sample size. Although a loosening of the criteria for a positive trial exists, a corresponding tightening of the standard for assessing secondary endpoints occurs. This additional flexibility does result in more complexity and difficulty in the interpretation of results, especially for those secondary outcomes that are not adequately powered. O’Neill (10), in commenting on the second Moy´e paper, points out some of the risks associated with the PAAS scheme. Individuals planning a study may be tempted to artificially classify primary outcome variables as secondary outcomes simply because it increases the likelihood of a positive trial without an increase in sample size. The dichotomy of primary and secondary endpoints can be arbitrary, and pre-specification does not ameliorate the problem. The PAAS method allows the experimenter to explore different strategies by designating some variables as primary and others as secondary. The resulting conclusions can be counterintuitive, changing solely because of a reclassification. O’Neill argues that medical judgment should be the only factor in deciding which variables should be primary and which should be secondary. Hence, significance of a secondary variable should not qualify a study as positive and significance of a secondary endpoint only occurs if the primary endpoint is positive and the role of the secondary is to confirm the primary. Although O’Neill does not support Moye’s proposal, he does suggest that reviewing the expected percentiles of the P-values, based on protocol-planned treatment effects for primary and secondary variables, can be a useful exercise both at the protocol planning stage and at the analysis stage. In discussing the PAAS scheme, Koch (11) notes that confirmatory clinical trials do provide valuable information on primary and secondary endpoints and that restricting valid inference to a primary endpoint does not make full use of the available information. Use of the PAAS scheme, in conjunction with α E = 0.1, provides a rigorous and a priori structure for evaluating secondary endpoints. Koch points out that this method may be

5

most useful for those diseases in which relatively little is known about the evaluation of therapy and in which no proven treatments exist. Having the greater flexibility of PAAS does lessen the risk of choosing the wrong primary endpoint and losing any inferential ability regarding secondary endpoints. The resampling P-values, suggested by Westfall and Young (12), are presented as a useful method for evaluating the pattern of nominal P-values against the PAAS structure, and Koch elaborates on how the resampling P-values can be used to construct a decision rule in conjunction with PAAS. He also discusses options for PAAS when evaluating a multiplicity of secondary endpoints. 2 REVIEW OF METHODOLOGY FOR MULTIPLICITY ADJUSTMENT AND GATEKEEPING STRATEGIES FOR SECONDARY ENDPOINTS For the most part, the strategy employed in analyzing secondary endpoints has been empirical and informal. In practice, if the study produced a ‘‘win’’ for the primary endpoint(s), with appropriate multiplicity control, secondary endpoints could be analyzed and nominal P-values calculated and interpreted without any multiplicity control as long as these secondary endpoints were manageable in number and viewed as supportive to and not a replacement for the primary endpoints. Here, appropriate multiplicity control is defined as strong control of the familywise error rate (13) In this context, studies that failed to win on the primary endpoint provided no confirmatory evidence on secondary endpoints regardless of the strength of evidence. Of course, outside of the regulatory environment, the constraints could be relaxed. This greater latitude can be seen most clearly in the scientific publication process where the analysis and interpretation of secondary variables is at least somewhat dependent on the author, editor, journal, and clinical context. To place complete multiplicity control on primary endpoints and essentially none on secondary endpoints, given a win on primaries, has been characterized by some statisticians as unbalanced and wasteful of

6

SECONDARY EFFICACY ENDPOINTS

information. The papers by Davis (8), Moye (9), and Koch (11) try to rectify this situation by developing strategies that approach multiplicity control for primary and secondary endpoints in a unified, although not identical, manner. An alternative approach to this problem is to employ strict multiplicity control for primary and (key) secondary endpoints. In the ‘‘fixed sequence’’ approach, this control can be accomplished by defining a strict hierarchy of secondary endpoints and advancing to analysis of a secondary endpoint only if the primary endpoints and each secondary endpoint above it in the hierarchy are significant. The secondary endpoints are generally ordered by importance but incorporate an a priori assessment of power. In the ‘‘gatekeeper’’ approach, the secondary endpoints are tested only after a significant primary endpoint, and the secondary endpoints have strict multiplicity control using the Bonferroni method or some variant. Westfall and Krishen (14) consider a general class of closed multiple test procedures and demonstrate that gatekeeping procedures and fixed sequence procedures fall within this class of closed procedures (as limiting cases) and thus provide strong control of the family-wise error rate. In so doing, they provide a sound theoretical framework for the widely used gatekeeping and fixed sequence procedures. Dmientrienko et al. (15) elaborate on this theme by considering the general situation in which a family of null hypotheses H1 , . . . , Hm are grouped into a primary family F1 = {H1 , . . . , Hk } and a secondary family F2 = {Hk+1 , . . . , Hm }. The F 1 family serves as a gatekeeper in the sense that if at least one hypothesis in the F 1 family is rejected, under strong control of the family-wise error rate, one can go on to test hypotheses in the F 2 family. This type of procedure is labeled a parallel gatekeeping procedure. Conversely, if every hypothesis in the F 1 family must be rejected to allow the testing of hypotheses in the F 2 family, this type is a serial gatekeeping procedure. The authors present algorithms for selecting weights associated with each hypothesis that defines parallel and serial Bonferroni gatekeeping procedures. They also show how to use these gatekeeping procedures to implement the Simes procedure

and resampling tests. These latter methods provide improved power compared with Bonferroni but are computationally and conceptually more complex. The authors compare the Bonferroni and Simes parallel gatekeeping procedures with the PAAS scheme using equal and unequal weights for various combinations of effect sizes and correlation among endpoints. They found that the gatekeeping procedures have greater efficiency when the primary endpoints have greater power, whereas the PAAS scheme is more efficient when the primary endpoints are underpowered and the secondary endpoints have high power. Of course, the latter situation may not provide marked benefit if the regulatory mandate requiring a win on primary endpoints is followed. 3 SUMMARY Clearly the discussion surrounding the use and interpretation of secondary endpoints has ramped up considerably in the last 6 to 8 years. Although this discussion has largely been an extension of the multiplicity issues developed for the general multiple endpoints problem, some very innovative extensions of the methodology have resulted. What is also clear from the review of the literature and suggested solutions above is that the primary driving force in dealing with secondary outcomes should revolve around the primary clinical question to be asked in the trial. No amount of statistical machinery or mechanics can be created to overcome lack of clarity around the clinical question and, although most of the solutions have been put in the statistician’s lap to develop, in the end, the primary need is for a focused and hopefully concise statement of the clinical problem leading to a clear hypothesis, objective, and endpoints. Although a great deal of progress has been made in research, a decided lack of extensive research exists on the issue of how to analyze and interpret clinical endpoints with and without the consideration of the clinical questions at hand. Given that a drive to collect as much information as possible in a given trial will always exist, because of the time and complexity involved, it behooves statisticians

SECONDARY EFFICACY ENDPOINTS

to work with their clinical colleagues to come up with a unified solution to maximize the interpretability of the information collected in trials. REFERENCES 1. R. T. O’Neill, Secondary endpoints cannot be validly analyzed if the primary endpoint does not demonstrate clear statistical significance. Controlled Clin. Trials 1997; 18: 550–556. 2. R. B. D’Agostino, Sr., Controlling alpha in clinical trial: the case for secondary endpoints, Editorial. Stat. Med. 2000; 19: 763–766. 3. R. L. Prentice, Discussion: on the role and analysis of secondary outcomes in clinical trials. Controlled Clin. Trials 1997; 18: 561–567. 4. L. A. Moye, End-point interpretation in clinical trials: the case for discipline. Controlled Clin. Trials 1999; 20: 40–49. 5. ICH, ICH Triparlite Guideline E-9 Document, statistical principles for clinical trials. International Conference on Harmonization (ICH) of Technical Requirements for Regulations of Pharmaceuticals for Human Use, 1998: 2. 6. CPMP Working Party on Efficacy of Medicinal Products, Biostatistical methodology in clinical trials in applications for marketing authorizations for medicinal products. Stat. Med. 1995; 14: 1659–1682. 7. L. D. Fisher, Carvedilol and the Food and Drug Administration (FDA) approval process: the FDA paradigm and reflections on the hypothesis testing. Controlled Clin. Trials 1999; 20: 16–39. 8. C. E. Davis, Secondary endpoints can be validly analyzed, even if the primary endpoint does not provide clear statistical significance. Controlled Clin. Trials 1997; 18: 557–560. 9. L.A. Moye, Alpha calculus in clinical trials: considerations and commentary for the new millennium. Stat. Med. 2000; 19: 767–779. 10. R. T. O’Neill, Commentary on: alpha calculus in clinical trials: considerations and commentary for the new millennium. Stat. Med. 2000; 19: 785–793. 11. G. G. Koch, Discussion for: alpha calculus in clinical trials: considerations and commentary for the new millennium. Stat. Med. 2000; 19: 781–784. 12. P. H. Westfall and S. S. Young, ResamplingBased Multiple Testing: Examples and Methods for p-Value Adjustment. New York: Wiley, 1993.

7

13. A. C. Tamhane, Multiple Comparison Procedures. New York: Wiley, 1987. 14. P. H. Westfall and A. Krishen, Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures. J. Stat. Plan. Inference 2001; 99: 25–40. 15. A. Dmitrienko, W. Offen, and P. H. Westfall, Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Stat. Med. 2003; 22: 2387–2400.

FURTHER READING S. M. Berry and D. A. Berry, Accounting for multiplicities in assessing drug safety: a three-level hierarchical mixture model. Biometrics 2004; 60: 418–426. C. E. Bonferroni II, Calcolo Delle Assicurazioni Su Gruppi di Teste, Studi in Onore del Professore Salvatore Ortu Carboni. Rome: 1935: 13–60. T. C. Chamberlain, The method of multiple working hypotheses. Science 1965; 148: 754–759. A. Dmitrienko, G. Molenberghs, C. Chuang-Stein, and W. Offen, Analysis of Clinical Trials Using SAS: A Practical Guide. Cary, NC: SAS Publishing, 2004, chapter 2. L. D. Fisher, Carvedilol and the Food and Drug Administration-approval process: a brief response to Professor Moye’s article. Controlled Clin. Trials 1999; 20: 50–51. Y. Hochberg, A sharper Bonferroni procedure for multiple significance testing. Biometrika 1998; 75: 800–802. S. Holm, A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979; 6: 65–70. G. Hommel, A stagewise rejective multiple procedure based on a modified Bonferroni test. Biometrika 1998; 75: 383–386. G. Hommel, Test of the overall hypothesis for arbitrary dependence structures. Biometrical J. 1983; 25: 423–430. D. M. Mehrotra and J. F. Heyse, Use of the false discovery rate for evaluating clinical safety data. Stat. Meth. Med. Res. 2004; 13: 227–238. L. A. Moye, Response to commentaries on: alpha calculus in clinical trials: considerations for the new millennium. Stat. Med. 2000; 19: 795–799. Z. Sidak, Rectangular confidence regions for the means of multivariate normal distributions. J. Amer. Stat. Assoc. 1967; 62: 626–633. R. J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986; 63: 655–660.

8

SECONDARY EFFICACY ENDPOINTS

The European Agency for the Evaluation of Medicinal Products, Evaluation of Medicines for Human Use: Points to Consider on Multiplicity Issues in Clinical Trials. London, UK: EMEA, 2002.

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

randomly associated with the criterion as it is to find a medical test with perfect reliability. Thus, testing null hypotheses of randomness (e.g., using a 2 × 2 chi-square test) is not sufficient to evaluate tests. The more important issue is to assess the accuracy of a test quantitatively in a meaningful way. The simplest way is to calibrate the sensitivity and specificity of a test against its random and ideal values:

HELENA CHMURA KRAEMER Stanford University, Stanford, California

1 EVALUATING A SINGLE BINARY TEST AGAINST A BINARY CRITERION

k(1) = (Se − Q)/(1 − Q)

The basic problem is to evaluate a single binary test against a binary criterion. Consider a population in which the probability of a positive ‘‘gold standard’’ or ‘‘reference’’ diagnosis (D+ ) is P, and a binary test in which the probability of a positive test (T+ ) is Q. The diagnosis represents the best criterion currently available to identify the disorder in question, but it is not usually available for routine use in clinical decision making (e.g., a result obtained on autopsy or from longterm follow-up). P may be a prevalence at the time of testing or an incidence during a fixed follow-up in the population of interest. The probability situation in the population is described in Table 1. Sensitivity is defined as Se = a/P; speci ficity as Sp = d/P . Thus, a perfectly accurate test will have Se = 1 and Sp = 1. Realistically, this will never happen because virtually no concurrent diagnostic criterion has perfect reliability, and no prognostic criterion is absolutely assured at the time of diagnosis. The presence of a certain amount of error in the criterion means that no test (even a second independent opinion on the criterion itself) will predict the criterion perfectly. If test and criterion have but random asso ciation, Se = Q and Sp = Q , and any test worth using in clinical practice should be able to do better than that. It should be noted that if negative association exists between any test and criterion, the labels T+ and T− might be reversed to make the association positive. Thus, any test worth clinical consid eration will have Se > Q and Sp > Q , or Se+ Sp > 1. It is almost as rare to locate a medical test seriously considered for diagnosis that is

k(0) = (Sp − Q )/(1 − Q ) These are weighted kappa coefficients (q.v.) that indicate how far between random and perfect the actual sensitivity and specificity are; zero indicates random association, and one indicates perfect association. If the concern in a particular situation were solely about avoiding false negatives, then one might well assess the accuracy of the test using k(1), as would be true in evaluating a mass screening test in which the consequences of a false negative might be orders of magnitude more troubling that the consequences of a false positive. If one instead demanded the highest sensitivity rather than recalibrated sensitivity, then the answer would always be to give everyone a positive test results (Q = 1), in which case, Se = 1. However, if the concern were solely about avoiding false positives, then one might well use k(0), as would be true in seeking a definitive diagnosis that might stimulate invasive or risky medical procedures. Similarly, if instead one demanded the highest specificity, then the answer always would be to give everyone a negative test result (Q = 0) in which case Sp = 1. Generally, however, some risk or cost is incurred by both a false positive or false negative, and neither of these extreme cases applies. This situation has led to the formulation of a weighted kappa that appropriately weights the clinical repercussions of the two types of errors using w, where w reflects the relative importance of false negatives to false positives (1–3). When w = 1, the total concern is on false negatives; w = 0 on false

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

Table 1. Mathematical Definitions of, and Relationships Between, Odds Ratio and Other Common Measures of 2 × 2 Association, where a, b, c, and d Are the Probabilities of the Four Possible Responses in Relating Diagnosis (D+ and D-) To Medical Test (T+ and T-).

D+ D−

T+

T−

a C Q

b d Q = 1 − Q

P P = 1 − P 1

Odds Ratio (OR) (also called the Cross-Product Ratio) and rescaled versions: OR = ad/bc = (Se Sp)/(Se Sp ) = (PVP PVN)/(PVP PVN ) = RR1 RR2 = RR3 RR4 √ √ Gamma = (OR − 1)/(OR + 1) Yules Index = ( OR − 1)/( OR + 1) Risk Ratios (also called Likelihood Ratios or Relative Risks): (RR1 − 1)/(RR1 + P /P) = κ(0). RR1 = Se/(1 − Sp) RR2 = Sp/(1 − Se) (RR2 − 1)/(RR2 + P/P ) = κ(1) RR3 = PVP/(1−PVN) (RR3 − 1)/(RR3 + Q /Q) = κ(1) (RR4 − 1)/(RR4 + Q/Q ) = κ(0) RR4 = PVN/(1 − PVP) Percentage Change in Risks: RR − 1 or (RR − 1)/RR. Risk Differences (also called Youden s Index): RD1 = Se + Sp −1 = κ(P ) RD2 = PVP + PVN −1 = κ(Q) Weighted Kappas: κ(1) = (ad − bc)/PQ = (Se − Q)/Q = (PVN − P )/P κ(0) = (ad − bc)/P Q = (Sp − Q )/Q = (PVP − P)/P (also called Attributable Risk) 1/κ(r) = w/κ(1) + w /κ(0) w any value between 0 and 1, w = 1− w. Phi Coefficient (Product Moment Correlation Coefficient between X and Y): ϕ = (ad − bc)/(PP QQ )1/2 = (κ(0) κ(1))1/2 ϕ/ϕmax = max(κ(1), κ(0))

positives; and w = 1/2 equally on both. Then the weighted kappa becomes: k(w) =

PQ wk(1) + P Qw k(0) PQ w + P Qw

When w = 1/2, k(1/2) is the well-known ‘‘Cohen’s kappa’’ or ‘‘unweighted kappa.’’ Indeed, w = 1/2 is the value most often used because often little is available to choose preferentially in terms of clinical repercussion between a false positive and a false negative result. No well-accepted standards are found for how large the appropriate k(w) must be for a ‘‘good’’ medical test. One possibility is reflected in the standards suggested for observer agreement using kappa: > 0.8, ‘‘Almost Perfect’’; 0.6–0.8, ‘‘Substantial’’; 0.4–0.6, ‘‘Moderate’’; and finally, 0.2–0.4 and ≤0.2, ‘‘Slight’’ to ‘‘Poor.’’ These standards set

what seem reasonable standards for medical tests. Another possibility avoids setting such standards and focuses instead on comparing all the available tests for a particular diagnosis in a population against each other and choosing the one that is best. For that, Receiver Operator Characteristic (ROC) methods (below) seem particularly appropriate. However, before discussing the ROC approach, it is important to note that sensitivity/specificity are not the only measures of test accuracy in common use (see Table 1 for many of the most common ones). For example, many analysts prefer the predictive values: Predictive Value of a Positive Test (PVP) = a/Q Predictive Value of a Negative Test (PVN) = d/Q

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

Once again, a perfectly accurate test will have both PVP and PVN equal to one, but this situation is rarely possible because of the unreliability of the criterion. In the worst possible situation, when test and criterion have random association, PVP = P and PVN = P . In clinical practice, one must be able to do better than that, and thus it is necessary that PVP > P and PVN > P , and thus PVP+ PVN > 1. Calibrating PVP and PVN, we find: k(0) = (PVP− P)/(1− P) and k(1) = (PVN− P )/(1− P ). Thus once calibration is done, PVN and PVP give the same information as do calibrated sensitivity/specificity. Moreover, as indicated in Table 1, numerous other measures of 2 × 2 association are seen in the medical literature, but, as was true with the predictive values, all the common ones relate to one of another of the k(w), for some value of w (4). For example, one risk difference is, Se+ Sp− 1 = k(P ), and another, PVP+ PVN− 1 = k(Q). The phi coefficient (the product moment correlation between T and D) equals (k(1)k(0))1/2 (4). The only major exception seems to be the odds ratio (4–6). For this reason, and because of the availability of ROC methods, we continue to focus on sensitivity and specificity. One important caveat: If the population changes, then the sensitivity and specificity of a test evaluated against a diagnostic criterion may well change and usually does (as will every other measure of 2 × 2 association including the odds ratio). Although it is possible that the sensitivity and specificity of a test evaluated against a criterion may be the same in two different clinical populations, generally, as the prevalence or the incidence, P, increases, the sensitivity tends to increase to some degree and the specificity to decrease. This caveat is important because many statistical presentations have been exhibited in which it is assumed that sensitivity and specificity of a test (and thus the odds ratio) evaluated against the same criterion are the same regardless of which clinical population is sampled, despite the number of times this claim has been disproved theoretically and empirically (7,8).

3

2 EVALUATION OF A SINGLE BINARY TEST: RECEIVER OPERATING CHARACTERISTIC (ROC) METHODS Receiver operating characteristic (ROC) methods have their origin in engineering applications, in which a signal (corresponding here to the diagnostic criterion) is either transmitted or not, and a detector (here the test) either detects it or does not (hence the term signal detection methods). In the engineering applications, often toggles and dials were on the detector with settings that could be varied, each setting resulting in a different sensitivity and specificity, with sensitivity there called the ‘‘detection rate’’ and 1 - Sp, the ‘‘false alarm rate.’’ The major difference between the engineering application and that related to evaluation of medical tests is that in engineering applications the signal is in the control of the researcher—it was known with certainty on each trial whether the signal was transmitted. In medical applications, the ‘‘signal,’’ for example, the diagnosis itself, has some degree of error and is not in the control of the investigator. However, the concepts derived from the engineering application still have value in medical test evaluation. The ROC plane has Se (‘‘hits’’) on the y-axis and 1 - Sp (‘‘false alarms’’) on the x-axis. Any test evaluated against a particular criterion in a particular population is located as one point in the ROC plane: the test point (Fig. 1) The location of that test point tells a great deal about the relationship of the test to outcome, but understanding what it tells requires understanding the geography of the ROC plane. Any test that is randomly associated with the criterion has Se+ Sp = 1 and thus is located on the diagonal line labeled ‘‘Random ROC’’ in Fig. 1. Any test below that line is negatively associated with the criterion; any test above that line is positively associated. The ideal test, of course, is at the point (0,1) labeled ‘‘Ideal Point’’ in Fig. 1. The closer a test point is to the Ideal Point and the further it is from the Random ROC, the more accurate the test for that diagnosis. One additional geographical reference is needed: The ‘‘diagnosis line,’’ is a line joining the Ideal Point with the point (P,P) on

4

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

ROC Curve

Ideal Point 1 0.8

Test Point 0.6 Diagnosis Line

0.4 0.2

(Q,Q)

Random ROC (P,P)

0 0

0.2

0.4

0.6

the Random ROC, where P, once again, is the prevalence or incidence of the criterion diagnosis. Because the criterion and the population are fixed, and the tests we will be considering are all evaluated against that criterion in that population, this line is a fixed line. For any test with level Q, its test point lies in the ROC plane on the line parallel to the diagnosis line, intersecting the Random ROC at the point (Q,Q) (see Fig. 1). This expression is a geometric expression of Bayes’s Theorem because (from Table 1): Q = P Se + P Sp This reference line becomes very important in comparing multiple tests against the same criterion in the same population. Positively biased tests (Q > P) lie above the diagnosis line; negatively biased tests (Q < P) lie below it. Unbiased tests (Q = P) lie on the diagnosis line. In short, if one is shown a ROC plane with a test point, all the information in the 2 × 2 table can be read directly from the graph. The value of P is read from the point where the diagnosis line intersects the Random ROC. The value of Q is read from the point where a line parallel to the diagnosis line through the test point intersects the Random ROC. The value of sensitivity and specificity are read from x- and y-axes that locate the test point. Any other statistic based on the entries in the 2 × 2 table then can be computed using these values. Thus, all the statistical information available on which to base the evaluation of

0.8

1

Figure 1. The ROC plane, a graph of Se versus Sp, which indicates the location of the Random ROC, the Ideal Point, the Diagnosis Line, and a single dichotomous test response, the location of a test point, and its ROC curve. AUC is the area under this ROC curve.

a medical test is embodied in its location in the ROC plane. 3 EVALUATION OF A TEST RESPONSE MEASURED ON AN ORDINAL SCALE: RECEIVER OPERATING CHARACTERISTIC (ROC) METHODS Now let us expand the situation from a single test point to one in which the response to a test is measured on an ordinal scale, and each test is some dichotomization of that scale. For example, body temperature might be measured in degrees Celsius, and the diagnosis ‘‘fever’’ is defined as having a temperature greater than some value on that scale. Then, for every possible cut point, one can locate a test point on the ROC plane: Thus, if the test response is T, then a test is positive (T+ ) if T > x, and negative (T− ) if T < x; x can be set to any possible value on the X-scale. When T = x, one flips a fair coin, and half will be designated T+ and half T− . For every possible value of x, we then can compute Se(x) and Sp(x) and locate that test in the ROC plane. As x decreases, Se(x) increases, as does 1 - Sp(x). Thus, the points in the ROC plane that correspond to the different values of x move up and to the right as x increases. The convex hull of this set of points, that is, the ‘‘curve’’ connecting those points including the two reference corner points (0,0) and (1,1) defines the ‘‘ROC curve’’ for that test response. If ties are found in T, then the ROC curve may not be smooth. If T is measured on a

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

binary scale, then only one point exists in the ROC plane, and the ROC curve is a triangle perched on the Random ROC. If T is measured on a discrete scale with p possible points, then there are p - 1 points in the ROC plane, and the ROC curve is a series of straight lines joining those points in order. For a continuous T, theoretically the ROC curve is a smooth curve that joins the two corners of the ROC plane, but even then, in a sample, the ROC curve may not be smooth because not all possible values will be observed. For example, the ROC curve in Fig. 1 is that for a dichotomous T. The better the test response discriminates those that are D+ from those that are D-, the higher the ROC curve arches above the Random ROC. One measure of how well the ordinal test response discriminates between those with D+ and D- is the ‘‘Area Under the ROC Curve (AUC).’’ A test response with no association to D will have AUC = 0.5; ideally AUC = 1. Because it is possible that some values of the cut point x may lead to positive association with D and others to negative association, it is possible that AUC = 0.5 even when some association is found between diagnosis and test. AUC is an effect size with a useful clinical meaning; it equals the probability that a randomly chosen subject with D+ has X larger than a randomly chosen subject with D- (ties randomly broken). For ordinal T, this is the effect size tested with the Mann–Whitney U-test. Indeed, the easiest way to compute the sample AUC for an ordinal test response is often to compute the Mann–Whitney Ustatistic comparing the X values for those with D+ and D− , in which case: AUC = U/(N2 PP )

where N is the total sample size, P and P the fixed proportions in D+ and D− groups. Of course, the Mann–Whitney test tests the null hypothesis of random association between the ordinal test response and diagnostic criterion. If T is dichotomous (e.g., male and female), then computation of the AUC is even easier because AUC = 0.5 (Se+ Sp). One can estimate and test the AUC in this case, using a

5

binomial test of the equality of two binomial proportions (null hypothesis: Se = 1− Sp). A large value of AUC indicates that the D+ and D− groups are well discriminated by X. However, such a result does not indicate how one might discriminate D+ from D− using T in clinical application. For that, what is needed is an optimal point of discrimination, x*, that cut point for T that best discriminates D+ from D− . The problem is the criterion one would want to use to identify ‘‘optimal.’’ In some situations, a false negative is far worse in terms of clinical consequences than a false positive; in others, the situation is reversed. A proposed geometric approach (9) is as follows: Draw a line with slope P(1− w)/(P’w) through the Ideal Point, where w, as above, reflects the relative importance of false negatives to false positives. Then, push that line down toward the ROC curve. The first point on that curve that the line touches is the optimal cut point for the test in that population (P) with the specified balance between the clinical consequences of false negative to false positives (w) (see Fig. 1). Thus, if the concern were totally about false negatives, w = 1, then the line is horizontal, and the optimal test is defined by the cut point that leads to the best sensitivity or highest predictive value of a negative test. If the concern is totally about false positives, w = 0, then the line is vertical, and the optimal test is defined by the cut point that leads to the best specificity or the highest predictive value of a positive test. If w = 1/2, then the line is perpendicular to the diagnosis line, and the optimal point is likely to be a cut point on the ROC curve very near the diagnosis line, for example, an unbiased test with Q = P. Generally, if one is most concerned about false negatives, Q > P; when one is most concerned about false positives, Q < P; and when one is equally concerned about both, Q ≈ P, which makes good common sense. An equivalent analytic approach (10), is simply to compute k(w) for each possible cut point x, and x* is that cut point that maximizes k(w). Several points are worth noting:

6

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

1. The results of the ROC method are invariant under all monotonic transformations. Thus, applying this method to ordinal T or to any monotonic function of T, f(T), produces exactly the same ROC curve and the same AUC, and if the optimal cut point for T is x*, then the optimal cut point for f(T) is f(x∗ ). 2. If T is grouped, for example, if age is measured in years rather than exactly, then every point on the ROC curve for the grouped data is also on the ROC curve for the ungrouped data. However, the curve for the grouped data will be less smooth than that for the ungrouped data for grouping increases the number of ties in T. Thus, some points on the ROC curve for the ungrouped data will be missing from that for the grouped data. Yet, as long as the grouping is not overly coarse, the optimal cut points will be approximately, but not exactly, the same x*. 3. In application of these methods, T is assumed to be ordinal, but the method is, as noted, equally applicable to a twopoint (as in Fig. 1) or three-point scale as to a continuum. However T is measured, no distributional assumptions are found. The method is completely distribution free. However, many papers in the literature derive results for the ROC curve, the AUC, or the optimal cut point assuming that among those with D+ , T has a normal distribution and among D- also a normal distribution, with different means but possibly the same variance. Such results are not necessarily robust to deviations from these distributional assumptions. Because D+ and D- are not typically totally reliable, the assumption of two normal distributions among those with D+ and those with D- is rarely true in real clinical situations. Thus, insights into methods can certainly be gained from examination of such parametric ROC results, but the results of applying such parametric methods are questionable when applied to evaluation of clinical tests. 4 EVALUATION OF MULTIPLE DIFFERENT TESTS

4.1 A Family of Tests When the test response is measured on an ordinal scale and the test to be used in clinical application is some dichotomization of that scale, we are dealing with a ‘‘family’’ of possible tests. In this case, the various tests in the family correspond to the different possible cut points, and the totality of resulting test points in the ROC plane form a curve that leads from (0,0) to (1,1). However, a ‘‘family’’ of possible tests can be much broader, comprising X1 , X2 , X3, . . . ., Xn , a multivariate test response, each component of which may be either binary or measured on possibly different ordinal scales, not even necessarily on the same scale. Now we can expand the definition of the ‘‘ROC curve’’ to a family of tests as the convex hull of all the test points in that family. Thus for a single test response T, the ROC curve would be exactly as described above, but for a multivariate T, we now have defined a new ROC curve, where not all the test points on the curve necessarily come from the same Ti . Indeed, it often happens that certain tests are ‘‘more sensitive’’ and others ‘‘more specific,’’ meaning that the test that would prove to be optimal using the methods above for different values of w may not be based on exactly the same test response. Given any family of tests, a family that may be multivariate, we can compute the ROC curve for the family and identify the best test among all the test responses and their cut points. However, it is not unusual that after the best test is so identified, one can, as they say, ‘‘increase the yield’’ by adding a second test or a third using a Boolean combination of tests. For example, Fig. 2 depicts a situation (11) in which baseline information (age, education, self-efficacy scores, and many others) were used to try to identify which subjects in a community sample would, over the 10 years of follow-up, improve their cardiac risk factors in response to a community intervention. Examining all cut points on all the baseline information considered, age was the best predictor with the optimal cut point of 55 years (w = 1/2). As a result, fewer improved risk factors in this group.

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

7

Total sample 69% changed Age *** Age<55 – years

Age>55 years 83% changed

Education ** Figure 2. An example of a ROC tree, in which multiple test responses are sequentially evaluated against a criterion to develop the best possible multivariate discrimination between those with D+ (here positive change in cardiac risk factors over 10 years of follow-up) and D−.

5

THE OPTIMAL SEQUENCE OF TESTS

Suppose now that we have selected the best test from among the n original test responses, for simplicity, T1 with cut point x1 *, for example, Age in Fig. 2. What the best first test does is to divide the population into two subpopulations, those positive and negative on the best first test. These populations are obviously two different subpopulations, if only because they have very different values of P. However, it is quite possible that using T1 (with a different cut point), T2 , . . . , Tn may add unique and valuable new information to discriminate D+ from D− more in one or the other subpopulations, thus ‘‘increasing the yield.’’ Because the populations with T1+ and T1− are two different subpopulations, each different from the parent population, the performance of these tests in the two subpopulations is not known from their performance in the parent populations. Nor is it true that the performance of these tests in one subpopulation will be even similar to that in the other. Thus, the whole process must be repeated. What was done for the total population can now be repeated for each of the subpopulations, thus beginning to grow a testing ‘‘tree’’ such as that in Fig. 2. The multiple subpopulations at the end of the ‘‘tree’’ may either be used to define a single new test that describes which subgroups of patients defined by their test results would optimally be considered T+ or T− , or the multiple subgroups may be left as is, defining subgroups at different risks of D+ . Either approach could be very useful in

>11 years

<11 – years 42% changed

Self-Efficacy* Low:<7 70% changed

High: – >7 55% changed

clinical decision making. The process can be repeated over and over again. For example, for the population in Fig. 2 when the search was repeated for individuals over 55 years of age (83% of whom improved their risk factors), no additional predictors of clinical value were found. Of those 55 years old or younger, however, education increased the yield. Individuals who were both younger and had low education (with the optimal cut point corresponding to less than a high school education) were least likely to change (42%). Individuals both younger and better educated were more likely to change, but in this subgroup, self-efficacy scores increased the yield. In this subgroup, in what may seem a paradoxical result, individuals with higher self-efficacy scores (7 or above) were less likely to change (55%); individuals with lower self-efficacy scores (below 7) were more likely to change (70%). Search for optimal predictors (tests) and for optimal cut points often confirm clinically intuitive choices, but, as here, often not, and the contradictions are often quite informative. In this case, individuals who were younger and better educated, who already were nonsmokers, were maintaining a healthy weight, were following a healthy exercise program, and so forth, were the ones who were likely to express high self-efficacy. But because their cardiac risk factors were already low, there was less room for improvement. The challenge is to know when to stop. One possible stopping rule is to stop (1) when no more tests are available, (2) when the sample size becomes too small for more credible

8

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

evaluation, or (3) when the optimal test, if it had been a priori proposed would not have been found to have a statistically significant association with the criterion at a fairly stringent significance level (say p < 0.001). This approach is, of course, not a legitimate test of any hypothesis, given the exploratory nature of the application. When such methods are used to develop sequential or composite medical tests, it is vital that the clinical value of such tests be confirmed in an independent sample with an a priori hypothesized testing procedure before the procedure enters clinical use. 6

SAMPLING AND MEASUREMENT ISSUES

Thus far, little attention has been paid to issues of sampling, of course, vital to medical test evaluation. 6.1 Naturalistic Sampling Conceptually, the easiest sampling procedure is a naturalistic sample from the population, in which a representative sample from the population is drawn, and both test and diagnosis are done on each subject, each ‘‘blinded’’ to the other. (Otherwise, association may be because of the biases of the evaluators rather than to the true association between test and criterion.) Then the proportion observed in each cell of Table 1 is an unbiased estimate of the corresponding population parameter, and Se and Sp, as well as any other measure of association, can be computed by substituting the estimates of a, b, c, and d in the formulas of Table 1. The sample estimate of NP has a binomial distribution with parameters N and P. Sensitivity and specificity, conditional on the observed NP and NP’, are also binomial, but because these are conditional distributions, to expect precise estimates, how large N must be to be large enough to well estimate Se and Sp depends strongly on the magnitude of P. Studies of populations in which the disorders are very rare or common, those with P near either 1 or 0, will require very large sample size for accurate results. Studies of relatively rare disorders are, of course, quite common, which leads to the need for consideration of other sampling strategies.

6.2 Prospective or Two-Stage Sampling An alternative possibility is ‘‘prospective sampling’’ or ‘‘two-stage sampling’’ in which testing is done on a large number, N, of subjects in a representative sample from the relevant population at Stage 1. If the test to be evaluated is binary (perhaps the optimal test based on a ordinal test response), then one can estimate Q from this Stage 1 sample. Then, a certain proportion of the subjects with T+ and a certain proportion of those with T− are randomly sampled to enter the Stage 2 sample, the proportions decided by the researchers, and the diagnostic criterion is only evaluated for these Stage 2 subjects. For example, if P is small, one might choose to obtain the diagnostic criterion for 100% of the subjects with T+ but perhaps only 20% of those with T− , thus generating a sample for the second stage much ‘‘richer’’ in D+ than is the total population. In situations when the binary test is fairly easily done, but the diagnostic criterion requires, for example, a 10-year follow-up or an extensive, costly, risky, or invasive evaluation, this type of sampling becomes an attractive option. From the Stage 2 sample, one can estimate the PPV and PPN for the test on which sampling was based and any measure of 2 × 2 association that depends only on PPV and PPV. To estimate Se and Sp, however, an estimate of Q is obtained from the Stage 1 sample, and the Bayes Theorem is used to estimate Se, Sp, and P. P = Q PPV + Q PPN; Se = Q PPV/(Q PPV + Q PPN ) Sp = Q PPN/(Q PPN + Q PPV ) With these estimates, one can then estimate any measure of 2 × 2 association in Table 2. The problem, of course, is that the test that underlies the Stage 2 sample must be binary. It is possible, using sampling weights, to evaluate parameters for other tests as well, but the power of the approach is determined by the test that underlies the Stage 2 sample. It is possibly to stratify the population using an ordinal or multivariate test response, estimating the distribution of the strata in the

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

population using Stage 1 data, selecting a random subsample from each stratum for diagnosis in Stage 2, and then using sampling weights to estimate Se and Sp. Thus, such a sampling strategy can lead to very complicated analytic procedures, but it is worth considering in circumstances when naturalistic sampling is unfeasible. 6.3 Retrospective (Case-Control) Sampling Another alternative often considered is a retrospective sample, for example, a CaseControl sample. Now a certain number of subjects with D+ and a certain number of subjects with D− are sampled (numbers selected by the researchers) and tested. This alternative is easily done if the test is diagnostic rather than prognostic, for example, the evaluation of the diagnostic criterion and the test are done at the same time. When the test is prognostic, for example, the test is meant to be done at one time and then subjects are followed up over a subsequent span of time to see whether a disorder has later onset, with rare exceptions, the value of the pertinent test result cannot be determined. Important exceptions are fixed markers, characteristics of a subject that do not change over the lifetime of the subject, for example, year of birth, gender, race or ethnicity, and genotype. Because these fixed markers do not vary over the lifetime of any subject, it doesn’t matter when they are measured. Then, for a diagnostic test, or for some rare exceptions for a prognostic test, one can estimate the Se and Sp directly from the two subsamples. However, because there may be immigration or emigration from the population between the time testing would have been done in a naturalistic or prospective sample and the time retrospective sampling is done, it is difficult to obtain an unbiased estimate of P or Q for the population that would have been sampled in a naturalistic or a prospective sample. In absence of an estimate of P, at best AUC can be estimated, not k(w) for any value of w. For all these reasons, retrospective or case-control approaches often provide misleading information about medical tests and should be used judiciously.

7

9

SUMMARY

Individuals evaluating medical tests should consult the STARD guidelines (12,13), which give only general directions for the statistical evaluation of medical tests but detail what issues must be addressed in designing an adequate study of medical tests. Key points in this discussion relate to the definitions of sensitivity and specificity and to the fact that in absence of knowledge about both and P as well, it is difficult to evaluate the clinical value of any test. However, with this knowledge, one can evaluate the quality of any binary test, taking the relative clinical importance of false negatives and false positives into appropriate consideration, choose the optimal cut point for any ordinal test response, and generate a tree using multiple test responses to best discriminate the individuals likely to have the disorder from those not. REFERENCES 1. J. Cohen, Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968; 70: 213–229. 2. D. A. Bloch and H. C. Kraemer, 2X2 kappa coefficients: measures of agreement or association. Biometrics. 1989; 45: 269–287. 3. H. C. Kraemer, V. S. Periyakoil, and A. Noda, Tutorial in biostatistics: kappa coefficients in medical research. Statistics in Medicine. 2002; 21: 2109–2129. 4. H. C. Kraemer, Reconsidering the odds ratio as a measure of 2X2 association in a population. Statistics in Medicine. 2004; 23(2): 257–270. 5. R. G. Newcombe, A deficiency of the odds ratio as a measure of effect size. Statistics in Medicine. 2006; 25: 4235–4240. 6. D. L. Sackett, Down with odds ratios EvidenceBased Medicine. 1996; 1: 164–166. 7. M. A. Hlatky, D. B. Pryor, F. E. J. Harrell, R. M. Califf, D. B. Mark, and R. A. Rosati, Factors affecting sensitivity and specificity of exercise electrocardiography. Am. J. Med. 1984; 77: 64–71. 8. M. A. Hlatky, D. B. Mark, F. E. J. Harrell, K. L. Lee, R. M. Califf, and D. B. Pryor, Rethinking sensitivity and specificity. Am. J. Cardiology. 1987; 59: 1195–1198.

10

SENSITIVITY, SPECIFICITY, AND RECEIVER OPERATOR CHARACTERISTIC (ROC) METHODS.

9. B. J. McNeil, E. Keeler, and S. J. Adelstein, Primer on certain elements of medical decision making. N. Engl. J. Med. 1975; 293: 211–215. 10. H. C. Kraemer, Evaluating Medical Tests: Objective and Quantitative Guidelines. Newbury Park, CA: Sage Publications, 1992. 11. M. A. Winkleby, J. A. Flora, and H. C. Kraemer, A community-based heart disease intervention? Predictors of change. Am. J. Public Health 1994; 84: 767–772. 12. P. M. Bossuyt, J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. P. Glasziou, L. M. Irwig, et al., The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann. Internal Med. 2003; 138:W1–W12. 13. G. J. Meyer, Guidelines for reporting information in studies of diagnostic test accuracy: the STARD initiative. J. Personality Assessment. 2003; 81(3): 191–193.

CROSS-REFERENCES AUC Weighted Kappa

SEQUENTIAL ANALYSIS

genetic linkage. A number of papers appeared during the 1950s on modifications of the SPRT for the design of sequential clinical trials, and an overview of these developments was given in the first edition of Armitage’s book (3) in 1960. Subsequently, Armitage et al. (4) proposed a new alternative to the SPRT and its variants. This is the ‘‘repeated significance test’’ (RST), a detailed treatment of which appeared in the second edition of Armitage’s book in 1975. The underlying motivation is that, since the strength of evidence in favor of a treatment from a clinical trial is conveniently indicated by the results of a conventional significance test, it is appealing to apply such a test, with nominal significance level α, repeatedly during the trial. However, the overall significance level α*, which is the probability that the nominal significance level is attained at some stage, may be substantially larger than α. For example, suppose that X 1 , X 2 , . . . are independent normal with unknown mean µ and known variance σ 2 . Let Sn = X 1 + · · · + X n . The conventional significance test of based on X 1 , . . . , X n rejects H0 H0 : µ = 0 √ if |Sn | ≥ aσ n, where 1 − (a) = α/2. The RST, with a maximum sample size M, stops sampling and rejects H0 at stage

T. L. LAI Stanford University, Stanford, CA, USA

The subject of sequential analysis was initiated by Abraham Wald (39,40) in response to demands for more efficient sampling inspection procedures during World War II. 1

SEQUENTIAL TESTS OF HYPOTHESES

Wald introduced the sequential probability test (SPRT) of a simple null hypothesis H0 : f = f 0 vs. a simple alternative hypothesis H1 : f = f 1 based on independent observations X 1 , X 2 , . . . having a common density function f . The test stops sampling at stage N = first n ≥ 1 such that rn ≤ A or rn ≥ B, where 0 < A < 1 < B, rn = ni=1 [f1 (Xi )/f0 (Xi )] and N is defined to be ∞ if A < rn < B for all n. The SPRT rejects H0 if rN ≤ A and rejects H1 if rN ≥ B. Wald & Wolfowitz (41) showed that it has the following optimality property: among all tests whose expected sample sizes under H0 and H1 are finite and whose type I and type II error probabilities are less than or equal to those of the SPRT, denoted by α and β, respectively, the SPRT minimizes the expected sample sizes under H0 and H1 . Moreover, Wald (39) showed that α≤

(1 − β) , B

T = first n ≥ 1 with n ≤ M such that |Sn | √ ≥ aσ n. √ If |Sn | < aσ n for all 1 ≤ n ≤ M, then the RST does not reject H0 . The overall significance level of the test is

β ≤ A(1 − α),

√ α ∗ = Pr (|Sn | ≥ aσ n for some 1 ≤ n ≤ M) µ=0

and that these inequalities become equalities if rN does not ‘‘overshoot’’ the boundary B or A. Ignoring overshoots, Wald treated these inequalities as equalities and arrived at the approximation A β/(1 − α), B (1 − β)/α to determine the boundaries of the SPRT from prescribed error probabilities α and β. Within a few years after Wald’s introduction of the subject, it was recognized that sequential hypothesis testing might provide a useful tool in biomedical studies. In particular, making use of Wald’s theory of the SPRT, Morton (26) developed a standard for proving

= 1 − (a) +

M

pn (a),

n=2

√ where pn (a) = Prµ=0 (|Sn | ≥ aσ n and |Sj | < aσ j for 1 ≤ j < n). Armitage et al. (4) developed a recursive numerical integration algorithm to evaluate pn (a). The choice of a is such that the overall significance level α* (instead of the nominal significance level) is equal to some prescribed number. For example, for α* = 0.05 and M = 71, Table 5.5 of (3) gives a =

1

2

SEQUENTIAL ANALYSIS

2.84, which corresponds to a nominal significance level of α = 0.005 = α*/10. Note that a = 2.84 is considerably larger than the value 1.96 associated with a 5% level significance test with fixed sample size M. The price of the smaller expected sample size of the RST is, therefore, a loss of power compared to a fixed sample size test with the same significance level and the same M. Haybittle (17), Peto et al. (28), and Siegmund (33) proposed the following modification of the RST to increase its power. The stopping rule has the same form as the preceding RST but the rejection region is modified to √ T ≤ M − 1 or |SM | ≥ cσ M, where a ≥ c are so chosen that the overall significance level is equal to some prescribed number. In particular, a = ∞ gives the fixed sample size test and a = c gives the RST. Pocock (29) introduced another modification of the RST. Noting that in practice it is difficult to arrange for continuous examination of the data as they accumulate to perform the RST, he considered a ‘‘group sequential’’ version in which X n above represents an approximately normally distributed statistic of the data in the nth group (instead of the nth observation) and M represents the maximum number of groups. Instead of √ the square-root boundary aσ n for —Sn — in the group sequential RST, O’Brien & Fleming (27) proposed to use a constant stopping boundary b that does not change with n. Siegmund (34) gives an extensive treatment of the theory of truncated sequential tests, and in particular of the RST and its modifications. The problem of group sequential testing for the mean of a normal distribution with known variance discussed above serves as a prototype for more complex situations. Note that a group sequential test for a normal mean, assuming M equally sized groups, involves a stopping rule for (S1 , . . . , SM ) which has a multivariate normal distribution with var(Sn ) = nσ 2 = cov(Si , Sn ) for i ≥ n. For more complicated statistics U n in more general situations, one has an asymptotically normal distribution for (U 1 , . . . , U M ) whose covariance matrix is not known in advance and has to be estimated from the

data. Flexible methods to construct stopping boundaries of group sequential tests in these situations have been proposed by Slud & Wei (35), Lan & DeMets (25), Fleming et al. (13), and Jennison & Turnbull (18). For example, consider a clinical trial whose primary objective is to compare survival times (times to failure) between two treatment groups. Patients enter the trial serially and are randomized to either treatment and then followed until they fail or withdraw from the study, or until the trial is terminated. The trial is scheduled to end by a certain time tM and there are also M − 1 periodic reviews at calendar times t1 , . . . , tM−1 prior to tM . Let U i be the logrank statistic calculated at calendar time ti . Then under the null hypothesis that the two treatment groups have the same survival distribution (U 1 , . . . , U M ) is asymptotically normal, as shown by Tsiatis (38), who considered more general rank statistics including the logrank statistics as a special case. It is also shown in (38) how the asymptotic covariances of the U i can be estimated to perform group sequential testing. For the logrank and many other rank statistics, U i and U j − U i are asymptotically independent for j > i, so one needs only estimate var(U i ) in this case. Whitehead (42) gives a comprehensive overview of these and other methods for sequential hypothesis testing in clinical trials. He considers the case where (U 1 , . . . , U M ) is asymptotically normal under the null hypothesis, with U j − U i asymptotically independent of U i for j > i. Letting V i denote a consistent estimate of the null variance of U i for i = 1, . . . , M, he advocates the use of certain triangular stopping boundaries in the (V i , U i ) plane and has developed a computer package, PEST, for their implementation. These triangular boundaries are associated with the problem of minimizing the maximum expected sample size of sequential tests for the mean µ of a normal distribution subject to constraints on the type I error at µ = 0 and type II error at some µ = 0, as shown by Lai (20). 2 SEQUENTIAL ESTIMATION Analysis of the data at the conclusion of a clinical study typically not only permits test-

SEQUENTIAL ANALYSIS

ing of the null hypothesis but also provides estimates of parameters associated with the primary and secondary end points. The use of a stopping rule whose distribution depends on these parameters introduces substantial difficulties in constructing valid confidence intervals for the parameters at the conclusion of the study. For example, consider the simple example of independent normal X i with unknown mean µ and known variance σ 2 . For a sample of fixed size n, the sample mean X n is an unbiased estimate of µ and has a normal distribution with variance σ 2 /n, yielding the √ classical confidence interval X n ± z1−α σ/ n with coverage probability 1 − 2α for µ, where zα is the α-quantile of the standard normal distribution. If n is replaced by a stopping rule T whose distribution √ depends on µ, then X n is typically biased and T(X T − µ)/σ is no longer standard normal but has a distribution that depends on µ. Rosner & Tsiatis (31) proposed the following method to construct a 1 − 2α confidence interval for µ. For every value of √µ, find the quantiles uα (µ) and u1−α (µ) of T(X T − µ), i.e. √ Prµ [ T(X T − µ) < uα (µ)] = α √ = Prµ [ T(X T − µ)u1−α (µ)]. These probabilities can be computed by the recursive numerical integration algorithm of Armitage et al. (4) when T is bounded by M. Hence the confidence region {µ : uα (µ) ≤ √ T(X T − µ) ≤ u1−α (µ)} has coverage probability 1 − 2α. Note that this confidence region reduces to an interval whose end√points are found by intersecting the line T(X T − µ) with the curves uα (µ) and u1−α (µ) if there is only one intersection with each curve, which is the case commonly encountered in practice. Siegmund (33) proposed another approach based on ordering the sample space in a certain way, following an earlier proposal of Armitage (2). Chapter 5 of Whitehead’s monograph (42) gives a comprehensive treatment of this ordering approach. It also discusses the construction and properties of bias-adjusted estimates following sequential tests. Emerson & Fleming (12) proposed an alternative ordering and used it to construct bias-adjusted estimates and confidence intervals.

3

A considerably simpler class of sequential estimation problems deals with estimation of a parameter θ with prescribed accuracy using a randomly stopped statistic θˆN , whose stopping rule N is targeted towards achieving the prescribed accuracy. In these problems, Anscombe’s (1) central limit theorem for randomly stopped sums typically yields adequate normal √ approximations for the distribution of N(θˆN − θ ), which can be used to construct fixed-width confidence intervals for θ . For example, let X 1 , X 2 , . . . be independent random variables from a population with mean µ and variance σ 2 . The variance of the estimate X n of µ is σ 2 /n and an approximate 1 − √ 2α confidence interval for µ is X n ± z1−α σ/ n, which can be made to have width 2d by choosing n to be the smallest integer ≥(z1−α σ /d)2 , assuming σ to be known. When σ is unknown, Chow & Robbins (7) proposed to replace it by the sample variances at successive stages, leading to the stopping rule N = first n ≥ m such that nd2 /z21−α ≥ (n − 1)−1

n

(Xi − X n )2 + n−1 .

i=1

The confidence interval is taken to be XN ± d. This has approximate coverage √ probability 1 − 2α when d is small, since N(X N − µ)/σ has a limiting standard normal distribution as d → 0 by Anscombe’s theorem. Schmidt et al. (32) used this procedure to construct fixed-width confidence intervals for the concentrations of enzymes in the normal human pancreas. Two-stage and three-stage analogs of this fully sequential procedure were developed by Stein (37) and Hall (16).

3 ADAPTIVE ALLOCATION, SEQUENTIAL DESIGN, AND DECISION THEORY Other topics in the field of sequential analysis of interest to biomedical studies are adaptive treatment allocation and sequential design of experiments. The ‘‘up-and-down’’ (staircase) method in bioassay and dosage determination is an example of sequential experimentation. A traditional nonsequential

4

SEQUENTIAL ANALYSIS

method for performing a bioassay experiment is to test a prescribed number of animals at each of several fixed dose levels. In the up-and-down method, one chooses a series of test levels with equal spacing (on an appropriate scale, usually log dose) between doses, and carries out a series of trials using the following rule: use the next higher dose following a negative response and use the next lower dose following a positive response. Details of implementation of the design and applications to estimation of the LD50 (‘‘lethal dose 50’’—the dose producing response on 50% of the subjects) are given by Dixon & Moood (10) and Dixon (9). Stochastic approximation, introduced by Robbins & Monro (30), is another example of sequential experimentation. In the context of quantal bioassay, Cochran & Davis (8) considered the following version of the Robbins–Monro scheme. To start the experiment, an initial guess x1 of LD50 is made and m animals are given dose x1 . Let c > 0. For n ≥ 1, let pˆ n be the observed proportion of deaths in the group of m animals assigned dose xn , and define xn+1 = xn − cn−1 pˆ n − 12 to be the dose level at which another group of m animals are tested. The basic idea behind stochastic approximation is to use the recursive scheme xn+1 = xn − an (Y n − h) to find the solution θ of the equation M(θ ) = h, in which M(x) is not observable and all that can be observed for each x is a random variable Y(x) with E[Y(x)] = M(x) and in which Y n = Y(xn ). Thus, in the quantal bioassay application above, h = 12 and Yn = pˆ n . Under certain regularity conditions, the best choice of an is (βn)−1 , where β is the derivative of M at θ . Adaptive stochastic approximation schemes that replace the unknown β in the optimal choice an = (βn)−1 by simple recursive estimates bn have been proposed and analyzed by Lai & Robbins (24) and Frees & Ruppert (14). In classical fixed-sample decision theory, one has a parameter space containing all possible values of the unknown parameter θ , an action space consisting of all possible actions a, and a loss function L(θ , a)

representing the loss when the true parameter is θ and action a is taken. In sequential decision theory, one has a sequence of actions a1 , a2 , . . . and loss Ln (θ , an ) at stage n. For sequential experimental design problems of the type described above, an is the choice of the design level xn at stage n. For sequential hypothesis testing (or estimation) problems, an represents whether stopping occurs at stage n and also acceptance of the null or alternative hypothesis (or the estimate of the unknown parameter) when stopping indeed occurs at stage n. In this case, it is more convenient to represent the action sequence (a1 , a2 , . . . ) by a stopping rule denoting when stopping occurs and a terminal decision rule denoting the action taken upon stopping. Given successive observations Z1 , Z2 , . . . , whose joint distribution depends on θ , a finite-horizon sequential decision problem, with horizon M, is to choose action dn = dn (Z1 , . . . , Zn ) at stage n on the basis of the current and past observations, for 1 ≤ n ≤ M. The of (d1 , . . . , Moverall risk L (θ , d ) . In particudM ) is R(θ ) = Eθ n n n=1 lar, putting a prior distribution G on the parameter space, one can consider the Bayes rule that minimizes R(θ )dG(θ ). The solution can be found by the backward induction algorithm of dynamic programming. Applications of the algorithm to determine optimal stopping boundaries of group sequential tests have been given by Berry & Ho (5) and Eales & Jennison (11). Chernoff’s monograph (6) gives a comprehensive treatment of optimal stopping problems in sequential analysis. Spiegelhalter et al. (36) discuss Bayesian approaches to monitoring clinical trials. The handbook edited by Ghosh & Sen (15) gives extensive references and survey articles on a wide variety of topics in sequential analysis including those covered in the present brief review which is oriented towards biomedical applications. The monograph by Jennison & Turnbull (19) on group sequential tests and the review articles by Lai (21–23) describe important developments in stochastic approximation, interim and terminal analyses of clinical trials with failure-time endpoints, and other areas of sequential analysis following the publication of the First Edition and provide updated lists of references.

SEQUENTIAL ANALYSIS

REFERENCES 1. Anscombe, F. J. (1992). Large sample theory of sequential estimation, Proceedings of the Cambridge Philosophical Society 48, 600–607. 2. Armitage, P. (1958). Numerical studies in the sequential estimation of a binomial parameter Biometrika 45, 1–15. 3. Armitage, P. (1975). Sequential Medical Trials, 2nd Ed. Blackwell, Oxford. 4. Armitage, P., McPherson, C. K. & Rowe, B. C. (1969). Repeated significance tests on accumulating data, Journal of the Royal Statistical Society, Series A 132, 235–244. 5. Berry, D. A. & Ho, C. H. (1988). One-sided sequential stopping boundaries for clinical trials: a decision-theoretic approach, Biometrics 44, 219–227. 6. Chernoff, H. (1972). Sequential Analysis and Optimal Design. SIAM, Philadelphia. 7. Chow, Y. S. & Robbins, H. (1965). On the asymptotic theory of fixed width sequential confidence intervals for the mean, Annals of Mathematical Statistics 36, 457–462.

5

16. Hall, P. (1981). Asymptotic theory of triple sampling for sequential estimation of a mean, Annals of Statistics 9, 1229–1238. 17. Haybittle, J. L. (1971). Repeated assessments of results in clinical trials of cancer treatment, British Journal of Radiology 44, 793–797. 18. Jennison, C. & Turnbull, B. W. (1989). Interim analysis: the repeated confidence interval approach (with discussion), Journal of the Royal Statistical Society, Series B 51, 306–361. 19. Jennison, C. & Turnbull, B. W. (2001). Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall & CRC, Boca Raton & London. 20. Lai, T. L. (1973). Optimal stopping and sequential tests which minimize the maximum expected sample size. Annals of Statistics 1, 659–673. 21. Lai, T. L. (2001). Sequential analysis: Some classical problems and new challenges. Statistica Sinica 11, 303–408. 22. Lai, T. L. (2003). Stochastic approximation. Annals of Statistics 31, 391–406.

8. Cochran, W. G. & Davis, M. (1965). The Robbins–Monro method for estimating the median lethal dose, Journal of the Royal Statistical Society, Series B 27, 28–44.

23. Lai, T. L. (2003). Interim and terminal analyses of clinical trials with failure-time endpoints and related group sequential designs. In Applications of Sequential Methodologies, N. Mukhopadhyay, S. Datta & S. Chattopadhyay, eds. Marcel Dekker, New York, in press.

9. Dixon, W. J. (1965). The up-and-down method for small samples, Journal of the American Statistical Association 60, 967–978.

24. Lai, T. L. & Robbins, H. (1979). Adaptive design and stochastic approximation, Annals of Statistics 7, 1196–1221.

10. Dixon, W. J. & Mood, A. M. (1948). A method for obtaining and analyzing sensitivity data, Journal of the American Statistical Association 43, 109–126.

25. Lan, K. K. G. & DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials, Biometrika 70, 659–663.

11. Eales, J. D. & Jennison, C. (1992). An improved method for deriving optimal onesided group sequential tests, Biometrika 79, 13–24. 12. Emerson, S. S. & Fleming, T. R. (1990). Parameter estimation following group sequential hypothesis testing, Biometrika 77, 875–892. 13. Fleming, T. R., Harrington, D. P. & O’Brien, P. C. (1984). Designs for group sequential tests, Controlled Clinical Trials 5, 348–361.

26. Morton, N. E. (1955). Sequential tests for the detection of linkage, American Journal of Human Genetics 7, 277–318. 27. O’Brien, P. C. & Fleming, T. R. (1979). A multiple testing procedure for clinical trials, Biometrics 35, 549–556. 28. Peto, R., Pike, M. C., Armitage, P., Breslow, N. E., Cox, D. R., Howard, S. V., Mantel, N., McPherson, K., Peto, J. & Smith, P. G. (1976). Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design, British Journal of Cancer 34, 585–712.

14. Frees, E. W. & Ruppert, D. (1990). Estimation following a Robbins–Monro designed experiment, Journal of the American Statistical Association 85, 1123–1129.

29. Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials, Biometrika 64, 191–199.

15. Ghosh, B. K. & Sen, P. K. (1991). Handbook of Sequential Analysis. Marcel Dekker, New York.

30. Robbins, H. & Monro, S. (1951). A stochastic approximation method, Annals of Mathematical Statistics 22, 400–407.

6

SEQUENTIAL ANALYSIS

31. Rosner, G. L. & Tsiatis, A. A. (1988). Exact confidence intervals following sequential tests, Biometrika 65, 341–349. 32. Schmidt, B., Corn´ee, J. & Delachaume-Salem, E. (1970). Application de proc´edures statistiques s´equentielles a` l’´etude des concentrations enzymatiques du suc pancr´eatiques humain normal, Comptes Rendus des S´eances de la Societ´e de Biologie Paris 164, 1813–1818. 33. Siegmund, D. (1978). Estimation following sequential tests, Biometrika 65, 341–349. 34. Siegmund, D. (1985). Sequential Analysis: Tests and Confidence Intervals. SpringerVerlag, New York. 35. Slud, E. V. & Wei, L. J. (1982). Two-sample repeated significance tests based on the modified Wilcoxon statistic, Journal of the American Statistical Association 157, 357–416. 36. Spiegelhalter, D. J., Freedman, L. S. & Parmar, M. K. B. (1994). Bayesian approaches to randomized trials (with discussion), Journal of the Royal Statistical Society, Series A 157, 357–416. 37. Stein, C. (1945). A two sample test for a linear hypothesis whose power is independent of the

variance, Annals of Mathematical Statistics 16, 243–258. 38. Tsiatis, A. A. (1982). Repeated significance testing for a general class of statistics used in censored survival analysis, Journal of the American Statistical Association 77, 855–861. 39. Wald, A. (1945). Sequential tests of statistical hypotheses, Annals of Mathematical Statistics 16, 117–186. 40. Wald, A. (1947). Sequential Analysis. Wiley, New York. 41. Wald, A. & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test, Annals of Mathematical Statistics 19, 326–339. 42. Whitehead, J. (1992). The Design and Analysis of Sequential Clinical Trials, 2nd Ed. Ellis Horwood, Chichester.

CROSS-REFERENCES Wald’s Identity

SERIOUS ADVERSE EVENT (SAE) A Serious Adverse Event (SAE) is classified as any adverse drug experience occurring at any dose that results in any of the following outcomes: death, a life-threatening adverse drug experience, in-patient hospitalization or prolongation of existing hospitalization, a persistent or significant disability/incapacity, or a congenital anomaly/birth defect. All serious adverse events should be reported immediately to the sponsor except for those SAEs that the protocol or other document (e.g., Investigator’s Brochure) identifies as not needing immediate reporting. The immediate reports should be followed promptly by detailed, written reports. The immediate and follow-up reports should identify subjects by unique code numbers assigned to the trial subjects rather than by the subjects’ names, personal identification numbers, and/or addresses. The investigator should also comply with the applicable regulatory requirement(s) related to the reporting of unexpected serious adverse drug reactions to the regulatory authority(ies) and to the IRB (Institutional Review Board)/IEC (Independent Ethics Committee).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/present/raps10-2002/jud yracoosin/tsld013.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

SIMPLE RANDOMIZATION

allocated to one treatment arm or another is determined by the desired allocation ratio (e.g., 1:1, 1:2, 3:2:1, etc.) and this allocation process depends on chance rather than judgment. For simplicity, we will assume a two-arm trial with equal allocation ratio (1:1) in the following discussion. However, it should be noted that the ideas discussed can be extended to a more general case of multi-arm trials or unequal allocation ratios.

INNA T. PEREVOZSKAYA Merck & Co., Inc. Rahway, New Jersey

The objective of any clinical trial is to provide an unbiased estimate of treatment effect, usually measured by comparison of the two treatment groups (e.g., experimental treatment versus conventional treatment) with respect to the endpoint of interest. Although there are several potential sources of bias for such an estimate, this articles focuses on one in particular, namely, lack of comparability of the treatment groups with respect to known or unknown prognostics factors (covariates). Randomization mitigates the risk of such an imbalance with respect to important covariates and thus addresses the source of bias arising from patient selection. However, randomization alone is not sufficient to completely eliminate bias. To claim that a study’s results are unbiased, additional measures may be required to alleviate concerns about other potential sources of bias such as methods of outcome assessment and missing data. (See the article on bias for more details on these additional sources.) 1

2

WHY IS RANDOMIZATION NEEDED?

As it was previously mentioned, randomization is crucial part of the study design intended to address an issue of possible bias in clinical trials. Because it is so closely related to the concept of bias, it is impossible to introduce the methods of randomization without mentioning the overall problem of bias and its sources, at least briefly. Clinical trials differ substantially from laboratory science in that the carefully controlled and monitored environment of the latter is not always achievable. Instead of being precisely controlled, clinical trials are complex, collaborative efforts involving people with different job skills and roles (physicians, patients, study coordinators, data managers, and statisticians). To claim the validity of findings at the end of a trial, one must ensure that key elements of the study design reflect adequate preparation and thought to address potential factors that may lead to underestimating or overestimating the true effect of innovative treatment (i.e., biasing the study results). There are several potential sources of bias:

CONCEPT OF RANDOMIZATION

Randomization is a method based on chance by which clinical trial participants are assigned to study treatment groups. Randomization minimizes the differences among groups by balancing the distribution of patients with particular characteristics (prognostic factors) among all the trial arms. In case of a two-arm trial with equal allocation to both treatment groups, each study participant has a fair and equal chance of receiving either the new medical intervention being studied (by being placed in the active therapy group) or the existing or ‘‘control’’ intervention (by being placed in the control therapy group). This principle can be easily extended to the case of multi-arm trials or to trials where unequal allocation to different arms is desired. In other words, randomization implies that for any given participant the chance of being

1. Lack of precision in outcome assessment caused by absence of appropriate standards applicable to all patients (e.g., key measurements performed by different laboratories) or subjectivity of evaluation (e.g., radiology examination data). 2. Missing outcome data possibly resulting in loss of valuable information when the treatment effect has something to do with loss to follow-up.

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

SIMPLE RANDOMIZATION

3. Accidental bias in the form of an imbalance among the treatment groups with respect to relevant covariates (such as gender, age, or race) or prognostic factors. 4. Selection bias in the form of study physicians using their own judgment in treatment assignment based on a patient’s specific health conditions. All four sources present very broad problems, each of them deserving a separate article. Issues 1 and 2 are discussed in the articles on blinding, standard operating procedures, and missing values. Randomization in general is designed to address issues 3 and 4. Simple randomization in particular primarily targets selection bias. Although accidental bias is also mitigated to some extent during simple randomization, the success of it varies greatly depending on the type/size of trial. If one is specifically concerned about balancing for covariates, there are other randomization methods that are more tailored to address known covariate imbalances (see covariate–adaptive randomization, stratified randomization). 3

METHODS: SIMPLE RANDOMIZATION

Up until now, we have discussed the importance of randomization in general in the process of designing a clinical trial. However, that still leaves the question open about how it is actually accomplished in practice. There are a number of methods for randomization, and the list starts with simple randomization. Simple randomization is the simplest of all randomization methods, which is probably quite obvious from its name. Sometimes it is also referred to as ‘‘complete randomization’’ because it can be viewed as a series of coin tosses: it generates a single sequence of independent assignments to either experimental treatment (if the coin lands on ‘‘heads’’) or to the control treatment (if the coin lands on ‘‘tails’’). No further attempts to modify this sequence or to control its behavior are made. Mathematically, simple randomization can be characterized as follows: if N patients need to be allocated in the trial equally distributed between treatment groups A

and B, then simple randomization is a sequence of independent, identically distributed Bernoulli random variables (coin tosses) with Pr (treatment assigned is ‘‘A’’) = 0.5 for each subject. In a long run, half of the subjects are randomly allocated to treatment A (the experimental treatment) while the other half are assigned to treatment B (the control). The ‘‘randomness’’ of this process is what ensures that particular prognostic factors (such as medical history) will play no role in these assignments, and the treatment groups are likely to be comparable with respect to important covariates, including those that are unknown at the time of trial planning. It is important to note that, under this scheme, the random assignments to treatments A or B are statistically independent. In other, more complicated randomization methods, this is not necessarily the case, and the nature of dependence between treatment assignments (called variance–covariance structure) gets progressively more complicated as one moves from simple randomization to more sophisticated methods: restricted randomization, covariate-adaptive randomization, and response-adaptive randomization. A comprehensive description of the above four classes of randomization methods in the context of practical application to clinical trials along with their statistical properties can be found in Rosenberger and Lachin (1). The goal of simple randomization is to balance the treatment assignment between two groups and to produce general balance between these two groups with respect to known and unknown prognostic factors. It is important to understand that even though simple randomization promotes such comparability it does not ensure it with a 100% guarantee. In other words, there is always a possibility that some imbalance (due to chance) will exist in the trial patient sample with respect to select prognostic factors. This occasionally happens even in relatively large phase III trials (N > 100 per group), especially if one tries to examine an extensive list of such covariates. Therefore, in practice, if some known covariate is considered of crucial importance for analysis and there is a reason to believe that randomization alone may not take care of balancing treatment

SIMPLE RANDOMIZATION

assignments with respect to this covariate (e.g., insufficient sample size or large number of centers), a more proactive approach than simple randomization is usually taken. All randomization methods specifically designed to balance, not only for treatment assignments but also for known covariates, have some sort of controlling mechanism for such balance built into their algorithm. Therefore, these methods impose more constraints on the generated random sequence to meet the balance requirements. The most widely used method that proactively balances for covariates is stratified randomization, which can be roughly described as breaking the patient population into multiple levels of the covariate of interest (strata) and generating a sequence of random treatment assignments within each stratum. Stratified randomization is discussed in detail later in this article. Another well-known family of methods for addressing the same problem (in cases when stratified randomization does not work well) is covariate-adaptive randomization, see the articles on adaptive randomization and minimization for details. 3.1 Balancing Properties of Simple Randomization Simple randomization is designed to balance treatment assignments equally between two treatment groups. This property is based on asymptotic behavior of the series of simple coin tossing and therefore requires large sample sizes to hold perfectly. In practice, however, sample sizes vary significantly from trial to trial and can be insufficient for this property to hold. Rosenberger and Lachin (1) examined the balancing properties of simple randomization in comparison with other (restricted) randomization methods designed to enforce 1:1 allocation. They compared simple randomization to three other restricted randomization methods via simulations (10,000 replications with sample size N = 50). The results showed that, on average, simple randomization procedure yields allocations very close to 1:1, with variances of this allocation ratio slightly larger for the simple randomization than for restricted randomization methods. They also pointed out that for small sample sizes this procedure has nonnegligible probability of some imbalances and

3

a small probability of severe imbalances. For example, if N = 50, there is 5% chance of imbalance of 36.1% versus 63.9%. The potential for imbalance is typically viewed as a disadvantage of simple randomization procedure compared with other (restricted) randomization methods with better balancing properties. However, this is debatable, and imbalances of such magnitude are not a major concern to many clinical trial investigators. The reason for that is because, even though such imbalances potentially diminish precision of estimates and power of statistical tests, this loss of precision is usually small (for moderate imbalances). Using the example from Rosenberger and Lachin (1) for N = 50, the probability of observing an imbalance more extreme than 36.1% versus 63.9% is less than 0.05 (and would decrease even further if the sample size increases). According to another calculation in Rosenberger and Lachin (1) examining loss of power across different ranges of imbalance, such loss becomes noticeable only if the imbalance is more extreme than 30% vs 70%, which is unlikely to happen according to the probability numbers provided above. Of course, one has to keep in mind that this particular calculation relies on a specific sample size as well as assumed treatment effect and variability estimates; therefore, it will not hold true in all possible scenarios. However, it provides a good insight into what is considered a ‘‘reasonable’’ imbalance in treatment allocations and its effect on power of statistical tests. For practical applications though, if there is a concern about treatment imbalances, it is always wise to examine the relationship between loss of power and magnitude of such imbalances based on the assumptions applicable to a particular trial. Thus, simple randomization does carry a small risk of treatment imbalance, which may potentially reduce power of statistical comparisons, but this is not the primary reason for lack of its application in practice. What is more likely to make simple randomization less popular is the logistics of drug distribution and supply: blocked randomization (which is one of the restricted randomization methods designed to enforce exact 1:1 allocation ratio) has become a standard in modern clinical trials because of its

4

SIMPLE RANDOMIZATION

easy adaptability to multicenter trials and the common procedures used for drug shipments. (See the block randomization and blocking articles for detailed description of the procedure and its properties.) 4 ADVANTAGES AND DISADVANTAGES OF RANDOMIZATION The goal of randomization is to produce comparable groups in terms of general participant characteristics such as age or gender and other key factors that affect the probable course the disease would take (e.g., smoking history or cardiovascular risk status). In this way, it helps avoid selection bias—all treatment groups are as similar as possible at the start of the study. One can say that patients in each treatment group make up samples drawn at random from the same patient population. If, at the end of the study, one group has a better outcome than the other, the investigators will be able to conclude with some confidence that one treatment (or intervention) is better than the other. This property is the most important advantage of randomized clinical trails over observational (nonrandomized) or case-control (retrospective) studies where such comparability is attempted only for few known covariates and, even for those, rarely can be achieved. Another important advantage of randomization is that it provides probabilistic basis for statistical inference from the observed results in reference to all possible results. It guarantees the validity of statistical tests and, at the end of the study, enables one to answer the question of whether it is likely to observe the treatment difference of a given magnitude just by chance or the observed difference is more likely to reflect the true state of nature. Finally, in combination with blinding, randomization ensures objective and independent evaluation of the patients’ condition and the study outcomes, thus contributing to the overall unbiasedness of the study findings. All of the above has made the randomized, controlled, prospectively designed trial the most reliable and impartial method of determining what medical interventions work the best in the modern world of clinical trials.

However, there are a number of disadvantages that preclude using randomization in certain medical investigations: it may create recruitment problems, especially in placebocontrolled trials because patients may be reluctant to enroll in a trial if they know there is a chance of not receiving any treatment. For the same reason, the patient sample drawn as a result of enrollment into a randomized clinical trial may not accurately represent the ‘‘true’’ patient population as some people believe that there is some dependence between patient’s health condition or lifestyle and his/her willingness to participate in a randomized clinical trial. Another disadvantage is that the concept of randomization poses major ethical dilemmas in certain situations. Many people have argued that probability should not be used as a decision tool in the field of medicine, that physicians should be the only ones deciding what treatment should be given to a patient. Such an approach would inevitably lead to a situation of selection bias during the trial process, making the treatment groups less comparable and therefore precluding a fair comparison of the experimental therapy to the control treatment. What makes clinical trials different from regular medical practice is the state of equipoise: until there is a solid scientific evidence of the therapy being safe and effective (or effective but unsafe, or simply ineffective), the uncertainty on the part of a physician as to whether the experimental therapy is actually better for the patient is justified. Most people agree that in the state of a ‘‘true equipoise’’ the use of randomization in clinical trials is justifiable. It is the definition of a ‘‘true’’ equipoise that sometimes presents ethical challenges. 5 OTHER RANDOMIZATION METHODS Simple randomization only starts the list of many randomization methods currently available for use in clinical trails. Restricted randomization is a modification of the simple randomization procedure designed to enforce target allocation of subjects to treatments and to provide more balanced treatment assignments. The most commonly used examples of restricted randomization are

SIMPLE RANDOMIZATION

block randomization (also referred to as permuted block design) and stratified randomization. We will discuss stratified randomization in detail further in section (6). Covariate-adaptive (also referred to as baseline-adaptive) randomization is employed when it is desired to balance the study on a large number of important covariates. In such cases, stratified randomization typically does not perform well. Covariate-adaptive randomization takes further steps in balancing on baseline covariates dynamically, based on real-time baseline composition of the treatment groups at the time of allocation. Finally, response-adaptive randomization is used in a situation when true equipoise can not be guaranteed, this is, when some scientific information about the relative performance of the experimental versus the control is available. In such situations, it is not desired, typically for ethical reasons, to allocate an equal number of patients to each treatment. The final allocation ratio is not prefixed but rather is determined dynamically, based on the revealing treatment effect information. Similar to the covariate-adaptive randomization, the procedure dynamically updates the allocation probabilities to favor the better performing treatment. This is the most complicated and the most controversial type of all randomization procedures. 6

STRATIFIED RANDOMIZATION

Stratified randomization holds a place of its own in the list of randomization methods because it adds another level of complexity to the balancing properties, compared with simple randomization and most other restricted randomization methods. All randomization procedures in clinical trails are designed to mitigate the source of bias associated with imbalance of the treatment groups with respect to important prognostic factors (covariates), both known and unknown. Simple randomization promotes balance with respect to such covariates across treatment arms as well as balance with respect to the treatment assignments themselves. However, simple randomization

5

alone is not sufficient to ensure such balance. In fact, with any randomization process, if a sufficient number of baseline covariates are examined, there will be some imbalances observed with respect to at least one prognostic factor. In certain situations— for example, if one of the prognostic factors is known to be of critical importance as a predictor of the outcome of interest—one would want to balance proactively for that factor within treatments and not want to rely on chance to accomplish that. In such situations, stratified randomization provides substantial benefits. 6.1 What Is Stratification? Stratified randomization (also known as stratified allocation) is a grouping of study subjects by level of prognostic factor (stratum) and subsequent randomization of subjects within each of the strata. Examples of stratification factors commonly occurring in clinical trials include study center, basic demographic characteristics (e.g., age group, race, or gender), or important medical baseline measurements (e.g., blood pressure or levels of cholesterol in cardiovascular trials). Stratified randomization is an important member of the class of restricted randomization methods (which also includes restricted randomization, covariate-adaptive randomization and response-adaptive randomization). The common feature in these methods is that they all try to impose some sort of constraint on the sequence of randomized treatment assignments to force it to yield a desirable balancing property. Most of randomization methods are concerned with proactively balancing for only one variable: treatment assignments. Stratified randomization takes it further and attempts to balance for different levels of select known prognostic factors within each treatment arm while maintaining the balance across the treatment arms as well. For example, in many trials, gender is considered an important prognostic factor. If the patient population consists of 80% females and 20% males, one would want to have approximately the same percentages of males and females in each of the treatment groups. If N = 100 subjects are available

6

SIMPLE RANDOMIZATION Table 1. Hypothetical Trial Example of Stratified Randomization Allocation

Treatment A Treatment B Total

Males

Females

10 10 20

40 40 80

to be randomized in such a trial, a stratified randomization would (ideally) yield the allocations shown in Table 1. It is important to differentiate between stratified randomization (also known as prestratification) and stratified-adjusted analysis (also known as post-stratification) because the term ‘‘stratification’’ is often used interchangeably for both. Post-stratification refers to the method of analysis that groups patients into strata corresponding to certain covariates after the randomization has taken place (i.e., at the analysis stage), and that analysis uses the computed stratum variable in modeling and/or subgroup analyses. As a result, post-stratified analysis yields estimates adjusted for important covariates even though these covariates were originally ignored during the randomization. When planning a trial where a specific covariate is known to be an important predictor of the response, one always has a choice of proactively implementing a design that stratifies by this variable (pre-stratification) followed by stratified analysis or performing only a post-stratified analysis. There is no clear answer to this dilemma and the choice usually depends on overall efficiency of stratified randomization, which in turn is driven by trial size, number of covariates, and overall logistics. The relative efficiencies of stratified inference combined with stratified randomization versus stratified inference alone are discussed in detail in Grizzle (2) and Matts and McHugh (3). The general rules is that stratified randomization should be employed only for covariates considered absolutely necessary; trials with smaller sample sizes where imbalances with respect to such covariates are more likely to happen by chance may benefit more from stratified randomization than larger trials. It is also strongly recommended

Total 50 50 100

to stratify by study center in a multicenter trials because study centers may be a major source of patient heterogeneity. 6.2 How Stratification Is Accomplished The most intuitive way to implement stratified randomization is to prepare a separate randomization schedule for each level of the factor (or combination of factors). In the previous example, two separate randomization schedules would be generated, one for male patients, another for female patients. If an additional factor, say cardiovascular risk category (low versus high) is added to the stratification, then four randomization schedules would be required (male and low risk, male and high risk, female and low risk, and female and high risk). In practice, however, enforcing balance within each combination of strata is rarely required if two or more stratification factors are involved because the levels of factors, not their combinations, are thought to be affecting the treatment effect. Providing balance for each level of individual strata is usually sufficient. Stratified randomization is typically implemented in two ways: (1) via the Interactive Voice Response System (IVRS), which creates separate randomization schedules for each stratum, and (2) by the permuted block schedule, where a set of blocks is sent to each study center (or allocated to each stratum) and randomization is performed within each center (stratum) using only blocks assigned to that stratum. In multicenter trials, IVRS may provide better balance across study sites with respect to treatments and covariates because it leaves fewer empty blocks (by relaxing the restrictions that each block can be filled only by patients from the same site). However, IVRS is expensive, and its use may not be justified

SIMPLE RANDOMIZATION

in all cases. Consequently, blocked randomization remains the first-choice method for most clinical studies today. The popularity of blocked randomization is also promoted to a great extent by drug distribution logistics. Most multicenter trials use study center as a stratification factor when blocked randomization is used; it facilitates stratification by center because each set of blocks sent to a site can be viewed as a separate randomization schedule for that site and the drug kits with allocation numbers corresponding to that schedule are shipped to the site before randomization. A common practice for multicenter studies using an additional two-level stratification variable is to assign lower and upper range allocation numbers to patients who belong to the first and second levels of the stratum, respectively. This procedure can be quite problematic though, if the number of patients at each center is small. 6.3 When Stratified Randomization Does Not Work Well Even though stratified randomization is designed to promote greater treatment balances between treatment assignments across different levels of strata, in practice this is not always the case. A common misconception is that proactively implementing stratified randomization is all that is needed to ensure balance across treatment groups with respect to the covariate of interest. In reality, the ‘‘quality’’ of stratified randomization is determined by how it is implemented. Most complete randomization procedures (without stratification) guarantee balance in treatment assignment only asymptotically for very large sample sizes and do not enforce such balance for small to moderate sample sizes. Consequently, in smaller trials there is a non-negligible probability of imbalances. When stratification is implemented in addition to such randomization procedure, it is equivalent to creating multiple small randomization schedules (one for each strata), each potentially with some degree of imbalance. When these imbalances are added up across strata, the resulting treatment imbalance may be substantial. Another common example of a potential problem is stratified blocked randomization,

7

which uses blocked randomization in conjunction with stratification. Even though blocked randomization by itself controls balance with respect to treatment assignments, it can do so only as long as all blocks remain filled completely or almost completely. It is not uncommon nowadays to see multiarm placebo-controlled trials with allocation ratios heavily skewed in favor of active treatment arms versus placebo (for ethical reasons, as the standard of therapy has been already established). An example of such allocation would be 5:5:2 (active treatment 1, active treatment 2, and placebo, respectively), resulting in a minimum block of size 12. If stratified randomization with a two-level covariate is implemented, that would require each site to use a minimum of 2 blocks of 12 patient each (i.e., a minimum of 24 patients to fill-in all levels of strata/treatments). If a second two-level stratification variable is desired, a minimum of 48 patients would be required within each center to fill in all levels of strata and treatments. In other words, the number of different combinations of strata levels grows exponentially with number of stratification variables. To use the previous example, many trials use lots of small centers to fulfill their overall enrollment requirements, so expecting even 48 patients enrolled per center may turn out to be unrealistic. This situation may be partially remedied by central randomization implemented via IVRS (where stratification by center is essentially abandoned). However, not every study routinely uses IVRS; even if all centers use it, this approach maybe questionable if there is a reason to believe that the centers differ substantially by standard of care or patient population (e.g., regional differences). To summarize, when using stratified randomization, only a very limited number of covariates (the most critical ones) should be used in pre-stratification. The size of trial, block size, number of centers, and projected enrollment within each center should all be considered when deciding on the number of ‘‘primary’’ stratification variables. To adjust for the remaining (‘‘secondary importance’’) covariates of interest, a stratified-adjusted analysis (post-stratification) should be used.

8

SIMPLE RANDOMIZATION

If the distinction between ‘‘primary’’ and ‘‘secondary’’ covariates is impossible to make, resulting in overwhelming number of ‘‘important’’ covariates, covariate-adaptive randomization (see the article on adaptive randomization) or Atkinson’s optimal design (4) may be explored. It should be kept in mind though that those methods have yet to gain wide acceptance due to a lack of well-established, documented properties.

FURTHER READING

REFERENCES

CROSS-REFERENCES

1. W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials. New York: Wiley, 2002. 2. J. E. Grizzle, A note on stratified versus complete random assignment in clinical trials. Control Clinical Trials. 1982; 3: 365–368. 3. J. P. Matts and R. B. McHugh, Analysis of accrual randomized clinical trails with balanced groups in strata. J Chronic Dis. 1978; 31: 725–740.

Blinding

4. A. C. Atkinson, Optimum biased coin designs for sequential clinical trials with prognostic factors. Biometrika. 1982; 69: 61–67.

W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials. New York: Wiley, 2002. They provide a comprehensive overview of the above mentioned randomization methods with in-depth discussion of their statistical properties and applications. The initial chapters of this book are suitable for introductory reading, more advanced chapters require some background in mathematical statistics.

Block randomization Stratified randomization Adaptive randomization Response-adaptive randomization

SOFTWARE FOR GENETICS/GENOMICS

security and quality control procedures in place. Certainly, standard database [e.g., Microsoft ACCESS (Microsoft Corporation, Redmond, WA)] and spreadsheet software (e.g., Microsoft EXCEL) can and are often used, but software specifically for the management of genotype, phenotype, and clinical (including expression) data has been developed. This discussion will therefore focus on the software designed specifically to fit the special needs of genetic data, which includes Mendelian inconsistencies and relationship misspecifications that are unique to this type of data.

QING LU , YEUNJOO SONG and COURTNEY GRAY-MCGUIRE Case Western Reserve University Cleveland, Ohio,

As the fields of molecular genetics, genetic epidemiology, and pharmacogenetics continue to expand, so does the availability of bioinformatics tools, particularly software, for the analysis of data from many areas of genetics and genomics research. These software tools can typically be classified into three categories: (1) software for data storage and management, which includes quality control mechanisms; (2) software for investigating patterns of inheritance (genetics), which includes familial aggregation, linkage, allelic association, and linkage disequilibrium; and (3) software for systematic examination of a given genome (genomics), which includes gene expression, microarray data, and sequence analysis. The term genetic data is often used to refer to both classic genetics (the science of inheritance), which emphasize the study of family data and heritability, and molecular genetics, which focus on the structure and function of genes. Genomic data, on the other hand, usually implies the study of large-scale genetic patterns across the genome. The breadth of these two areas alone gives indication of the complexity of the software involved. The text of this article will therefore serve only as an introduction to and review of the areas of analyses possible currently, and it will give program descriptions and references for many commonly used software packages. Furthermore, although software is often multifunctional, programs will be grouped by their most often used or most unique feature. More detailed lists of software and their capabilities can be found at the websites included in Table 1. 1

1.1 Data Storage and Retrieval The secure housing of genetic data is an evergrowing concern for both medical practitioners and researchers alike. However, secure data must also be usable (i.e., easily queried and exported for the analysis of choice). Programs like PEDSYS and PROGENY are fullscale database systems specially designed for management of genetic, pedigree, and demographic data. PROGENY is fully customizable and can facilitate multiple users; it also has a built-in pedigree-drawing component. Genial Passport is new software that has been designed specifically for the management of high-throughput SNP data. It can import data, check for Mendelian inheritance within families, and compile and export data. Unlike most software that will be discussed in this article, all packages mentioned here require a licensing fee, which can be expensive. A more complete list of software that contains at least a component for data management can be found in Table 2 (1–111) and Table 3 (111–128). 1.2 Data Visualization Both freeware and commercial software is available for data visualization, or more specifically, pedigree drawing. As mentioned, PROGENY has such capabilities. Cyrillic, which is commercial software, also has this ability. However, other packages for this purpose are available at no charge. PELICAN (Pedigree Editor for Linkage Computer Analysis), for example, is a utility for pedigree visualization and

DATA MANAGEMENT

A research project in the area of genetics or genomics is likely to include large quantities of data that must be stored with proper

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

SOFTWARE FOR GENETICS/GENOMICS

Table 1. List of web resources for genetics and genomics software http://linkage.rockefeller. edu/soft/

http://evolution.gs. washington.edu/phylip /software.html

http://www.bioinformatics. uthscsa.edu/www/ genomics.html http://www.stat.wisc. edu/˜yandell/statgen/ software/ http://www.nslij-genetics. org/microarray/soft.html http://ihome.cuhk.edu. hk/˜b400559/arraysoft. html

http://www.mged.org/ Workgroups/MIAME/miame software.html

http://www.microarraystation. com/microarray/ Microarray Bioinformatic Tools and Databases Online http://www.statsci.org/ micrarra/analysis.html http://www.biostat.ucsf. edu/sampsize.html http://stat.ubc.ca/˜rollin/ stats/ssize/ http://www.russell.emblheidelberg.de/gtsp/ secstrucpred.html http://www.rnaiweb.com/ RNAi/MicroRNA/ MicroRNA Tools Software/MicroRNA Target Scan/index.html http://www.soph.uab.edu/ ssg/default.aspx?id=49

This page lists more than 350 programs (by August 2005) on genetic linkage analysis for human pedigree data, QTL analysis for animal/plant breeding data, genetic marker ordering, genetic association analysis, haplotype construction, pedigree drawing, and population genetics. This page was created by Dr. Wentian Li, when he was at Columbia University (1995–1996) and was later moved to Rockefeller University (1996–2002). Now it takes its new home at North Shore LIJ Research Institute (2002–present), http://www.nslij-genetics.org/soft/. This page lists about 265 programs and 31 free servers on the subject of the phylogeny. This page was created by Dr. Joseph Felsenstein in University of Washington. This page is organized in a way that the programs can be searched by methods, by computer systems on which they work, by data types, and it provides the link for other lists of phylogeny software. This page lists 22 Genomics softwares with the extensive status reports, guide and links. This page was created as a part of service in Bioinformatics Core Facility by University of Texas. This page lists the softwares for statistical analysis of genimic data as open source. This page was created by Dr. Brian Yandell in University of Wisconsin-Madison. The page has the list of programs and links on QTL, Gene expression, Bioinformatics. This page lists the public domain programs for microarray data analysis. This page was created by The Robert S. Boas Center for Genomics and Human Genetics in North Shore LIJ Health System. This page contains the extensive comparison of programs for microarray data analysis by categories. This page was created by Dr. Y.F. Leung in Harvard University. This page also has definition of each category and suggested readings on it. This page lists the programs possibly MIAME (Minimum Information About a Microarray Experiment) compliant. MIAME is to outline the minimum information needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment. This page lists about 50 software programs for microarray data analysis with links.

This page lists institutions or companies that produce data analysis software for cDNA microarray data with links. Provides links to software for the calculation of power and sample size for a variety of statistical tests Provides links to software for the calculation of power and sample size for a variety of study designs Provides Links to RNA and protein structure prediction software

Provides Links to microRNA detection and RNA target identification software

Provides links to a variety of analysis software including those for genetic epidemiology and population genetics

Table 2. List of genetics software including function and data types required Genetic Software Category Functions (Acronyms)

P K √

D M

E D

M P

Type of Analysis G L

P W √

S G

L K √

A S

√

√

L D

Data Type

T D √

H P

√

√

√ √

M T

M C

S Q

P L

U R

R L √

M C

S P

M S √

√

√ √

√ √ √ √ √ √

√ √

√ √

√ √ √

√ √

√

√

√ √

√ √ √

√ √ √ √

√ √ √ √ √ √ √ √

√ √ √ √ √

√

√ √ √ √ √

√ √

√

√ √

√ √

√

√

√ √ √ √

√ √

√

√

√

√

√

√

√ √

√ √ √ √ √

SOFTWARE FOR GENETICS/GENOMICS

ACT (1) ADMIXMAP (2) ALLEGRO (3) ALOHOMORA (4) BAMA (5) BARS (6) BLADE (7) CATS (8) CARTHAGENE (9) CCREL (10) CHAPLIN (11) CHECKHET (12) CLUSTAG (13) CMAP (14) CoPE (15) COVIBD (16) CRIMAP (17) CYRILLIC (18) DHSMAP (19) DMLE (20) DPPH (21) EHP.R (22) EM-DECODER (23) ET-TDT (24) FASTLINK (25) FASTSLINK FBAT (26) FLOSS (27) GC/GCF (28) GCHAP (29) GENEFINDER (30) GENEHUNTER (31) GENETIC POWER CALCULATOR (32)

Miscellaneous

3

4

Table 2. (continued)

Functions (Acronyms) GENIAL PASSPORT GIST (33) GMCHECK (34) GOLD (35) GRR (36) GSMA (37) HAPAR (38) HAPASSOC (39) HAPBLOCK (40) HAPLOT (41) HAPLOBLOCK (42) HAPLOPAINTER (43) HAPLORE (44) HAPLOREC (45) HAPLO.STAT (46) HAPLOTYPE ESTIMATION (47) HAPLOTYPER (23) HAPLOVIEW (48) HAPSCOPE (49) HEGESMA (50) HELIXTREE (51) HPLUS (52) HS-TDT (53) HTR (51) ILR (54) INTEGRATEDMAP (55) LAMP (56) LEA (57) LINKAGE (58) LOCUSMAP

Miscellaneous P K

D M √

E D

M P

Type of Analysis G L

P W

S G

L K

A S

L D

T D

Data Type H P

M T

M C

S Q

P L

U R

√ √ √ √

√ √

R L

√ √ √ √

M C

S P √

M S

√

√

√ √ √ √ √ √

√

√ √ √ √

√ √ √

√

√

√ √ √ √ √

√ √

√ √ √

√

√

√

√

√

√

√ √

√

√ √

√ √

√

√

√

√ √

√ √

√ √

√

√

√

√ √ √

√

√

√

√

√

√

√ √

√

√

√

√ √ √

√

√ √

√

√ √

SOFTWARE FOR GENETICS/GENOMICS

Genetic Software Category

√

√ √

√

√

√

√

√ √

√

√

√

√ √

√

√ √ √

√

√

√

√

√ √

√ √ √ √

√ √

√ √

√ √ √ √

√ √ √

√

√

√ √

√

√ √

√

√ √

√ √ √ √ √ √ √

√

√ √ √

√

√ √ √ √ √

√

√ √ √ √ √ √

√

√ √ √

√ √

√ √

√

√ √ √

√ √ √

√ √ √ √

SOFTWARE FOR GENETICS/GENOMICS

LOT (59) LRTAE (60) MADELINE MALDSOFT (61) MEGA2 (62) MENDEL (63) MERLIN (64) MILD MINSAGE (65) MULTIMAP (66) OSA (67) PANGAEA PAP PARENTE (68) PAWE-3D (69) PBAT (70) PDT (71) PED (72) PEDCHECK (73) PEDGENIE (74) PEDIGRAPH PEDIGREEQUERY (75) PEDSTATS (76) PEDSYS PELICAN (77) PHASE (78) PL-EM (79) POWER (80) PREST (81) PROGENY PSEUDOMARKER (82) QTDT (83) QUANTO (84) RC-TDT (85) RELCHECK (86)

5

6

Table 2. (continued) Genetic Software

Functions (Acronyms) R/GAP SAGE SEGPATH (87) SIBMED (88) SIB-PAIR (89) SIBSIM (90) SIMLA (91) SIMPED (92) SIMPLE (93) SIMWALK (94) SNPALYZE (95) SNPHAP (96) SNPP (97) SOLAR (98) SPLAT (99) STRAT (100) SUPERLINK (101) TASSEL (102) TDT-AE (103) TDT-PC (104) TDTPOWER (105) THESIAS (106) TOMCAT (107) TREELD (108) TRIMHAP (109) UNPHASED (110) BLAST Phred Phrap Consed (111) PHYLIP Clustal W Seaview ReadSeq PAUP Treeview

Miscellaneous P K √ √

D M

E D

√

√

M P

Type of Analysis G L √

P W √

S G √ √

L K √ √

A S √ √

√

√

L D √

T D √

Data Type H P √ √

M T

M C

S Q

P L

U R

√ √

√

√ √ √ √ √

√ √ √ √

√ √

√ √

√

√ √ √

√ √ √ √ √

√

R L √ √ √ √ √ √ √ √ √ √

√

√ √ √

√

√ √ √ √ √

S P

M S

√

√

√ √ √ √ √

√

√

√

√

√ √ √ √ √

√ √ √ √ √

M C

√ √ √

√ √ √ √ √

√

√ √ √ √ √ √

√ √ √

SOFTWARE FOR GENETICS/GENOMICS

Category

Table 3. List of genomics software including function and data types required Genetic Software Category Functions (Acronyms)

DM √

√

√ √ √ √ √ √

Miscellaneous NM VS √ √ √ √ √ √ √

√ √

√ √ √ √

√

SA √ √ √ √ √ √ √ √ √ √ √ √ √

√ √ √ √ √ √ √ √ √

√ √ √ √ √ √ √ √ √

√ √

DT

√ √ √

SG

LK

Type of Analysis AS LD TD

HP

MT

MC

SQ

PL

Data Type UR RL

MC

√

√

√ √ √ √

√ √ √ √ √

√

√ √

√

√

√ √

SP

MS

SOFTWARE FOR GENETICS/GENOMICS

Acuity ArrayPack Aroma (111) BAGEL (112) BASE (113) BioConductor (114) BioSieve CAGED (115,116) ChipST2C Cluster (116) dChip (117) EBarrays (118) Expression Profiler ExpressYourself (119) GAAS (120) GeneClust (121) GenePattern GeneSifter GeneSpring GeneTraffic GeneX-Lite J-Express Pro (122) MAExplorer MAGIC (123)

PK

7

8

Genetic Software Category Functions (Acronyms) PAM (124) SAM (125) SMA ScanAlyze SilicoCyte SNOMAD (126) SNP CHART (127) Spotfire TIGR TM4 (128) Treeview

PK

Miscellaneous NM VS SA √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √

DM

Miscellaneous DM ED MP GL PW NM VS SA DT

Data Management Error Detect Map (genetic) Genealogy Power Normalization, Quality Control Check Visualization, Image Analysis Statistical Analysis Data-Mining

DT √

SG

Type of Analysis AS LD TD

HP

MT

MC √

√

√

Segregation Linkage Association Mapping TDT Haplotype Meta Analysis

SQ

PL

Data Type UR RL

MC

SP

√ √

√

Type of Analysis SG LK AS LD TD HP MT

LK

Data Type MC SQ PL UR RL MC SP MS

Microarrays Sequence Phylogeny Unrelated Related Micoarray SNPs Microsatellites

√ √ √ √

√

MS

SOFTWARE FOR GENETICS/GENOMICS

Table 3. (continued)

SOFTWARE FOR GENETICS/GENOMICS

editing pedigree structure that allows these graphical changes to the structure of the pedigree to be saved and data exported accordingly. COPE (Collaborative Pedigree drawing Environment) and PED (Pedigree Drawing Software) are pedigree-drawing programs particularly designed for epidemiologists that allow customized automated drawing of large numbers of pedigrees. PEDIGRAPH and PEDIGREEQUERY are similar programs but specifically designed for drawing large, complex pedigrees, with the latter additionally allowing for consanguinity loops and multiple mates. Haplopainter is a pedigree-drawing program specifically designed to import haplotype data from several genetic analysis programs like GENEHUNTER, ALLEGRO, MERLIN, and SIMWALK (See the sections on Linkage and Linkage Disequilibrium Analysis below). Madeline is a program for preparing, visualizing, and exploring pedigree data. It allows users to query their datasets, draw pedigrees, and then convert pedigree and marker data into the proper format for other genetic analysis software (like those described in the linkage section below). Again, a more complete listing of software that contains a visualization component can be examined in Tables 2 and 3. 1.3 Data Processors Most genetic and genomics programs have very specific data formatting requirements, and the option to export in a given format, from a standard database package or a genetics database may or may not be available. PEDSYS can export data formatted for several genetics analysis programs, like FISHER, MENDEL, SOLAR, and LINKAGE (see Table 2). PROGENY can export in a fixed set of formats easily input into programs like GENEHUNTER, SOLAR, and LINKAGE (see Table 2). Other programs are available that will transform standard input into a variety of different file types. MEGA2 (Manipulation Environment for Genetic Analyses), which is software specifically designed for this purpose, can convert LINKAGE files into an input format suitable for SimWalk, MENDEL, ASPEX, APM, SLINK, S.A.G.E., and Genehunter, to

9

name a few. SIB-PAIR, which is primarily designed for linkage analysis (see below), also formats data for several of the programs listed above and in Table 2. 1.4 Error Detection for Family and Genotype Data Most genetic analyses rely on having data from multiple members of the same family (or pedigree). These programs, which will be discussed in following sections, also assume that the given pedigree relationships are correct, which may or may not be true. For this reason, algorithms and software have been developed that will infer, usually based on genotype data, whether a specified relationship is correct. RELTEST, which is a part of S.A.G.E. for example, given genotypes from markers that span approximately half the genome, will assign a probability that each given relative pair is either unrelated, parent-offspring, sibling, half-sibling, or monozygotic twins. Other programs that do similar things include RELPAIR, RELCHECK, PREST, and GRR (Graphical Representation of Relationships), and they are listed in Table 2. One note of interest is that RELTEST only compares within a given family, so if the goal is to identify potential sample switches, then another program such as RELPAIR is probably most appropriate. Each of these programs allows for a certain degree of error in the genotype data, but they are also sensitive to these errors; interpretation of results must be tempered accordingly. RELTEST, RELPAIR, and ALTERTEST (the companion program to PREST), in addition to identifying errors, suggest the most likely relationship between pairs of individuals whose putative relationships are suspect. Note that these software programs all assume that a putative relationship is given. A few programs, however, do try to infer how two individuals are related without a relationship being specified a priori; FSTAT, FAMOZ, and PARENTE are all such programs. These options are not discussed in detail here, as they are computationally intensive and more likely to be used in the fields of population and archaeological genetics than in medical genetics.

10

SOFTWARE FOR GENETICS/GENOMICS

Similar to the programs that detect relationship misspecifications assuming known genotypes, there are also those programs that detect Mendelian inconsistencies in genotype data assuming a known family structure. For example, MARKERINFO, which is a program in S.A.G.E., gives as output (1) a list of all markers and the number of inconsistencies for each family, (2) a list of all families and the number of inconsistencies for each marker, and (3) the genotypes of all individuals involved in an inconsistency at a given marker. Other programs that detect genotyping error in a similar way are listed in Table 2 and include PEDCHECK, GMCheck, and MENDEL, which identify inconsistencies in pedigree data; SIBMED (Sibling Mutation and Error Detection), which identifies errors and mutations in sibpair data; and CHECKHET, which detects abnormal genotypes in case-control data. A few of these programs also assess whether a particular sample, for a given locus is in Hardy Weinberg Proportion. A closing point regarding programs for the detection of relationship misspecifications and Mendelian inconsistencies is their sensitivity to the type of data being used (i.e., microsatellite versus single nucleotide polymorphism (SNP) data). Because most algorithms implemented in these programs rely on identity in state information rather than identity by descent (IBD) information, the degree of heterozygosity is critical to performance thus causing microsatellites to be more informative for these types of analyses. 1.5 Error Detection for Genetic Maps As will be discussed, many genetic analysis methods used today rely on knowing the order of as well as the spacing between markers. This task is certainly not trivial, particularly given the various methods of creating both genetic and physical maps. Software is available that both construct genetic maps as well as compare genetic and physical maps. CRIMAP and MULTIMAP are companion programs designed to allow rapid, largely automated construction of multilocus linkage maps by evaluating several alternate locus orders. These programs can be used either on the user’s data or on the CEPH data. In fact, the CEPH data can be downloaded in CRIMAP format from their website

(http://www.cephb.fr/). INTEGRAYEDMAP is a web-based application that allows users to store and then interactively display genetic map data. LOCUSMAP functions similarly, but it is primarily designed for linkage analysis and map construction given a variety of specified modes of inheritance. Other linkage programs, like LODLINK in S.A.G.E., can also be used for marker-to-marker linkage, which allows estimation of the recombination fraction, and therefore order of and distance between two markers. This feature of linkage programs does require, however, that sufficient distance exists between markers and is therefore best suited for microsatellite genome scan or more sparsely spaced SNP scans. The reconciliation of physical and genetic maps is daunting, at best, and although several web-based resources are available of both types of maps (see Table 2), their agreement is certainly not guaranteed. Programs like CMAP however, allow users to view comparisons of these maps and to collate these data. This industry is an area of ever-increasing resources and the bioinformatics tools discussed in the genomics section below are likely to be useful in this area as well. 2 GENETIC ANALYSIS For the purposes of this discussion, we refer to genetics as the study of patterns of inheritance, specifically, the inheritance of disease, which can be done with or without data obtained from molecular genetics. Related to this area, but not included, at least not in any detail, in this review are the fields of population and ecological genetics that focus on the influence of evolutionary and environmental forces, respectively. 2.1 Summary Statistics The structure of a given dataset, particularly family data, has broad implications that include the best way to house the data and the type of analysis that is most appropriate. For that reason, programs like PEDINFO, which is a subroutine of S.A.G.E., can be useful. PEDINFO provides descriptive statistics on pedigree structure that includes the means, variances, and histograms of family,

SOFTWARE FOR GENETICS/GENOMICS

11

sibship, and overall sample size. It can provide statistics that are specific to individuals and can subset results based on any number of criteria. PEDINFO also gives potentially diagnostic information about individuals in a file that are unrelated to any one else, have multiple mates, or are indicated in consanguineous mating pairs. PEDSTATS is a similar program that additionally checks for basic formatting errors, Hardy-Weinberg proportions, Mendelian inconsistencies, and will identify individuals uninformative for segregation/cosegregation (linkage) and create new data files with those individuals removed.

polygenic models, and it is not limited in the size or the type of pedigrees it can analyze as long as they do not include loops. SEGREG can also include composite traits or environmental covariates. MENDEL can perform segregation analysis assuming few underlying trait loci, but it can also perform linkage calculations and allele frequency estimation. SEGPATH can, as the name implies, perform both segregation and path analysis either separately or jointly. Segregation models, in most programs, can also include multivariate phenotypes, environmental indices, and fixed covariates.

2.2 Familial Aggregation, Commingling, and Segregation Analysis

The biological phenomenon of linkage is that two loci are in very close physical proximity, to the extent that they travel together more than one would expect by chance alone (i.e., assume 50% chance of a recombination if there is no linkage). Linkage analysis can be classified into two categories: model-free, which do not require the a priori specification of a mode of inheritance controlling the disease of interest and model-based linkage analysis, which does. These two categories are often referred to as parametric and nonparametric, although they both require the estimation of parameters. To be completely accurate, they should be termed ‘‘geneticmodel-free’’ and ‘‘genetic-model-based.’’ Model-free linkage methods (again, a misnomer as all linkage analysis requires a statistical model, whether it requires specification of a genetic model or not) are typically computationally simple and rapid and can be used as a first screen of multiple markers. They are typically designed for nuclear families or independent relative pairs (e.g., sibling pairs) (Table 2). Model-based linkage methods assume a prespecified mode of inheritance and a given penetrance function for the underlying disease genotype, and they can analyze large pedigrees as a whole, often with few size restrictions. Several software packages implement both model-based and model-free linkage analysis. Model-based linkage analysis, which is sometimes referred to as traditional LOD score analysis, requires the a priori specification of mode of inheritance (additive, recessive, or dominant), disease allele frequency,

Genetic and genomic studies can be both time consuming and expensive. For this reason, most are not undertaken without evidence of familiality. High monozygotic twin concordance rates and elevated relative risk to the family member of an affected individual compared with the general population are the most often cited evidence of this. Assuming that one has reason to believe a particular trait is heritable, one might next seek to establish a significant correlation between pairs of relatives. This process can be done, to some extent, with standard software like SAS or S-Plus (SAS Institute, Cary, NC). However, to account accurately for all relationships within a pedigree, a software program like FCOR, which is a part of the program package S.A.G.E., needs to be used. Again, assuming the results of this analysis supports a genetic etiology, one can perform analysis to determine (1) that the underlying trait distribution differs by inferred trait genotype (commingling analysis) and (2) that the putative disease locus is being inherited in the family, that is, that a major gene or polygene effect can be observed (segregation analysis). Segregation models in particular can be very complex and therefore computationally intensive. Three prominent programs are used for segregation analysis today: SEGREG (included in S.A.G.E.), MENDEL, and SEGPATH (Segregation analysis and PATH analysis). SEGREG can be used to model single major gene models or

2.3 Linkage Analysis

12

SOFTWARE FOR GENETICS/GENOMICS

and penetrance, and it involves the estimation of the recombination fraction (measure of genetic distance between two loci). LINKAGE is a program commonly used for modelbased analysis and performs maximum likelihood estimation of recombination rates for an entire sample as well as individual families. This package includes two groups of programs, one for general pedigrees and marker data (i.e., microsatellite markers or SNPs) and the second for three-generation families and codominant markers. FASTLINK is a modified and improved version of LINKAGE, which is more computationally efficient and better documented. SUPERLINK performs exact linkage analysis as do LINKAGE and FASTLINK, but it can handle larger input files. LODLINK (a part of S.A.G.E.) performs model-based linkage analysis. In addition, it performs several tests of heterogeneity of recombination fraction and can take as input individual specific penetrances as calculated from segregation analysis, which is an alternative to analyzing the data under a variety of inferred modes of inheritance. All of these programs are designed specifically for binary traits and two-point linkage analysis (often called singlepoint, as only a single marker locus is considered at a time). Other modelbased linkage programs that analyze data from adjacent markers jointly (multipoint linkage analysis) include MLOD (included in S.A.G.E.) and GENEHUNTER. Model-free linkage analysis, which is also called nonparametric linkage analysis (again, a misnomer, as explained above), does not require prior specification of the genetic model, as it relies on the estimation of IBD allele sharing between pairs of relatives. Some model-free analysis software is designed for the analysis of full pedigrees, but most model-free methods are designed for the analysis of sibling or other relative pairs. Many programs for model-free linkage analysis are available (Table 2). One of the most commonly used algorithms— the HasemanElston algorithm—can be used for both continuous and dichotomous traits as well as affected, discordant, and unaffected relative pairs. It is available in the program SIBPAL, which is in the S.A.G.E. software suite. Other programs for affected relative pair analysis include MERLIN, LODPAL (also in

S.A.G.E.), and GENEHUNTER, which performs model-free linkage analysis for pedigrees and provides what the authors term a ‘‘nonparametric lod (NPL) score.’’ SOLAR is another widely used program that performs variance component based model-free linkage analysis. All of these programs can perform both two-point or multipoint linkage analysis. 2.4 Association Analysis Allelic association is the population association between alleles at two different loci, most often of interest, between a disease allele and a marker allele. Association can be caused by many factors, such as selection, admixture, and migration. When association is caused by linkage, that is, both linkage and allelic association are present, it is called linkage disequilibrium. It assumes that a disease mutation occurs on an ancestor haplotype. After generations, the association between the loci that are far away from each other decays quickly because of recombination events. Because the recombination events between the disease locus and its nearby loci are very rare, the linkage disequilibrium tends to be preserved, which is the basis for the genetic association studies. To conduct the univariate analysis for single nucleotide polymorphisms (SNP) in unrelated data, one can use a standard statistical package (e.g., R), by coding the genotypes of a SNP with a specific model of inheritance (i.e., for an additive model, TT = 0, AT = 1, AA = 2, where A and T are the two alleles of a SNP). One can then test association between a binary and the vector of measured codes through a logistic or two-by-two contingency table and Chi-square statistic. For an additive model, this test is equivalent to the Cochran-Armitage test. If the trait is continuous, then one can use either linear regression or generalized linear models with a different choice of the link function, like the identity log link (as compared with the logit link used in logistic regression). If the marker is multi-allelic, then the software CLUMP can be used. Although the programs discussed above allow for the appropriate analysis, they are not designed for the analysis of large volumes

SOFTWARE FOR GENETICS/GENOMICS

of data, specifically, the increasingly available genome-wide SNP association scan data. Some programs have been developed to run these type of data more efficiently. PLINK, for example, is designed for the analysis of highdensity genome wide association case control data. PLINK also has the unique implementation of an approach to test the common disease/rare variant hypothesis. ASSOC in S.A.G.E. has also been extended to analyze large numbers of markers for family data. This program not only accounts for the correlation among members of a family but also allows for the simultaneous transform of the data to obtain normally distributed residuals. It is unlike other family-based association methods (see linkage disequilibrium and transmission disequilibrium tests) in that it does not condition on parental genotype. Both PLINK and ASSOC have graphical user interfaces (GUIs). A concern for the population-based genetic association study is the possible false positive results caused by population stratification (populations with distinct allele frequencies and disease prevalences represented differentially in cases and controls, which induces false positives). Two most popular methods to either control for or test for population stratification are genomic control and STRUCTURE methods, respectively. The frequentist’s version of the genomic control approach provides an easy way to detect and adjust the population stratification using a panel of unlinked markers to calculate the median 2 statistic from a panel of markers and divide it by a given constant. This method allows the estimation of a variance inflation factor λ that, when greater than 1, indicates the presence of population stratification and the need to adjust the obtained 2 statistic by dividing by λ. An alternative and Bayesian based approach proposed by Pritchard et al. (100) and incorporated in the software STRUCTURE clusters the study population by genetic ancestry so that one can test for association in each of the respective homogeneous subpopulations. 2.5 Linkage Disequilibrium and Transmission Disequilibrium Tests Although analyses are often specified as association, the real phenomenon of interest is

13

linkage disequilibrium, such as allelic association that is caused by close physical proximity (i.e., linkage). Several programs are available to quantify the linkage disequilibrium between markers like the genetic package in R. When using several markers, a graphics tool can be very useful for the visualization the patterns of linkage disequilibrium in a given region. Both GOLD and HAPLOVIEW are useful programs for this purpose. HAPLOVIEW can also perform other functions, from defining the LD block to conducting an association test. A major difference between the commonly used case-control and family-based association studies lies in the control samples they use. In a case-control study, the control samples are recruited directly from the population, whereas in a family-based association study, such as is used in the transmission disequilibrium test, the controls are formed by alleles not transmitted from the parents to affected offspring. We call such controls ‘‘pseudo-controls,’’ as we do not observe them in the real data. Because the pseudo-controls originate from the same population as the affected offspring, they provide robustness against unobserved genetic population structure. The original Transmission/Disequilibrium Test (TDT) is essentially a McNemars test with one degree of freedom in which the frequency of the disease and alternative alleles transmitted from heterozygous parents to their offspring are compared. Much software, such as TDTEX in S.A.G.E, TDT/S-TDT, and FBAT, can conduct the original TDT. The original TDT has been extended to overcome many limitations, and most of these extensions have been incorporated into a variety of programs. The program ETDT (Extended Transmission/Disequilibrium test) includes the ability to handle multiallelic markers, whereas the program PDT (pedigree disequilibrium test) can be used for pedigrees of any size with several affected offspring. PDT can conduct both allele-specific and genotype-specific association analysis of individual markers. The program QTDT (Quantitative Trait Transmission/Disequilibrium test) performs linkage disequilibrium (TDT) and association analysis for quantitative traits. It provides

14

SOFTWARE FOR GENETICS/GENOMICS

a convenient interface for family-based tests of linkage disequilibrium and can calculate the exact P-value by permutation even when multiple linked polymorphisms are tested. Several other programs (e.g., ET-TDT, TDTHAP, and HS-TDT) are based on haplotypes instead of a single marker genotype. The program ET-TDT combines the benefits of haplotype analysis with TDT to find which haplotypes or group of haplotypes, as defined in an evolutionary tree, are responsible for change in relative risk for a genetic condition. Both Log-linear and TDTEX, which is a part of S.A.G.E., can test maternal effect and parental imprinting effect. Two unified approaches can accommodate any type of pedigree, dichotomous or quantitative phenotypes, and covariates. The corresponding computer programs for these two approaches are the TAI feature of ASSOC, which is a part of the program package S.A.G.E., and FBAT, respectively. 2.6 Admixture Mapping Admixture mapping is a way of mapping disease susceptibility loci in admixed populations. The ideal population for the study is a recently admixed population derived from ancestral populations with both distinct marker allele frequencies and disease prevalences (e.g., European and African populations). The underlying principle of admixture mapping is that a large proportion of the genetic segment at or near a disease locus will be inherited from the high-risk ancestral population. Thus, we can estimate the proportion of ancestry for each marker and identify disease susceptibility loci by comparing the estimated proportion of ancestry with the expected proportion of ancestry. Compared with linkage analysis, admixture mapping has more power to detect disease loci with modest effect and has better mapping resolution. However, as mentioned, it requires that the allele frequencies of these loci be distributed different among ancestor populations. The power will therefore decrease if the ancestral populations have highly similar allele frequencies. Compared with the association analysis, admixture mapping needs fewer markers for a whole genome scan and is more robust to allelic heterogeneity. It also

has the disadvantage of low power and low mapping resolution. ADMIXMAP, MALDsoft, and ANCESTRYMAP are three popular Markov chain Monte Carlo (MCMC)-based programs for admixture mapping. All three programs can be used to inference the individual admixture, population admixture, and locus ancestry, as well as test for association by using cases only or by using both cases and controls. Testing using both cases and controls provides robustness for misspecification of the allele frequency. The underlying statistical methods for these three approaches are similar, and all three approaches provide ways for the user to specify the priors for the allele frequencies. We should be aware of several limitations of the programs: MALDsoft and ANCESTRYMAP were designed only for two ancestral populations, and ANCESTRYMAP can only use biallelic markers. ADMIXMAP can be used in samples with more than two founder populations. Unlike ADMIXMAP and ANCESTRYMAP, MALDsoft cannot be used independently, but it relies on the output of the STRUCTURE program (discussed in Association Analysis above) to conduct tests of association. FRAPPE and PSMIX are two frequentist alternatives to the Bayesian and MCMC approaches mentioned above, which use maximum likelihood methods to estimate individual admixture and population stratification. 2.7 Haplotype Analysis Haplotype analysis is another popular analysis strategy for genetic association studies. Unlike single locus analysis, haplotype analysis considers multiple loci simultaneously. Because haplotype analysis can potentially use more information, it is anticipated to have more power than the single locus approach in the scenario when several nearby loci are in LD with the causal locus. Often, however, this information relies on the assumption of phase-known haplotypes. With that said, if known, haplotypes can be biologically meaningful as the sequence of an individual’s protein is decided by his or her maternal or paternal haplotype. Because, as mentioned, what is observed is the unphased genotype, the key steps in haplotype analysis are to determine haplotype

SOFTWARE FOR GENETICS/GENOMICS

phase and estimate haplotype frequencies. Many algorithms have been developed for these purposes. Among them, the Estimation Maximization (EM) algorithm is probably the most popular implementation of a frequentist approach and is the basis of several programs, including DECIPHER in S.A.G.E, HAPLO.STAT (previously known as HAPLO.SCORE), HAPLO, and SNPHAP. In addition to haplotype inference, DECIPHER and HAPLO.STAT provide a convenient way of testing for differences in haplotype frequencies between groups. Furthermore, because HAPLO.STAT is built on a generalized linear models framework, it can accommodate any type of trait and easily incorporates covariates. It provides a global test for overall haplotype effect as well as an individual test for each haplotype. Other programs for haplotype estimation rely on Bayesian approaches such as those in PHASE and HAPLOTYPER. A Bayesian approach focuses on the estimation of the posterior distribution of the haplotype frequency parameters determined by both the likelihood of the data and the prior information. When we use a noninformative prior, the point estimation of the haplotype frequencies should be close to that from the EM algorithm. However, the PHASE program has a computational advantage over the EM algorithm, as it can easily allow for construction of haplotypes when the number of heterozygous loci is very high. Therefore, it has the ability to handle many haplotypes. The new version of the program also allows the user to estimate recombination rates and to identify recombination hotspots. The HAPLOYPER program is similar to the PHASE program as they both use the Gibbs Sampler algorithm. However, HAPLOTYPER is computationally faster than the PHASE program because it also implements the Partition Ligation algorithm. In practice, when haplotype analysis is conducted on a large number of loci, many resulting haplotypes will have small frequencies. The power of haplotype analysis is hence significantly reduced because of the large number of degrees of freedom. Current work focuses on reducing the large number of haplotypes based on the Cladistic or clustering analysis. Hap-clustering and HapMiner are two programs that use cluster algorithms to

15

group haplotypes and therefore increase test power. Because the Hap-clustering program is developed based on the score test used in the HAPLO.STAT program, it inherits all the properties of the HAPLO.STAT program. The programs mentioned above all function under the assumption of independence of observations, (i.e., the sample comprises unrelated individuals). DECIPHER, in S.A.G.E., is the one partial exception, as it can utilize family relationships to derive the most likely haplotype for a given individual, but it cannot jointly estimate the haplotypes for all members of a family. A few programs, however, do calculate haplotypes for entire pedigrees. HAPLORE, for example, reconstructs all haplotype sets that are compatible with the observed genotypes in a pedigree. Similarly, SimWalk2 uses MCMC and simulated annealing algorithms to estimate haplotypes on any size pedigree. 2.8 Software Suites Several software packages have been designed to perform multiple analyses discussed thus far, but with the convenience of several modules in one. They provide an option for the user who wishes to conduct multiple genetic analyses on a given dataset. Software suites that are free to the users include MERLIN, which performs error detection, model-based and model-free linkage analysis, regression-based linkage analysis, association analysis for quantitative traits, IBD and kinship estimation, haplotyping and simulation. MENDEL also offers several options within a given environment. It can order and map markers, perform modelbased and model-free linkage, conduct error checking, calculate genetic risk and genotype penetrances, and perform linkage disequilibrium and association analyses. MORGAN also performs multiple analyses of pedigree data; specifically, error checking, computing kinship, simulation, IBD estimation, map estimation, segregation, and linkage analysis. These programs are primarily designed for UNIX platforms and are run through a command line interface. S.A.G.E. is a similar suite of programs for genetic analysis of family, pedigree, and individual data and can be used for conducting almost all genetic analyses we described above except for admixture

16

SOFTWARE FOR GENETICS/GENOMICS

mapping. It can be installed and used in different platforms, including Windows, Linux, Solaris, and Mac, and it has a GUI for all platforms. 3

GENOMIC ANALYSIS

Genomics is the study of large-scale genetic patterns across the genome of a given species. This field has only recently become feasible as it depends on the availability of whole genome sequences and efficient computational tools for analysis of large datasets. The major tools and methods related to genomics are bioinformatics, measurement of gene expression, and determination of gene function. Genomics is of particular interest because of its potential for leading to new diagnostic and therapeutic measures. The large category of genomics comprises phylogenetics, gene recognition, and protein and RNA structure prediction. Each of these would require an entire article to describe the various approaches to their analysis. However, some widely used and publicly available (most are web-based) programs are worth mentioning. Phylogenetics is the comparison of DNA sequence between populations (or organisms) to reconstruct population history or identify the relationship between genes or organism. The most comprehensive listing and links to programs for phylogenic analysis can be found in Table 1. Over 300 software packages can be used with a variety of genetic data (from microsatellite marker data to DNA sequence) to calculate genetic distance between populations (or species) and cluster genes, to align sequences of two different populations, to align proteins of two different populations (although this might also be subsumed under the term proteomics), and to create evolutionary trees and calculate the distance between those trees are available for this kind of analysis. Gene recognition addresses the task of localizing genes based on sequence information alone. This task involves identifying consensus patterns in promoter regions, searching for start and stop codons as well as splicing cites, recognizing codon sequences that are common across known genes, and

comparing sequence of unknown function to sequence in known genes. Again, a host of programs is available for performing many of these analyses, and an extensive list of almost 100 programs can be found at the website listed in Table 1. Related to gene recognition is the examination of the expression levels of thousands of genes simultaneously, which is called microarray analysis. Although the high dimensionality of this data has posed both a computational and statistical challenge, the software available for the analysis of microarray panels are becoming more and more abundant. BAMarray, for example, is a Bayesian approach to the analysis of the up and down regulation of genes compared between cases and controls. Similarly, SAM uses supervised computer learning to recognize patterns of gene expression within groups of patients and/or controls. TM4 offers a database for storing large volumes of microarray data and allows for the clustering and quantification of gene array results. All of these programs are available free of charge. MicroRNA profiling is related to the recognition of patterns within microarray data but is specific to a recently popularized class of noncoding RNAs that are evolutionary conserved and can now be assayed. These RNAs have emerged as important for both the regulation of gene expression and the regulation of the processes of translation and transcription. The involvement of these regions in disease can be studied by examining tissuespecific expression profiles of mRNA. The analysis of data presents a challenge, but it is the focus of much research. Table 1 lists a website that provides a series of programs for both MicroRNA search and prediction given sequence data as well as MicroRNA genomic target identification. Prediction of the structure of protein and RNA is considered an unsolved problem. However, a few algorithms have been developed that can predict the native structure of proteins with some degree of accuracy (although none are 100% consistent in their prediction of this three-dimensional structure). PHD, RNAforester, Jpred, and Predator are all currently used algorithms that have accuracies of approximately 70%. A

SOFTWARE FOR GENETICS/GENOMICS

website with links to these and other software for structure prediction is given in Table 1. Copy number variants (CNVs), which are related to both genomic structure and function, have become the focus of medical genetics recently. CNVs comprise large segments of inserted or deleted sequences as well as segmental duplications. The genomic distribution of these CNVs can have significant, yet not well-characterized effects on gene expression, regulation, and interactions. The process of defining these variants is still challenging, and software for managing or analyzing this data is still developing. It is important to note however, like other genomics data, CNVs can be analyzed using traditional statistics given a reasonable number of samples and independent observations. 4 OTHER 4.1 Power Calculations/Simulation Before beginning a study, one typically wants to know the power associated with a given sample, study design, and statistical procedure. Many web-based tools can be used for calculating power sample size when conducting a simple t or z-test, ANOVA, logistic regression, or Chi-square statistic based on a contingency table (Table 1 shows links to several). Other downloadable programs include PS (http://biostat.mc.vanderbilt.edu/ twiki/bin/view/Main/PowerSampleSize) and UnifyPow (http://www.bio.ri.ccf.org/power. html). Several resources for calculating power for genetic case-control association analyses are available, including Genetic Power Calculator (http://pngu.mgh.harvard. edu/∼purcell/gpc/), PAWE-3D (http://linkage. rockefeller.edu/pawe3d), and LRTae (http:// linkage.rockefeller.edu/derek/UserGuideLR Tae.htm). Far fewer programs are available for calculating power for family-based studies. DESPAIR is one such program that is available via the web. 4.2 Meta Analysis Power to detect a given effect can be increased by increasing the sample size. One increasingly popular way of combining the results of several studies that address similar research questions is meta-analysis. Several software

17

programs exist to perform this task. Some programs require raw data whereas others simply combine P-values or odds ratios. Meta Analysis is a software package that has a graphical user interface and several program options, but it does require a licensing fee. Other meta-analysis programs, like EpiMeta (http://ftp.cdc.gov/pub/Software/ epimeta/), EasyMA (http://www.spc.univLyon1.fr/∼mcu/easyma/), and Statistics software for Meta-Analysis (http://userpage. fu-berlin.de/∼health/meta e.htm) are downloadable from the authors’ websites. These programs are not necessarily specific to genetic data, but given that the metaanalysis is focused on combining P-values or given effect sizes, as long as the statistics used to obtain the results are known, then these tools will be perfectly suitable. REFERENCES 1. Amos CI, Zhu DK, Boerwinkle E. Assessing genetic linkage and association with robust components of variance approaches. Ann. Hum. Genet. 1996; 60: 143–160. 2. McKeigue PM. Prospects for admixture mapping of complex traits. Am. J. Hum. Genet. 2005; 76: 1–7. 3. Gudbjartsson DF, Thorvaldsson T, Kong A, et al. Allegro version 2. Nat. Genet. 2005; 37: 1015–1016. 4. Ruschendorf F, Nurnberg P. ALOHOMORA: a tool for linkage analysis using 10K SNP array data. Bioinformatics 2005; 21: 2123–2125. 5. Kilpikari R, Sillanpaa MJ. Bayesian analysis of multilocus association in quantitative and qualitative traits. Genet. Epidemiol. 2003; 25: 122–135. 6. Zhang X, Roeder K, Wallstrom G, Devlin B. Integration of association statistics over genomic regions using Bayesian adaptive regression splines. Hum. Genomics 2003; 1: 20–29. 7. Liu JS, Sabatti C, Teng J, et al. Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 2001; 11: 1716–1724. 8. Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 2006; 38: 209–213.

18

SOFTWARE FOR GENETICS/GENOMICS 9. de Givry S, Bouchez M, Chabrier P, et al. CARHTA GENE: multipopulation integrated genetic and radiation hybrid mapping. Bioinformatics 2005; 21: 1703–1704.

23. Niu T, Qin ZS, Xu X, Liu JS. Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am. J. Hum. Genet. 2002; 70: 157–169.

10. Browning SR, Briley JD, Briley LP, et al. Case-control single-marker and haplotypic association analysis of pedigree data. Genet. Epidemiol. 2005; 28: 110–122.

24. Selman, Roeder K, Devlin B. Transmission/Disequilibrium Test Meets Measured haplotype analysis: family-based association analysis guided by evolution of haplotypes. Am. J. Hum. Genet. 2001; 68: 1250–1263.

11. Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 2003; 73: 1316–1329.

25. Cottingham RW Jr, Idury RM, Schaffer AA. Faster sequential genetic linkage computations. Am. J. Hum. Genet. 1993; 53: 252–263.

12. Curtis D, North BV, Gurling HM, et al. A quick and simple method for detecting subjects with abnormal genetic background in case-control samples. Ann. Hum. Genet. 2002; 66: 235–244.

26. Horvath S, Xu X, Lake SL, et al. Familybased tests for associating haplotypes with general phenotype data: application to asthma genetics. Genet. Epidemiol. 2004; 26: 61–69.

13. Ao SI, Yip K, Ng M, et al. CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics 2005; 21: 1735–1736.

27. Browning BL. FLOSS: flexible ordered subset analysis for linkage mapping of complex traits. Bioinformatics 2006; 22: 512–513.

14. Ware DH, Jaiswal P, Ni J, et al. Gramene, a tool for grass genomics. Plant Physiol. 2002; 130: 1606–1613. 15. Brun-Samarcq L, Gallina S, Philippi A, et al. CoPE: a collaborative pedigree drawing environment. Bioinformatics 1999; 15: 345–346. 16. Devlin B, Jones BL, Bacanu SA, Roeder K. Mixture models for linkage analysis of affected sibling pairs and covariates. Genet. Epidemiol. 2002; 22: 52–65. 17. Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. U.S.A. 1987; 84: 2363–2367. 18. Chapman CJ. A visual interface to computer programs for linkage analysis. Am. J. Med. Genet. 1990; 36: 155–160. 19. McPeek MS, Strahs A. Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet. 1999; 65: 858–875. 20. Reeve JP, Rannala B. DMLE+: Bayesian linkage disequilibrium gene mapping. Bioinformatics 2002; 18:894–5.

28. Devlin B, Roeder K. Genomic control for association studies. Biometrics 1999; 55: 997–1004. 29. Thomas A. GCHap: fast MLEs for haplotype frequencies by gene counting. Bioinformatics 2003; 19: 2002–2003. 30. Glidden DV, Liang KY, Chiu YF, Pulver AE. Multipoint affected sibpair linkage methods for localizing susceptibility genes of complex diseases. Genet. Epidemiol. 2003; 24: 107–117. 31. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 1996; 58: 134763. 32. Purcell S, Cherny SS, Sham PC. Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 2003; 19: 149–150. 33. Li C, Scott LJ, Boehnke M. Assessing whether an allele can account in part for a linkage signal: the Genotype-IBD Sharing Test (GIST). Am. J. Hum. Genet. 2004; 74: 418–431. 34. Thomas A. GMCheck: Bayesian error checking for pedigree genotypes and phenotypes. Bioinformatics 2005; 21: 3187–3188.

21. Bafna V, Gusfield D, Lancia G, Yooseph S. Haplotyping as perfect phylogeny: a direct approach. J. Comput. Biol. 2003; 10: 323–340.

35. Abecasis GR, Cookson WO. GOLD-graphical overview of linkage disequilibrium. Bioinformatics 2000; 16: 182–183.

22. Yang Y, et al. Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA. Proc. Natl. Acad. Sci. U.S.A. 2003; 100: 7225–7230.

36. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. GRR: graphical representation of relationship errors. Bioinformatics 2001; 17: 742–743.

SOFTWARE FOR GENETICS/GENOMICS 37. Wise LH, Lanchbury JS, Lewis CM. Metaanalysis of genome searches. Ann. Hum. Genet. 1999; 63: 263–272. 38. Wang L, Xu Y. Haplotype inference by maximum parsimony. Bioinformatics 2003; 19: 1773–1780. 39. Burkett K, McNeney B, Graham J. A note on inference of trait associations with SNP haplotypes and other attributes in generalized linear models. Hum. Hered. 2004; 57: 200–206. 40. Zhang K, Qin Z, Chen T, et al. HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics 2005; 21: 131–134. 41. Gu S, Pakstis AJ, Kidd KK. HAPLOT: a graphical comparison of haplotype blocks, tagSNP sets and SNP variation for multiple populations. Bioinformatics 2005; 21: 3938–3939. 42. Greenspan G, Geiger D. High density linkage disequilibrium mapping using models of haplotype block variation. Bioinformatics 2004; 20:I137–I144. 43. Thiele H, Nurnberg P. HaploPainter: a tool for drawing pedigrees with complex haplotypes. Bioinformatics 2005; 21: 1730–1732. 44. Zhang, Sun, Zhao. HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 2005; 21: 90–103. 45. Eronen L, Geerts F, Toivonen H. A Markov chain approach to reconstruction of long haplotypes. Pac. Symp. Biocomput. 2004; 104–115. 46. Schaid DJ, Rowland CM, Tines DE, et al. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 2002; 70: 425–434. 47. Rohde K, Furst R. Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum. Mut. 2001; 17: 289–295. 48. Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005; 21: 263–265. 49. Zhang J, Rowe WL, Struewing JP, Buetow KH. HapScope: a software system for automated and visual analysis of functionally annotated haplotypes. Nucleic Acids Res. 2002; 30: 5213–5221.

19

50. Zintzaras E, Ioannidis JP. HEGESMA: genome search meta-analysis and heterogeneity testing. Bioinformatics 2005; 21: 3672–3673. 51. Zaykin DV, Westfall PH, Young SS, et al. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum. Hered. 2002; 53: 79–91. 52. Li SS, Khalid N, Carlson C, Zhao LP. Estimating haplotype frequencies and standard errors for multiple single nucleotide polymorphisms. Biostatistics 2003; 4: 513–522. 53. Zhang S, Sha Q, Chen HS, et al. Transmission/disequilibrium test based on haplotype sharing for tightly linked markers. Am. J. Hum. Genet. 2003; 73: 566–579. 54. Knapp M, Strauch K. Affected-sib-pair test for linkage based on constraints for identicalby-descent distributions corresponding to disease models with imprinting. Genet Epidemiol. 2004; 26: 273–285. 55. Yang, Wang, Gingle. IntegratedMap: a web interface for integrating genetic map data. Genet. Epidemiol. 2005; 26: 273–285. 56. Li MD, Boehnke M, Abecasis GR. Efficient study designs for test of genetic association using sibship data and unrelated cases and controls. Am. J. Hum. Genet. 2006; 78: 778–792. 57. Langella, Chikhi, Beaumont. LEA (Likelihood-based estimation of admixture): a program to simultaneously estimate admixture and the time since admixture. Molecular Ecology Notes 2006; 1: 357–358. 58. Lathrop GM, Lalouel JM, Julier C, Ott J. Strategies for multilocus linkage analysis in humans. Proc. Natl. Acad. Sci. U.S.A. 1984; 81: 3443–3446. 59. Feng R, Leckman JF, Zhang H. Linkage analysis of ordinal traits for pedigree data. Proc. Natl. Acad. Sci. U.S.A. 2004; 101: 16739–16744. 60. Gordon, Yang, Haynes, et al. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat. Appl. Genet. Mol. Biol. 2004; 3. 61. Montana G, Pritchard J. Statistical tests for admixture mapping with case-control and cases-only data. Am. J. Hum. Genet. 2004; 77: 771–789. 62. Mukhopadhyay N, Almasy L, Schroeder M, et al. Mega2: data-handling for facilitating genetic linkage and association analyses. Bioinformatics 2005; 21: 2556–2557.

20

SOFTWARE FOR GENETICS/GENOMICS

63. Lange K, Weeks D, Boehnke M. Programs for Pedigree Analysis: MENDEL, FISHER, and dGENE. Genet. Epidemiol. 1988; 5: 471–472. 64. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 2002; 30: 97–101.

78.

65. Gregorius HR. The probability of losing an allele when diploid genotypes are sampled. Biometrics 1980; 36: 643–652.

79.

66. Matise TC, Perlin M, Chakravarti A. Automated construction of genetic linkage maps using an expert system (MultiMap): a human genome linkage map. Nat Genet 1994; 6: 384–390.

80.

67. Hauser ER, Watanabe RM, Duren WL, et al. Ordered subset analysis in genetic linkage mapping of complex traits. Genet. Epidemiol. 2004; 27: 53–63.

81.

68. Cercueil A, Bellemain E, Manel S. PARENTE: computer program for parentage analysis. J. Hered. 2002; 93: 458–459.

82.

69. Gordon D, Haynes C, Blumenfeld J, Finch SJ. PAWE-3D: visualizing power for association with error in case-control genetic studies of complex traits. Bioinformatics 2005; 21: 3935–3937. 70. Van SK, Lange C. PBAT: a comprehensive software package for genome-wide association analysis of complex family-based studies. Hum. Genomics 2005; 2: 67–69.

83.

71. Martin ER, Monks SA, Warren LL, Kaplan NL. A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am. J. Hum. Genet. 2000; 67: 146–154.

84.

72. Plendl H. Medizinische Genetik 1998; 1: 50–51.

85.

73. O’Connell JR, Weeks DE. PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet. 1998; 63: 259–266. 74. Allen-Brady, Wong, Camp. PedGenie: an analysis approach for genetic association testing in extended pedigrees and genealogies of arbitrary size. Bioinformatics. In press.

86.

87.

75. Kirichenko. An algorithm of step-by-step pedigree drawing. Genetika 2003; 40: 1425–1428. 76. Wigginton JE, Abecasis GR. PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data. Bioinformatics 2005; 21: 3445–3447.

88.

77. Dudbridge F, Carver T, Williams GW.

89.

Pelican: pedigree editor for linkage computer analysis. Bioinformatics 2004; 20: 2327–2328. Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 2001; 68: 978–989. Qin ZS, Niu T, Liu JS. Partition-ligationexpectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet. 2002; 71: 1242–1247. Garcia-Closas M, Lubin JH. Power and sample size calculations in case-control studies of gene-environment interactions: comments on different approaches. Am. J. Epidemiol. 1999; 149: 689–692. McPeek MS, Sun L. Statistical tests for detection of misspecified relationships by use of genome-screen data. Am. J. Hum. Genet. 2000; 66: 1076–1094. Goring HH, Terwilliger JD. Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am. J. Hum. Genet. 2000; 66: 1310–1327. Abecasis GR, Cardon LR, Cookson WO. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 2000; 66: 279–292. Gauderman WJ. Candidate gene association analysis for a quantitative trait, using parent-offspring trios. Genet. Epidemiol. 2003; 25: 327–338. Knapp M. The transmission/disequilibrium test and parental-genotype reconstruction: the reconstruction-combined transmission/ disequilibrium test. Am. J. Hum. Genet. 1999; 64: 861–870. Broman KW, Weber JL. Estimation of pairwise relationships in the presence of genotyping errors. Am. J. Hum. Genet. 1998; 63: 1563–1564. Province MA, Rao DC. General purpose model and a computer program for combined segregation and path analysis (SEGPATH): automatically creating computer programs from symbolic language model specifications. Genet. Epidemiol. 1995; 12: 203–219. Douglas JA, Boehnke M, Lange K. A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am. J. Hum. Genet. 2000; 66: 1287–1297. Duffy. Am. J. Hum. Genet. 1997; 61:A197.

SOFTWARE FOR GENETICS/GENOMICS 90. Franke D, Kleensang A, Ziegler A. SIBSIM, quantitative phenotype simulation in extended pedigrees. GMS Med. Inform. Biom. Epidemiol. 2006; 2: Doc04. 91. Bass MP, Martin ER, Hauser ER. Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet. 2002;suppl 71: 569. 92. Leal SM, Yan K, Muller-Myhsok B. SimPed: a simulation program to generate haplotype and genotype data for pedigree structures. Hum. Hered. 2005; 60: 119–122. 93. Irwin M, Cox N, Kong A. Sequential imputation for multilocus linkage analysis. Proc. Natl. Acad. Sci. U.S.A. 1994; 91: 11684–11688. 94. Sobel E, Lange K. Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am. J. Hum. Genet. 1996; 58: 1323–1337. 95. Shimo-onoda K, Tanaka T, Furushima K, et al. Akaike’s information criterion for a measure of linkage disequilibrium. J. Hum. Genet. 2002; 47: 649–655. 96. Chiano MN, Clayton DG. Fine genetic mapping using haplotype analysis and the missing data problem. Ann. Hum. Genet. 1998; 62(Pt 1): 55–60. 97. Zhao LJ, Li MX, Guo YF, et al. SNPP: automating large-scale SNP genotype data management. Bioinformatics 2005; 21: 266–8. 98. Almasy L, Blangero. Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 1998; 62: 1198–1211. 99. Poznik GD, Adamska K, Xu X, et al. A novel framework for sib pair linkage analysis. Am. J. Hum. Genet. 2006; 78: 222–230. 100. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am. J. Hum. Genet. 2000; 67: 170–181. 101. Fishelson M, Geiger D. Exact genetic linkage computations for general pedigrees. Bioinformatics 2002; 18: S189–S198. 102. Zhang Z, Bradbury PJ, Kroon DE, et al. TASSEL 2.0: a software package for association and diversity analyses in plants and animals. Plant & Animal Genomes XIV Conference 2006. 103. Gordon D, Heath SC, Liu X, Ott J. A transmission/disequilibrium test that allows for genotyping errors in the analysis of singlenucleotide polymorphism data. Am. J. Hum. Genet. 2001; 69: 371–380.

21

104. Chen WM, Deng HW. A general and accurate approach for computing the statistical power of the transmission disequilibrium test for complex disease genes. Genet. Epidemiol. 2001; 21: 53–67. 105. Knapp M. A note on power approximations for the transmission/disequilibrium test. Am. J. Hum. Genet. 1999; 64: 1177–1185. 106. Tregouet DA, Escolano S, Tiret L, et al. A new algorithm for haplotype-based association analysis: the Stochastic-EM algorithm. Ann Hum Genet 2004; 68: 165–177. 107. Beckmann L, Thomas DC, Fischer C, ChangClaude J. Haplotype sharing analysis using mantel statistics. Hum. Hered. 2005; 59: 67–78. 108. Zollner S, Wen X, Pritchard JK. Association mapping and fine mapping with TreeLD. Bioinformatics 2005; 21: 3168–3170. 109. MacLean CJ, Martin RB, Sham PC, et al. The trimmed-haplotype test for linkage disequilibrium. Am. J. Hum. Genet. 2000; 66: 1062–1075. 110. Dudbridge F. Pedigree disequilibrium tests for multilocus haplotypes. Genet. Epidemiol. 2003; 25: 115–121. 111. Gordon D, Abajian C, Green P. Consed: a graphical tool for sequence finishing. Genome Res. 1998; 8: 195–202. 112. Bengtsson H, Calder B, Mian IS, et al. Identifying differentially expressed genes in cDNA microarray experiments authors. Sci. Aging Knowledge Environ. 2001; 2001: vp8. 113. Townsend JP, Hartl DL. Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple strains or treatments. Genome Biol. 2002; 3:RESEARCH0071. 114. Saal LH, Troein C, Vallon-Christersson J, et al. BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol. 2002; 3:SOFTWARE0003. 115. Dudoit S, Fridlyand J, Speed TP. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Available: http://www.statberkeley.edu/∼sandrine/ tecrep/576pdf2000. 116. Ramoni MF, Sebastiani P, Kohane IS. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. U.S.A. 2002; 99: 9121–9126. 117. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc. Natl. Acad.

22

SOFTWARE FOR GENETICS/GENOMICS Sci. U.S.A. 1998; 95: 14863–14868.

118. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl. Acad. Sci. U.S.A. 2001; 98: 31–36. 119. Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat. Med. 2003; 22: 3899–3914. 120. Luscombe NM, Royce TE, Bertone P, et al. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 2003; 31: 3477–3482. 121. Masseroli M, Cerveri P, Pelicci PG, Alcalay M. GAAS: gene array analyzer software for management, analysis and visualization of gene expression data. Bioinformatics 2003; 19: 774–775. 122. Hastie T, Tibshirani R, Eisen MB, et al. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 2000; 1:RESEARCH0003. 123. Dysvik B, Jonassen I. J-Express: exploring gene expression data using Java. Bioinformatics 2001; 17: 369–370. 124. Simpson CL, Hansen VK, Sham PC, et al. MaGIC: a program to generate targeted marker sets for genome-wide association studies. Biotechniques 2004; 37: 996–999. 125. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 2002; 99: 6567–6572. 126. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 2001; 98: 5116–5121. 127. Colantuoni C, Henry G, Zeger S, Pevsner J. SNOMAD (Standardization and NOrmalization of MicroArray Data): web-accessible gene expression data analysis. Bioinformatics 2002; 18: 1540–1541. 128. Tebbutt SJ, Opushnyev IV, Tripp BW, et al. SNP Chart: an integrated platform for visualization and interpretation of microarray genotyping data. Bioinformatics 2005; 21: 124–127. 129. Saeed AI, Sharov V, White J, et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques 2003; 34: 374–378.

FURTHER READING Hartl DL, Clark AG. Principles of Population Genetics. 1997. Sinauer Associates, Sunderland, MA. Lynch M, Walsh B. Genetics and Analysis of Quantitative Traits. 1998. Sinauer Associates, Sunderland, MA. Ott J. Analysis of Human Genetic Linkage, 3rd edition. 1999. The Johns Hopkins University Press, Baltimore, MD. Strachan T, Read AP. Human Molecular Genetics 3. 2004. Garland Science, New York. Thomas DC. Statistical Methods in Genetic Epidemiology. New York: Oxford University Press, 2004.

SPONSOR

3

A Sponsor is an individual, company, institution, or organization that takes responsibility for the initiation, management, and/or financing of a clinical trial.

The sponsor should use qualified individuals (e.g., biostatisticians, clinical pharmacologists, and physicians) as appropriate throughout all stages of the trial process from designing the protocol and Case Report Forms (CRF) and planning the analyses to analyzing and preparing interim and final clinical trial/study reports.

1 QUALITY ASSURANCE AND QUALITY CONTROL

4 TRIAL MANAGEMENT, DATA HANDLING, RECORDKEEPING, AND INDEPENDENT DATA MONITORING COMMITTEE

The sponsor is responsible for implementing and maintaining quality assurance and quality control systems with written Standard Operating Procedures (SOP) to ensure that trials are conducted and data are generated, documented (recorded), and reported in compliance with the protocol, Good Clinical Practice (GCP), and the applicable regulatory requirement(s). The sponsor is responsible for securing agreement from all involved parties to ensure direct access to all trial-related sites, source data/documents, and reports for the purpose of monitoring and auditing by the sponsor and of inspecting by domestic and foreign regulatory authorities. Agreements, made by the sponsor with the investigator/institution and/or with any other parties involved with the clinical trial, should be in writing as part of the protocol or in a separate agreement.

2

TRIAL DESIGN

The sponsor should use appropriately qualified individuals to supervise the overall conduct of the trial, to handle the data, to verify the data, to conduct the statistical analyses, and to prepare the trial reports. The sponsor may consider establishing an independent data monitoring committee (IDMC) to assess the progress of a clinical trial, including the safety data and the critical efficacy endpoints at intervals, and to recommend to the sponsor whether to continue, modify, or stop a trial. The IDMC should have written operating procedures and maintain written records of all its meetings. When using electronic trial data handling and/or remote electronic trial data systems, the sponsor should: • Ensure and document that the elec-

tronic data processing system(s) conforms to the requirements established by the sponsor for completeness, accuracy, reliability, and consistent intended performance (i.e., validation). • Maintain SOPs for using these systems. • Ensure that the systems are designed to permit data changes in such a way that the data changes are documented and that no entered data are deleted (i.e., maintain an audit trail, data trail, and edit trail). • Maintain a security system that prevents unauthorized access to the data.

MEDICAL EXPERTISE

The sponsor should designate appropriately qualified medical personnel who will be readily available to advise on trial-related medical questions or problems. If necessary, outside consultant(s) may be appointed for this purpose. This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

SPONSOR • Maintain a list of the individuals who

are authorized to make data changes. • Maintain adequate backup of the data.

for record retention and should notify the investigator(s)/institution(s) in writing when the trial-related records are no longer needed.

• Safeguard the blinding, if any (e.g.,

maintain the blinding during data entry and processing). If data are transformed during processing, it always should be possible to compare the original data and observations with the processed data. The sponsor should use an unambiguous subject identification code that allows identification of all the data reported for each subject. The sponsor, or other owners of the data, should retain all of the sponsor-specific essential documents that pertain to the trial. The sponsor should retain all sponsor-specific essential documents in conformance with the applicable regulatory requirement(s) of the country(ies) where the product is approved and/or where the sponsor intends to apply for approval(s). If the sponsor discontinues the clinical development of an investigational product (i.e., for any or all indications, routes of administration, or dosage forms), the sponsor should maintain all sponsor-specific essential documents for at least two years after formal discontinuation or in conformance with the applicable regulatory requirement(s). If the sponsor discontinues the clinical development of an investigational product, the sponsor should notify all the trial investigators/institutions and all the appropriate regulatory authorities. Any transfer of ownership of the data should be reported to the appropriate authority(ies) as required by the applicable regulatory requirement(s). The sponsor-specific essential documents should be retained until at least two years after the last approval of a marketing application in an ICH (International Conference on Harmonisation) region and until no pending or contemplated marketing applications exist in an ICH region or at least two years have elapsed since the formal discontinuation of the clinical development of the investigational product. These documents should be retained for a longer period, however, if required by the applicable regulatory requirement(s) or if needed by the sponsor. The sponsor should inform the investigator(s)/institution(s) in writing of the need

5 ALLOCATION OF DUTIES AND FUNCTIONS Before initiating a trial, the sponsor should define, establish, and allocate all trial-related duties and functions. 6 COMPENSATION TO SUBJECTS AND INVESTIGATORS If required by the applicable regulatory requirement(s), the sponsor should provide insurance or should indemnify (legal and financial coverage) the investigator/the institution against claims arising from the trial, except for claims that arise from malpractice and/or negligence. The policies and procedures of the sponsor should address the costs of treatment of trial subjects in the event of trial-related injuries in accordance with the applicable regulatory requirement(s). When trial subjects receive compensation, the method and manner of compensation should comply with applicable regulatory requirement(s). 7 FINANCING The financial aspects of the trial should be documented in an agreement between the sponsor and the investigator/institution.

SPONSOR-INVESTIGATOR A Sponsor-Investigator is an individual who both initiates and conducts, alone or with others, a clinical trial, and under whose immediate direction the investigational product is administered to, dispensed to, or used by a subject. The term does not include any person other than an individual (e.g., it does not include a corporation or an agency). The obligations of a sponsor-investigator include both those of a sponsor and those of an investigator.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

SPONTANEOUS REPORTING SYSTEM (SRS)

analysis of these data and signal generation. DPE receives approximately 250,000 adverse experience reports possibly associated with drug use annually. Approximately 25% of the reports received by CDER are reports of serious and unlabeled (or 15-day) and/or Direct Reports. The primary focus of the DPE reviews is to detect serious unlabeled reactions. Adverse experience reports are reviewed and analyzed to generate signals of serious, yet unrecognized, drug-associated events. These signals are communicated within DPE to staff epidemiologists and to the relevant review division via written summaries and safety conferences. When the DPE suspects that manufacturers have not been reporting ADRs as required, the DPE prepares summaries of adverse drug experience reporting deficiencies and forwards this information to the CDER Office of Compliance, Division of Prescription Drug Compliance and Surveillance (DPDCS). Based on such information, the DPDCS issues inspectional assignments to FDA field offices to follow up these deficiencies at the pertinent firm. The DPDCS evaluates the information provided by the DPE along with the inspectional findings and makes a determination if additional regulatory action is indicated. In addition, the DPE represents the Office of Epidemiology and Biostatistics on the Therapeutic Inequivalency Action Coordinating Committee (TIACC). The DPE representative assists in the investigation and resolution of claims of alleged drug bioinequivalency. In this way, CDER works to prevent injury from drugs that are superpotent or subpotent because of manufacturing errors.

The Division of Pharmacovigilance and Epidemiology (DPE) at the Center for Drug Evaluation and Research (CDER) maintains a Spontaneous Reporting System (SRS) that contains the adverse drug reaction reports from hospitals, health-care providers, and lay persons that are sent either directly to the Agency (via MEDWatch) or first to the drug manufacturer and then, by regulation, to the Agency by the manufacturer. SRS was replaced by an expanded system called the Adverse Events Reporting System (AERS). AERS is the result of efforts to implement many agreements from the International Conference for Harmonisation (ICH) as well as new regulations and pharmacovigilance processes of the Food and Drug Administration (FDA) to increase the efficiency with which CDER receives, files, and analyzes these reports. These reports are triaged through the MEDWatch program and then forwarded to the appropriate center (Drugs, Biologics, Foods, or Veterinary). Adverse Drug Reaction Reports are also sent directly from the sponsors of the New Drug Application (NDA) to the respective division. When either of these types of reports is received, it is entered into the computerized SRS. The SRS is maintained and used by the data processing, epidemiology, and statistic staff of the DPE. Their efforts are aimed at actively analyzing the data through recognition of Adverse Drug Reaction (ADR) patterns that might indicate a public health problem (a ‘‘signal’’). Improving access to the data facilitates our timely evaluation of aggregates of Adverse Drug Event (ADE) reports, which often are the first signals of a potential problem. The individual reports of serious adverse events then are reviewed critically and individually by staff trained in the This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/handbook/adverse.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

STABILITY ANALYSIS

based on the discrete rating are often classified into acceptable/unacceptable binary outcomes. However, in most stability studies, continuous responses such as potency are of primary concern and they are represented by the batch mean. When the value of the batch mean changes linearly, for a single strength/container size configuration of a drug product, the shelf life of a batch is the first intersection of the true regression line with the specified acceptance criteria. In practice, the true regression line is estimated with the collected and the measurement variation; the shelf life is thus estimated by the intersection of the 95% confidence bound of the regression line with the acceptance criteria. As a result of the limitation in number of observations (i.e., time points), the conventional approach is to determine whether the proposed shelf life is supported by comparing the estimate of the date of the intersection with the proposed data of shelf life. The proposed shelf life of a drug product is sufficiently supported if the estimated shelf life (i.e., the shortest point of intersection of the confidence bounds) is longer than the proposed shelf life or the expected response value at the proposed shelf life is within the specification limit. In order to determine a single shelf life of all the future batches of a drug product, batch-to-batch variation needs to be taken into consideration. The 1978 FDA guidance (1) and the draft ICH Guidance (2–4) request at least three batches of the same product studied before approval for marketing in order to determine the common shelf life. If all projected shelf lives of individual batches are longer than the sponsor-proposed shelf life, then a common shelf based on the proposed shelf life is supported. Otherwise, the shelf life is determined by the shortest of the three batches. However, often with limited number of observations, it is difficult to decide which batch has the shortest shelf life or all batches share the same changing slope or intercept without statistical analysis. In order to make such a decision, preliminary statistical tests for slope or intercept pooling were recommended by the 1987 FDA Guideline (1) and the ICH Guidances (2–4). In an effort to save cost and

YI TSONG and WEN JEN CHEN Office of Biostatistics/Office of Pharmacoepidemiology and Biostatistical Sciences Center of Drug Evaluation and Research U.S. Food and Drug Administration

1

INTRODUCTION

Shelf life of a drug product is defined as the length of time under the specific conditions of storage that the product will remain within acceptance criteria established to ensure its identity, strength, quality, and purity. Regulatory agencies request that an adequate stability study be performed by an applicant to collect the chemical data of the batch at a prespecified time point in order to provide supporting evidence of product stability and establish a proposed shelf life (1–7). In the simplest case, a sufficient number of container units from a batch are put in long-term storage in a controlled environment (e.g., 25 ◦ C/60%RH), and samples are taken from the chamber at predetermined time points, namely at 0, 3, 6, 9, 12, 18, 24, 36, 48, and so on months and tested for appropriate physical, chemical, biological, and microbiological attributes. For example, for solid oral dosage forms, such as tablets and capsules, the International Conference on Hormonization (ICH) guidance indicates the following characteristics should be studied: – Tablets– appearance, friability, hardness, color, odor, moisture, strength, and dissolution. – Capsules– strength, moisture, color, appearance, shape, bitterness, and dissolution. The concentration of time points during the first year provides an early warning system to unexpected loss in stability. Some of these characteristics are measured based on a discrete rating scale and the responses

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

STABILITY ANALYSIS

resources, several complicated designs have been proposed (8–12), including multiple factor designs, fractional factorial designs, and other strategies. However, their applicability to a given case needs to be cautiously evaluated. Stability data is often performed conventionally with an ANCOVA model (1, 3, 5–7). In the data analysis, it would be determined whether data from the batches can be pooled, which would usually lead to a longer shelf life estimate or a more powerful support of the target shelf life. For a more complex design with multiple factors, data from different levels of the factors may also be pooled subject to the statistical tests unless it can be predetermined based on historical data or chemical knowledge. Conventionally, a large significance level of 0.25 is used in pooling test of the simple multiple-batch stability study to accommodate the lack of power in pooling tests with the limited sample size (1, 3, 5–7, 13–16). For the simplicity of presentation and without loss of generality, the authors will focus on shelf life determination based on drug potency using two-sided acceptance limits in the following sections.

Yi (t) = α0 + αi + (β0 + βi )t + ε

(2)

where α 0 and β 0 are the common intercept and slope of all batches, respectively, and α i and β i are the deviation of individual intercept and slope of the ith batch from α 0 and β 0 , respectively. When evaluating the stability data in supporting of a single proposed shelf life T 0 for all batches of the same drug product, the shortest estimated shelf life T* of all the batches (at least three) in Equation (2) will be compared with T 0 . T 0 is granted if T* > T 0 . In practice, T 0 is specified with an increment of six months if it is longer than one year. In general, T 0 is no more than 6 to 12 months beyond the last observation date. Often, the batches may share the same regression line or have the same changing rate (slope). The pooling tests of slope and intercept were proposed to be performed in a hierarchical order such that the poolability of slope is tested before the intercept. It is to test H0β : βi = 0 versus

Haβ : βi = 0

(3)

Once H 0β is tested and not rejected, it means that the three batches share the common slope. So, it proceeds to test

2 ANCOVA MODELING AND POOLING TESTS Let Y(t) denote the observed value of a batch at t month, the individual linear regression of each batch can be represented by the following linear regression line: Y(t) = α + βt + εt

the ANCOVA model,

(1)

where α and β are the intercept and slope of the batch, respectively, and εt is the model error term that follows N(0, σ ) distribution. The true shelf life t of the batch is the date when the expected regression line intersects with either the lower acceptance criterion SL or the upper acceptance criterion SU . With the data collected, the shelf life is estimated by the first date either of the two 95% confidence limits of the expected regression line intersects with the acceptance criteria, SL or SU . To model the linear regressions of multiple batches of the same drug product, one uses

H0α : αi = 0 versus

Haα : αi = 0

(4)

Depending on the results of testing against H 0β and H 0α , one can determine whether the batches should be pooled in estimating the shelf life using a common intercept or slope. Conventionally, shelf life determination is carried out with a procedure that consists of two stages: the model determination stage and the shelf life estimation or testing stage. The hypothesis of common slope is tested first. If the hypothesis of common slope is not rejected, then it is followed by the test of the hypothesis of common intercept. The two tests are performed in the first stage with a single ANCOVA analysis of Equation (2) using F-tests for the hypotheses in Equation (3) and Equation (4) based on type I sum of squares (1, 5, 7, 17). With type I sum of square, the hypotheses in Equation (3) are tested using R(βi |α0 , β0 , αi ), the sum of

STABILITY ANALYSIS

squares of β i adjusted for α 0 , β 0 and α i . The hypotheses in Equation (4) are tested with R(αi |α0 , β0 ), the sum of squares of α i adjusted for α 0 and β 0 . Once a model is determined, shelf life testing and estimation can be carried out using the final model. The objective of the (NDA) stability study is to demonstrate that the applicant can produce a product with consistent behavior, which means that the same changes over time from batch to batch. It is expected that the applicant could control the initial value of all the batches to be released using a properly chosen criterion. Hence, the pooling tests is structured in a hierarchical order such that slope test is carried out before the intercept test. The ANCOVA model in Equation (2) may be used to represent a common slope-common intercept (i.e., βi = 0 and αi = 0) model, common slope-individual intercept (i.e., βi = 0 and α i = 0) model, individual slope-common intercept (i.e., β i = 0 and αi = 0) model, or a individual slopeindividual intercept (i.e., β i = 0 and α i = 0) model. In practice, the individual slopecommon intercept model is, in general, not of interest. Conventionally, the modeling of simple study design with multiple batches is carried out with hierarchical pooling tests of slope and intercept based on the type I sum of squares when using the SAS PROC ANCOVA or GLM (17). When extending the same ANCOVA-type I sum of square application to a complex multiple-factor design, it requires a prespecified hierarchical ordering of the pooling tests (18). For example, when applying the single two-step ANCOVA model to a study designed with two levels of strength and three batches in a cross classification setting, it cannot be tested from results of the type I sum of squares using SAS PROC GLM or ANCOVA. Ysbt = α + αs + αb + αsb + βt + βb t + βs t + βsb t + εsbt

(5)

where the subscript s = 1 to S represents the level of the strength, b = 1 to B represents the level of batch; αs represent the intercept of the regression lines, α i represents the effect of factor i on intercept, α ij represents the effect of two-way interaction of factor i and j

3

on intercept; Similarly βs represent the slope coefficients of regression lines. Finally, εsbt is iid data random error term follows standard normal distribution with mean zero and variance σ 2 . When the following hierarchical order of the factors, time, batch, time∗ batch, strength, time∗ strength, batch∗ strength, time∗ batch∗ strength is used in the analysis, the order of pooling tests are in the reverse order of the factors. Using type I SS, an F-test of each factor or interaction is based on sum of squares adjusted for the factors and interactions precede it in the ordering. The process of the pooling test stops whenever a significant F-value is encountered. For example, if H0 : αsb = 0 is rejected, the ANCOVA model to be used in the second stage is determined as Ysbt = α + αs + αb + αsb + βt + βb t + βs t + εsbt even if the slope is common across all batches or across all levels of the strength. When no batch-by-strength interaction for slope and intercept exists, a different hierarchical ordering for the partitioning of the sum of squares, such as time, strength, time∗ strength, batch, time∗ batch

may lead to different shelf life of the product from using Equation (5). The issue becomes more complicated with an even larger number of factors. For a stability design with two factors (e.g., strength and container size) and multiple batches, the ANCOVA model with all factors and interactions can be represented by Yspbt = α + αs + αp + αb + αsb + αsp + αpb + αspb + βt + βs t + βb t + βp t + βsb t + βsp t + βpb t + βspb t + εspbt

(6)

4

STABILITY ANALYSIS

where the subscript p = 1 to P represents the level of container size and α ijk represents the effect of three-way interaction of factor i, j and k on intercept. Similarly, βs represent the slope coefficients of regression lines. Finally, εspbt is the iid random error term that follows standard normal distribution with mean zero and variance σ 2 . A hierarchical ordering of the pooling tests may be given as follows, 1. H0 : βspb = 0, strength-by-container size-by-batch interaction for slope; 2. H0 : αspb = 0, strength-by-container size-by-batch interaction for intercept; 3. H0 : βsb = 0, strength-by-batch interaction for slope; 4. H0 : αsb = 0, strength-by-batch interaction for intercept; 5. H0 : βpb = 0, container size-by-batch interaction for slope; 6. H0 : αpb = 0, container size-by-batch interaction for intercept; strength-by-container 7. H0 : βsp = 0, size interaction for slope; strength-by-container 8. H0 : αsp = 0, size interaction for intercept; 9. H0 : βb = 0, batch slope; 10. H0 : αb = 0, batch intercept; 11. H0 : βs = 0, strength slope; 12. H0 : αs = 0, strength intercept; 13. H0 : βp = 0, container size slope; and 14. H0 : αp = 0, container size intercept. When no scientific restriction exists, hierarchical ordering of the pooling test involving batch, strength, and container size is not unique. Six possible test ordering arrangements for testing of β sb , β sp , and β pb exist, and six arrangements of β b , β p , and β s exist. Without a prespecified testing ordering, there could be as many as 36 different test orders in the analysis. As a result of the ‘‘sudden death’’ nature of one-step model selection, any of the alternative arrangements may lead to a different shelf life of the product of each strength-container size configuration. Alternatively, Tsong et al. (16, 18) proposed to use a more flexible stepwise modeling with type III sum of squares (17) under some proper restrictions on the order of the

pooling tests. This approach will accommodate the possibility of eliminating any of the terms within the limitation of exchangeable ordering. To illustrate, consider the following model:

Yspbt = α + αs + αp + αb + αsb + αsp + αpb + βt + βs t + βb t + βp t + βsb t + βsp t + βpb t + εspbt The F-values of testing βsb = 0, βsp = 0, and βpb = 0 are calculated using type III SS, R(βsb |α, αs , αp , αb , αsb , αsp , αpb , β, βs , βp , βb , βsp , βpb ), R(βsp |α, αs , αp , αb , αsb , αsp , αpb , β, βs , βp , βb , βsb , βpb ), and R(βpb |α, αs , αp , αb , αsb , αsp , αpb , β, βs , βp , βb , βsp , βsb ), respectively. The term with the largest nonsignificant Pvalue (say β sb ) is to be eliminated from the model. Then, in the reduced model, Yspbt = α + αs + αp + αb + αsb + αsp + αpb + βt + βs t + βb t + βp t + βsp t + βpb t + εspbt α sb , the corresponding intercept term, is to be tested for pooliability. It is eliminated from the model if the F-value using R(αsb |α, αs , αp , αb , αsp , αpb , β, βs , βp , βb , βsp , βpb ) for H0: αsb = 0 is not significant. In this case, the model is further reduced to Yspbt = α + αs + αp + αb + αsp + αpb + βt + βs t + βp t + βb t + βsp t + βpb t + εspbt Otherwise, either β sp or β pb with the larger nonsignificant P-value calculated using R(βsp | α, αs , αp , αb , αsp , αpb , β, βs , βp , βb , βpb ) and R(βsp |α, αs , αp , αb , αsb , αsp , αpb , β, βs , βp , βb , βsb ), respectively, is to be eliminated. Depending on the testing results of β sp and β pb , the model may be further reduced to one of the following two: Yspbt = α + αs + αp + αb + αsb + αsp + αpb + βt + βs t + βp t + βb t + βpb t + εspbt

STABILITY ANALYSIS

or Yspbt = α + αs + αp + αb + αsb + αsp + αpb + βt + βs t + βp t + βb t + βsp t + εspbt The stepwise modeling process stops when no additional term can be eliminated. The first restriction is to test higher-level interaction terms before lower level interactions or main factors. However, the lowerlevel interactions or main design factors can be tested even with significant higher-level interaction terms as long as no common design factors are involved. The second restriction is to test the equality of slope before testing the equality of the corresponding intercept, for any factor or interaction. In addition, no equality of intercept is tested if the equality of the corresponding slope is rejected. With limited sample size (i.e., number of observation time), the major concern of the pooling test is its power to reject the null hypothesis (1, 3, 5–7). Hence, in order to reduce the rate of false pooling, a significance level of 0.25 was recommended for testing for pooling batches in simple stability designs (14–16). Fairweather et al. (19) proposed to use a large significance level such as α = 0.25 for all the testing involving batch term and α = 0.05 for any other pooling test in order to control the final and overall type I error rate of the objective hypothesis testing regarding the final estimation or testing. These significance levels are currently recommended in ICH Guidance (2–4) and adopted in the following examples. Note that in this aspect, Fairweather et al. (19) used the error rate for the F-tests based on the type I sum of squares instead of the F-tests based on type III sum of squares as proposed in this article. Chen and Tsong (20), and Tsong et al. (6, 18) pointed out that, at NDA stage, the regulatory requirement of a shelf life T 0 is based on the evidence that the chemical characteristic of the batch is within the specification limit(s) at T 0 . Hence, the hypotheses of interest are H0 : Yj (T0 ) ≤ SL or Yj (T0 ) ≥ SU for some j versus Ha : SL < Yj (T0 ) < SU for all j = 1 to J (7)

5

The proposed shelf life T 0 is established if H 0 is rejected. Chen and Tsong (20) conducted a simulation study to study the significance levels used in the pooling tests in the situation of three batches of a simple stability study. Their results indicated that a significance level of 0.25 is needed in the pooling tests in order to control the type I error rate under 10% in testing the hypotheses in Equation (7) in studies with small to moderate variations. 3 POOLING BATCHES BASED ON EQUIVALENCE ASSESSMENT As a result of the difficulty in setting proper significance level of the pooling tests, Ruberg et al. (21, 22) proposed to test for equality of slopes or intercept as an alternative for batch pooling. Ruberg et al’s approach is to test the following hypotheses: H0 : |βj − βj | ≥ δ for some j = j , versus Ha : |βj − βj | < δ for all j, j and H0 : |αj − αj | ≥ δ for some j, j , versus Ha : |αj − αj | < δ for all j, j However, no equality limit was proposed for this approach. Lin and Tsong (23) showed through a simulation study that for the same equality limit of slope, different impact exists on the shelf life equality depending on the slope value. Yoshioka et al. (24, 25) revisited the equivalence approach and proposed to pool the batches if the difference in shelf life between any two batches are within a equivalence limit that is a prespecified percentage of the longest sample shelf life. Yoshioka et al’s range-based test is proposed to test the following hypotheses: H0 : |Tj − Tj | ≥ γ Maxj Tj for some j = j versus Ha : |Tj − Tj | < γ Maxj Tj for all j, j

(8)

where T j and T j are the true shelf life of batch j and j , respectively, and 0 ≤ γ ≤ 1 is

6

STABILITY ANALYSIS

a constant. Yoshioka et al. (25) compared the range-based equivalence test using γ = 0.15 with the ANCOVA approach through Monte Carlo simulation and found that the proposed method is more powerful in rejecting pooling when the batches have a different shelf life. However, no proper statistical procedure has been proposed for implementation. Tsong et al. (26) investigated the equivalence testing of shelf lives and found that the distribution of shelf life estimate is skewed and undefined in general. Tsong et al. (26) proposed to pool the batches based on the equivalence of the expected chemical characteristics value at the target shelf life. Let Y j (T 0 ) be the expected chemical characteristic of batch j at T 0 and let YP (T0 ) = (1/J)j=1 to J Yj (T0 ) be the mean value of all batches. Tsong et al. proposed to pool all batches if Y j (T 0 ) is equivalent to Y P (T 0 ) at T 0 for all j. Assume a prespecified equivalence limit δ T0 exists, then all batches are equivalent at T 0 if

batches at NDA stage are not random sample and the number of batches are often too small for any variance estimate in precision. For non-normal random effect modeling of stability data, Chen et al. (32) proposed the rank regression approach. For stability modeling of discrete chemical characteristics, a nonparametric random effect model was proposed by Chow and Shao (33).

5 ACKNOWLEDGMENT This manuscript was prepared with the support of Regulatory Science Research Grant RSR02-015 of Center of Drug Evaluation and Research, U.S. FDA. The authors would like to thank Dr. Chi-Wan Chen, Ms. Roswitha E. Kelly, Drs. Karl K.F. Lin and Daphne T. Lin and other members of the FDA CDER Office of Biostatistics Stability Working Group for the support and discussion on the development of this manuscript.

H0 : |Yj (T0 ) − YP (T0 )| ≥ δT0 for some j = 1 to J is rejected. A proper equivalence limit is yet to be determined. 4

DISCUSSION AND CONCLUSIONS

Shelf life determination is a legally defined attribute of a drug product and is part of the requirement at the NDA stage. ANCOVA approach has been the conventional and standard tool to analyze the stability study data. With appropriate significance levels set for the intercept and slope pooling tests, a decision on pooling data of different batches or across factors for the determination of shelf line can be made with the results of F-tests. Alternative approaches such as pooling by equivalence assessment in order to replace pooling tests with low power are topics of great interest of nonclinical statisticians. Random effect model for shelf life prediction was also proposed in the literature (27–31). The supporting argument of such an approach is that the shelf life is determined for the future batch rather the given a few studied batches. However, in practice, the

REFERENCES 1. FDA, Guidelines for Submitting Documentation for the Stability of Human Drugs and Biologics. Rockville MD: Food and Drug Administration, Center for Drugs and Biologics, 1987. 2. FDA, Guidance for Industry: ICH QIA(R) Stability Testing of New Drug Substances and Products. Rockville, MD: Food and Drug Administration, Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, August, 2001. Available: http://www.fda.gov/cder/guidance/4282fnl.pdf. 3. FDA, Draft ICH Consensus Guideline QIE Stability Data Evaluation. Rockville MD: Food and Drug Administration, Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, August, 2001. Available: http://www.fda.gov/cder/guidance/4983dft.pdf. 4. FDA, Guidance for Industry: ICH Q1D Bracketing and Matrixing Designs for Stability Testing of New Drug Substances and Products. Rockville, MD: Food and Drug Administration, Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, January, 2003. Available: http://www.fda.gov/cder/guidance/4985fnl.pdf.

STABILITY ANALYSIS 5. K. K. Lin, T. D. Lin, and R. E. Kelly, Stability of drugs In: C. R. Buncher and J. Y. Tsay (eds.), Statistics in the Pharmaceutical Industry, 2nd ed. New York: Marcel Dekker, 1993, pp. 419–444. 6. Y. Tsong, C-W. Chen, W. J. Chen, R. Kelly, K. K. Lin, and T. D. Lin, Stability of pharmaceutical products. In. C. R. Buncher and J. Y. Tsay (eds.), Statistics in the Pharmaceutical Industry, 3rd ed. New York: Marcel Dekker, 2003. 7. S. C. Chow and J. P. Liu, Statistical Design and Analysis in Pharmaceutical Sciences. New York: Marcel Dekker, 1995. 8. T-Y. D. Lin and C-W. Chen, Overview of stability designs. J. Biopharm. Stat. 2003; 13(3): 337–354. 9. P. Helboe, New designs for stability testing programs: matrix or factorial designs. Authorities viewpoint on the predictive value of such studies. Drug Info. J. 1992; 26: 629–634. 10. E. Nordbrock, Statistical comparison of stability study designs. J. Biopharm. Stat. 1992; 2(1): 91–113. 11. T. D. Lin, Applicability of matrix and bracket approach to stability study design. Proc. Association Biopharmaceutical Section of American Statistical Association 1994: 142– 147. 12. C-W. Chen, US FDA’s perspective of matrixing and bracketing. Proc. EFPIA Symposium: Advanced Topics in Pharmaceutical Stability Testing Building on the ICH Stability Guideline, EFPIA, Brussels, Belgium, 1996. 13. T. A. Bancroft, On biases in estimation due to the use of preliminary tests of significance, Ann. Math. Sta. 1944; 15: 190–204 14. H. J. Larson and T. A. Bancroft, Sequential model building for prediction in regression analysis, I. Ann. Math. Stat. 1963; 34: 231–242 15. T. A. Bancroft, Analysis and inference for incompletely specified models involving the use of preliminary tests of significance. Biometrics 1964; 20(3): 427–442. 16. J. P. Johnson, T. A. Bancroft, and C. P. Han, A pooling methodology for regressions in prediction. Biometrics 1977; 33: 57–67. 17. SAS/STAT User’s Guide, Version 8, vol. 2. Cary, NC: SAS Institute, Inc., 1999. 18. Y. Tsong, W-J. Chen, and C-W. Chen, ANCOVA approach for shelf life analysis of stability study of multiple factor designs. J. Biopharm. Stat. 2003; 13(3): 375–394.

7

19. W. R. Fairweather, T. D. Lin, and R. Kelly, Regulatory, design, and analysis aspects of complex stability studies. J. Pharm. Sci. 1995; 84(11): 1322–1326. 20. W-J. Chen and Y. Tsong, Significance level for stability polling test: a simulation study. J. Biopharm. Stat. 2003; 13(3): 355–374. 21. S. J. Ruberg and J. W. Stegeman, Pooling data for stability studies: testing the equality of batch degredation slopes. Biometrics 1991; 47: 1059–1069. 22. S. J. Ruberg and J. C. Hsu, Multiple comparison procedures for pooling batches in stability studies. Proc. Biopharm. Section of Joint Statist. Meeting, American Statistical Association, 1990: 204–209. 23. T-Y. D. Lin and Y. Tsong, Determination of significance level for pooling data in stability studies. Proc. Biopharm. Section, American Statistical Association, 1991: 195–201. 24. S. Yoshioka, Y. Aso, and S. Kojima, and A. L. W. Po, Power of analysis of variance for assessing batch-variation of stability data of pharmaceuticals. Chem. Pharm. Bull. 1996; 44(10): 1948–1950. 25. S. Yoshioka, Y. Aso, and S. Kojima, Assessment of shelf-life equivalence of pharmaceutical products. Chem. Pharm. Bull. 1997; 45(9): 1482–1484. 26. Y. Tsong, W-J. Chen, and C-W. Chen, Shelf life determination based on equivalence assessment. J. Biopharm. Stat. 2003; 13(3): 431–450. 27. J. J. Chen, J-S. Hwang, and Y. Tsong, Estimation of the shelf life of drugs with mixed effects models. J. Biopharm. Stat. 1995; 5(1): 131–140. 28. J. Chen, H. Ahn, and Y. Tsong, Shelf life estimation for multi-factor stability studies. Drug Info. J. 1997; 31(2): 573–587. 29. S. C. Chow and J. Shao, Estimating drug shelflife with random effect bastches. Biometrics 1991; 47: 1071–1079. 30. J. Shao and L. Chen, Prediction bounds for random shelf-lives. Stat. Med. 1997; 16: 1167–1173. 31. J. Shao and S-C. Chow, Statistical inference in stability analysis. Biometrics 1994; 50: 753–763. 32. Y. Q. Chen, A. Pong, and B. Xing, Rank regression in stability analysis. J. Biopharm. Stat. 2003; 13(3): 463–480. 33. S. C. Chow and J. Shao, Stability analysis with discrete response. J. Biopharm. Stat. 2003; 13(3): 451–462.

STABILITY STUDY DESIGNS

(e.g., bracketing, matrixing) may be considered as an efficient method for reducing the amount of testing needed while still obtaining the necessary stability information. A number of experimental designs are available that can capture the essential information to allow the determination of a shelf life while requiring less sampling and testing. Some of these designs require relatively few assumptions about the underlying physical and chemical characteristics and others require a great many. Stability requirements for the worldwide registration of pharmaceutical products have changed dramatically in the past few years. A series of guidelines on the design, conduct, and data analysis of stability studies of pharmaceuticals have been published by the ICH (International Conference on Harmonization). ICH Q1A(R2) (1) defines the core stability data package that is sufficient to support the registration of a new drug application in the tripartite regions of the European Union, Japan, and the United States. ICH Q1D (2) provides guidance on reduced designs for stability studies. It outlines the circumstances under which a bracketing or matrixing design can be used. ICH Q1E (3) describes the principles of stability data evaluation and various approaches to statistical analysis of stability data in establishing a retest period for the drug substance or a shelf life for the drug product, which would satisfy regulatory requirements. In this article, the discussion will be focused on the statistical aspects of stability study designs in relation to the recent ICH guidelines. The statistical analysis of stability data will not be covered here. In Section 2, different types of experimental designs, such as complete factorial design and fractional factorial design, will be exemplified. In Section 3, several commonly used criteria for design comparison will be presented. In Section 4, the role of stability study protocol will be emphasized. In Section 5, the statistical and regulatory considerations on the selection of stability study design, in particular, a full design vs. bracketing or matrixing design will be discussed. Finally, a conclusion will be made in Section 6.

TSAE-YUN DAPHNE LIN Center for Drug Evaluation and Research, U.S. FDA Office of Biostatistics, Office of Pharmacoepidemiology and Statistical Sciences Rockville, Maryland

CHI WAN CHEN Center for Drug Evaluation and Research, U.S. FDA Office of New Drug Chemistry, Office of Pharmaceutical Science Rockville, Maryland

1

INTRODUCTION

The stability of a drug substance or drug product is the capacity of the drug substance or drug product to remain within the established specifications to ensure its identity, strength, quality, and purity during a specified period of time. The U.S. Food and Drug Administration (FDA) requires that the shelf life (also referred to as expiration dating period) must be indicated on the immediate container label for every human drug and biologic in the market. A good stability study design is the key to a successful stability program. From statistical perspectives, several elements need to be considered in planning a stability study design. First, the stability study should be well designed so the shelf life of the drug product can be estimated with a high degree of accuracy and precision. Second, the stability design should be chosen so that it can reduce bias and identify and control any expected or unexpected source of variations. Third, the statistical method used for analyzing the data collected should reflect the nature of the design and provide a valid statistical inference for the established shelf life. In many cases, the drug product may have several different strengths packaged in different container sizes because of medical needs. To test every batch under all factor combinations on a real-time, long-term testing schedule can be expensive and time consuming. As an alternative, a reduced design

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

STABILITY STUDY DESIGNS

2

STABILITY STUDY DESIGNS

The design of a stability study is intended to establish, based on testing a limited number of batches of a drug substance or product, a retest period or shelf life applicable to all future batches of the drug substance or product manufactured under similar circumstances. Tested batches should, therefore, be representative in all respects, such as formulation, container and closure, manufacturing process, manufacturing site, and source of drug substance, for the population of all production batches and conform to the quality specification of the drug product. The stability study should be well designed so the shelf life of the product can be estimated with a high degree of accuracy and precision. A typical stability study consists of several design factors (e.g., batch, strength, container size) and several response variables (e.g., assay, dissolution rate) measured at different time points under a variety of environmental conditions. One can categorize the stability designs as full, bracketing, or matrixing designs by the number of time points or levels of design factors as shown in Table 1. The following sections describe the characteristics, advantages, and disadvantages of various designs. 2.1 Full Design Several different types of experimental design can be applied to the stability study. A full design (also referred to as a complete factorial design) is one in which samples for every combination of all design factors are tested at all time points as recommended in ICH Q1A(R2), for example, a minimum of 0, 3, 6, 9, 12, 18, 24 months, and yearly thereafter for long-term testing and 0, 3, and 6 months for accelerated testing. As mentioned, ICH has recently issued a series of guidelines on the design and conduct of stability studies. ICH Q1A(R2) recommends that the drug substance and product should be stored at the long-term condition (e.g., 25 ◦ C ± 2 ◦ C/60% RH ± 5% RH or 30 ◦ C ± 2 ◦ C/65% RH ± 5% RH) that is reflective of the storage condition intended for the container label. Unless the drug substance or product is destined for freezer storage, it

should also be stored at the accelerated condition for 6 months and tested minimally at 0, 3, and 6 months. The guideline also states that if long-term studies are conducted at 25 ◦ C ± 2 ◦ C/60% RH ± 5% RH and ‘‘significant change’’ occurs at any time during 6 months’ testing at the accelerated storage condition, additional testing at the intermediate storage condition should be conducted and evaluated. A product may be available in several strengths, and each strength may be packaged in more than one container closure system and several container sizes. In this case, the resources needed for stability testing are considerable. Table 2 illustrates an example of a stability protocol using a full design for a drug product manufactured in three strengths and packaged in three container sizes and four container closure systems. This example shows that it will require 1188 test samples for long-term and accelerated stability testing. In addition, if long-term studies are conducted at 25 ◦ C ± 2 ◦ C/60% RH ± 5% RH and ‘‘significant change’’ occurs at any time during 6 months’ testing at the accelerated storage condition, the intermediate testing is needed and the number of test samples will be increased to 1620. Table 3 shows an example of a simple full design. This example describes a protocol for long-term stability testing (2 5◦ C/60%RH) of a drug product manufactured in three strengths (25, 50, and 100 mg) and packaged in three container sizes (10, 50, and 100 ml). Samples for every combination of all design factors (i.e., strength, batch, and container size) are tested at all time points. Hence, this example is a complete factorial design, and the total number of samples tested is N = 3 × 3×3×8 = 216 . As shown in the above examples, a full design involves complete testing of all factor combinations at all time points, which can be costly. The pharmaceutical industry would like to apply a reduced design so that not every factor combination will be tested at every time point. ICH Q1D listed some principles for situations in which a reduced design can be applied. However, before a reduced design is considered, certain assumptions

STABILITY STUDY DESIGNS

3

Table 1. Types of Stability Designs All levels of design factors

Partial levels of design factors

All time points

Full design

Partial time points

Matrixing design on time points

Bracketing design or Matrixing design on factors Matrixing design on time points and factors

Table 2. An Example of a Stability Protocol Using a Full Design According to the ICH Q1A(R2) Guideline Number of batches Number of strengths Number of container sizes Number of container closure systems Long term, intermediate, and accelerated

Time points (months)

Total number of samples tested Total number of samples tested

3 3 3 4 Long term, 25 ◦ C ± 2 ◦ C/60% RH ± 5% RH Intermediate, 30 ◦ C ± 2 ◦ C/65% RH ± 5% RH Accelerated, 40 ◦ C ± 2 ◦ C/75% RH ± 5% RH Long term, 0,3,6,9,12,18,24,36 Intermediate, 0,6,9,12 Accelerated, 0,3,6 1188 1620

long-term condition could be 25 ◦ C ± 2 ◦ C/60% RH ± 5% RH or 30 ◦ C ± 2 ◦ C/65% RH ± 5% RH long-term studies are conducted at 25 ◦ C ± 2 ◦ C/60% RH ± 5% RH and ‘‘significant change’’ occurs at any time during 6 months’ testing at the accelerated storage condition, then the intermediate testing is needed. The If

should be assessed and justified. The potential risk should be considered for establishing a shorter retest period or shelf life than could be derived from a full design because of the reduced amount of data collected. 2.2 Reduced Design As defined in ICH Q1D Guideline, a reduced design (also referred to as a fractional factorial design) is one in which samples for every factor combination are not all tested at all time points. Any subset of a full design is considered a reduced design. Bracketing and matrixing designs are two most commonly used reduced designs. In 1989, Wright (4) proposed to use the factorial designs in stability studies. Nakagaki (5) used the terminologies matrix and bracket in his 1991 presentation. In 1992, Nordbrock (6), Helboe (7), and Carstensen et al. (8) published articles discussing several methods for reducing the number of samples tested from the chemistry and economic aspects of stability studies. Nordbrock investigated various types of fractional factorial designs and compared these designs based on

the power of detecting a significant difference between slopes. He concluded that the design that gives acceptable performance and has the smallest sample size can be chosen based on power. Lin (9) investigated the applicability of matrixing and bracketing approaches to stability studies. She concluded that the complete factorial design is the best design when the precision of shelf life estimation is the major concern, but indicated that matrixing designs could be useful for drug products with less variability among different strengths and container sizes. With regard to the statistical analysis of data from complex stability studies, Fairweather et al. (10), Ahn et al. (11), Chen et al. (12), and Yoshioka et al. (13) have proposed different procedures for testing and classifying stability data with multiple design factors. 2.2.1 Bracketing Design versus Matrixing Design. ICH Q1D states that a reduced design can be a suitable alternative to a full design when multiple design factors are involved in the product being evaluated. However, the application of a reduced design

4

STABILITY STUDY DESIGNS Table 3. Example of a Full Stability Study Design Granulation Batch

A

B

C

Strength 25 50 100 25 50 100 25 50 100

10 ml

Container size 50 ml

100 ml

T T T T T T T T T

T T T T T T T T T

T T T T T T T T T

T = Sample tested at 0, 3, 6, 9, 12, 18, 24, 36 months

has to be carefully assessed, taking into consideration any risk to the ability of estimating an accurate and precise shelf life or the consequence of accepting a shorter-than-desired shelf life. Bracketing and matrixing designs are the two most commonly used reduced designs. The reduced stability testing in a bracketing or matrixing design should be capable of achieving an acceptable degree of precision in shelf life estimation without losing much information. The terms of bracketing and matrixing are defined in the ICH Q1D as follows: Bracketing. Bracketing is the design of a stability schedule such that only samples on the extremes of certain design factors (e.g., strength, container size) are tested at all time points as in a full design. The design assumes that the stability of any intermediate levels is represented by the stability of the extremes tested. Matrixing. Matrixing is the design of a stability schedule such that a selected subset of the total number of possible samples for all factor combinations is tested at a specified time point. At a subsequent time point, another subset of samples for all factor combinations is tested. The design assumes that the stability of each subset of samples tested represents the stability of all samples at a given time point. As defined above, bracketing and matrixing are two different approaches to designing a stability study. Either design has its

own assumptions, advantages, and disadvantages. The applicability of either design to stability study generally depends on the manufacturing process, stage of development, assessment of supportive stability data, and other factors as described in ICH Q1D. The following additional points can be considered regarding the use of a bracketing or matrixing design in a stability study. In a bracketing design, samples of a given batch for a selected extreme of a factor are analyzed at all time points. Therefore, it is easier to assess the stability pattern in a bracketing study than in a matrixing study in which samples of a given batch are often tested at fewer time points. If all selected strengths or container sizes tested show the same trend, it can be concluded with a high degree of certainty that the stability of the remaining strengths or container sizes is represented, or bracketed, by the selected extremes. In a matrixing design, the samples to be tested are selected across all factor combinations. This procedure may be less sensitive to assessing the stability pattern than bracketing because of the reduced time points. Therefore, a matrixing design is more appropriate for confirming a prediction or available stability information, and is more effective in the following situations: (1) in the later stages of drug development when sufficient supporting data are available, (2) for stability testing of production batches, and (3) for annual stability batches. One of the advantages of matrixing over bracketing is that all strengths and container sizes are included in the stability testing.

STABILITY STUDY DESIGNS

The general applicability of bracketing and matrixing has been discussed in the literature (e.g., References 9, 14–21). From statistical and regulatory perspectives, the applicability of bracketing and matrixing depends on the type of drug product, type of submission, type of factor, data variability, and product stability. The factors that may be bracketed or matrixed in a stability study are outlined in ICH Q1D. This ICH guideline also briefly discusses several conditions that need to be considered when applying these types of design. A bracketing or matrixing design may be preferred to a full design for the purpose of reducing the number of samples tested, and consequently the cost. However, the ability to adequately predict the product shelf life by these types of reduced designs should be carefully considered. In general, a matrixing design is applicable if the supporting data indicate predictable product stability. Matrixing is appropriate when the supporting data exhibit only small variability. However, where the supporting data exhibit moderate variability, a matrixing design should be statistically justified. If the supportive data show large variability, a matrixing design should not be applied. A statistical justification could be based on an evaluation of the proposed matrixing design with respect to its power to detect differences among factors in the degradation rates or its precision in shelf life estimation. If a matrixing design is considered applicable, the degree of reduction that can be made from a full design depends on the number of factor combinations being evaluated. The more factors associated with a product and the more levels in each factor, the larger the degree of reduction that can be considered. However, any reduced design should have the ability to adequately predict the product shelf life. An example of a bracketing design is given in Table 4. Similar to the full design in Table 3, this example is based on a product available in three strengths and three container sizes. In this example, it should be demonstrated that the 10-ml and 100-ml container sizes truly represent the extremes of the container sizes. In addition, the 25-mg and 100-mg strength should also represent the extremes of the strengths. The batches for

5

each selected combination should be tested at each time point as in a full design. The total number of samples tested will be N = 2 × 2 × 3 × 8 = 96. An example of a matrixing-on-time-points design is given in Table 5. The description of the drug product is similar to the previous examples. Three time codes are used in two different designs. As an example of a complete one-third design, the time points for batch B in strength 100 mg and container size 10 ml are 0, 9, 24, and 36 months. The total number of samples tested for this complete one-third design will be N = 3 × 3×3×4 = 108 . Table 6 illustrates an example of matrixing on both time points and factors. Similar to the previous example, three time codes are used. As an example, batch B in strength 100 mg and container 10 ml is tested at 0, 6, 9, 18, 24, and 36 months. The total number of samples tested for this example will be N = 3 × 3×2×6 = 108. 2.3 Other Fractional Factorial Designs Fractional factorial designs other than those discussed above may be applied. As bracketing and matrixing designs are based on different principles, the use of bracketing and matrixing in one design should be carefully considered. If this type of design is applied, scientific justifications should be provided. 3

CRITERIA FOR DESIGN COMPARISON

Several statistical criteria have been proposed for comparing designs and choosing the appropriate design for a particular stability study. Most of the criteria are similar in principles but different in procedures. This section will discuss several commonly used criteria for design comparison. As the statistical method used for analyzing the data collected should reflect the nature of the design, the availability of a statistical method that provides a valid statistical inference for the established shelf life should be considered when planning a study design. The optimality criteria have been applied by several authors [Nordbrock (6), Ju and Chow (22)] to the selection of stability design.

6

STABILITY STUDY DESIGNS Table 4. Example of a Bracketing Design Batch

Strength

Container size 50 ml

10 ml 25 50 100 25 50 100 25 50 100

A

B

C

100 ml

T

T

T T

T T

T T

T T

T

T

T = Sample tested at 0, 3, 6, 9, 12, 18, 24, 36 months

Table 5. Example of a Matrixing-On-Time-Points Design

Batch

Strength 10 ml

Container size 50 ml

100 ml

T1 T3 T2 T2 T1 T3 T3 T2 T1

T2 T1 T3 T3 T2 T1 T1 T3 T2

T3 T2 T1 T1 T3 T2 T2 T1 T3

25 50 100 25 50 100 25 50 100

A

B

C

Complete one-third design

Complete two-thirds design

Time Code

Time Code

T1 T2 T3

Time Point (months) 0 0 0

3

12 6

18 9

24

36 36 36

These optimality criteria are statistical efficiency criteria and have been developed and widely used [Kiefer (23) and Fedorov (24)] in other scientific areas for many years. Hundreds of journal articles on these optimality criteria have been published. The basic principle of this approach can be described below. Let Y denote the result of a test attribute, say assay. The following model can be used to describe Y:

Y = Xβ + ε

T1 T2 T3

Time Point (months) 0 0 0

3 3

9 6 6

9

12 12

24 18 18

24

36 36 36

where X is the design matrix, β is the coefficient vector, and ε is the residuals vector. The least squares solution of this matrix equation with respect to the model coefficient vector is β = (X X)−1 X Y, and X X is called the information matrix because its determinant is a measure of the information content in the design. Several different optimality criteria are used for design comparison and, among them, the D-optimality and A-optimality are the most commonly used. The basic principle behind these criteria is to choose a design

STABILITY STUDY DESIGNS

7

Table 6. Example of a Matrixing On-Time-Points and Factors

Batch

Strength 25 50 100 25 50 100 25 50 100

A

B

C Time Code T1 T2 T3

10 ml

Container size 50 ml

100 ml

T1

T2

T3

T2 T2

T3 T3

T1 T1

T3 T3

T1 T1

T2 T2

T1

T2

T3

Time Point (months) 0 0 0

3 3

9 6 6

9

12 12

24 18 18

24

36 36 36

that is optimal with respect to the precision of the parameter estimators. For example, say X1 and X2 are the design matrices for two different fractional factorial designs. A D-optimality criterion is that if Det(X1 X1 ) > Det(X2 X2 ), then design X1 is said to be preferred to design X2 . An A-optimality criterion is that if Trace (X1 X1 ) > Trace (X2 X2 ), then design X1 is said to be preferred to design X2 . The D-optimal design is used most often in experimental designs. If the D-optimality concept is applied directly to stability studies, then the selection of observations (i.e., time points) that give the minimum variance for the slope is to place one-half at the beginning of the study and one-half at the end. However, in reality, the linearity of the test-attribute-vs.-time may not hold for many chemical and physical characteristics of the drug product. Hence, the statistically designed stability studies should not only apply this basic optimality principle, but also include several intermediate time points (e.g., 3 and 6 months) as a check for linearity. However, depending on the drug product, manufacturing process, marketing strategy, and other factors, different designs may be chosen and no single design is optimal in all cases. As analyses are typically done at several different times (e.g.,

at the time a registration application for the new drug product is filed, or yearly for a marketed product), the choice of design should also take into account that the analysis will be done after additional data are collected. Nordbrock (6) proposed to choose a design based on the power for detecting a significant difference among slopes. The method he proposed consists of three steps. The first step is to list slopes that must be compared [i.e., to list factors and factor interactions (for slopes) that may affect stability]. The second step is to list some alternative designs with reduced sample sizes. Some general experimental design considerations are used to derive these alternative designs. The third step is to compare statistical properties of the designs based on the power of contracts of interest. He then suggested that ‘‘among designs with the same sample size, the design with the highest power is the best design. One way to select a design for a particular study is to choose the desired power, and then, from designs having at least the desired power, the design with the smallest sample size is the best design’’ (6). The primary objective of a stability study is to establish a shelf life for the drug product and not to examine the effect of factors used in the experiment. Thus, the design chosen

8

STABILITY STUDY DESIGNS

based on the power of contracts of interest as proposed by Nordbrock may not be the best choice. As an alternative, Ju and Chow (22) proposed a design criterion based on the precision of the shelf life estimation. Mathematically, this criterion of choosing a design is based on the following comparison: Design XA is considered to be better than design XB if x(t)(XA XA )−1 x(t) < x(t)(XB XB )−1 x(t)

designs. Three of the criteria (moment, Defficiency, and power) are well known and widely used. Uncertainty is a measure of the reliability of the shelf life estimates, which is dependent on the reliability of the estimated slopes. Another measure of uncertainty is the width of a confidence interval on the fitted slope, which, for a fixed amount of residual variation, is dependent only on the number and spacing of the data points. G-efficiency is a relatively new concept that is defined as below:

for t ≥ tε G-efficiency = 100 [(p/n)1/2 ]/σmax where x(t) is the design matrix X at time t, and tε is chosen based on the true shelf life. This article applied the shelf life estimation procedure developed by Shao and Chow (25) that a random effect model was used. This criterion also takes into account the batch variability; hence, it could be applied to the balanced cases and could be extended to unbalanced cases. Based on the above criterion and comparison results, Ju and Chow then proposed that ‘‘For a fixed sample size, the design with the best precision for shelf life estimation is the best design. For a fixed desired precision of shelf life estimation, the design with the smallest sample size is the best design’’ (22). Murphy (26) introduced the uniform matrix design for drug stability studies and compared it with standard matrix designs. The strategy of the uniform matrix design is to delete certain times (e.g., the 3, 6, 9, and 18 month); therefore, testing is done only at 12, 24, and 36 months. This design has the advantage of simplifying the data entry of the study design and eliminating time points that add little to reducing the variability of the slope of the regression line. The disadvantage is that, if major problems exist with the stability, no early warning occurs because early testing is not done. Further, it may not be possible to determine if the linear model is appropriate. Murphy used five efficacy measures (design moment, D-efficiency, uncertainty, Gefficiency, and statistical power) to compare uniform matrix designs with other matrix

where n= number of points in the design p= number of parameters σ max = maximum standard error of prediction over the design Based on the comparisons of five different criteria, Murphy stated that ‘‘the uniform matrix designs provide superior statistical properties with the same or fewer design points than standard matrix design’’ (26). Pong and Raghavarao (27) compared the power for detecting a significant difference between slopes and the mean square error to evaluate the precision of the estimated shelf life of bracketing and matrixing designs. They found the power of both designs to be similar. Based on the conditions under study, they concluded that bracketing appears to be a better design than matrixing in terms of the precision of the estimated shelf life. Evidently, the above-mentioned criteria are based on different procedures. Thus, the final choice of design might be different by using different criteria. In general, in addition to chemical and physical characteristics of the drug product, regulatory and statistical aspects need to be considered. Ideally, one would like to choose a design that is optimal with respect to the precision of the shelf life estimation and the power of detecting meaningful effects. 4 STABILITY PROTOCOL A good stability study design is the key to a successful stability program. The program should start with a stability protocol

STABILITY STUDY DESIGNS

that specifies clearly the study objective, the study design, batch and packaging information, specifications, time points, storage conditions, sampling plan, statistical analysis method, and other relevant information. The protocol should be well designed and followed rigorously, and data collection should be complete and in accordance with the protocol. The planned statistical analysis should be described in the protocol to avoid the appearance of choosing an approach to produce the most desirable outcome at the time of data analysis. Any departure from the design makes it difficult to interpret the resulting data. Any changes made to the design or analysis plan without modification to the protocol or after examination of the data collected should be clearly identified. 5

BASIC DESIGN CONSIDERATIONS

5.1 Impact of Design Factors on Shelf Life Estimation A drug product may be available in different strengths and different container sizes. In such a case, stability designs for long-term stability studies will involve the following design factors: strength, container size, and batch. As mentioned by Nordbrock (6) and Chow and Liu (28), it is of interest to examine several hypotheses, such as the following: 1. Degradation rates among different container sizes are consistent across strengths. 2. Degradation rates are the same for all container sizes. 3. Degradation rates are the same for all strengths. 4. Degradation rates are the same for all batches. Hence, a need exists to investigate the impact of main effects (e.g., strength, container size, batch) and interaction effects on the stability of the drug product under long-term testing. In constructing a stability design, one needs to consider to what extent it is acceptable to pool data from different design factors. For example; if a statistically significant interaction effect occurs between

9

container size and strength, then a separate shelf life should be estimated for each combination of container size and strength. On the other hand, if no statistically significant interaction effect occurs (i.e., the degradation rates for all container sizes, strengths, and batches are the same), it will be acceptable to pool all results and calculate one single shelf life for all container sizes, strengths, and batches produced under the same circumstances. A design should be adequately planned such that it is capable of detecting the possible significant interaction effects and main effects. A full design can provide not only valid statistical tests for the main effects of design factors under study but also better precision in the estimates for interactions. Hence, the precision of the estimated drug shelf life for a full design is better than a reduced design. A reduced design is preferred to a full design for the purpose of reducing the test samples and, consequently, the cost. However, it has the following disadvantages: (1) One may not be able to evaluate some interaction effects for certain designs. For example, for a 24−1 fractional factorial design, two-factor effects are confounded with each other; hence, one may not be able to determine whether the data should be pooled. (2) If interactions between two factors exist, such as the strength by container size, the data cannot be pooled to establish a single shelf life. It is recommended that a separate shelf life for each combination of strength and container size should be established. However, no shelf life estimation for the missing factor combinations can be assessed. (3) If many missing factor combinations exist, there may not be sufficient precision for the estimated shelf life. It is generally impossible to test the assumption that the higher-order terms are negligible. Hence, if the design does not permit the estimation of interactions or main effects, it should be used only when it is reasonable to assume that these interactions are very small. This assumption must be made on the basis of theoretical considerations of the formulation, manufacturing process, chemical and physical characteristics, and supporting data from other studies.

10

STABILITY STUDY DESIGNS

Thus, to achieve a better precision of the estimated shelf life, a design should be so chosen as to avoid possible confounding or interaction effects. Once the design is chosen, statistical analysis should reflect the nature of the design selected. 5.2 Sample Size and Sampling Considerations The total number of samples tested in a stability study should be sufficient to establish the stability characteristics of the drug product and to estimate the shelf life of the drug product with an acceptable degree of precision. The total number of test samples needed in a stability study generally depends on the objective and design of the study. For example, for a drug product available in a single strength and a single container size, the choice of design is limited (i.e., the design should have three batches of the product and samples should be tested every three months over the first year, every six months over the second year, and then annually). For drug products involving several design factors, several different types of design, such as, full, bracketing, matrixing designs, can be chosen. In general, the number of design factors planned in the study, the number of batches, data variability within or across design factors, and the expected shelf life for the product all need to be considered when choosing a design. In addition, available stability information, such as the variability in the manufacturing process and the analytical procedures, also need to be evaluated. The estimation of shelf life of a drug product is based on testing a limited number of batches of a drug product. Therefore, tested batches should be representative in all respects as discussed previously. 5.3 Other Issues The purpose of selecting an appropriate stability design is to improve the accuracy and precision of the established shelf life of a drug product. Background information, such as regulatory requirements, manufacturing process, proposed specification, and developmental study results are helpful in the design of a stability study.

The choice of stability study design should reflect on the formulation and manufacturing process. For example, the study design and associated statistical analysis for a product available in three strengths made from a common granulation will be different from those made from different granulations with different formulations. The stability study design should be capable of avoiding bias and achieving minimum variability. Therefore, the design should take into consideration variations from different sources. The sources of variations may include individual dosage units, containers within a batch, batches, analytical procedures, analysts, laboratories, and manufacturing sites. In addition, missing values should be avoided and the reason for the missing values should be documented. When choosing a matrixing design, one cannot rely on the assumption that the shelf life of the drug product is the same for all design factors. If the statistical results show that a significant difference exists among batches and container sizes, one cannot rely on certain combinations of batch and container size or on the statistical model to provide reliable information on the missing combinations. The shelf life must be calculated for the observed combinations of batch and container size. The shortest observed shelf life is then assigned to all container sizes 6 CONCLUSIONS A good stability study design is the key to a successful stability program. The number of design factors planned in the study, number of batches, data variability within or across design factors, and the expected shelf life for the product all need to be considered when choosing a design. The recently published ICH guidelines, such as, Q1A(R2), Q1D, and Q1E, should be perused and the recommendations therein should be followed. A reduced design, such as bracketing and matrixing, can be a suitable alternative to a full design when multiple design factors are involved in the product being evaluated. However, the application of a reduced design has to be carefully assessed, taking into consideration any risk to the ability of estimating

STABILITY STUDY DESIGNS

an accurate and precise shelf life or the consequence of accepting a shorter-than-desired shelf life. When the appropriate study design is chosen, the total number of samples tested in a stability study should be sufficient to establish the stability characteristics of the drug product and to estimate the shelf life of the drug product with an acceptable degree of precision. To achieve a better precision of the estimated shelf life, a design should be chosen to avoid possible confounding or interaction effects. Once the design is chosen, the statistical method used for analysis of the data collected should reflect the nature of the design and provide a valid statistical inference for the established shelf life. 7

ACKNOWLEDGEMENT

The authors would like to thank the FDA CDER Office of Biostatistics Stability Working Group for the support and discussion on the development of this manuscript. REFERENCES 1. Guidance for Industry: ICH Q1A(R2) Stability Testing of New Drug Substances and Products. Food and Drug Administration, Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, January, 2003. 2. Guidance for Industry: ICH Q1D Bracketing and Matrixing Designs for Stability Testing of New Drug Substances and Products. Food and Drug Administration, Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, January, 2003. 3. Draft ICH Consensus Guideline Q1E Stability Data Evaluation. Food and Drug Administration, Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research, January 2003. 4. J. Wright, Use of factorial designs in stability testing. In Proceedings of Stability Guidelines for Testing Pharmaceutical Products: Issues and Alternatives. AAPS Meeting, December 1989. 5. P. Nakagaki, AAPS Annual Meeting, 1990. 6. E. Nordbrock, Statistical comparison of stability study designs. J. Biopharm. Stat. 1992; 2: 91–113.

11

7. P. Helboe, New designs for stability testing programs: matrix or factorial designs. Authorities viewpoint on the predictive value of such studies. Drug Info J. 1992; 26: 629–634. 8. J. T. Carstensen, M. Franchini, and K. Ertel, Statistical approaches to stability protocol design. J. Pharm. Sci. 1992; 81(3): 303–308. 9. T. D. Lin, Applicability of matrix and bracket approach to stability study design. Proceedings of the American Statistical Association Biopharmaceutical Section, 1994: 142–147. 10. W. Fairweather, T. D. Lin, and R. Kelly, Regulatory, design, and analysis aspects of complex stability studies. J. Pharm. Sci. 1995; 84(11): 1322–1326. 11. H. Ahn, J. Chen, and T. D. Lin, A Two-way analysis of covariance model for classification of stability data. Biometr. J. 1997; 39(5): 559–576. 12. J. Chen, H. Ahn, and Y. Tsong, Shelf life estimation for multi-factor stability studies. Drug Info J. 1997; 31(2): 573–587. 13. S. Yoshioka, Y. Aso, and S. Kojima, Assessment of shelf-life equivalence of pharmaceutical products. Chem. Pharm. Bull. 1997; 45: 1482–1484. 14. T. D. Lin and W. R. Fairweather, Statistical design (bracketing and matrixing) and analysis of stability data for the US market. Proceedings from IBC Stability Testing Conference, London, 1997. 15. T. D. Lin, Statistical considerations in bracketing and matrixing. Proceedings from IBC Bracketing and Matrixing Conference, London, 1999. 16. T. D. Lin, Study design, matrixing and bracketing. AAPS Workshop on Stability Practices in the Pharmaceutical Industry — Current Issues, Arlington, Virgina, 1999. 17. W. P. Fairweather and T. D. Lin, Statistical and regulatory aspects of drug stability studies: an FDA perspective. In: D. Mazzo, ed. International Stability Testing. Interpharm Press, Inc., 1999, pp. 107–132. 18. C. Chen, US FDA’s perspective of matrixing and bracketing. Proceedings from EFPIA Symposium: Advanced Topics in Pharmaceutical Stability Testing Building on the ICH Stability Guideline, EFPIA, Brussels, Belgium, 1996. 19. D. Chambers, Matrixing/Bracketing. US industry views. Proceedings from EFPIA Symposium: Advanced Topics in Pharmaceutical Stability Testing Building on the ICH Stability Guideline, EFPIA, Brussels, Belgium, 1996. 20. P. Helboe, Matrixing and bracketing designs for stability studies: an overview from the

12

21.

22.

23. 24.

25.

26.

STABILITY STUDY DESIGNS European perspective. In: D. Mazzo, ed. International Stability Testing. Interpharm Press, Inc., 1999 pp. 135–160. S. Yoshioka, Current application in Japan of the ICH stability guidelines: does Japanese registration require more than others do? In: International Stability Testing. D. Mazzo, ed. Interpharm Press, Inc., 1999, pp. 255– 264. H. L. Ju and S. C. Chow, On stability designs in drug shelf life estimation. J. Biopharm. Stat. 1995; 5(2): 201–214. J. Kiefer, Optimal experimental designs. J. Royal Stat. Soc. Series, B 1959; 21: 272–319. V. V. Fedorov, Theory of Optimal Experiments. W. J. Studden and E. M. Klimko, eds. New York: Academic Press, 1972. J. Shao and S. C. Chow, Statistical inference in stability analysis. Biometrics 1994; 50(3): 753–763. J. R. Murphy, Uniform matrix stability study designs. J. Biopharm. Stat. 1996; 6(4): 477–494.

27. A. Pong and D. Raghavarao, Comparison of bracketing and matrixing designs for a twoyear stability study. J. Pharm. Stat. 2000; 10(2): 217–228. 28. S. C. Chow and J. P. Liu, eds. Statistical Designs and Analysis in Pharmaceutical Science. New York: Marcel Dekker, 1995.

STANDARD OPERATING PROCEDURES (SOPS) Standard Operating Procedures (SOPs) are detailed, written instructions to achieve uniformity of the performance of a specific function.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

STATINS

statin medications upregulated the action of LDL receptors and increased LDL particle clearance from the plasma. Concern surrounded early developments of statins that risk of lens opacities could be increased and that liver transaminase enzymes would need to be monitored closely to avoid adverse side effects. A large safety study, which was undertaken shortly after the release of lovastatin into the U.S. market, generally showed a good safety profile that concerned the eye and the liver (4,5). At typical starting dosages, the statins lowered plasma LDL-C by 30% or more; this reduction had previously been difficult to achieve with other lipid-lowering medications. In addition, clinical trial data showed that persons on cholesterol-lowering diets and statins achieved even greater decrease in blood cholesterol than those on statin medications alone (6). The statin medications have several common structural features that are observed across the class. The first is that a portion of the molecule includes an open chain form or a closed lactone ring. The second is that a complex hydrophobic ring allows tight bonding to the reductase enzyme. Third, side groups on the rings define the solubility of the medications and many of their pharmacokinetic properties. For example, pravastatin is the most water soluble. Rosuvastain also has a hydrophilic group, and it is relatively water soluble, but the other statins are generally hydrophobic. The medications lovastatin, simvastatin, and pravastatin are natural statins and are relatively similar in terms of their structural identity. However, rosuvastatin, fluvastatin, and atorvastatin are entirely synthetic and have different chemical structures.

PETER W. F. WILSON MD Emory University School of Medicine Atlanta, Georgia

1 THE PRE-STATIN ERA 1.1 Observational Studies Background Elevation of blood cholesterol, especially lowdensity lipoprotein cholesterol (LDL-C), is associated with an increased risk of atherosclerotic disease. Mean cholesterol levels have declined in U.S. national data since the 1980s, but the nation continues to experience high risk for coronary heart disease (1). Lowered consumption of fat has been felt to be largely responsible for this national decline in cholesterol levels in adults, but a sizable fraction of American adults continue to have elevated cholesterol levels and elevated risk for cardiovascular sequelae. A variety of lifestyle factors, including diet, weight control, and exercise has been shown to lower cholesterol levels. Medications have also been available over the years to treat elevated cholesterol levels, and the mainstays have included bile acid resins, niacin, and fibric acid derivatives. Up to the late 1980s, it was uncommon to achieve cholesterol lowering greater than 20% from baseline with the host of medications that were available (2). 1.2 Advent of Statin Medications The discovery of the low density lipoprotein receptor by Goldstein and Brown increased our understanding of cholesterol homeostasis and helped to provide the impetus to intervene on newly identified metabolic pathways closely related cholesterol biosynthesis and degradation. Inhibitors of a crucial step in the pathways of cholesterol biosynthesis were discovered in the early 1970s, and it was recognized that the HMG reductase inhibitors, which were later called statins, could lead to major reductions in blood cholesterol levels in humans (3). In addition to having direct effects on cholesterol biosynthesis in the liver and other organs, it was recognized that

2 STATIN PHARMACOLOGY 2.1 Pharmacology and Drug Interactions Absorption of the statins ranges from less than 30% for pravastatin to more than 90% for fluvastatin. Pass extraction of the medications and plasma concentrations do not serve as a good guide to some metabolic effects observed in humans. Statin medications are most highly concentrated in the liver, and

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

STATINS

they have a relatively short half-life in the plasma. Nevertheless, statins have a chronic stimulatory effect on the liver. Atorvastatin and rosuvastatin have longer half-lives, as do their metabolites. It has been suggested that these longer half-lives lead to greater efficacy in blood cholesterol lowering. Most statin medications are metabolized by the cytochrome P450 enzyme family, which is composed of several isoenzymes. The 3A4 isoform in the liver is the key metabolic enzyme for most statins, with the exception that fluvastatin and rosuvastatin are metabolized by 2 C9. Pravastatin does not undergo a cytochrome P450 reaction as shown in Table 1 (7). Drug interactions are partly traceable to other medications that compete for similar liver metabolic enzymes and compounds such as azole antifungals, fibrates, and erythromycin; these and related medications are examples in which the dosage of statin may need to be reduced or the medication temporarily withheld to avoid toxicity. Antiatherosclerotic effects of statins have been largely attributed to lowering of LDL cholesterol and remnant particle concentrations in plasma and by reduction in cholesterol synthesis and upregulation of LDL receptors. Other antiatherosclerotic effects of statins are now appreciated, which include reduction in the inflammatory marker Creactive protein, reduction in phagocytic respiratory burst, blunting of angiotensin II effects, and increased synthesis of vascular nitric oxide (8). 2.2 Indications for Use Statin medications are approved in the United States for treatment of elevated lipid levels (primary hypercholesterolemia, mixed dyslipidemia, and hypertriglyceridemia) and homozygous familial hyperlipidemia. Some statins also have indications for primary prevention of coronary events and for secondary prevention of coronary events. 2.3 Effects of Statins on Traditional Lipid Concentrations Several statins have been developed and marketed to treat hyperlipidemia, and this situation has led to trials that have compared

the relative efficacy of the compounds at different prescribed doses. An example of the comparative potency of various statins with alter lipid levels was shown in the STELLAR trial, which was a study of 2431 adults with LDL cholesterol levels 160–250 mg/dL and triglycerides <400 mg/dL at the outset of the study (9). The participants were randomized to treatment with a variety of statins at different dosages. As shown in the top of Table 2, the decrease of LDL-C varied according to the brand of statin and the daily dose. Doubling the dose for a specific drug generally provided a decrease of approximately 6% to 7% in LDL-C levels, and several statin medications were equipotent in terms of LDL-C reduction, although the doses varied. Changes in HDLC with statin therapy were also favorable, and they showed increases over baseline levels. However, the increments were typically less than 10% compared with baseline, and they were less dependent on the dose of medication. Finally, triglyceride levels generally declined on statin therapy, especially with atorvastatin and Rosuvastatin use. 2.4 Drug Safety Common side effects of statins include headache, gastrointestinal intolerance, myalgias, and muscle fatigue. An increase in liver transaminase enzymes occurs in approximately 0.5% of users and is dose-dependent (10). Checking levels of transaminases approximately 6–8 weeks after starting a statin medication, increasing the dosage, or adding another lipid medication is advised, and specific recommendations on when to obtain liver transaminases levels are provided in the package insert for the individual medications. Myopathy has been described in approximately 0.2–0.4% of statin users and a history of reduced kidney function and combination therapy with fibrates both increase the risk of developing myopathy or rhabdomyolysis (11). Muscular symptoms should be regularly assessed in statin users, and creatine kinase levels should be obtained if myopathy symptoms are reported. If a creatine kinase level exceeds 10 times the upper limit of normal, or if rhabdomyolysis occurs, then it is recommended that statin therapy be stopped. An

STATINS

3

Table 1. Comparative Pharmacokinetics of Statins Pharmacokinetic Parameter Major metabolic isoenzyme Elimination half-life (hours)

Atorvastatin Fluvastatin Lovastatin Pravastatin

Rosuvastatin

Simvastatin

3A4

2 C9

3A4

None

2 C9 (small amount)

3A4

14

1.2

3

1.8

19

2

The major metabolic isoenzymes and the elimination half-lives of statins are related to potential for side effects, drug toxicity, and potency of the medications (7).

Table 2. Mean Change in LDL-C Compared with Baseline is Related to the Brand and Dosage of Different Statins (Adapted from Reference 9) Mean Change in LDL Cholesterol Compared with Baseline According to Brand of Statin and Daily Dosage Dosage (daily) LDL–C Effects 10 mg 20 mg 40 mg 80 mg HDL–C Effects 10 mg 20 mg 40 mg 80 mg Triglyceride Effects 10 mg 20 mg 40 mg 80 mg

Rosuvastatin

Atorvastatin

Simvastatin

Pravastatin

−46% −52% −55% Dose not used

−37% −43% −48% −51%

−28% −35% −39% −46%

−20% −24% −30% Dose not used

+7.7% +9.5% +9.6% Dose not used

+5.7% +4.8% +4.4% +2.1%

+5.3% +6.0% 5.2% +6.8%

+3.2% 4.4% +5.6% Dose not used

−20% −24% −26% Dose not used

−20% −23% −27% −28%

−12% −18% −15% −18%

−8% −8% −13% Dose not used

increase in the relative frequency of elevated creatine kinase levels and of rhabdomyolysis was observed for cerivastatin compared with other statins, and led to a recall of that brand of statin (12). The current opinion is that increased creatine kinase levels occur rarely on statin therapy, the complications can be serious, and occurrence of creatine kinase is greater when statins are taken in combination with other drugs. 3

MAJOR RESULTS FROM CLINICAL TRIALS

3.1 Large Primary and Secondary Prevention Trials Several randomized clinical trials have been conducted to test the usefulness of statin medications in preventing clinical cardiovascular disease. The early studies tended to

evaluate a statin versus placebo to prevent coronary heart disease events and recruited persons at high risk for a next heart disease event. Slightly different scenarios were involved in each trial according to the geographic location of the trial, age range of the participants, and the specific clinical outcomes investigated. The overall results of several of the trials are summarized in Table 3, where information is shown for 14 key trials that were reviewed by the Cholesterol Treatment Trialists (CTT) (13). Additional information for the Treatment to New Targets (TNT), the Incremental Decrease in End Points Through Aggressive Lipid Lowering (IDEAL), and the Management of Elevated Cholesterol in the Primary Prevention Group of Adult Japanese (MEGA) have been reported in the past 2

4

No. Study Study Type (Baseline–Event)

Number of Mean Age participants Follow (years) up (yr)

Women Diabetes Rx (%) (%) Groups

1

S

4 S (CHD→Death)

4444

5.2

35−70

19%

5%

2 3 4

P S P

WOSCOPS (xxx) CARE (MI→Hard CHD) Post-CABG (CABG—Angio progression)

6595 4159 1351

4.8 4.8 4.2

45−64 21−75 21−74

0% 14% 8%

1% 14% 9%

5

P

AFCAPS/TEXCAPS (→Major CHD)

6605

5.3

15%

2%

6 7 8 9 10 11

S P P S P P

LIPID (MI, Angina→CHD Death) GISSI Prevention (MI→Major CVD) LIPS (PCI→Major CVD) HP (CHD/PAD/DM→CVD) PROSPER (→CVD) ALLHAT-LLT (Hypertensive→Death)

9014 4271 1677 20,536 5804 10,355

5.6 1.9 3.1 5.0 3.2 4.8

45−73 M 55−73 W 31−75 19−90 18−80 40−80 70−82 ≥55

17% 14% 16% 25% 52% 49%

9% 14% 12% 29% 11% 35%

12 13 14 15 16 17

P P P S S P

ASCOT-LLA (Hypertensive→Death) ALERT (Renal disease→Major CVD) CARDS (Diabetes→Major CVD) IDEAL (MI→Major CHD) TNT (CHD→Major CVD) MEGA (Hypercholesterolemia→CHD)

10,305 2102 2838 8888 10,001 7832

3.2 5.1 3.9 4.8 4.9 5.3

40−79 30−75 40−75 <80 35−75 40−70

19% 34% 32% 19% 19% 6%

25% 9% 100% 12% 15% 21%

Includes trials No 1-14 reported by the Clinical Treatment Trialists and three more recent studies. Study type is denoted as P for primary and S for secondary. Medication abbreviations: A, Atorvastatin; F, Fluvastatin; L, Lovastatin; P, Pravastatin; S, Simvastatin. See Glossary for full study names.

S20-40 Placebo P40 Placebo P40 Placebo L40-80 Pbo, Warfarin L20-40 Placebo P40 Placebo P20 No Rx F80 Placebo S40 Placebo P40 Placebo P40 Usual Rx A10 Placebo F40 Placebo A10 Placebo A80 S20 A80 A10 P10-20 Placebo

Event Rates

LDL-C Relative Lowering Risk at Yr 1 (%) Reduction

8% 12%

-36%

−30%

5.5% 7.9% 10.2% 13.2% —

-22% -29% −27%

−31% −24% —

6.8% 10.9%

-24%

−37%

6.4 8.3 5.6 6.4 21.4% 26.7% 12.9% 14.7% 14.1% 16.2% 14.9% 15.3%

-27% -9% -27% -38% -27% -14%

−24% −10% −22% −13% −15% −1%

6.0% 9.4% 10.7 12.7 1.5 2.5 9.3% 10.4% 8.7% 10.9% 3.3% 5.0%

-31% -20% -38% -23% -23% -17%

−36% −17% −37% −11% −22% −33%

STATINS

Table 3. Summary of the Major Statin Trials with Clinical Event Outcomes

STATINS

years and are included at the bottom of Table 3 (14–16). As observed in Table 3, the number of persons who actively participated in these randomized clinical trials was very large, and it included approximately 115,000 persons. Most large clinical trials were conducted over an interval of 3–5 years, and the outcomes that have been studied in the greatest detail have been myocardial infarction, coronary heart disease death, and overall mortality. Often composite outcomes have been used to report the results, and the studies have typically included several secondary outcomes and subgroup analyses. The primary prevention trials have shown that LDL-C lowering typically exceeded 20% and relative risk reductions in the 25–35% range for clinical events were obtained. An exception to this rule was the ALLHAT-LLT study, in which it was concluded that there was considerable drop-in use of lipid altering medication during the course of the trial for persons who had been assigned to placebo. Medication crossover diminished the lipid differences between the statin group and the control group, which resulted in only a 14% difference in LDL-C lowering between the active therapy group and controls. It was felt to underlie the null result in ALLHAT-LLT. Almost all other primary prevention trials reported positive results and reduced heart disease risk for the primary outcome, but null results was also obtained in the PROSPER and GISSI studies. Each of these trials used less potent statin formulations than in most of the other investigations. In the large secondary prevention trials the outcome of interest has typically been occurrence of next coronary heart disease event or coronary heart disease death. The landmark Scandinavian Simvastatin Study was the first investigation to show that a statin medication (Simvastatin 20 mg or 40 mg daily) could reduce risk of coronary heart disease death and overall mortality (17). More recently the secondary prevention trials such as TNT and IDEAL have included active controls and used an entry level of a statin medication such as Simvastatin 20 mg/day or Atorvastatin 10 mg/day as a control agent compared with a more aggressive regimen with Atorvastatin 80 mg/day.

5

Each of these studies was positive, which supports the view that very aggressive LDL-C lowering was efficacious, safe, and life saving in persons with known cardiovascular disease. Lowering in LDL-C in secondary prevention has been especially effective in reduction of hard CHD (myocardial infarction or coronary heart disease death) events, as shown in Table 3 and Fig. 1. The LDL-C achieved on therapy has been related to the risk of Hard CHD events during the course of the clinical trial, as shown in Fig. 1 for 4 S, LIPID, CARE, HPS, IDEAL, and TNT. Persons at lower LDL-C levels while on statin therapy benefited and the lower bound of LDL-C target for therapy is not evident at this time. 3.2 Special Groups and Subgroup Analyses The relative effect of statins on different causes of death was comprehensively evaluated in a meta-analysis conducted by the CTT investigators and some of those results are summarized in Table 4 (18). The relative risk effects for treatment (statin) versus control (usually inactive placebo) in the table are presented per mmol/L unit of LDL-C (1 mmol = 38.66 mg/dl). More favorable effects in declining order were observed for coronary heart disease, any death, and stroke. On the other hand, nonsignificant effects were observed for cancer and non-vascular death across these studies that included more than 90,000 persons enrolled in statin clinical trials. Similar subgroup analyses were undertaken by the CTT group concerning the proportional effect of lipid lowering on different types of vascular disease clinical events, as shown in Table 5. In declining order, more favorable effects were observed for nonfatal myocardial infarction, coronary artery bypass graft, percutaneous coronary angioplasty, and presumed ischemic stroke. Nonsignificant effects were observed for use of statins and risk of hemorrhagic stroke. Favorable effects of statins were demonstrated for the composites any stroke or any vascular event, as shown at the bottom of Table 5. Additionally, a trial of statins in persons with diabetes mellitus on hemodialysis did not show a benefit of statins to prevent cardiovascular disease in this special subgroup (19).

6

STATINS

LDL Level and Event Rate 30%

25%

20% Event

4S

Placebo/Control Active Statin

4S LIPID

15%

LIPID TNT-10 IDEAL CARE IDEAL 10% HPS CARE TNT-80 HPS 5%

MEGA

MEGA

0% 50

100

150 LDL cholesterol (mg/dl)

200

250

Figure 1. Achieved LDL cholesterol after 1 year of therapy on active treatment with statins or in the placebo control group was related to the development of Hard CHD events (myocardial infarction or coronary heart disease death) in several secondary prevention trials. In general, the lower the LDL-C achieved the lower the risk of a subsequent CHD event. See Glossary for full study names.

Table 4. Cause Specific Mortality and Proportional Effect of 1 mmol/L Reduction in LDL-C in 14 Randomized Trials of Statins (Adapted from Reference 18) Cause of Death Coronary heart disease Stroke Any vascular Cancer Any nonvascular Any death

Treatment (n = 45,054)

Control (n = 45,002)

Relative Risk with 95% confidence interval

3.4% 0.6% 4.7% 2.4% 3.8% 8.5%

4.4% 0.6% 5.7% 2.4% 4.0% 9.7%

0.81 (0.76−0.85) 0.91 (0.74−1.11) 0.83 (0.79−0.87) 1.01 (0.91−1.12) 0.95 (0.90−1.01) 0.88 (0.84−0.91)

Subgroup analyses have been of considerable interest in the major statin trials. Special groups have included the elderly, women, and persons with diabetes mellitus. A summary included in the CTT review showed no relative risk differences for statins to prevent cardiovascular events persons <65 years versus >65 years, men versus women, treated hypertension versus no hypertension, history of diabetes mellitus versus no diabetes mellitus, and LDL-C at baseline <135 mg/dl versus >135 mg/dl. Additionally, the CTT authors showed no difference in risk in the overall risk of cancer or for several major subtypes of cancer in statin users compared with nonusers.

3.3 Statin Effects Other Than LDL-C Lowering and Atherosclerotic Protection There is considerable interest in the investigation of statin effects on atherosclerotic mechanisms that are not mediated by LDLC. Persons with higher levels of inflammatory markers and the ability of statins to lower inflammatory markers have received the greatest attention. The biomarker that has been studied the most in this arena is Creactive protein (CRP). Studies have shown that persons with higher CRP at baseline were especially likely to benefit from statin use in the PRINCE trial (20), and low levels of LDL-C and CRP were important determinants of lower risk for vascular disease in the PROVE-IT TIMI-22 trial (21). A review has

STATINS

7

Table 5. Vascular Disease Events and Proportional Effect of 1 mmol/L Reduction in LDL-C with Results from 14 Randomized Clinical Trials of Statins (Adapted from Reference 18) Cause of Death Nonfatal myocardial infarction CABG PTCA Hemorrhagic stroke Presumed ischemic stroke Any stroke Any major vascular event

Treatment (n = 45,054)

Control (n = 45,002)

Relative Risk (95% confidence interval)

4.4% 1.6% 1.1% 0.2% 2.8% 3.0% 14.1%

6.2% 2.2% 1.5% 0.2% 3.4% 3.7% 17.8%

0.74 (0.70−0.79) 0.75 (0.69−0.82) 0.79 (0.69−0.90) 1.05 (0.78−1.41) 0.81 (0.74−0.89) 0.83 (0.78−0.88) 0.79 (0.77−0.81)

shown that most CRP lowering that has been observed in lipid lowering trials is highly correlated with the LDL-C lowering (22), and more research is needed to distinguish lipid and nonlipid effects of statins. 4

CONCLUSIONS

Statins have been on the market for approximately 20 years and considerable success in the prevention of initial and recurrent cardiovascular events has been demonstrated for this class of medications. Safety of statins was a major concern when the medications were initially released, but experience with the medication has shown good tolerance in most instances. Achievement of lower LDL-C levels in persons at high risk for cardiovascular disease events will be of continued interest, as will improving our understanding of the role of statins on inflammatory process related to atherosclerosis. 5

GLOSSARY

Several acronyms are used for investigations and they are summarized here: AFCAPS/ TexCAPS = Air Force/Texas Coronary Atherosclerosis Prevention Study ALERT = Assessment of Lescol in Renal Transplantation ALLHAT-LLT = Antihypertensive and LipidLowering Treatment to Prevent Heart Attack Trial ASCOT-LLA = Anglo-Scandinavian Cardiac Outcomes Trial-Lipid Lowering Arm CARDS = Collaborative Atorvastatin Diabetes Study

CARE = Cholesterol And Recurrent Events GISSI Prevention = Gruppo Italiano per lo Studio della Sopravvivenza nell’Infarto Miocardico HPS = Heart Protection Study IDEAL = Incremental Decrease in End Points Through Aggressive Lipid Lowering, LIPID = Long-term Intervention with Pravastatin in Ischaemic Disease LIPS = Lescol Intervention Prevention Study MEGA = Management of Elevated Cholesterol in the Primary Prevention Group of Adult Japanese Post-CABG = Post-Coronary Artery Bypass Graft PROSPER = PROspective Study of Pravastatin in the Elderly at Risk 4 S = Scandinavian Simvastatin Survival Study TNT = Treatment to New Targets WOSCOPS = West of Scotland Coronary Prevention Study

REFERENCES 1. W. Rosamond, K. Flegal, G. Friday, K. Furie, A. Go, K. Greenlund, N. Haase, M. Ho, V. Howard, B. Kissela, S. Kittner, D. LloydJones, M. McDermott, J. Meigs, C. Moy, G. Nichol, C. J. O’Donnell, V. Roger, J. Rumsfeld, P. Sorlie, J. Steinberger, T. Thom, S. Wasserthiel-Smoller, Y. Hong, Heart disease and stroke statistics–2007 update: a report from the American Heart Association Statistics Committee and Stroke Statistics Subcommittee. Circulation 2007; 115: e69–171. 2. E. J. Schaefer and R. I. Levy, Pathogenesis and management of lipoprotein disorders. N. Engl. J. Med. 1985; 312: 1300–1309.

8

STATINS 3. A. Endo, The discovery and development of HMG-CoA reductase inhibitors. J. Lipid Res. 1992; 33: 1569–1582.

prospective meta-analysis of data from 90,056 participants in 14 randomised trials of statins. Lancet 2005; 366: 1267–1278.

4. R. H. Bradford, C. L. Shear, A. N. Chremos, C. Dujovne, M. Downton, F. A. Franklin, A. L. Gould, M. Hesney, J. Higgins, D. P. Hurley, A. Langendorfer, D. T. Nash, J. L. Pool, and H. Schnaper, Expanded clinical evaluation of lovastatin (EXCEL) study results: I. efficacy in modifying plasma lipoproteins and adverse event profile in 8245 patients with moderate hypercholesterolemia. Arch. Intern. Med. 1991; 151: 43–49. 5. A. M. Laties, C. L. Shear, E. A. Lippa, A. L. Gould, H. R. Taylor, D. P. Hurley, W. P. Stephenson, E. U. Keates, M. A. Tupy-Visich, and A. N. Chremos, Expanded clinical evaluation of lovastatin (EXCEL) study results II. assessment of the human lens after 48 weeks of treatment with lovastatin. Am. J. Cardiol. 1991; 67: 447–453.

14. J. C. LaRosa, S. M. Grundy, D. D. Waters, C. Shear, P. Barter, J. C. Fruchart, A. M. Gotto, H. Greten, J. J. Kastelein, J. Shepherd, N. K. Wenger, Intensive lipid lowering with atorvastatin in patients with stable coronary disease. N. Engl. J. Med. 2005; 352: 1425–1435.

6. D. B. Hunninghake, E. A. Stein, C. A. Dujovne, W. S. Harris, E. B. Feldman, V. T. Miller, J. A. Tobert, P. M. Laskarzewski, E. Quiter, and J. Held, The efficacy of intensive dietary therapy alone or combined with lovastatin in outpatients with hypercholesterolemia. N. Engl J. Med. 1993; 328: 1213–1219. 7. A. Gaw, C. J. Packard, Comparative chemistry pharmacology and mechanism of action of the statins. In: A. Gaw, C. J. Packard, J. Shepherd, editors. Statins: The HMG-CoA Reductase Inhibitors in Perspective. London: Martin Dunitz, 2004; 45–57. 8. L. L. Stoll, M. L. McCormick, G. M. Denning, and N. L. Weintraub, Antioxidant effects of statins. Drugs Today 2004; 40: 975–990. 9. P. H. Jones, M. H. Davidson, E. A. Stein, H. E. Bays, J. M. McKenney, E. Miller, V. A. Cain, and J. W. Blasetto, Comparison of the efficacy and safety of rosuvastatin versus atorvastatin, simvastatin, and pravastatin across doses (STELLAR* Trial). Am. J. Cardiol. 2003; 92: 152–160. 10. K. G. Tolman, The liver and lovastatin. Am. J. Cardiol. 2002; 89: 1374–1380. 11. P. D. Thompson, P. Clarkson, and R. H. Karas, Statin-associated myopathy. JAMA 2003; 289: 1681–1690. 12. J. A. Staffa, J. Chang, and L. Green, Cerivastatin and reports of fatal rhabdomyolysis. N. Engl. J. Med. 2002; 346: 539–540. 13. C. Baigent, A. Keech, P. M. Kearney, L. Blackwell, G. Buck, C. Pollicino, A. Kirby, T. Sourjina, R. Peto, R. Collins, and R. Simes, Efficacy and safety of cholesterol-lowering treatment:

15. T. R. Pedersen, O. Faergeman, J. J. Kastelein, A. G. Olsson, M. J. Tikkanen, I. Holme, M. L. Larsen, F. S. Bendiksen, C. Lindahl, M. Szarek, J. Tsai, High-dose atorvastatin vs usual-dose simvastatin for secondary prevention after myocardial infarction: the IDEAL study: a randomized controlled trial. JAMA 2005; 294: 2437–2445. 16. H. Nakamura, K. Arakawa, H. Itakura, A. Kitabatake, Y. Goto, T. Toyota, N. Nakaya, S. Nishimoto, M. Muranaka, A. Yamamoto, K. Mizuno, Y. Ohashi, Primary prevention of cardiovascular disease with pravastatin in Japan (MEGA Study): a prospective randomised controlled trial. Lancet 2006; 368: 1155–1163. 17. The 4S Group, Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: the Scandinavian Simvastatin Survival Study (4S). Lancet 1994; 344: 1383–1389. 18. C. Baigent, A. Keech, P. M. Kearney, L. Blackwell, G. Buck, C. Pollicino, A. Kirby, T. Sourjina, R. Peto, R. Collins, R. Simes. Efficacy and safety of cholesterol-lowering treatment: prospective meta-analysis of data from 90,056 participants in 14 randomised trials of statins. Lancet 2005; 366: 1267–1278. 19. C. Wanner, V. Krane, W. Marz, M. Olschewski, J. F. Mann, G. Ruf, E. Ritz, Atorvastatin in patients with type 2 diabetes mellitus undergoing hemodialysis. N. Engl. J. Med. 2005; 353: 238–248. 20. M. A. Albert, E. Danielson, N. Rifai, P. M. Ridker, Effect of statin therapy on C-reactive protein levels: the Pravastatin Inflammation/CRP Evaluation (PRINCE): a randomized trial and cohort study. JAMA 2001; 286: 64–70. 21. P. M. Ridker, D. A. Morrow, L. M. Rose, N. Rifai, C. P. Cannon, and E. Braunwald, Relative efficacy of atorvastatin 80mg and pravastatin 40mg in achieving the dual goals of low-density lipoprotein cholesterol <70mg/dl and C-reactive protein <2mg/l: an analysis of the PROVE-IT TIMI-22 trial. J. Am. Coll. Cardiol. 2005; 45: 1644–1648.

STATINS 22. S. Kinlay, Low-density lipoprotein-dependent and -independent effects of cholesterollowering therapies on C-reactive protein: a meta-analysis. J. Am. Coll. Cardiol. 2007; 49: 2003–2009.

FURTHER READING Prevention of cardiovascular events and death with pravastatin in patients with coronary heart disease and a broad range of initial cholesterol levels. The Long-Term Intervention with Pravastatin in Ischaemic Disease (LIPID) Study Group. N. Engl. J. Med. 1998; 339: 1349–1357. H. M. Colhoun, D. J. Betteridge, P. N. Durrington, G. A. Hitman, H. A. Neil, S. J. Livingstone, M. J. Thomason, M. I. Mackness, V. Charlton-Menys, J. H. Fuller, Primary prevention of cardiovascular disease with atorvastatin in type 2 diabetes in the Collaborative Atorvastatin Diabetes Study (CARDS): multicentre randomised placebo-controlled trial. Lancet 2004; 364: 685–696. J. R. Downs, M. Clearfield, S. Weis, E. Whitney, D. R. Shapiro, P. A. Beere, A. Langendorfer, E. A. Stein, W. Kruyer, A. M. Gotto Jr., Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels: results of AFCAPS/TexCAPS. JAMA 1998; 279: 1615–1622. GISSI Prevenzione Investigators, Results of the low-dose (20mg) pravastatin GISSI Prevenzione trial in 4271 patients with recent myocardial infarction: do stopped trials contribute to overall knowledge? GISSI Prevenzione Investigators (Gruppo Italiano per lo Studio della Sopravvivenza nell’Infarto Miocardico). Ital. Heart J. 2000; 1: 810–820. Heart Protection Study Collaborative Group, MRC/BHF Heart Protection Study of cholesterol lowering with simvastatin in 20,536 high-risk individuals: a randomised placebocontrolled trial. Lancet 2002; 360: 7–22. H. Holdaas, B. Fellstrom, A. G. Jardine, I. Holme, G. Nyberg, P. Fauchald, C. Gronhagen-Riska, S. Madsen, H. H. Neumayer, E. Cole, B. Maes, P. Ambuhl, A.G. Olsson, A. Hartmann, D. O. Solbu, T.R. Pedersen, Effect of fluvastatin on cardiac outcomes in renal transplant recipients: a multicentre, randomised, placebo-controlled trial. Lancet 2003; 361: 2024–2031. Post Coronary Artery Bypass Graft Trial Investigtors, The effect of aggressive lowering of lowdensity lipoprotein cholesterol levels and lowdose anticoagulation on obstructive changes in

9

saphenous-vein coronary-artery bypass grafts. N. Engl. J. Med. 1997; 336: 153–162. F. M. Sacks, M. A. Pfeffer, L. A. Moye, J. L. Rouleau, J. D. Rutherford, T. G. Cole, L. Brown, J. W. Warnica, J. M. Arnold, C. C. Wun, B.R. Davis, and E. Braunwald, The effect of pravastatin on coronary events after myocardial infarction in patients with average cholesterol levels. Cholesterol and Recurrent Events Trial investigators. N. Engl. J. Med. 1996; 335: 1001–1009. P. W. Serruys, P. de Feyter, C. Macaya, N. Kokott, J. Puel, M. Vrolix, A. Branzi, M. C. Bertolami, G. Jackson, B. Strauss, B. Meier, Fluvastatin for prevention of cardiac events following successful first percutaneous coronary intervention: a randomized controlled trial. JAMA 2002; 287: 3215–3222. P. S. Sever, B. Dahlof, N. R. Poulter, H. Wedel, G. Beevers, M. Caulfield, R. Collins, S. E. Kjeldsen, A. Kristinsson, G. T. McInnes, J. Mehlsen, M. Nieminen, E. O’Brien, and J. Ostergren, Prevention of coronary and stroke events with atorvastatin in hypertensive patients who have average or lower-than-average cholesterol concentrations, in the Anglo-Scandinavian Cardiac Outcomes Trial–Lipid Lowering Arm (ASCOTLLA): a multicentre randomised controlled trial. Lancet 2003; 361: 1149–1158. J. Shepherd, G. J. Blauw, M. B. Murphy, E. L. Bollen, B. M. Buckley, S. M. Cobbe, I. Ford, A. Gaw, M. Hyland, J. W. Jukema, A. M. Kamper, P. W. MacFarlane, et al., Pravastatin in elderly individuals at risk of vascular disease (PROSPER): a randomised controlled trial. Lancet 2002; 360: 1623–1630. J. Shepherd, S. M. Cobbe, I. Ford, C. G. Isles, A. R. Lorimer, P. W. MacFarlane, J. H. McKillop, and C. J. Packard, Prevention of coronary heart disease with pravastatin in men with hypercholesterolemia. West of Scotland Coronary Prevention Study Group. N. Engl. J. Med. 1995; 333: 1301–1307. The 4S Group, Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: the Scandinavian Simvastatin Survival Study (4S). Lancet 1994; 344: 1383–1389. The ALLHAT Officers and Coordinators for the ALLHAT Collaborative Research Group, Major outcomes in moderately hypercholesterolemic, hypertensive patients randomized to pravastatin vs usual care: The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT-LLT). JAMA 2002; 288: 2998–3007.

10

STATINS

CROSS-REFERENCES ASCOT Trials Coronary Drug Project Disease Trials for Cardiovascular Diseases TIMI Trial TNT Trial Trials on Metabolic and Endocrine Diseases Update in Hyperlipidemia Clinical Trials

STEPPED WEDGE DESIGN

Stepped wedge designs are particularly appropriate for evaluating an intervention that has been shown to be efficacious in a more limited research setting and is now being scaled up (e.g., Phase IV trials). In such a case, ethical considerations preclude denying the intervention to some individuals or communities, but logistical difficulties may prevent initiation of delivery of the intervention everywhere or to everyone simultaneously. In such cases, randomizing the order in which the intervention is provided not only creates an impression of fairness but also allows a careful evaluation of the real-world or community effectiveness of the intervention. A second important feature of the stepped wedge design is that it allows one to evaluate temporal trends in the effect of the intervention. Several considerations potentially limit the use of the stepped wedge design. Typically, stepped wedge trials take longer than other trials to conduct because of the multiple time steps. As with any crossover trial, carryover effects must be considered, especially if the control period is actually a standard-ofcare intervention (see Ref. 1 for an example). Blinding may be impossible in a stepped wedge trial, so issues of mobility between communities and the potential for contamination must be considered. Additional efforts to ensure unbiased outcome measurements through, for example, ensuring that the individuals assessing the outcome are blind to the current intervention status of the individual or cluster being assessed, may be needed (2).

JAMES P. HUGHES University of Washington, Seattle, Washington

In stepped wedge trials, the outcome of interest is evaluated under both the control and intervention condition using cohorts or repeated cross-sectional surveys. In this article, several examples of stepped wedge trials are described, and the advantages and disadvantages of different approaches to outcome measurement are discussed. Various methods have been used to analyze stepped wedge trials, and the strengths and weaknesses of the different approaches are illustrated with several examples. Power calculations have been similarly varied and have relied on both simulations and model-based results. The frequency of stepped wedge trials seems to be increasing, and additional methodological research on this design is needed. Stepped wedge or phased introduction trials are randomized trials in which the time at which an intervention is introduced to an individual or into a community is randomized. In contrast to a parallel design, in which half the individuals or communities are denied the intervention, or a crossover design, in which the intervention may be withdrawn after introduction, a stepped wedge design ensures that all individuals or communities eventually receive the intervention on an ongoing basis. Outcome measurements are collected, both before and after introduction of the intervention, either continuously or at each time period (each step). Figure 1 contrasts the stepped wedge design with the more common parallel and crossover designs. The phased introduction of the intervention into the study units creates the distinctive stair step pattern that gives this design its name. Importantly, the availability of contemporaneous intervention and control communities allows one to distinguish an intervention effect from an underlying temporal trend (which is possible from a simple before–after design).

1

EXAMPLES OF STEPPED WEDGE TRIALS

Relatively few stepped wedge trials have been conducted, although the design seems to be increasing in popularity. In a recent review of the literature, Brown and Lilford (2) identified 12 protocols or papers that used or proposed using a stepped wedge design. The first application of the stepped wedge was apparently in The Gambia (3) where a hepatitis B vaccine (HbV) was introduced in a phased fashion to the country’s 17 health

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

STEPPED WEDGE DESIGN

1 Cluster 2 3 4

Parallel Time 1 T T C C

1 2 3 4

Crossover Time 1 2 C T T C C T T C

1 2 3

districts. The immediate outcome of interest was the level of antibody titers found in a survey of infants in the health districts. The longer-term objective of the intervention was to prevent liver cancer and chronic liver disease. Grant and colleagues (4) provide an example of an individually randomized stepped wedge trial. They used a phased introduction of preventive isoniazid therapy among HIV-infected men in a South African mining town to prevent incident tuberculosis (Tb) infection. Because of capacity limitations of the mining company clinic, only a few men could initiate therapy at any one time, so the order in which men began therapy was randomly assigned. More recently, Golden and colleagues (5) describe a trial that will evaluate the community-level effects of expedited partner treatment for gonorrhea and Chlamydia infection in Washington State. In a prior individually randomized trial, Golden and colleagues (6) showed that delivery of medication to partners by patients diagnosed with gonorrhea and/or Chlamydia was effective in reducing reinfection in the original index patient. This intervention now is being introduced in phases throughout Washington, and the stepped wedge design (with county as the unit of randomization) provides an ideal mechanism to evaluate the community-level effect of the intervention. Moulton and colleagues (7) describe a stepped wedge randomized trial to evaluate a program to train health personnel in routine Tb testing and isoniazid prophylaxis for HIV-infected individuals in Brazil. In this case, health clinics were chosen as the unit of randomization, and the authors note that it would not be not feasible to train all the health care workers at all the clinics simultaneously. Again, pragmatic considerations led to the stepped wedge design.

1 C C C

Stepped wedge Time 2 3 4 T T T C T T C C T

Figure 1. Alternative trial designs. T and C represent the two randomization arms.

2 DESIGN ISSUES Brown and Lilford (2) cite Cook and Campbell (8) as the first authors to describe a stepped wedge design (Cook and Campbell use the term ‘‘experimentally staged introduction’’). Smith and Morrow (9) also provide information on the implementation of this design. Recent papers by Hussey and Hughes (10), Moulton and colleagues (7), and the aforementioned review by Brown and Lilford (2) discuss various aspects of the design and analysis of stepped wedge trials. 2.1 Outcome Evaluation Stepped wedge trials have used different approaches for outcome evaluation. For example, in their trial of expedited partner treatment for sexually transmitted infections (STI), Golden and colleagues (5) use repeated cross-sectional surveys at sentinel sites in each county at each time step to estimate the prevalence of Chlamydia. It is expected that different individuals will be sampled in each survey. Similarly, Moulton and colleagues (7) count person-years and number of incident tuberculosis cases during each time step at each clinic. In both of these cases, the intervention effect is expected to be realized soon after introduction. In contrast, in the Gambia Hepatitis Study, infants who received vaccination (either the control condition—a standard set of vaccines— or the intervention—the standard set plus HbV) were not expected to experience the primary trial outcome (liver disease and liver cancer) for many years. Therefore, infants presenting for vaccination within each health district at each time step are viewed as cohorts and are being followed over time to evaluate the intervention effect. In this manner, cases of disease can be uniquely assigned to a control or intervention time period. A third approach to evaluation is illustrated in the study by

STEPPED WEDGE DESIGN

Grant and colleagues (4). In this study, a cohort of men were identified at the start of the study and followed over time as the intervention was introduced. Thus, Tb incidence could be ascertained both before and after the introduction of the intervention. With this strategy, however, consideration of the nature of the outcome is particularly important because of the possibility of ‘‘healthy survivor’’ bias (4,11). Specifically, if the outcome is mortality or infection with an incurable disease, and if participants vary with respect to their risk of the outcome, then more events will be observed early in the trial (when most participants are receiving the control condition) compared with later in the trial (when most participants are receiving the treatment condition). This situation has implications for the analysis of the trial data and is discussed further below. 2.2 Analysis of Stepped Wedge Trials In a stepped wedge trial, information about the intervention effect is available from both within-unit and between-unit comparisons. The within-unit comparisons may provide a more precise estimate of the treatment effect but are also problematic because of the potential for confounding of the estimated treatment effect with an underlying temporal trend. Nonetheless, a common analytic approach (12–14), has been simply to compare average outcome or event rates during the control period to the average outcome or event rates during the intervention period. Such an analysis must be interpreted with caution because any underlying temporal trend in the rate of disease that is occurring independent of the intervention would seem to be attributable to the intervention. For example, an epidemic increase in the prevalence or incidence of the disease during the trial could mask the benefit of an effective intervention. Alternatively, if a cohort of participants is assembled at the beginning of the study and followed for the occurrence of death or an incurable disease, then the frailest individuals will likely succumb early in the study (when most participants are being exposed to the control condition) and only the healthiest individuals remain during the intervention period. Such a ‘‘healthy survivor’’ effect would

3

bias the results in favor of the intervention. Even if the outcome is recurrent (e.g., Tb), an underlying increase or decrease in an individual participant’s risk over time can confound the intervention effect [e.g., Grant and colleagues (4) note that the HIV-infected men in their trial were expected to experience an increase in Tb incidence over time even in the absence of an intervention because of HIV disease progression]. For these reasons, some control for time is advisable when analyzing data from stepped wedge trials. One approach to controlling for temporal confounding is to restrict the analyses to contemporaneous comparisons between units receiving and not receiving the treatment. Essentially, this approach views the stepped wedge design as a series of repeated parallel design trials (2). For example, in the Gambia Hepatitis study, cohorts of children were identified in each community at each time step and followed so that the future incidence of liver disease could be compared between the intervention and control cohorts. Thus, all comparisons would be between cohorts enrolled at the same point in time, and the information on the treatment effect could be pooled over the time periods. This analysis avoids both the ‘‘healthy survivor’’ bias mentioned above and any confounding of the treatment effect with underlying temporal trends. Moulton and colleagues (7) propose a similar strategy to evaluate the health services intervention described previously. They describe a partial likelihood approach for comparing Tb incidence between control and intervention clinics while conditioning on time. A proportional hazards assumption is used to pool the estimated intervention effect across time. They also propose using bootstrapping or a robust variance to account for within-clinic correlation. A model-based analysis provides a way of combining both within and between unit information on the treatment effect while controlling for underlying time trends. Grant and colleagues (4) analyzed episodes of tuberculosis before and after initiation of isoniazid therapy with a Poisson random effects model. They included calendar time as a covariate in their model because a natural increase in TB

4

STEPPED WEDGE DESIGN

incidence was expected in the HIV-infected participants over time. Similarly, in the context of a cluster randomized trial, Hussey and Hughes (10) propose the following model for analyzing data from a stepped wedge trial with repeated cross-sectional surveys in which different individuals are surveyed at each time point: Yijk = µ + αi + βj + Xij θ + eijk

(1)

Here, Yijk is the k’th observation at the j’th time step in the i’th cluster; α i is a random cluster effect with α i ∼ N(0, τ 2 ); β j is a fixed time effect; θ is the intervention effect; Xij is a treatment indicator; and eijk ∼ N(0, σe2 ). Hussey and Hughes (10) used simulations to compare the size and power of randomeffects and marginal models for analyzing data from such a study. They found that when cluster sizes were equal, an analysis on the cluster-level outcomes (i.e., Yij .) using a linear mixed model (15) gave as good or better power as an individual-level analysis using generalized estimating equations (GEE) (16) or generalized linear mixed model (GLMM) (17) to control for intracluster correlation. This result has been noted previously in the context of cluster-randomized trials (e.g., Ref. 18, section 6.2.1.2). However, when cluster sizes were unequal, an individual-level analysis using GEE or GLMM was much more efficient than a cluster-level analysis. In all cases, controlling for time in the model provided an unbiased estimate of the treatment effect. Some analyses of cluster randomized stepped wedge trials have ignored correlation because of clustering during the analysis phase (e.g., Ref. 12). This problem is not unique to stepped wedge trials. For instance, in her review of primary prevention trials, Simpson and colleagues (19) found that 9 of 21 cluster randomized trials failed to account for clustering in the statistical analysis. They note that, in general, the analysis of a cluster randomized trial as an individually randomized trial will not give the correct p-value. In particular, for a parallel-designed cluster randomized trial, the p-value from an individual-level analysis (e.g., a t-test or chisquare test that ignores the clustering) is almost always too small (type I error rate too

large). The situation is less clear in stepped wedge trials because of the crossover aspect of the design, but, as a rule, analyses that account for clustering are recommended. The stepped wedge design can also be used to investigate variation in the effect of treatment as a function of time. As noted above, a stepped wedge trial may be viewed as a series of repeated parallel trials, each initiated at a different point in time. This approach allows one to ask if the intervention effect depends on the time when it is introduced. In terms of the model in Equation (1), this time corresponds to a time by treatment interaction. Such interactions may reflect seasonal variation in the intervention effect or other external factors. More subtly, when the outcome is measured on a cohort or cohorts enrolled at the start of the study, variations in the apparent effectiveness of the intervention over time may result from a variant of the healthy survivor bias effect noted previously. Specifically, if individual variation in risk of the outcome is found and the intervention effectiveness is related to risk (for instance, the intervention is less effective in low-risk individuals), then these factors will combine to produce an intervention effectiveness that apparently wanes over time. Thus far, little discussion of such issues has occurred in the literature, and the development of analytic tools to estimate such effects is an important area of research. It is also possible that intervention effectiveness may vary as a function of the time since the intervention was introduced (i.e., a health services intervention may become more effective over time as those providing the intervention become more skilled). This issue is discussed more in the section on delayed intervention effect, below. 2.3 Power/Sample Size Calculations Power calculations in a stepped wedge trial are, of course, dependent on the design details (i.e., cohorts, repeated cross-sectional surveys, etc.) and the analytic approach to be taken at the end of the study. For instance, in the Gambia Hepatitis study, the authors propose an analysis based on contemporaneous comparison of intervention and control community cohorts and report that the stepped

STEPPED WEDGE DESIGN

wedge design gives about 70% of the power from a comparable parallel design, although no details are given. Moulton and colleagues (7) give a detailed explanation of the power calculations for their trial, which is designed to include 29 clinics and continue for 42 months. The used a pilot study to estimate the interclinic coefficient of variation (CV) and then used simulations to estimate the design effect of the stepped wedge trial relative to a comparable parallel-designed trial. This approach allowed them to use a slight modification of a standard sample size formula for paralleldesigned cluster randomized trials (20) to calculate power for their trial. A key finding was that the stepped wedge design was about 69% as efficient as a comparable parallel-designed trial using the proposed analysis— similar to that found in the Gambia Hepatitis Study. In contrast, Hughes and colleagues (1) propose an analysis of repeated cross-sectional surveys based on the model in Equation (1), which incorporates both within- and betweencluster information. They find that the power of the stepped wedge design compares favorably with a comparable parallel design and may have even greater power than the parallel design for high values of the interclinic CV (e.g., fig. 3 in Ref. 1). The key difference between this analysis and the previous approaches is the use of within-clinic information (together with the between-clinic information) to estimate the treatment effect. In a parallel-designed cluster randomized trial, power is highly dependent of the magnitude of the interclinic CV. In a crossover trial, this dependence is reduced because each cluster serves as its own control. In a stepped wedge trial, if the proposed analysis uses information from both within and between cluster comparisons, then the power of the trial will be less dependent on intercluster variability compared with a parallel design. Hussey and Hughes (10) provide more discussion of power and design issues relevant to stepped wedge trials with repeated cross-sectional surveys. Under the model in Equation (1), they show that the power to test the hypothesis Ho : θ = 0 using a Wald test is

5

(for a two-tailed α-level test) ⎛

⎞ 2 θ a power = ⎝ − Z1−α/2 ⎠ Var(θˆ )

(2)

where is the cumulative normal distribution function and Zp is the p’th quantile of the normal distribution function. In general, ˆ is the appropriate diagonal element of Var(θ) (Z V−1 Z)−1 where Z is a NT × (T + 1) design matrix and V is an NT × NT block diagonal covariance matrix (N is the number of clusters; T is the number of time periods). In the special case where the sample size is constant (say, n) for all clusters and time periods, a closed form expression for the variance of the estimated treatment effect, θˆ , is ˆ = Var(θ) Nσ 2 (σ 2 + Tτ 2 )

(3)

(NU − W)σ 2 + (U2 + NTU − TW − NV)τ 2 where U = ij Xij , W = j ( i Xij )2 , V = i 2 2 2 ( j Xij ) , and σ = σe /n. Although the model in Equation (1) is written in terms of a normally distributed response, the power formula in Equation (2) may be used quite generally if n is large enough so that the cluster-level responses are approximately normally distributed. When the number of clusters is small, a noncentral t distribution may be used in place of the normal distribution function in Equation (2). 2.4 Delayed Intervention Effect In general, stepped wedge trials assume that the intervention is fully effective in the step in which it is introduced. If the intervention does not become effective until later time steps and the lag is not explicitly incorporated into the design or analysis, then substantial power may be lost (10). This issue may be addressed in either the design or the analysis of the stepped wedge trial. For instance, in the Gambia Hepatitis study, as noted previously, one would not expect the effect of vaccination on the primary outcome—liver disease—to be realized until years after the intervention was introduced. Thus, a simple survey of liver cancer rates during each time step would show no intervention effect.

6

STEPPED WEDGE DESIGN

The authors dealt with this issue through the trial design, for example, by treating children receiving vaccination in each health district at each time step as cohorts that could be followed over time to provide information on the effectiveness of the intervention. Hussey and Hughes (10) and Moulton and colleagues (7) discuss approaches to dealing with lags in the intervention effect via adjustments to the analysis. For instance, if one has information about the expected lag in the intervention effect (e.g., 50% of full effectiveness in one time step, 80% of full effectiveness in two time steps, and 100% effective after three time steps), then Hussey and Hughes (10) suggest including this information in the analysis by using fractional values for the treatment indicator (additional measurements of the outcome following introduction of the intervention into all the clusters is also recommended). If no information about the lag is available a priori, then it may still be possible to estimate the lag effect during the analysis, although details of such an analysis have not been described in the literature. Alternatively, Moulton and colleagues (7) suggest using an individual-level as-treated analysis instead of an intent-totreat analysis to compensate for lags in the treatment effect. Either approach, however, will be effective only for dealing with relatively minor lags. If a large and variable lag occurs in the intervention effect, then the stepped wedge trial effectively becomes a before–after trial, and it likely will be impossible to distinguish unambiguously the intervention effect from an underlying time trend. If large lags in the intervention effect are anticipated, then these must be explicitly anticipated in the design (as in the Gambia Hepatitis study). 3

everywhere or to everyone simultaneously, then randomizing the order of introduction provides an impression of fairness and allows for a systematic evaluation of the communitylevel effectiveness of the intervention. A community-level stepped wedge trial is particularly well-suited for evaluating interventions designed to prevent infectious diseases as it allows one to evaluate the total prevention effect (i.e., direct disease prevention plus indirect effects such as herd immunity). It is also possible that more complex designs that involve randomization at both the cluster and individual levels to estimate direct and indirect effects separately (21) could be incorporated into the stepped wedge framework. The details of such a design are a potential area of research. Finally, the stepped wedge design allows one to evaluate temporal changes in the intervention’s effectiveness. Variations on the stepped wedge theme are possible. Hughes and colleagues (1) describe a combined parallel/stepped wedge design for evaluating different strategies of providing nevirapine to HIV-infected women in Zambia. In that protocol, some clinics provided only one of the two interventions during the entire trial (as in a parallel design), whereas others crossed over from the standard strategy to the proposed new strategy (as in a stepped wedge design). Patel and colleagues (12) use a variation of the stepped wedge design in which each community is evaluated once before and once after the intervention, but the time of the start of the study is staggered (see Fig. 2). Additional methodological research on the stepped wedge design is needed. Several analytic approaches have been used or proposed for analyzing stepped wedge trials, which suggests that little consensus on the

DISCUSSION

In summary, stepped wedge designs are useful for evaluating the effectiveness of interventions— particularly interventions that have been shown to be efficacious in smaller scale trials—that are being introduced into a community or communities. If feasibility or logistical constraints preclude the possibility of introducing the intervention

Cluster

1 2 3

1 C

Time 2 3 T C T C

4

T

Figure 2. Variation on the stepped wedge design with a single evaluation before and after the intervention. T and C represent the two randomization arms.

STEPPED WEDGE DESIGN

best approach for evaluating the intervention effect occurs. Improved estimators, both model-based and semiparametric or nonparametric, are potential areas of research. More investigation into the power of the stepped wedge under alternative correlation structures (e.g., autoregressive instead of the exchangeable assumption used in Equation 1) would be useful to determine the sensitivity of the design to such assumptions. In addition, as noted above, the development of methods for evaluating temporal trends in the effectiveness of the intervention is an open area of research. Finally, because stepped wedge trials are, by their nature, relatively long in duration, it may be useful to develop interim monitoring and stopping rules for this design. 4

ACKNOWLEDGEMENTS

This work was supported by NIH grant AI29168. REFERENCES 1. J. P. Hughes, R. L. Goldenberg, C. M. Wilfert, et al., Design of the HIV prevention trials network (HPTN) protocol 054: a cluster randomized crossover trial to evaluate combined access to nevirapine in developing countries. UW Biostatistics Working Paper Series. 2003: 195. 2. C. A. Brown and R. J. Lilford, The stepped wedge design: A systematic review. BMC Medical Research Methodology. 2006; 6: 54. 3. Gambia Hepatitis Study Group, The Gambia Hepatitis Intervention Study. Cancer Res. 1987; 47: 5782–5787. 4. A. D. Grant, S. Charalambous, K. L. Fielding, J. H. Day, E. L. Corbett, R. E. Chaisson, K. M. De Cock, R. J. Hayes, and G. J. Churchyard, Effect of routine isoniazid preventive therapy on tuberculosis incidence among HIV-infected men in South Africa: a novel randomized incremental recruitment study. JAMA. 2005; 293: 2719–2725. 5. M. R. Golden, J. P. Hughes, D. D. Brewer, K. K. Holmes, W. L. H. Whittington, M. Hogben, C. Malinski, A. Golding, and H. H. Handsfield, Evaluation of a population-based program of expedited partner therapy for gonorrhea and chlamydial infection. Sex. Transm. Dis. 2007; 34: 598–603.

7

6. M. R. Golden, W. L. H. Whittington, H. H. Handsfield, J. P. Hughes, W. E. Stamm, M. Hogben, A. Clark, C. Malinski, J. Larson, K. K. Thomas, and K. K. Holmes, Impact of expedited sex partner treatment on recurrent or persistent gonorrhea or chlamydial infection: a randomized controlled trial. New England J Medicine. 2005; 352: 676–685. 7. L. H. Moulton, J. E. Golub, B. Durrovni, S. C. Cavalcante, A. G. Pacheco, V. Saraceni, B. King, and R. E. Chaisson, Statistical design of THRio: a phased implementation clinicrandomized study of a tuberculosis preventive therapy intervention. Clinical Trials. 2007; 4: 190–199. 8. T. D. Cook and D. T. Campbell, QuasiExperimentation: Design and Analysis Issues for Field Settings. Boston, MA: Houghton Mifflin Company, 1979. 9. P. G. Smith and R. H. Morrow, Field Trials of Health Interventions in Developing Countries: A Toolbox. London: Macmillian Education Ltd, 1996. 10. M. A. Hussey and J. P. Hughes. Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials. 2007; 28: 182–191. 11. M. Halperin, J. Cornfield, and S. C. Mitchell, Effect of diet on coronary-heart-disease mortality. Lancet. 1973; 2: 438–439. 12. M. Patel, H. L. Sandige, M. J. Ndekha, A. Briend, P. Ashorn, and M. J. Manary, Supplemental feeding with ready-to-use therapeutic food in Malawian children at risk of malnutrition. J. Health, Population and Nutrition. 2005; 23: 351–357. 13. R. W. Levy, C. R. Rayner, C. K. Fairley, D. C. M. Kong, A. Mijch, K. Costello, C. McArthur, and the Melbourne Adherence Group, Multidisciplinary HIV adherence intervention: a randomized study. AIDS Patient Care and STDs. 2004; 18: 728–735. 14. T. B. M. Wilmink, C. R. G. Quick, C. S. Hubbard, and N. E. Day, The influence of screening on the incidence of ruptured abdominal aortic aneurysms. J. Vascular Surgery. 1999; 30: 203–208. 15. N. Laird and J. Ware, Random-effects models for longitudinal data. Biometrics. 1982; 38: 963–974. 16. K. Liang and S. Zeger, Longitudinal data analysis using generalized linear models. Biometrika. 1986; 73: 13–22. 17. N. E. Breslow and D. G. Clayton, Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 1993; 88: 9–25.

8

STEPPED WEDGE DESIGN

18. A. Donner and N. Klar, Design and Analysis of Cluster Randomized Trials in Health Research. New York: Oxford University Press, 2000. 19. J. M. Simpson, N. Klar, and A. Donner, Accounting for cluster randomization: A review of primary prevention trials, 1990 through 1993. American J. Public Health. 1995; 85: 1378–1383. 20. R. Hayes and S. Bennett, Simple sample size calculation for cluster-randomized trials. Int. J. Epidemiol. 1999; 28(2): 319–326. 21. M. E. Halloran, M. Haber, I. M. Longini, and C. J. Struchiner, Direct and indirect effects in vaccine efficacy and effectiveness. Amer. J. Epidemiol. 1991; 133(4): 323–331.

CROSS-REFERENCES Cluster Randomization Community Intervention Trial Phase IV trial Effectiveness Cohort versus Repeated Cross-sectional Survey

STOPPING BOUNDARIES

Consider a study that aims to compare two treatments. If a conventional clinical trial with fixed sample size (i.e., no interim analyses) is used, the hypotheses can be tested by comparing the treatment difference with a critical value so that the chance of falsely claiming a significant difference is controlled at some level, α, under the null hypothesis that the treatments are not different. If the same treatments are to be compared using a trial with interim analyses and the same critical values as used in the fixed sample size trial are used in each interim analyses, then the multiple tests induced by the interim analyses will lead to an overall false-positive error exceeding α. As an example, if the critical value corresponds that to α = 0.05 in a fixed sample test is used in trials with two, five, and ten interim analyses, the overall false-positive errors will be about 0.08, 0.014, and 0.2, respectively. In the HOPE trial, inflation of the overall false-positive error was controlled by setting stringent criteria for stopping the trial in favor of ramipril of 4 SD of difference in outcomes between groups in the first two analyses and of 3 SD in the third and final analyses. These criteria correspond to nominal P-values of 0.00006 and 0.0027, respectively, based on a two-sided test, and are much more stringent than the usual 0.05 in a conventional two-sided test. The criteria for stopping, of 4 SD and 3 SD, are known as ‘‘stopping boundaries’’ because they act like boundaries that separate the decision to continue or to stop a trial. The values of the stopping boundaries are established to control the overall false-positive error to a desired level. By setting appropriate stopping boundaries in a trial, the investigators can prevent the trial from stopping too early when early results based on few observations can be unduly influenced by a few extreme observations. Furthermore, they can allow the trial to stop early if little doubt exists that either one of the treatments is superior (inferior) or that they are not different. For administrative reasons, investigators in a trial usually restrict interim analyses to be carried out over a few fixed intervals during the trial. The intervals can be mea-

DENIS HENG-YAN LEUNG Singapore Management University School of Economics and Social Sciences Singapore

The Heart Outcomes Prevention Evaluation (HOPE) Study (1) was a large prospective randomized trial carried out in the 1990s to study the effectiveness of ramipril, an angiotensin-coverting enzyme, on cardiovascular events in high-risk patients with vascular or coronary disease, or diabetes with at least one of the risk factors of hypertension, hyperlipidemia, active smoking, or known microalbuminuria. Between December 1993 and June 1995, the trial investigators recruited a total of over 9000 patients, with 4656 randomized to receive daily doses of ramipril and 4652 randomized to receive placebo. The trial was planned to have a follow-up of 5 years with four interim analyses to be carried during follow-up. The study design indicated that ramipril would be considered effective if a difference in the primary outcome exceeded 4 standard deviations (SDs) between treatment groups during the first half of the study or 3 SDs between groups during the second half. Following the third interim analysis, in March 1999, the data monitoring committee (DMC) recommended early termination of the study in favor of ramirpril because difference in outcomes between ramipril and placebo exceeded 3 SDs. Trial designs like that in HOPE, where interim analyses of the trial results are scheduled, have become a standard in the conduct of modern clinical trials. Among the many attributes that interim analyses can offer, they allow the trial investigators the opportunity to stop a trial as soon as it is clear that it is no longer necessary or appropriate to continue the trial. However, the chance to analyze trial results at multiple times during a trial also leads to the problem of inflation of false-positive errors. The main statistical challenge is how to control the level of false-positive errors to a desired level.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

STOPPING BOUNDARIES

sured based on either calendar time, as in the HOPE trial, or the number of observations (or events) that have been accrued. Stopping boundaries for these types of periodic monitoring were first established by Pocock (2), O’Brien and Fleming (3), Haybittle (4), and Peto et al. (5). The original stopping boundaries in these works are designed for comparisons in which the interests in both treatments are equal. The boundaries are a set of critical values (-C, C) at each of the interim analyses. The trial stops in favor of one of the treatments if the difference in outcomes between treatments at an interim analysis or at the final analysis falls outside of (-C, C). Otherwise, the trial concludes with a decision that the treatments are not statistically different. As a result of the symmetry in interests in the treatments, these boundaries are called symmetric two-sided stopping boundaries. For a trial with four interim analyses and a false-positive error of 0.05, the Pocock, Haybittle-Peto, and O’Brien-Fleming two-sided symmetric stopping boundaries are given in Table 1. The Pocock boundaries emphasize equal importance in all interim analyses and the final analysis. At each analysis, the difference in outcomes is compared with (−2.36, 2.36), which is equivalent to applying a statistical test with a nominal P-value of 0.018. As a result of the small value of C (high nominal P-value) in the early analyses, the Pocock boundaries are considered too sensitive to early dramatic effects that lead to very early stopping. Furthermore, the method applies a much more stringent critical value at the final analysis than the critical value (= 1.96) in a fixed sample test with the same overall false-positive error of 0.05. Therefore, if the study continued to the final analysis, it could be possible that a test of the difference in outcomes between treatments would be non-significant using the Pocock boundaries but significant using a fixed sample test. The HaybittlePeto and O’Brien-Fleming boundaries, on the other hand, have critical values in the final analysis very similar to the critical value in a fixed sample test. Therefore, using these boundaries, the chance of arriving at a different conclusion from that using a fixed sample test is low. Both the Haybittle-Peto

boundaries and the O’Brien-Fleming boundaries are very conservative at the interim analyses; therefore, using these boundaries, a low chance of very early stopping exists because of a chance dramatic difference in outcomes. The conservatism of the HaybittlePeto boundaries at the interim analyses came from applying a uniformly stringent criterion, with a nominal P-value of 0.001, at all interim analyses. On the other hand, the O’Brien-Fleming boundaries are very conservative at the early analyses but become less conservative at later analyses to reflect the increased amount of information about treatment outcomes that have been accrued during the trial. Symmetric two-sided boundaries are suitable for comparisons between two active treatments. But in trials such as the HOPE trial, where one of the treatments under study is a placebo, the investigators may have less interest in finding out that the active treatment is significantly worse than the placebo. Therefore, in trials with a placebo control, the hypotheses of interest are whether the active treatment is better than or is no better than the placebo. The asymmetry in the interests between the active treatment and the placebo created the need for asymmetric stopping boundaries. In the HOPE trial, the stopping boundaries in order to conclude that ramipril was no better or worse than placebo were 3 SD during the first two analyses and 2 SD in the third and final analyses, which were different from the stopping boundaries to show that ramipril was superior to the placebo. The degree of asymmetry in the stopping boundaries also depends on the nature of the trial. In a prevention trial like the HOPE trial, a higher need to accumulate information on different outcomes and measures exists than in a treatment trial (6). Therefore, investigators in a prevention trial may be more willing to continue a trial unless the active treatment is grossly harmful, which was presumably the reason why the HOPE investigators chose a set of not very aggressive stopping boundaries for no treatment benefits. In treatment trials, more aggressive asymmetric stopping boundaries (7, 8) may be used. Sometimes, reasons internal or external to a trial may prompt the DMC to hold an

STOPPING BOUNDARIES

3

Table 1. Pocock, Haybittle-Peto, and the O’Brien-Fleming Two-Sided Symmetric Stopping Boundaries C Distribution 1 2 3 4

2.36 2.36 2.36 2.36

Pocock Nominal P-value .018 .018 .018 .018

unscheduled interim analysis. In response to such possibilities, Lan and DeMets (9) developed stopping boundaries that allow interim analyses to be carried out at any time and as often as the DMC wish during a trial. The thesis of the Lan-DeMets’ method is that a trial with an overall false-positive error of α should allow its investigators to ‘‘spend’’ the error at different rates at different times of the trial, with the provision that the total false-positive error spent be no more than α. For example, by using the Pocock boundaries in a trial with interim analyses spaced evenly over time, the investigators essentially assume that the false-positive error is to be spent in equal amounts at each of the interim analyses. Lan and DeMets generalized this concept of spending error to allow any amount to be spent at any time in a trial, as long as the total error spent in the trial is not more than α. Their method is sometimes called the alpha spending function method. The Lan and DeMets method revolutionized the way trials with interim analyses could be designed in that the administrative burden of limiting analyses to only a few scheduled times could be removed. As a result, many trials are now designed with the Lan-Demets boundaries (10, 11). In the HOPE trial, a single primary outcome (a composite measure of cardiovascular death, myocardial infarction, and stroke) was compared between two treatments. In other studies, multiple measures of efficacy may be used (6). In those cases, a Bonferroni adjustment is the simplest method to control the overall false-positive error, as long as the number of measures to be studied is not too high. In some studies, more than two treatments may be under study. Follman et al. (12) found that a Bonferroni adjustment on

Haybittle-Peto C Nominal P-value 3.29 3.29 3.29 1.96

.001 .001 .001 .05

O’Brien-Fleming C Nominal P-value 4.08 3.22 2.28 2.04

.000005 .0013 .0228 .0417

the stopping boundaries used for comparing two treatments is easy to use and leads to little loss in power. In multi-treatment trials with a control treatment, Hughes (13) suggested that the control treatment should be retained throughout the trial whereas other treatments can be dropped from the trial if they are found to be inferior to others under study. Sometimes, disappointing results in an interim analysis, possibly compounded by slow accrual or discouraging results reported in a similar trial, may lead a trial’s DMC to consider terminating the trial, if it knows that little hope exists that the trial will result in favor of a positive result. In those cases, methods called stochastic curtailment have been developed to aid the decision process (14). Stochastic curtailment is based on calculating the conditional probability that the trial will end in favor of the alternative hypothesis of a difference in treatment outcomes, based on the accumulated data and a projection of the future trend of the data, if the trial is to continue. Two types of projections are popular. One predicts that future data will follow the same trend as the accumulated data. Another assumes that future data will follow the alternative hypothesis. In both cases, a low conditional probability indicates that little hope exists that the trial will end in favor of the alternative hypothesis, in which case a decision to terminate the trial will probably follow. However, repeated application of conditional probability to monitor a trial will lead to an inflation of the false-negative error. Therefore, stochastic curtailment should be incorporated as one of the considerations in the design of a trial, to ensure false-positive and false-negative

4

STOPPING BOUNDARIES

errors are controlled at the desired levels (15, 16). Although stopping boundaries are defined in the stopping rules of a trial to indicate when a trial should be stopped early, in practice, stopping boundaries are often used as stopping guidelines rather than rules. The decision to continue or to stop a trial cannot be based solely on comparing trial results with stopping boundaries; other relevant information presented to the DMC at the point of the interim analysis must also be considered. As an example, in the HOPE trial, the stopping boundaries were crossed at the second interim analysis, but the trial continued to the third interim analysis, when it was finally stopped after the stopping boundary was crossed a second time. It is conceivable that the reason for not stopping at the second interim analysis was because of the investigators wish to see a definitive result, as a trial of that scale was unlikely to be carried out again. In another trial (17), the stopping boundary was crossed at the first interim analysis. However, its DMC decided that the trial should continue based on the reason that the interim analysis results were too extreme to be believable. That trial subsequently continued to the planned maximum duration with a conclusion in favor of no treatment difference. In any trial, the investigators must constantly monitor and balance the issues of patient safety and trial integrity. The establishment of a proper set of stopping boundaries is important to protect the integrity of a trial. Without the proper definition of a set of stopping boundaries, it is easy for an over-zealous investigator or sponsor of a trial to ‘‘look’’ at results repeatedly to look for a significant finding, the end result could be a misleading trial conclusion that can lead to substantial harms to patients down the road. On the other hand, reasons sometimes exist, such as those in the examples given above, for the stopping boundaries not to be used as the sole determinant for deciding whether a trial should be stopped. In those situations, an effective DMC will weigh the evidence from the comparison of trial results to the stopping boundaries with other information relevant to the study to make a decision that best balances the issues of patient well-being and trial integrity.

Many of the well-known stopping boundaries and their derivatives can now be calculated using commercially available packages or resources on the World Wide Web. Two commercial packages are East (Cytel Software Corporation, Cambridge, MA) and PEST (MPS Research Unit, University of Reading, Reading, UK). For calculating boundaries using Lan and DeMet’s method, Reboussin et al. (18) have developed a program that is available freely in their paper. REFERENCES 1. The Heart Outcomes Prevention Evaluation Study Investigators, Effects of an angiotensinconvertingenzyme inhibitor, ramipril, on cardiovascular events in high-risk patients. N. Engl. J. Med. 2000; 342: 145–153. 2. S. Pocock, Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64: 191–200. 3. P. O’Brien and T. Fleming, A multiple testing procedure for clinical trials. Biometrika 1979; 35: 549–566. 4. J. L. Haybittle, Repeated assessment of results in clinical trials of cancer treatments. Bri. J. Radiol. 1971; 44: 793–797. 5. R. Peto, M. C. Pike, P. Armitage, D. R. Breslow, N. E. Cox, S. V. Howard, N. Mantel, K. McPherson, J. Peto, and P. G. Smith, Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. introduction and design. Bri. J. Cancer 1976; 34: 585–612. 6. L. Freedman, G. Anderson, V. Kipnis, R. Prentice, C. Y. Wang, J. Rossouw, J. Wittes, and D. Demets, Approaches to monitoring the results of long-term disease prevention trials: examples from the women’s health initiative. Controlled Clin. Trials 1996; 17: 509–525. 7. D. DeMets and J. Ware, Group sequential methods for clinical trials with a one-sided hypothesis. Biometrika 1980; 67: 651–660. 8. D. DeMets and J. Ware, Asymmetric group sequential boundaries for monitoring clinical trials. Biometrika 1982; 69: 661–663. 9. K. K. G. Lan and D. DeMets, Discrete sequential boundaries for clinical trials. Biometrika 1983; 70: 659–663. 10. The Beta-Blocker Evaluation of Survival Trial Investigators, A trial of the beta-blocker bucindolol in patients with advanced chronic

STOPPING BOUNDARIES

11.

12.

13.

14.

15.

16.

heart failure. N. Eng. J. Med. 2001; 344: 1659–1667. The Atrial Fibrillation Follow up Investigation of Rhythm Management (AFFIRM) Investigators, A comparison of rate control and rhythm control in patients with atrial fibrillation. N. Engl. J. Med. 2002; 347: 1825–1833. D. Follmann, M. Proschan, and N. Geller, Monitoring pairwise comparisons in multiarmed clinical trials. Biometrics 1994; 50: 325–336. M. D. Hughes, Stopping guidelines for clinical trials with multiple treatments. Stat. Med. 1993; 12: 901–915. K. Lan, R. Simon, and M. Halperin, Stochastically curtailed tests in long-term clinical trials. Commun. Stat. B 1991; 1: 207–219. M. S. Pepe and G. L. Anderson, Two-stage experimental designs: early stopping with a negative result. J. Royal Stat. Soc. C 2002; 41: 181–190. Y. G. Wang and D. Leung, Conditional probability of significance for early stopping in favor of H 0 . Sequential Anal. 2002; 21: 145–160.

17. K. Wheatley and D. Clayton, Be skeptical about unexpected large apparent treatment effects: the case of an MRC AML 12 randomization. Controlled Clin. Trials 2002; 23: 355–366. 18. D. M. Reboussin, D. DeMets, K. M. Kim, and K. K. G. Lan, Computations for group sequential boundaries using the Lan-DeMets spending function method. Controlled Clin. Trials 2000; 21: 190–207.

5

STRATIFICATION MITCHELL H. GAIL National Cancer Institute, Bethesda, MD, USA

Stratification refers in epidemiology to a design that improves the efficiency of analytical procedures to control for confounding by causing controls to have the same distribution over strata defined by levels of potential confounders as cases in a case–control study or as the exposed cohort in a cohort study. Stratification (or stratified analysis) also refers to the analytical strategy that controls for confounding by estimating the association between exposure and disease status within strata defined by categorized levels of potential confounders and then combining stratum-specific results to obtain an overall estimate of exposure effect. In the context of survey sampling, stratification is an efficient design that usually allocates larger samples to strata of the population within which the estimate has a large variance.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

STRATIFIED DESIGNS

4. A population of retail establishments stratified by state, by description (grocer, butcher, etc.), and by range of annual sales. Here again there is some flexibility in the definitions of the ‘‘descriptions,’’ while the ranges of annual sales chosen are entirely arbitrary.

Stratification is a technique widely employed in finite population sampling∗ . It is used when the population units can easily be divided into groups such that the members of each group have some property or set of properties in common, relevant to the investigation at hand, which is not shared by the members of the other groups. If the sample selection within each group is carried out independently of the selection in every other group, the groups are called strata; any sample design that involves dividing the population into strata is a stratified design. The usual purpose of stratification is to increase the efficiency of the sample design—either by decreasing the mean square errors∗ of the sample estimates while keeping the sample size∗ or the cost of the survey constant, or alternatively by decreasing the sample size or the survey cost for the same mean square errors. Sometimes, however, the strata themselves are of primary interest. The following are typical examples of stratified populations:

Some stratification criteria are sharply defined (school class, sex, state), some admit of a degree of subjective judgment (region, description of retail establishment), while a third group (age, range of reported income, range of annual sales), being quantitative in character, leads to arbitrary stratum boundaries. Some choices of boundaries are better than others from the point of view of achieving an efficient sample design. The boundaries corresponding to the most efficient sample design for a particular item of interest provide optimum stratification∗ . THEORETICAL ADVANTAGES OF STRATIFICATION In this section, based on Evans [3], the effects of using a stratified in place of an unstratified design are presented, first in general and then for a simple example. Because the effects of stratification can be derived much more simply for sampling with replacement than for sampling without replacement, we consider first the case where the population total is to be estimated from a simple random sample selected with replacement. Capital letters will be used to denote population values and lower-case letters to denote sample values. Let N be the number of units in the population, n be the number of units in a sample, Yi be the value for the ith population unit of an item whose total is being estimated, the population total itself being denoted by Y = N i=1 Yi , and the population mean by Y = N −1 Y. Then the simplest estimator of Y is the expansion or number-raising estimator, defined by

1. A population of school students stratified by school class and by sex. A typical stratum would be ‘‘class 5H, boys.’’ 2. A population of individual taxpayers stratified by state, sex, and range of reported income. The income ranges are arbitrary. One possibility is (a) up to $4,999; (b) $5,000–$9,999; (c) $10,000–$19,999; (d) $20,000–$49,999; (e) $50,000 and over. In that case, a typical stratum would be ‘‘South Australia, females, $10,000–$19,999.’’ 3. A population of households in a country, treating each region as a separate stratum. (Note that it would not be necessary to have a list of every population unit in order to divide the populations into such strata or, indeed, to carry out selection. See AREA SAMPLING∗ .) The ‘‘regions’’ may have boundaries defined by some outside body or may be chosen as a matter of convenience.

yu = n−1 N

n i=1

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

yi = Ny.

2

STRATIFIED DESIGNS

This estimator is unbiased and its variance is σy2 = n−1 N 2 σ 2 , u

N −1

2 where σ 2 = N i=1 (Yi − Y) is the population variance. Now suppose that the population is divided into strata and samples of size nh (h = 1, 2, . . . , L), are selected independently from the units in each stratum with equal probabilities and with replacement. The stratified form of the number-raising estimator of Y is

ys =

L

L

yh =

h=1

n−1 h Nh

h=1

nh

yhi .

i=1

This also is unbiased and its variance is σy2 =

L

s

h=1

σy2 = h

L

2 2 n−1 h Nh σh ,

h=1

where σh2 = Nh−1

Nh (Yhi − Y h )2 . i=1

Three special cases are of particular interest: The first is proportional sampling∗ , where nh ∝ Nh ; the second is Neyman allocation∗ , where the nh are chosen to minimize the variance of ys given a fixed sample size n; the third is optimum allocation, a variant of Neyman allocation used where some strata are more expensive to sample from than others and the nh are chosen to minimize the variance of ys given a fixed survey cost. Consider first the case of proportional sampling. The difference between the variances of yu and ys can be shown to be σy2 − σy2 = n−1 N 2 u

L

s

Ph (Y h − Y)2 ,

h=1

where Ph = Nh /N is the proportion of population units in the hth stratum. This difference is nonnegative, being proportional to the weighted variance of the Y h with weights Ph . Consequently, given this type of sampling and estimation, it is impossible to lose efficiency as a result of stratification, and

the greatest efficiency is achieved when the stratum means are as different from each other as possible. In the case of Neyman allocation, the method of undetermined multipliers can be used to show that the nh must be proportional to Nh σh and the further reduction in variance as compared with proportional sampling is ⎡ −1

n

N ⎣ 2

L

Ph σh2

−

L

h=1

2 ⎤ ⎦.

P h σh

h=1

The expression in square brackets is the weighted variance of the σh with weights Ph . Thus this form of optimum allocation has two reductions in variance compared with simple random sampling, one term proportional to the weighted variance of the stratum means and the other proportional to the weighted variance of the stratum standard deviations. The weights in each case are the proportions of population units in the strata. In the more general case of optimum allocation, where the cost of sampling a unit in the hth stratum is proportional to Ch , the method of undetermined multipliers can be used to show that the nh should be chosen −1/2 to be proportional to Nh σh Ch . The further reduction in variance as compared with proportional sampling is then −1

n

N

2

L

Ph σh2 −

h=1

L

1/2

P h σ h Ch

h=1

×

L

−1/2

P h σ h Ch

.

h=1

The expression in parentheses is now the 1/2 weighted covariance between the σh Ch and −1/2 the σh Ch , again with weights Ph . Thus this more general form of optimum allocation also has two reductions in variance compared with simple random sampling∗ , one term being proportional to the same weighted variance of the stratum means, but the other now being proportional to a similarly weighted covariance, between the stratum standard deviation multiplied by the square root of the cost of sampling and the same stratum standard deviation divided by the square root of the cost of sampling.

STRATIFIED DESIGNS

Since this comparison retains the same value of n throughout, the reduction in variance will be greatest when the costs of sampling are equal from stratum to stratum and optimum allocation coincides with Neyman allocation. These results will now be applied to the population of 64 cities given by Cochran [2, pp. 93–94]. The 16 largest cities are in stratum 1 and the remaining 48 in stratum 2. The observed variable Yi is the 1930 population of the ith city, in thousands. The population total is Y = 19,568 and the population variance is σ 2 = 51, 629. If a simple random sample of 24 is selected with replacement (in Cochran’s book the sample is selected without replacement) the variance of the number-raising estimator is σy2 = u 8811, 344. For the individual strata, the population variances are σ12 = 50,477.374,

σ22 = 5464.651.

In proportional allocation we have n1 = 6 and n2 = 18. The variance of the stratified number-raising estimator is s

= 16 × 244.672505 = 3594.76 and N2 σ2 = 48 × 5464.6510.5 = 48 × 73.923279 = 3548.32 These are nearly equal, so the Neyman allocation is n1 = n2 = 12. The variance of the stratified number-raising estimator is then σy2 = 1076, 858 + 1049, 213 s

= 2126, 071. The reduction in the value of σy2 in moving s from proportional to Neyman allocation is 2853,192 − 2126,071 = 727,121. Because the values of n1 and n2 were rounded to the nearest unit, this is not exactly the same as is given by the preceding expression, i.e., 24−1 × 642 [0.25 × 50, 477.734

= 170.66667(12, 619.434 + 4098.488 σy2 u

σy2 s

and is The difference between 5958,152. This figure may also be obtained as follows. The population means in stratum 1, stratum 2 and the population as a whole are Y 2 = 197.875

and Y = 305.75. Then 24−1 × 642 ×[64−1 × 16 × (629.375−305.75)2 + 64−1 × 48 × (197.875 − 305.75)2 ] = 170.66667 × (0.25 × 104, 733.1406 = 5958, 152.

N1 σ1 = 16 × 50, 477.7340.5

+ 0.75 × 73.923279)2 ]

= 2853, 192.

+ 0.75 × 11, 637.0156)

For Neyman allocation the sample is allocated proportionally to Nh σh . Now

+ 0.75 × 5464.651 − (0.25 × 224.672505

σy2 = 2153, 717 + 699, 475

Y 1 = 629.375,

3

− 111.6105852 ) = 170.66667(16, 717.922 − 12, 456.923) = 170.66667 × 4260.999 = 727, 211. The latter is, however, close to 727,121, which is the actual reduction. The latter is also the maximum extent of the reduction in variance as long as the sample size remains constant at 24. If the cost of sampling varies from stratum to stratum, optimum sampling will differ from Neyman allocation sampling, but if the sample is kept at 24 units, the departure from Neyman allocation must increase the variance at the same time that it reduces the cost. Suppose for instance that it cost four times as much to sample one unit from the large-city stratum 1 as from stratum 2 (C1 = 4C2 ). Then the

4

STRATIFIED DESIGNS

optimum values of nh will be proportional to −1/2 Nh σh Ch . Now −1/2

N 1 σ 1 C1

−1/2

= 3594.76 × 0.5C2 −1/2

= 1797.38 × C2

,

while −1/2

N 2 σ 2 C2

−1/2

= 3548.32 × C2

.

The optimum allocation is then n1 = 8 and n2 = 16, which at 4 × 8 + 16 = 48 stratum-2 city equivalents is cheaper than n1 = n2 = 12, which costs 4 × 12 + 12 = 60 stratum-2 city equivalents. But the variance for n1 = 8 and n2 = 16 exceeds the variance for n1 = n2 = 12. Somewhat similar but more complex and less readily interpretable results can be obtained for sampling without replacement. In extreme cases where each stratum has the same—or nearly the same—mean, stratified random sampling without replacement can be less efficient than simple random sampling without replacement. If the number of strata to be formed is held constant but the positions of the stratum boundaries are allowed to vary, optimum positions may be calculated for them. This topic is treated in OPTIMUM STRATIFICATION. Optimum allocation and optimum stratification can be carried out only for one variable at a time, and different variables will, in general, give different sample numbers and stratum boundaries. Typically, a sample survey is used to estimate several means or totals. However, there is usually one important variable which will be less accurately measured than the others, almost regardless of the sample design, and this is the obvious choice for optimization. The multiparametric case is considered in OPTIMUM STRATIFICATION. STRATIFICATION AND RATIO ESTIMATION Stratification may be used either as an alternative to ratio estimators∗ or in combination with them. Both are ways of using relevant supplementary information—stratification in the selection of the sample and ratios in the estimation process. Where this supplementary information is

purely qualitative or descriptive, stratification is the only possibility. Where it is quantitative, either technique may be used, or both together. In the limit, as the number of strata is allowed to increase indefinitely, stratification approximates to unequal probability sampling (see Brewer [1]). Where stratified sampling is used in conjunction with ratio estimators, there is a choice between separate ratio estimation (also known as stratum-by-stratum ratio estimation) and combined ratio estimation (also known as across-stratum ratio estimation). The separate ratio estimator is the sum of the ratio estimators defined for each stratum separately. If the population values of the supplementary or benchmark variable are denoted by Xhi , the sample values by xhi , and the hth stratum total by Xh , then the separate ratio estimator may be written Ys =

L (yh /xh )xh , h=1

where yh

is the unbiased number-raising estimator of Yh and xh is the same estimator of Xh . The combined ratio estimator is the ratio of the sum of the number-raising estimators for each stratum to the corresponding sum of the same estimators for the benchmark variable, multiplied by the population total of the benchmark variable; that is, L L yh xh X. yc = h=1

h=1

The choice between the separate and combined ratio estimators depends on the relative importance of the variance and of the squared bias in the mean squared error of the separate ratio estimator. The variance of the separate ratio estimator is the sum of the individual stratum variances and the squared bias is the square of the sum of the individual stratum biases. The variance of the combined ratio estimator is generally a little greater than that of the separate ratio estimator, but its bias is smaller (since it depends on a larger sample than that found in each individual stratum). Thus one would tend to prefer the separate ratio estimator if the ratio estimator biases in the individual

STRATIFIED DESIGNS

strata are negligible, but the combined ratio estimator may be preferred if they are appreciable. A rule of thumb is sometimes used, that the sample size in each individual stratum must be at least 6, or that some degree of combined ratio estimation should otherwise be adopted (not necessarily over all strata at once). Cochran [2] compares separate and combined ratio estimation for two small artificial populations. The example population on page 177 has three four-unit strata, each with a very different ratio of Yh to Xh . Two units are selected from each stratum. Every one of the 216 possible samples is enumerated. The combined ratio estimator has high variance 262.8 and low squared bias 6.5, while the separate ratio estimator has low variance 35.9 and high squared bias 24.1 (Several other unbiased and low-bias ratio estimators are compared in the same example.) The population on page 187 has two four-unit strata, but these have very similar ratios of Yh to Xh . Again, two units are selected from each stratum and each of the 36 possible samples is enumerated. This time the combined ratio estimator has both the smaller variance (40.6 as opposed to 46.4) and the smaller squared bias (0.004 as opposed to 0.179). This second example shows that when the ratios of Yh to Xh are nearly constant the separate ratio estimator cannot always be relied upon to have the smaller variance. The choices of sample allocation (proportional or optimum) and of stratum boundaries (arbitrary or optimum) are subject to the same considerations with ratio estimation as with number-raising estimation, except that the relevant population variances are naturally those appropriate to ratio estimation. In situations where stratification on a quantitative variable and ratio estimation are both appropriate, it may be preferable to use unequal probability sampling in place of stratification. The abolition of stratum boundaries based on a quantitative variable allows further refinement in qualitative stratification, e.g., by industry or by type of establishment. The simultaneous optimization of the estimator and of the selection probabilities in this situation is considered by Brewer [1].

5

THE USE OF STRATIFIED DESIGNS IN PRACTICE Sudman [4] distinguishes four situations in which stratified sampling may be used: 1. The strata themselves are of primary interest. 2. Variances differ between the strata. 3. Costs differ by strata. 4. Prior information differs by strata. If the strata themselves are of primary interest—if for instance the separate unemployment rates for persons aged 15–19 and for persons aged 20 and over are both important—the statistician must consider whether the users need equal accuracies in the two estimates (as measured by their coefficients of variation∗ ) or whether they are prepared to accept a higher coefficient of variation for the smaller group, in this case the teenagers. If the sample fractions are small and the population coefficients of variation roughly equal, then equal accuracies imply roughly equal sample sizes. Proportional sampling leads to the estimator for the larger population (here, those aged 20 and over) being the more accurate. Sudman suggests a number of possible compromises. If only the individual strata are relevant, if the loss function is quadratic and the loss for each stratum is proportional to its size, then the optimal sample numbers are proportional to the square roots of the population numbers. Where variances and costs differ between strata, the optimum allocation formulae given previously are relevant. This happens chiefly with economic populations, such as farms or businesses. In human populations the population variances and costs seldom differ greatly from stratum to stratum, and it is usually better in these circumstances to retain the simplicity of proportional allocation. The last case, in which prior information differs from stratum to stratum, arises only in Bayesian∗ analysis. Sudman gives an example of optimum sampling for nonresponse∗ in a human population where, given very high sampling costs and some prior information, it may be decided not to attempt to sample at all from a ‘‘very difficult’’

6

STRATIFIED DESIGNS

stratum. He points out, that either explicitly or implicitly, most researchers employ some prior beliefs about omitted potential respondents. Once this ‘‘very difficult’’ stratum is omitted, proportional sampling is nearly as efficient as optimum sampling, and simpler to use. CONCLUSION There are three basic elements in stratified designs. Criteria for stratification must be chosen, strata formed, and sample numbers allocated between them. If ratio estimation is employed, there is a further choice between separate and combined ratios. Except where the strata themselves are of primary importance, the aim of stratification is to decrease the mean square errors of the estimates. This is done by forming strata within which the units are as similar to each other as possible, while each stratum differs from every other stratum as much as possible. Again, unless the strata themselves are of primary importance, human populations should be proportionately sampled, but economic populations, by virtue of their great differences in unit size, should be as nearly as possible optimally sampled. REFERENCES 1. Brewer, K. R. W. (1979). J. Amer. Statist. Ass., 74, 911–915. (Demonstrates relationship between Neyman allocation and optimum unequal probability sampling.) 2. Cochran, W. G. (1977). Sampling Techniques, 3rd ed. Wiley, New York. (Updated version of a classic work first published in 1953.) 3. Evans, E. D. (1951). J. Amer. Statist. Ass., 46, 95–104. (Demonstrates theoretical advantages of stratification.) 4. Sudman, S. (1976). Applied Sampling. Academic Press, New York, Chap. 6.

BIBLIOGRAPHY Stratification is a ubiquitous technique. Accounts can be found in all major sampling textbooks, some of which appear in the following list. Armitage, P. (1974). Biometrika, 34, 273–280. (Compares stratified and unstratified random sampling.)

Barnett, V. (1974). Elements of Sampling Theory, Hodder and Stoughton, London, England. (A concise modern treatment at an intermediate mathematical level.) Bryant, E. C., Hartley, H. O. and Jessen, R. J. (1960). J. Amer. Statist. Ass., 55, 105–124. (Two-way stratification.) Chatterjee, S. (1967). Skand. Aktuarietidskr., 50, 40–44. (Optimum stratification.) Chatterjee, S. (1968). J. Amer. Statist. Ass., 63, 530–534. (Multivariate stratification.) Chatterjee, S. (1972). Skand. Aktuarietidskr., 55, 73–80. (Multivariate stratification.) Cochran, W. G., (1946). Ann. Math. Statist., 17, 164–177. (Compares systematic and stratified random sampling.) Cochran, W. G. (1961). Bull, Int. Statist. Inst., 38(2), 345–358. (Compares methods for determining stratum boundaries.) Cornell, F. G. (1949). J. Amer. Statist. Ass., 42, 523–532. (Small example.) Dalenius, T. (1957). Sampling in Sweden. Almqvist and Wicksell, Stockholm. (Includes comprehensive discussion on optimum stratification.) Dalenius, T. and Gurney, M. (1951). Skand. Aktuarietidskr, 34, 133–148. (Optimum stratification.) Dalenius, T. and Hodges, J. L., Jr. (1959). J. Amer. Statist. Ass., 54, 88–101. (Optimum stratification.) Ericson, W. A. (1965). J. Amer. Statist. Ass., 60, 750–771. (Bayesian analysis of stratification in single stage sampling.) Ericson, W. A. (1968). J. Amer. Statist. Ass., 63, 964–983. (Bayesian analysis of stratification in multistage sampling.) Fuller, W. A. (1970). J. R. Statist. Soc. B, 32, 209–226. (Sampling with random stratum boundaries.) Hagood, M. J. and Bernert, E. H. (1945). J. Amer. Statist. Ass., 40, 330–341. (Component indexes as a basis for stratification.) Hansen, M. H., Hurwitz, W. N. and Madow, W. G. (1953). Sample Survey Methods and Theory, 2 vols., Wiley, New York. (Encyclopaedic.) Hartley, H. O., Rao, J. N. K. and Kiefer, G. (1969). J. Amer. Statist. Ass., 64, 841–851. (Variance estimation with one unit per stratum.) Hess, I., Sethi, V. K. and Balakrishnan, T. R. (1966). J. Amer. Statist. Ass., 61, 74–90. (Practical investigation.) Huddleston, H. F., Claypool, P. L. and Hocking, R. R. (1970). Appl. Statist., 19, 273–278. (Optimum allocation using convex programming.)

STRATIFIED DESIGNS Keyfitz, N. (1957). J. Amer. Statist. Ass., 52, 503–510. (Variance estimation with two units per stratum.) Kish, L. (1965). Survey Sampling. Wiley, New York, (Practically oriented.) Kokan, A. R. (1963). J. R. Statist. Soc. A, 126, 557–565. (Optimum allocation in multivariate surveys.) Mahalanobis, P. C. (1946). Philos. Trans. R. Soc. London B, 231, 329–451. (Stratified designs in large-scale sample surveys.) Mahalanobis, P. C. (1946). J. R. Statist. Soc., 109, 326–370. (Samples used by the Indian Statistical Institute.) Moser, C. A. and Kalton, G. (1971). Survey Methods in Social Investigation, 2nd ed. Heinemann Educational Books, London, England. (Includes 15 pages on stratification from the viewpoint of the social scientist.) Murthy, M. N. (1967). Sampling Theory and Methods. Statistical Publication Society, Calcutta, India. (Includes a particularly useful bibliography for sampling articles published up to that date.) Neyman, J. (1934). J. R. Statist. Soc., 97, 558–606. (The classical paper which established randomization theory as the only acceptable basis for sampling inference for over three decades. Includes the derivation of Neyman allocation.) Nordbotten, S. (1956). Skand. Aktuarietidskr., 39, 1–6. (Allocation to strata using linear programming.) Raj, D. (1968). Sampling Theory. McGraw-Hill, New York. Rao, J. N. K. (1973). Biometrika, 60, 125–133. (Double sampling, analytical surveys.) Sethi, V. K. (1963). Aust. J. Statist., 5, 20–33. (Uses normal and chi-square distributions to investigate optimum stratum boundaries.) Slonim, M. J. (1960). Sampling in a Nutshell. Simon and Schuster, New York. (A slightly humorous nontechnical approach.) Stephan, F. F. (1941). J. Marketing, 6, 38–46. (Expository.) Sukhatme, P. V. and Sukhatme, B. V. (1970). Sampling Theory of Surveys with Applications. Food and Agricultural Organization, Rome, Italy. Tschuprow, A. A. (1923). Metron, 2, 461–493, 646–683. (Includes an anticipation of the Neyman allocation formulae.) U. S. Bureau of the Census (1963). The Current Population Survey—A Report on Methodology. Tech. Paper No. 7, U.S. Government Printing Office, Washington, D. C.

7

See also AREA SAMPLING; FINITE POPULATIONS, SAMPLING FROM; MULTISTRATIFIED SAMPLING; NEYMAN ALLOCATION; OPTIMUM STRATIFICATION; PROBABILITY PROPORTIONAL TO SIZE (PPS) SAMPLING; PROPORTIONAL ALLOCATION; RATIO ESTIMATORS; and STRATIFIERS, SELECTION OF.

K. R. W. BREWER

STRATIFIED RANDOMIZATION INNA PEREVOZSKAYA Merck & Co., Rahway, New Jersey

Randomization procedures in clinical trials are designed to mitigate the source of bias associated with imbalance of the treatment groups with respect to important prognostic factors (covariates), both known and unknown. Simple randomization promotes balance with respect to such covariates across treatment arms as well as balance with respect to treatment assignments themselves. However, simple randomization alone is not sufficient to assure such balance. In fact, with any randomization process, if a sufficient number of baseline covariates are examined, some imbalances will be observed with respect to at least one prognostic factor. In certain situations, for example, if one of the prognostic factors is known to be of critical importance as a predictor of the outcome of interest, one would want to balance proactively for that factor within treatments and not to rely on chance to accomplish that. In such situations, stratified randomization may be beneficial. See the ‘‘Randomization’’ article for a detailed discussion of stratified randomization and its applications. REFERENCE 1. W. F. Rosenberger and J. M. Lachin, Randomization in Clinical Trials. New York: Wiley, 2002.

FURTHER READING E. M. Zelen, The randomization and stratification of patients in clinical trials. J. Chron. Dis. 1974;28:365–375

CROSS-REFERENCES Blinding Block Randomization Adaptive Randomization

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

SUBGROUPS

Although all these considerations sound fairly straightforward, the actual identification of subgroup-specific effects turns out to be treacherous. To some extent, the reasons are medical and biological: We often lack comprehensive medical knowledge concerning the diseases in question, so that the definition of clinically relevant subgroups may not even be obvious. Any particular individual presents with many demographic and disease-specific features; sorting out the important ones from the red herrings may not be simple. To an even greater extent, however, the problems with subgroup analysis are statistical, relating to the need for large samples to insure reliable inference because the treatment differences that emerge from most clinical trials are only small or moderate in magnitude (2–6).

JANET WITTES Statistics Collaborative Washington, District of Columbia

1

INTRODUCTION

Randomized clinical trials are pivotal in determining whether regulatory authorities should approve a new drug for commercial distribution. They may serve to probe into the biology of a disease. They may provide estimates, however rough, of the overall reduction in morbidity or mortality that might be expected if a successful test treatment were to be implemented on a wide scale. But if queried about the most critical role of clinical trials, then most physicians would probably answer that they help guide treatment decisions for the individual patient. And here one runs into a major conceptual obstacle. How can clinical trials actually inform a specific choice of therapy? Physicians have long recognized that simply assigning a diagnosis does not help much in estimating a patient’s risk of running into future trouble. Most illnesses exhibit a wide range of behaviors, so it seems natural to look to clinical trials for information about specific subgroups of patients in the hope that subgroup-specific estimates of treatment effects, rather than the overall effect, will predict more accurately the effect on specific individuals. In theory, a particular intervention might work better in less sick patients than in very sick ones. On the other hand, if a new effective therapy has certain troublesome or life-threatening toxicities, then the balance of risk and benefit might be more favorable if the patient has a late, rather than early, stage of disease, for then the urgent need for benefit from therapy may override even a considerable risk. For example, the risk–benefit relationship in coronary-artery bypass surgery differs for people with singlevessel disease from those with triple-vessel involvement: The more vessels diseased, the higher the risk of heart attack and the greater the benefit of surgery relative to medical therapy (1).

2

THE GENERAL PROBLEM

In clinical trials, subgroups loom as important considerations at least three times: in designing a study, in analyzing its data, and when physicians seek to apply its results to their own patients. Designing a study involves selecting entry criteria. When the study is complete, the investigators must decide whether to consider the dataset as a whole or to separate the observations into strata or use covariates to predict the effect of treatment. Finally, when a patient comes for therapy, the physician must decide whether any apparent subgroup-specific effects that relate to demographic or patient-specific variables preclude, suggest, or mandate a particular treatment. This article addresses these and other issues related to subgroups. 3

DEFINITIONS

In the context of a randomized trial, a subgroup is a subset of patients who participate in the trial. This article reserves the word ‘‘subgroup’’ to refer to patient subsets defined by parameters whose value is determined before randomization. In other words,

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

SUBGROUPS

a subgroup is defined here by baseline variables, that is, variables whose values are fixed before the start of treatment. Age, gender, race, or disease-related characteristics (e.g., blood pressure, CD4 count, tumor histology, and antibody titer) determined before study entry all define subgroups of potential interest for various disease states. Parameters measured after start of therapy and therefore potentially influenced by treatment, such as measures of response (reduction in blood pressure, increase in CD4 count, shrinkage of tumor, and increase in antibody titer) or compliance (amount of the treatment actually delivered) also define patient subsets, but in these improper ‘‘subgroups,’’ the dependence of treatment outcome on parameters defined only after treatment has begun can never be disentangled satisfactorily from the possible relation of these parameters to favorable or unfavorable patient characteristics. Three types of subgroups are of special interest in clinical trials. 1. Demographic subgroups are defined by a demographic variable like age, race, or gender. 2. Physiological subgroups are defined by some medical or biological parameter, for example, cholesterol level in a trial of prevention of heart attack, or number of involved nodes in an axillary node sampling from a woman with newly diagnosed breast cancer, or time since cardiac arrest in a trial of resuscitation strategies. 3. Target subgroups are defined by a characteristic that is likely to determine whether the treatment has a chance of being effective. This classification often applies to certain studies in infectious disease, in which treatment is given to people suspected of being infected with an organism against which the test therapy has been designed but whose presence can only be ascertained after therapy has actually started are genetic. It is also relevant to patients defined by genotypes that are most likely to respond to a specific therapy. Sometimes, the distinction between demographic and physiological subgroups is

blurry. The physiological changes that normally accompany advancing age are known to affect the pharmacological behavior of many drugs. Nevertheless, in many diseases age seems to have little practical effect on the tolerance of therapy or on the response to it, and thus an automatic age cut-off for eligibility, which was common in clinical trials until recently, may be without biological rationale. On the other hand, many elderly patients cannot tolerate very toxic or stressful therapies. In trials of methods for prevention, age may be an indicator of risk, so that a lower cut-off for age assures a study group of high risk for the disease under investigation. Similarly, for certain diseases of women such as breast cancer and osteoporosis, age acts as a correlate of menopausal status and thus an indirect indicator of hormonal status. Children comprise another subset defined by age, but they do not form a typical demographic subgroup. Pace medieval artists, children are not simply little adults; the growth and development that children undergo from infancy through adolescence make childhood a unique physiological state. Thus from the viewpoint of therapy, children are often best thought of as a physiological rather than a demographic subgroup. A priori versus a posteriori subgroups form another important distinction between subgroups. This dichotomy refers to the time in the investigation that the subgroup is specified. Formally, investigators can declare their interest in particular subgroups either during the planning phase of the study (a priori) or after the study has generated some data and at least a portion of the observations has been examined (a posteriori). Often subgroups defined a priori are those suggested by a strong medical or biological rationale. Sometimes, evidence from previous investigation points to their importance, and perhaps some of them have even been singled out as stratification variables for purposes of randomization. Prior specification of the hypotheses allows the investigator to count the number of statistical tests that will be performed and hence permits valid corrections for multiplicity. When important subgroup hypotheses are unspecified in the protocol but reported later in the scientific literature, the reviewers can never be sure

SUBGROUPS

whether more subgroups were tested than reported and hence whether the corrections for multiplicity are sufficiently stringent. Analyzing treatment effects on subgroups established a posteriori, after an initial look at the data, is risky. Because such analysis is data-driven, ‘‘statistically significant’’ differences cannot be taken too seriously; in fact, to attach formal significance levels (P-values) to any differences that emerge from this type of analysis may be very misleading. Biological or medical rationalizations of the observed differences may provide only a weak validation of the result, because one can usually construct a plausible explanation retrospectively for just about any finding. Although very large differences may be highly suggestive of a true effect, it is usually safest to regard treatment differences observed in a posteriori subgroups as results of random variability or, at best, generators of hypotheses rather than as proof of differential efficacy. An effect that is intriguing either scientifically or clinically should, if possible, be tested definitively in another trial. 4 SUBGROUP EFFECTS AND INTERACTIONS In analyzing and interpreting results from studies designed to show an effect of treatment for the overall population, a subgroup effect is a true effect of treatment for a particular subgroup. For example, we know that beta-blockers reduce the risk of subsequent heart attack in people who have already had one heart attack. One might reasonably then ask whether beta-blockers are effective in particular subsets of heart-attack patients: men, women, or the elderly, for example. The term subgroup interaction refers to a treatment effect that differs by subgroup. Some authors distinguish between quantitative and qualitative subgroup interactions (7), or crossover and noncrossover interactions (8). In a quantitative, or noncrossover interaction, the treatment effect differs in magnitude but not in direction from one subgroup to another; for example, a treatment, although beneficial for both men and women, may be less effective in men than

3

in women. On the other hand, a qualitative, or crossover interaction, is said to occur when the effect differs in direction between two subgroups, so that a treatment benefits one group but harms the other. We wish to distinguish interactions that would not change our threshold for treatment from those that would; a large ‘‘quantitative’’ interaction might be as likely to do this as a ‘‘qualitative’’ one. In the ordinary practice of medicine, differential effects of treatment may well occur among various patient subsets well. Clinical trials, however, differ in many important ways from the ordinary practice of medicine. The eligibility criteria in a well-designed trial formally exclude those patients whom investigators think may have a substantial chance of harm from any of the treatments in the trial. The existence of a putative crossover interaction in a clinical trial, therefore, always comes as an unpleasant surprise. Perhaps because of the patient selection inherent in clinical trials, identifying bona fide examples of crossover interactions that have been validated as reproducible findings in independent studies is very difficult indeed. In 1988, Yusuf et al. (9,10) examined this question in cardiovascular disease and uncovered no true crossover interaction. The rarity of bona fide crossover interactions seems to be true more generally in medicine as well. In cancer therapy, an interesting example occurred in a Children’s Cancer Study Group trial during the late 1970s. Patients with nonHodgkin’s lymphoma were allocated randomly to treatment with either a 4-drug or a 10-drug regimen. Analysis of the results by histological subgroup showed that the 4drug regimen was better for children with B-cell lymphomas, whereas the 10-drug regimen was superior for patients who had lymphoblastic lymphoma (11). Replicating this trial was out of the question for a variety of reasons, and these results helped confirm a difference in the approach to the treatment of B-cell and lymphoblastic lymphomas that has persisted to the present in many centers. Calling this observation a crossover interaction might seem strange now to many pediatric oncologists, who view these two classes of lymphomas as different

4

SUBGROUPS

diseases and would probably regard the differential treatment effect as resulting from a na¨ıve lumping of different disorders in an earlier era for a protocol. At the time the protocol was formulated, however, many clinicians had substantial doubt about the most suitable approach to treating children with lymphoma, and a multicenter randomized trial seemed to be the best way of settling the issue. 5 TESTS OF INTERACTIONS AND THE PROBLEM OF POWER If a treatment confers an effect φ i on a subgroup i, then a ‘‘treatment-subgroup’’ interaction means that for some subgroups i and j, φ i = φ j . Consider first an outcome variable measured on a continuous scale. The interaction between the ith and jth subgroup is the difference in the treatment effect between the two, or γ ij = (φ i − φ j ). Note that γij = −γji∼ For example, suppose that on average a lowsalt diet reduces diastolic blood pressure by 6 mmHg in men and 4 mmHg in women. The treatment effects in men and women are φ m = 6 and φ w = 4, respectively; the interaction between men and women is γ µω = (φ m − φ w ) = 2. If the outcome variable of interest is a proportion, for example the probability of surviving 28 days after acute onset of sepsis, and the subgroups of interest are those with gramnegative bacteremia and those without, then the interaction is typically defined in one of three ways. If the treatment effect is assessed as the difference in survival probabilities between the treated and control groups, then the interaction is the difference of those differences. If the treatment effect is measured as the relative odds of death, then the interaction is measured as the ratio of those odds. If the treatment effect is assessed as the relative risk of death, then the interaction is measured as the ratio of the relative risks. Importantly, the presence of interaction on one scale may correspond to no interaction on another. Consider, for example, Table 1, which shows the 28-day mortality for two subgroups of interest in an imaginary trial of sepsis.

In this example, when the effect of treatment effect is measured on a scale of differences in proportions, treatment reduces the mortality rate by 20% in each subgroup, so no treatment by subgroup interaction exists. On the other hand, when relative odds are compared, the odds ratio is 0.38 in the gramnegative group and 0.44 in the other group; their ratio (the relative odds) is 0.84, which suggests a larger treatment effect in the subgroup with gram-negative bacteria. Finally, if relative risk is compared, then the data show relative risks of 0.50 and 0.67 in the gramnegative and other groups, respectively, so that the treatment by subgroup interaction becomes 0.50/0.67, or 0.75, again suggesting that the therapy is considerably more effective in the gram-negative than in the other group. Similar dependence of the presence of interaction on the scale of measurement occurs in survival analysis, in which some natural measures of benefit may be change in 5-year mortality rate, average relative risk, or median survival. Thus, the first problem in discussing interaction is that its very presence depends on the scale of measurement. Having chosen a scale, the next difficulty in detecting interaction is twofold. First, the estimate of interaction has larger variability than the estimate of an overall effect; second, the magnitude of an interaction is likely to be much smaller than the magnitude of a main effect. Consider the sample size required to show a reduction in diastolic blood pressure (DBP). Assume a 15 mm standard deviation and a study with an α-level of 0.05 and power of 90%. If a salt-restricted diet reduces DBP by an average of 6 mmHg in men and 4 mmHg in women, then a study with 50% men and 50% women would require 380 people in all to detect the expected 5-mm average reduction in DBP. A sample size of approximately 850 people would be needed to detect the effects of 6 mm in men and 4 mm women with the same α-level and power. Note that the necessary sample size to detect separate effects in men and women is larger than twice the sample size necessary to detect an effect overall because the sample size is inversely proportional to the square of the difference to be detected. Thus, the decrease in sample size that develops under the hypothesis that men

SUBGROUPS

5

Table 1. Example: Treatment Effect and Interaction in Sepsis Patients with Gram-Negative Bacteria Dead Alive Total Treated Control Total

20 40 60

80 60 140

100 100 200

Mortality Rate 20% 40% 30%

Treatment Effect Difference in proportions Relative odds Relative risk

−20% 0.38 0.50

Patients with No Evidence of Gram-Negative Bacteria Dead Alive Total Treated Control Total

40 60 100

60 40 100

Mortality Rate

100 100 200

40% 60% 50%

Total

Mortality Rate

200 200 400

30% 50% 40%

Treatment Effect Difference in proportions Relative odds Relative risk

−20% 0.44 0.67 Dead

Treated Control Total

All Patients Alive

60 100 160

140 100 240

Overall Treatment Effect Difference in proportions Relative odds Relative risk

−20% 0.43 0.60 Interaction

Difference in proportions Relative odds Relative risk

0% 0.75 0.84

have a difference of 6 mm is more than offset by the hypothesis that women have a difference of only 4 mm. Finally, in a study with 50% men and 50% women, to detect the hypothesized difference of 2 mm in treatment effect, the required sample size becomes 4,700 people! One might argue that small interactions are not important because they do not affect decisions concerning treatment. Although they may affect cost-benefit calculations,

treating thousands of women to effect a 4mm reduction in blood pressure may not be nearly as cost-effective as treating thousands of men to achieve a 6-mm reduction. What one cares about is an interaction large enough to lead one to treat some subgroups of patients differently from others. Gail and Simon have developed a statistical test to detect the presences of crossover interactions in a set of mutually exclusive subgroups defined a priori (8).

6

SUBGROUPS

6 SUBGROUPS AND THE PROBLEM OF MULTIPLE COMPARISONS

Table 2. Probability of Specified Relative Risk in at Least One Subgroup Relative Risk

A large part of the reason an incautious approach to subgroups leads to trouble is an extension of the simple proposition that, if chance is involved, then the longer you look for something, the more likely you are to find it. As applied here, the more subgroups examined, the more likely chance will deal a spurious interaction, one that may even suggest different treatments for different subgroups. An amusing and wellknown example of spurious interaction comes from the ISIS-2 trial (12), which showed a medically important and statistically significant benefit of aspirin in the prevention of fatal heart attacks. To demonstrate the pitfalls of subgroup analysis, the investigators classified all study patients by their astrological signs; the subsequent subgroup analysis showed that whereas aspirin significantly benefited the whole group, it seemed actually to harm people born under Gemini or Libra. Nonastrologers would probably believe the finding was caused by chance. The more subgroups you examine, the more likely you are to find a ‘‘statistically significant’’ difference, or at least a suggestive difference, even if the two treatments are, in fact, equivalent. The following numerical example gives some flavor for the probabilities. Imagine a trial that compares an ineffective treatment with a control. Partition the subjects into 10 mutually exclusive subgroups. This design produces a 22% chance that treatment will seem significantly (P < 0.05) better than control in at least one subgroup and a 40% chance that treatment will seem significantly different from control (either better or worse) in at least one subgroup. These high error rates are bad enough, but note that, by the same token, if three independent trials examine the same question, then the probability that at least one will show a significant benefit of the ineffective treatment in at least one subgroup is over 50%. In the actual practice of reporting trials, the subgroups may be overlapping, and they may be defined after observing the characteristics of the patients who experience the outcome event of the study. These very

>1.5 >2 >3 >5 >10

Probability 95% 80% 45% 20% 8%

•Underlying mortality rate: 8% •500 treated patients, 500 controls •10 equal size nonoverlapping subgroups

substantial risks of finding spurious benefit point to the reason why one often cannot find reproducible subgroup differences when comparing the results of independent clinical trials. Another example applied to survival analysis is even more sobering. Sometimes one observes a subgroup effect of treatment that is not statistically significant. In published reports of the trial, the language may read, ‘‘Although not statistically significant, there was a trend toward benefit in the subgroup of patients defined by (some set of variables). The relative risk in the treated group compared to the control was 3; thus, if the study had had adequate power, then this observation would likely have been statistically significant.’’ Suppose 1000 patients with a mortality rate of 8% are randomized to two treatments of equivalent effectiveness. Again, partition the patient cohort randomly into 10 subgroups of equal size, and now ask about the relative effect of treatment on survival by subgroup. Thus, we are comparing 50 treated cases and 50 controls in each of 10 subgroups. As observed in Table 2, the probability of observing at least one relative risk that looks suggestively high is considerable. This table teaches at least two interesting lessons. The first is that even substantial relative risks can develop by chance alone in at least one subgroup. The second is that it would be very difficult to prove that a treatment affects all subgroups uniformly, because the play of chance can wreak havoc on estimates of relative risk in individual subgroups. Several statistical methods are available to correct the observed P-value for multiple

SUBGROUPS

subgroup comparisons. One simple very conservative approach is the ‘‘Bonferroni’’ correction: divide the overall significance level by the number of comparisons actually performed. Thus, if you are using a significance level of P = 0.05 as your criterion for declaring two treatments significantly different overall, and you perform 10 subgroup comparisons defined a priori, then you would need a P-value of 0.005 before declaring any particular subgroup comparison significant. The Bonferroni inequality, which forms the basis for this correction, tells us that if we have n events, A1 , A2 , . . . , An , then the probability P that at least one of the As occurs Prob(A1 or A2 or . . . or An ) is less than or equal to the probability R = Prob(A1 ) + Prob(A2 ) + . . . +Prob(An ) that they all occur. In our case, Prob(Ai ) = α, so that R = nα. To assure ourselves that P < α we must therefore divide any observed P-value by n. This type of correction is valid even when the subgroups are overlapping or highly correlated; however, whereas the Bonferroni adjustment provides a bound for the probability of Type I error, its extreme conservatism, especially when many subgroups are examined, leads to very low power. Other less conservative methods are available to correct for multiplicity. Corrections for multiplicity are useful to help guard against inappropriate enthusiasm for an observed subgroup difference, but some additional sophistication is necessary in interpreting results even when such a correction is made. To examine why, consider an example from Study 016 of the AIDS Clinical Trials Group, which randomized HIV-infected patients with one or two symptoms of AIDS-related complex to receive either AZT or placebo (13); endpoints were progression to AIDS or new symptoms plus a CD4 count ≤200. Table 3 documents the fraction of patients who developed progression, divided into subgroups defined by CD4 count at study entry. Although some people may interpret these data as supportive of differential subgroup-specific effects, a more parsimonious explanation—and one that is more consistent with expected variability— is that AZT decreases the rate of progression in HIVpositive patients whatever the CD4 count.

7

The overall P-value of 0.002 is strong evidence for such an effect. The P-value of 0.06 for the group with CD4 count of 400–500 does not mean that AZT is ineffective in that group, for the relative risk in that group is similar to the overall relative risk. Slightly more problematic is the group with CD4 count >500. In the absence of strong evidence for a crossover interaction, the fact that one subgroup shows a relative risk in favor of placebo likely reflects variability. The test of heterogeneity of effect is far from statistically significant (3 df χ 2 = 0.7); it indicates no evidence that the relative risks differ significantly from each other, so that surely no evidence suggests that AZT does worse than placebo in the high-CD4 group. The problem of an observed, but not statistically significant, crossover effect is exacerbated by a small sample size but, unfortunately, is not limited to small trials. The Physician’s Health Study (14) was a randomized trial of 22,000 physicians that studied the effect of aspirin on the incidence of myocardial infarction. Overall, the use of aspirin led to a decrease in the number of myocardial infarctions (relative risk = 0.58; P < 0.001). As observed in Table 4, the relative risk in each age subgroup except the youngest is somewhere between roughly 0.5 and 0.6 and the P-values are very small. In the youngest age group, however, the relative risk is slightly over 1 and the P-value is 0.7, showing no evidence of an effect of aspirin. These data raise the question of the effect of aspirin in young (40 to 49 year old) men. A literal interpretation of the relative risk and its P-value would lead to the conclusion that aspirin is ineffective in the youngest group; however, a Gail-Simon test for crossover interaction (8) is far from statistically significant. A parsimonious interpretation of the data is that aspirin provides benefit in all age groups. Both the ACTG study 016 and the Physician’s Health Study share a common finding: The subgroup with the lowest event rate shows an apparent crossover interaction. In neither case, however, does the test of crossover interaction find statistically significant evidence that any subgroups truly have opposite direction of effects. In both examples, the fact that the subgroups with

8

SUBGROUPS

Table 3. Number of Cases of Progression, Sample Size, and (Progression Rate) by CD4 Count CD4 pretreatment All 200–300 300–400 400–500 >500

Overall

Placebo

AZT

Relative Risk

Logrank P

51/711 (7.2) 19/140 (13.6) 15/209 (7.2) 12/168 (7.1) 5/194 (2.6)

36/351 (10.3) 12/64 (18.8) 12/101 (11.9) 10/90 (11.1) 2/96 (2.1)

15/360 (4.2) 7/76 (9.2) 3/108 (2.8) 2/78 (2.6) 3/98 (3.1)

0.41 0.49 0.24 0.23 1.48

0.002 0.04 0.01 0.06 0.63

Table 4. Number of Myocardial Infarction (Events) by Age in the Physician’s Health Study (14) Age

Events

40–49 50–59 60–69 70–84 Overall

27 51 39 22 139

Aspirin Sample Size 4,527 3,725 2,045 740 11,037

Events 24 87 84 44 239

the apparent effect opposite to the overall effect have very low event rates may well mean that treatment, though not harmful, may not be cost-effective. See Furberg and Byington (15) for a discussion of unusually large or discordant effects observed in small subgroups. For a discussion for the lay reader in the context of behavior of stocks, see Table (16). 7

DEMOGRAPHIC SUBGROUPS

Designing a trial to include members of various demographic subgroups leads to thinking about subgroup eligibility in several ways. The inclusive view would argue that in the absence of compelling reasons to exclude it, a trial should not exclude a particular subgroup, so that inclusion is as broadly conceived as possible. Then extension of results of the trial to the total population seems natural. The agnostic view goes one step further and argues not only for limiting arbitrary exclusion criteria but also for including enough people from each demographic subgroup of interest so that one can eventually draw specific conclusions about the effect of treatment on that particular subgroup. This stance is agnostic because it assumes that one cannot know the effect of treatment on a specific subgroup without explicit study of it. Finally, the comparative view would argue for

Placebo Sample Size 4,524 3,725 2,045 740 11,034

Relative Risk

P-value

1.12 0.58 0.46 0.50 0.58

0.7 0.002 0.00004 0.006 0.000004

having enough people in various subgroups to permit testing the hypothesis that responses may differ significantly between subgroups. The rationale for this extreme position rests on the assumption that important subgroup differences are likely; its practical implication for clinical trials is profound. If major clinical trials must be designed to pinpoint possible differences in the response of all demographic subgroups of possible interest to treatment, then the implications for cost and time are enormous. The result will necessarily be a significant decrease in the number of trials that can be supported and a severe curtailing of the generation of useful new information. Doubtlessly, diseases have differential effects on demographic subgroups. In fact, an epidemiological perspective shows consistent differences in incidence and mortality rates between men and women, blacks and whites, urban and rural dwellers, and high and low socioeconomic status across a wide variety of diseases. Epidemiological studies would be badly flawed if they failed to report measures of morbidity and mortality by underlying demographic subgroup. The frequent finding of differential disease incidence or mortality rates by demographic subgroups in vital statistics, however, does not imply that response to therapy will also differ by subgroup.

SUBGROUPS

Of course, several reasons could explain why treatment results might differ by demographic subgroup. Some interindividual variation in drug activation and metabolism has a genetic basis and may have therapeutic consequences. If no such prior information suggesting that the study regimen is likely to have differential efficacy or toxicity in a specific demographic subgroup, then one needs to decide how much exploration of analyses of subgroup offers. The answer probably depends in large part on whether the designers of the trial believe that the biological similarities among people are stronger than our differences. Believers in biological similarities will then be content, if sometimes a little uneasy, to ascribe most of the observed variability in treatment effect between demographic subgroups to the play of chance. Unless you believe that treatment differences between demographic subgroups are likely unless they are specifically excluded, the next question is whether to even bother looking for them in those clinical trials that are not really large enough to be analyzed for such differences reliably. Admittedly, the temptation to do this may seem almost irresistible; indeed, some would regard failure to analyze available data for all it can possibly reveal as arrant anti-intellectualism. Nevertheless, ample reason exists to resist the temptation. Literal readings of the results of many large-scale clinical trials would show results that are at best difficult to understand and at worst nonsensical. In the Systolic Hypertension in the Elderly Program (SHEP) (17), which is a study of 4736 people over the age of 60, black men did not seem to respond to treatment of isolated systolic hypertension, but white men, white women, and black women did respond. In the Veterans Administration (VA) randomized trial of zidovudine therapy versus placebo referred to above (13), ‘‘minorities’’ did not respond to treatment, but nonminorities did. In the two latter examples, the investigators wisely downplayed the importance of the subgroup findings; the SHEP investigators did not even report outcome by race in the primary report of their trial, and the VA investigators urged additional study before making subgroupspecific recommendations.

9

So whether subgroup analysis deals with astrological signs or ethnic diversity, the play of chance may make results uninterpretable or grossly misleading unless the sample size is large enough to assure stable estimates of treatment effect. A few more comments about demographic subgroups may convince the skeptical reader that even these apparently unambiguous parameters are not simple measures of anything. The inclusion and exclusion criteria of a trial and the geographical areas from which the participating clinics draw patients determine the composition of the study cohort. Thus, in a clinical trial the study population does not in any formal sense represent the target population, at least not in the way that the sample of a well-designed sample survey represents its parent population. Furthermore, the selective forces that lead members of a specific demographic subgroup to enter a trial may differ from those that drive another. Two hospitals within the same city may have very different socioeconomic catchment areas. One hospital may draw middle-class whites and blacks below the poverty line; the other hospital may draw middle-class blacks and poor whites. The composition of a study that achieves a 15% proportion of blacks by selecting from one of these hospitals may differ importantly from the composition of a study that selects from the other. Similarly, recruiting clinics from rural Alabama will produce a subgroup of blacks that may be very different socioeconomically from a clinic in Washington, D.C. An even more subtle source of demographic recruitment bias is the oft-cited, but perhaps incorrect, observation that rates of signing informed consent vary considerably by demographic subgroup (18). When such rates differ, we must presume that the forces leading to participation in a study may also differ. For example, women who enter a trial may have, on the average, more severe disease than participating men. The basic point is that a demographic variable may really function as a surrogate for other, more powerful predictors of outcome; for example, race may instead reflect urbanrural status, socioeconomic status, education, or access to medical care. Interpreting putative differences in response to treatment

10

SUBGROUPS

is hazardous. A rational approach basically ignores possible treatment interactions with demographic variables, unless the sample size has been planned for such explanation in advance and will support reliable inference. Usually, we should look to demographic variables only as markers of good randomization (21). 8

PHYSIOLOGICAL SUBGROUPS

Physiologic subgroups present a somewhat different series of issues. Physiological subgroups are generally defined by those biological characteristics of the patient and the disease that investigators think are likely to play a role in prognosis and response to treatment. Often they are the major known prognostic factors for the disease in question. The strongest among them may provide the basis for stratifying randomization to insure that patients in the specified subgroups are approximately equally distributed. If the most powerful prognostic factors are equally distributed into the various test arms, then the overall comparison of major endpoints will be valid without adjustment for maldistribution of prognostic factors. It bears noting that the process of randomization guarantees only that the allocation of patients among the treatment arms is unbiased. It does not insure that in any specific clinical trial prognostic factors will be distributed equally between the treatment groups, although a sufficiently large sample size, coupled with the judicious use of stratification, provides a high probability of achieving good balance. Even if the desire to quantify treatment effects by physiological subgroup is clearly biologically reasonable, it is often not feasible. Most trials are far too small to put much faith in analysis of subgroup effects. Hope springs eternal, of course, and so most investigators will continue to examine their data, perhaps in the hope that any subgroup effect found in the analysis will be large enough to be persuasive. This outcome rarely occurs, however, and, even when it does, it is probably at least as likely that such an analysis will be misleading. One should be very skeptical of any alleged subgroup-specific findings and be extremely

reluctant to draw firm conclusions from them. These findings may be used to put forth a new hypothesis to test an observed effect in another trial, but as shown, extreme differences often arise by chance alone. Even an appropriately cautious and skeptical attitude will therefore not provide complete insurance against being misled unless the effect observed is very large, varies substantially from the main result, and has been observed in other studies. If at the beginning of the trial the investigators strongly suspect that the effect of treatment may differ by subgroup and want adequate statistical power to say something about this at the end of the study, then they will have to bite the bullet, convince a sympathetic funding source, and design a large enough study to support subgroup-specific comparisons at the end. 9 TARGET SUBGROUPS The story is different where the patient population of real interest (i.e., the actual target of the intervention) cannot be identified beforehand but is buried in the study population as a whole. This situation is common in trials of therapies for patients with acute infectious diseases, in which the start of therapy must precede definitive identification of the responsible pathogen. Consider the treatment of cancer patients with fever and severe neutropenia (low white count) from chemotherapy with no identifiable focus of infection. Many clinical trials have addressed different antibiotic strategies for dealing with this potentially life-threatening problem. Here, one studies a patient population with a clinical syndrome (fever and neutropenia) that makes bacterial infection likely, whether we can identify it or not, but the real target of the intervention is the subgroup actually infected with bacteria; those patients having non-infectious causes of fever cannot be expected to respond to antibiotics, and their presence simply serves to dilute the real target population. Such studies include them simply because there is no way to exclude them beforehand. Nor, by the way, can the culture-negative group be excluded post hoc; aside from the unacceptable biases that such an action would introduce. An even more fundamental reason is that the lack of a positive

SUBGROUPS

culture does not rigorously exclude infection, and many culture-negative patients respond to antibiotics with defervescence. In fact, many organisms have very fastidious growth requirements and require special conditions to be cultured successfully. The history in the late 1980s and early 1990s of monoclonal antibodies against the lipid components of certain (gram-negative) bacterial cell walls adds another twist to this already complex story. Several antibodies had been studied in Phase 2 or Phase 3 clinical trials in the setting of clinical sepsis. For biological reasons, these antibodies were not expected to be effective unless the patient was infected with gram-negative bacteria, against which the antibody has specificity. Gram-negative bacteria are themselves only a subset of all bacteria that can cause the sepsis syndrome. These trials randomized patients satisfying consensus criteria for sepsis to one of two approaches: antibiotics with or without the monoclonal antibody. As already implied, the design of these trials is constrained narrowly by the critical condition of the patients and the emergency nature of the intervention, so that any therapy must be given as soon as possible and cannot wait for identification of organisms. Accordingly, two principal analytical approaches exist. The first approach is to analyze according to the treatment assigned. Obviously, this method is lacking in power. Patients having nontargeted organisms will not respond to antibody; thus, the only way to establish treatment efficacy is to have a large enough sample size to provide the power for detecting a medically significant difference despite the fact that the therapy will be effective in, at most, a subset of the total. The second approach is to wait for the result of cultures and then to analyze only the subset of patients from whom potentially susceptible organisms have been cultured. The size required when the analysis includes everyone is much greater than the strategies that target the gram-negative organism because the difference in the event rate for the treated and control groups is diluted by the lack of effect in the non–gram-negative cases.

11

Note that these two analytical approaches have different consequences for the interpretation of the study. The first approach concentrates on the effect of the monoclonal antibody on mortality of the total study population. If the monoclonal antibody is wonderfully effective in the subset with gramnegative sepsis, then its impact on the whole population will depend on the proportion of the total with gram-negative infections; if the proportion is great, then we will easily observe an effect in the total population effect. But if it is not, then we will not observe an effect. A less than wonderful, but still beneficial effect, will clearly require a very large total sample size to show a statistically significant benefit. Whatever its difficulties, however, this method has the advantage that it will allow a powerful comparison of what happens to the total population of patients with presumed sepsis who are treated in this way compared to control. This question is important because if the antibody should eventually become available for widespread use, then physicians will have to decide at the time the patient presents with sepsis whether to use the antibody, not later when culture results are available. Thus, the relative effects of the antibody on the entire patient population are perhaps the most relevant question clinically. By contrast, the second approach, which looks only at the target population of interest, establishes the statistical significance of a beneficial effect on the target population with far fewer patients total; however, precisely for this reason the effect on the total population may not be ascertainable with decent power. This approach allows one to answer the biologically interesting question quite economically (‘‘Does the antibody provide effective therapy for patients infected with organisms against which the antibody was designed?’’), but it leaves the treating physician in the lurch concerning the benefit–risk relationship for the total treated population. In summary, it is both appropriate and mandatory to analyze the results of a trial such as this by target subgroup, because it is certainly a question of great relevance biologically. Nevertheless, from a medical point of view, it seems clear that the effects on

12

SUBGROUPS

the total treated population can never be ignored, and it is never correct to presume the absence of a potentially deleterious effect in the nontarget subgroup, even if the investigators are not smart enough to figure out in advance how a deleterious effect might be rationalized biologically. On the whole, the first approach is more relevant medically and more conservative of patient safety, but the second approach asks the mechanistic question more directly. Another example of targeted therapies comes from cancer where certain interventions are targeted against specific molecular targets or tumor types. Again, the treatment may be given to all patients who present with certain clinical features, but the treatment is expected to work only for those whose tumor has specific characteristics. 10

IMPROPER SUBGROUPS

Patient subsets defined on the basis of parameters established after initiation of therapy cannot generally be analyzed for treatment effect in an unbiased manner. Typical examples of such ‘‘improper’’ subgroups are responders to, or compliers with therapy, as opposed to nonresponders and noncompliers, respectively. Clearly, the definition of such subsets requires data generated after the beginning of therapy. Historically, interest in these sorts of comparisons probably developed in an attempt to maximize the information derived from uncontrolled studies of therapy and from the desire of investigators to establish a cause-and-effect relationship between therapy and a benefit. In uncontrolled studies of a particular treatment, it seemed logical to some investigators to infer a causal relationship between treatment and benefit if responders survive longer than nonresponders, or if those who take all their medications do better than those who do not. Viewed in this way, the nonresponders or noncompliers serve as a kind of surrogate control group—not exactly an untreated control, perhaps, but something approaching this and with which those who receive therapy in a fully satisfactory fashion, or who respond to it, can be compared. This line of reasoning has a certain intuitive appeal. Suppose cancer patients whose

tumors shrink in response to chemotherapy live longer than nonresponders, or myocardial infarction patients whose coronary arteries recanalize satisfactorily in response to thrombolytics live longer than those who do not. Such finding may seem to provide acceptable evidence of a direct treatment effect. Unhappily, this type of analysis is beset by too many fatal biases to ever establish a cause-and-effect relationship between a therapy and an effect. Consider the following points. To be classified as a responder, a patient must live long enough for a response to develop. In cancer, for example, most definitions of tumor response require that tumor shrinkage be maintained for a specified period of time (usually at least a month but sometimes longer). Some treatment protocols do not even look for evidence of response until several cycles of therapy have been delivered. By contrast, progression of disease in face of therapy can often be diagnosed as soon as it occurs. Thus, a built-in length-time bias may occur for responders that is absent for nonresponders. The responder–nonresponder comparison assumes at some level that the nonresponder group is at worst equivalent to an untreated group in its survival. If the nonresponder group actually had worse survival than an untreated group, then no one would be very interested in a result showing that responders lived longer than nonresponders, because it might mean only that response was not itself deleterious. In fact, however, it is never correct to assume that the survival of nonresponders is not adversely affected. Finally, and probably most fundamentally, this sort of analysis is confounded by possible correlation of ‘‘response’’ (or ‘‘compliance’’) with other, patient- or disease-related factors that are the actual determinants of response (or compliance). The biological heterogeneity of all disease and its human hosts guarantee a broad spectrum of clinical behavior; response to a treatment may be simply a marker of a disease that is inherently more indolent than one that does not respond. One might think that a sophisticated enough analysis should resolve this issue: Simply line up all known prognostic factors and

SUBGROUPS

observe whether you can adjust away the difference in outcome between responders and nonresponders; if you cannot, the treatment must be responsible. This approach, however, is unsatisfactory because it assumes that all important prognostic factors are known, which is never the case. Furthermore, it optimistically assumes we know how to formulate precise mathematical models of the relationship between prognostic factors and outcome. Even if many important prognostic factors are known for a particular disease, a responder–nonresponder difference in survival associated with a particular treatment might well reflect an important biological property that has not yet been identified. The confounding here is hopeless; no analytic treatment can reliably extricate us from this quagmire. Some people use a different strategy for analysis, which is potentially applicable to unblinded studies. Select only those patients without adverse effects of therapy and compare their experience to those of controls. But because the control group, especially a placebo or untreated group, cannot have an adverse event of the new therapy, this analysis excludes only treated patients. The resulting two treatment groups, all controls, and a selected subset of treated, are no longer directly comparable. Similarly, in trials of vaccines, it is tempting to identify only those with an antibody response and compare their frequency of contracting the disease being studied to that of the controls. Again, this mechanism allows a selective exclusion of members of the treated group but no such exclusion of controls. Sometimes the search for factors that help distinguish success from failure centers around a parameter relating to the therapy, such as the amount or intensity of therapy planned or received. Certain therapeutic areas may have a very wide window of equieffective doses. For example, the treatment of pneumococcal pneumonia in otherwise healthy adults or streptococcal pharyngitis in healthy children requires relatively low doses of penicillin, and additional increases in dose above this level do not add therapeutic benefit. In other areas, such as cancer treatment, therapeutic success may depend critically on dose right up to the limit of host

13

tolerance. Because laboratory studies have long suggested that dose is a critical factor in the ability of cancer chemotherapy to cure tumor-bearing animals, cancer drug development in the clinic generally emphasizes delivery of the maximum tolerated dose. For a variety of reasons, however, patients do not always receive the maximum planned doses of drugs; attenuation of dose often occurs either because the patient simply cannot tolerate the full dose or because the physician does not think it in the patient’s best interests to receive it. What actually happens in a trial, therefore, is that a range of doses is delivered, and consequently one may ask whether the delivery of ‘‘full dose’’ is associated with a better outcome than delivery of less-than-full doses. The foregoing discussion implies that any retrospective analysis of treatment effect according to received drug doses must be biased, at least when this analysis is performed on patients treated within a single trial. Of the many reasons why some patients cannot receive a full dose, some may relate directly or indirectly to the patient’s underlying prognosis. In the cancer example just outlined, perhaps patients who have a high probability of relapse have certain physiological factors (e.g., occult bone marrow metastases) that both increase the probability of relapse and reduce tolerance to chemotherapy. Dose analyses like this have been performed in other ways, however, in which the sources of bias are less obvious.

11

SUMMARY

The foregoing discussion points to the prickly problems of subgroups in clinical trials; subgroups analysis is not at all amenable to doctrine and simple solutions. Limitations of resources may prevent the formally correct solution that would maximize the chances of reliable inference in subgroups—very large trials with sample sizes adequate for all subgroup comparisons of interest. This solution also neglects the messy fact that in the real world, new information that develops during the trial may mandate certain subgroup comparisons that were not even considered

14

SUBGROUPS

during the planning phase. Sometimes metaanalyses of many trials can help solve the subgroup dilemma. Aside from adequacy of sample size, few relatively certain fixes exist. Under most circumstances, the overall effect of a treatment for the entire population in a trial is likely to be a more reliable indicator of the effect in any subgroup than a subgroup-specific analysis (5). Subgroup analyses that are performed should be planned in advance and be motivated by prior clinical or biological knowledge that makes the hypothesis of a treatment interaction plausible. Finally and above all, rummaging around in subgroups is treacherous be very skeptical of finding anything useful. REFERENCES 1. CABG Surgery Pooling Group, Meta-analysis of coronary bypass surgery trials. 1994. 2. A. Oxman and G. Guyatt, A consumer’s guide to subgroup analysis. Annu. Intern. Med. 1992; 116: 78–84. 3. A. Donner, A Bayesian approach to the interpretation of subgroup results in clinical trials. J. Chronic Dis. 1982; 35: 429–435. 4. R. Simon, Patient subsets and variation in therapeutic efficacy. Br. J. Clin. Pharmacol. 1982; 14: 473–482. 5. S. Yusuf, J. Wittes, J. Probstfield, et al., Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991; 266: 93–98. 6. C. E. Davis and D. Leffinwell, Empirical Bayes estimates of subgroup effects in clinical trials. Control Clin Trials 1990; 11: 347–353. 7. R. Peto, Clinical trial methodology. Biomedicine 1978; 28: 24–36. 8. M. Gail and R. Simon, Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 1985; 41: 361–372. 9. S. Yusuf, J. Wittes, and L. Friedman, Overview of results of randomized clinical trials in heart disease. I. Treatments following myocardial infarction. JAMA 1988; 260: 2088–2093.

10. S. Yusuf, J. Wittes, and L. Friedman, Overview of results of randomized clinical trials in heart disease. II. Unstable angina, heart failure, primary prevention with aspirin, and risk factor modification. JAMA 1988; 260: 2259–2263. 11. J. Anderson, J. Wilson, D. Jenkin, et al., Childhood non-Hodgkin’s lymphoma. The results of a randomized therapeutic trial comparing a 4-drug regimen (COMP) with a 10-drug regimen (LSA2-L2). N. Engl. J. Med. 1983; 308: 559–565. 12. ISIS-2 (Second International Study of Infarct Survival) Collaborative Group, Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17 187 cases of suspected acute myocardial infarction: ISIS-2. Lancet 2: (8607); 1988; 349–360. 13. J. Hamilton, P. Hartigan, M. Simberkoff, et al., A controlled trial of early versus late treatment with zidovudine in symptomatic human immunodeficiency virus infection. Results of the Veterans Affairs Cooperative Study (see comments). N. Engl. J. Med. 1992; 326: 437–443. 14. Steering Committee of the Physicians’ Health Study Research Group, Final report on the aspirin component of the ongoing Physicians’ Health Study. N Engl J Med. 1989; 321: 129–135. 15. C. Furberg and R. Byington, What do subgroup analyses reveal about differential response to beta-blocker therapy? The BetaBlocker Heart Attack Trial experience. Circulation 1983; 67: 98–101. 16. N. N. Taleb, Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets. New York: Thomson/Texere, 2004. 17. The Systolic Hypertension in the Elderly Program (SHEP) Cooperative Research Group, Rationale and design of a randomized clinical trial on prevention of stroke in isolated systolic hypertension. J. Clin. Epidemiol. 1988; 41: 1197–1208. 18. D. Wendler, R. Kington, J. Madans, et al., Are racial and ethnic minorities less willing to participate in health research? PLoS Medicine 2006; 3: 0201–0209. 19. S. Assmann, S. Pocock, L. Enos, et al., Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet 2000; 355: 1064–1069.

SUBGROUPS 20. R. L. Greenman, R. M. H. Schein, and M. A. Martin, A controlled clinical trial of E5 murine monoclonal IgM antibody to endotoxin in the treatment of Gram-negative sepsis. JAMA 1991; 266: 1097–1102. 21. E. Ziegler, C. Fisher, C. Sprung, et al., Treatment of gram-negative bacteremia and septic shock with HA-1A human monoclonal anitbody against endotoxin. N. Engl. J. Med. 1991; 324: 429–436.

15

SUBGROUP ANALYSIS

examining treatment effect within subgroups of patients (2). Temple and Ellenberg (3) indicated that approved treatments for many conditions fail to produce significantly better results than placebo in a large proportion of clinical trials and attributed this to heterogeneity of effectiveness among subgroups of patients. Subgroup analysis has, however, long been criticized by statisticians and clinical trialists (4–7). The criticisms have generally focused on investigators attempting to transform ‘‘negative’’ studies into ‘‘positive’’ studies; that is, taking studies in which the overall treatment effect is not statistically significant and finding a subgroup of patients for which the treatment effect seems significant. Despite substantial criticism, however, subgroup analysis is widely practiced because of a desire of practicing physicians to use results of clinical trials of heterogeneous diseases for treatment of individual patients.

RICHARD M. SIMON National Cancer Institute Bethesda, Maryland

Subgroup analysis refers to the practice of attempting to determine whether and how treatment effects vary among subgroups of subjects who are studied in intervention studies. This article reviews the variety of statistical methods that have been used and developed for subgroup analyses and the criticisms that have been made of subgroup analyses. The article provides guidance for the conduct of subgroup analyses in a manner that limits the opportunity for misleading claims. 1

THE DILEMMA OF SUBGROUP ANALYSIS

Most clinical trials have as primary objective the evaluation of an intervention (treatment) relative to a control for a representative sample of patients who satisfy prospectively defined eligibility criteria. The evaluation is made with regard to a prospectively defined endpoint, that is, a measure of patient outcome. This experimental paradigm is frequently used for laboratory and observational studies. The focus on a single treatment effect is consistent with the Neyman-Pearson theory of hypothesis testing that is the general basis for analysis of such studies. Subgroup analysis, which is also called subset analysis, refers to examination of treatment effects in subgroups of study patients. For clinical trials, the subgroups should be defined based on baseline covariates measured before randomization. Subgroup analysis is frequently observed in even the most prominent journals of medicine and science. Assmann et al. (1) reviewed 50 clinical trials published in major medical journals in 1997 and noted that 70% reported a median of four subgroup analyses. Humans with a given disease are heterogeneous with regard to numerous characteristics, some of which may influence treatment effect. Hence, investigators are interested in

2 PLANNED VERSUS UNPLANNED SUBGROUP ANALYSIS It is important to distinguish planned from unplanned subgroup analyses (8, 9). The subgroup analyses that are commonly criticized and commonly practiced are unplanned. They often involve ‘‘data dredging’’ in which treatment effect in many subgroups defined with regard to a long list of baseline covariates are examined. In some cases, the list of covariates is limited to the factors used to stratify the randomization of the clinical trial, but generally no strong biological rationale holds for expecting treatment effect to be different for the subgroups examined and generally no prespecified analysis plan exists for performing the subgroup analysis or for controlling the experiment-wise type I error rate. Unplanned subgroup analysis has been described as analogous to betting on a horse after watching the race (10). Rather than testing predefined hypotheses, unplanned subgroup analysis amounts to finding the covariate and cut-point that partitions the data in a manner that maximizes differences in treatment effect estimates. Because the

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

SUBGROUP ANALYSIS

subgroups are not defined prospectively, multiplicity cannot be controlled. Authors often do not report all of the subgroups examined (11). Even when they do, it is not clear how many more subgroups might have been examined had they not found the differences that they identify in their report. From a Bayesain viewpoint, prior distributions cannot be properly defined post-hoc in a data driven manner (12). If no prespecified subgroup analysis plan exists, one must suspect that important real variation in treatment effect among subgroups was not expected. This implicitly low prior probability reduces the plausibility of ‘‘significant’’ findings. Unplanned subgroup analysis should be viewed as exploratory hypothesis generation for future studies. The use of statistical significance tests, confidence intervals, multiplicity corrections, or Bayesian modeling can confuse the fact that no valid inference can really be claimed. The most essential requirement for conducting a subgroup analysis that is intended to be more than completely exploratory is that the analysis should be prospectively planned and should not be data driven. The subgroups to be examined and the manner in which the analysis will be conducted should be prospectively defined. A strong basis must exist for examining treatment effect to be different in those subgroups, either based on previous studies or the biology of the treatment and the disease. It is best that this planning occur during the process of study planning and be documented in the study protocol. In some cases, however, it may still be possible to ‘‘prospectively’’ plan the subgroup analysis after the patients have been treated but before the data has been examined. With prospective planning of subgroup analyses, it is important that the subgroups be limited to a very few for which strong reasons exist to believe that treatment effect may differ from the overall effect. In pharmacogenomic studies, evaluating treatment effect in a predefined subgroup may have equal importance to the overall comparison. It is becoming increasingly clear that many diseases traditionally defined based on morphology and symptomatology are in fact molecularly heterogeneous. This finding makes it less rational to expect that

subgroups of patients will benefit from the same treatment (13, 14). Traditional clinical trials with broad eligibility criteria and subgroup analysis viewed as exploratory are of less relevance for this setting, which is particularly the case with treatments developed for particular molecular targets that may be driving disease progression in only a subset of the patients. For example, Simon and Wang (15) describe a clinical trial setting in which it is of interest to test treatment effect overall and for one pre-specified subgroup. They allocate the conventional 0.05 level between the two tests to preserve the experiment-wise type I error rate at 0.05. The sample size could be planned to ensure adequate power for both analyses. This approach can be used more generally for an overall test and G subgroup tests in a prospective subset analysis design. It is prospective in that the subsets and analysis plan are prespecified, the experiment-wise error rate is preserved, and the power for subset analysis is considered in the planning of the study. Preplanned subgroups also occur in 2 × 2 factorial designs. Those trials are generally planned assuming that interactions between the treatment factors do not exist. Consequently, it is important to plan the trial to test this assumption (16, 17). In subsequent sections, we will review methods that have been used for subgroup analysis. 3 FREQUENTIST METHODS 3.1 Significance Testing Within Subgroups The most commonly used method of subgroup analysis involves testing the null hypothesis of no treatment effect separately for several subgroups S1 , . . . , SG . The subgroups may or may not be disjoint. The results are often considered ‘‘statistically significant’’ if the significance level is less than the conventional 0.05 threshold. At least three problems exist with this approach. First, no control is implemented for the fact that multiple null hypotheses are being tested. Hence, the probability of obtaining at least one false positive conclusion will be greater than the nominal 0.05 level of each test and may approach 1 rapidly as the number of subgroups G

SUBGROUP ANALYSIS

3

Table 1. Probability of Seeing a Significant (P < 0.05) Treatment Effect in at Least One of G Independent Subgroups Number of Subgroups G 2 3 4 10 20

increases. Table 1 shows the probability of observing a nominally significant (P < 0.05) treatment effect in at least one subgroup if treatment is evaluated in G independent subgroups. With only four comparisons, the chance of at least one false positive conclusion is 18.5%. Four subsets can be defined by two binary covariates. Although the comparisons in subsets determined by binary covariates are not independent, the dependence does not have a major effect on ameliorating the problem. For example, Fleming and Watelet (18) performed a computer simulation to determine the chance of obtaining a statistically significant treatment difference when two equivalent treatments are compared in six subgroups determined by three binary variables. The chance of obtaining a P < 0.05 difference in at least one subgroup was 20% at the final analysis and 39% in the final or one of the three interim analyses. A second problem with the approach of significance testing within subgroups is that the power for detecting a real treatment effect in a subgroup may be small. The sample size of most studies is planned to be sufficient only for the primary overall analysis of all eligible subjects. Consequently, only in a meta-analysis of multiple studies is the sample size within subgroup likely to be large enough for adequate within-subgroup statistical power. The third problem with this approach is that usually no strong reason exists to expect that the treatment would be effective in specific subgroups and not effective in other specific subgroups. If strong reasons existed to suspect that, then it would not have been appropriate to plan and size the trial for the overall analysis of the eligible subjects as a

Probability 0.097 0.143 0.185 0.40 0.64

whole. This problem is often stated as ‘‘qualitative interactions are a-priori unlikely’’ as will be clarified later in this section. All three of these considerations work in the direction of making a P < 0.05 result for a subgroup more likely to represent a false positive than a true positive (19). The approach of using statistical significance tests within prespecified subgroups can be modified by reducing the threshold for declaring statistical significance for subgroup effects. The simplest approach to controlling multiple comparisons is to use the Bonferroni adjustment based on the number G of subgroups examined. If the G subgroups are disjoint, then the probability of obtaining at least one false-positive subgroup effect significant at the α level is 1 − (1 − α )G . This probability can be limited to be no greater than α by setting α =1 − (1 − α)1/G . The experimentwise type I error rate, which is conventionally set at 0.05, could be allocated between the overall analysis and the subgroup analyses. For example, the overall analysis could be performed at a significance level of 0.04 and if nonsignificant then the subgroup analysis utilize the remaining α=0.01 to be allocated among the G disjoint subgroups using the Bonferroni adjustment. For nondisjoint subgroups, the within subgroup analyses are not independent and the Bonferroni adjustment is conservative. Many somewhat more efficient approaches are available (e.g., References 8, 20, and 21). Although the multiplicity adjustments can protect the experiment-wise type I error rate, they render the problem of lack of statistical power within subgroups even more severe, because the significance tests are performed at a more stringent level.

4

SUBGROUP ANALYSIS

4 TESTING TREATMENT BY SUBGROUP INTERACTIONS Statisticians generally recommend testing a global hypothesis of no treatment by subgroup interaction as a prelude to any testing of treatment effect within subgroups (22). If the true treatment effects in the G subgroups are equal, then there is said to be no treatment by subgroup interactions. The null hypothesis of no interaction does not stipulate whether the common treatment effect is null or nonnull. The interaction test used depends on the structure of the subgroups and the distribution of the data. For model based analysis, one general approach is first to fit the model that contains a single treatment effect and the main effects for the categorical covariates that define the subgroups. One then fits a model that also contains all of the treatment by covariate interactions. Twice the difference in log likelihoods provides an approximate test of the null hypothesis of no interactions. Gail and Simon (23) developed a test for no qualitative interactions for a categorical covariate with G levels. A qualitative interaction is said to exist when the true treatment effects in the different subgroups are not all of the same sign. Having the treatment effect positive in some subgroups and null in other subgroups also represents a qualitative interaction. Qualitative interactions are the interactions of real interest because ordinary interactions are scale dependent. That is, different subgroups may have the same treatment effect when measured on the logit scale, but not on the difference in response probability scale. Russek-Cohen and Simon (24) extended the Gail-Simon test to a single continuous covariate or to data cross-classified by several categorical covariates and other tests for qualitative interactions have been developed (25, 26). Many authors use interaction tests one at a time for each categorical covariate used to define subgroups. This approach, however, does not really protect the experiment-wise type one error rate. A better approach is to structure an analysis in which significance tests for treatment effect within subgroups are performed only if a global test

of no interaction or no qualitative interactions is rejected in the context of a model that includes all of the categorical covariates simultaneously. Consider, for example, a proportional hazards model λ(t, z, x) = λ0 (t) exp(αz + β x + zγ x)

(1)

where z denotes a binary treatment indicator, x is a vector of binary covariates, α is the main effect of treatment, β is the main covariate effects, and γ is the treatment by covariate interactions. A global test of no interactions tests the hypothesis that all of the components of γ are simultaneously zero. If the global interaction test is performed at the 0.05 level, and the within subgroup tests are also performed at the 0.05 level, then the experiment-wise type I error rate is preserved under the global null hypothesis that all subset effects are null. Interaction tests tend to have low power because they involve a comparison of multiple treatment effect estimates (27, 28). Brookes et al. (27) reported that if a trial has 80% power to detect the overall effect of treatment, then reliable detection of an interaction between two subgroups of the same magnitude as the overall effect would need a four-fold greater sample size. Brookes et al. (27) and Peterson and George (16) have studied sample size requirements for detecting interactions. Qualitative interaction tests are considered to be even more demanding with regard to sample size requirements. Some authors view the limited power of interaction tests as an advantage, however, as it will be more difficult to justify evaluation of treatment effects within subgroups. Consequently, the best setting for subgroup analysis is a meta-analysis of clinical trials. Certainly for unplanned subgroup analyses, replicability of findings is more important than the uninterpretable ‘‘statistical significance’’ frequently reported, and this can best be addressed in meta-analyses (10). 5 SUBGROUP ANALYSES IN POSITIVE CLINICAL TRIALS It is suggested frequently that subgroup analysis should be disregarded unless the treatment effect is significant for the subjects as a

SUBGROUP ANALYSIS

whole. This viewpoint is motivated by a wish to prevent authors from reporting ‘‘negative’’ studies as positive based on false-positive subgroup findings. Subgroup analyses, however, are generally not more reliable for positive trials than for trials that are negative overall. If the true treatment effect is of size δ for all subgroups, then it is likely that in some subgroups, the treatment effect looks more like it is of size 0 or 2δ than the true value δ. Grouin et al. (29) point out that it is a standard regulatory requirement to examine treatment effects within subgroups in trials for which the overall treatment effect is significant ‘‘to confirm that efficacy benefits that have been identified in the complete trial are consistently observed across all subgroups defined by major factors of potential clinical importance.’’ Unfortunately, however, this requirement itself may result in misleading findings. Peto (30) showed that if the patients in a trial that is just significant at the P < 0.05 level overall are randomly divided into two equal size subgroups, then a one in three chance exists that the treatment effect will be large and significant in one of the subgroups (P < 0.05) and negligible in the other (P < 0.50). Hence the regulatory requirement of demonstrating that efficacy benefits observed consistently across all subgroups are likely to lead to false conclusions that the efficacy is concentrated in only some subgroups. This finding is true of effects such as center, which have no a priori clinical relevance and should be viewed as random effects and not subjected to subgroup analysis. It is ironic that statistical practice is to be suspicious of subgroup analyses for studies that are negative overall and to require such analyses for studies that are positive overall; the problems of multiplicity and inadequate sample size limit validity of inference in both settings. Subgroup analyses that are not of sufficient a priori importance to be planned and allocated some experiment-wise type I error rate should generally not be considered definitive enough to influence post-study decision making.

5

6 CONFIDENCE INTERVALS FOR TREATMENT EFFECTS WITHIN SUBGROUPS In some studies, it may be more feasible to identify prospectively the subsets and to plan the analysis so that the experiment-wise type I error rate is preserved than to increase the sample size sufficiently for adequately powered subgroup analyses. The limitations of inadequate sample size can be made explicit by using confidence intervals for treatment effects within subgroups instead of or in addition to significance tests (31). Reporting a broad confidence limit for a treatment effect within a subgroup is much less likely to be misinterpreted as a lack of efficacy of the treatment within that subgroup than is a ‘‘non-significant’’ P-value. The confidence coefficients used should be based on the ‘‘multiplicity corrected’’ significance levels that preserve the experiment-wise type I error rate. If the subgroups are disjoint, then the multiplicity adjustment may be of the simple Bonferonni type described above. Otherwise, model based confidence intervals and multiplicity adjustments can be used to account for the dependence structure. Model based frequentist confidence intervals for treatment effects in subgroups was illustrated by Dixon and Simon (32). Consider, for example, a generalized linear model L(Y) = f (x, z) = αz + β x + zγ x

(2)

where L is the link function, x is a vector of categorical covariates that define the subgroups, z is a binary treatment indicator with z=1 for the experimental treatment and z=0 for control, β is a vector of main effects of covariates, and γ is a vector of treatment by subgroup interactions. The treatment effect for a subgroup specified by x can be defined as f (x, z=1) − f (x, z=0)=α+γ x. This subgroup effect can be estimated by αˆ + γˆ x. Asymptotically, we generally have that γˆ is multivariate normal and αˆ is univariate normal. Consequently, the estimate of treatment effect for a specific subgroup is approximately univariate normal, and the estimates for several subgroups are approximately multivariate normal with a covariance matrix that can be estimated from the data. An approximate confidence interval for treatment effect

6

SUBGROUP ANALYSIS

within each subgroup can be constructed based on the multivariate normal distribution (32). It is common to report confidence intervals for average outcome on each treatment arm in each subgroup presented. They, however, are much less useful than confidence intervals for treatment effects within subgroups. 7

BAYESIAN METHODS

Several authors (33–35) have studied empirical Bayesian methods for estimating treatment effects δ i in G disjoint subgroups of patients. Often the estimators δˆi |δi are taken as independent N(δi , σi2 )with estimates of {σi2 }available and the subgroup specific treatment effects are taken as exchangeable from the same distribution N(µ,τ 2 ). The mean µ and variance τ 2 are estimated from the data, often by maximizing the marginal likelihood. This technique is similar to frequentist James-Stein estimation (36) and to mixed-model analysis of variance such as is used in random effects metaanalysis (37). Several authors have discussed the use of fully Bayesian methods for evaluating treatment effects within subgroups of patients (e.g., References 12, 32 38–40) Dixon and Simon (40) presented perhaps the most general and extensively developed method. For a generalized linear model such as Equation (2) with subgroups determined by one or more binary covariates they assumed that γ : MVN(0, ξ 2 I)with I denoting the identity matrix and they used a modified Jeffrey’s prior for the variance component ξ 2 . They used flat priors for the main effects and derived an expression for the posterior density of any linear combination of the parameters θ =(α,β,γ ). Simon et al. (41) simplified the DixonSimon method by using a multivariate normal prior θ :MVN(0,D). They show that the posterior distribution of the parameters can be approximated by MVN(Bb,B) where B−1 = C−1 + D−1 , b = C−1 θˆ and θˆ denotes the usual maximum likelihood (or maximum partial likelihood) estimate of the parameters obtained by frequentist analysis with C the estimated covariance matrix of θˆ . With independent priors for the interaction effects and

flat priors for the main effects, D−1 is diagonal with p+1 zeros along the main diagonal corresponding to the main effects followed by diagonal elements 1/d1 , . . . ,1/dp corresponding to the reciprocal of the variances of the priors for the p treatment by covariate interactions. They specify these variances to represent the a-priori belief that qualitative interactions are unlikely. For any linear combination of parameters η=a θ , the posterior distribution is univariate normal N(a Bb, a Ba). They define linear combinations of parameters to represent subgroups based on simultaneous specification of all covariates and for subgroups determined by a single covariate, averaged over the other covariates. Simon (42) applied the Bayesian approach of Simon et al. (41) to the requirement that randomized clinical trials sponsored by the U.S. National Institutes of Health include an analysis of male and female subgroups. He showed that the mean of the posterior distribution of treatment effect in each gender subgroup is a weighted average of both gender-specific estimates, with the weights determined by the variance of the prior distribution of the interaction effect. As the variance goes to infinity, the posterior distribution of treatment effect for women depends only upon the data for the women. This analysis is equivalent to a frequentist analysis that includes an interaction term. Most clinical trials, however, are not conducted in the context where large interactions are as a-priori likely as no interactions. If the variance of the prior is specified as zero, then the posterior distribution of the treatment effect for women weights treatment effects for both subgroups in proportion to their variances. It is essentially equivalent to the usual frequentist analysis without interactions. This limit seems equally extreme, however, as it corresponds to an assumption that treatment by subgroup interactions are impossible. The Bayesian analysis enables an analysis to be performed using the assumption that large treatment by subgroup interactions are possible but unlikely. REFERENCES 1. S. F. Assmann, S. J. Pocock, L. E. Enos, and L. E. Kasten, Subgroup analysis and other

SUBGROUP ANALYSIS (mis)uses of baseline data in clinical trials. Lancet 2000; 355: 1064–1069. 2. D. L. Sackett, Applying overviews and metaanalyses at the bedside. J. Clin. Epidemiol. 1995; 48: 61–70. 3. R. Temple and S. S. Ellenberg, Placebocontrolled trials and active-control trials in the evaluation of new treatments. Part 1: ethical and scientific issues. Annals Int. Med. 2000; 133: 455–463. 4. S. J. Pocock, S. E. Assmann, L. E. Enos, and L. E. Kasten, Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stats. Med. 2002; 21: 2917–2930. 5. I. F. Tannock, False-positive results in clinical trials: multiple significance tests and the problem of unreported comparisons. J. Natl. Cancer Inst. 1996; 88: 206. 6. T. R. Fleming, Interpretation of subgroup analyses in clinical trials. Drug Informat. J. 1995; 29: 1681 S–1687S. 7. D. I. Cook, V. J. Gebski, and A. C. Keech, Subgroup analysis in clinical trials. Med. J. Australia 2004; 180: 289–291. 8. D. R. Bristol, P-value adjustments for subgroup analyses. J. Biopharm. Stats. 1997; 7: 313–321. 9. G. G. Koch, Discussion of p-value adjustments for subgroup analyses. J. Biopharm. Stats. 1997; 7: 323–331. 10. P. M. Rothwell, Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet 2005; 365: 176–186. 11. S. Hahn, P. R. Williamson, L. Hutton, P. Garner, and E. V. Flynn, Assessing the potential for bias in meta-analysis due to selective reporting of subgroup analysis within studies. Stats. Med. 2000; 19: 3325–3336. 12. D. A. Berry, Multiple comparisons, multiple tests and data dredging: a Bayesian perspective. In: J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (eds.), Bayes Statistics. Oxford: Oxford University Press; 1988.

7

16. B. Peterson and S. L. George, Sample size requirements and length of study for testing interaction in a 1 × k factorial design when time-to-failure is the outcome. Control. Clin. Trials 1993; 14: 511–522. 17. R. Simon and L. S. Freedman, Bayesian design and analysis of 2 by 2 factorial clinical trials. Biometrics 1997; 53: 456–464. 18. T. R. Fleming and L. Watelet, Approaches to monitoring clinical trials. J. Natl. Cancer Inst. 1989; 81: 188. 19. R. Simon, Commentary on Clinical trials and sample size considerations: Another perspective. Stats. Science 2000; 15: 95–110. 20. R. J. Simes, An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986; 73: 751–754. 21. Y. Hochberg and Y. Benjamini, More powerful procedures for multiple significance testing. Stats. Med. 1990; 9: 811–818. 22. R. Simon, Patient subsets and variation in therapeutic efficacy. Br. J. Clin. Pharmacol. 1982; 14: 473–482. 23. M. Gail and R. Simon, Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 1985; 41: 361. 24. E. Russek-Cohen and R. M. Simon, Qualitative interactions in multifactor studies. Biometrics 1993; 49: 467–477. 25. D. Zelterman, On tests for qualitative interactions. Statist. Probab. Lett. 1990; 10: 59–63. 26. J. L. Ciminera, J. F. Heyse, H. H. Nguyen, and J. W. Tukey, Tests for qualitative treatmentby-centre interaction using a ‘pushback’ procedure. Stats. Med. 1993; 12: 1033–1045. 27. S. T. Brookes, E. Whitley, T. J. Peters, P. A. Mulheran, M. Egger, and G. Davey-Smith, Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Health Technol. Assessm. Assessment 2001; 5: 1–56. 28. R. F. Potthoff, B. L. Peterson, and S. L. George, Detecting treatment-by-centre interaction in multi-centre clinical trials. Stats. Med. 2001; 20: 193–213.

13. R. Simon, New challenges for 21st century clinical trials. Clin. Trials 2007; 4: 167–169.

29. J. M. Grouin, M. Coste, and J. Lewis, Subgroup analyses in randomized clinical trials: statistical and regulatory issues. J. Biopharm. Stats. 2004; 15: 869–882.

14. R. Simon, A roadmap for developing and validating therapeutically relevant genomic classifiers. J. Clin. Oncol. 2005; 23: 7332–7341.

30. R. Peto, Statistical aspects of cancer trials. In: Halnan KE (ed.), Treatment of Cancer. London: Chapman & Hall, 1982: 867–871.

15. R. Simon and S. J. Wang, Use of genomic signatures in therapeutics development. Pharmacogenom. J. 2006; 6: 1667–1673.

31. R. Simon, Confidence intervals for reporting results from clinical trials. Ann. Int. Med. 1986; 105: 429.

8

SUBGROUP ANALYSIS

32. D. O. Dixon and R. Simon, Bayesian subset analysis in a colorectal cancer clinical trial. Stats. Med. 1992; 11: 13–22. 33. T. A. Louis, Estimating a population of parameter values using Bayes and empirical Bayes methods. J. Am. Stats. Assoc. 1984; 79: 393–398. 34. R. Simon, Statistical tools for subset analysis in clinical trials. In: M. Baum, R. Kay, H. Scheurlen (eds.), Recent Results in Cancer Research, vol 3. New York: Springer-Verlag, 1988. 35. C. E. Davis and D. P. Leffingwell, Empirical Bayes estimation of subgroup effects in clinical trials. Control. Clin. Trials 1990; 11: 37–42. 36. B. Efron and C. Morris, Stein’s estimation rule and its competitors-An empirical Bayes approach. J. Am. Stats. Assoc. 1973; 68: 117–130. 37. R. DerSimonian and N. Laird, Meta-analysis in clinical trials. Control. Clin. Trials 1986; 7: 177–188. 38. J. Cornfield, Sequential trials, sequential trials and the likelihood principle. Am. Statistician 1966; 20: 18–23. 39. A. Donner, A Bayesian approach to the interpretation of subgroup results in clinical trials. J. Chron. Dis. 1982; 34: 429–435. 40. D. O. Dixon and R. Simon, Bayesian subset analysis. Biometrics 1991; 47: 871. 41. R. Simon, D. O. Dixon, and B. A. Freidlin, A Bayesian model for evaluating specificity of treatment effects in clinical trials. In: Thall PF (ed.), Recent Advances in Clinical Trial Design. Norwell, MA: Kluwer Academic Publications, 1995. 42. R. Simon, Bayesian subset analysis: application to studying treatment-by-gender interactions. Stats. Med. 2002; 21: 2909–2916.

SUBINVESTIGATOR A Subinvestigator is any individual member of the clinical trial team designated and supervised by the investigator at a trial site to perform critical trial-related procedures and/or to make important trial-related decisions (e.g., associates, residents, or research fellows).

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

SUPERIORITY TRIALS

3 BUT TRADITIONAL STATISTICS IS TWO-SIDED

DAVID L. SACKETT

Progress toward the clinician’s ‘‘more informative’’ goal arguably has been slowed because of adherence to traditional statistical concepts that call for two-sided tests of statistical significance and require rejection of the null hypothesis. This situation has been made worse by the misinterpretation of ‘‘statistically non-significant’’ results of these two-tailed tests. Rather than recognizing such results as ‘‘indeterminate’’ (uncertain), authors often conclude that they are ‘‘negative’’ (certain, providing proof of no difference between treatments). At the center of this consideration is the ‘‘null hypothesis,’’ which decrees that the difference between a new and standard treatment ought to be zero. Two-sided P values tell us the probability that the results are compatible with that null hypothesis. When the probability is small (say, less than 5%), we traditionally ‘‘reject’’ the null hypothesis and ‘‘accept’’ the ‘‘alternative hypothesis’’ that the difference observed is not zero. In doing so, however, we make no distinction of the new treatment as better or worse than the standard treatment. Of course, we must recognize that small P-values (say, less than 0.05), although rare, must occur by chance when there is no real treatment effect at all, especially in post hoc subgroup analyses, as shown in the elegant ‘‘DICE’’ experiments of Michael Clarke and his colleagues (2,3).

Trout Research & Education Centre, Ontario, Canada

1

INTRODUCTION

This entry is intended to explain (or at least demystify) what trialists mean by ‘‘superiority’’ trials. To understand them, they must be considered in light of another other major trial design, the ‘‘noninferiority’’ trial. This article will start from a clinical perspective and then negotiate the accompanying statistical concepts. To make the latter presentation easier to follow, it will be assumed at the outset that the measure of patient outcomes under discussion combines both benefit and harm, as in a global measure of function or quality of life. In addition, because these ideas can be difficult to grasp the first time they are encountered, this article will describe them in both words and pictures. Moreover, although this discussion will employ P-values for simplicity’s sake, the author prefers the 95% confidence interval as a more informative measure of the size of treatment effects. Finally, although the ‘‘control’’ treatment in the following examples is often referred to as a ‘‘placebo,’’ these principles apply equally well to nondrug trials and to trials in which the control treatment is an alternative active treatment.

4 THE CONSEQUENCES OF TWO-SIDED ANSWERS TO ONE-SIDED QUESTIONS

2 CLINICIANS ASK ONE-SIDED QUESTIONS, AND WANT IMMEDIATE ANSWERS

Three consequences of this reasoning should be mentioned. First, by performing ‘‘twosided’’ tests of statistical significance, investigators ignore the ‘‘one-sided’’ clinical questions of superiority and noninferiority. Second, investigators often fail to recognize that the results of these two-sided tests, especially in small trials, can be ‘‘statistically nonsignificant’’ even when their confidence intervals include important benefit or harm. Third, investigators (abetted by editors) frequently

When busy clinicians encounter a new treatment, they ask themselves 2 questions (1). First, is it better than (‘‘superior to’’) what they are using now? Second, if it is not superior, is it as good as what they are using now (‘‘non-inferior’’) and preferable for some other reason (e.g., fewer side effects or more affordable)? Moreover, they want answers to these questions right away.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

SUPERIORITY TRIALS

misinterpret this failure to reject the null hypothesis (based on two-sided P values > 5%, or 95% confidence intervals that include zero). Rather than recognizing their results as uncertain (‘‘indeterminate’’), they report them as ‘‘negative’’ and conclude that ‘‘no difference’’ is observed between the treatments. 5

THE FALLACY OF THE ‘‘NEGATIVE’’ TRIAL

Not only authors, but also editors, and especially readers often conclude that the ‘‘absence of proof of a difference’’ between two treatments constitutes ‘‘proof of an absence of a difference’’ between them. This mistake was forcefully pointed out by Philip Alderson and Iain Chalmers (4), who wrote, ‘‘It is never correct to claim that treatments have no effect or that there is no difference in the effects of treatments. It is impossible to prove . . . that two treatments have the same effect. Some uncertainty will always surround estimates of treatment effects, and a small difference can never be excluded.’’ 6 THE SOLUTION LIES IN EMPLOYING ONE-SIDED STATISTICS A solution to both this incompatibility (between one-sided clinical reasoning and two-sided statistical testing) and confusion (about the clinical interpretation of statistically nonsignificant results) is now gaining widespread recognition and application. It has been described most succinctly by Charles Dunnett and Michael Gent (5), and others have also contributed to its development (although the latter sometimes refer to ‘‘noninferiority’’ as ‘‘equivalence,’’ a term whose common usage fails to distinguish one-sided from two-sided thinking) (6). An example may be useful here. 7 EXAMPLES OF EMPLOYING ONE-SIDED STATISTICS Twenty years ago, a randomized trial was conducted in the hope of demonstrating that superficial temporal artery–middle cerebral artery anastomosis (‘‘EC–IC bypass’’), compared with no operation, prevented stroke

(7). To the disappointment of the investigators, a statistically significant superiority of surgery for preventing subsequent fatal and nonfatal stroke could not be shown. Therefore, it became important to overcome the ambiguity of this ‘‘indeterminate’’ result. Accordingly, the investigators then asked the one-sided question: What degree of surgical benefit could be ruled out? That one-sided analysis, which calculated the upper end of a 90% (rather than a 95%) confidence interval, excluded a surgical benefit as small as 3%. When news of this one-sided result was reported, the performance of this operation rapidly declined. Based on the contributions of statisticians like Charles Dunnett and Michael Gent, it is now possible to translate rational, one-sided clinical reasoning into sensible, one-sided statistical analysis. Moreover, this modern strategy of asking one-sided superiority and noninferiority questions in randomized clinical trials has become both widely accepted and increasingly common. The CONSORT statement on recommendations for reporting randomized clinical trials omits any requirement for two-sided significance testing (8). In addition, one-sided superiority and noninferiority trials can now be found in prominent medical journals such as the New England Journal of Medicine, The Lancet, JAMA, and regularly in the ACP Journal Club. 8 ONE-SIDED STATISTICAL ANALYSES NEED TO BE SPECIFIED AHEAD OF TIME An essential prerequisite to doing one-sided testing is the specification of the exact superiority and noninferiority questions before the randomized clinical trial begins. As with unannounced subgroup analyses, readers can and should be suspicious of authors who apply one-sided analyses without previous planning and notice. Have they been slipped in only after a peek at the data revealed that conventional two-sided tests generated indeterminate results? This need for prior specification of one-sided analyses provides yet another argument for registering randomized trials in their design stages and for publishing their protocols in open-access journals such as Biomed Central (9).

SUPERIORITY TRIALS

9 A GRAPHIC DEMONSTRATION OF SUPERIORITY AND NONINFERIORITY To assist in understanding the concepts of superiority and noninferiority, they will now be described graphically. In doing so, another important element of superiority trials will be described, that is, how investigators should test a new treatment when an effective one already exists. As before, this explanation will employ a single outcome measure that incorporates both the benefits and the harms of the experimental treatment. The discussion will begin in terms of two-sided tests of statistical significance and double-pointed 95% confidence intervals, as shown in Fig. 1. This figure (and the two that follow it) use ‘‘forest’’ plots to illustrate different sorts of trial results and their interpretation. In each figure, the horizontal arrows present the two-sided 95% confidence intervals for the differences in outcomes between experimental and control groups. For purposes of this discussion, imagine that these horizontal arrows express a composite of both good and harm, such as a quality of life measure. When no difference is observed in outcomes between experimental and control groups, the confidence interval would be centered on the heavy vertical line at 0. When the average patient fares better on experimental

3

therapy than on control therapy, the confidence interval is centered to the left of 0. But when the average patient fares worse on experimental therapy than on control therapy, the confidence interval is centered to the right of 0.

10 HOW TO THINK ABOUT AND INCORPORATE MINIMALLY IMPORTANT DIFFERENCES Notice that dashed vertical lines are on either side of 0. These lines show the points where patients (and their clinicians) believe that these differences in outcomes on experimental versus control treatment exceed some minimally important differences in benefit and harm. As a consequence, patients would want, and their clinicians should offer, treatments that meet or exceed a positive minimally important difference in good outcomes. Similarly, patients would refuse, and their clinicians would not even offer, treatments that meet or exceed a negative minimally important difference in bad outcomes. These minimally important differences appear in Fig. 1. On the left, observe the dotted line for minimally important benefits (MIB), and on the right, examine the dashed line for minimally important harm (MIH) A

A: Important benefit is ruled out: “True-Negative” RCT 2-sided 95% Confidence Interval for the difference between Experimental and Control Event Rates B: Important benefit cannot be ruled out: “Indeterminate” RCT

B

C C: Treatment provides benefit, but cannot tell whether benefit is important

D D: Treatment provides important benefit to patients IB = to the left of this dotted line, Experimental Rx provides Important Benefit to patients.

MIB

0

MIH

IH = to the right of this dotted line, Experimental Rx causes Important Harm to patients.

Figure 1. Randomized clinical trials in which the control Rx is a placebo or no treatment at all.

4

SUPERIORITY TRIALS

to patients. Note that patients and clinicians usually place the ‘‘minimally important harm’’ line closer to 0 than they do the ‘‘minimally important benefit’’ line, which is consistent with most people’s greater concern about avoiding harm. The locations of the minimally important benefit and harm lines ought to be based on what patients consider important benefit and harm. As it happens, patient considerations are rarely determined in any formal way, but they are informally incorporated by the investigators as they set these lines. Their location depends on the question posed by the trial. For a disease with no known cure (say, a rapidly fatal untreatable cancer), any benefit would be welcomed by patients and their clinicians, and the minimally important benefit line might be placed very close to the 0 line. For a trivial self-limited disease like the common cold, the minimally important benefit line might be placed far to the left of 0, but its minimally important harm line only slightly to the right of 0. 11 INCORPORATING CONFIDENCE INTERVALS FOR TREATMENT EFFECTS When a treatment’s confidence interval is centered to the left of the minimally important benefit line, we are encouraged. If its entire confidence interval lies to the left of the minimally important benefit line, then superiority is confirmed and we can be confident that experimental treatment provides important benefits to patients (the doubletipped confidence interval arrow labeled D describes this case). In a similar fashion, any confidence interval centered to the right of the minimally important harm line is discouraging, for noninferiority is rejected; if the entire confidence interval lies to the right of the minimally important harm line, then we can be confident that experimental treatment causes important harm to patients. As a starting point, Fig. 1 illustrates four of several results we might obtain in randomized clinical trials performed with a promising new drug on a disorder for which ‘‘no effective treatment’’ exists. (‘‘no effective treatment’’ means that no prior randomized clinical trial has found any treatment whose

confidence interval lies entirely to the left of the 0 line.) These initial trials typically employ placebos and seek a treatment that is superior to that placebo. Trial result A is highly informative. Its entire confidence interval lies to the right of the IB line, so it has ruled out any minimally important benefit to patients. By contrast, trial result B is not very helpful. Its confidence interval stretches all the way from being slightly (but not importantly) harmful at its right tip to being importantly beneficial at its left tip. 12 WHY WE SHOULD NEVER LABEL AN ‘‘INDETERMINATE’’ TRIAL RESULT AS ‘‘NEGATIVE’’ OR AS SHOWING ‘‘NO EFFECT’’ Trial results A and B nicely illustrate the confusion that develops from stating that a treatment ‘‘has no effect’’ or that a randomized clinical trial was ‘‘negative’’ just because its confidence interval crosses the 0 line. These ‘‘negative’’ and ‘‘no effect’’ labels imply that one should relegate the experimental treatment into the dustbin. That conclusion is true only for Treatment A, where any important patient benefit has been ruled out. However, this result is not the case for Treatment B. Although its confidence interval also crosses the 0 line, the confidence interval for Treatment B also crosses the important benefit line, so it might, in fact, produce important benefit to patients. One simply does not know whether Treatment B is worthwhile. A useful word that many trialists use to label such a trial result is ‘‘indeterminate.’’ Because of this confusion in labeling, some leading trialists have called for banning the terms ‘‘negative’’ and ‘‘no effect’’ from health care journals. A strong case can be made for reserving the term ‘‘true negative’’ for only those A-type randomized clinical trials that rule out any important treatment benefit for patients. For example, the previously noted trial of superficial temporal-middle cerebral artery anastomosis (‘‘EC-IC bypass’’) ruled out a surgical benefit as small as 3% in the relative risk of fatal and nonfatal stroke (7). One way to distinguish A-type results from merely indeterminate ones is to label the A-types ‘‘true negative’’ trials.

SUPERIORITY TRIALS

13 HOW DOES A TREATMENT BECOME ‘‘ESTABLISHED EFFECTIVE THERAPY’’? Trial Result C shows a definite benefit (the right tip of the arrow lies to the left of the 0 line). The treatment in C lies entirely to the good side of 0, so it works better than nothing. However, because it straddles the MIB line, one cannot tell whether its benefit is great enough to be important to patients. Result D is what most trialists hope to demonstrate in most trials. The right arrow tip lies to the left of the MIB line, and one can be confident that Treatment D provides important benefits to patients. Treatment D, especially when it is confirmed in a meta-analysis of several independent trials, now deserves the title of ‘‘established effective therapy’’ and trialists would urge their clinical colleagues to offer it to all patients who could obtain and tolerate it (alas, economic, geographic, political, and cultural barriers deny established effective therapy to millions of patients on every continent). 14 MOST TRIALS ARE TOO SMALL TO DECLARE A TREATMENT ‘‘ESTABLISHED EFFECTIVE THERAPY’’ The sample sizes enrolled in most trials are only large enough to generate Result C (treatment is better than nothing). Although its point-estimate of efficacy (the mid-point of its two-tipped arrow) lies to the left of the important benefit (IB) line, the lower bound of its confidence interval extends to the right of that important benefit (IB) line. It requires many more patients for that same treatment to shrink its confidence interval enough to drag its right tip to the left of that IB line, which demonstrates superiority. This degree of shrinkage may not be achieved until several trials of this treatment have been combined in a meta-analysis. 15 HOW DO WE ACHIEVE A SUPERIORITY RESULT? Most commonly, a superiority (D-type) result, with the entire confidence interval beyond the minimally important benefit line, is achieved

5

only in meta-analyses of several comparable trials. However, a D-type result can be achieved other ways. First, superiority may be established in a single trial when the experimental treatment turned out to be so much better than expected that the number of enrolled patients was far greater than necessary for generating Result C. A second exception occurs in a trial that is very shortterm, so that virtually all its patients had been admitted and treated before any interim analysis could have detected its favorable result. However, a superiority result also can occur when the outcomes a long term and are ascertained only some time after total recruitment has been completed. Finally, a D-type result can occur when no interim analysis of larger-than-expected treatment effect is conducted. 16 SUPERIORITY AND NONINFERIORITY TRIALS WHEN ESTABLISHED EFFECTIVE THERAPY ALREADY EXISTS To this point, the trials under consideration have been placebo-controlled trials. The next important question asks what should be done when an established effective therapy already exists, such as D. When a new, promising (but untested) treatment appears, we must determine either whether it is ‘‘superior’’ to D (and ought to replace it as established effective therapy), or whether it is just as good as (‘‘noninferior’’ to) the current established effective therapy but safer or less expensive. A substantial proportion of trialists hold that one should not test such promising new treatments against a placebo when established effective therapy already exists. Their reasons for this strong stand are both ethical and clinical. The ethical reason is that it is simply wrong to replace established effective therapy (something we know does work among patients who can and will take it) with a placebo (something we know does not work) (10). They argue that it is already difficult to justify substituting a promising new treatment for established effective therapy in a superiority trial, and that it is surely unethical to assign half the study patients to a placebo known not to work.

6

SUPERIORITY TRIALS

The clinical reason for this position is that, when an established effective therapy already exists, it is clinically nonsensical to test the next promising treatment against a placebo. Clinicians do not care whether the promising new treatment is better than a placebo, because that result does not indicate whether they should offer it, rather than the established effective therapy, to future patients. As this argument goes, clinicians want to know the answer to one of two questions. First, is the promising new treatment better than the established effective therapy (a ‘‘superiority’’ question)? Second, if the promising new treatment is not better, then is the promising new treatment ‘‘as good as’’ (‘‘noninferior’’ to) established effective therapy but preferable on other grounds (such as safety or cost)? The appropriate design for answering these questions depends on whether the new treatment, if effective, would replace established effective therapy or simply be added to it. 17 EXCEPTIONS TO THE RULE THAT IT IS ALWAYS UNETHICAL TO SUBSTITUTE PLACEBOS FOR ESTABLISHED EFFECTIVE THERAPY The first exception occurs when patients cannot or will not take established effective therapy. In this case, it can be argued that no established effective therapy exists for them, and that a placebo-controlled trial is both clinically sensible and ethical. A similar situation exists when patients cannot obtain the established effective therapy because of geography, politics, local tradition, or economics. The second exception occurs when a promising new treatment, if found superior to established effective therapy, would replace it. In this situation, it can be argued that the only comparison that makes clinical (and ethical) sense is a ‘‘head-to-head’’ comparison of the established effective therapy and the promising new therapy. If the promising but untested treatment would replace established effective therapy, then the head-to-head comparison would be similar to the ‘‘placebo-controlled’’ trials previously described, but with established effective therapy in place of the placebo.

18 WHEN A PROMISING NEW TREATMENT MIGHT BE ADDED TO ESTABLISHED EFFECTIVE THERAPY When the promising but untested treatment might become a beneficial addition to established effective therapy, these clinical ethical lines of reasoning dictate that the appropriate randomized clinical trial compares the combination of established effective therapy plus the new treatment versus established effective therapy alone. For example, Tugwell et al. (11) tested a promising new therapy, cyclosporine, by adding it or a placebo to methotrexate, which is the established effective therapy for rheumatoid arthritis. Trialists often call this sort of trial an ‘‘add on’’ randomized clinical trial. 19 USING PLACEBOS IN A TRIAL SHOULD NOT MEAN THE ABSENCE OF TREATMENT Readers should not equate the use of placebos with the absence of any treatment. Placebos might be used in both sorts of the randomized clinical trials that test the promising new treatment. If we label the established effective therapy ‘‘EE’’ and the promising untested treatment ‘‘PU,’’ in a head-to-head trial of EE versus PU (and unless the two drugs are identical in appearance), the treatment groups would be: [active EE + placebo PU] versus [placebo EE + active PU]. Similarly, the ‘‘add-on’’ randomized clinical trial to test PU would have treatment groups of: [active EE + active PU] versus [active EE + placebo PU].

20 DEMONSTRATING TRIALS OF PROMISING NEW TREATMENTS AGAINST (OR IN ADDITION TO) ESTABLISHED EFFECTIVE THERAPY This concept is illustrated in Fig. 2. The first thing to notice in Fig. 2 is that the ‘‘goalpost

SUPERIORITY TRIALS

7

“Equivalence”

E E: New Rx importantly inferior to EET. (Trial conclusion is “Inferiority”)

2-sided 95% Confidence Interval for the difference between Experimental F: New Rx may or may and Control Event Rates not be importantly inferior to EET (Trial conclusion is “Indeterminate”)

0

G

G: New Rx displays “Non-Inferiority” but may or may not be Importantly superior to EET (Trial conclusion is “Non-Inferiority”) H: New Rx importantly superior to EET (Trial conclusion is “Superiority”)

F

Shift in “O” line due to EET

H

New Rx Superior to EET

S

0*

I

New Rx Inferior to EET

Figure 2. Randomized clinical trials in which the control Rx is established effective therapy (conventional two-sided 95% confidence interval).

has been moved’’ to the left. The vertical 0 line that was used when there was no established effective therapy has been replaced with the 0* line that sets the new standard by which benefit and harm is to be judged. These ‘‘shifting goal posts’’ reinforce the argument that, when a promising but untested treatment joins a clinical arena in which an established effective therapy already exists, the sensible questions require testing the promising treatment against (or in addition to) established effective therapy, not against placebo. The second thing to notice is that the MIB line has been replaced by the superiority line S line that identifies the boundary, to the left of which the new treatment is minimally importantly superior to established effective therapy. In similar fashion, the MIH line has been replaced by the inferiority line I that identifies the boundary, to the right of which the new treatment is minimally importantly inferior to established effective therapy. As before, the locations of these superiority and inferiority lines should be based on what is important to patients. 21 WHY WE ALMOST NEVER FIND, AND RARELY SEEK, TRUE ‘‘EQUIVALENCE’’ Here the terminology becomes convoluted. Although it might be desirable to show that

the efficacy of the promising new treatment was ‘‘equivalent’’ to established effective therapy (but safer or cheaper), the number of patients required for that proof is huge; in fact, the determination of identical efficacy (a confidence interval of zero) requires a sample size of infinity. Even the ‘‘equivalence’’ result shown at the top of Fig. 2, where the entire confidence interval lies between the superiority and inferiority lines, requires massive numbers of patients (for example, to halve a 95% confidence interval requires a quadrupling of sample size). To add to this confusion, some authors use ‘‘equivalence’’ to denote ‘‘noninferiority.’’ 22 THE GRAPHICAL DEMONSTRATION OF ‘‘SUPERIORITY’’ AND ‘‘NONINFERIORITY’’ As previously stated, head-to-head trials of promising new versus established effective therapy usually ask two questions. Figure 2 illustrates these superiority and noninferiority questions. Trial result E shows that the new treatment is importantly inferior to established effective therapy (and should be abandoned). At the other extreme, result H shows that the new treatment is importantly superior to established effective therapy (and, if it is safe and affordable, should replace it).

8

SUPERIORITY TRIALS

Results for treatments F and G are problematic. Result F is indeterminate for noninferiority because it crosses the inferiority line I, and the new treatment might or might not be inferior to established effective therapy. Similarly, result G is indeterminate for superiority because it crosses the superiority line S, and the new treatment might or might not be superior. However, an examination of their 95% confidence intervals reveals that the upper (left) bound of the confidence interval for treatment F lies to the right of the superiority line S. Thus, one can conclude confidently that treatment F is not superior to established effective therapy. Similarly, the lower (right) bound of the confidence interval for G lies to the left of the inferiority line I. Thus, treatment G is not inferior to established effective therapy. 23 COMPLETING THE CIRCLE: CONVERTING ONE-SIDED CLINICAL THINKING INTO ONE-SIDED STATISTICAL ANALYSIS The contribution of statisticians such as Charles Dunnett and Michael Gent provides a way to avoid the indeterminate noninferiority conclusion about treatment F and the indeterminate superiority conclusion about G without having to greatly increase the numbers of patients in these trials. The solution is illustrated in Fig. 3, in which some key results from the previous two-tailed figure are presented in their one-tailed forms. As stated in the opening sections of this entry, Charles Dunnett and Michael Gent have argued that the first question usually posed in an randomized clinical trial (Is treatment X superior to placebo or established effective therapy?) is a ‘‘one-sided’’ (not a ‘‘two-sided’’) question. That is, it asks only about superiority. Neither investigators nor clinicians would care whether a ‘‘no’’ answer was caused by equivalence or inferiority, because both would lead them to abandon the new treatment. Translated into graphic terms, this means that only the right-hand tip of the confidence interval arrow for treatment G is relevant, and it should extend rightward only to its

5% boundary not to its 2.5% boundary. Now, with the less stringent one-sided 95% confidence interval, indeterminate study result G in the previous figure becomes the definitive superiority result G’ in this one. Following from the proposition that both superiority and a noninferiority question can be posed in the design phase of a trial, the answers to both questions can be sought in the analysis. Thus, even if the one-sided ‘‘superiority’’ question is answered ‘‘no,’’ then investigators can still go on to ask the other one-sided ‘‘noninferiority’’ question: Is the new treatment not importantly inferior to established effective therapy? As noted earlier, this question would be highly relevant if the promising new treatment might be safer, or easier to take, or produce fewer or milder side effects, or was less expensive. In graphic terms, this result means that only the right-hand tip of the confidence interval arrow for treatment F is relevant, and it should extend rightward only to its 5% boundary not to its 2.5% boundary. Now, with the less stringent one-sided 95% confidence interval, indeterminate study result F in the previous figure becomes the definitive noninferiority result F’ in this one. 24 A FINAL NOTE ON SUPERIORITY AND NONINFERIORITY TRIALS OF ‘‘ME-TOO’’ DRUGS A promising new treatment might be a ‘‘metoo’’ drug from the same class as established effective therapy, developed by a competing drug company to gain a share of the market already created by the established effective therapy. In this case, its proposers are not interested in whether their new drug is superior to the standard drug, but only whether it is not inferior to it. As before, this question is best answered in a ‘‘head-to-head’’ randomized clinical trial. This strategy often is viewed with suspicion by some (especially some licensing authorities). They fear that the noninferiority trial may be a tool used to promote treatments of little value. This result would be accomplished by ‘‘shifting the goal posts’’ of either or both the superiority line S and the inferiority line I to the right. Two safeguards are set to protect against this abuse.

SUPERIORITY TRIALS

1-sided 95% Confidence Interval for the difference between Experimental and Control Event Rates

F’ O

F’: New Rx is not importantly inferior to EET (“Non-Inferiority” conclusion) G’: New Rx importantly superior to EET(“Superiority” conclusion)

9

G’ Shift in “O” line due to EET

New Rx superior to EET

S

0’

I

New Rx inferior to EET

Figure 3. Randomized clinical trials in which the control Rx is established effective therapy (using one-sided 95% confidence interval).

First, patients’ values should determine the location of the superiority and inferiority lines. Second, through the registration of trials and protocols, the precise noninferiority questions can be made public before the trial begins.

REFERENCES 1. D. L. Sackett, Superiority trials, noninferiority trials, and prisoners of the 2-sided null hypothesis (Editorial). ACP J. Club. 2004; 140:A- 11–12. 2. C. E. Counsell, M. J. Clarke, J. Slattery, and P. A. Sandercock, The miracle of DICE therapy for acute stroke: fact or fictional product of subgroup analysis? BMJ 1994; 309: 1677–1681. 3. M. Clarke and J. Halsey, DICE 2: a further investigation of the effects of chance in life, death and subgroup analyses. Int. J. Clin. Pract. 2001; 55: 240–242. 4. P. Alderson and I. Chalmers, Survey of claims of no effect in abstracts of Cochrane reviews. BMJ 2003; 326: 475. 5. C. W. Dunnett and M. Gent, An alternative to the use of two-sided tests in clinical trials. Stat. Med. 1996; 15: 1729–1738. 6. J. H. Ware and E. M. Antman, Equivalence trials. N. Engl. J. Med. 1997; 337: 1159–1161.

7. The EC/IC Bypass Study Group, Failure of extracranial-intracranial arterial bypass to reduce the risk of ischemic stroke. Results of an International randomized trial. N. Engl. J. Med. 1985; 313: 1191–1200. 8. http://www.consort-statement.org/Statement/ revisedstatement.htm#app. 9. http://www.biomedcentral.com/. 10. K. B. Michels and K. J. Rothman, Update on unethical use of placebos in randomised trials. Bioethics 2003; 17: 188–204. 11. P. Tugwell, T. Pincus, D. Yocum, M. Stein, O. Gluck, G. Kraag, R. McKendry, J. Tesser, P. Baker, and G. Wells, Combination therapy with cyclosporine and methotrexate in severe rheumatoid arthritis. The MethotrexateCyclosporine Combination Study Group. N. Engl. J. Med. 1995; 333: 137–141.

FURTHER READING RB. Haynes, DL. Sackett, GH. Guyatt, and P. Tugwell, Clinical Epidemiology 4th ed. How to Do Clinical Practice Research. Philadelphia, PA: Lippincott Williams & Wilkins, 2006.

CROSS-REFERENCES Non-Inferiority trial Equivalence trial Clinical significance

Surrogate Endpoints The selection of the primary “outcome measures” or “endpoints” is a very important step in the design of clinical trials (see Outcome Measures in Clinical Trials). Typically, the primary goal of the clinical trial is to assess definitively a treatment’s effect on these endpoints. Two major criteria should guide their selection. The endpoints should (i) be sensitive to treatment effects and (ii) be clinically relevant. Adequate attention is usually given to ensuring that the first criterion is satisfied. Unfortunately, ensuring that the endpoints also satisfy the criterion of clinical relevance is often improperly addressed. We focus on this second criterion and the corresponding controversial issues arising when surrogate endpoints are used as study outcomes. The nature of clinical relevance depends on the stage of clinical experimentation. In Phase II trials, which provide a screening evaluation of treatment effect, the primary objective usually is to assess a treatment’s biological activity. Relevant endpoints in such a trial in cancer patients might be measures of tumor shrinkage; in HIV-infected persons, measures of viral load or immune function; and in patients with cardiovascular disease, blood pressure or lipid levels. In contrast, in Phase III clinical trials, where the intent is to define the role of a therapy in standard clinical practice, the primary objective should be to assess the treatment’s clinical efficacy through outcome measures that unequivocally reflect tangible benefit to the patient. In the treatment of patients with life-threatening diseases, such clinical efficacy measures include improvement in the duration of survival or in the quality of life (QOL). Often, there is a sense of urgency in the evaluation of promising new interventions for patients having life-threatening diseases. When survival is the primary endpoint, clinical trials frequently require large sample sizes and very lengthy intervals of follow-up. The subjective nature of QOL outcome measures presents additional difficulties through the need to identify validated and widely accepted QOL instruments. To reduce the trial cost, size, and duration and to avoid complexities of QOL assessments, considerable attention has been given, in the design of definitive Phase III trials, to identifying surrogate or replacement endpoints for the true clinical efficacy endpoint. As defined by Temple [21],

a surrogate endpoint of a clinical trial is a laboratory measurement or a physical sign used as a substitute for a clinically meaningful endpoint that measures directly how a patient feels, functions or survives. Changes induced by a therapy on a surrogate endpoint are expected to reflect changes in a clinically meaningful endpoint.

Measures of biological activity have been chosen frequently as surrogates because usually they are readily available early in a clinical trial and because often they are strongly correlated with clinical efficacy. Unfortunately, treatment effects on the clinical efficacy endpoints may not be predicted reliably by the observed effects on surrogate endpoints, even when natural history data reveal that these surrogates are strongly correlated with the clinical efficacy outcomes. As indicated by Fleming & DeMets [9], there are several possible explanations for this failure. Even though a surrogate endpoint may be a correlate of disease progression, it might not involve the same pathophysiologic process that results in the clinical outcome. Even when it does, it is likely there are disease pathways causally related to the clinical outcome and yet unrelated to the surrogate endpoint. Of the disease pathways affecting the true clinical outcome, the intervention may only affect (i) the pathway mediated through the surrogate endpoint or (ii) the pathway(s) independent of the surrogate endpoint. Most importantly, the intervention might also affect the true clinical outcome by unintended mechanisms of action independent of the disease process. The intervention’s effects mediated through intended mechanisms could be substantially offset by an array of mechanisms that are unintended, unanticipated and unrecognized.

The example of lipid-lowering agents clearly illustrates the existence and impact of these unintended mechanisms. In a comprehensive overview of 50 randomized trials (see Meta-analysis of Clinical Trials) of cholesterol-lowering agents by Gordon [12], an average reduction in cholesterol of 10% was achieved along with an intended 9% reduction in coronary heart disease (CHD) mortality. However, overall mortality was unchanged, due to an unintended 24% increase in non-CHD mortality.

Illustrations Research across a broad array of clinical settings confirms that many powerful correlates of clinical

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

Surrogate Endpoints

efficacy outcomes have been poor surrogates for true clinical efficacy [8, 9]. Anti-arrhythmic drugs effectively suppress ventricular arrhythmias after myocardial infarction (MI), yet lead to more than three-fold increases in death rate. Drugs improving cardiac output as treatment for congestive heart failure have increased mortality. The rate of vessel reperfusion has not predicted adequately the effect of thrombolytic therapies on mortality. Cholesterol-lowering interventions, such as diet, fibrates, hormones, resins, and lovastatin, have not lowered mortality rates. Antihypertensive calcium channel blockers reduce blood pressure, but now appear to increase the risk of myocardial infarction. Calcium antagonists reduce the risk of developing new angiographic lesions of atherosclerosis in patients with MI, yet increase the death rate. Sodium fluoride increases bone mineral density in postmenopausal women with osteoporosis, yet substantially increases the risk of bone fractures. In patients with retinitis pigmentosa, vitamin A provides a favorable slowing of decline on electroretinograms, yet has no effect on any direct measure of visual function. In addition to these false positive leads, surrogates can provide false negative leads as well. Gamma interferon fails to have a measurable effect on superoxide production and bacterial killing in children with chronic granulomatous disease, yet substantially reduces the rate of serious life-threatening infection.

Validation of Surrogates Proper validation of a surrogate endpoint is a difficult task. Insights about validity can be provided by empiric evidence from an array of clinical trials documenting treatment effects on both surrogate and clinical efficacy endpoints, as well as by a thorough biological understanding of causal pathways in the disease process and of mechanisms of treatment effect (see Causation). Prentice [19] provides a definition of a valid surrogate, and gives two sufficient conditions that jointly ensure this validity, thereby providing guidance for how one might approach using empiric evidence to assess validation. By his definition, a surrogate is valid if “a test of the null hypothesis of no relationship (of the surrogate endpoint) to the treatment groups must also be a valid test of the corresponding null hypotheses based on the true endpoint”. Prentice’s first condition to ensure this validity is the

“correlate” requirement, i.e. a valid surrogate endpoint must be correlated with the true clinical endpoint. This condition usually holds since, in practice, potential surrogates are often selected by identifying measures that are strongly correlated with clinical efficacy endpoints. Prentice’s very restrictive second condition requires the surrogate to capture fully the treatment’s “net effect” on the clinical endpoint, where the net effect is the aggregate effect accounting for all mechanisms of action. The restrictiveness of this condition provides important insight into why correlates are rarely valid surrogates. In applications, extensive analyses have been performed to assess surrogacy of CD4 cell count, using data from several large clinical trials evaluating nucleoside analogs in HIV/AIDS patients. While these analyses show consistently that CD4 cell count is a correlate of the “progression to symptomatic AIDS or death” endpoint, thereby satisfying Prentice’s first condition, CD4 has not been established as a valid surrogate endpoint, since the second condition of Prentice consistently fails to hold [3, 6, 14, 15, 22]. The validity of Prentice’s restrictive second condition, requiring a surrogate to capture fully the net effect of an intervention on the clinical efficacy endpoint, has been explored by Freedman et al. [11] in an epidemiologic setting. Their methods involve estimating the proportion of the net treatment effect apparently captured by the marker, allowing assessment of the strength of evidence about whether this proportion is near unity. It should be recognized, however, that while particular interest often is in evaluating the effect of treatment on the disease process pathway(s) causally inducing the clinical events, it is not possible to determine the proportion, p, of that effect that is accounted for by effects on a surrogate endpoint [20]. To demonstrate how this nonidentifiability arises, DeGruttola et al. [7] consider a simple example in which the clinical efficacy endpoint is death and where, on the control regimen, the death rate induced by the causal pathways of the disease process is µh , while the death rate due to other causes is µo . They suppose further that the experimental intervention alters the death rate induced by the causal pathways of the disease process by the multiplicative factor r, to rµh , but increases the death rate due to other causes (including those influenced by unintended mechanisms of the drug) by the multiplicative factor k, to kµo . If it is assumed that

Surrogate Endpoints the treatment-induced change in the surrogate endpoint would only influence the death rate induced by the causal pathways of the disease process, then this change would alter the overall death rate by a multiplicative factor rso =

(µh + µo ) − p(1 − r)µh . µh + µo

(1)

This parameter rso can be measured using data on control patients from the study itself or natural history databases that allow modeling the association of death with surrogate endpoints. One can also measure the observed overall “net effect” of the intervention on death rate, ro =

rµh + kµo µh + µo

(2)

and, using (1) and (2), can compute the observed portion of the net effect accounted for by the treatment-induced change in the surrogate endpoint, 1 − rso po = 1 − ro =

p(1 − r)µh . (1 − r)µh + (1 − k)µo

By (3), po = p if either µo = 0 (i.e. death can only be caused by the causal pathways of the disease process) or k = 1 (i.e. the intervention has no effect on the other causes of death). However, in the more common setting where µo > 0 and k > 1, even when p 1, the observed proportion po approaches unity as k approaches [1 + (µh /µo )(1 − r)(1 − p)]. Thus, surrogate endpoints, which capture only a small fraction of the change in the death rate induced by treatment effects on the causal pathways of the disease process, may appear to capture an observed portion, po , near unity, simply due to unanticipated and unrecognized harmful effects of the intervention on the other causes of death. To formulate estimators of po in epidemiologic data, Freedman et al. [11] used linear logistic regression models, while Choi et al. [3], O’Brien et al. [17] and DeGruttola et al. [7] used proportional hazards models to conduct similar analyses in the setting of censored failure time data. Specifically, these three sets of authors assume that the failure rate at time t in treatment group Z is λ(t|Z) = λo (t) exp(βZ),

where Z = 0 for control and Z = 1 for experimental treatment, β is an unknown constant, and λo (t) is an arbitrary positive function. From (2) and (4), the “net treatment effect” is ro = eβ . In turn, incorporating the effect of the surrogate X(t) on the failure rate at time t, DeGruttola et al. [7] assume the model λ[t|Z, X(t)] = λ˜ o (t) exp(βa Z) exp[αX(t)],

(4)

(5)

where βa and α are unknown constants, and λ˜ o is an arbitrary positive function that might differ from λo . Strictly speaking, models (4) and (5) cannot hold simultaneously; however, they may hold approximately when either α or λ˜ o (t) is small. We will assume that the effects of model misspecification are negligible (see Lin et al. [16] for a rigorous discussion of this issue). By (4) and (5), rso = exp(β − βa ). Thus, po =

(3)

3

1 − exp(β − βa ) . 1 − eβ

Freedman et al. [11] approximate po by po∗ = 1 −

βa . β

The two quantities, po and po∗ , are equivalent when βa = β or βa = 0, and differ only slightly for intermediate values. Of course, as shown above, the quantities are equal to p only in very special cases. Note that while p, the proportion of the intended effect on causal pathways of the disease process captured by the surrogate, is always a proportion in the mathematical sense of lying in the interval [0, 1], po need not lie in the interval [0, 1]. When βa and β differ in sign (a situation that arises when the surrogate captures all of the benefit so that only the harmful effect is reflected in βa ), po exceeds 1; when βa > β > 0, po is negative. The problems of interpretation of po and po∗ are compounded by the high variability of their estimators [11]. Let βˆ and βˆ a denote the estimates of β and βa , obtained by the usual method of maximum partial likelihood. Then po∗ is estimated by pˆ o∗ = 1 −

βˆa . βˆ

4

Surrogate Endpoints

Lin et al. [16] showed that, for large samples, pˆ o∗ is approximately normal with mean po∗ and with variance Vβ Vβa Vββa + (1 − po∗ )2 − 2(1 − po∗ ) σ2 = 2 , β Vβ Vβ (6) where Vβ and Vβa are the variances of βˆ and βˆ a , and Vββa is their covariance. Formula (6) indicates that the factors that determine the variance of pˆ o∗ include the coefficient of variation for β (i.e. the inverse of the unadjusted treatment effect relative to its standard error), the value of po∗ itself, and the values of Vβa and Vββa relative to Vβ . For illustration, suppose α is small and the correlation between treatment and marker is low. Then Vβ ≈ Vβa ≈ Vββa , in which case σ ≈ |po∗ |

ˆ se(β) . |β|

(7)

Suppose that we have a large unadjusted treatment effect which is four times its standard error, i.e. ˆ = 4. Then (7) implies that the mean width β/se(β) of the 95% confidence interval for po∗ is equal to po∗ itself. In practice, (7) tends to underestimate the true variability of pˆ o∗ because Vβa is generally larger than Vβ . The estimate βˆ a becomes increasingly unstable as the correlation between treatment and marker increases. (An extreme scenario occurs when, in a placebo-controlled trial of a treatment, all treated and no untreated patients have a marker response.) Thus, an unadjusted treatment effect that is greater than four times its standard error is a necessary, though insufficient, condition for precise estimation of po∗ . Similar observations are made by Freedman et al. [11]. Clinical studies with treatment effects that are many times their standard errors are unusual, because studies with large treatment effects tend to be stopped early, and because most studies do not compare treatments with greatly different degrees of efficacy. Thus, meta-analyses that combine evidence across studies usually would be required for statistical evaluation of the reliability of surrogate endpoints. There are a number of ways to make use of data collected across studies. The first is simply to estimate po∗ (and its associated variance) corresponding to a given surrogate for each individual study, and examine the consistency of these estimates. This may

be especially useful when there have been a variety of treatments under study, with differing mechanisms of action and toxicity profiles. More formally, one could treat the true values of the po∗ for each study as latent variables, and estimate their underlying distribution (or features of the distribution) across studies. In settings where po∗ appears to be highly variable across studies, it might be of interest to assess whether such factors as class of drug or population under study explain this variability. While such efforts are not free from the problems of identifiability described earlier, values of po∗ that are consistently near 1 for studies investigating different classes of treatments may provide more persuasive evidence about the validity of a surrogate than do results from individual studies. An alternative approach to using data across studies, proposed by Daniels & Hughes [5], uses Bayesian methods to construct prediction intervals for the true difference in clinical outcome associated with a given estimated treatment effect on the potential surrogate. A factor that further complicates analyses of surrogacy, especially analyses across studies, is that marker values are generally not measured continuously or without error. Measurement error and the fact that marker values are available only at certain times – times that are often influenced by the disease under study – can result in bias in the estimation of α, and hence of β and po∗ . Tsiatis et al. [22] explored methods for correcting for bias resulting from issues related to measurement.

Auxiliary Variables Rather than serving as surrogates to replace clinical efficacy endpoints, response variables, such as the measures of biological activity discussed earlier, can be used to strengthen clinical efficacy analyses. Such variables, S, are then called auxiliary. Suppose one’s interest is in the effect of treatment on time to a clinical endpoint, T . Suppose, furthermore, that the auxiliary information, S, is readily observed, whereas T is censored in a substantial fraction of the patients because they have relatively late clinical endpoints. If S and T are strongly correlated, one can expect that S will provide useful additional information about the timing of the clinical endpoint for those patients in which T is censored. Three approaches have been proposed for using auxiliary variables, and are referred to as “variance

Surrogate Endpoints reduction”, “augmented score” and “estimated likelihood”. The variance reduction method, explored by Kosorok & Fleming [13], is applicable when S is a time-to-event endpoint and when the treatment relationship with S is described by a statistic X with zero mean, such that cor(X, Y ) ≡ ρ is positive, where Y is a standard statistic used to assess the effect of treatment on T . The statistic Y − ρX proposed by Kosorok & Fleming makes use of auxiliary information to provide a variance-reduced alternative to using Y . The “augmented score” and “estimated likelihood” methods were explored by Fleming et al. [10]. Both approaches assume the proportional hazards model of (4) for the relationship between the covariate vector Z and the hazard function for the clinical outcome T . Denote the cumulative hazard for λo by Λ0 . Assume Ti and Ui are independent latent failure and censoring variables for the ith patient (i = 1, . . . , n), and denote Xi = min{Ti , Ui } and δi = I{Xi =Ti } , where I{A} denotes an indicator for A. To motivate the “augmented score” approach, recall that in the semiparametric regression setting where λo is unspecified, the Cox [4] maximum partial likelihood estimate of β is obtained by solving the score estimating equation: n

Zi Mˆ i (Xi |β) = 0,

(8)

i=1

where, for any t ≥ 0, Mˆ i (t|β) = I{Ti ≤t} − exp(β Zi )Λˆ 0 (t ∧ Ti ) is the martingale residual (see Counting Process Methods in Survival Analysis) evaluated at β, and where Λˆ 0 is the semiparametric Breslow [1] estimator of Λ0 evaluated at β. Censorship reduces the information available in (8) that is used for the estimation of β. Specifically, Mˆ i (t|β) is only known over t ∈ [0, Xi ] rather than over t ∈ [0, Ti ] and, in (8), less information is available to formulate Λˆ 0 . Fortunately, the surrogate information, Si , does allow recovery of some of this lost information. Suppose τ denotes some arbitrary large time. To recover some information over (Xi , τ ] for a censored case (i.e. with δi = 0), consider eMˆ i (β) ≡ E[Mˆ i (τ |β) − Mˆ i (Xi |β)|Xi , δi = 0, Si ], which essentially is the conditional expectation of the lost information over (Xi , τ ], given available

5

information on case i to Xi . Fleming et al. [10] formulate an estimator eˆMˆ i (β) in the special case in which Si is a censored time-to-event endpoint, and propose estimation of β based on solving the “augmented score equation”: n

Zi Mˆ i (Xi |β) +

i−1

n

(1 − δi )I{Xi <τ } Zi eˆMˆ i (β) = 0.

i=1

In the “estimated likelihood” approach, following Pepe’s [18] semiparametric approach in which λo temporarily is assumed to be known and that involves nonparametric estimation of P (S|T , Z) to obtain greater robustness, the corresponding estimated likelihood is ˆ L(β) =

Pβ (Ti |Zi )

δi =1

×

Pβ (T > Xi |Zi )

δi =0

Pˆβ (Si |T > Xi , Zi ),

(9)

δi =0

where Si can be an arbitrary right-censored vectorvalued process providing auxiliary information. The first two terms on the right-hand side of (9) represent the usual likelihood when the auxiliary information, S, is not taken into account. Under (4), these two terms reduce to the usual Cox partial likelihood when λo is considered to be unspecified and, in turn, is estimated by the piecewise linear approach presented in Breslow [2]. Turning to the third term in the estimated likelihood in (9), the amount of improvement provided by the estimated likelihood relative to the usual partial likelihood depends on the degree of dependence of Pβ (Si |t, Zi ) on t. Improvements in efficiency with these approaches using auxiliary information are likely to be small unless S and T are highly correlated and unless there is one pool of patients having longer-term follow-up and another pool of patients with auxiliary information but with relatively short-term follow-up on the clinical endpoint. In spite of these limitations, approaches using auxiliary information are of interest since they avoid the substantial risks for false positive or false negative conclusions that arise when surrogate endpoints are used to replace measures of clinical efficacy.

6

Surrogate Endpoints

Conclusions It would be rare to be able to establish rigorously the validity of a surrogate endpoint. False positive and false negative error rates in definitive trials evaluating intervention effects on clinical outcomes are required to be very low, typically in the range of 2.5% to 10%. Hence, to be a valid replacement endpoint, a surrogate must provide a very high level of accuracy in predicting the intervention’s effect on the true clinical endpoint. Predictions having an accuracy of approximately 50%, such as was provided by the CD4 surrogate in the HIV setting (see Fleming [8]), are as uninformative as random tosses of a coin. The statistical methods for validation discussed in this article usually require meta-analyses since the sample sizes needed are much larger than those necessary for the typical phase III evaluation of interventions (see Sample Size Determination for Clinical Trials). Proper validation of surrogates also requires in-depth understanding of the causal pathways of the disease process, as well as the intervention’s intended and unintended mechanisms of action. Such in-depth insights are rarely achievable. Surrogate endpoints should be used in screening for promising new therapies through the evaluation of biological activity in preliminary Phase II trials. Results of such studies can guide decisions about whether the intervention is sufficiently promising to justify the conduct of large-scale and longer-term clinical trials. In these definitive Phase III trials, while information on surrogate endpoints can provide valuable additional insights about the intervention’s mechanisms of action, the primary goal should be to obtain direct evidence about the intervention’s effect on safety and clinical outcomes.

[5]

[6]

[7]

[8] [9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

References [1]

[2] [3]

[4]

Breslow, N.E. (1972). Contribution to the discussion on the paper by D.R. Cox, Regression models and life tables, Journal of the Royal Statistical Society, Series B 34, 216–217. Breslow, N.E. (1974). Covariance analysis of censored survival data, Biometrics 30, 89–99. Choi, S., Lagakos, S.W., Schooley, R.T. & Volberding, P.A. (1993). CD4+ lymphocytes are an incomplete surrogate marker for clinical progression in persons with asymptomatic HIV infection taking zidovudine, Annals of Internal Medicine 118, 674–680. Cox, D.R. (1975). Partial likelihood, Biometrika 62, 269–276.

[18] [19]

[20]

[21]

Daniels, M.J. & Hughes, M.D. (1997). Meta-analysis for the evaluation of potential surrogate markers, Statistics in Medicine 16, 1965–1982. DeGruttola, V., Wulfsohn, M., Fischl, M.A. & Tsiatis, A.A. (1993). Modeling the relationship between survival and CD4+ lymphocytes in patients with AIDS and AIDS-related complex, Journal of Acquired Immune Deficiency Syndrome 6, 359–365. DeGruttola, V., Fleming, T.R., Lin, D.Y. & Coombs, R. (1996). Validating surrogate markers: are we being naive? Journal of Infectious Disease 175, 237–246. Fleming, T.R. (1994). Surrogate markers in AIDS and cancer trials, Statistics in Medicine 13, 1423–1435. Fleming, T.R. & DeMets, D.L. (1996). Surrogate end points in clinical trials: are we being misled?, Annals of Internal Medicine 125, 605–613. Fleming, T.R., Prentice, R.L., Pepe, M.S. & Glidden, D. (1994). Surrogate and auxiliary endpoints in clinical trials, with potential applications in cancer and AIDS research, Statistics in Medicine 13, 955–968. Freedman, L.S., Graubard, B.I. & Schatzkin, A. (1992). Statistical validation of intermediate endpoints for chronic diseases, Statistics in Medicine 11, 167–178. Gordon, D.J. (1994),. in Contemporary Issues in Cholesterol Lowering: Clinical and Population Aspects, B.M. Rifleind, ed. Marcel Dekker, New York. Kosorok, M.R. & Fleming, T.R. (1993). Using surrogate failure time data to increase cost effectiveness in clinical trials, Biometrika 80, 823–833. Lagakos, S.W. & Hoth, D.F. (1992). Surrogate markers in AIDS: where are we? Annals of Internal Medicine 116, 599–601. Lin, D.Y., Fischl, M.A. & Schoenfeld, D.A. (1993). Evaluating the role of CD4−lymphocyte counts as surrogate endpoints in HIV clinical trials, Statistics in Medicine 12, 835–842. Lin, D.Y., Fleming, T.R. & DeGruttola, V. (1997). Estimating the proportion of treatment effect explained by a surrogate marker, Statistics in Medicine 16, 1515–1527. O’Brien, W., Hartigan, P.M., Martin, D., Esinhart, J., Hill, A., Benoit, S., Rubin, M., Simberkoff, M.S., Hamilton, J.D. & the Veterans Affairs Cooperative Study Group on AIDS (1996). Changes in plasma HIV-1 RNA and CD4+ lymphocyte counts and the risk of progression to AIDS, New England Journal of Medicine 334, 426–431. Pepe, M.S. (1992). Inference using surrogate outcome data and a validation sample, Biometrika 79, 355–365. Prentice, R.L. (1989). Surrogate endpoints in clinical trials: definition and operational criteria, Statistics in Medicine 8, 431–440. Schatzkin, A., Freedman, L.S., Schiffman, M.H. & Dawsey, S.M. (1990). Validation of intermediate end points in cancer research, Journal of the National Cancer Institute 82, 1746–1752. Temple, R.J. (1995). A regulatory authority’s opinion about surrogate endpoints, in Clinical Measurement in

Surrogate Endpoints Drug Evaluation, W.S. Nimmo & G.T. Tucker, eds. Wiley, New York. [22] Tsiatis, A.A., DeGruttola, V. & Wulfsohn, M.S. (1995). Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4

7

counts in patients with AIDS, Journal of the American Statistical Association 90, 27–37.

THOMAS R. FLEMING, VICTOR DE GRUTTOLA & DAVID L. DEMETS

SURVIVAL ANALYSIS, OVERVIEW

(see e.g. [22, p. 517]), later examples being due to Lambert [33, p. 483]:

PER KRAGH ANDERSEN

University of Copenhagen, Denmark

NIELS KEIDING University of Copenhagen, Denmark

x x 2 − 0.6176 exp − 96 31.682 x − exp − (1) 2.43114

1−

and the influential nineteenth-century proposals by Gompertz (19) and Makeham (37), who modeled the hazard function as bcx and a + bcx , respectively. Motivated by the controversy over smallpox inoculation, D. Bernoulli (5) laid the foundation of the theory of competing risks; see (44) for a historical account. The calculation of expected number of deaths (how many deaths would there have been in a study population if a given standard set of death rates applied) also dates back to the eighteenth century; see (29) and the article on Historical Controls in Survival Analysis. Among the important methodological advances in the nineteenth century was, in addition to the parametric survival analysis models mentioned above, the graphical simultaneous handling of calendar time and age in the Lexis Diagram [35, cf. 30]. Two very important themes of modern survival analysis may be traced to early twentieth century actuarial mathematics: Multistate modeling in the particular case of disability insurance (41) and nonparametric estimation in continuous time of the survival function in the competing risk problem under delayed entry and right censoring (13). At this time, survival analysis was not an integrated component of theoretical statistics. A characteristic scepticism about ‘‘the value of life-tables in statistical research’’ was voiced by Greenwood (20) in the Journal of the Royal Statistical Society, and Westergaard’s (50) guest appearance in Biometrika on ‘‘Modern problems in vital statistics’’ had no reference to sampling variability. This despite the fact that these two authors were actually statistical pioneers in survival analysis: Westergaard (48) by deriving what we would call the standard error of the standardized mortality ratio (rederived by Yule (52);

Survival analysis is the study of the distribution of life times, that is, the times from an initiating event (birth, start of treatment, employment in a given job) to some terminal event (death, relapse, disability pension). A distinguishing feature of survival data is the inevitable presence of incomplete observations, particularly when the terminal event for some individuals is not observed; instead, it is only known that this event is at least later than a given point in time: right censoring. The aims of this entry are to provide a brief historical sketch of the long development of survival analysis and to survey what we have found to be central issues in the current methodology of survival analysis. Necessarily, this entry is rich in crossreferences to other entries that treat specific subjects in more detail. However, we have not attempted to include cross-references to all specific entries within survival analysis. 1 HISTORY 1.1 The Prehistory of Survival Analysis in Demography and Actuarial Science Survival analysis is one of the oldest statistical disciplines with roots in demography and actuarial science in the seventeenth century; see [49, Chapter 2]; (51) for general accounts of the history of vital statistics and (22) for specific accounts of the work before 1750. The basic life-table methodology in modern terminology amounts to the estimation of a survival function (one minus distribution function) from life times with delayed entry (or left truncation; see below) and right censoring. This was known before 1700, and explicit parametric models at least since the linear approximation of de Moivre (39)

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

SURVIVAL ANALYSIS, OVERVIEW

see (29)) and Greenwood (21) with his famous expression for ‘‘the ‘errors of sampling’ of the survivorship tables’’, (see below). 1.2 The ‘‘Actuarial’’ life table and the Kaplan–Meier Estimator In the mid-twentieth century, these well-established demographic and actuarial techniques were presented to the medical–statistical community in influential surveys such as those by Berkson and Gage (4) and Cutler and Ederer (13). In this approach, time is grouped into discrete units (e.g. one-year intervals), and the chain of survival frequencies from one interval to the next are multiplied together to form an estimate of the survival probability across several time periods. The difficulty is in the development of the necessary approximations due to the discrete grouping of the intrinsically continuous time and the possibly somewhat oblique observation fields in cohort studies and more complicated demographic situations. The penetrating study by Kaplan and Meier (28), the fascinating genesis of which was chronicled by Breslow (8), in principle, eliminated the need for these approximations in the common situation in medical statistics where all survival and censoring times are known precisely. Kaplan and Meier’s tool (which they traced back to B¨ohmer (7)) was to shrink the observation intervals to include at most one observation per interval. Though overlooked by many later authors, Kaplan and Meier also formalized the age-old handling of delayed entry (actually also covered by B¨ohmer) through the necessary adjustment for the risk set, the set of individuals alive and under observation at a particular value of the relevant time variable. Among the variations on the actuarial model, we will mention two. Harris et al (23) anticipated much recent work in, for example, AIDS survival studies in their generalization of the usual life-table estimator to the situation in which the death and censoring times are known only in large, irregular intervals. Ederer et al [(14)] developed a ‘‘relative survival rate . . . as the ratio of the observed survival rate in a group of patients to the survival rate expected in a group similar to

the patients . . . ’’ thereby connecting to the long tradition of comparing observed with expected; see, for example, (29) and the article on Historical Controls in Survival Analysis. 1.3 Parametric Survival Models Parametric survival models were wellestablished in actuarial science and demography, but have never dominated medical uses of survival analysis. However, in the 1950s and 1960s important contributions to the statistical theory of survival analysis were based on simple parametric models. One example is the maximum likelihood approach by Boag (6) to a cure model assuming eternal life with probability c and lognormally distributed survival times otherwise. The exponential distribution was assumed by Littell (36), when he compared the ‘‘actuarial’’ and the maximum likelihood approach to the ‘‘T-year survival rate’’, by Armitage (3) in his comparative study of two-sample tests for clinical trials with staggered entry, and by Feigl and Zelen (16) in their model for (uncensored) lifetimes whose expectations were allowed to depend linearly on covariates, generalized to censored data by Zippin and Armitage (53). Cox (11) revolutionized survival analysis by his semiparametric regression model for the hazard, depending arbitrarily (‘‘nonparametrically’’) on time and parametrically on covariates. For details on the genesis of Cox’s paper, see (42,43). 1.4 Multistate Models Traditional actuarial and demographical ways of modeling several life events simultaneously may be formalized within the probabilistic area of finite-state Markov processes in continuous time. An important and influential documentation of this was by Fix and Neyman (18), who studied recovery, relapse, and death (and censoring) in what is now commonly termed an illness–death model allowing for competing risks. Chiang (9), for example, in his 1968 monograph, extensively documented the relevant stochastic models, and Sverdrup (46), in an important paper, gave a systematic statistical study. These models have constant

SURVIVAL ANALYSIS, OVERVIEW

transition intensities, although subdivision of time into intervals allows grouped-time methodology of the actuarial life-table type, as carefully documented by Hoem (24). 2

SURVIVAL ANALYSIS CONCEPTS

The ideal basic independent nonnegative random variables X i , i = 1, . . . , n are not always observed directly. For some individuals i, the available piece of information is a right-censoring time U i , that is, a period elapsed in which the event of interest has not occurred (e.g. a patient has survived until U i ). Thus, a generic survival data sample i is the i , Di ), i = 1, . . . , n) where X includes ((X smaller of X i and U i and Di is the indicator, I(X i ≤ U i ), of not being censored. Mathematically, the distribution of Xi may be described by the survival function Si (t) = Pr(Xi > t).

(2)

If the hazard function αi (t) = lim

t→0

Pr(Xi ≤ t + t | Xi >t) t

(3)

3

3 NONPARAMETRIC ESTIMATION AND TESTING The simplest situation encountered in survival analysis is the nonparametric estimation of a survival distribution function based on a right-censored sample of observation n ), where the true survival 1 , . . . , X times (X times X i , i = 1, . . . , n, are assumed to be independent and identically distributed with common survival distribution function S(t), whereas as few assumptions as possible are usually made about the right-censoring times U i except for the assumption of independent censoring. The concept of independent censoring has the interpretation that the fact that an individual, i, is alive and uncensored at time t, say, should not provide more information on the survival time for that individual than X i > t, that is, the right-censoring mechanism should not remove individuals from the study who are at a particularly high or a particularly low risk of dying. Under these assumptions, S(t) is estimated by the Kaplan–Meier estimator (28). This is given by Di = , (6) 1− S(t) i ) Y(X Xi ≤t

exists, then Si (t) = exp(−Ai (t)), where

(4)

t

Ai (t) =

αi (u) du

(5)

0

is the integrated hazard over [0, t). If, more generally, the distribution of the X i has discrete components, then Si (t) is given by the product-integral of the cumulative hazard measure. Owing to the dynamical nature of survival data, a characterization of the distribution via the hazard function is often convenient. (Note that α i (t)t when t > 0 is small is approximately the conditional probability of i ‘‘dying’’ just after time t given ‘‘survival’’ till time t.) Also, α i (t) is the basic quantity in the counting process approach to survival analysis (see e.g. (2), and the article on Survival Distributions and Their Characteristics).

i ≥ t) is the number of where Y(t) = I(X individuals at risk just before time t. The Kaplan–Meier estimator is a nonparametric maximum likelihood estimator and, is approximately norin large samples, S(t) mally distributed with mean S(t) and a variance that may be estimated by Greenwood’s formula: 2 (t) = [S(t)] 2 σ

≤t X i

Di . i )[Y(X i ) − 1] Y(X

(7)

From this result, pointwise confidence intervals for S(t) are easily constructed and, since one can also show weak convergence of the entire Kaplan–Meier curve √ − S(t)]; 0 ≤ t ≤ τ }, τ ≤ ∞ to a mean { (n)[S(t) zero Gaussian process, simultaneous confidence bands for S(t) on [0, τ ] can also be set up. As an alternative to estimating the survival distribution function S(t), the cumulative hazard function A(t) = −logS(t) may be

4

SURVIVAL ANALYSIS, OVERVIEW

studied. Thus, A(t) may be estimated by the Nelson–Aalen Estimator = A(t)

≤t X i

Di i ) Y(X

.

use a k-vector of sums of weighted differ ences between the increments of A h (t) and A(t):

(8)

The relation between the estimators S(t) and A(t) is given by the product-integral from which it follows that their largesample properties are equivalent. Though the Kaplan–Meier estimator has the advantage that a survival probability is easier to interpret than a cumulative hazard function, the Nelson–Aalen estimator is easier to generalize to multistate situations beyond the survival data context. We shall return to this below. To give a nonparametric estimate of the hazard function α(t) itself requires some smoothing techniques to be applied. Right censoring is not the only kind of data-incompleteness to be dealt with in survival analysis; in particular, left truncation (or delayed entry) where individuals may not all be followed from time 0 but maybe from a later entry time V i conditionally on having survived until V i , occurs frequently in, for example, epidemiological applications. Dealing with left truncation only requires a redefinition of the risk set from the set i ≥ t} of individuals still alive and uncen{i : X i } of sored at time t to the set {i : Vi < t ≤ X individuals with entry time V i < t and who are still alive and uncensored. With Y(t) still denoting the size of the risk set at time t both (6,7), and (8) are applicable though one should be aware of the fact that estimates of S(t) and A(t) may be ill-determined for small values of t due to the left truncation. When the survival time distributions in a number, k, of homogeneous groups have been estimated nonparametrically, it is often of interest to test the hypothesis H 0 of identical hazards in all groups. Thus, on the basis of censored survival data ((X hi , Dhi ), i = 1, . . . , nh ) for group h, h = 1, . . . , k, the Nelson–Aalen estimates A h (t) have been computed, and based on the combined sample of

i , Di ), i = 1, . . . , n), size n = h nh with data ((X an estimate of the common cumulative hazard function A(t) under H 0 may be obtained As a genby a Nelson–Aalen estimator A(t). eral statistic for testing H 0 , one may then

Zh =

n

i )[A

h (X i ) − A( i )]. ˆ X Kh (X

(9)

i=1

Here, A h (t) = 0 if t is not among the observed survival times in the hth sample and K h (t) is 0 whenever Y h (t) = 0, in fact all weight functions used in practice have the form K h (t) = Y h (t)K(t). With this structure for the weight function, the covariance between Zh and Zj given by (9) is estimated by i ) i ) ( X Y ( X Y j h i ) σhj = K (X δhj − Di , i ) i ) Y(X Y(X i=1 (10) and, letting Z be the k-vector (Z1 , . . . , Zk ) and the k by k matrix (σ hj , h, j = 1, . . . , k) the test statistic X 2 = Z − Z is asymptotically chisquared distributed under H 0 with k − 1 degrees of freedom if all nh tend to infinity at the same rate. Here, − is a generalized inverse for . Special choices for K(t) correspond to test statistics with different properties for particular alternatives to H 0 . An important such test statistic is the logrank test obtained for K(t) = I(Y(t) > 0). For this test, which has particularly good power for proportional hazards alternatives, Zh given by (9) reduces to Zh = Oh − Eh with Oh the total number of observed failures in group i )/Y(X i ) an ‘‘expected’’ h and Eh = Di Yh (X number of failures in group h. For the twosample case (k = 2), one may of course use the square root of X 2 as an asymptotically normal test statistic for the null hypothesis. For the case where the k groups are ordered, and where a score xh (with x1 ≤ . . . ≤ xk ) is attached to group h, a test for trend is given by T 2 = (x Z)2 /x x with x = (x1 , . . . , xk ) and it is asymptotically chi-squared with 1 df. The above linear rank tests have low power against certain important classes of alternatives such as ‘‘crossing hazards’’. Just as for uncensored data, this has motivated the development of test statistics of the ` Kolmogorov–Smirnov and Cramer–von Mises types, based on maximal deviation or n

2

SURVIVAL ANALYSIS, OVERVIEW

integrated squared deviation between estimated hazards, cumulative hazards or survival functions. 4

PARAMETRIC INFERENCE

The nonparametric methods outlined in the previous section have become the standard approach to the analysis of simple homogeneous survival data without covariate information. However, parametric survival time distributions are sometimes used for inference, and we shall here give a brief review. Assume again that the true survival times X 1 , . . . , X n are independent and identically distributed with survival distribution function S(t; θ ) and hazard function α(t; θ ) but i , Di ), i = that only a right-censored sample (X 1, . . . , n, is observed. Under independent censoring, the likelihood function for the parameter θ is L(θ ) =

n

i ; θ ))Di S(X i ; θ ). (α(X

(11)

i=1

The function (11) may be analyzed using standard large-sample theory. Thus, standard tests, that is, Wald-, score-, and likelihood ratio tests are used as inferential tools. Two frequently used parametric survival models are the Weibull distribution with hazard function αρ(αt)ρ−1 , and the piecewise exponential distribution with α(t, θ ) = αj for t ∈ Ij with Ij = [tj−1 , tj ), 0 = t0 < tj < · · · < tm = ∞. Both of these distributions contain the very simplest model, the exponential distribution with a constant hazard function as null cases. 5 COMPARISON WITH EXPECTED SURVIVAL As a special case of the nonparametric tests discussed above, a one-sample situation may be studied. This may be relevant if one wants to compare the observed survival in the sample with the expected survival based on a standard life table. Thus, assume that a hazard function α*(t) is given and that the hypothesis H 0 : α = α* is to be tested. One test statistic for H 0 is the one-sample logrank test (O − E*)/(E*)1/2 where E*, the ‘‘expected’’ number

5

i ) − A∗ (Vi )] of deaths is given by E∗ = [A∗ (X (with A* the cumulative hazard corresponding to α*). In this case, θ = O/E∗ , the standardized mortality ratio, is the maximum likelihood estimate for the parameter θ in the model α(t) = θ α*(t). Thus, the standardized mortality ratio arises from a multiplicative model involving the known population hazard α*(t). Another classical tool for comparing with expected survival, the so-called expected survival function, arises from an additive or excess hazard model. 6

THE COX REGRESSION MODEL

In many applications of survival analysis, the interest focuses on how covariates may affect the outcome; in clinical trials, adjustment of treatment effects for effects of other explanatory variables may be crucial if the randomized groups are unbalanced with respect to important prognostic factors, and in epidemiological cohort studies, reliable effects of exposure may be obtained only if some adjustment is made for confounding variables. In these situations, a regression model is useful and the most important model for survival data is the Cox (11) proportional hazards regression model. In its simplest form, it states the hazard function for an individual, i, with covariates Zi = (Zi1 , . . . , Zip ) to be (12) αi (t; Zi ) = α0 (t) exp(β Zi ), where β = (β1 , . . . , βp ) is a vector of unknown regression coefficients and α 0 (t), the baseline hazard, is the hazard function for individuals with all covariates equal to 0. Thus, the baseline hazard describes the common shape of the survival time distributions for all individuals while the relative risk function exp(β Zi ) gives the level of each individual’s hazard. The interpretation of the parameter, β j for a dichotomous Zij ∈ {0, 1} is that exp(β j ) is the relative risk for individuals with Zij = 1 compared to those with Zij = 0 all other covariates being the same for the two individuals. Similar interpretations hold for parameters corresponding to covariates taking more than two values. The model is semiparametric in the sense that the relative risk part is modeled parametrically while the baseline hazard is left

6

SURVIVAL ANALYSIS, OVERVIEW

unspecified. This semiparametric nature of the model led to a number of inference problems, which was discussed in the literature in the years following the publication of Cox’s article in 1972. However, these problems were all resolved and estimation proceeds as follows. The regression coefficients β are estimated by maximizing the Cox partial likelihood L(β) =

n i=1

exp(β Zi )

j∈R exp(β Zj )

Di ,

(13)

i

j ≥ X i }, the risk set at time where Ri = {j : X Xi , is the set of individuals still alive and uncensored at that time. Furthermore, the cumulative baseline hazard A0 (t) is estimated by the Breslow estimator A 0 (t) =

≤t X i

Di Zj ) exp(β

,

(14)

j∈Ri

which is the Nelson–Aalen estimator one would use if β were known and equal to the maximum partial likelihood estimate β. The estimators based on (13) and (14) also have a nonparametric maximum likelihood is approxinterpretation. In large samples, β imately normally distributed with the proper mean and with a covariance, which is estimated by the information matrix based on (13). This means that approximate confidence intervals for the relative risk parameters of interest can be calculated and that the usual large-sample test statistics based on (13) are available. Also, the asymptotic distribution of the Breslow estimator is normal; however, this estimate is most often used as a tool for estimating survival probabilities for individuals with given covariates, Z0 . Such an estimate may be obtained by the product Z0 )A integral S(t; Z0 ) of exp(β 0 (t). The joint asymptotic distribution of β and the Breslow estimator then yields an approximate normal distribution for S(t; Z0 ) in large samples. A number of useful extensions of this simple Cox model are available. Thus, in some cases, the covariates are time-dependent, for example, a covariate might indicate whether or not a given event had occurred by time t, or a time-dependent covariate

might consist of repeated recordings of some measurement likely to affect the prognosis. In such cases, the regression coefficients β are estimated replacing exp(β Zj ) in (13) by i )]. exp[β Zj (X Also, a simple extension of the Breslow estimator (14) applies in this case. However, the survival function can, in general, no longer be estimated in a simple way because of the extra randomness arising from the covariates, which is not modeled in the Cox model. This has the consequence that the estimates are more difficult to interpret when the model contains time-dependent covariates. To estimate the survival function in such cases, a joint model for the hazard and the time-dependent covariate is needed. Another extension of (12) is the stratified Cox model where individuals are grouped into a number, k of strata each of which has a separate baseline hazard. This model has important applications for checking the assumptions of (12). The model assumption of proportional hazards may also be tested in a number of ways, the simplest possibility being to add interaction terms of the form Zij f (t) between Zij and time where f (t) is some specified function. Also, various forms of residuals as for normal linear models may be used for model checking in (12). In (12), it is finally assumed that a quantitative covariate affects the hazard log-linearly. This assumption may also be checked in several ways and alternative models with other relative risk functions r(β Zi ) may be used. Special care is needed when covariates are measured with error. 7 OTHER REGRESSION MODELS FOR SURVIVAL DATA Though the semiparametric Cox model is the regression model for survival data that is applied most frequently, other regression models, for example, parametric regression models also play important roles in practice. Examples include models with a multiplicative structure, that is, models like (12) but with a parametric specification, α0 (t) = α0 (t; θ ), of the baseline hazard, and accelerated failure-time models. A multiplicative model with important epidemiological applications is the Poisson

SURVIVAL ANALYSIS, OVERVIEW

regression model with a piecewise constant baseline hazard. In large data sets with categorical covariates, this model has the advantage that a sufficiency reduction to the number of failures and the amount of person-time at risk in each cell defined by the covariates and the division of time into intervals is possible. This is in contrast to the Cox regression model (12) where each individual data record is needed to compute (13). The substantial computing time required to maximize (13) in large samples has also led to modifications of this estimation procedure. Thus, in nested case–control studies the risk set Ri in the Cox partial likelihood is replaced by a random sample Ri of Ri . In the accelerated failure-time model, the focus is not on the hazard function but on the survival time itself much like in classical linear models. Thus, this model is given by log Xi = α + β Zi + i , where the error terms are assumed to be independent and identically distributed with expectation 0. Examples include normally distributed ( i , i = 1, . . . , n), and error terms with a logistic or an extreme value distribution, the latter giving rise to a regression model with Weibull distributed life times. Finally, we shall mention some nonparametric hazard models. In Aalen’s additive model, αi (t) = β0 (t) + β(t) Zi (t), the regression functions β 0 (t), . . . , β p (t) are left completely unspecified and estimated nonparametrically much like the Nelson–Aalen estimator discussed above. This model provides an attractive alternative to the other regression models discussed in this section. There also exist more general and flexible models containing both this model and the Cox regression model as special cases. 8

MULTISTATE MODELS

Models for survival data may be considered a special case of a multistate model, namely, a model with a transient state alive (0) and an absorbing state dead (1) and where the hazard rate is the force of transition from state 0 to state 1. Multistate models may conveniently be studied in the mathematical framework of counting processes with a notation that actually simplifies the notation of

7

the previous sections and, furthermore, unifies the description of survival data and that of more general models like the competing risks model and the illness–death model to be discussed below. We first introduce the counting processes relevant for the study of censored survival data (1) Define, for i = 1, . . . , n, the stochastic processes

and

i ≤ t, Di = 1) Ni (t) = I(X

(15)

i ≥ t). Yi (t) = I(X

(16)

Then (15) is a counting process counting 1 i if individual i is observed to die; at time X otherwise N i (t) = 0 throughout. The process (16) indicates whether i is still at risk just before time t. Models for the survival data are then introduced via the intensity process, λi (t) = α i (t)Y i (t) for N i (t), where α i (t), as before, denotes the hazard function for the distribution of X i . Letting N = N 1 + ; . . . + N n and Y = Y 1 + ; . . . + ;Y n the Nelson–Aalen estimator (8) is given by the stochastic integral = A(t)

t 0

J(u) dN(u), Y(u)

(17)

where J(t) = I(Y(t) > 0). In this simple multistate model, the transition probability P00 (0, t), that is, the conditional probability of being in state 0 by time t given state 0 at time 0 is simply the survival probability S(t), which, as described above, may be estimated using the Kaplan–Meier estimator, which is the product-integral of (17). In fact, all the models and methods for survival data discussed above, which are based on the hazard function have immediate generalizations to models based on counting processes. Thus, both the nonparametric tests and the Cox regression model may be applied for counting process (multistate) models. One important extension of the two-state model for survival data is the competing risks model with one transient alive state 0 and a number, k, of absorbing states corresponding to death from cause h, h = 1, . . . , k. In this model, the basic parameters are the causespecific hazard functions α h (t), h = 1, . . . , k, and the observations for individual i will coni , Dhi ), h = 1, . . . , k, where Dhi = 1 if sist of (X

8

SURVIVAL ANALYSIS, OVERVIEW

individual i is observed to die from cause h, and Dhi = 0 otherwise. On the basis of these data, k counting processes for each i can i ≤ t, Dhi = 1) and be defined by Nhi (t) = I(X letting N h = N h1 + . . . + N hn , the integrated cause-specific hazard Ah (t) is estimated by the Nelson–Aalen estimator replacing N by N h in (17). A useful synthesis of the causespecific hazards is provided by the transition probabilities P0h (0, t) of being dead from cause h by time t. This is frequently called the cumulative incidence function for cause h and is given by

t

P0h (s, t) =

S(u)αh (u) du,

(18)

s

and hence it may be estimated by (18) by inserting the Kaplan–Meier estimate for S(u) and the Nelson–Aalen estimate for the integrated cause-specific hazard. In fact, this Aalen–Johansen estimator of the matrix of transition probabilities is exactly the product-integral of the cause-specific hazards. Another important multistate model is the illness– death or disability model with two transient states, say healthy (0) and diseased (1) and one absorbing state dead (2). If transitions both from 0 to 1 and from 1 to 0 are possible, the disease is recurrent, otherwise it is chronic. On the basis of such observed transitions between the three states, it is possible to define counting processes for individual i as N hji (t) = number of observed h → j transitions in the time interval [0, t] for individual i and, furthermore, we may let Y hi (t) = I (i is in state h at time t −). With these definitions, we may set up and analyze models for the transition intensities α hji (t) from state h to state j including nonparametric comparisons and Cox-type regression models. Furthermore, transition probabilities Phj (s, t) may be estimated by product-integration of the intensities. 9 OTHER KINDS OF INCOMPLETE OBSERVATION A salient feature of survival data is right censoring, which has been referred to throughout in the present overview. However, several

other kinds of incomplete observation are important in survival analysis. Often, particularly when the time variable of interest is age, individuals enter study after time 0. This is called delayed entry and may be handled by left truncation (conditioning) or left filtering (‘‘viewing the observations through a filter’’). There are also situations when only events (such as AIDS cases) that occur before a certain time are included (right truncation). The phenomenon of left censoring, though theoretically possible, is more rarely relevant in survival analysis. When the event times are only known to lie in an interval, one may use the grouped time approach of classical life tables, or (if the intervals are not synchronous) techniques for interval censoring may be relevant. A common framework (coarsening at random) was recently suggested for several of the above types of incomplete observation.

10 MULTIVARIATE SURVIVAL ANALYSIS For multivariate survival, the innocently looking problem of generalizing the Kaplan–Meier estimator to several dimensions has proved surprisingly intricate. A major challenge (in two dimensions) is how to efficiently use singly censored observations, where one component is observed and the other is right censored. For regression analysis of multivariate survival times, two major approaches have been taken. One is to model the marginal distributions and use estimation techniques based on generalized estimating equations leaving the association structure unspecified. The other is to specify random effects models for survival data based on conditional independence. An interesting combination between these two methods is provided by copula models in which the marginal distributions are combined via a socalled copula function thereby obtaining an explicit model for the association structure. For the special case of repeated events, both the marginal approach and the conditional (frailty) approach have been used successfully.

SURVIVAL ANALYSIS, OVERVIEW

11

CONCLUDING REMARKS

Survival analysis is a well-established discipline in statistical theory as well as in biostatistics. Most books on biostatistics contain chapters on the topic and most software packages include procedures for handling the basic survival techniques. Several books have appeared, among them the documentation of the actuarial and demographical knowhow by Elandt–Johnson and Johnson (15); the research monograph by Kalbfleisch and Prentice (27), the first edition of which for a decade maintained its position as main reference on the central theory; the comprehensive text by Lawless (34) covering also parametric models, and the concise text by Cox and Oakes (12), two central contributors to the recent theory. The counting process approach is covered by Fleming and Harrington (17) and by Andersen et al (2); see also (25). Later, books intended primarily for the biostatistical user have appeared. These include (10,31,32,38,40). Also, books dealing with special topics, like implementation in the S-Plus software (47), multivariate survival data (26), and the linear regression model (45) have appeared. REFERENCES 1. Aalen, O. O. (1978) Nonparametric inference for a family of counting processes, Annals. of Statistics 6: 701–726. 2. Andersen, P. K., Borgan, Ø., Gill, R. D. & Keiding, N. (1993) Statistical Models Based on Counting Processes. Springer, New York. 3. Armitage, P. (1959) The comparison of survival curves, Journal of the Royal Statistical Society A 122: 279–300. 4. Berkson, J. & Gage, R. P. (1950) Calculation of survival rates for cancer, Proceedings of the Staff Meetings of the Mayo Clinic 25: 270–286. 5. Bernoulli, D. (1766) Essai d’une nouvelle analyse de la mortalit´e caus´ee par la petite v´erole, et des avantages de l’inoculation pour la pr´evenir, M´emoires de Math´ematique et de Physique de l’Acad´emie Royale des Sciences, Paris Ann´ee MDCCLX, pp. 1–45 of M´emoires. 6. Boag, J. W. (1949) Maximum likelihood estimates of the proportion of patients cured by cancer therapy, Journal of the Royal Statistical Society B 11: 15–53.

9

7. B¨ohmer, P. E. (1912) Theorie der ¨ unabhangigen Wahrscheinlichkeiten, Rapports, M´emoires et Proc´es – verbaux du 7e Congr`es International d’Actuaires, Amsterdam 2: 327–343. 8. Breslow, N. E. (1991) Introduction to Kaplan and Meier (1958). Nonparametric estimation from incomplete observations, in Breakthroughs in Statistics II, S. Kotz & N. L. Johnson, eds. Springer, New York, 311–318. 9. Chiang, C. L. (1968) Introduction to Stochastic Processes in Biostatistics. Wiley, New York. 10. Collett, D. (2003) Modelling Survival Data in Medical Research, 2nd Ed., Chapman & Hall, London. 11. Cox, D. R. (1972) Regression models and lifetables (with discussion), Journal of the Royal Statistical Society (B) 34: 187–220. 12. Cox, D. R. & Oakes, D. (1984) Analysis of Survival Data. Chapman & Hall, London. 13. Cutler, S. J. & Ederer, F. (1958) Maximum utilization of the life table method in analyzing survival, Journal of Chronic Diseases 8: 699–713. 14. Ederer, F., Axtell, L. M. & Cutler, S. J. (1961) The relative survival rate: A statistical methodology, National Cancer Institute Monographs 6: 101–121. 15. Elandt-Johnson, R. C. & Johnson, N. L. (1980) Survival Models and Data Analysis. Wiley, New York. 16. Feigl, P. & Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information, Biometrics 21: 826–838. 17. Fleming, T. R. & Harrington, D. P. (1991) Counting Processes and Survival Analysis. Wiley, New York. 18. Fix, E. & Neyman, J. (1951) A simple stochastic model of recovery, relapse, death and loss of patients, Human Biology 23: 205–241. 19. Gompertz, B. (1825) On the nature of the function expressive of the law of human mortality, Philosophical Transactions of the Royal Society of London, Series A 115: 513–580. 20. Greenwood, M. (1922) Discussion on the value of life-tables in statistical research, Journal of the Royal Statistical Society 85: 537–560. 21. Greenwood, M. (1926) The natural duration of cancer, in Reports on Public Health and Medical Subjects, Vol. 33 His Majesty’s Stationery Office, London, pp. 1–26. 22. Hald, A. (1990) A History of Probability and Statistics and Their Applications before 1750. Wiley, New York.

10

SURVIVAL ANALYSIS, OVERVIEW

23. Harris, T. E., Meier, P. & Tukey, J. W. (1950) The timing of the distribution of events between observations, Human Biology 22: 249–270. 24. Hoem, J. M. (1976) The statistical theory of demographic rates. A review of current developments (with discussion), Scandinavian Journal of Statistics 3: 169–185. 25. Hosmer, D. W. & Lemeshow, S. (1999) Applied Survival Analysis. Regression Modeling of Time to Event Data. Wiley, New York. 26. Hougaard, P. (2000) Analysis of Multivariate Survival Data. Springer, New York. 27. Kalbfleisch, J. D. & Prentice, R. L. (2002) The Statistical Analysis of Failure Time Data, 2nd Ed., Wiley, New York. 28. Kaplan, E. L. & Meier, P. (1958) Nonparametric estimation from incomplete observations, Journal of the American Statistical Association 53: 457–481, 562–563. 29. Keiding, N. (1987) The method of expected number of deaths 1786–1886–1986, International Statistical Review 55: 1–20. 30. Keiding, N. (1990) Statistical inference in the Lexis diagram, Philosophical Transactions of the Royal Society London A 332: 487–509. 31. Klein, J. P. & Moeschberger, M. L. (2003) Survival Analysis. Techniques for Censored and Truncated Data, 2nd Ed., Springer, New York. 32. Kleinbaum, D. G. (1996) Survival Analysis. A Self-Learning Text. Springer, New York. 33. Lambert, J. H. (1772) Beytrage ¨ zum Gebrauche der Mathematik und deren Anwendung, Vol. III, Verlage des Buchlages der Realschule, Berlin. 34. Lawless, J. F. (2002) Statistical Models and Methods for Lifetime Data, 2nd Ed., Wiley, New York. 35. Lexis, W. (1875) Einleitung in die Theorie der ¨ Bev¨olkerungsstatistik. Trubner, Strassburg. 36. Littell, A. S. (1952) Estimation of the T-year survival rate from follow-up studies over a limited period of time, Human Biology 24: 87–116. 37. Makeham, W. M. (1860) On the law of mortality, and the construction of mortality tables, Journal of the Institute of Actuaries 8: 301. 38. Marubini, E. & Valsecchi, M. G. (1995) Analysing Survival Data from Clinical Trials and Observational Studies. Wiley, Chichester. 39. de Moivre, A. (1725) Annuities upon Lives: or, The Valuation of Annuities uon any Number of Lives; as also, of Reversions. To which is

added, An Appendix concerning the Expectations of Life, and Probabilites of Survivorship. Fayram, Motte and Pearson, London. 40. Parmar, K. B. & Machin, D. (1995) Survival analysis. A practical approach. Wiley, Chichester. 41. du Pasquier, L. G. (1913) Mathematische ¨ Theorie der Invaliditatsversicherung, Mitteilungen der Vereinigung der Schweizerische Versicherungs-Mathematiker 8: 1–153. 42. Prentice, R. L. (1991) Introduction to Cox (1972) Regression models and life-tables, in Breakthroughs in Statistics II, S. Kotz & N. L. Johnson eds. Springer, New York, pp. 519–526. 43. Reid, N. (1994) A conversation with Sir David Cox, Statistical Science 9: 439–455. 44. Seal, H. L. (1977) Studies in the history of probability and statistics, XXXV. Multiple decrements or competing risks, Biometrika 64: 429–439. 45. Smith, P. J. (2002) Analysis of Failure Time Data. Chapman & Hall/CRC, London. 46. Sverdrup, E. (1965) Estimates and test procedures in connection with stochastic models for deaths, recoveries and transfers between different states of health, Skandinavisk Aktuarietidskrift 48: 184–211. 47. Therneau, T. M. & Grambsch, P. M. (2000) Modeling Survival Data: Extending the Cox Model. Springer, New York. 48. Westergaard, H. (1882) Die Lehre von der Mortalitat ¨ und Morbilitat. ¨ Fischer, Jena. 49. Westergaard, H. (1901) Die Lehre von der Mortalitat ¨ und Morbilitat, ¨ 2. Aufl., Fischer, Jena. 50. Westergaard, H. (1925) Modern problems in vital statistics, Biometrika 17: 355–364. 51. Westergaard, H. (1932) Contributions to the History of Statistics. King, London. 52. Yule, G. (1934) On some points relating to vital statistics, more especially statistics of occupational mortality (with discussion), Journal of the Royal Statistical Society 97: 1–84. 53. Zippin, C. & Armitage, P. (1966) Use of concomitant variables and incomplete survival information with estimation of an exponential survival parameter, Biometrics 22: 655–672.

SUSPENSION OR TERMINATION OF IRB APPROVAL Under 21 CFR (Code of Federal Regulations) 56.113, an Institutional Review Board (IRB) shall have the authority to suspend or terminate approval of research that is not being conducted in accordance with the IRB’s requirements or that has been associated with unexpected serious harm to subjects. Any suspension or termination of approval shall include a statement of the reasons for the IRB’s action and shall be reported promptly to the investigator, appropriate institutional officials, and the Food and Drug Administration. 21 CFR 56.108(b) requires that the IRB follow written procedures for ensuring prompt reporting to the IRB, appropriate institutional officials, and the Food and Drug Administration of: 1. Any unanticipated problems that involve risks to human subjects or others; 2. Any instance of serious or continuing noncompliance with these regulations or the requirements or determinations of the IRB; or 3. Any suspension or termination of IRB approval.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oc/gcp/irbterm.html) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

BELMONT REPORT

of individual patients or clients, and that have reasonable expectations of success’’ (1). By contrast, they define research as, ‘‘activities designed to test hypotheses, permit conclusions to be drawn and thereby to develop or contribute to generalizable knowledge’’ (1). They make note of experimental practices and procedures and consider their distinctions from and similarities to research. They recognize that overlap often exists between these activities, but stress that if any portion of an activity is research, it should undergo review for the protection of human subjects.

DEBORAH ZUCKER Tufts-New England Medical Center Boston, Massachusetts

In the Research Act of 1974, the United States Congress put forth guidelines for the establishment of Institutional Review Boards and created a National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. This commission met at the Smithsonian Institution’s Belmont Conference Center near Baltimore, Maryland in February 1976, and their resulting Belmont Report is a summary of the basic ethical principles and guidelines identified by the commission to assist in the protection of human subjects in research (1). The Nuremberg Code (2) and the Declaration of Helsinki (3) addressed the protection of human subjects in research. However, revelations regarding the Tuskegee syphilis study (4) heightened the need for developing U.S. government policies. The Belmont Report did not provide specific recommendations, but rather provided the broad ethical underpinnings upon which regulations, to this day, are based. The Belmont Report focuses on three main areas, (1) the boundaries between research and practice, (2) the basic ethical principles that should underlie the conduct of research and the protection of human subjects, and (3) the application of these principles into practice.

2

BASIC ETHICAL PRINCIPLES

The second section of the report outlines three basic ethical principles that underlie practices for the protection of human subjects in research: 1. Respect for persons—(often termed autonomy) describes a conviction that individuals should be treated as autonomous agents. In addition, they discuss the need for those individuals with diminished capacity for autonomy to receive protections, often with the aid of a third party. 2. Beneficence. Their description of this ethical principle encompasses beneficial actions (beneficence) as well as nonmaleficence (doing no harm), which requires not only that persons be treated in an ethical manner with respect to their decisions and protecting them from harm, but also that efforts be made to secure their well-being and that steps be taken to help maximize possible benefits and minimize possible harms. 3. Justice. In discussing this principle, they consider who should receive the benefits of research and how the burdens of research should be shared. Although this section stresses the need for equal treatment of individuals, it also recognizes that equality can be viewed in different ways (e.g., equality based on need, equality based on merit,

1 BOUNDARIES BETWEEN PRACTICE AND RESEARCH This section considers the distinctions between biomedical and behavioral research and the practice of accepted and novel therapies in medical care. Even today, close to 30 years later, questions remain about defining research versus practice, and, in particular, research versus quality improvement activities (5–7). The Belmont Report defines the purpose of medical or behavioral practice ‘‘to provide diagnosis, preventive treatment or therapy to particular individuals, as in interventions designed to enhance the well-being

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

BELMONT REPORT

equality based on contribution.) Again, no specific recommendations for one approach over the other are given, but rather the issues are raised for consideration in developing future policies. In this section of the Belmont Report, specific reference is made to the Tuskegee syphilis study and the ethical misconduct in that research. In addition to equality for participation in research, the commission also addresses the need to consider equality of research outcomes so that the fruits of research (e.g., through new products, and so on) are justly distributed.

3 APPLICATION OF ETHICAL PRINCIPLES INTO PRACTICE The third portion of the Belmont Report describes several processes involved in carrying out research and their connections to the underlying ethical principles. 1. Informed consent. Respect for persons (autonomy) requires that individuals be given the opportunity to choose what shall or shall not happen to them to the extent possible. As such, informed consent requires that information be provided to participants regarding research risks and procedures as well as its purpose. In addition to providing information, the research procedures themselves need to be designed to minimize, to the extent possible, any risks to individuals within the study (beneficence/nonmaleficence). The report also describes the need to provide information in a comprehensible manner. For participants who might be incapable of understanding all aspects of the research plan, assistance should be provided for comprehension and to aid their participation in decision-making. Voluntariness and making sure that participation is free of coercion or undue influence are key in maintaining respect for persons. 2. Assessment of risk and benefit. This process is tied in large part to the principle of beneficence—to protect subjects

from harm and to secure the individual’s well being. The scope of risk and benefit assessment and the need for systematic evaluation in order to balance the two in an unbiased manner are addressed. Overall, they stress that studies need to strive toward a favorable benefit/risk ratio for participants. 3. Participants (recruitment). Reference is made to the underlying principle of justice. The commission highlights the need to be inclusive and allow all people to participate in research, sharing in both the burden of the research risks and in potential benefits. In its short 10 pages, the Belmont Report lays out many of the ethical considerations that, to this day, are used in formulating regulations and in addressing new issues that develop in clinical research and in ensuring the protection of human study volunteers. REFERENCES 1. The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, Ethical Principles and Guidelines for the Protection of Human Subjects of Research. 1979. 2. The Nuremberg Code, Trials of War Criminals before the Nuremberg Military Tribunals under Control Council Law No. 10. Washington, DC: U.S. Government Printing Office, 1949. 3. World Medical Association, Declaration of Helsinki: Ethical Principles for Medical Research Involving Human Subjects. Adopted by the 18th WMA General Assembly Helsinki, Finland, June 1964 and amended by the 29th WMA General Assembly, Tokyo, Japan, October 1975 35th WMA General Assembly, Venice, Italy, October 1983 41st WMA General Assembly, Hong Kong, September 1989 48th WMA General Assembly, Somerset West, Republic of South Africa, October 1996 and the 52nd WMA General Assembly, Edinburgh, Scotland, October 2000; 1964. 4. J. H. Jones, Bad Blood: The Tuskegee Syphilis Experiment. New York: Free Press, 1993. 5. D. Casarett, J. H. Karlawish, and J. Sugarman, Determining when quality improvement initiatives should be considered research: Proposed criteria and potential implications. JAMA 2000; 283: 2275–2280.

BELMONT REPORT 6. N. Fost, Ethical dilemmas in medical innovation and research: distinguishing experimentation from practice. Sem. Perinatol. 1998; 22: 223–232. 7. B. Freedman, A. Fuks, and C. Weijer, Demarcating research and treatment: a systematic approach for the analysis of the ethics of clinical research. Clin. Res. 1992; 40: 653–660.

3

• Heavy

CAROTENE AND RETINOL EFFICACY TRIAL (CARET)

smoker participants, who were men and women who, on entry to the trial, were 50 to 69 years of age, had a cigarette smoking history of 20 or more pack-years, and were current smokers or had quit no more than 6 years previously; • Asbestos-exposed participants, who were men who, on entry to the trial, were 45 to 69 years of age, were current smokers or had quit within the previous 15 years, had first occupational exposure to asbestos at least 15 years previously, and either had a chest X-ray positive for changes compatible with asbestos exposure or had been employed in a protocol-defined high-risk trade for at least 5 years as of 10 years before enrollment.

MARK D. THORNQUIST Fred Hutchinson Cancer Research Center Seattle, Washington,

The Carotene and Retinol Efficacy Trial (CARET) is a phase III chemoprevention trial sponsored by the U.S. National Cancer Institute (NCI). The primary hypothesis tested in CARET was whether the daily combination of 30 mg beta-carotene plus 25,000 International Units (IU) retinyl palmitate (an ester of retinol, also known as Vitamin A) would lower the incidence of lung cancer in individuals at high risk for the disease. Secondary hypotheses tested the effect of the study vitamins on the incidence of other cancers (with particular interest in prostate cancer), overall and cause-specific mortality (with particular interest in mortality from cardiovascular disease), and symptoms, signs, and laboratory values potentially attributable to the study vitamins. The study vitamins were chosen based on evidence from in vitro, animal, and pre-clinical human studies that suggested beta-carotene (a Vitamin A precursor) could act as an antioxidant and Vitamin A could promote cell differentiation. CARET ended intervention early because of definitive evidence of no benefit and strong probability of harm of the study vitamins (1). The trial continues to monitor the time course of study endpoints after withdrawal of the study vitamins. This article describes the trial design, execution, and results of CARET, contrasting prevention trials to clinical trials of therapeutic efficacy. 1

Participants were recruited primarily through mass mailings to such groups as health insurance enrollees, high-risk trade unions, and the AARP. After a screening telephone call, potentially eligible participants were invited to a run-in visit at a study center and, if eligible, entered a placebo run-in phase. Participants who demonstrated adherence to taking the placebos were randomized at a randomization visit 3 months later. Randomized participants were followed on a regular schedule (see below) for the primary endpoint of lung cancer incidence, and secondary endpoints of incidence of other cancers, incidence of six symptoms, seven signs, and four laboratory values potentially attributable to the study agents, and mortality. CARET’s primary test of efficacy was a weighted log-rank test of the incident lung cancers, with the weight increasing linearly from zero at the time of randomization to one at 2 years following randomization, then constant at one thereafter. CARET planned to recruit 4000 asbestos-exposed participants and 13,700 heavy smokers and follow them for a mean of 6 years, in which time it was projected that 487 incident lung cancers would occur. CARET had 80% power to detect if the ultimate risk reduction (from 2 years on)

TRIAL DESIGN

CARET is a two-armed, double-blind trial originally conducted at six study centers in the United States with the coordinating center at the Fred Hutchinson Cancer Research Center in Seattle, Washington (Table 1). CARET tested the effect of the combination of 30 mg/day beta-carotene plus 25,000 IU/day retinyl palmitate in two populations at high risk for lung cancer:

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

CAROTENE AND RETINOL EFFICACY TRIAL (CARET)

was 33% in fully adherent individuals receiving the active vitamins. If the study vitamins had an effect of this magnitude in fully adherent individuals, they were projected to produce an observed 23% reduction in incidence in the trial population, after allowing for a 2-year time lag to full effect and less than full adherence in participants. Monitoring of CARET was performed by an independent data and safety monitoring board (DSMB). The formal monitoring guidelines adopted by the board included two interim analyses of the primary outcome after one-third and two-thirds of the projected weighted endpoints, using O’Brien-Fleming boundaries. No formal monitoring guidelines were set for the incidence of other outcomes. In addition, the DSMB could request conditional power calculations in the event of apparent lack of benefit (2). CARET was one of four randomized placebo-controlled prevention trials using beta-carotene launched around the same time. The other trials were • Nutrition intervention trials in Linx-

ian, China randomized 29,584 men and women in a one-half fraction of a 24 factorial design. The primary outcomes were mortality and cancer incidence (with emphasis on stomach cancer). One of the four treatments tested was a combination of 15 mg/day beta-carotene, 50 µg/day selenium, and 30 mg/day alpha-tocopherol. • Alpha-Tocopherol Beta-Carotene (ATBC) trial in Finland randomized 29,133 male smokers in a 2 × 2 factorial design testing dosages of 20 mg/day beta-carotene and 50 mg/day alphatocopherol. The primary outcome was lung cancer incidence. • Physicians Health Study (PHS) in the United States randomized 22,071 male physicians in a 2 × 2 factorial design testing 325 mg aspirin on alternate days and 50 mg beta-carotene on alternate days. The primary outcomes were cardiovascular disease and cancer incidence. The aspirin component was terminated ahead of schedule because of benefit in reducing the incidence of first myocardial infarction.

Other than the early termination of the aspirin component of the PHS, all of these trials completed their full-planned follow-up and terminated normally. The timeline of events in CARET is shown in Table 2. 2 TRIAL EXECUTION 2.1 Pilot Studies CARET was preceded by two pilot studies in Seattle, one in 1029 heavy smoker participants with the same eligibility criteria as the full-scale trial, one in 816 asbestos-exposed participants, with broader age (45–74 at enrollment) and cigarette smoking (no smoking requirement) eligibility criteria. Both trials were launched in 1985. The heavy smoker pilot study was a 2 × 2 factorial trial of betacarotene (30 mg/day vs. placebo) and retinol (25,000 IU/day vs. placebo) with primary endpoints of symptoms, signs, and laboratory values potentially attributable to the study vitamins. The asbestos-exposed pilot study was a two-armed trial of beta-carotene (15 mg/day) plus retinol (25,000 IU/day) vs. placebo, with primary outcome of cancer incidence. Both trials incorporated a 2-month placebo run-in period to evaluate adherence to the study agents, with subsequent randomization at the second study center visit. Participants were followed every 2 months, alternating telephone calls and study center visits. 2.2 Full-Scale Trial—Intervention Phase When the full-scale trial was launched in 1988, participants in both pilot studies were included in the CARET study population. Pilot participants who had received any active agents during the pilot studies had their study dosage changed to the fullscale study active agent dosage (30 mg/day beta-carotene plus 25,000 IU/day retinyl palmitate); thus three-fourths of the pilot heavy smoker population were included in the active intervention group. To achieve the planned recruitment goal, four additional study center sites were funded, with a fifth (Irvine Study Center) added in 1991. CARET ultimately randomized 4060 asbestos-exposed individuals and 14,254

CAROTENE AND RETINOL EFFICACY TRIAL (CARET)

heavy smokers, exceeding its recruitment goals in both populations. The participants recruited after the launch of the full-scale trial are termed the ‘‘Efficacy’’ cohort, to distinguish them from the cohort of Pilot participants. The study timeline for Efficacy participants included a runin visit at which eligibility was confirmed and the placebo run-in was begun, a randomization visit 3 months later, two telephone calls and two visits in the first year postrandomization, then two telephone calls and one study center visit annually. Continuing Pilot participants had two study center visits and two telephone calls each year. Participants who stopped taking their study vitamins continued to be followed for endpoints by telephone every six months. The data collected at these contacts are shown in Table 3. Blood was collected from CARET participants on the schedule shown in Table 4. The collected blood was centrifuged and 8 ml of serum drawn off and shipped on dry ice to the coordinating center, where it was stored in 2 ml and 0.5 ml vials at −80◦ F. Beginning in 1994, CARET performed a one-time collection of whole blood, stored at −80◦ F in two 2 ml aliquots and blood spots on cards. A second collection of whole blood in a subset of CARET participants was planned for 2004. Tissue specimens from participants diagnosed with lung cancer were requested from the treating institution for central pathology. With institutional permission, some specimens, including blocks, slides, fine needle aspirates, and bronchial washings/brushings, have been retained at the coordinating center for further analysis. 2.3 Termination of CARET Intervention In April 1994, the ATBC trial published its unexpected finding of a statistically significant adverse effect of beta-carotene on the incidence of lung cancer and overall mortality. CARET’s first interim analysis was performed on schedule in July and August 1994, and CARET’s DSMB reviewed the results in light of ATBC’s findings. The DSMB voted to continue the trial with two modifications: recruitment was stopped at the one site still recruiting participants, and an additional

3

interim analysis was scheduled for 1 year later. The second interim analysis was completed in August 1995, but the DSMB needed additional time to agree on the appropriate steps. Late in December 1995, after receiving advice from an external review board set up by the NCI, the DSMB recommended to the CARET investigators that the trial end its intervention; early in January 1996, the CARET Steering Committee reviewed the recommendation and voted to stop the intervention, 16 months before its planned termination. CARET’s findings were published in May 1996 (1) and are described below. CARET participants were notified of the findings, instructed to return their study vitamins to the study centers, and informed of their randomization assignment. Participants continued to be followed on their regular contact schedule until they could complete a study center visit to transition them from the intervention phase to the postintervention phase. By early 1997, the transition to post-intervention follow-up was complete. 2.4 Full-Scale Trial—Post-Intervention Phase CARET received funding to continue to follow its participants to determine the time course of endpoint incidence following the withdrawal of the study vitamins. Such follow-up can help clarify the mechanism of action of the adverse effect of the study vitamins and determine how quickly the adverse effects resolve. The formal analysis plan calls for an analysis of the effects every 5 years after the end of intervention, with primary endpoints being lung cancer incidence, cardiovascular disease mortality, and overall mortality, secondary endpoints being cancer incidence at other sites, and comparisons of interest including current vs. former smokers and men vs. women. Participants are contacted annually, at first by telephone calls performed by the study centers and now by letters sent from the Coordinating Center. All study operations were centralized at the Coordinating Center in July 2001. In the years since the end of the CARET intervention, the numbers of study endpoints have grown remarkably. By May 2003, there had been 1213 primary lung cancers in the

4

CAROTENE AND RETINOL EFFICACY TRIAL (CARET)

CARET population. This large number of events, coupled with extensive data and prediagnosis specimen collection, make CARET a valuable resource for investigations into causes of and early markers for lung cancer.

3

CARET’S FINDINGS

At the time CARET ended the intervention, it had accumulated 388 lung cancer cases in 73,135 person-years of follow-up. Compared with those receiving placebo, those receiving the active study vitamins had a lung cancer relative risk of 1.28, with nominal P-value (i.e., not adjusted for the interim analyses’ expenditure of type I error) of 0.02 and nominal 95% confidence interval of (1.04, 1.57). The cumulative lung cancer incidence curves for those receiving the active study vitamins began separating from that for those receiving placebo at 18 months after randomization, a similar time period to ATBC. The relative risk due to the study vitamins for all-cause mortality was 1.17 [nominal 95% confidence interval of (1.03, 1.33)] and for cardiovascular disease mortality it was 1.26 [nominal 95% confidence interval of (0.99, 1.61)]. Post-hoc analysis of the effect of the interim analyses indicate that the nominal P-value would be in good agreement with the corrected P-value. The adverse effect of the study vitamins on lung cancer seemed to be primarily in current smokers; the former smokers among the CARET heavy smoker population had an active: placebo relative risk of 0.80 [95% confidence interval of (0.48, 1.31)]. Data from ATBC and PHS support the hypothesis that any adverse effect of beta-carotene on lung cancer incidence occurs primarily in current smokers. Six symptoms and seven signs possibly attributable to the study vitamins were evaluated routinely; the only one that showed consistent differences between the study groups was skin yellowing attributable to beta-carotene. Two laboratory measures, serum alkaline phosphatase and serum total triglycerides, showed slight elevations in the individuals receiving the active study vitamins (3).

4 DIFFERENCES BETWEEN PREVENTION AND THERAPEUTIC TRIALS CARET was designed beginning in 1983, at a time when chemoprevention trials were still uncommon, and much of the design followed the model of therapeutic trial design. However, important ways exist in which primary prevention trials differ from therapeutic trials which strongly design considerations. A few of these differences are discussed here. 4.1 Length of Intervention In most settings, treatment for disease involves a brief intervention, rarely longer than a few months and often just days. In contrast, primary prevention trials need to intervene on participants for years (for prevention targeted at late-stage events) or decades (for early-stage events). As a result, the study design must account for the changes in the event rates of study outcomes, both primary risks and competing risks, that result from the aging of the study population after recruitment. Further, the lengthy intervention period increases the possibility of new relevant knowledge developing during the course of the trial, potentially affecting the scientific rationale of the trial or the willingness of the trial participants to continue in the trial. The lengthy study period alone adds to the burden on the participant, which can increase non-adherence, so efforts to maintain adherence take on great importance. 4.2 Primary Outcome Research into potential new therapies for disease follows the standard phase sequence for clinical trials, with each phase addressing a single unique aspect of the risk/benefit ratio. Thus, therapeutic trials of treatment efficacy are conducted in phase III, by which time the maximum tolerated dose has been found and the level of toxicity of the study agent has been well characterized. In contrast, because of the extreme length of time of prevention trials, no phase I or II studies will typically exist to determine the optimal dose and the toxicity profile. Thus, both safety and efficacy need to be evaluated in a single trial. Thus, the primary outcome for prevention trials ought to be a true test of the risk/benefit ratio.

CAROTENE AND RETINOL EFFICACY TRIAL (CARET)

4.3 Trial Participants Therapeutic trials are conducted in individuals with disease and, if the therapy works, the participants regain their health as a result of the trial’s activities. In contrast, primary prevention trials are conducted in healthy individuals; if the prevention agent works, no observable change in their health occurs, although if side effects of the agents exist, they cause a worsening of participants’ quality of life. Thus, prevention trials are at best of no noticeable benefit and may be of slight detriment to the study participants. As a result, although therapeutic trial participants generally agree to participate with the hope of personal benefit, most prevention trial participants agree to participate for altruistic reasons. Providing more tangible benefits, or boosting participants’ esteem that occurs from their altruism, are important tasks in prevention trials in order to minimize non-adherence. 4.4 Dosage of Study Agents The different natures of the trial populations in therapy vs. prevention trials dictate different goals in the selection of the dosage of the study agents. In therapy trials, the dosage of study agent is pushed to the maximum tolerated dose in the expectation that the greatest therapeutic efficacy is obtained with the greatest dose. In contrast, prevention trials require agents with minimal toxicity, even if that results in less than optimal preventative effect. 4.5 Adherence Therapeutic trials, often conducted in clinical settings in individuals highly motivated to regain their health, can usually achieve high adherence to the treatment regimen and data collection schedule. Prevention trials, with volunteer participants with a more abstract motivation, will typically face lower adherence. Absent external factors (such as the results of other trials being published), drop-out rates are highest in participants immediately after enrollment, before participants have bonded with the study, and early drop-out causes the greatest dilution of intervention effect in intent-to-treat analyses. An

5

additional adherence issue in prevention trials that is virtually unknown in treatment trials is drop-in, in which participants randomized to control groups take the active agents on their own, which can commonly occur in prevention research because the study agents are often readily available (e.g., vitamin supplements) or are behaviors that anyone can adopt (e.g., diet). Thus, adherence maintenance is usually a much more difficult task in prevention trials. 5

CONCLUSION

In 1981, beta-carotene was widely viewed as a highly promising agent for the prevention of cancer of various sites, but particularly for lung cancer. In consequence, four chemoprevention trials involving beta-carotene were launched in three countries. The findings from CARET confirmed those from ATBC and conclusively demonstrated that no shortterm lung cancer risk reduction exists from the vitamin and harm likely occurs from the vitamin in cigarette smokers. These results demonstrate the importance of testing epidemiological associations with controlled trials before recommendations on the use of potential preventative agents are made. REFERENCES 1. G. S. Omenn et al., Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. N. Engl. J. Med. 1996; 334: 1150–1155. 2. M. D. Thornquist et al., Statistical design and monitoring of the Carotene and Retinol Efficacy Trial (CARET). Controlled Clin. Trials 1993; 14: 308–324. 3. G. S. Omenn, G. E. Goodman, M. Thornquist, and J. D. Brunzell, Long term vitamin A does not produce clinically significant hypertriglyceridemia; results from CARET, the β-Carotene and Retinol Efficacy Trial. Cancer Epidemiol. Biomark. Prevent. 1994; 3: 711–713.

THE CENTER FOR BIOLOGICS EVALUATION AND RESEARCH

Spurred on by these tragedies, in 1902, Congress enacted the Biologics Control Act, also known as the Virus-Toxin Law, which gave the government its first control over the processes used for the production of biological products. The Act became effective in 1903 and mandated that manufacturers of vaccines, sera, and antitoxins be licensed annually. The administration of the Act was performed by the Public Health Service Hygienic Laboratory. As part of the licensing process, production was to be supervised by a qualified scientist and manufacturing facilities were required to undergo periodic inspections. All product labels were required to include the product name, expiration date, and address and license number of the manufacturer (4). At the same time, adulteration of agricultural commodities, particularly in the meat packing industry, became the subject of great public debate, aided by the activities of muckraking journalists, such as Samuel Hopkins Adams, and the publication of The Jungle by Upton Sinclair. As a result, the Federal Food and Drugs Act was passed by Congress in 1906, creating a meat inspection law and a comprehensive food and drug law. The Act gave the Bureau of Chemistry (headed by Harvey W. Wiley) administrative responsibility and specifically prohibited the interstate transport of unlawful food and drugs under penalty of seizure of the questionable products and/or prosecution of the responsible parties. The basis of the law rested on the regulation of product labeling rather than on premarket approval. The reference standard for drugs under the Act was defined in accordance with the standards of strength, quality, and purity in the United States Pharmacopoeia and the National Formulary (4). In 1930, the Public Health Service Hygienic Laboratory became the National Institutes of Health, and in 1937, the NIH was reorganized and the Hygienic Lab became the Division of Biologics Standardization (DBS) within NIH. Another tragedy, resulting in the deaths of 107 people, many children, from consumption of liquid formulation of the antibiotic sulfanilamide, called Elixir Sulfanilamide, prompted revision of the drug regulations (5). Despite

BARNEY KING Macnas Consulting International San Diego, California

The Center for Biologics Evaluation and Research (CBER) is one of six main centers of the U.S. Food and Drug Administration (FDA), an agency within the U.S. Department of Health and Human Services (HHS). CBER is the Center within the FDA that regulates biological products for human use under applicable federal laws, including the Public Health Service Act and the Federal Food, Drug and Cosmetic Act. The CBER website is located at http://www.fda.gov/cder. CBER is responsible for ensuring the safety and efficacy (including purity and potency) of a broad range of products that come under the designation of biological products (1). Biologics are isolated from a variety of natural sources—human, animal, or microorganism— and may be produced by biotechnology methods and other technologies. Biologics can be composed of sugars, proteins, or nucleic acids or complex combinations of these substances, or they may be living entities such as cells and tissues. In contrast to most drugs that are chemically synthesized and their structure is known, most biologics are complex mixtures that are not easily identified or characterized (2). 1

HISTORY (3)

At the beginning of the twentieth century, diphtheria patients were routinely treated with antitoxin derived from the blood serum of horses. The horse serum was manufactured in local establishments without external oversight. In 1901, in St. Louis, Missouri, the blood of a tetanus-infected horse was used to prepare the diphtheria antitoxin. Administration of the tainted material resulted in the deaths from tetanus of 13 children in St. Louis. A similar event resulted in the death of nine children in Camden, New Jersey.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

THE CENTER FOR BIOLOGICS EVALUATION AND RESEARCH

the use of the word ‘‘Elixir’’ in the name, the formulation did not contain alcohol, but rather the highly toxic compound polyethylene glycol. The 1938 Food, Drug and Cosmetics Act (FD&C Act) replaced the 1906 legislation focusing on demonstration of safety. Subsequently, the 1902 and 1938 Acts were used to regulate biologic products. Additional authority to regulate biologics was provided in the Public Health Service Act of 1944. In the period from the 1940s through the 1960s, scientists at DBS played critical roles in the development of several important vaccines such at pertussis, polio, and rubella. In addition, beginning in World War II and continuing into the 1970s, there was great concern about the safety of the blood supply, particularly regarding the risk of hepatitis. Evidence in the 1970s also indicated that blood obtained from commercial blood banks carried a greater risk of hepatitis transmission. This led to more careful testing and to increased regulation of blood to further protect the blood supply, increasing DBS’s responsibility in this area (4). In 1972, DBS was transferred to the Food and Drug Administration (FDA, successor to Bureau of Chemistry) as part of The Center for Drugs and Biologics, and hence, responsibility for premarket licensing authority for therapeutic agents of biological origin then resided within the FDA. On October 6, 1987, the Center for Drugs and Biologics was split into the Center for Drug Evaluation and Research (CDER) and the Center for Biologics Evaluation and Research (CBER). Paul Parkman was named Director of CBER. (Carl Peck became the first Director of CDER.) In 1998, the FDA officially became an agency within the Department of Health and Human Services. Publication of the structure of DNA by Watson and Crick in 1953 directly led to the discovery of recombinant DNA methodology and, ultimately, resulted in the development of biotechnology and the creation of the biotechnology industry. The first recombinant DNA vaccine, Hepatitis B (Recombinant), was licensed in 1986. In 2003, the FDA transferred many product oversight responsibilities from CBER to CDER 6,7. The categories of products transferred from CBER to CDER included:

• Monoclonal antibodies for in vivo use. • Proteins intended for therapeutic use,

including cytokines (e.g., interferons), enzymes (e.g., thrombolytics), and other novel proteins, except for those that are specifically assigned to CBER (e.g., vaccines and blood products). This category includes therapeutic proteins derived from plants, animals, or microorganisms, and recombinant versions of these products. • Immunomodulators (non-vaccine and non-allergenic products intended to treat disease by inhibiting or modifying a preexisting immune response). • Growth factors, cytokines, and monoclonal antibodies intended to mobilize, stimulate, decrease, or otherwise alter the production of hematopoietic cells in vivo. Then, as a result of the events of September 11, 2001 and subsequent bioterrorism threats, the Project BioShield Act of 2004 authorized the FDA to expedite its review procedures to enable rapid distribution of treatments as countermeasures to chemical, biological, and nuclear agents that might be used in a terrorist attack against the United States. This area became a specific area of responsibility of CBER (8) 2 AUTHORITY CBER’s authority resides in sections 351 and 361 of the Public Health Service Act (PHS Act) and in various sections of the FD&C Act (9). Biological products are approved for marketing under provisions of the PHS Act. However, because most biological products also meet the definition of ‘‘drugs’’ under the FD&C Act, they are also subject to regulation under FD&C Act provisions (9). CBER has also been delegated authority to regulate certain drugs closely related to biologics, such as anticoagulants packaged in plastic blood collection containers. CBER regulates these as drugs under the FD&C Act. Similarly, some medical devices used in blood banks to produce biologics are regulated by CBER under the FD&C Act’s Medical Device Amendments of 1976. Examples of

THE CENTER FOR BIOLOGICS EVALUATION AND RESEARCH

such devices are automated cell separators, empty plastic containers and transfer sets, and blood storage refrigerators and freezers. The PHS Act provides the agency with a licensing mechanism to confer approval of biological products and biologics manufacturing, including authority to immediately suspend licenses in situations where danger exists to public health. This law also provides important flexibility in regulation of biotechnology products, which facilitates the introduction and development of new medicines. Section 351 of the PHS Act requires licensure of biological products that travel in interstate commerce in the United States. Section 361 of the same act allows the Surgeon General to make and enforce regulations to control the interstate spread of communicable disease. Deriving from authority provided by these legal bases, CBER publishes regulations that become part of the Code of Federal Regulations. Most regulations relating to CBER activities are contained in 21 CFR 600-690. Other regulations pertaining to the conduct of clinical trials and institutional review boards are contained in other sections of 21 CFR. In addition to these laws and guidances, CBER also publishes guidance documents. These guidelines are not requirements, but they are generally followed by industry. Licensed manufacturers are expected to adopt either the guidance or an equivalent process. Guidances pertaining to CBER are located at http://www.fda.gov/cber/guidelines.htm. Other guidances relating to associated activities, e.g., submission of regulatory documents, considerations for development of products for specific diseases, statistical analyses, and so on, are located on other portions of the FDA website, http://www.fda.gov. A unique activity of CBER, which was also enabled by the PHS Act, is the preparation or procurement of products by the Center in the event of shortages and critical public health needs. The PHS Act also authorizes the creation and enforcement of regulations to prevent the introduction or spread of communicable diseases into the United States and between states (9).

3

3

AREAS OF REGULATION

CBER regulates an array of diverse and complex biological products, both investigational and licensed, including: • Allergenics (10). Examples include pat-

ch tests (e.g., to diagnose the causes of contact dermatitis) and extracts that might be used to diagnose and treat rhinitis (‘‘hay fever’’), allergic sinusitis and conjunctivitis, and bee stings. • Blood (11). Examples include blood and

blood components used for transfusion (e.g., red blood cells, plasma, and platelets) and pharmaceutical products made from blood (e.g., clotting factors and immunoglobulins). • Devices (12). Examples include medi-

cal devices and tests used to safeguard blood, blood components, and cellular products from HIV, hepatitis, syphilis, and other infectious agents, reagents used to type blood, and machines and related software used to collect blood and blood components. • Gene Therapy (13). Examples include

products that replace a person’s faulty or missing genetic material for diseases, such as cancer, cystic fibrosis, heart disease, hemophilia, and diabetes. • Human Tissues and Cellular Products

(13, 14). Examples include human tissues for transplantation (e.g., skin, tendons, ligaments, and cartilage) and cellular products (e.g., such as human stem cells and pancreatic islets). • Vaccines (15). Examples include vac-

cines used for the prevention of infectious diseases (e.g., mumps, measles, chicken pox, diphtheria, tetanus, influenza, hepatitis, smallpox, and anthrax) and treatment or prevention of noninfectious conditions (e.g., certain cancers). • Xenotransplantation Products (16).

Examples include live animal cells, tissues, or organs to treat human diseases (e.g., liver failure and diabetes).

4

4

THE CENTER FOR BIOLOGICS EVALUATION AND RESEARCH

ORGANIZATION

Below the Office of the Director of CBER, eight other offices provide support and regulatory oversight. These offices include the Office of Cellular, Tissue and Gene Therapies; the Office of Blood Research and Review; the Office of Vaccines Research and Review; the Office of Communication, Training and Manufacturers Assistance; the Office of Compliance and Biologics Quality (responsible for inspections); Office of Management; the Office of Biostatistics and Epidemiology; and the Office of Information Technology. Counterterrorism and Pandemic Threat Preparedness are located within the office of the Director. The current organizational chart for CDER can be found at http://www.fda.gov/ cber/inside/org.htm. 5

PRODUCT DEVELOPMENT

The PHS Act requires individuals or companies who manufacture biologics for introduction into interstate commerce to hold a license for the products. These licenses are issued by CBER. Licensing of biologic products under the PHS Act is very similar to the new drug approval process for human drugs. After initial laboratory and animal testing, a biological product is studied in clinical trials in humans under an investigational new drug application (IND). If the data generated by the studies demonstrate that the product is safe and effective for its intended use, the data are submitted to CBER as part of a biologics license application (BLA) for review and approval for marketing. As is true for drug development generally, approval of a biological product for marketing involves balancing the benefits to be gained with the potential risks. After a license application is approved for a biological product, the product may also be subject to official lot release. As part of the manufacturing process, the manufacturer is required to perform certain tests on each lot of the product before it is released

for distribution. If the product is subject to official release by CBER, the manufacturer submits samples of each lot of product to CBER together with a release protocol showing a summary of the history of manufacture of the lot and the results of all of the manufacturer’s tests performed on the lot. CBER may also perform certain confirmatory tests on lots of some products, such as viral vaccines, before releasing the lots for distribution by the manufacturer. CBER inspects manufacturing plants before it approves products and, thereafter, on a regular basis. CBER also may inspect clinical study sites to determine whether the studies are being carried out properly, and to ensure that accurate information is being submitted to the agency. CBER continues to monitor the safety and stability of biological products that have been approved. Facilities in which biologics are manufactured are also subject to a licensing process, Establishment License Application (ELA). As a result of its responsibility in the area of bioterrorism, CBER has taken an active role in facilitating the development of potential vaccines against agents such as anthrax and smallpox that might be used in a bioterrorist attack. As noted, CBER has established a Manufacturers Assistance Branch that provides assistance to manufacturers and sponsors regarding numerous areas supporting product development, including clinical development (17). 6 PERFORMANCE On June 30, 2003, the FDA officially transferred some of the therapeutic biological products that had been reviewed and regulated by CBER to CDER. In 2004, for the first time, each center reported biological product approvals and approval times separately. It is interesting to compare mean review times in months for biologic products approved in 2004, the first full year for reviews of biological products in each center. Approvals for New Biologics from 1996 to 2006 by year and approval time∗ (mean in months):

THE CENTER FOR BIOLOGICS EVALUATION AND RESEARCH

Biologic category

Center

5 ‘‘New’’ Therapeutic Biologics 4 Priority New Therapeutic Biologics 2 Non-Therapeutic Biological Product All Biological Products All Biological Products (Median)

CDER CDER CBER CDER & CBER CDER & CBER

5

Mean Review Time (Months) 15.7∗ 5.7 19.8 16.8 6.0 (Median)

Decreases to 5.7 months if NeutroSpec’s 55.4-month review is excluded Source: FDA, PhRMA.

# months

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

9 24.5

10 32.4

9 13.5

5 17.1

6 25.8

8 19.6

9 30.1

14 34.7

2 19.8

8 9.1

7 16.2

5 10+

Beginning in 2004, these figures do not include new BLAs for therapeutic biologic products that have been transferred from CBER to CDER. Source: FDA, PhRMA.

CBER Fast Track Designation Request Performance Report All Requests Received (March 1, 1998 through December 31, 2007) (18) Within Goal Number Submitted

Goal

196∗

60 days

Overdue

Granted Denied Pending Total % Granted Denied Pending Total % 121

70

0

191

98

3

1

1

5

2

Does not include three requests received by OTRR and pending on October 1, 2003. Final actions on those requests are counted in CDER reports.

Fast track reviews (for products for lifethreatening diseases) were likewise reviewed by CBER until 2003 and by both centers since then.

8.

REFERENCES 1. http://www.fda.gov/cber/about.htm. 2. http://www.fda.gov/cber/faq.htm#3. 3. J. P. Swann, Food and Drug Administration. In: G. T. Kurian (ed.), A Historical Guide to the U. S. Government. New York: Oxford University Press, 1998. 4. http://www.fda.gov/cber/inside/centscireg.htm #1902. 5. L. Bien, Frances Oldham Kelsey: FDA Medical Reviewer Leaves Her Mark on History. FDA Consumer Magazine. 2001 (March-April). Available: http://www.fda.gov/fdac/features/ 2001/201 kelsey.html. 6. http://www.fda.gov/cber/transfer/transfer. htm. 7. Drug and Biological Product Consolidation. Fed Reg. 2003 June 26; 68(123):38067-8. Avai-

9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

lable: http://www.fda.gov/cber/genadmin/ transconsol.pdf. M. Meadows, Project BioShield: Protecting Americans From Terrorism. FDA Consumer Magazine. 2004 (November-December). http:// www.fda.gov/fdac/features/2004/604 terror. html. http://www.fda.gov/cber/about.htm. http://www.fda.gov/cber/allergenics.htm. http://www.fda.gov/cber/blood.htm. http://www.fda.gov/cber/devices.htm. http://www.fda.gov/cber/devices.htm. http://www.fda.gov/cber/tiss.htm. http://www.fda.gov/cber/vaccines.htm. http://www.fda.gov/cber/xap/xap.htm. http://www.fda.gov/cber/manufacturer.htm. http://www.fda.gov/CbER/inside/fastrk.htm.

FURTHER READING M. Mathieu (ed.), Biologics Development: A Regulatory Overview, 3rd ed. Waltham, MA: PAREXEL International Corporation, 2004.

6

THE CENTER FOR BIOLOGICS EVALUATION AND RESEARCH

CROSS-REFERENCES Regulatory Issues Center for Drug Evaluation and Research (CDER) Center for Devices and Radiological Health (CDRH) Code of Federal Regulation (CFR) Drug Development Establishment License Application (ELA) Product License Application (PLA) Food and Drug Administration (FDA, USA) Investigational New Drug Application (IND) New Drug Application (NDA)

THE COCHRANE COLLABORATION

need to be brought together. Trials need to be assessed, and those that are good enough can be combined to produce both a more statistically reliable result and one that can be more easily applied in other settings. This combination of trials needs to be done in as reliable a way as possible. It needs to be systematic. A systematic review uses a predefined, explicit methodology. The methods used include steps to minimize bias in all parts of the process: identifying relevant studies, selecting them for inclusion, and collecting and combining their data. Studies should be sought regardless of their results. A systematic review does not need to contain a statistical synthesis of the results from the included studies, which might be impossible if the designs of the studies are too different for an averaging of their results to be meaningful or if the outcomes measured are not sufficiently similar. If the results of the individual studies are combined to produce an overall statistic, it is usually called a metaanalysis. A meta-analysis can also be done without a systematic review, simply by combining the results from more than one trial. However, although such a meta-analysis will have greater mathematical precision than an analysis of any one of the component trials, it will be subject to any biases that occur from the study selection process, and may produce a mathematically precise, but clinically misleading, result.

MIKE CLARKE UK Cochrane Centre Oxford, United Kingdom

1

INTRODUCTION

For all but the last 100 years, decisions on how to treat patients were almost always based on personal experience, anecdotal case histories, and comparisons of a group of patients who received one treatment with an entirely separate group who did not. These processes, although subject to many biases, are still in use today, but ways to minimize these biases are now available, accepted, and more easily adopted. Among these methods is the use of the randomized trial as a means of providing more reliable estimates of the relative effects of interventions, as the only difference between the patients in the groups being compared in a randomized trial will be that of most interest: namely, the interventions under investigation. However, in part because of chance variations in the types of patients allocated to the different interventions in the randomized trial, the results of a single trial will rarely be sufficient. Most trials are too small and their results are not sufficiently robust against the effects of chance. In addition, small trials might be too focused on a particular type of patient to provide a result that can be either easily or reliably generalized to future patients. In addition, the amount of information about health care, including that coming from individual randomized trials, is now overwhelming. Vast amounts of information are now readily available in journals, books, magazines, the media, and, especially in recent years, on the Internet. However, people making decisions about health care, including patients, their carer givers, health care professionals, policy makers, and managers, need high-quality information. Unfortunately, much of what is available is of poor quality. To help identify which forms of health care work, which do not, and which are even harmful, results from similar randomized trials

2

THE COCHRANE COLLABORATION

The Cochrane Collaboration is the largest organization in the world engaged in the production and maintenance of systematic reviews. It has received worldwide support in its efforts to do something about the problems outlined above, by making systematic reviews accessible to people making decisions about health care (www.cochrane.org). The Collaboration aims to help people make well-informed decisions by preparing, maintaining, and promoting the accessibility of systematic reviews of the effects of interventions in all areas of health care. These reviews bring together the relevant research

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

THE COCHRANE COLLABORATION

findings on a particular topic, synthesize this evidence, and then present it in a standard, structured way. One of their most important attributes is that they are periodically updated to take account of new studies and other new information, to help people be confident that the systematic reviews are sufficiently current to be useful in making decisions about health care. The Cochrane Collaboration was established in 1993, founded on ideas and ideals that stem from earlier times. In October 1992, Iain Chalmers, Kay Dickersin, and Thomas Chalmers wrote an editorial in the BMJ (1) that began with the following quote from the British epidemiologist, Archie Cochrane, published in 1972: ‘‘It is surely a great criticism of our profession that we have not organised a critical summary, by specialty or subspecialty, updated periodically, of all relevant randomised controlled trials’’ (2). This editorial was published at the time of the opening of the first Cochrane Centre in Oxford, United Kingdom. This center was funded by the National Health Service Research and Development Programme in the United Kingdom ‘‘to facilitate and coordinate the preparation and maintenance of systematic reviews of randomized controlled trials of healthcare.’’ However, a clear need existed for the work to extend beyond this center, beyond the United Kingdom, and, in some circumstances, beyond randomized trials. In 1993, 77 people from 19 countries gathered at what was to become the first Cochrane Collaboration and established The Cochrane Collaboration as an international organization. Annual Cochrane Colloquia have been held since then, with the most recent being in Melbourne, Australia, in October 2005, attended by more than 700 people from 45 countries. The Cochrane Collaboration is supported by hundreds of organizations from around the world, including health service providers, research funding agencies, departments of health, international organizations, and universities. More than 14,000 people are currently contributing to the work of The Cochrane Collaboration from over 90 countries, and this involvement continues to grow.

The number of people involved has increased by about 20% year on year for each of the 5 years to 2004. The importance of involving people from low- and middle-income countries in the work of The Cochrane Collaboration is well recognized, which is reflected by the efforts of the Centres based in these countries and the steady increase in the number of people actively involved in the preparation and maintenance of Cochrane reviews, from about 300 in the year 2000 to more than 1000 in 2006. The Cochrane Collaboration has ten guiding principles: • Collaboration, by internally and exter-

nally fostering good communications, open decisionmaking, and teamwork. • Building on the enthusiasm of individu-

als, by involving and supporting people of different skills and backgrounds. • Avoiding duplication, by good manage-

ment and co-ordination to maximize economy of effort. • Minimizing bias, through a variety

of approaches such as scientific rigor, ensuring broad participation, and avoiding conflicts of interest. • Keeping up to date, by a commitment

to ensure that Cochrane Reviews are maintained through identification and incorporation of new evidence. • Striving for relevance, by promoting the

assessment of health care interventions using outcomes that matter to people making choices in health care. • Promoting access, by wide dissemina-

tion of the outputs of The Cochrane Collaboration, taking advantage of strategic alliances, and by promoting appropriate prices, content, and media to meet the needs of users worldwide. • Ensuring

quality, by being open and responsive to criticism, applying advances in methodology, and developing systems for quality improvement.

• Continuity, by ensuring that responsi-

bility for reviews, editorial processes, and key functions is maintained and renewed.

THE COCHRANE COLLABORATION • Enabling wide participation in the work

of The Cochrane Collaboration by reducing barriers to contributing and by encouraging diversity. The work of preparing and maintaining Cochrane reviews is done by the reviewers, of whom there were more than 7000 in 2006. Very few of these are paid to work on their reviews, and the main motivation is a desire to answer reliably a question about the relative effects of interventions for people with particular conditions. The reviewers are supported by 51 Cochrane Collaborative Review Groups, which are responsible for reviews within particular areas of health and collectively providing a home for reviews in all aspects of health care. These Groups organize the refereeing of the drafts for Cochrane reviews, and the protocols that precede them, and the editorial teams in these Groups decide whether a Cochrane review should be published. As far as possible, they work with the reviewers to ensure that publication occurs, and the decision that a Cochrane review will be published depends on its quality not its findings, which is unlike the publication process elsewhere in the health care literature where journals rarely help authors with their reports and where the decision about whether a paper will be published will often depend on the importance given to the paper in the light of its findings. The Cochrane Reviews Groups are based around the world and some have editorial bases in more than one country. Cochrane Methods Groups, with expertise in relevant areas of methodology; Fields or Networks, with broad areas of interest and expertise spanning the scope of many Review Groups; and a Consumer Network helping to promote the interests of users of health care also exist. The work of these Cochrane entities, and their members, is supported by 12 regional Cochrane Centres: Australasian, Brazilian, Canadian, Chinese, Dutch, German, IberoAmerican, Italian, Nordic, South African, United Kindom, and United States. The Cochrane Collaboration Steering Group, containing elected members from the different types of entity, is responsible for setting Collaboration-wide policy and, by working

3

with the entities, the implementation of the Collaboration’s strategic plan. One of the important ways in which activity within The Cochrane Collaboration is supported is the Collaboration’s Information Management System (IMS). This system was developed initially by Update Software, the original publishing partner of The Cochrane Collaboration, before responsibility was transferred to the Nordic Cochrane Centre in Copenhagen, Denmark, where much further development has taken place over the last few years. The IMS comprises the set of software tools used to prepare and maintain Cochrane reviews and to submit these reviews for publication, and also to describe the work of each entity and to manage contact details of their members. For the Collaboration’s first decade, the IMS worked mainly as standard software running on local computers, with reviewers sharing their files by disk or e-mail attachment. As the Collaboration grew and the number of reviews and the vital task of keeping these up-to-date got bigger, a better way to share these documents and information was needed. In 2001, a software needs assessment survey was conducted. Nearly all Cochrane entities and almost 500 individuals responded. The results were influential in planning the new IMS, which is being introduced between 2004- and 2007 and will increase the ability of people in The Cochrane Collaboration to work together by providing a central computer approach to the storage of documents such as the draft versions of Cochrane reviews. Cochrane reviews are published in The Cochrane Database of Systematic Reviews (CDSR). As of mid-2006, this database contains the full text of more than 2600 complete Cochrane reviews, each of which will be kept up-to-date as new evidence and information accumulates. An additional 1600 published protocols exist for reviews in progress. These protocols set out how the reviews will be done and provide an explicit description of the methods to be followed. The growth in Cochrane reviews is well illustrated by the following milestones. The first issue of CDSR, at the beginning of 1995, included 36 Cochrane reviews, with 500 in 1999, 1000 in 2001, and 2000 in 2004. Hundreds of

4

THE COCHRANE COLLABORATION

newly completed reviews and protocols are added each year and a few hundred existing reviews are updated so substantively that they can be considered to be the equivalent of new reviews. Several hundred Cochrane reviews are currently at earlier stages than the published protocol. The Cochrane Database of Systematic Reviews is available on the Internet and on CD-ROM as part of The Cochrane Library. This library is published by John Wiley and Sons, Ltd., and is available on a subscription basis. The establishment of national contracts means that The Cochrane Library is currently free at the point of use to everyone in Australia, Denmark, England, Finland, Ireland, Northern Ireland, Norway, and Wales. More countries are being added to this list each year. The Cochrane Collaboration also produces the Cochrane Central Register of Controlled Trials (CENTRAL), the Cochrane Database of Methodology Reviews, and the Cochrane Methodology Register, all of which are unique resources. In 1993, when the Collaboration was established, less than 20,000 reports of randomized trials could be found easily in MEDLINE, and one of the main tasks facing the Collaboration was the need to identify and make accessible information on reports of trials that might be suitable for inclusion in Cochrane reviews. This task has been accomplished through extensive programs of the hand searching of journals (in which a journal is checked from cover to cover to look for relevant reports) and of electronic searching of bibliographic databases such as MEDLINE and EMBASE. Suitable records were then added to CENTRAL, with coordination by the U.S. Cochrane Centre in Rhode Island (3, 4). By 2006, CENTRAL contained records for more than 450,000 reports of randomized (or possibly randomized) trials, many of which are not included in any other electronic database. The Cochrane Database of Methodology Reviews contains the full text for Cochrane methodology reviews, which are systematic reviews of issues relevant to the conduct of reviews of health care interventions or evaluations of health care more generally. The Cochrane Methodology Register, to a large extent, provides the raw material for the Cochrane methodology reviews,

containing more than 8000 records of, for example, records for reports of research, and also for ongoing, unpublished research, into the control of bias in health care evaluation. Over the next few years, The Cochrane Collaboration will strive to ensure that its work is sustainable. Even with more than 4000 Cochrane reviews already underway, and results available from 2600 of these reviews, a large amount of work remains to be done. A recent estimate is that approximately 10,000 systematic reviews are needed to cover all health care interventions that have already been investigated in controlled trials. If the growth in The Cochrane Collaboration continues at the pace of the last few years, this target will be reached within the next 10 years. However, this achievement will require continuing and evolving partnership and collaboration. The Cochrane Collaboration will need to continue to attract and support the wide variety of people that contribute to its work. It will also need to work together with funders and with providers of health care to ensure that the resources needed for the work grow and the output of the work is accessible to people making decisions about health care (5). REFERENCES 1. I. Chalmers, K. Dickersin, and T. C. Chalmers, Getting to grips with Archie Cochrane’s agenda. BMJ 1992; 305: 786–788. 2. A. L. Cochrane, 1931–1971: a critical review, with particular reference to the medical profession. In: Medicines for the Year 2000. London: Office of Health Economics, 1979, pp. 1–11. 3. K. Dickersin, E. Manheimer, S. Wieland, K. A. Robinson, C. Lefebvre, and S. McDonald, Development of the Cochrane Collaboration’s CENTRAL Register of controlled clinical trials. Eval. Health Professions 2002; 25: 38–64. 4. C. Lefebvre and M. J. Clarke, Identifying randomised trials. In: M. Egger, G. Davey Smith, and D. Altman (eds.), Systematic Reviews in Health Care: Meta-analysis in Context, 2nd ed. London: BMJ Books, 2001, pp. 69–86. 5. M. Clarke and P. Langhorne, Revisiting the Cochrane Collaboration. BMJ 2001; 323: 821.

COMMUNITY INTERVENTION TRIAL FOR SMOKING CESSATION (COMMIT)

smoking 25 or more cigarettes per day. Randomization occurred in May 1988. After community organization and mobilization, the active intervention took place over a 4-year period (1989 to 1992), with final outcome data collection in 1993. The objective of COMMIT was to establish and assess a cooperative program of smoking cessation strategies that would (1) reach and aid smokers (with a particular interest in heavy smokers) in achieving and maintaining cessation of cigarette smoking; and (2) work through communities, using existing channels, organizations, and institutions capable of influencing smoking behavior in large groups of people. The goal of the COMMIT intervention has been described as the creation of a persistent and inescapable network of cessation influences. Thought of another way, it was to act as a ‘‘catalyst’’ for smoking cessation activities in the community. The principal intervention channels were media and public education, health care providers, worksites (and community organizations), and cessation resources (1, 6). The assumption was that the combination would be more effective than the sum of its individual components. The primary hypothesis of COMMIT was that the community-level, multi-channel, 4-year intervention would increase quit rates among cigarette smokers. ‘‘Quit rate’’ was defined as the fraction of cohort members who had achieved and maintained cessation for at least 6 months at the end of the trial. Endpoint cohorts totaling 10,019 heavy smokers and 10,328 light-to-moderate smokers, between 25 and 64 years of age, were followed by telephone (6). As a secondary outcome, COMMIT investigated whether the intervention would decrease the prevalence of adult cigarette smoking; baseline (1988) and final (1993) telephone surveys sampled households to determine changes in smoking prevalence (7). However, it was recognized during the design of COMMIT that greater statistical power would be obtained by using quit rates among the cohorts as the primary outcome measure. Sample size calculations for group randomized trials need to consider the extra

SYLVAN B. GREEN Arizona Cancer Center Tucson, Arizona

The Community Intervention Trial for Smoking Cessation (COMMIT), funded by the U.S. National Cancer Institute (NCI), was a largescale study involving 11 pairs of communities in North America, matched on geographic location (state or province), size, and sociodemographic factors. Communities were randomized within pairs to active communitybased intervention versus comparison. Thus, COMMIT is an example of a group randomization design (also called a cluster randomization design), in which intact groups (clusters), rather than individuals, were randomly allocated to intervention condition. Randomization of communities was used in COMMIT to obtain an unbiased assessment of the intervention effect; randomization also provided the basis for the statistical analysis. The design of this trial has been described by the Commit Research Group (1) and by Gail et al. (2). The interplay between the design and statistical analysis of COMMIT has been discussed by Green et al. (3); other descriptions of issues in the design and analysis have been published separately (4, 5). The primary results of COMMIT were published in 1995 (6, 7). A general presentation of cluster randomized trials, which includes COMMIT as an example, can be found in the textbook by Donner and Klar (8). Along with the NCI, 11 research institutions participated in COMMIT, each associated with one pair of communities; the community populations ranged in size from 49,421 to 251,208. From January to May, 1988, a baseline survey was conducted, using random-digitdialing methods, to estimate smoking prevalence in each community and to identify separate cohorts of ‘‘heavy’’ and of ‘‘light-tomoderate’’ smokers to be followed. For this categorization of smoking level, a ‘‘heavy’’ smoker was defined as one who reported

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

COMMUNITY INTERVENTION TRIAL FOR SMOKING CESSATION (COMMIT)

Research Institution

Intervention Community

Comparison Community

American Health Found., NY F. Hutchinson Cancer Ctr., WA Kaiser Permanente, CA Lovelace Institutes, NM Oregon Research Inst., OR Research Triangle Inst., NC Roswell Park Cancer Inst., NY Univ. of Iowa, IA Univ. of Massachusetts, MA Univ. of Med. & Dent., NJ Univ. of Waterloo and McMaster Univ., Ont.

Yonkers Bellingham Vallejo Santa Fe Medford-Ashland Raleigh Utica Cedar Rapids Fitchburg-Leominster Paterson Brantford

New Rochelle Longview-Kelso Hayward Las Cruces Albany-Corvallis Greensboro Binghamton-Johnson City Davenport Lowell Trenton Peterborough

source of variation resulting from the inherent heterogeneity across groups (8). Expressed another way, the outcomes of individuals within a group are generally correlated, as measured by the intraclass (intracluster) correlation. It is important to have an adequate number of randomized groups to account for the between-group variation. Often, in practice, it is easier to obtain a reasonably large number of individuals per group than it is to obtain a large number of groups, so the latter becomes the factor controlling (and perhaps limiting) the power of the trial. Such considerations formed the basis for the design of COMMIT (2, 3). As with the sample size calculations, the method of analysis for a group randomized trial must account for the correlation of individuals within a group [8]. COMMIT used randomization-based inference, in which the outcome data were analyzed many times (once for each acceptable assignment that could have been employed, according to the randomization process) and then compared with the observed result, without dependence on additional distributional or modelbased assumptions. Thus, this approach was robust; hypothesis testing (randomization tests or permutation tests) and corresponding test-based confidence intervals were designed for the group-randomized data, based specifically on the randomization distribution (3, 4). To perform the randomization test, the assignment of the communities to intervention condition (for analysis purposes) was permuted within the pairs. This test accounts for the fact that communities (rather than individuals) were randomized, and that this

randomization was performed within community pairs. Specifically, the procedure involves estimating some quantity for each pair, for example, the difference in proportion of smokers quitting between the intervention and comparison community of a pair. The expected value of the mean difference is 0 under the null hypothesis of no intervention effect. For hypothesis testing, one wants to know the probability that an estimate of the mean would be as large, or larger, than the observed estimate of the mean, by chance alone. The mean was calculated for each of the 2J ways (permutations) that the intervention assignments could have occurred in this pair-matched design; for COMMIT, 211 = 2, 048 such permutations existed. The rank of the observed outcome among all possible outcomes provided the one-tailed significance level; for example, a value in the top 5% would be considered significant at the 0.05 level. The primary analysis of the COMMIT cohort data (6) showed that the mean heavy smoker quit rate (fraction of cohort members who had achieved and maintained cessation at the end of the trial) was 0.180 for intervention communities versus 0.187 for comparison communities, a nonsignificant difference (one-sided P = 0.68 by permutation test); 90% test-based confidence interval for the difference = [−0.031, 0.019]. That this community-based intervention did not affect heavy smokers was disappointing. However, the corresponding light-to-moderate smoker quit rates were 0.306 for intervention communities versus 0.275 for comparison communities; the difference was significant

COMMUNITY INTERVENTION TRIAL FOR SMOKING CESSATION (COMMIT)

(one-sided P = 0.004), with 90% confidence interval = [0.014, 0.047]. Also, it was found that smokers in intervention communities had greater perceived exposure to smoking control activities, which correlated with outcome only for light-to-moderate smokers (6). The impact of this community-based intervention on light-to-moderate smokers, although modest, has potential public health importance. Subsidiary analyses were performed using baseline covariates to adjust the analysis of intervention effect. Individual-level baseline covariates were incorporated in a logistic regression model to predict outcome under the null hypothesis of no intervention effect (i.e., no intervention term in the model). For COMMIT, the logistic model included a separate intercept for each community pair, and various baseline covariates predictive of quitting (e.g., demographic, smoking history, household variables) selected using a stepdown regression approach. Then residuals between observed and predicted rates were calculated, and differences in such residuals between intervention and comparison communities were analyzed using a permutation test to evaluate an adjusted intervention effect. Such a permutation test using community mean residuals is a valid test of the null hypothesis even if the model is misspecified (4). In COMMIT, the adjusted and unadjusted analyses gave similar results (6). COMMIT also included separate analyses in subgroups defined by baseline covariates specified a priori. Statistical tests for interaction were performed to investigate whether the intervention effect differed according to covariate value. Such interaction tests are of interest because of the real risk of finding spurious differences in subgroups by chance alone. Little difference existed between males and females in the cohorts in the effect of the COMMIT intervention: no benefit for heavy smokers of either sex, but an additional 3% quitting among light-to-moderate smokers of both sexes. However, among lightto-moderate smokers, the less educated subgroup appeared more responsive to the intervention than college-educated smokers (6).

3

Analysis of the cross-sectional survey data (7) showed favorable secular trends in smoking prevalence in intervention and in comparison communities. No intervention effect occurred on heavy smoking prevalence (ages 25 to 64 years), which decreased by 2.9 percentage points both in intervention and in comparison communities. Overall smoking prevalence (ages 25 to 64) decreased 3.5 in intervention communities versus 3.2 in comparison communities, a difference not statistically significant. However, these results were consistent with the cohort analysis, although the more powerful cohort design showed a statistically significant intervention effect on light-to-moderate smokers, as noted above. Based on sound principles of experimental design, COMMIT allowed a rigorous evaluation of the intervention (3). REFERENCES 1. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): summary of design and intervention. J. Natl. Cancer Inst. 1991; 83: 1620–1628. 2. M. H. Gail, D. P. Byar, T. F. Pechacek, and D. K. Corle, for the COMMIT Study Group, Aspects of statistical design for the Community Intervention Trial for Smoking Cessation (COMMIT). Controlled Clin. Trials 1992; 13: 6–21 [Published Erratum. Controlled Clin. Trials 1993; 14: 253–254]. 3. S. B. Green et al., Interplay between design and analysis for behavioral intervention trials with community as the unit of randomization. Amer. J. Epidemiol. 1995; 142: 587–593. 4. M. H. Gail, S. D. Mark, R. J. Carroll, S. B. Green, and D. Pee, On design considerations and randomization-based inference for community intervention trials. Stat. Med. 1996; 15: 1069–1092. 5. L. S. Freedman, M. H. Gail, S. B. Green, and D. K. Corle, for the COMMIT Research Group, The efficiency of the matched-pairs design of the Community Intervention Trial for Smoking Cessation (COMMIT). Controlled Clin. Trials 1997; 18: 131–139. 6. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): I. Cohort results from a four-year community intervention. Amer. J. Public Health 1995; 85: 183–192.

4

COMMUNITY INTERVENTION TRIAL FOR SMOKING CESSATION (COMMIT)

7. COMMIT Research Group, Community Intervention Trial for Smoking Cessation (COMMIT): II. Changes in adult cigarette smoking prevalence. Amer. J. Public Health 1995; 85: 193–200. 8. A. Donner and N. Klar, Design and Analysis of Cluster Randomization Trials in Health Research. London: Arnold, 2000.

THE FDA AND REGULATORY ISSUES

the Drug Importation Act of 1848, considered by many as the cornerstone of drug regulation in the United States. The act itself required U.S. Customs Service (already in existence) inspection to stop entry of adulterated drugs from overseas. If one subscribes to George F. Will’s precept: ‘‘We are not a democracy, we are a republic,’’ the ensuing train of legislative endeavor provides for a fascinating story of continuous interaction between the governing and the governed. Therefore, the Food and Drug Administration exists by the mandate of the U.S. Congress with the Food, Drug & Cosmetics Act (http://www.fda.gov/opacom/laws/fdcact/ fdctoc.htm) as the principal law to enforce. The Act, based on it regulations developed by the Agency, constitutes the basis of the drug approval process (http://www.fda.gov/cder/ regulatory/applications/default.htm). The name Food and Drug Administration is relatively new. In 1931 the Food, Drug, and Insecticide Administration, then part of the U.S. Department of Agriculture, was renamed the Food and Drug Administration (http://www.fda.gov).

W. JANUSZ RZESZOTARSKI Food and Drug Administration, Rockville, Maryland, USA

1

CAVEAT

The reader is reminded that all the information that is provided in this chapter is freely available on the Web from the government and other sources and subject to change. Instead of a bibliography, only the hyperlink sources are included in text, and the reader is advised to check them frequently. 2

INTRODUCTION

‘‘The problem is to find a form of association which will defend and protect with the whole common force the person and goods of each associate, and in which each, while uniting himself with all, may still obey himself alone, and remain as free as before.’’ So wrote Jean Jacques Rousseau, Citizen of Geneva, in The Social Contract or Principles of Political Right (1762) (http://www.blackmask.com/ books10c/socon.htm). His teachings were well known to the Founding Fathers. The Miracle at Philadelphia, the Constitutional Convention of May–September 1787, so gloriously described by Catherine Drinker Bowen, established federalism in the United States and provided for regulation of commerce between the states. The progress of federalism was slow, and a trigger was needed. On the 25th of April 1846, Mexican troops crossed the Rio Grande to attack U.S. dragoons; this provided an excuse for the U.S.-Mexican war of 1846–1848. The state of medical support for the U.S. troops in Mexico was appalling. The drugs imported for them, counterfeited. In reaction to these events the U.S. Congress passed

3 CHRONOLOGY OF DRUG REGULATION IN THE UNITED STATES The history of food and drug law enforcement in the United States and the consecutive modifications of the 1906 Act are summarized below (from http://www.fda.gov/cder/about/ history/time1.htm). – 1820 Eleven physicians meet in Washington, DC, to establish the U.S. Pharmacopeia, the first compendium of standard drugs for the United States. – 1846 Publication of Lewis Caleb Beck’s Adulteration of Various Substances Used in Medicine and the Arts helps document problems in the drug supply. – 1848 Drug Importation Act passed by Congress requires U.S. Customs Service inspection to stop entry of adulterated drugs from overseas.

The views expressed are my own and do not necessarily represent those of nor imply endorsement from the Food and Drug Administration or the U.S. Government.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

THE FDA AND REGULATORY ISSUES

– 1903 Lyman F. Kebler, M. D., Ph.C., assumes duties as Director of the Drug Laboratory, Bureau of Chemistry. – 1905 Samuel Hopkins Adams’ 10-part expos´e of the patent medicine industry, ‘‘The Great American Fraud,’’ begins in Collier’s. The American Medical Association, through its Council on Pharmacy and Chemistry, initiates a voluntary program of drug approval that would last until 1955. To earn the right to advertise in AMA and related journals, companies submitted evidence, for review by the Council and outside experts, to support their therapeutic claims for drugs. – 1906 The original Food and Drugs Act is passed by Congress on June 30 and signed by President Theodore Roosevelt. It prohibits interstate commerce in misbranded and adulterated foods and drugs. The Meat Inspection Act is passed the same day. Shocking disclosures of unsanitary conditions in meatpacking plants, the use of poisonous preservatives and dyes in foods, and cure-all claims for worthless and dangerous patent medicines were the major problems leading to the enactment of these laws. – 1911 In U.S. versus Johnson, the Supreme Court rules that the 1906 Food and Drugs Act does not prohibit false therapeutic claims but only false and misleading statements about the ingredients or identity of a drug. – 1912 Congress enacts the Sherley Amendment to overcome the ruling in U.S. versus Johnson. It prohibits labeling medicines with false therapeutic claims intended to defraud the purchaser, a standard difficult to prove. – 1914 The Harrison Narcotic Act imposes upper limits on the amount of opium, opium-derived products, and cocaineallowed in products available to the public; requires prescriptions for products exceeding the allowable limit of narcotics; and mandates increased record keeping for physicians and pharmacists that dispense narcotics. A separate law

dealing with marihuana would beenacted in 1937. – 1933 FDA recommends a complete revision of the obsolete 1906 Food and Drugs Act. The first bill is introduced into the Senate, launching a 5-year legislative battle. FDA assembles a graphic display of shortcomings in pharmaceutical and other regulation under the 1906 act, dubbed by one reporter as the Chamber of Horrors and exhibited nationwide to help draw support for a new law. – 1937 Elixir Sulfanilamide, containing the poisonous solvent diethylene glycol, kills 107 persons, many of whom are children, dramatizing the need to establish drug safety before marketing and to enact the pending food and drug law. – 1938 The Federal Food, Drug, and Cosmetic Act of 1938 is passed by Congress, containing new provisions: – Requiring new drugs to be shown safe before marketing; starting a new system of drug regulation. – Eliminating the Sherley Amendment requirement to prove intent to defraud in drug misbranding cases. – Extending control to cosmetics and therapeutic devices. – Providing that safe tolerances be set for unavoidable poisonous substances. – Authorizing standards of identity, quality, and fill-of-container for foods. – Authorizing factory inspections. – Adding the remedy of court injunctions to the previous penalties of seizures and prosecutions. – Under the Wheeler-Lea Act, the Federal Trade Commission is charged to oversee advertising associated with products, including pharmaceuticals, otherwise regulated by FDA. – FDA promulgates the policy in August that sulfanilamide and selected other dangerous drugs must be administered under the direction

THE FDA AND REGULATORY ISSUES

of a qualified expert, thus launching the requirement for prescription only (non-narcotic) drugs. – 1941 Insulin Amendment requires FDA to test and certify purity and potency of this life-saving drug for diabetes. Nearly 300 deaths and injuries result from distribution of sulfathiazole tablets tainted with the sedative phenobarbital. The incident prompts FDA to revise manufacturing and quality controls drastically, the beginning of what would later be called good manufacturing practices (GMPs). – 1945 Penicillin Amendment requires FDA testing and certification of safety and effectiveness of all penicillin products. Later amendments would extend this requirement to all antibiotics. In 1983 such control would be found no longer needed and was abolished. – 1948 Supreme Court rules in U.S. versus Sullivan that FDA’s jurisdiction extends to the retail distribution, thereby permitting FDA to interdict in pharmacies illegal sales of drugs—the most problematical being barbiturates andamphetamines. – 1951 Durham-Humphrey Amendment defines the kinds of drugs that cannot be used safely without medical supervision and restricts their sale to prescription by a licensed practitioner. – 1952 In U.S. versus Cardiff, the Supreme Court rules that the factory inspection provision of the 1938 FDC Act is too vague to be enforced as criminal law. A nationwide investigation by FDAreveals that chloramphenicol, a broadspectrum antibiotic, has caused nearly 180 cases of often fatal blood diseases. Two years later, the FDA would engage the American Society of Hospital Pharmacists, the American Association of Medical Record Librarians, and later the American Medical Association in a voluntary program of drug reaction reporting. – 1953 Factory Inspection Amendment clarifies previous law and requires FDA to give manufacturers written reports of

–

–

–

–

–

–

3

conditions observed during inspections and analyses of factory samples. 1955 HEW Secretary Olveta Culp Hobby appoints a committee of 14 citizens to study the adequacy of FDA’s facilities and programs. The committee recommends a substantial expansion of FDA staff and facilities, a new headquarters building, and more use of educational and informational programs. 1962 Thalidomide, a new sleeping pill, is found to have caused birth defects in thousands of babies born in western Europe. News reports on the role of Dr. Frances Kelsey, FDA medical officer, in keeping the drug off the U.S. market, arouse public support for stronger drug regulation. Kefauver-Harris Drug Amendments are passed to ensure drug efficacy and greater drug safety. For the first time, drug manufacturers are required to prove to the FDA the effectiveness of their products before marketing them. In addition, the FDA is given closer control over investigational drug studies, FDA inspectors are granted access to additional company records, and manufacturers must demonstrate the efficacy of products approved before 1962. 1963 Advisory Committee on Investigational Drugs meets, the first meeting of a committee to advise FDA on product approval and policy on an ongoing basis. 1965 Drug Abuse Control Amendments are enacted to deal with problems caused by abuse of depressants, stimulants, and hallucinogens. 1966 FDA contracts with the National Academy of Sciences/National Research Council to evaluate the effectiveness of 4000 drugs approved on the basis of safety alone between 1938 and 1962. 1968 FDA Bureau of Drug Abuse Control and the Treasury Department’s Bureau of Narcotics are transferred to the Department of Justice to form the Bureau of Narcotics and Dangerous Drugs (BNDD), consolidating efforts to police traffic in abused drugs. A reorganization of BNDD in 1973 formed the Drug Enforcement Administration.

4

THE FDA AND REGULATORY ISSUES

– The FDA forms the Drug Efficacy Study Implementation (DESI) to incorporate the recommendations of a National Academy of Sciences investigation of effectiveness of drugs marketed between 1938 and 1962. – Animal Drug Amendments place all regulation of new animal drugs under one section of the Food, Drug, and Cosmetic Act—Section 512—making approval of animal drugs and medicated feeds more efficient. – 1970 In Upjohn versus Finch, the Court of Appeals upholds enforcement of the 1962 drug effectiveness amendments by ruling that commercial success alone does not constitute substantial evidence of drug safety and efficacy. – The FDA requires the first patient package insert: oral contraceptives must contain information for the patient about specific risks and benefits. – The Comprehensive Drug Abuse Prevention and Control Act replaces previous laws and categorizes drugs based on ` abuse and addiction potential vis-a-vis therapeutic value. – 1972 Over-the-Counter Drug Review initiated to enhance the safety, effectiveness, and appropriate labeling of drugs sold without prescription. – 1973 The U. S. Supreme Court upholds the 1962 drug effectiveness law and endorses FDA action to control entire classes of products by regulations rather than to rely only on time-consuming litigation. – 1976 Vitamins and Minerals Amendments (‘‘Proxmire Amendments’’) stop the FDA from establishing standards limiting potency of vitamins and minerals in food supplements or regulating them as drugs based solely on potency. – 1982 Tamper-resistant packaging regulations issued by the FDA to prevent poisonings such as deaths from cyanide placed in Tylenol capsules. The Federal Anti-Tampering Act passed in 1983 makes it a crime to tamper with packaged consumer products.

– 1983 Orphan Drug Act passed, enabling FDA to promote research and marketing of drugs needed for treating rare diseases. – 1984 Drug Price Competition and Patent Term Restoration expedites the availability of less costly generic drugs by permitting the FDA to approve applications to market generic versions of brandname drugs without repeating the research done to prove them safe and effective. At the same time, the brand-name companies can apply for up to 5 years of additional patent protection for the new medicines they developed to make up for time lost while their products were going through the FDA’s approval process. – 1987 The FDA revises investigational drug regulations to expand access to experimental drugs for patients with serious diseases with no alternative therapies. – 1988 The Prescription Drug Marketing Act bans the diversion of prescription drugs from legitimate commercial channels. Congress finds that the resale of such drugs leads to the distribution of mislabeled, adulterated, subpotent, and counterfeit drugs to the public. The new law requires drug wholesalers to be licensed by the states; restricts reimportation from other countries; and bans sale, trade, or purchase of drug samples and traffic or counterfeiting of redeemable drug coupons. – 1991 The FDA publishes regulations to accelerate reviews of drugs for lifethreatening diseases. – 1992 Generic Drug Enforcement Act imposes debarment and other penalties for illegal acts involving abbreviated drug applications. Prescription Drug User Fee requires drug and biologics manufacturers to pay fees for product applications and supplements and other services. The act also requires the FDA to use these funds to hire more reviewers to assess applications. – 1994 The FDA announces it could consider regulating nicotine in cigarettes as

THE FDA AND REGULATORY ISSUES

a drug, in response to a citizen’s petition by the Coalition on Smoking and Health. Uruguay Round Agreements Act extends the patent terms of U.S. drugs from 17 to 20 years. – 1995 The FDA declares cigarettes to be ‘‘drug delivery devices.’’ Restrictions are proposed on marketing and sales to reduce smoking by young people. – 1997 Food and Drug Administration Modernization Act (FDAMA) re-authorizes the Prescription Drug User Fee Act of 1992 and mandates the most wideranging reforms in agency practices since 1938. Provisions include measures to accelerate review of devices, advertising unapproved uses of approved drugs and devices, health claims for foods in agreement with published data by a reputable public health source, and development of good guidance practices for agency decision-making. The fast track provisions are intended to speed up the development and the approval review process for . . . ‘‘drug intended for the treatment of a serious or life-threatening condition and it demonstrates the potential to address unmet medical needs for such a condition.’’

As an agency of the U.S. Government, the FDA does not develop, manufacture, or test drugs. Drug approval is entirely based on sponsor’s (manufacturer’s) reports of a drug’s studies so that the appropriate Center can evaluate its data. The evaluation of submitted data allows the Center reviewers (1) to establish whether the drug submitted for approval works for the proposed use, (2) to assess the benefit-to-risk relationship, and (3) to determine if the drug will be approved. The approval of low molecular weight molecular entities rests within the CDER authority (http://www.fda.gov/cder/) and is the subject of this chapter. An analogous center CBER regulates biological products like blood, vaccines, therapeutics, and related drugs and devices (http://www.fda.gov/cber/). The reader interested in other centers or aspects of FDA activities is advised to visit appropriate sites. In general outline, the drug approval process is divided into an Investigational New Drug (IND) Application Process (with its phases representing a logical and safe process of drug development); New Drug Approval, and the post-approval activities. For as long as an approved drug remains on the market, all aspects pertinent to its safety are under constant scrutiny by the FDA. 5

4

FDA BASIC STRUCTURE

Employing over 9000 employees, the FDA’s structure reflects the tasks on hand and consists of a number of centers and offices. Center for Biologics Evaluation and Research (CBER) Center for Devices and Radiological Health (CDRH) Center for Drug Evaluation and Research (CDER) Center for Food Safety and Applied Nutrition (CFSAN) Center for Veterinary Medicine (CVM) National Center for Toxicological Research (NCTR) Office of the Commissioner (OC) Office of Regulatory Affairs (ORA)

5

IND APPLICATION PROCESS

The tests carried out in the pre-clinical investigation of a potential drug serve the purpose to determine whether the new molecule has the desired pharmacological activity and is reasonably safe to be administered to humans in limited, early-stage clinical studies. Before any new drug under pre-clinical investigation is administered to patients to determine its value as a therapeutic or diagnostic, the drug’s sponsor must obtain permission from the FDA through the IND process (http://www.fda.gov/cder/regulatory/applications). By definition, a sponsor is a person or entity who assumes responsibility for compliance with applicable provisions of the Federal Food, Drug, and Cosmetic Act (the FDC Act) and related regulations and initiates a clinical investigation. A sponsor could be an individual, partnership, corporation, government agency, manufacturer, or

6

THE FDA AND REGULATORY ISSUES

scientific institution. In a way, the IND is an exemption from the legal requirement to transport or distribute across state lines only drugs approved for marketing. Although not approved, the molecule has to conform to specific requirements under the Federal Food, Drug, and Cosmetic Act as interpreted by the Code of Federal Regulations (CFR). The CFR, a codification of the general and permanent regulations published in the Federal Register by the Executive departments and agencies, provides detailed information of requirements for each step of the approval process (http://www.access.gpo.gov/nara/cfr/waisidx 98/21cfr312 98.html). The Federal Register is the additional important source for information on what regulations FDA proposes and notices issues. A sponsor wishing to submit an IND is assisted and guided by a number of regulatory mechanisms and documents created to secure the uniformity of applications and to guarantee consistency of the review process. The logical development of information and guidance is as follows: (1) from the Federal Food, Drug, and Cosmetic Act to (2) the Code of Federal Regulations to (3) the use of available guidance documents issued by the CDER/CBER and International Conference on Harmonization (ICH). In their review process the FDA reviewers also depend on Manuals of Policies and Procedures (MAPPs), which constitute approved instructions for internal practices and procedures followed by CDER staff. MAPPs are to help standardize the new drug review process and other activities and are available for the public (http://www.fda.gov/cder/mapp.htm). 5.1 Types of IND The CFR does not differentiate between the ‘‘commercial’’ and ‘‘non-commercial,’’ ‘‘research,’’ or ‘‘compassionate’’ IND. The three general ‘‘types’’ of INDs below are often mentioned, but again the nomenclature used is not recognized by 21 CFR312.3. The term Commercial IND is defined in CDER’s MAPP 6030.1 as: ‘‘An IND for which the sponsor is usually either a corporate entity or one of the institutes of the National Institutes of Health (NIH). In addition, CDER may designate other INDs as commercial

if it is clear the sponsor intends the product to be commercialized at a later date’’ (http://www.fda.gov/cder/mapp/6030–1.pdf). The term Screening IND is defined in CDER’s MAPP 6030.4 (http://www.fda.gov/cder/ mapp/6030–4.pdf) as ‘‘A single IND submitted for the sole purpose of comparing the properties of closely related active moieties to screen for the preferred compounds or formulations. These compounds or formulations can then become the subject of additional clinical development, each under a separate IND.’’ The same goes for the fast track programs of the FDA originating from the section 112(b) ‘‘Expediting Study and Approval of Fast Track Drugs’’ of the Food and Drug Administration Modernization Act (FDAMA) of 1997. The FDAMA amendments of the Act are designed to facilitate the development and expedite the review of new drugs that are intended to treat serious or life-threatening conditions and that demonstrate the potential to address unmet medical needs (fast track products). 5.1.1 An Investigator IND. An investigator is an individual who conducts a clinical investigation or is a responsible leader of a team of investigators. Sponsor is a person who takes responsibility far and initiates a clinical investigation. Sponsor may be a person or an organization, company, university, etc. Sponsor-investigator is a physician who both initiates and conducts an investigation, and under whose immediate direction the investigational drug is administered or dispensed. A physician might submit a research IND to propose studying an unapproved drug, or an approved product for a new indication or in a new patient population. The investigator’s name appears on the Investigational New Drug Application forms (Forms FDA 1571 and 1572) as the name of person responsible for monitoring the conduct and progress of clinical investigations. 5.1.2 Emergency Use IND. Emergency Use IND of an investigational new drug (21 CFR 312.36) allows the FDA to authorize use of an experimental drug in an emergency situation that does not allow time for submission of an IND in accordance with the

THE FDA AND REGULATORY ISSUES

Code of Federal Regulations. It is also used for patients who do not meet the criteria of an existing study protocol or if an approved study protocol does not exist. 5.1.3 Treatment IND. Treatment IND (21 CFR 312.34) is submitted for experimental drugs showing promise in clinical testing for serious or immediately life-threatening conditions while the final clinical work is conducted and the FDA review takes place. An immediately life-threatening disease means a stage of a disease in which there is a reasonable likelihood that death will occur within a matter of months or in which premature death is likely without early treatment. For example, advanced cases of AIDS, herpes simplex encephalitis, and subarachnoid hemorrhage are all considered to be immediately life-threatening diseases. Treatment INDs are made available to patients before general marketing begins, typically during phase III studies. Treatment INDs also allow FDA to obtain additional data on the drug’s safety and effectiveness (http://www.fda.gov/cder/ handbook/treatind.htm). 5.2 Parallel Track Another mechanism used to permit wider availability of experimental agents is the ‘‘parallel track’’ policy (Federal Register of May 21, 1990) developed by the U.S. Public Health Service in response to AIDS. Under this policy, patients with AIDS whose condition prevents them from participating in controlled clinical trials can receive investigational drugs shown in preliminary studies to be promising. 5.3 Resources for Preparation of IND Applications As listed above, to assist in preparation of IND, numerous resources are available on the Web to provide the sponsor with (1) the legal requirements of an IND application, (2) assistance from CDER/CBER to help meet those requirements, and (3) internal IND review principles, policies, and procedures. 5.3.1 Pre-IND Meeting. In addition to all documents available on the Web, under the

7

FDAMA provisions and the resulting guidances (http://www.fda.gov/cder/fdama/ default.htm), a sponsor can request all kinds of meetings with the FDA to facilitate the review and approval process. The pre-IND meetings (21 CFR 312.82) belong to type B meetings and should occur with the division of CDER responsible for the review of given drug therapeutic category within 60 days from when the Agency receives a written request. The list of questions and the information submitted to the Agency in the Information Package should be of sufficient pertinence and quality to permit a productive meeting. 5.3.2 Guidance Documents. Guidance documents representing the Agency’s current thinking on a particular subject can be obtained from the Web (http://www.fda.gov/ cder/guidance/index.htm) or from the Office of Training and Communications, Division of Communications Management (http://www. fda.gov/cder/dib/dibinfo.htm). One should remember that the guidance documents merely provide direction and are not binding on either part. A Guidance for Industry ‘‘Content and Format of Investigational New Drug Applications (INDs) for Phase I Studies of Drugs, Including Well-Characterized, Therapeutic, Biotechnology-derived Products’’ (http://www.fda.gov/cder/guidance/index. htm) is a place to start. This particular Guidance, based on 21CFR 312, provides a detailed clarification of CFR requirements for data and data presentation to be included in the initial phase I IND document, permitting its acceptance by the Agency for review. 5.3.3 Information Submitted with IND. To be acceptable for review by the FDA, the IND application must include the following groups of information. 5.3.3.1 Animal Pharmacology and Toxicology Studies. Pre-clinical data to permit an assessment as to whether the product is reasonably safe for initial testing in humans. Also included are the results of any previous experience with the drug in humans (often foreign use). 5.3.3.2 Manufacturing Information. Information pertaining to the composition, manufacturer, stability, and controls used for

8

THE FDA AND REGULATORY ISSUES

manufacturing the drug substance and the drug product. This information is assessed to ensure that the company can adequately produce and supply consistent batches of the drug. 5.3.3.3 Clinical Protocols and Investigator Information. Detailed protocols for proposed clinical studies to assess whether the initialphase trials will expose subjects to unnecessary risks needs to be provided. In addition, information on the qualifications of clinical investigators, professionals (generally physicians) who oversee the administration of the experimental compound, to assess whether they are qualified to fulfill their clinical trial duties is needed. Finally, commitments to obtain informed consent from the research subjects, to obtain review of the study by an institutional review board (IRB), and to adhere to the investigational new drug regulations are also required. Once the IND is submitted, the sponsor must wait 30 calendar days before initiating any clinical trials. During this time, the FDA has an opportunity to review the IND for safety to assure that research subjects will not be subjected to unreasonable risk. 5.4 The First Step, the Phase I IND Application The content of the Phase I IND Application (http://www.fda.gov/cder/guidance/index. htm) must include the following: a. FDA Forms 1571 (IND Application) and 1572 (Statement of Investigator) b. Table of Contents c. Introductory Statement and General Investigational Plan d. Investigator’s Brochure e. Protocols f. Chemistry, Manufacturing, and Control (CMC) Information g. Pharmacology and Toxicology Information h. Previous Human Experience with the Investigational Drug i. Additional and Relevant Information

5.4.0.4 Ad C. It should succinctly describe what the sponsor attempts to determine by the first human studies. All previous

human experience with the drug, other INDs, previous attempts to investigate followed by withdrawal, foreign marketing experience relevant to the safety of the proposed investigation, etc., should be described. Because the detailed development plans are contingent on the results of the initial studies, limited in scope and subject to change, that section should be kept as brief as possible. 5.4.0.5 Ad D. Before the investigation of a drug by participating clinical investigators may begin, the sponsor should provide them with an Investigator Brochure. The recommended elements of Investigator’s Brochure are subject of ICH document ICH E6 (http:// www.fda.gov/cder/guidance/iche6.htm) and should provide a compilation of the clinical and non-clinical data relevant to the study in human subjects. The brochure should include a brief description of the drug substance, summaries of pharmacological and toxicological effects, pharmacokinetics and biological disposition in animals, and if known, in humans. Also included should be a summary of known safety and effectiveness in humans from previous clinical studies. Reprints of published studies may be attached. Based on prior experience or on related drugs, the brochure should describe possible risks and side effects and eventual precautions or need of special monitoring. 5.4.0.6 Ad E. Protocols for phase I studies need not be detailed and may be quite flexible compared with later phases. They should provide the following: (1) outline of the investigation, (2) estimated number of patients involved, (3) description of safety exclusions, (4) description of dosing plan, duration, and dose or method of determining the dose, and (5) specific detail elements critical to safety. Monitoring of vital signs and blood chemistry and toxicity-based stopping or dose adjustment rules should be specified in detail. 5.4.0.7 Ad F. Phase I studies are usually conducted with the drug substance of drug discovery origin. It is recognized that the synthetic methods may change and that additional information may be accumulated as the studies and development progress. Nevertheless, the application should provide CMC information sufficient to evaluate the safety

THE FDA AND REGULATORY ISSUES

of drug substance. The governing principle is that the sponsor should be able to relate the drug product proposed for human studies to the drug product used in animal toxicology studies. At issue is the comparability of the (im)purity profiles. Also addressed should be the issues of stability of the drug product and the polymorphic form of the drug substance as they might change with the change of synthetic methods. The CMC information section to be provided in the phase I application should consist of following sections. 1. CMC Introduction: Should address any potential human risks and the proposed steps to monitor such risks and describe the eventual chemical and manufacturing differences between batches used in animal and proposed for human studies. 2. Drug Substance: • Brief description of the drug sub-

• •

•

•

stance including some physicochemical characterization and proof of structure. The name and address of manufacturer. Brief description of manufacturing process with a flow chart and a list of reagents, solvents and catalysts. Proposed acceptable limits of assay and related substances (impurities) based on actual analytical results with the certificates of analysis for batches used in animal toxicological studies and stability studies and batches destined for clinical studies. Stability studies may be brief but should cover the proposed duration of the study. A tabular presentation of past stability studies may be submitted.

3. Drug Product: • List of all components. • Quantitative composition of the

investigational drug product.

9

• The name and address of manu-

facturer. • Brief description of manufacturing

and packaging process. • Specifications and methods assur-

ing identification, strength, quality and purity of drug product. • Stability and stability methods used. The stability studies should cover the duration of toxicologic and clinical studies. 4. Placebo (see part 3) 5. Labels and Labeling: Copies or mockups of proposed labeling that will be provided to each investigator should be provided. 6. A claim for Categorical Exclusion from submission or submission of Environmental Assessment. The FDA believes that the great majority of drug products should qualify for a categorical exclusion.

5.4.0.8 Ad G. The Pharmacology and Toxicology Information is usually divided into the following sections. 1. Pharmacology & Drug Distribution which should contain, if known: description of drug pharmacologic effects and mechanisms of action in animals and its absorption, distribution, metabolism and excretion. 2. Toxicology: Integrated Summary of toxicologic effects in animals and in vitro. In cases where species specificity may be of concern, the sponsor is encouraged to discuss the issue with the Agency. In the early phase of IND, the final fully quality-assured individual study reports may slow preparation and delay the submission of application. If the integrated summary is based on unaudited draft reports, the sponsor is required to submit an update by 120 days after the start of the human studies and identify the differences. Any new findings discovered in preparation of final document affecting the patient safety must be reported to FDA in IND safety reports. To support the safety

10

THE FDA AND REGULATORY ISSUES

of human investigation the integrated summary should include: • Design of the toxicologic studies

and deviations from it. The dates of trials. References to protocols and protocol amendments. • Systematic presentation of find-

ings highlighting the findings that may be considered by an expert as possible risk signals. • Qualifications of individual who

evaluated the animal safety data. That individual should sign the summary attesting that the summary accurately reflects the data. • Location of animal studies and

where the records of the studies are located, in case of an inspection. • Declaration of compliance to Good

Laboratory Practices (GLP) or explanation why the compliance was impossible and how it may affect the interpretation of findings. 3. Toxicology—Full Data Tabulation. Each animal toxicology study intended to support the safety of the proposed clinical study should be supported by a full tabulation of data suitable for detailed review. A technical report on methods used and a copy of the study protocol should be attached.

5.4.0.9 Ad H. Previous Human Experience with the Investigational Drug may be presented in an integrated summary report. The absence of previous human experience should be stated. 5.4.0.10 Ad I. Additional and Relevant Information may be needed if the drug has a dependence or abuse potential, is radioactive, or if pediatric safety and effectiveness assessment is planned. Any information previously submitted need not to be resubmitted but may be referenced. Once the IND is submitted to FDA, an IND number is assigned, and the application is forwarded to the appropriate reviewing

division. The reviewing division sends a letter to the Sponsor-Investigator providing the IND number assigned, date of receipt of the original application, address where future submissions to the IND should be sent, and the name and telephone number of the FDA person to whom questions about the application should be directed. The IND studies shall not be initiated until 30 days after the date of receipt of the IND by the FDA. The sponsor may receive earlier notification by the FDA that studies may begin. 5.4.1 Phase I of IND. The initial introduction of an investigational new drug into humans may be conducted in patients, but is usually conducted in healthy volunteer subjects. Phase I studies are usually designed to obtain, in humans, sufficient information about the pharmacokinetics, pharmacological effects, and metabolism, the side effects associated with increasing doses, and perhaps, preliminary evidence on effectiveness of the drug. The information collected should permit the design of well-controlled, scientifically valid phase II studies. The studies might even attempt to study the structure-activity relationships and the mechanism of action. The total number of subjects in the phase I study may vary. Depending on intent, it is usually in the range of 20–80 and rarely exceeds 100. The phase lasts several months and 70% of investigated drugs pass that phase. Beginning with phase I studies, the CDER can impose a clinical hold (i.e., prohibit the study from proceeding or stop a trial that has started) for reasons of safety or because of a sponsor’s failure to accurately disclose the risk of study to investigators. The review process, illustrated in Fig. 1, begins with the moment the IND application is assigned to individual reviewers representing various disciplines. 5.4.2 Phase II of IND. The initial (phase I) studies can be conducted in a group of patients, but most likely are conducted in healthy volunteers. In phase II, the early clinical studies of the effectiveness of the drug for a particular indication or indications are conducted in patients with the disease or condition. They are also used to determine the common short-term side effects and

THE FDA AND REGULATORY ISSUES

Figure 1. IND review flow chart from http://www.fda.gov/cder/handbook/ind.htm.

11

12

THE FDA AND REGULATORY ISSUES

risks associated with the drug. The number of patients in phase II studies is still small and does not exceed several hundred. The studies that have to be well-controlled and closely monitored last several months to 2 years. Approximately 33% of drugs investigated pass that phase. 5.4.3 Phase III of IND. Phase III studies are expanded controlled and uncontrolled trials. They are performed after preliminary evidence suggesting effectiveness of the drug has been obtained in phase II and are intended to gather the additional information about effectiveness and safety that is needed to evaluate the overall benefit–risk relationship of the drug. Phase III studies also provide an adequate basis for extrapolating the results to the general population and transmitting that information in the physician labeling. Phase III studies usually include several hundred to several thousand people. In both phases II and III, CDER can impose a clinical hold if a study is unsafe (as in phase I), or if the protocol is clearly deficient in design in meeting its stated objectives. Great care is taken to ensure that this determination is not made in isolation, but reflects current scientific knowledge, agency experience with the design of clinical trials, and experience with the class of drugs under investigation. Out of 100 drugs entering phase I, over 25 should pass phase III and go into the New Drug Application (NDA) approval process. According to FDA calculations (http://www. fda.gov/fdac/special/ newdrug/ndd toc.html), about 20% of drugs entering IND phase are eventually approved for marketing. The numbers agree with similar representation of Pharmaceutical Research and Manufacturers of America (PhRMA; http://www.phrma.org/index.phtml?mode= web), showing that on the average, it takes 12–15 years and over $500 million to discover and develop a new drug. Out of 5000 compounds entering the preclinical research, only 5 go to IND and only 1 is approved (http://www.phrma.org/publications/documents/factsheets//2001–03-01.210.phtml). 5.4.4 Phase IV of IND. 21 CFR 312 Subpart E provides for drugs intended to treat

life-threatening and severely-debilitating illnesses. In that case, the end-of-phase I meetings would reach agreement on the design of phase II controlled clinical trials. If the results of preliminary analysis of phase II studies are promising, a treatment protocol may be requested and when granted would remain in effect until the complete data necessary for marketing application are assembled. Concurrent with the marketing approval, the FDA may seek agreement to conduct post-marketing, phase IV studies (21CFR312.85). 5.5 Meetings with the FDA (http://www. fda.gov/cder/guidance/2125fnl.pdf) Section 119(a) of the FDAMA of the 1997 Act directs the FDA to meet with sponsors and applicants, provided certain conditions are met, for the purpose of reaching agreement on the design and size of clinical trials intended to form the primary basis of an effectiveness claim in a NDA submitted under section 505(b) of the Act. These meetings are considered special protocol assessment meetings. All in all, there are three categories of meetings between sponsors or applicants for PDUFA products and CDER staff listed in the above guidance: type A, type B, and type C. 5.5.1 Type A. A type A meeting is one that is immediately necessary for an otherwise stalled drug development program to proceed. Type A meetings generally will be reserved for dispute resolution meetings, meetings to discuss clinical holds, and special protocol assessment meetings that are requested by sponsors after the FDA’s evaluation of protocols in assessment letters. Type A meetings should be scheduled to occur within 30 days of FDA’s receipt of a written request for a meeting from a sponsor or applicant for a PDUFA product. 5.5.2 Type B. Type B meetings are (1) pre-IND meetings (21 CFR 312.82), (2) certain end of phase I meetings (21 CFR 312.82), (3) end of phase II/pre-phase III meetings (21 CFR 312.47), and (4) pre-NDA/BLA meetings (21 CFR 312.47). Type B meetings should be scheduled to occur within 60 days of the

THE FDA AND REGULATORY ISSUES

Agency’s receipt of the written request for a meeting. 5.5.3 Type C. A type C meeting is any meeting other than a type A or type B meeting, but it should be regarding the development and review of a product in a human drug application as described in section 735 (1) of the Act.

6 DRUG DEVELOPMENT AND APPROVAL TIME FRAME The development and approval process is presented in Fig. 2. In the preclinical phase, the sponsor conducts the short-term animal testing and begins more extensive long-term animal studies. It is advisable to meet with the appropriate division of CDER in a pre-IND meeting to clarify the content of an application. When a sufficient amount of necessary data is gathered into an IND document, the application is filed with the FDA. The Agency has 30 days from the date the document is received to review the IND application, request additional information, and reach the decision of whether the phase I studies using human subjects can begin (see Fig. 1). Depending on the amount of information available or developed about the investigated drug, the phases of IND can overlap. There is no time limit on the duration of IND phases, and the time limits are simply determined by the results and economics. Approval of a drug doesn’t end the IND process, which may continue for as long the sponsor intends to accumulate additional information about the drug, which may lead to new uses or formulation (see Fig. 2). Accelerated development/review (Federal Register, April 15, 1992) is a highly specialized mechanism for speeding the development of drugs that promise significant benefit over existing therapy for serious or life-threatening illnesses for which no therapy exists. This process incorporates several novel elements aimed at making sure that rapid development and review is balanced by safeguards to protect both the patients and the integrity of the regulatory process.

13

6.1 Accelerated Development/Review Accelerated development/review can be used under two special circumstances: when approval is based on evidence of the product’s effect on a ‘‘surrogate endpoint,’’ and when the FDA determines that safe use of a product depends on restricting its distribution or use. A surrogate endpoint is a laboratory finding or physical sign that may not be a direct measurement of how a patient feels, functions, or survives, but is still considered likely to predict therapeutic benefit for the patient. The fundamental element of this process is that the manufacturers must continue testing after approval to demonstrate that the drug indeed provides therapeutic benefit to the patient (21CFR314.510). If not, the FDA can withdraw the product from the market more easily than usual. 6.2 Fast Track Programs Fast Track Programs (http://www. fda.gov/ cder/fdama/default.htm and http://www.fda. gov/cder/guidance/2112fnl.pdf), Section 112(b), of the Food and Drug Administration Modernization Act of 1997 (FDAMA) amends the Federal Food, Drug, and Cosmetic Act (the Act) by adding section 506 (21 U.S.C. 356) and directing the FDA to issue guidance describing its policies and procedures pertaining to fast track products. Section 506 authorizes the FDA to take actions appropriate to facilitate the development and expedite the review of an application for such a product. These actions are not limited to those specified in the fast track provision but also encompass existing FDA programs to facilitate development and review of products for serious and life-threatening conditions. The advantages of Fast Track consist of scheduled meetings with the FDA to gain Agency input into development plans, the option of submitting a New Drug Application in sections, and the option of requesting evaluation of studies using surrogate endpoints (see Accelerated Approval). ‘‘The Fast Track designation is intended for products that address an unmet medical need, but is independent of Priority Review and Accelerated Approval. An applicant may use any or all of the components of Fast Track without the

14

THE FDA AND REGULATORY ISSUES

formal designation. Fast Track designation does not necessarily lead to a Priority Review or Accelerated Approval’’ (http://www.accessdata.fda.gov/scripts/cder/onctools/accel.cfm).

6.3 Safety of Clinical Trials The safety and effectiveness of the majority of investigated, unapproved drugs in treating, preventing, or diagnosing a specific disease or condition can only be determined by their administration to humans. It is the patient that is the ultimate premarket testing ground for unapproved drugs. To assure the safety of patients in clinical trials, the CDER monitors the study design and conduct of clinical trials to ensure that people in the trials are not exposed to unnecessary risks. The information available on the Web refers the sponsors and investigators to the necessary CFR regulations and guidances. The most important parts of CFR regulating clinical trials are as follows.

1. Financial disclosure section under 21 CFR 54. This covers financial disclosure for clinical investigators to ensure that financial interests and arrangements of clinical investigators that could affect reliability of data submitted to the FDA in support of product marketing are identified and disclosed by the sponsor (http://www.fda.gov/ cder/about/smallbiz/financial disclosure. htm). 2. Parts of 21 CFR 312 that include regulations for clinical investigators (http:// www.fda.gov/cder/about/smallbiz/CFR. htm#312.60 and further): 312.60 General Responsibilities of Investigators 312.61 Control of the Investigational Drug 312.62 Investigator Record Keeping and Record Retention 312.64 Investigator Records 312.66 Assurance of Institutional Review Board (IRB) Review

Figure 2. New drug development and approval process.

THE FDA AND REGULATORY ISSUES

312.68 Inspection of Investigator’s Records and Reports 312.69 Handling of controlled Substances 312.70 Disqualification of a Clinical Investigator The important part of any clinical investigation is the presence and activity of an Institutional Review Board (http://www.fda.gov/ oc/ohrt/irbs/default.htm), a group that is formally designated to review and monitor biomedical research involving human subjects. An IRB has the authority to approve, require modifications in (to secure approval), or disapprove research. This group review serves an important role in the protection of the rights and welfare of human research subjects. An IRB review is to assure, both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of humans participating as subjects in the research. To achieve that, IRBs use a group process to review research protocols and related materials (e.g., informed consent documents and investigator brochures) to ensure protection of the rights and welfare of human subjects. 7 NDA PROCESS (HTTP://WWW.FDA.GOV/ CDER/REGULATORY/APPLICATIONS/ NDA.HTM) By submitting the NDA application to the FDA, the sponsor formally proposes to approve a new drug for sale and marketing in the United States. The information on the drug’s safety and efficacy collected during the animal and human trials during the IND process becomes part of the NDA application. The review process of the submitted NDA (Fig. 3) is expected to answer the following questions: 1. Is the new drug safe and effective in its proposed use(s)? Do the benefits of the drug outweigh the risks? 2. Is the proposed labeling (package insert) of the drug appropriate and complete? 3. Are the manufacturing and control methods adequate to preserve drug’s identity, strength, quality, and purity?

15

As for IND, the preparation of NDA submission is based on existing laws and regulations and is guided by various guidance documents representing the Agency’s current thinking on particular subjects to be included in the NDA documentation. The preparation of NDA submission is based on the Federal Food, Drug, and Cosmetic Act (http://www. fda.gov/opacom/laws/fdcact/fdctoc.htm), as amended, which is the basic drug law in the United States. Its interpretation is provided by the Code of Federal Regulations: 21CFR 314—Applications for FDA Approval to Market a New Drug or an Antibiotic Drug and is available in PDF format at http://www.fda.gov/opacom/laws/fdcact/ fdctoc.htm. Further help in understanding of the NDA process is obtained from the available online guidances (http://www.fda.gov/ cder/guidance/index.htm) and CDER Manuals of Policies and Procedures (MAPPs; http://www.fda.gov/cder/mapp.htm). The list of guidances is particularly long and needs constant monitoring because some of them may be updated or withdrawn. Many of them address the format and content of the application to assure uniformity and consistency of the review process and decision-making. Particularly useful are the following MAPPs (in PDF): 6050.1—Refusal to Accept Application for Filing from Applicants in Arrears 7211.1—Drug Application Approval 501(b) Policy 7600.6—Requesting and Accepting NonArchivable Electronic Records for New Drug Applications 7.1 Review Priority Classification Under the Food and Drug Administration Modernization Act (FDAMA), depending on the anticipated therapeutic or diagnostic value of the submitted NDA, its review might receive a ‘‘Priority’’ (P) or ‘‘Standard’’ (S) classification. The designations ‘‘Priority’’ (P) and ‘‘Standard’’ (S) are mutually exclusive. Both original NDAs and effectiveness supplements receive a review priority classification, but manufacturing supplements do not. The basics of classification, discussed in CDER’s MAPP 6020.3

16

THE FDA AND REGULATORY ISSUES

Figure 3. NDA review process from http://www.fda.gov/cder/handbook/index.htm.

(http://www.fda.gov/cder/mapp/6020–3.pdf), are listed below. 7.2 P—Priority Review The drug product, if approved, would be a significant improvement compared with marketed products [approved (if such is required), including non-‘‘drug’’ products/therapies] in

the treatment, diagnosis, or prevention of a disease. Improvement can be demonstrated by (1) evidence of increased effectiveness in treatment, prevention, or diagnosis of disease; (2) elimination or substantial reduction of a treatment-limiting drug reaction; (3) documented enhancement of patient compliance; or (4) evidence of safety and effectiveness of a

THE FDA AND REGULATORY ISSUES

new subpopulation. (The CBER definition of a priority review is stricter than the definition that CDER uses. The biological drug, if approved, must be a significant improvement in the safety or effectiveness of the treatment diagnosis or prevention of a serious or life-threatening disease). 7.3 S—Standard Review All non-priority applications will be considered standard applications. The target date for completing all aspects of a review and the FDA taking an action on the ‘‘S’’ application (approve or not approve) is 10 months after the date it was filed. The ‘‘P’’ applications have the target date for the FDA action set at 6 months. 7.4 Accelerated Approval (21CFR Subpart H Sec. 314.510) Accelerated Approval is approval based on a surrogate endpoint or on an effect on a clinical endpoint other than survival or irreversible morbidity. The CFR clearly states that the FDA . . . ‘‘may grant marketing approval for a new drug product on the basis of adequate and well-controlled clinical trials establishing that the drug product has an effect on a surrogate endpoint that is reasonably likely, based on epidemiologic, therapeutic, pathophysiologic, or other evidence, to predict clinical benefit or on the basis of an effect on a clinical endpoint other than survival or irreversible morbidity. Approval under this section will be subject to the requirement that the applicant study the drug further, to verify and describe its clinical benefit, where there is uncertainty as to the relation of the surrogate endpoint to clinical benefit, or of the observed clinical benefit to ultimate outcome. Post-marketing studies would usually be studies already underway. When required to be conducted, such studies must also be adequate and well-controlled. The applicant shall carry out any such studies with due diligence.’’ Therefore, an approval, if it is granted, may be considered a conditional approval with a written commitment to complete clinical studies that formally demonstrate patient benefit.

8

17

U.S. PHARMACOPEIA AND FDA

The USP Convention is the publisher of the United States Pharmacopeia and National Formulary (USP/NF). These texts and supplements are recognized as official compendia under the Federal Food, Drug & Cosmetic Act (FD&C Act). As such, their standards of strength, quality, purity, packaging, and labeling are directly enforceable under the adulteration and misbranding provisions without further approval or adoption by the FDA (http://www.usp.org/frameset.htm; http://www.usp.org/standards/fda/jgv testimony.htm). The Federal Food, Drug, and Cosmetic Act §321(g) (1) states: ‘‘The term ‘‘drug’’ means (A) articles recognized in the official United States Pharmacopoeia, official Homeopathic Pharmacopoeia of the United States, or official National Formulary, or any supplement to any of them; and . . . ’’ (http://www.mlmlaw. com/library/statutes/federal/fdcact1.htm). That statement and additional arguments evolving from it may lead to a misapprehension that the USP and the FDA are at loggerheads over the authority to regulate the quality of drugs marketed in the United States. Nothing could be further from the truth. The harmonious collaboration of the FDA with many of the USP offices and committees may serve as a model of interaction between a federal agency and a nongovernmental organization such as USP. CDER’s MAPP 7211.1 (http://www.fda. gov/cder/mapp/7211–1.pdf) establishes policy applicable to drug application approval with regard to official compendial standards and Section 501(b) of the Act: ‘‘When a USP monograph exists and an ANDA/NDA application is submitted to the Agency, reviewers are not to approve regulatory methods/ specifications (i.e., those which must be relied upon for enforcement purposes) that differ from those in the USP, unless a recommendation is being or has been sent to the USPC through Compendial Operations Branch (COB) to change the methods/specifications. Direct notification to the U.S. Pharmacopeial Convention, Inc. by applicants does not absolve reviewers of their obligation to notify COB. Each Office within the Center should

18

THE FDA AND REGULATORY ISSUES

determine its own standard operating procedures under the policy decision.’’ 9 CDER FREEDOM OF INFORMATION ELECTRONIC READING ROOM The 1996 amendments to the Freedom of Information (FOI) Act (FOIA) mandate publicly accessible ‘‘electronic reading rooms’’ (ERRs) with agency FOIA response materials and other information routinely available to the public, with electronic search and indexing features. The FDA (http://www.fda.gov/ foi/electrr.htm) and many centers (http:// www.fda.gov/cder/foi/index.htm) have their ERRs on the Web. The International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH; http://www.ifpma.org/ich1.html) brings together the regulatory authorities of European Union (EU-15), Japan, and the United States, and experts from the pharmaceutical industry in these three regions. The purpose is to make recommendations on ways to achieve greater harmonization in the interpretation and application of technical guidelines and requirements for product registration to reduce or obviate the need to duplicate the testing carried out during the research and development of new medicines. The objective of such harmonization is a more economical use of human, animal, and material resources, and the elimination of unnecessary delay in the global development and availability of new medicines while maintaining safeguards on quality, safety, and efficacy, and regulatory obligations to protect public health. A series of guidances have been issued (such as http://www.fda.gov/cder/guidance/ 4539Q.htm) that provide recommendations for applicants preparing the Common Technical Document for the Registration of Pharmaceuticals for Human Use (CTD) for submission to the FDA. 10

CONCLUSION

The ever-expanding field of medicinal chemistry and the heterogeneity of treatment approaches require constant vigilance to

maintain the balance between the safe and the novel. The process of new drug evaluation to determine the risk/benefit quotient is affected by many conflicting factors. Nobody, be it the inventor, the generic, the regulatory, the physician, or the patient, is immune from temptation. The principles of social contract as it was written by Jean Jacques Rousseau and accepted by John Adams still apply. The globalization of pharmaceutical industry in G-7 countries and the harmonization of regulatory process between the United States, EU-15, and Japan will have a profound impact on how and how many of the new drugs are going to be developed in the future. The reader is advised to stay familiar with the Web and attuned to the FDA and PhRMA pages.

THERAPEUTIC DOSE RANGE

dosing profile of a drug is mostly determined by its pharmacokinetic characteristics (3). A poor pharmacokinetic profile may render a compound of so little therapeutic value as to be not worth developing. For example, very rapid elimination of a drug from the body would make it impractical to maintain a compound at a suitable level to have the desired effect. By contrast, the development of drug immunoassays has led to the introduction of therapeutic drug monitoring and the use of population pharmacokinetic (PK) methodology to develop guidelines for individualizing drug dosage. Recognition of pharmacodynamic (PD) variability and identification of PK parameters that best correlate with clinical response has occurred in several areas, which include oncology, infection, and immunosuppression. Assessment of dose–response relationships has thus occupied a prominent role in drug development with clinical studies designed to assess dose response as an inherent part of establishing the safety and effectiveness of the drug. Such a therapeutic dose range is derived from data collected both in phase I and phase II trials, which include safety, tolerability, and effectiveness, as well as pharmacokinetic and pharmacodynamic data. However, in lifethreatening diseases such as cancer or HIV, an implicit assumption suggests that toxicity is a surrogate for efficacy, and a higher dose leads to greater antitumor activity. Thus, therapeutic dose range usually reduces to the concept of maximum tolerated dose (MTD), which coincides with the lowest or minimum effective dose (MED) of the cytotoxic drug. We first summarize the main concepts surrounded drug dose–response relationships. Second, we focus on how to obtain dose–response information.

CHEVRET SYLVIE Inserm Paris, France

One main goal of clinical drug development is to answer the fundamental question of ‘‘how much’’ of a new drug should be administered to humans to achieve some desired therapeutic effect with no or acceptable drugrelated toxicity. Therefore, knowledge of the relationships among dosage, drug concentration in blood, and therapeutic response (effectiveness and unwanted effects) is important for the safe and effective use of drugs in humans. Historically, drugs have often been marketed initially at what were later recognized as excessive doses, sometimes with adverse consequences. Indeed, an inappropriate dosage regimen may produce therapeutic failure as a result of toxic drug levels or overdose, or alternatively, as a result of subtherapeutic drug levels. This situation has been improved by attempts to define the therapeutic dose range, that is, between the smallest dose with a discernable useful effect and a maximum dose beyond which no additional beneficial effect is expected. Actually, many drugs, such as aminoglycosides, theophylline, and gentamicin, as well as anti-HIV drugs, have narrow therapeutic ranges. The proper administration of these drugs requires an accurate prediction of the relationship between dose, dosing interval, and the resulting blood concentration of the drug in the patient. Various predictive methods have been developed for calculating an appropriate dose and dose interval for drug administration, based on preclinical data, including in vitro and computer-based methods that link pharmacokinetics, pharmacodynamics, and disease aspects into mathematical models (1). The problem is that despite significant progress, generally we still cannot predict the pharmacokinetic profile in humans of many drug classes. Over-reliance on premarketing clinical trials has thus been pointed out (2). Clearly, the ideal is to test in humans those compounds that have desirable pharmacokinetic properties. Indeed, the

1

MAIN CONCEPTS AND DEFINITIONS

Note that the dosage regimen includes the quantity of drug administered (dose) and the frequency of administration (dosing interval). The effect produced by a drug varies with the concentration that is present at its site of action and usually approaches a maximum

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

THERAPEUTIC DOSE RANGE

value beyond which an increase in concentration is not more effective. 1.1 Therapeutic Index, Therapeutic Window

Plasma Concentration

Because the drug will be used therapeutically, it is important to understand the margin of safety that exists between the minimal dose needed for the desired effect and the dose that produces unwanted and possibly dangerous side effects. This margin is usually referred as the ‘‘therapeutic index’’ or ‘‘therapeutic ratio,’’ although it is also known as the ‘‘selectivity index,’’ or ‘‘margin of safety.’’ Quantitatively, it is defined as the ratio given by the dose required to produce the toxic effect divided by the therapeutic dose. Rather than on a dose scale, it has been also defined as the ratio of the highest potentially therapeutic concentration to the lowest potentially therapeutic concentration. Although the therapeutic index mostly refers as the ratio, the ‘‘therapeutic window,’’ which is also used as synonymously, sometimes refers to the range, that is, the difference between the dose needed for the desired effect and the dose that produces unwanted effects. Similar to the therapeutic index, such a measure can be defined by the range of plasma concentrations between the minimally effective concentration and the concentration associated with unwanted toxicity; this range represents the therapeutic window (Fig. 1). It is believed that this index can help to avoid most potential side effects, and it is believed to be more reliable than that of the therapeutic index because this index considers the biological variation among individuals to a larger extent. Whatever the drug and the population under study, whether animals or the humans,

the greater the value calculated for the therapeutic index or window, the greater a drug’s margin of safety. By contrast, drugs with a small pharmaceutical index/window must be administered with care and control, for example, by frequently measuring blood concentration of the drug, because it loses effects easily or causes adverse effects. In general, as the margin narrows, it is more likely that the drug will produce unwanted effects. For instance, in cancer chemotherapy, drug concentrations that fall below the minimally effective concentration are subtherapeutic, which may lead to cancer progression and multidrug resistance. However, most antineoplastics are associated with severe adverse reactions; therefore, drug concentrations above the therapeutic index may lead to life-threatening toxicity. 1.2 Usual Measures of Dose Response According to the population studied, definitions of ‘‘minimally effective’’ and ‘‘unwanted toxicity’’ obviously differ. In toxicology, a commonly used measure of the therapeutic index is the lethal dose of a drug for 50% of the population (LD50) divided by the effective dose for 50% of the population (ED50). Similarly, the difference between the ED50 and the starting point of LD50 curve is usually computed as the therapeutic window. This range suggests that the ‘‘dose needed for the desire effect’’ is the median effective dose (ED50), which is defined as the dose that produces a response that is one half of the maximal effect that can be elicited (Emax). Sometimes, the response is measured in terms of the proportion of subjects in a sample population that show a given all-or-nothing response (e.g., apparition of convulsions) rather than as a continuously

Toxicity level Cmax

Cmin

Minimally effective concentration Time

Figure 1. Therapeutic window, that is, the range between the minimally effective drug concentration and the drug concentration associated with severe adverse reactions. Cmax refers to the maximum plasma concentration, whereas Cmin refers to the minimal plasma concentration.

THERAPEUTIC DOSE RANGE

graded response. As such, the ED50 represents the dose that causes 50% of a sample population to respond. By contrast, the measure of ‘‘dose that produces unwanted and possibly dangerous side effects’’ is the median lethal dose (LD50), which is defined as the dose that causes death in 50% of animals. Such a measure of therapeutic index/window has been criticized because, based on animal data, the LD50 is a poor guide to the likelihood of unwanted effects in humans. Moreover, obviously this range could not be measured in humans, so that other measures of therapeutic index are used. In such clinical studies that involve humans, the therapeutic window often refers to the range between the lowest or minimum effective dose and the maximum tolerated dose, which is based on clinical endpoints rather than on plasma concentration. Moreover, such endpoints are commonly measured as being present or absent, so that the probability of response or toxicity over the dose scale is usually modeled (Fig. 2). The lowest dose or MED represents the dose that causes some prespecified proportion of a sample population to respond. This definition is in agreement with the definition of the MTD, which is defined as the highest dose level of a potential therapeutic agent at which the patients have experienced an acceptable level of dose limiting toxicity. See the discussion on MTD for details. Of note, the range of optimal dose levels is often determined during the clinical trial process. Often, it reflects merely an average response to the drug among the studied subjects, ignoring the potential effects of age,

race, gender, and treatment history. Optimal drug doses that correspond to the patient profile of the individual being monitored (not to the average response) should be better used, which occurs for instance with digoxin and tobramycin. 2 DRUG DOSE–RESPONSE RELATIONSHIP 2.1 Dose-Response Models: Generalities Dose-response data typically are graphed with the dose or dose function (e.g., log10 dose) on the x-axis and the measured effect (response) on the y-axis, independent of time. Graphing dose-response curves of drugs studied under identical conditions can help compare the pharmacologic profiles of the drugs. Based on several assumptions that very low doses produce little or no effect and higher (‘threshold’) doses produce increasingly large effects, a possible model for dose response ‘‘curves’’ can be chosen. Mostly, mathematical models, such as the Emax model or sigmoid Emax models (4), the log-linear model, express some effect E to Emax and ED50. PK/PD modeling has become available more recently as a more sophisticated analysis tool. Notably, they allow for consideration of the temporal aspects in effect intensity and provide a more rational scientific framework for the relationship between dose and response, incorporating sources of biological variation. Indeed, many factors may impact interindividual pharmacokinetic variability. Random effects allow interindividual differences to be expressed. As a result of the high degree of variability within and among patients, equal doses of

s atio n Com plic

50

Tum Con our trol

Probability of tumour control/ complications (%)

100

Figure 2. Example of dose–response curve in cancer. For increase in dose from level 1 to 2, there is a small increase in tumor control but a much larger increase in treatment complication probability.

3

0 Dose

1

2

4

THERAPEUTIC DOSE RANGE

the same drug in two different individuals can result in dramatically different clinical outcomes. Population models assess and quantify potential sources of variability in exposure and response in the target population, even under sparse sampling conditions (5). Based on observed data, the models can be fitted using specific software. The most used application is NONMEM (University of California, San Francisco, CA). 2.2 Principles of Chemotherapy Dose Intensity Dose intensity refers to the chemotherapy dose per unit of time over which the treatment is given and expressed as mg/m2 per week. The assumption that increasing cytotoxic dose intensity will improve cancer cure rates is compelling. Actually, the principles of chemotherapy administration are based on its hematological toxicity. A mathematical model, first described by Goldie and Coldmaan in 1979 (6), demonstrated the link between the dose schedule and the cytotoxicity of the drug. Thereafter, Norton (7) integrated the tumor growth taking into account the parameters of the in vivo tumor homeostasis. The finding that ‘‘more is better’’ has been shown in several tumor types, which include lymphomas, breast cancer (8), testicular cancers, and small-cell lung cancer (9). It has also gained general acceptance among pediatric oncologists (10). It is the rationale for dose-finding experiments in cancer patients when using cytotoxic drugs. Nevertheless, for molecularly targeted agents that have minimal toxicity in animal studies, dose findings studies may be conducted in healthy volunteers like in other medical fields. 3

DOSE-RANGING DESIGNS

Designs that focus on establishing therapeutic dose range are often entitled ‘‘doseranging’’ studies, by contrast to ‘‘dose-finding’’ studies that usually deal with a single clinical endpoint, mostly in cancer (11,12). 3.1 Preclinical Studies Typically, during pharmaceutical development, for ethical and safety reasons, and a

requirement by regulatory authorities worldwide, no test substance can be given to humans without first demonstrating some evidence of its likely risk. This evidence is provided by a mixture of in vitro and animal data that conforms to requirements laid down by regulatory agencies worldwide (13). The studies are designed to permit the selection of a safe starting dose for humans, to gain an understanding of which organs may be the targets of toxicity, to estimate the margin of safety between a clinical and a toxic dose, and to predict pharmacokinetics and pharmacodynamics. One commonly applied approach to predicting a human PK profile based on animal data is allometric scaling, which scales the animal data to humans assuming that the only difference among animals and humans is body size. Although body size is an important determinant of pharmacokinetics, it is certainly not the only feature that distinguishes humans from animals. Therefore, perhaps not surprisingly, this simple approach has been estimated to have less than 60% predictive accuracy (14). 3.2 Early Phases of Clinical Trials Often, an upward dose exploration occurs in phase I, especially in cases in which a narrow therapeutic index is observed. In these cases, it may be important to identify the MTD, whereas later phase studies may identify a MED. In the clinical setting, the potential therapeutic results of a treatment depend on both pharmacokinetics and pharmacodynamics. Thus, an evaluation based on multiple subjects across a range of doses is helpful in defining an optimal therapeutic regimen for subsequent clinical trial phases. Choosing the appropriate outcome measure is essential to determine accurately both safety and efficacy. 3.3 Phase I in Cancer In several therapeutic areas such as lifethreatening diseases, considerable toxicity could be accepted; relatively high doses of drugs are usually chosen to achieve the greatest possible beneficial effect. Thus, phase I clinical trials of new anticancer therapies aim at determining suitable doses for additional testing. See phase I section.

THERAPEUTIC DOSE RANGE

Objective responses observed in phase I trials are important for determining the future development of an anticancer drug. Although the evaluation of new investigational drugs in phase I, II, and III trials requires considerable time and patient resources, only a few of these drugs ultimately are established as anticancer drugs. A consequence of using less effective designs is that more patients are treated with doses outside the therapeutic window (15). This result points out the importance of establishing correct dose selection in phase I trials (11).

3.4 PK/PD Assessments In healthy volunteers, it is widely reported that dose ranging, depending on the demands of the clinical situation, either should use the cross-over or the dose-escalation design rather than the often used placebo-controlled randomized design (16). The initiation of drug therapy includes the administration of an initial loading dose and subsequent interim maintenance doses at periodic intervals to achieve therapeutic ‘‘peak’’ and ‘‘trough’’ blood concentrations. In conventional methods, a determination of the dosage regimen typically involves the use of nomograms and dosing guidelines derived from population averages. Typically, measured serum drug concentrations will not be available before a maintenance dose is required. After the initiation of drug therapy, blood assays are typically performed to determine analytically the serum concentration of the therapeutic agent. Recently, optimal adaptive design appeared a methodology for improving the performance of phase II dose-response studies. Optimal adaptive design uses both information prior to the study and data accrued during the study to update and refine the study design continuously. Dose-response models include linear, log-linear, sigmoidal E(max), and exponential models. The capability of the adaptive designs to ‘‘learn’’ the true dose response resulted in performances up to 180% more efficient than the best fixed optimal designs (17).

4

5

CONCLUDING REMARKS

Historically, drugs have been marketed at excessive doses, with some patients experiencing adverse events unnecessarily. This effect has been reported to rely on two main causes: Drugs are often introduced at a dose that will be effective in around 90% of the target population, because it helps market penetration. Doses are determined partly by an irrational preference for round numbers (18). Over the last 5 years, a greater effort has been made to ensure that the best benefit to risk assessment is obtained for each new drug (2). This assessment has been improved, in some cases, by postmarketing label changes, which aim to optimize the dosage regimen for the indicated populations (19). To provide the data sufficient to convince the FDA of the favorable benefit/risk ratio of the drug under investigation, clinical trials are the gold standard for demonstrating efficacy, but they cannot fully predict safety when drugs are used in the real world. Knowledge about the safety profile of a drug in humans is limited at the time of marketing. More extensive and earlier epidemiologic assessment of risks and benefits of new products will create a new standard of evidence for industry and regulators, and it is likely to result in more effective and balanced regulatory actions, thereby affording better care for patients (2).

REFERENCES 1. J. Dingemanse and S. Appel-Dingemanse, Integrated pharmacokinetics and pharmacodynamics in drug development. Clin. Pharmacokinet. 2007; 46: 713–737. 2. E. Andrews and M. Dombeck, The role of scientific evidence of risks and benefits in determining risk management policies for medications. Pharmacoepidemiol Drug Saf , 2004; 13: 599–608. 3. L. Sheiner and J. Steimer, Pharmacokinetic/pharmacodynamic modeling in drug development. Annu. Rev. Pharmacol. Toxicol. 2000; 40: 67–95. ´ G. Paintaud, and M. Wakelkamp, 4. G. Alvan, The efficiency concept in pharmacodynamics. Clin. Pharmacokinet, 1999; 36: 375–389.

6

THERAPEUTIC DOSE RANGE 5. L. Sheiner and J. Wakefield, Population modelling in drug development. Stat. Methods Med. Res. 1999; 8: 183–193. 6. J. Goldie and A. Coldmaan, A mathematical model for relating the drug sensitivity of tumors to their spontaneous mutation rate. Cancer Treat. Rep. 1979; 63: 1727–1733. 7. L. Norton, Gompertzian model of human breast cancer growth. Cancer Res. 1988; 4S: 7067–7071.

8. W. Hryniuk and H. Bush, The importance of dose intensity of metastatic breast cancer. J. Clin. Oncol. 1984; 2: 1281–1288. 9. G. Crivellari, et al. Increasing chemotherapy in samll-cell lung cancer: From dose intensity and density to megadoses. The Oncologist 2007; 12: 79–89. 10. M. Smith, et al. Dose intensity of chemotherapy for childhood cancers. Oncologist 1996; 1: 293–304. 11. S. Chevret, Statistical methods for dosefinding experiments. In: S. Senn (ed.), Statistics in Practice. Chichester, UK: John Wiley & Sons, 2006. 12. N. Ting, Dose finding in drug development. In: M. Gail, et al. (eds.), Statistics for Biology and Health. New York: Springer, 2006. 13. Food and Drug Administration, Guidance for Industry, Investigators and Reviewers: Exploratory IND Studies. Draft guidance. Washington, DC: US Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research, 2005.

14. K. Ward and B. Smith, A comprehensive quantitative and qualitative evaluation of extrapolation of intravenous pharmacokinetic parameters from rat, dog and monkey to humans. I. Clearance. Drug Metab. Dispos. 2004; 32: 603–611. 15. A. Rogatko, et al. Translation of innovative designs into phase I trials. J. Clin. Oncol. 2007; 25: 4982–4986. 16. L. Sheiner, Y. Hashimoto, and S. Beal, A simulation study comparing designs for dose ranging. Stat. Med. 1991: 10: 303–321. 17. A. Maloney, M. Karlsson, and U. Simonsson, Optimal adaptive design in clinical drug development: a simulation example. J. Clin. Pharmacol. 2007; 47: 1231–1243. 18. A. Herxheimer, How much drug in the tablet? Lancet 1991; 337: 346–348. 19. J. Cross, et al. Postmarketing drug dosage changes of 499 FDA-approved new molecular entities, 1980–1999. Pharmacoepidemiol. Drug Saf. 2002; 11: 439–446.

CROSS-REFERENCES Area Under the Curve Maximum Concentration Maximum Tolerable Dose (MTD) Pharmacodynamic Study Pharmacokinetic Study

THERAPEUTIC EQUIVALENCE

1 WHEN TO USE A THERAPEUTIC EQUIVALENCE TRIAL DESIGN A therapeutic equivalence trial is appropriate when two treatments are to be compared and a potential (generally desired) outcome is that neither treatment is markedly superior to the other one. In this way, equivalence trials are, in general, differentiated from noninferiority trials and superiority trials: in noninferiority trials, showing an investigational treat is superior to an active comparator is possible and often desired; in superiority trials, the only desired outcome is showing that one treatment is superior to the other. Therapeutic equivalence trials can be used for next-generation or improved drugs, when another approved drug with similar mechanism of action is already available. Having multiple drugs available that are on average equivalent is an important goal for public health because having more treatment options increases the chance that a given patient will tolerate and benefit from at least one of them, and competition may decrease the cost of each of the drugs (1). Therapeutic equivalence trials can also be conducted for comparing drugs in different therapeutic classes that have similar effects with different mechanisms of action. However, in these trials, noninferiority analyses may be more appropriate because additional efficacy is generally desirable. With some endpoints, equivalence trials do not seem to make sense. For example, if mortality is the primary endpoint, a noninferiority or superiority trial will generally be a more obvious goal as excess efficacy is undoubtedly a good thing. Other endpoints may lend themselves to more obvious application of equivalence designs if too much effect is as bad as too little effect. With the use of an insulin product or device in diabetic patients, blood glucose that is either too high or too low will have negative and obvious consequences for patients. In the use of an erythropoiesis-stimulating product in patients with renal disease, negative consequences will be observed if hemoglobin levels are either too high (morbidity risk) or too

BRIAN L. WIENS Gilead Sciences, Inc. Westminster, Colorado

Therapeutic equivalence clinical trials aim to demonstrate that two or more treatments are essentially similar. Therapeutic equivalence trials share characteristics with other trials discussed elsewhere in this volume: noninferiority trials, bioequivalence trials, and equivalence trials. Unlike a noninferiority trial, differences in either direction are of interest in a therapeutic equivalence trial. The inclusion of the term ‘‘therapeutic’’ implies that the similarity is based on an endpoint of clinical relevance—that is, not a biomarker or laboratory value without direct clinical benefit, but an endpoint of measurable relevance to the patient’s health. This distinguishes therapeutic equivalence trials from bioequivalence trials in which equivalence is based on pharmacokinetic profiles, which relate directly to drug concentrations in the bloodstream and much less directly to clinical consequences. Bioequivalence trials are also distinguished by well-established criteria to conclude equivalence (such as a ratio between 80% and 125% for some pharmacokinetic parameters) whereas in therapeutic equivalence trials the criteria are generally chosen individually for each trial. The term ‘‘equivalence’’ is a general term that can describe therapeutic equivalence and other trials, and many aspects of therapeutic equivalence trials will be similar to noninferiority and bioequivalence trials. For purposes of this article, only issues pertaining to therapeutic equivalence trials (and not issues unique to noninferiority trials or bioequivalence trials) will be discussed. However, because many of the issues around these various trial types are similar or identical, there will be some overlap and the other entries in this volume provide further reading.

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

THERAPEUTIC EQUIVALENCE

low (fatigue), and these will be noticeable to patients as well. It is possible that these examples can be turned into noninferiority trials; for example, instead of mean change in hemoglobin, the endpoint might be changed to the proportion of patients with hemoglobin in a predefined range of interest. Because more is undoubtedly better when considering the proportion of subjects in the predefined range of interest, such a proportion can be easily used as an endpoint in a noninferiority trial. Based on the preceding discussion, the number of studies for which therapeutic equivalence designs are useful is admittedly quite small, as more efficacy is generally beneficial. A specific example of a study that could be classified as a therapeutic equivalence study is a vaccine consistency lot study. In these studies, three lots of vaccine from three consecutive manufacturing runs are compared, with a desired outcome of equivalence of immunogenicity among the three lots to establish the consistency of the manufacturing process (2). In these studies, finding that one lot is much more immunogenic than another is an undesirable outcome, so demonstrating equivalence is required and demonstrating noninferiority is an inappropriate goal. The endpoint of immunogenicity is a level established to demonstrate protection from disease in case of exposure, making this more of a therapeutic equivalence trial than a bioequivalence trial. Also distinguishing these trials from bioequivalence trials is the equivalence limit, or how different the lots can be while still concluding equivalence; these limits are quite different in consistency lot trials than in bioequivalence trials. (Equivalence limits are discussed in detail in the next section.) Another possible situation in which a trial can be considered a therapeutic equivalence trial is when drug–drug interactions or manufacturing process changes cannot be assessed by pharmacokinetics but can be assessed clinically. This is an uncommon situation in systemically distributed (oral or parenteral) drugs, but may be necessary when the distribution of the drug in the body is not uniform or is not important for efficacy, such as for topically administered drugs for localized indications or for devices.

Equivalence might also be of interest in a decision-theoretic approach, in which possible conclusions are that one treatment is superior to the second treatment, the second treatment is superior to the first, or the two treatments are equivalent. 2 DESIGN REQUIREMENTS FOR A THERAPEUTIC EQUIVALENCE TRIAL Two important considerations are required for the design of a therapeutic equivalence trial: an equivalence limit (or margin) and a test statistic. In fact, other than these two considerations, a therapeutic equivalence will be quite similar to a superiority trial in design, conduct, analysis, and interpretation. 2.1 Choice of an Equivalence Limit The equivalence limit is the amount that the treatments can differ while still concluding equivalence based on a parameter of choice, often the mean or a proportion and generally a measure of efficacy. The equivalence limit is often denoted as δ, and the interval between the upper and lower bounds is called the equivalence zone. The choice of the equivalence limit is one of the most important design aspects of the study design. The conclusion of the trial will depend on the magnitude of the equivalence limit. If the equivalence limit is chosen to be too large, a positive trial will be viewed with skepticism. If the equivalence limit is chosen to be too small, a larger than necessary sample size will be required to avoid having a useful product rejected by a regulatory agency or the medical community. One aspect of an equivalence margin that is of utmost importance in noninferiority studies and perhaps as important therapeutic equivalence trials is the putative placebo concern. That is, if two treatments are determined to be equivalent, that conclusion must imply that the treatments are both superior to placebo (or, in a logically equivalent requirement, that placebo could not have been concluded equivalent to an active treatment). This is accomplished by setting the equivalence limit appropriately. For a noninferiority trial, this implies that a positive result (conclusion of noninferiority) is tantamount to concluding that the test treatment

is superior to placebo via an indirect comparison. For a therapeutic equivalence trial, the putative placebo argument is different but must also be considered. The equivalence limit must be small enough that a placebo could not be concluded therapeutically equivalent to a treatment that provides meaningful benefit. It is possible to envision a therapeutic equivalence trial in which neither treatment has been compared with placebo, such as the aforementioned vaccine consistency lot study in which all participants receive an investigational product. In such a case, a heuristic argument on the putative placebo effect is required, and a conservative (small) margin should be chosen based on this argument. Even then, the trial results may be open to criticism due to the lack of definitive knowledge of the relative performance of placebo and the studied treatments. Another important aspect of the equivalence limit is the ability to rule out a difference of medical concern. This is an intentionally imprecise statement about an amount that is very difficult in practice to quantify: every party (patient, physician, third-party payer, etc.) has a different quantity that is of concern, and even different patients will have different opinions. Equivalence limits might also be based on tolerability, cost, or some other aspect the profile of a given therapy. A drug with fewer side effects or lower cost might be granted a lower hurdle (larger margin) to be called equivalent. Some investigators have provided guidance on an equivalence limit based on absolute change (3), and others have argued that an equivalence limit should be based on a proportion of effect of an active product compared with a placebo from a historical comparison (4). Obviously, the latter requires a historical or concurrent comparison of one or both treatments to placebo. Equivalence is most often based on a population mean or proportion, but can also be based on another parameter such as a median or hazard ratio. For purposes of illustration, the hypotheses of interest when basing equivalence on population means might be written

THERAPEUTIC EQUIVALENCE

3

H0 : µA − µB ≥ δ or µA − µB ≤ −δ

(1)

as follows:

versus H1 : −δ < µB − µB < δ. Other formulations of the hypotheses are possible. Although there is no absolute requirement to do so, it is common that the equivalence limits will be symmetric about 0 when equivalence is based on the difference in two parameters, and symmetric about 1 when equivalence is based on the ratio of two parameters (often, this is actually symmetry about 0 after log-transformation). It is intuitive that if one allows µA − µB = θ to be included in the equivalence zone, then allowing µB − µA = θ to be in the equivalence zone seems defensible. This principle can be relaxed if there is reason based on some aspect of the treatment other than efficacy (e.g., cost or tolerability). A treatment that more commonly causes an adverse event that leads to significant discomfort might be required to meet a higher hurdle before it is considered therapeutically equivalent to another treatment with a cleaner tolerability profile. One way to accomplish this is to make the equivalence bounds asymmetric. 2.2 Choice of a Test Statistic Testing of a therapeutic equivalence trial is generally conducted under a NeymanPearson framework (5). The test statistic to be used must be a consideration in the design of a therapeutic equivalence trial, and will be determined in part from the chosen form of the hypotheses. In general, as shown, two null hypotheses must be rejected: Treatment A is importantly better than treatment B, or treatment B is importantly better than treatment A. For hypotheses written as in equation 1 above, an alternate writing would be with two null and two alternative hypotheses as follows: H0A : µA − µB ≥ δ versus H1A : −µA − µB < δ, and H0B : µA − µB ≤ −δ versus H1 : µA − µB > −δ.

(2)

4

THERAPEUTIC EQUIVALENCE

From this form, it is easy to see that two test statistics will need to be calculated, one for each null hypothesis. Both null hypotheses must be rejected to conclude equivalence. Assuming that the data are normally distributed with known variance, or that the sample size is large enough for the central limit theorem to apply, an appropriate pair of test statistics follows the form

ZA =

xA − xB − δ se(xA − xB )

(3)

with ZB defined analogously. Tests are routinely conducted at the 2.5% level one-sided to correspond with superiority studies in which the standard is to test at the 5% level twosided. (Again, this is a contrast to bioequivalence trials, in which each test is routinely conducted at the 5% level, one sided.) Each test statistic can be compared with the standard normal critical value zα (e.g., ZA compared with −1.96, and ZB compared with + 1.96 for two tests, each at the 2.5% level), with equivalence concluded if both test statistics are more extreme than the corresponding critical value. (With unknown variance or finite sample size, the test statistic would be the t-test statistic rather than the z-test statistic, and the critical value would come from the corresponding t distribution.) Because two test statistics are calculated and each tests a one-sided hypothesis, this procedure is sometimes referred to as a ‘‘two one-sided tests’’ (TOST) procedure (6). For notational convenience, sometimes the one test statistic or critical value is presented as the opposite of the above notation, such as ZA ∗ = −ZA and ZB ∗ = ZB . Then the lesser of the two, min(ZA ∗ , Z∗B ), can be considered the test statistic and compared with the critical value + 1.96. Because the TOST procedure effectively tests a noninferiority-type statistic twice, many testing issues discussed in the noninferiority literature apply directly to therapeutic equivalence testing. In practice, it is more common to calculate a two-sided 100(1−2α)% confidence interval and conclude equivalence at level α if the entire confidence interval is contained in the equivalence interval (–δ, δ). This has the

added advantage of reporting not only the test result but an interval estimate as well, and each reader can make a personal decision on whether the differences are of importance (7). The impact is that the bounds of the confidence interval act as the test statistics, and the equivalence limits act as the corresponding critical values. It is easy to verify that this is identical to using the test statistics shown in equation 3 if the confidence interval is calculated as xA − xB ± zα se(xA − xB ). A two-sided 95% confidence interval corresponds to two one-sided tests, each with a type I error rate of 2.5%. Testing procedures for continuous data (better yet, normally distributed continuous data) are fairly easy to adapt from superiority analyses to equivalence analyses. For binary data, the adaptation is not always straightforward due to the relationship between the mean and the variance. Under the null hypothesis, the proportions differ by at least δ, but when two treatment are identical or nearly so, the proportions will often differ by much less than δ. A restricted maximum likelihood estimate (MLE) of the proportions (the restriction being that the proportions differ by exactly δ) can be used in the variance estimate of the test statistic instead of the usual unrestricted estimates commonly used for calculating confidence intervals or the pooled estimates commonly used for testing for superiority (8). The corresponding confidence interval, with limits calculated using the restricted MLE when estimating the variance, may not be appropriate, especially if the null hypothesis is rejected. Not too long ago, equivalence was commonly concluded by testing a hypothesis of no difference and finding a large P-value (e.g., P > 0.050 or ‘‘not statistically significant’’). This is now seen as inappropriate for two reasons: a significant difference might be too small to be of interest, and a nonsignificant difference might be large enough to be of concern despite the relatively large amount of noise. Both situations can be viewed in terms of inappropriate sample size: larger or smaller than required, respectively. Today, lack of statistical significance in testing a hypothesis of equality is not viewed as tantamount to concluding equivalence (3, 9, 10). Rather, the magnitude of the observed difference and

THERAPEUTIC EQUIVALENCE

associated variability in the estimate must be known before a conclusion of equivalence is allowed (although this view is not universal, and the clinical literature occasionally presents conclusions of equivalence based on large P-values for tests of hypotheses of no difference). It is uncommon to report P-values for therapeutic equivalence trials. 3 IMPORTANCE OF RIGOROUS STUDY CONDUCT The ability of a study to distinguish between treatments that are different is sometimes called ‘‘assay sensitivity’’ (11). Superiority studies are internally valid in the sense that a positive study inherently demonstrates that assay sensitivity was present (10). With an equivalence trial, this is not the case. Rather, historical evidence and proper trial conduct are necessary to provide evidence of the presence of assay sensitivity (11). This is not to say that rigorous methods are more important in equivalence trials than in superiority trials, only that the external interest in the rigor of study conduct may be higher for a therapeutic equivalence study due to the lack of internal validation. When a superiority trial shows statistical significance, it can be concluded that, in some patients, there is a difference. With an equivalence trial, this is not the case. Poor study conduct, such as enrolling study subjects who do not meet inclusion criteria, poor adherence to the study protocol, botched randomization, and even study fraud, can lead to a situation in which an outcome of equivalence under then null hypothesis is higher than the prespecified type I error rate. In each case, the impact will be to make the outcomes more similar in the two groups than they should be, leading to bias (9, 12). Although intentional misconduct is a concern in both superiority and equivalence studies, unintentionally decreasing the treatment effect via inadvertent conduct can be viewed very differently. Perversely, a placebo-controlled superiority study that is positive despite study conduct that makes the two treatments appear more similar (e.g., poor adherence to study treatment in both comparative treatment arms)

5

might be thought to increase confidence in the data. Showing a difference despite poor operational conditions suggests that the investigational treatment is effective even in a suboptimal setting. A conclusion of equivalence in such a setting would not inspire confidence in the results. Unfortunately, one aspect of assay sensitivity cannot be controlled in a therapeutic equivalence study: because every study participant receives an active treatment, there may be bias induced because the evaluations are made by someone who knows that every participant received an active treatment and therefore believes that every one of them has a good chance to have a positive response (11). This argues for using an objective endpoint when possible. Subjective endpoints such as physician interpretations on an ordinal scale (e.g., New York Heart Association functional class in heart failure patients) or study subject responses on a visual analog scale would be particularly prone to criticism. 4

CHOICE OF ANALYSIS SET

In superiority trials, it is nearly universal to use an intention-to-treat (ITT) approach to analyze the data (13–15). Although there are many interpretations of ITT, the general idea is that every randomized subject will contribute to the interpretation of the study, according to randomized treatment group, with only very narrow exceptions allowed (10, 16). In a therapeutic equivalence trial, the choice of analysis set is not as straightforward, and the ITT analysis may play a different role (10). Consider a trial in which a large proportion of patients do not strictly adhere to the intended treatment plan. Participants who do not have the disease under study are enrolled; some do not receive adequate amounts of the prescribed treatment; some have missing evaluations or outcomes; and/or primary endpoint measurements are taken incorrectly by investigator site personnel. With each of these situations, patients on the two treatments might tend to behave more alike, and the data might have more variability than if strict adherence to the intended treatment protocol had been obtained. In a

6

THERAPEUTIC EQUIVALENCE

superiority trial, this will make it more difficult to show a difference in treatments, but in a therapeutic equivalence trial this might make it easier to show lack of an important difference. This is not an absolute because adding variability can decrease the probability of incorrectly demonstrating equivalence, even as bringing point estimates closer together can increase the probability of demonstrating equivalence (1, 17, 18). Allowing exceptions to a narrow interpretation of ITT in extraordinary situations is often considered for superiority studies (16). For example, subjects who do not meet laboratory test entry criteria for a study based on a serum sample that was obtained before randomization might be properly excluded from an ITT analysis, even if the laboratory test results are not available until after randomization due to the necessity of starting therapy immediately. Such exceptions might be useful in excluding from the analysis the patients who do not have the disease under study or those who will almost certainly not benefit from the treatments under study. A common analysis method for equivalence and noninferiority studies is to use a per protocol analysis set. Though again there are various definitions of this analysis set, in general the per protocol set will contain all participants who in retrospect met entry criteria, received at least a minimal amount of study drug, and had primary evaluations (16). Because this analysis set most accurately tests the underlying hypothesis of activity of the treatment on the biological target, it is useful in all studies (including superiority analyses) (10). The difference with therapeutic equivalence studies is that the per protocol set might be a primary or coprimary evaluation set (10). Bias may arise from the per protocol analysis due to, for example, participants who do not tolerate the treatment contributing nothing to the resulting analysis (due to being excluded from the analysis) rather than contributing lack of efficacy as in the ITT analysis. When one treatment is better tolerated but the other treatment is more efficacious among participants who can tolerate it, the two factors might make the treatments look equivalent or might even make the less tolerated treatment look somewhat superior. However, a

proper analysis (discriminating between the degree of tolerability and the conditional efficacy, given that a treatment is tolerated) will help distinguish between these two offsetting issues; for such an analysis, a per protocol and an ITT analysis might both be useful. Some investigators have argued that an ITT focus is as supportable for noninferiority (and therefore therapeutic equivalence) studies as for superiority studies (19). Specifically, the per protocol analysis is not the only, or even the best, alternative to the usual ITT approach in general. For now, the reader is cautioned that the ability to use an ITT approach to the analysis of such trials is an evolving area of research (20–22). 5 POWERING A THERAPEUTIC EQUIVALENCE TRIAL The power of a therapeutic equivalence study is the probability that a study will conclude equivalence. This is the probability that both null hypotheses in the TOST procedure will be rejected. Standard methods can be used for calculating the power of each of the two test statistics individually, but the power of the TOST approach can be poorly estimated if the boundaries of the two test statistics are ignored. Consider a situation in which zα se(xA − xB ) > 2δ. Such a study will have 0% power to conclude equivalence because the confidence interval is too wide to fit in the interval (−δ, δ). However, each of the TOST statistics, by usual calculations, will have positive power under the assumption that µA = µB . In addition, performance of binary and categorical data can vary dramatically from a sample size formula (23). For superiority studies, introductory statistics textbooks often give a sample size formula (assuming a t-test) as σ (Z +Z ) 2 1−α 1−β , where n is the sample n=2 (µ∗ ) size (per group), σ is the common standard deviation, α is the one-sided type I error rate, 1 − β is the desired power, and µ∗ is the difference in mean responses at a point of interest under the alternative hypothesis. For each one-sided test in the TOST procedure, this same formula can be used, with µ∗ being the difference in means minus δ (for ZA ) or plus δ (for ZB ).

THERAPEUTIC EQUIVALENCE

7

For binary data, a similar formula will be used, but the estimate of variance will be based on the null hypothesis, such as the estimate proposed by Farrington and Manning (8). For the TOST procedure, the overall power for a given sample size will be 1 − β 1 − β 2 , where 1 − β i is the power for the i-th hypothesis, i = 1,2, for the given sample size. (With symmetric equivalence limits and the assumption that the two treatments have identical efficacy, this will simplify to 1 − 2β, where 1 − β is the common power for each one-sided hypothesis.) Many commercially available software packages are able to calculate power for the TOST. However, it is not always immediately clear which formula is being used or which test statistic is assumed, especially for the case of binary data. The reader is advised to closely read the documentation provided with the software and, in case of doubt, to consider alternatives such as simulations to verify the power of the planned study. The assumption of the true difference in treatments is an important consideration. It is common to assume that the treatments are identical, but this is actually the point at which power is maximized (assuming symmetric equivalence limits). The true power will be lower than calculated if there is a small non-zero difference in treatment effects (10).

also of no clinical consequence, a conclusion of therapeutic equivalence is still appropriate, despite a seemingly inconsistent yet supportable alternative conclusion of superiority of one treatment over the other. This possibility should be discussed during the planning stage. Alternatively, if the confidence interval lies partially outside of the equivalence interval, the conclusion is not necessarily that equivalence does not exist. Rather, it means that the study failed to demonstrate equivalence. Especially if the point estimate for treatment effect (difference in sample means) is within the equivalence interval, it may be a situation in which a larger sample size is required to demonstrate therapeutic equivalence. If the entire confidence interval lies outside the equivalence zone of (−δ, δ), such as when the lower bound of the confidence interval exceeds δ, one can conclude that equivalence does not exist.

6 INTERPRETATION OF AN EQUIVALENCE STUDY

REFERENCES

In the simplest case, the bounds of the confidence interval for the difference in treatment means are calculated, and the study is interpreted. If the confidence interval lies entirely within (−δ, δ), the conclusion is one of therapeutic equivalence; if not, the conclusion is that therapeutic equivalence has not been demonstrated. However, there can be complications. If the confidence interval is entirely on one side of 0 but within the equivalence interval [or entirely in the interval (0, δ), for example], there might be concern that the treatments are in fact not equivalent. If the equivalence interval was chosen such that it represents a range of differences that are non-zero but

7

OTHER ISSUES

Other details in clinical trials are also areas of research for noninferiority and equivalence trials, including multiple comparisons, exact and nonparametric tests, Bayesian methods, and group sequential methods. See the Further Reading list for a sample of such papers.

1. E. Garbe, J. R¨ohmel, and U. Gundert-Remy, Clinical and statistical issues in therapeutic equivalence trials. Eur J Clin Pharmacol. 1993; 45: 1–7. 2. B. L. Wiens and B. Iglewicz, On testing equivalence of three populations. J Biopharm Stat. 1999; 9: 465–483. 3. B. L. Wiens, Choosing an equivalence limit for noninferiority or equivalence studies. Control Clin Trials. 2002; 23: 2–14. 4. E. B. Holmgren, Establishing equivalence by showing that a specified percentage of the effect of the active control over placebo is maintained. J Biopharm Stat. 1999; 9: 651–659. 5. G. Aras, Superiority, noninferiority, equivalence, and bioequivalence—revisited. Drug Inf J. 2001; 35: 1157–1164.

8

THERAPEUTIC EQUIVALENCE 6. D. J. Schuirman, A comparison of the two onesided tests procedure an the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987; 15: 657–680. 7. W. W. Hauck and S. Anderson, Some issues in the design and analysis of equivalence trials. Drug Inf J. 1999; 33: 109–118. 8. C. R. Farrington and G. Manning, Test statistics and sample size formulae for comparative binomial trials with null hypotheses of nonzero risk difference or non-unity relative risk. Stat Med. 1990; 9: 1447–1454. 9. B. Jones, P. Jarvis, J. A. Lewis, and A. F. Ebbutt, Trials to assess equivalence: the importance of rigorous methods. BMJ. 1996; 313: 36–39.

10. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E9 Statistical Principles for Clinical Trials. Step 4 version, February 5, 1998. Available at: http://www.ich.org/LOB/media/ MEDIA485.pdf 11. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH Harmonised Tripartite Guideline: E10 Choice of Control Group and Related Issues in Clinical Trials. Current Step 4 version, July 20, 2000. Available at: http://www.ich.org/LOB/media/ MEDIA486.pdf 12. J. P. Siegel, Equivalence and noninferiority trials. Am Heart J. 2000; 139: S166–S170. 13. L. Fisher, D. O. Dixon, J. Herson, R. K. Frankowski, M. S. Hearron, and K. E. Peace, Intention to treat in clinical trials. In: K. E. Peace (ed.), Statistical Issues in Drug Research and Development. New York: Marcel-Dekker, 1990, pp. 331–350. 14. J. A. Lewis and D. Machin, Intention to treat—who should use ITT? Br J Cancer. 1993; 68: 647–650. 15. R. Peto, M. C. Pike, P. Armitage, N. E. Breslow, D. R. Cox, et al., Design and analysis of randomized clinical trials requiring prolonged observation of each patient. Br J Cancer. 1976; 34: 585–612. 16. D. Gillings and G. Koch, The application of the principle of intention-to-treat to the analysis of clinical trials. Drug Inf J. 1991; 25: 411–424. 17. C. Chuang-Stein, Clinical equivalence—a clarification. Drug Inf J. 1999; 33: 1189–1194.

18. A. Koch and J. R¨ohmel, The impact of sloppy study conduct on noninferiority studies. Drug Inf J. 2002; 36: 3–6. 19. B. L. Wiens and W. Zhao, The role of intention to treat in the analysis of noninferiority studies. Clin Trials. 2007; 4: 286–291. 20. A. F. Ebbutt and L. Frith, Practical issues in equivalence trials. Stat Med. 1998; 17: 1691–1701. 21. M. M. Sanchez and X. Chen, Choosing the analysis population in noninferiority studies: per protocol or intent-to-treat. Stat Med. 2006; 25: 1169–1181. 22. E. Brittain and D. Lin, A comparison of intentto-treat and per-protocol results in antibiotic noninferiority trials. Stat Med. 2005; 24: 1–10. 23. D. Tu, Two one-sided tests procedures in establishing therapeutic equivalence with binary clinical endpoints: fixed sample size performance and sample size determination. J Stat Comput Simul. 1997; 59: 271–290.

FURTHER READING E. Bofinger and M. Bofinger, Equivalence of normal means compared with a control. Commun Stat Theory Methods. 1993; 22: 3117–3141. J. J. Chen, Y. Tsong, and S. H. Kang, Tests for equivalence or noninferiority between two proportions. Drug Inf J. 2000; 34: 569–578. S. Y. Chen and H. J. Chen, A range test for the equivalence of means under unequal variances. Technometrics. 1999; 41: 250–260. S. C. Chow and J. Shao, On noninferiority margin and statistical tests in active control trials. Stat Med. 2006; 25: 1101–1113. C. W. Dunnett and A. C. Tamhane, Multiple testing to establish superiority/equivalence of a new treatment compared with k standard treatments. Stat Med. 1997; 16: 2489–2506. T. Friede and M. Kieser, Blinded sample size reassessment in noninferiority and equivalence trials. Stat Med. 2003; 22: 995–1007. H. Quan, J. Bolognese, and W. Yuan, Assessment of equivalence on multiple endpoints. Stat Med. 2001; 20: 3159–3173. J. M. Robins, Correction for non-compliance in equivalence trials. Stat Med. 1998; 17: 269–302. R. Simon, Bayesian design and analysis of active control clinical trials. Biometrics. 1999; 55: 484–487. T. Yanagawa, T. Tango, and Y. Hiejima, MantelHaenszel-type tests for testing equivalence or

THERAPEUTIC EQUIVALENCE more than equivalence in comparative clinical trials. Biometrics. 1994; 50: 859–864.

CROSS-REFERENCES Noninferiority trial Bioequivalence Equivalence trial Equivalence limit Sample size for comparing means (superiority and noninferiority) Sample size for comparing proportions (superiority and noninferiority) Equivalence analysis

9

THERAPEUTIC INDEX

Drugs with a low therapeutic index are frequently also referred to as ‘‘narrow therapeutic index’’ drugs (2). According to the Code of Federal Regulations (CFR) §20.33, the United States Food and Drug Administration (FDA) defines narrow therapeutic index drugs as those that ‘‘exhibit less than a 2-fold difference in median lethal dose (LD50 ) and median effective dose (ED50 ) values, or have less than a 2-fold difference in the minimum toxic concentrations and minimum effective concentrations in the blood, and safe and effective use of the drug products requires careful dosage titration and patient monitoring.’’ However, the FDA does not claim narrow therapeutic index to be a formal designation, and does not mention it in the publication Approved Drug Products with Therapeutic Equivalence Evaluations [The Orange Book] (3–5). Others have defined narrow therapeutic index drugs as those ‘‘for which relatively small changes in systemic concentrations can lead to marked changes in pharmacodynamic response’’ (6). To date, no official listing of narrow therapeutic index drugs has been published, and the FDA has taken the position that each drug is unique, and thus different drugs should not be clustered into discrete groups (3). Despite the FDA’s position, at least two state legislative attempts have been made to define narrow therapeutic index drugs, including legislation proposed by the state of New York in 1997 (NY proposed Assembly Bill 8087.A). The therapeutic index of a drug can be determined through the use of quantal doseeffect plots. This methodology plots the Gaussian normal distribution of the percentage of subjects either responding to or dying from increasing doses. From this curve, cumulative response and death curves are plotted, and the ED50 and LD50 are obtained (Figure 1). Throughout all of this, one must recognize that a determined therapeutic index from preclinical animal studies may not directly extrapolate to the human situation when used to estimate a drug’s prospective utility and safety in humans. Certainly, it is obvious that determination of an LD50 in humans is not possible. This has led to alternative meth-

RAED D. ABUGHAZALEH University of Minnesota College of Pharmacy Minneapolis, Minnesota

TIMOTHY S. TRACY Department of Experimental and Clinical Pharmacology University of Minnesota College of Pharmacy Minneapolis, Minnesota

The term ‘‘therapeutic index’’ has been used for several decades, yet the concept remains widely misunderstood and its significance is controversial. In this respect, although the general concept of the therapeutic index is describable by most individuals, issues such as what defines a ‘‘narrow therapeutic index’’ drug, pharmacokinetic and pharmacodynamic variability, interindividual and intraindividual variability, and adequacy of biomarkers and surrogate endpoints among others continue to confound determination and clinical utility of the ‘‘therapeutic index.’’ In this article, definitions of therapeutic index, ‘‘narrow therapeutic index,’’ and the aforementioned confounding issues will be discussed, and a perspective on the clinical utility of a therapeutic index will be presented. 1

DEFINITION

Therapeutic index is most commonly defined as the ratio of the median lethal dose (LD50 ) to the median effective dose (ED50 ), as determined in preclinical animal studies. The median LD50 is the dose that results in death in half the population of animals to which it is administered, whereas the median ED50 is the dose required that elicits the desired pharmacologic effect in half the studied population (1). Therapeutic index, therefore, reflects the selectivity of a drug to elicit a desired effect rather than toxicity. A higher (or greater) therapeutic index suggests a higher tolerance to a dose increase beyond ED50 , or a wider ‘‘margin of safety.’’

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

THERAPEUTIC INDEX

Cumulative % of individuals responding

2

Therapeutic Index = LD50 ED50

50%

ED50

LD50

Dose Figure 1. Quantal dose-effect plot. Each curve represents cumulative frequency distribution of animals exhibiting efficacy or toxicity. Therapeutic index is calculated from dividing median lethal dose (LD50 ) by median effective dose (ED50 ). Table 1. Proposed therapeutic ranges for some commonly monitored drugs Drug Amikacin Carbamazepine Digoxin Gentamicin Lidocaine Lithium Phenytoin Procainamide Quinidine Theophylline Tobramycin Valproic acid Vancomycin

Therapeutic Range 20−30 µg/mL (peak), <10 µg/mL (trough) 4−12 µg/mL 1−2 ng/mL 5−10 µg/mL (peak), <2 µg/mL (trough) 1−5 µg/mL 0.6−1.2 mEq/L 10−20 µg/mL 4−10 µg/mL 1−4 µg/mL 10−20 µg/mL 5−10 µg/mL (peak), <2 µg/mL (trough) 50−100 µg/mL 20−40 µg/mL (peak), 5–15 µg/mL (trough)

Source: Adapted from Schumacher 1995 (8).

ods of estimating the therapeutic index in humans via methods using the therapeutic dosing ranges of a drug and related measures. For instance, the median toxic dose TD50 may be substituted in place of the LD50 in the numerator of the therapeutic index determination. In this case, the median TD50 is the dose that produces a defined toxicity in 50% of the subjects examined (in this case, it may be human subjects or animals). Apart from the obvious advantage of not requiring a lethal endpoint, one can also gather data more relevant to the human situation.

Certain safety factor (CSF) is another variation of therapeutic index in which TD(LD)50 is replaced by TD(LD)1 , and ED50 is replaced by ED99 . The obvious theoretical advantage is from minimizing toxicity in the experimental subjects. However, it is sometimes not possible to achieve 99% efficacy in subjects even at the maximal dose of a drug. Using a 1% toxicity threshold (TD1 ) may be too ambitious as well because placebo may cause a higher than 1% incidence of mild side effects. Finally, use of LD1 as a toxicity measure may compromise experimental accuracy as

THERAPEUTIC INDEX

the lethal dose determinations require very large sample sizes to assess 1% lethality with reasonable statistical certainty. The CSF has yet to be validated in the literature as a reliable method of estimating therapeutic utility of an agent. Though few publications have used this method, the work of Zhou et al. (7) describes an interesting application of the CSF as related to intravenous administration of emulsified isoflurane. In this study, the investigators found that emulsified isoflurane could be given safely by the intravenous route and that the CSF for emulsified isoflurane was comparable with that of intravenous propofol. Another method proposed as a clinical alternative for therapeutic index and closely related to the estimation using TD50 and ED50 is the use of the therapeutic concentration range upper and lower limits as an estimation of the therapeutic index. Table 1 lists the proposed therapeutic ranges for some commonly monitored drugs (8). 2 ISSUES ASSOCIATED WITH THE DETERMINATION OF TD50 The use of TD50 as the primary measure of toxicity evaluation in estimating the therapeutic index is confounded by not so obvious factors that are associated with its determination and merit further examination. Toxicity from drugs can be far more complex to assess and understand than efficacy. With efficacy evaluation, one is looking for a specific effect or a set of specific effects to monitor using specific endpoints, which are typically based on previously known mechanisms of action and extrapolated ED50 , and are generally predictable in terms of onset, magnitude, and duration. Toxicities to a certain extent can be predictable, such as those that are extensions of the drug’s known mechanism(s) of action and that tend to be consistent across a class of drugs. For example, a primary toxicity of alpha-receptor blockers is orthostatic hypotension, and fibrinolytic agents are notorious for increasing the risk of bleeding, both toxicities being obvious extensions of the respective drugs’ mechanisms of action. However, there are many forms of toxicities that are idiosyncratic, meaning that

3

the mechanism responsible for the toxicity is unknown or unexplainable. For example, the exact mechanism underlying why the statin hypolipidemic agents can cause muscle toxicity including myopathy and rhabdomyolysis (9) remains elusive. This toxicity is rather unpredictable, can occur anytime during treatment course, can be associated with virtually all drugs from the class, and does not appear to be dose related. However, there is consensus that the toxicity is related to the potency of the statin such that the incidence of muscle toxicities is more likely with the more potent statin agents (9). It is true that there are established risk factors for such toxicities; however, these are not prospective predictors of the toxicity but mere associations. Any one or a combination of at least three mechanisms are currently proposed for this statin-associated muscle toxicity, but the exact cause(s) remain theoretical and not definitively established (9). Drug toxicity can sometimes be of an allergic nature and can occur as type I immediate, type II cytotoxic, or type III immune-complex hypersensitivity reactions (10). Type I hypersensitivity is immunoglobulin E (IgE) mediated and can be manifested in its most severe form, systemic anaphylaxis, such as that associated with penicillins. Type II hypersensitivity reactions are IgG mediated and can be severe as well. Examples of type II hypersensitivity reactions include quinidineinduced hemolysis or quinine-induced thrombocytopenia. Lastly, type III hypersensitivity reactions are also IgG mediated and can result in conditions like serum sickness, such as is noted with the penicillin-type drugs. Each of these types of hypersensitivity reactions is typically unpredictable when it initially occurs and can be greatly influenced by an individual’s specific biological characteristics, such as previous exposure to similar compounds or antibody titers. Determination of therapeutic index is therefore of little value with respect to hypersensitivity reactions because there may not be a clear association of drug dose and observed toxicity. Thus, margin-of-safety determinations are extremely difficult when hypersensitivity reactions occur as an element of the drug’s toxicity profile.

4

THERAPEUTIC INDEX

Another area where determination of therapeutic index lacks guidance is in the case of toxicities following chronic administration of the drug that goes beyond the timeframe of clinical trial evaluation. In such cases, longterm toxicities are usually identified during postmarketing surveillance studies, which is unfortunately after the drug has become available to the general public. For example, amiodarone is an antiarrhythmic agent that exhibits a very long half-life (t1/2 = 35 days), resulting in substantial accumulation of the drug for very long periods of time. In this case, such long-term accumulation can ultimately result in toxicities, like pulmonary fibrosis, that do not become evident for a substantial period of time following initiation of drug administration (11). Such toxicities may be beyond the scope of evaluation in preclinical animal studies and may not be reflected in determinations of the TD50 , thus rendering determination of the therapeutic index meaningless in this situation.

3 ISSUES ASSOCIATED WITH DETERMINATION OF THE LD50 Upon careful consideration, it becomes apparent that issues related to the LD50 and its determination are abundant (12). For example, mortality can be an imprecise measure of a drug’s toxicity because toxicity is generally a continuum and not an absolute endpoint. In fact, one may be willing to accept a certain level of toxicity in exchange for a given benefit depending on the drug in question (e.g., anticancer agents). In addition, there are species differences that prevent accurate extrapolation of LD50 (and ED50 ) from animals to humans. This can include such factors as interspecies pharmacodynamic and pharmacokinetic differences. One must be cognizant of the correlations from animals to humans of these types of studies when determining the potential information that may be gained. A second factor that may profoundly affect the determination of LD50 (and TD50 for that matter) is the effect of drug formulations on the toxicities/effects noted. For example, slow or sustained release (SR) drugs tend to show a

Figure 2. Effect of dose formulations on toxicity and effectiveness of a drug. Immediate release formulations (IR) typically result in higher Cmax and lower Tmax values than do sustained release formulation (SR). As a result, a given dose of a drug may exhibit a milder toxicity profile if given in a sustained release formulation than would the same dose given in an immediate release formulation.

THERAPEUTIC INDEX

steady and slow rise in serum concentration, resulting in a blunted maximum concentration (Cmax ) and a longer time to maximum concentration (Tmax ) as compared with immediate release (IR) preparations (Figure 2). The end result may be that one experiences drug concentrations within the toxic range after administration of an immediate release formulation, but a sustained release formulation might not achieve this same drug concentration. Frequently, solution formulations are used in animal studies for ease of administration and result in immediate drug absorption whereas the final formulation for human use may be a tablet (and potentially a sustained release tablet or capsule) that will give a very different pharmacokinetic profile with potentially lower maximum concentrations. Likewise, use of other routes of administration such as intramuscular or transdermal administration may also give

5

pharmacokinetic profiles analogous to a sustained release preparation, and thus have potentially a more blunted peak concentration. However, it should be noted that this effect of sustained release formulations can cut both ways in that it can also confound the determination of the ED50 because this kind of kinetic profile can potentially alter the period of time that effective concentrations are reached. Finally, one must consider conditions where nonlinear pharmacokinetics occur with the drug of interest. Generally, nonlinear pharmacokinetics occur due to changes in one or more of the drug’s absorption, distribution, metabolism, or elimination characteristics in response to change in dose (13, 14). In such a case, the concept of superposition is violated, and thus the increase in drug bioavailability or concentrations resulting from an increase in dose is greater (or

Figure 3. Nonlinear pharmacokinetics results from changes in one or more of a drug’s absorption, distribution, metabolism, or elimination characteristics. This manifests in nonlinear doseconcentration curve, and may lead to underestimation or overestimation of the target concentration in response to incremental increases in dose. In the special case of metabolic saturation, nonlinearity is caused by reaching the maximum catalytic velocity (Vmax ) at a given substrate concentration ([S]), as seen in the hyperbolic plot (inset).

6

THERAPEUTIC INDEX

lesser) than would be predicted based on the change in size of the dose (Figure 3). For instance, the steady-state concentrations of fluoxetine are significantly higher and the half-life longer than would be predicted from single-dose administration (15). This occurs due to inhibition of fluoxetine metabolism by its metabolite norfluoxetine, which also exhibits a long half-life. Thus, the product of metabolism (norfluoxetine) inhibits parent drug metabolism, resulting in substantially higher drug concentrations upon multiple dosing. Without chronic dosing models and measurement of toxicity, this type of phenomenon would be missed and would not be reflected in the therapeutic index either. An additional classic example of nonlinearity is seen with phenytoin, which follows the behavior of Michaelis-Menten kinetics with respect to saturation of metabolism (see Figure 3). As the drug’s serum concentration rises, the catalytic site in its metabolizing enzymes saturates with the drug molecule, and maximum catalytic velocity (Vmax ) is

quickly reached. As a result, the drug’s metabolism reaches a plateau and causes its accumulation. The normal therapeutic concentration range of phenytoin is 10–20 µg/mL; however, within the range of doses that result in concentrations within this therapeutic range, an incremental increase in dose results in a disproportionately greater increase in concentration, meaning that the difference between subtherapeutic and toxic concentrations is traversed rapidly with a small change in dose (13, 14, 16). Another consequence of this saturable metabolism exhibited by phenytoin is the disproportionality between the rate of administration and time to reach steady state, where the time to reach steady state increases progressively with increasing rate of administration, again due to the saturation of metabolism (14). An additional factor potentially contributing to phenytoin’s nonlinear kinetics is its limited aqueous solubility and subsequent absorption, where its peak serum concentrations are actually reduced at higher oral

Table 2. Mechanisms and examples of important drugs that exhibit nonlinear kinetics Poor aqueous solubility Phenytoin Griseofulvin Saturation of carrier-mediated absorption Amino-β-lactam antibiotics Levodopa Saturation of first-pass metabolism Verapamil Fluorouracil Propranolol Hydralazine Saturable plasma protein binding Valproic acid Disopyramide Saturable metabolism Ethanol Phenytoin Theophylline Salicylate Autoinhibition Fluoxetine Autoinduction Carbamazepine Phenytoin Rifampicin Source: Adapted from Ludden 1991 (13) and Otton et al. 1993 (15).

THERAPEUTIC INDEX

doses. In one study, a 1600-mg single oral dose achieved a serum concentration peak of 10.7 mg/L, but dividing it into four 400-mg doses every 3 hours achieved a peak level of 15.3 mg/L (17). As we can see, nonlinear kinetics can pose a serious hurdle in interpreting a drug’s therapeutic index. One can imagine the level of complexity and unpredictability of behaviors of such drugs when considering other factors such as interindividual variability, which can be as high as 50% for phenytoin (18). Fortunately, not many drugs behave like phenytoin, and a list of some of these drugs and the mechanism resulting in the nonlinearity can be found in Table 2 (13, 15).

4 USE OF BIOMARKERS IN THE DETERMINATION OF ED50 Death of the animal is the only biomarker for assessing LD50 . In the case of TD50 , the biomarker is simply the toxicity observed and is most generally an extension of the drug’s pharmacologic effects, except in the case of hypersensitivity reactions (see the previous discussion). However, determining biomarkers for pharmacologic effect (i.e., ED50 ) can be a more difficult task. For example, if one was studying a new thrombolytic molecule in mice, a decision must be made on whether the efficacy endpoint should be time to complete lysis of the clot, time to partial lysis, degree of thrombus lysis, or even a surrogate endpoint marker such as degree of vascular occlusion. Because other investigators studying the effects of drugs of the same class might choose a different biomarker than previously used, the appropriateness of comparing therapeutic indices of any two drugs may be questionable. It is possible to use composite or multiple biomarkers in the determination of ED50 , but this situation may be complex and difficult to interpret. An additional component of ED50 that merits consideration is whether one should consider either dose or drug concentration as the correlate to effect. It is difficult, and perhaps often impossible, to derive an exact measure of drug exposure at the target molecular site of drug action, but more accurate

7

assumptions will naturally produce more representative and potentially useful therapeutic indices. It may be argued that the applied dose is a sufficient measure of exposure, where ED50 would translate into the drug dose that achieves the desired effect in half the subjects studied. Accounting for the dose as a surrogate endpoint of exposure obviates the need for drug concentration analyses and pharmacokinetic parameter estimations (such as Cmax , Tmax or area under the curve). However, when only dose is used, one is making the assumption that the drug concentrations achieved are identical for all individuals at a given dose. This is obviously an oversimplification as we will discuss with respect to pharmacokinetic variability (6). To further complicate matters, the effect of some drugs such as antibiotics (i.e., antibacterial action) is concentration dependent (e.g., aminoglycosides and fluoroquinolones), whereas for other antibacterials the effect is time dependent (e.g., penicillins and cephalosporins). Additionally, for some drugs such as the serotonin reuptake inhibitors (SSRIs), the primary antidepressant effects occur through synaptic remodeling, which can take weeks to months (1). These differences illustrate that in some cases, the temporal nature of the drug’s pharmacodynamics rather than pharmacokinetics dictate the time frame of action. 5

INTERINDIVIDUAL VARIABILITY

There are multiple sources of interindividual, or between-subject, variability that can influence the application of a therapeutic index to human drug therapy. Interindividual variability can be divided into two general categories: pharmacokinetic (PK) and pharmacodynamic (PD). Because both can lead to differences in efficacy or toxicity of a drug between different individuals, the estimation of a drug’s therapeutic index will have a certain variability associated with it, and this variability may be substantial. Furthermore, application of a therapeutic index in making dosing decisions becomes more complex because individuals respond differently to a given drug; thus, the dose that is efficacious to one person may be toxic to another.

8

THERAPEUTIC INDEX

Pharmacokinetic variability can affect all aspects of drug exposure, including bioavailability, the concentration time profile, and area under the curve (AUC) of a drug. Pharmacodynamic variability manifests as interindividual differences in drug effect as measured either directly or with surrogate endpoints. An increasingly recognized source of PK and PD variability is genetic polymorphisms, which are typically single nucleotide polymorphisms (SNPs) that exist in our genes and result in differential protein expression. The PK variability can be influenced by genetics in multiple ways, such as variability in certain efflux or influx transporter proteins that play a role in transporting a drug molecule within the body or as variability in drug metabolizing enzymes. For example, a mutation in the multiple drug resistance gene (MDR-1) encoding the P-glycoprotein (P-gp) transporter has been shown to affect cyclosporine bioavailability in renal transplant patients (19, 20) and to be predictive of cyclosporine target concentration in liver transplant patients (21). With respect to pharmacodynamic genetic variation, mutations in β-adrenergic receptors have been associated with differential response to β-blocker therapy (22, 23) as well as other receptors. Disease and environmental factors may also result in PK and PD variability, such as the degree of renal or hepatic function present in an individual, concomitant drug therapy (drug interactions), dietary factors, and so on. As expected, variability may be more clinically relevant in drugs considered to exhibit a narrow therapeutic index, such as cyclosporine, as compared with drugs that have a wider therapeutic index such as benzodiazepines. Table 3 lists some polymorphic

genes with potential effects on drugs considered to possess a narrow therapeutic index. The subject of interindividual variability with respect to narrow therapeutic index drugs has recently become a contentious one (3). At issue is whether FDA bioequivalence criteria are sufficient for minimizing any significant differences in bioavailability and concentration–time profile in narrow therapeutic index drugs when switching patients from brand-name to generic-name drugs (3, 5). Several proposals have recently been put forth by state boards of pharmacy and state legislatures to tighten the regulations on ‘‘generic switching’’ with respect to narrow therapeutic index drugs (3). From a pharmacokinetic standpoint, the FDA currently requires that the 90% confidence interval of a test drug’s rate and extent of absorption (AUC and Cmax , respectively) be within 80% to 125% of the reference drug in order for them to be considered bioequivalent, regardless of the drug class or group (4). In response, the FDA reaffirmed its current position and referred to its commissioned taskforce from 1986 that was appointed to investigate very similar issues at the time. That taskforce subsequently conducted two studies with a drug considered to possess a narrow therapeutic index, carbamazepine, which demonstrated no differences in bioequivalence, efficacy, or safety between generic and innovator products (3, 5).

6 INTRAINDIVIDUAL VARIABILITY Another issue of importance in therapeutic index determinations is within-subject variability, or intraindividual variability. When

Table 3. Examples of narrow therapeutic index drugs affected by polymorphic genes Drug Cyclosporine Phenytoin Carbamazepine Warfarin Thioguanine Irinotecan Tamoxifen

Polymorphic genes CYP3A5, MDR1 CYP2 C9 CYP3A4 CYP2 C9 Thiopurine methyltransferase (TPMT) UGT1A1 CYP2D6

THERAPEUTIC INDEX

applying a therapeutic index to any population, one must keep in mind that each individual in the population has several intrinsic and extrinsic elements of variability that may lead him or her to experience differences in efficacy and/or toxicity on different occasions of drug administration. Certainly, when one first thinks of intraindividual variability, the issues of physiologic factors as potential sources come to mind. For example, differences in gastric pH between dose administrations, fasted versus fed state, food complexation, differences in gastric emptying rate across days, and other seemingly subtle but important changes can affect the absorption, distribution, metabolism, or excretion of a drug and play a role in intraindividual variability (24). Another element of intraindividual variability to consider is that a given drug can often be used to treat several conditions that may require different doses and target concentrations. For example, the recommended dose range for aspirin when used to prevent recurrent myocardial infarction is 75–325 mg per day, whereas that for treating rheumatoid arthritis is 3 grams per day to be increased until effect is achieved (25). Although platelets are anucleated and exquisitely sensitive to inhibition by aspirin and only require a low dose to achieve saturation of cyclo-oxygenase-1 (COX-1), monocytes are nucleated, exhibit a high turnover rate of cyclooxygenases, and thus require higher doses of aspirin to suppress inflammation and exhibit a dose-related anti-inflammatory effect (1, 26). Incidentally, gastrointestinal bleeding is more likely at higher doses of

aspirin due to the dose-related suppression of cyclooxygenases in nucleated enterocytes; therefore, one would expect two different toxicity profiles for aspirin treatment of rheumatoid arthritis and prevention of myocardial infarction. The example of aspirin is classic in terms of the different mechanisms of actions of efficacy and toxicity obtained with different doses. However, aspirin will most likely be assigned a single ‘‘therapeutic index,’’ and one is left with the task of determining how to apply it to a population given the amount of variability in efficacy and toxicity that could result from its use for different indications. The question then becomes whether different indications for a given drug must have different therapeutic indexes based on the dose range and the specified toxicity endpoint. It is apparent that this can increase the complexity of application of a therapeutic index, particularly as not all indications and dose ranges are studied in preclinical or early clinical testing. 7 VARIABILITY IN NARROW THERAPEUTIC INDEX DRUGS By definition, narrow therapeutic index drugs show little intraindividual variability— typically less than 30% analysis of variance coefficient of variation (ANOVA-CV). Otherwise, it becomes difficult to approve them for human use because patients will suffer continuous cycles of toxic and subtherapeutic drug concentrations, and the drug would fail in the advanced phases of clinical trials (6). Table 4 lists some commonly considered

Table 4. Intra- and intersubject variability for some commonly-considered narrow therapeutic index drugs Coefficient of variation % Drug Carbamazepine Conjugated estrogens Digoxin Levothyroxine Phenytoin Theophylline (SR) Warfarin

Intersubject 38 42 52 20 51 31 53

Source: Adapted from Benet 1999 (18).

9

Intrasubject 14−15 <20 10−15 11−14 6−11

10

THERAPEUTIC INDEX

narrow therapeutic index drugs and their estimated intersubject and intrasubject variability (18). Note that for all the listed drugs, the intrasubject variability remains below 30%, and less than the intersubject variability. This means that, for these drugs, practitioners will likely find it hard at first to find the appropriate dosage regimen for a given patient, but once that regimen is arrived at there should be little need to readjust it in the future. 8

SUMMARY

The ultimate question is how to apply a therapeutic index clinically. As we have seen, the concept of therapeutic index is riddled with inconsistencies, and in some cases its use may be tenuous in clinical practice. The FDA does not endorse the use of therapeutic index, and there is no mention of it in their publications, Approved Drug Products with Therapeutic Equivalence Evaluations [The Orange Book] (4). There is also no apparent FDA guidance on use of therapeutic index for clinical trials and new drug applications (NDA). However, the therapeutic index may prove useful in the preliminary phases of clinical trials when it is important to rule out any major toxicity associated with a given drug, and before proceeding any further with expensive and exhaustive clinical trials. Nevertheless, the ethical issues of experimental animal testing must be considered. In the end, therapeutic index provides a semiquantitative estimate of the range of doses (or concentrations) that may be administered to a patient (or study subject) to achieve a desired response with an acceptable level of adverse effects. In addition, a therapeutic index may be useful in choosing among drugs with similar therapeutic indications. REFERENCES 1. L. L. Brunton, ed. Goodman & Gilman’s The Pharmacological Basis of Therapeutics, 11th ed. Boston: McGraw-Hill, 2005, pp. 126–129. 2. G. Levy, What are narrow therapeutic index drugs? Clin Pharmacol Ther. 1998; 63: 501–505.

3. FDA position on product selection for ’narrow therapeutic index’ drugs. Am J Health Syst Pharm. 1997; 54: 1630–1632. 4. Center for Drug Evaluation and Research, U.S. Food and Drug Administration. Approved Drug Products with Therapeutic Equivalence Evaluations [Orange Book], 27th ed. Updated: October 10, 2007. Available at: http://www.fda.gov/cder/orange/default.htm 5. FDA comments on activities in states concerning narrow-therapeutic-index drugs. Am J Health Syst Pharm. 1998; 55: 686–687. 6. L. Z. Benet and J. E. Goyan, Bioequivalence and narrow therapeutic index drugs. Pharmacotherapy. 1995; 15: 433–440. 7. J. X. Zhou, N. F. Luo, X. M. Liang, and J. Liu, The efficacy and safety of intravenous emulsified isoflurane in rats. Anesth Analg. 2006; 102: 129–134. 8. G. E. Schumacher, ed. Therapeutic Drug Monitoring. Boston: Appleton & Lange, 1995, pp. 8–9. 9. K. A. Antons, C. D. Williams, S. K. Baker, and P. S. Phillips, Clinical perspectives of statininduced rhabdomyolysis. Am J Med. 2006; 119: 400–409. 10. W. Levinson, Medical Microbiology & Immunology: Examination & Board Review. 8th ed. Boston: McGraw-Hill, 2004, pp. 445–452. 11. H. T. Stelfox, S. B. Ahmed, J. Fiskio, and D. W. Bates, Monitoring amiodarone’s toxicities: recommendations, evidence, and clinical practice. Clin Pharmacol Ther. 2004; 75: 110–122. 12. H. P. Rang, M. M. Dale, and J. M. Ritter, Pharmacology, 4th ed. Edinburgh: Churchill Livingstone, 2000, pp. 57–60. 13. T. M. Ludden, Nonlinear pharmacokinetics: clinical Implications. Clin Pharmacokinet. 1991; 20: 429–446. 14. R. Malcolm and T. N. Tozer, Clinical Pharmacokinetics: Concepts and Applications, 3rd ed. Philadelphia: Lippincott Williams & Wilkins, 1995, pp. 406–411. 15. S. V. Otton, D. Wu, R. T. Joffe, S. W. Cheung, and E. M. Sellers, Inhibition by fluoxetine of cytochrome P450 2D6 activity. Clin Pharmacol Ther. 1993; 53: 401–409. 16. T. M. Ludden, S. R. Allerheiligen, T. R. Browne, and J. R. Koup, Sensitivity analysis of the effect of bioavailability or dosage form content on mean steady state phenytoin concentration. Ther Drug Monit. 1991; 13: 120–125.

THERAPEUTIC INDEX 17. D. Jung, J. R. Powell, P. Walson, and D. Perrier, Effect of dose on phenytoin absorption. Clin Pharmacol Ther. 1980; 28: 479–485. 18. L. Z. Benet, Relevance of pharmacokinetics in narrow therapeutic index drugs. Transplant Proc. 1999; 31: 1642–1644; discussion 1675–1684. 19. R. H. Ho and R. B. Kim, Transporters and drug therapy: implications for drug disposition and disease. Clin Pharmacol Ther. 2005; 78: 260–277. 20. C. R. Yates, W. Zhang, P. Song, S. Li, A. O. Gaber, et al., The effect of CYP3A5 and MDR1 polymorphic expression on cyclosporine oral disposition in renal transplant patients. J Clin Pharmacol. 2003; 43: 555–564. 21. L. Bonhomme, A. Devocelle, F. Saliba, S. Chatled, J. Maccario, et al., MDR-1C3435T polymorphism influences cyclosporine a dose requirement in liver-transplant recipients. Transplantation. 2004; 78: 21–25. 22. D. E. Lanfear, P. G. Jones, S. Marsh, S. Cresci, H. L. McLeod, and J. A. Spertus, Beta2-adrenergic receptor genotype and survival among patients receiving beta-blocker therapy after an acute coronary syndrome. JAMA. 2005; 294: 1526–1533. 23. J. A. Johnson, I. Zineh, B. J. Puckett, S. P. McGorray, H. N. Yarandi, and D. F. Pauly, Beta 1-adrenergic receptor polymorphisms and antihypertensive response to metoprolol. Clin Pharmacol Ther. 2003; 74: 44–52.

11

24. V. P. Shah, A. Yacobi, W. H. Barr, L. Z. Benet,. D. Breimer, et al., Evaluation of orally administered highly variable drugs and drug formulations. Pharm Res. 1996; 13: 1590–1594. 25. PDR. Physicians’ Desk Reference, 57th Edition. 2003. 26. C. Patrono, Aspirin: new cardiovascular uses for an old drug. Am J Med. 2001; 110: 62S–65S.

FURTHER READING M. Rowland, L. B. Sheiner, J. L. Steimer, and A. G. Sandoz, Variability in Drug Therapy: Description, Estimation, and Control, A Sandoz Workshop. New York: Raven Press, 1985.

CROSS-REFERENCES Therapeutic window Therapeutic dose range Surrogate endpoint Maximum tolerable dose Minimum effective dose

TNT TRIAL

(4,5). This hypothesis was tested by comparing the occurrence of major cardiovascular endpoints in two groups of patients: one that received 10-mg atorvastatin daily with the intent to reach an average LDL cholesterol target of approximately 100 mg/dL (2.6 mmol/L), and one that received 80-mg atorvastatin daily with the intent to reach an average LDL cholesterol of approximately 75 mg/dL (1.9 mmol/L).

JOHN C. LAROSA State University of New York Health Science Center, Brooklyn, New York

SCOTT M. GRUNDY University of Texas Southwestern Medical Center, Dallas, Texas

DAVID D. WATERS San Francisco General Hospital, San Francisco, California on behalf of the TNT Steering Committee and Investigators

2 STUDY DESIGN 2.1 Patient Population Patients eligible for inclusion in TNT were men and women aged 35 to 75 years with clinically evident CHD, defined as previous myocardial infarction (MI), previous or present angina with objective evidence of atherosclerotic CHD, and/or who had undergone a coronary revascularization procedure.

The link between elevated LDL-C levels and increased risk of coronary heart disease (CHD) events is well established. Indeed, for secondary prevention trials, the relationship between on-treatment LDL-C and CHD event rates seems to be approximately linear (1). Secondary prevention guidelines currently recommend an LDL-C treatment target of <100 mg/dL (<2.6 mmol/L) for patients with CHD or CHD equivalents (including clinical manifestations of noncoronary forms of atherosclerotic disease [peripheral arterial disease, abdominal aortic aneurysm, and carotid artery disease (transient ischemic attacks or stroke of carotid origin or >50% obstruction of a carotid artery)], diabetes, and 2+ risk factors with 10-year risk for hard CHD >20%) and state that it is reasonable to reduce LDL cholesterol levels to <70 mg/dL (1.8 mmol/L) in very high-risk patients (2,3). However, prior to the Treating to New Targets (TNT) study, the value of treating CHD patients, particularly those with stable, nonacute disease, to LDL-C levels substantially below 100 mg/dL (2.6 mmol/L) had not been clearly demonstrated. 1

2.2 Study Protocol TNT was a double-blind, parallel-group, randomized, multicenter study. All patients completed a washout period of 1–8 weeks, with any previously prescribed lipid-regulating drugs being discontinued at screening. To ensure that all patients at baseline achieved LDL-C levels consistent with then current guidelines for the treatment of stable CHD, patients with LDL-C between 130 and 250 mg/dL (3.4 to 6.5 mmol/L) and triglycerides ≤600 mg/dL (6.8 mmol/L) commenced 8 weeks of open-label treatment with atorvastatin 10 mg/day (week 8). At the end of the run-in period (week 0), those patients with a mean LDL-C <130 mg/dL (3.4 mmol/L) were randomized to double-blind therapy with either atorvastatin 10 or 80 mg/day. During the double-blind period, follow-up visits occurred at week 12 and at months 6, 9, and 12 in the first year, and every 6 months thereafter. At each visit, vital signs, clinical endpoints, adverse events, and concurrent medication information were collected. In addition, on alternating visits (i.e., annually), physical examinations and electrocardiograms were performed and laboratory specimens were collected.

OBJECTIVES

The primary hypothesis of TNT was that reducing LDL-C levels to well below 100 mg/dL (2.6 mmol/L) in stable CHD patients with modest LDL-C elevation (despite previous low-dose atorvastatin therapy) could yield incremental clinical benefit

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

TNT TRIAL

2.3 Study Endpoints The primary endpoint was the time to occurrence of a major cardiovascular event, defined as CHD death, nonfatal, non-procedurerelated MI, resuscitated cardiac arrest, and fatal or nonfatal stroke. Secondary study outcomes are time to occurrence of the following: major coronary event (CHD death; nonfatal, non-procedure-related MI; or resuscitated cardiac arrest), any coronary event (major coronary event, revascularization procedure, procedure-related MI, or documented angina); cerebrovascular event (fatal or nonfatal stroke or transient ischemic attack); peripheral vascular disease; hospitalization with primary diagnosis of congestive heart failure; any cardiovascular event (any of the previous); and all-cause mortality. An independent endpoint committee blinded to treatment assignment adjudicated all potential primary and secondary endpoints. 2.4 Sample Size Determination Epidemiologic data suggested that the difference in LDL-C levels between the two treatment groups would reduce the number of 5-year recurrent coronary events in the atorvastatin 80 mg/day treatment arm by 20% to 30% compared with atorvastatin 10 mg/day. The study originally had a target enrollment of approximately 8600 subjects in order to accumulate 750 major coronary events in an average follow-up time estimated at 5.5 years. A higher than expected recruitment rate afforded the opportunity to increase the sample size in the study; 10,001 patients were randomized and dispensed the study drug. During the trial, and prior to any review of data, stroke (fatal and nonfatal) was added to the definition of the primary efficacy outcome measure. At this point, it was projected that approximately 950 primary events (750 coronary events plus an additional 200 stroke events) would now accrue over the expected duration of the trial, providing 85% power for a two-sided test at α of 0.05 to detect a 17% reduction in the 5-year cumulative primary efficacy outcome rate for atorvastatin 80 mg compared with atorvastatin 10 mg.

3 RESULTS 3.1 Study Population A total of 18,469 patients were screened at 256 sites in 14 countries. Of these patients, 15,464 (83.7%) were eligible to enter the open-label run-in period, during which another 5461 patients were excluded. A total of 10,001 patients were randomized to double-blind treatment with either atorvastatin 10 or 80 mg. The time of randomization was taken as the baseline for the study. Patients were followed for a median of 4.9 years. Patients in both treatment groups were well matched for age, gender, race, cardiovascular risk factors, and cardiovascular history (Table 1). A little over half of the patients had a history of hypertension, and 15% in each group had type 2 diabetes. The prevalence of conditions represented in their cardiovascular histories reflects what we had expected to find in this sample of patients with established coronary heart disease. 3.2 Change in Lipid Values During the open-label period, LDL-C was reduced by 35% in the overall patient population from 152 mg/dL (3.9 mmol/L) to 98 mg/dL (2.6 mmol/L). After randomization, mean LDL-C in the atorvastatin 10 mg group was maintained at around baseline level for the duration of the treatment period, with an average of 101 mg/dL (2.6 mmol/L) across the 5 years of follow-up. Among patients receiving atorvastatin 80 mg, LDLC was further reduced to a mean level of 77 mg/dL (2.0 mmol/L). The LDL-C level in the 80-mg group remained relatively stable over the course of the study. Total cholesterol and triglyceride levels decreased significantly from baseline in the atorvastatin 80-mg group, and levels remained stable over the treatment period. Atorvastatin 80 mg and atorvastatin 10 mg both produced nonsignificant increases over baseline in HDL-C with no significant difference between groups throughout the duration of the study.

TNT TRIAL

3

Table 1. Baseline Characteristics of Randomized Patients Atorvastatin 10 mg (n = 5006)

Atorvastatin 80 mg (n = 4995)

61 (8.8) 4045 (81) 4711 (94) 131 (17) 78 (10) 28.6 (4.7)

61 (8.8) 4054 (81) 4699 (94) 131 (17) 78 (10) 28.4 (4.5)

672 (13) 3167 (63) 2721 (54) 753 (15) 2888 (58) 4067 (81) 263 (5) 570 (11) 404 (8) 927 (19)

669 (13) 3155 (63) 2692 (54) 748 (15) 2945 (55) 4084 (82) 255 (5) 603 (12) 377 (8) 907 (18)

2719 (54) 2338 (47)

2688 (54) 2317 (47)

98 (18) 175 (24) 151 (72) 47 (11)

97 (18) 175 (24) 151 (70) 47 (11)

Baseline characteristic∗ Age, years Men, n (%) White, n (%) Systolic blood pressure, mmHg Diastolic blood pressure, mmHg Body mass index, kg/m2 Cardiovascular history, n (%) Current smoker Ex-smoker Systemic hypertension History of diabetes mellitus Myocardial infarction Angina Cerebrovascular accident Peripheral arterial disease Congestive heart failure Arrhythmia Coronary revascularization –Coronary angioplasty –Coronary bypass Mean lipid level, mg/dL –LDL cholesterol –Total cholesterol –Triglycerides –HDL cholesterol

∗ Mean (SD) unless otherwise denoted. To convert from mg/dL to mmol/L for cholesterol multiply by 0.02586 and for triglycerides by 0.0113.

3.3 Efficacy At the end of the study, the relative risk of the primary outcome measure was reduced by 22% in the 80-mg atorvastatin compared with the 10-mg atorvastatin group (Figure 1), which was a difference that was highly significant (hazard ratio [HR] = 0.78; 95% confidence intervals [CI], 0.69, 0.89; P < 0.001). Treatment with atorvastatin 80 mg had a consistent and significant beneficial effect on most study endpoints (Figure 2). Relative risk reductions for individual components of the primary outcome measure were all consistent with that observed for the primary composite, with the exception of resuscitated cardiac arrest. Of particular note, the relative risk of stroke in the atorvastatin 80-mg group was 25% lower at the end of the study than in the 10-mg group (HR = 0.75; 95%

CI, 0.59, 0.96; P = 0.02). Among the secondary outcome measures, significant reductions also occurred in major coronary events, any coronary events, cerebrovascular events, hospitalization for congestive heart failure, and any cardiovascular events. The effects of atorvastatin 80 mg on peripheral arterial disease did not differ significantly from those of atorvastatin 10 mg, and differences between the two groups in total, cardiovascular, or noncardiovascular mortality did not reach statistical significance. The leading cause of noncardiovascular mortality in both groups was cancer, which accounted for 75 (1.5%) and 85 (1.7%) deaths in the atorvastatin 10-mg and 80-mg groups, respectively. Deaths from hemorrhagic stroke or trauma (including accidental death, suicide, and homicide) were infrequent and did not differ significantly between groups.

TNT TRIAL

Proportion of Subjects Experiencing Event

4

0.2 Atorvastatin 10 mg Atorvastatin 80 mg

0.1

HR = 0.78 (0.69-0.89) p = 0.0002 0 0

1

Atorvastatin 10 mg 5006 Atorvastatin 80 mg 4995

4866 4889

3 Time (years) 4738 4596 4774 4654 2

4

5

6

4456 4521

2304 2344

0 0

Figure 1. Primary efficacy outcome measure: time to first cardiovascular event. HR P-value 0.78 <0.001 0.09 0.80 0.004 0.78 0.96 0.89 0.02 0.75

Primary Efficacy Measure Major CV event – CHD death – Nonfatal, non-PR MI – Resuscitated cardiac arrest – Fatal/nonfatal stroke Secondary Efficacy Measures Any cardiovascular event – Major coronary event* – Any coronary event – Cerebrovascular event – Hospitalization for CHF – Peripheral arterial disease All cause mortality

0.81 0.80 0.79 0.77 0.74 0.97 1.01

0.5 1 Atorvastatin 80 mg better

<0.001 0.002 <0.001 0.007 0.01 0.76 0.92

1.5 Atorvastatin 10 mg better

*CHD death, nonfatal non-procedure-related MI, resuscitated cardiac arrest

Figure 2. Hazard ratios for primary and secondary outcome measures.

3.4 Safety

Adverse events related to study treatment occurred in 8.1% of patients in the atorvastatin 80-mg group and in 5.8% of patients in the atorvastatin 10-mg group. No specific category of events could be identified to account for this difference. No significant difference was found in the rate of treatment-related myalgia between the two groups. Five cases of muscle-related

pathology were reported as rhabdomyolysis in this study (two receiving atorvastatin 80 mg and three receiving atorvastatin 10 mg). None of these cases were considered by the on-site investigator to be causally related to atorvastatin, and none met American College of Cardiology/American Heart Association/National Heart, Lung and Blood Institute criteria for rhabdomyolysis. The rate of persistent elevations in liver enzymes was 1.2% in the atorvastatin 80 group and 0.2% in the atorvastatin 10 group.

TNT TRIAL

4

CONCLUSIONS

Treating patients with established CHD to an LDL of 77 mg/dL (2.0 mmol/L) with 80 mg of atorvastatin daily, from their starting LDLC level of 100 mg/dL (2.6 mmol/L), provided a highly significant reduction in the risk of major cardiovascular events. Significant declines in both CHD and cerebrovascular morbidity were observed in the atorvastatin 80-mg group relative to the atorvastatin 10mg group. Both the morbid and mortal event rates recorded in both arms of the TNT study are lower than those achieved in any of the other major secondary prevention studies (6–9). It is important to emphasize that TNT was an active-controlled study and that the reduction in cardiovascular risk associated with atorvastatin 80 mg was over and above the well-established clinical benefits provided by treatment with atorvastatin 10 mg, and the attainment of LDL-C levels was regarded as optimal treatment targets. Atorvastatin was well tolerated at both doses, and the improved clinical outcome in the 80 mg group was achieved without additional safety concerns. Although liver function abnormalities increased from 0.2% to 1.2% in the 10-mg and 80-mg groups, respectively, both of these incidence rates are well below those observed in other statin trials. The number of patients reported as having myalgia was similar in the two groups, and five cases of rhabdomyolysis were found, all of uncertain relationship to study medication. No increased risk of cancer, hemorrhagic stroke, or traumatic deaths was found in the high-dose atorvastatin group. All of these conditions have been suggested as occurring with low LDL-C in observational studies or with LDL-C lowering in earlier cholesterollowering trials. The results of the TNT study indicate that the linear relationship between reduced LDL-C and reduced CHD risk demonstrated in prior statin secondary prevention trials holds true even at very low LDL-C levels (10). Overall, these data extend the growing body of evidence indicating that lowering LDL-C well below currently recommended levels can

5

further reduce the health-care burden associated with cardiovascular and cerebrovascular disability. REFERENCES 1. Kastelein JJ. The future of best practice. Atherosclerosis 1999; 143(Suppl 1):S17–S21. 2. Grundy SM, Cleeman JI, Merz CN, et al. Implications of recent clinical trials for the National Cholesterol Education Program Adult Treatment Panel III guidelines. Circulation 2004; 110: 227–239. 3. Smith SC, Allen J, Blair SN, et al. AHA/ACC guidelines for secondary prevention for patients with coronary and other atherosclerotic vascular disease: 2006 update. Circulation 2006; 113: 2363–2372. 4. Waters DD, Guyton JR, Herrington DM, et al. Treating to New Targets (TNT) Study: Does lowering low-density lipoprotein cholesterol levels below currently recommended guidelines yield incremental clinical benefit? Am. J. Cardiol. 2004; 93: 154–158. 5. LaRosa JC, Grundy SM, Waters DD, et al. Intensive lipid lowering with atorvastatin in patients with stable coronary disease. N. Engl. J. Med. 2005; 352: 1425–1435. 6. Scandinavian Simvastatin Survival Study Group. Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: the Scandinavian Simvastatin Survival Study (4S). Lancet 1994; 344: 1383–1389. 7. Sacks FM, Pfeffer MA, Moy´e LA, et al. The effect of pravastatin on coronary events after myocardial infarction in patients with average cholesterol levels. N. Engl. J. Med. 1996; 335: 1001–1009. 8. The Long-term Intervention with Pravastatin in Ischaemic Disease (LIPID) Study Group. Prevention of cardiovascular events and death with pravastatin in patients with coronary heart disease and a broad range of initial cholesterol levels. N. Engl. J. Med. 1998; 339: 1349–1357. 9. Heart Protection Study Collaborative Group. MRC/BHF Heart Protection Study of cholesterol lowering with simvastatin in 20,536 high-risk individuals: a randomised placebocontrolled trial. Lancet 2002; 360: 7–22. 10. LaRosa JC, Grundy SM, Kastelein JJP, Kostis JB, Greten H. Safety and efficacy of atorvastatin-induced very low low-density lipoprotein cholesterol levels in patients with coronary heart disease (a post hoc analysis

6

TNT TRIAL of the Treating to New Targets [TNT] study). Am. J. Cardiol. 2007; 100: 747–752.

FURTHER READING Pedersen TR, Faergeman O, Kastelein JJP, et al. High-dose atorvastatin vs usual-dose simvastatin for secondary prevention after myocardial infarction. The IDEAL study: a randomized clinical trial. JAMA. 2005; 294: 2437–2445. The Stroke Prevention by Aggressive Reduction in Cholesterol Levels (SPARCL) Investigators. High-dose atorvastatin after stroke or transient ischemic attack. N. Engl. J. Med. 2006; 355: 549–559. Shepherd J, Barter P, Carmena R, et al. Effect of lowering low density lipoprotein cholesterol substantially below currently recommended levels in patients with coronary heart disease and diabetes: The Treating to New Targets (TNT) study. Diabetes Care. 2006; 29: 1220–1226. Wenger NK, Lewis SJ, Herrington DM, et al. Outcomes of using high- or low-dose atorvastatin in patients 65 years of age or older with stable coronary heart disease. Ann. Intern. Med. 2007; 147: 1–9. Deedwania P, Barter P, Carmena R, et al. Reduction of low density lipoprotein cholesterol in patients with coronary heart disease and metabolic syndrome: analysis of the Treating to New Targets study. Lancet 2006; 368: 919–928. Barter P, Gotto AM, LaRosa JC, et al. HDL cholesterol, very low levels of LDL cholesterol, and cardiovascular events. N. Engl. J. Med. 2007; 357: 1301–1310. Shepherd J, Kastelein JJP, Bittner V, et al. Effects of intensive lipid lowering with atorvastatin on renal function: The Treating to New Targets study. Clin. J. Am. Soc. Nephrol. 2007; 2: 1131–1139.

CROSS-REFERENCES Active-Controlled Trial Run-in Period Secondary Prevention Trials Statins

TREATMENT-BY-CENTER INTERACTION

patient subpopulations, and so forth. When evidence of treatment-by-center interaction exists, investigating and understanding its underlying causes may elucidate important aspects of the treatments being studied. Perhaps if groups of centers can be identified that produce similar patterns of results, then some common threads can be identified; this investigation may prove useful in most fully understanding the results of the trial or identifying issues that warrant additional study.

PAUL GALLO Novartis Pharmaceuticals East Hanover, New Jersey

Clinical trials that compare treatment therapies are commonly conducted using patients from several different investigative sites or clinical centers. A primary motivation for using multiple centers is to enable the enrollment of a sufficient number of patients to investigate the trial hypotheses in a timely manner. A beneficial consequence is that this arrangement can provide evidence as to whether the trial results are strongly settingdependent and can assist in interpreting how well the results should apply to a broad population of clinical sites or patient groups. This arrangement also can provide the added appeal of having the trial more closely reflect how treatments will be used in varied settings in actual practice outside of the clinical trial arena. Sites used in a clinical trial may differ in any of several aspects that could have an impact on patient outcomes, such as patient population (as reflected in demographic or disease history characteristics), investigator experience with the treatment regimens being studied, local differences in medical practice or concomitant therapy, and so forth. Inherent in the use of this broader range of settings within a trial is the possibility that treatment effects will not be highly similar across centers. Treatment-by-center interaction refers to evidence that comparative treatment effects are not consistent across centers. Such evidence of nonhomogeneity within a trial has important ramifications for the interpretation and generalizability of the trial results. Inconsistency of outcomes across centers can raise very natural questions about the strength and nature of any conclusions drawn from overall trial results, such as whether the results might have been markedly different had a different set of centers been used, whether we sufficiently understand the true risks and benefits associated with the trial therapies in different

1 CHARACTERIZING TREATMENT-BY-CENTER INTERACTION It is important to distinguish treatment-bycenter interaction from differences in the overall pattern of patient outcomes across centers. Average patient responses may differ from center to center in a multicenter trial, perhaps because of different patient risk populations, medical practices, outcome assessment, and so forth. Even when overall differences exist among patient outcomes at different centers, appropriate measures of comparative treatment effects may nevertheless be similar across centers so that overall estimates of treatment differences obtained from the trial can be reasonably interpreted to apply across the sites. For example, consider a hypothetical situation in which patients in a control group at Center A have a 50% risk of experiencing a particular study outcome, whereas in Center B the control risk is 40%. At both sites, an experimental treatment reduces this risk by a factor of 0.10 (i.e., to 45% in Center A and to 36% in Center B). Despite the overall difference in risk between these two centers (i.e., patients in Center A are at higher risk of the outcome, regardless of treatment), the relative risk reduction of 0.10 applies to and has meaning in both centers. Treatment-by-center interaction, however, refers to differences among centers with regard to comparative measures of treatment effects. For example, in Center C a 10% reduction in risk for a particular treatment relative to control might exist, whereas in Center D, the risk reduction is 20%. Pending the results

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

TREATMENT-BY-CENTER INTERACTION

of additional investigation of this inconsistency, no single value describing the relative risk between treatments may be identified that applies satisfactorily to patients at both centers. It should be pointed out that the presence or magnitude of treatment-by-center interaction (as for any type of interaction in general) might be an artifact of the scale on which responses are measured and on which statistical analysis is performed (1). As an admittedly simplistic example, imagine that the baseline mean systolic blood pressure of patients at Center E is 150 mmHg and that following administration of trial therapies the control group mean remains unchanged, whereas the mean of the experimental group is 135. Suppose that at Center F the mean baseline value is 120 mmHg and again the mean control group response does not change after therapy, whereas the experimental group mean is reduced to 108 mmHg. On an absolute scale, the between-treatment differences of 15 units at Center E and 12 units at Center F may suggest a mild interaction. However, if the difference is expressed, and presumably the analysis performed, on a relative basis (which would likely be the more appropriate scale in this type of example), then the common advantage of a 10% reduction from baseline at both centers is not suggestive of any interaction. Sometimes, a treatment-by-center interaction is described as belonging to one of two categories, as follows:

• Quantitative interaction. Although the

center effects seem to differ, they are all directionally the same with regard to the treatment groups (e.g., Treatment X is superior to Treatment Y in all centers, although the numerical magnitude of benefit differs).

Clearly, qualitative treatment-by-center interaction, were it to exist in a given situation, is potentially a more challenging and complex concern with regard to fully understanding the relative merits of different treatment regimens and the implications for patients. Note that qualitative interaction cannot be explained as an artifact of measurement scale; for example, it will remain (although its magnitude may change) under any transformation of scale. It should be kept in mind that there has at times been ambiguity in the use of this terminology with regard to whether qualitative interaction refers simply to estimates of within-center treatment effects being in different directions or to the stricter standard of formal demonstration that treatment effects at different sites are truly directionally opposite. In trials run in many centers or that include centers with small numbers of patients, observing within-center estimates in opposite directions is by no means unusual. Indeed, it is often to be expected, even when one treatment is clearly superior and in truth no interaction at all exists, and often, thus, it should not be a source of concern. Senn (2, ch. 14) discusses this phenomenon and presents an interesting quantification of this effect. True qualitative treatment-by-center interaction in clinical data associated with effective treatments is probably rare. Formal statistical demonstration that qualitative treatment-by-center interaction exists tends to be challenging regardless; unlike other types of interactions that might be investigated involving larger patient subgroups, it will commonly be the case in multicenter trials that centers will not be large enough to characterize with high precision the magnitudes of effects within individual centers. Examples of formal statistical procedures that address qualitative interaction are given in Refs. 3–5.

• Qualitative interaction. Treatment ef-

fects differ among centers, but, in addition, different centers show advantage for different treatments (e.g., Treatment X seems superior to Treatment Y at Center G, but Treatment Y seems superior at Center H).

2 INVESTIGATING TREATMENT-BY-CENTER INTERACTION As alluded to above, it is sensible to consider the consistency of results across sites in a multicenter trial. When treatment-by-center

TREATMENT-BY-CENTER INTERACTION

interaction seems indicated, it is responsible to address this interaction in greater depth to more fully understand the effects of the study therapies on specific patient groups and the extent to which the aggregate study results can be generalized. Differences in the patterns of patient outcomes across centers are by no means unexpected when centers are not identical with regard to relevant patient characteristics or clinical conditions; the process of investigating interaction generally involves observing whether such factors leading to the apparent interaction can be identified. Relevant regulatory perspectives regarding investigation of treatment-by-center interaction are reflected in the following excerpt from the ICH-E9 document (6): If positive treatment effects are found in a trial with appreciable numbers of subjects per center, there should generally be an exploration of the heterogeneity of treatment effects across centers, as this may affect the generalizability of the conclusions. Marked heterogeneity may be identified by graphical display of the results of individual centers or by analytical methods, such as a significance test of the treatment-by-center interaction. When using such a statistical significance test, it is important to recognize that this generally has low power in a trial designed to detect the main effect of treatment. If heterogeneity of treatment effects is found, this should be interpreted with care, and vigorous attempts should be made to find an explanation in terms of other features of trial management or subject characteristics. Such an explanation will usually suggest appropriate further analysis and interpretation . . . . It is even more important to understand the basis of any heterogeneity characterized by marked qualitative interaction, and failure to find an explanation may necessitate further clinical trials before the treatment effect can be reliably predicted.

These statements clearly reflect the importance that is ascribed to investigating whether treatment-by-center interaction is present and, if so, attempting to understand its cause or mechanism, generally through identification of relevant patient characteristics or clinical practices that differ across sites, as will be described below. Additionally,

3

the ICH-E9 statements suggest the potential for the trial results to be considered less convincing or less meaningful if inconsistency among center outcomes remains unexplained, particularly when qualitative interactions are suggested. We now go into more depth on the process of investigating, and if necessary explaining, treatment-bycenter interaction, along the lines outlined in the ICH-E9 excerpt. Often, the possible presence of treatmentby-center interaction is initially addressed in a descriptive manner, for example, by examination of appropriate summary statistics relating to treatment effects, presented separately by center. Graphical presentation may be employed, such as plotting summary measures of outcomes for each treatment versus center and then connecting the values for each center and seeing whether the resulting curves seem nearly parallel. Descriptive and graphical approaches have limitations, especially given the typical disparity of within-center sample sizes and the issues associated with small centers that we have alluded to previously. Thus, formal statistical approaches for assessing treatmentby-center interaction usually play an important role in this investigation. Primarily, this involves a hypothesis test for significance of a treatment-by-center interaction term within an appropriate statistical model. We now present several sequential steps in such formal identification of interaction and, when present, investigation of its causes. 2.1 Is There Formal Statistical Evidence of Treatment-By-Center Interaction? In trying to envision a possible criterion for a statistical test of an interaction term, it needs to be kept in mind that clinical studies are typically designed to yield power for some specified overall difference between treatment groups; the presence of treatment-bycenter interaction is somewhat of a secondary concern (much as is the case for other types of subgroup comparisons). Studies will generally not have high power to detect what might be considered to be meaningful magnitudes of interactions when using conventional significance levels. In some contexts, the use of significance levels higher than typical significance thresholds have been described as

4

TREATMENT-BY-CENTER INTERACTION

criteria for homogeneity tests, perhaps to be interpreted as ‘‘flags’’ that suggest that more investigation is warranted [e.g., p = 0.10 for a formal test of an interaction term (7); of course, it needs to be kept in mind that such larger significance level thresholds will more frequently falsely suggest interactions]. When an interaction test does reach a conventional level of significance (e.g., p < 0.05), however, that will often indicate that more examination or explanation is called for; in some situations, somewhat weaker signals might still merit some investigation. 2.2 Is the Interaction an Artifact of the Numerical Scale on which the Treatment Effect was Determined, as We Have Described Previously? If sensible alternate numerical scales reduce the signal of the interaction (i.e., performing analyses on a numerical scale that is scientifically plausible yields a far less significant result for the test of interaction), then the concern is reduced; in fact, it might suggest that analyses that use the alternative scale have desirable statistical properties for other purposes of evaluation of the trial data. 2.3 Is the Interaction Caused by a Small Number of Aberrant or Outlying Data Points? Depending on data structures and analysis methods used, some formal interaction tests may be highly sensitive to just a few extreme data values and thereby convey a signal of interaction when most data do not. Although certainly extreme values of a response variable are not to be ignored and may be indicative of an issue that needs investigation, it is generally the case that when an apparent interaction can be made to disappear largely by exclusion of a very small number of values, then the suggestion of the interaction is not considered to be much of a concern. 2.4 Is the Interaction Explainable Through Differences Between Centers With Regard to Other Prognostic Factors? Explanation of an interaction through relationship of the outcome to identifiable and scientifically sensible factors that may have differed across the centers is often a key part of

this investigation. When comparative treatment effects differ by site, it is very natural to consider whether prognostic factors present at baseline, and which may differ across sites, can be identified (e.g., demographics, risk factors, medical history, etc.) that seem to explain the interaction. Presumably, the factors believed to be most highly predictive of response can be expected to already have been included in primary analysis models. However, following current conventions and regulatory guidances (6,8), the number of factors typically included in primary analysis models tends to be limited. Thus, among a set of candidate prognostic factors that may have been considered for the primary model, only a subset may have been selected for the model; effects of others are often planned to be investigated in supplemental analyses. Additional analyses in which some other covariates are included in models will often be performed as part of the investigation of the interaction. The use of richer statistical models may decrease signals of treatment-bycenter interaction, or this use may indicate that because of confounding, the suggested interaction with center is in reality an interaction with another covariate. For example, if comparative results at one particular center seemed different than at other centers but in fact that site enrolled a noticeably older population than the others, this may indicate that the interaction in reality is one of treatment with age. To summarize, attempts to explain and understand an apparent treatment-by-center interaction fall in large part into two categories: • Identifying whether the signal is a nu-

merical artifact and not very meaningful from a clinical perspective (for example, whether it is caused by oversensitivity of an analysis to a few outlying values or to issues of measurement scale). • Identifying measurable characteristics

of the patient population or of some aspect of trial conduct at centers that seem to account for the inconsistency of comparative outcomes in a clinically plausible manner.

TREATMENT-BY-CENTER INTERACTION

When characteristics of within-center patient populations or clinical practices are identified as accounting for an apparent treatment-by-center interaction, it may have an effect on proper interpretation of trial results, for example, in how the results should be extrapolated to specific target populations that may receive more or less benefit than the aggregate results would suggest. Failure to identify explanations for the interaction, particularly when a suggestion develops that the interaction is qualitative or when very large disparities exist among the center results, may well be dissatisfying and limit the interpretability of the overall results. For example, if favorable overall trial results are highly influenced by particularly positive results at a small number of very large centers (i.e., results are much less favorable across the other centers), then it may not be possible to distinguish among the following explanations: whether some biasing influence existed at those large centers that does not apply to broad patient settings so that the true treatment effects are less favorable than the aggregate results suggest, whether some characteristic as yet unidentified exists that could put the trial results in a more interpretable and consistent context, or whether this result is a chance result so that the overall results are in fact fully meaningful. Investigation of a treatment-by-center interaction is rarely a highly algorithmic process and has a clearly post hoc flavor. Results of one step of the investigation commonly suggest what the next steps should be. The path will be guided by scientific plausibility, as some numerical limitations almost always exist in any process that looks at subsets of data that were not designed to be convincing in their own right. Useful perspective on these issues is found in Reference 9, for example: The power to look at interactions is . . . bound to be small, and apparent trends in the ‘wrong’ direction at individual centres must be quite likely, so that they should generally be disbelieved . . . . substantial treatment-by-centre interactions, if and when they exist, must be caused by something else, such as differences in specific patient characteristics or clinical conditions. Their interpretive value is purely as a

5

possible indicator of something more valuable for future use.

3

STATISTICAL ANALYSIS ISSUES

Whether or how potential treatment-by-center interaction should be accounted for in statistical analysis models has historically been a topic of much discussion and, at times, controversy, in the biostatistical literature. Typically, statistical analysis models for clinical trial data account for the site at which a patient participated; that is, models are used that adjust expected response for center. A main rationale for such adjustment is that for logistical reasons it is common for randomization schemes to be stratified within center, and it is considered good statistical practice for analysis models to reflect restrictions on randomization that result from stratification (8). Additionally, the center at which a patient participated may be somewhat predictive of response because of variations in patient populations or the manners in which therapies are administered at different sites, as we have discussed throughout this article (although predictive effects of center are often somewhat exploratory and usually not felt to be as likely to predict outcome as strongly as would other baseline or medical history characteristics). Less historical consensus exists regarding whether treatment-by-center interaction should be included in primary statistical analysis models. Whether a treatment-by-center interaction term is included in the model can have a large impact on the results of the main analysis of the overall treatment effect. The specific mechanism can be viewed in terms of ‘‘weights’’ that within-center treatment effect estimates receive when they are combined into an overall estimate for the entire trial. Much literature discussion on this topic played out in the specific framework of analyses using fixed-effects analysis of variance (ANOVA) models for normally distributed data, in which the principles and mechanics are most clearly illustrated. When a treatment-by-center interaction term is included in an ANOVA model, the resulting analysis of the treatment effect can be viewed as

6

TREATMENT-BY-CENTER INTERACTION

giving equal weight to within-center effect estimates; that is, analysis is based on an overall treatment effect estimate determined as the numerical average of all the withincenter estimates. When an interaction term is not included in an analysis model, the analysis gives weight to within-center estimates according to their precision. For example, centers with more precise estimates receive greater weight. This precision is closely related to within-center sample size so that estimates from larger centers are weighted more strongly. (It may be noted that in this approach, patients, rather than centers, tend to be weighted equally; in the equally-weighted center approach, patient responses are weighted according to the size of their center; for example, outcomes from patients at large centers receive less weight than patients from small centers). Various arguments in support of both weighting approaches have been put forth (e.g., References 1 and 10–13). When center sample sizes vary, as is typical in multicenter clinical trials, approaches that give weight to centers according to their precision tend to be more efficient in the sense that the analysis is based on an overall treatment effect estimate that is more precise, which in turn tends to yield more powerful analyses (10). In recent years, precision-based weighting of centers, through the use of statistical models that do not include a treatment-bycenter interaction term, has become much more standard for primary analyses in multicenter trials, and this change is reflected in regulatory guidance documents (6,8). This occurrence is not indicative of a lesser degree of interest in the existence of treatment-bycenter interaction or in weaker motivation for ‘‘explaining’’ it when it exists, as described in the previous section; it simply is viewed as the preferred approach for the main test of the primary study hypothesis. Generally, investigation of treatment interactions (with center, and with other key subgroups) will remain a key secondary goal of the trial, and additional analyses that include the interaction term should still be performed. Appropriate followup of suggested treatment-by-center interaction should occur, as we have described previously.

This discussion has focused on the most typical manners in which statistical analyses are performed for multicenter trials, in which center main effects are viewed statistically as fixed effects, and on the two most commonly considered center weighting schemes. A variety of other analysis approaches has been considered in the literature and used in practice, which may have different ramifications for questions of modeling or describing treatment-by-center interactions. These approaches include random effects models, in which center effects, including interactions, are considered random variables; in this setting, evidence of treatment-by-center interaction tends to translate directly into variability in the treatment effect estimate so that a higher degree of interaction tends to lead to lower strength of statistical evidence of an overall effect. Other proposals have included the use of empirical Bayes analyses (14), analyses in which center weights are related to expected representation of patient subgroups in a specified target population (15), and analyses that reflect the random nature of within-center sample sizes (16). REFERENCES 1. S. Snapinn, Interpreting interaction: the classical approach. Drug Inf. J. 1998; 32: 433–438. 2. S. Senn, Statistical Issues in Drug Development. Chichester, UK: John Wiley and Sons, 1997. 3. M. Gail, R. Simon, Testing for qualitative interactions between treatment effects and patient subgroups. Biometrics 1985; 41: 361–372. 4. J. L. Ciminera, J. F. Heyse, H. H. Nguyen, J. W. Tukey, Tests for qualitative treatmentby-centre interaction using a ‘‘pushback’’ procedure. Stat. Med. 1993; 12: 1033–1045. 5. X. Yan, Test for qualitative interaction in equivalence trials when the number of centres is large. Stat. Med. 2004; 23: 711–722. 6. International Conference on Harmonisation, Guidance on statistical principles for clinical trials (ICH-E9). Food and Drug Administration, DHHS, 1998. 7. J. L. Fleiss, Analysis of data from multiclinic trials. Control. Clin. Trials 1986; 7: 267–275. 8. Committee for Proprietary Medicinal Products (CPMP), Points to consider on adjustment for baseline covariates. London, 2003.

TREATMENT-BY-CENTER INTERACTION 9. J. A. Lewis, Statistical issues in the regulation of medicines. Stat. Med. 1995; 14: 127–136. 10. P. Gallo, Center-weighting issues in multicenter clinical trials. J. Biopharm. Stat. 2000; 10: 145–163. 11. B. Jones, D. Teather, J. Wang, J. A. Lewis, A comparison of various estimators of a treatment difference for a multi-centre clinical trial. Stat. Med. 1998; 17: 1767–1777. 12. G. Schwemer, General linear models for multicenter clinical trials. Control. Clin. Trials 2000; 21: 21–29. 13. Z. Lin, An issue of statistical analysis in controlled multi-centre studies: how shall we weight the centres? Stat. Med. 1999; 18: 365–373. 14. A. L. Gould, Multi-centre trial analysis revisited. Stat. Med. 1998; 17: 1779–1797. 15. V. Dragalin, V. Fedorov, B. Jones, F. Rockhold, Estimation of the combined response to treatment in multicenter trials. J. Biopharm. Stat. 2001; 11: 275–295. 16. J. Ganju, D. V. Mehrotra, Stratified experiments reexamined with emphasis on multicenter trials. Control. Clin. Trials 2003; 24: 167–181.

7

FURTHER READING ¨ en, Treatment-by-center interaction: what A. Kall` is the issue? Drug Inf. J. 1997; 31: 927–936. T. Permutt, Probability models and computational models for ANOVA in multicenter clinical trials. J. Biopharm. Stat. 2003; 13: 495–505. T. Sohzu, T. Omori, I. Yoshimura, Quantitative evaluation of the methods to deal with the treatment-by-centre interaction in ‘‘Statistical principles for clinical trials’’. Japanese J. Appl. Stat. 2001; 30: 1–18.

CROSS-REFERENCES Multicenter Trial Analysis of Variance (ANOVA) Effect Size Primary Hypothesis Treatment Effect

TREATMENT INTERRUPTION

option for patients with low short-term risk of disease (3,4). Fixed-term TI strategies that were evaluated in randomized, controlled clinical trials included, for example, cycling 7 days on, 7 days off ART [the HIV Netherlands Australia Thailand Research Collaboration 001.4 (HIV-NAT) (5) and Staccato studies (6)], longer cycles of equal time on and off ART [8 weeks, Window-ANRS 106 (7); 12 weeks, DART (8)], and others (9). In each of these trials, the control arm was continuous ART. In HIV-infected patients, the CD4+ Tlymphocyte count is a strong predictor for the risk of progression of disease; it is the major clinical marker for immunocompetence and substantially drives initiation of ART (10). Successful ART suppresses viral load and increases or maintains CD4+ levels; when stopping ART, CD4+ levels decline, and viral load rebounds (11). In contrast to fixed-term TIs, CD4+-guided intermittent treatment strategies stop use of ART when CD4+ cell counts are high and restart when CD4+ levels decline below a threshold. The aim is to minimize drug exposure while the risk of progression to AIDS is low, thus avoiding toxic side effects, and to reserve antiretroviral drugs for times when the risk of HIV-related illnesses outweighs the negative effects of prolonged drug exposure. Current treatment guidelines generally recommend starting treatment at CD4+ counts between 200 and 350 cells/mm3 (10). In recent years, 5 randomized clinical trials evaluated CD4+-guided intermittent treatment strategies, with thresholds for stopping at CD4+ counts above 350 and starting when the CD4+ level declined to below 250 cells/mm3 [the Strategies for Management of Anti-Retroviral Therapy (SMART) (12) and the Trivacan ANRS 1269 (13) studies], stopping at CD4+ > 350 and restarting when CD4+ declined to < 350 cells/mm3 (the Staccato (6) and HIV-NAT (5) studies), and others (14) versus continuous ART. Table 1 summarizes key characteristics; treatment assignments were open-label. In the two largest studies, SMART and Trivacan, the CD4+-guided TI arm was terminated early

BIRGIT GRUND University of Minnesota, Minneapolis, Minnesota

Interruption of active drug treatment in clinical trials can be classified broadly as therapeutic treatment interruption (TI), in which the potential therapeutic benefit of supervised interruption of treatment is investigated, and analytic TI, in which participants are randomized to stopping versus continuing active treatment and the TI arm serves as control group. Design issues for therapeutic TI trials are discussed in the example of a trial in HIV/AIDS. Clinical trials that included interruption of active drug treatment have been conducted in many fields and have addressed a wide range of research questions. The spectrum ranges from therapeutic treatment interruption (TI), which tests the hypothesis that supervised interruption of treatment may provide a net benefit to the individual patient compared with continuous treatment, to analytic TI, in which the TI arm serves as control arm to evaluate an active drug. This article is restricted to randomized, controlled trials; it includes examples from HIV research, treatment of osteoporosis and hypertension, and a Phase II cancer trial. 1 THERAPEUTIC TI STUDIES IN HIV/AIDS 1.1 Overview No cure for chronic HIV infection currently exists, but highly active antiretroviral therapy (ART) has substantially reduced morbidity and mortality caused by HIV and AIDS. However, regimens are often complex and burdensome to patients; drugs may have toxic side effects, including diabetes, cardiovascular disease, renal or hepatic complications (1); and drugs may lose their effectiveness because of developing drug resistance (2). Anecdotal reports of patients who discontinued ART for several months and did not suffer progression of disease or substantial rebound of viral load spurred interest in intermittent ART as a viable

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

Study (ClinicalTrials. gov Identifier)

SMART (12) (NCT00027352)

TrivacanA (13) (NCT00158405)

StaccatoB (6) (NCT00113126)

HIV-NAT 001.4A (5)

TIBET (15)

5472 7367

326 516

430 746

48 99

201 378

> 350

> 350

> 350

> 350

Start ART

< 250

< 250

< 350

< 350 or 30% drop

Study entry

> 350

> 350

> 350

> 350

> 500 and VL < 50 cp/mL < 350 or VL > 105 cp/mL > 500

NoC

No

3–6 months

3 months

No

597

457 TI / 461 cont.

470 TI / 506 cont.

766 TI / 653 cont.

841 TI / 786 cont.

72% 57% N. America, 26% Europe, 10% S. America

100% Cˆote d’Ivoire

100% 80% Thailand, 18% Europe

100% Thailand

100% Spain

HIV-related illnesses or all-cause death

WHO stage 3–4 morbidity; all-cause death

% VL < 50 cp/mL (at study close); Drug savings

% CD4+ > 350; % VL < 400 cp/mL (at study close)

New AIDS AE

No. participants Person-years CD4+ thresholds (cells/mm3 ) Stop ART

ART period in TI arm before planned study close Baseline characteristics CD4+ (median; cells/mm3 ) Suppressed VL Location

Primary endpoint, definition

TREATMENT INTERRUPTION

Table 1. Randomized, Controlled Clinical Trials of CD4+-Guided Treatment Interruption Versus Continuous Antiretroviral Treatment of HIV-Infection (See Also Reference 9)

Table 1. (continued) Study (ClinicalTrials. gov Identifier)

SMART (12) (NCT00027352)

TrivacanA (13) (NCT00158405)

StaccatoB (6) (NCT00113126)

Increased risk of HIV illnesses or death in TI arm, rates 3.3 TI / 1.3 cont. per 100 PY, HR 2.6

Increased risk of morbidity in TI arm, rates 17.6 TI / 6.7 cont. per 100 PY, HR 2.8 No difference in death

No difference in % VL < 50 (p = 0.90) Substantial drug savings in TI arm

HIV-NAT 001.4A (5)

TIBET (15)

No difference in AIDS Increased risk of AE in TI arm, 43 / 15 % (TI/cont.), HR 2.7

Results Primary endpoint

HIV-related disease Death Other

HR 3.5 (TI / cont.) HR 1.8 (TI / cont.) More CVD, renal and hepatic events in TI, equal Grade IV AE

HR 2.1 (TI / cont.)

None 1 TI / 1 cont. Equal Grade III/IV AE

TI vs cont.: VL: 91/96 % < 400, CD4+ : 100/96 % > 350 Lower CD4+ median in TI group at week 108 1 TI / 1 cont. None Equal Grade I–IV AE

No new AIDS None

67% 17 months

69% 9 months

62% 4 months

54% > 6 months

67% 17 months

Conclusions

TI is inferior

TI is inferior

No undue harm under TI

Comparable outcome, cost saving in TI

TI is less safe

A Results B Results

of the CD4+-guided TI versus continuous treatment comparison; a third study arm tested a fixed-term TI, 4 months on, 2 months off ART. of the CD4+-guided TI versus continuous treatment comparison; a third study arm tested 1 week on/off ART cycling and was terminated because of high risk of virologic

failure. C Participants in TI arm were recommended to restart ART when the TI strategy was terminated in January 2006; participants were followed for 18 more months. ABBREVIATIONS: AE, adverse events; TI, treatment interruption; HR, hazard ratio; CI, confidence interval; PY, person-years; VL, viral load.

TREATMENT INTERRUPTION

Time off ART in TI arm % of total time Initial time off ART (median)

3

4

TREATMENT INTERRUPTION

because of increased risk of HIV-related illnesses and death, whereas two of the three smaller studies found no treatment differences in clinical events and led to the interpretation that CD4+-guided TI was safe (5,6,9,12–14). None of the studies showed clinical benefit of CD4+-guided TI. The results for fixed-term TI were similarly disappointing (9). In these TI trials, it was accepted that the risk of HIV-related disease increases when interrupting ART; this increase in risk, however, was expected to be small and counterbalanced by potential long-term benefits. The SMART study showed a 2.6-fold increased risk of HIV-related illnesses and death in the TI arm, with an event rate of 1.3% under continuous treatment (12). The smaller studies were not powered to detect such a risk difference given the low event rates. It is also possible, however, that the TI strategies with higher CD4+ thresholds and less time spent off ART are safer, as hypothesized in Reference 6; conclusive proof would require an even larger study than SMART. TI was also investigated as a strategy to restore treatment options to patients whose virus had become resistant to multiple antiretroviral drug classes and could not be suppressed on their ART regimen. Continuing a failing regimen may contribute to more resistance; however, resistant viral strains are less fit, and when antiretroviral drugs are stopped, drug-susceptible ‘‘wild type’’ virus reemerges and tends to crowd out the resistant strains. The hope was that reconstitution of ART after a fixed length treatment interruption would provide long-term benefit, in particular, durable suppression of viral load and higher CD4+ cell counts (2,15). Of 7 randomized controlled studies reviewed in Reference 16, only one study reported a gain in CD4+ counts (17); this study was small (68 participants) and underpowered for clinical events. The largest study enrolled 270 participants, and one half were randomized to a 4 months TI, followed by a salvage ART regimen. An increased risk of HIV-related illnesses and death in the TI arm, a decline of CD4+ cell counts that still persisted after 16 months on ART, and no benefit in viral suppression was found (15).

1.2 Design Issues in Therapeutic TI studies In this section, selected issues in the design and monitoring of randomized, controlled therapeutic TI trials are discussed in the example of the SMART study. The main points are summarized in tables 2 and 3. General principles for the design of randomized trials are described elsewhere (18). The design of the TI strategy and choice of entrance criteria are interrelated. First and foremost, entrance criteria need to restrict the study population to those individuals for whom it is considered safe to discontinue treatment. Second, the TI strategy should allow participants to stay off treatment safely for a substantial period of time; otherwise, the interventions in the TI arm and the continuous therapy arm would be very similar, and it would be futile to look for a treatment difference. In the SMART trial, 5472 participants with CD4+ counts greater than 350 cells/ mm3 , age > 13 years, were randomized to intermittent versus continuous ART. Exclusion criteria concerned pregnancy and poor general health. The TI was guided by CD4+ cell count thresholds. ART was stopped at study entry and restarted when the CD4+ count had fallen below 250 cells/mm3 . When the CD4+ cell count had risen above 350 cells/mm3 , confirmed in a second measurement, ART would be stopped again. The threshold for restarting ART was selected to keep all participants at CD4+ levels that indicated low risk of disease, based on thencurrent guidelines for starting ART, and on rates of progression to AIDS known from cohort studies (19). Additional triggers for restarting ART in the TI arm included development of HIV-related symptoms or CD4% levels below 15. The thresholds for entering the study, and for stopping ART, were selected so that most participants could spend a considerable period of time off ART, assuming previously observed rates of decline in CD4+ cell counts. By the time the TI arm of the study was stopped, the difference in time spent off ART in two arms was substantial: In the TI group, the median time until the first reinitiation of ART was 17 months, and patients received ART for 34% of the time compared with 94% in the continuous

TREATMENT INTERRUPTION

5

Table 2. Design Considerations for Therapeutic TI Studies Entry criteria • Stopping active treatment must be considered safe. • Participants should be sufficiently healthy to interrupt treatment safely for a substantial period of time, for differential exposure to treatment in the TI versus the control groups. The TI strategy • Often open-label assignment. • Thresholds for stopping and restarting treatment need to maintain patient safety but allow for sufficient difference in exposure to treatment between the TI and control groups. • Include provisions to restart treatment early for safety concerns; monitor clinical markers for safety. Endpoints • The expected risk–benefit trade-off motivating a TI trial often involves endpoints beyond the disease under investigation. • If the primary endpoint is restricted to the target disease, then the trial may be stopped with a ‘‘significant’’ result that does not accurately capture the true risk or benefit of TI. This problem may be partially alleviated by including all-cause mortality in the primary endpoint. Power considerations • The treatment difference is influenced by the duration of the TI. Early restart of therapy in the TI arm or interrupting treatment in the control arm will both decrease the difference. • Including all-cause mortality may dilute the treatment difference.

therapy arm (12). The median CD4+ count at study entry was 597 cells/mm3 ; a higher CD4+ entrance threshold would have prolonged the initial time off ART for patients in the TI arm. In therapeutic TI studies, it is often expected that the risk of the specific disease under investigation is somewhat higher in the TI arm, counterbalanced by a higher risk of adverse events associated with active treatment in the continuous treatment arm. The primary endpoint in the SMART study was a composite endpoint of HIV-related illnesses or all-cause mortality. It was expected that the TI group would have slightly more HIV-related illnesses in the short term

because active anti-HIV treatment was disrupted. A secondary endpoint consisted of major cardiovascular disease and renal and hepatic complications; these events were known to be associated with long-term use of ART from observational studies (1) and were at least as serious as the events in the primary endpoint (12). Equipoise in the trial came from the expectation that an increased risk of HIV-related illnesses in the TI arm would be counterbalanced by increased risk of cardiovascular disease and renal and hepatic complications in the continuous ART arm in the first few years; in the longer term, the scales might be tipped in favor of TI also with respect to HIV-related illnesses because higher rates of treatment failure resulting

6

TREATMENT INTERRUPTION Table 3. Considerations for Monitoring Therapeutic TI Studies Monitoring for efficacy • ‘‘Clear evidence for benefit or harm’’ includes the treatment difference with respect to the target disease (primary endpoint) but also other serious outcomes in the hypothesized risk–benefit trade-off (major secondary endpoint). • If the efficacy boundary is crossed for the primary endpoint but the major secondary endpoint favors the other study arm, consider continuation. • Short-term and long-term risks may be different. If relative risks are expected to reverse after a certain time, monitoring boundaries that discourage stopping at low information time may be preferred. Monitoring for futility • Early restart of therapy in the TI arm or interrupting therapy in the control arm decrease the difference in treatment between the study arms. • Monitor the difference in ‘‘time on treatment’’ between the study arms. Even with perfect adherence, this difference may turn out to be small. Monitoring for safety • Timely reinitiation of therapy, in addition to other safety parameters.

from drug resistance were thought likely under continuous ART. The study was powered to detect a 17% risk difference in the primary endpoint of HIV-related disease or death, assuming increasing event rates from 1.3 to 2.6 per 100 person years under continuous treatment. To ensure that the trial would be terminated only if there was ‘‘clear and substantial evidence of benefit or harm’’, the interim monitoring guidelines recommended to stop the trial for efficacy only if (1) the O’Brian– Fleming boundary was crossed for the primary endpoint and (2) the secondary endpoint of major cardiovascular disease and renal and hepatic complications favored the same study arm. In this approach, the different risks expected in the two treatment arms were not treated symmetrically in the design or the interim stopping guidelines. The study included two provisions to guard against ending the trial solely based on the relative risk of HIV-related illnesses: (1) the requirement for consistency of the primary and secondary endpoint in the stopping guidelines and (2) the primary endpoint included all-cause mortality. Thus, the most severe, fatal non-HIV

complications were included in the primary comparison of the treatment arms. Alternatively, the trial could have been designed with coprimary endpoints to avoid the asymmetry, although at the cost of higher sample size. The TI strategy of the SMART trial was stopped early, and patients were recommended to resume ART; at this point, 18% of the planned 910 primary endpoint events had accrued. Both the primary and the secondary endpoint favored continuous ART, with hazard ratios of 2.6 (95% CI: 1.9 to 3.7) and 1.7 (1.1 to 2.5), respectively (12). A certain amount of nonadherence or cross over to another treatment arm would be expected in a large trial and taken into account when powering the study. In TI trials, however, participants in the TI arm effectively ‘‘cross over’’ to the intervention of the control arm (continuous therapy) when they restart treatment, even if the restart was mandated by the TI guidelines. Similarly, participants in the control arm ‘‘cross over’’ when they discontinue therapy. The proportion of follow-up time spent on or off therapy by participants in the TI and control groups is a measure for the difference in

TREATMENT INTERRUPTION

intervention received in the two study arms. If this difference becomes too small, then the study may be severely underpowered. In the SMART trial, monitoring guidelines for study viability concentrated on follow-up time spent off and on ART in the two study arms. Monitoring guidelines included boundaries for the proportion of participants who restarted ART within 6 months after randomization in the TI arm and for the proportion of participants who interrupted ART for more than 4 weeks in the continuous ART arm. 2

MANAGEMENT OF CHRONIC DISEASE

TI is often attractive to patients with chronic disease, in particular, when medication had to be taken for many years, drugs have side effects, and symptoms are not immediately apparent after stopping drugs. Under these circumstances, ‘‘drug holidays’’ outside of clinical studies are a fact of life, and unless controlled TI studies are performed, it is difficult to assess the risks of such interruptions. Another motivation for TI studies is uncertainty about the optimal duration of treatment, because the utility of life-long treatment may decrease after several years. Osteoporosis treatment with bisphosphonates is an example. Bisphosphonates increase bone mineral density (BMD) when actively taken and remain in the bone matrix for many years, which suggests a long-term beneficial effect even when treatment is discontinued. This property motivated a large, placebo-controlled randomized trial to assess whether the bisphosphonate alendronate could be discontinued or maintained at low dose after several years of successful treatment (20). A total of 1099 women who had received alendronate for at least 3 years in a previous study (21,22) were randomized to receiving placebo (40%), alendronate 5 mg/d (30%), or alendronate 10 mg/d (30%) and were followed for 5 years. The study excluded women who had lost bone mineral density since starting alendronate or had very low BMD (t-score < −3.5), and the blinded study treatment was discontinued when BMD dropped to 5% below the pretreatment value. Stopping alendronate resulted in increased risk of vertebral fractures but not

7

other bone fractures. BMD declined in the placebo group but stayed above the pretreatment value. No difference in BMD was found between the lower- and higher-dose study arms. In a study on treatment of mild hypertension, conducted in the 1980s, blood pressure medication was discontinued in all participants to investigate whether blood pressure could be controlled by adding potassium chloride supplements to a low-sodium diet (23). The study was motivated by concerns over long-term exposure to the then-used antihypertensive agents and by findings in epidemiologic studies that suggested benefit of high potassium and low sodium intake on blood pressure; previous smaller, short-term studies had inconsistent results concerning potassium supplements (23). The study randomized 287 men who had received antihypertensive drugs for at least 3.5 years to potassium chloride supplements versus placebo. During the first 12 weeks, all participants received dietary intervention to lower sodium intake; then antihypertensive drugs were stopped, and patients were followed for an average of 2 more years. To protect patient safety, those with cardiovascular or renal disease or diastolic blood pressure > 90 mm Hg were excluded, and blood pressure-lowering drugs were reinstated when diastolic blood pressure increased to > 90 mm Hg. The primary endpoint was the proportion of patients requiring reinstatement of antihypertensive drugs. No difference was found between the sodium supplement and placebo arms. The hypertension study blurs the boundaries of analytic and therapeutic TI. The TI was analytic, as the primary aim was not to assess the effect of interrupting treatment but to compare two interventions in patients who had interrupted treatment. However, the study was motivated by a possible therapeutic benefit of the TI—the reduction of side effects.

3 ANALYTIC TREATMENT INTERRUPTION IN THERAPEUTIC VACCINE TRIALS The objective of therapeutic anti-HIV vaccines is to control HIV replication in patients

8

TREATMENT INTERRUPTION

with acute or chronic HIV infection by activating the host’s immune response. Therapeutic vaccines that protect the patient if administered after infection exist, for example, for rabies and hepatitis B, but no antiHIV vaccine has yet been approved. After a series of unsuccessful trials that tested vaccine candidates in patients with acute HIV infection and in patients with uncontrolled chronic HIV, it was suspected that ongoing HIV replication decreases the host’s responsiveness to HIV antigens (24). Consequently, trials were conducted in patients who had suppressed viral load because of successful ART, and analytic treatment interruptions were used to evaluate whether the vaccine response controlled HIV RNA levels in the absence of ART (25,26). For example, in a study of ALVAC (27), patients on ART with undetectable viral load were randomized to vaccine versus placebo; all patients continued on ART until the course of vaccinations was completed; then ART was interrupted. In the placebo group, viral load was expected to rebound and plateau after a certain time, and CD4+ cell counts were expected to decline. The effectiveness of the vaccine was assessed by comparing treatment groups for time to detectable viral load, magnitude of the viral load plateau, change in CD4+ levels, and other markers. To monitor the durability of viral response, analytic TIs often require about 12–16 weeks off ART, to allow the HIV RNA level to plateau (28), but may last substantially longer (29). The realization that interruption of ART poses risks even in patients with high CD4+ cell counts, as found in the Trivacan and SMART trials (12,13), raised concerns about use of analytic TIs, and stricter safety parameters and monitoring standards were recommended (28). 4 RANDOMIZED DISCONTINUATION DESIGNS In randomized discontinuation trials, as introduced by Kopec et al. (30), all patients receive the investigative treatment for a set run-in period. At the end of this run-in period, patients who may have responded to treatment are randomized to continuing or stopping (control group) the treatment; the

control group may receive placebo, an active control, or stop treatment open-label. Patients who had adverse reactions or did not adhere to the treatment regimen are usually excluded from randomization. The description so far also applies to the alendronate trial described earlier; however, a crucial difference exists. The study question in the alendronate trial concerned women who had responded to alendronate in long-term treatment, and such women were enrolled. In contrast, randomized discontinuation trials are conducted to explore whether the investigational treatment could be effective in future, untreated patients. Statistical properties, efficiency, and sample size are discussed in References 30–33. The Phase II trial that first reported that sorafenib inhibited tumor growth in renal cancer illustrates successful use of a randomized discontinuation design (34). A total of 202 patients with metastatic renal cell carcinoma were enrolled. During the first 12 weeks, all patients were to receive sorafenib. At 12 weeks, 7% of patients had discontinued treatment, most because of adverse effects; 36% of patients showed tumor shrinkage by more than 25% and continued on openlabel sorafenib; and for 34% of patients (69 patients) tumor size stayed within 25% of baseline. 64 of the patients whose tumor size stayed within 25% of baseline, plus one other patient, were randomized to continuing sorafenib versus placebo Patients who were randomized to placebo (discontinuation arm) formed the control group. The primary endpoint was disease progression at 24 weeks, defined as > 25% tumor growth or other clinical progression. At 24 weeks, 16 of the 32 patients who were randomized to continue sorafenib were progression-free compared with only 6 of the 33 on placebo (P = 0.008). Sorafenib was later confirmed to prolong progression-free survival and overall survival in a large Phase III trial in which 602 patients with advanced renal cell carcinoma were randomized up front to sorafenib versus placebo (35). The original focus of the sorafenib trial was colon cancer, but patients with certain other tumors were also permitted to enter. Early in the trial, investigators observed tumor shrinkage in patients with renal cell

TREATMENT INTERRUPTION

carcinoma and increased enrollment of such patients. In total, 502 patients were enrolled, including 202 with renal cell carcinoma. Sorafenib was not effective in colon cancer. The treatment effect in renal cancer, however, was sufficiently strong to be evident in the small subgroup that was randomized (65 patients). By enrolling a broad study population, and through careful monitoring of the run-in period, investigators could identify characteristics of patients who were likely to benefit. In this trial, investigators avoided the controversy about stopping effective treatment by continuing open-label sorafenib therapy in patients whose tumors had shrunk by more than 25% (36). Randomized discontinuation trials have been conducted in many areas, including cancer research, assessing drugs for mental disorders, asthma treatments, or pain medication. The design is particularly useful if the treatment is effective only in a small subgroup of the study population; by confining the randomized comparison to potential responders, the power for detecting the treatment effect in the enriched population is increased (30–32). Whether enriching the population results in substantially lower sample-size requirements compared with upfront randomization depends on the size of the treatment effect, the proportion of patients sensitive to the treatment, and other design parameters (31,32). The post hoc selection of the subgroup for randomization has disadvantages, including the following: (1) If the finding is positive, it is not clear to what population of future patients it applies, (2) the design is unsuited for estimating the risk of adverse events because patients with adverse reactions in the run-in phase are excluded, and (3) a negative finding may result from carry-over effects from active treatment. Therefore, findings from randomized discontinuation designs need to be confirmed in a trial with up-front randomization to treatment versus control (30,36,37). The randomized discontinuation trial can help identify characteristics of likely responders, to focus the population selection for a confirmatory trial (36).

5

9

FINAL COMMENTS

TI has various purposes in clinical trials. In randomized discontinuation designs, it is used to produce a control against which the investigational treatment is measured. In other settings, the TI itself is the target of the investigation, motivated by the hope that interrupting treatment provides a net benefit or at least draws even in the trade-off between risk of disease progression versus the adverse side effects, cost, or inconvenience of long-term treatment. Lessons learned from the history of CD4+guided therapeutic TI studies in HIV-infection mirror those in many other fields— if small studies suggest that TI is safe, this should be confirmed in larger studies powered to identify moderate risk differences in important clinical endpoints; broad data collection helps identify unexpected risks, and small, underpowered studies may lead to an over-optimistic perception of safety. REFERENCES 1. A. Carr and D. A. Cooper, Adverse effects of antiretroviral therapy. Lancet. 2000; 356: 1423–1430. 2. S. Staszewski, Treatment interruption in advanced failing patients. Current Opinion in HIV & AIDS. 2007; 2(1): 39–45. 3. F. Lori, A. Foli, and J. Lisziewicz, Stopping HAART temporarily in the absence of virus rebound: exploring new HIV treatment options. Current Opinion in HIV & AIDS. 2007; 2(1): 14–20. 4. F. Lori and J. Lisziewicz, Structured treatment interruptions for the management of HIV infection. JAMA. 2001; 286(23): 2981–7. 5. J. Ananworanich, U. Siangphoe, A. Hill, P. Cardiello, W. Apateerapong, B. Hirschel, A. Mahanontharit, S. Ubolyam, D. Cooper, P. Phanuphak, and K. Ruxrungtham, Highly active antiretroviral therapy (HAART) retreatment in patients on CD4-guided therapy achieved similar virologic suppression compared with patients on continuous HAART: the HIV Netherlands Australia Thailand Research Collaboration 001.4 study. J. Acquir. Immune Defic. Syndr. 2005; 39(5): 523–529. 6. J. Ananworanich, A. Gayet-Ageron, M. Le Braz, W. Prasithsirikul, P. Chetchotisakd, S. Kiertiburanakul, W. Munsakul,

10

TREATMENT INTERRUPTION P. Raksakulkarn, S. Tansuphasawasdikul, S. Sirivichayakul, M. Cavassini, U. Karrer, D. Genne, R. Nuesch, P. Vernazza, E. Bernasconi, D. Leduc, C. Satchell, S. Yerly, L. Perrin, A. Hill, T. Perneger, P. Phanuphak, H. Furrer, D. Cooper, K. Ruxrungtham, and B. Hirschel, Staccato Study Group, Swiss H. I. V. Cohort Study, CD4-guided scheduled treatment interruptions compared with continuous therapy for patients infected with HIV-1: results of the Staccato randomised trial. Lancet. 2006; 368(9534): 459–465.

14.

15.

7. B. Marchou, P. Tangre, I. Charreau, J. Izopet, P. M. Girard, T. May, J. M. Ragnaud, J. P. Aboulker, J. M. Molina, Team AS, Intermittent antiretroviral therapy in patients with controlled HIV infection. AIDS. 2007; 21(4): 457–466. 8. J. Hakim, on behalf of the DART trial team, A structured treatment interruption strategy of 12 week cycles on and off ART is clinically inferior to continuous treatment in patients with low CD4 counts before ART: a randomisation within the DART trial. In: XVI International AIDS Conference. Toronto, Canada, 13–18 August 2006, Abstract THLB0207; 2006. ¨ 9. R. Nuesch and B. Hirschel, Treatment interruption for convenience, cost cutting and toxicity sparing. Current Opinion in HIV & AIDS. 2007; 2(1): 31–38. 10. DHHS Panel on Antiretroviral Guidelines for Adult and Adolescents, Guidelines for the use of antiretroviral agents in HIVinfected adults and adolescents - January 29, 2008. Department of Health and Human Services. (Available: http://aidsinfo.nih.gov/ contentfiles/AdultandAdolescentGL.pdf. Accessed on February 17, 2008.) 11. C. T. Burton, M. R. Nelson, P. Hay, B. G. Gazzard, F. M. Gotch, and N. Imami, Immunological and virological consequences of patient-directed antiretroviral therapy interruption during chronic HIV-1 infection. Clin. Exp. Immunol. 2005; 142(2): 354–361.

16.

17.

18.

19.

20.

12. The SMART Study Group, CD4-guided antiretroviral treatment interruption: Primary results of the SMART study. N. Engl. J. Med. 2006; 355(22): 2283–2296. 13. C. Danel, R. Moh, A. Minga, A. Anzian, O. BaGomis, C. Kanga, G. Nzunetu, D. Gabillard, F. Rouet, S. Sorho, M.-L. Chaix, S. Eholie, H. Menan, D. Sauvageot, E. Bissagnene, R. Salamon, and X. Anglaret, Trivacan ANRS 1269 trial group, CD4-guided structured antiretroviral treatment interruption strategy

21.

in HIV-infected adults in west Africa (Trivacan ANRS 1269 trial): a randomised trial. Lancet. 2006; 367: 1981–1989. L. Ruiz, R. Paredes, G. Gomez, J. Romeu, P. Domingo, N. Perez-Alvarez, G. Tambussi, J. M. Llibre, J. Martinez-Picado, F. Vidal, C. R. Fumaz, and B. Clotet, The Tibet Study Group, Antiretroviral therapy interruption guided by CD4 cell counts and plasma HIV-1 RNA levels in chronically HIV-1-infected patients. AIDS. 2007; 21(2): 169–78. J. Lawrence, D. L. Mayers, K. H. Hullsiek, G. Collins, D. I. Abrams, R. B. Reisler, L. R. Crane, B. S. Schmetter, T. J. Dionne, J. M. Saldanha, M. C. Jones, J. D. Baxter, 064 Study Team of the Terry Beirn Community Programs for Clinical Research on AIDS, Structured treatment interruption in patients with multidrug-resistant human immunodeficiency virus. N. Engl. J. Med. 2003; 349: 837–846. N. Pai, J. Lawrence, A. L. Reingold, and J. P. Tulsky, Structured treatment interruptions (STI) in chronic unsuppressed HIV infection in adults. Cochrane Database of Systematic Reviews 2006(3):Art. No.: CD006148. DOI: 10.1002/14651858.CD006148. C. Katlama, S. Dominguez, K. Gourlain, C. Duvivier, C. Delaugerre, M. Legrand, R. Tubiana, J. Reynes, J.-M. Molina, G. Peytavin, V. Calvez, and D. Costagliola, Benefit of treatment interruption in HIV-infected patients with multiple therapeutic failures: a randomized controlled trial (ANRS 097). AIDS. 2004; 18(2): 217–226. L. M. Friedman, C. D. Furberg, and D. L. DeMets, Fundamentals of Clinical Trials. 3rd ed. New York: Springer, 1998. A. Mocroft, C. Katlama, A. M. Johnson, C. Pradier, F. Antunes, F. Mulcahy, A. Chiesi, A. N. Phillips, O. Kirk, and J. D. Lundgren, AIDS across Europe, 1994–98: the EuroSIDA study. Lancet. 2000; 356(9226): 291–296. D. M. Black, A. V. Schwartz, K. E. Ensrud, J. A. Cauley, S. Levis, S. A. Quandt, S. Satterfield, R. B. Wallace, D. C. Bauer, L. Palermo, L. E. Wehren, A. Lombardi, A. C. Santora, S. R. Cummings, and the FLEX Research Group, Effects of continuing or stopping alendronate after 5 years of treatment: the Fracture Intervention Trial Long-term Extension (FLEX): a randomized trial. JAMA. 2006; 296(24): 2927–2938. S. R. Cummings, D. M. Black, D. E. Thompson, W. B. Applegate, E. Barrett-Connor, T. A. Musliner, L. Palermo, R. Prineas, S. M. Rubin, J. C. Scott, T. Vogt, R. Wallace, A. J. Yates, and

TREATMENT INTERRUPTION A. Z. LaCroix, Effect of alendronate on risk of fracture in women with low bone density but without vertebral fractures: results from the Fracture Intervention Trial. JAMA. 1998; 280(24): 2077–2082. 22. D. M. Black, S. R. Cummings, D. B. Karpf, J. A. Cauley, D. E. Thompson, M. C. Nevitt, D. C. Bauer, H. K. Genant, W. L. Haskell, R. Marcus, S. M. Ott, J. C. Torner, S. A. Quandt, T. F. Reiss, and K. E. Ensrud, Randomised trial of effect of alendronate on risk of fracture in women with existing vertebral fractures. Fracture Intervention Trial Research Group. Lancet. 1996; 348(9041): 1535–1541. 23. J. R. Grimm, J. D. Neaton, P. J. Elmer, K. H. Svendsen, J. Levin, M. Segal, L. Holland, L. J. Witte, D. R. Clearman, P. Kofron, R. K. LaBounty, R. Crow, and R. J. Prineas, The influence of oral potassium chloride on blood pressure in hypertensive men on a low-sodium diet. N. Engl. J. Med. 1990; 322(9): 569–572. 24. R. T. Schooley, C. Spino, D. Kuritzkes, B. D. Walker, F. T. Valentine, M. Hirsch, E. Cooney, G. Friedland, S. Kundu, T. Merigan Jr., M. McElrath, A. Collier, S. Plaeger, R. Mitsuyasu, J. Kahn, P. Haslett, P. Uherova, V. deGruttola, S. Chiu, B. Zhang, G. Jones, D. Bell, N. Ketter, T. Twadell, D. Chernoff, and M. Rosandich, Two double-blinded, randomized, comparative trials of 4 human immunodeficiency virus type 1 (HIV-1) envelope vaccines in HIV-1–infected individuals across a spectrum of disease severity: AIDS Clinical Trials Groups 209 and 214. J. Infect. Dis. 2000; 182: 1357–1364. 25. Immune Response to a Therapeutic HIV Vaccine Followed by Treatment Interruption in Patients With Acute or Recent HIV Infection (AIN504/ACTG A5218). ClinicalTrials.gov identifier: NCT00183261. Available: http://clinicaltrials.gov/show/NCT00183261? order=11. Accessed February 17, 2008. 26. M. M. Lederman, A. Penn-Nicholson, S. F. Stone, S. F. Sieg, and B. Rodriguez, Monitoring clinical trials of therapeutic vaccines in HIV infection: role of treatment interruption. Current Opinion in HIV & AIDS. 2007; 2(1): 56–61. 27. J. M. Jacobson, R. Pat Bucy, J. Spritzler, M. S. Saag, J. J. J. Eron, R. W. Coombs, R. Wang, L. Fox, V. A. Johnson, S. Cu-Uvin, S. E. Cohn, D. Mildvan, D. O’Neill, J. Janik, L. Purdue, D. K. O’Connor, C. D. Vita, I. Frank, and the National Institute of Allergy and Infectious Diseases-AIDS Clinical Trials Group 5068 Protocol Team, Evidence that

28.

29.

30.

31.

32.

33.

34.

35.

11

intermittent structured treatment interruption, but not immunization with ALVAC-HIV vCP1452, promotes host control of HIV replication: the results of AIDS Clinical Trials Group 5068. J. Infect. Diseases. 2006; 194: 623–632. Report from the workshop on HIV structured treatment interruptions/intermittent therapy. July 17–19, 2006, London, U.K. Sponsored by the Office of AIDS Research, NIH, U.S. Department of Health and Human Services. Available: http://www.oar.nih.gov/public/NIH OAR STI IT Report Final.pdf. Accessed February 17, 2008. K. Henry, D. Katzenstein, D. W. Cherng, H. Valdez, W. Powderly, M. B. Vargas, N. C. Jahed, J. M. Jacobson, L. S. Myers, J. L. Schmitz, M. Winters, P. Tebas, and the A5102 Study Team of the AIDS Clinical Trials Group, A pilot study evaluating time to CD4 T-cell count < 350 cells/mm3 after treatment interruption following antiretroviral therapy +/- Interleukin 2: results of ACTG A5102. J. Acquir. Immune Defic. Synd. 2006; 42(2): 140–148. J. A. Kopec, M. Abrahamowicz, and J. M. Esdaile, Randomized discontinuation trial: utility and efficiency. J. Clin. Epidemiol. 1993; 46:959–971. W. B. Capra, Comparing the power of the discontinuation design to that of the classic randomized design on time-to-event endpoints. Control. Clin. Trials. 2004; 25(2): 168–177. B. Freidlin and R. Simon, Evaluation of randomized discontinuation design. J. Clin. Oncol. 2005; 23(22): 5094–5098. G. L. Rosner, W. Stadler, and M. J. Ratain, Randomized discontinuation design: application to cytostatic antineoplastic agents. J. Clin. Oncol. 2002; 20(22): 4478–4484. M. J. Ratain, T. J. Eisen, W. M. Stadler, K. T. Flaherty, S. B. Kaye, G. L. Rosner, M. Gore, A. A. Desai, A. Patnaik, H. Q. Xiong, E. Rowinsky, J. L. Abbruzzese, C. Xia, R. Simantov, B. Schwartz, and P. J. O’Dwyer, Phase II placebo-controlled randomized discontinuation trial of sorafenib in patients with metastatic renal cell carcinoma. J. Clin. Oncol. 2006; 24: 2505–2512. J. Llovet, S. Ricci, V. Mazzaferro, P. Hilgard, J. Raoul, S. Zeuzem, M. Poulin-Costello, M. Moscovici, D. Voliotis, and J. Bruix for the SHARP Investigators Study Group, Sorafenib improves survival in advanced Hepatocellular Carcinoma (HCC): Results of a Phase III randomized placebo-controlled trial (SHARP

12

TREATMENT INTERRUPTION

trial). J. Clin. Oncol. 2007 ASCO Annual Meeting Proceedings Part I vol. 25, no. 18S (June 20 Supplement), 2007:LBA1. 36. W. M. Stadler, The randomized discontinuation trial: a phase II design to assess growthinhibitory agents. Mol. Cancer Ther. 2007; 6(4): 1180–1185. 37. P. D. Leber and C. S. Davis, Threats to the validity of clinical trials employing discontinuation strategies for sample selection. Control Clin. Trials. 1998; 19(22): 178–187.

CROSS-REFERENCES Enrichment Design Placebo-Controlled Trial Human Immunodeficiency Virus (HIV) Trials Run-in Period Stopping Boundaries

TREATMENT INVESTIGATIONAL NEW DRUG (IND) Treatment Investigational New Drugs (IND) are used to make promising new drugs available to desperately ill patients as early in the drug development process as possible (Federal Register, May 22, 1987). The U.S. Food and Drug Administration (FDA) will permit an investigational drug to be used under a treatment IND when there is preliminary evidence of the drug’s efficacy and the drug is intended to treat a serious or life-threatening disease, or when there is no comparable alternative drug or therapy available for the stage of the disease in the intended patient population. In addition, these patients must not be eligible to be included in the definitive clinical trials, which must be well underway if not almost finished. To be considered immediately life-threatening, the disease must be in a stage where there is a reasonable likelihood of death within a matter of months or of a premature death without early treatment. For example, advanced cases of acquired immunodeficiency syndrome (AIDS), herpes simplex encephalitis, and subarachnoid hemorrhage are all considered to be immediately lifethreatening diseases. Treatment INDs are made available to patients before general marketing begins, typically during phase III studies. Treatment INDs also allow the FDA to obtain additional data on the drug’s safety and effectiveness. Between 1987 and 1998, 39 treatment INDs were allowed to proceed. Of these, 11 INDs were for the human immunodeficiency virus (HIV)/AIDS and 13 were for cancer.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/oashi/patrep/treat.html and http://www.fda.gov/cder/handbook/treatind.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

TRIAL SITE The Trial Site is the location(s) where trialrelated activities are actually conducted.

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cder/guidance/iche6.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

TRUE POSITIVES, TRUE NEGATIVES, FALSE POSITIVES, FALSE NEGATIVES

2 TRUE AND FALSE POSITIVES AND NEGATIVES Whenever the clinical outcome of a diagnostic test is dichotomous with two outcomes ‘‘positive’’ and ‘‘negative,’’ a new diagnostic method under investigation classifies patients according to their test result into groups with outcomes ‘‘test positive’’ and ‘‘test negative.’’ Four different classifications are possible depending on whether the positive or negative test result is correct or whether the positive or negative test result is not correct. Classifications in which the test result is not correct are called misclassifications. To estimate the effect and amount of the misclassification, the true status of disease of the patient has to be known (e.g., ‘‘disease present’’ or ‘‘disease absent’’). With a known true status of the patient, a true-positive test result (TP) is counted when the disease is truly present and the method gives a positive test result. A false-positive test result (FP) (i.e., a misclassification of a patient without a disease) is present when the test result is positive but the true status of the patient is ‘‘not diseased.’’ A true-negative test result (TN) is counted when the disease is truly absent and the method gives a negative test result. A false-negative test result (FN) (i.e., a misclassification of a diseased patient) is present when the test result is negative but the true status of the patient is ‘‘diseased.’’ The formal definitions are shown in Table 1. The direction and amount of the misclassification can be estimated by estimating validity parameters like sensitivity, specificity, and accuracy (see section: Diagnostic Studies). These four classifications can be summarized in a 2 × 2 table as shown in Table 2. In Table 3, a fictive study with 100 patients is shown to explain the effect of misclassification. Here, 55 diseased patients were correctly classified by the method (i.e., for these patients, a positive test result was found).

CARSTEN SCHWENKE Schering AG Berlin, Germany

1

INTRODUCTION

In the clinical development of diagnostic procedures, studies are performed to measure the efficacy of the new method in the target population. Those new methods could be new laboratory tests, which measure some tumor marker to evaluate the disease status of a patient. Other examples are imaging methods like ultrasound, where several raters are asked to assess the images. However, in imaging studies testing contrast agents for magnetic resonance imaging or computer tomography, several readers are asked to search for pulmonary nodules in the liver segments, to classify lesions in the brain, or to detect vessel segments with stenosis. Often, the outcomes in such studies are dichotomous with the two outcomes ‘‘disease present’’ or ‘‘disease absent.’’ A chance of misclassification exists because the new method is not free from error (i.e., the method provides the wrong disease status). Often, a diagnostic study is performed to quantify the amount and direction of misclassification. Misclassification can only be determined when the truth is known. When the truth is not known and cannot be obtained, the misclassification cannot be measured. In diagnostic studies, the truth is obtained from standard methods, used in clinical routine. The true status, also called the ‘‘gold standard’’ or ‘‘standard of truth,’’ often is not known, so that a ‘‘standard of reference’’ is chosen to be closest to the truth. Examples for standards of reference are surgery in studies on the detection of focal liver lesions, histopathology to classify malignomas, as well as the clinical outcome of a disease by following up with the patient after several years.

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

TRUE POSITIVES, TRUE NEGATIVES, FALSE POSITIVES, FALSE NEGATIVES

Table 1. Formulae for true and false positive and negative test results

Source

Outcome

1 = positive (+) = ‘‘disease present’’ 0 = negative (−) = ‘‘disease absent’’ Test procedure (Test): 1 = positive (+) = ‘‘test positive’’ 0 = negative (−) = ‘‘test negative’’ Truth:

Classification: True Positive (TP): False Negative (FN): False Positive (FP): True Negative (TN):

Truth = 1 and Test = 1 Truth = 1 and Test = 0 Truth = 0 and Test = 1 Truth = 0 and Test = 0

Table 2. True and false positives and negatives in a 2 × 2 table

Truth Test procedure

D+ (disease present)

D− (disease absent)

T+ (test positive)

True positive (TP)

False positive (FP)

TP+FP

T− (test negative)

False negative (FN)

True negative (TN)

FN+TN

TP+FN

FP+TN

N (total)

Table 3. Example for a 2 × 2 table with true and false positives and negatives

Standard of reference Test procedure

D+ (disease present)

D− (disease absent)

T+ (test positive)

55

10

65

T− (test negative)

15

20

35

70

30

100

TRUE POSITIVES, TRUE NEGATIVES, FALSE POSITIVES, FALSE NEGATIVES

For 20 subjects without disease, the test result was negative, so these subjects were correctly classified as non-diseased. For 10 subjects without disease, the test was positive (i.e., the subject would wrongly go under therapy to treat a disease that is not present). For 15 patients having the disease, a therapy would be required but no therapy would be given as the test result showed a negative result. For these patients, the presence of the disease would be missed, as only the test method but not the truth would be known. As an example, true and false positives and negatives are essential for the estimation of sensitivity and specificity in diagnostic studies (see section: Diagnostic Studies).

3

UGDP TRIAL

type did not appear in the National Library of Medicine (NLM) lexicon until 1991. The impetus for the UGDP was born of a question from a U.S. Congressman to a diabetologist who would ultimately head up the UGDP. The question concerned treatment for the Congressman’s daughter, who was recently diagnosed as having adult-onset diabetes and having been placed on an oral antidiabetic drug. The Congressman wanted to know about the long-term safety and efficacy of such drugs. The answer was ‘‘not much.’’ The expert noted that the drug agent she was taking was a commonly used drug and that it had been approved by the FDA because it was shown to be safe and effective in controlling blood sugar in people with type-2 diabetes. The expert went on to say that he had no idea whether such control was beneficial in prolonging life or delaying the usual complications associated with diabetes and lamented that a long-term randomized trial should be conducted to answer the question. The dialogue ultimately led to a meeting of some key figures in diabetes and a methodologist-epidemiologist to map plans for such a trial. Those meetings ultimately led to an application to the National Institutes of Health for grant support with funding initiated in 1960. Enrollment started in 1961 and finished in 1966 with a total enrollment of 1027 persons with newly diagnosed, noninsulin-dependent, adult-onset diabetes.

CURTIS MEINERT Johns Hopkins Bloomberg School of Public Health Baltimore, Maryland

1

INTRODUCTION

The University Group Diabetes Program (UGDP) was a multicenter, investigatorinitiated, trial started in 1960 and finished two decades later designed to assess the value of established antihyperglycemic drugs for prevention of the late complications of adultonset diabetes. It generated a firestorm of controversy starting in 1970 with presentation and publication of results suggesting that a widely used oral agent—tolbutamide (Orinase, Upjohn, Kalamazoo now part of Pfizer)—was of no value in reducing the risk of morbidity or in prolonging life, if not possibly unsafe. The Food and Drug Administration (FDA) used its results to revise the label for the drug to include a special warning regarding possible cardiovascular (CV) risks associated with the drug, but only after a legal battle that raged on for more than a decade, which was propelled by the Committee for the Care of the Diabetic (CCD). This battle ultimately wound up in front of the U.S. Supreme Court. (The CCD was a group of diabetologists from around the country formed in late 1970 to challenge results from the trial.) The trial started in one era and ended in another. It started before Institutional Review Boards (IRBs) existed and finished with them in place. Indeed, it was finishing enrollment just as the U.S. Public Health Service notified recipients of NIH funds that more funding would be contingent on evidence of written informed consents from persons being studied. The trial has historical importance in that it was one of the first multicenter long-term prevention trials designed to assess the safety and efficacy of established treatments. Indeed, ‘‘multicenter study’’ as a publication

2

DESIGN AND CHRONOLOGY

The trial was designed to evaluate the effectiveness of hypoglycemic drug therapy in preventing or delaying the vascular complications in type-2 diabetics. The treatments and numbers enrolled per treatment group are as listed below. The trial involved 13 centers—12 clinics (5 clinics at the outset) and a coordinating center. All centers (except the clinic located in Williamson, WV) were located in universities or were university affiliated, hence, the name University Group (Table 1).

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

UGDP TRIAL

Table 1. Study Centers, Location, and Directors∗ Clinics 1

The Johns Hopkins School of Medicine Baltimore, Maryland (1960) 2 Massachusetts General Hospital Boston, Massachusetts (1960) 3 University of Cincinnati Medical Center Cincinnati, Ohio (1960) 4 University of Minnesota Hospitals Minneapolis, Minneapolis (1960) 5 The Jewish Hospital and Medical Center of Brooklyn Brooklyn, New York (1960) 6 University Hospitals of Cleveland Cleveland, Ohio (1961) 7 Appalachian Regional Hospital Williamson, West Virginia (1961) 8 University of Alabama Medical Center Birmingham, Alabama (1962) 9 Presbyterian-St. Luke’s Hospital Chicago, Illinois (1962) 10 Washington University School of Medicine St. Louis, Missouri (1962) 11 University of Puerto Rico School of Medicine San Juan, Puerto Rico (1963) 12 The Virginia Mason Research Center Seattle, Washington (1963) Coordinating Center (1960) University of Minnesota School of Public Health Minneapolis, Minnesota (1960–1963) University of Maryland School of Medicine (1963–) Baltimore, Maryland

Thaddeus Prout, MD

Project Office (1960) National Inst of Arthritis & Metabolic Diseases National Institutes of Health Bethesda, Maryland

LeMar Remmert, PhD

∗ Directors

Robert Osborne, MD Harvey Knowles, MD Frederick Goetz, MD Martin Goldner, MD Max Miller, MD Charles Jones, MD Buris Boshell, MD Theodore Schwartz, MD William Daughaday, MD Lillian Haddock, MD Rorbert Reeves, MD Christian Klimt, MD, DrPH

as of January 1970.

The trial, as originally designed, involved only tolbutamide as an oral agent and a corresponding placebo plus the two insulin treatments. The randomization schedule (administered by the coordinating center) was designed to provide equal numbers of people across the four treatments within blocks of size 16, by clinic. Phenformin and its corresponding placebo was added in 1962 along with additional clinics. The randomization scheme was modified to accommodate the change (1). Table 2 provides a chronology of the trial and account of the legal battle mounted by the CCD to forestall the label change proposed by the FDA regarding the possibility of CV

risks associated with tolbutamide. Table 3 provides a list of criticisms leveled against the trial and responses to those criticisms. 3 RESULTS The primary results of the trial are summarized in References 2,14,26, and 37. The paper in Reference 2 describes results for the tolbutamide versus placebo comparison related to the decision of investigators to stop use of tolbutamide in the trial. Reference 14 published in 1971, describes results related to the decision of investigators to stop use of phenformin. Reference 26 and 37 published in 1978 and 1982, provides results

UGDP TRIAL

3

Enrollment by treatment group No. Enrolled

Treatment∗

204

Insulin variable (IVAR)

210

Insulin standard (ISTD)

205

Placebo (PLBO)

204 204

Tolbutamide (TOLB) Phenformin (PHEN)

∗ All

Dosage As much insulin (U-80 Lente Iletin or other insulins) per day as required to maintain ‘‘normal’’ blood glucose levels. Administered via subcutaneous injections. 10, 12, 14, or 16 units of insulin (U-80 Lente Iletin) per day, depending on patient body surface. Administered via subcutaneous injections. Placebo (lactose) tablets or capsules similar to those used for tolbutamide or phenformin treatments. 3 tablets per day, each containing 0.5 g of tolbutamide. 1 capsule per day during the first week of treatment, thereafter 2 capsules per day; 50 mg of phenformin per capsule.

persons were prescribed standard antidiabetic diets in addition to their assigned treatments.

on the two insulin treatment groups versus placebo—the only treatments to remain in use over the entire course of the trial. 3.1 Tolbutamide Results Conclusion (as given in Reference 2): All UGDP investigators are agreed that the findings of this study indicate that the combination of diet and tolbutamide therapy is no more effective than diet alone in prolonging life. Moreover, the findings suggest that tolbutamide and diet may be less effective than diet alone or diet and insulin at least insofar as cardiovascular mortality is concerned. For this reason, use of tolbutamide has been discontinued in the UGDP.

The steering committee (SC—composed of the director and deputy director of each of the 12 clinical centers and the coordinating center) was responsible for conduct of the trial and for review of interim analyses of outcome data for the purpose of determining whether the trial should continue unaltered. The tolbutamide–placebo mortality difference started to trend against tolbutamide around year 6 of the trial (1967) and continued to increase with the difference for all cause mortality ultimately approaching the upper 95% monitoring bound (Fig. 1) and actually crossing the 95% upper monitoring bound for CV mortality. Ultimately, the investigators voted to stop use of tolbutamide

in the fall of 1969, but not without considerable debate prompted, in large measure, by a vocal minority who objected to stopping because they did not believe that the results were sufficient to show tolbutamide to be harmful. However, in the end investigators agreed that the finding, although by no means establishing harm, did indicate that ‘‘the combination of diet and tolbutamide therapy is no more effective than diet alone in prolonging life’’ and moreover ‘‘that tolbutamide and diet may be less effective than diet alone or than diet and insulin at least insofar as cardiovascular mortality is concerned (2).’’ The mortality results that led to the decision are presented in Table 4 and Fig. 1 and 2 as contained in the above referenced publication. See also ‘‘The UGDP controversy: Thirty-four years of contentious ambiguity laid to rest’’ by Schwartz and Meinert (51). If tolbutamide was in fact harmful, investigators speculated that it was CV related. The tolbutamide–placebo difference in CV mortality was striking. The conventional pvalue for the difference was 0.005 by the time tolbutamide was stopped. The concern regarding CV mortality was sufficient to cause the FDA to propose a labeling change for tolbutamide that was to include a special warning regarding potential CV risks associated with use of the drug. However, the CCD was successful in staying the change

4

UGDP TRIAL

Table 2. Chronology of Events Associated with the UGDP (34) Year

Month, Day

1959 1960

June September

1961 1962

February September

1966 1969

February June 6

1970 1970

May 20 May 21, 22

1970

June 14

1970

October

1970 1970 1971 1971 1971 1971 1971

November November April May 16 June August 9 September 14

1971 1971 1971

September 20 September 20 October 7

1972 1972

May June 5

1972

July 13

1972

August 3

1972

August 11

1972

August 30

1972

August

1972 1972

September October 17

1972 1972

October November 3

1972

November 7

Event First planning meeting of UGDP investigators (1) Initiation of grant support for the coordinating center and first 7 clinics (1) Enrollment of first patient (1) Addition of phenformin to the study and recruitment of 5 additional clinics (1) Completion of patient recruitment (1, 2) UGDP investigators vote to discontinue tolbutamide treatment (7 and UGDP meeting minutes) Tolbutamide results on Dow Jones ticker tape (3) Wall Street Journal, Washington Post, and New York Times articles on tolbutamide results (4–6) Tolbutamide results presented at American Diabetes Association meeting, St Louis (7–9) Food and Drug Administration (FDA) distributes bulletin supporting findings (10) Tolbutamide results published (2) Committee for the Care of Diabetics (CCD) formed (11) Feinstein criticism of UGDP published (12) UGDP investigators vote to discontinue phenformin treatment FDA outlines labeling changes for sulfonylureas (13) UGDP preliminary report on phenformin published (14) Associate Director of National Institutes of Health (NIH) (Tom Chalmers) asks the president of International Biometrics Society to appoint a committee to review UGDP (15) Schor criticism of UGDP published (16) Cornfield defense of UGDP published (17) CCD petitions commissioner of the FDA to rescind proposed label change (18 and actual petition) FDA reaffirms position on proposed labeling change (19) FDA commissioner denies October 1971 request to rescind proposed label change (18) CCD requests evidentiary hearing before FDA commissioner on proposed labeling changes (18) Commissioner of FDA denies CCD request for evidentiary hearing (20) CCD argues to have the FDA enjoined from implementing labeling change before the United States District Court for the District of Massachusetts (20) Request to have the FDA enjoined from making labeling change denied by Judge Campbell of the United States District Court for the District of Massachusetts (18,20) Biometrics Society Committee starts review of UGDP and other related studies (15) Seltzer criticism of UGDP published (21) Second motion for injunction against label change filed by CCD in the United States District Court for the District of Massachusetts (20) Response to Seltzer critique published (22) Temporary injunction order granted by Judge Murray of the United States District Court for the District of Massachusetts (20) Preliminary injunction against proposed label change granted by United States District Court for the District of Massachusetts (18)

UGDP TRIAL Table 2. (Continued) Year

Month, Day

1973

July 31

1973 1974 1974

October February March-April

1974

September 18–20

1975

January 31

1975 1975 1975

February 10 February July 9, 10

1975 1975

August September 30

1975

October 14

1975 1976

December February 5

1976

February 25

1976 1976

September October

1977

March 8

1977

April 22

1977

May 6

1977 1977

May 13 July 25

1977

August

1977

October 21

1977

October 23

Event Preliminary injunction vacated by Judge Coffin of the United States Court of Appeals for the First Circuit. Case sent back to FDA for further deliberations (18,20) FDA hearing on labeling of oral agents (18) FDA circulates proposed labeling revision (18) FDA holds meeting on proposed label change, then postpones action on change until report of Biometrics Committee (18) Testimony taken concerning use of oral hypoglycemic agents before the United States Senate Select Committee on Small Business, Monopoly Subcommittee (23) Added testimony concerning use of oral hypoglycemic agents before the United States Senate Select Committee on Small Business, Monopoly Subcommittee (24) Report of the Biometrics Committee published (15) UGDP final report on phenformin published (25) Added testimony concerning use of oral hypoglycemic agents before the United States Senate Select Committee on Small Business, Monopoly Subcommittee (24) Termination of patient follow-up in UGDP (26) CCD files suit against David Mathews, Secretary of Health, Education, and Welfare, et al, for access to UGDP raw data under the Freedom of Information Act (FOIA) in the United States District Court for the District of Columbia (27) Ciba-Geigy files suit against David Mathews, Secretary of Health, Education, and Welfare, et al, for access to UGDP raw data under the FOIA in the United States District Court for the Southern District of New York (28) FDA announces intent to audit the UGDP results (29) United States District Court for the District of Columbia rules UGDP raw data not subject to FOIA (30) CCD files appeal of February 5 decision in United States Court of Appeals for the District of Columbia Circuit (29) FDA audit of UGDP begins FDA Endocrinology and Metabolism Advisory Committee recommends removal of phenformin from market (32) United States District Court for the Southern District of New York rejects Ciba-Geigy request for UGDP raw data (31) Health Research Group (HRG) of Washington, DC, petitions Secretary of HEW to suspend phenformin from market under imminent hazard provision of law (33) FDA begins formal proceedings to remove phenformin from market (33) FDA holds public hearing on petition of HRG (33) Secretary of HEW announces decision to suspend New Drug. Applications (NDAs) for phenformin in 90 days (33) CCD requests that United States District Court for the District of Columbia issue an injunction against HEW order to suspend NDAs for phenformin (34) CCD request to United States District Court for the District of Columbia for injunction against HEW order to suspend NDAs for phenformin denied (34) NDAs for phenformin suspended by Secretary of HEW under imminent hazard provision of law (35)

5

6

UGDP TRIAL

Table 2. (Continued) Year

Month, Day

1977 1978

December January

1978 1978

July 7 July 11

1978

July 25

1978

October 17

1978 1978

November 14 November 15

1979

January 15

1979 1979 1979

April 10 May 14 October 31

1980

March 3

1982 1982

April November

1984

March 16

Event UGDP announces release of data listings for individual patients (36) Appeal of October 21, 1977, court ruling field by the CCD in United States Court of Appeals for the District of Columbia Circuit Preliminary report on insulin findings published (37) Judges Leventhal and MacKinnon of the United States Court of Appeals for the District of Columbia Circuit rule that public does not have right to UGDP raw data under the FOIA. Judge Bazelon dissents (29, 38) CDC petitions United States Court of Appeal for the District of Columbia Circuit for rehearing on July 11 ruling (29) Petition for rehearing at the United States Court of Appeals for the District of Columbia Circuit denied (29) Results of FDA audit of UGDP announced (39) Commissioner of FDA orders phenformin withdrawn from market (40) CCD petitions the United States Supreme Court for writ of certiorari to the United States Court of Appeals for the District of Columbia Circuit (29) Appeal of October 21, 1977, ruling denied Writ of certiorari granted UGDP case of Forsham et al, versus Harris et al, argued before the United States Supreme Court (40) United States Supreme Court holds that HEW need not produce UGDP raw data in 6 to 2 decision (40) Expiration of NIH grant support for UGDP UGDP deposits patient listings plus other information at the National Technical Information Service for public access (41, 42) Revised label for sulfonylurea class of drugs released (43–45)

via prolonged court battles. Ultimately, the revised label, complete with the special CV warning, was issued after challenges were exhausted—1984, 13 years after having been first proposed by the FDA.

Special Warning on Increased Risk of Cardiovascular Mortality : The administration of oral hypoglycemic drugs has been reported to be associated with increased cardiovascular mortality as compared to treatment with diet alone

20

20

10

10

−10

−10

−20

−20 0 1 2 3 4 5 6 7 8 9 10 11 12

FEB 1961

FEB 1965

FEB 1969

Years of Study (a) All cause mortality

FEB 1973

0 1 2 3 4 5 6 7 8 9 10 11 12 FEB 1961

FEB 1965

FEB 1969

Years of Study (b) Cardiovascular mortality

Figure 1. Tolbutamide-placebo 95% monitoring bounds (2).

FEB 1973

UGDP TRIAL

7

Table 3. Criticisms of the UGDP and Comments Pertaining there to (34) Criticism

Comment

The study was not designed to detect differences in morality (16).

The main aim of the trial was to detect difference in nonfatal vascular complications of diabetes (1). However, this focus in no way precludes comparisons for mortality differences. In fact, it is not possible to interpret results for nonfatal events in the absence of data on fatal events. It is unethical to continue a trial, especially one involving an elective treatment, to produce unequivocal evidence of harm. The tolbutamide-placebo mortality difference remains after adjustment for important baseline characteristics (17).

The observed mortality difference was small and not statistically significant (12,46) The baseline differences in the composition of the study groups are large enough to account for excess mortality in the tolbutamide treatment group (12,16,21,46). The tolbutamide-treated group had a higher concentration of baseline cardiovascular risk factors than any of the other treatment groups (12,16,21,46).

The treatment groups included patients who did not meet study eligibility criteria (12,16).

Data from patients who received little or none of the assigned study medication should have been removed from analysis (21,46).

The data analysis should have been restricted to patients with good blood glucose control (46).

The study failed to collect relevant clinical data (12,21).

There were changes in the ECG coding procedures midway in the course of the study (16,21).

Differences in the distribution of baseline characteristics, including CV risk factors, is within the range of chance. Further, the mortality excess is as great for the subgroup of patients who were free of CV risk factors as for those who were not. Finally, simultaneous adjustment for major CV baseline risk factors did not eliminate the excess (2,17). Correct. However, the number of such cases was small and not differential by treatment group. Further, analyses in which ineligible patients were removed did not effect the tolbutamide-placebo mortality difference (1). The initial analysis included all patients to avoid the introduction of selection biases. This analysis approach tends to underestimate the true effect. Analyses in which noncompliant patients were not counted enhanced, rather than diminished, the mortality difference (1). The analysis philosophy for this variable was the same as for drug compliance. The removal of patients using a variable influenced by treatment has a good chance of rendering the treatment groups noncomparable with regard to important baseline characteristics. In any case, analyses by level of blood glucose control did not account for the mortality difference (47). The criticism is unjustified. The study collected data on a number of variables needed for assessing the occurrence of various kinds of peripheral vascular events. It is always possible to identify some variable that should have been observed with the perspective of hindsight. The criticism lacks credibility, in general and especially in this case, because of the nature of the result observed. It is hard to envision other clinical observations that would offset mortality, an outcome difficult to reverse! Correct. However, the changes were made before investigators had noted any real difference in mortality and were, in any case, made without regard to observed treatment results (17).

8

UGDP TRIAL

Table 3. (Continued) Criticism

Comment

The patients did not receive enough medication for effective control of blood glucose levels (21)

A higher percentage of tolbutamide-treated patients had blood glucose values in the range indicative of good control than did the placebo-treated patients. The percentage of patients judged to have fair or good control, based on blood glucose determinations done over the course of the study, was 74 in the tolbutamide-treated group versus 59 in the placebo-treated group (47,48) The argument is not plausible. While it is true that the study did not collect baseline smoking histories, there is no reason to believe the distribution of this characteristic would be so skewed so as to account for the excess (17). The study did, in fact, make an effort to rectify this oversight around 1972 with the collection of retrospective smoking histories. There were no major differences among the treatment groups with regard to smoking. However, the results were never published because of obvious questions involved in constructing baseline smoking histories long after patients were enrolled and then with the use of surrogate respondents for deceased patients. The oversight is understandable in view of the time the trial was designed. Cigarette smoking, while recognized at that time as a risk factor for cancer, was not widely recognized as a risk factor for coronary heart disease. This criticism can be raised for any trial. However, it lacks validity since there is no reason to assume treatment groups in a randomized trial are any less comparable for unobserved characteristics than for observed characteristics. And even if differences do exist, they will not have any effect on observed treatment differences unless the variables in question are important predictors of outcome. Differences in the number of deaths by clinic are to be expected in any multicenter trial. However, they are irrelevant to comparisons by treatment groups in the UGDP, since the number of patients assigned to treatment groups was balanced by clinic (1,2). There are a variety of criteria used for diagnosing diabetes, all of which are based, in part or totally, on the glucose tolerance test. The sum of the fasting one, two, and three hour glucose tolerance test values used in the UGDP represented an attempt to make efficient use of all the information provided by the test (1). Most patients in the real world received the dosage used in the study (22).

The excess mortality can be accounted for by differences in the smoking behavior of the treatment group (source unknown).

The observed mortality difference can be accounted for by differences in the composition of the treatment group for unobserved baseline characteristics (12,16).

The majority of deaths were concentrated in a few clinics (12,21).

The study included patients who did not meet the ‘‘usual’’ criteria for diabetes (21).

The patients received a fixed dose of tolbutamide. The usual practice is to vary dosage, depending on need (12,16,21)

UGDP TRIAL

9

Table 3. (Continued) Criticism

Comment

The randomization schedules were not followed (16)

The Biometrics Committee reviewed the randomization procedure and found no evidence of any breakdown in the assignment process (15). There is no evidence of any problem in this regard. The few errors noted in audits performed by the Biometrics Committee and FDA audit team were of no consequence in the findings of the trial (15,39). The coding and classification error rate was in fact low and the errors that did occur were not differential by treatment group. There were no errors in the classification of patients by treatment assignment or by vital status. Hence, the argument does not provide a valid explanation of the mortality differences observed (15,39,50). Independent review of individual death records by the FDA audit team revealed only three classification discrepancies, only one of which affected the tolbutamide-placebo comparison (39). However, in any case, the main analyses in the study and the conclusions drawn from them relate to overall mortality. Correct. It would be unethical to continue a trial to establish the toxicity of an elective treatment. Toxicity is not needed to terminate an elective treatment (1).

There were ‘‘numerous’’ coding errors made at the coordinating center in transcription of data into computer readable formats (12). There were coding and classification discrepancies in the assembled data (49).

The cause of death information was not accurate (12,16,21)

The study does not prove tolbutamide is harmful (12,16,21)

Table 4. Tolbutamide-Placebo Mortality Results (as of 7 Oct 1969) (2) Number

Enrolled Cardiovascular deaths 1 MI 2 Sudden death 3 Other heart disease 4 Extracardiac vascular All CV causes Noncardiovascular causes 5 Cancer 6 Other than 1–5 7 Unknown cause All non-CV causes All causes

%

Tolb

Plbo

Tolb

Plbo

204

205

10 4 5 7 26

0 4 1 5 10

4.9 2.0 2.5 3.4 12.7

— 2.0 0.5 2.4 4.9

2 2 0 4 30

7 3 1 11 21

1.0 1.0 — 2.0 14.7

3.4 1.5 0.5 5.4 10.2

10

UGDP TRIAL 20

TOLB

20

15 IVAR ISTD PLBO

10

5

Cumulative Mortality Rate

Cumulative Mortality Rate

TOLB 15

10 IVAR ISTD PLBO

5

0

0 0

1

2 3 4 5 6 7 Years of Follow-Up (a) All causes

8

0

1

2 3 4 5 6 7 Years of Follow-Up (b) Cardiovascular causes

8

Figure 2. Tolbutamide-placebo cumulative observed mortality results (2). Table 5. Phenformin-Placebo Mortality Results (as of 6 Jan 1971) (14) Number

Enrolled Cardiovascular deaths 1 MI 2 Sudden death 3 Other heart disease 4 Extracardiac vascular All CV causes Non-cardiovascular causes 5 Lactic acidosis 6 Cancer 7 Causes other 1–6 8 Unknown cause All non-CV causes All causes

%

Phen

Plbo

IStd

IVar

Plbo+ IStd+ IVar

204

64

68

65

197

5 6 8 7 26

1 1 0 0 2

1 2 1 2 6

0 1 0 2 3

1 2 2 0 5 31

0 3 1 0 4 6

0 0 0 0 0 6

0 0 0 1 1 4

or diet plus insulin. This warning is based on the study conducted by the University Group Diabetes Program (UGDP), a long-term prospective clinical trial designed to evaluate the effectiveness of glucose-lowering drugs in preventing or delaying vascular complications in patients with non-insulin-dependent dependent diabetes. The study involved 823 patients who were randomly assigned to one of four treatment groups (2). UGDP reported that patients treated for 5 to 8 years with diet plus a fixed dose of tolbutamide (1.5 grams per day) had a rate of cardiovascular mortality approximately 2 1/2 times that of

Phen

Plbo+ IStd+ IVar

2 4 1 4 11

2.5 2.9 3.9 3.4 12.7

1.0 2.0 0.5 2.0 5.6

0 3 1 1 5 16

0.5 1.0 1.0 0.0 2.5 15.2

0.0 1.5 0.5 0.5 2.5 8.1

patients treated with diet alone. A significant increase in total mortality was not observed, but the use of tolbutamide was discontinued based on the increase in cardiovascular mortality, thus limiting the opportunity for the study to show an increase in overall mortality. Despite controversy regarding the interpretation of these results, the findings of the UGDP study provide an adequate basis for this warning. The patient should be informed of the potential risks and advantages of tolbutamide and of alternative modes of therapy. Although only one drug in the sulfonylurea class (tolbutamide) was included in this study, it is prudent

UGDP TRIAL 24

20 16 PHEN

12

P+IN

PLBO

8 INS

4

Cumulative Death Rate

Cumulative Death Rate

24

11

20 16 12

0

PHEN

8 INS

4

P+IN

PLBO

0 0

1

2

3

4

5

6

7

8

0

Years of Follow-Up (a) All Causes

1

2

3

4

5

6

7

8

Years of Follow-Up (b) Cardiovascular Causes

Figure 3. Phenformin-placebo cumulative observed mortality results (14). criterion, and that was the insulin-standard versus placebo comparison for the occurrence of elevated serum creatine levels (8.3% versus 18.5%, p value = 0.005). The occurrence of serious microvascular complications was surprisingly low. The latter finding as well as the slow progression of microvascular complications underscores the differences in the course and the nature of the two principal types of diabetes mellitus, the rather stable and non-ketosisprone maturity-onset type (type II) and the relatively unstable insulin-dependent juvenile-onset type (type 1).

from a safety standpoint to consider that this warning may also apply to other oral hypoglycemic drugs in this class, in view of their close similarities in mode of action and chemical structure (52).

3.2 Phenformin Results Conclusion as taken from Reference 14 This study provided no evidence that phenformin was more efficacious than diet alone or than diet and insulin in prolonging life for the patients studied. In fact, the observed mortality from all causes and from cardiovascular causes for patients in the phenformin treatment group was higher than that observed in any of the other treatment groups. In addition, there was no evidence that phenformin was more effective than any of the other treatments in preventing the occurrence of nonfatal vascular complications associated with diabetes. For these reasons, the use of phenformin has been terminated in the UGDP.

The mortality results that led to the decision are presented in Table 5 and Fig. 3 as contained in the above-referenced publication. 3.3 Insulin Results Conclusion as taken from Reference 37: Mortality rates among the treatment groups were comparable. The differences in the occurrence of nonfatal vascular complications among the patients in these three treatment groups were small and only one of the drug-placebo differences was considered significant by the study

The insulin results are presented in Tables 6 and 7 and Figure 4 as contained in the above referenced publication. 4

CONCLUSION AND DISCUSSION

Neither of the two oral agents tested showed evidence of benefit in prolonging life or in reducing the risk of morbidity associated with adult-onset diabetes. Indeed, phenformin disappeared from use on the U.S. market before the trial was finished when it was forcibly removed from the market by action of the Secretary of Health, Education, and Welfare in 1977 because of deaths from lactic acidosis linked to the drug (33). Use of Orinase declined after publication of the results in 1970, but the decline was short-lived and soon offset by increased use of other oral agents (34). Marketing of Orinase (in pill form) was discontinued in 1999. An obvious shortcoming of trials is that testing is performed in ways that depart

12

UGDP TRIAL

Table 6. Insulin-Placebo Mortality Results (as of 31 Dec 1974) (37) Number

Enrolled Cardiovascular deaths 1 MI 2 Sudden death 3 Other heart disease 4 Extracardiac vascular All CV causes Non-cardiovascular causes 5 Cancer 6 Causes other 1–5 8 Unknown cause All non-CV causes All causes

%

IStd

IVar

Plbo

IStd

IVar

Plbo

210

204

205

6 8 4 9 27

4 11 5 9 29

1 11 4 13 29

2.9 3.8 1.9 4.3 12.9

2.0 5.4 2.5 4.4 14.2

0.5 5.4 2.0 6.3 14.1

10 9 2 21 48

7 11 2 20 49

16 8 1 25 54

4.8 4.3 1.0 10.0 22.9

3.4 5.4 1.0 9.8 24.0

7.8 3.9 0.5 12.2 26.3

Table 7. Insulin-Placebo Morbidity Results (as of 31 Dec 1974) (37) Number at Risk

Enrolled ECG abnormality Use of digitalis Hospitalized for heart disease Hypertension Angina pectoris Visual acuity ≤ 20/200 (either eye) Opacity§ Fundus abnormalities Urine protein ≥ 1.5 mg/dl Serum creatine ≥ 1.5 mg/dl Amputation (all or part; either limb) Arterial calcification Intermittent claudication § Vitreous,

%

IStd

IVar

Plbo

IStd

IVar

Plbo

210 192 190 190 139 187 179 179 117 195 193 198 163 191

204 188 184 187 142 187 175 173 118 190 186 190 155 181

205 190 190 194 128 189 179 173 127 189 184 194 169 182

16.7 12.6 6.8 54.7 15.5 11.7 10.6 45.3 2.1 8.3 0.5 28.8 19.4

17.6 12.5 7.0 55.6 16.6 11.4 11.6 43.2 5.8 9.1 1.6 28.4 16.0

20.0 12.1 11.9 50.0 19.6 11.2 9.2 43.3 4.2 16.3 1.5 29.6 17.6

lenticular, or corneal; either eye.

from everyday usage. Hence, an issue always in trials is whether results obtained under such ‘‘idealized’’ circumstances generalize to everyday settings. In regard to blood sugar control, the usual everyday practice is to change the dosage prescription, albeit generally within fairly narrow limits, to achieve the desired level of blood sugar control. That option did not exist in the UGDP because investigators opted for a ‘‘fixed dose design’’ in regard to administration of the oral agents tested. The decision was prompted by the desire to evaluate the treatments in a doublemasked/blind, placebo-controlled, setting. Hence, the issue of whether the results are

generalizable to settings that involve individualized dosage schedules is, of necessity, a matter of judgment. But an even bigger shortcoming is the reality that only a few drugs can be tested in any given trial. Phenformin and tolbutamide are members of the biguanide and sulfonylurea, classes of drugs respectively. Technically, a trial like the UGDP reveals nothing about the safety and efficacy of other members of the classes. But scientifically, it is reasonable to expect that side effects associated with one member of a class will likely be present in other members of the

IVAR PLBO ISTD

30 25 20 15 10 5 0

1 2 3 4 5 6 7 8 9 10 11 12 13 Years of Follow-Up

(a) All cause mortality

Cumulative Event Rate in Percent

Cumulative Event Rate in Percent

UGDP TRIAL

13

30 25 20

IVAR PLBO ISTD

15 10 5 0

1 2 3 4 5 6 7 8 9 10 11 12 13 Years of Follow-Up

(b) Cardiovascular mortality

Figure 4. Insulin-placebo cumulative observed mortality results (26).

class. That some judgment is needed is obvious from the reality that new members of a class appear faster than they can be tested. For example, around the time the tolbutamide results were published, marketing shifted from Orinase (tolbutamide) to Tolinase (tolazamide; approved by the FDA July 1966), both of which are members of the sulfonylurea class of drugs. No doubt, a shift was prompted, in part, by the expiration of patent protection for Orinase. The FDA requires that the special warning generated for tolbutaime (the section on ‘‘Tolbutamide results’’) is required for all members of the sulfonylurea class of drugs approved for blood sugar control in adultonset diabetes. The chair (Rachmiel Levine, NY Medical College) of a symposium ‘‘Tolbutamide . . . after ten years’’ (53) held in Augusta, Michigan, opened the proceedings by reminding the audience that the story on tolbutamide went back to the 1940s with the work of Auguste Loubati`eres on sulfa derivatives on blood sugar and then noted that ‘‘We now have more people on sulfonylureas. Many more animals have been subjected to all kinds of experiments. But essentially, we are still with the concepts first brought forward by Professor Loubati`eres.’’ To a large degree, the same might still be said today. One problem is the way antidiabetic drugs are approved by the FDA. For approval, an antihyperglycemic drug has to be shown to be safe and effective in lowering blood sugar. But blood sugar control in non-insulin-dependent adult-onset diabetes is merely a means to

an end. The supposition is that blood sugar control confers benefits in reducing the risk of death and morbidities associated with diabetes—an intuitively appealing supposition even if largely untested. But drugs, even if shown effective in controlling blood sugar, have other effects. A fact brought to the fore again just recently by evidence that rosiglitazone maleate (Avandia, GlaxoSmithKline, Philadelphia) for blood sugar control carries risks of myocardial infarction and death from CV causes (54). Various diabetes trials have been conducted since the UGDP notably the Diabetes Complication Control Trial (DCCT), which was conducted 1983–1993 (55), and the UK Prospective Diabetes Study (UKPDS) (56). In many senses, the DCCT grew out of the controversy caused by the UGDP and the equivocal insulin findings in the UGDP, although it concentrated on type-1 diabetes. It was instrumental in demonstrating that tight control of blood sugar levels in people with type-1 diabetes was useful in reducing the morbidity associated with diabetes. One lesson learned from the UGDP was the danger involved in presenting results before publication. Indeed, those lessons are part of the reason why various groups since have adopted a ‘‘publish first, present later’’ philosophy in relation to primary results from trials. The intent of the investigators at the time of the decision to stop use of tolbutamide was to orchestrate publication so as to coincide with presentation, but the orchestration failed (as is often the case because editors have their own time schedules). As

14

UGDP TRIAL

a result, diabetologists were obliged to deal with queries from patients prompted by press coverage before and after the presentation without benefit of a publication. By the time the publication appeared—about 5 months after the presentation—the mood in the diabetic community toward the study was decidedly hostile. The UGDP was performed before the establishment of committees created specifically for monitoring. Monitoring in the UGDP was done by the SC. This reality raised concerns on the part of some that allowing investigators to see accumulating data had the potential of introducing treatmentrelated biases in the data-collection processes. The concern ultimately generated monitoring structures entrusted largely to people independent of study clinics in trials. The trial also predated most work on statistical methods for monitoring. The primary methods of dealing with multiple looks was via use of relative betting odds based on the likelihood principle (57) and Monte Carlo generated 95% monitoring bounds as shown in Fig. 1. The prevalence of diabetes in the U.S. population has increased steadily from the 1950s onward. The percent of people with diabetes was around 1% when the UGDP started and was estimated to be around 8% in 1993 (58). The number of people in the United States who live with diabetes was estimated to be about 21 million people in 2005 (see: http://diabetes.niddk.nih.gov/dm/pubs/ statistics/10). Not surprisingly, there has been a veritable explosion in the prescriptions of oral antidiabetic agents. In 1980, prescriptions in the United States were around 13 million (34). In 1990, there were 23.4 million such prescriptions and 91.8 million in 2001 (59). Glipizide and glyburide, which are sulfonylurea compounds, accounted for 77% of all prescriptions in 1990 and 33.5% in 2001. The issue of safety of the oral antidiabetic agents remains an open question because most diabetes trials are relatively small and short term in nature. Clinical trials, at best, are weak instruments for detecting rare adverse effects, and hence the shorter the period of follow-up, the less the likelihood of detection. Most meta-analyses of trials that involve

oral antidiabetic (54,60,61) agents are short term. For example, of the 29 reports of randomized placebo controlled monotherapy trials captured by Inzucchi, (61) only 1 had a study length of more than 1 year. Likewise, for the 42 trials included in the meta-analysis of Nissen and Wolski (54), only 5 were of more than 1 year duration of follow-up. To be sure, trials are not performed to establish harm, but the absence of evidence of harm, especially in small, short-term trials, should not be seen as evidence that drugs are safe. 5 ACKNOWLEDGMENTS Thanks to Betty Collison for her help in digging through old files for production of this manuscript and to Jill Meinert for her help with the graphics in the manuscript. REFERENCES 1. University Group Diabetes Program Research Group, A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: I. Design, methods, and baseline characteristics. Diabetes 1970; 19(suppl 2): 747–783. 2. University Group Diabetes Program Research Group, A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: II. Mortality results. Diabetes 1970; 19(suppl 2): 785–830. 3. M. Mintz. Discovery of diabetes drug’s perils stirs doubts over short-term tests. The Washington Post, June 8, 1970. 4. R.R. Ledger. Safety of Upjohn’s oral antidiabetic drug doubted in study: Firm disputes findings. Wall Street Journal, May 21, 1970. 5. M. Mintz. Antidiabetes pill held causing early death. The Washington Post, May 22, 1970. 6. H.M. Schmeck. Jr: Scientists wary of diabetic pill: FDA study indicates oral drug may be ineffective. New York Times, May 22, 1970. 7. University Group Diabetes Program Research Group, The effects of hypoglycemic agents in vascular complications in patients with adult-onset diabetes: 1. Design and methods (abstract). Diabetes 1970; 19(suppl 1): 387.

UGDP TRIAL 8. University Group Diabetes Program Research Group, The effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: 2. Findings at baseline (abstract). Diabetes 1970; 19(suppl 1): 374. 9. University Group Diabetes Program Research Group, The effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: 3. Course and mortality (abstract). Diabetes 1970; 19(suppl 1): 375. 10. Food and Drug Administration, Oral hypoglycemic agents: diabetes prescribing information. FDA Curr. Drug Inform., Oct, 1970. 11. Personal Communication. Robert F. Bradley, Joslin Diabeties Center, Boston, MA. First Chairman of the CCD. 12. A.R. Feinstein. Clinical biostatistics: VII. An analytic appraisal of the University Group Diabetes Program (UGDP) study. Clin. Pharmacol. Ther. 1971; 12: 167–191. 13. Food and Drug Administration, Oral hypoglycemic agents. FDA Drug Bull., June 23, 1971. 14. University Group Diabetes Program Research Group, Effects of hypoglycemic agents on vascular complications in patients with adultonset diabetes: IV. A preliminary report on phenformin results. JAMA 1971; 217: 777–784. 15. Committee for the Assessment of Biometric Aspects of Controlled Trials of Hypoglycemic Agents, Report of the Committee for the Assessment of Biometric Aspects of Controlled Trials of Hypoglycemic Agents. JAMA 1975; 231: 583–608. 16. S. Schor. The University Group Diabetes Program: A statistician looks at the mortality results. JAMA 1971; 217: 1,671–1,675. 17. J. Cornfield. The University Group Diabetes Program: A further statistical analysis of the mortality findings. JAMA 1971; 217: 1,676–1,687. 18. Food and Drug Administration, Drug labeling: failure to reveal material facts; labeling of oral hypoglycemic drugs. Fed. Reg. July 7, 1975; 40: 28,582–28,595. 19. Food and Drug Administration, Oral hypoglycemic drug labeling. FDA Drug Bull., May, 1972. 20. United States Court of Appeals for the First Circuit, Robert F Bradley, et al, versus Caspar W. Weinberger, Secretary of Health, Education, and Welfare et al. No 73-1014, July 31, 1973.

15

21. H.S. Seltzer. A summary of criticisms of the findings and conclusions of the University Group Diabetes Program (UGDP). Diabetes 1972; 21: 976–979. 22. University Group Diabetes Program Research Group, The UGDP controversy: clinical trials versus clinical impressions. Diabetes 1972; 21: 1,035–1,040. 23. United States Senate Select Committee on Small Business, Oral Hypoglycemic Drugs: Hearing before the Subcommittee on Monopoly, Sept 18–20, 1974. Part 25, US Government Printing Office, Washington, DC, 1974. 24. United States Senate Select Committee on Small Business, Oral Hypoglycemic Drugs: Hearing before the Subcommittee on Monopoly, Jan 31, July 9–10, 1975. Part 28, US Government Printing Office, Washington, 1975. 25. University Group Diabetes Program Research Group, A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: V. Evaluation of Phenformin therapy. Diabetes 1975; 24(suppl 1): 65–184. 26. University Group Diabetes Program Research Group, Effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes: VIII. Evaluation of insulin therapy: Final report. Diabetes, 1982; 31(suppl 5): 1–81. 27. United States District Court for the District of Columbia, Forsham et al, versus Matthews, et al. No 75-1608, Sept 30, 1975. 28. United States District Court for the Southern District of New York, Ciba-Geigy Corporation versus David Matthews, et al. No 75 Civ 5049, Oct 14, 1975. 29. United States Supreme Court, Peter H. Forsham, et al, versus Joseph A. Califano, et al. Petition for a writ of certiorari to the United States Court of Appeals for the District of Columbia Circuit. Oct term, 1978. 30. United States District Court for the District of Columbia, Peter H. Forsham et al, versus David Matthews, et al. No 75-1608, Feb 5, 1976. 31. United States District Court for the Southern District of New York, Ciba-Geigy Corporation versus David Matthews, et al. No 75 Civ 5049 (CHT), March 8, 1977. 32. Food and Drug Administration, Phenformin: new labeling and possible removal from the market. FDA Drug Bull. 1977; 7: 6–7.

16

UGDP TRIAL

33. Food and Drug Administration, HEW Secretary suspends general marketing of phenformin. FDA Drug Bull. 1977; 7: 14–16. 34. C.L. Meinert, and S. Tonascia. Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. 35. Food and Drug Administration, Status of withdrawal of phenformin. FDA Drug Bull. 1977; 7: 19–20. 36. University Group Diabetes Program Research Group, University Group Diabetes Program releases data on 1,027 patients. Diabetes 1977; 26: 1,195. 37. University Group Diabetes Program Research Group, Effects of hypoglycemic agents on vascular complications in patients with adultonset diabetes: VII, Mortality and selected nonfatal events with insulin treatment. JAMA 1978; 240: 37–42. 38. United States Court of Appeals for the District of Columbia Circuit, Peter H. Forsham et al, versus Joseph A. Califano, Jr, et al. No 761308, July 11, 1978. 39. Food and Drug Administration, Oral hypoglycemic drugs: availability of agency analysis and reopening of comment period on proposed labeling requirements. Fed. Reg. 1978; 43: 52,732–52,734. 40. United States Supreme Court, Forsham, et al, versus Harris, Secretary of Health, Education and Welfare, et al. Certiorari to the United States Court of Appeals for the District of Columbia Circuit. No 78–1118, argued Oct 31, 1979, decided March 3, 1980. 41. University Group Diabetes Program Research Group: Paper listing of baseline and follow-up data on UGDP patients (one volume per treatment group). No PB 83-136-325, Springfield, VA: National Technical Information Service (NTIS), 1983. 42. University Group Diabetes Program Research Group, Magnetic tape of baseline and followup data on UGDP patients. No PB 83-129-122, Springfield, VA: National Technical Information Service (NTIS), 1983. 43. Food and Drug Administration, Guideline Labeling for Oral Hypoglycemic Drugs of the Sulfonylurea Class. Docket no 83D-0304, Rockville, MD, March 16, 1984. 44. Food and Drug Administration, Labeling of oral hypoglycemic drugs of the sulfonylurea class. Fed. Reg. 1984; 49: 14,303–14,331. 45. Food and Drug Administration, Oral hypoglycemic drug products: availability of guideline labeling. Fed. Reg. April 11, 1984; 49: 14,441–14,442.

46. C. Kilo, J.P. Miller, and J.R. Williamson, JR. The achilles heel of the University Group Diabetes Program. JAMA 1980; 243: 450–457. 47. University Group Diabetes Program Research Group, Effects of hypoglycemic agents on vascular complications in patients with adultonset diabetes: III. Clinical implications of UGDP results. JAMA 1971; 218: 1,400–1,410. 48. University Group Diabetes Program Research Group, Effects of hypoglycemic agents on vascular complications in patients with adultonset diabetes: VI. Supplementary report on nonfatal events in patients treated with tolbutamide. Diabetes 1976; 25: 1,129–1,153. 49. G.B. Kolata. Controversy over study of diabetes drugs continues for nearly a decade. Science 1979; 203: 986–990. 50. T.E. Prout, G.L. Knatterud, and C.L. Meinert. Diabetes drugs: clinical trial (letter). Science 1979; 204: 362–363. 51. T.B. Schwartz, and C.L. Meinert. The UGDP controversy: thirty-four years of contentious ambiguity laid to rest. Perspect. Biol. Med. 2004; 47: 564–574. 52. Physicians’ Desk Reference, 39th ed. Montvale, N.J. Medical Economics, Co., 1985 p. 2130. 53. W.J.H. Butterfield, W. Van Westering, eds. Tolbutamide . . . after ten years. Proceedings of the Brook Lodge Symposium, Augusta, Michigan, March 6–7, 1967. 54. S.E. Nissen, and K. Wolski. Effect of rosiglitazone on the risk of myocardial infarction and death from cardiovascular causes. N. Engl. I.J. Med. 2007; 356: 2,457–2,471. 55. Diabetes Control and Complications Trial Research Group, The effect of intensive treatment of diabetes on the development and progression of long-term complications in insulindependent diabetes mellitus. N. Engl. J. Med. 1993; 329: 977–986. 56. UK Prospective Diabetes Study Group, Intensive blood-glucose control with sulphonylureas or insulin compared to conventional treatment and risk of complications in patients with type 2 diabetes. Lancet 1998; 52: 837–853. 57. J. Cornfield. Sequential trials, sequential analysis and the likelihood principle. Am. Statistician 1966; 20: 18–23. 58. S.J. Kenny, R.E. Aubert, and L.S. Geiss. Prevalence and incidence of non-insulindependent diabetes, In National Diabetes Data Group: Diabetes in America, 2nd ed (782 pgs) Bethesda, MD: National Institute of Diabetes and Digestive and Kidney Disease, 1995, pp 47–67, ch 4, NIH pub. no. 95–1468.

UGDP TRIAL 59. D.K. Wysowski, G. Armstrong, and L. Governale, Rapid increase in the use of oral antidiabetic drugs in the United States, 1990–2001. Diabetes Care 2003; 26: 1,852–1,855. 60. S. Bolen, L. Feldman, J. Vassy, L. Wilson, H. Yeh, S. Marinopoulos, C. Wiley, E. Selvin, R. Wilson, E.B. Bass, and F.L. Brancati. Systematic review: comparative effectiveness and safety of oral medications for type 2 diabetes mellitus. Ann. Int. Med. 2007; 147: 386–399. 61. S.E. Inzucchi, Oral antihyperglycemic therapy for type 2 diabetes. JAMA 2002; 287: 360–372.

17

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

early because the atorvastatin group had a significant 36% reduction in nonfatal and fatal coronary heart disease events compared with the placebo group after 3.3 years followup (3).

KUO-LIONG CHIEN National Taiwan University Institute of Preventive Medicine Taipei, Taiwan

1

3 SECONDARY PREVENTION CLINICAL TRIALS

INTRODUCTION The Scandinavian Simvastatin Survival Study (4 S) randomized 4444 patients with coronary heart disease history and elevated total cholesterol (≥213 mg/dL) into simvastatin and placebo groups (4). At 5.4 years follow-up, simvastatin significantly reduced coronary heart disease mortality by 42% and total mortality rates by 30%, with fewer revascularization procedures and stroke events. The Cholesterol and Recurrent Events (CARE) trials included 4159 patients with myocardial infarction history and an average LDL 139 mg/dL (5). A significant 24% reduction of nonfatal and fatal cardiac deaths was shown in pravastatin after 5 years treatment. This study showed that when baseline LDL level was less than 125 mg/dL, only diabetic patients had a significant benefit. The Long-term Intervention trial with Pravastatin in Ischemic Disease (LIPID) trial randomized 9014 patients with coronary heart disease history and a normal cholesterol level (6). Pravastatin can reduce the risk of coronary heart disease mortality by 24% and overall mortality by 22% compared with placebo groups. Secondary endpoint analyses showed significant reduction in stroke and revascularization procedures. Among 1351 patients undertaking coronary artery bypass graft (CABG), the post-CABG trial showed that lovastatin can achieve a target of LDL less than 85 mg/dL and less graft atherosclerotic progression and 29% fewer revascularization procedures (7). Aggressive lowering LDL levels are mandatory for high-risk groups such as post-CABG patients. Comparing percutaneous coronary intervention with aggressive LDL lowering with atorvastatin, the Atorvastatin versus Revascularization Treatments (AVERT) trial (341 patients) showed the atorvastatin group

Clinical trials on hyperlipidemia provide most convincing evidence for benefits of cardiovascular morbidity and mortality reduction in primary and secondary prevention. Large-scale clinical trial results demonstrate significant protective effects of statins on coronary heart disease events. With these splendid results, more aggressive study designs on various outcomes and subgroups, such as stroke event prevention and high-risk population, were initiated to resolve problems in the fields of cardiovascular disease prevention and treatment. The effectiveness of lipid-lowering therapy in reducing cardiovascular morbidity and mortality, including patients with mild to moderate hypercholesterolemia, were proven. 2

PRIMARY PREVENTION CLINICAL TRIALS

The West of Scotland Coronary Prevention Study (WOSCOPS) randomized a total of 6595 patients with an average cholesterol of 172 mg/dL into two arms of a doubleblind, placebo-control trial (1). After 5 years’ treatment, the pravastatin treatment group had 29% fewer nonfatal myocardial infarction, a 32% lower cardiovascular mortality rate, and a 22% lower mortality rate than the control group. The Air Force/Texas Coronary Atherosclerosis Prevention Study (AFCAPS/TexCAPS) was a randomized trial of 9605 patients with average LDL cholesterol levels and below-average HDL levels with lovastatin or placebo (2). After 5.2 years follow-up, the lovastatin group had a 37% reduction in the major coronary events. Recently, the Anglo-Scandinavian Cardiac Outcomes Trial-Lipid Lowering Arm (ASCOT-LLA) (10,305 patients) was stopped

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

had a nonsignificant 36% reduction in coronary events after 18 months follow-up (8). Among the Lescol Intervention Prevention Study (LIPS), 1677 patients after successful first percutaneous coronary intervention were randomized to fluvastatin and placebo (9). Major adverse cardiac events were less in the fluvastatin group after 3.9 years followup. A significant 22% risk reduction resulted versus placebo in endpoints of major adverse cardiac event was observed. 4 LESSONS LEARNED FROM RECENT STATIN CLINICAL TRIALS The above-mentioned large-scale major randomized clinical trials of lowering LDL proved that statins can reduce cardiovascular risk effectively and safely. Three consistent lessons from these trials merit particular attention. 1. The higher the risk, the greater the benefit. In terms of absolute risk reduction, benefit is greatest in patients at highest risk (i.e, established coronary heart disease). The absolute reduction in 5-year mortality was 3.3% in the high-risk 4 S population, but no mortality reduction occurred in the low-risk AFCAPS/TexCAPS population. Targeting the high-risk population who need it most is strongly recommended by recent guidelines. 2. Little or no clinical benefit exists if LDL is not reduced, but benefit is attenuated as LDL is lowered further. Whether large reductions of LDL, achievable with high doses of the most potent statins, are superior to more modest reductions of LDL, achievable with smaller doses or less potent statins, is still formally being tested in large trials. 3. Clinical benefit is largely independent of baseline LDL levels. Reduction in risk, occurs even when treating patients whose concentrations of cholesterol and LDL-C are already close to current goals of therapy. Some argue that cholesterol measures are unnecessary when statin is used.

The failure to virtually eliminated coronary heart disease events indicates that additional therapies must be introduced to further reduce events. Several new approaches exist for reducing further risks. First, low HDL and hypertriglyceridemia are frequently found among high-risk patients with metabolic syndrome and diabetes. Second, specific high-risk populations, such as diabetes mellitus and obesity, are considered as a target population. Third, early statin initiation treatments to reduce coronary risk are undergoing. Fourth, new drug developments on more aggressive lowering LDL are processing. Finally, new measurement modalities were developed to evaluate cardiovascular disease outcomes by new directions.

5 CURRENT ADVANCES IN HYPERLIPIDEMIA TRIALS 5.1 Target Low HDL and Hypertriglyceridemia The classic target in an hyperlipidemia trial is lowering the LDL level. Although statin can lower cardiovascular events, further reduction in coronary events may depend on other therapeutic targets such as HDL cholesterol and triglyceride. Metabolic syndrome, including low HDL and hypertriglyceridemia, is recommended by the NCEP guideline for secondary target of prevention. Low HDL levels was clearly an atherosclerotic risk factor, based on evidence from subgroup analysis on lipid trials (10). A low concentration of HDL is predictive for coronary heart disease at all levels of LDL, in people with or without diabetes and in women as well as in men. Statins also raise the level of HDL-C, although they tend to be less effective than fibrates and niacin. Fibrates (fibric acid derivates, for example gemfibrozil and fenofibrate) had cardioprotective effects in people with metabolic syndrome. Previous fibrate clinical trials such as the Helsinki Heart Study (gemfibrozil in 4081 asymptomatic hypercholesterolemic men) (11), the Veterans Administration HDL Intervention Trial (VA-HIT; gemfibrozil in 2531 patients for secondary prevention) (12),

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

the World Health Organization Cooperative Study (clofibrate in 10,627 hypercholesterolemic men) (13), and the Bezafibrate Infarction Prevention (BIP) study (3090 patients with coronary heart disease history) (14) did not show significant reduction in total mortality or other primary endpoints. These fibrate trial results are unsatisfactory for coronary artery prevention, although smallscale secondary prevention trials showed that fibrates can improve lipid profiles such as in the Bezafibrate Coronary Atherosclerosis Intervention Trial (BECAIT) (15). Therefore, special focus on high-risk population, especially in diabetes mellitus patients, is on going in update trials. Published subgroup results from VA-HIT demonstrated that diabetics experienced a 41% reduction in coronary death and a 40% reduction in stroke with fibrate therapy. In studies such as VAHIT (12, 16), hyperinsulinemia and diabetes status had a 40% reduction in CHD and stroke death. In the secondary prevention trials, subgroup analyses on diabetes patients, such in 4 S, CARE, LIPID, and HPS, showed improving survival rates. 5.2 High-Risk Population as Target Intervention Statin treatment is considered useful for all ranges of age and both genders. The important benefits have been observed across a wide age range and among patients with total cholesterol concentrations much lower than average. Specific population targets, such as diabetes, the elderly, and other characteristics, were analyzed first in subgroup analysis results. Although evidence existed on benefits of statin among diabetes by subgroup analyses, direct intervention on diabetic patients provided strong evidence for protective effects. 5.2.1 Diabetes Mellitus. Coronary risk in diabetes is two-fold to four-fold that in those without diabetes. Patients with diabetes suffered from lipoprotein abnormalities, such as increased triglyceride, reduced HDL, highnormal LDL, smaller LDL particles, and postprandial hyperlipidemia, and had a great susceptibility of their vessels to effects of oxidized LDL. NCEP/ATP-III guidelines suggest the first target to reduce LDL to below

3

100 mg/dL (2.6 mmol/L) in diabetes (17). The second target is to correct the HDL and triglyceride abnormalities, for which fibrates may be more favorable. Statins and fibrates are underwent for clinical trials in diabetes. The Heart Protection Study (HPS) randomized 20,536 patients with diabetes or cardiovascular diseases into simvastatin or placebo groups (18). Risk reduction was approximately 25% for total cardiovascular disease events and 15–17% for coronary deaths (19). The HPS focused on high-risk patients, such as those with diabetes, and the significant differences on major vascular events were clear after 2 years treatment. Also, even with LDL less than 125 mg/dL, statins treatment can reduce coronary disease events 13% among diabetes patients. Diabetes Atherosclerosis Intervention Study (DAIS) was the first lipid intervention trial designed to determine whether correcting the lipoprotein abnormality typically seen in Type 2 diabetes will alter the course of angiographically determined coronary atherosclerosis (20). It included 418 Type 2 diabetic subjects between 40 to 65 years of age. The results showed that fenofibrate treatment was associated with 40% less progression of average minimum lumen diameter (P = 0.029) and percent stenosis (P = 0.02), which is the primary endpoint outcome. The secondary endpoints, such as HDL, ApoA1, and LDL particle size, increase. Keech headed the Fenofibrate Intervention and Event Lowering in Diabetes (FIELD) study, with a total of 9795 diabetes patients, involving Australia, New Zealand, and Finland. The FIELD is a large-scale, randomized controlled trial to determine whether increasing HDL and lowering triglyceride reduces coronary heart disease events and mortality. The study design was for 9795 patients with Type 2 diabetes (50–75 years of age) recruited with normal to high cholesterol levels and randomized to either daily fenofibrate 200 mg or placebo for more than 5 years (http://www.ctc.usyd.edu.au/trials/cardiovas cular/field.htm). The study will answer whether treatment with fenofibrate, a potent modifier of blood lipid levels, will reduce the risk of fatal coronary heart disease in people with Type 2 diabetes

4

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

5.2.2 Elderly Population. The PROspective Study of Pravastatin in the Elderly at Risk (PROSPER) trial was the first trial to examine the effects of statin therapy in an exclusively elderly population (21). A total of 5804 patients aged 70 to 82 years were randomized into pravastatin and placebo group and, after 3.2 years follow-up, pravastatin 40 mg produced a 15% reduction in primary endpoint of fatal and nonfatal coronary and stroke events (P = 0.014) compared with placebo. 5.2.3 Other High-Risk Groups: Hypertension, Renal Transplant, Obestiy, and Others. Hypertensive patients with hypercholesterolemia were prevalent and at high risk for cardiovascular events. The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT-LLT) (22) and ASCOT-LLT (3) were designed to determine whether statins can improve total mortality among hypertensive patients. Unfortunately, the ALLHAT-LLT results did not significantly reduce cardiovascular events in the pravastatin treatment group. ALLHAT demonstrated that pravastin 40 mg produced nonsignificant reductions in coronary and stroke events compared with usual care. In contrast, ASCOT-LLA trial, which stopped early, found that atorvastatin 10 mg significantly reduced coronary and stroke events by 36% and 27%, respectively, compared with placebo. Some phenomena were observed. In ALLHAT-LLT, 28.5% of patients in the usual care group were receiving lipid-lowering drugs and 26.1% took a statin. As a result, the difference in total cholesterol (9.6%) and LDL-C (16.7%) between the pravastatin and usual care groups in ALLHAT were less than half in ASCOT-LLA (24% and 35%, respectively). The discrepancy between the results ALLHAT-LLT and ASCOT-LLA implied methodological issues of inadequate power of ALLHAT-LLT because of lower than projected sample size and poor compliance. Open design in ALLHAT-LLT rendered these discrepancies more apparent (23). Other explanations for the apparently disappointing clinical benefit in ALLHATLLT may be attributable to the dose-response effect on cardiovascular events associated with achieved LDL-C reduction.

The Assessment of LEscol in Renal Transplantation (ALERT) Study randomized 2102 renal transplant patients into fluvastatin, 40–80 mg daily, and placebo (24). After 5.1 years follow-up, fluvastatin reduced nonsignificant risk for major adverse cardiac events, although fewer cardiac deaths or nonfatal MI occurred among the fluvastatin group than in the placebo group. Obesity has become an endemic around the world, and body weight control has become an important issue for chronic disease prevention. Obesity is also the crucial component in metabolic syndrome, which is a cluster of dyslipidemia, hypertension, and insulin resistance. For dyslipidemic control, the Helsinki Heart study and VA-HIT results indicate that overweight patients with a body mass index more than 26 formed another group that received large risk reduction with fibrate treatment (17, 25). Inflammation contributes to the pathogenesis of coronary heart disease and elevated serum levels of C-reactive protein (CRP) are independently associated with increased coronary risk. Statins were reported to decrease CRP levels without affecting HDL levels (26). Recently, a large-scale clinical trial, the JUPITER study, is ongoing that will test if patients with low LDL and elevated CRP can benefit from rosuvastatin (27). In summary, identifiers of high atherosclerotic risk, such as those with a history of coronary heart disease, diabetes, high triglyceride, low HDL, obesity, and inflammation status, are important for treatment goals, and the greatest chance of benefit exists in aggressive treatment of hyperlipidemia among these high-risk identifiers. 5.3 New Study Drugs and Strong Potency in Lipid Lowering Although statins can reduce LDL and coronary events, great proportions of hyperlipidemic patients existed who did not reach the targets recommended by ATP-III. For example, the European Action on Secondary Prevention by Intervention to Reduce Events (EUROASPIRE II) study has found that approximately 50% of high-risk patients in Europe are not achieving the goal for LDL cholesterol set in the guideline (28). Choices

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

between a simple treatment regimen and dose titration or polypharmacy exist. Combination therapy with a statin and one of these other lipid-lowering agents can be useful in patients who are unable to achieve target lipid levels through monotherapy. A simple regimen is appealing because it can reduce the cost of treatment, avoid problems of drug interaction, and enhance compliance with therapy. Also, it is still not known ‘‘if the lower is the better.’’ Some of the new options for reducing LDL levels will be available in the near future. 5.3.1 Statins. Newly designed clinical trials exist based on more potent lipid-lowering drugs, such as SEARCH (Study of the Effectiveness of Additional Reduction in Cholesterol and Homocysteine) (29), TNT (Treat to New Targets), IDEAL (Incremental Decrease in Endpoints through Aggressive Lipid Lowering) (30), STELLAR (Statin Therapies for Elevated Lipid Levels compared Across doses to Rosuvastatin) (31), composed of thousands hyperlipidemic subjects, hopefully to answer the ‘‘lower is better’’ question. For example, in IDEAL, simvastatin 20 or 40 mg is compared with atorvastatin 80 mg, whereas SEARCH and TNT test the effects of two doses of simvastatin (20 and 80 mg) or atorvastatin (10 and 80 mg), respectively. These trials can achieve targeted LDL goals satisfactorily. One new drug, rosuvastatin, which has effects of 40–69% reduction in LDL, up to 22% reduction in triglyceride, and 13% increase in HDL level, was launched in 2003 and provides more efficient LDL-lowering effects (32). The Measuring Effective Reductions in Cholesterol Using Rosuvastatin therapy I (MERCURY I, announced in the 13th International Symposium on Atherosclerosis, Kyoto, Japan, 2003) study investigated among 3164 patients by open-label treatment and showed that rosuvastatin 10 mg/day enabled a significantly greater proportion of patients to achieve NCEP ATP III and 1998 European goals for LDL (80% and 88%, respectively). New mechanism drugs like pitavastatin had cytochrome P450 enzyme-free metabolism and potentially decrease adverse effects and drug interactions with other drug combinations. According to the results

5

of pharmacokinetic studies, pitavastatin showed a favorable and promising safety profile; it was only slightly metabolized by the cytochrome P450 (CYP) system and did not inhibit effects on the CYP3A4-mediated metabolism of concomitantly administered drugs (33). Also, pitavastatin has great lipidlowering effects, with a 55% reduction in LDL, an increase in HDL of 15%, and a 30% reduction in triglyceride effects (34). 5.3.2 Cholesterol Transport Inhibitors (Ezetimibe). Ezetimibe, a selective cholesterol absorption inhibitor, can potently inhibit the absorption of binary and dietary cholesterol from the small intestine without affecting the absorption of fat-soluble vitamins, triglycerides, or bile acids. Enterohepatic recirculation of ezetimibe and its glucuronide ensures repeated delivery to the site of action and limited peripheral excretion. Also, ezetimibe has no effect on the activity of CYP450 metabolic enzyme, which reduces any potential interactions of other medications. A 10 mg daily dose of ezetimibe can inhibit cholesterol absorption by 54% in hypercholesterolemic patients (P < 0.0001). Used, alone it can reduce plasma total and LDL 18% in patients of primary hypercholesterolemia. For some ongoing statin combinations, an additional 25% reduction in LDL was found in familial hypercholesterolemic patients (35, 36). The drug affects a 15–20% reduction in LDL and offers potential additive benefits when used in combination with statins (37). Other new drugs, such as bile acid transport inhibitors (S-8921), are in preclinical development phase and affect a 5–10% reduction in serum cholesterol. An inhibitor of acyl coenzyme A-cholesterol acyltransferase (ACAT), avasimibe, was also found to enhance statin effects on lowering cholesterol (38). 5.3.3 PPARα Agonists (Perxisome Proliferator Activated Receptors Alpha): Fibrates. Pleiotropic effect, various manifestations of phenotypes by single treatment, is a very popular idea and the fibrates provide the progression of this idea. The progenitor, clofibrate, which appeared in the market 40 years ago, attracted the attention by its multiple effects including those on lipids,

6

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

fibrinogen, and glucose intolerance. It took 20–30 years for such pleiotropic effects to be explained by their function through a nuclear receptor, PPAR (Perxisome proliferator activated receptors). PPARα agonists are transcription factors belonging to the nuclear receptor superfamily that are activated by fatty acids and their derivates. It is expressed in tissues having a high metabolic rate such as liver and muscles and in the cells of the atherosclerotic lesion. PPARα regulated genes involved in lipid metabolism, hemostasis, and vascular inflammation, making it a candidate gene for risk of dyslipidemia, atherosclerosis, and coronary events (39). Agonists for PPAR nuclear receptors such as fibrates are for metabolic derangement and insulin-resistance drugs. They mediate effects of adipocyte differentiation, glucose and lipid metabolism, and inflammation and are a theoretically good candidate for treatment of metabolic syndrome (40). 5.3.4 Other Drug Developments. NARC1 (neural apoptosis regulated convertase-1), the gene in chromosome 1p32-34, coded genes PCSK9 (proprotein convertase subtilisin/kexin type 9), which encodes NARC-1 as a novel protein convertase. Statins can increase dose-dependently the mRNA levels of NARC-1, and NARC-1 may constitute an important new target for drug therapy of hypercholesterolemia (41). The adenosine triphosphate-binding cassette protein A1 transporte (ABCA1) plays a pivotal role in cholesterol efflux from peripheral cells, and overexpression of ABCA1 in transgenic mice increased HDL, facilitated reverse cholesterol transport, and decreased diet-induced atherosclerosis. The ABC transporter is a major determinant of plasma HDL level (42). Finally, some biological therapy modalities, such as the autoimmunization process, developed internal antibodies to inhibit the actions of various proteins or receptors. Vaccination to inhibit plasma CETP (cholesteryl ester transfer protein) results in elevated antiCETP antibodies, and clinical trials are underway to determine if these antibodies sufficiently raise HDL levels (43). Another target for autoimmunization is oxidized LDL. Exciting new clinical studies are occurring on intravenous and oral forms of Apo-A1

mimetic to inhibit atherosclerosis in humans (44). 5.4 Timing of Treatment of Statin: How Soon Should it Start in Acute Coronary Syndrome? The timing of statins in the management of patients with an acute coronary syndrome, such as unstable angina or acute myocardial infarction, are under intensive investigation. Increasing experimental evidence suggests that statin may act rapidly in the arterial wall to reduce the endothelial dysfunction, local inflammatory response, and elevated thrombogenesis that might otherwise increase the risk of recurrent ischemic events in acute coronary artery patients. Several statins, such as pravastatin (Pravastatin Acute Coronary Treatment, PACT), pravastatin and atorvastatin (Pravastatin or Atorvastatin Evaluation and Infection Therapy, PROVE-IT) (45), cerivastatin (Prevention of Reinfarction by Early Treatment of Cerivastatin Study, PRINCESS), atorvastatin (Myocardial Ischemia Reduction with Aggressive Cholesterol Lowering, MIRACL) (46), and simvastatin (A to Z), through the Pravastatin Turkish Trial, the Fluvastatin on Risk Diminishing After Acute Myocardial Infarction (FLORIDA) study, and the LipidCoronary Artery Disease (L-CAD) study, are being used to investigate the timing of initiation and duration of statin therapy. Up to now, the evidence is not consistently favorable for early use of statins. For example, MIRACL showed reduction of nonsignificant death and nonfatal myocardial infarction in 3086 patients randomized into atorvastatin and placebo during the 24–96 hours after admission for acute coronary syndrome. In PACT, 3408 patients were admitted to the hospital within 24 hours, randomized to receive 40 mg of pravastatin or placebo, and the results showed nonsignificant absolute risk reduction for death or recurrent events. 5.5 Various Endpoints Characterization and Measurement Modalities 5.5.1 Stroke. Classic hard endpoints defined by total death, cardiovascular death, and coronary events including nonfatal and fatal

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

myocardial infarction are considered as primary endpoints in clinical trial designs. Most classic clinical trial designs look at these hard endpoints. Stroke prevention was considered as a secondary endpoint in previous largescale clinical trials. Meta-analysis on stroke prevention clearly demonstrated that statins can reduce the risk of stroke. Treatment with statins led to an overall risk reduction of 31% for stroke (47). The beneficial effects of statins on stroke prevention can be seen in primary and secondary prevention, and may be because of delayed progression and plaque stability of extracranial carotid atherosclerosis or the marked reduction of incident coronary heart disease associated with treatment (48). A primary target on stroke by statin therapy is the ongoing research such as the Stroke Prevention by Aggressive Reduction in Cholesterol Levels (SPARCL) Study, which randomizes 4732 stroke patients to evaluate atorvastatin 80 mg/day effects on prevention of further stroke events (49). 5.5.2 Morphology Endpoints. Clinically useful image modalities, such as coronary angiographical studies, intravascular ultrasounds (IVUS), carotid ultrasonography, and coronary calcium burden, detected by electron beam computed tomography (EBCT) were used for evaluation of atherosclerosis burdens and considered as important subclinical markers for identifying high-risk populations (50). These surrogate measures are viewed as primary endpoints in various clinical trial programs. 1. Angiographical regression studies on coronary luminal diameter changes were used to evaluate the effects on reduction of coronary narrowing of various intervention modalities. The treatment modalities included cloestipol-niacin (CLAS-I) (51), statins [FATS (52), MARS (53), CCAIT (54), MAAS (55), PLAC-I (56), REGRESS (57), LCAS (58)], intensive multifactorial risk reduction programs [SCRIP (59), SCAT (60), HATS (61)], and fibrates [LOCAT (62), DAIS (20)]. For example, large-scale clinical trials on angiographic measures, such

7

as the Diabetes Atherosclerosis Intervention Study (DAIS), the first lipid intervention trial designed to determine whether correcting the lipoprotein abnormality typically seen in Type 2 diabetes, will alter the course of angiographically determined coronary atherosclerosis. 2. IVUS imaging by high-frequency ultrasound images not only the coronary lumen but also the structure of the vessel wall, including the atherosclerotic plaque. IVUS imaging can increase the understanding of the atherosclerosis process and help to establish the optimal means of reducing atherosclerotic burden. Recent IVUS study results demonstrated clearly that positive relation between LDL cholesterol and annual changes in plaque size was found, and HDL cholesterol was inversely related with annual changes in plaque size (63). Regression and progression of plaque, visualized by IVUS, such as REVERSAL (Reversal of Atherosclerosis with Aggressive Lipid Lowering) trial, compared the effects of atorvastatin and pravastatin after 18 months of treatment. The results were announced in an AHA meeting in November 2003, showing that aggressive lipid lowering can decrease atherosclerosis progression. 3. Carotid artery intima-medial thickness (IMT) is a surrogate marker of atherosclerosis. Statin treatment is associated with a reduction of IMT and a reversal of atherosclerosis, including the MARS (lovastatin) (53), REGRESS (pravastatin) (57), ASAP (atorvastatin and simvastatin), and ENHANCE (Ezetimibe and Simvastatin in Hypercholesterolemia Enhances Atherosclerosis Regression) trials. 5.5.3 Other Endpoints. Statin treatments on other chronic diseases, such as Alzheimer’s disease [the Alzheimer’s Disease Cholesterol-Lowering Treatment Trial (ADCLT) (64)], osteoporosis (65), and congestive heart failure (UNIVERSE (RosUvastatiN Impact on VEntricular Remodelling

8

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

lipidS and cytokinEs), are ongoing. These clinical trials are designed to answer the possibility of benefits of statins on various endpoints. The pleiotropic effects of statins, beyond lipid-lowering effects, are interesting and potentially useful in clinical applications.

5. F. M. Sacks, M. A. Pfeffer, L. A. Moye, J. L. Rouleau, J. D. Rutherford, T. G. Cole, L. Brown, J. W. Warnica, J. M. Arnold, C. C. Wun, B. R. Davis, and E. Braunwald, The effect of pravastatin on coronary events after myocardial infarction in patients with average cholesterol levels. Cholesterol and recurrent events trial investigators. N. Engl. J. Med. 1996; 335: 1001–1009.

6

6. Prevention of cardiovascular events and death with pravastatin in patients with coronary heart disease and a broad range of initial cholesterol levels. The Long-Term Intervention with Pravastatin in Ischaemic Disease (LIPID) Study Group. N. Engl. J. Med. 1998; 339: 1349–1357.

CONCLUSION

Lipid-lowering effects are beneficial on cardiovascular disease prevention, and the evidence from various clinical trial results are increasing. Especially, progression on highrisk group interventions, more effective lipidlowering, and novel measurement modalities can improve the knowledge on pathogenesis and prevention of hyperlipidemia.

REFERENCES 1. J. Shepherd, S. M. Cobbe, I. Ford, C. G. Isles, A. R. Lorimer, P. W. MacFarlane, J. H. McKillop, and C. J. Packard, Prevention of coronary heart disease with pravastatin in men with hypercholesterolemia. West of Scotland Coronary Prevention Study Group. N. Engl. J. Med. 1995; 333: 1301–1307. 2. J. R. Downs, M. Clearfield, S. Weis, E. Whitney, D. R. Shapiro, P. A. Beere, A. Langendorfer, E. A. Stein, W. Kruyer, and A. M. Gotto, Jr., Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels: results of AFCAPS/TexCAPS. Air Force/Texas Coronary Atherosclerosis Prevention Study. JAMA 1998; 279: 1615–1622. 3. P. S. Sever, B. Dahlof, N. R. Poulter, H. Wedel, G. Beevers, M. Caulfield, R. Collins, S. E. Kjeldsen, A. Kristinsson, G. T. McInnes, J. Mehlsen, M. Nieminen, E. O’Brien, and J. Ostergren, Prevention of coronary and stroke events with atorvastatin in hypertensive patients who have average or lowerthan-average cholesterol concentrations, in the Anglo-Scandinavian Cardiac Outcomes Trial—Lipid Lowering Arm (ASCOT-LLA): a multicentre randomised controlled trial. Lancet 2003; 361: 1149–1158. 4. Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: the Scandinavian Simvastatin Survival Study (4S). Lancet 1994; 344: 1383–1389.

7. The effect of aggressive lowering of lowdensity lipoprotein cholesterol levels and lowdose anticoagulation on obstructive changes in saphenous-vein coronary-artery bypass grafts. The Post Coronary Artery Bypass Graft Trial Investigators. N. Engl. J. Med. 1997; 336: 153–162. 8. B. Pitt, D. Waters, W. V. Brown, A. J. van Boven, L. Schwartz, L. M. Title, D. Eisenberg, L. Shurzinske, and L. S. McCormick, Aggressive lipid-lowering therapy compared with angioplasty in stable coronary artery disease. Atorvastatin versus revascularization treatment investigators. N. Engl. J. Med. 1999; 341: 70–76. 9. P. W. Serruys, P. de Feyter, C. Macaya, N. Kokott, J. Puel, M. Vrolix, A. Branzi, M. C. Bertolami, G. Jackson, B. Strauss, and B. Meier, Fluvastatin for prevention of cardiac events following successful first percutaneous coronary intervention: a randomized controlled trial. JAMA 2002; 287: 3215– 3222. 10. F. M. Sacks, A. M. Tonkin, J. Shepherd, E. Braunwald, S. Cobbe, C. M. Hawkins, A. Keech, C. Packard, J. Simes, R. Byington, and C. D. Furberg, Effect of pravastatin on coronary disease events in subgroups defined by coronary risk factors: the Prospective Pravastatin Pooling Project. Circulation 2000; 102: 1893–1900. 11. M. H. Frick, O. Elo, K. Haapa, O. P. Heinonen, P. Heinsalmi, P. Helo, J. K. Huttunen, P. Kaitaniemi, P. Koskinen, V. Manninen, et al., Helsinki Heart Study: primary-prevention trial with gemfibrozil in middle-aged men with dyslipidemia. Safety of treatment, changes in risk factors, and incidence of coronary heart disease. N. Engl. J. Med. 1982; 317: 1237–1245.

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS 12. H. B. Rubins, S. J. Robins, D. Collins, C. L. Fye, J. W. Anderson, M. B. Elam, F. H. Faas, E. Linares, E. J. Schaefer, G. Schectman, T. J. Wilt, and J. Wittes, Gemfibrozil for the secondary prevention of coronary heart disease in men with low levels of high-density lipoprotein cholesterol. Veterans Affairs HighDensity Lipoprotein Cholesterol Intervention Trial Study Group. N. Engl. J. Med. 1999; 341: 410–418. 13. A co-operative trial in the primary prevention of ischaemic heart disease using clofibrate. Report from the Committee of Principal Investigators. Br. Heart J. 1978; 40: 1069– 1118. 14. Secondary prevention by raising HDL cholesterol and reducing triglycerides in patients with coronary artery disease: the Bezafibrate Infarction Prevention (BIP) study. Circulation 2000; 102: 21–27. 15. G. Ruotolo, C. G. Ericsson, C. Tettamanti, F. Karpe, L. Grip, B. Svane, J. Nilsson, U. de Faire, and A. Hamsten, Treatment effects on serum lipoprotein lipids, apolipoproteins and low density lipoprotein particle size and relationships of lipoprotein variables to progression of coronary artery disease in the Bezafibrate Coronary Atherosclerosis Intervention Trial (BECAIT). J. Am. Coll. Cardiol. 1998; 32: 1648–1656. 16. H. B. Rubins, S. J. Robins, D. Collins, D. B. Nelson, M. B. Elam, E. J. Schaefer, F. H. Faas, and J. W. Anderson, Diabetes, plasma insulin, and cardiovascular disease: subgroup analysis from the Department of Veterans Affairs high-density lipoprotein intervention trial (VA-HIT). Arch. Intern. Med. 2002; 162: 2597–2604. 17. Executive Summary of The Third Report of The National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, And Treatment of High Blood Cholesterol In Adults (Adult Treatment Panel III). JAMA 2001; 285: 2486–2497. 18. MRC/BHF Heart Protection Study of cholesterol lowering with simvastatin in 20,536 high-risk individuals: a randomised placebocontrolled trial. Lancet 2002; 360: 7–22. 19. R. Collins, J. Armitage, S. Parish, P. Sleigh, and R. Peto, MRC/BHF Heart Protection Study of cholesterol-lowering with simvastatin in 5963 people with diabetes: a randomised placebo-controlled trial. Lancet 2003; 361: 2005–2016. 20. Effect of fenofibrate on progression of coronary-artery disease in type 2 diabetes: the Diabetes Atherosclerosis Intervention

9

Study, a randomised study. Lancet 2001; 357: 905–910. 21. J. Shepherd, G. J. Blauw, M. B. Murphy, E. L. Bollen, B. M. Buckley, S. M. Cobbe, I. Ford, A. Gaw, M. Hyland, J. W. Jukema, A. M. Kamper, P. W. MacFarlane, A. E. Meinders, J. Norrie, C. J. Packard, I. J. Perry, D. J. Stott, B. J. Sweeney, C. Twomey, and R. G. Westendorp, Pravastatin in elderly individuals at risk of vascular disease (PROSPER): a randomised controlled trial. Lancet 2002; 360: 1623– 1630. 22. Major outcomes in moderately hypercholesterolemic, hypertensive patients randomized to pravastatin vs usual care: The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT-LLT). JAMA 2002; 288: 2998–3007. 23. C. H. Hennekens, The ALLHAT-LLT and ASCOT-LLA Trials: are the discrepancies more apparent than real? Curr. Atheroscler. Rep. 2004; 6: 9–11. 24. H. Holdaas, B. Fellstrom, A. G. Jardine, I. Holme, G. Nyberg, P. Fauchald, C. Gronhagen-Riska, S. Madsen, H. H. Neumayer, E. Cole, B. Maes, P. Ambuhl, A. G. Olsson, A. Hartmann, D. O. Solbu, and T. R. Pedersen, Effect of fluvastatin on cardiac outcomes in renal transplant recipients: a multicentre, randomised, placebo-controlled trial. Lancet 2003; 361: 2024–2031. 25. S. J. Robins, Targeting low high-density lipoprotein cholesterol for therapy: lessons from the Veterans Affairs High-density Lipoprotein Intervention Trial. Am. J. Cardiol. 2001; 88: 19N–23N. 26. B. J. Ansell, K. E. Watson, R. E. Weiss, and G. C. Fonarow, hsCRP and HDL effects of statins trial (CHEST): rapid effect of statin therapy on C-reactive protein and highdensity lipoprotein levels: a clinical investigation. Heart Dis. 2003; 5: 2–7. 27. P. M. Ridker, Rosuvastatin in the primary prevention of cardiovascular disease among patients with low levels of low-density lipoprotein cholesterol and elevated high-sensitivity C-reactive protein: rationale and design of the JUPITER trial. Circulation 2003; 108: 2292–2297. 28. Clinical reality of coronary prevention guidelines: a comparison of EUROASPIRE I and II in nine countries. EUROASPIRE I and II Group. European Action on Secondary Prevention by Intervention to Reduce Events. Lancet 2001; 357: 995–1001.

10

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

29. M. MacMahon, C. Kirkpatrick, C. E. Cummings, A. Clayton, P. J. Robinson, R. H. Tomiak, M. Liu, D. Kush, and J. Tobert, A pilot study with simvastatin and folic acid/vitamin B12 in preparation for the Study of the Effectiveness of Additional Reductions in Cholesterol and Homocysteine (SEARCH). Nutr. Metah. Cardiovasc. Dis. 2000; 10: 195– 203. 30. T. R. Pedersen, O. Faergeman, and I. Holme, Effect of Greater LDL-C Reductions on Prognosis - The Incremental Decrease in Endpoints through Aggressive Lipid Lowering (IDEAL) trial. Atherosclerosis, 1999; 114 (Suppl 1): 71. 31. J. M. McKenney, P. H. Jones, M. A. Adamczyk, V. A. Cain, B. S. Bryzinski, and J. W. Blasetto, Comparison of the efficacy of rosuvastatin versus atorvastatin, simvastatin, and pravastatin in achieving lipid goals: results from the STELLAR trial. Curr. Med. Res. Opin. 2003; 19: 689–698. 32. H. Schuster, Rosuvastatin—a highly effective new 3-hydroxy-3-methylglutaryl coenzyme A reductase inhibitor: review of clinical trial data at 10-40mg doses in dyslipidemic patients. Cardiology 2003; 99: 126–139. 33. K. Kajinami, N. Takekoshi, and Y. Saito, Pitavastatin: efficacy and safety profiles of a novel synthetic HMG-CoA reductase inhibitor. Cardiovasc. Drug Rev. 2003; 21: 199–215. 34. W. L. Isley, Pitavastatin (NK-104), a new HMG-CoA reductase inhibitor. Drugs Today (Barc.) 2001; 37: 587–594. 35. B. Kerzner, J. Corbelli, S. Sharp, L. J. Lipka, L. Melani, A. LeBeaut, R. Suresh, P. Mukhopadhyay, and E. P. Veltri, Efficacy and safety of ezetimibe coadministered with lovastatin in primary hypercholesterolemia. Am. J. Cardiol. 2003; 91: 418–424. 36. C. Gagne, D. Gaudet, and E. Bruckert, Efficacy and safety of ezetimibe coadministered with atorvastatin or simvastatin in patients with homozygous familial hypercholesterolemia. Circulation 2002; 105: 2469–2475. 37. M. H. Davidson, T. McGarry, R. Bettis, L. Melani, L. J. Lipka, A. P. LeBeaut, R. Suresh, S. Sun, and E. P. Veltri, Ezetimibe coadministered with simvastatin in patients with primary hypercholesterolemia. J. Am. Coll. Cardiol. 2002; 40: 2125–2134. 38. F. J. Raal, A. D. Marais, E. Klepack, J. Lovalvo, R. McLain, and T. Heinonen, Avasimibe, an ACAT inhibitor, enhances the lipid lowering effect of atorvastatin in

subjects with homozygous familial hypercholesterolemia. Atherosclerosis 2003; 171: 273–279. 39. J. C. Fruchart, Overview. PPAR and cardiovascular risk. J. Cardiovasc. Risk. 8: 185– 186. 40. S. J. Robins, 2001; PPARalpha ligands and clinical trials: cardiovascular risk reduction with fibrates. J. Cardiovasc. Risk. 2001; 8: 195–201. 41. M. Abifadel, M. Varret, J. P. Rabes, D. Allard, K. Ouguerram, M. Devillers, C. Cruaud, S. Benjannet, L. Wickham, D. Erlich, A. Derre, L. Villeger, M. Farnier, I. Beucler, E. Bruckert, J. Chambaz, B. Chanu, J. M. Lecerf, G. Luc, P. Moulin, J. Weissenbach, A. Prat, M. Krempf, C. Junien, N. G. Seidah, and C. Boileau, Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nat. Genet. 2003; 34: 154–156. 42. H. B. Brewer, Jr. and S. Santamarina-Fojo, Clinical significance of high-density lipoproteins and the development of atherosclerosis: focus on the role of the adenosine triphosphate-binding cassette protein A1 transporter. Am. J. Cardiol. 2003; 92: 10K–16K. 43. G. J. de Grooth, J. A. Kuivenhoven, A. F. Stalenhoef, J. de Graaf, A. H. Zwinderman, J. L. Posma, A. van Tol, and J. J. Kastelein, Efficacy and safety of a novel cholesteryl ester transfer protein inhibitor, JTT-705, in humans: a randomized phase II dose-response study. Circulation 2002; 105: 2159–2165. 44. S. E. Nissen, T. Tsunoda, E. M. Tuzcu, P. Schoenhagen, C. J. Cooper, M. Yasin, G. M. Eaton, M. A. Lauer, W. S. Sheldon, C. L. Grines, S. Halpern, T. Crowe, J. C. Blankenship, and R. Kerensky, Effect of recombinant ApoA-I Milano on coronary atherosclerosis in patients with acute coronary syndromes: a randomized controlled trial. JAMA 2003; 290: 2292–2300. 45. C. P. Cannon, C. H. McCabe, R. Belder, J. Breen, and E. Braunwald, Design of the Pravastatin or Atorvastatin Evaluation and Infection Therapy (PROVE IT)-TIMI 22 trial. Am. J. Cardiol. 2002; 89: 860–861. 46. G. G. Schwartz, A. G. Olsson, M. D. Ezekowitz, P. Ganz, M. F. Oliver, D. Waters, A. Zeiher, B. R. Chaitman, S. Leslie, and T. Stern, Effects of atorvastatin on early recurrent ischemic events in acute coronary syndromes: the MIRACL study: a randomized controlled trial. JAMA 2001; 285: 1711–1718.

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS 47. G. J. Blauw, A. M. Lagaay, A. H. Smelt, and R. G. Westendorp, Stroke, statins, and cholesterol. A meta-analysis of randomized, placebo-controlled, double-blind trials with HMG-CoA reductase inhibitors. Stroke 1997; 28: 946–950. 48. J. R. Crouse, III, R. P. Byington, H. M. Hoen, and C. D. Furberg, Reductase inhibitor monotherapy and stroke prevention. Arch. Intern. Med. 1997; 157: 1305–1310. 49. P. Amarenco, J. Bogousslavsky, A. S. Callahan, L. Goldstein, M. Hennerici, H. Sillsen, M. A. Welch, and J. Zivin, Design and baseline characteristics of the stroke prevention by aggressive reduction in cholesterol levels (SPARCL) study. Cerebrovasc. Dis. 2003; 16: 389–395. 50. D. E. Grobbee, and M. L. Bots, Statin treatment and progression of atherosclerotic plaque burden. Drugs 2003; 63: 893–911. 51. D. H. Blankenhorn, S. A. Nessim, R. L. Johnson, M. E. Sanmarco, S. P. Azen, and L. Cashin-Hemphill, Beneficial effects of combined colestipol-niacin therapy on coronary atherosclerosis and coronary venous bypass grafts. JAMA 1987; 257: 3233–3240. 52. G. Brown, J. J. Albers, L. D. Fisher, S. M. Schaefer, J. T. Lin, C. Kaplan, X. Q. Zhao, B. D. Bisson, V. F. Fitzpatrick, and H. T. Dodge, Regression of coronary artery disease as a result of intensive lipid-lowering therapy in men with high levels of apolipoprotein B. N. Engl. J. Med. 1990; 323: 1289–1298. 53. D. H. Blankenhorn, S. P. Azen, D. M. Kramsch, W. J. Mack, L. Cashin-Hemphill, H. N. Hodis, L. W. DeBoer, P. R. Mahrer, M. J. Masteller, L. I. Vailas, et al., Coronary angiographic changes with lovastatin therapy. The Monitored Atherosclerosis Regression Study (MARS). The MARS Research Group. Ann. Intern. Med. 1993; 119: 969–976. 54. D. Waters, L. Higginson, P. Gladstone, B. Kimball, M. Le May, S. J. Boccuzzi, and J. Lesperance, Effects of monotherapy with an HMG-CoA reductase inhibitor on the progression of coronary atherosclerosis as assessed by serial quantitative arteriography. The Canadian Coronary Atherosclerosis Intervention Trial. Circulation 1994; 89: 959–968. 55. Effect of simvastatin on coronary atheroma: the Multicentre Anti-Atheroma Study (MAAS). Lancet 1994; 344: 633–638. 56. B. Pitt, G. B. Mancini, S. G. Ellis, H. S. Rosman, J. S. Park, and M. E. McGovern, Pravastatin limitation of atherosclerosis in the coronary arteries (PLAC I): reduction

11

in atherosclerosis progression and clinical events. PLAC I investigation. J. Am. Coll. Cardiol. 1995; 26: 1133–1139. 57. J. W. Jukema, A. V. Bruschke, A. J. van Boven, J. H. Reiber, E. T. Bal, A. H. Zwinderman, H. Jansen, G. J. Boerma, F. M. van Rappard, K. I. Lie, et al., Effects of lipid lowering by pravastatin on progression and regression of coronary artery disease in symptomatic men with normal to moderately elevated serum cholesterol levels. The Regression Growth Evaluation Statin Study (REGRESS). Circulation 1995; 91: 2528–2540. 58. J. A. Herd, C. M. Ballantyne, J. A. Farmer, J. J. Ferguson, III, P. H. Jones, M. S. West, K. L. Gould, and A. M. Gotto, Jr., Effects of fluvastatin on coronary atherosclerosis in patients with mild to moderate cholesterol elevations (Lipoprotein and Coronary Atherosclerosis Study [LCAS]). Am. J. Cardiol. 1997; 80: 278–286. 59. W. L. Haskell, E. L. Alderman, J. M. Fair, D. J. Maron, S. F. Mackey, H. R. Superko, P. T. Williams, I. M. Johnstone, M. A. Champagne, R. M. Krauss, et al., Effects of intensive multiple risk factor reduction on coronary atherosclerosis and clinical cardiac events in men and women with coronary artery disease. The Stanford Coronary Risk Intervention Project (SCRIP). Circulation 1994; 89: 975–990. 60. K. K. Teo, J. R. Burton, C. E. Buller, S. Plante, D. Catellier, W. Tymchak, V. Dzavik, D. Taylor, S. Yokoyama, and T. J. Montague, Long-term effects of cholesterol lowering and angiotensin-converting enzyme inhibition on coronary atherosclerosis: the Simvastatin/Enalapril Coronary Atherosclerosis Trial (SCAT). Circulation 2000; 102: 1748–1754. 61. B. G. Brown, X. Q. Zhao, A. Chait, L. D. Fisher, M. C. Cheung, J. S. Morse, A. A. Dowdy, E. K. Marino, E. L. Bolson, P. Alaupovic, J. Frohlich, and J. J. Albers, Simvastatin and niacin, antioxidant vitamins, or the combination for the prevention of coronary disease. N. Engl. J. Med. 2001; 345: 1583–1592. 62. M. H. Frick, M. Syvanne, M. S. Nieminen, H. Kauma, S. Majahalme, V. Virtanen, Y. A. Kesaniemi, A. Pasternack, and M. R. Taskinen, Prevention of the angiographic progression of coronary and vein-graft atherosclerosis by gemfibrozil after coronary bypass surgery in men with low levels of HDL cholesterol. Lopid Coronary Angiography Trial (LOCAT) Study Group. Circulation 1997; 96: 2137–2143.

12

UPDATE IN HYPERLIPIDEMIA CLINICAL TRIALS

63. C. von Birgelen, M. Hartmann, G. S. Mintz, D. Baumgart, A. Schmermund, and R. Erbel, Relation between progression and regression of atherosclerotic left main coronary artery disease and serum cholesterol levels as assessed with serial long-term ( > or = 12 months) follow-up intravascular ultrasound. Circulation 2003; 108: 2757–2762. 64. D. L. Sparks, J. Lopez, D. Connor, M. Sabbagh, J. Seward, and P. Browne, A position paper: based on observational data indicating an increased rate of altered blood chemistry requiring withdrawal from the Alzheimer’s Disease Cholesterol-Lowering Treatment Trial (ADCLT). J. Mol. Neurosci. 2003; 20: 407–410. 65. A. Waldman, and L. Kritharides, The pleiotropic effects of HMG-CoA reductase inhibitors: their role in osteoporosis and dementia. Drugs 2003; 63: 139–152.

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

electronic mass media. The extent to which interactivity helps or hinders the outcomes of an Internet health application is a function of the qualities of interactivity, the communication objectives, and the communicators (10, 13). The general public and patients see the Internet as a powerful medium for obtaining reliable health information (14–16). This demand is likely to grow as Internet access continues to spread. Despite its potential, the Internet has only recently become the topic of scientific investigation. An examination of the small number of published studies shows that the Internet health applications evaluated ranged in scope from programs intended to provide online information to those that attempted to capitalize on the medium’s interactivity to deliver content tailored to the user. Most of the applications depended on the user to initiate contact and ‘‘pull’’ information from them; however, a few encompassed personalized feedback and technologies like e-mail that ‘‘push’’ information to users. Also, in many instances, the Internet health application was part of a larger health program, but in a few cases, the Internet was relied on to deliver the entire health content. High-quality studies on the efficacy of Internet health programs are essential because of the expense of these programs and the speed with which governmental and non-governmental health organizations are embracing their use. In some respects, mounting a community study to evaluate the efficacy of an Internet health application has the same requirements as testing any health care or disease prevention service in a community setting, for example, identifying and defining the at-risk population, gaining access to the population for testing and program delivery, performing an experimental design with rigorous scientific control, and using valid and reliable measures of processes and outcomes. However, the nature of this new medium presents several unique challenges related to the access of populations to the Internet, the development and implementation of the Internet health programs, continued use of the pro-

DAVID B. BULLER Klein Buendel, Inc. Golden, Colorado,

1

INTRODUCTION

The Internet has been adopted in the United States at unprecedented speeds compared with previous forms of mass media. Adoption is progressing at a rapid pace elsewhere, too. Not surprisingly, it has attracted the attention of health care providers and public health practitioners as a means for delivering medical and prevention services. Unfortunately, very little is known about the efficacy of this medium, because very few communitybased trials have been conducted on Internet health applications. It is essential that the role of the Internet is investigated and how effective it can be is identified, with what audiences, in what circumstances, and by what features. The Internet has several characteristics that make it an attractive medium for delivering health care and prevention services. It offers a dynamic, engaging, multimedia environment in which health care and prevention services can be delivered immediately, on demand, 24 hours per day 7 days a week, and for a lower per unit cost than by other delivery systems (e.g., human contact, U.S. mail) (1–3). Web-based interventions may reduce demands on other hospital services, such as general inquiries, routine examinations (asthma), or outpatient visits (4–6). It can span barriers created by distance, disability, and stigma (5, 7, 8). The interactivity on the Internet can be used to deliver information that is tailored both by the user and for the user (9). Interactivity involves an interdependent exchange of information that is built on monitoring and responsiveness to the discourse or interchange, best exemplified by the ability of Internet programs to ‘‘talk back’’ to the user (10–12). It is a defining feature of the Internet medium that distinguishes it from other

Wiley Encyclopedia of Clinical Trials, Copyright © 2007 John Wiley & Sons, Inc.

1

2

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

grams, and measurement of the process of program delivery. These issues will be considered in the following sections providing examples from published studies and the author’s experience in performing three large community-based trials evaluating Internet programs intended to prevent smoking by children (17, 18), improve the community tobacco control efforts by local advocates, and improve the dietary behavior of multicultural adults (19, 20). 2

METHODS AND EXAMPLES

2.1 The User Populations Internet health applications have been evaluated with a variety of healthy (1, 21–33) and patient (2, 4, 5, 7, 8, 34–45) populations in community settings. They have been tested on adults at various ages and on children. A few studies have enrolled college students (23, 33) or recruited adults in workplaces (26). In considering the population for an Internet health application, the issues of Internet access and computer aptitude can present barriers to the production and delivery of it, and influence the quality of a communitybased study (41, 46). Early in the introduction of the Internet, only certain population subgroups— men, more educated individuals, younger and middle-aged populations, whites, and suburban dwellers—were online (47), which substantially restricted access to many populations for which the Internet might prove beneficial as a source of health information and programming. Fortunately, many of these disparities have shrunk substantially in the United States as the Internet penetrates into more households, workplaces, and schools. Internet access has become nearly universal in schools (48); home access far exceeds half of U.S. households (49); and more than 60% of employed adults have access to the Internet (50). Women are now online as much as men (47). Older and minority populations are experiencing some of the highest rates of new adoption, although they still lag in use (47, 49). The Internet is particularly attractive to children (49, 51, 52). For many children, it has become an indispensable tool for school work, and e-mail

and instant messaging are common forms of communication with peers. Still, regional variations exist in Internet access (53) and issues of access internationally remain, particularly in developing countries (54). Access to broadband technology, at least in the home environment, also presents impediments to accessing populations through the Internet in the United States, although it was estimated that over 30% of U.S. households had broadband Internet service in 2002 (55). Broadband Internet service is more widely available in school settings (48). Dial-up Internet service can experience severely restricted communication speed either because older modems work slower or, in some areas, particularly rural regions, the telecommunication infrastructure cannot support high-speed data transmission. For example, in an evaluation of a nutrition education website the author conducted in the six counties that comprise the Upper Rio Grande Valley, many home users could not obtain Internet access that exceeds 28 KB, even with newer modems, because of the older telephone infrastructure in this highly rural area. Slower Internet service can severely handicap Internet health applications by limiting the use of multimedia features like streaming audio and video or rich graphics that require transmitting large files. One strategy that has been employed to improve access to the Internet is to identify and refer study participants to sites that provide public access to computers and Internet service in the local communities (19). Unfortunately, public access sites can be difficult to use because their hours are limited, their technical support is lacking, and availability and time on computers is restricted. Also, some public access computers may not allow private usage, potentially threatening the confidentiality of health information. Computer aptitude and comfort also present challenges for accessing some populations of interest to health care and public health practitioners with an Internet health application. Many less educated and older adults may have very little experience with computers and may find the prospect of using them daunting (22, 36, 56, 57). Computer familiarity may become less of an issue as

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

the Internet has become more widely available. Vision and dexterity deficits can make computers and the Internet difficult to use for older adults and populations with certain disabilities (58, 59). The Federal accessibility standard 508 has helped to promote website designs that serve to overcome some disabilities. However, some populations may require more training and assistance to use the Internet effectively, before the efficacy of an Internet health application can be evaluated (26, 46, 59). In one study, weekly training in computer skills by a nurse or significant other improved the self-esteem and depression outcomes when elderly individuals were assigned to use the Internet (22). In the Upper Rio Grande Valley, the author’s research team have hired and trained a group of local residents to recruit adults to the randomized community-based trial evaluating the nutrition education website and provide introductory training on basic computer skills to adults with limited computer experience. 2.2 Program Design, Development, and Implementation Internet health applications that have been the focus of community-based evaluations were designed to teach healthy eating habits (21, 26, 27, 37); treat eating disorders and promote healthy body images (8, 23, 31, 33); manage weight (29, 30), diabetes (36, 37, 41, 44), asthma (2, 4, 39), tinnitus (34), back pain (40), depression (35), panic disorder (7), HIV (5), and low-birth weight babies (1); promote tobacco cessation (24, 32); improve quality of life of breast cancer and HIV patients (5, 38, 43, 45); and increase physical activity (27). 2.2.1 Programming Internet Health Applications. With their increasing familiarity with the Internet, users are undoubtedly becoming much more sophisticated in their expectations for the content they find on it. Medical and public health practitioners are challenged to produce an Internet health application with very high production qualities. Unless one is relying solely on text-based e-mail, it is unlikely that a person with limited experience programming in Hypertext Markup Language (HTML) will be able to

3

produce a website that competes within the World Wide Web media environment and meets the expectations of users. For the author’s websites that are being evaluated in community-based trials, he works closely with a group of media developers, including graphic designers, Web programmers, a database programmer, and an instructional designer. For various applications, the author has used production studio facilities and hired voice and acting talent. It is also necessary that the programmers stay abreast of the latest developments in multimedia programming and database (e.g., SQL, Oracle, Gnu) software. To achieve the greatest cross-platform compatibility and optimum browser functionality, the author’s sites are currently being created using Macromedia’s Dreamweaver MX. The graphic design and Web-ready image optimization are handled by Adobe Photoshop and Macromedia Freehand. The Macromedia Suite of multimedia authoring software (Director Shockwave Internet Studio and Flash MX) are used to create the bulk of the interactive elements and features. Sometimes, browser plug-in technologies are used such as QuickTime and Shockwave Players and multimedia components such as Windows Media Player, Acrobat Reader, and Macromedia Flash Communication Server are employed. To tailor Web pages to users, the author’s websites rely on a database software such as SQL that controls the serving of pages to a Web browser using template page technologies like PHP Hypertext Preprocessor. As discussed below, native customwritten automated data-gathering programs also are authored for websites to provide an extensive tracking of website use for process analysis (18). When designing an Internet health application, it is important to know the minimum specifications of the users’ computers. This information determines the functionality that is programmed into an Internet health application. At a minimum, it is important to know the computer’s microprocessor speed, amount of RAM, Web browser type and version (e.g., Internet Explorer or Netscape), whether cookies and Java/JavaScript can be enabled and plug-in programs installed, the audio and video capabilities, and the type

4

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

and speed of Internet service (e.g., dial-up, cable, T1 line). The presence of a CD-ROM drive may be important if the Internet health application interfaces with files stored on a CD-ROM. Also, headphones are sometimes desirable to maintain privacy, which require that a sound card be installed. The type of e-mail software and presence of an e-mail account also may need to be considered when the program will be designed to send or receive e-mail. 2.2.2 Usability Testing. It is essential that Internet health applications be tested for usability throughout production (20). It is unwise to design a website without checking design assumptions and decisions with experts and potential users. Usability testing is a formative evaluation methodology (60–64) that is employed to assess users’ interactions with, and problems using, hardware (computers and other information products), software, and their interfaces and content. As a tool in information design, designers of Internet health applications should use it to test the look and feel of this electronic communication, font selection and size, layout, graphics, line art, photographs, sound, videos, white space, animations, interactivity, organization and structure, headlines, questions, overviews, bullet lists, paragraph length and structure, site navigation, and logical connectors. Recently, the National Cancer Institute published a series of usability guidelines that can be used as starting points for the development of Internet health applications. However, it is also advisable to obtain expert review and test the application with end users throughout development from initial script and site map development to prototype Web screens and activities. A sequence of usability testing that was deployed to develop the nutrition education website for adults in the Upper Rio Grande Valley (20) began with a modified Q-sort method to design the initial website map based on over 70 nutrition concepts followed by two rounds of protocol analyses (63) in which potential users worked with prototypes of the website and provided feedback on their navigational decisions and reactions to its content. This information was used by the research team and Web developers to

improve the content and format of the website throughout production. 2.2.3 Website Security. The security of medical and health information exchanged with an Internet health program is of paramount importance, which is relevant to community trials because of concerns over the protection of human subjects from risks that might occur from the disclosure of this information. Many computer firewalls and the security software on Web servers will protect low-risk information from most intruders; however, highly sensitive personal health information gathered on an Internet health program may need to be protected by added security provided by a Secure Socket Layer (SSL) Encryption Key issued by Verisign/Network Solutions, Inc. SSL are commonly used on e-commerce sites to protect payment information. Likewise, security features implemented by the end users can be barriers to website access and functionality. Software used to block certain content like references to sex and drugs can restrict access to some Internet health applications related to sexual health and drug abuse prevention (65). However, the degree to which this software blocks access depends on how restrictive the settings are that are selected by the user (66). Virus protection software and firewalls can also interfere with websites that have a high degree of interactivity and serve dynamic Web pages (18). Schools often have policies that prohibit children from using online discussions (18). It is advisable to test an Internet health program on users’ computers before launching a community trial. 2.3 Randomized and Non-Randomized Trial Designs Given the infancy of the research on Internet health applications, it is probably not surprising that not all published evaluations have reported on randomized trials or on outcomes related to health behavior. Randomized controlled trials were performed on online programs to improve diet (26), manage weight (21, 25, 29, 30), improve body image and decrease disordered eating (22, 33); manage tinnitus (34), asthma (2, 39),

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

depression (35), lower back pain (40), and panic disorder (7); improve quality of life in HIV-positive and breast cancer patients (5, 38, 45); and increase physical activity (41, 42). The comparison group in these trials has included an untreated or wait-listed control group (7, 23, 33, 34) or an alternative intervention delivered through another medium (e.g., text or written) or in person (2, 25, 26, 35, 39, 40, 42). A few studies compared alternative online programs (29, 30, 41). One study included an Internet-based attentioncontrol condition (21). The non-randomized trials have been designed to test the acceptability of an Internet health application to the user population, compare use of these programs between different user groups, or provide preliminary insight into the possible efficacy of the programs. For instance, trials have been performed to determine the acceptability of an online program to monitor asthma intended to reduce regular outpatient visits (4) and of an online program to promote physical activity and healthy nutrition with adolescents and adults in a primary care setting (27). Similarly, adults that participate or decline participation in an online diabetes self-management program were compared in one study (36); breast cancer and HIVpositive patients use of online support system were compared in another study (43); and users at different stages of change in physical activity were compared on their use and satisfaction with an online program (28). Efficacy data on Internet health applications to promote tobacco use cessation (24, 32), online counseling to manage weight and treat depression and bulimic symptoms (8), and provide blood glucose monitoring with immediate online feedback (44) were reported from non-randomized trials. 2.4 Measurement Community trials on Internet health programs have employed measures of healthrelated outcomes (e.g., attitudes, mental states, and behaviors). Many have also included process measures assessing the implementation and use of the programs. 2.4.1 Outcome Measures. Most of the outcome measures are typical of the types

5

of self-report and observational measures used in community trials on other forms of health interventions. Many trials have assessed changes in behavior, usually with self-report measures: diaries logging sleep (34); assessments of disordered eating (8, 23, 31); measures of dietary intake (21, 25, 26, 30, 41) and stage of change (26); reports on physical activity (21, 25, 30, 42); body mass index (21, 33); reported medical care utilization (1, 5, 39, 40); reports on insulin, medication, and monitoring device use (2, 36, 39); measures of participation in health care (38); and assessments of physical disability (40). Behaviors were directly observed only in a few trials, for example, picking up of antidepressants at a pharmacy (35), and assessing body weight (25, 30) and waist circumference (30) in clinical settings. Three studies also collected medical assessments—blood samples (41), blood glucose levels (44), and spirometry tests (4). Several community trials on Internet health programs included self-reported measures of attitudes and mental states, among which are measures of depression and mental health (7, 8, 35, 41, 45), body image and disordered eating attitudes (23, 33), social and health support (31, 33, 37, 38), quality of life (5, 7, 38, 40), and back pain distress (40). Again, many of these measures are very similar to the ones used in community trials testing other forms of medical care and preventive interventions. 2.4.2 Process Measures. Process measures are essential in community trials to ensure that the program under study is being implemented as intended and is reaching the target population. They also can provide insight into which components of a medical or health program are most responsible for its effectiveness. In the community trials on Internet health applications, researchers have assessed users’ appreciation, relevance, credibility, appropriateness, acceptance, and usefulness of the online programs (26, 27, 44); comfort level entering personal information into the online program (27); word length and content in e-mail messages (8, 43); and reports of program use (e.g., frequency, time of day, components used) (28, 31, 43). Perceptual measures of reactions to an Internet

6

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

health application can be useful and add to designers’ understanding of how usable the application is. However, little reason exists to collect selfreports on program usage. An advantage of the Internet is that computer programs can be authored to track usage of various features and components of online programs, for example, time spent using the program (41), page views (42, 67), overall website activity (42), and number of log-ins and entries into online activities (30). Several software packages are readily available on the Web server that can track hits to, and page views in, websites. A drawback to these software packages in the context of community trials is that they usually cannot associate the data on usage with individual users, which requires a customized program with unique user identifiers or passwords assigned to trial participants that they enter whenever they log into the Internet health program. With this information, a database associated with the Internet health application can record a variety of information such as date and time of each visit, Web pages visited, time spent on each Web page or on the website, and click stream (i.e., sequence of page views as the user moves through the website). The author’s websites, containing a smoking prevention program for adolescents, technical assistance tools for local tobacco control advocates, and nutrition education for rural adults, all have these customized tracking programs. User identifiers or passwords can be an impediment to website usage in some circumstances and may not be desirable in all trials. For example, the author’s research team recently tested two versions of a website to disseminate a sun safety curriculum to public schools and state-licensed childcare facilities. User identifiers and passwords were not required to access the websites in order to make it easier for the managers of the schools and childcare facilities to access them. In another case, the Internet smoking prevention program was revised and user identifiers were eliminated so it could be used by any school that wanted to implement it. Instead, an entry screen was programmed in which visitors create their own user identifiers (known only to them and the program)

and they are asked to record the name of their school to provide a small amount of identifying information. 2.4.3 Recent Developments. An emerging issue in community trials designed to evaluate Internet health applications is that of return use or drop off. It is becoming apparent that simply because an online health program is made available to a population, no guarantee exists that the population will actually use it, even when assigned to do so in a community trial (21, 31). Drop off can be detrimental; several trials found that the outcomes produced by Internet health applications improved for users who spent more time on them (8, 30, 31, 42). Research on the factors associated with the use of Internet health application in community trials has found that those who volunteered to use a website in one study were younger and more recently diagnosed with a disease (36). In another study, participants in preparation and action stages within the Transtheoretical Model (68) of health behavior change used websites promoting physical activity more frequently than those in contemplation. In a third study, HIV-positive patients used online support groups more than breast cancer patients. It has been suggested, however, that a low baseline rate of a desirable health behavior allows even infrequent users to make significant gains from Internet health applications (38, 69). Effective strategies are needed to achieve high utilization of websites in community trials to provide a good test of their efficacy. It has recently been reported that e-mail notifications increased use of a weight loss website in a hospital setting (29, 30). How successful these notifications will be in other settings remains to be tested, although data from the author’s evaluation of a nutrition education website with rural adults is promising in this regard. Computer-based technology remains at present a ‘‘pull’’ rather than ‘‘push’’ technology, and, until this changes, use may be difficult to motivate unless programs are administered in highly controlled settings. The creation of a community, with commensurate norms and pressures for usage, may be necessary to improve use (67). Also,

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

designers must consider how to create Internet health applications that are attractive to the user. Some authors have speculated that many users do not return to websites because they find little information of relevance, especially users from minority populations, those who have strong local (rather than national or international) orientations (70), and those who speak another language (71). It is also important to conduct careful formative analyses on the content and usability of Internet health applications before publishing them to the Internet to avoid problems that could drive users away from these programs (20). Still, in cases where use is an initial problem or interest in the health issue is low, online programs may not be effective motivators for change (25, 27, 35). 3

CONCLUSIONS

The Internet is a potentially effective medium for delivering medical and preventive services to many populations. It shows many signs of continuing to diffuse rapidly in the United States and elsewhere and soon may be as universally available as older media such as videotape recorders and television. Adults express positive reactions to using the online health program (4). Younger generations in particular may be highly dependent on this new medium and any organization that wants to deliver medical and preventive services to them will need to be on it. However, the published community-based trials on Internet health programs are modest in number and several provide no efficacy data or use non-randomized research designs. The results of these studies are mixed at present, with some finding benefits from Internet health applications (5, 7, 23, 26, 30, 31, 34, 38, 39, 45) and others failing to find benefit (2, 21, 25, 33, 35, 41, 42). Although this failure may be in part caused by the rapidly increasing sophistication of the technology, program design, and users, Internet health applications should not be perceived as the panacea for every problem in delivering medical and preventive care. Much further research is needed to provide a solid understanding of under what circumstances these programs are truly superior, in

7

what ways they may best assist other interventions, and which features, components, or characteristics of the Internet will be most effective, and with what user populations. Access continues to present challenges, but, once it is solved, a large challenge will be to promote the continued use of Internet health programs to realize their full potential. Researchers interested in how to deliver medical and preventive services to community populations have only begun to answer these and other research questions on how the Internet can best play a role in this endeavor. REFERENCES 1. J. E. Gray et al., Baby CareLink: using the Internet and telemedicine to improve care for high-risk infants. Pediatrics 2000; 106: 1318–1324. 2. C. Homer et al., An evaluation of an innovative multimedia educational software program for asthma management: report of a randomized, controlled trial. Pediatrics 2000; 106: 210–215. 3. R. B. Jones et al., Randomised trial of personalised computer based information for patients with schizophrenia. BMJ 2001; 322: 835–840. 4. J. Finkelstein, M. R. Cabrera, and G. Hripcsak, Internet-based home asthma telemonitoring: can patients handle the technology? Chest 2000; 117: 148–155. 5. D. H. Gustafson et al., Impact of a patientcentered, computer-based health information/support system. Amer. J. Prev. Med. 1999; 16: 1–9. 6. K. Harno, T. Paavola, C. Carlson, and P. Viikinkoski, Patient referral by telemedicine: effectiveness and cost analysis of an Intranet system. J. Telemed. Telecare 2000; 6: 320–329. 7. P. Carlbring, B. E. Westling, P. Ljungstrand, L. Ekselius, and G. Andersson, Treatment of panic disorder via the Internet: a randomized trial of a self-help program. Behav. Ther. 2001; 32: 751–764. 8. P. H. Robinson and M. A. Serfaty, The use of e-mail in the identification of bulimia nervosa and its treatment. Eur. Eating Disord. Rev. 2001; 9: 182–193. 9. R. E. Rice and J. E. Katz, The Internet and Health Communication: Experiences and Expectations. Thousand Oaks, CA: Sage, 2001.

8

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

10. J. K. Burgoon et al., Testing the interractivity model: communication processes, partner assessments, and the quality of collaborative work. J. Manag. Inform. Syst. 2000; 16: 35–38. 11. C. Heeter, Implication of New Interactive Technologies for Conceptualizing Communication. Hillsdale, NJ: Lawrence Erlbaum Associates, 1989. 12. E. M. Rogers and M. M. Albritton, Interactive communication tecnologies in business organizations. J. Bus. Commun. 1995; 32: 177–195. 13. J. B. Walther, Computer-mediated communication: impersonal, interpersonal, and hyperpersonal interaction. Commun. Res. 1996; 23: 3–43. 14. M. Brodie et al., Health information, the Internet, and the digital divide. Health Aff. (Millwood). 2000; 19: 255–265. 15. S. Fox and D. Fallows, Internet Health Resources. Washington, DC: Pew Internet & American Life Project, 2003. 16. J. Monnier, M. Laken, and C. L. Carter, Patient and caregiver interest in Internetbased cancer services. Cancer Practice 2002; 10: 305–310. 17. D. B. Buller et al., Arresting tobacco uptake with interactive multimedia project: designing a web-based smoking prevention program for children. In: R. Rice and C. Atkin (eds.), Public Communication Campaigns. Thousand Oaks, CA: Sage, 2001. 18. J. R. Hall et al., Challenges to producing and implementing the Consider This web-based smoking prevention and cessation program. Elec. J. Commun. 2001; 11. 19. D. B. Buller et al., Formative research activities to provide Web-based nutrition education to adults in the Upper Rio Grande Valley. Fam.Community Health 2001; 24: 1–12. 20. D. E. Zimmerman et al., Integrating usability testing into the development of a 5 a day nutrition web site for at-risk populations in the American Southwest. J. Health Psychol. 2003; 8: 119–134. 21. T. Baranowski et al., The Fun, Food, and Fitness Project (FFFP): the Baylor GEMS pilot study. Ethn. Dis. 2003; 13: S30–S39. 22. S. H. Billipp, The psychosocial impact of interactive computer use within a vulnerable elderly population: a report on a randomized prospective trial in a home health care setting. Public Health Nurs. 2001; 18: 138–145. 23. A. A. Celio et al., Reducing risk factors for eating disorders: comparison of an Internet-

and a classroom-delivered psychoeducational program. J. Consult. Clin. Psychol. 2000; 68: 650–657. 24. K. J. Fisher, H. H. Severson, S. Christiansen, and C. Williams, Using interactive technology to aid smokeless tobacco cessation: a pilot study. Amer. J. Health Educ. 2001; 32: 332–342. 25. J. Harvey-Berino et al., Does using the Internet facilitate the maintenance of weight loss? Int. J. Obes. 2002; 26: 1254–1260. 26. A. Oenema, Web-based tailored nutrition education: results of a randomized controlled trial. Health Educ. Res. 2002; 16: 647–660. 27. J. J. Prochaska, M. F. Zabinski, K. J. Calfas, J. F. Sallis, and K. Patrick, PACE+: interactive communication technology for behavior change in clinical settings. Amer. J. Prev. Med. 2000; 19: 127–131. 28. C. N. Sciamanna et al., User attitudes toward a physical activity promotion website. Prev. Med. 2002; 35: 612–615. 29. D. F. Tate, E. H. Jackvony, and R. R. Wing, Effects of Internet behavioral counseling on weight loss in adults at risk for type 2 diabetes: a randomized trial. JAMA 2003; 289: 1833–1836. 30. D. F. Tate, R. R. Wing, and R. A. Winett, Using Internet technology to deliver a behavioral weight loss program. JAMA 2001; 285: 1172–1177. 31. A. J. Winzelberg et al., Effectiveness of an Internet-based program for reducing risk factors for eating disorders. J. Consult. Clin. Psychol. 2000; 68: 346–350. 32. S. I. Woodruff, C. C. Edwards, T. L. Conway, and S. P. Elliott, Pilot test of an Internet virtual world chat room for rural teen smokers. J. Adolesc. Health 2001; 29: 239–243. 33. M. F. Zabinski et al., Reducing risk factors for eating disorders: targeting at-risk women with a computerized psychoeducational program. Intl. J. Eating Disord. 2001; 29: 401–408. 34. G. Andersson, T. Stromgren, L. Strom, and L. Lyttkens, Randomized controlled trial of internet-based cognitive behavior therapy for distress associated with tinnitus. Psychosom. Med. 2002; 64: 810–816. 35. A. Atherton-Naji, R. Hamilton, W. Riddle, and S. Naji, Improving adherence to antidepressant drug treatment in primary care: a feasibility study for a randomized controlled trial of educational intervention. Primary Care Psychiatry 2001; 7: 61–67.

USING INTERNET IN COMMUNITY INTERVENTION STUDIES 36. E. G. Feil, R. E. Glasgow, S. Boles, and H. G. McKay, Who participates in Internet-based self-management programs? A study among novice computer users in a primary care setting. Diabetes Educator 2000; 26: 806–811. 37. R.E. Glasgow, M. Barrera, H. G. McKay, and S. M. Boles, Social support, self-management, and quality of life among participants in an internet-based diabetes support program: a multi-dimensional investigation. Cyberpsychol. Behav. 1999; 2: 271–281. 38. D. H. Gustafson et al., Effect of computer support on younger women with breast cancer. J. Gen. Intern. Med. 2001; 16: 435–445. 39. S. Krishna, B. D. Francisco, E. A. Balas, P. Konig, G. R. Graff, and R. W. Madsen, Internet-enabled interactive multimedia asthma education program: a randomized trial. Pediatrics 2003; 111: 503–510. 40. K. R. Lorig et al., Can a Back Pain E-mail Discussion Group improve health status and lower health care costs?: a randomized study. Arch. Intern. Med. 2002; 162: 792–796. 41. H. G. McKay, R. E. Glasgow, E. G. Feil, S. M. Boles, and M. Barrera, Jr., Internet-based diabetes self-management and support: initial outcomes from the diabetes network project. Rehabil. Psychol. 2002; 47: 31–48. 42. H. G. McKay, D. King, E. G. Eakin, J. R. Seeley, and R. E. Glasgow, The diabetes network internet-based physical activity intervention: a randomized pilot study. Diabetes Care 2001; 24: 1328–1334. 43. S. J. Rolnick et al., Computerized information and support for patients with breast cancer or HIV infection. Nurs.Outlook 1999; 47: 78–83. 44. M. W. Tsang et al., Improvement in diabetes control with a monitoring system based on a hand-held, touch-screen electronic diary. J. Telemed. Telecare 2001; 7: 47–50. 45. A. J. Winzelberg et al., Evaluation of an internet support group for women with primary breast cancer. Cancer 2003; 97: 1164–1173. 46. S. C. Kalichman, L. Weinhardt, E. Benotsch, and C. Cherry, Closing the digital divide in HIV/AIDS care: development of a theorybased intervention to increase Internet access. Aids Care-Psycholog. Socio-Med. Aspects AIDS/HIV 2002; 14: 523–537. 47. U.S. Department of Commerce National Telecommunications and Information Administration, Falling Through the Net: Toward Digital Inclusion. Washington, DC: U.S. Department of Commerce, 2000. 48. U.S. Department of Education, National Center for Education Statistics, Computer and

9

Internet Use by Children and Adolescents in 2001, NCES 2004-014. Washington, DC: U.S. Department of Education, 2003. 49. U.S. Department of Commerce National Telecommunications and Information Administration, A Nation Online: How Americans are Expanding their Use of the Internet. Washington, DC: U.S. Department of Commerce, 2002. 50. J. B. Horrigan, Consumption of Information Goods and Services in the United States. Washington, DC: Pew Internet & American Life Project, 2003. 51. A. Lenhart, M. Simon, and M. Graziano, The Internet and Education: Findings of the Pew Internet and American life Project. Washington, DC: Pew Internet & American Life Project, 2003. 52. D. Levin and S. Arafeh, The Digital Disconnect: The Widening Gap Between InternetSavvy Students and their Schools. Washington, DC: Pew Internet & American Life Project, 2003. 53. T. Spooner, Internet Use by Region in the United States. Washington, DC: Pew Internet & American Life Project, 2003. 54. E. Fife and F. Pereira, Socio-economic and cultural factors affecting adoption of broadband access: a cross-country analysis. J. Commun. Network 2002; 1: 62–69. 55. J. Horrigan, Broadband Adoption at Home: A Pew Internet Project Data Memo. Washington, DC: Pew Internet & American Life Project, 2003. 56. S. J. Czaja and J. Sharit, Age differences in attitudes toward computers. J. Gerontol. B Psychol. Sci. Soc. Sci. 1998; 53: 329–340. 57. C. C. Hendrix, Computer use among elderly people. Comput. Nurs. 2000; 18: 62–68. 58. E. M. Alexy, Computers and caregiving: reaching out and redesigning interventions for homebound older adults and caregivers. Holist. Nurs. Pract. 2000; 14: 60–66. 59. H. White et al., A randomized controlled trial of the psychosocial impact of providing internet training and access to older adults. Aging Ment. Health 2002; 6: 213–221. 60. F. Bethke, Measuring the usability of software manuals. Tech. Commun. 1983; 32: 13–16. 61. R. Grice and L. S. Ridgway, A discussion of modes and motives for usability evaluation. IEEE Trans. Prof. Commun. 1989; 32: 230–237. 62. C. B. Mills and K. L. Dye, Usability testing. Tech. Commun. 1985; 32: 40–44.

10

USING INTERNET IN COMMUNITY INTERVENTION STUDIES

63. J. Nielsen, Usability Engineering. Cambridge, MA: AP Professional, 1993. 64. T. Warren, Readers and microcomputers: approaches to increase usability. Tech. Commun. 1988; 35: 188–192. 65. N. J. Gray, J. D. Klein, J. A. Cantrill, and P. R. Noyce, Adolescent girls’ use of the Internet for health information: issues beyond access. J. Med. Syst. 2002; 26: 545–553. 66. C. R. Richardson et al., Does pornographyblocking software block access to health information on the Internet? JAMA 2002; 288: 2887–2894. 67. A. A. Celio, Improving compliance in on-line, structured self-help programs: evaluation of an eating disorder prevention program. J. Psychiatr. Practice 2002; 8: 14–20. 68. J. O. Prochaska and W. F. Velicer, The transtheorectical model of health behavior change. Amer. J. Health Prom. 1997; 12: 38–48. 69. R. Thomas, J. Cahill, and L. Santilli, Using an interactive computer game to increase skill and self-efficacy regarding safer sex negotiation: field test results. Health Educ. Behav. 1997; 24: 71–86. 70. Benton Foundation. Losing ground bit by bit: low-income communities in the information age. Washington, DC: Benton Foundation. 1998. 8-18-2000. 71. K. Alexander, Product alert. A selected list of Spanish language and low literacy patient education web sites. J. Ped. Health Care 2002; 16(3): 151–155.

VACCINE

lyme disease, pneumococcal pneumonia, and influenza. The widespread use of highly effective vaccines has led to the control of many once deadly infectious diseases, including the eradication of smallpox in 1980. The World Health Organization (WHO) is currently leading an initiative targeting polio as the next infectious disease to be eliminated from the earth. The remarkable success of vaccines has certainly proven to be one of the greatest achievements in public health in the twentieth century (1). Vaccines are typically administered as a single dose or a single series over a short period of time, with a potential of a booster dose at a later time to maintain efficacy. For example, a vaccine to prevent chickenpox has been given as one single dose to children 1 to 12 years of age, and as a two-dose regimen (at least 4 weeks apart) to individuals 13 years of age or older. To date, a high level of disease protection maintained for at least 10 years following receipt of a chickenpox vaccine (Oka-Merck strain) has been demonstrated in clinical trials (2, 3). In another example, children are recommended to receive a five-dose series of a vaccine for prevention of diphtheria, tetanus, and pertussis before they reach 7 years of age, followed by a booster vaccination at 11 to 12 years of age. In adulthood, a tetanus and diphtheria toxoid booster vaccination is recommended every 10 years for continued protection. Recommendations of universal vaccination schedules are made by government bodies and vary by country. In the United States, universal vaccination recommendations are made by the Centers for Disease Control and Prevention.

IVAN S. F. CHAN Clinical Biostatistics Merck Research Laboratories North Wales, Pennsylvania

1

WHAT ARE VACCINES?

Vaccines are biological products that are primarily designed to prevent infectious diseases. They are different from pharmaceutical products, which traditionally have been chemical products designed to treat or cure diseases. Vaccines work primarily by stimulating immune responses specific for disease protection. By introducing antigen, attenuated virus, or DNA plasmids into the body, vaccination aims to trigger an immune response in the form of antibodies or T-cell immunity that is specific for the target infectious agent. Rapid generation of antibodies and T-cell immunity is important in limiting the spread of most pathogens and preventing disease infection. Although antibodies may play an important role at the site of infection and in limiting the spread of the virus or bacteria, cell-mediated immune responses generated through T-cells may be required for the clearance of virus from infected cells. Vaccination induces immunologic memory of immune responses that enables the body to rapidly produce a large pool of antigenspecific antibodies and T-cells when the infectious agent is detected. Most vaccines developed so far have focused on prevention of diseases in infants and young children. Some childhood diseases that are now preventable by vaccines include polio, hepatitis B, chickenpox, infant diarrhea (rotavirus), measles, mumps and rubella, Haemophilus influenzae type b, diphtheria, tetanus, pertussis (whooping cough), and pneumococcal disease. In recent years, research efforts have also been undertaken on development of vaccines for adolescents and adults to prevent diseases such as meningitis, hepatitis A, cervical cancer (caused by human papillomavirus), human immunodeficiency virus (HIV), herpes zoster,

2

CLINICAL DEVELOPMENT OF VACCINES

Vaccines have been successfully developed by using weakened live viruses, inactivated viruses, purified bacterial proteins and glycoproteins, and recombinant, pathogen-derived proteins. After successful basic research and animal studies, vaccine clinical development typically begins with small-scale testing of

Wiley Encyclopedia of Clinical Trials, Copyright  2007 John Wiley & Sons, Inc.

1

2

VACCINE

tolerability and immunogenicity before the demonstration of safety and efficacy in larger clinical studies. Because almost all vaccines are developed for prophylaxis, vaccine clinical trials typically recruit healthy individuals, as opposed to patients in drug trials. The entire clinical development process can generally be classified into four phases. Phase I vaccine clinical trials are early, small-scale human studies that explore the safety and immunogenicity of multiple dosage levels of a new vaccine. Results from phase I studies help researchers understand the maximum dosage level at which the vaccine is safe and tolerable as well as the minimum dosage level at which the vaccine is immunogenic. Phase II trials assess the safety, immunogenicity, and sometimes efficacy of selected doses of the vaccine and generate hypotheses for late-stage testing. Proof of concept for efficacy of the new vaccine is established during this phase, and one or two dosage levels of the vaccine may be chosen for further testing. Phase III trials, usually large in scale, seek to confirm the efficacy of the vaccine in the target population. This is the last stage of clinical studies before licensure of the vaccine is requested. Because of the variability associated with manufacturing of biological products, phase III clinical trials are also required to demonstrate consistency of manufacturing processes of a new vaccine. Phase IV trials are often conducted after licensure to collect additional information on the safety, immunogenicity, or efficacy of the vaccine to meet regulatory commitments or postmarketing objectives, such as expanding target populations of the drugs and vaccines. Mehrotra (4) and Chan et al. (5) provide more detailed descriptions of statistical considerations for different phases of vaccine clinical trials. 3

ASSESSING VACCINE EFFICACY

One of the most important steps in evaluating a new vaccine is to assess its protective efficacy against the targeted disease. Efficacy assessment usually starts in phase II with a proof-of-concept study, followed by a phase III, large-scale confirmatory trial. The goal of an efficacy trial is to evaluate

whether a new vaccine reduces the incidence (or severity) of the disease as compared with placebo controls in a randomized study. In some cases, another vaccine that does not contain the active ingredients of interest is used in lieu of the placebo. A common measure of vaccine efficacy (VE) is (VE = 1 – RR), where RR is the relative risk of contracting disease among vaccine recipients compared with placebo recipients. Some key considerations for designing vaccine efficacy trials include the disease incidence rate in the target population, disease infectiousness and transmission rate, criteria for demonstrating efficacy (or precision of the efficacy estimate), and durability of vaccine efficacy. Because vaccines are given to healthy individuals for prophylaxis, the level of evidence and precision for demonstrating vaccine efficacy is high, compared with pharmaceutical products, to balance the benefits and risks to public health. For this reason, confirmatory trials of vaccine efficacy are generally required to show that the vaccine efficacy is not only statistically greater than zero, but statistically greater than a pre-established non-zero lower bound. The sample size requirement for phase III vaccine efficacy trials depends largely on the disease incidence rate in the unvaccinated population and the VE lower bound requirement. The lower the disease incidence rate (or the higher the VE lower bound), the larger the sample size. For example, in a phase III efficacy trial of a new vaccine for prevention of hepatitis A disease, of which the disease incidence rate was predicted to be 3% in the unvaccinated subjects, approximately 1000 healthy children were enrolled and randomized to receive the vaccine or placebo (6). In planning a phase III pivotal trial to evaluate the efficacy of a new vaccine for prevention of the invasive pneumococcal disease in infants, the sample size was estimated to be more than 30,000 subjects because the disease incidence rate was extremely low (∼0.2%) (7). A variety of statistical methods are available for planning and analyzing vaccine efficacy trials (5). The appropriate choice of method is context dependent. For comparison of incidence ratios, for example, time-toevent analyses based on the Cox regression

VACCINE

framework as well as conditional and unconditional methods (7–9) have been proposed. A recently completed study (Shingles Prevention Study) to evaluate a vaccine for the prevention of herpes zoster (HZ) illustrated some of the challenges in vaccine efficacy trials (10). Herpes zoster is caused by reactivation of varicella zoster virus, and its occurrence is generally related to declining immunity, associated primarily with natural aging. The primary complication of HZ is pain, including a chronic and severe pain condition called postherpetic neuralgia (PHN), which greatly interferes with general activities of daily living. The incidence of HZ was anticipated to be low (approximately 3 to 10 per 1000 person-years) among unvaccinated individuals, and the expected incidence rate of PHN among unvaccinated individuals was even lower (approximately 0.75 to 2.5 per 1000 person-years). To capture the potential vaccine benefit for disease prevention and the reduction of severity of pain associated with HZ, the trial used a novel composite endpoint based on the burden-of-illness concept (11), which is designed to measure the incidence, severity, and duration of HZ and its associated pain. The incidence of PHN was considered as another important endpoint in the trial. The success criteria for the study required showing vaccine efficacy that was greater than a 25% lower bound (based on the 95% confidence interval [CI]) in at least one of these two endpoints, with appropriate adjustment for multiple hypothesis tests to control the overall false-positive rate (10). The low incidence of HZ and PHN as well as the strict criteria for success called for a very large size of study. The Shingles Prevention Study enrolled approximately 38,500 individuals 60 years of age and older, with half of participants receiving a single injection of the new vaccine and the other half receiving a placebo injection. The trial participants were followed for symptoms of HZ for an average of 3.1 years, and a total of 957 confirmed HZ cases were identified (315 among vaccine recipients and 642 among placebo recipients). In addition, there were a total of 107 PHN cases (27 vaccine and 80 placebo). The study showed strong vaccine efficacy for the reduction of HZ incidence (51.3%; 95% CI, 44.2–57.6%), the

3

reduction of PHN incidence (66.5%; 95% CI, 47.5–79.2) and the reduction of the burdenof-illness associated with HZ pain (61.1%; 95% CI, 51.1–69.1) (10). These vaccine efficacies were highly statistically significant (all P-values < 0.001) and exceeded the prespecified success criteria. It is expected that the HZ vaccine will have profound public health benefits for the elderly population. The clinical trials are generally designed to measure the direct benefit of a vaccine to individuals who receive the vaccine. Vaccination also may offer indirect benefits to unvaccinated individuals by reducing the secondary transmission of disease in the community. This indirect benefit of the vaccine is called herd immunity (12), which is an important concept for measuring the public health impact of a vaccine. Halloran et al. (13) and Halloran (14) gave comprehensive reviews of direct and indirect vaccine efficacy and effectiveness in field studies. Moulton et al. (15) described the evaluation of the indirect effects of a pneumococcal vaccine in a community-randomized study. 4

ASSESSING VACCINE IMMUNOGENICITY

Critical to understanding the biological mechanism of immune responses, the immunogenicity of a new vaccine is studied in all phases of clinical development. Immunogenicity is commonly measured by (1) geometric means titer or concentration of immune responses (antibody or T-cell responses), and (2) an immune response rate, defined as the percentage of subjects who achieve at least a certain level of immune response. In clinical trials of pediatric vaccines, where almost all children are seronegative (with no preexposure and no measurable immune responses) before vaccination, emphasis is generally on measuring the immune responses after vaccination. In trials of adolescent and adult vaccines, where participants may have had prior exposure and hence some preexisting immunity, additional endpoints, such as geometric mean fold-rise in immune levels from baseline to postvaccination (i.e., change from baseline) or the percentage of subjects achieving a certain level of fold-rise from baseline, are also important measures to evaluate the immune responses to the vaccine.

4

VACCINE

Early (phase I/II) immunogenicity studies generally aim to assess whether the vaccine can produce a quantifiable immune response in terms of antibody responses or cell-mediated immunity (T-cell responses). These early studies are also important in identifying markers of immune responses to the vaccine, evaluating the dose-response pattern of the vaccine, and determining the minimum effective dose or the optimal dosing schedule. If an immune marker is identified, it may be further tested in late-stage efficacy clinical trials to determine whether the immune marker can be used as a surrogate endpoint for efficacy or as a correlate of protection. The evaluation of an immune marker as a surrogate endpoint may include the following steps, based on the Prentice Criteria (16): 1. Demonstrate that the vaccine is efficacious as compared with placebo. 2. Demonstrate that the vaccine induces immune responses. 3. Demonstrate that the level of immune responses is correlated with efficacy. 4. Demonstrate that, at a given level of immune response, the probability of developing the disease is the same in vaccinated individuals as in unvaccinated individuals. Of note, step 4 is a very strict criterion, requiring that the immune marker capture the full vaccine effect on disease protection. This criterion is very difficult to meet in practice. As a result, it has been proposed to calculate measures of the strength of surrogacy using the proportion of vaccine effect explained by the surrogate endpoint (17, 18) or using the relative effect and adjusted association (19). A good surrogate endpoint should explain a large portion of the vaccine effect on disease protection and should have high association with disease protection after adjusting for the effect of vaccine. However, because these classic methods rely on having sufficient numbers of disease cases from both vaccinated and unvaccinated individuals, they may be difficult to apply when the vaccine is highly immunogenic and highly efficacious such that very few vaccinated individuals develop disease after vaccination. For example, a large clinical trial of

a seven-valent vaccine to prevent pneumococcal disease in ∼37,000 children demonstrated a 97.4% vaccine efficacy against invasive pneumococcal disease (20). As almost all disease cases occurred in the placebo group, a correlation between immune responses and protection could not be determined. Because of this potential issue, vaccine researchers focus on the concept of ‘‘correlate of protection’’ or ‘‘approximate correlate of protection’’ (5, 21). By examining vaccine failures, researchers aim to establish a ‘‘protective level’’ or ‘‘approximate protective level’’ of immune responses that can be used to predict disease protection either on a population level or on an individual level. Statistical modeling approaches have also been proposed to assess correlates of protection using the whole continuum of immune responses and to predict vaccine efficacy based on immunogenicity data (22–24). Once an immune surrogate or correlate of protection has been established, immunogenicity studies can be used as efficient alternatives to efficacy trials for time and economical considerations in evaluating future vaccines. After successful demonstration of vaccine efficacy, for example, immunogenicity trials are typically used to demonstrate the consistency of the vaccine manufacturing process; to prove vaccine effectiveness when the manufacturing processes, storage conditions, or vaccination schedules are modified; to expand the indication to new populations; to justify concomitant use with other vaccines; or to develop a new combination vaccine with multiple components. The primary objective of such immunogenicity studies is to demonstrate that a new (or modified) vaccine is noninferior (or equivalent) to the current vaccine by ruling out a prespecified, clinically relevant difference in the immune response. More comprehensive descriptions of statistical considerations for designing and analyzing vaccine noninferiority trials can be found in Wang et al. (25). 5 ASSESSING VACCINE SAFETY The importance of safety assessment of a new vaccine is paramount because the vaccine is potentially administered to millions

VACCINE

of healthy individuals worldwide for disease prophylaxis. Compared with drug products, the expectation of the benefit-to-risk ratio for vaccines is much greater in order to justify their widespread use in healthy populations. Safety evaluations begin with the earliest preclinical experiments and continue throughout the clinical development program and marketing lifespan of the vaccine. The ICH E9 guidelines (26) give specific recommendations on safety evaluation in clinical trials that pertain to both drugs and vaccines. Factors affecting vaccine safety generally include the type of vaccine (e.g., live virus or inactivated), biological mechanism for eliciting immune responses (e.g., antibody based or T-cell based), target population (e.g., infants, children, adolescents, or adults), and route of administration (e.g., injection or oral). Because vaccines stimulate the immune system, they may cause allergic or anaphylactic reactions. These adverse experiences or events (AEs) are closely monitored in every clinical trial. For vaccines administered via injection, some common AEs found at the injection site include pain and tenderness, swelling, and redness. These AEs usually occur shortly after vaccination (mostly within 1 to 5 days). It is important to assess the duration and severity of AEs to determine whether the vaccine is well tolerated. Vaccines can also cause systemic AEs such as fever, muscle pain, and headache. In addition, some live-virus vaccines can produce symptoms similar to those caused by the natural infection. Depending on the type of vaccine, systemic AEs are monitored for a few weeks to a few months after vaccination, and it is important to understand the temporal trend of systemic AEs as well as their duration and severity. In clinical trial settings, standardized diary cards, such as a validated vaccination report card (27), for safety data collection are recommended to ensure ease of use and to promote completeness and consistency of reporting AEs. Incidence of rare and serious AEs is another important parameter in evaluating vaccine safety because of the potential widespread use of the vaccine in millions of people. Data about these rare and serious

5

AEs will accumulate throughout the investigational clinical development phase; nevertheless, rare AEs may still be missed due to the limited size of clinical trials. As a result, monitoring of rare and serious AEs generally continues into the postmarketing setting, which can provide large sample sizes in usual care conditions to increase the precision of estimating the rate of rare events. Postmarketing studies also allow safety assessment in special populations and subgroups or further investigation on safety signals emerging in prelicensure clinical trials. As a general guideline, the precision of the database for rare and serious AEs depends on the sample size (n) and follows the ‘‘rule of three’’ (28), which states that 3/n is an upper 95% confidence bound for the true AE rate P when no events occur among n individuals. Applying this rule to a vaccine program of n = 5000 participants gives an upper bound estimate of P = 0.0006 (i.e., an incidence rate of 6 per 10,000 persons) for AEs not observed in studies. Interpretation of this statistical statement means that administration of the vaccine to 1 million persons could yield as many as 600 serious AEs (5). Statistical considerations for large trials for assessing safety of a new vaccine were given by Ellenberg (29), who proposed a sample size on the order of 60,000 to 80,000, using the methods appropriate under simple binomial sampling. In the late-stage development of a new rotavirus vaccine, Sadoff et al. (30) proposed a study design for including at least 60,000 infants in a placebo-controlled, randomized clinical trial to detect an increased risk of intussusception (a rare and serious AE associated with a previous rotavirus vaccine made by another manufacturer). Extensive monitoring of intussusception using multiple stopping boundaries was incorporated in the study design to ensure a high probability of detecting a real safety signal and a high power of demonstrating that the new vaccine is safe if its safety profile is similar to that of placebo. This design was used in the Rotavirus Efficacy and Safety Trial (31), in which approximately 68,000 infants were randomized to receive either the vaccine or placebo and monitored for serious AEs. At the end of the trial (conducted in several

6

VACCINE

countries over 4 years), a total of 27 intussusception cases were identified (12 vaccine and 15 placebo) within 1 year after the first dose of vaccination. Among these cases, 11 occurred during the primary safety followup period (42 days after any dose) with a split of 6 cases in the vaccine group and 5 in the placebo group. The relative risk (vaccine/placebo) was 1.6 (95% CI, 0.4–6.4) after adjusting for multiple sequential looks during the trial. It was concluded that the study had provided sufficient evidence showing that the risk of intussusception was similar in vaccine and placebo recipients. REFERENCES 1. Centers for Disease Control and Prevention. Ten great public health achievements— United States, 1900–1999. MMWR Morb Mortal Wkly Rep. 1999; April 2; 48(12): 241–243. 2. S. J. R. Vessey, C. Y. Chan, B. J. Kuter, K. M. Kaplan, M. Waters, et al., Childhood vaccination against varicella: persistence of antibody, duration of protection, and vaccine efficacy. J Pediatr. 2001; 139: 297–304. 3. B. Kuter, H. Matthews, H. Shinefield, S. Black, P. Dennehy, et al., Ten year followup of healthy children who received one or two doses of a varicella vaccine. Pediatr Infect Dis J. 2004; 23: 132–137. 4. D. V. Mehrotra, Vaccine clinical trials—a statistical primer. J. Biopharm Stat. 2006; 16: 403–414. 5. I. S. F. Chan, W. W. Wang, and J. F. Heyse, Vaccine clinical trials. In: S. C. Chow (ed.), The Encyclopedia of Biopharmaceutical Statistics, 2nd ed. New York: Marcel Dekker, 2003, pp. 1005–1022. 6. A. Werzberger, B. Mensch, B. Kuter, L. Brown, J. Lewis, et al., A controlled trial of a formalininactivated hepatitis A vaccine in healthy children. N Engl J Med. 1992; 327: 453–457. 7. I. S. F. Chan and N. R. Bohidar, Exact power and sample size for vaccine efficacy studies. Commun Stat Theory Methods. 1998; 27: 1305–1322. 8. I. S. F. Chan, Exact tests of equivalence and efficacy with a non-zero lower bound for comparative studies. Stat Med. 1998; 17: 1403–1413. 9. M. Ewell, Comparing methods for calculating confidence intervals for vaccine efficacy. Stat Med. 1996; 15: 2379–2392.

10. M. N. Oxman, M. J. Levin, G. R. Johnson, K. E. Schmader, S. E. Straus, et al., A vaccine to prevent herpes zoster and postherpetic neuralgia in older adults. N Engl J Med. 2005; 352: 2271–2284. 11. M. N. Chang, H. A. Guess, and J. F. Heyse, Reduction in burden of illness: (A) new efficacy measure for prevention trials. Stat Med. 1994; 13: 1807–1814. 12. P. E. M. Fine, Herd immunity: history, theory, practice. Epidemiol Rev. 1993; 15: 265–302. 13. M. E. Halloran, M. Haber, I. M. Longini, Jr., C. J. Struchiner, Direct and indirect effects in vaccine efficacy and effectiveness. Am J Epidemiol. 1991; 133: 323–331. 14. M. E. Halloran, Overview of vaccine field studies: types of effects and designs. J. Biopharm Stat. 2006; 16: 415–427. 15. L. H. Moulton, K. L. O’Brien, R. Reid, R. Weatherholtz, M. Santosham, and G. R. Siber, Evaluation of the indirect effects of a pneumococcal vaccine in a community-randomized study. J. Biopharm Stat. 2006; 16: 453–462. 16. R. L. Prentice, Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med. 1989; 8: 431–440. 17. L. Freedman, B. I. Graubard, and A. Schatzkin, Statistical validation of intermediate endpoints for chronic disease. Stat Med. 1992; 11: 167–178. 18. D. Y. Lin, T. R. Fleming, and V. DeGruttola, Estimating the proportion of treatment effect explained by a surrogate marker. Stat Med. 1997; 16, 1515–1527. 19. M. Buyse and G. Molenberghs, The validation of surrogate endpoints in randomized experiments. Biometrics. 1998; 54: 1014–1029. 20. S. Black, H. Shinefield, B. Fireman, E. Lewis, P. Ray, et al., Efficacy, safety and immunogenicity of heptavalent pneumococcal conjugate vaccine in children. Pediatr Infect Dis J. 2000; 19: 187–195. 21. S. Li, I. S. F. Chan, H. Matthews, J. F. Heyse, C. Y. Chan, et al., Childhood vaccination against varicella: inverse relationship between 6-week postvaccination varicella antibody response and likelihood of long-term breakthrough infection. Pediatr Infect Dis J. 2002; 21: 337–342. 22. I. S. F. Chan, S. Li, H. Matthews, C. Chan, R. Vessey, J. Sadoff, J. Heyse, Use of statistical models for evaluating antibody response as a correlate of protection against varicella. Stat Med. 2002; 21: 3411–3430. 23. L. Jodar, J. Butler, G. Carlone, R. Dagan, D. Goldblatt, et al., Serological criteria for

VACCINE

7

evaluation and licensure of new pneumococcal conjugate vaccine formulations for use in infants. Vaccine. 2003; 21: 3265–3272. 24. A. J. Dunning, A model for immunological correlates of protection. Stat Med. 2006; 25: 1485–1497.

Merck rotavirus vaccine is associated with a rare adverse reaction (intussusception). U.S. Food and Drug Administration Workshop: Evaluation of new vaccines: how much safety data. Bethesda, Maryland, November 14–15, 2000.

25. W. W. B. Wang, D. V. Mehrotra, I. S. F. Chan, and J. F. Heyse, Statistical considerations for noninferiority/equivalence trials in vaccine development. J. Biopharm Stat. 2006; 16: 429–441. 26. ICH E9 Expert Working Group. Statistical principles for clinical trials: ICH harmonized tripartite guidelines. Stat Med. 1999; 18: 1905–1942.

31. T. Vesikari, D. O. Matson, P. Dennehy, P. Van Damme, M. Santosham, et al., Safety and efficacy of a pentavalent human-bovine (WC3) reassortant rotavirus vaccine. N Engl J Med. 2006; 354: 23–33.

27. P. Coplan, L. Chiacchierini, A. Nikas, J. Sha, A. Baumritter, et al., Development and evaluation of a standardized questionnaire for identifying adverse events in vaccine clinical trials. Pharmacoepidemiol Drug Saf. 2000; 9: 457–471. 28. B. D. Jovanovic and P. S. Levy, A look at the rule of three. Am Stat. 1997; 51: 137–139. 29. S. S. Ellenberg, Safety considerations for new vaccine development. Pharmacoepidemiol Drug Saf. 2001; 10: 1–5. 30. J. Sadoff, J. F. Heyse, P. Heaton, and M. Dallas, Designing a study to evaluate whether the

CROSS-REFERENCES Benefit/risk assessment Clinical trial Confirmatory trials Efficacy Endpoint Placebo Primary prevention trials Safety Surrogate endpoints Trial monitoring

VACCINE ADVERSE EVENT REPORT SYSTEM (VAERS)

providers are encouraged to report adverse events, whether or not they believe that the vaccination was the cause. The National Childhood Vaccine Injury Act (NCVIA) requires health-care providers to report:

The Vaccine Adverse Event Reporting System (VAERS) is a national vaccine safety surveillance program co-sponsored by the U.S. Food and Drug Administration (FDA) and the Centers for Disease Control and Prevention (CDC). The VAERS program was created as an outgrowth of the National Childhood Vaccine Injury Act of 1986 (NCVIA) and is administered by the FDA and CDC. The purpose of VAERS is to detect possible signals of adverse events associated with vaccines. VAERS collects and analyzes information from reports of adverse events (possible side effects) that occur after the administration of U.S. licensed vaccines. Reports are welcome from all concerned individuals: patients, parents, healthcare providers, pharmacists, and vaccine manufacturers. More than 10 million vaccinations per year are given to children less than 1 year old, usually between 2 months and 6 months of age. At this stage of development, infants are at risk for a variety of medical events and serious childhood illnesses. These naturally occurring events include fevers, seizures, sudden infant death syndrome (SIDS), cancer, congenital heart disease, asthma, and other conditions. Some infants coincidentally experience an adverse event shortly after a vaccination. In such situations an infection, congenital abnormality, injury, or some other provocation may cause the event. Because of such coincidences, it is usually not possible from VAERS data alone to determine whether a particular adverse event resulted from a concurrent condition or from a vaccination, even when the event occurs soon after vaccination. Doctors and other vaccine

• Any event listed by the vaccine manu-

facturer as a contraindication to subsequent doses of the vaccine. • Any event listed in the Reportable Events Table that occurs within the specified time period after vaccination. The Reportable Events Table specifically outlines the reportable postvaccination events and the time frames in which they must occur to qualify as being reportable. VAERS encourages the reporting of any significant adverse event occurring after the administration of any vaccine licensed in the United States. Any significant adverse event should be reported, even if it is unclear whether a vaccine caused the event. The report of an adverse event to VAERS is not proof that a vaccine caused the event. Sometimes, an event is noted but the evidence may not be adequate to conclude that a noted event is due to the vaccine. The FDA continually monitors VAERS reports for any unexpected patterns or changes in rates of adverse events. If the VAERS data suggest a possible link between an adverse event and vaccination, the relationship may be further studied in a controlled fashion. Both the CDC and the FDA review data reported to VAERS. The FDA reviews reports to assess whether a reported event is adequately reflected in product labeling, and closely monitors reporting trends for individual vaccine lots. Analyzing VAERS reports is a complex task. Children are often administered more than one vaccine at a time, making it difficult to know which vaccine, if any, may have contributed to any subsequent adverse events. Approximately 85% of the reports describe mild events such as fever, local reactions, episodes of crying or mild irritability, and other less serious experiences. The remaining 15% of the reports reflect

This article was modified from the website of the United States Food and Drug Administration (http://www.fda.gov/cber/vaers/vaers.htm, http:// www.fda.gov/cber/vaers/faq.htm, and http://www. fda.gov/cber/vaers/what.htm) by Ralph D’Agostino and Sarah Karl.

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

1

2

VACCINE ADVERSE EVENT REPORT SYSTEM (VAERS)

serious adverse events involving life-threatening conditions, hospitalization, permanent disability, or death, which may or may not have been truly caused by an immunization. As part of the VAERS program, FDA reviews the deaths and serious reports weekly, and conducts follow up investigations. In some cases, certain vaccines and potentially associated symptoms will receive more intense follow-up evaluation. In addition to analyzing individual VAERS reports, the FDA also analyzes patterns of reporting associated with vaccine lots. Many complex factors must be considered in comparing reports between different vaccine lots. More reports may be received for a large lot than for a small one simply because more doses of vaccine from the large lot will be given to more children. Some lots contain as many as 700,000 doses, and others as few as 20,000 doses. Similarly, more reports will be received for a lot that has been in use for a long time than for a lot that has been in use for a short time. Even among lots of similar size and time in use, some lots will receive more reports than others will simply due to chance. The FDA continually looks for lots that have received more death reports than would be expected on the basis of such factors as time in use and chance variation as well as any unusual patterns in other serious reports within a lot. If such a lot is detected, further review is conducted to determine if the lot continues to be safe for use or if additional FDA actions are needed. VAERS is a valuable tool for postmarketing safety surveillance (monitoring after a product has been approved and is on the market). Although extensive studies are required for licensure of new vaccines, postmarketing research and surveillance are necessary to identify safety issues that may only be detected after vaccination of a much larger and more diverse population. Rare events may not come to light before licensure.

WEB BASED DATA MANAGEMENT SYSTEM

late 1990s, with the expansive growth of the World Wide Web and the development of software tools for web applications, web-based data management systems have become popular. Web systems are browser-based and some can be used online or offline (3). As with other remote data management systems, web-based systems can be used for distributed data entry from paper case report forms; for computer assisted data collection directly onto electronic case report forms; and for computer-assisted self-interviews by participants. In deciding whether to implement a webbased data management system, several issues must be considered. First, the needs and characteristics of the study, which is an aspect often ignored, and the characteristics of the study sites. It is important to understand what capabilities a web system offers before deciding to implement it. System requirements and cost should be considered. Last, web-based data management systems have advantages and disadvantages over other kinds of data management systems and these must be taken into account.

HOPE E. BRYAN Collaborative Studies Coordinating Center, Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina

Data management in clinical trials comprises the collection, cleaning, and validating of study data. Data management systems assist study personnel in completing these tasks in a timely, efficient way to ensure the completeness and accuracy of the data. Earlier data management systems were manual, paper-based, centralized systems. Study data from participant interviews or medical records were recorded on paper case report forms (CRF) at clinical sites. The CRFs were sent to a coordinating center or professional keying service and keyed into a database hosted on a mainframe computer. Data queries were generated on paper forms and returned to the clinical sites. Clinicians recorded corrections and notes on the paper queries, which were returned then used to correct the database (1). Eligibility determination was done using paper forms and written algorithms. Initially, paper was the only option. But with the advent of affordable personal computers, decentralized data management systems became an option. The clinical trials industry was slow to accept electronic substitutes for paper, in part because of confusion over the U.S. Food and Drug Administration’s (FDA’s) requirements for validation of electronic data capture systems (2). With the clarification and elucidation of those guidelines, the industry began to replace paper systems with electronic ones. Early electronic data capture systems were installed on computers at clinical sites. Data collection software and the database were installed on local computers. Site data were transferred periodically to a central location and used to update a central database. Other systems used a client-server model. The client software was installed on the local computers, which connected over a network to a central database. Since the

1 STUDY AND STUDY SITE CHARACTERISTICS Web-based data management systems are more appropriate for some studies than for others. These studies include those in which: • Immediate access to data as soon as it

is entered is important. This aspect is one of the primary benefits of web-based systems because data are written to the server on entry and are immediately available for retrieval, reporting, and analysis. • Eligibility criteria are complex, include

data from several sources (e.g., laboratories), and are best performed by computer to reduce error. Interactive Voice Randomization systems (IVR) can do this as well but are not always integrated into the study data management system.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WEB BASED DATA MANAGEMENT SYSTEM • Many participating sites have small

numbers of participants per site so maintaining a remotely installed system inefficient. • Data collection forms are of reasonable length and complexity. In web systems, the only software on the user computer is a web browser. All form content and code to control form behavior must be downloaded from the server when the form is displayed. Very large forms with complex real-time validation checks and complicated conditional field patterns can take so long to load that the user becomes impatient. In deciding whether to implement a webbased system, capabilities and characteristics of participating study sites should be evaluated. Sometimes study investigators prefer electronic data capture whereas experienced study coordinators do not. The following must be true about study sites: • All locations at which data are collected

have Internet connections, and the connections are sufficiently fast. • Clinics in which nurse coordinators run several studies may have trouble remembering how to use systems. • Clinic logistics may make it difficult to enter data directly while interacting with a patient.

2

SYSTEM CAPABILITIES

Web-based data management systems include some capabilities of other electronic systems and also offer some unique features. When deciding what type of system to implement, the capabilities of the system must be evaluated with respect to the needs of the study and compared with those offered by other types of systems. The following are features that may be included in web-based and that should be evaluated. • Data Capture and Validation: All

web-based data management systems allow electronic data capture. Most systems offer real-time field validation that

notifies users when fields are invalid or out of range. Validation checks can ensure completion of required fields or required forms, enforce skip rules, and detect inconsistent or illogical values across fields. Some web systems only validate fields when a data record is saved to the server. These systems present the user with a list of invalid fields to correct. Other systems validate each field on entry. • Data Integration: Web-based data management systems, with their access to the complete study database, can run eligibility checks for patient randomization and permit balancing within randomization strata (4). IVR systems allow clinicians to telephone a centralized system, enter eligibility criteria, and receive treatment assignment. Some web-based randomization systems provide a similar facility and can be integrated with the data management system. Some systems offer telephone and web-based access to the same eligibility algorithms and database, which allows the user to choose the most appropriate interface method (5). In addition to keyboard entry, systems can allow files to be uploaded in batch mode. In this way, data from many sources can be immediately integrated into the database and available for reporting and analysis. • Data Queries: Because data are stored centrally for web-based systems, error reports can be generated in real-time or as frequently as needed. Systems include query resolution tools that allow users to view error reports and resolve errors via the web. Thus, data can be cleaned and locked more rapidly. • Reports: In addition to error reports, web-based systems can generate data reports that use the consolidated database to advantage. Reports run at the sites against the study database can assist in study management, site monitoring, subject enrollment, and adverse events reporting (6). Some reports may be time-critical, and some may have access restricted to certain users. For example, reports of numbers of screened

WEB BASED DATA MANAGEMENT SYSTEM

and randomized patients by site and across sites allow sites to compare their progress with others. Which data reports are included in a system and how easy it is to generate custom reports is an important feature to consider. • Messaging: Because web systems connect to the Internet, e-mail messaging can be integrated into the system. For example, a help desk request system that submits electronic requests can easily be incorporated. The help request can be populated with the user’s contact information and the request directed to the appropriate support staff. Protocol questions can be directed to research staff, whereas data entry system problems can be directed to the computing staff (6). Automatic e-mail messages can also be used to order study drug after treatment assignment automatically, alert medical officers of serious adverse events and notify users of system problems. • Standards: In 1997 in Title 21 Code of Federal Regulations (21 CFR Part 11) (22), the FDA issued rules that govern electronic records and electronic signatures. The rules provided criteria for acceptance electronic records, and electronic signatures as equivalent to paper records and handwritten signatures. Part 11 signatures include electronic signatures that are used to document that certain events or actions occurred, for example, approval, review, and verification of data. Electronic signatures can be checked to authenticate the identity of the signer and can be used to ensure that the record has not been modified after signing. Electronic data collection systems, which include webbased systems, which are in compliance with these regulations, have capabilities for signing electronic CRFs. Although FDA is reevaluating the scope and application of Part 11, it intends to enforce the provisions applicable to records in electronic rather than paper format. Thus, compliance with Part 11 is a necessary feature for any electronic data management system used for studies in which data will be submitted to the

3

FDA. In 1999, the FDA published Guidance for Industry Computerized Systems Used in Clinical Trials (7). This document describes specific requirements for adhering to 21 CFR Part 11 to guarantee data quality when computerized systems are used. It lists such things as the need for audit trails, signatures, date and time stamps, security, and training. • Data Storage and Export: Data are stored in database systems and must be extracted for analysis and data transport. Data management systems have facilities for exporting data from the database into other systems for analysis or data sharing. A common format for exported data files is Extensible Markup Language (XML). XML is a generalpurpose mark-up language whose primary purpose is to facilitate the sharing of data across different information systems particularly Internet-based systems. XML has several advantages for data transfer. It is self-documenting because the format describes the structure, field names, and values. All information in an XML file is coded as text, which makes it easily shared by different software applications and across different computing platforms (8). The Clinical Data Interchange Standards Consortium (CDISC) has developed several XML based standards for storing and archiving electronic data. The CDISC Operational Data Model is a vendor neutral, platform-independent format specifically for the interchange and archive of clinical trials data. The model includes the clinical data along with its associated metadata, administrative data, reference data, and audit information (9).

3 SYSTEM REQUIREMENTS, DEVELOPMENT MODELS, AND COSTS Web-based systems are based on a threetiered software model. The three tiers comprise the client, the web application server, and the database server. This model uses a

4

WEB BASED DATA MANAGEMENT SYSTEM

thin client, which the web browser installed on a user computer that may run small parts of the application via scripts downloaded when a screen is displayed. The application server executes the main application code and interfaces between the client and the database on the database server. The servers are in a central location and are under control of the study information technology group who manage the configurations. • Client Requirements: The only soft-

ware requirement for a system user is a computer with an Internet connection and a web browser. Some systems require a particular browser and operating system and some systems can run with any browser. The browser may have required settings, such as allowing cookies, or particular security settings, but no additional software is required. Most systems work better with a highspeed Internet connection rather than a modem connection. • Web Application Requirements: When the user requests a form for data entry, the application software on the server renders the form in a form the user’s browser can understand. When the user saves a record, the browser sends the data to the application software on the server, which sends the data record to the database. The application software is data-management software and can be purchased from a commercial vendor, developed in-house or, if opensource, installed and used with no purchase cost. The choice of database, sometimes called the backend, is independent of the application code. The most popular database model used for clinical trials systems is the relational model. In a relational database data are organized in tables. Commercial products such as Oracle, Microsoft SQL Server, Microsoft Access, Informix, Sybase, and DB2 as well as open-source relational database systems, such as mySQL, PostgreSQL, and Ingres, are some commonly used relational databases. The application software and database can run on the same server or on two separate servers. Some organizations prefer to

separate data from applications for performance or security reasons. Three implementation approaches available for web-based data management systems for clinical trials as follows: • commercial web-based products for clin-

ical trials • open source products • custom-designed

software developed using general use programming tools

Commercial web-based data management systems for clinical trials offer comprehensive, integrated sets of components. Some of these products were initially designed as client server implementations but have added web-based capabilities or been redesigned for web use. They are usually licensed by user often with separate licenses for each software component. The software cost for these products is high, but programmer cost can be small. Support from the companies is available. Open source products can be obtained free of charge. The expectation from the software developers is that other users will make improvements and share these with others. Because open source operating systems, web servers, and databases are available, a whole system can be implemented with little software cost. Sometimes, open source software companies sell editions that include installation, training, and support. Even though the software costs for open source products are minimal, costs for programmers to install, customize, and maintain the system may not be minimal. Much open source software works as advertised, but if problems occur, often little assistance is available except online help documents. A programmer or information technology specialist is probably required to set up the software. Custom designed systems are built using software tools that are designed for web programs. Such tools include written Java, which is Sun Microsystems’ programming language; Javascript; Microsoft’s .NET platform; PHP, which is an open source scripting tool; or Adobe’s ColdFusion. The cost of these

WEB BASED DATA MANAGEMENT SYSTEM

tools is not prohibitive, but programmers must develop all the system components, and programmer costs are high. Organizations may choose, rather than building or buying web-based systems using the tools or products described above, to hire clinical information technology companies or contract research organizations to design and implement systems and, in some cases, to handle all data and study management. The specifications for these systems are often not published, but the functional descriptions are available to potential customers. Systems can be hosted at the contracted company or licensed to the data coordinating center. 4

ADVANTAGES AND DISADVANTAGES

Web-based data managements systems have advantages and disadvantages compared with paper-based and other electronic systems. Other potential benefits have not been tested. 4.1 Speed and Efficiency One motivation for using web-based systems is to expedite the conduct of trials, which results in trials that are less expensive and take less time, thereby publishing results more quickly. Ways that a web-based data management system can accelerate the conduct of a trial include: • Data Transfer: With web-based sys-

tems, it is obvious that the speed of data transfer from clinical sites to the sponsor or data coordinating center is faster because as soon as data are entered, they are written to a database, which obviates a separate data transfer step. Not only are data received more quickly, but also those data are cleaner if realtime validation checks are performed during data entry. • Re-key Verification: Electronic data capture eliminates the additional time and potential errors that develop when data are first written to paper forms and then keyed. The need for duplicate data entry or verification is eliminated when data are entered in real-time and edited as they are entered.

5

• Queries: With query resolution capa-

bilities available through a web-based interface, any queries that do develop can be resolved immediately by sites. In paper-based or remote systems, the data must be transferred to sponsor and processed before queries can be sent to sites. • Data Access: The immediate access to the data provided by a web system gives investigators information about issues such as study status, recruitment and enrollment, and forms completion which allows them to manage the study more easily and spot problems more readily. Clinic monitors also have access to the data so they can plan site visits more efficiently, which can result in fewer, shorter monitoring visits (10). One of the few studies conducted to compare web-based to paper-based data collection showed the expected time savings was true: The time from the last patient completing the trial to the database closure, data entry and query resolution all were shorter with the Internet system (11). Another study that assessed advantages of web-based data collection cited time-savings benefits as well: decreased time lag between data collection and validation, immediate correction of entry mistakes, and elimination of double data entry (10). However, few studies have been conducted to measure cost-savings of webbased systems in conducting trials. 4.2 Study Site Configuration An economical advantage of a web-based data management system is the promise that the system can run on any computer with an Internet connection. Computers do not have to be dedicated to study data management nor do specialized computers need to be purchased to run the system. Systems that rely on server-side scripts rather than client-side scripts and that use standard HTML have fewer client requirements. However, relying on server-side scripting often means data are validated only when a form is submitted to the server rather than as individual fields are keyed. Systems that use clientside scripting often have client requirements,

6

WEB BASED DATA MANAGEMENT SYSTEM

such as for a specific operating system, web browser, or browser plug-in (12–15). Because the customized software in a web-based system is installed on servers at a single location, system maintenance is centralized. This centralization also makes scaling to multiple remote sites easy because no remote software is required to install and configure. Data security is improved with single-site control of data backup and system security (16). A technical requirement that may be a disadvantage of web-based systems is an adequate Internet connection that is fast and reliable enough to display forms and respond to user requests adequately. In addition, the connection must be in the right places in the clinic. One study found that investigator and user satisfaction depended on Internet connection speed. (10) Unlike remotely installed systems, webbased systems require no local system administration or maintenance. The sponsor performs data backups on the centrally located server. 4.3 Security When data are transferred over the Internet, systems must include security measures to protect data integrity and subject confidentiality. The passage of the Health Insurance Portability and Accountability Act of 1996 required stricter protections for all data systems that deal with personally identifiable health data. Because web systems store data on a server connected to the Internet, data are more vulnerable than other kinds of systems to security problems. To protect data from unauthorized access, web-based systems use data encryption both while data are in transit and when stored on a server, authentication certificates to ensure identity, firewalls to restrict certain kinds of network traffic, and access restricted by ID and password. Some systems have timeout features so that if no activity occurs for a specified length of time, the user is logged out of the session, which prevents unauthorized users from accessing data in the system. These required security features make designing, managing, and using web systems more complex. At clinical sites accustomed to using paper-based

systems, the switch to web-based data management systems may require reassignment of duties and more user training (17,18).

5 FUTURE WEB-BASED SYSTEMS Web-based data management systems continue to evolve. Some developments to look for in the immediate future are expanded system capabilities to include study management rather than just data management features, extended data integration and standardization, and trials run exclusively online. 5.1 Study Management and Recruitment Web-based systems are likely to include study management in addition to data management functions. Study management tools such as scheduling, specimen tracking, and document management, are included in some current systems (19). The STAR*D project developed an integrated web-based study communications and management system, which includes an online help desk with automated email requests sent to appropriate personnel; a document sharing program for forms, manuals, and minutes; data reports; and a specimen tracking system for blood samples. In this system, the overlap between data management and study management systems is clear. Data reports and specimen tracking could easily be included in a data management system but clearly facilitate study management (20). To increase efficiency data and study management components should be integrated even more to include patient recruitment, drug supply management, and regulatory documentation. 5.2 Data Integration With data coming from many systems data integration is a problem. However, standards for data exchange such as the CDISC based on XML, which is the Internet data interchange standard, should make integration and communication between systems easier. In the CDISC Operational Data Model protocol, metadata, and study data collected on subjects are integrated through the clinical trial process—from data acquisition, to

WEB BASED DATA MANAGEMENT SYSTEM

analysis and reporting, to regulatory submission and electronic data archive. The Electronic Source Data Interchange Group of CDISC is developing standards to facilitate and encourage use of electronic source documents in clinical research while complying with regulations and using standards to support platform-independent data interchange (20). 5.3 Online Trials Expanding online techniques, such as patient recruitment and web-based data capture, to conduct a clinical trial entirely on line was tested in a feasibility study on osteoarthritis (21). Participants mailed signed paper consent forms and medical record release permission, but otherwise data were entered by participants through the study web site. Study drug was distributed by mail. The web site sent automated reminder e-mails to participants to log in every 2 weeks to fill out forms. The authors concluded it is feasible to do certain types of trials entirely online. The cost of the trial was lower, participants were satisfied and other metrics, such as adherence and retention, were similar to other osteoarthritis trials. This online approach may become increasingly practical as the Internet is more and more widely available. 6

SUMMARY

The World Wide Web is the latest technology to be used for clinical trials data management systems. However, web systems are not the best choice for all studies. Characteristics of the study and the study sites must be considered when choosing a system. In addition, different systems offer different features, and which must be matched against the needs of the study. As more computerized systems are used for clinical trials, standards are being developed to ensure the integrity and quality of the data and to facilitate data interchange and archival. Systems purchased or developed should adhere to these standards. Ease and cost of implementation are important factors as well. Data management software can be purchased, developed, and supported by the sponsor or downloaded from open-source

7

organizations, modified, and installed. The cost of the three options varies widely and the decision of which to choose depends on budget and expertise. Many advantages and disadvantages to these systems exist. Security is a concern, but if controls are implemented, then the risk can be minimized. Increasing the conduct of a trial by eliminating data transfer and query resolution time is one advantage as is immediate access to data and thus real-time information about study status. The appeal to investigators of web-based systems will likely lead to putting more and more study functions, such as study management, recruiting, even the complete conduct of trials, on the web. REFERENCES 1. C. J. Cooper, S. P. Cooper, D. J. del Junco, E. M. Shipp, R. Whitworth, and S. R. Cooper, Web-based data collection: detailed methods of a questionnaire and data gathering tool. Epidemiol. Perspect. Innov. 2006; 3: 1. 2. R. G. Marks, Validating electronic source data in clinical trials. Control. Clin. Trials 2004; 25: 437–446. 3. R. J. Piazza, Integrated web-based clinical data handling solutions. Drug Inf. J. 2001; 35: 731–735. 4. Web-based randomization facility for Perinatal trials. Retrieved from Oxford University, Department of Public Health, Institute of Health Sciences web site. Available at: http:// acdt.oucs.ox.ac.uk/ltgweb/index.php?action= view projects&proj id=2002n. 5. Clinphone product web site (n.d.) Available at: http://www.clinphone.com/IWR.asp. 6. S. R. Wisniewski, H. Eng, L. Meloro, R. Gatt, L. Ritz, D. Stegman, M. Trivedi, M. Biggs, E. Friedman, K. Shores-Wilson, D. Warden, D. Bartolowits, J. P. Martin, and A. J. Rush, Web-based communications and management of a multi-center clinical trial–The Sequenced Treatment Alternatives to Relieve Depression (STAR*D) Project. Clin. Trials 2004; 1: 387–398. 7. Food and Drug Administration, Guidance for Industry: Systems Used in Clinical Trials. Washington, DC: FDA, 1999. 8. Definition of XML. Avalable at: http://en. wikipedia.org/wiki/Xml. 9. Clinical Data Interchange Standards Consortium (CDISC). Operational Data Model, Final

8

WEB BASED DATA MANAGEMENT SYSTEM Version 1.3. Available at: http://www.cdisc. org/models/odm/v1.3/index.html.

10. C. Lopez-Carrero, E. Arriaza, E. Bolanos, A. Ciudad, M. Municio, J. Ramos, and W. Hesen, Internet in clinical research based on a pilot experience. Contemp. Clin. Trials 2005; 26: 234–243. 11. J. Litchfield, J. Freeman, H. Schou, M. Elsley, R. Fuller, and B. Chubb, Is the future for clinical trials internet-based? A cluster randomized clinical trial. Clin. Trials 2005; 2: 72–79. 12. J. R. Schmidt, A. J. Vignati, R. M. Pogash, V. A. Simmons, and R. L. Evans, Web-based distributed data management in the Childhood Asthma Research and Education (CARE) Network. Clin. Trials 2005; 2: 50–60. 13. J. E. Gillen, T. Tse, N. C. Ide, and A. T. McCray, Design, implementation and management of a web-based data entry system for ClinicalTrials.gov. Medinfo 2004; 11: 1466–1470. 14. Oracle Clinical, An Oracle White Paper. Available at: http://www.oracle.com/industries/ life sciences/oc data sheet 81202.pdf. 15. Technology and System Configuration. 2006. Available at: http://www.datalabs. com/brochures/DataLabsTech.pdf. 16. J. Unutzer, Y. Choi, I. A. Cook, and S. Oishi, A web-based data management system to improve care for depression in a multicenter clinical trial. Psychiatr. Serv. 2002; 53: 671–673, 678. 17. M. Winget, H. Kincaid, P. Lin, L. Li, S. Kelly, and M. Thornquist, A web-based system for managing and coordinating multiple multisite studies. Clin. Trials 2005; 2: 42–49. 18. J. Paul, R. Seib, and T. Prescott, The internet and clinical trials: background, online resources, examples and issues. J. Med. Internet Res. 2005; 7: e5 19. D. Reboussin and M. A. Espeland, The science of web-based clinical trial management. Clin. Trials 2005; 2: 1–2. 20. Leveraging the CDISC Standards to facilitate the use of electronic source data within clinical trials, Clinical Data Interchange Standards Consortium, Electronic Source Data Interchange (eSDI) Group, November 20, 2006. Available at: http://www.cdisc.org/eSDI/ eSDI.pdf. 21. T. McAlindon, M. Formica, K. Kabbara, M. La Valley, and M. Lehmer, Conducting clinical trials over the Internet: feasibility study. BMJ. 2003; 327: 484–487.

22. Guidance for Industry. Part 11 Electronic Records; Electronic Signatures–Scope and Application. Available at: http://www.fda. gov/cder/guidance/5505dft.pdf.

CROSS-REFERENCES Electronic Data Capture Software for Data Management Clinical Data Management Remote Data Capture Data Validation

WEI-LIN-WEISSFELD METHOD FOR MULTIPLE TIMES TO EVENTS

estimators that does not require any modeling of the correlation structure among multiple intrasubject event times. One can use the WLW approach to make simultaneous inferences about the collection of association estimates or to make inferences regarding a single, global summary of the association estimates.

JAMES E. SIGNOROVITCH Harvard Medical School

LEE-JEN WEI Harvard University Boston, Massachusetts

1

Although longitudinal studies often collect multiple event time data, primary analyses usually focus on a single event time and employ standard univariate techniques for survival analysis. For example, a study may use the Cox proportional hazards model (1) to study the times to each patient’s first event. However, in some cases, use of the full, multivariate event time data may provide more comprehensive and efficient inferences for parameters of interest. For example, consider investigating whether a certain genetic mutation is associated with a complex disease such as asthma, diabetes, hypertension, autism, and so on. The progression of complex diseases can often be described by the occurrence times of several clinical phenotypes. But it is usually difficult if not impossible to summarize these occurrence times into a single, meaningful disease progression score that can be used to assess associations between clinical progression and genotype. The Wei–Lin–Weissfeld (WLW) method (2) provides a practical alternative to creating a single score. Given multiple, possibly censored event times for each patient, the WLW method estimates for each outcome type the strength of association between occurrence time and genotype, potentially adjusting for any confounders. Taken together across all outcome types, these estimates of genotypic association capture the overall association between genotype and disease progression. Proper inferences regarding these associations must account for nonindependence among their estimators. The attractive feature of the WLW method is that it provides a reliable, robust estimate of the joint distribution of the association

THE WLW APPROACH

In general, the WLW procedure can incorporate any parametric or semiparametric model for the marginal event times. Here we illustrate the WLW procedure with marginal Cox proportional hazards models, which are often used in univariate failure time analyses. Supposing K different types of failures and n subjects exist, let T ki be the time to the kth failure type in the ith subject. Either by design or by chance, however, observation of T ki may be censored at a random time Cki , in which case we observe X ki = min (T ki , Cki ), and ki = 1 if X ki = T ki and ki = 0 other wise. Also, let Zki (t) = (Z1ki (t), . . . , Zpki (t)) be a p × 1 vector of covariates for the ith subject at time t ≥ 0 specific to the kth failure type. For the ith subject, i = 1, . . . , n, it is assumed that conditional on the covariate histories for each of the K failure types Zi (·) = (Z1i (·) , . . . , ZKi (·) ) , the vector of failure times T i = (T 1i , . . . , T Ki ) is independent of the vector of censoring times Ci = (C1i , . . . , CKi ) . Our study generates observations that consist of n triplets, (X i , i , Zi (·)), i = 1, . . . , n, with X i = (X 1i , . . . , X Ki ) and i = (1i , . . . , Ki ) . Assume that the marginal model for the kth failure is specified though a hazard function λki (t) for the ith patient with the form λki (t) = λk0 (t) exp{βk Zki (t)}, t ≥ 0,

(1)

where λk0 (t) is an unspecified baseline haz ard function and β k = (β 1k , . . . , β pk ) is the failure-specific regression parameter. With each β k unrestricted with respect to every β j , k = j, this collection of K marginal models

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WEI-LIN-WEISSFELD METHOD FOR MULTIPLE TIMES TO EVENTS

can be fit to the data by maximizing the K failure-specific partial likelihoods (1)

Lk (β) =

n i=1

exp{β Zki (Xki )

l∈Rk (Xki ) exp {β Zkl (Xki )

ki ,

k = 1, . . . , K, where the risk set Rk (t) = {l: X kl ≥ t} indicates which subjects are at risk for the kth failure type just prior to time t. The maximum partial likelihood estimate βˆk for β k is then defined as the solution to the normal equation ∂Lk (β)/∂β = 0. If the marginal model in Equation (1) is correctly specified, then the estimator βˆk is consistent for β k . In general, the βˆk s will be correlated in some unknown fashion. However, it can be shown that for large n, the vector of estimated coefficients (βˆ1 , . . . , βˆK ) is asymptotically normal with mean (β1 , . . . , βK ) and a covariance matrix Q, say, which can be consistently estimated via a robust sandwich estimate ˆ With this variance–covariance estimate Q. in hand, inferences regarding the βˆk s, or any subset or linear function thereof can be made. The WLW method can be implemented in SAS 9.1 (SAS Institute, Inc., Cary, NC) using the PHREG procedure by constructing strata that correspond to each event type and employ a robust sandwich estimate of the covariance matrix (3).

2

INFERENCE FOR THE WLW APPROACH

We now briefly describe some commonly used inference procedures for the βˆk s. Suppose interest is focused on the effects ηk = β 1k (k = 1, . . . , K) for a particular covariate, a treatment indicator for example, on each of the k failure types. The estimated covariance matrix for (ηˆ 1 , . . . , ηˆ K ) , which is denoted as ˆ The global null ˆ can be obtained from Q. , hypothesis η1 , = . . . = ηK = 0 can be tested ˆ −1 (ηˆ 1 , . . . , ηˆ K ) to by comparing (ηˆ 1 , . . . , ηˆ K ) a chi-square distribution with K degrees of freedom. To test the individual null hypotheses simultaneously Hk : ηk = 0 (k = 1, . . . , K), one could obtain z-scores by dividing the ηˆ 1 , . . . , ηˆ K by the square roots of their corˆ and then responding diagonal elements in

obtain critical values using the the Bonferroni adjustment (eoct ‘‘Bonferroni Adjustment’’), which is known to be conservative ˆ 1/2 are when the test statistics zk = ηk / kk highly correlated or when few of the Hk s are true. Because (ηˆ 1 , . . . , ηˆ K ) is asymptotically normal, a less-conservative testing procedure is to reject all H k such that |zk | > c, where the c is chosen such that Pr(maxk=1,...,K |Zk | < c) = 1 − α

with Z = (Z1 , . . . , ZK ) following a multivariate normal distribution with mean 0 and ˆ −1/2 where A is covariance matrix A−1/2 A the K × K matrix obtained by setting all ˆ to zero. The critoff-diagonal elements of ical value c may be found by simulating Z. Sequential procedures can provide even less conservative tests (2). A confidence region for η may be obtained by inverting the global chi-squared test. That is, by including in the confidence region is ˆ −1 (η−η) ˆ any value η such that (η−η) ˆ less than a critical value obtained from the chi-square distribution with K degrees of freedom. Estimation and hypothesis testing may also be of interest for particular summaries of the ηk s. For example if all ηk s are thought to be equal to some common value η, then it is natural to estimate η as ˆ −1 J]−1 J ˆ −1 (ηˆ 1 , . . . , ηˆ K ) , where J is ηˆ = [J a K-vector of 1s. This result is the linear estimator of η with the smallest asymptotic variance (2). Even if the ηk s are not equal, the aggregated estimate η can still be interpreted as an average treatment effect across failure types. If the true ηk s are at least nearly equal and generally have the same sign, then estimating η can provide an informative and efficient summary of the treatment effect across the K event types. 2.1 Notes on Interpretation Interpreting the marginal regression coefficients for each recurrence takes some care. For example, suppose that the WLW method has been applied to a randomized trial in which the outcome is a recurrent event. Observing a treatment effect on the time of the second recurrences does not imply

WEI-LIN-WEISSFELD METHOD FOR MULTIPLE TIMES TO EVENTS

that treatment affects the waiting time between the first and second recurrence. Rather, because the WLW approach models the marginal distribution of the time from baseline to the second recurrence, any treatment effect on the time to the second recurrence could develop from a treatment effect on the time to the first occurrence even if the time between first and second recurrence is independent of treatment. It has been argued that in a randomized trial, studying effects on the time to recurrent events from baseline is appropriate because such effects can be interpreted straightforwardly as causal effects of the treatment assignment at baseline (4). In contrast, the apparent effect of treatment on the time between the first and second recurrence is likely confounded by the time to the first recurrence, which is a postrandomization variable 3

OTHER CONSIDERATIONS

This article has considered the WLW approach to multivariate failure time data for noncompeting events. That is, we have assumed that each type of event is observable until some censoring time that is independent of the event time given the covariates. However when multiple events are allowed for each subject, it may be the case that some events, such as death for example, terminate follow-up and prevent the observation of subsequent events. Although the problem of competing risks cannot be avoided in this setting, several options for handling terminating events under the WLW approach have been explored (5,6). The multivariate failure time data considered in this article were assumed to be correlated within subjects but independent across subjects. So-called clustered failure time data, in which events times may be correlated across subjects, is another important setting for multivariate failure time analysis. For example, in a genetic study where the sampling units are families, individuals’ failure times may be correlated with

those of their relatives (see Reference This setting cannot be readily handled the WLW approach. An excellent review multiple event time analysis is given Lin (8).

3

7). by on by

REFERENCES 1. D. R. Cox, Regression models and life-tables. J. Royal Stat.l Soc. Series B-Stat. Methodol. 1972; 34: 187. 2. L. J. Wei and D. Y. Lin, et al., Regressionanalysis of multivariate incomplete failure time data by modeling marginal distributions. J. Am. Stat. Assoc. 1989; 84: 1065–1073. 3. SAS Institute Inc. SAS/STAT 9.1 User’s Guide. Cary, NC: SAS Institute Inc., 2004. 4. C. Mahe and S. Chevret, Estimation of the treatment effect in a clinical trial when recurrent events define the endpoint. Stat. Med. 1999; 18: 1821–1829. 5. R. J. Cook and J. F. Lawless, Marginal analysis of recurrent events and a terminating event. Stat. Med. 1997; 16: 911–924. 6. Q. H. Li and S. W. Lagakos, Use of the Wei-LinWeissfeld method for the analysis of a recurring and a terminating event. Stat. Med. 1997; 16: 925–940. 7. E. Lee and L. Wei, et al., Cox-type regression analysis for large numbers of small groups of correlated failure time observations. J. P. Klein and P. K. Goel (eds.), Survival Analysis: State of the Art. Dordrecht, The Netherlands: Kluwer Academic Publisher, 1992. 8. D. Y. Lin, Cox regression-analysis of multivariate failure time data - the marginal approach. Stat. Med. 1994; 13: 2233–2247.

FURTHER READING L. J. Wei and D. V. Glidden, An overview of statistical methods for multiple failure time data in clinical trials. Stats. Med. 1997; 16: 833–839.

CROSS-REFERENCES Cox proportional hazards model; time to event time to disease progression hypothesis testing HIV trials

where again u(xi , yj ) = 1 if xi < yj and zero otherwise. U(x < y) reports how many of the mn distinct pairs comprising one xi and one yj have xi < yj . Mann & Whitney showed that

WILCOXON–MANN–WHITNEY TEST LINCOLN E. MOSES Stanford University, Stanford, CA, USA

R(y) =

This test for comparing two samples with respect to their ‘‘general size’’ is based on ranking the observations—both samples combined—and then comparing the average ranks in the two samples. Though this idea had appeared several times in various disciplines (2), the statistical community first recognized the idea when Wilcoxon proposed it in 1945 (9); thereafter developments followed fast, the first of which was Mann & Whitney’s paper (3). 1

n(n + 1) + U(x < y), 2

(4)

and that, hence, properties of Wilcoxon’s test could be learned by studying U(x < y). The relation between R(y) and U(x < y) also implies that one may choose to calculate whichever is more convenient with any particular data set. (In what follows we write W-M-W for Wilcoxon–Mann–Whitney.) Because a monotone continuous transformation (like x1/2 or log x) does not change order relations, both U(x < y) and R(y) are also unaffected.

REPRESENTATIONS OF THE TEST 2 DISTRIBUTION THEORY WHEN F = G (i.e. H0 HOLDS)

To fix ideas we introduce some notation. Until further stated, we consider continuous data, thus excluding ties. Let X 1 , X 2 , . . . , X m be independent and identically distributed, with unknown cumulative distribution function F. Define m 1 u(xi , t), (1) Fm (t) = m

The exact distribution of U(x < y) is obtained by enumeration, which is much expedited by using recursion relationships. Under H0 the mean and variance of U(x < y) are: mn E0 [U(x < y)] = (5) 2

i=1

where the function u(a, b) = 1 if a < b and zero if not. Similarly, define Y 1 , Y 2 , . . . , Y n , and G, 1 u(yj , t), n

and

n

Gn (t) =

var0 [U(x < y)] =

(2)

j=1

mn(N + 1) . 12

(6)

Both results are readily obtained by regarding R(y) as the sum of n random observations chosen without replacement from (1, 2, . . . , N. Asymptotic normality (shown below) provides good approximation to the exact distribution for m and n both large (m ≥ 8, n ≥ 8, suffices at 2p = 0.05. Owen (7) tabulates distributions of both U(x < y) and R(y).

and write N = m + n. rank of xi in the combined If ri = r(xi ) is the sample, let R(x) = m i=1 r(xi ); and if si = s(yi ) in the combined sample, let is the rank of y j R(y) = nj=1 s(yj ). Observe that R(x) + R(y) = N(N + 1)/2 since each side represents the sum of the integers 1, 2, . . . , N. Define m n u(xi , yj ), (3) U(x < y) = i=1 j=1

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WILCOXON–MANN–WHITNEY TEST

3 DISTRIBUTION THEORY: GENERAL CASE; x AND y CONTINUOUS It is evident that E[u(xi , yj )] = 1 × Pr(x < y) + 0 × Pr(y < x) = Pr(x < y), whence E

m n 1 1 u(xi , yj ) U(x < y) = E mn mn i

4 SOME PROPERTIES OF THE TEST

j

= Pr(x < y). < y) for U(x < y)/mn. Hereafter we write Pr(x Now, < y) = 1 Pr(x u(xi , yj ) mn m

n

i

j

n m 1 1 u(xi , yj ) . = n m j=1

i=1

Hence, applying (1), < y) = 1 Pr(x Fm (yj ). n

(7)

j=1

It follows that, as m → ∞ [and hence F m (t) → F(t)], 1 F(yj ). n

In comparison with the two-sample t test, the W-M-W enjoys a very strong property as a test against translation alternatives. First, if normality, with σx2 = σy2 , governs the data, the asymptotic relative efficiency of the W-M-W procedure is 0.955 = 3/π , which is nearly 1. Secondly, if the data come from a heavy-tailed distribution, then that efficiency rises above 1.0 and, for some distributions, much above 1.0. Hodges & Lehmann showed (1) that the asymptotic relative efficiency of W-M-W vs. the t test never falls below 0.864 for translation alternatives. 5 SOME PRACTICAL ASPECTS

n

< y) = lim Pr(x

data (y1 , . . . , yn ) comes from that distribution, against an alternative that G(y) is some other distribution with F dG = 1/2. nUnder H0 , G = F and the statistic (1/n) j=n F(yj ) is a sum of uniform (0, 1) random variables; the statistic has mean 1/2, variance 1/12n, and, if n is large (say 8 or more), its distribution is effectively normal, providing the test for H0 : G = F.

n

(8)

j=1

< y) is nearly From (8), for large m, Pr(x an average of n independent identically distributed bounded random variables, and hence is asymptotically normally distributed as n (in addition to m) grows large. From (7), n < y) = 1 Fm (yj ) = Fm dGn , Pr(x n

Where the distributions F and G are believed to differ by translation, a confidence interval for that translation can be constructed by a simple graphical procedure, based on W-M-W test theory (5). The parameter Pr(x < y) and its esti < y) are sometimes readily intermate Pr(x pretable. They are unit-free, and can serve as indicators of ‘‘effect size’’. We have seen that Pr(x < y) is asymptotically normal, and unbiased for Pr(x < y) under H0 when var[U(x < y)] =

mn(N + 1) 12

and var[Pr(x < y)] =

N+1 . 12mn

j=1

which estimates F dG. Examination of (8) yields a one-sample version of the Wilcoxon–Mann–Whitney test (4). If F is known (say from census figures), then we can test whether a given set of

Unfortunately, except when F = G, the stan dard error of Pr(x, y) is not a simple matter, though the following upper bound can be jus < y)] = [p(1 − p)]1/2 /k, where tified; se[Pr(x p denotes Pr(x < y) and k denotes the smaller of m and n (7).

WILCOXON–MANN–WHITNEY TEST

6

TIED DATA (DISCRETE PROBABILITIES)

When ties occur only in the xs or only the ys, they do not affect anything. However, where there are t observations, including at least one x and at least one y, sharing a common value, they are handled in the following manner. To calculate U(x ≤ y), count each tied pair (xi , yj ) in the tied set as contributing 1/2 to U(x ≤ y). To calculate R(y), consider the t consecutive ranks that would be assigned were the tied data perturbed slightly to become distinct. The average of those t distinct ranks is then assigned as the rank for every observation in that tied set. These two approaches are consistent–they lead to compatible values of R and U. The variance of R (and U) is somewhat reduced by the ties. Indeed, the variance appropriate for untied data is multiplied by CF (for ‘‘correction factor’’) as follows: (t3 − t) , CF = 1 − N3 − N where the sum runs over all the sets of xwith-y ties, and t denotes the length of such a tie. If no x-with-y tie includes as many as half of the observations, the variance will need correction only in borderline situations, as the correction factor stays near 1.0 unless at least half the observations are in one tied set. 6.1 Example (Table 1)

[(213 − 21) + (203 − 20) +(233 − 23) + (243 − 24)] CF = 1 − 883 − 88 = 0.93665.

So the routine null standard σ = [(22 × 66 × 89)/12]1/2 is reduced; it is multiplied by √ 0.93665 = 0.9678. For this table, Pr(x, y) is calculated thus: < y) Pr(x

{2(47) + 4(31) + 5(13) + 12 [2 × 19 +4 × 16 + 5 × 18 + 11 × 13} = 22 × 66 = 0.3103.

3

Table 1.

x y t

Poor

Fair

Good

Excellent

2 19 21

4 16 20

5 18 23

11 13 24

22 66 88

m n N

In the above calculation we have organized our work by choosing successive subgroups of x; thus, the 2 xs in ‘‘poor’’ form (x, y) pairs, where x < y, with 16 + 18 + 13 = 47ys, etc. The tied x, y pairs each contribute 1/2. The example illustrates a salient application of W-M-W methodology. It is mistaken to apply the usual chi-square test of significance where the categories have a relevant order, because that order, the key to the problem, is not taken into account by the χ 2 statistic. For these data χ32 = 8.64 (p = 0.0345), and the W-M-W statistic is Z=

0.18974 0.31026 − 0.5000 =− 1/2 0.06917 89 (0.9678) 12 × 22 × 66

= −2.74 (2p = 0.0061). A more detailed treatment of the ordered 2 × k contingency table appears in (6). REFERENCES 1. Hodges, J. L. & Lehmann, E. L. (1956). The efficiency of some non-parametric competitors of the t-test, Annals of Mathematical Statistics 27, 324–335. 2. Kruskal, W. H. (1957). Historical notes on the Wilcoxon unpaired two-sample test, Journal of the American Statistical Association 52, 356–360. 3. Mann, H. B. & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other, Annals of Mathematical Statistics 18, 50–60. 4. Moses, L. E. (1964). One sample limits of some two sample rank tests, Journal of the American Statistical Association 59, 645–651. 5. Moses, L. E. (1965). Confidence limits from rank tests (query), Technometrics 7, 257–260. 6. Moses, L. E. (1986). Think and Explain with Statistics. Addison Wesley, Reading, pp. 184–187.

4

WILCOXON–MANN–WHITNEY TEST 7. Owen, D. B. (1962). Handbook of Statistical Tasks. Addison Wesley, Reading. 8. Pratt, J. W. & Gibbons, J. D. (1981). Concepts of Non-Parametric Theory. Springer-Verlag, New York, pp. 264–265. 9. Wilcoxon, F. (1945). Individual comparisons by ranking methods, Biometrics 1, 80–83.

WILCOXON SIGNED-RANK TEST

ter for the distribution of differences. One alternative hypothesis is that the resting heart rate before the exercise regimen is higher than the resting heart rate after the exercise regimen (H1 : µd > 0). To execute the test, the absolute values of the differences, |di |, are computed. These values also are given in Table 1. After computing the absolute values, one must order them from smallest to largest disregarding any zeros. In the case of absolute differences being tied for the same ranks, the mean rank (mid-rank) is calculated and assigned to each tied value. The rankings for the absolute differences are also given in Table 1. Test statistics for the Wilcoxon signed-rank test are calculated by either summing the ranks assigned to the positive differences (T+ ) or by summing the ranks assigned to the negative differences (T− ). If there are n differences, then the two sums are related through

R. F. WOOLSON Medical University of South Carolina, Charleston, SC, USA

The Wilcoxon signed-rank test, due to Wilcoxon (9), is a nonparametric test procedure used for the analysis of matched-pair data or for the one-sample problem. In the matched-pair setting it is used to test the hypothesis that the probability distribution of the first sample is equal to the probability distribution of the second sample. This hypothesis can be tested from statistics calculated on the intrapair differences. The hypothesis commonly tested is that these differences come from a distribution centered at zero. Consider the following example. A study was conducted in which nine people of varying weights were put on a particular exercise regimen to determine the program’s effect on the resting heart rate of the subjects. Given that a low resting heart rate is beneficial in reducing blood pressure and increasing overall cardiovascular fitness, this exercise regimen was developed to help people lower their resting heart rate. To test the effectiveness of the regimen, the resting heart rate measurement for each subject was taken before the induction of the regimen, and at six months after beginning the regimen. Table 1 presents the data from this study. Because this study involves before and after measurements of the same individuals, an independent sample test procedure cannot be executed. The null hypothesis in the Wilcoxon signed-rank test is that the set of pairwise differences have a probability distribution centered at zero. A key assumption is that the differences arise from a continuous, symmetric distribution. In the example, the null hypothesis would be that there is no resting heart rate difference before and after the exercise regimen (H0 : µd = 0). In this instance, µd represents the location parame-

T− =

[n(n + 1)] − T+ . 2

(1)

In the example, the sum of the ranks of the positive differences is T+ = 5 + 3 + 9 + 7 + 4 + 6 + 8 = 42. Note that {[n(n + 1)]/2} = 45 so T− = 45 − 42 = 3. To test the null hypothesis, a rejection region can be determined for the test statistic, T + . This rejection region can be determined from the exact null hypothesis distribution of T + . This null distribution is easily derived from a permutational argument, as each of the possible configuration of signs ( + or − ) is equally likely under the null hypothesis. Tables of this exact null distribution are available in standard nonparametric texts such as (3,4), or (7). This null distribution depends only on n, hence the test procedure is nonparametric, i.e. distribution-free.

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WILCOXON SIGNED-RANK TEST

Table 1. Resting Heart Rate of Nine People Before and after Initiation of an Exercise Regimen

Subject

Heart Rate at Baseline (yi )

Heart Rate at 6 Months (xi )

1 2 3 4 5 6 7 8 9

80 76 78 90 84 86 81 84 88

72 70 82 76 86 76 74 75 76

For large samples the standard normal distribution Z, can be used as an approximation to test hypotheses. For this situation a two-tailed rejection region for the null hypothesis based on T + , is given: [T+ − n(n + 1)] 4 > Z1−α/2 (2) Z+ = [n(n + 1)(2n + 1)] 1/2 24 or Z− =

[T− − n(n + 1)] 4

[n(n + 1)(2n + 1)] 24

1/2 > Z1−α/2 . (3)

A one-tailed test is conducted in a similar fashion with the comparison made to Z1−α . Other issues regarding the Wilcoxon signed-rank test include its testing efficiency and the construction of estimators. The asymptotic relative efficiency (ARE) of this test relative to the paired t test is never less than 0.864 in the entire class of continuous symmetric distributions, and is 0.955 if the underlying distribution of differences is normal, see (2). The handling of ties and zeros is discussed by Pratt (5) and Cureton (1). Point and confidence interval estimators are easily derived from the test procedure, and details are described in Lehmann (4). Lehmann (4) also describes power properties for the test procedure when shift alternatives are of interest. Extensions to censored

Difference di = yi − xi

Absolute Value of Difference |di |

Rank of Absolute Difference (sign)

+8 +6 −4 +14 −2 +10 +7 +9 +12

8 6 4 14 2 10 7 9 12

5(+) 3(+) 2(−) 9(+) 1(−) 7(+) 4(+) 6(+) 8(+)

data are discussed by Woolson & Lachenbruch (10) and Schemper (8). References for other aspects of the Wilcoxon signed-rank test are given by Randles & Wolfe (7) and Randles (6). REFERENCES 1. Cureton, E. E. (1967). The normal approximation to the signed-rank sampling distribution when zero differences are present, Journal of the American Statistical Association 62, 1068–1069. 2. Hodges, J. L., Jr & Lehmann, E. L. (1956). The efficiency of some nonparametric competitors of the t-test, Annals of Mathematical Statistics 27, 324–335. 3. Hollander, M. & Wolfe, D. A. (1973). Nonparametric Statistical Methods. Wiley, New York. 4. Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco. 5. Pratt, J. W. (1959). Remarks on zeros and ties in the Wilcoxon signed rank procedures, Journal of the American Statistical Association 54, 655–667. 6. Randles, R. H. (1988). Wilcoxon signed rank test, in Encyclopedia of Statistical Sciences, Vol. 9, S. Kotz & N. L. Johnson, eds. Wiley, New York, pp. 613–616. 7. Randles, R. H. & Wolfe, D. A. (1979). Introduction to the Theory of Nonparametric Statistics. Wiley, New York. 8. Schemper, M. (1984). A generalized Wilcoxon test for data defined by intervals, Communications in Statistics—Theory and Methods 13, 681–684.

WILCOXON SIGNED-RANK TEST 9. Wilcoxon, F. (1945). Individual comparisons by ranking methods, Biometrics 1, 80–83. 10. Woolson, R. F. & Lachenbruch, P. A. (1980). Rank tests for censored matched pairs, Biometrika 67, 597–600.

3

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

sustained low-fat eating pattern (40%) or selfselected dietary behavior (60%), with breast cancer and colorectal cancer as designated primary outcomes and coronary heart disease as a secondary outcome. From the outset, the nutrition goals for women assigned to the DM intervention group have been to reduce total dietary fat to 20%, and saturated fat to 7% of corresponding daily calories and, secondarily, to increase daily servings of vegetables and fruits to at least five and of grain products to at least six, and to maintain these changes throughout trial follow-up. The randomization of 40%, rather than 50%, of participating women to the DM intervention group was intended to reduce trial costs, while testing trial hypotheses with specified power. The postmenopausal hormone therapy (PHT) component is composed of two parallel randomized, double-blind trials among 27 347 (target 27 500) women, with coronary heart disease (CHD) as the primary outcome, with hip and other fractures as secondary outcomes, and with breast cancer as a potential adverse outcome. Of these, 10 739 (39.3% of total) had a hysterectomy prior to randomization, in which case there was a 1:1 randomized double-blind allocation between conjugated equine estrogen (E-alone) 0.625 mg/day or placebo. The remaining 16 608 (60.7%) of women, each having a uterus at baseline, were randomized 1:1 to the same preparation of estrogen plus continuous 2.5 mg/day of medroxyprogesterone (E + P) or placebo. These numbers compare to design goals of 12 375 for the E-alone comparison, and 15 125 for the E + P comparison, based on an assumption that 45% of women would be post hysterectomy. Over 8000 women were randomized to both the DM and PHT clinical trial components. At their one-year anniversary from DM and/or PHT trial enrollment, all women were further screened for possible randomization in the calcium and vitamin D (CaD) component, a randomized double-blind trial of 1000 mg elemental calcium plus 400 international units of vitamin D3 daily, versus placebo. Hip fracture is the designated primary outcome for the CaD component, with

Ross L. Prentice and Garnet L. Anderson Fred Hutchinson Cancer Research Center Seattle, WA, USA

1

INTRODUCTION

The women’s health initiative (WHI) is perhaps the most ambitious population research investigation ever undertaken. The centerpiece of the WHI program is a randomized, controlled clinical trial (CT) to evaluate the health benefits and risks of three distinct interventions (dietary modification, postmenopausal hormone therapy, and calcium/vitamin D supplementation) among 68 132 postmenopausal women. Participating women were identified from the general population living in proximity to any one of 40 participating clinical centers throughout the United States. The WHI program also includes an observational study (OS) comprising 93 676 postmenopausal women recruited from the same population base as the CT. Enrollment into WHI began in 1993 and concluded in 1998. Intervention activities in the combined hormone therapy component of the CT ended early in July 2002 when evidence had accumulated that the risks exceed the benefits for combined hormone therapy. Follow-up on all participating women is planned through March 2005, giving an average follow-up duration of about 8.5 years in the CT and 7.5 years in the OS. 2 WHI CLINICAL TRIAL AND OBSERVATIONAL STUDY The WHI CT includes three overlapping components, each a randomized controlled comparison among women who were postmenopausal and in the age range of 50 to 79 at randomization. The dietary modification (DM) component randomly assigned 48 835 (target 48 000) eligible women to either a

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

other fractures and colorectal cancer as secondary outcomes. A total of 36 282 (53.3% of CT enrollees) were randomized to the CaD component. While the WHI design estimated that about 45 000 women would enroll in the CaD trial component, protocol planning activities also included projected sample sizes of 35 000 and 40 000 and noted that most WHI objectives could be met with these smaller sample sizes. The total CT sample size of 68 132 is only 60.6% of the sum of the individual sample sizes for the three CT components, providing a cost and logistics justification for the use of a partial factorial design with overlapping components. Postmenopausal women of ages 50 to 79 years, who were screened for the CT but proved ineligible or unwilling to be randomized, were offered the opportunity to enroll in the observational study (OS). The OS is intended to provide additional knowledge about risk factors for a range of diseases, including cancer, cardiovascular disease, and fractures. It has an emphasis on biological markers of disease risk, and on risk factor changes as modifiers of risk. There was also an emphasis on the recruitment of women of racial/ethnic minority groups. Overall 18.5% of CT women and 16.7% of OS women identified themselves as other than white. These fractions allow meaningful study of disease risk factors within certain minority groups in the OS. Also, key CT subsamples are weighted heavily in favor of the inclusion of minority women in order to strengthen the study of intervention effects on specific intermediate outcomes (e.g. changes in blood lipids or micronutrients) within minority groups. Age distribution goals were also specified for the CT as follows: 10%, ages 50 to 54 years; 20%, ages 55 to 59 years; 45%, ages 60 to 69 years; and 25%, ages 70 to 79 years. While there was substantial interest in assessing the benefits and risks of each CT intervention over the entire 50 to 79 year age range, there was also interest in having sufficient representation of younger (50–54 years) postmenopausal women for meaningful age group-specific intermediate outcome (biomarker) studies, and of older (70–79 years) women for studies of treatment

effects on quality of life measures, including aspects of physical and cognitive function. Differing age and incidence rates within the 50 to 79 age range, and across the outcomes that were hypothesized to be affected, favorably or unfavorably, by the interventions under study, provided an additional motivation for a prescribed age-at-enrollment distribution. The enrollment of such a large number of women, meeting designated eligibility and exclusionary criteria (11) for each CT component and for the OS proved to be a challenge, particularly for the PHT component of the CT, since many women who volunteered for WHI were already taking hormones and did not wish to be randomized to take hormones or placebo, while other women had already made a decision against the use of hormones. 3 STUDY ORGANIZATION In addition to the clinical centers, the study is implemented through a Clinical Coordinating Center (CCC) located in Seattle with various collaborators providing specific expertise, as described below. The National Heart Lung and Blood Institute (NHLBI) sponsors the program with input from the National Cancer Institute, the National Institute of Aging, the National Institute of Arthritis and Musculoskeletal and Skin Diseases, the NIH Office of research on women’s health, and the NIH director’s office. A steering committee, consisting of the principal investigators of the 40 CCs, CCC, and NHLBI representatives are responsible for major scientific and operational decisions. An executive committee identifies, prioritizes, and coordinates items for the steering committee discussion. Program activities are implemented through a regional organization that categorizes CCs geographically (West, Midwest, Northeast, Southeast). Principal investigators, and staff groups defined by project responsibilities (clinic manager, clinic practitioner, nutritionist, recruitment coordinator, data coordinator, outcomes coordinator) meet regularly through conference calls within regions to discuss implementation plans and issues, and regional staff group representatives also confer regularly to ensure national

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

coordination. Nine standing advisory committees (behavior, calcium and vitamin D, design and analysis, dietary modification, hormone therapy, morbidity and mortality, observational study, publications and presentations, special populations) composed of study investigators having expertise in the major substantive areas involved in the program, provide recommendations on relevant issues as they arise. The CCC participates and provides liaison support in these various contexts. Figure 1 shows the WHI governance more generally, including NIH advisory committees. Specifically, the directors of participating NIH institutes and offices form a consortium that advises the NHLBI director concerning the WHI. A special working group of the NHLBI council also advises the NHLBI director concerning the WHI. 4 PRINCIPAL CLINICAL TRIAL COMPARISONS, POWER CALCULATIONS, AND SAFETY AND DATA MONITORING This section provides sample sizes by age for each CT component and for the OS, and provides power calculations for key outcomes for each continuing CT component. Relative to the basic WHI design manuscript (11), these calculations have been updated to reflect the sample size and age distribution achieved, and to reflect the actual average follow-up duration, which will be realized by March 2005. The target sample sizes noted above were based on consideration of the probability of rejecting the null hypothesis of no treatment effect (i.e. power) on the designated primary outcome under a set of design specifications concerning age-specific control group primary outcome incidence rates, intervention effects on incidence rates as a function of time from randomization, intervention adherence rates, and competing risk mortality rates. These assumptions have previously been listed in (11) where an extensive bibliography is cited providing the rationale for these assumptions. The power calculations were based on weighted logrank statistics that accumulate the differences between the observed numbers of primary outcome events in the intervention group and the expected number of

3

such events under the null hypothesis, across the follow-up time period. Early events that may be less likely to be affected by intervention activities are downweighted relative to later events. Specifically, the observed minus expected differences are weighted linearly from zero at randomization to a maximum value of one at a certain time from randomization and are constant(at one) thereafter. For cardiovascular disease and fracture incidence, this ‘‘certain time’’ was taken to be three years, whereas for cancer and mortality, it was taken to be 10 years. For coronary heart disease, incidence of the event times are grouped into three-year follow-up periods, in order to accommodate the inclusion of silent myocardial infarctions detected by routine electrocardiograms, which are to be obtained at baseline and every three years during follow-up for CT participants. A weighted odds ratio test statistic is then used to acknowledge this grouping. Table 1 shows the number of enrollees, and percentages of the total, by age category for each component of the CT and the OS. Note the degree of correspondence to the target age distribution, especially in the PHT component. Such correspondence was achieved by the closure of age-specific cells as the target numbers were approached. Table 2 shows the projected power; that is, the probability of rejecting the null hypothesis, for the key outcomes for each continuing component of the CT, taking account of the age-specific sample sizes in Table 1. Projected power is given at planned termination in early 2005, in which case the average follow-up duration will be about 8.5 years in the DM and PHT components and about 7.5 years in the CaD component. The intervention effects shown in Table 2 represent the projected effect size after accounting for assumed nonadherence and loss to competing risks. Comparison with projected power calculations at the design stage (11) indicates that a somewhat prolonged recruitment period, and the minor departures from target in sample sizes by age category had rather little effect on projected study power. The CHD and hip fracture power projections for the Ealone versus placebo comparison is somewhat reduced by a smaller than targeted sample

4

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

Table 1. Women’s Health Initiative Sample Sizes (% of Total) by Age Group (as of 4/1/00) Postmenopausal hormone therapy Age group

Dietary modification

Without uterus (E-alone)

With uterus (E + P)

Calcium and vitamin D

Observational study

50–54 55–59 60–69 70–79 Total

6961 (14) 11 043 (23) 22 713 (47) 8118 (17) 48 835

1396 (13) 1916 (18) 4852 (45) 2575 (24) 10 739

2029 (12) 3492 (21) 7512 (45) 3575 (22) 16 608

5157 (14) 8265 (23) 16 520 (46) 6340 (17) 36 282

12 386 (13) 17 321 (18) 41 196 (44) 22 773 (24) 93 676

Table 2. Statistical Power for Each Component for the CT Disease probability (%) (× 100)a Outcome Dietary modification component Breast cancer Colorectal cancer CHD Postmenopausal hormone therapy—E-alone CHD Hip fracture Combined fracturec Breast cancer Calcium and vitamin D Hip fracture Combined fracturec Colorectal cancer

Control

Intervention

Intervention effectb (%)

Avg. follow-up duration (yrs)

Projected power (%)

—

—

—

—

—

2.72 1.39 3.78 —

2.35 1.12 3.27 —

14 19 14 —

8.5 8.5 8.5 —

84 87 84 —

4.63 2.86 11.02 4.38 — 2.23 8.93 1.25

3.67 2.25 8.81 5.36 — 1.77 7.23 1.02

21 21 20 (22) — 21 19 18

8.5 8.5 8.5 13.5 — 7.5 7.5 7.5

72 55 97 71 — 88 >99 66

a Cumulative

disease probability to planned termination (×100) minus ratio of control to intervention cumulative incidence rates at study termination (×100) c Includes proximal femur, distal forearm, proximal humerus, pelvis, and vertebraeoct949 b One

size (10 739 versus 12 375) in this CT component. Power calculations for representative comparisons in the OS have been previously given (11). An independent Data and Safety Monitoring Board (DSMB) is charged with monitoring the CT to ensure participant safety, to assess conformity to program goals, and to examine whether there is a need for early stoppage or other modification of any CT component. The DSMB is composed of senior researchers, otherwise not associated with the study, who have expertise in relevant areas of medicine, epidemiology, biostatistics, clinical trials, and ethics. The DSMB meets biannually to review study progress,

including its status in the context of emerging external data. The Board provides recommendations to the NHLBI Director (see Figure 1). The DSMB reviewed and approved the protocol and consent forms prior to study implementation. They are apprised of any significant changes to protocol. Throughout the period of study conduct, the DSMB reviews data on recruitment, adherence, retention, and outcomes. The DSMB is the only group given access to treatment arm comparisons outside of the necessary CCC and NHLBI staff. As such, they determine whether the existing data demonstrate either significant or unanticipated risk or unexpectedly strong benefits, in

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

5

Figure 1. Organizational structure of the Women’s Health Initiative. NHLBI: National Heart, Lung, and Blood Institute; NIH: National Institutes of Health; PI: Principal Investigator; SC: Steering Committee; CC: Clinical Center; CM: Clinic Manager; LN: Lead Nutritionist: CP: Clinic Practitioner; DC: Data Coordinator; OC: Outcomes Coordinator

which case early trial termination, or modification, may be recommended. A particular complexity in this study, as often exists in prevention studies, is the need to consider effects on multiple disease processes that may differ in direction, timing, and magnitude. In the WHI, CT monitoring for consideration of early stopping is based on the following principles and procedures: • Each trial component (DM, Estrogen

alone, Estrogen plus Progestin, CaD) is evaluated separately, so that a stopping decision for one will not necessarily impact the continuation of the other three. • The evaluation of each intervention includes an assessment of the overall intervention effects on health, through the use of a global index. This global index is defined for each woman as time to first incident event. The events to be included were selected on the basis of a priori evidence for each intervention, and supplemented with evidence of

death from other causes to capture serious unanticipated intervention effects, as shown in Table 3. • Early

stopping for benefit would be considered, if the primary endpoint comparison crossed a 0.05 level O’Brien–Fleming (OBF) boundary, and the global index provided supportive evidence defined by crossing the 0.1 level OBF in favor of the intervention. For the DM, a Bonferroni correction is used to acknowledge the fact that there are two designated primary endpoints. This correction allows a stopping recommendation to be made if the boundary is crossed for either of the primary endpoints, without exceeding the designated probability (0.05) of falsely rejecting the overall null hypothesis.

• Early stopping for adverse effects uses a

two-step procedure with a 0.1 level OBF boundary for primary safety endpoints, a Bonferroni corrected 0.1 level OBF boundary for all other safety endpoints, and a lower boundary of z = −1.0 for

6

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

Table 3. Trial Monitoring Endpoints for the WHI Clinical Trial Components PHT (E-alone and E + P) Primary endpoint

CHD

Primary safety endpoint Other endpoints included in the global index

DM

CaD

Breast cancer

Breast cancer, Colorectal cancer N/A

N/A

Stroke,

CHD,

Colorectal cancer,

Pulmonary embolism, Hip fractures, Colorectal cancer, Endometrial cancer, Death from other causes

Death from other causes

Breast cancer, Other fractures, Death from other causes

the global index to signify supportive evidence for overall harm. As mentioned above, weighted logrank test statistics are used to test the difference between intervention and control event rates for each outcome. These weights were specified to yield efficient test statistics for the primary outcome under CT design assumptions. As such, these tests may not be sensitive to unexpected effects, whether adverse or beneficial, on any of the study outcomes. Consequently, the DSMB also informally examines unweighted logrank statistics, as well as weighted and unweighted tests for various intervals of time since randomization and for selected subgroups of participants (e.g. specific age groups), toward ensuring participant safety. Further detail on CT monitoring methods and their rationale is given in (6). CT monitoring reports, prepared on a semiannual basis throughout trial follow-up, also present data on the adherence to intervention goals, the rates of participation in follow-up and other program activities, and control group incidence rates. These data are used to update power calculations, along the lines of Table 2, to help assess conformity to overall design goals, and to alert the DSMB to emerging problems. Data on selected biomarkers and intermediate outcomes are also assembled, as such data can provide an objective assessment of the extent to which intervention goals are achieved, and can provide insights into processes that can explain intervention effects on disease outcomes.

Hip fractures

5 BIOMARKERS AND INTERMEDIATE OUTCOMES Beyond testing primary and secondary hypotheses, the CT is designed to support specialized analyses to explain any treatment effects in terms of intermediate outcomes, and both the CT and OS are designed to produce new information on risk factors for cardiovascular disease, cancer, and other diseases. To do so, the basic WHI program supports a substantial infrastructure of archival blood product storage, which includes serum and plasma from CT and OS participants at baseline, and at selected follow-up times (one year from enrollment in the CT and three years from enrollment in the OS). In addition, baseline white blood cells (buffy coat) are stored in both the CT and OS. These blood specimens are used for specialized studies related to participant safety and CT intervention adherence, and for externally funded ancillary studies. Stored blood components collected from each CT and OS participant during screening include 7.2 mL serum (in 4 × 1.8 mL vials), 5.4 mL citrated plasma (in 3 × 1.8 mL vials), 5.4 mL EDTA plasma (in 3 × 1.8 mL vials), and two aliquots of buffy coat. Intermediate outcome data collected in the CT include electrocardiograms (obtained as baseline, 3, 6, and 9 years among all CT women) to ascertain ‘‘silent’’ myocardial infarctions and other cardiac diagnoses, and bilateral mammograms (obtained annually for PHT women and biennially for other CT participants). In addition, all PHT women, 65 years of age and older, have cognitive

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

function assessment, and a 25% sample have functional assessment, at baseline and followup. A sample of women in both the CT and OS (all those who are enrolled at any one of three specified clinical centers) have dual x-ray absorptiometry at baseline, and at follow-up years 1 (CT only), 3, 6, and 9, to measure change in bone mass in the hip, spine, and total body; these women also provide urine specimens that are stored for studies of the interventions’ effects on bone metabolites. Analyses to explain CT treatment effects, and CT/OS analyses to elucidate disease risk factors, generally take place in a case–control or case–cohort fashion, to limit the number of specialized analyte determinations. Extensive self-report questionnaire data at baseline and selected follow-up times are also available for use in these analyses, and can be used to inform the case–control sampling procedure. 6 DATA MANAGEMENT AND COMPUTING INFRASTRUCTURE The size and scope of the WHI creates a large and rather complex data processing load. Each clinical site has recruited at least 3000 participants creating a local data management load as large as that for many multicenter trial coordinating centers. The data collected for WHI fall roughly into three categories: self-report, clinical measurements, and outcomes data. Self-reported information includes demographic, medical history, diet, reproductive history, family history, and psychosocial and behavioral factors. For these areas, standardized questionnaires were developed from instruments used in other studies of similar populations. Current use of medications and dietary supplements is captured directly from pill bottles that participating women bring to the clinic. To capture details of hormone therapy use prior to WHI enrollment, an in-person interview was conducted with each woman at baseline to determine her entire history of postmenopausal hormone use. For additional diet information in the DM trial beyond routine food frequency questionnaires, 4-day food records and 24-hour recall of diet are obtained from a subsample of women. Dietary

7

records are completed by the participant, reviewed and documented by certified clinic staff, and a subsample is sent to the CCC for nutrient coding and analysis. The 24-hour recalls of diet are obtained by telephone contact from the coordinating center and these data were coded using the same methods as for the dietary records. Clinical measures such as anthropometrics, blood pressure, functional status, and results from gynecologic exams are obtained by certified WHI clinic staff using standardized procedures and data collection forms and key-entered into the local study database. Limited blood specimen analyses are conducted locally and recorded. The remaining blood specimens are sent to a central blood repository where they are housed until the appropriate subsamples are identified and sent to the central laboratory for the selected analyses. Electrocardiogram and bone densitometry data are submitted electronically to respective central reading and coordination facilities. Information on significant health outcomes is initially obtained by self-report. If the type of event is of particular interest for WHI research, additional documentation is obtained from local health care providers and this information is used by a clinic physician to classify and code the event. Additional details of outcomes definitions and methods appear elsewhere (5). Data quality assurance mechanisms are incorporated at several levels, in addition to the overall quality assurance program described below. Data entry screens incorporate range and validity checks, and scanning software rejects forms containing critical errors. Routine audits of randomly selected charts document errors and provide feedback to CC and CCC staff. Additional data quality checks are used in creating analytic data sets. Multiple versions of most forms have been used, so some data items require mapping across versions. To support the large requirement of local operations as well as central analyses and reporting, the CCC developed and implemented a standardized computing and database management system that serves each clinical center site and the coordinating center. This computing system can be

8

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

logically divided into three major areas: computing at the clinical centers, computing at the CCC, and a private wide area network (WAN). The study-wide database uses this infrastructure to provide the appropriate data management tools to all sites. Each clinical center is equipped with its own local area network consisting of a file server, ethernet switch, 10 to 20 workstations, two or more printers, a mark sense form reader, bar code readers, and a router. The router provides connectivity back to the CCC over the WAN. In some cases, the router also provides connectivity to the parent institution. The file server is configured with Windows NT Advance Server and runs its own instance of the study’s Oracle database. The server also provides standardized office applications (Microsoft Office) and e-mail (Microsoft Exchange Web client). The workstations are Windows 98 clients. The CCC maintains a cadre of application servers dedicated to the development, testing, and warehousing of the consolidated database, currently requiring 100 GB. The CCC also maintains several other servers dedicated to statistical analysis, administrative support for CCC staff, website and e-mail services for study-wide communication, and centralized automated backup for all study servers. The website and e-mail system dedicated to WHI staff and investigators is critical to managing the challenges of study communications with nearly 1500 WHI staff and investigators spread across 5 time-zones. The website provides a kind of electronic glue for bringing together disparate groups. WHI email access is available through the website either over the WAN or through the Internet. The WHI WAN is a private network, which connects CCs to the CCC using a combination of 56 k and T1 frame-relay circuits. The WAN enables the CCC to conduct nightly backups of clinical center file servers. It also facilitates remote management and troubleshooting of clinical center equipment. In addition, it provides CCs direct access to the WHI e-mail system and website. The WHI database management system is a distributed replicated database, implemented in Oracle 8.0 for Windows NT. Database design and table structure are identical across CCs but are populated only with

data specific to that site. The average clinical center database currently requires approximately 15 GB of space. Data acquisition relies heavily on mark sense scanning, supplemented with traditional key entry and bar code reading. The database supports and enforces the study protocol through its participant eligibility confirmation, randomization, drug dispensing, and collection, visit and task planning, and outcomes processing functions. Security is provided both by password protection and by limiting access to specific data based on the identified role of the user. Local access to clinical site-specific data is supported through centrally defined reports and a flexible data extract system. The CCC database provides the superstructure into which the CC data are consolidated routinely. Additional data are obtained from the central laboratories and specimen repository and are merged with, and checked against, the corresponding participant data. The central database serves as the source of all data reports and analyses. 7 QUALITY ASSURANCE PROGRAM OVERVIEW The WHI program involves a complex protocol, with an extensive set of required procedures. The CT intervention goals and the study timeline are demanding program elements. With these challenges, an organized quality assurance (QA) program was needed to identify and correct emerging problems. The QA program is an integral part of the study protocol, procedures, and database, and covers all aspects of WHI. The program seeks to balance the need to assure scientific quality of the study with available resources. The complexity, size, and fiscal responsibility of WHI necessitated establishing priorities to guide local and central QA activities. The WHI QA priorities were developed by a task force comprising WHI investigators and staff, under the premise that aspects critical to the main scientific objectives of WHI would be of highest priority. As the centerpiece of WHI, the fundamental elements of the CT are considered of highest priority. The next highest priority is given to key elements of the OS and elements of the CT that

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

are important for interpretive analyses. The remaining elements are given a lower priority. The implementation of these priorities is manifested in the frequency and level of detailed QA activities. QA methods and responsibilities include activities performed at the CCs as well as activities initiated and coordinated by the CCC. The QA Program includes: extensive documentation of procedures; training and certification of staff; routine QA visits conducted by the CCC (all CCs received an initial and an annual QA Visit while subsequent visits are done approximately every other year, or more frequently as needed); and database reports for review by CCs and pertinent committees describing the completeness, timeliness, and reliability of tasks at the CCs. For example moving average monthly intervention adherence rates, and major task completeness rates, for each CC are used as one up-to-date indicator of CT status. WHI has established performance goals for various important tasks that are centrally monitored. These goals were determined on the basis of design assumptions and, where available, on previously published standards of quality and safety. The performance of each CC is reviewed on a regular basis under a performance monitoring plan. This plan is used to identify clinic-specific performance issues in a timely fashion, to reinforce good performance and to provide assistance or to institute corrective action if performance is inadequate. Much of this work is conducted under the auspices of a performance monitoring committee (PMC), comprising representatives of the CCC, CCs, and PO. The PMC follows up on persistent issues with specific CCs, and conducts site visits to facilitate the resolution of specific areas of concern. Some additional detail on the implementation of the WHI design is given in (2). 8 EARLY RESULTS FROM THE WHI CLINICAL TRIAL In late May 2002, after an average followup of 5.2 years, the DSMB recommended the early stopping of the E + P trial component because the weighted logrank test

9

for invasive breast cancer exceeded the OBF stopping boundary for this adverse effect and the global index supported risks exceeding benefits. Participating women in the E + P trial were asked to stop taking their study pills on July 8, 2002 and principal trial results were published soon thereafter (13). Women in the E + P trial continue to be followed without intervention through 2005, and a plan is under development for the additional nonintervention follow-up of PHT women from 2005 to 2007. On the basis of data through April 2002, the E + P trial generated hazard ratio estimates and nominal 95% confidence intervals as follows: coronary heart disease 1.29 (1.02–1.63), breast cancer 1.26 (1.00–1.59), stroke 1.41 (1.07–1.85), pulmonary embolism 2.13 (1.39–3.25), colorectal cancer 0.63 (0.43–0.92), endometrial cancer 0.83 (0.47–1.47), hip fracture 0.66 (0.45–0.98), and death due to other causes 0.92 (0.74–1.14). The global index, defined as the earliest event of these just listed, had a hazard ratio estimate (nominal 95% confidence interval) of 1.15 (1.03–1.28). Absolute excess risks per 10 000 person years were estimated as seven for coronary heart disease, eight for stroke, eight for pulmonary embolism, and eight for breast cancer, while corresponding absolute risk reductions were estimates as six for colorectal cancer and five for hip fracture. The absolute excess risk for global index events were estimated as 19 per 10 000 person years. Confidence intervals adjusted for sequential monitoring, and for multiple testing in accordance with the CT monitoring plan are also given in (13). Even though these risk alterations are fairly modest, they have substantial population implications for morbidity and mortality. As a result of these findings, various professional organizations have altered their recommendations concerning combined hormone use, and labeling changes have been made or are under consideration. These results follow decades of observational studies supporting a cardioprotective benefit for hormone therapy, and the discussion following the reporting of E + P trial results has sharpened the understanding of comparative properties of trials and observational

10

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS

studies among scientific groups and the general population. Additional outcome events through July 7, 2002 have been adjudicated and several more specialized results papers have been published (1,3,4,7,8,12). An ancillary study examining the effects of E + P on dementia and cognitive function has also been published (9,10). 9

SUMMARY AND DISCUSSION

The WHI, CT, and OS was implemented in close correspondence to design specifications (11). Departures from design assumptions concerning sample size, age distribution, and projected average trial follow-up have limited effect on the adequacy of primary outcome study power for continuing CT components, with the possible exception of the E-alone versus placebo comparison, where some power reduction for coronary heart disease arises from a smaller than targeted sample size. Substantial infrastructure for specimen storage, routine analyte determination, data management and computing, and for data and protocol quality control was also implemented. Principal results from the trial of combined hormones (E + P) have been presented following the early stopping of intervention. Ongoing challenges in the CT and OS include retaining the active participation of study subjects over a lengthy follow-up period, ensuring the unbiased and timely ascertainment of outcome events in each CT component and in the OS and, perhaps the most challenging, ensuring an adequate adherence to intervention goals for each continuing CT intervention. The astrogen-alone component of the WHI clinical trial stopped early on March 1, 2004 with principal results presented in the Journal of the American Medical Association 291; 1701–1712, 2004. 10

ACKNOWLEDGMENTS

This work was supported by NIH contracts for the WHI. Parts of this entry are closely related to a forthcoming monograph chapter (2) by the authors and other WHI colleagues.

REFERENCES 1. Anderson, G. L., Judd, H. L., Kaunitz, A. M., Barad, D. H., Beresford, S. A., Pettinger, M., Liu, J., McNeeley, S. G. & Lopez, A. M., for the WHI Investigators. (2003). Effects of estrogen plus progestin on gynecologic cancers and associated diagnostic procedures in the Women’s Health Initiative: a randomized trial, Journal of the American Medical Association 290, 1739–1748. 2. Anderson, G. L., Manson, J., Wallace, R., Lund, B., Hall, D., Davis, S., Shumaker, S., Wang, C. Y., Stein, E. & Prentice, R. L. (2003). Implementation of the Women’s Health Initiative study design, Annals of Epidemiology 13, 55–517. 3. Cauley, J. A., Robbins, J., Chen, Z., Cummings, S. R., Jackson, R., LaCroix, A. Z., LeBoff, M., Lewis, C. E., McGowan, J., Neuner, J., Pettinger, M., Stefanick, M. L., Wactawski-Wende, J. & Watts, N. B. (2003). The Effects of estrogen plus progestin on the risk of fracture and bone mineral density. The Women’s Health Initiative clinical trial, Journal of the American Medical Association 290, 1729–1738. 4. Chlebowski, R. T., Hendrix, S. L., Langer, R. D., Stefanick, M. L., Gass, M., Lane, D., Rodabough, R. J., Gilligan, M. A., Cyr, M. G., Thomson, C. A., Khandekar, J., Petrovitch, H. & McTiernan, A., for the WHI Investigators. (2003). Influence of estrogen plus progestin on breast cancer and mammography in healthy postmenopausal women. The Women’s Health Initiative randomized trial, Journal of the American Medical Association 289, 3243–3253. 5. Curb J. D., McTiernan, A., Heckbert, S. R., Kooperberg, C., Stanford, J., Nevitt, M., Johnson, K., Proulx-Burns, L., Pastore, L., Criqui M. & Daugherty, S. (2003). Outcomes ascertainment and adjudication methods in the Women’s Health Initiative, Annals of Epidemiology 13, 5122–5128. 6. Freedman, L. S., Anderson, G. L., Kipnis, V., Prentice, R. L., Wang, C. Y., Rossouw, J. E., Wittes, J. & De Mets, D. (1996). Approaches to monitoring the results of long-term disease prevention trials: examples from the Women’s Health Initiative, Controlled Clinical Trials 17, 509–525. 7. Hays, J., Ockene, J. K., Brunner, R. L., Kotchen, J. M., Manson, J. E., Patterson, R. E., Aragaki, A. K., Shumaker, S. A., Brzyski, R. G., LaCroix, A. Z., Granek, I. A. & Valanis, B. G. for the WHI Investigators. (2003).

WOMEN’S HEALTH INITIATIVE: STATISTICAL ASPECTS AND SELECTED EARLY RESULTS Effects of estrogen plus progestin on healthrelated quality of life, New England Journal of Medicine 348, 1839–1854. 8. Manson, J. E., Hsia, J., Johnson, K. C., Rossouw, J. E., Assaf, A. R., Lasser, N. L., Trevisan, M., Black, H. R., Heckbert, S. R., Detrano, R., Strickland, O. L., Wong, N. D., Crouse, J. R., Stein, E. & Cushman, M., for the WHI Investigators. (2003). Estrogen plus progestin and the risk of coronary heart disease, New England Journal of Medicine 349, 523–534. 9. Rapp, S. R., Espeland, M. A., Shumaker, S. A., Henderson, V. W., Brunner, R. L., Manson, J. E., Gass, M. L., Stefanick, M. L., Lane, D. S., Hays, J., Johnson, K. C., Coker, L. H., Daily, M. & Bowen, D., for the WHIMS Investigators. (2003). Effect of estrogen plus progestin on global cognitive function in postmenopausal women. The Women’s Health Initiative memory study: a randomized controlled trial, Journal of the American Medical Association 289, 2663–2672. 10. Shumaker, S. A., Legault, C., Rapp, S. R., Thal, L., Wallace, R. B., Ockene, J. K., Hendrix, S. L., Jones, III, B. N., Assaf, A. R., Jackson, R. D., Kotchen, J. M., WassertheilSmoller, S. & Wactawski-Wende, J., for the WHIMS Investigators. (2003). Estrogen plus progestin and the incidence of dementia and mild cognitive impairment in postmenopausal women. The Women’s Health Initiative memory study: a randomized controlled trial, Journal of the American Medical Association 289, 2651–2662.

11

11. The Women’s Health Initiative Study Group. (1998). Design of the women’s health initiative clinical trial and observational study, Controlled Clinical Trials 19, 61–109. 12. Wassertheil-Smoller, S., Hendrix, S. L., Limacher, M., Heiss, G., Kooperberg, C., Baird, A., Kotchen, T., Curb, J. D., Black, H., Rossouw, J. E., Aragaki, A., Safford, M., Stein, E., Laowattana, S. & Mysiw, W. J., for the WHI Investigators. (2003). Effect of estrogen plus progestin on stroke in postmenopausal women. The Women’s Health Initiative: a randomized trial, Journal of the American Medical Association 289, 2673–2684. 13. Writing Group for the Women’s Health Initiative Investigators. (2002). Risks and benefits of estrogen plus progestin in healthy postmenopausal women. Principal results from the Women’s Health Initiative randomized controlled trial, Journal of the American Medical Association 288, 321–333.

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL

the WHI DMT was 8.1 years, and the trial progressed to its planned conclusion date of March 31, 2005. The WHI protocol and consent forms were approved by the institutional review board for each participating institution and the clinical coordinating center. All participants provided informed consent and provided signed consent prior to enrollment. An independently appointed external Data and Safety Monitoring Board (WHI DSMB) met twice yearly. The Clinical Trial Registration government identifier for the WHI DMT is NCT0000061. Sixty percent (n = 29,294) of the WHI DMT participants were assigned at random to a usual diet comparison group and were not asked to make any changes in diet; 40% of the participants were assigned to the low-fat intervention. Usual care participants were provided a copy of the Dietary Guidelines for Americans available at the time (5, 6). The 19,541women assigned to the WHI DMT were educated to follow an ad libitum lowfat dietary pattern of 20% of calories from total fat, five or more servings of vegetables and fruits daily, and six or more servings of grains daily.

YASMIN MOSSAVAR-RAHMANI Albert Einstein College of Medicine— Epidemiology & Population Health Bronx, New York

LESLEY F. TINKER Fred Hutchinson Cancer Research Center Seattle, Washington

1

OBJECTIVES

The Women’s Health Initiative (WHI) Randomized Controlled Dietary Modification Trial (DMT) is an historic study given its sheer size, duration, and the strong clinical relevance of the research questions it was designed to answer (i.e., What are the effects of a low-fat dietary pattern on decreasing the risk of incident breast cancer, colorectal cancer, and heart disease among postmenopausal women at 40 clinical centers across the United States?). It is the largest randomized controlled trial to date that involves long-term dietary change and disease outcomes (1, 2).

2.1 The Dietary Change Intervention 2

STUDY DESIGN

The WHI DMT intervention group received intensive nutritional and behavioral modification training that consisted of 18 group sessions and one individual session during the first year. The first sessions began as early as 1994. After the first year, participants attended quarterly maintenance refresher sessions through the summer of 2004. Groups were composed of 8 to 15 participants with sessions facilitated by a nutritionist trained to deliver the WHI dietary intervention. Each participant received an individualized dietary fat gram goal to achieve daily intake of 20% energy from fat during the intervention; they were also given a goal to consume five or more servings daily of combined vegetables and fruits and six or more servings daily of grains. Self-monitoring techniques were emphasized as was group session attendance. Neither a weight loss goal nor a physical activity goal were set for these participants.

The design and the baseline description of the WHI DMT and recruitment methods have been published previously (2–4). Briefly, the WHI DMT was one component in the entirety of the WHI (n = 161,808), which was composed additionally of two postmenopausal hormone trials, one calcium-vitamin D supplementation trial, and one observational study (2). Within the WHI DMT, 48,835 generally healthy postmenopausal women aged 50–79 years at baseline were recruited and enrolled beginning in 1993. Major exclusion criteria included having a history of previous breast cancer or colorectal cancer, any cancer within the previous 10 years except non-melanoma skin cancer, medical conditions with a predicted survival of less than three years, adherence or retention concerns, or a current dietary intake of less than 32% of energy from fat. Mean follow-up time for

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL

Details of the core intervention program have been published (7). Commencing in 1999, augmented interventions based on motivational enhancement (ME), an abbreviated form of motivational interviewing, were introduced into the intervention program (8). Individual and group sessions were formulated to be motivational, with the nutritionist guiding rather than directing the participants. The OARS (O = Open-ended questions, A = Affirmations, R = Reflections, S = Summary) and FRAMES model (F = Feedback, R = Responsibility, A = Advice, M = Menu of options, E = Empathy, S = Self-efficacy were incorporated into both formats (9). Other group-based ME formats included the Targeted Mailing Campaign and the Personal Evaluation of Food Intake. A Dietary Modification Trial Advisory Committee composed of WHI investigators and lead nutritionists provided oversight of the intervention and its delivery. The Special Populations Advisory Committee was responsible for ensuring that materials were appropriate for special populations.

All WHI DMT participants had blood drawn after an overnight 8-hour fast at baseline and at the first annual visit. Serum and plasma samples were frozen at −70◦ C and were shipped to the WHI central storage facility. A total of 2,816 (5.8%) of the DMT participants had serum analyzed at baseline and years 1, 3, and 6 for diet-related biomarkers. Details are described elsewhere (10).

2.2 Data and Specimen Collection and Follow-Up

Descriptive statistics for baseline demographic and health information, laboratory measures, and dietary intakes were computed and compared by randomization assignment. Differences between intervention and comparison for baseline, year 1, and year 6 were computed and tested for significance at P < 0.001 using a two-sample t test or appropriate non-parametric techniques. Primary outcome controls were analyzed as time-to-event hazard ratios (HRs) and 95% confidence intervals (CIs) from Cox proportional hazard analyses that control for confounding variables. Significance was analyzed by log-rank tests. Kaplan-Meier curves were drawn to depict cumulative hazard estimates over time. Annualized incidence rates were assessed for absolute disease rate controls. Details of the statistical design have been published (2). Additional details may be found in the results papers for the disease endpoints.

During screening, potential participants for the WHI DMT completed three prerandomization visits during which baseline information was obtained (10). Follow-up for clinical events occurred every six months. A standardized written protocol, centralized training of local clinic staff, local quality assurance activities, and periodic quality assurance visits by the Clinical Coordinating Center were used to maintain uniform data collection procedures at all study sites. Dietary intake was monitored primarily by a validated Food Frequency Questionnaire (FFQ) designed for the WHI (11). The FFQ was administered during screening (considered the baseline measure), at one year after randomization, and thereafter annually to one-third of the participants on a rotating basis. Additionally, all WHI DMT participants completed a four-day food record at baseline. A 4.6% cohort of the WHI DMT participants also completed a four-day food record at year 1 followed by a 24-hour dietary recall by telephone at years 3, 6, and 9.

2.3 Outcomes Ascertainment Information about prespecified disease endpoints and hospitalizations was collected twice annually by participant self-report. For the WHI DMT, participant reports of breast cancer, colorectal cancer, and cardiovascular disease and stroke were confirmed by medical record review and by centralized physician adjudication [breast and colorectal cancers and coronary heart disease (CHD) death] or local physician adjudicators (all other CHD outcomes). Details of the outcomes ascertainment procedures have been published (12). 2.4 Statistical Analysis

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL

3

RESULTS

3.1 Description of Baseline Characteristics Among the 48,835 WHI DMT participants, at baseline, 37% were aged 50–59 years, 47% were aged 60–69 years, and 17% were aged 70–79 years. Diversity in racial and ethnic recruitment was achieved with a minority group representation of over 18%. Details of the baseline demographics also have been published (4). 3.2 Nutrient Data Intervention group participants significantly reduced their percentage of energy from total dietary fat from a mean of 35% at baseline to 24.3% at year 1, which increased to 26.7% at year 3 and to 28.8% after 8.1 years of followup; whereas control group intake remained relatively stable over the same time period (1, 13, 14).This finding resulted in a difference in mean dietary fat intake at year 1 of 10.7% energy intake as fat between the intervention group and comparison group. This difference between the groups was mostly maintained (8.1% at year 6) despite the fact that the intervention group did not reach the target goal of 20% of energy from fat. Dietary fruit and vegetable intake at baseline averaged 3.6 servings and increased to 5.1 servings at year 1, 5.2 servings at year 3, and then decreased to 4.9 servings after 8.1 years of follow-up for intervention group participants. Again, intake was stable for the comparison group. Initially the intervention group reported an average grain intake of 4.7 servings/day at baseline, which increased to 5.1 servings at year 1, then decreased to 4.6 servings at year 3, and to 4.3 servings after 8.1 years of follow-up. Black and Hispanic women decreased fat intake slightly less (6.6% and 5.6%) than did the group as a whole (8.3%), and Hispanic women showed somewhat lower increases in fiber than did the intervention group as a whole (1.5 vs. 2.5 g/d) (14). From baseline to year 1, reductions in fat intake in the intervention group came primarily from added fats such as butter, oils, and salad dressings (9.1 g/day), secondarily from meats (4.6 g /day), and thirdly from desserts (3.9 g/day) (15).

3

3.3 Clinical Outcomes 3.3.1 DM and Breast Cancer. A low-fat dietary pattern did not result in a statistically significant reduction in invasive breast cancer risk over an 8.1 year average followup period (13). However, a 9% reduction in breast cancer in the low-fat intervention group compared with the comparison group (95% confidence interval 0.83–1.01; P = 0.07, P = 0.09 weighted for length of follow-up) was observed. Secondary analyses indicated that the greatest reduction in cases of breast cancer was among dietary change participants who started with higher intakes of total fat as a percentage of calories and made the greatest reductions in fat intake. Additionally, dietary modification participants significantly reduced the risk of breast cancer for disease that was positive for the estrogen receptor and negative for the progesterone receptor (hazard ratio 0.64; 95% confidence interval, 0.49–0.84; P = 0.001), which suggests a dietary effect that may vary by hormone receptor characteristics of the tumor (13). A greater reduction occurred in levels of gamma-tocopherol in the intervention group (related to the dietary fat restriction) compared with the comparison group, and small positive differences occurred in levels of alpha-carotene, beta-carotene, and beta-cryptoxanthin in the intervention group compared with the comparison group. Lowdensity lipoprotein (LDL) cholesterol level was decreased modestly in the intervention group compared with the comparison group, but changes in levels of high-density lipoprotein (HDL) cholesterol, triglycerides, insulin, and glucose did not differ significantly between the two groups; this observation is inconsistent with other low fat dietary intervention trials (16, 17). The dietary intervention was not reported to be associated with any adverse effects or any major weight loss. However, the trends observed suggest that a longer, planned nonintervention follow-up may yield a more definitive comparison (13).

4

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL

3.3.2 DM and Colon Cancer. This lowfat dietary pattern intervention did not reduce the risk of colorectal cancer in postmenopausal women during 8.1 years of follow-up (HR, 1.08; 95% CI, 0.90–1.29). However, the annualized incidence rates of colon polyps or adenomas (self-report) were significantly lower in the intervention group than in the comparison group (2.16% vs 2.35%, respectively; HR, 0.91; 95% CI, 0.87–0.95). No differences were observed between groups for tumor characteristics nor did evidence exist of reduced risk for any category of colorectal cancer outcome (such as distal vs. proximal colon cancer) associated with the intervention (18). 3.3.3 DM and Cardiovascular Disease. Over a mean of 8.1 years, a decreased intake of dietary fat and increased intake of vegetables, fruits, and grains did not reduce significantly the risk of coronary heart disease, stroke, or cardiovascular disease in postmenopausal women. Modest reductions in risk factors associated with heart disease were observed for LDL cholesterol, blood pressure, and factor VIIc. No increases in risk factors were observed for HDL cholesterol, triglycerides, glucose, or insulin. The results suggest that a more focused diet and lifestyle intervention, which would combine dietary modification with weight loss and physical activity, may be needed to improve risk factors and to reduce risk of coronary vascular disease. A full report of the WHI DMT CVD outcomes has been published (19). 3.3.4 Adherence Results and Predictors of Dietary Change and Maintenance. Lifestyle change and maintenance are among the most challenging aspects of a dietary modification trial. Elucidation of predictors of adherence can assist in screening and development of protocols. In the WHI DMT, baseline predictors of dietary change after one year in intakes of total fat, vegetables, fruits, and grains included being 50–59 years of age at baseline compared with being older, having higher levels of education, and having scored more highly on measures of optimism. After randomization, the strongest predictors at one year were by participating in the intervention program, namely, attending the

group nutrition sessions and self-monitoring intake (1, 20). Self-monitoring rates ranged from 74% at first year of intervention and declined to 59% in spring 2000 (21). A variety of self-monitoring tools was introduced to motivate participants to self-monitor including behavioral, graphic, and electronic self-monitoring tools (21). A 1.2% decrease in percentage of energy from fat was predicted for every 10% increase in session attendance or self-monitoring. Furthermore, participating in the dietary intervention program mediated the negative effect of poorer mental health on dietary change (22). By year 3 of the WHI DMT, predictors of maintaining a low-fat dietary intake continued to include attending group sessions, selfmonitoring and being younger (1). 4 CONCLUSIONS The challenges of the WHI DMT include its focus on long-term behavior change, the large number of participants, and the diversity of trial participants. A constant effort to balance standardization of assessment and intervention materials with adaptation to local conditions was necessary throughout the trial. For example glossaries and special instructions were developed for Cuban participants in the Miami clinic and Puerto Rican participants in New York City for the FFQ to assist them in completing the FFQ. A range of self-monitoring tools was developed to maintain the interest of participants. Other issues included the need to motivate participants to continue to adhere to the long-term intervention. The finding that participants who abandoned self-monitoring early on in the trial were less likely to return to self-monitoring, highlights the importance of addressing barriers such as discomfort with the recording methods or lack of time, early on in a trial (21). These efforts, however, require a high degree of nutritionist time and effort. Another challenge was maintaining the integrity of the initial dietary intervention protocol over an 8+ year period in the face of scientific progress that suggested a potential adaptation of the protocol in terms of chronic disease risk reduction. To meet the a priori protocol in the face of new knowledge,

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL

intervention sessions were written to incorporate the latest scientific findings; however, it was sometimes a challenge to reconcile new findings with the goals of the trial. An important finding of the trial was that despite that fact that weight loss was not targeted, participants experienced modest weight loss. For example, mean weight decreased significantly in the intervention group from baseline to year 1 by 2.2 kg (P < 0.001) and was 2.2 kg less than the comparison group change from baseline at year 1. This difference from baseline between comparison and intervention groups diminished over time, but a significant difference in weight was maintained through year 9 (0.5 kg, P = 0.01) (14). Given the sheer size, diversity, and multiple goals of the study, the accomplishments of WHI DMT are extraordinary and provide substantial evidence that given adequate support and guidance, participants can adopt and engage in long-term behavior change. The study continues to provide important results that impact chronic diseases that affect postmenopausal women worldwide. Although a small percentage of women discontinued trial participation upon conclusion of the WHI randomized trials, consenting participants (n = 115,405), which included many in the WHI Observational trial, have continued to provide health status updates annually as we test additional hypotheses to advance the health of postmenopausal women. 5

ACKNOWLEDGMENTS

The WHI program is funded by the National Heart Lung and Blood Institute, U.S. Department of Health and Human Services. 6

SHORT LIST OF WHI INVESTIGATORS

Program Office: (National Heart, Lung, and Blood Institute, Bethesda, Maryland) Elizabeth Nabel, Jacques Rossouw, Shari Ludlam, Linda Pottern, Joan McGowan, Leslie Ford, and Nancy Geller. Clinical Coordinating Center: (Fred Hutchinson Cancer Research Center, Seattle, WA)

5

Ross Prentice, Garnet Anderson, Andrea LaCroix, Charles L. Kooperberg, Ruth E. Patterson, Anne McTiernan; (Wake Forest University School of Medicine, Winston-Salem, NC) Sally Shumaker; (Medical Research Labs, Highland Heights, KY) Evan Stein; (University of California at San Francisco, San Francisco, CA) Steven Cummings. Clinical Centers: (Albert Einstein College of Medicine, Bronx, NY) Sylvia Wassertheil-Smoller; (Baylor College of Medicine, Houston, TX) Aleksandar Rajkovic; (Brigham and Women’s Hospital, Harvard Medical School, Boston, MA) JoAnn Manson; (Brown University, Providence, RI) Annlouise R. Assaf; (Emory University, Atlanta, GA) Lawrence Phillips; (Fred Hutchinson Cancer Research Center, Seattle, WA) Shirley Beresford; (George Washington University Medical Center, Washington, DC) Judith Hsia; (Los Angeles Biomedical Research Institute at HarborUCLA Medical Center, Torrance, CA) Rowan Chlebowski; (Kaiser Permanente Center for Health Research, Portland, OR) Evelyn Whitlock; (Kaiser Permanente Division of Research, Oakland, CA) Bette Caan; (Medical College of Wisconsin, Milwaukee, WI) Jane Morley Kotchen; (MedStar Research Institute/Howard University, Washington, DC) Barbara V. Howard; (Northwestern University, Chicago/Evanston, IL) Linda Van Horn; (Rush Medical Center, Chicago, IL) Henry Black; (Stanford Prevention Research Center, Stanford, CA) Marcia L. Stefanick; (State University of New York at Stony Brook, Stony Brook, NY) Dorothy Lane; (The Ohio State University, Columbus, OH) Rebecca Jackson; (University of Alabama at Birmingham, Birmingham, AL) Cora E. Lewis; (University of Arizona, Tucson/Phoenix, AZ) Tamsen Bassford; (University at Buffalo, Buffalo, NY) Jean Wactawski-Wende; (University of California at Davis, Sacramento, CA) John Robbins; (University of California at Irvine, CA) F. Allan Hubbell; (University of California at Los Angeles, Los Angeles, CA) Howard Judd; (University of California at San Diego, LaJolla/Chula Vista, CA) Robert D. Langer; (University of Cincinnati, Cincinnati, OH) Margery Gass; (University of Florida, Gainesville/Jacksonville, FL) Marian Limacher; (University of Hawaii,

6

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL

Honolulu, HI) David Curb; (University of Iowa, Iowa City/Davenport, IA) Robert Wallace; (University of Massachusetts/Fallon Clinic, Worcester, MA) Judith Ockene; (University of Medicine and Dentistry of New Jersey, Newark, NJ) Norman Lasser; (University of Miami, Miami, FL) Mary Jo O’Sullivan; (University of Minnesota, Minneapolis, MN) Karen Margolis; (University of Nevada, Reno, NV) Robert Brunner; (University of North Carolina, Chapel Hill, NC) Gerardo Heiss; (University of Pittsburgh, Pittsburgh, PA) Lewis Kuller; (University of Tennessee, Memphis, TN) Karen C. Johnson; (University of Texas Health Science Center, San Antonio, TX) Robert Brzyski; (University of Wisconsin, Madison, WI) Gloria E. Sarto; (Wake Forest University School of Medicine, Winston-Salem, NC) Denise Bonds; (Wayne State University School of Medicine/Hutzel Hospital, Detroit, MI) Susan Hendrix. REFERENCES 1. L. F. Tinker, M. C. Rosal, A. F. Young, M. G. Perri, R. E. Patterson, L. Van Horn, A. R. Assaf, D. J. Bowen, J. Ockene, J. Hays, and L. Wu, Predictors of dietary change and maintenance in the Women’s Health Initiative dietary modification trial. J Am Diet Assoc. 2007; 107(7) 1155–1165. 2. The Women’s Health Initiative Study Group. Design of the Women’s Health Initiative clinical trial and observational study. Control Clin Trials 1998; 19(1) 61–109. 3. J. Hays, J. R. Hunt, F. A. Hubbell, G. L. Anderson, M. Limacher, C. Allen, J. W. Rossouw, The Women’s Health Initiative recruitment methods and results. Ann Epidemiol. 2003; 13(9, suppl. 1): S18–S77. 4. C. Ritenbaugh, R. E. Patterson, R. T. Chlebowski, B. Caan, L. Fels-Tinker, B. J. Howard, J. Ockene, The Women’s Health Initiative dietary modification trial: overview and baseline characteristics of participants. Ann Epidemiol. 2003; 13(9, suppl. 1): S87–S97. 5. U.S. Department of Agriculture, Dietary Guidelines for Americans, 3rd ed. Washington, D.C.: Department of Health and Human Services, 1990. 6. U.S. Department of Agriculture, Dietary Guidelines for Americans, 4th ed. Washington, D.C.: Department of Health and Human Services, 1995.

7. L. F. Tinker, E. R. Burrows, H. Henry, R. Patterson, J. Rupp, L. Van Horn, The Women’s Health Initiative: overview of the nutrition components. In: D. A. Krummel and P. M. Kris-Etherton (eds.), Nutrition and Women’s Health. Gaithersburg, MD: Aspen Publishers, 1996, pp. 510–542. 8. D. Bowen, C. Ehret, M. Pedersen, L. Snetselaar, M. Johnson, L. Tinker, D. Hollinger, L. Ilona, K. Bland, D. Sivertsen, D. Ocke, L. Staats, J. W. Beedoe, Results of an adjunct dietary intervention program in the Women’s Health Initiative. J Am Diet Assoc. 2002; 102(11) 1631–1637. 9. D. Ernst, S. Berg-Smith, D. Brenneman, and M. Johnson, Motivation Enhancement Training: August 1999. Women’s Health Initiative Intensive Intervention Protocol. Portland, OR: Kaiser Permanente Northwest Region; Center for Health Research, 1999. 10. G. Anderson, J. Manson, R. Wallace, B. Lund, D. Hall, S. Davis, S. Shumaker, C.-Y. Wang, E. Stein, R. L. Prentice, Implementation of the Women’s Health Initiative Study Design. Ann Epidemiol. 2003; 13(9, suppl. 1): S5–S17. 11. R. E. Patterson, A. R. Kristal, L. F. Tinker, R. A. Carter, M. P. Bolton, T. Agurs-Collins, Measurement characteristics of the Women’s Health Initiative food frequency questionnaire. Ann Epidemiol. 1999; 9(3) 178–187. 12. J. D. Curb A. McTiernan, S. R. Heckbert, C. Kooperberg, J. Stanford, M. Nevitt, K. C. Johnson, L. Proulx-Burns, L. Pastore, M. Criqui, and S. Daugherty, Outcomes ascertainment and adjudication methods in the Women’s Health Initiative. Ann Epidemiol. 2003; 13(9, suppl. 1): S122–S128. 13. R. L. Prentice, B. Caan, R. T. Chlebowski, R. Patterson, L. H. Kuller, J. K. Ockene, K. L. Margolis, M. C. Limacher, J. E. Manson, L. M. Parker, E. Paskett, L. Phillips, J. Robbins, J. E. Rossouw, G. E. Sarto, J. M. Shikany, M. L. Stefanick, C. A. Thomson, L. Van Horn, M. Z. Vitolins, J. Wactawski-Wende, et al., Lowfat dietary pattern and risk of invasive breast cancer: the Women’s Health Initiative randomized controlled dietary modification trial. JAMA. 2006; 295(6) 629–642. 14. B. V. Howard, J. E. Manson, M. L. Stefanick, S. A. Beresford, G. Frank, B. Jones, R. J. Rodabough, L. Snetselaar, C. Thomson, L. Tinker, M. Vitolins, R. Prentice, Low-fat dietary pattern and weight change over 7 years: the Women’s Health Initiative dietary modification trial. JAMA. 2006; 295(1) 39–49. 15. R. E. Patterson, A. Kristal, R. Rodabough, B. Caan, L. Lillington, Y. Mossavar-Rahmani,

WOMEN’S HEALTH INITIATIVE DIETARY TRIAL M. S. Simon, L. Snetselaar, L. Van Horn, Changes in food sources of dietary fat in response to an intensive low-fat dietary intervention: early results from the Women’s Health Initiative. J Am Diet Assoc. 2003; 103(4) 454–460. 16. M. A. Pereira, S. Liu, Types of carbohydrates and risk of cardiovascular disease. J Women’s Health (Larchmt), 2003; 12: 115–122. 17. M. K. Hellerstein. Carbohydrate-induced hypertriglyceridemia: modifying factors and implications for cardiovascular risk. Curr Opin Lipidol. 2002; 13: 33–40. 18. S. A. Beresford, K. C. Johnson, C. Ritenbaugh, N. L. Lasser, L. G. Snetselaar, H. R. Black, G. L. Anderson, A. R. Assaf, T. Bassford, D. Bowen, R. L. Brunner, R. G. Brzyski, B. Caan, R. T. Chlebowski, M. Gass, R. C. Harrigan, J. Hays, et al., Low-fat dietary pattern and risk of colorectal cancer: the Women’s Health Initiative Randomized Controlled Dietary Modification Trial. JAMA. 2006; 295(6) 643–654. 19. B. V. Howard, L. Van Horn, J. Hsia, J. E. Manson, M. L. Stefanick, S. WassertheilSmoller, L. H. Kuller, A. Z. LaCroix, R. D. Langer, N. L. Lasser, C. E. Lewis, M. C. Limacher, K. L. Margolis, W. J. Mysiw, J. K. Ockene, L. M. Parker, M. G. Perri, L. Phillips, R. L. Prentice, et al., Low-fat dietary pattern and risk of cardiovascular disease: the Women’s Health Initiative Randomized Controlled Dietary Modification Trial. JAMA. 2006; 295(6) 655–666. 20. Women’s Health Initiative Study Group. Dietary adherence in the Women’s Health Initiative Dietary Modification Trial. J Am Diet Assoc. 2004; 104(4) 654–658. 21. Y. Mossavar-Rahmani, H. Henry, R. Rodabough, C. Bragg, A. Brewer, T. Freed, L. Kinzel, M. Pedersen, CO. Soule, S. Vosburg. Additional self-monitoring tools in the dietary modification component of the Women’s Health Initiative. J Am Diet Assoc. 2004; 104(1) 76–85. 22. LF. Tinker, MG. Perri, RE. Patterson, DJ. Bowen, M. McIntosh, LM. Parker, MA. Sevick, LA. Wodarski. The effects of physical and emotional status on adherence to a low-fat dietary pattern in the Women’s Health Initiative. J Am Diet Assoc. 2002; 102(6) 789–800, 888.

7

FURTHER READING WHI. www.whiscience.org is the WHI’s scientific resources website. It includes a full listing of publications, description of the WHI, publicly available data tables, and information on how to collaborate with WHI scientists to write papers and propose ancillary studies.

CROSS-REFERENCES clinical trial intervention

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

the WHI, no large-scale randomized trial in a primary prevention setting had been conducted to confirm or refute the observational findings.

SHARI S. BASSUK Brigham and Women’s Hospital Harvard Medical School Boston, Massachusetts

2

JOANN E. MANSON

More than 27,000 healthy postmenopausal women (age range: 50–79 years; mean age: 63 years) were recruited largely by mass mailings at 40 clinical centers across the United States. The main endpoint of interest was CHD, so older women were targeted for enrollment because of their higher baseline risk of this outcome. Women with severe vasomotor symptoms—the traditional indication for HT—tended not to participate because of the need to accept possible randomization to placebo. In the estrogen–progestin trial, 16,608 women with an intact uterus were assigned to oral estrogen plus progestin [0.625 mg of conjugated equine estrogens (CEE) plus 2.5 mg of medroxyprogesterone acetate (MPA); Prempro] or a placebo daily. In the estrogen-alone trial, 10,739 women with hysterectomy (40% also had oophorectomy) were assigned to oral estrogen alone (0.625 mg of CEE; Premarin) or a placebo daily. The sample sizes were chosen to have adequate power (≥ 80%) to detect an effect of HT on CHD, the primary outcome of interest, and to understand the balance of benefits and risks over the trials’ planned duration of 9 years (3). The hormone preparations were selected because they were—and are—the most commonly prescribed forms of HT in this country, and they seemed to benefit women’s health in many observational studies. Moreover, the manufacturer, Wyeth Ayerst, interested in obtaining FDA approval to market these medications for CHD prevention, donated > 100 million pills to help defray the costs of this enormous undertaking. The WHI hormone trials enrolled study participants between 1993 and 1998 and were scheduled originally to end in 2005. However, the estrogen–progestin trial was halted in July 2002, after an average of 5.2 years of treatment, because of a significant increase in breast cancer risk and an

Brigham and Women’s Hospital Harvard Medical School Boston, Massachusetts

Once prescribed primarily to treat vasomotor symptoms, postmenopausal hormone therapy (HT) had in recent years been increasingly promoted as a strategy to forestall many chronic diseases that accelerate with aging, including coronary heart disease (CHD), cognitive impairment, and osteoporosis. Indeed, more than two of five postmenopausal women in the United States were taking HT in the year 2001 (1). The belief that estrogen protected the heart and that women of all ages could benefit was so strong that many clinicians were initiating HT in patients who were 20 to 30 years past the menopausal transition and in women at high risk of CHD. This widespread use was unwarranted given the lack of conclusive data from randomized clinical trials on the balance of risks and benefits of such therapy when used for chronic disease prevention. 1

STUDY DESIGN

OBJECTIVES

Launched in 1991 by the National Institutes of Health, the HT component of the WHI consisted of two randomized, double-blind, placebo-controlled trials in postmenopausal women who were aged 50 to 79 years and generally healthy at baseline. The trials were designed to test the effect of estrogen plus progestin (for women with a uterus) or estrogen alone (for women with hysterectomy) on CHD, breast cancer, osteoporotic fracture, and other health outcomes and whether the possible benefits of treatment would outweigh possible risks. Taken in aggregate, data from observational studies had suggested both benefits and risks (2), but until

Wiley Encyclopedia of Clinical Trials, Copyright  2008 John Wiley & Sons, Inc.

1

2

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

unfavorable overall risk–benefit ratio associated with estrogen–progestin therapy observed in the study population as a whole (4). The estrogen-alone trial was terminated in April 2004 after 6.8 years because of an excess risk of stroke that was not offset by a reduced risk of CHD in the hormone group (5). Although estrogen lowered the risk of osteoporotic fracture, it offered no other clear benefit in terms of reducing risk of chronic disease. After the decision is made to stop a clinical trial prematurely, the participants need to be notified and their medical information collected; these steps usually take several months, during which time randomized treatment necessarily continues. Thus, the final WHI results presented here are based on slightly longer treatment intervals— 5.6 years and 7.1 years for the estrogen–progestin and estrogen-alone trials, respectively—than the results initially reported when the trials were terminated. The WHI is continuing to follow participants observationally until the year 2010. The major research question to be addressed

by long-term follow-up is whether—and how rapidly—risks and benefits decline after HT use is stopped.

3 RESULTS 3.1 Main effects The results for the primary and secondary endpoints of the trial are shown in Table 1. 3.1.1 Coronary heart disease. Rather than the expected cardioprotection, women assigned to estrogen–progestin for an average of 5.6 years were 24% more likely to experience CHD (defined in primary analyses as MI or coronary death) than women assigned to placebo (6), with the risk increase most evident during the first year of the trial. Women assigned to 6.8 years of estrogen alone also experienced no reduction in CHD risk as compared with women assigned to placebo (7); relative risks (RR) were slightly elevated during early follow-up and diminished over time.

Table 1. Relative Risks and 95% Confidence Intervals for Various Health Outcomes in the Overall Study Population of Women Aged 50 to 79 Years in Women’s Health Initiative Trials of Menopausal Hormone Therapy Estrogen–Progestin

Estrogen Alone

Primary endpoints: Coronary heart disease Stroke Venous thromboembolism∗ Pulmonary embolism Breast cancer Colorectal cancer Endometrial cancer Hip fracture All-cause mortality Global index†

1.24 (1.00–1.54)6 1.31 (1.02–1.68)8 2.06 (1.57–2.70)10 2.13 (1.45–3.11)10 1.24 (1.01–1.54)12 0.56 (0.38–0.81)15 0.81 (0.48–1.46)21 0.67 (0.47–0.96)16 1.00 (0.83–1.19)17 1.13 (1.02–1.25)17

0.95 (0.79–1.16)7 1.33 (1.05–1.68)9 1.32 (0.99–1.75)11 1.37 (0.90–2.07)10 0.80 (0.62–1.04)13 1.08 (0.75–1.55)5 Not applicable 0.61 (0.41–0.91)5 1.04 (0.88–1.22)17 1.02 (0.92–1.13)17

Secondary endpoints: Type 2 diabetes Gallbladder removal Ovarian cancer Dementia (age ≥ 65)

0.79 (0.67–0.93)18 1.67 (1.32–2.11)20 1.58 (0.77–3.24)21 2.05 (1.21–3.48)22

0.88 (0.77–1.01)19 1.93 (1.52–2.44)20 Not available 1.49 (0.83–2.66)22

∗ Venous

thromboembolism includes deep venous thrombosis and pulmonary embolism. global index is a composite outcome that represents the first event for each participant from among the following: coronary heart disease, stroke, pulmonary embolism, breast cancer, colorectal cancer, endometrial cancer (–estrogen–progestin trial only), hip fracture, and death. Because participants can experience more than one type of event, the global index cannot be derived by a simple summing of the component events. † The

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

3.1.2 Stroke. Women in the estrogen– progestin group were 31% more likely to develop stroke than those in the placebo group (8). Unlike the pattern for CHD, the excess risk for stroke emerged during the second year and remained elevated throughout the trial. An elevation in risk was found only for ischemic stroke but not hemorrhagic stroke. Similar findings were observed for women assigned to estrogen alone (9). 3.1.3 Venous thromboembolism. Assignment to estrogen–progestin was associated with a statistically significant doubling of the risk of venous thromboembolism and pulmonary embolism (10). Assignment to estrogen alone seemed to raise the risk of these events by approximately one third; this increase was of borderline statistical significance (11). 3.1.4 Breast cancer. Women assigned to estrogen–progestin therapy experienced a 24% increase in the risk of breast cancer (12). The excess risk did not emerge clearly until the fourth year of randomized treatment and was limited to participants who reported prior HT use. The breast cancers diagnosed in the estrogen–progestin group tended to be larger and possibly more advanced than those in the placebo group. In addition, the percentage of women with abnormal mammograms was significantly higher in the estrogen– progestin group than in the placebo group. Taken together, the results suggest that estrogen–progestin promotes development of breast cancer while also delaying detection, perhaps by increasing breast tissue density. In contrast, assignment to estrogen alone predicted a 20% reduction in risk of breast cancer, an association of borderline statistical significance (13). Reasons for this seeming protective effect are unclear. However, observational studies have also implicated estrogen–progestin more strongly than estrogen alone in breast carcinogenesis; in such studies, the excess risk associated with estrogen–progestin seems to increase more rapidly over time than that associated with estrogen alone (2,14). 3.1.5 Colorectal cancer. Women assigned to estrogen–progestin therapy experienced

3

a 44% risk reduction in colorectal cancer, although the proportion of cancers diagnosed at an advanced stage was unexpectedly higher in the hormone than in the placebo group (15). Little association was found between estrogen alone and risk of colorectal cancer (5). 3.1.6 Endometrial cancer. Estrogen taken alone has long been recognized as a powerful risk factor for endometrial cancer and thus was not administered to women with a uterus in the WHI. However, concurrent use of a progestogen is believed to counteract the excess risk, a hypothesis confirmed in the estrogen–progestin trial. 3.1.7 Hip and other fractures. Compared with those assigned to placebo, women assigned to estrogen–progestin experienced 33% fewer hip fractures and 24% fewer total fractures (16). For estrogen alone, the corresponding risk reductions were 39% and 30% (5). 3.1.8 Global index. The health risks of estrogen–progestin therapy clearly outweighed its benefits in the WHI study population as a whole. For the composite outcome (‘‘global index’’) of CHD, stroke, pulmonary embolism, breast cancer, colorectal cancer, endometrial cancer, hip fracture, or mortality, the RR associated with estrogen– progestin was 1.13 [95% confidence interval (CI): 1.02–1.25] (17). That is, participants assigned to estrogen–progestin were significantly more likely to experience an adverse event than those assigned to placebo. In absolute terms, the WHI findings suggest that for every 10,000 women taking estrogen–progestin each year, there would be six more coronary events, seven more strokes, ten more pulmonary emboli, and eight more invasive breast cancers, as well as seven fewer colorectal cancers, five fewer hip fractures, and one fewer endometrial cancers (Fig. 1a). Overall, the net effect would be 19 additional adverse events per 10,000 women per year (4). Parallel analyses of the estrogen-alone data also show no evidence of a favorable balance of benefits and risks when this therapy is used for chronic disease prevention among postmenopausal women as a whole

4

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

(a)

All Women 39 33

CHD

106%

17 41 33

Breast cancer

Hip fracture

31%

35

VTE

22 17

24%

31 24

Stroke

Colorectal cancer

Age 50–59 Age 60–69 Age 70–79 Number of cases per 10,000 women per year 35 34

14 10

32 23

19

35 19

8 31 26

24%

44%

4 5

10 19

11 16

33%

1 3

9 11

52 53

Death

All Women

54 41 14 28

105 101

52 47

Estrogen-progestin therapy Placebo (b)

62 27

33 48

20 31

--

61 48

44 36

16

9

78 55

VTE = venous thromboembolism

Age 50–59

Age 60–69

Age 70–79

Number of cases per 10,000 women per year CHD

45 33

Stroke

VTE

30 22

Breast cancer

28 34

Colorectal cancer

17 16

Hip fracture

11 17

Death

17 27

53 56 37%

13 12

41 24

33%

16 12

32 25

42 31

26 36

32 34

21 29 7 12

39% 61 78

96 86

57 61

4 1

66 44

32 15

16 19

32 52

4 11 29 39

Estrogen only Placebo

79 79

145 130

VTE = venous thromboembolism

Figure 1. 1a) Estrogen–progestin therapy and health outcomes: results from the Women’s Health Initiative, by age group. 1b) Estrogen-only therapy and health outcomes: results from the Women’s Health Initiative, by age group.

(Fig. 1b). For the global index, the RR associated with estrogen alone was 1.02 (95% CI: 0.92–1.13) (17). 3.1.9 Secondary endpoints (i.e., endpoints not included in the global index). Estrogen

with or without progestin was associated with a reduction in the risk of type 2 diabetes (18,19) and an increase in the risk of gallbladder disease (20). Estrogen–progestin also seemed to raise the risk of ovarian cancer, although the number of observed cases

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

5

Coronary heart disease Observational studies WHI4 Stroke Observational studies WHI4 Pulmonary embolism Observational studies WHI4 Hip fractures Observational studies WHI4 Breast cancer Observational studies WHI4 Colorectal cancer Observational studies WHI4 0

0.5

1 1.5 Relative risk

2

2.5

3

3.5

4

From Michels KB, Manson JE. Circulation 2003;107:1830–1833. (Reproduced with permission).

Figure 2. Relation between postmenopausal hormone therapy and various health outcomes in observational studies and in the Women’s Health Initiative estrogen–progestin trial.

was too small to allow a precise estimate of the effect (21). To examine the effect of HT on cognition, WHI investigators followed participants aged ≥ 65 years without probable dementia at enrollment, including 4532 women in the estrogen–progestin trial and 2947 women in the estrogen-alone trial. In analyses that pooled the data across both trials, assignment to HT was associated with a 76% increase in the risk of probable dementia (RR = 1.76; 95% CI: 1.19–2.60) and a 25% increase in the risk of mild cognitive impairment (RR = 1.25; 95% CI: 0.97–1.89) (22). In absolute terms, for every 10,000 women aged ≥ 65 years who take estrogen–progestin for 1 year, 23 excess cases of dementia would occur. For estrogen alone, 12 excess cases would occur. The relatively short follow-up interval during which probable cases were diagnosed suggests that some participants already had experienced cognitive decline at enrollment and that HT did not initiate the underlying dementia but rather accelerated its progression or manifestation. Indeed, the adverse impact of HT was greatest among women who performed poorly on the cognitive assessment at baseline.

3.2 Influence of age and time since menopause The divergence between observational studies and the WHI with respect to their findings for CHD raises concerns that the coronary benefit seen in observational studies might be attributable to selection factors or confounding by participants’ baseline health and behavior. Nevertheless, the concordance between observational studies and the WHI for other endpoints, particularly stroke, which have lifestyle determinants similar to those for coronary disease, suggests that these biases are not the primary explanation for the discrepant CHD results (Fig. 2). Instead, a closer examination of available data suggests that the timing of HT initiation in relation to menopause onset may affect the association between such therapy and risk of coronary disease. Hormone users in observational studies typically start therapy within 2 to 3 years after menopause onset, which occurs on average at age 51 in the United States, whereas WHI participants were randomized to HT more than a decade after cessation of menses. These older

6

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

women likely had more extensive subclinical atherosclerosis than their younger counterparts. It has been postulated that estrogen has multiple and opposing actions, slowing early-stage atherosclerosis through beneficial effects on lipids and endothelial function but triggering acute coronary events through prothrombotic and inflammatory influences when advanced lesions are present (23,24). Several lines of evidence lend credence to this hypothesis. Trials in humans reveal complex effects of exogenous estrogen on cardiovascular biomarkers (25,26). Oral estrogen lowers low-density lipoprotein (LDL) cholesterol, lipoprotein(a), glucose, insulin, homocysteine, fibrinogen, and plasminogenactivator inhibitor type 1 levels; inhibits oxidation of LDL cholesterol; raises highdensity lipoprotein cholesterol; and improves endothelial function—effects expected to reduce coronary risk. However, oral estrogen also raises triglycerides, coagulation factors (factor VII, prothrombin fragments 1 and 2, and fibrinopeptide A), C-reactive protein, and matrix metalloproteinases—effects expected to increase coronary risk. In addition, certain progestogens counteract some salutary effects of estrogen. Data from controlled experiments in nonhuman primates also support the concept that the net coronary effect of HT varies according to the initial health of the vasculature. Conjugated estrogen with or without MPA had no effect on coronary artery plaque in cynomolgus monkeys started on this treatment at 2 years ( ∼ 6 human years) after oophorectomy and well after the establishment of atherosclerosis, whereas such therapy reduced the extent of plaque by 70% when initiated immediately after oophorectomy, during the early stages of atherosclerosis (27). Similarly, imaging trials in humans with significant coronary lesions at baseline have not found estrogen to be effective in slowing the rate of arterial narrowing (28–31). However, in the only imaging trial to date in which the presence of significant lesions was not a criterion for study entry, micronized 17β-estradiol significantly retarded the progression of carotid atherosclerosis (32).

Subgroup analyses of WHI data also suggest that age or time since menopause influences the relation between HT and CHD (Fig. 3). In analyses that pooled the data across the two trials to increase statistical power, RRs were 0.76 (95% CI: 0.50–1.16), 1.10 (95% CI: 0.84–1.45), and 1.28 (95% CI: 1.03–1.58) for women who were < 10, 10–19, and ≥ 20 years past the menopausal transition at study entry, respectively (p for trend = 0.02) (17); a pattern of steadily rising RRs was found in both the estrogen–progestin and estrogen-alone trials. With respect to age, HT-associated RRs were 0.93 (95% CI: 0.65–1.33), 0.98 (95% CI: 0.79–1.21), and 1.26 (95% CI: 1.00–1.59) for women aged 50–59, 60–69, and 70—79 years, respectively (p for trend = 0.16) (17). Although a trend by age group was not seen in the estrogen–progestin trial, this trend was apparent in the estrogen-alone trial, with RRs for women in their 50s, 60s, and 70s of 0.63 (95% CI: 0.36–1.09), 0.94 (95% CI: 0.71–1.24), and 1.13 (95% CI: 0.82–1.54), respectively (p for trend = 0.12). Indeed, among women aged 50—59 years, estrogen alone was associated with significant reductions in the secondary endpoint of coronary revascularization (RR = 0.55; 95% CI: 0.35–0.86) and a composite endpoint of MI, coronary death, or coronary revascularization (RR = 0.66; 95% CI: 0.44–0.97) (7). Taken in aggregate, the WHI results suggest that time since menopause has a somewhat greater impact than chronologic age on the HT–CHD association. HT seems to have a beneficial or neutral effect on CHD in women closer to menopause (who are likely to have healthier arteries) but a harmful effect in later years. The WHI findings have spurred reanalyses of existing observational and randomized data. For example, investigators with the Nurses’ Health Study, the largest and longest-running prospective study of HT and CHD in the United States, who earlier reported that current use of HT was associated with an approximate 40% reduction in risk of CHD in the cohort as a whole (33), recently found that the coronary benefit was largely limited to women who started HT within 4 years of menopause onset (34). A 2006 meta-analysis that pooled data from 22

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS By years since menopause§:

By age: CHD

CHD 0.93

50–59 y 60–69 y

0.98

70–79 y

1.26

p = 0.16 †

Total mortality

<10 y

0.76

10–19 y

1.10

20+ y

1.28

p = 0.02

Total mortality

50–59 y

0.70

60–69 y

1.05

70–79 y

1.14

p = 0.06

Global index ‡

<10 y

0.76

10–19 y

0.98

20+ y

1.14

p = 0.51

Global index

50–59 y

0.96

60–69 y

1.08

70–79 y

1.14

0.2

7

1 1.2

p = 0.09

2.2

<10 y

1.05

10–19 y

1.12

20+ y

1.09 0.2

1 1.2

p = 0.82

2.2

* Confidence intervals plotted as error bars. † p values for trend. ‡ The global index is a composite outcome of CHD, stroke, pulmonary embolism, breast cancer, colorectal cancer, endometrial cancer (estrogen-progestin trial only), hip fracture, and mortality. § Age at menopause was defined as the age at which a woman last had menstrual bleeding, bilateral oophorectomy, or began using HT. For hysterectomy without bilateral oophorectomy, age at menopause was the age at which a woman either began using HT or first had vasomotor symptoms. For hysterectomy without bilateral oophorectomy at age 50 or more, but no use of HT or symptoms, age of menopause was defined as age at hysterectomy. Data from Ref.17,

Figure 3. Relative risks and 95% confidence intervals∗ for selected health outcomes in the combined trials of postmenopausal hormone therapy (estrogen–progestin and estrogen alone) of the Women’s Health Initiative, by age and years since menopause.

smaller randomized trials with data from the WHI found that HT was associated with a 30–40% reduction in CHD risk in trials that enrolled predominantly younger participants (women age < 60 years or within 10 years of menopause) but not in trials with predominantly older participants (35). Available data also suggest that, in a pattern similar to that for coronary outcomes, initiation of HT soon after the menopausal transition may improve or preserve cognitive function, but later initiation— after the onset of subclinical neuropathologic changes in the brain—has a negligible or even harmful effect on cognition (36). In the WHI, age also seemed to modulate the effect of HT on total mortality and the global index. In an analysis that pooled data from both trials, HT was associated with a significant reduction in mortality (RR = 0.70; 95% CI: 0.51–0.96) among women aged 50–59 years but had little effect among those aged 60—69 years (RR = 1.05; 95% CI: 0.87–1.26) and was associated with

a borderline significant increase in mortality among those aged 70—79 years (RR = 1.14; 95% CI: 0.94–1.37) (p for trend = 0.06) (17). This pattern was observed in both trials. For the global index, the HT-associated RRs for women in their 50s, 60s, and 70s were 0.96 (95% CI: 0.81–1.14), 1.08 (95% CI: 0.97–1.20), and 1.14 (95% CI: 1.02–1.29), respectively (p for trend = 0.09), although a significant trend was present only in the estrogen-alone trial (17). In a 2003 metaanalysis of 30 randomized trials, including the WHI estrogen–progestin trial, HT was associated with a nearly 40% reduction in mortality in trials in which the mean age of participants was < 60 but was unrelated to mortality in other trials (37). It is important to recognize that the evidence for differential health effects of HT by age or time since menopause, although strong, is not yet conclusive. Nonetheless, even if HT-associated RRs do not vary according to these characteristics, the much lower absolute baseline risks of coronary and

8

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

other events in younger or recently postmenopausal women means that these women experience much lower absolute excess risks associated with HT use as compared with their counterparts who are older or further past menopause (Fig. 1 and Table 2). 4

CONCLUSIONS

The WHI results have led to revisions of clinical guidelines for HT use. The U.S. Preventive Services Task Force, American College of Obstetricians and Gynecologists, American Heart Association, Canadian Task Force on Preventive Health Care, and the North American Menopause Society now recommend against the use of estrogen with or without a progestogen to prevent CHD and other chronic diseases. Hot flashes and night sweats that are severe or frequent enough to disrupt sleep or quality of life are currently the only compelling indications for HT. The WHI findings suggest that key factors to consider in deciding whether to initiate HT in a woman with these symptoms (assuming she has a personal preference for such therapy) are where she is in the menopausal transition and whether she is in good cardiovascular health. A younger, recently postmenopausal woman—one whose final menstrual period was ≤ 5 years ago—at low baseline risk of CHD, stroke, or venous thromboembolism is a reasonable candidate for HT. Conversely, an older woman many years past menopause, who is at higher baseline risk of these conditions, is not. HT is best used for 2 to 3 years and generally for no more than 5 years, as breast cancer risk rises with duration of hormone use, particularly estrogen–progestin. For more guidance on evidence-based HT decision making, see Ref. 38. The WHI trials undoubtedly will remain the gold standard of evidence on the health effects of HT for years to come, but limitations must be acknowledged. Although the WHI provided clear data on the benefits and risks of HT in women aged ≥ 60 years and ended the increasingly common practice of starting hormones in these women for the express purpose of preventing CHD, the overall findings likely overstate the risks for healthy women aged 40 to 59 years who begin HT

closer to menopause onset. Moreover, only one type and dose of oral estrogen and progestogen was tested, so the results may not apply to other formulations, doses, and routes of administration (e.g., patches or creams). There are few or no trials on alternative hormone medications, particularly customcompounded ‘‘bioidentical’’ hormones. The lack of data on these agents should not be construed to mean that they are safer or more effective at preventing chronic disease than CEE or MPA; more research is needed to answer these questions. Trials to determine whether differential effects of HT are found on the development and progression of atherosclerosis according to age at initiation (39) and type of therapy (40) are in progress. REFERENCES 1. A. L. Hersh, M. L. Stefanick, and R. S. Stafford, National use of postmenopausal hormone therapy: annual trends and response to recent evidence. JAMA. 2004; 291: 47–53. 2. S. S. Bassuk and J. E. Manson, Postmenopausal hormone therapy: observational studies to clinical trials. In: J Liu and M. L. S. Gass (eds.), Management of the Perimenopause (Practical Pathways in Obstetrics and Gynecology). New York: McGraw-Hill, 2006, pp. 377–408. 3. Women’s Health Initiative Study Group, Design of the Women’s Health Initiative clinical trial and observational study. Control. Clin. Trials. 1998; 19: 61–109. 4. Writing Group for the Women’s Health Initiative Investigators, Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the Women’s Health Initiative randomized controlled trial. JAMA. 2002; 288: 321–33. 5. Women’s Health Initiative Steering Committee, Effects of conjugated equine estrogen in postmenopausal women with hysterectomy: the Women’s Health Initiative randomized controlled trial. JAMA. 2004; 291: 1701–1712. 6. J. E. Manson, J. Hsia, K. C. Johnson, et al., Estrogen plus progestin and the risk of coronary heart disease. N. Engl. J. Med. 2003; 349: 523–534. 7. J. Hsia, R. D. Langer, J. E. Manson, et al., Conjugated equine estrogens and the risk of coronary heart disease: the Women’s Health Initiative. Arch. Intern. Med. 2006; 166: 357–365.

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

9

Table 2. Estimated Absolute Excess Risks per 10,000 Person-Years∗ for Selected Health Outcomes in the Combined Trials of Menopausal Hormone Therapy (Estrogen–Progestin and Estrogen Alone) of the Women’s Health Initiative, by Age and Years since Menopause Outcome

Coronary heart disease Total mortality Global index‡

Age (Years)

Years since Menopause

50–59

60–69

70–79

<10

10–19

−2 − 10 −4

−1 −4 + 15

+ + 16† + 43

−6 −7 +5

+4 −1 + 20

19†

≥ 20 + 17† + 14 + 23

∗ Estimated absolute excess risk per 10,000 person-years = [(annualized percentage in placebo group)*(hazard ratio in placebo group – 1)]*1000. † p = 0.03 compared with age 50–59 years or < 10 years since menopause. ‡ The global index is a composite outcome of coronary heart disease, stroke, pulmonary embolism, breast cancer, colorectal cancer, endometrial cancer, hip fracture, and mortality. Data from Ref. 17.

8. S. Wassertheil-Smoller, S. L. Hendrix, M. Limacher, et al., Effect of estrogen plus progestin on stroke in postmenopausal women: the Women’s Health Initiative: a randomized trial. JAMA. 2003; 289: 2673–2684. 9. S. L. Hendrix, S. Wassertheil-Smoller, K. C. Johnson, et al., Effects of conjugated equine estrogen on stroke in the Women’s Health Initiative. Circulation. 2006; 113: 2425–2434.

17.

18.

10. M. Cushman, L. H. Kuller, R. Prentice, et al., Estrogen plus progestin and risk of venous thrombosis. JAMA. 2004; 292: 1573–1580. 11. J. D. Curb, R. L. Prentice, P. F. Bray, et al., Venous thrombosis and conjugated equine estrogen in women without a uterus. Arch. Intern. Med. 2006; 166: 772–780. 12. R. T. Chlebowski, S. L. Hendrix, R. D. Langer, et al., Influence of estrogen plus progestin on breast cancer and mammography in healthy postmenopausal women: the Women’s Health Initiative Randomized Trial. JAMA. 2003; 289: 3243–3253. 13. M. L. Stefanick, G. L. Anderson, K. L. Margolis, et al., Effects of conjugated equine estrogens on breast cancer and mammography screening in postmenopausal women with hysterectomy. JAMA. 2006; 295: 1647–1657.

19.

20.

21.

22.

14. W. Y. Chen, J. E. Manson, S. E. Hankinson, et al., Unopposed estrogen therapy and the risk of invasive breast cancer. Arch. Intern. Med. 2006; 166: 1027–1032. 15. R. T. Chlebowski, J. Wactawski-Wende, C. Ritenbaugh, et al., Estrogen plus progestin and colorectal cancer in postmenopausal women. N. Engl. J. Med. 2004; 350: 991–1004.

23.

16. J. A. Cauley, J. Robbins, Z. Chen, et al., Effects of estrogen plus progestin on risk of fracture and bone mineral density: The Women’s

24.

Health Initiative randomized trial. JAMA. 2003; 290: 1729–1738. J. E. Rossouw, R. L. Prentice, J. E. Manson, et al., Effects of postmenopausal hormone therapy on cardiovascular disease by age and years since menopause. JAMA. 2007; 297: 1465–1477. K. L. Margolis, D. E. Bonds, R. J. Rodabough, et al., Effect of oestrogen plus progestin on the incidence of diabetes in postmenopausal women: results from the Women’s Health Initiative Hormone Trial. Diabetologia. 2004; 47: 1175–1187. D. E. Bonds, N. Lasser, L. Qi, et al., The effect of conjugated equine oestrogen on diabetes incidence: the Women’s Health Initiative randomised trial. Diabetologia. 2006: 1–10. D. J. Cirillo, R. B. Wallace, R. J. Rodabough, et al., Effect of estrogen therapy on gallbladder disease. JAMA. 2005; 293: 330–339. G. L. Anderson, H. L. Judd, A. M. Kaunitz, et al., Effects of estrogen plus progestin on gynecologic cancers and associated diagnostic procedures: The Women’s Health Initiative randomized trial. JAMA. 2003; 290: 1739–1748. S. A. Shumaker, C. Legault, L. Kuller, et al., Conjugated equine estrogens and incidence of probable dementia and mild cognitive impairment in postmenopausal women: Women’s Health Initiative Memory Study. JAMA. 2004; 291: 2947–2958. F. Grodstein, T. B. Clarkson, and J. E. Manson, Understanding the divergent data on postmenopausal hormone therapy. N. Engl. J. Med. 2003; 348: 645–650. M. E. Mendelsohn and R. H. Karas, Molecular and cellular basis of cardiovascular gender differences. Science. 2005; 308: 1583–1587.

10

WOMEN’S HEALTH INITIATIVE HORMONE THERAPY TRIALS

25. J. E. Manson, S. S. Bassuk, S. M. Harman, et al., Postmenopausal hormone therapy: new questions and the case for new clinical trials. Menopause. 2006; 13: 139–147. 26. S. R. Salpeter, J. M. Walsh, T. M. Ormiston, E. Greyber, N. S. Buckley, and E. E. Salpeter, Meta-analysis: effect of hormone-replacement therapy on components of the metabolic syndrome in postmenopausal women. Diabetes Obes. Metab. 2006; 8: 538–554. 27. T. S. Mikkola and T. B. Clarkson, Estrogen replacement therapy, atherosclerosis, and vascular function. Cardiovasc. Res. 2002; 53: 605–619. 28. D. M. Herrington, D. M. Reboussin, K. B. Brosnihan, et al., Effects of estrogen replacement on the progression of coronary-artery atherosclerosis. N. Engl. J. Med. 2000; 343: 522–529. 29. H. N. Hodis, W. J. Mack, S. P. Azen, et al., Hormone therapy and the progression of coronary-artery atherosclerosis in postmenopausal women. N. Engl. J. Med. 2003; 349: 535–545. 30. D. D. Waters, E. L. Alderman, J. Hsia, et al., Effects of hormone replacement therapy and antioxidant vitamin supplements on coronary atherosclerosis in postmenopausal women: a randomized controlled trial. JAMA. 2002; 288: 2432–2440. 31. P. Angerer, S. Stork, W. Kothny, P. Schmitt, and C. von Schacky, Effect of oral postmenopausal hormone replacement on progression of atherosclerosis: a randomized, controlled trial. Arterioscler. Thromb. Vasc. Biol. 2001; 21: 262–268. 32. H. N. Hodis, W. J. Mack, R. A. Lobo, et al., Estrogen in the prevention of atherosclerosis. A randomized, double-blind, placebocontrolled trial. Ann. Intern. Med. 2001; 135: 939–953. 33. F. Grodstein, J. E. Manson, G. A. Colditz, W. C. Willett, F. E. Speizer, and M. J. Stampfer, A prospective, observational study of postmenopausal hormone therapy and primary prevention of cardiovascular disease. Ann. Intern. Med. 2000; 133: 933–941.

34. F. Grodstein, J. E. Manson, and M. J. Stampfer. Hormone therapy and coronary heart disease: the role of time since menopause and age at hormone initiation. J. Womens Health. 2006; 15: 35–44. 35. S. R. Salpeter, J. M. Walsh, E. Greyber, and E. E. Salpeter, Brief report: Coronary heart disease events associated with hormone therapy in younger and older women. A meta-analysis. J. Gen. Intern. Med. 2006; 21: 363–366. 36. American College of Obstetricians and Gynecologists. Cognition and dementia. Obstet. Gynecol. 2004; 104: 25S–40S. 37. S. R. Salpeter, J. M. Walsh, E. Greyber, T. M. Ormiston, and E. E. Salpeter, Mortality associated with hormone replacement therapy in younger and older women: a meta-analysis. J. Gen. Intern. Med. 2004; 19: 791–804. 38. J. E. Manson and S. S. Bassuk, Hot Flashes, Hormones & Your Health. New York: McGraw-Hill, 2007. 39. Early versus Late Intervention Trial with Estrogen (ELITE). Available: http:// clinicaltrials.gov/ct/show/NCT00114517? order=1. Accessed April 9, 2007. 40. S. M. Harman, E. A. Brinton, M. Cedars, et al., KEEPS: The Kronos Early Estrogen Prevention Study. Climacteric. 2005; 8: 3–12.

World Health Organization (WHO): Global Health Situation To recognize the development of epidemiologic and statistical activities and trend assessment in the World Health Organization (WHO), it is first important to review the contribution of epidemiology to world health. Epidemiology originated in response to a need to understand and control the highly infectious epidemic diseases, such as cholera, plague, smallpox, and yellow fever. It was only with time that appreciation grew of the fact that all conditions of disease and ill-health are interrelated, and that the emerging science of epidemiology provided the tools for helping to understand the major factors underlying these issues as well [1]. International health work also began with a concentration on infectious diseases (see Communicable Diseases), and then moved towards a wider concept of health as part of overall development. The roots of the World Health Organization, as an international health agency, go back to efforts in the last century, and early in this century, particularly to the Rome Agreement of 1907, which established the Office International d’Hygi`ene Publique, with the express purpose “to combat infectious diseases”. The progressive shift of the concept of health, from the prevention of infectious diseases to viewing health as “a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity”, is reflected in the successive evolution of the Pan American Sanitary Bureau, founded in 1902, the health services arm of the League of Nations, founded in 1918, and finally WHO, founded in 1948 [1]. Epidemiology has provided the tools for a better understanding of the incidence, prevalence, natural history, causes (see Causation), and effects of control and other measures that are relevant to each of the communicable disease control programs of WHO. More than this, the epidemiologic sciences have enabled us also to understand noncommunicable diseases such as cancer (see Oncology), cardiovascular diseases, (see Cardiology and Cardiovascular Disease), and genetic disorders (see Genetic Epidemiology). In the area of primary prevention this understanding has allowed for intervention before the onset of disease [1].

WHO has a constitutional responsibility for the global epidemiologic surveillance of disease. It receives information on outbreaks of communicable disease and distributes this information throughout the world by telecommunication and its publication Weekly Epidemiological Record. The WHO system of international epidemiologic surveillance provides countries with access to information. This includes the countries that otherwise would not be able to communicate directly with each other. WHO is responsible for the International Health Regulations, the International Classification of Diseases, and a great many international standards, which together make international epidemiologic comparisons possible. WHO also publishes the World Health Statistics Quarterly, the World Health Statistics Annual, the World Health Report, and other publications of an epidemiologic nature [1]. Twenty years ago, the Thirtieth World Health Assembly decided that the main social target of governments and WHO in the coming decades should be health for all by the year 2000. One year later, in 1978, at a major international conference at AlmaAta, primary health care was declared to be the key to attaining this goal, in the spirit of social justice. The policies and strategies for “health for all” have subsequently been defined by the World Health Assembly at an international level, in the light of its own health and socioeconomic situation. Furthermore, WHO Member States have committed themselves, with the Organization, to monitoring progress towards and evaluating the attainment of this common goal, using a basic set of global indicators in addition to those applicable within each country. For the first time in the history of international health work, an epidemiologic framework is being applied on a global scale. The implication is that the science of epidemiology must be applied for strategic health planning and evaluation, in a systematic manner, in practically all countries of the world, for national and international health development purposes [1]. The health-for-all monitoring and evaluation process is intended to establish a baseline of current health and socioeconomic conditions, against which progress towards defined targets and objectives can be measured. Periodic measurement should establish trends that will permit anticipation of future conditions, and to start planning for them in advance. The three cycles of monitoring and evaluation of the health-for-all strategy that have taken place so far

Wiley Encyclopedia of Clinical Trials, Copyright © 2008 John Wiley & Sons, Inc.

2

World Health Organization (WHO): Global Health Situation

make WHO optimistic that the information obtained can be used to reorient national and international priorities and directions for health development work, on the basis of sound epidemiologic evidence, reported by countries with honesty [1]. The latest evaluation was carried out late 1996/early 1997; national reports will be consolidated into regional reports to be reviewed by WHO regional committees in September–October 1997; the global findings will be reported to WHO Governing Bodies in 1998. At the present time epidemiology and evaluation are used substantially to support health future trend assessment through scenario planning at the global, regional, and country level. A new reorientation of health statistics is also taking place in response to the trend for broader and higher-quality data requested by users. In WHO, approaching the end of the twentieth century, epidemiology, statistics, and future trend assessment provide a substantive contribution to the formulation of the health-for-all policy and strategy for the next century.

Program Activities to the End of the Twentieth Century The Health Situation and Trend Assessment Program is at present responsible for global health situation analysis and projection; strengthening of country health information; and partnerships and coordination of epidemiology, statistics, and trend assessment. Global epidemiologic surveillance is under the responsibility of the Emerging and other Communicable Diseases Surveillance and Control Program. Various epidemiologic and statistical activities are also carried out by technical programs at WHO headquarters, e.g. Expanded Program on Immunization; Health and Environment; Information System Management; Special Program of Research, Development, and Research Training in Human Reproduction; Special Program for Research and Training in Tropical Diseases; and also its six regional offices. As already mentioned, one of the normative functions of WHO is monitoring the health situation and trends throughout the world, for which the compilation, use, and coordination of relevant health information and statistical activities are essential. WHO evaluates the world health status and trends every 6 years and publishes its findings. It also assesses

the global health status annually and, since 1995, has published The World Health Report. The report gives an overview of the global situation and identifies priority areas for international health action; it also links the work of WHO to global health needs and priorities (see Morbidity and Mortality, Changing Patterns in the Twentieth Century; Mortality, International Comparisons). In headquarters this Program is carried out by the Division of Health Situation and Trend Assessment, which is comprised of the Units of Health Situation Analysis and Projection and Strengthening Country Health Information, and the Director’s Office responsible for partnerships and coordination in epidemiology, statistics, and trend assessment and the International Statistical Classification of Diseases and Related Health Problems (ICD) besides providing direction and supervision of the Program. The six regional offices of WHO also undertake health situation and trend assessment activities. From the Program review in 1995 the focus of the regional programs is as follows.

African Region In the African Region priority is given to generation and utilization of epidemiologic and health information, particularly through the strengthening of national health information systems (see Administrative Databases) and epidemiologic surveillance systems. The development of epidemiology practice is based on the following approaches: integrated management of epidemiologic information systems; strengthening both data management capacity and decision-making processes, particularly at local level; combined training in epidemiology and management of health programs and health systems. Efforts are made to equip all district teams with a set of essential epidemiologic capabilities. WHO established a position of epidemiologist in each Country Office in order to support this effort.

Region of the Americas The major achievements of the Region of the Americas include coordinating the response to the cholera epidemic, supporting the evaluation and strengthening of national health surveillance systems, support for modernization of records and statistical information systems, implementation and application of

World Health Organization (WHO): Global Health Situation

3

the ICD-10, Geographic–Epidemiologic Information Systems (see Geographic Patterns of Disease; Geographic Epidemiology), supporting national studies of social inequities that affect health status, producing regional publications of the health situation (Health Conditions in the Americas, Strategies to Monitor Health for All (HFA), PAHO Epidemiological Bulletin, and Health Statistics from the Americas), and establishing a technical information system including country profiles, databases on mortality and population, and a bibliography on epidemiology. Efforts will continue to strengthen national epidemiologic capacities, to study inequity in health, training in epidemiology, statistics, and health situation analysis and systems, and to disseminate information (see Health Services Data Sources in Canada; Health Services Data Sources in the US).

health and health service databases like the healthfor-all database, as a means for making better use of available health data at national and local level. The European Regional Office also produces country “highlights” which give an overview of the health and health-related situation in a given country and compare, where possible, its position in relation to other countries in the WHO European Region (also available from http://www.who.dk). The intention in all cases is to make better use of available health information, which will in itself help to improve data quality and comparability, and enable national and local agencies and institutions with health responsibilities to have easy access and benefit from the activities and product of the European Regional Office (see Health Services Data Sources in Europe).

Eastern Mediterranean Region

Southeast Asia Region

The Eastern Mediterranean Region has cooperated and will continue to cooperate in the development of health information systems in Member States, by providing technical support, strengthening national capabilities through fellowships and workshops, developing a regional database and publishing guidelines and manuals to improve health information management design, implementation, and use in the decision making process.

In the Southeast Asia Region support to countries has focused on the strengthening of health management information systems at the central and district level, and the enhancement of mortality and morbidity statistics (see Vital Statistics, Overview). Attention will continue to focus on developing health management information systems in interested countries. Epidemiologic surveillance, the use of “health futures” methodology, and health data processing and rational use is being further enhanced.

European Region

Western Pacific Region

In the European Region the main tasks are collecting and analyzing health information for periodic reports of progress towards health for all by the year 2000 (entitled “Health in Europe”), regularly updating and disseminating information from the healthfor-all database to Member States (available also from http://www.who.dk), and supporting training in epidemiology and health information. A European Public Health Information Network (EUPHIN) is being established (together with the EC) to enable telematic reporting and exchange of data for international comparisons. As part of this initiative, efforts continue to improve international data comparability by developing and encouraging countries to use standard definitions, measurement instruments, and methods. At the country level, the European Regional Office assists countries in developing and using national

The Western Pacific Region is engaged in technical cooperation to strengthen epidemiologic surveillance and cholera control, support field epidemiology training, improve medical records, birth registration documentation and health information systems, and provide guidelines and manuals. Future activities include reformulation and construction of a new regional database, increased coordination with technical units, strengthening of current activities, and research on more sensitive indicators and associated analytic methods for monitoring and evaluation.

Main Activities at Global and Regional Level The Program’s main activities at the global and regional level from around 1997 to 1999 can be summarized as follows:

4

World Health Organization (WHO): Global Health Situation

1. Global health situation analysis and projection. Updating databases on mortality, health-for-all data resulting from the third global evaluation and global health futures information; reporting on the third global evaluation of the implementation of the health-for-all strategy (to be published in the World Health Report 1998); improving and formulating new indicators and updating the common framework for the fourth monitoring of the implementation of the health-for-all strategy; issuing the World Health Statistics Annual 1997 and 1998, and the World Health Statistics Quarterly, volumes 51 and 52; updating the documents “Global health situation analysis and projection” and “Demographic data for health situation and projections”; and validating global health and health-related data and information. At the regional level preparing and distributing regional reports on the world health situation; maintaining and updating databases of health statistics; and improving the health-for-all strategy through the findings of the third evaluation of the implementation of the Strategy. 2. Strengthening country health information. Contributing to defining universal public health functions; providing guidance on assessing performance of public health functions and defining the minimum information required for their monitoring and management; preparing methodology for rapidly assessing availability and use of information for managing the essential public health functions; preparing guidance materials and supporting processes for enhancing various information functions and support systems; and undertaking a series of applications of the above methods within interested countries. At the regional level providing advisory services and support to countries; enhancing national information systems and assisting countries in applying various methodologies such as health futures and rapid evaluation. 3. Partnerships and coordination in epidemiology, statistics, and trend assessment. Forming partnerships and coordination in epidemiology, statistics and trend assessment with related WHO programs, regional offices and international organizations; preparing an international agreement on a taxonomic approach for medical procedures and guidelines for establishing national classifications; providing guidance on medical

certification of causes of death. At the regional level improving partnerships in epidemiology, statistics, and trend assessment, and holding computer-based training courses in the use of ICD-10. It is important to note that the Division of Health Situation and Trend Assessment is responsible for the development, maintenance, and coordination of the International Statistical Classification of Diseases and Related Health Problems (ICD) and other members of the “family” of disease and health-related classifications in both English and French. The future prospect of ICD can be highlighted as follows. More than 60 Member States submit national mortality data to WHO on a routine basis. Twenty-eight of these Member States have already implemented the tenth revision of the ICD (ICD-10), which was published by WHO in 1992–93, for either mortality or morbidity coding or both. Twenty-two Member States in the Region for the Americas have received training in the use of ICD-10 but have not yet implemented it. A further nine Member States have indicated that they plan to implement it before the year 2000. It can be assumed, therefore, that by the year 2000 the vast majority of the countries currently submitting mortality data will have moved to ICD-10, though the actual data may not be available until some 2–3 years later. It is hoped that the planned publication of a simplified three-character version of ICD-10 will encourage more developing countries to use the classification. In general, it is estimated that, approaching the twenty-first century, most developed countries will have implemented ICD-10. On the other hand, in developing countries the classification will be implemented step by step in line with the speed of the strengthening of the capability of their vital and health statistical infrastructures. At the global level the Division of Emerging and other Communicable Diseases Surveillance and Control publishes the Weekly Epidemiological Record ; updates annually International Travel and Health; revises the International Health Regulations; and rapidly exchanges information through electronic media with WHO Collaborating Centers, public health administrations and the general public. At the present time the Health Situation and Trend Assessment Program is implemented with support from various professional staff, especially

World Health Organization (WHO): Global Health Situation epidemiologists, statisticians, medical officers, and public health specialists.

Vision for the Use and Generation of Data in the First Quarter of the Twenty-first Century Epidemiologic and health statistical activities in WHO are now operational through dynamic networking with various parties at the global, regional, and country level. From experience at the global and regional level and in many countries, the trend of information required by users is towards more specific and higher-quality information. Figure 1 highlights the information on health status and determinants required by most users. The generation of data and information can usually be grouped into four main activities: 1. Collection, validation, analyses, and dissemination of data and information. 2. Support activities and resources, especially through cooperation with Member States in strengthening country health information. 3. Research and development activities. 4. Partnerships and coordination of information activities. Figure 2 shows these activities and their interconnection. From experience this diagram is helpful in visualizing the complex issues of information activities. E X O G E N O U S D E T E R M I N A N T S

5

On the basis of experience and future prospects the trends of the use and generation of information are briefly described in Table 1. The trends here shown take into consideration the main change in the use of data and the various activities involved in the generation of data. It is hoped that the main trends from 1975 to the end of the century are clearly highlighted in the table. From the WHO Global Health Situation Analysis and Projection from 1950 to 2025 it is possible to foresee the future prospects of health situation and trend assessment in the early part of the twenty-first century, as summarized below. 1. The various users of information will request many types of information or data, all of which should be up-to-date and of high quality. This kind of request will not only come from the user in developed countries but also from many users in developing countries. 2. On the generation of information: (i) In the collection, validation, analysis and dissemination of data and information the emphasis will be on data concerning health status (see Quality of Life and Health Status) (health, life expectancy, mortality, disability, morbidity), health resources (including financial) (see Health Care Financing), physical and manpower resources (see Health Workforce Modeling), health services,

POLITICAL, ECONOMIC, DEMOGRAPHIC, SOCIAL AND TECHNOLOGICAL ENVIRONMENT

COMMUNITY ACTION AND HEALTHY LIFE STYLE

H E A S L Y T S H T C E A M R E

HEALTH CARE DEVELOPMENT OF HEALTH SYSTEMS HEALTH RESOURCES

ENDOGENOUS DETERMINANTS: HEREDITARY, ACQUIRED

HEALTH STATUS MORBIDITY DISABLEMENT

DISEASE AGENTS

MORTALITY HEALTH, LIFE EXPECTANCY

PHYSICAL AND BIOLOGICAL ENVIRONMENT

Figure 1 Health status and determinants. Adapted from Sistem Kesehatan Nasional [National Health System], Annex 1. Jakarta, Ministry of Health, 1984 and Public Health Status and Forecast, Figure 1.3.3, The Hague, National Institute of Public Health and Environmental Protection, 1994

6

World Health Organization (WHO): Global Health Situation 2

Management Future trend assessment planning Direction, guidance, instruction

Development and support

Activating, coordination

Monitoring, evaluation

Management of information activities

Feedback/ reporting

Information

Direction, guidance, instruction

Feedback/ reporting

1

Implementation

Figure 2

Relationship between information and implementation, management and development activities

Table 1 Brief review of development of main health situation and trend assessment activities at country, regional, and global level Main activities 1. The use of information

2. The generation of information (i) Collect, validate, analyze, and disseminate data and information

(ii)

Research and development activities

1975–87

1988–95

1996–2001

For planning, management and evaluation Limited quality of data could be tolerated

Limited use of information for past and present trends Limited quality of some data could be tolerated

Broader use, including future trends, of information and data High quality of data is needed

Data on health status, utilization of services, resources, and demographic, social and environmental data

Emphasis on data on monitoring and evaluation of Health for All Partial support of informatics

Emphasis on data on mortality, morbidity, disability

Developments of indicators; methods of presentation and dissemination; operational research methods

Development of monitoring and evaluation

Continue to give emphasis to data on monitoring and evaluation of Health for All Full support of informatics Further development of health indicators

World Health Organization (WHO): Global Health Situation Table 1

7

(continued )

Main activities

1975–87

1988–95

Statistical support to health research ICD

Started health futures trend assessment ICD, ICIDH

Strengthening country health information – started cooperation with Member States, especially developing countries and countries in transition, through regional offices

(iii)

Support activities and resources

Support given country by country

(iv)

Partnerships and coordination of information activities

Improve the uniformity and comparability of data Balancing between decentralized and centralized systems

Cooperation between statisticians and users

Started through dynamic networking of health information and statistical activities

1996–2001 Health futures trend assessment Further development of health futures methods Implementation of ICD More on strengthening country health information through regional offices Support health futures trend assessment Support the monitoring and evaluation of HFA Support the enhancement of statistical capabilities Support the use of computers Dynamic networking of health information and statistical activities Coordination of various information, statistical activities, and trend assessment

Source: Division of Health Situation and Trend Assessment, WHO.

socioeconomic environment (especially economic, sociocultural, and technological environment), and biological and physical environments (including pollution) (see Environmental Epidemiology). Full informatic support in these information activities will become a reality. (ii) The future focus on research and development will be on health measurement of the above-mentioned data and various health classifications in a more elaborated way. Besides this, further development of future trend assessment in support of scenario planning will be needed by many countries. (iii) The trend in support activities and resources will continue to focus on providing support to many developing countries, especially to the least developed countries in strengthening their country health information.

The specific support activities and resources as mentioned in Table 1 will continue throughout the beginning of the twenty-first century. (iv) Partnerships, coordination and provision of direction of information activities will become WHO’s most important and challenging tasks in health situation and trend assessment in the future.

Reference [1]

Nakajima, H. (1991). Epidemiology and the future of world health – The Robert Cruickshank Lecture, International Journal of Epidemiology 20, 589–594.

Bibliography WHO (1992). Global Health Situation and Projections, Estimates, WHO/HST/92.1. WHO, Geneva (updated 1997).

8

World Health Organization (WHO): Global Health Situation

WHO (1993–96). Implementation of the Global Strategy for Health for All by the Year 2000- – Second Evaluation: Eighth Report on the World Health Situation. Vol. 1, Global Review (WHO, Geneva, 1993); Vol. 2, African Region (WHO, Brazzaville, 1994); Vol. 3, Region of the Americas (WHO, Washington, 1993); Vol. 4, South-East Asia Region (WHO, New Delhi, 1993); Vol. 5, European Region (WHO, Copenhagen, 1993); Vol. 6, Eastern Mediterranean Region (WHO, Alexandria, 1996); Vol. 7, Western Pacific Region (WHO, Manila, 1993). WHO (1994). Ninth General Programme of Work, Covering the Period 1996–2000 . WHO, Geneva. WHO (1995). Programme Budget for the Financial Period 1996–1997 . WHO, Geneva. WHO (1995). Progress towards health for all: Third monitoring report, World Health Statistics Quarterly 48(3/4).

WHO (1995). Renewing the Health-for-All Strategy: Elaboration of Policy for Equity, Solidarity and Health, Consultation document, WHO/PAC/95.1. WHO, Geneva. WHO (1996). Catalogue of Health Indicators, WHO/HST/ SCI/96.8. WHO, Geneva. WHO (1996). Programme Budget for the Financial Period 1998–1999 . WHO, Geneva. Weekly Epidemiological Record. WHO, Geneva. WHO (1995). World Health Report. WHO, Geneva. WHO (1996). World Health Report. WHO, Geneva. World Health Statistics Annual . WHO, Geneva. World Health Statistics Quarterly. WHO, Geneva.

H.R. HAPSARA